More Related Content Similar to How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics - Strata Conf - Sept 2011 Similar to How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics - Strata Conf - Sept 2011 (20) How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics - Strata Conf - Sept 20111. How Apache Hadoop is Revolutionizing
Business Intelligence and Data Analytics
Strata Conference, Sept 22nd 2011, New York, NY
Dr. Amr Awadallah, Founder, CTO, VP of Engineering
aaa@cloudera.com, twitter: @awadallah
2. Business Intelligence Before Adopting Apache Hadoop
BI Reports + Interactive Apps Can’t Explore Original
High Fidelity Raw Data
RDBMS (processed data)
ETL Compute Grid
Moving Data To
Compute Doesn’t Scale
Storage Only Grid (original raw data)
Archiving =
Mostly Append
Premature
Collection Data Death
Instrumentation
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 2
3. Business Intelligence After Adopting Apache Hadoop
Data Exploration &
BI Reports + Interactive Apps Advanced Analytics
RDBMS
ETL and Aggregations Complex Data Processing
Hadoop: Storage + Compute Grid
Mostly Append Keep Data Alive For Ever
Collection
Instrumentation
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 3
4. So What is Apache Hadoop?
• A scalable fault-tolerant distributed system for data storage and
processing (open source under the Apache license)
• Core Hadoop has two main components:
• Hadoop Distributed File System: self-healing high-bandwidth clustered storage
• MapReduce: fault-tolerant distributed processing
• Key business values:
• Flexible – Store any data, Run any analysis (Mine First, Govern Later)
• Scalable – Start at 1TB/3-nodes then grow to petabytes/thousands of nodes
• Affordable – Cost per TB at a fraction of traditional options
• Open Source – No Lock-In, Rich Ecosystem, Large developer community
• Broadly adopted – A large and active ecosystem, Proven to run at scale
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 4
5. The Main Benefit: Agility/Flexibility
Schema-on-Write (RDBMS): Schema-on-Read (Hadoop):
• Schema must be created before • Data is simply copied to the file
data is loaded store, no special transformation is
needed
• Explicit load operation has to
take place which transforms data • A SerDe (Serializer/Deserlizer) is
to database internal structure applied during read time to extract
the required columns
• New columns must be added
explicitly before data for such • New data can start flowing
columns can be loaded into the anytime and will appear
database retroactively once the SerDe is
updated to parse them
• Read is Fast • Load is Fast
Benefits
• Standards/Governance • Flexibility/Agility
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 5
6. What is Complex Data Processing?
1. Java MapReduce: Gives the most flexibility and performance,
but potentially long development cycle (the “assembly
language” of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop in
any programming language of your choice, but slightly lower
performance and less flexibility.
3. Pig: A high-level language out of Yahoo, suitable for batch data
flow workloads.
4. Hive: A SQL interpreter out of Facebook, also includes a meta-
store mapping files to their schemas and associated SerDe.
5. Oozie: A PDL XML workflow server engine that enables creating
a workflow of jobs composed of any of the above.
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 6
7. What This Means For You: Agility
Up Front Design Just in Time
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 7
8. What This Means For You: Innovation
Data Committee Data Scientist
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 8
9. What This Means For You: Consolidation
Silos Sharing
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 9
10. What This Means For You: Extract Value from Latent Data
Archive to Tape Keep Data Alive
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 10
11. What This Means For You: Ability to Grow Fluidly
Benefit #2: Scalability
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 11
12. What This Means For You: Data Beats Algorithm
Smarter Algos More Data
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 12
13. Where Does Hadoop Fit in the Enterprise Data Stack?
Data Scientists Analysts Business Users
Enterprise
IDEs BI, Analytics
System Reporting
Operators
Development Tools Business Intelligence Tools
Cloudera
Mgmt Suite Enterprise
Data
Data
ETL Tools
Architects Warehouse Customers
Low-Latency Web
Serving Application
Relational Systems
Logs Files Web Data
Databases
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 13
14. Use The Right Tool For The Right Job
Relational Databases: Hadoop:
Use when: Use when:
• Interactive OLAP Analytics (<1sec) • Structured or Not (Agility)
• Multistep ACID Transactions • Scalability of Storage/Compute
• 100% SQL Compliance • Complex Data Processing
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 14
15. Two Core Use Cases Common Across Many Industries
Use Case Application Industry Application Use Case
Social Network Analysis Web Clickstream Sessionization
ADVANCED ANALYTICS
Media
DATA PROCESSING
Content Optimization Clickstream Sessionization
Network Analytics Telco Mediation
Loyalty & Promotions Retail Data Factory
Fraud Analysis Financial Trade Reconciliation
Entity Analysis Federal SIGINT
Sequencing Analysis Bioinformatics Genome Mapping
Product Quality Manufacturing Mfg Process Tracking
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 15
16. CDH: Cloudera’s Distribution Including Apache Hadoop
UI Framework HUE SDK HUE SDK
Workflow OOZIE Scheduling OOZIE Metadata HIVE
Languages / Compilers
PIG, HIVE Fast Read/Write
Data Integration
Access
FLUME, SQOOP, ODBC HBASE
Coordination ZOOKEEPER
• Open Source – 100% Apache licensed, 100% Open Source, 100% Free.
• Enterprise Ready – Predictable releases, Documentation, Hotfix Patches, Intensive QA
• Integrated – All required component versions & dependencies are managed for you
• Industry Standard – Existing RDBMS, ETL and BI systems work best with it
• Many Form Factors – Public Cloud, Private Cloud, Ubuntu, RHEL, 32/64bit, etc
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 16
17. SCM Express: Simplifies Installation and Configuration
Service & Configuration Manager
(SCM) Express takes the complexity out of
deploying and configuring CDH.
Provision a complete Hadoop stack in minutes
Centrally manage system services through a user-
friendly interface
Manages services for up to 50 nodes
FREE to download
KEY FEATURES
Automated, wizard-based Central, real-time Ability to configure the Incorporates Automates the expansion
installation of the dashboard for cluster while it’s running comprehensive validation of services to new nodes
complete Hadoop stack configuration and error checking when they come online
management
1 2 3 4 5
©2011 Cloudera, Inc. All Rights Reserved. 17
18. What is Cloudera Enterprise?
Cloudera Enterprise makes open source CLOUDERA ENTERPRISE COMPONENTS
Apache Hadoop enterprise-easy
Cloudera Production-Level
Simplify and Accelerate Hadoop Deployment
Management Suite Support
Reduce Adoption Costs and Risks
Lower the Cost of Administration Comprehensive Our Team of Experts
Toolset for Hadoop On-Call to Help You
Increase the Transparency & Control of Hadoop
Administration Meet Your SLAs
Leverage the Experience of Our Experts
3 of the top 5 telecommunications, mobile services, defense & intelligence,
banking, media and retail organizations depend on Cloudera Enterprise
EFFECTIVENESS EFFICIENCY
Ensuring Repeatable Value from Enabling Apache Hadoop to be
Apache Hadoop Deployments Affordably Run in Production
©2011 Cloudera, Inc. All Rights Reserved. 18
19. Hadoop World 2011
The largest gathering of Hadoop practitioners, developers,
business executives, industry luminaries and innovative
companies in the Hadoop ecosystem.
• 1400 attendees, 25+ sponsors
November 8-9
• 60 sessions across 5 tracks for:
Sheraton New York Hotel
– Business Decision Makers & Towers, NYC
– Enterprise Architects
– IT Operators Learn more and register at
– Data Scientists www.hadoopworld.com
– Developers
• Cloudera Training and Certification $50 discount for
(November 7, 10, 11) Strata attendees
©2011 Cloudera, Inc. All Rights Reserved. 19
20. What I Would Like You To Remember:
• The Key Benefits of the Apache Hadoop Data Platform:
• Agility/Flexibility (Enables Innovation/Exploration).
• Complex Data Processing (Any Language, Any Problem).
• Scalability of Storage/Compute (Freedom to Grow).
• Economical Active Archive (Keep All Your Data Alive).
• Cloudera Enterprise enables:
• Lower the Cost of Management and Administration.
• Simplify and Accelerate Hadoop Deployment.
• Increase the Transparency & Control of Hadoop.
• Firm SLAs on Issue Resolution.
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 20
21. Contact Information:
Amr Awadallah
aaa@cloudera.com
650-644-3921
http://twitter.com/awadallah
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 21
23. Appendix
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 23
24. Hadoop Timeline
Fastest sort of a TB, 3.5mins
over 910 nodes
Doug Cutting adds DFS &
MapReduce support to Nutch • Fastest sort of a TB, 62secs
over 1,460 nodes
NY Times converts 4TB of • Sorted a PB in 16.25hours
Doug Cutting & Mike Cafarella over 3,658 nodes
image archives over 100 EC2s
started working on Nutch
2002 2003 2004 2005 2006 2007 2008 2009
Google publishes GFS &
Yahoo! hires Cutting, Cloudera Doug Cutting
MapReduce papers
Hadoop spins out of Nutch Founded joins Cloudera
Facebooks launches Hive:
SQL Support for Hadoop
Hadoop Summit 2009,
750 attendees
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 24
25. Cloudera’s Track Record
• Customers: Multiple customers with >1,000 Hadoop nodes under management
• Supporting dozens of diverse production use cases including ones that are revenue critical
with tight SLA’s
• Community: years of demonstrated leadership in the Apache Hadoop ecosystem.
Cloudera employees are:
• The largest contributor to the Hadoop ecosystem in patches
• Founders of 70% of the projects in the Apache Hadoop ecosystem including Apache
Hadoop itself
• The first to build & integrate what is now the reference Hadoop stack
• Industry: Multiple years of experience providing Hadoop solutions across industries:
• 2 of the top 5 payments companies run Cloudera
• 3 of the top 5 commerical banks run Cloudera
• 2 of the top 4 online travel companies run Cloudera
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 25
26. Cloudera Enterprise Management Suite
Utility It Helps You… So You Can… It’s Like…
Activity Monitor • Consolidate all user activities
into a real-time view
• Improve performance • MySQL Enterprise Monitor
• Improve conformance to • Quest Foglight for Oracle /
• Diagnose user performance SLAs SQL Server
• Track activity metrics • Improve QOS
Service & • Manage system services • Lower cost of administration • Red Hat Satellite Server
• Automate changes • Improve uptime • Microsoft System Center
Configuration • Validate settings • Oracle Enterprise Manager
Manager • 1-click security
Resource • Report on the usage of
scarce resources
• Improve quality of service • VMware vCenter
• Extend the life of the cluster
Manager • Plan for capacity expansion
Authorization • Centralize management of all
users, groups and privileges
• Lower the costs of
administration
• Teradata security
administration
Manager • Manage permissions via • Improve compliance
delegated administration
©2011 Cloudera, Inc. All Rights Reserved. 26
27. CDH Integrates with Existing IT Infrastructure
BI/Analytics ETL Databases Cloud/OS Hardware
Copyright © 2011, Cloudera, Inc. All Rights Reserved. 27