SlideShare a Scribd company logo
1 of 29
Ā© Hortonworks Inc. 2013
Hortonworks
Data Science with Hadoop ā€“ A Primer
Hadoop Summit, June 2013
Ofer Mendelevitch
ofer@hortonworks.com
@ofermend
Ā© Hortonworks Inc. 2013 Page 2
Who am I?
currently <- c(
role=ā€œdirector of data sciencesā€,
company=ā€œHortonworksā€)
ā€¢ Previously: Nor1, Yahoo!, Risk Insight, Quiver, etcā€¦
ā€¢ Blog: www.achessdad.com
Ā© Hortonworks Inc. 2013 Page 3
What I will be talking about?
ā€¢What is Data Science?
ā€¢Hadoop and Data Science
ā€¢Use-cases: data science with Hadoop
ā€¢How to get started?
Ā© Hortonworks Inc. 2013 Page 4
What is Data Science?
What is a data scientist?
A person who does this
Data Product: software product whose core
functionality relies on applying statistical (or
machine learning) methods to data.
What is Data Science?
The art of building data products
Ā© Hortonworks Inc. 2013 Page 5
Data science & big data
Ā© Hortonworks Inc. 2013 Page 6
With Hadoopā€¦
Time and cost of building large scale
data products is dramatically reduced
Ā© Hortonworks Inc. 2013
ApplianceCloudOS / VM
An Apache Hadoop Platform
HORTONWORKS
DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA,
DR, Snapshots, Security, ā€¦
Distributed
Storage & ProcessingHDFS
MAP REDUCE
DATA
SERVICES
Store,
Process and
Access Data
HCATALOG
HIVEPIG
HBASE
SQOOP
FLUME
OPERATIONAL
SERVICES
Manage &
Operate at
Scale
OOZIE
AMBARI
Ā© Hortonworks Inc. 2013
A typical Big Data Architecture
Page 8
APPLICATIONSDATASYSTEMS
TRADITIONAL REPOS
RDBMS EDW MPP
DATASOURCES
MOBILE
DATA
OLTP,
POS
SYSTEMS
OPERATIONAL
TOOLS
MANAGE &
MONITOR
Traditional Sources
(RDBMS, OLTP, OLAP)
New Sources
(web logs, email, sensor data, social media)
DEV & DATA
TOOLS
BUILD &
TEST
Business
Analytics
Custom
Applications
Packaged
Applications
HORTONWORKS
DATA PLATFORM
Ā© Hortonworks Inc. 2013 Page 9
Keys to Hadoopā€™s power
ā€¢ Computation co-located with data
ā€“ Data and computation system co-designed to work
together
ā€¢ Affordable at scale
ā€“ Use ā€œcommodityā€ hardware nodes
ā€“ Self-healing; failure handled by software
ā€“ Very good at batch processing of large datasets
Ā© Hortonworks Inc. 2013 Page 10
Hadoop improves productivity of data
scientists
ā€¢All data in one place
ā€“Ability to store all the data in raw format
ā€“Data silo convergence
ā€“Data scientists will find innovative uses of combined data
assets
ā€¢Data/compute capabilities available as shared asset
ā€“Data scientists can quickly prototype a new idea without an
up-front request for funding
Ā© Hortonworks Inc. 2013 Page 11
Data-driven innovation is accelerated since
Hadoop is ā€œschema on readā€
I need
new data
Finally, w
e start
collecting
Let me
seeā€¦ is it
any good?
Start 6 months 9 months
ā€œSchema changeā€ project
Letā€™s just put
it in a folder
on HDFS
Let me
seeā€¦ is it
any good?
3 months
My model is
awesome!
Ā© Hortonworks Inc. 2013 Page 12
Hadoop is ideal for pre-processing of large
raw datasets
Strip away
HTML/PDF/DOC/P
PT
Entity resolution
Document vector
generation
Sampling, filtering
Joins
Raw Data
Processed
Data
Term
normalization
Ā© Hortonworks Inc. 2013 Page 13
In machine learning, very often:
more data -> better outcomes
Banko & Brill, 2001
ā€¢More examples to learn from
ā€¢More possible feature types
ā€“Weā€™re looking for the most useful
for our task
Ā© Hortonworks Inc. 2013 Page 14
Use-cases
Ā© Hortonworks Inc. 2013 Page 15
A (partial) map of data science ā€œtasksā€
Discovery
Clustering
Detect natural groupings
Outlier detection
Detect anomalies
Affinity Analysis
Co-occurrence patterns
Prediction
Classification
Predict a category
Regression
Predict a value
Recommendation
Predict a preference
Big Data Science: High energy physics, Genomics, etc
Ā© Hortonworks Inc. 2013 Page 16
Use-case: product recommendation
ā€¢Inputs:
ā€“Explicit product ratings (when provided)
ā€“Implicit information: purchase transactions, page views,
comments
5 2 4 ? ?
? ? 5 2 ?
1 2 ? ? 3
? 2 3 1 5
Epic
X-Men
Hobbit
Argo
Pirates
U101
U102
U103
U104
U105
ā€¦
Ratings
Page views
Forum
Comments
Ā© Hortonworks Inc. 2013 Page 17
Goal: predict a preference
5 2 4 ? ?
? ? 5 2 ?
1 2 ? ? 3
? 2 3 1 5
Epic
X-Men
Hobbit
Argo
Pirates
5 2 4 1 3
4 1 5 2 3
1 2 4 1 3
3 2 3 1 5
U101
U102
U103
U104
U105
ā€¦
U101
U102
U103
U104
U105
ā€¦
Epic
X-Men
Hobbit
Argo
Pirates
Ā© Hortonworks Inc. 2013 Page 18
Using Hadoop for recommendation
Pre-process
SQL
Online serving
HDFS
Map Reduce
Transactions
Page views
Content
Recommend
Data sources
Custom
Logic
With Hadoop, we can process
very large preference datasets
Ā© Hortonworks Inc. 2013 Page 19
Use-case: failure prediction
ā€¢Inputs:
ā€“Equipment history: install date, model, past issues
ā€“Equipment sensor data
ā€“Product catalog: product families, expected lifetime
SKU Install
date
Service
Person ID
Zip
code
Avg
temp
TTF
(days)
113454 5/1/2011 1345 94002 72 180
998323 5/3/2009 3234 88321 68 450
345375 8/2/2005 1112 53323 82 332
ā€¦ ā€¦ ā€¦ ā€¦
history
Sensor data
Product
Catalog
Ā© Hortonworks Inc. 2013 Page 20
Building a prediction model
SKU Install
date
Service
Person ID
Zip
code
Avg
temp
TTF
(days)
113454 5/1/2011 1345 94002 72 180
998323 5/3/2009 3234 88321 68 450
345375 8/2/2005 1112 53323 82 332
ā€¦ ā€¦ ā€¦ ā€¦
Unseen data
Model
TTF
Labeled Data
SKU Install
date
Service
Person ID
Zip
code
Avg
temp
332456 3/3/2013 1345 94005 71
442343 6/6/2013 1112 77485 67
Ā© Hortonworks Inc. 2013 Page 21
Using Hadoop for failure prediction
ā€¢ HDFS: central repository for all data
ā€“ Service records (word, pdf, etc)
ā€“ Equipment purchase transaction data
ā€“ Product catalog: SKUs, model numbers, etc
ā€¢ Pre-process
ā€“ Convert service records to item features: remove PDF
formatting, detect entities in records
ā€“ Normalize data using service records, product catalog
ā€“ Create feature matrix; ready for modeling algorithm
Ā© Hortonworks Inc. 2013 Page 22
Use-case: SaaS application security
ā€¢Inputs:
ā€“Click-stream: user interaction with application
User ID User
since
Logins/m
onth
Avg DL
KB/day
ā€¦
123456 1/3/2004 6 30
998323 5/3/2009 1 5
345375 8/2/2005 22 120
ā€¦ ā€¦ ā€¦ ā€¦
User data
Clicks
Ā© Hortonworks Inc. 2013 Page 23
Detecting anomalous behavior records
ā€¢ User access profile modeled as vector of features
ā€¢ Detect anomalies in application access patterns
ā€“ Rules based
ā€“ Machine learning based (determine ā€œoutlier factorā€: 0ā€¦1)
Ā© Hortonworks Inc. 2013 Page 24
Using Hadoop for anomaly detection
ā€¢ HDFS: central repository for all raw data
ā€“ Raw user-access logs
ā€“ User information (organization, demographics)
ā€¢ Pre-process
ā€“ Build access-profile (behavioral) for each user
ā€¢ Detect anomalies
ā€“ In Hadoop
ā€“ Using existing tools: R, SAS, rules engine, etc
Ā© Hortonworks Inc. 2013 Page 25
How do I get started?
Ā© Hortonworks Inc. 2013 Page 26
1. Pick a good use-case that delivers immediate
business value
2. Implement a proof-of-value (POV)
3. Build a team (hire/train)
Getting started with Data science on Hadoop
Ā© Hortonworks Inc. 2013 Page 27
ā€¢ Put together a Hadoop cluster
ā€¢ Define the POV business use-case
ā€¢ Pull raw data you need into the cluster
ā€¢ Build it
ā€¢ Show the business value of your data assets
Contact us. We can help!
Implement a proof-of-value
Ā© Hortonworks Inc. 2013 Page 28
Build a team:
The data scientist skillset continuum
Software
engineer
Research
Scientist
Data
Engineer
Data
Scientist
Applied
Scientist
Role Data Engineer Applied Scientist
Function Builds production-grade data products Finds signal/meaning in the data
Applies statistical/ML models and tunes the
algorithm
Good atā€¦. Data and Systems architecture
Hadoop, PIG/HIVE, MapReduce, mahout
Java, Python, Perl, SQL, C++, etc
NoSQL (Hbase, Cassandra, Mongo)
Statistics, Machine learning
Text processing, NLP
R, Matlab, SAS, SQL
Sciptring, prototyping
Visualization / telling the story
Ā© Hortonworks Inc. 2013 Page 29
Thank you!
Any Questions?
Ofer Mendelevitch
Director, Data Sciences @ Hortonworks
ofer@hortonworks.com
@ofermend
Weā€™re hiring!
Data Science training: www.hortonworks.com/training

More Related Content

What's hot

Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeDataWorks Summit
Ā 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
Ā 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB
Ā 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
Ā 
Artur Fejklowicz - ā€œData Lake architectureā€ AI&BigDataDay 2017
Artur Fejklowicz - ā€œData Lake architectureā€ AI&BigDataDay 2017Artur Fejklowicz - ā€œData Lake architectureā€ AI&BigDataDay 2017
Artur Fejklowicz - ā€œData Lake architectureā€ AI&BigDataDay 2017Lviv Startup Club
Ā 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
Ā 
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariAmbari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariHortonworks
Ā 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
Ā 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefitsRicky Barron
Ā 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for ArchitectsTomasz Kopacz
Ā 
Making Bank Predictive and Real-Time
Making Bank Predictive and Real-TimeMaking Bank Predictive and Real-Time
Making Bank Predictive and Real-TimeDataWorks Summit
Ā 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopDatameer
Ā 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecturemark madsen
Ā 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which DataWorks Summit
Ā 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsDataWorks Summit
Ā 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data IntroductionDurga Gadiraju
Ā 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data ArchitectureSplunk
Ā 
Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data SilosDataWorks Summit
Ā 

What's hot (20)

Use dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application codeUse dependency injection to get Hadoop *out* of your application code
Use dependency injection to get Hadoop *out* of your application code
Ā 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
Ā 
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, ClouderaMongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
MongoDB IoT City Tour STUTTGART: Hadoop and future data management. By, Cloudera
Ā 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
Ā 
Artur Fejklowicz - ā€œData Lake architectureā€ AI&BigDataDay 2017
Artur Fejklowicz - ā€œData Lake architectureā€ AI&BigDataDay 2017Artur Fejklowicz - ā€œData Lake architectureā€ AI&BigDataDay 2017
Artur Fejklowicz - ā€œData Lake architectureā€ AI&BigDataDay 2017
Ā 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Ā 
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with AmbariAmbari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ambari Meetup: 2nd April 2013: Teradata Viewpoint Hadoop Integration with Ambari
Ā 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
Ā 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
Ā 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
Ā 
Making Bank Predictive and Real-Time
Making Bank Predictive and Real-TimeMaking Bank Predictive and Real-Time
Making Bank Predictive and Real-Time
Ā 
Introduction to Azure HDInsight
Introduction to Azure HDInsightIntroduction to Azure HDInsight
Introduction to Azure HDInsight
Ā 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & Hadoop
Ā 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
Ā 
Big Data with Azure
Big Data with AzureBig Data with Azure
Big Data with Azure
Ā 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
Ā 
Evolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data ApplicationsEvolving Hadoop into an Operational Platform with Data Applications
Evolving Hadoop into an Operational Platform with Data Applications
Ā 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
Ā 
SplunkSummit 2015 - Real World Big Data Architecture
SplunkSummit 2015 -  Real World Big Data ArchitectureSplunkSummit 2015 -  Real World Big Data Architecture
SplunkSummit 2015 - Real World Big Data Architecture
Ā 
Destroying Data Silos
Destroying Data SilosDestroying Data Silos
Destroying Data Silos
Ā 

Viewers also liked

BMC BSM - Automate Service Management System
BMC BSM - Automate Service Management SystemBMC BSM - Automate Service Management System
BMC BSM - Automate Service Management SystemVyom Labs
Ā 
Hack Into Drupal Sites (or, How to Secure Your Drupal Site)
Hack Into Drupal Sites (or, How to Secure Your Drupal Site)Hack Into Drupal Sites (or, How to Secure Your Drupal Site)
Hack Into Drupal Sites (or, How to Secure Your Drupal Site)nyccamp
Ā 
Fibre Channel åŸŗē¤Žč¬›åŗ§
Fibre Channel åŸŗē¤Žč¬›åŗ§Fibre Channel åŸŗē¤Žč¬›åŗ§
Fibre Channel åŸŗē¤Žč¬›åŗ§Brocade
Ā 
Medical Graphs
Medical GraphsMedical Graphs
Medical GraphsPassy World
Ā 
Software Quality Plan
Software Quality PlanSoftware Quality Plan
Software Quality Planguy_davis
Ā 
AWSė„¼ ķ™œģš©ķ•œ ėÆøė””ģ–“ ģŠ¤ķŠøė¦¬ė° ģ„œė¹„ģŠ¤
AWSė„¼ ķ™œģš©ķ•œ ėÆøė””ģ–“ ģŠ¤ķŠøė¦¬ė° ģ„œė¹„ģŠ¤AWSė„¼ ķ™œģš©ķ•œ ėÆøė””ģ–“ ģŠ¤ķŠøė¦¬ė° ģ„œė¹„ģŠ¤
AWSė„¼ ķ™œģš©ķ•œ ėÆøė””ģ–“ ģŠ¤ķŠøė¦¬ė° ģ„œė¹„ģŠ¤Amazon Web Services Korea
Ā 
Fast+plants+essay
Fast+plants+essayFast+plants+essay
Fast+plants+essayjespinal5
Ā 
Hematology learning guide
Hematology learning guide Hematology learning guide
Hematology learning guide Fidaa Jaafrah
Ā 
Furan Testing of Transformers Oil
Furan Testing of Transformers OilFuran Testing of Transformers Oil
Furan Testing of Transformers OilNitish Kumar
Ā 
2015 Largest Healthcare Staffing Firms in the US
2015 Largest Healthcare Staffing Firms in the US2015 Largest Healthcare Staffing Firms in the US
2015 Largest Healthcare Staffing Firms in the USBrian Snyder
Ā 
CƔch lƠm Email marketing thƠnh cƓng!
CƔch lƠm Email marketing thƠnh cƓng!CƔch lƠm Email marketing thƠnh cƓng!
CƔch lƠm Email marketing thƠnh cƓng!missbik
Ā 
Cowboy tools and attire
Cowboy tools and attireCowboy tools and attire
Cowboy tools and attireChristianN2T
Ā 
Sustainable Leadership
Sustainable LeadershipSustainable Leadership
Sustainable LeadershipLaura Pasquini
Ā 
Effect of electrolytes on cardiac rhythm
Effect of electrolytes on cardiac rhythmEffect of electrolytes on cardiac rhythm
Effect of electrolytes on cardiac rhythmAhmad Thanin
Ā 
Icons and Stencils for Hadoop
Icons and Stencils for HadoopIcons and Stencils for Hadoop
Icons and Stencils for HadoopHortonworks
Ā 

Viewers also liked (19)

BMC BSM - Automate Service Management System
BMC BSM - Automate Service Management SystemBMC BSM - Automate Service Management System
BMC BSM - Automate Service Management System
Ā 
Glusterfs and Hadoop
Glusterfs and HadoopGlusterfs and Hadoop
Glusterfs and Hadoop
Ā 
HDFS Design Principles
HDFS Design PrinciplesHDFS Design Principles
HDFS Design Principles
Ā 
Hack Into Drupal Sites (or, How to Secure Your Drupal Site)
Hack Into Drupal Sites (or, How to Secure Your Drupal Site)Hack Into Drupal Sites (or, How to Secure Your Drupal Site)
Hack Into Drupal Sites (or, How to Secure Your Drupal Site)
Ā 
Gourmet Company Presentation
Gourmet Company PresentationGourmet Company Presentation
Gourmet Company Presentation
Ā 
Fibre Channel åŸŗē¤Žč¬›åŗ§
Fibre Channel åŸŗē¤Žč¬›åŗ§Fibre Channel åŸŗē¤Žč¬›åŗ§
Fibre Channel åŸŗē¤Žč¬›åŗ§
Ā 
Medical Graphs
Medical GraphsMedical Graphs
Medical Graphs
Ā 
Software Quality Plan
Software Quality PlanSoftware Quality Plan
Software Quality Plan
Ā 
AWSė„¼ ķ™œģš©ķ•œ ėÆøė””ģ–“ ģŠ¤ķŠøė¦¬ė° ģ„œė¹„ģŠ¤
AWSė„¼ ķ™œģš©ķ•œ ėÆøė””ģ–“ ģŠ¤ķŠøė¦¬ė° ģ„œė¹„ģŠ¤AWSė„¼ ķ™œģš©ķ•œ ėÆøė””ģ–“ ģŠ¤ķŠøė¦¬ė° ģ„œė¹„ģŠ¤
AWSė„¼ ķ™œģš©ķ•œ ėÆøė””ģ–“ ģŠ¤ķŠøė¦¬ė° ģ„œė¹„ģŠ¤
Ā 
Fast+plants+essay
Fast+plants+essayFast+plants+essay
Fast+plants+essay
Ā 
Hematology learning guide
Hematology learning guide Hematology learning guide
Hematology learning guide
Ā 
Furan Testing of Transformers Oil
Furan Testing of Transformers OilFuran Testing of Transformers Oil
Furan Testing of Transformers Oil
Ā 
2015 Largest Healthcare Staffing Firms in the US
2015 Largest Healthcare Staffing Firms in the US2015 Largest Healthcare Staffing Firms in the US
2015 Largest Healthcare Staffing Firms in the US
Ā 
CƔch lƠm Email marketing thƠnh cƓng!
CƔch lƠm Email marketing thƠnh cƓng!CƔch lƠm Email marketing thƠnh cƓng!
CƔch lƠm Email marketing thƠnh cƓng!
Ā 
Cowboy tools and attire
Cowboy tools and attireCowboy tools and attire
Cowboy tools and attire
Ā 
Selenium at Salesforce Scale
Selenium at Salesforce ScaleSelenium at Salesforce Scale
Selenium at Salesforce Scale
Ā 
Sustainable Leadership
Sustainable LeadershipSustainable Leadership
Sustainable Leadership
Ā 
Effect of electrolytes on cardiac rhythm
Effect of electrolytes on cardiac rhythmEffect of electrolytes on cardiac rhythm
Effect of electrolytes on cardiac rhythm
Ā 
Icons and Stencils for Hadoop
Icons and Stencils for HadoopIcons and Stencils for Hadoop
Icons and Stencils for Hadoop
Ā 

Similar to Data Science with Hadoop: A Primer

Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopMark Ginnebaugh
Ā 
Modern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BIModern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BIKognitio
Ā 
Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013Michael Hiskey
Ā 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
Ā 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudHortonworks
Ā 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Hortonworks
Ā 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Hortonworks
Ā 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataHortonworks
Ā 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Barijaxconf
Ā 
Enterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the UnionEnterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the UnionHortonworks
Ā 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack EuropeHortonworks
Ā 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopPOSSCON
Ā 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks
Ā 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
Ā 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks
Ā 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata Hortonworks
Ā 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...Hortonworks
Ā 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
Ā 

Similar to Data Science with Hadoop: A Primer (20)

Hortonworks Big Data & Hadoop
Hortonworks Big Data & HadoopHortonworks Big Data & Hadoop
Hortonworks Big Data & Hadoop
Ā 
201305 hadoop jpl-v3
201305 hadoop jpl-v3201305 hadoop jpl-v3
201305 hadoop jpl-v3
Ā 
Modern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BIModern Data Architecture: In-Memory with Hadoop - the new BI
Modern Data Architecture: In-Memory with Hadoop - the new BI
Ā 
Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013Hortonworks kognitio webinar 10 dec 2013
Hortonworks kognitio webinar 10 dec 2013
Ā 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Ā 
Apache Hadoop on the Open Cloud
Apache Hadoop on the Open CloudApache Hadoop on the Open Cloud
Apache Hadoop on the Open Cloud
Ā 
Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks Non-Stop Hadoop for Hortonworks
Non-Stop Hadoop for Hortonworks
Ā 
Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014Splunk-hortonworks-risk-management-oct-2014
Splunk-hortonworks-risk-management-oct-2014
Ā 
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big DataCombine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Combine Apache Hadoop and Elasticsearch to Get the Most of Your Big Data
Ā 
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu BariApache Hadoop and its role in Big Data architecture - Himanshu Bari
Apache Hadoop and its role in Big Data architecture - Himanshu Bari
Ā 
Enterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the UnionEnterprise Apache Hadoop: State of the Union
Enterprise Apache Hadoop: State of the Union
Ā 
Munich HUG 21.11.2013
Munich HUG 21.11.2013Munich HUG 21.11.2013
Munich HUG 21.11.2013
Ā 
Yahoo! Hack Europe
Yahoo! Hack EuropeYahoo! Hack Europe
Yahoo! Hack Europe
Ā 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Ā 
Hortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptxHortonworks sqrrl webinar v5.pptx
Hortonworks sqrrl webinar v5.pptx
Ā 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
Ā 
Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration Hortonworks Oracle Big Data Integration
Hortonworks Oracle Big Data Integration
Ā 
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata The Value of the Modern Data Architecture with Apache Hadoop and Teradata
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
Ā 
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
The Modern Data Architecture for Advanced Business Intelligence with Hortonwo...
Ā 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
Ā 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash CourseDataWorks Summit
Ā 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
Ā 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Ā 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Ā 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
Ā 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
Ā 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
Ā 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Ā 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Ā 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Ā 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
Ā 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
Ā 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Ā 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Ā 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Ā 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
Ā 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Ā 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Ā 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
Ā 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Ā 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
Ā 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
Ā 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Ā 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
Ā 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Ā 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
Ā 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
Ā 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
Ā 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Ā 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Ā 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Ā 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
Ā 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
Ā 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Ā 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
Ā 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Ā 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Ā 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Ā 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
Ā 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Ā 

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
Ā 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
Ā 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
Ā 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
Ā 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
Ā 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
Ā 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
Ā 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
Ā 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
Ā 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
Ā 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
Ā 
Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...Patryk Bandurski
Ā 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
Ā 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
Ā 
Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)
Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)
Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)Wonjun Hwang
Ā 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
Ā 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
Ā 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
Ā 
Anypoint Exchange: Itā€™s Not Just a Repo!
Anypoint Exchange: Itā€™s Not Just a Repo!Anypoint Exchange: Itā€™s Not Just a Repo!
Anypoint Exchange: Itā€™s Not Just a Repo!Manik S Magar
Ā 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
Ā 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Ā 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
Ā 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
Ā 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Ā 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
Ā 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Ā 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
Ā 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
Ā 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
Ā 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
Ā 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Ā 
Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleĀ Integration and Automat...
Ā 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Ā 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Ā 
Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)
Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)
Bun (KitWorks Team Study ė…øė³„ė§ˆė£Ø ė°œķ‘œ 2024.4.22)
Ā 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
Ā 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
Ā 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
Ā 
Anypoint Exchange: Itā€™s Not Just a Repo!
Anypoint Exchange: Itā€™s Not Just a Repo!Anypoint Exchange: Itā€™s Not Just a Repo!
Anypoint Exchange: Itā€™s Not Just a Repo!
Ā 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
Ā 

Data Science with Hadoop: A Primer

  • 1. Ā© Hortonworks Inc. 2013 Hortonworks Data Science with Hadoop ā€“ A Primer Hadoop Summit, June 2013 Ofer Mendelevitch ofer@hortonworks.com @ofermend
  • 2. Ā© Hortonworks Inc. 2013 Page 2 Who am I? currently <- c( role=ā€œdirector of data sciencesā€, company=ā€œHortonworksā€) ā€¢ Previously: Nor1, Yahoo!, Risk Insight, Quiver, etcā€¦ ā€¢ Blog: www.achessdad.com
  • 3. Ā© Hortonworks Inc. 2013 Page 3 What I will be talking about? ā€¢What is Data Science? ā€¢Hadoop and Data Science ā€¢Use-cases: data science with Hadoop ā€¢How to get started?
  • 4. Ā© Hortonworks Inc. 2013 Page 4 What is Data Science? What is a data scientist? A person who does this Data Product: software product whose core functionality relies on applying statistical (or machine learning) methods to data. What is Data Science? The art of building data products
  • 5. Ā© Hortonworks Inc. 2013 Page 5 Data science & big data
  • 6. Ā© Hortonworks Inc. 2013 Page 6 With Hadoopā€¦ Time and cost of building large scale data products is dramatically reduced
  • 7. Ā© Hortonworks Inc. 2013 ApplianceCloudOS / VM An Apache Hadoop Platform HORTONWORKS DATA PLATFORM (HDP) PLATFORM SERVICES HADOOP CORE Enterprise Readiness: HA, DR, Snapshots, Security, ā€¦ Distributed Storage & ProcessingHDFS MAP REDUCE DATA SERVICES Store, Process and Access Data HCATALOG HIVEPIG HBASE SQOOP FLUME OPERATIONAL SERVICES Manage & Operate at Scale OOZIE AMBARI
  • 8. Ā© Hortonworks Inc. 2013 A typical Big Data Architecture Page 8 APPLICATIONSDATASYSTEMS TRADITIONAL REPOS RDBMS EDW MPP DATASOURCES MOBILE DATA OLTP, POS SYSTEMS OPERATIONAL TOOLS MANAGE & MONITOR Traditional Sources (RDBMS, OLTP, OLAP) New Sources (web logs, email, sensor data, social media) DEV & DATA TOOLS BUILD & TEST Business Analytics Custom Applications Packaged Applications HORTONWORKS DATA PLATFORM
  • 9. Ā© Hortonworks Inc. 2013 Page 9 Keys to Hadoopā€™s power ā€¢ Computation co-located with data ā€“ Data and computation system co-designed to work together ā€¢ Affordable at scale ā€“ Use ā€œcommodityā€ hardware nodes ā€“ Self-healing; failure handled by software ā€“ Very good at batch processing of large datasets
  • 10. Ā© Hortonworks Inc. 2013 Page 10 Hadoop improves productivity of data scientists ā€¢All data in one place ā€“Ability to store all the data in raw format ā€“Data silo convergence ā€“Data scientists will find innovative uses of combined data assets ā€¢Data/compute capabilities available as shared asset ā€“Data scientists can quickly prototype a new idea without an up-front request for funding
  • 11. Ā© Hortonworks Inc. 2013 Page 11 Data-driven innovation is accelerated since Hadoop is ā€œschema on readā€ I need new data Finally, w e start collecting Let me seeā€¦ is it any good? Start 6 months 9 months ā€œSchema changeā€ project Letā€™s just put it in a folder on HDFS Let me seeā€¦ is it any good? 3 months My model is awesome!
  • 12. Ā© Hortonworks Inc. 2013 Page 12 Hadoop is ideal for pre-processing of large raw datasets Strip away HTML/PDF/DOC/P PT Entity resolution Document vector generation Sampling, filtering Joins Raw Data Processed Data Term normalization
  • 13. Ā© Hortonworks Inc. 2013 Page 13 In machine learning, very often: more data -> better outcomes Banko & Brill, 2001 ā€¢More examples to learn from ā€¢More possible feature types ā€“Weā€™re looking for the most useful for our task
  • 14. Ā© Hortonworks Inc. 2013 Page 14 Use-cases
  • 15. Ā© Hortonworks Inc. 2013 Page 15 A (partial) map of data science ā€œtasksā€ Discovery Clustering Detect natural groupings Outlier detection Detect anomalies Affinity Analysis Co-occurrence patterns Prediction Classification Predict a category Regression Predict a value Recommendation Predict a preference Big Data Science: High energy physics, Genomics, etc
  • 16. Ā© Hortonworks Inc. 2013 Page 16 Use-case: product recommendation ā€¢Inputs: ā€“Explicit product ratings (when provided) ā€“Implicit information: purchase transactions, page views, comments 5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 ? 2 3 1 5 Epic X-Men Hobbit Argo Pirates U101 U102 U103 U104 U105 ā€¦ Ratings Page views Forum Comments
  • 17. Ā© Hortonworks Inc. 2013 Page 17 Goal: predict a preference 5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3 ? 2 3 1 5 Epic X-Men Hobbit Argo Pirates 5 2 4 1 3 4 1 5 2 3 1 2 4 1 3 3 2 3 1 5 U101 U102 U103 U104 U105 ā€¦ U101 U102 U103 U104 U105 ā€¦ Epic X-Men Hobbit Argo Pirates
  • 18. Ā© Hortonworks Inc. 2013 Page 18 Using Hadoop for recommendation Pre-process SQL Online serving HDFS Map Reduce Transactions Page views Content Recommend Data sources Custom Logic With Hadoop, we can process very large preference datasets
  • 19. Ā© Hortonworks Inc. 2013 Page 19 Use-case: failure prediction ā€¢Inputs: ā€“Equipment history: install date, model, past issues ā€“Equipment sensor data ā€“Product catalog: product families, expected lifetime SKU Install date Service Person ID Zip code Avg temp TTF (days) 113454 5/1/2011 1345 94002 72 180 998323 5/3/2009 3234 88321 68 450 345375 8/2/2005 1112 53323 82 332 ā€¦ ā€¦ ā€¦ ā€¦ history Sensor data Product Catalog
  • 20. Ā© Hortonworks Inc. 2013 Page 20 Building a prediction model SKU Install date Service Person ID Zip code Avg temp TTF (days) 113454 5/1/2011 1345 94002 72 180 998323 5/3/2009 3234 88321 68 450 345375 8/2/2005 1112 53323 82 332 ā€¦ ā€¦ ā€¦ ā€¦ Unseen data Model TTF Labeled Data SKU Install date Service Person ID Zip code Avg temp 332456 3/3/2013 1345 94005 71 442343 6/6/2013 1112 77485 67
  • 21. Ā© Hortonworks Inc. 2013 Page 21 Using Hadoop for failure prediction ā€¢ HDFS: central repository for all data ā€“ Service records (word, pdf, etc) ā€“ Equipment purchase transaction data ā€“ Product catalog: SKUs, model numbers, etc ā€¢ Pre-process ā€“ Convert service records to item features: remove PDF formatting, detect entities in records ā€“ Normalize data using service records, product catalog ā€“ Create feature matrix; ready for modeling algorithm
  • 22. Ā© Hortonworks Inc. 2013 Page 22 Use-case: SaaS application security ā€¢Inputs: ā€“Click-stream: user interaction with application User ID User since Logins/m onth Avg DL KB/day ā€¦ 123456 1/3/2004 6 30 998323 5/3/2009 1 5 345375 8/2/2005 22 120 ā€¦ ā€¦ ā€¦ ā€¦ User data Clicks
  • 23. Ā© Hortonworks Inc. 2013 Page 23 Detecting anomalous behavior records ā€¢ User access profile modeled as vector of features ā€¢ Detect anomalies in application access patterns ā€“ Rules based ā€“ Machine learning based (determine ā€œoutlier factorā€: 0ā€¦1)
  • 24. Ā© Hortonworks Inc. 2013 Page 24 Using Hadoop for anomaly detection ā€¢ HDFS: central repository for all raw data ā€“ Raw user-access logs ā€“ User information (organization, demographics) ā€¢ Pre-process ā€“ Build access-profile (behavioral) for each user ā€¢ Detect anomalies ā€“ In Hadoop ā€“ Using existing tools: R, SAS, rules engine, etc
  • 25. Ā© Hortonworks Inc. 2013 Page 25 How do I get started?
  • 26. Ā© Hortonworks Inc. 2013 Page 26 1. Pick a good use-case that delivers immediate business value 2. Implement a proof-of-value (POV) 3. Build a team (hire/train) Getting started with Data science on Hadoop
  • 27. Ā© Hortonworks Inc. 2013 Page 27 ā€¢ Put together a Hadoop cluster ā€¢ Define the POV business use-case ā€¢ Pull raw data you need into the cluster ā€¢ Build it ā€¢ Show the business value of your data assets Contact us. We can help! Implement a proof-of-value
  • 28. Ā© Hortonworks Inc. 2013 Page 28 Build a team: The data scientist skillset continuum Software engineer Research Scientist Data Engineer Data Scientist Applied Scientist Role Data Engineer Applied Scientist Function Builds production-grade data products Finds signal/meaning in the data Applies statistical/ML models and tunes the algorithm Good atā€¦. Data and Systems architecture Hadoop, PIG/HIVE, MapReduce, mahout Java, Python, Perl, SQL, C++, etc NoSQL (Hbase, Cassandra, Mongo) Statistics, Machine learning Text processing, NLP R, Matlab, SAS, SQL Sciptring, prototyping Visualization / telling the story
  • 29. Ā© Hortonworks Inc. 2013 Page 29 Thank you! Any Questions? Ofer Mendelevitch Director, Data Sciences @ Hortonworks ofer@hortonworks.com @ofermend Weā€™re hiring! Data Science training: www.hortonworks.com/training

Editor's Notes

  1. Data science is not new. But now we need to do it with much larger datasets.
  2. As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop ā€“ and HDP in particular ā€“ being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications ā€“ such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring