SlideShare a Scribd company logo
1 of 32
© comScore, Inc. Proprietary.
Using Hadoop to Process a
Trillion+ Events
Michael Brown, CTO | September 23rd, 2013
© comScore, Inc. Proprietary. 2
comScore is a leading internet technology company that
provides Analytics for a Digital World™
NASDAQ SCOR
Clients 2,100+ Worldwide
Employees 1,000+
Headquarters Reston, Virginia, USA
Global Coverage Measurement from 172 Countries; 44 Markets Reported
Local Presence 32 Locations in 23 Countries
Big Data Over 1 Trillion Digital Interactions Captured Monthly
V0113
© comScore, Inc. Proprietary.
Broad Client Base and Deep Expertise Across Key Industries
Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology
V0910
© comScore, Inc. Proprietary.
Panel Heat Map
© comScore, Inc. Proprietary.
CENSUS
Unified Digital Measurement™ (UDM) Establishes Platform For
Panel + Census Data Integration
PANEL
Unified Digital Measurement (UDM)
Patent-Pending Methodology
Adopted by 90% of Top 100 U.S. Media Properties
Global PERSON
Measurement
Global DEVICE
Measurement
V0411
© comScore, Inc. Proprietary. 6
0
200,000,000,000
400,000,000,000
600,000,000,000
800,000,000,000
1,000,000,000,000
1,200,000,000,000
1,400,000,000,000
1,600,000,000,000
1,800,000,000,000
2,000,000,000,000
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
2009 2010 2011 2012 2013
#ofrecords
Panel Records Beacon Records
Total records collected in August 2013
1,729,895,147,710
Worldwide Tags per Day
© comScore, Inc. Proprietary.
Worldwide UDM™ Penetration
December 2012 Penetration Data
Europe
Austria 87%
Belgium 93%
Switzerland 89%
Germany 92%
Denmark 88%
Spain 95%
Finland 93%
France 92%
Ireland 90%
Italy 90%
Netherlands 93%
Norway 91%
Portugal 92%
Sweden 90%
United Kingdom 92%
Asia Pacific
Australia 90%
Hong Kong 95%
India 92%
Japan 82%
Malaysia 93%
New Zealand 91%
Singapore 92%
North America
Canada 94%
United States 91%
Latin America
Argentina 95%
Brazil 96%
Chile 94%
Colombia 95%
Mexico 93%
Puerto Rico 92%
Middle East & Africa
Israel 92%
South Africa 78%
Percentage of Machines Included in UDM Measurement
© comScore, Inc. Proprietary.
High Level Data Flow
Panel
Census
Custom Code +
Delivery
© comScore, Inc. Proprietary.
Our Cluster
Production Hadoop Cluster
 224 nodes: Mix of Dell 720xd, R710 and R510 servers
 Each R720xd has (24x1.2TB drives; 64GB RAM; 24 cores)
 6300+ total CPUs
 13.3TB total memory
 4.3PB total disk space
 Our distro is MapR M5 2.1.3
© comScore, Inc. Proprietary.
The Project:
vCE – Validated Campaign Essentials
© comScore, Inc. Proprietary. 11
 vCE provides real-time, cloud-
based, on-demand monitoring and
optimization of digital advertising
campaigns
 Deep industry penetration
 22 of the Top 25 Largest Global
Advertisers, representing 89% of
global ad dollars, are vCE/CE
clients*
 Includes ALL Top 10 CPG
Advertisers*
What is vCE?
*Source: AdAge 2012 Top 25 Global Advertisers (directly or through their advertising agency)
Allstate
© comScore, Inc. Proprietary.
comScore - vCE
© comScore, Inc. Proprietary.
The Problem Statement
Calculate the number of events and unique cookies for each reportable
campaign element
Key take away
 Data on input will be aggregated daily
 Need to process all data for 3 months
 Need to calculate values for every day in the 92 day period spanning all
reportable campaign elements
© comScore, Inc. Proprietary.
Structure of the Required Output
Client Campaign Population Location Cookie Ct Period
1234 160873284 840 1 863,185 1
1234 160873284 840 1 1,719,738 2
1234 160873284 840 1 2,631,624 3
1234 160873284 840 1 3,572,163 4
1234 160873284 840 1 4,445,508 5
1234 160873284 840 1 5,308,532 6
1234 160873284 840 1 6,032,073 7
1234 160873284 840 1 6,710,645 8
1234 160873284 840 1 7,421,258 9
1234 160873284 840 1 8,154,543 10
© comScore, Inc. Proprietary.
Counting Uniques from a Time Ordered Log File
A
B
C
D
B
A
A
Major Downsides:
Need to keep all key elements in memory.
Constrained to one machine for final aggregation.
© comScore, Inc. Proprietary.
First Version
Java Map-Reduce application which processes pre-aggregated data from 92 days
Map reads the data and emits each cookie as the key of the key value pair
All 130B records go though the shuffle
Each Reducer will get all the data for a particular campaign sorted by cookie
Reducer aggregates the data by grouping key ( Client / Campaign / Population ) and calculates
unique cookies for period 1-92
Volume Grew rapidly to the point the daily processing took more than a day
© comScore, Inc. Proprietary.
M/R Data Flow
CB
Mapper MapperMapperMap Map Map
Reduce ReduceReduce
BA AC
AA BB CC
A B C
© comScore, Inc. Proprietary.
Scaling Issue
As our volume has grown we have the following stats:
 Over 500 billion events per month
 Daily Aggregate 1.5 billion
 130 billion aggregate records for 92 days
 70K Campaigns
 Over 50 countries
 We see 15 billion distinct cookies in a month
 We only need to output 25 million rows
© comScore, Inc. Proprietary.
Basic Approach Retrospective
Processing speed is not scaling to our needs on a sample of the input data
Diagnosis
 Most aggregations could not take significant advantage of combiners.
 Large shuffles caused poor job performance. In some cases large aggregations ran slower on the
Hadoop cluster due to shuffle and skew in data for keys.
Diagnosis
 A new approach is required to reduce the shuffle
© comScore, Inc. Proprietary.
Counting Uniques from a Key Ordered Log File
A
D
B
C
B
A
A
Major Downsides:
Need to sort data in advance.
The sort time increases as volume grows.
© comScore, Inc. Proprietary.
Counting Uniques from a Key Ordered Log File
© comScore, Inc. Proprietary.
Counting Uniques from Sharded Key Ordered Log Files
© comScore, Inc. Proprietary.
Solution to reduce the shuffle
The Problem:
 Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and
job performance issues
The Idea:
 Partition and sort the data by cookie on a daily basis
 Create a custom InputFormat to merge daily partitions for monthly aggregations
© comScore, Inc. Proprietary.
Custom Input Format with Map Side Aggregation
CB
Mapper MapperMapperMap Map Map
Reduce ReduceReduce
BA AC
A B C
A B C
Combiner Combiner Combiner
A B C
© comScore, Inc. Proprietary.
Risks for Partitioning
Data locality
 Custom InputFormat requires reading blocks of the partitioned data over the network
 This was solved using a feature of the MapR file system. We created volumes and set the chunk size to
zero which guarantees that the data written to a volume will stay on one node
Map failures might result in long run times
 Size of the map inputs is no longer set by block size
 This was solved by creating a large number (10K) of volumes to limit the size of data processed by each
mapper
© comScore, Inc. Proprietary.
Partitioning Summary
Benefits:
 A large portion of the aggregation can be completed in the map phase
 Applications can now take advantage of combiners
 Shuffles sizes are minimal
Results:
 Took a job from 35 hours to 3 hours with no hardware changes
© comScore, Inc. Proprietary.
DMX & comScore
© comScore, Inc. Proprietary.
DMX use at comScore
We use DMX from Syncsort across hundreds of servers for efficient data
processing and aggregation.
We currently run over 100+ unique jobs every day.
With these jobs we process over 150 billion rows of data through DMX!
Connect
Design
Process Accelerate
© comScore, Inc. Proprietary.
Compression w/Sorting
Compress Log Files when processing large volumes of log data
Several advantages to Sorting Data First:
 Reduces the size of the data
 Improves application performance
Examples:
 1 Hour of one source of our data (313 GB raw, 815 million rows)
 Standard compression of time ordered data is 93GB (30% of original)
 Standard compression on a 2 key sorted set is 56GB (18% of original)
 For one day it saves 800GB
When applied to all our sources we save
 4.5 TB per day
 137 TB per month
 412TB per quarter
© comScore, Inc. Proprietary.
TCO with Large Cluster Systems
Examine the ability to sort data to reduce disk usage
Example:
Hadoop cluster that needs to support 100TB of base compressed data
Hypothetical Configurations @ 75% disk utilization:
 Replication Factor of 3 using 1.2 TB drives
R710 (6x drives, JBOD); requires 26 servers
R510 (12x drives JBOD); requires 52 servers
R720xd (24x drives JBOD); requires 13 servers
© comScore, Inc. Proprietary.
Useful Factoids
Visit www.comscoredatamine.com or follow @datagems for the latest gems.
Colorful, bite-sized graphical representations of the best discoveries we unearth.
© comScore, Inc. Proprietary.
Thank You!
Michael Brown
CTO
comScore, Inc.
mbrown@comscore.com

More Related Content

What's hot

Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataMathieu Dumoulin
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016Mathieu Dumoulin
 
Big data processing with PubSub, Dataflow, and BigQuery
Big data processing with PubSub, Dataflow, and BigQueryBig data processing with PubSub, Dataflow, and BigQuery
Big data processing with PubSub, Dataflow, and BigQueryThuyen Ho
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Mathieu Dumoulin
 
Innovating to Create a Brighter Future for AI, HPC, and Big Data
Innovating to Create a Brighter Future for AI, HPC, and Big DataInnovating to Create a Brighter Future for AI, HPC, and Big Data
Innovating to Create a Brighter Future for AI, HPC, and Big Datainside-BigData.com
 
Distributed graph mining
Distributed graph miningDistributed graph mining
Distributed graph miningSayeed Mahmud
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Big Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS CloudBig Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS CloudAmazon Web Services
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...MapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Alluxio, Inc.
 
GPUdb: A Distributed Database for Many-Core Devices
GPUdb: A Distributed Database for Many-Core DevicesGPUdb: A Distributed Database for Many-Core Devices
GPUdb: A Distributed Database for Many-Core Devicesinside-BigData.com
 

What's hot (19)

Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
 
Penny Pinching at Scale
Penny Pinching at ScalePenny Pinching at Scale
Penny Pinching at Scale
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016CEP - simplified streaming architecture - Strata Singapore 2016
CEP - simplified streaming architecture - Strata Singapore 2016
 
Big data processing with PubSub, Dataflow, and BigQuery
Big data processing with PubSub, Dataflow, and BigQueryBig data processing with PubSub, Dataflow, and BigQuery
Big data processing with PubSub, Dataflow, and BigQuery
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
Streaming Architecture to Connect Everything (Including Hybrid Cloud) - Strat...
 
Innovating to Create a Brighter Future for AI, HPC, and Big Data
Innovating to Create a Brighter Future for AI, HPC, and Big DataInnovating to Create a Brighter Future for AI, HPC, and Big Data
Innovating to Create a Brighter Future for AI, HPC, and Big Data
 
Distributed graph mining
Distributed graph miningDistributed graph mining
Distributed graph mining
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Big Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS CloudBig Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS Cloud
 
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
 
GPUdb: A Distributed Database for Many-Core Devices
GPUdb: A Distributed Database for Many-Core DevicesGPUdb: A Distributed Database for Many-Core Devices
GPUdb: A Distributed Database for Many-Core Devices
 

Viewers also liked

Robert Moakler, Data Science Intern, Integral Ad Science at MLconf SEA - 5/01/15
Robert Moakler, Data Science Intern, Integral Ad Science at MLconf SEA - 5/01/15Robert Moakler, Data Science Intern, Integral Ad Science at MLconf SEA - 5/01/15
Robert Moakler, Data Science Intern, Integral Ad Science at MLconf SEA - 5/01/15MLconf
 
An Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network ApproachAn Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network ApproachCOST action BM1006
 
Integral Ad Science Digital Ad Fraud Presentation
Integral Ad Science Digital Ad Fraud PresentationIntegral Ad Science Digital Ad Fraud Presentation
Integral Ad Science Digital Ad Fraud PresentationIntegral Ad Science
 
Integral Ad Science Viewability Presentation
Integral Ad Science Viewability PresentationIntegral Ad Science Viewability Presentation
Integral Ad Science Viewability PresentationIntegral Ad Science
 
Mystery Shopping Inside the Ad-Verification Bubble
Mystery Shopping Inside the Ad-Verification BubbleMystery Shopping Inside the Ad-Verification Bubble
Mystery Shopping Inside the Ad-Verification BubbleShailin Dhar
 

Viewers also liked (6)

Robert Moakler, Data Science Intern, Integral Ad Science at MLconf SEA - 5/01/15
Robert Moakler, Data Science Intern, Integral Ad Science at MLconf SEA - 5/01/15Robert Moakler, Data Science Intern, Integral Ad Science at MLconf SEA - 5/01/15
Robert Moakler, Data Science Intern, Integral Ad Science at MLconf SEA - 5/01/15
 
2015 Q4 Media Quality Report
2015 Q4 Media Quality Report2015 Q4 Media Quality Report
2015 Q4 Media Quality Report
 
An Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network ApproachAn Introduction to Causal Discovery, a Bayesian Network Approach
An Introduction to Causal Discovery, a Bayesian Network Approach
 
Integral Ad Science Digital Ad Fraud Presentation
Integral Ad Science Digital Ad Fraud PresentationIntegral Ad Science Digital Ad Fraud Presentation
Integral Ad Science Digital Ad Fraud Presentation
 
Integral Ad Science Viewability Presentation
Integral Ad Science Viewability PresentationIntegral Ad Science Viewability Presentation
Integral Ad Science Viewability Presentation
 
Mystery Shopping Inside the Ad-Verification Bubble
Mystery Shopping Inside the Ad-Verification BubbleMystery Shopping Inside the Ad-Verification Bubble
Mystery Shopping Inside the Ad-Verification Bubble
 

Similar to Syncsort & comScore Big Data Warehouse Meetup Sept 2013

How to Suceed in Hadoop
How to Suceed in HadoopHow to Suceed in Hadoop
How to Suceed in HadoopPrecisely
 
Control m customers using big data
Control m customers using big dataControl m customers using big data
Control m customers using big dataJuliette Smit
 
Concept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with TelematicsConcept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with TelematicsSeeling Cheung
 
Lessons from handling up to 26 Billion transactions a day - The Weather Compa...
Lessons from handling up to 26 Billion transactions a day - The Weather Compa...Lessons from handling up to 26 Billion transactions a day - The Weather Compa...
Lessons from handling up to 26 Billion transactions a day - The Weather Compa...Derek Baron
 
Bmc joe goldberg
Bmc joe goldbergBmc joe goldberg
Bmc joe goldbergBigDataExpo
 
Microsoft Windows Azure - EBC Deck June 2010 Presentation
Microsoft Windows Azure -  EBC Deck June 2010 PresentationMicrosoft Windows Azure -  EBC Deck June 2010 Presentation
Microsoft Windows Azure - EBC Deck June 2010 PresentationMicrosoft Private Cloud
 
Jazz for Service Management
Jazz for Service ManagementJazz for Service Management
Jazz for Service ManagementIBM Danmark
 
BMC Discovery with new Multi-Cloud Function
BMC Discovery with new Multi-Cloud FunctionBMC Discovery with new Multi-Cloud Function
BMC Discovery with new Multi-Cloud FunctionBill Spinner
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Germany
 
Microsoft: Ride the new opportunity with the Microsoft Cloud Platform
Microsoft: Ride the new opportunity with the Microsoft Cloud PlatformMicrosoft: Ride the new opportunity with the Microsoft Cloud Platform
Microsoft: Ride the new opportunity with the Microsoft Cloud PlatformGabriele Bozzi
 
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...Teradata Aster
 
Why You Need to Move Your Website to the Cloud
Why You Need to Move Your Website to the CloudWhy You Need to Move Your Website to the Cloud
Why You Need to Move Your Website to the CloudEktron
 
Building a real-time, scalable and intelligent programmatic ad buying platform
Building a real-time, scalable and intelligent programmatic ad buying platformBuilding a real-time, scalable and intelligent programmatic ad buying platform
Building a real-time, scalable and intelligent programmatic ad buying platformJampp
 
New Technologies For The Sustainable Enterprise; keynote @Wharton
New Technologies For The Sustainable Enterprise; keynote @WhartonNew Technologies For The Sustainable Enterprise; keynote @Wharton
New Technologies For The Sustainable Enterprise; keynote @WhartonPaul Hofmann
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationAbdelkrim Hadjidj
 
Cloud: The Commercial Silver Lining for Partners
Cloud: The Commercial Silver Lining for PartnersCloud: The Commercial Silver Lining for Partners
Cloud: The Commercial Silver Lining for PartnersAmazon Web Services
 
IBM Z for the Digital Enterprise - IBM Z Open Data Analytics
IBM Z for the Digital Enterprise - IBM Z  Open Data AnalyticsIBM Z for the Digital Enterprise - IBM Z  Open Data Analytics
IBM Z for the Digital Enterprise - IBM Z Open Data AnalyticsDevOps for Enterprise Systems
 
Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...
Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...
Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...Bitmovin Inc
 

Similar to Syncsort & comScore Big Data Warehouse Meetup Sept 2013 (20)

How to Suceed in Hadoop
How to Suceed in HadoopHow to Suceed in Hadoop
How to Suceed in Hadoop
 
Control m customers using big data
Control m customers using big dataControl m customers using big data
Control m customers using big data
 
Concept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with TelematicsConcept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with Telematics
 
Lessons from handling up to 26 Billion transactions a day - The Weather Compa...
Lessons from handling up to 26 Billion transactions a day - The Weather Compa...Lessons from handling up to 26 Billion transactions a day - The Weather Compa...
Lessons from handling up to 26 Billion transactions a day - The Weather Compa...
 
Bmc joe goldberg
Bmc joe goldbergBmc joe goldberg
Bmc joe goldberg
 
Microsoft Windows Azure - EBC Deck June 2010 Presentation
Microsoft Windows Azure -  EBC Deck June 2010 PresentationMicrosoft Windows Azure -  EBC Deck June 2010 Presentation
Microsoft Windows Azure - EBC Deck June 2010 Presentation
 
Jazz for Service Management
Jazz for Service ManagementJazz for Service Management
Jazz for Service Management
 
BMC Discovery with new Multi-Cloud Function
BMC Discovery with new Multi-Cloud FunctionBMC Discovery with new Multi-Cloud Function
BMC Discovery with new Multi-Cloud Function
 
IBM Z for the Digital Enterprise 2018 - Z Keynote
IBM Z for the Digital Enterprise 2018 - Z KeynoteIBM Z for the Digital Enterprise 2018 - Z Keynote
IBM Z for the Digital Enterprise 2018 - Z Keynote
 
AWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data AnalyticsAWS Summit Berlin 2013 - Big Data Analytics
AWS Summit Berlin 2013 - Big Data Analytics
 
Microsoft: Ride the new opportunity with the Microsoft Cloud Platform
Microsoft: Ride the new opportunity with the Microsoft Cloud PlatformMicrosoft: Ride the new opportunity with the Microsoft Cloud Platform
Microsoft: Ride the new opportunity with the Microsoft Cloud Platform
 
comScore
comScorecomScore
comScore
 
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
Utilizing Aster nCluster to support processing in excess of 100 Billion rows ...
 
Why You Need to Move Your Website to the Cloud
Why You Need to Move Your Website to the CloudWhy You Need to Move Your Website to the Cloud
Why You Need to Move Your Website to the Cloud
 
Building a real-time, scalable and intelligent programmatic ad buying platform
Building a real-time, scalable and intelligent programmatic ad buying platformBuilding a real-time, scalable and intelligent programmatic ad buying platform
Building a real-time, scalable and intelligent programmatic ad buying platform
 
New Technologies For The Sustainable Enterprise; keynote @Wharton
New Technologies For The Sustainable Enterprise; keynote @WhartonNew Technologies For The Sustainable Enterprise; keynote @Wharton
New Technologies For The Sustainable Enterprise; keynote @Wharton
 
Paris FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant PresentationParis FOD Meetup #5 Cognizant Presentation
Paris FOD Meetup #5 Cognizant Presentation
 
Cloud: The Commercial Silver Lining for Partners
Cloud: The Commercial Silver Lining for PartnersCloud: The Commercial Silver Lining for Partners
Cloud: The Commercial Silver Lining for Partners
 
IBM Z for the Digital Enterprise - IBM Z Open Data Analytics
IBM Z for the Digital Enterprise - IBM Z  Open Data AnalyticsIBM Z for the Digital Enterprise - IBM Z  Open Data Analytics
IBM Z for the Digital Enterprise - IBM Z Open Data Analytics
 
Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...
Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...
Bitmovin LIVE Tech Talks: Analytics for Workflow Automation (ft. Touchstream ...
 

Recently uploaded

办理(PITT毕业证书)美国匹兹堡大学毕业证成绩单原版一比一
办理(PITT毕业证书)美国匹兹堡大学毕业证成绩单原版一比一办理(PITT毕业证书)美国匹兹堡大学毕业证成绩单原版一比一
办理(PITT毕业证书)美国匹兹堡大学毕业证成绩单原版一比一F La
 
2024 WRC Hyundai World Rally Team’s i20 N Rally1 Hybrid
2024 WRC Hyundai World Rally Team’s i20 N Rally1 Hybrid2024 WRC Hyundai World Rally Team’s i20 N Rally1 Hybrid
2024 WRC Hyundai World Rally Team’s i20 N Rally1 HybridHyundai Motor Group
 
办理阳光海岸大学毕业证成绩单原版一比一
办理阳光海岸大学毕业证成绩单原版一比一办理阳光海岸大学毕业证成绩单原版一比一
办理阳光海岸大学毕业证成绩单原版一比一F La
 
Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...
Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...
Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...Forth
 
What Could Be Causing My Jaguar XF To Lose Coolant
What Could Be Causing My Jaguar XF To Lose CoolantWhat Could Be Causing My Jaguar XF To Lose Coolant
What Could Be Causing My Jaguar XF To Lose CoolantEMC- European Motor Cars
 
办理乔治布朗学院毕业证成绩单|购买加拿大文凭证书
办理乔治布朗学院毕业证成绩单|购买加拿大文凭证书办理乔治布朗学院毕业证成绩单|购买加拿大文凭证书
办理乔治布朗学院毕业证成绩单|购买加拿大文凭证书zdzoqco
 
call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书
办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书
办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书zdzoqco
 
907MTAMount Coventry University Bachelor's Diploma in Engineering
907MTAMount Coventry University Bachelor's Diploma in Engineering907MTAMount Coventry University Bachelor's Diploma in Engineering
907MTAMount Coventry University Bachelor's Diploma in EngineeringFi sss
 
(毕业原版)曼尼托巴大学毕业证(曼大学位证)毕业证成绩单留信学历认证原版一比一
(毕业原版)曼尼托巴大学毕业证(曼大学位证)毕业证成绩单留信学历认证原版一比一(毕业原版)曼尼托巴大学毕业证(曼大学位证)毕业证成绩单留信学历认证原版一比一
(毕业原版)曼尼托巴大学毕业证(曼大学位证)毕业证成绩单留信学历认证原版一比一ffhuih11ff
 
What Could Cause A VW Tiguan's Radiator Fan To Stop Working
What Could Cause A VW Tiguan's Radiator Fan To Stop WorkingWhat Could Cause A VW Tiguan's Radiator Fan To Stop Working
What Could Cause A VW Tiguan's Radiator Fan To Stop WorkingEscondido German Auto
 
( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607
( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607
( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607dollysharma2066
 
办理科廷科技大学毕业证Curtin毕业证留信学历认证
办理科廷科技大学毕业证Curtin毕业证留信学历认证办理科廷科技大学毕业证Curtin毕业证留信学历认证
办理科廷科技大学毕业证Curtin毕业证留信学历认证jdkhjh
 
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdfkushkruthik555
 
Digamma / CertiCon Company Presentation
Digamma / CertiCon Company  PresentationDigamma / CertiCon Company  Presentation
Digamma / CertiCon Company PresentationMihajloManjak
 
原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
-The-Present-Simple-Tense.pdf english hh
-The-Present-Simple-Tense.pdf english hh-The-Present-Simple-Tense.pdf english hh
-The-Present-Simple-Tense.pdf english hhmhamadhawlery16
 
(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样
(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样
(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样whjjkkk
 
(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样
(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样
(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样gfghbihg
 
EPA Funding Opportunities for Equitable Electric Transportation by Mike Moltzen
EPA Funding Opportunities for Equitable Electric Transportationby Mike MoltzenEPA Funding Opportunities for Equitable Electric Transportationby Mike Moltzen
EPA Funding Opportunities for Equitable Electric Transportation by Mike MoltzenForth
 

Recently uploaded (20)

办理(PITT毕业证书)美国匹兹堡大学毕业证成绩单原版一比一
办理(PITT毕业证书)美国匹兹堡大学毕业证成绩单原版一比一办理(PITT毕业证书)美国匹兹堡大学毕业证成绩单原版一比一
办理(PITT毕业证书)美国匹兹堡大学毕业证成绩单原版一比一
 
2024 WRC Hyundai World Rally Team’s i20 N Rally1 Hybrid
2024 WRC Hyundai World Rally Team’s i20 N Rally1 Hybrid2024 WRC Hyundai World Rally Team’s i20 N Rally1 Hybrid
2024 WRC Hyundai World Rally Team’s i20 N Rally1 Hybrid
 
办理阳光海岸大学毕业证成绩单原版一比一
办理阳光海岸大学毕业证成绩单原版一比一办理阳光海岸大学毕业证成绩单原版一比一
办理阳光海岸大学毕业证成绩单原版一比一
 
Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...
Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...
Transportation Electrification Funding Strategy by Jeff Allen and Brandt Hert...
 
What Could Be Causing My Jaguar XF To Lose Coolant
What Could Be Causing My Jaguar XF To Lose CoolantWhat Could Be Causing My Jaguar XF To Lose Coolant
What Could Be Causing My Jaguar XF To Lose Coolant
 
办理乔治布朗学院毕业证成绩单|购买加拿大文凭证书
办理乔治布朗学院毕业证成绩单|购买加拿大文凭证书办理乔治布朗学院毕业证成绩单|购买加拿大文凭证书
办理乔治布朗学院毕业证成绩单|购买加拿大文凭证书
 
call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Jama Masjid (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书
办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书
办理萨省大学毕业证成绩单|购买加拿大USASK文凭证书
 
907MTAMount Coventry University Bachelor's Diploma in Engineering
907MTAMount Coventry University Bachelor's Diploma in Engineering907MTAMount Coventry University Bachelor's Diploma in Engineering
907MTAMount Coventry University Bachelor's Diploma in Engineering
 
(毕业原版)曼尼托巴大学毕业证(曼大学位证)毕业证成绩单留信学历认证原版一比一
(毕业原版)曼尼托巴大学毕业证(曼大学位证)毕业证成绩单留信学历认证原版一比一(毕业原版)曼尼托巴大学毕业证(曼大学位证)毕业证成绩单留信学历认证原版一比一
(毕业原版)曼尼托巴大学毕业证(曼大学位证)毕业证成绩单留信学历认证原版一比一
 
What Could Cause A VW Tiguan's Radiator Fan To Stop Working
What Could Cause A VW Tiguan's Radiator Fan To Stop WorkingWhat Could Cause A VW Tiguan's Radiator Fan To Stop Working
What Could Cause A VW Tiguan's Radiator Fan To Stop Working
 
( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607
( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607
( Best ) Genuine Call Girls In Mandi House =DELHI-| 8377087607
 
办理科廷科技大学毕业证Curtin毕业证留信学历认证
办理科廷科技大学毕业证Curtin毕业证留信学历认证办理科廷科技大学毕业证Curtin毕业证留信学历认证
办理科廷科技大学毕业证Curtin毕业证留信学历认证
 
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf
248649330-Animatronics-Technical-Seminar-Report-by-Aswin-Sarang.pdf
 
Digamma / CertiCon Company Presentation
Digamma / CertiCon Company  PresentationDigamma / CertiCon Company  Presentation
Digamma / CertiCon Company Presentation
 
原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制阳光海岸大学毕业证(JCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
-The-Present-Simple-Tense.pdf english hh
-The-Present-Simple-Tense.pdf english hh-The-Present-Simple-Tense.pdf english hh
-The-Present-Simple-Tense.pdf english hh
 
(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样
(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样
(办理学位证)墨尔本大学毕业证(Unimelb毕业证书)成绩单留信学历认证原版一模一样
 
(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样
(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样
(办理学位证)(Toledo毕业证)托莱多大学毕业证成绩单修改留信学历认证原版一模一样
 
EPA Funding Opportunities for Equitable Electric Transportation by Mike Moltzen
EPA Funding Opportunities for Equitable Electric Transportationby Mike MoltzenEPA Funding Opportunities for Equitable Electric Transportationby Mike Moltzen
EPA Funding Opportunities for Equitable Electric Transportation by Mike Moltzen
 

Syncsort & comScore Big Data Warehouse Meetup Sept 2013

  • 1. © comScore, Inc. Proprietary. Using Hadoop to Process a Trillion+ Events Michael Brown, CTO | September 23rd, 2013
  • 2. © comScore, Inc. Proprietary. 2 comScore is a leading internet technology company that provides Analytics for a Digital World™ NASDAQ SCOR Clients 2,100+ Worldwide Employees 1,000+ Headquarters Reston, Virginia, USA Global Coverage Measurement from 172 Countries; 44 Markets Reported Local Presence 32 Locations in 23 Countries Big Data Over 1 Trillion Digital Interactions Captured Monthly V0113
  • 3. © comScore, Inc. Proprietary. Broad Client Base and Deep Expertise Across Key Industries Media Agencies Telecom/Mobile Financial Retail Travel CPG Pharma Technology V0910
  • 4. © comScore, Inc. Proprietary. Panel Heat Map
  • 5. © comScore, Inc. Proprietary. CENSUS Unified Digital Measurement™ (UDM) Establishes Platform For Panel + Census Data Integration PANEL Unified Digital Measurement (UDM) Patent-Pending Methodology Adopted by 90% of Top 100 U.S. Media Properties Global PERSON Measurement Global DEVICE Measurement V0411
  • 6. © comScore, Inc. Proprietary. 6 0 200,000,000,000 400,000,000,000 600,000,000,000 800,000,000,000 1,000,000,000,000 1,200,000,000,000 1,400,000,000,000 1,600,000,000,000 1,800,000,000,000 2,000,000,000,000 Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug 2009 2010 2011 2012 2013 #ofrecords Panel Records Beacon Records Total records collected in August 2013 1,729,895,147,710 Worldwide Tags per Day
  • 7. © comScore, Inc. Proprietary. Worldwide UDM™ Penetration December 2012 Penetration Data Europe Austria 87% Belgium 93% Switzerland 89% Germany 92% Denmark 88% Spain 95% Finland 93% France 92% Ireland 90% Italy 90% Netherlands 93% Norway 91% Portugal 92% Sweden 90% United Kingdom 92% Asia Pacific Australia 90% Hong Kong 95% India 92% Japan 82% Malaysia 93% New Zealand 91% Singapore 92% North America Canada 94% United States 91% Latin America Argentina 95% Brazil 96% Chile 94% Colombia 95% Mexico 93% Puerto Rico 92% Middle East & Africa Israel 92% South Africa 78% Percentage of Machines Included in UDM Measurement
  • 8. © comScore, Inc. Proprietary. High Level Data Flow Panel Census Custom Code + Delivery
  • 9. © comScore, Inc. Proprietary. Our Cluster Production Hadoop Cluster  224 nodes: Mix of Dell 720xd, R710 and R510 servers  Each R720xd has (24x1.2TB drives; 64GB RAM; 24 cores)  6300+ total CPUs  13.3TB total memory  4.3PB total disk space  Our distro is MapR M5 2.1.3
  • 10. © comScore, Inc. Proprietary. The Project: vCE – Validated Campaign Essentials
  • 11. © comScore, Inc. Proprietary. 11  vCE provides real-time, cloud- based, on-demand monitoring and optimization of digital advertising campaigns  Deep industry penetration  22 of the Top 25 Largest Global Advertisers, representing 89% of global ad dollars, are vCE/CE clients*  Includes ALL Top 10 CPG Advertisers* What is vCE? *Source: AdAge 2012 Top 25 Global Advertisers (directly or through their advertising agency) Allstate
  • 12. © comScore, Inc. Proprietary. comScore - vCE
  • 13. © comScore, Inc. Proprietary. The Problem Statement Calculate the number of events and unique cookies for each reportable campaign element Key take away  Data on input will be aggregated daily  Need to process all data for 3 months  Need to calculate values for every day in the 92 day period spanning all reportable campaign elements
  • 14. © comScore, Inc. Proprietary. Structure of the Required Output Client Campaign Population Location Cookie Ct Period 1234 160873284 840 1 863,185 1 1234 160873284 840 1 1,719,738 2 1234 160873284 840 1 2,631,624 3 1234 160873284 840 1 3,572,163 4 1234 160873284 840 1 4,445,508 5 1234 160873284 840 1 5,308,532 6 1234 160873284 840 1 6,032,073 7 1234 160873284 840 1 6,710,645 8 1234 160873284 840 1 7,421,258 9 1234 160873284 840 1 8,154,543 10
  • 15. © comScore, Inc. Proprietary. Counting Uniques from a Time Ordered Log File A B C D B A A Major Downsides: Need to keep all key elements in memory. Constrained to one machine for final aggregation.
  • 16. © comScore, Inc. Proprietary. First Version Java Map-Reduce application which processes pre-aggregated data from 92 days Map reads the data and emits each cookie as the key of the key value pair All 130B records go though the shuffle Each Reducer will get all the data for a particular campaign sorted by cookie Reducer aggregates the data by grouping key ( Client / Campaign / Population ) and calculates unique cookies for period 1-92 Volume Grew rapidly to the point the daily processing took more than a day
  • 17. © comScore, Inc. Proprietary. M/R Data Flow CB Mapper MapperMapperMap Map Map Reduce ReduceReduce BA AC AA BB CC A B C
  • 18. © comScore, Inc. Proprietary. Scaling Issue As our volume has grown we have the following stats:  Over 500 billion events per month  Daily Aggregate 1.5 billion  130 billion aggregate records for 92 days  70K Campaigns  Over 50 countries  We see 15 billion distinct cookies in a month  We only need to output 25 million rows
  • 19. © comScore, Inc. Proprietary. Basic Approach Retrospective Processing speed is not scaling to our needs on a sample of the input data Diagnosis  Most aggregations could not take significant advantage of combiners.  Large shuffles caused poor job performance. In some cases large aggregations ran slower on the Hadoop cluster due to shuffle and skew in data for keys. Diagnosis  A new approach is required to reduce the shuffle
  • 20. © comScore, Inc. Proprietary. Counting Uniques from a Key Ordered Log File A D B C B A A Major Downsides: Need to sort data in advance. The sort time increases as volume grows.
  • 21. © comScore, Inc. Proprietary. Counting Uniques from a Key Ordered Log File
  • 22. © comScore, Inc. Proprietary. Counting Uniques from Sharded Key Ordered Log Files
  • 23. © comScore, Inc. Proprietary. Solution to reduce the shuffle The Problem:  Most aggregations within comScore can not take advantage of combiners, leading to large shuffles and job performance issues The Idea:  Partition and sort the data by cookie on a daily basis  Create a custom InputFormat to merge daily partitions for monthly aggregations
  • 24. © comScore, Inc. Proprietary. Custom Input Format with Map Side Aggregation CB Mapper MapperMapperMap Map Map Reduce ReduceReduce BA AC A B C A B C Combiner Combiner Combiner A B C
  • 25. © comScore, Inc. Proprietary. Risks for Partitioning Data locality  Custom InputFormat requires reading blocks of the partitioned data over the network  This was solved using a feature of the MapR file system. We created volumes and set the chunk size to zero which guarantees that the data written to a volume will stay on one node Map failures might result in long run times  Size of the map inputs is no longer set by block size  This was solved by creating a large number (10K) of volumes to limit the size of data processed by each mapper
  • 26. © comScore, Inc. Proprietary. Partitioning Summary Benefits:  A large portion of the aggregation can be completed in the map phase  Applications can now take advantage of combiners  Shuffles sizes are minimal Results:  Took a job from 35 hours to 3 hours with no hardware changes
  • 27. © comScore, Inc. Proprietary. DMX & comScore
  • 28. © comScore, Inc. Proprietary. DMX use at comScore We use DMX from Syncsort across hundreds of servers for efficient data processing and aggregation. We currently run over 100+ unique jobs every day. With these jobs we process over 150 billion rows of data through DMX! Connect Design Process Accelerate
  • 29. © comScore, Inc. Proprietary. Compression w/Sorting Compress Log Files when processing large volumes of log data Several advantages to Sorting Data First:  Reduces the size of the data  Improves application performance Examples:  1 Hour of one source of our data (313 GB raw, 815 million rows)  Standard compression of time ordered data is 93GB (30% of original)  Standard compression on a 2 key sorted set is 56GB (18% of original)  For one day it saves 800GB When applied to all our sources we save  4.5 TB per day  137 TB per month  412TB per quarter
  • 30. © comScore, Inc. Proprietary. TCO with Large Cluster Systems Examine the ability to sort data to reduce disk usage Example: Hadoop cluster that needs to support 100TB of base compressed data Hypothetical Configurations @ 75% disk utilization:  Replication Factor of 3 using 1.2 TB drives R710 (6x drives, JBOD); requires 26 servers R510 (12x drives JBOD); requires 52 servers R720xd (24x drives JBOD); requires 13 servers
  • 31. © comScore, Inc. Proprietary. Useful Factoids Visit www.comscoredatamine.com or follow @datagems for the latest gems. Colorful, bite-sized graphical representations of the best discoveries we unearth.
  • 32. © comScore, Inc. Proprietary. Thank You! Michael Brown CTO comScore, Inc. mbrown@comscore.com

Editor's Notes

  1. Key MessagecomScore is a global internet technology company providing customers with Analytics for a Digital WorldSupporting Talking PointsFounded in 1999, comScore is best known as the gold standard for measuring digital activity, including website visitation, search, video, social, digital advertisingcomScore’s data and technologies are well-established crucial components in measuring and analyzing the rapidly evolving digital world, and are widely deployed at a broad range of publishers, advertising agencies, advertisers, retailers and telecom operators, both in the US and internationally
  2. comScore leverages DMExpress from SyncSort across hundreds of our servers to allow us to efficiently process our data.A generic design pattern for us is to sort the input data based on the column that we will be counting uniques. Counting uniques is one of the more costly measures to calculate in a system. By sorting the data in advance, you only need to see if the prior value has changed from the current value and increment a counter.This approach has let us implement aggregation systems that can process over 50 GB of data with 357 million rows in less than an hour on a Dell R710 2U server.