Build a Big Data Warehouse on the Cloud in 30 Minutes

Big Data Warehousing
Meetup - April 8, 2014
Building a Big Data Warehouse
on the Cloud in 30 Minutes
Sponsored By:

7:00 –
7:15
Networking (15 min)
Grab some food and drink... Make some friends.
7:15 –
7:35
Bob Eilbacher (20 min)
VP Sales
Caserta Concepts
Welcome + Intro
About the Meetup, about Caserta Concepts
+ Swag
7:35 –
8:20
Elliott Cordo (45 min)
Chief Architect
Caserta Concepts.
Building a Big Data Warehouse on the Cloud
Live demo of Amazon's AWS, S3, EMR, and
Redshift
8:20 –
8:40
Ben Sgro (20 min)
Sr. Software Engineer
Simulmedia
Implementing Redis on the Cloud
An ultra-low latency customer segmentation tool
with AWS Elasticache
8:40 –
9:00
Q&A (10 min)
More Networking (10 min)
Tell us what you’re up to…
Agenda

Gathering music brought to you by….
BIG DATA
a paranoid electronic music
project from the Internet,
formed out of a general
distrust for technology and
The Cloud (despite a
growing dependence on
them).
bigdata.fm

• Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear
about yours
• Great networking opportunity for like
minded data nerds
• Opportunities to collaborate on exciting
projects
• Founded by Caserta Concepts
• Big Data Analytics, DW, BI Consulting
About the BDW Meetup

Real-world Data Science
w/Claudia Perlich
• Date:
• Tuesday May 27, 2014, 7:00 PM
• Location:
• New Work City, Broadway & Canal
• Sponsor:
• Revolution Analytics
Next BDW Meetup

Caserta Concepts
• Technology innovation company with expertise in:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Digital Media
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Consulting, Writing, Education
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization

Innovation & Implementation
Listed as a Top 20 Most Promising
Data Analytics Consulting Companies
CIOReview looked at hundreds of data analytics consulting companies and shortlisted
the ones who are at the forefront of tackling the real analytics challenges.
A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial
board of CIOReview selected the Final 20.

Expertise & Offerings
Strategic Roadmap /
Assessment / Education /
Implementation
Data Warehousing/
ETL/Data Integration
BI/Visualization/
Analytics
Big Data
Analytics

Hadoop Distributions
Platforms/ETL
Analytics & BI
Caserta Partners

Client Portfolio
Finance. Healthcare
& Insurance
Retail/eCommerce
& Manufacturing
Education
& Services

Does this word cloud excite you?
Speak with us about our open positions: leslie@casertaconcepts.com
Join Our Network
Storm
Big Data Architect Hbase
Cassandra

Big Data is like water.
There is little point in debating how much there is.
It’s the flow and use that matters.
#gigaomlive
@dominiek
3/20/2014
Gigaom Structure Data

BUILDINGA BIG DATA WAREHOUSE IN THE
CLOUD IN 30 MIN
Elliott Cordo
Chief Architect, Caserta Concepts

What is a Big Data Warehouse??
• An enterprise system providing reliable ah-hoc analytics,
reporting, and decision support
• Large Scale – Big Data
• Not only confined to traditional Dimensional model

Big Data Warehouse
• Data governance is still important!
• Data Quality
• Metadata: Naming, Lineage, etc
Data cannot be governed until it is structured
Big Data
Warehouse
Data Science
Workspace
Data Lake – Integrated
Sandbox
Landing – Source Data in “Full
Fidelity”

Cloud
• Infrastructure is not fun
• Months to server procurement
• Inability to handle growth
• Servers idling all day doing nothing
• Cloud to the rescue
• Unlimited cheap storage
• Provision new servers in minutes
• Use of elastic services!  EMR
• AWESOME for prototypes and POC’s

About our sample data
• Consumer Yelp Ratings
• Generated based on Kaggle dataset  100 million rows
• Model looks something like this:
f_reviews
d_date d_business
d_user

So let’s get cooking
1. Create an EMR cluster  On Demand Hadoop
1. Provision a Redshift cluster  Data Warehouse

Redshift
• Massive Parallel Processing
• Columnar DB’s that present themselves as relational
• MPP’s grew up in Parallel to Hadoop
• Impala, HAWQ are MPP’s themselves!
• OEM of Actian Matrix (formerly ParaAccel)
• A modern MPP, clean, reliable, SCHEMA AGNOSTIC

Redshift is cheap inexpensive?
Enterprise grade EDW @ $1000/TB per year

MPP Design Considerations
• JOINS
• Shuffle – data is large and distributed by key to servers
• Broadcast – data is small and gets distributed to all servers
• Collocated – all data needed for join is on same server
• Design Considerations for MPP
• Distribution Key
• Collocated joins
• Even distribution of work across the cluster
• Customer will work well
• Sort Key
• Fastest scan operations
• Primary date field is usually best

ETL – Transform your data
• S3 is the ultimate staging ground
• Use EMR for the heavy lifting:
• Run your ETL Program and kill it when done!
• Pay just for processing.
• PIG, native map reduce, streaming
• For the right use case HIVE or Impala can be used for
ETL too (mainly for aggregates, summaries)

Smaller data - don’t need EMR?
• Python ETL on EC2 (on Demand)
• Can later “graduate” to big data using Hadoop streaming
• Your favorite ETL tool is just fine too

Presentation Layer – Data Warehouse
How do you get your ETL data in?
• Hadoop distcp - High performance transfer of data from
S3 to HDFS
• Distributed COPY from S3 to Redshift

And how to orchestrate all of this?
• Amazon data pipelines
• AWS CLI
• Build a driver program using modules like Boto (Python)
• Cron or external scheduler

Back to AWS
1. Apply Redshift DDL and load tables
1. Run some queries

Build a Big Data Warehouse on the Cloud in 30 Minutes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Build a Big Data Warehouse on the Cloud in 30 Minutes

Similar to Build a Big Data Warehouse on the Cloud in 30 Minutes (20)

More from Caserta

More from Caserta (20)

Recently uploaded

Recently uploaded (20)

Build a Big Data Warehouse on the Cloud in 30 Minutes