Elliott Cordo, Chief Architect at Caserta Concepts will give a live demo using Amazon's AWS to build a Big Data Warehouse using S3 for data storage, Elastic MapReduce (EMR) for data manipulation and Redshift for interactive queries.
For more information, visit http://www.casertaconcepts.com/.
Build a Big Data Warehouse on the Cloud in 30 Minutes
1. Big Data Warehousing
Meetup - April 8, 2014
Building a Big Data Warehouse
on the Cloud in 30 Minutes
Sponsored By:
2. 7:00 –
7:15
Networking (15 min)
Grab some food and drink... Make some friends.
7:15 –
7:35
Bob Eilbacher (20 min)
VP Sales
Caserta Concepts
Welcome + Intro
About the Meetup, about Caserta Concepts
+ Swag
7:35 –
8:20
Elliott Cordo (45 min)
Chief Architect
Caserta Concepts.
Building a Big Data Warehouse on the Cloud
Live demo of Amazon's AWS, S3, EMR, and
Redshift
8:20 –
8:40
Ben Sgro (20 min)
Sr. Software Engineer
Simulmedia
Implementing Redis on the Cloud
An ultra-low latency customer segmentation tool
with AWS Elasticache
8:40 –
9:00
Q&A (10 min)
More Networking (10 min)
Tell us what you’re up to…
Agenda
3. Gathering music brought to you by….
BIG DATA
a paranoid electronic music
project from the Internet,
formed out of a general
distrust for technology and
The Cloud (despite a
growing dependence on
them).
bigdata.fm
4. • Big Data is a complex, rapidly changing
landscape
• We want to share our stories and hear
about yours
• Great networking opportunity for like
minded data nerds
• Opportunities to collaborate on exciting
projects
• Founded by Caserta Concepts
• Big Data Analytics, DW, BI Consulting
About the BDW Meetup
6. Real-world Data Science
w/Claudia Perlich
• Date:
• Tuesday May 27, 2014, 7:00 PM
• Location:
• New Work City, Broadway & Canal
• Sponsor:
• Revolution Analytics
Next BDW Meetup
7. Caserta Concepts
• Technology innovation company with expertise in:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Digital Media
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Consulting, Writing, Education
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
8. Innovation & Implementation
Listed as a Top 20 Most Promising
Data Analytics Consulting Companies
CIOReview looked at hundreds of data analytics consulting companies and shortlisted
the ones who are at the forefront of tackling the real analytics challenges.
A distinguished panel comprising of CEOs, CIOs, VCs, industry analysts and the editorial
board of CIOReview selected the Final 20.
9. Expertise & Offerings
Strategic Roadmap /
Assessment / Education /
Implementation
Data Warehousing/
ETL/Data Integration
BI/Visualization/
Analytics
Big Data
Analytics
12. Does this word cloud excite you?
Speak with us about our open positions: leslie@casertaconcepts.com
Join Our Network
Storm
Big Data Architect Hbase
Cassandra
14. Big Data is like water.
There is little point in debating how much there is.
It’s the flow and use that matters.
#gigaomlive
@dominiek
3/20/2014
Gigaom Structure Data
15. BUILDINGA BIG DATA WAREHOUSE IN THE
CLOUD IN 30 MIN
Elliott Cordo
Chief Architect, Caserta Concepts
16. What is a Big Data Warehouse??
• An enterprise system providing reliable ah-hoc analytics,
reporting, and decision support
• Large Scale – Big Data
• Not only confined to traditional Dimensional model
17. Big Data Warehouse
• Data governance is still important!
• Data Quality
• Metadata: Naming, Lineage, etc
Data cannot be governed until it is structured
Big Data
Warehouse
Data Science
Workspace
Data Lake – Integrated
Sandbox
Landing – Source Data in “Full
Fidelity”
18. Cloud
• Infrastructure is not fun
• Months to server procurement
• Inability to handle growth
• Servers idling all day doing nothing
• Cloud to the rescue
• Unlimited cheap storage
• Provision new servers in minutes
• Use of elastic services! EMR
• AWESOME for prototypes and POC’s
19. About our sample data
• Consumer Yelp Ratings
• Generated based on Kaggle dataset 100 million rows
• Model looks something like this:
f_reviews
d_date d_business
d_user
20. So let’s get cooking
1. Create an EMR cluster On Demand Hadoop
1. Provision a Redshift cluster Data Warehouse
21. Redshift
• Massive Parallel Processing
• Columnar DB’s that present themselves as relational
• MPP’s grew up in Parallel to Hadoop
• Impala, HAWQ are MPP’s themselves!
• OEM of Actian Matrix (formerly ParaAccel)
• A modern MPP, clean, reliable, SCHEMA AGNOSTIC
22. Redshift is cheap inexpensive?
Enterprise grade EDW @ $1000/TB per year
23. MPP Design Considerations
• JOINS
• Shuffle – data is large and distributed by key to servers
• Broadcast – data is small and gets distributed to all servers
• Collocated – all data needed for join is on same server
• Design Considerations for MPP
• Distribution Key
• Collocated joins
• Even distribution of work across the cluster
• Customer will work well
• Sort Key
• Fastest scan operations
• Primary date field is usually best
24. ETL – Transform your data
• S3 is the ultimate staging ground
• Use EMR for the heavy lifting:
• Run your ETL Program and kill it when done!
• Pay just for processing.
• PIG, native map reduce, streaming
• For the right use case HIVE or Impala can be used for
ETL too (mainly for aggregates, summaries)
25. Smaller data - don’t need EMR?
• Python ETL on EC2 (on Demand)
• Can later “graduate” to big data using Hadoop streaming
• Your favorite ETL tool is just fine too
26. Presentation Layer – Data Warehouse
How do you get your ETL data in?
• Hadoop distcp - High performance transfer of data from
S3 to HDFS
• Distributed COPY from S3 to Redshift
27. And how to orchestrate all of this?
• Amazon data pipelines
• AWS CLI
• Build a driver program using modules like Boto (Python)
• Cron or external scheduler
28. Back to AWS
1. Apply Redshift DDL and load tables
1. Run some queries