SlideShare a Scribd company logo
1 of 48
Download to read offline
Practical Medium Data
Analytics with Python
PyData NYC 2013
Practical Medium Data
Analytics with Python
10 Things I Hate
About pandas
PyData NYC 2013
Wes McKinney
@wesmckinn
• Former quant and MIT math dude
• Creator of Pandas project for Python
• Author of
Python for Data Analysis — O’Reilly

• Founder and CEO of DataPad

3

www.datapad.io
•
•

4

> 20k copies since Oct 2012
Bringing many new people
to Python and data analysis
with code

www.datapad.io
• http://datapad.io
Founded in 2013, located in SF
•
In private beta, join us!
•
• Hiring for engineering
www.datapad.io
Why hate on pandas?
7

www.datapad.io
pandas rocks!
So, pandas
• Easy-to-use, fast in-memory data wrangling
and analytics library

• Enabled loads of complex data work to be
done by mere mortals in Python

• Might have kept R from taking over the
world (hehe)

10

www.datapad.io
11

www.datapad.io
pandas, the project

• 170 distinct contributors
• Over 5400 issues and pull requests
on GitHub

•
12

Upcoming 0.13 release

www.datapad.io
But.

• pandas’s broad applicability also a
liability

•
pandas being used in some
•

Only game in town for lot of things
unplanned ways

13

www.datapad.io
Some things to love
• No more structured dtype drudgery!
• Easy IO!
• Data alignment!
• Hierarchical indexing!
• Time series analytics!
14

www.datapad.io
More things to love

• Table reshaping
• Missing data handling
pandas.merge, pandas.concat
•
Expressive groupby machinery
•
15

www.datapad.io
Some pandas use cases

• General data wrangling
• ETL jobs
Business analytics (incl. BI uses)
•
Time series analysis, statistical
•
modeling

16

www.datapad.io
pandas does many things
that are tedious, slow, or
difficult to do correctly
without it
Unfortunately, pandas is
not a database
#1 Slightly too far from
the metal

• DataFrame’s internal structure

intended to make row-oriented ops
fast on numerical data

•
19

Python objects can be used as data,
indices (a feature, not a bug)
www.datapad.io
#2 No support (yet) for
memory maps
• Many analytics ops require a small portion
of the data

• Many ways to “materialize” the full data set
in memory by accident

• Axis indexes wouldn’t necessarily make
sense on out of core data sets

20

www.datapad.io
#2 No support (yet) for
memory maps

• N.B. HDF5/PyTables support is a
partial solution

21

www.datapad.io
#3 No tight database
integration

• Makes it difficult to be a serious tool
in an ETL toolchain on top of some
SQL-ish system

•
22

Inadequacy of pandas/NumPy data
type systems
www.datapad.io
#3 No tight database
integration

• Jobs with heavy SQL-reading are
slow and use tons of memory

•

23

TODO: integrate pandas with ODBC
C API and write out SQL data directly
into NumPy arrays
www.datapad.io
#4 Best-efforts NA
representation

• Inconsistent representation of
missing data

•
NA needs to be a first class citizen in
•
No Boolean or Integer NA values
analytics operations

24

www.datapad.io
#5 RAM management
• Difficult to understand footprint of pandas
object

• Ample data copying throughout library
• Would benefit from being able to compress

data in-memory or shuttle data temporarily
to disk

25

www.datapad.io
#6 Weak support for
categorical data

• Makes pandas not quite a fullyfledged R replacement

•

26

GroupBy and Joins slower than they
could be

www.datapad.io
#7 Complex GroupBy
operations get messy

• Must write custom functions to pass
to .apply(..)

•

27

Easy to run up against DRY
problems and general Python
syntax limitations
www.datapad.io
#8 Appending data slow
and tedious

• DataFrame not intended as a
database table

•

Makes streaming data use a
challenge

• B+ tree tables interesting?
28

www.datapad.io
#9 Limited type system,
column metadata

• Currencies, units
• Time zones
Geographic data
•
Composite data types
•
29

www.datapad.io
#10 No true query
processing layer

•
•
•
•
•
•
30

Filter
Group
Join
Aggregate
Limit/TopK
Sorting

WHERE, HAVING
GROUP BY
JOIN
SUM, MEAN, ...
LIMIT
ORDER BY
www.datapad.io
#11 “Slow”: no multicore /
distributed algos

• Hampered by use of Python data
structures / GIL interactions

•

31

Object internals not designed for
concurrent use

www.datapad.io
Oh no what do we do
Stop believing in the “one
tool to rule them all”
“Real Artists Ship”
- Steve Jobs
www.datapad.io
Focus on results

• I am heavily biased by focus on
business analytics/BI use cases

•

36

Need production-ready software to
ship in relatively short time frame

www.datapad.io
A new project

• In internal development at DataPad
• Code named “badger”
pandas-ish syntax: designed for
•
data processing and analytical
queries

37

www.datapad.io
Badger in a nutshell

•
Compressed columnar binary storage
•
• High perf analytical query processor
• Data preparation/cleaning tools
Consistent data type system

38

www.datapad.io
Badger in a nutshell

•
Immutable array data, little copying
•
• Analytics kernels: written C with no
Time series analytics

dependencies

•
39

Caching of useful intermediates
www.datapad.io
Some benchmarks

• Data set: 2012 Election data (FEC)
5.3 mm records 7 columns
•
• Tools
• pandas
badger
•
• R: data.table
SQL: PostgreSQL, SQLite
•
40

www.datapad.io
Query 1

• Total contributions by candidate
SELECT	
  cand_nm,	
  
	
  	
  	
  	
  	
  	
  	
  sum(contb_receipt_amt)	
  AS	
  total
FROM	
  fec
GROUP	
  BY	
  cand_nm

41

www.datapad.io
Query 1

• Total contributions by candidate
badger	
  (in-­‐memory)	
  :	
  	
  	
  19ms	
  (1x)
badger	
  (from-­‐disk)	
  :	
  	
  131ms	
  (6.9x)
pandas	
  (in-­‐memory)	
  :	
  	
  273ms	
  (14.3x)
R	
  data.table	
  1.8.10:	
  	
  382ms	
  (20x)
PostgreSQL	
  	
  	
  	
  	
  	
  	
  	
  	
  :	
  	
  	
  4.7s	
  (247x)
SQLite	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  :	
  	
  	
  	
  72s	
  (3800x)

42

www.datapad.io
Query 2
contributions by candidate
• Totalstate
and
SELECT	
  cand_nm,	
  contbr_st,
	
  	
  	
  	
  	
  	
  	
  sum(contb_receipt_amt)	
  AS	
  total
FROM	
  fec
GROUP	
  BY	
  cand_nm,	
  contbr_st

43

www.datapad.io
Query 2

•

Total contributions by candidate and
state

badger	
  (in-­‐memory)	
  :	
  	
  269ms	
  (1x)
badger	
  (from-­‐disk)	
  :	
  	
  391ms	
  (1.5x)
R	
  data.table	
  1.8.10:	
  	
  500ms	
  (1.8x)
pandas	
  (in-­‐memory)	
  :	
  	
  770ms	
  (2.9x)
PostgreSQL	
  	
  	
  	
  	
  	
  	
  	
  	
  :	
  	
  5.96s	
  (23x)

44

www.datapad.io
Query 3

• Total contributions by candidate
and state with 2 filter predicates

SELECT	
  cand_nm,
	
  	
  	
  	
  	
  	
  	
  sum(contb_receipt_amt)	
  as	
  total
FROM	
  fec
WHERE	
  contb_receipt_dt	
  BETWEEN
	
  	
  	
  	
  	
  	
  	
  	
  '2012-­‐05-­‐01'	
  and	
  '2012-­‐11-­‐05'
	
  	
  AND	
  contb_receipt_amt	
  BETWEEN	
  
	
  	
  	
  	
  	
  	
  	
  	
  0	
  and	
  2500
GROUP	
  BY	
  cand_nm
45

www.datapad.io
Query 3

• Total contributions by candidate
and state with 2 filter predicates

badger	
  (in-­‐memory)	
  :	
  	
  	
  96ms	
  (1x)
badger	
  (from-­‐disk)	
  :	
  	
  275ms	
  (2.9x)
pandas	
  (in-­‐memory)	
  :	
  	
  946ms	
  (9.8x)
PostgreSQL	
  	
  	
  	
  	
  	
  	
  	
  	
  :	
  	
  	
  6.2s	
  (65x)

46

www.datapad.io
Badger, the future

• Distributed in-memory analytics
• Multicore algorithms
• ETL job-building tools
• Open source in some form someday
Looking for algorithms hackers to help
•
47

www.datapad.io
Thank you!

48

www.datapad.io

More Related Content

What's hot

Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architecturesDaniel Marcous
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkDatabricks
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceDataWorks Summit
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Simplilearn
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Edureka!
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
 

What's hot (20)

Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
Spark
SparkSpark
Spark
 
Big data real time architectures
Big data real time architecturesBig data real time architectures
Big data real time architectures
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Building Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache SparkBuilding Robust ETL Pipelines with Apache Spark
Building Robust ETL Pipelines with Apache Spark
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of ServiceImproving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
 
Big data architectures
Big data architecturesBig data architectures
Big data architectures
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Apache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAseApache phoenix: Past, Present and Future of SQL over HBAse
Apache phoenix: Past, Present and Future of SQL over HBAse
 

Similar to Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)DataPad Inc.
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Wes McKinney
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software DevelopmentAlexis Seigneurin
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and PythonTravis Oliphant
 
Games Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGames Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGIAF
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoTaro L. Saito
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDeltares
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)spil-engineering
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Rittman Analytics
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016Kent Graziano
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemTuri, Inc.
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauMongoDB
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and developmentWes McKinney
 

Similar to Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013) (20)

The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
 
Data Science meets Software Development
Data Science meets Software DevelopmentData Science meets Software Development
Data Science meets Software Development
 
A data analyst view of Bigdata
A data analyst view of Bigdata A data analyst view of Bigdata
A data analyst view of Bigdata
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Games Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - PlumbeeGames Industry Analytics Forum 2 - Plumbee
Games Industry Analytics Forum 2 - Plumbee
 
Workflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. TokyoWorkflow Hacks #1 - dots. Tokyo
Workflow Hacks #1 - dots. Tokyo
 
DSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De BoerDSD-INT 2017 The use of big data for dredging - De Boer
DSD-INT 2017 The use of big data for dredging - De Boer
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)MySQL performance monitoring using Statsd and Graphite (PLUK2013)
MySQL performance monitoring using Statsd and Graphite (PLUK2013)
 
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
 
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
Data Warehousing 2016
Data Warehousing 2016Data Warehousing 2016
Data Warehousing 2016
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData Ecosystem
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
 
A look inside pandas design and development
A look inside pandas design and developmentA look inside pandas design and development
A look inside pandas design and development
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 

More from Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 

Recently uploaded

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

  • 1. Practical Medium Data Analytics with Python PyData NYC 2013
  • 2. Practical Medium Data Analytics with Python 10 Things I Hate About pandas PyData NYC 2013
  • 3. Wes McKinney @wesmckinn • Former quant and MIT math dude • Creator of Pandas project for Python • Author of Python for Data Analysis — O’Reilly • Founder and CEO of DataPad 3 www.datapad.io
  • 4. • • 4 > 20k copies since Oct 2012 Bringing many new people to Python and data analysis with code www.datapad.io
  • 5. • http://datapad.io Founded in 2013, located in SF • In private beta, join us! • • Hiring for engineering www.datapad.io
  • 6. Why hate on pandas?
  • 9.
  • 10. So, pandas • Easy-to-use, fast in-memory data wrangling and analytics library • Enabled loads of complex data work to be done by mere mortals in Python • Might have kept R from taking over the world (hehe) 10 www.datapad.io
  • 12. pandas, the project • 170 distinct contributors • Over 5400 issues and pull requests on GitHub • 12 Upcoming 0.13 release www.datapad.io
  • 13. But. • pandas’s broad applicability also a liability • pandas being used in some • Only game in town for lot of things unplanned ways 13 www.datapad.io
  • 14. Some things to love • No more structured dtype drudgery! • Easy IO! • Data alignment! • Hierarchical indexing! • Time series analytics! 14 www.datapad.io
  • 15. More things to love • Table reshaping • Missing data handling pandas.merge, pandas.concat • Expressive groupby machinery • 15 www.datapad.io
  • 16. Some pandas use cases • General data wrangling • ETL jobs Business analytics (incl. BI uses) • Time series analysis, statistical • modeling 16 www.datapad.io
  • 17. pandas does many things that are tedious, slow, or difficult to do correctly without it
  • 19. #1 Slightly too far from the metal • DataFrame’s internal structure intended to make row-oriented ops fast on numerical data • 19 Python objects can be used as data, indices (a feature, not a bug) www.datapad.io
  • 20. #2 No support (yet) for memory maps • Many analytics ops require a small portion of the data • Many ways to “materialize” the full data set in memory by accident • Axis indexes wouldn’t necessarily make sense on out of core data sets 20 www.datapad.io
  • 21. #2 No support (yet) for memory maps • N.B. HDF5/PyTables support is a partial solution 21 www.datapad.io
  • 22. #3 No tight database integration • Makes it difficult to be a serious tool in an ETL toolchain on top of some SQL-ish system • 22 Inadequacy of pandas/NumPy data type systems www.datapad.io
  • 23. #3 No tight database integration • Jobs with heavy SQL-reading are slow and use tons of memory • 23 TODO: integrate pandas with ODBC C API and write out SQL data directly into NumPy arrays www.datapad.io
  • 24. #4 Best-efforts NA representation • Inconsistent representation of missing data • NA needs to be a first class citizen in • No Boolean or Integer NA values analytics operations 24 www.datapad.io
  • 25. #5 RAM management • Difficult to understand footprint of pandas object • Ample data copying throughout library • Would benefit from being able to compress data in-memory or shuttle data temporarily to disk 25 www.datapad.io
  • 26. #6 Weak support for categorical data • Makes pandas not quite a fullyfledged R replacement • 26 GroupBy and Joins slower than they could be www.datapad.io
  • 27. #7 Complex GroupBy operations get messy • Must write custom functions to pass to .apply(..) • 27 Easy to run up against DRY problems and general Python syntax limitations www.datapad.io
  • 28. #8 Appending data slow and tedious • DataFrame not intended as a database table • Makes streaming data use a challenge • B+ tree tables interesting? 28 www.datapad.io
  • 29. #9 Limited type system, column metadata • Currencies, units • Time zones Geographic data • Composite data types • 29 www.datapad.io
  • 30. #10 No true query processing layer • • • • • • 30 Filter Group Join Aggregate Limit/TopK Sorting WHERE, HAVING GROUP BY JOIN SUM, MEAN, ... LIMIT ORDER BY www.datapad.io
  • 31. #11 “Slow”: no multicore / distributed algos • Hampered by use of Python data structures / GIL interactions • 31 Object internals not designed for concurrent use www.datapad.io
  • 32. Oh no what do we do
  • 33. Stop believing in the “one tool to rule them all”
  • 36. Focus on results • I am heavily biased by focus on business analytics/BI use cases • 36 Need production-ready software to ship in relatively short time frame www.datapad.io
  • 37. A new project • In internal development at DataPad • Code named “badger” pandas-ish syntax: designed for • data processing and analytical queries 37 www.datapad.io
  • 38. Badger in a nutshell • Compressed columnar binary storage • • High perf analytical query processor • Data preparation/cleaning tools Consistent data type system 38 www.datapad.io
  • 39. Badger in a nutshell • Immutable array data, little copying • • Analytics kernels: written C with no Time series analytics dependencies • 39 Caching of useful intermediates www.datapad.io
  • 40. Some benchmarks • Data set: 2012 Election data (FEC) 5.3 mm records 7 columns • • Tools • pandas badger • • R: data.table SQL: PostgreSQL, SQLite • 40 www.datapad.io
  • 41. Query 1 • Total contributions by candidate SELECT  cand_nm,                sum(contb_receipt_amt)  AS  total FROM  fec GROUP  BY  cand_nm 41 www.datapad.io
  • 42. Query 1 • Total contributions by candidate badger  (in-­‐memory)  :      19ms  (1x) badger  (from-­‐disk)  :    131ms  (6.9x) pandas  (in-­‐memory)  :    273ms  (14.3x) R  data.table  1.8.10:    382ms  (20x) PostgreSQL                  :      4.7s  (247x) SQLite                          :        72s  (3800x) 42 www.datapad.io
  • 43. Query 2 contributions by candidate • Totalstate and SELECT  cand_nm,  contbr_st,              sum(contb_receipt_amt)  AS  total FROM  fec GROUP  BY  cand_nm,  contbr_st 43 www.datapad.io
  • 44. Query 2 • Total contributions by candidate and state badger  (in-­‐memory)  :    269ms  (1x) badger  (from-­‐disk)  :    391ms  (1.5x) R  data.table  1.8.10:    500ms  (1.8x) pandas  (in-­‐memory)  :    770ms  (2.9x) PostgreSQL                  :    5.96s  (23x) 44 www.datapad.io
  • 45. Query 3 • Total contributions by candidate and state with 2 filter predicates SELECT  cand_nm,              sum(contb_receipt_amt)  as  total FROM  fec WHERE  contb_receipt_dt  BETWEEN                '2012-­‐05-­‐01'  and  '2012-­‐11-­‐05'    AND  contb_receipt_amt  BETWEEN                  0  and  2500 GROUP  BY  cand_nm 45 www.datapad.io
  • 46. Query 3 • Total contributions by candidate and state with 2 filter predicates badger  (in-­‐memory)  :      96ms  (1x) badger  (from-­‐disk)  :    275ms  (2.9x) pandas  (in-­‐memory)  :    946ms  (9.8x) PostgreSQL                  :      6.2s  (65x) 46 www.datapad.io
  • 47. Badger, the future • Distributed in-memory analytics • Multicore algorithms • ETL job-building tools • Open source in some form someday Looking for algorithms hackers to help • 47 www.datapad.io