3. Wes McKinney
@wesmckinn
• Former quant and MIT math dude
• Creator of Pandas project for Python
• Author of
Python for Data Analysis — O’Reilly
• Founder and CEO of DataPad
3
www.datapad.io
4. •
•
4
> 20k copies since Oct 2012
Bringing many new people
to Python and data analysis
with code
www.datapad.io
10. So, pandas
• Easy-to-use, fast in-memory data wrangling
and analytics library
• Enabled loads of complex data work to be
done by mere mortals in Python
• Might have kept R from taking over the
world (hehe)
10
www.datapad.io
12. pandas, the project
• 170 distinct contributors
• Over 5400 issues and pull requests
on GitHub
•
12
Upcoming 0.13 release
www.datapad.io
13. But.
• pandas’s broad applicability also a
liability
•
pandas being used in some
•
Only game in town for lot of things
unplanned ways
13
www.datapad.io
14. Some things to love
• No more structured dtype drudgery!
• Easy IO!
• Data alignment!
• Hierarchical indexing!
• Time series analytics!
14
www.datapad.io
15. More things to love
• Table reshaping
• Missing data handling
pandas.merge, pandas.concat
•
Expressive groupby machinery
•
15
www.datapad.io
16. Some pandas use cases
• General data wrangling
• ETL jobs
Business analytics (incl. BI uses)
•
Time series analysis, statistical
•
modeling
16
www.datapad.io
17. pandas does many things
that are tedious, slow, or
difficult to do correctly
without it
19. #1 Slightly too far from
the metal
• DataFrame’s internal structure
intended to make row-oriented ops
fast on numerical data
•
19
Python objects can be used as data,
indices (a feature, not a bug)
www.datapad.io
20. #2 No support (yet) for
memory maps
• Many analytics ops require a small portion
of the data
• Many ways to “materialize” the full data set
in memory by accident
• Axis indexes wouldn’t necessarily make
sense on out of core data sets
20
www.datapad.io
21. #2 No support (yet) for
memory maps
• N.B. HDF5/PyTables support is a
partial solution
21
www.datapad.io
22. #3 No tight database
integration
• Makes it difficult to be a serious tool
in an ETL toolchain on top of some
SQL-ish system
•
22
Inadequacy of pandas/NumPy data
type systems
www.datapad.io
23. #3 No tight database
integration
• Jobs with heavy SQL-reading are
slow and use tons of memory
•
23
TODO: integrate pandas with ODBC
C API and write out SQL data directly
into NumPy arrays
www.datapad.io
24. #4 Best-efforts NA
representation
• Inconsistent representation of
missing data
•
NA needs to be a first class citizen in
•
No Boolean or Integer NA values
analytics operations
24
www.datapad.io
25. #5 RAM management
• Difficult to understand footprint of pandas
object
• Ample data copying throughout library
• Would benefit from being able to compress
data in-memory or shuttle data temporarily
to disk
25
www.datapad.io
26. #6 Weak support for
categorical data
• Makes pandas not quite a fullyfledged R replacement
•
26
GroupBy and Joins slower than they
could be
www.datapad.io
27. #7 Complex GroupBy
operations get messy
• Must write custom functions to pass
to .apply(..)
•
27
Easy to run up against DRY
problems and general Python
syntax limitations
www.datapad.io
28. #8 Appending data slow
and tedious
• DataFrame not intended as a
database table
•
Makes streaming data use a
challenge
• B+ tree tables interesting?
28
www.datapad.io
29. #9 Limited type system,
column metadata
• Currencies, units
• Time zones
Geographic data
•
Composite data types
•
29
www.datapad.io
30. #10 No true query
processing layer
•
•
•
•
•
•
30
Filter
Group
Join
Aggregate
Limit/TopK
Sorting
WHERE, HAVING
GROUP BY
JOIN
SUM, MEAN, ...
LIMIT
ORDER BY
www.datapad.io
31. #11 “Slow”: no multicore /
distributed algos
• Hampered by use of Python data
structures / GIL interactions
•
31
Object internals not designed for
concurrent use
www.datapad.io
36. Focus on results
• I am heavily biased by focus on
business analytics/BI use cases
•
36
Need production-ready software to
ship in relatively short time frame
www.datapad.io
37. A new project
• In internal development at DataPad
• Code named “badger”
pandas-ish syntax: designed for
•
data processing and analytical
queries
37
www.datapad.io
38. Badger in a nutshell
•
Compressed columnar binary storage
•
• High perf analytical query processor
• Data preparation/cleaning tools
Consistent data type system
38
www.datapad.io
39. Badger in a nutshell
•
Immutable array data, little copying
•
• Analytics kernels: written C with no
Time series analytics
dependencies
•
39
Caching of useful intermediates
www.datapad.io
40. Some benchmarks
• Data set: 2012 Election data (FEC)
5.3 mm records 7 columns
•
• Tools
• pandas
badger
•
• R: data.table
SQL: PostgreSQL, SQLite
•
40
www.datapad.io
41. Query 1
• Total contributions by candidate
SELECT
cand_nm,
sum(contb_receipt_amt)
AS
total
FROM
fec
GROUP
BY
cand_nm
41
www.datapad.io
43. Query 2
contributions by candidate
• Totalstate
and
SELECT
cand_nm,
contbr_st,
sum(contb_receipt_amt)
AS
total
FROM
fec
GROUP
BY
cand_nm,
contbr_st
43
www.datapad.io
44. Query 2
•
Total contributions by candidate and
state
badger
(in-‐memory)
:
269ms
(1x)
badger
(from-‐disk)
:
391ms
(1.5x)
R
data.table
1.8.10:
500ms
(1.8x)
pandas
(in-‐memory)
:
770ms
(2.9x)
PostgreSQL
:
5.96s
(23x)
44
www.datapad.io
45. Query 3
• Total contributions by candidate
and state with 2 filter predicates
SELECT
cand_nm,
sum(contb_receipt_amt)
as
total
FROM
fec
WHERE
contb_receipt_dt
BETWEEN
'2012-‐05-‐01'
and
'2012-‐11-‐05'
AND
contb_receipt_amt
BETWEEN
0
and
2500
GROUP
BY
cand_nm
45
www.datapad.io
46. Query 3
• Total contributions by candidate
and state with 2 filter predicates
badger
(in-‐memory)
:
96ms
(1x)
badger
(from-‐disk)
:
275ms
(2.9x)
pandas
(in-‐memory)
:
946ms
(9.8x)
PostgreSQL
:
6.2s
(65x)
46
www.datapad.io
47. Badger, the future
• Distributed in-memory analytics
• Multicore algorithms
• ETL job-building tools
• Open source in some form someday
Looking for algorithms hackers to help
•
47
www.datapad.io