SlideShare a Scribd company logo
1 of 23
Practical Machine
Learning in Python
Matt Spitz
       via
@mattspitz
Practical Machine Learning in Python   2




This is the Age of Aquarius Data
• Data is plentiful
 • application logs
 • external APIs
   • Facebook, Twitter

 • public datasets
• Analysis adds value
 • understanding your users
 • dynamic application decisions
• Storage / CPU time is cheap
Practical Machine Learning in Python   3




Machine Learning in Python
• Python is well-suited for data analysis
• Versatile
  • quick and dirty scripts
  • full-featured, realtime applications
• Mature ML packages
  • tons of choices (see: mloss.org)
  • plug-and-play or DIY
Practical Machine Learning in Python   4




Classification Problem: Terminology
• Data points
  • feature set: “interesting” facts about an event/thing
  • label: a description of that event/thing
• Classification
  • training set: a bunch of labeled feature sets
  • given a training set, build a classifier to predict labels for
    unlabeled feature sets
Practical Machine Learning in Python   5




SluggerML
• Two questions
   • What features are strong predictors for home runs and strikeouts?
   • Given a particular situation, with what probability will the batter
     hit a home run or strike out?
• Feature sets represent game state for a plate appearance
   • game: day vs. night, wind direction...
   • at-bat: inning, #strikes, left-right matchup...
   • batter/pitcher: age, weight, fielding position...
• Labels represent outcome
   • HR (home run), K (strikeout), OTHER
• Poor Man’s Sabermetrics
Practical Machine Learning in Python   6




SluggerML: Example
• Training set
   • {game_daynight: day, batter_age: 24, pitcher_weight: 211}
    • label: HR
  • {game_daynight: day, batter_age: 36, pitcher_weight: 242}
     • label: K
  • {game_daynight: night, batter_age: 27, pitcher_weight: 195}
     • label: OTHER
• Classifier predictions
  • {game_daynight: night, batter_age: 36, pitcher_weight: 225}
    • 2.6% HR     15.6% K
  • {game_daynight: day, batter_age: 20, pitcher_weight: 216}
     • 2.2% HR 19.1% K
Practical Machine Learning in Python   7




SluggerML: Gathering Data
• Sources
  • Retrosheet
     • play-by-play logs for every game since 1956
  • Sean Lahman’s Baseball Archive
     • detailed stats about individual players

• Coalescing
  • 1st pass, Lahman: create player database
    • shelve module
  • 2nd pass, Retrosheet: track game state, join on player db
• Scrubbing
  • ensure consistency
Practical Machine Learning in Python   8




SluggerML: Gathering Data
• Training set
  • regular-season games from 1980-2011
  • 5,669,301 plate appearances
     • 135,602 home runs
     • 871,226 strikeouts
Practical Machine Learning in Python   9




Selecting a Toolkit: Tradeoffs
• Speed
  • offline vs. realtime
• Transparency
   • internal visibility
   • customizability
• Support
  • maturity
  • community
Practical Machine Learning in Python   10




Selecting a Toolkit: High-Level Options
• External bindings
  • python interfaces to popular packages
  • Matlab, R, Octave, SHOGUN Toolbox
  • transition legacy workflows
• Python implementations
  • collections of algorithms
  • (mostly) python
  • external subcomponents
• DIY
  • building blocks
Practical Machine Learning in Python   11




Selecting a Toolkit: Python Implementations
• nltk
  • focus on NLP
  • book: Natural Language Processing with Python (O’Reilly ‘09)
• mlpy
  • regression, classification, clustering
• PyML
  • focus on SVM
• PyBrain
  • focus on neural networks
Practical Machine Learning in Python   12




Selecting a Toolkit: Python Implementations
• mdp-toolkit
  • data processing management
  • nodes represent tasks in a data workflow
  • scheduling, parallelization
• scikit-learn
  • supervised, unsupervised, feature selection, visualization
  • heavy development, large team
  • excellent documentation
  • active community
Practical Machine Learning in Python   13




Selecting a Toolkit: Do It Yourself
• Basic building blocks
  • NumPy
  • SciPy
• C/C++ implementations
  • LIBLINEAR
  • LIBSVM
  • OpenCV
  • ...your own?
Practical Machine Learning in Python   14




SluggerML: Two Questions
• What features are strong predictors for home runs
  and strikeouts?
• Given a particular situation, with what probability will
  the batter hit a home run or strike out?
Practical Machine Learning in Python   15




SluggerML: Feature Selection
• Identifies predictive features
  • strongly correlated with labels
  • predictive: max_benchpress
  • not predictive: favorite_cookie
• scikit-learn: chi-square feature selection
• Visualizing significance
  • for each well-supported value, find correlation with HR/K
     • “well-supported”: >= 0.05% of samples with feature=value
     • correlation: ( P(HR | feature=value) / P(HR) ) - 1
Practical Machine Learning in Python   16




      SluggerML: Feature Selection
                                   Batter: Home vs. Visiting
              50.0%


              40.0%


              30.0%


              20.0%


              10.0%
Correlation




               0.0%                                                                              Home Run
                                                                                                 Strikeout
              -10.0%


              -20.0%


              -30.0%


              -40.0%


              -50.0%
                       home team                               visiting team
Practical Machine Learning in Python    17




      SluggerML: Feature Selection
                                         Batter: Fielding Position
              50.0%


              40.0%


              30.0%


              20.0%


              10.0%
Correlation




               0.0%                                                                                      Home Run
                                                                                                         Strikeout
              -10.0%


              -20.0%


              -30.0%


              -40.0%


              -50.0%
                       P   C   1B   2B       3B    SS     LF       CF       RF        DH       PH
Practical Machine Learning in Python      18




      SluggerML: Feature Selection
                                                           Game: Temperature (˚F)
              50.0%


              40.0%


              30.0%


              20.0%


              10.0%
Correlation




               0.0%                                                                                                                    Home Run
                                                                                                                                       Strikeout
              -10.0%


              -20.0%


              -30.0%


              -40.0%


              -50.0%
                       35-39   40-44   45-49   50-54   55-59   60-64   65-69   70-74   75-79   80-84   85-89   90-94   95-99 100-104
Practical Machine Learning in Python     19




      SluggerML: Feature Selection
                                                           Game: Year
              50.0%


              40.0%


              30.0%


              20.0%


              10.0%
Correlation




               0.0%                                                                                                   Home Run
                                                                                                                      Strikeout
              -10.0%


              -20.0%


              -30.0%


              -40.0%


              -50.0%
                       1980-1984   1985-1989   1990-1994    1995-1999   2000-2004     2005-2009      2010-2011
Practical Machine Learning in Python   20




SluggerML: Realtime Classification
• Given features, predict label probabilities
• nltk: NaiveBayesClassifier
• Web frontend
  • gunicorn, nginx
Practical Machine Learning in Python   21




Tips and Tricks
• Persistent classifier internals
   • once trained, save and reuse
   • depends on implementation
    • string representation may exist
    • create your own
• Using generators where possible
  • avoid keeping data in memory
    • single-pass algorithms
    • conversion pass before training
• Multicore text processing
  • scrubbing: low memory footprint
  • multiprocessing module
Practical Machine Learning in Python   22




The Fine Print™
• Plug-and-play is easy!
• Don’t blindly apply ML
  • understand your data
  • understand your algorithms
     • ml-class.org is an excellent resource
Practical Machine Learning in Python   23




Thanks!
github.com/mattspitz/sluggerml
slideshare.net/mattspitz/practical-machine-learning-in-python


@mattspitz

More Related Content

Viewers also liked

Sample email submission
Sample email submissionSample email submission
Sample email submissionDavid Sommer
 
My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3David Sommer
 
Internationalization in Rails 2.2
Internationalization in Rails 2.2Internationalization in Rails 2.2
Internationalization in Rails 2.2Nicolas Jacobeus
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Javajbellis
 
Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)John Collins
 
mobile development platforms
mobile development platformsmobile development platforms
mobile development platformsguestfa9375
 
How to make intelligent web apps
How to make intelligent web appsHow to make intelligent web apps
How to make intelligent web appsiapain
 
My Valentine Gift - YOU Decide
My Valentine Gift - YOU DecideMy Valentine Gift - YOU Decide
My Valentine Gift - YOU DecideSizzlynRose
 
Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)John Collins
 
2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentary2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentaryalghanim
 
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)John Collins
 
The ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj KumarThe ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj KumarThoughtWorks
 
Strategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful LocalizationStrategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful LocalizationJohn Collins
 
Designing for Multiple Mobile Platforms
Designing for Multiple Mobile PlatformsDesigning for Multiple Mobile Platforms
Designing for Multiple Mobile PlatformsRobert Douglas
 
Stc 2014 unraveling the mysteries of localization kits
Stc 2014 unraveling the mysteries of localization kitsStc 2014 unraveling the mysteries of localization kits
Stc 2014 unraveling the mysteries of localization kitsDavid Sommer
 
Linguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with RailsLinguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with RailsHeatherRivers
 

Viewers also liked (19)

Glossary
GlossaryGlossary
Glossary
 
Sample email submission
Sample email submissionSample email submission
Sample email submission
 
My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3My trans kit checklist gw1 ds1_gw3
My trans kit checklist gw1 ds1_gw3
 
Shrunken Head
 Shrunken Head  Shrunken Head
Shrunken Head
 
Internationalization in Rails 2.2
Internationalization in Rails 2.2Internationalization in Rails 2.2
Internationalization in Rails 2.2
 
Pycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from JavaPycon 2012 What Python can learn from Java
Pycon 2012 What Python can learn from Java
 
Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)Putting Out Fires with Content Strategy (InfoDevDC meetup)
Putting Out Fires with Content Strategy (InfoDevDC meetup)
 
mobile development platforms
mobile development platformsmobile development platforms
mobile development platforms
 
How to make intelligent web apps
How to make intelligent web appsHow to make intelligent web apps
How to make intelligent web apps
 
My Valentine Gift - YOU Decide
My Valentine Gift - YOU DecideMy Valentine Gift - YOU Decide
My Valentine Gift - YOU Decide
 
Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)Putting Out Fires with Content Strategy (STC Academic SIG)
Putting Out Fires with Content Strategy (STC Academic SIG)
 
2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentary2008 Fourth Quarter Real Estate Commentary
2008 Fourth Quarter Real Estate Commentary
 
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
Strategies for Friendly English and Successful Localization (InfoDevWorld 2014)
 
The ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj KumarThe ruby on rails i18n core api-Neeraj Kumar
The ruby on rails i18n core api-Neeraj Kumar
 
Strategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful LocalizationStrategies for Friendly English and Successful Localization
Strategies for Friendly English and Successful Localization
 
Designing for Multiple Mobile Platforms
Designing for Multiple Mobile PlatformsDesigning for Multiple Mobile Platforms
Designing for Multiple Mobile Platforms
 
Stc 2014 unraveling the mysteries of localization kits
Stc 2014 unraveling the mysteries of localization kitsStc 2014 unraveling the mysteries of localization kits
Stc 2014 unraveling the mysteries of localization kits
 
Silmeyiniz
SilmeyinizSilmeyiniz
Silmeyiniz
 
Linguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with RailsLinguistic Potluck: Crowdsourcing localization with Rails
Linguistic Potluck: Crowdsourcing localization with Rails
 

Similar to Practical ML in Python: HR/K Prediction

Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use caseFlorian Wilhelm
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use caseinovex GmbH
 
FSB: TreeWalker - SECCON 2015 Online CTF
FSB: TreeWalker - SECCON 2015 Online CTFFSB: TreeWalker - SECCON 2015 Online CTF
FSB: TreeWalker - SECCON 2015 Online CTFYOKARO-MON
 
Down the rabbit hole, profiling in Django
Down the rabbit hole, profiling in DjangoDown the rabbit hole, profiling in Django
Down the rabbit hole, profiling in DjangoRemco Wendt
 
Code instrumentation
Code instrumentationCode instrumentation
Code instrumentationBryan Reinero
 
Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1
Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1
Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1Toshiaki Maki
 

Similar to Practical ML in Python: HR/K Prediction (9)

sourav-projects
sourav-projectssourav-projects
sourav-projects
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
Performance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use casePerformance evaluation of GANs in a semisupervised OCR use case
Performance evaluation of GANs in a semisupervised OCR use case
 
FSB: TreeWalker - SECCON 2015 Online CTF
FSB: TreeWalker - SECCON 2015 Online CTFFSB: TreeWalker - SECCON 2015 Online CTF
FSB: TreeWalker - SECCON 2015 Online CTF
 
專題報告
專題報告專題報告
專題報告
 
Down the rabbit hole, profiling in Django
Down the rabbit hole, profiling in DjangoDown the rabbit hole, profiling in Django
Down the rabbit hole, profiling in Django
 
Code instrumentation
Code instrumentationCode instrumentation
Code instrumentation
 
Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1
Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1
Spring Boot Actuator 2.0 & Micrometer #jjug_ccc #ccc_a1
 
About_Moviemetr
About_MoviemetrAbout_Moviemetr
About_Moviemetr
 

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 

Practical ML in Python: HR/K Prediction

  • 1. Practical Machine Learning in Python Matt Spitz via @mattspitz
  • 2. Practical Machine Learning in Python 2 This is the Age of Aquarius Data • Data is plentiful • application logs • external APIs • Facebook, Twitter • public datasets • Analysis adds value • understanding your users • dynamic application decisions • Storage / CPU time is cheap
  • 3. Practical Machine Learning in Python 3 Machine Learning in Python • Python is well-suited for data analysis • Versatile • quick and dirty scripts • full-featured, realtime applications • Mature ML packages • tons of choices (see: mloss.org) • plug-and-play or DIY
  • 4. Practical Machine Learning in Python 4 Classification Problem: Terminology • Data points • feature set: “interesting” facts about an event/thing • label: a description of that event/thing • Classification • training set: a bunch of labeled feature sets • given a training set, build a classifier to predict labels for unlabeled feature sets
  • 5. Practical Machine Learning in Python 5 SluggerML • Two questions • What features are strong predictors for home runs and strikeouts? • Given a particular situation, with what probability will the batter hit a home run or strike out? • Feature sets represent game state for a plate appearance • game: day vs. night, wind direction... • at-bat: inning, #strikes, left-right matchup... • batter/pitcher: age, weight, fielding position... • Labels represent outcome • HR (home run), K (strikeout), OTHER • Poor Man’s Sabermetrics
  • 6. Practical Machine Learning in Python 6 SluggerML: Example • Training set • {game_daynight: day, batter_age: 24, pitcher_weight: 211} • label: HR • {game_daynight: day, batter_age: 36, pitcher_weight: 242} • label: K • {game_daynight: night, batter_age: 27, pitcher_weight: 195} • label: OTHER • Classifier predictions • {game_daynight: night, batter_age: 36, pitcher_weight: 225} • 2.6% HR 15.6% K • {game_daynight: day, batter_age: 20, pitcher_weight: 216} • 2.2% HR 19.1% K
  • 7. Practical Machine Learning in Python 7 SluggerML: Gathering Data • Sources • Retrosheet • play-by-play logs for every game since 1956 • Sean Lahman’s Baseball Archive • detailed stats about individual players • Coalescing • 1st pass, Lahman: create player database • shelve module • 2nd pass, Retrosheet: track game state, join on player db • Scrubbing • ensure consistency
  • 8. Practical Machine Learning in Python 8 SluggerML: Gathering Data • Training set • regular-season games from 1980-2011 • 5,669,301 plate appearances • 135,602 home runs • 871,226 strikeouts
  • 9. Practical Machine Learning in Python 9 Selecting a Toolkit: Tradeoffs • Speed • offline vs. realtime • Transparency • internal visibility • customizability • Support • maturity • community
  • 10. Practical Machine Learning in Python 10 Selecting a Toolkit: High-Level Options • External bindings • python interfaces to popular packages • Matlab, R, Octave, SHOGUN Toolbox • transition legacy workflows • Python implementations • collections of algorithms • (mostly) python • external subcomponents • DIY • building blocks
  • 11. Practical Machine Learning in Python 11 Selecting a Toolkit: Python Implementations • nltk • focus on NLP • book: Natural Language Processing with Python (O’Reilly ‘09) • mlpy • regression, classification, clustering • PyML • focus on SVM • PyBrain • focus on neural networks
  • 12. Practical Machine Learning in Python 12 Selecting a Toolkit: Python Implementations • mdp-toolkit • data processing management • nodes represent tasks in a data workflow • scheduling, parallelization • scikit-learn • supervised, unsupervised, feature selection, visualization • heavy development, large team • excellent documentation • active community
  • 13. Practical Machine Learning in Python 13 Selecting a Toolkit: Do It Yourself • Basic building blocks • NumPy • SciPy • C/C++ implementations • LIBLINEAR • LIBSVM • OpenCV • ...your own?
  • 14. Practical Machine Learning in Python 14 SluggerML: Two Questions • What features are strong predictors for home runs and strikeouts? • Given a particular situation, with what probability will the batter hit a home run or strike out?
  • 15. Practical Machine Learning in Python 15 SluggerML: Feature Selection • Identifies predictive features • strongly correlated with labels • predictive: max_benchpress • not predictive: favorite_cookie • scikit-learn: chi-square feature selection • Visualizing significance • for each well-supported value, find correlation with HR/K • “well-supported”: >= 0.05% of samples with feature=value • correlation: ( P(HR | feature=value) / P(HR) ) - 1
  • 16. Practical Machine Learning in Python 16 SluggerML: Feature Selection Batter: Home vs. Visiting 50.0% 40.0% 30.0% 20.0% 10.0% Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% home team visiting team
  • 17. Practical Machine Learning in Python 17 SluggerML: Feature Selection Batter: Fielding Position 50.0% 40.0% 30.0% 20.0% 10.0% Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% P C 1B 2B 3B SS LF CF RF DH PH
  • 18. Practical Machine Learning in Python 18 SluggerML: Feature Selection Game: Temperature (˚F) 50.0% 40.0% 30.0% 20.0% 10.0% Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% 35-39 40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85-89 90-94 95-99 100-104
  • 19. Practical Machine Learning in Python 19 SluggerML: Feature Selection Game: Year 50.0% 40.0% 30.0% 20.0% 10.0% Correlation 0.0% Home Run Strikeout -10.0% -20.0% -30.0% -40.0% -50.0% 1980-1984 1985-1989 1990-1994 1995-1999 2000-2004 2005-2009 2010-2011
  • 20. Practical Machine Learning in Python 20 SluggerML: Realtime Classification • Given features, predict label probabilities • nltk: NaiveBayesClassifier • Web frontend • gunicorn, nginx
  • 21. Practical Machine Learning in Python 21 Tips and Tricks • Persistent classifier internals • once trained, save and reuse • depends on implementation • string representation may exist • create your own • Using generators where possible • avoid keeping data in memory • single-pass algorithms • conversion pass before training • Multicore text processing • scrubbing: low memory footprint • multiprocessing module
  • 22. Practical Machine Learning in Python 22 The Fine Print™ • Plug-and-play is easy! • Don’t blindly apply ML • understand your data • understand your algorithms • ml-class.org is an excellent resource
  • 23. Practical Machine Learning in Python 23 Thanks! github.com/mattspitz/sluggerml slideshare.net/mattspitz/practical-machine-learning-in-python @mattspitz

Editor's Notes

  1. Data is everywhere clickstream data users are bad at managing fb permissions; you can get a lot out of the graph APIThere’s value in learning about data - how people use your site- feature or advertisement personalizationOne thing that enables this is that resources are cheap these days
  2. Python is a fantastic programming environment for data processing and analyticson one end of the spectrum, quick and dirty scripts... or full-featured applications ready for a deployment at scaleWide variety of toolkits for off-the-shelf analysis or building out your own data processing applications
  3. For this talk... discussing one flavor of analytics and machine learning, the classification problemintuition: training set: what you know about the world train a classifier to predict things that you don’t
  4. As a concrete example, I started playing around with some baseball stats to illustrate how one might go about building ML applications in pythoneven if you’re not into baseball, you know that the iconic visions of success and failure are the home run and the strikeout in all the movies, hitting a home run is equivalent to getting the girl and striking out is seen as a major setback
  5. As with any machine learning problem, you want to get your data into a classifier-consumable format. That is, labeled feature sets. For each play in the game, keep track of the game state and output a labeled feature bundle representing the situation and its outcome: HR, K, (other)
  6. speed: offline: deadline ~ hours, daysrealtime: user waiting on the other side (user actions: => milliseconds)transparency:seeing what’s going on with an algorithm in case the docs aren’t clearmodifying or patching an algorithm to meet your needssupport:maturity, active development how strong is the community around the project? are there tutorials available?
  7. interface with external packages if you’ve done some analysis already and want to transition to python without throwing away codepython toolkits provide sets of algorithms, mostly python implementationsoften use external packages with C bindings, some even use other toolkitsDIY: use the external packages yourself
  8. to give a sampling of what’s available, i chose some toolkits that were last updated within a yearAs a disclaimer... -Not exhaustive, just a sampling -some of these tools I’ve used, some I haven’t! -I’m sure I’ve missed your favorite, and for that I apologizedifferent packages focus on different things, so one isn’t necessarily going to suit all of your needs
  9. buzz around scikit-learn last year - checked it out recently and it’s been built out a lot
  10. NumPy: fast and efficient arraysSciPy: scientific tools and algorithms built on NumPyCan also use popular C/C++ implementations using python bindingspython is a modular language, so you can always sub out your implementation without disrupting your workflow too muchnow, as an example of applying these toolkits...
  11. speed isn’t criticalspeed is critical (imagine that you’re a coach) baseball is slow, but it’s not THAT slow
  12. identifies predictive features certain values are strongly correlated with certain labelssklearn- wasn’t clear on the documented usage, looked at the code
  13. for a coach
  14. don’t we need to train our classifier to run our web application?save them on disk!pickle or pull out a textual representation(another argument for using a package that allows you to do this)why compute things twice?use generatorslots and lots of dataavoid keeping it all in memorysingle pass algorithm (bayes)first-pass conversion to compact data (numpy vectors, not python objects)not always possible, but keep it in mindtake advantage of multiple cores - if your processing step has a minimal memory footprint (just one line at a time), do it on multiple cores - multiple processes on different input files or multiprocessing module is great at this
  15. you don't need to know everything about the algorithms you use …but you can't just blindly apply these things and hope that they magically workml-class.org: free class, provides an excellent foundation and starting point for understanding MLin no time, you, too, can be a number muncher
  16. source code for SluggerML on github; kind of a mess, and I’m sorry about thatand I’m @mattspitz on the twitters