SlideShare a Scribd company logo
1 of 23
Revolution Confidential




 Introduc tion to R for
  Data Mining

2012 S pring Webinar S eries



J os eph B . R ic kert,
R evolution A nalytic s
J une 5, 2012


                                               1
G oals for Today’s Webinar                                  Revolution Confidential




                    To convince you that:


                                                 Seriously, it is
                                                 not difficult to
           R                                    learn enough R
     is a serious                                  to do some
     platform for                                 serious data
     data mining                                      mining


                             Revolution R
                             Enterprise is
                          is the platform for
                                serious
                              data mining


                                                                              2
Data Mining   Applications        Actions                Revolution Confidential
                                                  Algorithms


                 Credit Scoring    Acquire Data         CART




                Fraud Detection      Prepare        Random Forests




                Ad Optimization      Classify            SVM




                   Targeted
                                     Predict           KMeans
                   Marketing




                                                      Hierarchical
                Gene Detection      Visualize
                                                       clustering




                Recommendation                         Ensemble
                                    Optimize
                    systems                           Techniques




                Social Networks      Interpret



                                                                           3
R ec ent K DD Nuggets P oll s ugges ts s o are a lot
of other s erious data miners                 Revolution Confidential


          What Analytics, Data mining, Big Data software you used in the past 12
          months for a real project (not just evaluation) [798 voters]


          Software                            % users in 2012                      % users in 2011


          R (245)                             30.7%                                23.3%
          Excel (238)                         29.8%                                21.8%

          Rapid-I RapidMiner (213)            26.7%                                27.7%
          KNIME (174)                         21.8%                                12.1%

          Weka / Pentaho (118)                14.8%                                11.8%

          StatSoft Statistica (112)           14.0%                                8.5%

          SAS (101)                           12.7%                                13.6%

          Rapid-I RapidAnalytics (83)         10.4%                                Not asked in 2011

          MATLAB (80)                         10.0%                                7.2%

          IBM SPSS Statistics (62)            7.8%                                 7.2%

          IBM SPSS Modeler (54)               6.8%                                 8.3%

          SAS Enterprise Miner (46)           5.8%                                 7.1%


                                                                                                       4
Revolution Confidential




Learning R

WHAT DOE S IT ME A N TO
LE AR N R ?

                                            5
What does it mean to learn F renc h? Revolution Confidential




  To get around Paris on the Metro

          To read a Menu
     To carry on a conversation




                                                       6
L earning R                                                                            Revolution Confidential




Levels of R Skill

 Write production level code                                             R developer


    Write an R package                                   R contributor


       Write functions                    R programmer



      Use R Functions                R user


         Use a GUI              R aware


                                   10                                                        10,000
                                                         Hours of use


                               The Malcolm Gladwell “Outlier” Scale



                                                                                                         7
Revolution Confidential




Productive from the Get go!

T HE S T R UC T UR E OF R
FA C IL ITAT E S L E A R NING

                                                  8
R is s et up to c ompute func tions on data
                                          Revolution Confidential




                           lm.model
 lm <-   function(x,y)         lm.model$assign
         {                     lm.model$coefficients
         . . .                 lm.model$df.residual
         }                     lm.model$effects
                               lm.model$fitted.values
                                  .
                                  .
                                  .




                                                            9
A little knowledge goes a long way in R          Revolution Confidential



  R’s functional design facilitates
   performing small tasks
  For the most part, the output of a   The trick is
                                        knowing which
   function depends only on the         functions to
   values of its arguments              call
  calling a function multiple times
   with the same values of its
   arguments will produce the same
   result each time
  Minimal side effects means it is
   much easier to understand and
   predict the behavior of a program


                                                                  10
B as ic Mac hine L earning F unc tions                              Revolution Confidential



              Function       Library        Description
Cluster       hclust         stats          Hierarchical cluster analysis
              kmeans         stats          Kmeans clustering
Classifiers   glm            stats          Logistic Regression
              rpart          rpart          Recursive partitioning and
                                            regression trees
              ksvm           kernlab        Support Vector Machine
Ensemble      ada            ada            Stochastic boosting
              randomForest   randomForest   Random Forests classification and
                                            regression




                                                                                     11
Noteworthy Data Mining P ac kages                            Revolution Confidential




     Package   Comment
     rattle    A very intuitive GUI for data mining that
               produces useful R code
     caret     Well organized and remarkably complete
               collection of functions to facilitate model
               building for regression and classification
               problems




                                                                              12
Revolution Confidential




Doing a lot with a little R

T IME TO R UN S OME C ODE


                                               13
S c ripts to run                                          Revolution Confidential




        Script                      Some key Functions
    0   Setup                       Load libraries
    1   Explore weather data        Read.csv, plot
    2   Run clustering algorithms   kmeans, hclust
    3   Basic decision tree         rpart
    4   Boosted Tree                ada
    5   Random Forest               randomForest
    6   Support Vector Machine      randomForest, varImpPlot
    7   Big Data Mortgage Default   rxLogit, rxKmeans
        model




                                                                           14
B ig Data and R                         Revolution Confidential




There are some challenges:
 All of your data and model code must fit into
  memory
 Big data sets as well as big models (lots of
  variables) can run out of memory
 Parallel computation might be necessary for
  models to run in a reasonable time



                                                         15
R evoS c aleR in R evolution R E nterpris e      Revolution Confidential




Can help in a number of ways:
 Manipulate large data sets, and perhaps
  aggregating data so that it will fit in memory
   For example, boiling down time-stamped data
    like a web log to form a time series that will fit in
    memory
 Run RevoScaleR Functions directly on big
  data sets
 Run R functions in parallel
                                                                  16
Top R evoS c aleR F unc tions for Data Mining
parallel external memory algorithms                         Revolution Confidential




      Task                        RevoScaleR function
      Data processing             rxDataStep
      Descriptive Statistics      rxSumary
      Tables and cubes            rxCube, rxCrosstabs
      Correlations / covariance   rxCovCor, rxCor, rxCov,
                                  rxSSCP
      Linear Models               rxLinMod
      Logistic regressions        rxLogit
      Generalized linear models   rxGlm
      K means clustering          rxKmeans
      Predictions (scoring)       rxPredict




                                                                             17
Revolution Confidential




More than code, R is a community

WHE R E TO G O F R OM HE R E ?


                                                    18
F inding your way around the R world         Revolution Confidential




   Machine Learning
   Data Mining
   Visualization
   Finding Packages
        Task Views
        crantastic.org
   Blogs
        Revolutions
        R-Bloggers
        Quick-R
   Getting Help
        StackOverflow
        @RLangTip
        Inside-R
        www.rseek.org
   Finding R People
        User Groups worldwide
        #rstats

                                 Word Cloud for @inside_R




                                                                       19
L ook at s ome more s ophis tic ated examples     Revolution Confidential




 Thomson Nguyen on the Heritage Health Prize
 Shannon Terry & Ben Ogorek (Nationwide Insurance):
  A Direct Marketing In-Flight Forecasting System
 Jeffrey Breen:
  Mining Twitter for Airline Consumer Sentiment
 Joe Rothermich: Alternative Data Sources for Measuring
  Market Sentiment and Events (Using R)




                                                                   20
R evolution A nalytic s Training          Revolution Confidential




    http://www.revolutionanalytics.com/
    products/training/




                                                           21
R eferenc es   Revolution Confidential




                                22
Revolution Confidential




                          Revolution Confidential




                                           23

More Related Content

What's hot

Introduction to Deep Learning and AI at Scale for Managers
Introduction to Deep Learning and AI at Scale for ManagersIntroduction to Deep Learning and AI at Scale for Managers
Introduction to Deep Learning and AI at Scale for ManagersDataWorks Summit
 
SuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-finalSuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-finalstelligence
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryInside Analysis
 
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data WarehouseHadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data Warehousetervela
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsDomino Data Lab
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...BigMine
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamGreg Goltsov
 
Using AI to Solve Data and IT Complexity -- And Better Enable AI
Using AI to Solve Data and IT Complexity -- And Better Enable AIUsing AI to Solve Data and IT Complexity -- And Better Enable AI
Using AI to Solve Data and IT Complexity -- And Better Enable AIDana Gardner
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October IssueJIMS Rohini Sector 5
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1gauravsc36
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionInside Analysis
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraJongwook Woo
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionJongwook Woo
 

What's hot (20)

Introduction to Deep Learning and AI at Scale for Managers
Introduction to Deep Learning and AI at Scale for ManagersIntroduction to Deep Learning and AI at Scale for Managers
Introduction to Deep Learning and AI at Scale for Managers
 
SuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-finalSuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-final
 
#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data Discovery
 
Open problems big_data_19_feb_2015_ver_0.1
Open problems big_data_19_feb_2015_ver_0.1Open problems big_data_19_feb_2015_ver_0.1
Open problems big_data_19_feb_2015_ver_0.1
 
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data WarehouseHadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
 
Big data 101
Big data 101Big data 101
Big data 101
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Using AI to Solve Data and IT Complexity -- And Better Enable AI
Using AI to Solve Data and IT Complexity -- And Better Enable AIUsing AI to Solve Data and IT Complexity -- And Better Enable AI
Using AI to Solve Data and IT Complexity -- And Better Enable AI
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October Issue
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
 
AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
 

Similar to Introduction to R for Data Mining

100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0Revolution Analytics
 
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...Revolution Analytics
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionRevolution Analytics
 
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...Revolution Analytics
 
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...Nagios
 
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution Analytics
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Massimo Gaetano Panunzio
 
Software Development Engineers Ireland
Software Development Engineers IrelandSoftware Development Engineers Ireland
Software Development Engineers IrelandSean O'Sullivan
 
Revolution R Enterprise - 100% R and More
Revolution R Enterprise - 100% R and MoreRevolution R Enterprise - 100% R and More
Revolution R Enterprise - 100% R and MoreRevolution Analytics
 
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...Burr Sutter
 
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase Sybase Türkiye
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBigDataCloud
 
Dynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving IT
Dynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving ITDynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving IT
Dynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving ITDynatrace
 
AI in the Financial Services Industry
AI in the Financial Services IndustryAI in the Financial Services Industry
AI in the Financial Services IndustryAlison B. Lowndes
 

Similar to Introduction to R for Data Mining (20)

100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0
 
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to Production
 
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
 
Revolution R - 100% R and More
Revolution R - 100% R and MoreRevolution R - 100% R and More
Revolution R - 100% R and More
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
 
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
 
Software Development Engineers Ireland
Software Development Engineers IrelandSoftware Development Engineers Ireland
Software Development Engineers Ireland
 
Revolution R Enterprise - 100% R and More
Revolution R Enterprise - 100% R and MoreRevolution R Enterprise - 100% R and More
Revolution R Enterprise - 100% R and More
 
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
 
Infochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey TheoremInfochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey Theorem
 
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
 
Big Data & The Cloud
Big Data & The CloudBig Data & The Cloud
Big Data & The Cloud
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Dynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving IT
Dynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving ITDynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving IT
Dynatrace: Davis - Hololens - AI update - Cloud announcements - Self driving IT
 
AI in the Financial Services Industry
AI in the Financial Services IndustryAI in the Financial Services Industry
AI in the Financial Services Industry
 

More from Revolution Analytics

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureRevolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source CommunitiesRevolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with RRevolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceRevolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudRevolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorRevolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalRevolution Analytics
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint packageRevolution Analytics
 

More from Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 

Recently uploaded

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Introduction to R for Data Mining

  • 1. Revolution Confidential Introduc tion to R for Data Mining 2012 S pring Webinar S eries J os eph B . R ic kert, R evolution A nalytic s J une 5, 2012 1
  • 2. G oals for Today’s Webinar Revolution Confidential To convince you that: Seriously, it is not difficult to R learn enough R is a serious to do some platform for serious data data mining mining Revolution R Enterprise is is the platform for serious data mining 2
  • 3. Data Mining Applications Actions Revolution Confidential Algorithms Credit Scoring Acquire Data CART Fraud Detection Prepare Random Forests Ad Optimization Classify SVM Targeted Predict KMeans Marketing Hierarchical Gene Detection Visualize clustering Recommendation Ensemble Optimize systems Techniques Social Networks Interpret 3
  • 4. R ec ent K DD Nuggets P oll s ugges ts s o are a lot of other s erious data miners Revolution Confidential What Analytics, Data mining, Big Data software you used in the past 12 months for a real project (not just evaluation) [798 voters] Software % users in 2012 % users in 2011 R (245) 30.7% 23.3% Excel (238) 29.8% 21.8% Rapid-I RapidMiner (213) 26.7% 27.7% KNIME (174) 21.8% 12.1% Weka / Pentaho (118) 14.8% 11.8% StatSoft Statistica (112) 14.0% 8.5% SAS (101) 12.7% 13.6% Rapid-I RapidAnalytics (83) 10.4% Not asked in 2011 MATLAB (80) 10.0% 7.2% IBM SPSS Statistics (62) 7.8% 7.2% IBM SPSS Modeler (54) 6.8% 8.3% SAS Enterprise Miner (46) 5.8% 7.1% 4
  • 5. Revolution Confidential Learning R WHAT DOE S IT ME A N TO LE AR N R ? 5
  • 6. What does it mean to learn F renc h? Revolution Confidential To get around Paris on the Metro To read a Menu To carry on a conversation 6
  • 7. L earning R Revolution Confidential Levels of R Skill Write production level code R developer Write an R package R contributor Write functions R programmer Use R Functions R user Use a GUI R aware 10 10,000 Hours of use The Malcolm Gladwell “Outlier” Scale 7
  • 8. Revolution Confidential Productive from the Get go! T HE S T R UC T UR E OF R FA C IL ITAT E S L E A R NING 8
  • 9. R is s et up to c ompute func tions on data Revolution Confidential lm.model lm <- function(x,y) lm.model$assign { lm.model$coefficients . . . lm.model$df.residual } lm.model$effects lm.model$fitted.values . . . 9
  • 10. A little knowledge goes a long way in R Revolution Confidential  R’s functional design facilitates performing small tasks  For the most part, the output of a The trick is knowing which function depends only on the functions to values of its arguments call  calling a function multiple times with the same values of its arguments will produce the same result each time  Minimal side effects means it is much easier to understand and predict the behavior of a program 10
  • 11. B as ic Mac hine L earning F unc tions Revolution Confidential Function Library Description Cluster hclust stats Hierarchical cluster analysis kmeans stats Kmeans clustering Classifiers glm stats Logistic Regression rpart rpart Recursive partitioning and regression trees ksvm kernlab Support Vector Machine Ensemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and regression 11
  • 12. Noteworthy Data Mining P ac kages Revolution Confidential Package Comment rattle A very intuitive GUI for data mining that produces useful R code caret Well organized and remarkably complete collection of functions to facilitate model building for regression and classification problems 12
  • 13. Revolution Confidential Doing a lot with a little R T IME TO R UN S OME C ODE 13
  • 14. S c ripts to run Revolution Confidential Script Some key Functions 0 Setup Load libraries 1 Explore weather data Read.csv, plot 2 Run clustering algorithms kmeans, hclust 3 Basic decision tree rpart 4 Boosted Tree ada 5 Random Forest randomForest 6 Support Vector Machine randomForest, varImpPlot 7 Big Data Mortgage Default rxLogit, rxKmeans model 14
  • 15. B ig Data and R Revolution Confidential There are some challenges:  All of your data and model code must fit into memory  Big data sets as well as big models (lots of variables) can run out of memory  Parallel computation might be necessary for models to run in a reasonable time 15
  • 16. R evoS c aleR in R evolution R E nterpris e Revolution Confidential Can help in a number of ways:  Manipulate large data sets, and perhaps aggregating data so that it will fit in memory  For example, boiling down time-stamped data like a web log to form a time series that will fit in memory  Run RevoScaleR Functions directly on big data sets  Run R functions in parallel 16
  • 17. Top R evoS c aleR F unc tions for Data Mining parallel external memory algorithms Revolution Confidential Task RevoScaleR function Data processing rxDataStep Descriptive Statistics rxSumary Tables and cubes rxCube, rxCrosstabs Correlations / covariance rxCovCor, rxCor, rxCov, rxSSCP Linear Models rxLinMod Logistic regressions rxLogit Generalized linear models rxGlm K means clustering rxKmeans Predictions (scoring) rxPredict 17
  • 18. Revolution Confidential More than code, R is a community WHE R E TO G O F R OM HE R E ? 18
  • 19. F inding your way around the R world Revolution Confidential  Machine Learning  Data Mining  Visualization  Finding Packages  Task Views  crantastic.org  Blogs  Revolutions  R-Bloggers  Quick-R  Getting Help  StackOverflow  @RLangTip  Inside-R  www.rseek.org  Finding R People  User Groups worldwide  #rstats Word Cloud for @inside_R 19
  • 20. L ook at s ome more s ophis tic ated examples Revolution Confidential  Thomson Nguyen on the Heritage Health Prize  Shannon Terry & Ben Ogorek (Nationwide Insurance): A Direct Marketing In-Flight Forecasting System  Jeffrey Breen: Mining Twitter for Airline Consumer Sentiment  Joe Rothermich: Alternative Data Sources for Measuring Market Sentiment and Events (Using R) 20
  • 21. R evolution A nalytic s Training Revolution Confidential http://www.revolutionanalytics.com/ products/training/ 21
  • 22. R eferenc es Revolution Confidential 22
  • 23. Revolution Confidential Revolution Confidential 23