We at Revolution Analytics are often asked “What is the best way to learn R?” While acknowledging that there may be as many effective learning styles as there are people we have identified three factors that greatly facilitate learning R. For a quick start:
- Find a way of orienting yourself in the open source R world
- Have a definite application area in mind
- Set an initial goal of doing something useful and then build on it
In this webinar, we focus on data mining as the application area and show how anyone with just a basic knowledge of elementary data mining techniques can become immediately productive in R. We will:
- Provide an orientation to R’s data mining resources
- Show how to use the "point and click" open source data mining GUI, rattle, to perform the basic data mining functions of exploring and visualizing data, building classification models on training data sets, and using these models to classify new data.
- Show the simple R commands to accomplish these same tasks without the GUI
- Demonstrate how to build on these fundamental skills to gain further competence in R
- Move away from using small test data sets and show with the same level of skill one could analyze some fairly large data sets with RevoScaleR
Data scientists and analysts using other statistical software as well as students who are new to data mining should come away with a plan for getting started with R.
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Introduction to R for Data Mining
1. Revolution Confidential
Introduc tion to R for
Data Mining
2012 S pring Webinar S eries
J os eph B . R ic kert,
R evolution A nalytic s
J une 5, 2012
1
2. G oals for Today’s Webinar Revolution Confidential
To convince you that:
Seriously, it is
not difficult to
R learn enough R
is a serious to do some
platform for serious data
data mining mining
Revolution R
Enterprise is
is the platform for
serious
data mining
2
3. Data Mining Applications Actions Revolution Confidential
Algorithms
Credit Scoring Acquire Data CART
Fraud Detection Prepare Random Forests
Ad Optimization Classify SVM
Targeted
Predict KMeans
Marketing
Hierarchical
Gene Detection Visualize
clustering
Recommendation Ensemble
Optimize
systems Techniques
Social Networks Interpret
3
4. R ec ent K DD Nuggets P oll s ugges ts s o are a lot
of other s erious data miners Revolution Confidential
What Analytics, Data mining, Big Data software you used in the past 12
months for a real project (not just evaluation) [798 voters]
Software % users in 2012 % users in 2011
R (245) 30.7% 23.3%
Excel (238) 29.8% 21.8%
Rapid-I RapidMiner (213) 26.7% 27.7%
KNIME (174) 21.8% 12.1%
Weka / Pentaho (118) 14.8% 11.8%
StatSoft Statistica (112) 14.0% 8.5%
SAS (101) 12.7% 13.6%
Rapid-I RapidAnalytics (83) 10.4% Not asked in 2011
MATLAB (80) 10.0% 7.2%
IBM SPSS Statistics (62) 7.8% 7.2%
IBM SPSS Modeler (54) 6.8% 8.3%
SAS Enterprise Miner (46) 5.8% 7.1%
4
6. What does it mean to learn F renc h? Revolution Confidential
To get around Paris on the Metro
To read a Menu
To carry on a conversation
6
7. L earning R Revolution Confidential
Levels of R Skill
Write production level code R developer
Write an R package R contributor
Write functions R programmer
Use R Functions R user
Use a GUI R aware
10 10,000
Hours of use
The Malcolm Gladwell “Outlier” Scale
7
9. R is s et up to c ompute func tions on data
Revolution Confidential
lm.model
lm <- function(x,y) lm.model$assign
{ lm.model$coefficients
. . . lm.model$df.residual
} lm.model$effects
lm.model$fitted.values
.
.
.
9
10. A little knowledge goes a long way in R Revolution Confidential
R’s functional design facilitates
performing small tasks
For the most part, the output of a The trick is
knowing which
function depends only on the functions to
values of its arguments call
calling a function multiple times
with the same values of its
arguments will produce the same
result each time
Minimal side effects means it is
much easier to understand and
predict the behavior of a program
10
11. B as ic Mac hine L earning F unc tions Revolution Confidential
Function Library Description
Cluster hclust stats Hierarchical cluster analysis
kmeans stats Kmeans clustering
Classifiers glm stats Logistic Regression
rpart rpart Recursive partitioning and
regression trees
ksvm kernlab Support Vector Machine
Ensemble ada ada Stochastic boosting
randomForest randomForest Random Forests classification and
regression
11
12. Noteworthy Data Mining P ac kages Revolution Confidential
Package Comment
rattle A very intuitive GUI for data mining that
produces useful R code
caret Well organized and remarkably complete
collection of functions to facilitate model
building for regression and classification
problems
12
14. S c ripts to run Revolution Confidential
Script Some key Functions
0 Setup Load libraries
1 Explore weather data Read.csv, plot
2 Run clustering algorithms kmeans, hclust
3 Basic decision tree rpart
4 Boosted Tree ada
5 Random Forest randomForest
6 Support Vector Machine randomForest, varImpPlot
7 Big Data Mortgage Default rxLogit, rxKmeans
model
14
15. B ig Data and R Revolution Confidential
There are some challenges:
All of your data and model code must fit into
memory
Big data sets as well as big models (lots of
variables) can run out of memory
Parallel computation might be necessary for
models to run in a reasonable time
15
16. R evoS c aleR in R evolution R E nterpris e Revolution Confidential
Can help in a number of ways:
Manipulate large data sets, and perhaps
aggregating data so that it will fit in memory
For example, boiling down time-stamped data
like a web log to form a time series that will fit in
memory
Run RevoScaleR Functions directly on big
data sets
Run R functions in parallel
16
17. Top R evoS c aleR F unc tions for Data Mining
parallel external memory algorithms Revolution Confidential
Task RevoScaleR function
Data processing rxDataStep
Descriptive Statistics rxSumary
Tables and cubes rxCube, rxCrosstabs
Correlations / covariance rxCovCor, rxCor, rxCov,
rxSSCP
Linear Models rxLinMod
Logistic regressions rxLogit
Generalized linear models rxGlm
K means clustering rxKmeans
Predictions (scoring) rxPredict
17
19. F inding your way around the R world Revolution Confidential
Machine Learning
Data Mining
Visualization
Finding Packages
Task Views
crantastic.org
Blogs
Revolutions
R-Bloggers
Quick-R
Getting Help
StackOverflow
@RLangTip
Inside-R
www.rseek.org
Finding R People
User Groups worldwide
#rstats
Word Cloud for @inside_R
19
20. L ook at s ome more s ophis tic ated examples Revolution Confidential
Thomson Nguyen on the Heritage Health Prize
Shannon Terry & Ben Ogorek (Nationwide Insurance):
A Direct Marketing In-Flight Forecasting System
Jeffrey Breen:
Mining Twitter for Airline Consumer Sentiment
Joe Rothermich: Alternative Data Sources for Measuring
Market Sentiment and Events (Using R)
20
21. R evolution A nalytic s Training Revolution Confidential
http://www.revolutionanalytics.com/
products/training/
21