BSidesLV 2013 - Using Machine Learning to Support Information Security

Using Machine Learning to
support Information Security
Alexandre Pinto
alexcp@mlsecproject.org
@alexcpsec
@MLSecProject
Proving Ground (Many Thanks to Joel Wilbanks)

• This is a talk about DEFENDING not attacking
– NO systems were harmed on the development of
this talk.
– This is NOT about some vanity hack that will be
patched tomorrow
– We are actually trying to BUILD something here.
• This talk includes more MATH thank the daily
recommended assumption by the FDA.
• You have been warned...
WARNING!

• 12 years in Information Security, done a little bit of
everything.
• Past 7 or so years leading security consultancy and
monitoring teams in Brazil, London and the US.
– If there is any way a SIEM can hurt you, it did to me.
• Researching machine learning and data science in
general for the past year or so. Participates in
Kaggle machine learning competitions (for fun, not
for proﬁt).
• First presentation in a real Infosec conference! (give
or take a few hours)
Who’s Alex?

• The elephant in the room
• Enter Machine Learning
• Principles and Kinds of ML
• ML and InfoSec
• MLSec Project
• How to get started?
• Take Aways
Agenda

The elephant in the room
• “Internet-scale companies”

• “Machine learning systems automatically
learn programs from data” (*)
• You don’t really code the program, but it
is inferred from data.
• Intuition of trying to mimic the way the
brain learns: that’s where terms like
artiﬁcial intelligence come from.
Enter Machine Learning
(*) CACM 55(10) - A Few Useful Things to Know about Machine Learning

• Sales
Applications of Machine Learning
• Trading
• Image and
Voice
Recognition

• Fraud detection systems:
– Is what he just did consistent with
past behavior?
• Network anomaly detection (?):
– NOPE!
– More like statistical analysis, bad
one at that
• Predicting likelihood of attack
actors
– Create different predictive models
and chain them to gain more
conﬁdence in each step.
Security Applications of ML
• SPAM ﬁlters

• Data Mining:
How to do Machine Learning?
• Exploring the space:

• Supervised Learning:
– Classiﬁcation (NN, SVM,
Naïve Bayes)
– Regression (linear,
logistic)
Kinds of Machine Learning
Source – scikit-learn.github.io/scikit-learn-tutorial/
• Unsupervised Learning :
– Clustering (k-means)
– Decomposition (PCA, SVD)

• Paper from Microsoft Research circa Sept’98!
• (Thanks, Wikipedia!)
Kinds of ML: Naïve Bayes (SPAM ﬁlters)

• One of the simplest examples of ML
• Try to infer a relationship between a result variable (y)
and a linear combination of others (x), minimizing the
“squared error” (distance measurement)
Kinds of ML: Linear Regression
Jesse Johnson – shapeofdata.wordpress.com

Kinds of ML: SVM FTW!
• One of my favorite algorithms!
• Support Vector Machines (SVM):
– Good for classification problems with numeric features
– Not a lot of parameters, it helps control overfitting, built in
regularization in the model, usually robust
– However, sometimes slow to train (# of points, # of features)
– Also awesome: hyperplane separation on an unknown infinite
dimension.
Jesse Johnson – shapeofdata.wordpress.com
No idea… Everyone copies this

• SIEM and Log Monitoring tools are just vertical BI
applications (from the 90’s)
• “I don't have time for your marketing hype!” – Infosec
• How many logs you think there are in your
organization?
ML and Infosec

InfoSec Data Scientists
Data Science Venn Diagram by Drew Conway
• “Data Scientist (n.): Person who is better at statistics than
any software engineer and better at software engineering
than any statistician.” -- Josh Willis, Cloudera

Considerations on Data Gathering
• Models will (generally) get better with more data
– But we always have to consider bias and variance as we
select our data points
– Also adversaries – we may be force fed “bad data”, ﬁnd
signal in weird noise or design bad (or exploitable) features
• “I’ve got 99 problems, but data ain’t one”
Domingos, 2012 Abu-Mostafa, Caltech, 2012

• Adversaries - Exploiting the learning process
• Understand the model, understand the
machine, and you can circumvent it
• Something InfoSec community knows very well
• Any predictive model on Infosec will be pushed
to the limit (LIMIT!)
• Again, think back on the
way SPAM engines evolved.
Considerations on Data Gathering

MLSec Project
• Sign up, send logs, receive reports generated by
robots machine learning models!
– FREE! I need the data! Please help! ;)
• Looking for contributors, ideas, skeptics to support
project as well.
• Visit https://www.mlsecproject.org , message
@MLSecProject or just e-mail me.

• We developed an algorithm to detect malicious
behavior from log entries of ﬁrewall blocks
• Over 6 months of data from SANS DShield
• We don’t focus on frequency or network
anomaly detection. Get ground truth “badness”
and roll with it.
• After a lot of statistical-based math (true
positive ratio, true negative ratio, odds
likelihood), it can pinpoint actors that would
be 13x-18x more likely to attack you.
MLSec Project

Map of the
Internet
• (Hilbert Curve)
• Block port 22
• 2013-07-20
0
10
127
MULTICAST AND FRIENDS

Map of the
Internet
• (Hilbert Curve)
• Block port 22
• 2013-07-20
0
10
127
MULTICAST AND FRIENDS
CN
RU
CN,
BR,
TH

• Behavior: block
on port 22
• Trial inference
on 100k IP
addresses per
Class A subnet
• Logarithm
scale:
brightest tiles
are 10 to 1000
times more
likely to
attack.
MLSec Project

MLSec Project - Some interesting
results
• Ok, robot: show me who the “evil guys” are on
port 80 (most likelihood of attack), by AS name

MLSec Project - Some interesting
results
• ZOMG! It KNOWS! Call John Connor!
• 1st model did not take into consideration web crawler activity.
• Without netsec/infosec experience, scientists would be
scratching heads for days.
• Ok, robot: show me who the “evil guys” are on
port 80 (most likelihood of attack), by AS name

• Programming is a must (Python / R)
• Statistical knowledge keeps you from
making dumb mistakes
• Speciﬁc machine learning courses and
books:
– Coursera (ML/ Data Analysis / Data Science)
• Practice, Practice, Practice:
– Kaggle
– KDD, VAST, VizSec
How to get started?

• Big data is here! *BUZZWORD ALERT*
• Machine learning / predictive analytics are
coming.
• In 6-12 months, everyone will wish they were a
Data Scientist (not really!)
• There is a lot of applicability in InfoSec
• Embrace the change: the correct applicability of
ML models can greatly enhance defensive
practices.
• MLSec Project is cool, check out my talk in BH/DC
• And MOST IMPORTANTLY…
Take Aways

Machine Learning = ROBOT Unicorns + Rainbows

Thanks!
• Q&A?
• Feedback is welcome!
• (bad = Joel’s fault :P)
Alexandre Pinto
alexcp@mlsecproject.org
@alexcpsec
@MLSecProject
"Prediction is very difficult, especially if it's about the future."

- Niels Bohr

BSidesLV 2013 - Using Machine Learning to Support Information Security

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BSidesLV 2013 - Using Machine Learning to Support Information Security

Similar to BSidesLV 2013 - Using Machine Learning to Support Information Security (20)

Recently uploaded

Recently uploaded (20)

BSidesLV 2013 - Using Machine Learning to Support Information Security