Big Data, Data Science, Machine Learning and Analytics are a few of the new buzzwords that have invaded out industry of late. Again we are being sold a unicorn-laden, silver-bullet panacea by heavy handed marketing folks, evoking an expected pushback from the most enlightened members of our community. However, as was the case before, there might just be enough technical meat in there to help out with our security challenges and the overwhelming odds we face everyday. And if so, what do we as a community have to know about these technologies in order to be better professionals? Can we really use the data we have been collecting to help automate our security decision making? Is a robot going to steal my job?
If you are interested in what is behind this marketing buzz and are not scared of a little math, this talk would like to address some insights into applying Machine Learning techniques to data any of us have easy access to, and try to bring home the point that if all of this technology can be used to show us “better” ads in social media and track our behavior online (and a bit more than that) it can also be used to defend our networks as well.
Dev Dives: Streamline document processing with UiPath Studio Web
BSidesLV 2013 - Using Machine Learning to Support Information Security
1. Using Machine Learning to
support Information Security
Alexandre Pinto
alexcp@mlsecproject.org
@alexcpsec
@MLSecProject
Proving Ground (Many Thanks to Joel Wilbanks)
2. • This is a talk about DEFENDING not attacking
– NO systems were harmed on the development of
this talk.
– This is NOT about some vanity hack that will be
patched tomorrow
– We are actually trying to BUILD something here.
• This talk includes more MATH thank the daily
recommended assumption by the FDA.
• You have been warned...
WARNING!
3. • 12 years in Information Security, done a little bit of
everything.
• Past 7 or so years leading security consultancy and
monitoring teams in Brazil, London and the US.
– If there is any way a SIEM can hurt you, it did to me.
• Researching machine learning and data science in
general for the past year or so. Participates in
Kaggle machine learning competitions (for fun, not
for profit).
• First presentation in a real Infosec conference! (give
or take a few hours)
Who’s Alex?
4. • The elephant in the room
• Enter Machine Learning
• Principles and Kinds of ML
• ML and InfoSec
• MLSec Project
• How to get started?
• Take Aways
Agenda
7. • “Machine learning systems automatically
learn programs from data” (*)
• You don’t really code the program, but it
is inferred from data.
• Intuition of trying to mimic the way the
brain learns: that’s where terms like
artificial intelligence come from.
Enter Machine Learning
(*) CACM 55(10) - A Few Useful Things to Know about Machine Learning
9. • Fraud detection systems:
– Is what he just did consistent with
past behavior?
• Network anomaly detection (?):
– NOPE!
– More like statistical analysis, bad
one at that
• Predicting likelihood of attack
actors
– Create different predictive models
and chain them to gain more
confidence in each step.
Security Applications of ML
• SPAM filters
12. • Paper from Microsoft Research circa Sept’98!
• (Thanks, Wikipedia!)
Kinds of ML: Naïve Bayes (SPAM filters)
13. • One of the simplest examples of ML
• Try to infer a relationship between a result variable (y)
and a linear combination of others (x), minimizing the
“squared error” (distance measurement)
Kinds of ML: Linear Regression
Jesse Johnson – shapeofdata.wordpress.com
14. Kinds of ML: SVM FTW!
• One of my favorite algorithms!
• Support Vector Machines (SVM):
– Good for classification problems with numeric features
– Not a lot of parameters, it helps control overfitting, built in
regularization in the model, usually robust
– However, sometimes slow to train (# of points, # of features)
– Also awesome: hyperplane separation on an unknown infinite
dimension.
Jesse Johnson – shapeofdata.wordpress.com
No idea… Everyone copies this
15. • SIEM and Log Monitoring tools are just vertical BI
applications (from the 90’s)
• “I don't have time for your marketing hype!” – Infosec
• How many logs you think there are in your
organization?
ML and Infosec
16. InfoSec Data Scientists
Data Science Venn Diagram by Drew Conway
• “Data Scientist (n.): Person who is better at statistics than
any software engineer and better at software engineering
than any statistician.” -- Josh Willis, Cloudera
17. Considerations on Data Gathering
• Models will (generally) get better with more data
– But we always have to consider bias and variance as we
select our data points
– Also adversaries – we may be force fed “bad data”, find
signal in weird noise or design bad (or exploitable) features
• “I’ve got 99 problems, but data ain’t one”
Domingos, 2012 Abu-Mostafa, Caltech, 2012
18. • Adversaries - Exploiting the learning process
• Understand the model, understand the
machine, and you can circumvent it
• Something InfoSec community knows very well
• Any predictive model on Infosec will be pushed
to the limit (LIMIT!)
• Again, think back on the
way SPAM engines evolved.
Considerations on Data Gathering
19. MLSec Project
• Sign up, send logs, receive reports generated by
robots machine learning models!
– FREE! I need the data! Please help! ;)
• Looking for contributors, ideas, skeptics to support
project as well.
• Visit https://www.mlsecproject.org , message
@MLSecProject or just e-mail me.
20. • We developed an algorithm to detect malicious
behavior from log entries of firewall blocks
• Over 6 months of data from SANS DShield
• We don’t focus on frequency or network
anomaly detection. Get ground truth “badness”
and roll with it.
• After a lot of statistical-based math (true
positive ratio, true negative ratio, odds
likelihood), it can pinpoint actors that would
be 13x-18x more likely to attack you.
MLSec Project
21. Map of the
Internet
• (Hilbert Curve)
• Block port 22
• 2013-07-20
0
10
127
MULTICAST AND FRIENDS
22. Map of the
Internet
• (Hilbert Curve)
• Block port 22
• 2013-07-20
0
10
127
MULTICAST AND FRIENDS
CN
RU
CN,
BR,
TH
23. • Behavior: block
on port 22
• Trial inference
on 100k IP
addresses per
Class A subnet
• Logarithm
scale:
brightest tiles
are 10 to 1000
times more
likely to
attack.
MLSec Project
24. MLSec Project - Some interesting
results
• Ok, robot: show me who the “evil guys” are on
port 80 (most likelihood of attack), by AS name
25. MLSec Project - Some interesting
results
• ZOMG! It KNOWS! Call John Connor!
• 1st model did not take into consideration web crawler activity.
• Without netsec/infosec experience, scientists would be
scratching heads for days.
• Ok, robot: show me who the “evil guys” are on
port 80 (most likelihood of attack), by AS name
26. • Programming is a must (Python / R)
• Statistical knowledge keeps you from
making dumb mistakes
• Specific machine learning courses and
books:
– Coursera (ML/ Data Analysis / Data Science)
• Practice, Practice, Practice:
– Kaggle
– KDD, VAST, VizSec
How to get started?
27. • Big data is here! *BUZZWORD ALERT*
• Machine learning / predictive analytics are
coming.
• In 6-12 months, everyone will wish they were a
Data Scientist (not really!)
• There is a lot of applicability in InfoSec
• Embrace the change: the correct applicability of
ML models can greatly enhance defensive
practices.
• MLSec Project is cool, check out my talk in BH/DC
• And MOST IMPORTANTLY…
Take Aways
30. Thanks!
• Q&A?
• Feedback is welcome!
• (bad = Joel’s fault :P)
Alexandre Pinto
alexcp@mlsecproject.org
@alexcpsec
@MLSecProject
"Prediction is very difficult, especially if it's about the future."
- Niels Bohr