anirban webappsecusa2014

Download Report

Transcript anirban webappsecusa2014

AppSec USA 2014
Denver, Colorado
Catch me if you can
Machine Learning, VMs, honeypots and more..
Introduction
Anirban Banerjee
•
•
•
•
•
•
•
•
San Francisco
Web-Malware detection
Machine learning, scalable systems
Ph.D. CSE – works at CloudFlare
Interface with hosting industry
Co-Founder of StopTheHacker
Post acquisition at CloudFlare
Interested in malware detection, RE
Various talks at Hostingcon, parallels summit
Quick Overview
•
•
•
•
•
•
•
StopTheHacker
CloudFlare
Web Malware – Existing tools Fail
Web Malware – Attack Vectors
Identification
Scaling honeypots
Machine Learning
StopTheHacker - CloudFlare
• StopTheHacker
– Founded in 2009
– Funded by NSF
– Identifies, cleans
web-malware
automatically
– Partners with hosters
– Uses Machine
Learning, pattern
matching, AVs, VMs
• CloudFlare
– DDoS protection
– CDN
– WAF
– Cloud Solution
– Contribute to NGINX
– Use Lua, Go
– 5->7% of Internet
traffic daily
Web Malware - Existing Tools Fail
• AVs
– Polymorphic
malware
– Checks for AV
processes
– Avast, ClamAV, AVG
– Linux versions seem
to not be updated as
frequently
• Pattern Matching
– Trivial to change
code structure
– Trivial to change
commands
– Yara, Perl, Grep, Awk
Web Malware – Attack Vectors
• Via Website
– SQL Injection
– XSS
– Ads
– 3rd party libraries
– Themes
– Plugins
• Bypass
– FTP creds
– Apache modules
– SEO poisoning
Web Malware – Attack Vectors
• Making it a bit harder
– Custom WP packages e.g. Dreampress
– Auto upgrades
– WAFs
– Proper separation of web server and CMS roles
– End clients must be educated
– *Some* default scanning for *every* site
• Free to end client
– Web-Malware collaboration group (SBW)
Identification - Highlights
Web malware
Binaries
• High churn
• Low churn
– Iframe targets
– Fast flux networks
– Encoded, encrypted,
randomly generated
domains
– PhP code changes
– Primarily PE32/Win
– Target old IE exploits
– Spyware/Adware
more than malware
– FTP sniffers, IRC drop
Identification – Challenges
Web malware
• Detection is hard
– What is malware? Redirection, binary drop,
registry modification..
– PhP, ASP, Shell, Perl, Python, Ruby..
– Malware is smart: UA, Geo IP, Time of day, only
once per IP..
– Blacklists very outdated
– AVs have very poor catch rate
Scaling honeypots
Host content, tripwire,
analyze binary
IP, file deposited etc..
dev.go.com
WAF
BL
Container
Front End
Bad Hacker
WP 3.6.1, 3.7, 2.8, 3.0 Joomla, Drupal,
Django – Any flavor we want
Docker
Public API
OS
Bare Metal
Cuckoo based VM
Windows binaries and honeypot
Bad Bot
Scaling honeypots - Is this better
Yes
•
•
•
•
•
•
Docker – common library re-use
Spawn thousands of instances on one rack
Any flavor of CMS you like
Watchdog for file system changes
Dropped files shipped off to cuckoo VM
Complete trace, screenshots with specific IE
version
Scaling honeypots - Challenges
Constant Cat and Mouse game
•
•
•
•
•
Rotate IPs, avoid customer IPs
Juicy target for DDoS (400 Gigs/s +)
Keep up with new variants
Malware getting smarter, check for VM
Malware targets mobile devices
Machine Learning
Helps identify the unseen
• Need a dataset
– Offensive computing, virustotal, blacklists..
• Analyze what is important
– Reduce noise
– More features is not always better
– PCA type experiments
– Use rules of thumb – forests/Trees
– Scikitpy/pybrain/weka is your friend
Machine Learning
Toolkit strategy
• Pybrain
– Use for clustering, neural network
– Identify what clusters are present
• Scikitpy/weka
– Use for classification
– Constant retraining needed : high recall, precision
– Feedback loop based system is important
Machine Learning
What is the benefit
•
•
•
•
Fuzzed iframes caught easily
Fuzzed/encoded PHP/JS caught easily
Catches ad misbehavior
Catches binary that is missed by AV but tries
to do “obvious” bad things
• Lets move away from signatures
Machine Learning
Is it all roses and honey?
• No – constant retraining needed
• Has to be able to get large dataset
– Features increase, exponential increase in data
• CPU needed
• Near-Real-time very hard
• Toolkits are good – but can be better
Current Status and Future Plans
Right now
• Pybrain
– Use for clustering, neural network
– Identify what clusters are present
• Scikitpy/weka
– Use for classification
– Constant retraining needed : high recall, precision
– Feedback loop based system is important
Current Status and Future Plans
Future Plans
•
•
•
•
Inline ML for WAF
More focus on mobile malware
More focus on DDoS malware
More focus on using ML – traffic anomalies
More work needed
The road ahead
•
•
•
•
•
Make VM detection harder
Use on metal type solution – performance!
Investigate Go for inline traffic processing
Potentially open source portions of code
Automated malware collection at massive
scale
That’s it folks
Q&A
[email protected]