slides - Data Sciences and Knowledge Discovery Laboratory

Transcript slides - Data Sciences and Knowledge Discovery Laboratory

ACTIONABLE KNOWLEDGE DISCOVERY
FOR THREATS INTELLIGENCE SUPPORT
~
A MULTI-DIMENSIONAL DATA MINING METHODOLOGY
2nd Int. Workshop on
Domain Driven Data Mining
Pisa - Dec 15th, 2008
Olivier Thonnard
Royal Military Academy
Polytechnic Faculty
Belgium
[email protected]
Marc Dacier
Symantec Research Labs
Sophia Antipolis
France
[email protected]
Outline
1. Introduction
2. A multi-dimensional & domain-driven approach
for mining network traffic (eg malicious)
3. Experimental environment
4. A real-world example
5. Conclusions
Introduction
 According to the security community, today’s
cybercriminality:
 Is increasingly organized
 Involves the commoditization of various activities :


By selling 0-days and new (undetected) malwares
By selling /renting compromised hosts or entire botnets
 Seems to be specialized in certain countries
 Coordination  patterns …
Threats intelligence
 What is the prevalence of emerging coordinated malicious
activities?
 Which countries / IP blocks seem to be more affected?
 Can we observe various “communities” of machines coordinating their
efforts?
 How to discover knowledge about:
1.
2.
The modus operandi of attack phenomena
The underlying root causes of attacks
 How to analyze Internet threats from a global strategic level?
 Can we enable some sort of Internet threat “situational awareness”
Our « multi-dimensional KDD » approach
to analyze network threats

Collect real-world attack traces from a number of
(worldwide) distributed sensors


Network of honeypots = “Honeynet”
Threats analysis (semi-automated):
 Collect “attack events” from each sensor
 Multi-dimensional KDD:
1) Extract relevant nuggets of knowledge  DDDM (with expert-defined
features )
–
2)
Using Clique algorithms (clique-based clustering)
 extraction of maximal weighted cliques
Synthesizing those pieces of knowledge, to create “concepts” describing
the attack phenomena
–
Using Cliques combinations  DDDM
+/- 40 sensors, 30 countries, 5 continents
Leurré.com
Project
6
Leurre.com / SGNET Honeynet
 Global distributed honeynet (http://www.leurrecom.org)
 +50 sensors distributed in more than 30 countries worldwide
 Ongoing effort of EURECOM since 2003
 Same configuration for all sensors :
 (V1.0): low-interaction honeypots based on honeyd
 (V2.0) : high-interaction honeypots based on ScriptGen
 Data enrichment:
 Dataset enriched with contextual information:
 Geo, reverse-DNS, ASN, external blacklists (SpamHaus, Shadowserver,
Dshield, EmergingThreats, etc)
 Parsed and uploaded into an Oracle DB
 All partners have full access (for free) to the whole DB
Research context
WOMBAT
 Worldwide Observatory of Malicious Behaviors And Threats
 EU-FP7 project ( http://www.wombat-project.eu )
 Joint effort in collecting, sharing and analyzing data on global Internet
threats
Definition 1: Attack profiles
 In our honeynet:


A source = an IP address that targets a honeypot platform
on a given day, with a certain port sequence.
All sources are clustered into “attack (profiles)” based on
certain network characteristics(*):





targeted port sequence,
#packets,
attack duration,
packet payload,
…
Attack tool

Fingerprint(s)
(*) F. Pouget, M. Dacier, Honeypot-Based Forensics. AusCERT Asia Pacific Information
technology Security Conference 2004.
Definition 2:
Attack event on sensor ‘x’
Event 1
Event 2
Event 3
Dimensions used
to create “attack cliques”
 We need to identify salient features for the creation of meaningful cliques
(“viewpoints“)
  expert-defined characteristics for each dimension
 Geolocation
 Botnets located in specific regions
 So-called “safe harbors” for the hackers
 IP netblocks / ISP’s of origin
 Bias in worm propagation (e.g. malware coding strategies)
 “Uncleanliness” of certain networks (e.g. clusters of zombie machines)
 Many others
 Time series

Synchronized activities targeting different sensors
 Targeted sensors
Remark: distance used for distributions  Kullback-Leibler, Chi-2,
and Kolmogorov-Smirnov
Cliques combination:
Creating multi-dimensional “concepts”
Geographical cliques
of attack events
Temporal cliques
of attack events
Dimension 2-concept
+
time
time
time
Remark: for each dimension, we extract maximal weighted cliques
using the « dominant sets » approximation (! needs a full similarity matrix)
Dynamic creation of Concept lattices
 Initial set of attack events
 D2-concepts
 D3-concepts
 D4-concept
Dimensional Level
 Cliques
= D1-concepts
Some experiments
 Some analysis details:
 Timeframe: Sep 2006  June 2008
 Network traffic volume : 282,363 IP sources (grouped into
351 attack events)
 Nr of targeted sensors: 36

In 20 different countries, 18 different subnets
 136 different attack profiles (i.e. attack clusters)
Experimental results
Cliques overview
Attack
Dimension
Geolocation
IP Subnets
(Class A)
Targeted sensors
Attack
time series
Nr of
cliques
45
30
17
82
Volume of
sources (%)
Most targeted port sequences
66.4
1027U, I, 1433T, 1026U, I445T, 5900T, 1028U, 9763T,
I445T80T, 15264T, 29188T, 6134T,
6769T, 1755T, 64264T, 1028U1027U1026U, 32878T,
64783T, 4152T, 25083T, 9661T, 25618T, …
56.0
1027U, I, 1433T, 1026U, I445T, 5900T, 1028U, 9763T,
15264T, 29188T, 6134T, 6769T, 1755T,
50656T, 64264T, 1028U1027U1026U, 32878T, 64783T,
18462T, 4152T, 25083T, 9661T, 25618T, 7690T, …
70.1
I, 1433T, I445T, 1025T, 5900T, 1026U,
I445T139T445T139T445T, 4662T, 9763T, 1008T, 6211T,
I445T80T, 15264T, 29188T, 12293T, 33018T, 6134T,
6769T, 1755T, 2968T, 26912T, 50656T, 64264T, 32878T,
…
92.2
135T, I, 1433T, I445T, 5900T, 1026U,
I445T139T445T139T445T, I445T80T, 6769T,
1028U1027U1026U, 50286T, 2967T, …
Visualizing Cliques
using Multi-dimensional Scaling
 High-dimensional dataset  Low-dimensional map retaining
the global and local structure
 ‘Dimensionality reduction’
 Build a matrix with e.g.:
 Rows = attack events
 Columns = feature vectors

Example : Geolocation  vector of 226 country variables
 MDS techniques
 Linear  PCA
 Non-linear  Sammon mapping, Isomap, LLE, (t-)SNE
Clique number
Visualizing Cliques
using MDS and Country labels
Combining Cliques: Real-world example
Botnet scans on ports:
I, I-445T, I-445T-139T, I-445T-80T
Attack
events
{1,2,3,…,67}
ts2
ts1
ts4
ts6
Cliques of
Time series
time
p7
superclique
g1
g9
g16
Geo cliques
g12
g3
g32
Only scanners !
(ICMP)
Only attackers!
(I-445T-139T…)
s4
s12
s19
s26
s28
s30
s2
s24
Subnets
cliques
Dimension
Platform cliques
Visualizing Cliques
using Multi-dimensional Scaling
scanners
Clique number
attackers
Real-world example:
Botnet attack waves
 Inferred facts:

Different waves in time

Those 4 botnet waves
have hit the same
group of platforms

Dynamic evolution of the
botnet population (IP blocks)
between each attack wave

Separation of attackers
and scanners
Scanners vs Attackers …
Scanning bots
Attacking bots
Conclusions
 This KDD methodology can produce concise, high-level
summaries of attack traffic:
 Attack cliques deliver insights into global attack phenomena
 Facilitates the interpretation of traffic correlations:
 Attack concepts are rich in semantic
 It helps to uncover certain modus operandi
 Flexible and open to additional correlation « viewpoints »:
 New clique dimension can be added easily when experts find it
relevant (i.e. domain-driven)
Future work
 Integration of other relevant attack features:
 Botnet / worm patterns separation
 Malware characteristics (e.g. from high-interaction traffic)
 Find appropriate combination of attack dimensions:
 Generation of higher-level “concepts” describing real-
world phenomena
 Knowledge engineering:


Exploit attack concepts  “reasoning system”
Decision tree, expert system, kNN, … ?
Thank you.
Any question?
Note:
If you’d like to participate in the WOMBAT project (*),
please do not hesitate to contact us:
Engin Kirda: [email protected]
Marc Dacier: [email protected]
Olivier Thonnard: [email protected]
(*) Leita, C.; Pham, V.; Thonnard, O.; Ramirez-Silva, E.;Pouget, F.; Kirda , E.; Dacier, M.
The Leurre.com Project: Collecting Internet Threats Information Using a Worldwide Distributed Honeynet.
1st WOMBAT workshop, April 21st-22nd, Amsterdam.
Leurre.com V2.0:
SGNET(*)
 Novel high-interaction honeypots
 SGNET = ScriptGen Hpots + Argos emulator + Nepenthes
 Malware analysis: VirusTotal + Anubis Sandbox
ScriptGen
Anubis
“0-day”
Malware
repository
Automated
submissions
(*) Corrado Leita and Marc Dacier. SGNET: a worldwide deployable framework to support the analysis
of malware threat models. (EDCC 2008, Lithuania)

slides - Data Sciences and Knowledge Discovery Laboratory

Transcript slides - Data Sciences and Knowledge Discovery Laboratory

Directory