Bayesian bot detection based on DNS traffic similarity

Download Report

Transcript Bayesian bot detection based on DNS traffic similarity

Bayesian bot detection based on
DNS traffic similarity
Presented by Hsi-Shan Chung
1
REFERENCES
• Source :
Proceedings of the 2009 ACM symposium on
Applied Computing
• Authors :
Ricardo Villamarín-Salomón
PA José Carlos Brustoloni
University of Pittsburgh, Pittsburgh, PA
2
Outline
•
•
•
•
Bayesian Probability
Bayesian Inference
The experimental results
Discussion and Conclusion
3
Bayesian Probability
• An extension of logic that enables reasoning with
uncertain statements.
• 正向概率(頻率機率):假設袋子裡面有N個
白球,M個黑球,你伸手進去摸一個,摸出黑
球的機率是多大?
• 如果我們事先並不知道袋子裡面黑白球的比例,
而是閉著眼睛摸出一個(或好幾個)球,觀察
這些取出來的球的顏色之後,那麼我們可以就
此對袋子裡面的黑白球的比例作出什麼樣的推
測?
4
Bayesian Inference
• Collecting evidence: be consistent or inconsistent
with a given hypothesis.
• As evidence accumulates, the degree of belief in
a hypothesis ought to change. With enough
evidence, it should become very high or very low.
• Bayesian inference uses a numerical estimate of
the degree of belief in a hypothesis before
evidence has been observed and calculates a
numerical estimate of the degree of belief in the
hypothesis after evidence has been observed.
5
Bayesian Method
6
Bayesian Method Parameter(1)
•
•
•
•
•
•
During a specific time period:
Q : the set of DNS queries made,
D : the domain names queried,
H : a set of hosts,
: The DNS queries made by host h,
The objective of the Bayesian approach:
Pr(host h is infected |
)
7
Bayesian Method Parameter(2)
• B : blacklist, a set containing domain names of
known C&C servers.
•
= { h ∈ H : h queried b ∈ B }
•
: the sets of all domain names queried by
hosts in
.
•
: the sets of all domain names queried by
hosts in H .
•
=
∩B,
=(
\
).
•
: the set of queries in
.
8
Bayesian Method Parameter(3)
• DNS names sets according to the hosts that
queried them during the monitoring period.
9
Bayesian Method Parameter(4)
•
•
: the set of uninfected hosts,
: the set of hosts that are infected but not
in
.
• a score to every q ∈ Q indicating a probability
that a host making it is infected.
•
: higher scores.
: lower scores.
• a score to each host that combines the scores
of all the queries it made.
10
Bayesian Method
• Assign each host to sets
and
depending
on the score received.
• If a host
∈
is in the same botnet as
host
∈
,
∩
is expected to be
larger than
∩ ∩ .
• Because
and
receive the same
commands and therefore issue overlapping
DNS queries.
11
Bayesian Method(1)
•
: whether a host
is infected or not.
• The likelihood that a query is made by a
host in
:
• The likelihood that a query
host in H :
is queried by a
12
Bayesian Method(2)
• The posterior probability that a host
send query :
• assuming P(
will
= 1 ) = 0.5
13
Bayesian Method
• Where
represents
. If the
only host h querying the said domain belongs
to
,
will be 1 (and 0 if h ∉
).
14
Bayesian Statistics(1)
• x :the a priori belief that a domain name that was
never queried before will be queried by an
infected host.
• assuming that
is a binomial random variable
with a beta distribution prior, and that it
represents the infection classification of a host
making a specific query q.
• Next, we perform n trials, each of which consists
in examining the next host ∈ H that sent q to
see if is infected. If it is, we consider that the
experiment was successful and assign
=1.
15
Bayesian Statistics(2)
• The probability that the +1 trial will be
successful can be expressed:
• α and β are the parameters of the Beta
distribution, n is the number of trials and s is
the number of successes involving q.
• Define f=α+β and α=f·x, where f is a constant
interpreted as the strength we give to x.
16
Bayesian Statistics(3)
• In this case s was approximated with
,
where is the total number of times a query
has been made during the traffic monitoring
period.
• If we consider, for instance, values of 1 and 0.5
for f and x respectively, then a query receives a
neutral score
=0.5 before it is sent by any
host.
17
Bayesian Statistics(4)
• Taking the geometric mean of the host’s most extreme
values (closest to 0 and 1) with two thresholds, and
:
18
Bayesian Statistics(5)
• N(h) and I(h) indicate how likely it is that a
host is infected or non-infected,
• Define a combined score(between -1 and 1):
• Or, a combined score between 0 and 1:
19
METHODOLOGY
• Backdoor.Win32.SdBot.cmz and Net-Worm.Win32.Bobic.k
• DNS traffic from uninfected hosts: 89 PCs in instructional
laboratories during two 24-hour periods. We named such
traces CSL-1(February 13-14) & CSL-2(February 14-15).
• DNS traffic from infected hosts:
• We also obtained daily from Shadowserver a blacklist of
known C&C servers.
20
METHODOLOGY
• Number of DNS names queried per trace:
21
METHODOLOGY
• We picked m traces of our n infected hosts, and we altered their DNS
packets by obfuscating any blacklisted names in them.
• Altered the DNS packets by appending to them a non-existent ccTLD (.nv).
22
METHODOLOGY
• Number of DNS names queried in each test trace:
23
METHODOLOGY
• TP : infected hosts classified as infected.
• FN : infected hosts classified as non-infected.
• True Positive Rate (TPR):
• FP : non-infected hosts identified as infected
• TN : non-infected hosts classified as non-infected.
• False Positive Rate (FPR):
24
EXPERIMENTAL RESULTS
• Find parameters: applied the Bayesian
method to the test traces in CSL-1-Sdbot-T.
•
=0.95 and
=0.5, and we consider any
host with a score P(h)≥0.9 to be infected.
•
, because in our traces there are
hundreds of queries with very low
scores.
25
EXPERIMENTAL RESULTS
• Average False Positive Rate for traces in CSL-1-Sdbot-T
26
EXPERIMENTAL RESULTS
• Average True Positive Rate for traces in CSL-1-Sdbot-T
27
EXPERIMENTAL RESULTS
• Average True Positive Rate for traces in CSL-2-Bobic.k-T
28
Discussion and Conclusion
• If parameters are not well tuned:
1.False positives : a domain name is queried only by an
infected host and one or a few of the uninfected hosts.
2. False negatives : very popular domain names are
queried by both infected and uninfected hosts. The
number of queries with a
score near zero.
• These error do not occur if
is properly tuned.
• The technique successfully recognized C&C
servers with multiple domain names.
29