Bayesian_Network - Computer Science Department

Download Report

Transcript Bayesian_Network - Computer Science Department

BAYESIAN NETWORK
Submitted By
 Faisal Islam
 Srinivasan Gopalan
 Vaibhav Mittal
 Vipin Makhija
Prof. Anita Wasilewska
State University of New York at Stony
Brook
References









[1]Jiawei Han:”Data Mining Concepts and Techniques”,ISBN 153860-489-8
Morgan Kaufman Publisher.
[2] Stuart Russell,Peter Norvig “Artificial Intelligence – A modern
Approach ,Pearson education.
[3] Kandasamy,Thilagavati,Gunavati , Probability, Statistics and
Queueing Theory , Sultan Chand Publishers.
[4] D. Heckerman: “A Tutorial on Learning with Bayesian Networ
ks”, In “Learning in Graphical Models”, ed. M.I. Jordan, The MIT
Press, 1998.
[5] http://en.wikipedia.org/wiki/Bayesian_probability
[6] http://www.construction.ualberta.ca/civ606/myFiles/Intro%2
0to%20Belief%20Network.pdf
[7] http://www.murrayc.com/learning/AI/bbn.shtml
[8] http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html
[9] http://en.wikipedia.org/wiki/Bayesian_belief_network
CONTENTS







HISTORY
CONDITIONAL PROBABILITY
BAYES THEOREM
NAÏVE BAYES CLASSIFIER
BELIEF NETWORK
APPLICATION OF BAYESIAN NETWORK
PAPER ON CYBER CRIME DETECTION
HISTORY




Bayesian Probability was named after
Reverend Thomas Bayes (1702-1761).
He proved a special case of what is current
ly known as the Bayes Theorem.
The term “Bayesian” came into use around
the 1950’s.
Pierre-Simon, Marquis de Laplace (1749-1
827) independently proved a generalized ve
rsion of Bayes Theorem.
http://en.wikipedia.org/wiki/Bayesian_probability
HISTORY (Cont.)





1950’s – New knowledge in Artificial Intelligen
ce
1958 Genetic Algorithms by Friedberg (Hollan
d and Goldberg ~1985)
1965 Fuzzy Logic by Zadeh at UC Berkeley
1970 Bayesian Belief Network at Stanford
University (Judea Pearl 1988)
The idea’s proposed above was not fully
developed until later. BBN became popular in
the 1990s.
http://www.construction.ualberta.ca/civ606/myFiles/Intro%20to%20Belief%20Network.pdf
HISTORY (Cont.)
Current uses of Bayesian Networks:
 Microsoft’s printer troubleshooter.
 Diagnose diseases (Mycin).
 Used to predict oil and stock prices
 Control the space shuttle
 Risk Analysis – Schedule and Cost Overru
ns.
CONDITIONAL PROBABILITY


Probability : How likely is it that an event will happen?
Sample Space S




Element of S: elementary event
An event A is a subset of S
P(A)
P(S) = 1
Events A and B

P(A|B)- Probability that event A occurs given that event B has
already occurred.
Example:
There are 2 baskets. B1 has 2 red ball and 5 blue ball. B2 has 4 re
d ball and 3 blue ball. Find
probability of picking a red ball
from
basket 1?
CONDITIONAL PROBABILITY
The question above wants P(red ball |
basket 1).
 The answer intuitively wants the probability o
f red ball from only the sample space of b
asket 1.
 So the answer is 2/7
 The equation to solve it is:
P(A|B) = P(A∩B)/P(B) [Product Rule]
P(A,B) = P(A)*P(B) [ If A and B are independe
nt ]
How do you solve P(basket2 | red ball) ???

BAYESIAN THEOREM
A special case of Ba
yesian Theorem:
P(A∩B) = P(B) x P(A|B
)
A
B
P(B∩A) = P(A) x P(B|A
)
Since P(A∩B) = P(B∩A
),
P(B) x P(A|B) = P(A) x
P(B|A)
P( A) P( B | A)
P( A) P( B | A)
P( A | B) 

( B)
P APB | A   PA PB | A
=> P(A|B) = [P(A) xPP(


BAYESIAN THEOREM
Solution to P(basket2 | red ball) ?
P(basket 2| red ball) = [P(b2) x P(r | b2)]
/ P(r)
= (1/2) x (4/7)] / (6/14)
= 0.66
BAYESIAN THEOREM
Example 2: A medical cancer diagnosis
problem
There are 2 possible outcomes of a diagnos
is: +ve, -ve. We know .8% of world popul
ation has cancer. Test gives correct +ve r
esult 98% of the time and gives correct –v
e result 97% of the time.
If a patient’s test returns +ve, should we
diagnose the patient as having cancer?

BAYESIAN THEOREM
P(cancer) = .008
P(+ve|cancer) = .98
P(+ve|-cancer) = .03
P(-cancer) = .992
P(-ve|cancer) = .02
P(-ve|-cancer) = .97
Using Bayes Formula:
P(cancer|+ve) = P(+ve|cancer)xP(cancer) / P(+ve)
= 0.98 x 0.008 = .0078 / P(+ve)
P(-cancer|+ve) = P(+ve|-cancer)xP(-cancer) / P(+ve)
= 0.03 x 0.992 = 0.0298 / P(+ve)
So, the patient most likely does not have cancer.
BAYESIAN THEOREM

General Bayesian Theorem:
Given E1, E2,…,En are mutually disjoint ev
ents and P(Ei) ≠ 0, (i = 1, 2,…, n)
P(Ei/A) = [P(Ei) x P(A|Ei)] / Σ P(Ei) x P(A|
Ei)
i = 1, 2,…, n
BAYESIAN THEOREM

Example:
There are 3 boxes. B1 has 2 white, 3 bla
ck and 4 red balls. B2 has 3 white, 2 bl
ack and 2 red balls. B3 has 4 white, 1 bla
ck and 3 red balls. A box is chosen at r
andom and 2 balls are drawn. 1 is white
and other is red. What is the probability t
hat they came from the first box??
BAYESIAN THEOREM
Let E1, E2, E3 denote events of choosing
B1, B2, B3 respectively. Let A be the ev
ent that 2 balls selected are white and r
ed.
P(E1) = P(E2) = P(E3) = 1/3
P(A|E1) = [2c1 x 4c1] / 9c2 = 2/9
P(A|E2) = [3c1 x 2c1] / 7c2 = 2/7
P(A|E3) = [4c1 x 3c1] / 8c2 = 3/7
BAYESIAN THEOREM
P(E1|A) = [P(E1) x P(A|E1)] / Σ P(Ei) x P(A
|Ei)
= 0.23727
P(E2|A) = 0.30509
P(E3|A) = 1 – (0.23727 + 0.30509) = 0.4576
4
BAYESIAN CLASSIFICATION
Why use Bayesian Classification:
 Probabilistic learning: Calculate explicit
probabilities for hypothesis, among the mo
st practical approaches to certain types of
learning problems
 Incremental: Each training example can
incrmentally increase/decrease the probabil
ity that a hypothesis is correct. Prior knowl
edge can be combined with observed dat
a.
BAYESIAN CLASSIFICATION


Probabilistic prediction: Predict multiple
hypotheses, weighted by their probabiliti
es
Standard: Even when Bayesian methods
are computationally intractable, they ca
n provide a standard of optimal decisio
n
making against which other method
s can be measured
NAÏVE BAYES CLASSIFIER

A simplified assumption: attributes are
conditionally independent:

Greatly reduces the computation cost, onl
y count the class distribution.
NAÏVE BAYES CLASSIFIER
The probabilistic model of NBC is to find the probab
ility of a certain class given multiple dijoint (assum
ed)
events.
The naïve Bayes classifier applies to learning tasks
where each instance x is described by a conjuncti
on of attribute values and where the target functio
n f(x) can take on any value from some finite set
V. A set of training examples of the target function
is provided, and a new instance is presented, de
scribed by the
tuple of attribute values <a1,a2,
…,an>. The learner is asked to predict the target v
alue, or classification, for this new instance.
NAÏVE BAYES CLASSIFIER
Abstractly, probability model for a classifier is a
conditional model
P(C|F1,F2,…,Fn)
Over a dependent class variable C with a small
nuumber of outcome or classes conditional o
ver several feature variables F1,…,Fn.
Naïve Bayes Formula:
P(C|F1,F2,…,Fn) = argmaxc [P(C) x P(F1|C) x
P(F2|C) x…x P(Fn|C)] / P(F1,F2,…,Fn)
Since P(F1,F2,…,Fn) is common to all probabili
ties, we donot need to evaluate the denomitat
or for comparisons.
NAÏVE BAYES CLASSIFIER
Tennis-Example
NAÏVE BAYES CLASSIFIER
Problem:
Use training data from above to classify t
he following instances:
a) <Outlook=sunny, Temperature=cool,
Humidity=high, Wind=strong>
b) <Outlook=overcast, Temperature=cool
, Humidity=high, Wind=strong>

NAÏVE BAYES CLASSIFIER
Answer to (a):
P(PlayTennis=yes) = 9/14 = 0.64
P(PlayTennis=n) = 5/14 = 0.36
P(Outlook=sunny|PlayTennis=yes) = 2/9 = 0.22
P(Outlook=sunny|PlayTennis=no) = 3/5 = 0.60
P(Temperature=cool|PlayTennis=yes) = 3/9 = 0
.33
P(Temperature=cool|PlayTennis=no) = 1/5 = .2
0
P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33
P(Humidity=high|PlayTennis=no) = 4/5 = 0.80
P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33
NAÏVE BAYES CLASSIFIER
P(yes)xP(sunny|yes)xP(cool|yes)xP(high|y
es)xP(strong|yes) = 0.0053
P(no)xP(sunny|no)xP(cool|no)xP(high|no)
x P(strong|no) = 0.0206
So the class for this instance is ‘no’. We ca
n normalize the probility by:
[0.0206]/[0.0206+0.0053] = 0.795
NAÏVE BAYES CLASSIFIER
Answer to (b):
P(PlayTennis=yes) = 9/14 = 0.64
P(PlayTennis=no) = 5/14 = 0.36
P(Outlook=overcast|PlayTennis=yes) = 4/9 = 0.44
P(Outlook=overcast|PlayTennis=no) = 0/5 = 0
P(Temperature=cool|PlayTennis=yes) = 3/9 = 0.33
P(Temperature=cool|PlayTennis=no) = 1/5 = .20
P(Humidity=high|PlayTennis=yes) = 3/9 = 0.33
P(Humidity=high|PlayTennis=no) = 4/5 = 0.80
P(Wind=strong|PlayTennis=yes) = 3/9 = 0.33
P(Wind=strong|PlayTennis=no) = 3/5 = 0.60
NAÏVE BAYES CLASSIFIER
Estimating Probabilities:
In the previous example, P(overcast|no) =
0 which causes the formulaP(no)xP(overcast|no)xP(cool|no)xP(high|
no)xP(strong|nno) = 0.0
This causes problems in comparing becau
se the other probabilities are not consid
ered. We can avoid this difficulty by usin
g mestimate.
NAÏVE BAYES CLASSIFIER
M-Estimate Formula:
[c + k] / [n + m] where c/n is the origina
l
probability used before, k=1 and
m=
equivalent sample size.
Using this method our new values of
probility is given below-
NAÏVE BAYES CLASSIFIER
New answer to (b):
P(PlayTennis=yes) = 10/16 = 0.63
P(PlayTennis=no) = 6/16 = 0.37
P(Outlook=overcast|PlayTennis=yes) = 5/12 = 0.42
P(Outlook=overcast|PlayTennis=no) = 1/8 = .13
P(Temperature=cool|PlayTennis=yes) = 4/12 = 0.33
P(Temperature=cool|PlayTennis=no) = 2/8 = .25
P(Humidity=high|PlayTennis=yes) = 4/11 = 0.36
P(Humidity=high|PlayTennis=no) = 5/7 = 0.71
P(Wind=strong|PlayTennis=yes) = 4/11 = 0.36
P(Wind=strong|PlayTennis=no) = 4/7 = 0.57
NAÏVE BAYES CLASSIFIER
P(yes)xP(overcast|yes)xP(cool|yes)xP(hi
gh|yes)xP(strong|yes) = 0.011
P(no)xP(overcast|no)xP(cool|no)xP(high|
no)xP(strong|nno) = 0.00486
So the class of this instance is ‘yes’
NAÏVE BAYES CLASSIFIER
The conditional probability values of all t
he
attributes with respect to the class are
pre-computed and stored on disk.
 This prevents the classifier from comput
ing the conditional probabilities every ti
me it runs.
 This stored data can be reused to reduc
e the

BAYESIAN BELIEF NETWORK



In Naïve Bayes Classifier we make the assumpti
on of class conditional independence, that is gi
ven the class label of a sample, the value of the
attributes are conditionally independent of one
another.
However, there can be dependences between
value of attributes. To avoid this we use Bayesia
n Belief Network which provide joint conditional
probability distribution.
A Bayesian network is a form of probabilistic
graphical model. Specifically, a Bayesian netwo
rk is a directed acyclic graph of nodes represe
nting
variables and arcs representing depen
dence
relations among the variables.
BAYESIAN BELIEF NETWORK


A Bayesian network is a representation of the
joint distribution over all the variables represen
ted by nodes in the graph. Let the variables be
X(1), ..., X(n).
Let parents(A) be the parents of the node A. T
hen the joint distribution for X(1) through X(n) i
s
represented as the product of the proba
bility
distributions P(Xi|Parents(Xi)) for i =
1 to n. If X has no parents, its probability dist
ribution is said to be unconditional, otherwise
it is conditional.
BAYESIAN BELIEF NETWORK
BAYESIAN BELIEF NETWORK
By the chaining rule of probability, the j
oint probability of all the nodes in the gr
aph
above is:
P(C, S, R, W) = P(C) * P(S|C) * P(R|C) *
P(W|S,R)
W=Wet Grass, C=Cloudy, R=Rain,
S=Sprinkler
Example: P(W∩-R∩S∩C)
= P(W|S,-R)*P(-R|C)*P(S|C)*P(C)
= 0.9*0.2*0.1*0.5 = 0.009

BAYESIAN BELIEF NETWORK
What is the probability of wet grass on a given
day - P(W)?
P(W) = P(W|SR) * P(S) * P(R) +
P(W|S-R) * P(S) * P(-R) +
P(W|-SR) * P(-S) * P(R) +
P(W|-S-R) * P(-S) * P(-R)
Here P(S) = P(S|C) * P(C) + P(S|-C) * P(-C)
P(R) = P(R|C) * P(C) + P(R|-C) * P(-C)
P(W)= 0.5985
Advantages of Bayesian Approac
h



Bayesian networks can readily handle
incomplete data sets.
Bayesian networks allow one to learn
about causal relationships
Bayesian networks readily facilitate use
of prior knowledge.
APPLICATIONS
OF
BayesianNetwork
Sources/References

Naive Bayes Spam Filtering Using Word-Position-Based Attributes- http://www.ceas.cc/paper
s-2005/144.pdf
by-: Johan Hovold, Department of Computer Science,Lund University Box 118,
221 00
Lund, Sweden.[E-mail [email protected]]
[Presented at CEAS 2005 Second Conference on Email and Anti-Spam
July 21 & 22, at Stanford University]

Tom Mitchell , “ Machine Learning” , Tata Mcgraw Hill

A Bayesian Approach to Filtering Junk EMail,
Mehran Sahami Susan Dumaisy David Heckermany Eric Horvitzy Gates Building
Computer Science Department Microsoft Research, Stanford University Redmond W
Stanford CA fsdumais heckerma horvitzgmicrosoftcom
[Presented at AAAI Workshop on Learning for Text Categorization, July 1998, Madison, Wiscon
sin]
Problem???




real world Bayesian network application –
“Learning to classify text. “
Instances are text documents
we might wish to learn the target concept “electronic ne
ws articles that I find interesting,” or “pages on the Worl
d Wide Web that discuss data mining topics.”
In both cases, if a computer could learn the target conc
ept accurately, it could automatically filter the large volu
me of
online text documents to present only the most relevan
t
documents to the user.
TECHNIQUE





learning how to classify text, based on the
naive Bayes classifier
it’s a probabilistic approach and is among the most effe
ctive algorithms currently known for learning to classify t
ext documents,
Instance space X consists of all possible text documents
given training examples of some unknown target
function f(x), which can take on any value from some
finite set V
we will consider the target function classifying document
s as interesting or uninteresting to a particular person, u
sing the target values like and dislike to indicate these t
wo classes.
Design issues

how to represent an arbitrary text docume
nt in terms of attribute values

decide how to estimate the probabilities r
equired by the naive Bayes classifier
Approach

Our approach to representing arbitrary text document
s is disturbingly simple: Given a text document, such
as this paragraph, we define an attribute for each wor
d position in the document and define the value of t
hat attribute to be the English word found in that pos
ition. Thus, the current paragraph would be described
by 111 attribute values, corresponding to the 111 wor
d positions. The value of the first attribute is the word
“our,” the value of the second attribute is the word “a
pproach,” and so on. Notice that long text documents
will require a larger number of attributes than short do
cuments. As we shall see, this will not cause us any t
rouble.
ASSUMPTIONS



assume we are given a set of 700
training documents that a friend has
classified as dislike and another 300
she has classified as like
We are now given a new document and
asked to classify it
let us assume the new text document is
the preceding paragraph




We know (P(like) = .3 and P (dislike) = .7 in the current
example
P(ai , = wk|vj) (here we introduce wk to indicate the kth word
in the English vocabulary)
estimating the class conditional probabilities (e.g., P(ai =
“our”Idislike)) is more problematic because we must
estimate one such probability term for each combination of
text position, English word, and target value.
there are approximately 50,000 distinct words in the
English vocabulary, 2 possible target values, and 111 text
positions in the current example, so we must estimate
2*111* 50, 000 =~10 million such terms from the training
data.





we shall assume the probability of encountering a specific
word wk (e.g., “chocolate”) is independent of the specific
word position being considered (e.g., a23 versus a95) .
we estimate the entire set of probabilities P(a1= wk|vj),
P(a2= wk|vj)... by the single position-independent
probability P(wklvj)
net effect is that we now require only 2* 50, 000 distinct
terms of the form P(wklvj)
We adopt the rn-estimate, with uniform priors and with m
equal to the size of the word vocabulary
n  total number of word positions in all training examples
whose target value is v, nk is the number of times word Wk i
s found among these n word positions, and Vocabulary is th
e total number of distinct words (and other tokens) found wi
thin the training data.
Final Algorithm

Examples is a set of text documents along with their target values. V is the
set of all possible target values. This function learns the probability terms
P( wk| vj), describing the probability that a randomly drawn word from a
document in class vj will be the English word Wk. It also learns the class prior
probabilities P(vi).
1. collect all words, punctuation, and other tokens that occur in Examples
• Vocabulary  set of all distinct words & tokens occurring in any text
document from Examples
2. calculate the required P(vi) and P( wk| vj) probability terms
• For each target value vj in V do
• docsj  the subset of documents from Examples for which the target value
is vj
• P(v1)  IdocsjI / \Examplesl
• Textj a single document created by concatenating all members of docsj
• n  total number of distinct word positions in Textj
• for each word Wk in Vocabulary
nk  number of times word wk occurs in Textj
• P(wkIvj)  nk+1/n+|Vocabulary|
CLASSIFY_NAIVE_BAYES_TEXT( Doc)
Return the estimated target value for the document Doc. ai denotes the word
found in the ith position within Doc.
• positions  all word positions in Doc that contain tokens found in
Vocabulary
• Return VNB, where

During learning, the procedure
LEARN_NAIVE_BAYES_TEXT examines all training
documents to extract the vocabulary of all words and
tokens that appear in the text, then counts their
frequencies among the different target classes to
obtain the necessary probability estimates. Later,
given a new document to be classified, the
procedure CLASSIFY_NAIVE_BAYESTEXT uses these
probability estimates to calculate VNB according to
Equation Note that any words appearing in the new
document that were not observed in the training set
are simply ignored by CLASSIFY_NAIVE_BAYESTEXT
Effectiveness of the Algorithm






Problem  classifying usenet news articles
target classification for an article name of the usenet newsgroup in which
the article appeared
In the experiment described by Joachims (1996), 20 electronic newsgroups
were considered
1,000 articles were collected from each newsgroup, forming a data set of 20,0
00 documents. The naive Bayes algorithm was then applied using two-thirds o
f these 20,000 documents as training examples, and performance was measur
ed over the remaining third.
100 most frequent words were removed (these include words such as “the”
and “of’), and any word occurring fewer than three times was also removed.
The resulting vocabulary contained approximately 38,500 words.
The accuracy achieved by the program was 89%.
comp.graphics
misc.forsale
soc.religion.christian
alt.atheism
comp.os.ms-winclows.misc
rec.autos
talk.politics.guns
sci.space
cornp.sys.ibm.pc.hardware
rec.sport.baseball
talk.politics.mideast
sci.crypt
comp.windows.x
rec.motorcycles
talk.politics.misc
sci.electronics
comp.sys.mac.hardware
rec.sport.hockey
talk.creligion.misc
sci .med
APPLICATIONS



A newsgroup posting service that learns to
assign documents to the appropriate
newsgroup.
NEWSWEEDER system—a program for reading
netnews that allows the user to rate articles as
he or she reads them. NEWSWEEDER then
uses these rated articles (i.e its learned profile
of user interests to suggest the most highly
rated new articles each day
Naive Bayes Spam Filtering Using Word- Positi
on-Based Attributes
Thank you !
Bayesian Learning Networks
Approach to
Cybercrime Detection
Bayesian Learning Networks Approach to
Cybercrime Detection
N S ABOUZAKHAR, A GANI and G MANSON
The Centre for Mobile Communications Research
(C4MCR),
University of Sheffield, Sheffield
Regent Court, 211 Portobello Street,
Sheffield S1 4DP, UK
[email protected]
[email protected]
[email protected]
M ABUITBEL and D KING
The Manchester School of Engineering,
University of Manchester
IT Building, Room IT 109,
Oxford Road,
Manchester M13 9PL, UK
[email protected]
[email protected]
REFERENCES
1. David J. Marchette, Computer Intrusion Detection and Network Monitoring,
A statistical Viewpoint, 2001,Springer-Verlag, New York, Inc, USA.
2. Heckerman, D. (1995), A Tutorial on Learning with Bayesian Networks, Technical
Report MSR-TR-95-06, Microsoft Corporation.
3. Michael Berthold and David J. Hand, Intelligent Data Analysis, An Introduction, 1
999, Springer, Italy.
4. http://www.ll.mit.edu/IST/ideval/data/data_index.html, accessed on 01/12/2002
5. http://kdd.ics.uci.edu/ , accessed on 01/12/2002.
6. Ian H. Witten and Eibe Frank, Data Mining, Practical Machine Learning Tools and
Techniques with Java Implementations, 2000, Morgan Kaufmann, USA.
7. http://www.bayesia.com , accessed on 20/12/2002
Motivation behind the paper..
Growing dependence of modern society
on telecommunication and information
networks.

Increase in the number of interconnected
networks to the Internet has led to an
increase in security threats and cyber crimes.

Structure of the paper

In order to detect distributed network
attacks as early as possible, an under
research and development probabilistic
approach, based on Bayesian networks
has been proposed.
Where can this model be utilized

Learning Agents which deploy Bayesian
network approach are considered to be
a promising and useful tool in determini
ng suspicious early events of Internet
threats.
Before we look at the detai
ls given in the paper lets
understand what Bayesian
Networks are and how they
are constructed………….
Bayesian Networks

A simple, graphical notation for conditional independe
nce assertions and hence for compact specification of
full
joint distributions.

Syntax:
 a set of nodes, one per variable

a directed, acyclic graph (link ≈ "directly influences"
)
 a conditional distribution for each node given its
parents:
P (Xi | Parents (Xi))
In the simplest case, conditional distribution represented
as a conditional probability table (CPT) giving the

Some conventions……….




Variables depicted as node
s
Arcs represent probabilistic
dependence between
variables.
Conditional probabilities
encode the strength of
dependencies.
Missing arcs implies
conditional independence.
Semantics
The full joint distribution is defined as the product of
the
local conditional distributions:
P (X1, … ,Xn) = πi = 1 P (Xi | Parents(Xi))
e.g., P(j  m  a  b  e)
= P (j | a) P (m | a) P (a | b, e) P (b) P (e)
Example of Construction of a BN
Back to the discussion of the
paper……….
Description

This paper shows how probabilistically B
ayesian network detects communication
network attacks, allowing for generalizati
on of Network Intrusion Detection Syste
ms
(NIDSs).
Goal
How well does our model detect or classif
y
attacks and respond to them later on.
The system requires the estimation of two
quantities:
 The probability of detection (PD)
 Probability of false alarm (PFA).
 It is not possible to simultaneously achi
eve a PD of 1 and PFA of 0.
Input DataSet

The 2000 DARPA Intrusion Detection Ev
aluation Program which was prepared a
nd managed by MIT Lincoln Labs has pr
ovided the necessary dataset.
Sample dataset
Construction of the network
The following figure shows the Bayesian
network that has been automatically
constructed by the learning algorithms of
BayesiaLab.
The target variable, activity_type, is directl
y
connected to the variables that heavily
contribute to its knowledge such as servic
e
and protocol_type.
Data Gathering
MIT Lincoln Labs set up an environment t
o
acquire several weeks of raw TCP dump
data for a local-area network (LAN)
simulating a typical U.S. Air Force LAN. T
he
generated raw dataset contains about few
million connection records.
Mapping the simple
Bayesian Network that we saw to
the one used in the paper
Observation 1:
As shown in the next figure, the most pro
bable activity corresponds to a smurf at
tack (52.90%), an ecr_i (ECHO_REPLY)
service (52.96%) and an icmp protocol
(53.21%).
Observation 2:


What would happen if the probability of
receiving ICMP protocol packets is incre
ased? Would the probability of having a
smurf attack increase?
Setting the protocol to its ICMP value in
creases the probability of having a smur
f attack from 52.90% to 99.37%.
Observation 3:


Let’s look at the problem from the opposite di
rection. If we set the probability of portsweep
attack to 100%,then the value of some associ
ated variables would inevitably vary.
We note from Figure 4 that the probabilities o
f the TCP protocol and private service have b
een increased from 38.10% to 97.49% and fr
om 24.71% to 71.45% respectively. Also, we
can notice an increase in the REJ and RSTR fl
ags.
How do the previous examples
work??
PROPOGATION
Data
Data
Benefits of the Bayesian Model




The benefit of using Bayesian IDSs is the abili
ty to adjust our IDS’s sensitivity.
This would allow us to trade off between
accuracy and sensitivity.
Furthermore, the automatic detection network
anomalies by learning allows distinguishing th
e normal activities from the abnormal ones.
Allow network security analysts to see the
amount of information being contributed by e
ach variable in the detection model to the kno
wledge of the target node
Performance evaluation
QUESTIONS OR QUERIES
Thank you !