Bayesian Filtering

Download Report

Transcript Bayesian Filtering

Bayesian
Filtering
Team Glyph
Debbie Bridygham Pravesvuth Uparanukraw
Ronald Ko Rihui Luo Thuong Luu
Background
• Strong need exists to identify “bad”
items in a population and remove
them -- Examples: SPAM,
Unsolicited IMs, Etc.
• Filtering often results in “Arm’s
Race” requiring rapid response
• “Arm’s Race” favors inherently
adaptive methods over others
Benefits of Filters
• Less unwanted traffic, thus less
wasted space on clients & servers
• Greater use of internet services due
to reduced customer frustration
• Provide some protection against
dangerous traffic: scams, phishing
attacks, viruses, etc.
Downsides of Filtering
• Exclusion of even one legitimate
item (i.e., False Positives) less
desirable than letting 10 or more
illegitimate items pass.
• Reducing the percentage of
undesirable traffic often causes
legitimate traffic to be excluded as
well.
Cost of Filtering
• Manual filtering has become
prohibitive
• Maintenance of static filters costs
time & money
• Time spent maintaining keywords or
updating software delays response
• “Arm’s Race” often results in ever
escalating costs
Methodologies
• Manual filtering prohibitive in terms
of time
• Static filtering based on heuristics
and keywords does not adapt except
via manual updates
• Bayesian filtering is dynamic,
adapting with each new item
scanned and/or marked
What is
Bayesian Filtering?
• Uses Naïve Bayes Classifier, which
uses Bayes Theorem
• Classifier allows items to be
adaptively categorized using
probabilities & has low rate of False
Positives
• Most well-known use in SPAM
filtering; often credited to initial work
by Paul Graham (“A Plan for Spam”)
in 2002
Naïve Bayes Classifier
•
•
•
Uses Bayes Theorem with assumptions
that probabilities are independent (rarely
true), thus “naïve”
Classifier can start with initial assumptions,
i.e., probabilities that words occur in
legitimate or illegitimate messages
Is trained over time and adapts. If final
probability reaches some threshold, an
item is rejected. Superior to keyword
filtering.
Bayes Theorem
•
•
First presented in 1763 based on work by
mathematician Thomas Bayes
Pr(A|B) = Pr(B|A)· Pr(A) / Pr(B)
• Specifies relationships between
conditional probabilities
• Currently has practical use in many
fields
Bayesian Filtering
Usage
• Uses user input to develop individual
statistics
• Probability matrix changes over time
based on scanned messages and
user decisions
• Matrix is used to calculate probability
a message is unwanted
• Matrix adapts quickly to new input,
resulting in surprisingly good results
Example Matrix
Example
• Suppose the word “guarantee”
occurs in 500 of 2000 Spam emails,
but only in 5 of 1000 Non-Spam
emails
• The probability of Spam for this word
is then (500 / 2000) / ((500 / 2000) +
(5 / 1000)) = 0.98
• This probability is combined with that
of others obtained from message to
compute a probability for the entire
message being Spam.
Bayesian Poisoning
• Attempts to fool BF systems by
adding irrelevant words (often
hidden)
• Type I attacks attempt to get
messages through filter -- could be
active or passive, with active
producing feedback to sender via a
“Web Bug” or other means
• Type II attacks attempt to cause
“False Positives”, i.e., force
desirable messages to be rejected
Poisoning Effectiveness
• Passive attacks are rarely effective
as filters are individual and sender
gets no feedback
• Active attacks can be initially highly
effective, if systems access “Web
Bugs”
• All attacks lose effectiveness as the
filter adjusts to incoming traffic
Products that use
Bayesian Filtering
AlienCamel
DSPAM
Eudora
eXpurgate
Junk-Out
Mozilla
Pegasus
Mail
POPFile
Postini
SeaMonkey
SpamAssas
sin
SpamBayes
SpamProbe
Thunderbird
Summary
• BF adapts to individual needs
• BF is highly effective
• BF adapts more quickly than other
solutions
• BF is resistant to “poisoning”
References
• [1] Sahami, M., et. al. “A Bayesian
Approach to Filtering Junk E-Mail”,
1998
• [2] Graham, Paul.
“A Plan for
SPAM”, 2002
• [3] Graham-Cumming, John. “Does
Bayesian poisoning exist?”, 2006
References, cont.
• [4] Naive Bayes Classifier,
Wikipedia, 2007
• [5] Bayes Theorem, Wikipedia, 2007