false positive spam

Download Report

Transcript false positive spam

CSC 380
Algorithm Project Presentation
Spam Detection Algorithms
Kyle McCombs
Bridget Kelly
Objective
• Create a text-filtering algorithm that can
accurately and efficiently identify spam emails
based on data collected from past spam
emails.
Background
• spam - : e-mail that is not wanted : e-mail that
is sent to large numbers of people and that
consists mostly of advertising : unsolicited
usually commercial e-mail sent to a large
number of addresses
• Spam is estimated to account for anywhere
from 70 – 95% of all emails
Method
• Create a word bank from parsing through the
body of spam emails in database
– Our methods disregard sender address, subject line
• Each word is associated with a frequency of
appearance within all emails evaluated during the
learning phase
• Use this data to evaluate emails with one of two
methods:
– Naïve bayes classifier
– Markov model
Naïve Bayes Classifier - Background
• One of the most popular/oldest methods of spam
detection, first known use in 1996
• Common text identification method – utilizing
features from the “bag of words” model
– Disregards grammar, word order but not multiplicity
• Assumes independence among features - value of
any particular feature is unrelated to the
presence or absence of any other feature
• Tailored to a specific user
• Offers low false-positive detection rate
Naïve Bayes Classifier - Process
• Each word has a probability of being in a spam
email
– Training phase accounts for building these
probabilities (email user marking an email as spam)
• Probabilities of individual words are used to
compute the probability that an email with a
particular set of words is spam or not
• If this probability meets a certain threshold – the
email is determined to be spam
Naïve Bayes Classifier - Process
Considering one word’s effect on an email being spam:
Pr(S|W) – probability an email is spam knowing it contains word X
Pr(W|S)- probability that word X appears in spam
Pr(S) – probability any given message is spam
Pr(W|H)- probability that word X doesn’t appear in spam
Pr(H) – probability any given message isn’t spam
Pr(S) = .8, Pr(H) = .2 ?
Pr(S) = .9 Pr(H) = .1 ?
(based on recent statistics)
Most bayesian spam software
makes no assumptions about
incoming emails
So the formula can be simplified to :
Naïve Bayes Classifier - Process
Combining individual probabilities:
p = probability the email in question is spam
p1 = probability of a word being in a spam email
n = number of words being evaluated
*multiplication shown here is actually done as addition
in the log domain because the numbers involved are
very small
Compare p to a determined threshold,
if p is below threshold – email cannot be classified spam
if p is equal to or above threshold – email can be classified as
spam
Naïve Bayes Classifier - Results
• 15,000 spam emails evaluated
during learning phase
• Average classifier value of
emails in learning phase used
as threshold
– 2.86% success rate in testing
(86/3000 emails could be
confidently identified as spam)
• Median – better summary
statistic for data that is not
normally distributed
– 52.03% success rate when
using median value as
threshold (1561/3000)
SAS output shown on the right displays results from a PROC UNIVARIATE procedure ran on a data set
containing the bayes classifier values for the 15,000 emails in the learning set. This data is highly
skewed and three different normality tests support that this data is not normally distributed. This
evidence supports that the model considering individual probabilities of every word within an email
is not the best fit for our data.
Naïve Bayes Classifier - Results
• Only consider the 15 most “interesting” (highest)
probabilities for each email in the classifier
• Neutral words (words associated with a low
spam probability) should not effect the statistical
significance of highly incriminating words, no
matter how many there are
• 97.13% success rate (2914/3000 spam emails
correctly identified) – using average bayes value
from learning set as threshold
Markhov Model - Background
• Models the statistical behaviors of spam
emails.
• Widely used in current spam classification
systems.
• In essence, a Bayes filter works on single
words alone, while a Markovian filter works
on phrases or possibly whole sentences.
Markhov Model - Process
• Training – Analyze a training set of emails that
are all known to be spam
• Examining adjacent words, ‘A’ and ‘B’,
compute the frequency that word ‘B’ follows
word ‘A’ , for every word in the body of a
email.
• If word ‘A’ is followed by a period, question
mark, or exclamation point, skip it.
Markhov Model - Process
•
Calculate and store the average occurrence rate of word ‘B’ following word ‘A’, for
every word in each email in training set.
avgPerEmail(‘A’’B’) =
•
Summing all of the average occurrence rates of ‘B’ following ‘A’ and dividing by the
total number of emails in the training set, results in the final average rate that
word ‘B’ followed word ‘A’ in the training set.
• Final Avg. Occurrence (‘B’ Follows ‘A’) =
Email 1 + … + avgPerEmail(‘A’’B’)Email n
Number of Emails in Training Set
•
Using a weighted directed graph, store each word encountered as a vertex, with
edges between adjacent words containing the average rate of occurrence in all
spam emails from training set.
Markhov Model - Process
Classification:
When “grading” an email in question,
• Examine adjacent words
• Lookup the corresponding edge weight in the graph
(The average rate that a word follows another word in the training set
collection.)
• Accumulate these weights per each email and calculate the average
weight as a final grade for the email.
• If this grade is greater than or equal to a determined threshold, consider
this email as spam, if less than, consider this email as not spam.
•
If an edge does not exist, (two words were never adjacent in training
collection) It is skipped, having no affect on the overall grade.
• Skip common words that could potentially be frequent in both spam and
non-spam emails.(ie. the, this, I, etc. )
Markhov Model - Results
• 3000 spam emails evaluated during learning
phase
• 1000 test spam emails used in Testing Set.
• Average classifier grade of emails in learning
phase used as threshold.
• 920 spam emails correctly identified as spam.
• 92% Success rate.