Seminar Slides

Download Report

Transcript Seminar Slides

SPAM FILTERING
By
Ankur Khator 01005028
Gaurav Sharma 01005029
Arpit Mathur 01D05014
What is Spam Email?



“junk email” or “unsolicited commercial
email”.
Spam filtering - a special case of email
classification.
Only 2 classes – Spam and Non-spam.
Various Approaches

Bayesian Learning



Ripper algorithm


Probabilistic model for Spam Filtering
Bag of Words Representation
Context Sensitive Learning.
Boosting algorithm

Improving Accuracy by combining weaker
hypotheses.
Term Vectors
Naive Bayes for Spam

Seeking model to find
P(Y=1/X1=x1,X2=x2,..,Xd=xd)

From Bayes theorem
P(Y=1/X1=x1,..,Xd=xd) = P(Y=1) * P(X1=x1,..,Xd=xd/Y=1)
P(X1=x1,..Xd=xd)
P(Y=0/X1=x1,..,Xd=xd) = P(Y=0) * P(X1=x1,..,Xd=xd/Y=0)
P(X1=x1,..Xd=xd)
Justification of using Bayes Theorem


Sparseness of data
P(B/A) can be easily and accurately
determined as compared to P(A/B)
Naive Bayes for Spam(contd.)
Assume P(X1=x1,..,Xd=xd/Y=k) = ∏ P(Xi=xi / Y=k)
Also assume Xi = 1 if no of occurrence of word i >= 1
= 0 otherwise
Naive Bayes for Spam(contd.)
referred to as weights of evidence
• Inconsistency when some probability is zero.
•Smooth the estimates by adding a smooth positive
constant to both numerator and denominator of each
probability estimate
Classifying

Assume new mail with text
 “The quick rabbit rests”
0.51 + 0.51 + 0 + 0.51 + 1.10 + 0 = 2.63
Probability = 0.93
Threshold

Lower threshold


Higher false positive rate
Higher threshold


Higher false negative rate
Preferred
Non-Linear Classification
Linear Classifier Ignores the effect of Context of word on its
meaning.
Unrealistic . Build a linear classifier that test for more
complex Features like Simultaneous Occurrences.
High Computation Cost !! Non-Linear Classification is the
Solution
Ripper
Disjunction of Different Contexts
 Each Contexts is conjunction of Simple terms
 Context of w1 is :
if w2 belongs to data and w3 belongs to data.
i.e. for context to be true w1 must occur with w2 and
w3.
Three Components of Ripper Algorithm:

Rule Learning :





Spam  spam Є Subject
Spam  Free Є Subject ,Spam Є Subject.
Spam Gift!! Є Subject, Click Є Subject.
The rule would be disjunction of three
statements stated above.
There is an initial set of rules too
Constructing Rule Set




Initial Rule Set is Constructed Using a greedy
Strategy.
Based on the IREP (Incremental Reduced Error
Pruning)
To Construct A new Rule partitioning Dataset into
two parts training Set And Pruning Set is Done.
Every Time a Single condition is Added to Rule.
Simplification And Optimization



At every step the density of +ve examples
covered is increased.
Adding stops until clause cover no –ve example
or there is no positive gain.
After this, pruning i.e. simplification is done. At
every stage, again following greedy Strategy
Reaching Sufficient Rules

The clause is deleted which maximizes the Function

where U+(i+1) and U-(i-1) are the positive and
negative examples.
Termination when information gain is non-zero i.e.
every rule covers +ve examples.
But If data is noisy then number of rules increase

MDL



Several heuristics are applied to solve the
problem. MDL(Minimum Description Length) is
one of them.
After addition of each rule , total length of
current rule set and example is calculated.
Addition of rule is stopped when this length is
d bits larger than shortest length.
AdaBoost

Easy to find rule of thumb which are often
correct



If ‘buy now’ occurs in message, then predict ‘spam’
Hard to find one rule which is very accurate
AdaBoost helps here


general method of converting rough rules of thumb
into highly accurate prediction rule
Concentrating on hard examples
Pictorially
Algorithm



Input S = { (Xi , Yi) } mi=1
Initialize D(i) = 1/m for all i
For i = 1 to T


H(t) = WeakLearner(S,Dt)
Choose βt



ln((1-ε)/ε) (proven to Minimize error for 2class) [2]
Update Dt+1(i) = Dt(i) exp(-βtYiht(xi)) and Normalize
Final Hypotheses f(x) = ∑βt ht(x)
Example
Example
Accuracy

Weighted accuracy measure






Improving accuracy


Increase λ
Introduce θ threshold




(λL- + S+) / (λL + S)
λ strictness measure
L : # legitimate messages S : # spam
L- : #legitimate messages classified as legitimate
S+ : #spam classified as spam
Example classified positive only if f(x) > θ
Default is ZERO
Recall correctly predicted spam out of number of spam in corpus
Precision correctly classified spam out of number predicted as spam
Results on corpus PU1 . . . [1]
Tree Depth 1
Θ = 10.2
λ=9
Tree Depth 1
θ = 46.9
λ = 999
Tree Depth 5
θ = 37.4
λ=9
Tree Depth 5
Θ = 178
λ = 999
T
RECALL
PRECISION ACC
525
93.55
98.71
98.59
550
74.43
100
99.98
525
93.97
99.12
98.92
550
66.53
100
99.97
Pros and Cons



Fast and Simple
No parameters to tune
Flexible






Can combine with any learning Algorithm
No knowledge needed of WeakLearner
Error reduces exponentially
Robust to overfitting
Data Driven – requires lots of data
Performance depends on WeakLearner

May fail if WeakLearner is too weak
Conclusion



RIPPER as text categorization algorithm
works better than Naïve Bayes (better for
more classes).
Comparable for spam filtering (2 classes)
Boosting better than any weak learner it
works on.
References

[1] Boosting trees for Anti Spam Email Filtering by Xavier Carreras
and Llius Marquez 2001.

[2] The boosting approach to machine learning: An overview. by



Robert E. Schapire in MSRI Workshop on Nonlinear Estimation and
Classification, 2002.
[3] Statistics and The War on Spam by David Madigan, David
Madigan, 2004.
[4] Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras,
and C. D. Spyropoulos. An Evaluation of Naive Bayesian Anti-Spam
Filtering. In Proc. of the workshop on Machine Learning in the New
Information Age, 2000.
http://citeseer.ist.psu.edu/androutsopoulos00evaluation.html
[5] William W. Cohen, Yoram Singer: Context-sensitive Learning
Methods for Text Categorization. SIGIR 1996: 307-315