SPAM: Content based methods - Computer Science and Engineering
Download
Report
Transcript SPAM: Content based methods - Computer Science and Engineering
SPAM: Content based
methods –Seminar II
Different methods
Metrics and benchmarks
-- Seminar I
Liqinpresented
Zhangby Christian Loza, Srikanth Palla and Liqin Zhang
Different Methods:
Non-content based approach
remove spam message if contain virus, worms before read.
leaves some messages un-labeled
Content based method:
widely used method
may need lots pre-labeled message
label message based its content
Zdziarski[5] said that it's possible to stop spam, and that content-
based filters are the way to do it
Focus on content based method
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Method of content-based
Bayesian based method [6]
Centroid-based method[7]
Machine learning method [8]
Latent Semantic indexing LSI
Contextual Network Graphs (CNG)
Rule based method[9]
ripper rule: a list of predefined rules that can be changed
by hand
Memory based method[10]
saving cost
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Rule-based method
A list
of predefined rules that can be changed
by hand
ripper rule
Each
rule/test associated with a score
If an email fails a rule, its score increased
After apply all rules, if the score is above a
certain threshold, it is classified as spam
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Rule-based method
Advantage:
able to employ diverse and specific rule to check
spam
Check size of the email
Number of pictures it contains
no training messages are needed
Disadvantage:
rules have to be entered and maintained by hand
--- can’t be automatically
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Latent Semantic indexing
Keyword
important word for text classification
High frequent word in a message
Can used as an indicator for the message
Why LSI?
Polysemy: word can be used in more than one category
ex: Play
Synonymous: if two words have identical meaning
Based on nearest neighbors based algorithm
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Latent semantic indexing
consider semantic links between words
Search keyword over the semantic space
Two words have the same meaning are treated as one
word
eliminate synonymous
Consider the overlap between different message, this
overlap may indicate:
polysemy or stop-word
two messages in same category
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Latent semantic indexing
Step1:
build a term-document matrix X from
the input documents
Doc1: computer science department
Doc2: computer science and engineering science
Doc3: engineering school
Computer science department and engineering school
Doc1
1
1
1
0
0
0
Doc2
1
2
0
1
1
0
Doc3
0
0
0
0
1
1
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Latent semantic indexing
Step2:
Singular value Decomposition (SVD) is
performed on matrix X
To extract a set of linearly independent FACTORS
that describe the matrix
Generalize the terms have the same meaning
Three new matrices TSD are produced to reduce
the vocabulary’s size
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Latent Semantic indexing
Two
document can be compared by finding the
distance between two document vector, stored
in matrix X1
Text classification is done by finding the
nearest neighbors – assign to category with
max document
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Nearest neighbors method classify the test message to be UN-SPAM
Spam:
Un-spam:
Test:
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Latent Semantic Indexing
Advantage:
Entire training set can be learned at same time
No intermediate model need to be build
Good for the training set is predefined
Disadvantage:
When new document added, matrix X changed, and TSD
need to be re-calculated
Time consuming
Real classifier need the ability to change training set
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Contextual network Graphs
A weighted,
bipartite, undirected graph of term
and document nodes
t1
w21
t1: W11+w21 = 1
d1: w11+w12+w13 = 1
w11
d1
w12
w1
3
t2
w22
d2
w23
t3
At any time, for each node, the sum of the weigh is 1
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Contextual network graphs
When
new document d is added, energizing
the weights at node d, and may need rebalance the weights at the connected node
The document is classified to the one with
maximum of energy (weight) average for each
class
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Comparison Bayesian, LSI,CNG,
centroid, rule-based
Bayesian
LSI
Classificatio
n
Statistical/pr
obabilities
Learning
CNG
Centroid
Rule
Generalizatio Energy ren/contextual balancing
data
Cosine
similarity
TF-IDF
Predefined
rules
Update
statistics
Recalculatio
n matrices
Addition
nodes to
graph
Recalculatio
n of centroid
Test against
rule
Semantic
No
Yes
Yes
No
no
Automatic
update model
Yes
Yes
Yes
Yes
no
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Result
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Result and conclusion:
LSI
& CNG super Bayesian approach 5%
accuracy, and reduce false positive and
negatives up to 71%
LSI & CNG shows better performance even
with small document set
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Comparison content based and non-content based
Non-content
based:
dis-adv:
depends on special factor like email address, IP
address, special protocol,
leaves some un-classified
Adv: detect spam before reading message with
high accuracy
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Content
based:
Disadvantage:
need some training message
not 100% correct classified due to the spammer also
know the anti-spam tech.
Advantage:
leaves no message unclassified
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Improvement for spam
Combine both method
[1] proposes an email network based algorithm, which with
100% accuracy, but leaves 47% unclassified, if combine
with content based method, can improve the performance.
Build up multi-layers[11]
[11] Chris Miller, A layered Approach to enterprise
antispam
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Metrics and Benchmarks
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Measurements --- Metrics
Accuracy:
the percentage of correct classified
correct/(correct + un-correct)
False positive: if a message is a spam, but
misclassify to un-spam.
Goal:
Improve accuracy
Prevent false positive
Correct
No spam Spam
presented by Christian Loza, Srikanth Palla and Liqin
Zhang
Un-correct
False positive
Entire document
Relevant
collection
documents
Retrieved
documents
relevant irrelevant
Measurements -- Metrics
retrieved &
irrelevant
Not retrieved &
irrelevant
retrieved &
relevant
not retrieved but
relevant
retrieved
not retrieved
Number of relevant documents retrieved
recall
Total number of relevant documents
Number of relevant documents retrieved
precision
Total number of documents retrieved
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Data set for spam:
Non-content
based:
Email network:
One author’s email corpus, formed by 5,486 messages
IP address: -- none
Content
based:
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Data set for spam
LSI
& CNG:
Corpus of varying size (250 ~ 4000)
Spam and un-spam emails in equal amount
Bayesian
based:
Corpus of 1789 email
211 spam, 1578 non-spam
Cetroid
based:
Totally 200 email message
90 spam, 110 non-spam
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Most recently used
Benchmarks:
Reuters:
About 7700 training and 3000 test documents, 30000 terms,135 categories,
21MB.
each category has about 57 instances
collection of newswire stories
20NG:
About 18800 total documents, 94000 terms, 20 topics, 25MB.
Each category has about 1000 instances
WebKB:
About 8300 documents, 7 categories, 26 MB.
Each category has about 1200 instances
4 university website data
Above three are well-known in presented
recently
IR with
and
by Christian
Loza,small
Srikanth in
Pallasize
and Liqin
Zhang
used to test the performance and CPU scalability
Benchmarks
OHSUMED:
348566 document, 230000 terms and 308511 topics, 400 MB.
Each category has about 1 instance
Abstract from medical journals
Dmoz:
482 topics, 300 training document for each topic, 271MB
Each category has less than 1 instance
taken from Dmoz(http://dmoz.org/) topic tree
Large dataset, used to test the memory scalability of a
model
presented by Christian Loza, Srikanth Palla and Liqin Zhang
Sources
Slide 1, image: ttp://www.ecommerce-guide.com
Slide 1, image: ttp://www.email-firewall.jp/products/das.html
presented by Christian Loza, Srikanth Palla and Liqin Zhang
References
Anti-spam Filtering: A centroid-based Classification Approach,
Nuanwan Soonthornphisaj, Kanokwan Chaikulseriwat, Piyan TangOn, 2002
Centroid-Based Document Classification: Analysis & Experimental
Results, Eui-Hong (Sam) and George Karypis, 2000
Multi-dimensional Text classification, Thanaruk Theeramunkog, 2002
Improving centroid-based text classification using term-distributionbased weighting system and clustering, Thanaruk Theeramunkog and
Verayuth Lertnattee
Combining Homogeneous Classifiers for Centroid-based text
classifications, Verayuth Lertnattee and Thanaruk Theeramunkog
presented by Christian Loza, Srikanth Palla and Liqin Zhang
References
[1] P Oscar Boykin and Vwani Roychowdhury, Personal Email Networks: An Effective
Anti-Spam Tool, IEEE COMPUTER, volume 38, 2004
[2] Andras A. Benczur and Karoly Csalogany and Tamas Sarlos and Mate Uher,
SpamRank - Fully Automatic Link Spam Detection,
citeseer.ist.psu.edu/benczur05spamrank.html
[3]. R. Dantu, P. Kolan, “Detecting Spam in VoIP Networks”, Proceedings of USENIX,
SRUTI (Steps for Reducing Unwanted Traffic on the Internet) workshop, July
05(accepted)
[4]. IP addresses in email clients ttp://www.ceas.cc/papers-2004/162.pdf
[5] Plan for Spam ttp://ww.paulgraham.com/spam.html
presented by Christian Loza, Srikanth Palla and Liqin Zhang
References
[6] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. 1998, “A Bayesian
Approach to Filtering Junk E-Mail”, Learning for Text Categorization – Papers from
the AAAI Workshop, pages 55–62, Madison Wisconsin. AAAI Technical Report
WS-98-05
[7] N. Soonthornphisaj, K. Chaikulseriwat, P Tang-On, “Anti-Spam Filtering: A Centroid
Based Classification Approach”, IEEE proceedings ICSP 02
[8] Spam Filtering Using Contextual Networking Graphs
www.cs.tcd.ie/courses/csll/dkellehe0304.pdf
[9] W.W. Cohen, “Learning Rules that Classify e-mail”, In Proceedings of the AAAI
Spring Symposium on Machine Learning in Information Access, 1996
[10] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, P.
Stamatopoulos, “A memory based approach to anti-spam filtering for mailing lists”,
Information Retrieval 2003
presented by Christian Loza, Srikanth Palla and Liqin Zhang