But how do we filter out spams from normal emails??

Download Report

Transcript But how do we filter out spams from normal emails??

Spam Detection
Kingsley Okeke
Nimrat Virk
• Spam emails, also known as junk emails, are unwanted
emails sent to numerous recipients by email.
• They impede our ability to recognise normal emails.
• They can also be a threat to computer security
Everyone hates spams!!
• But how do we filter out spams from normal emails??
??
??
• What is Text Mining??
• Text mining usually involves the process of structuring the
input text (usually parsing, along with the addition of some
derived linguistic features and the removal of others, and
subsequent insertion into a database), deriving patterns within
the structured data, and finally evaluation and interpretation of
the output .. wikipedia
Text Mining!!
• Marketing applications
• It is used to improve predictive analytic models for customers
• E.g Open ended questions in surveys
• Online Media applications
• Used by Large media companies to provide users with better
search experience
• Academic applications
• Publishers with large databases use text mining for easy
information retrieval
Applications
• Using text mining we can analyse patterns common in
spam emails in order to distinguish them from Ham
emails.
1) Get some training data
• A large collection of spam and normal emails
• SpamAssassin public corpus
(http://www.spamassassin.org/publiccorpus/)
Steps
2) Data Pre-processing
a) Stop words: e.g for, when, to, a , be
• Domain specific stop words e.g email, send
Steps
b) Stemming: removal of stems/roots from words
• E.g discussed – discussing - discuss
• Porter stemming algorithm
• One of the most widely used stemming algorithm
• Developed by Martin Porter
http://www.tartarus.org/~martin/PorterStemmer/
Steps
c) Feature Selection
What are Good and Bad Features?
Good features:
Must occur alongside with a particular category
Do not co-occur with other categories
Bad features:
Uniform across all categories
Very infrequent occurrence
Steps
• Information Gain
• A common feature selection technique used in machine
learning applications. information gain of term t is defined
as:
Steps
• Feature Representation
….
word1
word2
doc1
0
2
c1
doc2
2
4
c2
doc3
2
1
c3
Steps
class
• TF: Term Frequency
• Definition: TF = t (i,j)
• frequency of term i in document j
• Purpose: makes the frequent words for the document more
important
• TF-IDF (Term Frequency - Inverted Document Frequency)
•
•
•
•
value of a term i in document j
Definition: TF×IDF = t(i,j) × log(N/ni)
ni : number of documents containing term i
N : total number of documents
Steps
• d) Text Classification
• WEKA
• Training data is used to build a classification model
• This model is built from the pre-processed data
Steps
END