PowerPoint 프레젠테이션 - University at Buffalo

Download Report

Transcript PowerPoint 프레젠테이션 - University at Buffalo

E-Mail Filtering
Soonyeon Kim
1
Good Site for Data Mining

http://liinwww.ira.uka.de/bibliography/
- The Collection of
Computer Science Bibliographies

Major Conferences in Data Mining
- KDD 2000 of ACM SIGKDD
http://www.acm.org/sigs/sigkdd/kdd2000/
- SIGMOD 2000 of ACM SIGMOD

Other Conferences
- VLDB, IEEE ICDE, PAKDD conference
2
Text Mining:
Finding Nuggets in Mountains of
Textual Data

Author
- Jochen Dorre, Peter Gerstl, Roland Seiffert
- {doerre,gerstl,seiffert}@de.ibm.com
 Method to find this paper
- Searching from The Collection of
Computer Science Bibliographies
- key word used : Data mining & Text classification
3
Brief Description

What is Text Mining?
- same analytical functions of data mining to the domain of textual
information.
 How Text mining differs from Data mining?
- Data mining : addresses a very limited part of data (structured
information available in database)
- Text mining : helps to dig out the hidden gold from textual
information & requires the very complex feature extraction
function
 Describe in more detail the unique technologies that are key to
successful text mining
4
Ifile: An Application of Machine
Learning to E-mail Filtering

Author
- Jason D. M. Rennie
Artificial Intelligence Lab, MIT
- [email protected]
 Method to find this paper
- KDD 2000 of ACM SIGKDD
5
Outline of Paper





Introduction
- need for automated e-mail filtering
- Ishmail
- important issues regarding mail filtering
Mail Filtering
- Classification Efficiency
- Features
- Naïve Bayes algorithm
IFILE
Experiment
Conclusion
6
Introduction



Popular E-mail clients allow users to manage their mail into
folders by meaningful topic
- popular e-mail client : Netscape Messenger, Pine, Microsoft
Outlook, Eudora and EXMH
Ishmail
- purpose of a prioritization system
- alert the user when high-priority mail is arrived or a large
number of messages have accumulated in a lower-priority folder
Barrier
- implementation for mail filters
(speed efficiency, database size, collection of supervised training
data)
- integration into e-mail clients
7
Classification Efficiency



Traditional classification method
- kNN, C4.5, Naïve Bayes
Recent development
- SVM (Support Vector Machine), Maximum Entropy
discrimination)
Efficiency Problems
- SVM and MEM provide significant improvement in accuracy,
but at the cost of simplicity and time efficiency
- kNN : time to classify
8
Classification Efficiency(2)

Naïve Bayes
- efficient training, quick classification and extensibility to
iterative learning
- training : updating word counts
- classification : normalized sum of the counts corresponding to
the words in the document in question
9
Personal E-mail Filtering




Every user has a unique collection of e-mail
User organizes their e-mail in unique way
It pertains directly to his preference
Key fact for effective personal e-mail filtering
- using the information made through the user interface of
the mail client
10
Learning Architecture




Label is assigned to newly filtered e-mail message
Added to the classification model
Update the classification model : every filtered e-mail
is a training example
- assumed to be correct if user does not move the
message to another folder
- update the model if user moves misclassified mail
into the appropriate folder
Update for Naïve Bayes
- shift word counts from one folder to another
11
Features


Classification model act as a function f
F
C
F
C
f
- F : Features C: class
Mail filter is a special classifier
- F : characteristics of e-mail message C: mail folder
- by considering each e-mail message as a bag of
words function f maps an unordered set of words to a
folder name
12
Features(2)



Naïve Bayes keeps the track of word frequency statistics
Reduce the number of features for classification to make
filtering more efficient
Feature selection cutoff
- old, infrequent words are dropped
- word that occur fewer than log(age)-1 times should be
discarded from the model
- age : number of e-mail messages added to the model since
statistic has been kept for that word
e.q. if “baseball” occurred in the 1st document and occurred 5 or
fewer times in the next 63 document, the word and statistics
would be eliminated from database.
13
Maintaining Dictionary

Cutoff Algorithm
- word that occur fewer than log(age)-1 times should be
discarded from the model
e.q. “datamining” occurred in the 1st document
63 documents are coming after the document
age = 1 + 63 = 64
log(age) – 1 = 5
if “datamining” appears 5 or fewer times, the word and
statistics would be eliminated from database.
14
Maintaining Dictionary(2)
-----------.idata------------ABC
526
211
party 4 0:2 1:1
belch 3 0:1
yellow 4 0:2 2:3
kick 2 1:1
peep 1 2:2
list of folders(A:0 B:1 C:2)
total word instances
# of message
word age folder:frequency
two msg in A - "party party belch yellow yellow"
one msg in B - "party kick"
one msg in C - "peep peep yellow yellow yellow"
15
Word Selection

1.
2.
Header Trimming
E-mail
Body: content
Header : list of fields pertaining to the message
From: To: Subject:
- keep this part
Received: Date: Message-id
- remove
16
Tokenizing text

1.
2.
Two techniques
Using stop list
- decrease the amount of noise in the data by
eliminating uninformative words
e.g) pronoun, modifier, adverb
Stemming
- link together words which have the same root
e.g) serve, service, serves, served
=> same root serv
17
Naïve Bayes

What is Naïve Bayes?
-
Simple, yet effective classifier of text documents
Statistical Machine learning algorithm

Assumption
each document is considered as a set of words
Each word is independent
-
18
Naïve

st
Bayes-1
step
Probability of d having been generated by ci
P(ci | d ) 
P(ci ) wj  dP( wj | ci )
P( d )
- With the assumption that attribute values are
independent,
P(a1, a2..an | vj )  iP(ai | vj )
19
Naïve Bayes-second step

Computing P(ci|d) for all classes

Find the class to be classified

Maximum likelyhood
CNB  arg max ci  CP(ci ) P(wj | ci )
wjd
- Probability values are only used for comparison
Purpose, P(d) can be dropped
20
Naïve Bayes

-
M-estimate
purpose : to give a reasonable probability in the case
of sparse data
nj  1
P( wj | ci ) 
n | Vocabulary|
-nj : number of instances of wj in the documents of class ci
-n : total number of words in documents of class ci
-|Vocabulary| : number of distinct words
21
Experimental Result


Information about the e-mail corpora on which
classification experiments were conducted.
Four volunteers including author
22
Experimental Result
23
Experimental Result

1.
2.
3.
4.
5.
6.
7.
Individual Experiments with different setting
Alpha lexer, stoplist used, header trimming, feature
selection, no stemming
Alpha only lexer replaces alpha lexer
White lexer replaces alpha lexer
No stoplist is used
Stemming is used
No feature selection is used
All headers are used for classification purposes
24
Experimental Result

1.
2.
3.
Three Lexers
Alpha lexer
- default lexer
- tokenizes strings of alphabetic characters
Alpha only lexer
- tokenizes only strings of alphabetic characters
- does not lex e-mail addresses into tokens
White lexer
- tokenizes strings separated by whitespace
25
Experimental Result


Result
- No experimental environment setting provide the
best results across all users
Experiment with highest average accuracy
- experiment #1 shows the best average result (89%
accuracy)
- ranging from 86% to 91%
26
Experimental Result


Time Efficiency
- Naïve Bayes : “fast enough”
- 27 seconds to build a model of 7000+ e-mail
messages (average 259 msg/second)
(tar-gzip of same msg requires 17 seconds)
Space Efficiency
- classification model built on 7000+ messages across
49 folders requires only 447,090 bytes
27
Filtering Junk E-Mail
Soonyeon Kim
28
A Bayesian Approach to Filtering
Junk E-Mail

Authors
- Mehran Sahami, Susan Dumais, David Heckerman, Eric
Horvitz
- Stanford University & Microsoft Researh

From
- AAAI 98 (American Association for Artificial Intelligence
29
Problems of Junk-mail

Wasting time
- Many users must now spend a non-trivial portion of their
time because of unwanted messages
 Content of Material
- Some of these messages can contain offensive material
such as graphic pornography
 Space problem
-Junk-mails also quickly fill up file server storage space
30
Machine Learning Approach



Learning
- system S learns from experience E with respect to a class of tasks
T and performance P
Learning in junk-mail
S : E-mail classifier
T : classify an e-mail message as junk/legitimate
P : fraction of correct prediction
E : a set of pre-classified e-mail messages
Vector Space Model
- to represent mail messages as feature vectors
- e-mail message has single fixed-length feature vector
- individual message can be represented as a binary vector
denoting which word are present or absent. (1 for present 0 for
absent)
31
Bayesian Classifier



e-mail message as a vector of N features
X = X1, X2, X3, ..., XN
- For example, X42 might be ‘the e-mail contains “money”’
- x42=0 means “the message described by x does not contain the
“money”’.
classify messages in K classes
C = {c1 , c2} = {junk, legit} (K=2)
Now suppose we see a new e-mail message, with encoding x. We
seek the probability that the class C is junk,
Pr[C=junk | X=x]
shorthand for Pr[C=junk | X1=x1 & X2=x2 & … & XN=xN]
32
Bayesian networks
(a) a Naïve Bayesian classifier
(b) a more complex Bayesian classifier with limited
dependencies between the features
33
Bayesian Rule

Bayes theorem
P( X  x | C  ck ) P(C  ck )
P(C  ck | X  x) 
P( X  x)

ssume that each Xi is independent
P( X  x | C  ck )   P( Xi  xi | C  ck )
34
Features
1.
2.
Words
- fixed width vector <X = X1, X2,…, Xn>
Hand-crafted
Phrasal Features
- “FREE!”, “only $” ( as in “only $4.95”) and “be over 21”
- 35 such hand-crafted phrases are included
Domain-specific features
- domain type of sender (.edu or .com)
- junk mail is usually not from .edu domain
Resolving familiar E-mail address
- i.e. replace [email protected] with Susan Dumais
Time
- most junk E-mail is sent at night
35
Features(2)
Peculiar punctuation
- percentage of non-alphanumeric characters in the subject of a mail
- “$$$$$ MONEY $$$$$”
X : subject has peculiar
punctuation
Y : pct of total messages
36
Feature Selection

Mutual Information
- Mutual information MI(A,B) is a numeric measure of what we
can conclude about A if we know B, and vice-versa.
- Example: If A and B are independent, then we can’t conclude
anything: MI(A, B) = 0
P( Xi, C )
MI ( Xi; C )   P( Xi, C ) log
P( Xi) P(C )
Xi x C c
- Select 500 features with greatest value
37
Evaluation

1.
2.
3.
Three ways
Using Domain-specific Features
- Words only
- Words + Phrases
- Words + Phrases + Extra Features
Three way Categorization
- 3 categories {porn-junk, other-junk, legit}
instead of 2 categories {junk, legit}.
“Real” scenario
38
Using different features




The cost of missing legitimate email is much higher than the cos
ting of inadvertently reading junk.
The authors wanted to make their system very “optimistic” so th
at it only predicts “junk” if it is very certain -- uses threshold 99.
9%.
1789 hand-tagged e-mail messages
–
1578 junk
–
211 legit
Split into…
–
1538 training messages (86%)
–
251 testing messages (14%)
39
Using different features

Result of experiment
- words only
- words + 35 phrasal features
- words + phrasal features + 20 non-textual domain-specific
features
40
Using different features
predict Junk
predict Legit
actually Junk
A (true positives)
B (false negatives)
actually Legit
C (false positives)
D (true negatives)
Junk Precision = A / (A + C)
Legit Precision = D / (D + B)
Junk Recall = A / (A + B)
Legit Recall = D / (D + C)
Junk precision is of greatest concern to most users, because they
would not want their legitimate mail discarded as junk
41
Using different features

Precision/ Recall curves for junk mail
42
Sub-classes of junk E-mail



Three way Categorization
- 3 categories {porn-junk, other-junk, legit}
instead of 2 categories {junk, legit}
Consider that classifying is correct if any “junk” messages is cla
ssifed either “porn-junk” or “other-junk”
Unfortunately, it didn’t work!
- Probably because more parameters means need (exponentially
!) more data to estimate them accuractely
- some feature may be very clearly indicative of junk versus legit
imate, but may not be powerful in three categories
(they do not distinguish well between the sub-classes of junk
43
Sub-classes of junk E-mail

Precision/recall curves considering sub-groups of
junk mail
44
Real Usage Scenario



Three kinds of messages
1. Read and keep
2. Read and discard (ex. Joke from a friend)
3. Junk
Result
Misclassified mails – news stories from a e-mail news service
that the user subscribes to. (No loss of significant information)
45
Real Usage Scenario

Precision/recall curves in real usage scenario
46