ApMl (All Purpose Machine Learning) Toolkit

Transcript ApMl (All Purpose Machine Learning) Toolkit

ApMl (All Purpose Machine
Learning) Toolkit
David W. Miller and Helen Howell
Semantic Web Final Project
Spring 2002
Department of Computer Science
University of Georgia
www.cs.uga.edu/~miller/SemWeb
www.cs.uga.edu/~helen/SemWeb/SemWeb.html
What Has Been Done
• Extensive Research into the
effectiveness of machine learning
algorithms has been performed
– Train System on expert created taxonomy
with expert specified documents
2
What We Did
• Train system on a domain specific
taxonomy
– Eg. CNN’s Sports Pages
• Test system’s ability to correctly classify
documents from a second, yet similar
taxonomy
– Eg. Yahoo! Sports Pages
3
Automatic Text Classification
via Statistical Methods
Text Categorization is the problem of assigning
predefined categories to free text documents.
Statistical Learning Methods used in ApMl
•Bayes Method
•Rocchio Method (most popular)
•K-Nearest Neighbor Classification
•Probabilistic Indexing
4
A Probabilistic Generative Model
• Define a probabilistic generative model for
“Bag-of-words”
documents with classes.
Bayes:
Reinforcement
Learning:
a Survey
This paper surveys
the field of reinforcement learning
from a computer
science perspective.
Automatic Text Classification through Machine Learning, McCallum, et. al.
35
1
12
4
1
7
44
3
2
1
5
9
2
56
11
1
…
a
block
computer
field
leg
machine
of
paper
perspective
rate
reinforcement
science
survey
the
this
underrated
…
5
Bayes Method
Pick the most probable class, given the evidence:
c  arg max c j Pr(c j | d )
cj
- a class (like “Planning”)
d
- a document (like “language intelligence proof...”)
Bayes Rule:
Pr(c j | d ) 
Pr(c j ) Pr( d | c j )
Pr( d )
Probability Category cj
should be assigned to
document d
Automatic Text Classification through Machine Learning, McCallum, et. al.
6
Bayes Rule
Pr(c j | d ) 
P (c j | d )
P(d )
Pr(c j ) Pr( d | c j )
Pr( d )
- Probability that document d belongs to category cj
- Probability that a randomly picked document has the same attributes
P (c j )
P(d j | c)
- Probability that a randomly picked document belongs to this category
- Probability that category c contains document d
7
Bayes Method
• Generates conditional probabilities of
particular words occurring in a document
given it belongs to a particular category.
• Larger vocabulary generate better
probabilities
• Each category is given a threshold p for
which it judges the worthiness of a document
to fall in that classification.
• Documents may fall into one, more than one,
or not even one category.
8
Rocchio Method
• Each document is D is represented as a
vector within a given vector space V:

(1)
(| F |)
d  (d ,..., d )
•Documents with similar content have similar
vectors
•Each dimension of the vector space represents a
word selected via a feature selection process
9
Rocchio Method
• Values of d(i) for a document d are
calculated as a combination of the
statistics TF(w,d) and DF(w)
• TF(w,d) (Term Frequency) is the
number of times word w occurs in a
document d.
• DF(w) (Document Frequency) is the
number of documents in which the word
w occurs at least once.
10
Rocchio Method
• The inverse document frequency is
calculated as
IDF (w)  log(
| D|
DF ( w)
)
• Value of d(i) of feature wi for a document d is
calculated as the product
d (i )  TF ( wi , d )  IDF ( wi )
•d(i) is called the weight of the word wi in the
document d.
11
Rocchio Method
• Based on word weight heuristics, the
word wi is an important indexing term
for a document d if it occurs frequently
in that document
• However, words that occurs frequently
in many document spanning many
categories are rated less importantly
12
K-Nearest Neighbor
• Features
– All instances correspond to points in an ndimensional Euclidean space
– Classification is delayed till a new instance
arrives
– Classification done by comparing feature
vectors of the different points
– Target function may be discrete or realvalued
K-Nearest Neighbor Learning, Dipanjan Chakraborty
13
1-Nearest Neighbor
K-Nearest Neighbor Learning, Dipanjan Chakraborty
14
K-Nearest Neighbor
• An arbitrary instance is represented by
(a1(x), a2(x), a3(x),.., an(x))
– ai(x) denotes features
• Euclidean distance between two instances
d(xi, xj)=sqrt (sum for r=1 to n (ar(xi) - ar(xj))2)
• Find the k-nearest neighbors whose distance
from your test cases falls within a threshold p.
• If x of those k-nearest neighbors are in
category ci, then assign the test case to ci,
else it is unmatched.
K-Nearest Neighbor Learning, Dipanjan Chakraborty
15
Probabilistic Indexing
• Goal is to estimate P(C|si, dm)
– Probability that assignment of term si to the
document dm is correct
• Once terms have been identified, assign
Form Of Occurrence (FOC)
– Certainty that term is correctly indentified
– Significance of Term
16
Probabilistic Indexing Cont.
• If term t appears in document d and a
term descriptor from t to s exists, s an
indexing term, then generate a
descriptor indictor
• Set of generated term descriptors can
be evaluated and a probability
calculated that document d lies in class
c
17
ApMl Toolkit
• Built on top of and extends existing
toolkits
– rainbow (CMU) – Machine Learning
– wget (GNU) – Web Crawler
• 4 Machine Learning Algorithms and 2
Classification Committees
• Web Crawler and Document Retrieval
• Automated Testing
18
Machine Learning Components
• 4 Machine Learning Algorithms
(rainbow)
– Naïve Bayes, Rocchio, KNN, Probabilistic
Indexing
• 2 Classification Committees (ApMl)
– Weight Assigned For Overall Accuracy
– Weights Assigned For Accuracy within
each Class of Taxonomy
19
20
21
Document Retrieval
• Web Crawler and Document Retrieval
– Specify Starting URL
– Specify Recursion Depth
– Allow Multiple Domain Spanning
– Specify Excluded Domains
– Store all retrieved pages into a single
directory (ApMl)
22
23
Automated Testing
•
•
•
•
Choose Algorithms to Test
Choose Test Directory
Specify Number of Tests
All results are placed into persistent
window for evaluation
24
25
Effectiveness: Contingency
Table
Truth
Yes
No
Yes
a
b
No
c
d
System
Machine Learning for Text Classification, David D. Lewis, AT&T Labs
26
Effectiveness Measures
Truth
Yes
No
Yes
a
b
No
c
d
System
• precision = a/(a+b)
– Documents classified correctly vs. All classified as a
particular category
• recall = a/(a+c)
– Documents classified correctly vs. All that should have been
classified in a category
• accuracy = (a+d)/(a+b+c+d)
– All documents classified as positive or negative in a category
correctly vs All classified
Machine Learning for Text Classification, David D. Lewis, AT&T Labs
27
Test Plan
• Choose two areas and selected
subcategories
– Sports
•
•
•
•
Football
Tennis
Golf
NBA
– Health
• Children
• Men
• Women
28
Test Plan Continued
• Sport Web Sites
– www.sportsillustrated.com
– sports.yahoo.com
– www.usatoday.com/sports/sfront.htm
• Health Web Sites
– www.patient.co.uk
– www.cdc.gov/health
– www.bbc.co.uk/health
29
Test Plan Continued
• Train the system on pages from one
taxonomy from one domain and test on
another taxonomy for the same area
• Determine contingency tables for each
category
• Compute effectiveness using precision,
recall, and accuracy
30
Sports Test Results
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
Precision
Recall
0.00
Bayes
ApMl Test Results
K
Rocchio
Nearest
Prob
Com 1
Com 2
31
Health Test Results
0.35
0.30
0.25
0.20
Precision
0.15
Recall
0.10
0.05
0.00
Bayes
K
Rocchio
Nearest
Prob
Com 1
Com 2
ApMl Test Results
32
Comparison of Precision
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Sports Precision
Health Precision
0
Bayes
K
Rocchio
Nearest
Prob
Com 1
Com 2
ApMl Test Results
33
Comparison of Recall
0.8
0.7
0.6
0.5
Sports Recall
0.4
Health Recall
0.3
0.2
0.1
0
Bayes
K
Rocchio
Nearest
Prob
Com 1
Com 2
ApMl Test Results
34
Comparison of Sports
Additional Levels
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Sports Precision (50)
Sports Recall (50)
Sports Precision (200)
2
C
om
1
C
om
b
Pr
o
o
oc
ch
i
R
K
N
Ba
ea
re
s
t
ye
s
Sports Recall (200)
ApMl Test Results
35
Comparison of Health
Additional Levels
0.4
0.35
0.3
0.25
0.2
0.15
Health Precision (30)
Health Recall (30)
Health Precision (60)
Health Recall (60)
0.1
2
C
om
1
om
C
ob
Pr
o
oc
ch
i
R
K
N
Ba
ea
re
st
ye
s
0.05
0
ApMl Tests Results
36
Comparison of Accuracy
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Sports (50)
Sports (200)
Health (30)
2
C
om
1
om
C
ob
Pr
o
oc
ch
i
R
K
N
Ba
y
ea
re
st
es
Health (60)
ApMl Test Results
37
Trends of Results
• K Nearest Neighbor effectiveness was
significantly lower than other algorithms
– continuously categorize the same
• The class of Health was much more
difficult for the algorithms to correctly
categorize
– children’s health a non-gender class
• No improvement in our results with
additional training
38
Conclusions
• Results of automatic text categorization
are subjective
• Trends can occur because of various
factors
• Heterogeneous taxonomies can be
used for automatic classification with
acceptable efficiencies
• More research needed
39
Resources
1.
2.
3.
4.
5.
Dipanjan Chakraborty. “K-Nearest Neighbor Learning.” A
PowerPoint Presentation.
Norbert Fuhr and Ulrich Pfeifer. “Combining Model-Oriented
and Description-Oriented Approached for Probabilistic
Indexing.” Proceedings of the Fourteenth Annual
International ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 46-56. ACM,
New York. 1991.
Thorsten Joachims. “A Probabilistic Analysis of the Rocchio
Algorithm with TFIDF for Text Categorization.” Technical
Report, CMU, March 1996.
Fabrizio Sebastiani. “Machine Learning in Automated Text
Categorization.” ACM Computing Surveys, 34(1):1-47, 2002.
Amit Sheth, et. al. “Semantic Web Content Management for
Enterprises and the Web.” In submission to IEEE Internet
Computing.
40