Supervised learning for text

Transcript Supervised learning for text

Lecture 6: Supervised Learning
for Text
(Chap 5, Charkrabarti)
Wen-Hsiang Lu (盧文祥)
Department of Computer Science and Information
Engineering,
National Cheng Kung University
2004/11/4
Organizing knowledge
 Systematic knowledge structures
 Ontologies
• Dewey decimal system, the Library of
•
Congress catalog, the AMS Mathematics
Subject
Classification, and the US Patent subject
classification
 Web catalogs
• Yahoo & Dmoz
 Problem: Manual maintenance
Mining the Web
Chakrabarti & Ramakrishnan
2
Medical Subject Heading (MeSH)
Topic Tagging
 Finding similar documents
 Guiding queries
 Naïve Approach:
• Syntactic similarity between documents
 Better approach
• Topic tagging
Mining the Web
Chakrabarti & Ramakrishnan
8
Topic Tagging
 Advantages
• Increase vocabulary of classes
• Hierarchical visualization and browsing aids
 Applications
• Email/Bookmark organization
• News Tracking
• Tracking authors of anonymous texts
 E.g.:
The Flesch-Kincaid index
• Classify the purpose of hyperlinks.
Mining the Web
Chakrabarti & Ramakrishnan
9
Supervised learning
 Learning to assign objects to classes
given examples
 Learner (classifier)
A typical supervised text learning scenario.
Mining the Web
Chakrabarti & Ramakrishnan
10
Difference with texts
 ML (machine learning) classification
techniques used for structured data
 Text: lots of features and lot of noise
 No fixed number of columns
 No categorical attribute values
 Data scarcity
 Larger number of class label
 Hierarchical relationships between classes
less systematic unlike structured data
Mining the Web
Chakrabarti & Ramakrishnan
11
Techniques
 Nearest Neighbor Classifier
• Lazy learner: remember all training instances
• Decision on test document: distribution of labels on
the training documents most similar to it
Assigns large weights to rare terms
•
 Feature selection
• removes terms in the training documents which are
statistically uncorrelated with the class labels,
 Bayesian classifier
• Fit a generative term distribution Pr(d|c) to each class
c of documents {d}.
• Testing: The distribution most likely to have generated
a test document is used to label it.
Mining the Web
Chakrabarti & Ramakrishnan
12
Other Classifiers
 Maximum entropy classifier:
• Estimate a direct distribution Pr(c|d) from term space
to the probability of various classes.
 Support vector machines:
• Represent classes by numbers
• Construct a direct function from term space to the
class variable.
 Rule induction:
• Induce rules for classification over diverse features
• E.g.: information from ordinary terms, the structure of
the HTML tag tree in which terms are embedded, link
neighbors, citations
Mining the Web
Chakrabarti & Ramakrishnan
13
Other Issues
 Tokenization
• E.g.: replacing monetary amounts by a
subjective
special token
 Evaluating text classifier
• Accuracy
• Training speed and scalability
• Simplicity, speed, and scalability for document
•
modifications
Ease of diagnosis, interpretation of results,
and adding human judgment and feedback
Mining the Web
Chakrabarti & Ramakrishnan
14
Benchmarks for accuracy
 Reuters
• 10700 labeled documents
• 10% documents with multiple class labels
 OHSUMED
• 348566 abstracts from medical journals
 20NG
• 18800 labeled USENET postings
• 20 leaf classes, 5 root level classes
 WebKB
• 8300 documents in 7 academic categories.
 Industry
• 10000 home pages of companies from 105 industry sectors
• Shallow hierarchies of sector names
Mining the Web
Chakrabarti & Ramakrishnan
15
Measures of accuracy
 Assumptions
• Each document is associated with exactly one
•
class.
Each document is associated with a subset of
classes.
 Confusion matrix (M)
• For more than 2 classes
• M[i; j] : number of test documents belonging
•
to class i which were assigned to class j
Perfect classifier: diagonal elements M[i; i]
would be nonzero.
Mining the Web
Chakrabarti & Ramakrishnan
16
Evaluating classifier accuracy
 Two-way ensemble
• To avoid searching over the power-set of class labels in the
subset scenario
• Create positive (Cd ) and negative classes(Cd ) for each
document d (E.g.: “Sports” and “Not sports” (all remaining
documents)
 Recall and precision
• 2 2 contingency matrix per (d,c) pair
M d,c [0,0]  [ c  C d and classier outputs c ]
M d,c [0,1]  [ c  C d and classier does not output c ]
M d,c [1,0]  [ c  C d and classier outputs c ]
M d,c [1,1]  [ c  C d and classier does not output c ]
Mining the Web
Chakrabarti & Ramakrishnan
17
Evaluating classifier accuracy
(contd.)
M  M
• micro averaged contingency matrix
• micro averaged contingency matrix M
• micro averaged precision and recall
 Equal

d ,c
d ,c
c

1
 M c,d
|C | c d
importance for each document
M  ( precision ) 
M  [0,0]
M  [0,0]  M  [1,0]
M  (recall ) 
M  [0,0]
M  [0,0]  M  [0,1]
• Macro averaged precision and recall
 Equal
importance for each class
M c ( precision ) 
M c (recall ) 
Mining the Web
Chakrabarti & Ramakrishnan
M c [0,0]
M c [0,0]  M c [1,0]
M c [0,0]
M c [0,0]  M c [0,1]
18
Evaluating classifier accuracy
(contd.)
• Precision – Recall tradeoff
 Plot
of precision vs. recall: Better classifier has
higher curvature
 Harmonic mean : Discard classifiers that sacrifice
one for the other
2  recall  precision
F1 
recall  precision
Mining the Web
Chakrabarti & Ramakrishnan
19
Nearest Neighbor classifiers
 Intuition
• similar documents are expected to be
•
•
assigned the same class label.
Vector space model + cosine similarity
Training:
 Index
each document and remember class label
• Testing:
 Fetch
“k” most similar document to given
document
Mining the Web
– Majority class wins
– Alternative: Weighted counts – counts of classes
weighted by the corresponding similarity measure
– Alternative: per-class offset bc which is tuned by testing
the classier on a portion of training data held out for this
Chakrabarti & Ramakrishnan
20
purpose.
Nearest neighbor classification
Mining the Web
Chakrabarti & Ramakrishnan
21
Pros
 Easy availability and reuse of of inverted
index
 Collection updates trivial
 Accuracy comparable to best known
classifiers
Mining the Web
Chakrabarti & Ramakrishnan
22
Cons
 Iceberg category questions
• involves as many inverted index lookups as
•
there are distinct terms in dq,
scoring the (possibly large number of)
candidate documents which overlap with dq in
at least one word,
sorting by overall similarity,
picking the best k documents,
•
•
 Space overhead and redundancy
• Data stored at level of individual documents
• No distillation
Mining the Web
Chakrabarti & Ramakrishnan
23
Workarounds
 To reducing space requirements and speed up
classification
• Find clusters in the data
• Store only a few statistical parameters per cluster.
• Compare with documents in only the most promising
clusters.
 Again….
• Ad-hoc choices for number and size of clusters and
•
parameters.
k is corpus sensitive
Mining the Web
Chakrabarti & Ramakrishnan
24
TF-IDF
 TF-IDF done for whole corpus
 Interclass correlations and term
frequencies unaccounted for
 Terms which occur relatively frequently in
some classes compared to others should
have higher importance
 Overall rarity in the corpus is not as
important.
Mining the Web
Chakrabarti & Ramakrishnan
25
Feature selection
 Data sparsity
• Term distribution could be estimated if training
•
•
•
set larger than test
Not the case however…….
Vocabulary W  2|W | documents
For Reuters, only about 10300 documents
available.
 Over-fitting problem
• Joint distribution may fit training instances…..
• But may not fit unforeseen test data that well
Mining the Web
Chakrabarti & Ramakrishnan
26
Marginals rather than joint
 Marginal distribution of each term in each
class
 Empirical distributions may not still reflect
actual distributions if data is sparse
 Therefore feature selection
• Purposes:
 Improve
accuracy by avoiding overfitting
 maintain accuracy while discarding as many
features as possible to save a great deal of space
for storing statistics
• Heuristic, guided by linguistic and domain
knowledge, or statistical.
Mining the Web
Chakrabarti & Ramakrishnan
27
Feature selection
 Perfect feature selection
•
•
•
•
•
goal-directed
pick all possible subsets of features
for each subset train and test a classier
retain that subset which resulted in the highest accuracy.
COMPUTATIONALLY INFEASIBLE
 Simple heuristics
• Stop words like “a”, “an”, “the” etc.
• Empirically chosen thresholds (task and corpus sensitive) for ignoring
•
“too frequent” or “too rare” terms
Discard “too frequent” and “too rare terms”
 Larger and complex data sets
• Confusion with stop words
• Especially for topic hierarchies
 Greedy inclusion (bottom up) vs. Truncation
(top-down)
Mining the Web
Chakrabarti & Ramakrishnan
28
Greedy inclusion algorithm


Most commonly used in text
Algorithm:
1. Compute, for each term, a measure of
2.
3.
•
discrimination among classes.
Arrange the terms in decreasing order of this
measure.
Retain a number of the best terms or features for
use by the classier.
Greedy because
•
•
measure of discrimination of a term is computed
independently of other terms
Over-inclusion: mild effects on accuracy
Mining the Web
Chakrabarti & Ramakrishnan
29
Measure of discrimination
• Dependent on
• model of documents
• desired speed of training
• ease of updates to documents and class
assignments.
• Observations
• sets included for acceptable accuracy tend to
have large overlap.
Mining the Web
Chakrabarti & Ramakrishnan
30
The 
2
test
• Similar to the likelihood ratio test
• Build a 2 x 2 contingency matrix per class-term
pair
k i,0  number of documents in class i not containing term t
k i,1  number of documents in class i containing term t
2  
(Oij  Eij ) 2
Eij
 Under the independence hypothesis
• Aggregates the deviations of observed values from
•
expected values
2
Larger the value of  , the lower is our belief that the
independence assumption is upheld by the observed
data.
Mining the Web
Chakrabarti & Ramakrishnan
31
 
2
The 
(Oij  Eij ) 2
2
test
It =
0
Eij
 
2
l ,m
[kl ,m  n Pr( C  l ) Pr( I t  m)]
n Pr( C  l ) Pr( I t  m)
2
1
C = 0 k00
k01
1 k10
k11
n(k11k 00  k10 k 01 ) 2

(k11  k10 )( k 01  k 00 )( k11  k 01 )( k10  k 00 )
• Feature selection process
2

• Sort terms in decreasing order of their
•
•
values,
Train several classifier with varying number
of features
Stopping at the point of maximum accuracy.
Mining the Web
Chakrabarti & Ramakrishnan
32
Mutual information
• Useful when the multinomial document model is
used
• X and Y are discrete random variables taking
values x,y
• Mutual information (MI) between them is defined as
Pr( x, y )
MI ( X , Y )   Pr( x, y ) log
Pr( x) Pr( y )
x
y
• Measure of extent of dependence between
random variables,
• Extent to which the joint deviates from the product of
•
the marginals
Weighted with the distribution mass at (x; y)
Mining the Web
Chakrabarti & Ramakrishnan
33
Mutual Information
 Advantages
• To the extent MI(X,Y) is large, X and Y are dependent.
• Deviations from independence at rare values of (x,y)
are played down
• Interpretations
• Reduction in the entropy of Y given X.
•
MI(X; Y) = H(X) – H(X|Y) = H(Y) – H(Y|X)
• KL (Kullback-Leibler) distance between noindependence hypothesis and independence
hypothesis
•
Mining the Web
KL distance gives the average number of bits wasted by
encoding events from the `correct‘ distribution using a code
based on a not-quite-right distribution
Chakrabarti & Ramakrishnan
34
Feature selection with MI
• Fix a term t and let I t be an event associated
with that term.
• E.g., for the binary model, I t= 0/1,
• Pr( I t ) = the empirical fraction of documents in
the training set in which event I t occurred.
• Pr(I t ,c) = the empirical fraction of training
documents which are in class c
• Pr(c) = fraction of training documents belonging
to class c.
kl , m
kl , m / n
log
• Formula: MI ( I t , C )  
(kl ,0  kl ,1 )( k0,m  k1,m ) / n 2
l ,m n
• Problem: document lengths are not normalized.
Mining the Web
Chakrabarti & Ramakrishnan
35
Fisher's discrimination index
• Useful when documents are scaled to
constant length
• Term occurrences are regarded as
fractional real numbers.
• E.g.: Two class case
• Let X and Y be the sets of length normalized
•
•
document vectors corresponding to the two classes.
( x)
( y )
Let  X  
and
be centroids for each
X
Y  Y
|Y |
class. | X |
Covariance matrices be
S X  (1 / | X |) ( x   X )( x   X )T SY  (1 / | Y |) ( y  Y )( y  Y )T
Y
X
Mining the Web
Chakrabarti & Ramakrishnan
36
Fisher's discrimination index (contd.)
• Goal : find a projection of the data sets X and Y
on to a line such that
• the two projected centroids are far apart compared to
the spread of the point sets projected on to the same
line.
• Find a column vector  *  m such that

the ratio of
– the square of the difference in mean vectors projected
onto it ( T (    )) 2
X
Y
1 T
– & average projected variance  ( S X  SY )
2

is maximized.
• This gives
Mining the Web
T
2
(

(



))
Y
 *  arg max T X
 ( S X  SY )

Chakrabarti & Ramakrishnan
37
Fisher's discrimination index
• Formula
• Let X and Y for both the training and test data are
generated from multivariate Gaussian distributions
• Let S X  SY
• Then this value of induces the optimal (minimum
error) classier by suitable thresholding on  T q for a
test point q.
• Problems
• Inverting S would be unacceptably slow for tens of
•
thousands of dimensions.
Linear transformations would destroy already existing
sparsity.
Mining the Web
Chakrabarti & Ramakrishnan
38
Solution
• Recall
• Goal was to eliminate terms from
•
consideration.
Not to arrive at linear projections involving
multiple terms
• Regard each term t as providing a
candidate direction t which is parallel to
the corresponding axis in the vector space
model.
• Compute the Fisher's index of t
Mining the Web
Chakrabarti & Ramakrishnan
39
FI : Solution (contd.)
• Formula
• For two class case
FI (t ) 
( Tt (  X  Y )) 2
 S t
T
t

(  X ,t  Y ,t ) 2
(1 / | X |) ( xt   X ,t ) 2  (1 / | Y |)  ( yt  Y ,t ) 2
• Can be generalized to a set {c} of more than
X
two classes
FI (t ) 
 (
c1 ,t
Y
  c 2 ,t ) 2
c1 ,c2
1
( xd , t   c , t ) 2
c | D | d
Dc
c
• Feature selection
• Terms are sorted in decreasing order of FI(t)
• Best ones chosen as features.
Mining the Web
Chakrabarti & Ramakrishnan
40
Validation
• How to decide a cut-off rank ?
• Validation approach
• A portion of the training documents are held out
• The rest is used to do term ranking
• The held-out set used as a test set.
• Various cut-off ranks can be tested using the same heldout set.
• Leave-one-out cross-validation/partitioning data into two
• An aggregate accuracy is computed over all trials.
• Wrapper to search for the number of features
• In decreasing order of discriminative power
• Yields the highest accuracy.
Mining the Web
Chakrabarti & Ramakrishnan
41
Validation (contd.)
• Simple search heuristic
• Keep adding one feature at every step until
the classifier's accuracy ceases to improve.
A general illustration of wrapping for feature selection.
Mining the Web
Chakrabarti & Ramakrishnan
42
Validation (contd.)
• For naive Bayes-like classier
• Evaluation on many choices of feature sets
can be done at once.
• For Maximum Entropy/Support vector
machines
• Essentially involves training a classier from
•
scratch for each choice of the cut-off rank.
Therefore inefficient
Mining the Web
Chakrabarti & Ramakrishnan
43
Validation : observations
• Bayesian classifier cannot overfit much
Effect of feature selection on Bayesian classifiers
Mining the Web
Chakrabarti & Ramakrishnan
44
Truncation algorithms
•
Start from the complete set of terms T
1. Keep selecting terms to drop
2. Till you end up with a feature subset
3. Question: When should you stop truncation ?
• Two objectives
• minimize the size of selected feature set F.
• Keep the distorted distribution Pr(C|F) as
similar as possible to the original Pr(C|T)
Mining the Web
Chakrabarti & Ramakrishnan
45
Truncation Algorithms: Example
• Kullback-Leibler (KL)
• Measures similarity or distance between two distributions
• Markov Blanket
• Let X be a feature in T. Let M  T \ { X }
• The presence of M renders the presence of X unnecessary as a
feature => M is a Markov blanket for X
• Technically
M is called a Markov blanket for X  T if X is conditionally
independent of (T  C ) \ ( M  { X }) given M
• eliminating a variable because it has a Markov blanket contained in
other existing features does not increase the KL distance between
Pr(C|T) and Pr(C|F).
•
Mining the Web
Chakrabarti & Ramakrishnan
46
Finding Markov Blankets
• Absence of Markov Blanket in practice
• Finding approximate Markov blankets
• Purpose: To cut down computational complexity
• search for Markov blankets M to those with at most k
•
features.
given feature X, search for the members of M to those
features which are most strongly correlated (using
tests similar to the 2 or MI tests) with X.
• Example : For Reuters dataset, over two-thirds
of T could be discarded while increasing
classification accuracy
Mining the Web
Chakrabarti & Ramakrishnan
47
Feature Truncation algorithm
1. while truncated Pr(C|F) is reasonably close to original Pr(C|T)
do
2.
3.
4.
for each remaining feature X do
Identify a candidate Markov blanket M:
For some tuned constant k, find the set M of k
variables in F \ X that are most strongly correlated
with X
Estimate how good a blanket M is:
Estimate
5.
6.
 Pr( X
M
 xM , X  x) KL(Pr (C | X M  x M , X  x), Pr(C | X M  x M ))
xM , x
7.
8.
9.
end for
Eliminate the feature having the best surviving Markov
blanket
end while
Mining the Web
Chakrabarti & Ramakrishnan
48
General observations on feature selection
• The issue of document length should be addressed properly.
• Choice of association measures does not make a dramatic
difference
• Greedy inclusion algorithms scale nearly linearly with the number of
features
• Markov blanket technique takes time proportional to at least | T |k.
• Advantage of Markov blankets algorithm over greedy inclusion
• Greedy algorithm may include features with high individual correlations even
•
though one subsumes the other
Features individually uncorrelated could be jointly more correlated with the class
•
This rarely happens (e.g., Phrase)
• Binary feature selection view may not be only view to subscribe to
• Suggestion: combine features into fewer, simpler ones
• E.g.: project the document vectors to a lower dimensional space
Mining the Web
Chakrabarti & Ramakrishnan
49
Bayesian Learner
•
•
Very practical text classifier
Assumption
1. A document can belong to exactly one of a
2.
3.
•
set of classes or topics.
Each class c has an associated prior
probability Pr(c), cPr(c) =1
There is a class-conditional document
distribution Pr(d|c) for each class.
Posterior probability
Pr( c | d ) 
Pr( c, d )
Pr( c) Pr( d | c)

Pr( d )
 Pr(  ) Pr( d |  )
• Obtained using Bayes Rule
 Parameter set  consists of all P(d|c)

Mining the Web
Chakrabarti & Ramakrishnan
50
Parameter Estimation for Bayesian Learner
Estimate of  is based on two sources of
information:
•
1.
2.
•
Prior knowledge on the parameter set before seeing any training
documents
Terms in the training documents D.
Bayes Optimal Classifier
•
Taking the expectation of each parameter over Pr( |D)
Pr( c | d )   Pr( c | d , ) Pr(  | D)

Pr( c | d )  
•
•

Pr( c | ) Pr( d | c, )
Pr(  | D)
Pr(

|

)
Pr(
d
|

,

)


Computationally
infeasible
Maximum likelihood estimate
•
•
Replace the sum above with the value of the summand (Pr(c|d, )) for
arg max Pr( | D),
Works poorly for text classification
Mining the Web
Chakrabarti & Ramakrishnan
51
Naïve Bayes Classifier
• Naïve
• assumption of independence between terms,
• joint term distribution is the product of the marginals.
• Widely used owing to
• simplicity and speed of training, applying, and
updating
• Two kinds of widely used marginals for text
• Binary model
• Multinomial model
Mining the Web
Chakrabarti & Ramakrishnan
52
Naïve Bayes Models
• Binary Model
• Each parameter indicates the probability that a
document in class c will mention term t at least once.
c ,t
 (1  c,t )
td 1  c ,t t
W
Pr( d | c)   c,t  (1  c,t )  
td
tW ,td
• Multinomial model
• each class has an associated die with |W| faces.
• each parameter denotes probability of the face
to account for dD
•
•
turning up on tossing the die.
term t occurs n(d; t) times in document d, t W
document length is a random variable denoted L,

Mining the Web
 ld

 tn ( d ,t )
Pr( d | c)  Pr( L  ld | c) Pr( d | ld , c)  Pr( L  ld | c)
{n(d , t )} td
Chakrabarti & Ramakrishnan
53
Analysis of Naïve Bayes Models
1. Multiply together a large number of small
probabilities,
• Result: extremely tiny probabilities as
•
answers.
Solution : store all numbers as logarithms
2. Class which comes out at the top wins by
a huge margin
• Sanitizing scores using likelihood ration
•
•
Mining the Web
Also called the logit function
1
Pr(C  1 | d )
log it (d ) 
, LR(d ) 
 LR ( d )
1 e
Pr(C  1 | d )
Chakrabarti & Ramakrishnan
54
Parameter smoothing
• What if a test document d q contains a term t that never
occurred in any training document in class c ?
• Ans : Pr( c | d q )will be zero
• Even if many other terms clearly hint at a high likelihood of class
c generating the document.
• Bayesian Estimation
• Estimating probability from insufficient data.

If you toss a coin n times and it always comes up heads, what is the
probability that the (n + 1)th toss will also come up heads?
• posit a prior distribution on  , called  ( )

E.g.: The uniform distribution
• Resultant posterior distribution:
 ( | k , n ) 
 ( ) Pr( k , n |  )
1
 dp ( p) Pr( k , n | p)
0
Mining the Web
Chakrabarti & Ramakrishnan
55
Laplace Smoothing
• Based on Bayesian Estimation
• Laplace's law of succession
~
• loss function L( ,  ) (penalty) for picking a
•
•
•
smoothed value ~ as against the `true' value.
E.g.: Loss function as the square error
For this choice of loss,the best choice of the
smoothed parameter is simply the expectation
of the posterior distribution on having
observed the data:
~
  E ( ( | k , n )) 
Mining the Web
k 1
n2
Chakrabarti & Ramakrishnan
56
Laplace Smoothing (contd.)
• Heuristic alternatives
• Lidstone's law of succession
~ k 


•
n  2
• derivation for the multinomial model
• there are |W| possible events where W is the
vocabulary.
•
1
 c ,t 
 n( d , t )
d Dc
|W | 
 n(d , )
d Dc , d
Mining the Web
Chakrabarti & Ramakrishnan
57
Performance analysis
• Multinomial naive Bayes classifier
generally outperforms the binary variant
• K-NN may outperform naïve Bayes
• Naïve Bayes is faster and more compact
• decision boundaries:
• regions of potential confusion
Mining the Web
Chakrabarti & Ramakrishnan
58
NB: Decision boundaries
• Bayesian classier partitions the
multidimensional term space into regions
• Within each region, the probability of one
class is higher than others
• On the boundaries, the probability of two
or more classes are exactly equal
• NB is a linear classier
• it makes a decision between c = 1 and c = -1
• by thresholding the value of  NB .d  b
(b=prior) for a suitable vector  NB
Mining the Web
Chakrabarti & Ramakrishnan
59
Pitfalls
• Strong bias
• fixes the policy that  NB (t ) (tth component of
•
the linear discriminant) depends only on the
statistics of term t in the corpus.
Therefore it cannot pick from the entire set of
possible linear discriminants,
Mining the Web
Chakrabarti & Ramakrishnan
60
Bayesian Networks
• Attempt to capture statistical dependencies between
terms themselves
• Approximations to the joint distribution over terms
• Probability of a term occurring depends on observation about
other terms as well as the class variable.
• A directed acyclic graph
• All random variables (classes and terms) are nodes
• Dependency edges are drawn from c to t for each t.(parent-child
edges)
• To represent additional dependencies between terms
dependency edges (parent child) are drawn
Mining the Web
Chakrabarti & Ramakrishnan
61
Bayesian networks. For the naive Bayes assumption, the only edges are from the class
variable to individual terms. Towards better approximations to the joint distribution over terms:
the probability of a term occurring may now depend on observation about other terms as well as
class variable.
Mining the Web
Chakrabarti & Ramakrishnan
62
Bayesian Belief Network (BBN)
• DAG
• Parents Pa(X)
• nodes that are connected by directed edges to a node X
• Fixing the values of the parent variables completely
determines the conditional distribution of X
• Conditional Probability tables
• For discrete variables, the distribution data for X can be stored in
the obvious way as a table with each row showing a set of
values of the parents, the value of X, and a conditional
probability.
• Unlike Naïve Bayes
• P(d|c) is not a simple product over all terms.
• .Pr( x)   Pr( x | pa( X ))
x
Mining the Web
Chakrabarti & Ramakrishnan
63
BBN: difficulty
• Getting a good network structure.
• At least quadratic time
• Enumeration of all pairs of features
• Exploited only for binary model
• Multinomial model
• Prohibitive CPT sizes
Mining the Web
Chakrabarti & Ramakrishnan
64
Exploiting hierarchy among topics
•Ordering between the class labels
•For Data warehousing
• E.g.
: high, medium, or low cancer risk patients.
•Text Class labels:
• Taxonomy:
•large and complex class hierarchy that relates the class
labels
•Tree structure
• Simplest
form of taxonomy
• widely used in directory browsing,
• often the output of clustering algorithms.
• inheritance:
•If class c0 is the parent of class c1, any training
document which belongs to c1 also belongs to c0.
Mining the Web
Chakrabarti & Ramakrishnan
65
Topic Hierarchies : Feature
selection
• Discriminating ability of a term sensitive to
the node (or class) in the hierarchy
• Measure of discrimination of a term
• Can be evaluated with respect to only internal
•
•
nodes of the hierarchy.
`can' may be a noisy word at the root node of
Yahoo!
Help classifying documents under the sub
tree of /Science/Environment/Recycling.
Mining the Web
Chakrabarti & Ramakrishnan
66
Topic Hierarchies: Enhanced
parameter estimation
• Uniform priors not good
• Idea
• If a parameter estimate is shaky at a node
with few training documents, perhaps we can
impose a strong prior from a well-trained
parent to repair the estimates.
• Shrinkage
• Seeks to improve estimates of descendants
using data from ancestors,
Mining the Web
Chakrabarti & Ramakrishnan
67
Shrinkage
• Assume multinomial model
• introducing a dummy class c0 as the parent of
the root c1, where all terms are equally likely.
• For a specific path c0,c1,…….cn,
• `shrunk' estimate ~c ,t is determined by a convex linear
n
interpolation of the MLE parameters at the ancestor
nodes up through c0
• Estimatation of mixing weights
• Simple form of EM algorithm
• Determined empirically, by iteratively maximizing the
probability of a held-out portion Hn of the training set
for node cn.
Mining the Web
Chakrabarti & Ramakrishnan
68
Shrinkage: Observation
• Improves accuracy beyond hierarchical
naïve Bayes,
• Improvement is high when data is sparse
• Capable of utilizing many more features
than Naïve Bayes
Mining the Web
Chakrabarti & Ramakrishnan
69
Topic search in Hierarchy
• By definition
• All documents are relevant to the root ‘topic’
• Pr(root|d) = 1.
• Given a test document d:
• Find one or more of the most likely leaf nodes
•
in the hierarchy.
Document cannot belong to more than one
path,
•  Pr(c | d )  Pr(c
i
0
| d)
i
Mining the Web
Chakrabarti & Ramakrishnan
70
Topic search in Hierarchy: Greedy
Search strategy
• Search starts at the root
• Decisions are made greedily
• At each internal node pick the highest
probability class
Continue
•
• Drawback
• Early errors cause compounding effect
Mining the Web
Chakrabarti & Ramakrishnan
71
Topic search in Hierarchy: Best-first
search strategy
• For finding m most probable leaf classes
• Find the weighted shortest path from the
root to a leaf.
• Edge (c0,ci) is assigned a (non-negative)
edge weight of –Pr(ci|c0,d)
• .  log Pr(ci | d )  ( log Pr(c0 | d ))  ( log Pr(ci | c0 , d ))
• To make Best first search different from
greedy search
• Rescale/smoothen the probabilities
Mining the Web
Chakrabarti & Ramakrishnan
72
Using best-first search on a hierarchy can improve both accuracy and speed. Because
the hierarchy has four internal nodes, the second column shows the number of features
for each. These were tuned so that the total number of features for both at and best-first
are roughly the same (so that the model complexity is comparable). Because each
document belonged to exactly one leaf node, recall equals precision in this case and is
called `accuracy'.
Mining the Web
Chakrabarti & Ramakrishnan
73
The semantics of hierarchical
classification
• Asymmetry
• training document can be associated with any
node,
test document must be routed to a leaf,
•
• Routing test documents to internal nodes
• none of the children matches the document
• many children match the document
• the chances of making a mistake while
•
pushing down the test document one more
level may be too high.
Research issue
Mining the Web
Chakrabarti & Ramakrishnan
74
Maximum entropy learners:
Motivation
• Bayesian learner
• First, model Pr(d|c) at training time
• Apply Bayes rule at test time
• Two problems with Bayesian learners
• d is represented in a high-dimensional term space
•
=> Pr(d|c) cannot be estimated accurately from a training set
of limited size. (since sparse data)
• No systematic way of adding synthetic features
•
Such an addition may result in
• highly correlated features may “crowd out“ useful features
Mining the Web
Chakrabarti & Ramakrishnan
75
Maximum entropy learners
• Assume that each document has only one class
label and no duplicate
• Indicator functions fj(d,c) (features)
• Flag ‘j’th condition relating class c to document d
• Expectation of indicator fj is
• E ( f j )   Pr(d , c) f j (d , c)   Pr(d ) Pr(c | d ) f j (d , c)
d ,c
d
c
• Approximate Pr(d,c) and Pr(d) with their empirical
~
~
estimates Pr(d , c), Pr(d )
• Constraint:  ~Pr( di , ci ) f j (d i , ci )   ~Pr(di ) Pr( c | di ) f j (di , c)
i
i
c
1
1
1
f
(
d
,
c
)

Pr(
c
|
d
)
f
(
d
,
c
)
(uniform
probabilit
y
)
 j i i  
i
j
i
n
i n
i n c
Mining the Web
Chakrabarti & Ramakrishnan
76
Principle of Maximum Entropy
• Constraints don’t determine Pr(c|d) uniquely
• Principle of Maximum Entropy:
• prefer the simplest model to explain observed data.
• Choose Pr(c|d) that maximizes the Entropy of Pr(c|d)
• In the event of empty training set we should consider
all classes to be equally likely,
• Constrained Optimization
• Maximize the entropy of the model distribution Pr(c|d)

While obeying the constraints for all j
• Optimize by the method of Lagrange multipliers
G (Pr( c | d ), )   Pr( d ) Pr(c | d ) log Pr(c | d )    j ( f j (d i , ci )   Pr(c | d i ) f j (d i , c))
d ,c
Mining the Web
j
Chakrabarti & Ramakrishnan
i
i ,c
77
Principle of Maximum Entropy
• Constraints don’t determine Pr(c|d) uniquely
• Principle of Maximum Entropy:
• prefer the simplest model to explain observed data.
• Choose Pr(c|d) that maximizes the Entropy of Pr(c|d)
• In the event of empty training set we should consider
all classes to be equally likely,
• Constrained Optimization
• Maximize the entropy of the model distribution Pr(c|d)

While obeying the constraints for all j
• Optimize by the method of Lagrange multipliers
G (Pr( c | d ), )   Pr( d ) Pr(c | d ) log Pr(c | d )    j ( f j (d i , ci )   Pr(c | d i ) f j (d i , c))
d ,c
j
i
i ,c
1
f ( d ,c )
Pr( c | d ) 
,
j j
Z (d ) j
Mining the Web

Chakrabarti
 j  e j , Z (d ) : scale
factor, &c Ramakrishnan
Pr( c | d )  1
78
Maximum Entropy solution
•
Fitting the distribution to the data
involves two steps:
1. Identify a set of indicator functions derived
2.
•
from the data.
Iteratively arrive at values for the parameters
that satisfy the constraints while maximizing
the entropy of the distribution being modeled.
An equivalent optimization problem
max  log Pr( cd | d )
d D
Mining the Web
Chakrabarti & Ramakrishnan
79
Text Classification using Maximum
Entropy Model
• Example
• Pick an indicator for each (class, term) combination.
• For the binary document model, f (d , c)  1 if c  c' and t  d
c ', t
0
otherwise
• For the multinomial document model
 0
 n( d , t )
f c ',t (d , c)  
  n( d ,  )

if c  c'
otherwise
• What we gain with Maximum Entropy over naïve
Bayes
•
•
does not suffer from the independence assumptions
E.g.
• If the terms t1 = machine and t2 = learning are often found
together in class c
• c ,t1 and c ,t 2 would be suitably discounted.
Mining the Web
Chakrabarti & Ramakrishnan
80
Performance of Maximum Entropy
Classifier
• Outperforms naive Bayes in accuracy, but
not consistently.
• Dealing with a large of synthesized,
possibly redundant features.
Mining the Web
Chakrabarti & Ramakrishnan
81
Discriminative classification
• Naïve Bayes and Maximum Entropy Classifiers
• “induce” linear decision boundaries between classes
in the feature space.
Pr( c | d ) 
1
f j ( d ,c )

,
 j
Z (d ) j
log Pr( c | d )   log Z (d )   f c,t (d , c) log  c,t
td
• Discriminative classifiers
• Directly map the feature space to class labels
• Class labels are encoded as numbers
• e.g: +1 and –1 for two class problem
• Two examples
• Linear least-square regression
• Support Vector Machines
Mining the Web
Chakrabarti & Ramakrishnan
82
Linear least-square regression
• No inherent reason for going through the modeling step as in
Bayesian or maximum entropy classifier to get a linear discriminant.
• Linear Regression Problem
• Look for some arbitrary  such that   d i  b directly predicts
the label ci of document di.
• Minimize the square error between the observed and predicted
class variable:
(  d  b  c ) 2

i
i
i
• Optimization: use gradient-descent methods
•
E.g., Widrow-Hoff (WH) update rule
• Scaling  to norm 1
 (i )   (i 1)  2 ( (i 1)  d i  ci )d i
• Two equivalent interpretations
 : learning rate
• Classifier is a hyperplane
• Documents are projected on to a direction 
• Performance
• Comparable to Naïve Bayes and Maximum Entropy
Mining the Web
Chakrabarti & Ramakrishnan
83
Support vector machines
• Assumption : training and test population are drawn from
the same distribution
• Hypothesis
• Hyperplane that is close to many training data points has a
greater chance of misclassifying test instances
• A hyperplane which passes through a “no-man's land”, has lower
chances of misclassifications
 Make a decision by thresholding  SVM  d i  b
 Seek an  SVM which maximizes the distance of any
training point from the hyperplane
Minimize
subject to
Mining the Web
1
1
   ( ||  || 2 )
2
2
ci (α  d i  b )  1 i  1,.....n ; d i : document vector, ci : class
Chakrabarti & Ramakrishnan
84
 d b 1
  d  b  1
  d  b  1

  d  b  1
  d  b  1
 d b  0
Illustration of the SVM optimization problem
Mining the Web
Chakrabarti & Ramakrishnan
85
Support vector machines
• Optimal separator
• Orthogonal to the shortest line connecting the
•
•
convex hull of the two classes
Intersects this shortest line halfway
Distance of any training point from the
optimized hyperplane (margin)
 It
Mining the Web
is at least
1
||  ||
Chakrabarti & Ramakrishnan
86
SVMs: non separable classes
• Classes in the training data not always separable.
(single hyperplane may not enough)
• Introduce fudge variables ξ i
subject to
1
   C  i
2
i
ci (α  d i  b)  1-ξ i i  1,....,n.
and
ξi  0
Minimize
i  1,........n
• Equivalent dual
Maximize
1
λ

 i
 i  j ci c j (d i  d j )
2 i, j
i
subject to
 ci λi  0
i
and
Mining the Web
0  λi  C
Chakrabarti & Ramakrishnan
i  1,........n
87
SVMs: Complexity
• Quadratic optimization problem
• Working set: refine a few  ' s at a time holding
the others fixed.
• Large memory:
• On-demand computation of inner-products
• Time: n documents, training time  n a
1.7  a  2.1
 Recent SVM packages
• Linear time by clever selection of working sets.
Mining the Web
Chakrabarti & Ramakrishnan
88
SVM training time variation as the training set size is increased, with and without
sufficient memory to hold the training set. In the latter case, the memory is set to about a
quarter of that needed by the training set.
Mining the Web
Chakrabarti & Ramakrishnan
89
Performance
• Comparison with other classifiers
• Among most accurate classifier for text
• Better accuracy than naive Bayes and
decision tree classifier
• Interesting revelation
• Linear SVMs suffice
• Standard text classification tasks have
classes almost separable using a hyperplane
in feature space
• Research issues
• Non-linear SVMs
Mining the Web
Chakrabarti & Ramakrishnan
90
Comparison of LSVM with previous classifiers on the Reuters data set (data taken from
Dumais). (The naive Bayes classier used binary features, so its accuracy can be
improved)
Mining the Web
Chakrabarti & Ramakrishnan
91
Comparison of accuracy across three classifiers: Naive Bayes, Maximum Entropy and Linear
SVM, using three data sets: 20 newsgroups, the Recreation sub-tree of the Open Directory,
and University Web pages from WebKB.
Mining the Web
Chakrabarti & Ramakrishnan
92
Comparison between several classifiers using the Reuters collection.
Mining the Web
Chakrabarti & Ramakrishnan
93
Hypertext classification
• Techniques to address hypertextual
features.
• Document Object Model (DOM)
• Well-formed HTML document is a properly
nested hierarchy of regions in a treestructured
• DOM tree,
• internal nodes are elements (e.g., UL, LI)
• some of the leaf nodes are segments of text.
• other nodes are hyperlinks to other Web
pages
Mining the Web
Chakrabarti & Ramakrishnan
94
Representing hypertext for
supervised learning
• Paying special attention to tags can help
with learning
• Keyword-based search
• Assign heuristic weights to terms that occur in
specific HTML tags
 E.g.,
Title, H1, …
• Example (next slide)
Mining the Web
Chakrabarti & Ramakrishnan
95
Prefixing with tags
• Distinguishing between the two
occurrences of the word “surfing”,
• Prefixing each term by the sequence of tags
that we need to follow from the DOM root to
get to the term,
• A repeated term in different sections
should reinforce belief in a class label
• Using a maximum entropy classier
•
Accumulate evidence from different features
• Maintain both forms of a term:
•
Mining the Web
plain text and prefixed text (all path prefixes)
Chakrabarti & Ramakrishnan
96
Example
• <resume>
<publication>
<title>Web-surfing models</title>
</publication>
<hobbies>
<item>Wind-surfing
</hobbies>
</resume>
• Prefixed term
• resume.publication.title.surfing
• resume.hobbies.item.surfing
Mining the Web
Chakrabarti & Ramakrishnan
97
Experiments
• 10705 patents from the US Patent Office,
• 70% error with plain text classier,
• 24% error with path-tagged terms
• 17%. Error with path prefixes
• 1700 resumes (with naive Bayes classifier)
• 53% error with flattened HTML
• 40% error with prefix-tagged terms
 Simple tricks suffice to boost accuracy
Mining the Web
Chakrabarti & Ramakrishnan
98
Limitations
• Prefix representations
• ad-hoc
• inflexible.
• Generalizibility:
• How to incorporate additional features ?
• E.g.: adding features derived from hyperlinks.
• Relations
• uniform way to codify hypertextual features.
• Example:
classified(A, facultyPage):contains-text(A, professor), contains-text(A, phd),
links(B, A), contains-text(B, faculty).
Mining the Web
Chakrabarti & Ramakrishnan
99
Rule Induction for relational
learning
• Inductive classifiers
• Goal : Discover a set of predicate rules from a
collection of relations.
• Consider 2 class setting
• Positive examples: D+
• Negative examples: D• Test instance:
• Applying the predicate rules
 True
=> positive instance
 Else => negative instance
Mining the Web
Chakrabarti & Ramakrishnan
100
Rule induction with First Order
Inductive Logic (FOIL)
•
•
Well-known rule learner
Start with empty rule set
1. learn a new rule.
2. add conjunctive literals to the new rule (specialize)
3.
4.
•
until no negative example is covered by the new
rule.
pick a literal which increases the ratio of surviving
positive to negative bindings rapidly.
Remove positive examples covered by any rule
generated thus far.
Till no positive instances are left
Mining the Web
Chakrabarti & Ramakrishnan
101
Types of Literals Explored
•
•
X i  X j , Xi  c, Xi  X j , Xi  X j ; where X i ,X j are variables and c is a constant
where Q is a relation and Xi
are variables, at least one of which must
be already bound.
• not(L), where L is a literal of the above
forms.
Q( X 1,......X k )
Mining the Web
Chakrabarti & Ramakrishnan
102
Advantage of Relational Learning
• Can learn class labels for individual pages
• Can learn relationships between labels
•
•
•
•
member(homePage, department)
teaches(homePage, coursePage)
advises(homePage, homePage)
writes(homePage, paper)
• Hybrid approaches
• Statistical classifier (Naïve Bayes)
•
more complex search for literals
• Inductive learning
•
Mining the Web
comparing the estimated probabilities of various
classes.
Chakrabarti & Ramakrishnan
103
Advantage of Relational Learning
• Recursively labeling relations
• Relating page label in terms of labels of
neighboring pages
classified(A, facultyPage) :links-to(A, B), classified(B, studentPage),
links-to(A, C), classified(C, coursePage),
links-to(A, D), classified(D, publicationsPage).
Mining the Web
Chakrabarti & Ramakrishnan
104