Concept Ontology for Text Classification

Download Report

Transcript Concept Ontology for Text Classification

Concept Ontology for Text
Classification
Bhumika Thakker
Outline







Introduction
Techniques Used
Shrinkage in a Hierarchy of Classes
Classification using very few words
Enhanced Word Clustering for Hierarchical Text
Classification
Classification of Web Content
References
Introduction
An explosion in availability of online
information with millions of documents on
every topic easily accessible via the
internet
 The inability of people to assimilate and
profitably utilize large amount of
information is apparent due to the increase
in available information
 It becomes significant to organize this
mass of information

Techniques for Classification
Shrinkage in a Hierarchy of Classes
 Classification using very few words
 Word Clustering for Hierarchical Text
Classification

Shrinkage in a Hierarchy of
Classes



Technique leverages commonly available topic
hierarchies in order to significantly improve
classification accuracy
Works well even when hierarchy is large and the
training data set is sparse
A method for exponentially reducing the amount
of computation necessary for classification,
sacrificing only a small amount of accuracy
Shrinkage in a Hierarchy of
Classes




The approach is to apply a technique from
Statistics called shrinkage
It provides improved estimates of parameters
The technique exploits a hierarchy by “shrinking”
parameter estimates in data sparse children
toward the estimates of data-rich ancestors
Employs simple form of shrinkage that creates
new parameter estimates in a child by a linear
interpolation of all hierarchy nodes from the child
to the root
Shrinkage in a Hierarchy of
Classes
Shrinkage for Text Classification




It is used to better estimate the probability of a word
given a class θjt
For each node in tree a maximum likelihood (ML)
estimate based on the data associated with that node is
constructed
An improved estimate for each leaf node is then derived
by “shrinking” its ML estimate towards the ML estimate
of all its ancestors
A unigram model for each node in the tree is build and
smoothed each leaf model by linearly interpolating it with
all the models found along the path to the root
Cont…




The estimates along the path from the leaf to the
root represent a tradeoff between specificity and
reliability
The estimate at the leaf is the most specific as it
is based on the data from the topic alone
Its least reliable as it is based on the least
amount of data
The estimator at the root is the most reliable but
the least specific
Cont…



Subtract each child’s data from its parent’s before
calculating the parent’s ML estimate to ensure that the
ML estimates along a given path are independent
The estimate is based on the data that belongs to all the
siblings of said child but not to the child itself
Thus for any path from leaf to root every datum in the
tree is used exactly one of the ML estimates providing
both independence among the estimates and efficient
use of the training data
Determining the weights



Let
be k estimates where
is the
estimate at the leaf, and k-1 is the depth of class
cj in the tree
The interpolation weights among the ancestors
of the class cj are written
where
for the new estimate of the class conditioned
word probabilities based on shrinkage is:
The likelihood of data according to the mixture
model is a convex function of weights and attains a
single global maximum. This maximum for each
leaf class cj, is calculated using the following
procedure




The algorithm can be viewed as a particularly simple
form of EM (Expectation maximization)
Each datum is assumed to have been generated by first
choosing one of the tree nodes in the path to the root
say
, using that estimate to generate the datum
EM then maximizes the total likelihood when the choices
of estimates made for the various data are unknown
The first step in the iterative part is thus the E step and
the second one is the M step





The method makes inefficient use of the available
training data by carving off some of it to be used as a
held-out set
To overcome this problem the algorithm is modified as
follows:
All the available data is used both to construct the ML
estimates and to optimize the weights
As each document is used in the above algorithm, the
ML estimates are modified to exclude its data so as to
make them independent of it
This method is known as “leave-one-out” or “jacknifing”
Experimental Results





The following figure shows classification accuracy on the
industry sector data set with 50-50 train-test splits while
varying vocabulary size
Larger vocabulary sizes generally perform better
Hierarchical Feature Selection somewhat improves the
performance of the flat naive Bayes in mid range of
feature selection at about 5000 words
Hierarchical feature selection accuracy reaches 64%
Shrinkage improves classification accuracy of 74%
Shrinkage helps more when
training data is sparse




Following figure shows accuracy on the Newsgroups
data set with full vocabulary and varying amount of
training data
Accuracy in this domain is highest with no feature
selection for both classifiers with small amount of training
data
Shrinkage provides more improvement when the amount
of training data is small and the shrinkage reduces
variance in the classifications
Estimates are improved by using shrinkage to smooth a
class’s parameters with its ancestors



Following figure shows classification accuracy
on Science hierarchy as a function of vocabulary
size
Best performance for both the classifier is equal
Among 51 with less than 50 training documents
shrinkage provides 6% improvement in
accuracy, 39% for flat to 45% for hierarchical
Pruning Tree for Increased
computational Efficiency
The class hierarchy can be leveraged to
improve computational efficiency
 The classifier can avoid calculating
for a majority of the classes by pruning the
tree dynamically during the classification
of each document

Classification using very few words



Categorizing the different documents according
to their topic, where topics are organized in a
hierarchy of increasing specificity
The bottleneck in this classification tasks is the
need for a person to read each document and
decide on its appropriate place in the hierarchy
This is avoided by automatically classifying new
documents using machine learning techniques





The approach is to divide the classification task
into a set of smaller classification tasks
Each corresponds to some split in classification
hierarchy
The key insight is that each subtask is
significantly simpler than original task
The classifier at each node in the hierarchy
needs to be distinguished between small
number of categories
This is possible using small set of features
The ability to restrict to a very small
feature set avoids many of the difficulties
 Such models are more robust and less
subject to overfitting
 Thus they achieve better accuracy even
for a very simple classifier such as Naive
Bayes





Key note is not merely the use of feature selection but its
integration with the hierarchical structure
Choosing small set of feature would not give good
performance if used for classification for flattened class
space
In hierarchical approach any document only sees a small
fraction of the features throughout the process
The feature which it does see are divided so as to focus
the attention of the classifier on the features relevant to
the classification subtask at hand
Probabilistic Framework


Constructing a hierarchical set of classifier, each
based on its own set of relevant features
It uses two main subroutines:

A feature selection algorithm for deciding on the
appropriate feature set at each decision point
 A supervised learning algorithm for constructing a
classifier for that decision
Bayesian Classifiers



A Bayesian network allows us to provide
compact descriptions of complex distributions
over a large number of random variables
It uses directed acyclic graph to encode
conditional independence assumptions about
the domain
Independent assumptions allow the distribution
to be described as a product of small local
interaction models




Bayesian classifier is simply a Bayesian network applied
to a classification domain
Contains a node C for the class variable and a node Xi
for each of the features
Specific instance x, the Bayesian network allows to
compute the probability
for each
possible ck
Bayes Optimal classification can be achieved by simply
selecting class ck for which this probability is maximized
Feature Selection



Feature selection employs Information Theoretic
measures to determine a subset of the original domain
features that seem to best capture the class distribution
in the data
For each feature Xi the algorithm determines the
expected cross entropy:
where
is the set of all domain features except Xi
It eliminates the feature Xi for which
is minimized
Feature Selection
The feature eliminated least disrupts the
original conditional class distribution
 This process can be iterated to eliminate
as many features as desired
 In this respect the algorithm is very
applicable to text domains with many
features

Experimental Results
Both hierarchical and flat classification
schemes are ran on the datasets without
employing any probabilistic feature
selection
 Original number of features in each data
set and results are given in the following
table


Two important phenomena are observed
 In
Hier1 dataset, the very large number of features
used precludes the hierarchical scheme from
performing better than simple flat method
 In Hier2 dataset, the large number of features and
small dataset size allows for more expressive KDB
algorithm to overfit the training data
 These results provide an empirical motivation for
integration of feature selection
The table results shows that hierarchical
method clearly outperforms the flat
classification method when considering a
direct comparison of the 10 and 20 feature
runs
 Hierarchical method produces 8-41%
fewer errors than flat methods for Hier1
and more modest relative gains for Hier2





The results show that the feature selection stage
does serve to focus the algorithm on the
features relevant to the local classification task
The table shows the set of 10 features found to
be most discriminating at each level of hierarchy
learned for the Hier1 dataset
At top level of the hierarchy, High level terms are
selected from various major topics
More specific words distinguishing the subtopics
are seen
Enhanced Word Clustering for
Hierarchical Text
Classification






Distributional clustering of words/features is one of the ways to
reduce dimensionality
Each word cluster can be treated as a single feature and thus
dimensionality can be drastically reduced
Feature clustering is more effective than feature selection especially
at lower number of features.
Feature clustering appears to preserve classification accuracy as
compared to a full-feature classifier
In case of small training sets and noisy features, word clustering can
actually increase classification accuracy.
The algorithms used are agglomerative in nature yielding suboptimal word clusters at a high computational cost
DISTRIBUTIONAL WORD
CLUSTERING




C be a discrete random variable that takes on values
from the set of classes
W be the random variable that ranges over the set of
words
The joint distribution p(C,W) can be estimated from the
training set.
Now suppose we cluster the words into k clusters
W1,…,Wk. As Interested in reducing the number of
features and the model size, we only look at “hard"
clustering where each word belongs to exactly one word
cluster, i.e




The random variable W range over the word clusters.
The information about C captured by W can be
measured by the mutual information I(C;W). Ideally, in
formation of word clusters exact preservation of the
mutual information is expected
However clustering usually lowers mutual information
Thus its essential to find a clustering that minimizes the
decrease in mutual information, I(C;W) - I(C;W )
Classifying using word Clusters


The Naive Bayes method can be simply
translated into using word clusters instead of
words.
This is done by estimating the new parameters
p(Ws/ci) for word clusters similar to the word
parameters p(wt/ci) in as


When estimates of p(wt/ci) for individual words
are relatively poor, the corresponding word
cluster parameters p(Ws/ci) provide more robust
estimates resulting in higher classification scores
The Naive Bayes rule (5) for classifying a test
document d can be rewritten as


Figures 2 and 3 plot the fraction of mutual
information lost against the number of clusters
for both the divisive and agglomerative
algorithms on the 20Ng and Dmoz data sets.
It can be seen that less mutual information is lost
with Divisive Clustering compared to
Agglomerative Clustering at all number of
clusters, though the difference is more
pronounced at lower number of clusters





Figure 5 shows classification accuracies on the 20 Newsgroups data
set for the algorithms considered.
The horizontal axis indicates the number of features/clusters used
in the classification model while the vertical axis indicates the
percentage of test documents that were classified correctly
Divisive Clustering (SVM as well as NB) achieves significantly better
results at lower number of features than feature selection using
Information Gain and Mutual Information
With 50 clusters, Divisive Clustering (NB) achieves 78.05%
accuracy
The largest gain occurs when the number of clusters equals the
number of classes








In Figure 6, the classification accuracy is plotted on 20Ng data using Naive
Bayes when the training data is sparse.
2% of the available data is taken, that is 20 documents per class, for
training and tested on the remaining 98% of the documents
The results are averages of 5-10 trials
Divisive Clustering obtains better results than Information Gain at all
number of features.
It also achieves a significant 12% increase over the maximum possible
accuracy achieved by Information Gain.
This is in contrast to Figure 5 where Information Gain eventually catches up
as the number of features increases
When the training data is small the word by class frequency matrix contains
many zero entries.
By clustering words, more robust estimates of word class probabilities are
obtained which lead to higher classification accuracies
Classification of Web Content



Utilizes known hierarchical structure, the
classification problem is decomposed into a set
of smaller problems corresponding to
hierarchical splits in the tree
Aims to first learn to distinguish among classes
at the top level, then lower level distinctions are
learned only within the appropriate top level of
the tree.
Each of these sub-problems can be solved much
more efficiently, and more accurately as well
Classification of Web Content




The use of a hierarchical decomposition of a
classification problem allows for efficiencies in both
learning and representation.
Each sub-problem is smaller than the original problem
It is sometimes possible to use a much smaller set of
features for each
The hierarchical structure can also be used to set the
negative set for discriminative training and at
classification time to combine information from different
levels
SVM for Web Content
SVMs have been found to be efficient and
effective for text classification, but not
previously explored in the context of
hierarchical classification.
 The efficiency of SVMs for both initial
learning and real-time classification make
them applicable to large dynamic
collections like web content

Hierarchical structure used in Web
Content

Hierarchical structure is used for two purposes

Train second-level category models using different
contrast sets (either within the same top-level
category in the hierarchical case, or across all
categories in the flat non hierarchical case).
 Combine scores from the top- and second-level
models using different combination rules, some
requiring a threshold to be exceeded at the top level
before second level comparisons are made.
Classifying web search results



Classification techniques are used to automatically
organize search results into existing hierarchical
structures
Classification models are learned offline using a training
set of human-labeled documents and web categories
Classification offers two advantages compared to
clustering –


run time classification is very efficient
manually generated category names are easily understood
Constraints


To support goal of automatically classifying web search results two
constraints

Use of just the short summaries returned from web search engines.
since it takes too long to retrieve the full text of pages in a networked
environment. These automatically generated summaries are much
shorter than the texts used in most classification experiments, and they
are much noisier than other document surrogates like abstracts that
some have worked with.

Focus on the top levels of the hierarchy since we believe that many
search results can be usefully disambiguated at this level.
Develop an interface that tightly couples search results and category
structure are found to have large preference and performance
advantages for automatically classified search results
TEXT CLASSIFICATION USING
SVMs




Text classification involves a training phase and a testing
phase.
In the training phase, a large set of web pages with
known category labels are used to train a classifier.
An initial model is built using a subset of the labeled
data, and a holdout set is used to identify optimal model
parameters.
During the testing or operational phase, the learned
classifier is used to classify new web pages.



A support vector machine (SVM) algorithm was
used as the classifier
A linear SVM is a hyperplane that separates a
set of positive examples from a set of negative
examples with maximum margin.
The margin is the distance from the hyperplane
to the nearest of the positive and negative
examples.




In the linearly separable case maximizing the margin can
be expressed as an optimization problem
In cases where points are not linearly separable, slack
variables are introduced that permit, but penalize, points
that fall on the wrong side of the decision boundary
Problems that are not linearly separable, kernel methods
can be used to transform the input space so that some
non-linear problems can be learned.
The simplest linear form of the SVM can be used
because it provided good classification accuracy, and is
fast to learn and apply
Feature Selection in Web Content



Reduction in the feature space by eliminating words that
appear in only a single document then selecting the
1000 words with highest mutual information with each
category.
The mutual information MI(F, C) between a feature, F,
and a category, C, is defined as:
Compute the mutual information between every pair of
features and categories
SVM Parameters

In addition to varying the number of features, SVM performance is
governed by two parameters

C (the penalty imposed on training examples that fall on the wrong side
of the decision boundary)
 p (the decision threshold)





The Default C parameter value (0.01) is used
The decision threshold, p, can be set to control precision and recall
for different tasks.
Increasing p, results in fewer test items meeting the criterion, and
this usually increases precision but decreases recall.
Conversely, decreasing p typically decreases precision but
increases recall.
P is chosen so as to optimize performance on the F1 measure on a
training validation set.
Results






Decision thresholds were established on a training validation set
For each category, if a test item exceeds the decision threshold, it is
judged to be in the category
A test item can be in zero, one, or more than one categories
From this precision (P) and recall (R) are computed
These are micro-averaged to weight the contribution of each
category by the number of test examples in it
For each test example, the probability of it being in each of the 13
top-level categories and each of the 150 second-level categories is
computed
References




Improving Text Classication by Shrinkage in a Hierarchy of Classes
by Andrew McCallum, Ronald Rosenfeld, Tom Mitchell, Andrew Y.
Ng http://www.cs.cmu.edu/~mccallum/papers/hier-icml98.ps.gz
Hierarchically classifying documents using very few words by
Daphne Koller and Mehran Sahami
http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&do
c=1997-75&format=pdf&compression=&name=1997-75.pdf
Hierarchical Classification of Web Content by Susan Dumais and
Hao Chen
http://research.microsoft.com/~sdumais/sigir00.pdf
Enhanced Word Clustering for Hierarchical Text Classification by
Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumar
http://www.cs.utexas.edu/users/inderjit/public_papers/hierdist.pdf
THANK YOU