Concept Ontology For Text Classification

Download Report

Transcript Concept Ontology For Text Classification

Concept Ontology For
Text Classification
Swaranika Palit
Content
•
•
•
•
•
•
Introduction
Shrinkage in a hierarchy of Classes
Hierarchical classification using few words
Enhanced word clustering for Hierarchical Text Classification
Hierarchical Classification Of Web Content
References
Motivation
• Huge volumes of worldwide accessible information
• Need to assimilate and profitably use these large information
• Manual organizing may not be feasible : expensive ,time constraint and
involvement of large no of documents
• Automated Classification: Combination of information retrieval technology
and machine learning
Text Classification Task
 Text Classification is the task of automatically sorting a set of documents into categories
from a predefined set.
 Given a text document, decide which of several pre-defined categories it belongs to.
 The count of each word gives the document a certain probability of belonging to each
category.
 Task of approximating unknown target function Φ : D × C → {T, F } (that describes how
documents ought to be classified, according to a supposedly authoritative expert) by means
of a function Φˆ : D×C → {T, F } called the classifier, where C = {c1 ,..., c|C| } is a
predefined set of categories and D is a (possibly infinite) set of documents. If Φ(dj , ci ) =
T,then dj is called a positive example (or a member) of ci , while if Φ(dj , ci ) = F it is
called a negative example of ci .
Hierarchical Text Classification
Motivation:
• Categories are treated independently and relationship among them is not exploited in flat
classifier
• Dealing with larger problems by divide-conquer approach
Hierarchy = = is-a-relationship
Sportspersons
Cricketer
Sachin Tendulkar
Statistical Approaches For learning text
classification
• Each document di is represented as a vector of word counts, generated from
a multinomial distribution:
•
Modeling Document Classes
• Documents that have similar word count vectors belong to the same category
and Categories or classes are modeled as clusters of documents in word-count
space.
• Mixture Model of Classes:
The overall probability of a document is a mixture of multinomial
distributions of the probability of the document belonging to each class.
Problem
• Typically represent documents as vectors of words
• Rely on learned word statistics, so these approaches are data-intensive
• Require large number of hand-labeled training documents per class to
achieve high classification accuracy
Question
How can we scale up statistical learning algorithm
1)To task with large number of classes
2) Sparse Learning data per class
Introduction Of Advanced Techniques……….
Shrinkage in Hierarchy Of Classes
• Apply a technique from Statistics called “ Shrinkage” which provides improved estimates
of parameters which would otherwise be uncertain due to sparse training data.
• This method exponentially reduces amount of computation for classification with a small
amount of accuracy sacrificing
• Specially applicable when hierarchy is large and training data is sparse
• It exploits hierarchy by shrinking parameter estimates in data-sparse children towards the
estimates of data-rich parents.
• Creates new parameter estimates in a child by linear interpolation of all hierarchy nodes
from child to root.
Shrinkage in Statistics
• Given several different estimates of the same quantity and the estimates can
be improved by scaling and shifting them all toward a common value,
thereby shrinking the variance of each.
Text Classification
Automatically placing documents in their correct
categories.
“grow
(Crops)
corn
tractor…”
Testing
Data:
Irrigation
Categories:
Training
Data:
water
grating
ditch
farm
tractor...
Crops
corn
wheat
silo
farm
grow...
Botany
corn
tulips
splicing
grow...
Evolution
selection
mutation
Darwin
Galapagos
DNA...
Magnetism
...
Relativity
...
12
The Idea: “Shrinkage”
We can improve the parameter estimates in a leaf by averaging
them with the estimates in its ancestors.
Science
Testing
Data:
Agriculture
Categories:
Training
Data:
“corn
(Crops)
grow
tractor…”
Irrigation
water
grating
ditch
farm
tractor...
Biology
Crops
corn
wheat
silo
farm
grow...
Botany
corn
tulips
splicing
grow...
Evolution
selection
mutation
Darwin
Galapagos
DNA...
Physics
Magnetism
Relativity
...
...
13
Bayesian Learning Framework Approach
• Assume that text data is generated by parametric model and use training data
to calculate estimates of model parameters.
• With the help of these estimates, we classify new text documents by Bayes
rules and calculate the posterior probability that a class would have generated
the test document in question.
• Classification will then become to select the most probable class.
• Apply the Bayes assumption: Probability of each word event in a document is
independent of word’s context given the class and of its position in document.
Bayesian Text Classification
• Given document di, find the category or classes cj, that it belongs to. Two
parameters must be learned:
The probability that the category cj generated document di.
The prior probability of category cj
Continued…….
• N(wt,di) is the count of the number of times wt occurs in document di and
P(cj|di) ={0.1} as the document’s class llevel.The estimate of probability of
word wt in class cj is as :
Shrinkage for Text Classification
• Use shrinkage to better estimate the probability of a word given a class,
• For each node in a tree a maximum likelihood(ML) estimate based on the data
associated with the node is calculated.
• Improved estimate of each leaf node is achieved by shrinking its ML estimate
towards ML estimate of all of its parents.
• In term of statistical modeling, unigram model is build for each node and each leaf
model is smoothed by linearly interpolating it with all models found along the path
to root.
• The estimate along the path from leaf to roof represents a tradeoff between
specificity and reliability.
Question
Given a set of ML estimates along the path from leaf to root , how do we
decide on the weights for interpolating (mixing) them?
Shrinkage Algorithm……..Learning Mixture weights ‘ λ’:
Shrinkage improves
bringing it closer to
the estimated probability of word wt given class cj, by
for super-classes k, through linear combination:
Shrinkage Algorithm
• Calculate initial parameter estimates using word count
• Initialize weights for the super-classes to any normalized value, such as λ ji = 1/k.
• Repeat until convergence:
Expectation step
Maximization step
• Expectation Step: Use the current l’s to estimate the degree to which each node was likely to
have generated the words in held out documents.
• Maximization Step: Use the estimates to recalculate new values for the λ’s.
Experimental Results
The shrinkage algorithm is tested on three hierarchical sets of documents:
Company web pages, classified by industry sector.
Usenet articles, classified into 20 newsgroup categories.
Web pages from the Yahoo “science” and “health” hierarchies.
From the industry sector figure, it shows that :
Larger vocabulary sizes generally performs better , shrinkage improves
classification accuracy of 74% , Hierarchical feature selection improves the naive
Bayes classifier around 5000 words, Shrinkage performs better than naive Bayes
for all vocabulary sizes
Yahoo Science
• Accuracy does not improve with vocabulary size for either classifier.
• Accuracy is actually reduced for naïve Bayes above 10,000 words.
• The lack of improvement may be an artifact of the experimental setup, which can
only handle hierarchies of depth two.
• The second graph shows accuracy averaged over classes instead of documents.
• Shrinkage shows a much greater performance improvement over naive Bayes.
• This is due to the nature of the Yahoo training set, in which the majority of
documents were lumped into a few classes
Newsgroup
Shrinkage Helps more when training data is
sparse
• In this experiment, the vocabulary size is held constant, while the size of the
training set varies.
• Hierarchical feature selection does not improve either classifier and is not
shown.
• Shrinkage performs better than naive Bayes, but by a smaller margin, due to
lower fan-out.
• Shrinkage performs better than naive Bayes by a greater margin for smaller
training sets.
Hierarchical Classification using few words
Motivation:
• Categorize the different documents according to their topic, where topics are
organized in a hierarchy of increasing specificity.
• Facing with large number of classes and huge number of relevant features to
distinguish them in context of text classification
• Restriction to use simple classifier due to computational cost and tendency of
complex models to over fit.
Approach:
• Divide the classification task into a set of smaller classification tasks, each of which
corresponds to some split in classification hierarchy
• Each of subtask is significantly simpler than original task .
Key Note
• The classifier at each node in hierarchy is needed to be distinguished only between smaller
number of categories, so it its possible to use small set of features.
• With small feature set , models are robust and less subject to over fitting and provides better
accuracy even for simple classifiers.
• Basic point is integration of features with hierarchical structure along with its selection.
• In Hierarchical approach, any document only sees a small fraction of features through out
the process and the feature it sees are divided so as to focus the attention of classifier on the
features relevant to the classification subtask at hand.
• Applicable to medical domain also-to classify the disease that a patient has based on
symptoms and test results
Probabilistic Framework
• Construct a hierarchical set of classifiers, each based on its own set relevant features.
• Two main subroutines:
(1)Feature Selection Algorithm for deciding on the appropriate feature set at each decision
point.
(2)Supervised Learning Algorithm for constructing a classifier on this decision.
• Basic idea is our model of world is represented as a probabilistic distribution over the space of
possible states of the world.
• A state of world is described via some set of random variables, so that each state is an
assignment of values to these variables.
Bayesian Classifier
• A Bayesian network allows to provide compact descriptions of complex distributions over a
range number of variables.
• It uses a directed cyclic graph to encode conditional independence assumptions about the
domain and these assumptions allow the distribution to be described as a product of small local
interaction models.
• Bayesian classifier is a Bayesian network applied to classification domain.
• It contains node C for class variable and a node Xi for each of the features.
• For specific instance x, it allows to compute the probability P(C =ck|X = x) for each possible
class ck.
• Bayes Optimal classification is achieved by selecting class ck for which this probability is
maximum.
• Naïve Bayesian classifier restricts the domain features to be conditionally
independent of each other , given the class variable.
• But it is clearly unrealistic in text domain so approaches proposed for
augmenting Bayesian classifier with limited interactions between feature
variables i.e allowing to have some parents beyond the class variables for
each node.
Feature Selection
• Employs information theoretic measures to determine a subset of original domain features that
seem to best capture the class distribution in the data.
• For each feature Xi, it determines expected cross-entropy:
where X_i is the set of all domain features except Xi.
• Then eliminates the features Xi for which
is minimized
• The feature eliminated least disrupts the original conditional class distribution
• The process can be iterated to eliminate as many feature as desire.
• Since it assumes conditional independence of features , it is a fast process to update P(C|X) to
P(C|X_i),after elimination of Xi. So, it is very applicable to text domain with many features.
Experimental Results
• Two subsets of Reuters collection Hier1 and Hier2 were extracted and were split
70%/30% into class stratified training and testing sets.
• Apply a single pass of Zipf ’s law-based feature selection method which eliminates
all words appearing fewer than 10 or more than 1000 times in the training corpus.
• Both hierarchical and flat classification schemes are ran on the datasets without
employing any probabilistic feature selection.
• Original numbers of features in each dataset and results are given in the following
table:
Two important phenomenon are observed:
• In Hier1 dataset, the very large number of features used precludes the
performing better than simple flat method.
hierarchical scheme from
• In Hier2 dataset, the large number of features and small dataset size allows for more expressive KDB
algorithm to overfit the training data.
• The initial results provide an empirical motivation for integration of feature selection.
Next , feature set had been reduced to 20 and then 10 features i.e. the hierarchical method actually examines a
large set of features and also for flat method 50 most relevant features had been used. The results are shown in
the table:
•
In every run using feature selection , an improvement in accuracy was found ,even though 1248 features
were eliminated from Hier1 dataset.
• Hierarchical method clearly outperforms the flat classification method while considering a direct comparison
of the 10 and 20 runs.
• Hierarchical method produces 8-41% fewer errors than flat methods for Hier1 and more modest relative
gains for Hier2.
• The results reveal that feature selection stage does serve to focus the
algorithm on the features relevant to the local classification task.
• The last table shows that the set of 10 features found to be most
discriminating at each level of hierarchy learned for the Hier1 dataset.
• At the top level of hierarchy, high level terms are selected from various
major topics.
• More specific words distinguishing the subtopics are seen.
Enhanced Word Clustering For Hierarchical
Text Classification
• Typically document vectors are formed using a vector-space or bag-of-words mode and is
•
•
•
•
•
extremely high dimensional.
This high dimensionality can be serve as obstacle for classification algorithm based on SVM,
k-nearest neighbor.
Distributional clustering of words/features is a way to reduce dimensionality where each
word cluster is treated as single feature.
Feature clustering is more efficient than feature selection for lower number of features.
For small training set and noisy feature ,word clustering increases the classification accuracy.
Algorithms used are agglomerative in nature yielding suboptimal word clusters at a high
computational cost .
Distributional Word Clustering
• C is a discrete random variable that takes on values from the set of classes
C={c1,c2….cl} and W be the random variable that ranges over the set of words
W={w1,…wm}
• The joint distribution p(C,W) can be estimated from the training set
• Let, we cluster the words into k clusters W1,W2,…Wk. Here we will consider only
“hard clustering” where each word belongs to exactly one word cluster, as we are
concentrating on reducing the number of features and model size:
• Random variable W ranges over the word clusters.
• To judge the quality of the word clusters we will use information-theoretic
nature.
• The information about C captured by W is measured by mutual information
I(C;W).Ideally ,in formation of word clusters exact preservation of mutual
information is expected.
• But clustering usually lowers mutual information.
• This motivates to find a clustering that minimizes the decrease in mutual
information, I(C;W) - I(C;W )
• Introduce Divisive algorithm for word clustering based on KullbakLeibler(KL) divergences.
Classifying using Word Clusters
• We can adapt Naïve Bayes method which will translate into using word
clusters instead of words.
• Naïve Bayes classifier for word parameters:
• Estimate new parameters p(Ws|ci) for word clusters similar to p(wt|ci) as:
• Estimate of p(wt|ci) for individual words are relatively poor ,the
corresponding word cluster parameters p(Ws|ci) provides more robust
estimates which results in higher classification score.
• Naïve Bayes rule for classifying a test document can be written as
• SVM can similarly used with word clusters
• Divisive Clustering achieves higher classification accuracy than the best feature
selection method when training data is sparse.
• The previous figures plot fraction of mutual information lost against number of
clusters for both Divisive and agglomerative algorithm in 20Ng and Dmoz datasets.
• Fraction of mutual information lost due to clustering words is given as:
• The term
in the numerator is precisely the global objective function
that Divisive Clustering attempts to minimize.
• Less mutual information is lost in Divisive Clustering compared to agglomerative at
all number of clusters,though the difference is more pronounced at lower number
of clusters
• Figure 5 shows classification accuracies on the 20 Newsgroups data set for the algorithms considered.
• The horizontal axis indicates the number of features/clusters used in the classification model while
•
•
•
•
•
•
•
•
•
•
•
the vertical axis indicates the percentage of test documents that were classified correctly
Divisive Clustering (SVM as well as NB) achieves significantly better results at lower number of
features than feature selection using Information Gain and Mutual Information
With 50 clusters, Divisive Clustering (NB) achieves 78.05% accuracy
The largest gain occurs when the number of clusters equals the number of classes.
In Figure 6, the classification accuracy is plotted on 20Ng data using Naive Bayes when the training
data is sparse.
2% of the available data is taken, that is 20 documents per class, for training and tested on the
remaining 98% of the documents
The results are averages of 5-10 trials
Divisive Clustering obtains better results than Information Gain at all number of features.
It also achieves a significant 12% increase over the maximum possible accuracy achieved by
Information Gain.
This is in contrast to Figure 5 where Information Gain eventually catches up as the number of
features increases
When the training data is small the word by class frequency matrix contains many zero entries.
By clustering words, more robust estimates of word class probabilities are obtained which lead to
higher classification accuracies
Hierarchical classification Of Web Content
• In Hierarchical case, a model is learned to distinguish a second-level category from
other categories within the same top level.
• Classification problem is decomposed into a set of smaller problems corresponding
to hierarchical splits in the tree by utilizing hierarchical structure.
• Use of a hierarchical decomposition of a classification problem allows for
efficiencies in both learning and representation.
• Each sub-problem is smaller than the original problem and is sometimes possible to
use a much smaller set of features for each.
• Hierarchical structure can also be used to set the negative set for discriminative
training and at classification time to combine information from different levels.
Use Of SVM for Web Content
• SVM learning model is found to be efficient and effective for text classification.
• The efficiency of SVMs for both initial learning and real-time classification make them applicable
to large dynamic collections like web content.
• Hierarchical structure is used for two purposes
• Train second-level category models using different contrast sets (either within the same toplevel category in the hierarchical case, or across all categories in the flat non hierarchical
case).
• Combine scores from the top- and second-level models using different combination rules,
some requiring a threshold to be exceeded at the top level before second level comparisons
are made
Classifying web search results:
• Classification techniques are used to automatically organize search results into existing
hierarchical structures
• Classification models are learned offline using a training set of human-labeled documents and
web categories
• Classification offers two advantages compared to clustering –
• run time classification is very efficient
• manually generated category names are easily understood
Constraints:
• To support goal of automatically classifying web search results two constraints
• Use of just the short summaries returned from web search engines. since it takes too long to
retrieve the full text of pages in a networked environment. These automatically generated
summaries are much shorter than the texts used in most classification experiments, and they
are much noisier than other document surrogates like abstracts that some have worked with.
• Focus on the top levels of the hierarchy since we believe that many search results can be
usefully disambiguated at this level.
• Develop an interface that tightly couples search results and category structure are found to
have large preference and performance advantages for automatically classified search results
TEXT CLASSIFICATION USING SVMs
• Text classification involves a training phase and a testing phase.
• In the training phase, a large set of web pages with known category labels are used to train a
•
•
•
•
•
classifier.
An initial model is built using a subset of the labeled data, and a holdout set is used to
identify optimal model parameters.
During the testing or operational phase, the learned classifier is used to classify new web
pages.
Support vector machine (SVM) algorithm was used as the classifier
A linear SVM is a hyperplane that separates a set of positive examples from a set of negative
examples with maximum margin.
The margin is the distance from the hyperplane to the nearest of the positive and negative
examples.
• In the linearly separable case
maximizing the margin can be
expressed as optimization problem
• In cases where points are not
linearly separable, slack variables are
introduced that permit, but penalize,
points that fall on the wrong side of
the decision boundary
• Problems that are not linearly
separable, kernel methods can be
used to transform the input space so
that some non-linear problems can
be learned.
• The simplest linear form of the
SVM can be used because it
provided good classification
accuracy, and is fast to learn and
apply
Feature Selection in Web Content
• Reduction in the feature space by eliminating words that appear in only a single
document then selecting the 1000 words with highest mutual information with each
category.
• The mutual information MI(F, C) between a feature, F, and a category, C, is defined
as:
• Compute the mutual information between every pair of features and categories
SVM Parameters:
• In addition to varying the number of features, SVM performance is governed by
•
•
•
•
•
two parameters
• C (the penalty imposed on training examples that fall on the wrong side of the
decision boundary)
• p (the decision threshold)
The Default C parameter value (0.01) is used
The decision threshold, p, can be set to control precision and recall for different
tasks.
Increasing p, results in fewer test items meeting the criterion, and this usually
increases precision but decreases recall.
Conversely, decreasing p typically decreases precision but increases recall.
P is chosen so as to optimize performance on the F1 measure on a training
validation set
Results:
• Decision thresholds were established on a training validation set
• For each category, if a test item exceeds the decision threshold, it is judged
•
•
•
•
to be in the category
A test item can be in zero, one, or more than one categories
From this precision (P) and recall (R) are computed
These are micro-averaged to weight the contribution of each category by the
number of test examples in it
For each test example, the probability of it being in each of the 13 top-level
categories and each of the 150 second-level categories is computed
References
• Improving Text Classication by Shrinkage in a Hierarchy of Classes by Andrew McCallum,
•
•
•
•
•
Ronald Rosenfeld, Tom Mitchell, Andrew Y. Ng
http://www.cs.cmu.edu/~mccallum/papers/hier-icml98.ps.gz
Hierarchically classifying documents using very few words by Daphne Koller and Mehran
Sahami
http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&doc=199775&format=pdf&compression=&name=1997-75.pdf
Hierarchical Classification of Web Content by Susan Dumais and Hao Chen
http://research.microsoft.com/~sdumais/sigir00.pdf
Enhanced Word Clustering for Hierarchical Text Classification by Inderjit S. Dhillon,
Subramanyam Mallela, and Rahul Kumar
http://www.cs.utexas.edu/users/inderjit/public_papers/hierdist.pdf
Text classification combining clustering and hierarchical approaches by S Ranganathan
www.ittc.ku.edu/research/thesis/.../shankar_ranganathan_thesis.pdf
Question
Thank You