Towards Improving Classification of Real World

Download Report

Transcript Towards Improving Classification of Real World

Towards Improving
Classification of Real World
Biomedical Articles
Kostas Fragos
TEI of Athens
[email protected]
Christos Skourlas
TEI of Athens
[email protected]
Summary
• We propose a method to improve performance in
biomedical article classification.
• We use Naïve Bayes and Maximum Entropy classifiers to
classify real world biomedical articles derived from the
dataset used in the classification competition task BC2.5
• To improve classification performance, we use two
merging operators, Max and Harmonic Mean to combine
results of the two classifiers.
• The results show that we can improve classification
performance of real world biomedical data
2
Introduction
• From the biomedical point of view there are many challenges in
classifying biomedical information [3].
• Even the most sophisticated of solutions often overfit to the
training data and do not perform as well on real-world data [4].
• In this paper we try to devise a method which makes real world
biomedical data classification more robust.
• First we parse documents applying a keyword extraction
algorithm to find out the keywords from the full text. Second,
we apply chi-square feature selection strategy to identify the
most relevant. Finally, we apply Naïve Bayes and Maximum
Entropy classifiers to classify documents and then combine
them using two merging operators to improve performance.
3
THE CLASSIFICATION METHOD
• Naïve Bayes Classifiers:
• A text classifier could be defined as a function that maps a
document d of x1, x2,x3,..,xn words (features), d=( x1,
x2,x3,..,xn ), to a confidence that the document d belongs to
a text category.
• the Naïve Bayes classifier [1] is often used to estimate the
probability of each category.
• The Bayes theorem can be used to estimate the
probabilities: Pr(c|d)=Pr(d|c)*Pr(c)/Pr(d) [6]
THE CLASSIFICATION METHOD
• Maximum Entropy Classifiers:
• Entropy was used by Shannon (Shannon, 1948), in the
communication theory. The entropy H itself measures the
average uncertainty of a single random variable X :
H(p)=H(x)=-Σp(x)log2p(x) [2]
• The maximum entropy model can be specially adjusted for
text classification.
• This can be done using the iterative scaling (IIS) algorithm
and a hillclimbing algorithm for estimating the parameters
of the maximum entropy model [6]
5
Merging Classifiers
We use two operators to combine the results of the Naïve
Bayes Classifier (NBC) and the Maximum Entropy
Classifier (MEC) to improve the classification
performance. The Maximum and the Harmonic Mean of
the results of the two classifiers
• MaxC(d) = Max {NBC(d), MEC (d)}
HarmC (d) = 2.0 × NBC(d) ×MEC (d) / (NBC(d) + MEC (d))
• The MaxC(d) operator chooses a maximum value among
the results of the two classifiers. The HarmC (d) operator
estimates the Harmonic Mean of the results of these two
classifiers.
6
BioCreAtIvE challenge
•
Description [2004-01-02]
• The BioCreAtIvE (Critical Assessment of Information
Extraction systems in Biology) challenge evaluation
consists of a community-wide effort for evaluating text
mining and information extraction systems applied to the
biological domain.
• http://www.biocreative.org/about/background/description/
7
BioCreative II.5 challenge
•
Evaluation library [2009-12-17]
• This is the current version of the BioCreative evaluation
library including a command line tool to use it; current,
official version: 3.2 (use command line option --version to
see the version of the script you have installed: bc-evaluate
--version. If you have reason to believe that there is a bug
with the tool or the library, or any other questions related
to it, please contact the author, Florian Leitner.
• http://www.biocreative.org/resources/biocreativeii5/evaluation-library/
8
BioCreative II.5 challenge
•
Task 2: Protein-Protein Interactions [2006-04-01]
• This task is organized as a collaboration between the
IntAct and MINT protein interaction databases and the
CNIO Structural Bioinformatics and Biocomputing group.
• Background
• Introduction
• Task description
• Data
• Resources
• http://www.biocreative.org/tasks/biocreative-ii/task-2protein-protein-interac/
9
Preparing the Data.
• For experimentation purposes we used the data used in the
article classification competition task BC2.5 [4].
• This classification task was based on a training data set
comprised of 61 full-text articles relevant to proteinprotein interaction and 558 irrelevant one.
• For training we chose the first 60 relevant and sampled
randomly 60 irrelevant articles, for testing we used the
Biocreative 2.5 testing data set consisting of 63 full-text
articles relevant to protein- protein interaction and 532
irrelevant ones.
10
Preparing the Data.
• Before using the data for training and testing we preprocessed all articles by filtering out stop words and porter
stemming the remaining words/keywords.
• Finally, we ranked keywords extracted from BC2.5
training articles according to chi-square scoring formula to
identify most top relevant keywords [6].
11
Experiments
• The experiments consist of the following phases:
• First, we collect five sets of top relevant keywords using
chi-square feature selection strategy. Second, we compare
the performance of the two classifiers, Naïve Bayes and
Maximum Entropy, for each set of word features. Third,
we use merging operators to combine the results of these
two classifiers to improve performance.
• In each experiment we calculate Precision, Recall, True
Negative Rate and Accuracy measures.
12
Results
• The Maximum Entropy classifier shows the best
performance Precision, Recall and Accuracy, 0,186%,
0.857% and 0.589% at 500 top ranked keywords, while for
True Negative Rate shows the best performance 0.565% at
700 top ranked keywords.
•
We combine the results of the two classifiers using the
two merging operators mentioned above to improve the
performance, especially the Recall rate.
•
The merging operators do improve performance, for
Precision 0.189%, Recall 0.873%, True Negative Rate
0.560% and Accuracy 0.591%.
13
Conclusion
• The results show that the Maximum Entropy classifier
shows the better performance at 500 top relevant
keywords.
• Combining the results of the two classifiers we can
improve classification performance of real world
biomedical data
References
1. Fuhr, N., Hartmanna, S., Lustig, G., Schwantner, M., and
Tzeras, K. 1991. Air/X – A rule-based multi-stage indexing
system for lage subject fields. RIAO’91, pp. 606-623.
2. Galathiya, A. S., Ganatra, A. P., and Bhensdadia, K. C.
2012. An Improved decision tree induction algorithm, with
feature selection, cross validation, model complexity &
reduced error pruning, IJSCIT, March 2012.
3. Feldman, R., Sanger, J. 2006. The Text Mining Handbook:
advanced approaches in analyzing unstructured data.
Cambridge University Press.
4. Krallinger, M., et al. 2009 The BioCreative II. 5 challenge
overview. In: Proc. The BioCreative II. 5 Workshop 2009 on
Digital Annotations, pp. 7–9.
15
References
5. Fragos, K., Maistros, I. 2006. A Goodness of Fit Test
Approach in Information Retrieval. In journal of
“Information Retrieval”, Springer, Volume 9, Number 3,
pp 331 – 342.
6. McCallum A. and Nigam, K. 1998. A comparison of event
models for naive Bayes text classification. In AAAI/ICML98 Workshop on Learning for Text Categorization.
7. Fragos, K., Maistros, I., Skourlas, C. 2005. A X2-Weighted
Maximum Entropy Model for Text Classification. 2nd
International Conference On N.L.U.C.S, Miami, Florida.
16
Questions…