Hierarchical Text Categorization and Its Evaluation

Download Report

Transcript Hierarchical Text Categorization and Its Evaluation

Hierarchical Text Categorization
and
its Application to Bioinformatics
Stan Matwin and Svetlana Kiritchenko
joint work with Fazel Famili (NRC), and
Richard Nock (Université Antilles-Guyane)
School of Information Technology and Engineering
University of Ottawa
Outline
•
•
•
•
•
What is hierarchical text categorization (HTC)
Functional gene annotation requires HTC
Ensemble-based learning and AdaBoost
Multi-class multi-label AdaBoost
Generalized local hierarchical learning
method
• New global hierarchical learning algorithm
• New hierarchical evaluation measure
• Application to Bioinformatics
2
Text categorization
• Given: dj  D - textual documents
C = {c1, …, c|C|} – predefined categories
• Task: <dj, ci>  DC  {True, False}
c1
c2
TC
c3
c7
c6
c4
c5
3
Hierarchical text
categorization
• Hierarchy of categories: ≤  CC - reflexive, antisymmetric, transitive binary relation on C
c1
HTC
c2
c4
c3
c5
c6
c7
4
Advantages of HTC
• Additional, potentially valuable
information
– Relationships between categories
• Flexibility
– High levels: general topics
– Low levels: more detail
5
Outline
•
•
•
•
•
What is hierarchical text categorization (HTC)
Functional gene annotation requires HTC
Ensemble-based learning and AdaBoost
Multi-class multi-label AdaBoost
Generalized local hierarchical learning
method
• New global hierarchical learning algorithm
• New hierarchical evaluation measure
• Application to Bioinformatics
6
Text classification and
bioinformatics
• Clustering and classification of gene
expression data
– DNA chip time series – performance data
– Gene function, process,… – genetic knowledge GO
– Literature will connect the two - domain
knowledge
• Validation of results from performance
data
7
Example: Gene Ontology
8
From data to knowledge via literature
• Functional annotation of genes from
biomedical literature
9
Other applications
• Web directories
• Digital libraries
• Patent databases
• Biological ontologies
• Email folders
10
Outline
•
•
•
•
•
What is hierarchical text categorization (HTC)
Functional gene annotation requires HTC
Ensemble-based learning and AdaBoost
Multi-class multi-label AdaBoost
Generalized local hierarchical learning
method
• New global hierarchical learning algorithm
• New hierarchical evaluation measure
• Application to Bioinformatics
11
Boosting
• not a learning technique on its own, but a method in
which a family of “weakly” learning agents (simple
learners) is used for learning
• based on the fact that multiple classifiers that
disagree with one another can be together more
accurate than its component classifiers
• if there are L classifiers, each with an error rate < 1/2,
and the errors are independent, then the prob. that
the majority vote is wrong is the area under binomial
distribution for more than L/2 hypotheses
12
Why do we have committees
(ensembles)?
13
Boosting – the very idea
• Train an ensemble of classifiers, sequentially
• Each next classifier focuses more on the
training instances on which the previous one
has made a mistake
• The “focusing” is done thru the weighting of
the training instances
• To classify a new instance, make the
ensemble vote
14
15
Boosting - properties
• If each hl is only better than chances,
boosting can attain ANY accuracy!!
• No need for new examples, additional
knowledge, etc
• Original AdaBoost is on single-labeled data
16
Outline
•
•
•
•
•
What is hierarchical text categorization (HTC)
Functional gene annotation requires HTC
Ensemble-based learning and AdaBoost
Multi-class multi-label AdaBoost
Generalized local hierarchical learning
method
• New global hierarchical learning algorithm
• New hierarchical evaluation measure
• Application to Bioinformatics
17
AdaBoost.MH [Schapire and Singer, 1999]
• (di, Ci)  ((di, l), Ci[l]), l  C
• Initialize distribution P1(i,l) = 1/(mk) .
• For t = 1, …, T:
– Train weak learner using distribution Pt .
– Get weak hypothesis ht: DC   .
• Update:
• The final hypothesis:
18
BoosTexter [Schapire and Singer, 2000]
• “Weak” learner: decision stump
word w
occurs

Pt (i, l ) 

1  i:wd i ,Ci [ l ] 1 
q1l  ln

2 
Pt (i, l ) 

 i:wd i ,Ci [ l ] 1 
doesn’t occur

Pt (i,l ) 

1  i:wd i ,Ci [ l ] 1 
q0l  ln

2 
Pt (i,l ) 

 i:wd i ,Ci [ l ] 191 
Thresholds for AdaBoost
• AdaBoost often underestimates its
confidences
• 3 approaches to selecting better thresholds
– single threshold for all classes
– individual thresholds for each class
– separate thresholds for each subtree rooted in the
children of a top node (for tree-hierarchies only)
20
Thresholds for AdaBoost
21
Outline
•
•
•
•
•
What is hierarchical text categorization (HTC)
Functional gene annotation requires HTC
Ensemble-based learning and AdaBoost
Multi-class multi-label AdaBoost
Generalized local hierarchical learning
method
• New global hierarchical learning algorithm
• New hierarchical evaluation measure
• Application to Bioinformatics
22
Hierarchical consistency
• if (dj, ci)  True,
then (dj, Ancestor(ci))  True
c1
c1
c2
c4
c2
c3
c5
c6
consistent
c7
c4
c3
c5
c6
c7
inconsistent
23
Hierarchical local approach
c1
c2
c4
c3
c5
c8
c6
c7
c9
24
Hierarchical local approach
c1
c2
c4
c3
c5
c8
c6
c7
c9
25
Hierarchical local approach
c1
c2
c4
c3
c5
c8
c6
c7
c9
26
Hierarchical local approach
c1
c2
c4
c3
c5
c8
c6
c7
c9
27
Hierarchical local approach
c1
c2
c4
c3
c5
c8
c6
c7
c9
consistent classification
28
Generalized hierarchical local
approach
• stop classification at an intermediate level if
none of the children categories seem relevant
• a category node can be assigned only after
all its parent nodes have been assigned
c1
c2
c4
c3
c5
c8
c6
c7
c9
29
Outline
•
•
•
•
•
What is hierarchical text categorization (HTC)
Functional gene annotation requires HTC
Ensemble-based learning and AdaBoost
Multi-class multi-label AdaBoost
Generalized local hierarchical learning
method
• New global hierarchical learning algorithm
• New hierarchical evaluation measure
• Application to Bioinformatics
30
New global hierarchical
approach
• Make a dataset consistent with a class
hierarchy
– add ancestor category labels
• Apply a regular learning algorithm
– AdaBoost
• Make prediction results consistent with a
class hierarchy
– for inconsistent labeling make a consistent
decision based on confidences of all ancestor
classes
31
New global hierarchical
approach
• Hierarchical (shared) attributes
sports
team, game,
winner, etc.
hockey
football
NHL, Senators,
goalkeeper, etc.
Super Bowl, Patriots,
touchdown, etc.
32
Outline
•
•
•
•
•
What is hierarchical text categorization (HTC)
Functional gene annotation requires HTC
Ensemble-based learning and AdaBoost
Multi-class multi-label AdaBoost
Generalized local hierarchical learning
method
• New global hierarchical learning algorithm
• New hierarchical evaluation measure
• Application to Bioinformatics
33
Evaluation in TC
c1
c2
Correct category
c3
Incorrect category
c4
c5
c6
precision 
c7
correctly predicted
total predicted
correctly predicted
recall 
total in category
(  2  1)  P  R
F - measure 
,
2
 P R
 0
34
Weaknesses of standard
measures
H1
H2
c1
c2
c4
c3
c5
c6
H3
c1
c2
c7
c4
c3
c5
c6
c1
c2
c7
c4
c3
c5
c6
c7
P(H1) = P(H2) = P(H3)
R(H1) = R(H2) = R(H3)
F(H1) = F(H2) = F(H3)
Ideally, M(H1) > M(H3) and M(H2) > M(H3)
35
Requirements for a
hierarchical measure
1. to give credit to partially correct
classification
H1
c1
c2
c4
c8
c2
c3
c5
c6
c9
H2
c1
c10
c4
c7
c11
M(H1) > M(H2)
c8
c3
c5
c6
c9
c10
c7
c11
36
Requirements for a
hierarchical measure
2. to punish distant errors more heavily:
– to give higher evaluation for correctly classifying
one level down comparing to staying at the parent
node
H1
c1
c2
c4
c8
c2
c3
c5
c6
c9
H2
c1
c10
c4
c7
c11
M(H1) > M(H2)
c8
c3
c5
c6
c9
c10
c7
c11
37
Requirements for a
hierarchical measure
2. to punish distant errors more heavily:
– gives lower evaluation for incorrectly classifying
one level down comparing to staying at the parent
node
H1
c1
c2
c4
c8
c2
c3
c5
c6
c9
H2
c1
c10
c4
c7
c11
M(H1) > M(H2)
c8
c3
c5
c6
c9
c10
c7
c11
38
Requirements for a
hierarchical measure
3. to punish errors at higher levels of a
hierarchy more heavily
H1
c1
c2
c4
c8
c2
c3
c5
c6
c9
H2
c1
c10
c4
c7
c11
M(H1) > M(H2)
c8
c3
c5
c6
c9
c10
c7
c11
39
Advantages of the new
measure
• Simple, straight-forward to calculate
• Based solely on a given hierarchy (no
parameters to tune)
• Satisfies all three requirements
• Has much discriminating power
• Allows to trade off between classification
precision and classification depth
40
Our new hierarchical measure
Correct category
+ all its
ancestors
(excluding
c4
root)
c1
c2
Correct category
c3
Incorrect category
c5
precision 
c6
c7
correctly predicted
total predicted
correctly predicted
recall 
total in category
(  2  1)  P  R
F - measure 
,
2
 P R
 0
41
Our new hierarchical measure
H1
H2
c1
c2
c4
c3
c5
c6
H3
c1
c2
c7
correct: {c4}  {c2, c4}
predicted: {c2}  {c2}
c4
c3
c5
c6
{c4}  {c2, c4}
{c5}  {c2, c5}
| {c2 } |
1

| {c2 , c5 } | 2
| {c2 } |
hP( H 1 ) 
1
| {c2 } |
hP( H 2 ) 
| {c2 } |
1
hR( H 1 ) 

| {c2 , c4 } | 2
| {c2 } |
1
hR( H 2 ) 

| {c2 , c4 } | 2
c1
c2
c7
c4
c3
c5
c6
c7
{c4}  {c2, c4}
{c7}  {c3, c7}
hP( H 3 ) 
| {} |
0
| {c3 , c7 } |
| {} |
hR( H 3 ) 
0
| {c2 , c4 }42|
Measure consistency
• Definition [Huang & Ling, 2005]:
f, g – measures on domain 
R = {(a,b)|a,b , f(a)>f(b), g(a)>g(b)}
S = {(a,b)|a,b , f(a)>f(b), g(a)<g(b)}
f is statistically consistent with g if |R|>|S|
• Experiment:
– 100 randomly chosen hierarchies
– New hierarchical F-measure and standard accuracy
were consistent on 85% of random classifiers
(|R|>5|S|)
43
Measure discriminancy
• Definition [Huang & Ling, 2005]:
f, g – measures on domain 
P = {(a,b)|a,b , f(a)>f(b), g(a)=g(b)}
Q = {(a,b)|a,b , f(a)=f(b), g(a)>g(b)}
f is statistically more discriminating than g if |P|>|Q|
H1
• Examples:
c1
c2
c4
H2
c3
c5
c6
H3
c1
c2
c7
c4
c3
c5
c6
c1
c2
c7
c4
c3
c5
c6
44
For one accuracy value - 3 different hierarchical values
c7
Results: Hierarchical vs. Flat
Synthetic data (hierarchical attributes)
levels
branching
Flat
H. AdaBoost
2
2
68.30
76.22
3
2
58.35
74.21
4
2
44.90
73.22
5
2
20.88
72.70
2
3
53.47
63.45
3
3
29.51
60.69
4
3
2.67
58.22
2
4
41.35
55.25
3
4
6.98
50.70
2
5
29.99
47.87
45
Results: Hierarchical vs. Flat
Synthetic data (no hierarchical attributes)
levels
branching
Flat
H. AdaBoost
2
2
61.69
65.95
3
2
42.47
51.53
4
2
24.49
40.18
5
2
8.45
32.61
2
3
41.53
48.02
3
3
14.50
29.97
4
3
0.79
21.91
2
4
26.72
35.01
3
4
2.46
19.70
2
5
17.14
27.12
46
Results: Hierarchical vs. Flat
Real data
dataset
Flat
H. AdaBoost
newsgroups
75.51
79.26
reuters
87.06
88.31
47
Results: Hierarchical vs. Local
Synthetic data (hierarchical attributes)
levels
branching
Local
H. AdaBoost
2
2
73.42
76.22
3
2
69.40
74.21
4
2
68.18
73.22
5
2
68.44
72.70
2
3
61.99
63.45
3
3
58.81
60.69
4
3
57.40
58.22
2
4
54.26
55.25
3
4
50.66
50.70
2
5
47.26
47.87
48
Results: Hierarchical vs. Local
Synthetic data (no hierarchical attributes)
levels
branching
Local
H. AdaBoost
2
2
59.83
65.95
3
2
44.00
51.53
4
2
33.44
40.18
5
2
26.03
32.61
2
3
43.87
48.02
3
3
26.33
29.97
4
3
17.97
21.91
2
4
32.51
35.01
3
4
17.96
19.70
2
5
26.04
27.12
49
Results: Hierarchical vs. Local
Real data
dataset
Local
H. AdaBoost
newsgroups
80.01
79.26
reuters
89.11
88.31
50
Outline
•
•
•
•
•
What is hierarchical text categorization (HTC)
Functional gene annotation requires HTC
Ensemble-based learning and AdaBoost
Multi-class multi-label AdaBoost
Generalized local hierarchical learning
method
• New global hierarchical learning algorithm
• New hierarchical evaluation measure
• Application to Bioinformatics
51
Application to Bioinformatics
• Functional annotation of genes from
biomedical literature
Learning
(from fully-annotated genes in the db)
retrieve GO codes and IDs of Medline entries
from the db records
1
Genomic database (SGD)
ID
Symbol
Name
S0007287 15S_RRNA
S0007287 15S_RRNA
S0004660
AAC1
…
…
ADP/ATP
translocator
…
Medline reference Evidence
…
GO ID
PMID:6261980
PMID:6280192
ISS
IGI
GO:0003735
GO:0006412
PMID:2167309
TAS
GO:0005743
…
…
…
…
53
Learning
(from fully-annotated genes in the db)
2
retrieve the corresponding Medline abstracts
Medline
PMID
Abstract
PMID:6261980
Nucleotide sequence of the gene for the mitochondrial 15S ribosomal RNA of yeast
Sor F, Fukuhara H.
We have determined the nucleotide sequence of a DNA segment carrying the entire 15S ribosomal RNA gene of
yeast mitochondrial genome. …
PMID:6280192
Suppressor of yeast mitochondrial ochre mutations that maps in or near the 15S ribosomal RNA gene of
mtDNA.
Fox TD, Staempfli S.
A polypeptide chain-terminating mutation in the yeast mitochondrial oxi 1 gene has been shown to be an ochre
(TAA) mutation by DNA sequence analysis. …
PMID:2167309
Structure-function studies of adenine nucleotide transport in mitochondria. II. Biochemical analysis of
distinct AAC1 and AAC2 proteins in yeast.
Gawaz M, Douglas MG, Klingenberg M.
AAC1 and AAC2 genes in yeast each encode functional ADP/ATP carrier (AAC) proteins of the mitochondrial inner
membrane. …
…
…
54
Learning
(from fully-annotated genes in the db)
3
form the training set: words from Medline abstracts
(features) and GO codes (categories)
Training set
Abstract
GO ID
Nucleotide sequence of the gene for the mitochondrial 15S ribosomal RNA of yeast
Sor F, Fukuhara H.
We have determined the nucleotide sequence of a DNA segment carrying the entire 15S ribosomal RNA gene of
yeast mitochondrial genome. …
GO:0003735
Suppressor of yeast mitochondrial ochre mutations that maps in or near the 15S ribosomal RNA gene of
mtDNA.
Fox TD, Staempfli S.
A polypeptide chain-terminating mutation in the yeast mitochondrial oxi 1 gene has been shown to be an ochre
(TAA) mutation by DNA sequence analysis. …
GO:0006412
Structure-function studies of adenine nucleotide transport in mitochondria. II. Biochemical analysis of
distinct AAC1 and AAC2 proteins in yeast.
Gawaz M, Douglas MG, Klingenberg M.
AAC1 and AAC2 genes in yeast each encode functional ADP/ATP carrier (AAC) proteins of the mitochondrial inner
membrane. …
GO:0005743
…
55
…
Learning
(from fully-annotated genes in the db)
4
learn a classifier from the training set
Learning
Algorithm
Classifier:
Abstracts  GO codes
56
Classification
(for genes with missing annotation)
1
retrieve Medline abstracts mentioning the gene
Medline
Gene
YLL057C
Abstract
Cloning and characterization of a sulfonate/alphaketoglutarate dioxygenase from Saccharomyces
cerevisiae.
Hogan DA, Auchtung TA, Hausinger RP.
The Saccharomyces cerevisiae open reading frame
YLL057c is predicted to encode a gene product with 31.5%
amino acid sequence identity to Escherichia coli taurine/alphaketoglutarate dioxygenase and 27% identity to Ralstonia
eutropha TfdA, a herbicide-degrading enzyme. …
57
Classification
(for genes with missing annotation)
2
classify these abstracts in GO codes
Classifier:
Abstracts  GO codes
Gene
GO code
GO function
YLL057C
GO:0006790
sulfur
metabolism
58
Results
dataset
level
branching
Flat
Local
H. AdaBoost
biol. process
12
5.41
15.06
59.27
59.31
mol. function
10
10.29
8.78
43.36
38.17
cell. component
8
6.45
44.18
72.07
73.35
59
Conclusion
We have presented:
• hierarchical categorization task
(categories are partially ordered)
• generalized hierarchical local approach
• new hierarchical global approach
(hierarchical AdaBoost)
• new hierarchical evaluation measure
• application to gene annotation task
60
Future work
• to try global hierarchical approach with other
learning algorithms
• to extend the gene annotation training sets
with similar documents from Medline
• to perform similar task for other organisms
• to use gene annotations in gene classification
and clustering
61
Gene expression analysis with
functional annotations
GO:0006790
GO:0006798
GO:0007315
GO:0007289
GO:0002132
GO:0002166
62