Transcript Document

Three Approaches to GO-Tagging
Biomedical Abstracts
Neil Davis
Henk Harkema
Rob Gaizauskas
Yikun Guo
Moustafa Ghanem
Tom Barnwell
Yike Guo
Jon Ratcliffe
InforSense
Imperial College London
University of Sheffield
Symposium on Semantic Mining in Biomedicine 2006
12/4/6
Introduction
• On-going explosive growth of biomedical literature
• Text Mining techniques can help through:
• Extractive processes: extracting terms or facts
from papers for searching and linking
• Structuring processes: grouping papers based on content
for conceptual navigation of large document collections
• GO-tag project:
• Annotating biomedical papers with terms from
the Gene Ontology
SMBM 2006
2
Gene Ontology
• Provides common descriptive framework for
genes and gene products across species
• Consists of three structured, controlled
vocabularies (ontologies) that describe genes
and gene products in terms of:
• Biological processes
• Cellular components
• Molecular functions
SMBM 2006
3
Gene Ontology
• Contains almost 20,000 terms
• GO Slim (87 terms): subset of all GO terms
• Aims to give broad overview of ontology content
• Can be species-specific
4
• Typical GO term
Term name:
Accession:
Ontology:
Synonyms:
Definition:
isotropic cell growth
GO:0051210
biological_process
related: uniform cell growth
“The process by which a cell irreversibly increases in
size uniformly in all directions. In general, a rounded cell
morphology reflects isotropic cell growth.”
SMBM 2006
Common Use of GO
• Associations of genes and gene products with GO terms in
model organism and protein databases
• FlyBase, SGD, MGD
• For example (from SGD):
Gene
ACT1
ACT1
ACT1
GO Annotation
Structural constituent
of cytoskeleton
Exocytosis
Histone acetyltransferase
complex
References
Botstein D, et al. (1997)
The yeast cytoskeleton
Pruyne D and Bretsher
(2000) Polarization of
in yeast
Botstein D, et al (1997)
The yeast cytoskeleton
Galarneua L, et al.
(2000) Multiple links
between the NuA4 …
SMBM 2006
Evidence Code
Traceable Author
Statement
Traceable Author
Statement
Traceable Author
Statement
Inferred from
Direct Assay
5
GO-Tagging
• Task: given a text (PubMed abstract) and GO/GO Slim,
assign 0 or more GO terms to the text if the text is “about”
the process/component/function identified by the GO term
• Only most specific terms are assigned
• No association of GO term with specific genes or gene products
• User scenarios:
• Research scientists: clustering of PubMed search results
• Database curators: identifying texts that may support Gene-GO
term associations
SMBM 2006
6
Outline of Rest of Talk
• Data sets / Gold standards
• SGD Gold Standard
• IC Gold Standard
• Three approaches to GO-tagging
• Lexical look-up
• Information retrieval approach
• Machine learning
• Evaluation results
• Conclusions
SMBM 2006
7
SGD Gold Standard
• Derive Gold Standard from SGD model organism
database (yeast)
• Given the annotated genes in SGD, assign a GO term T
to a paper P if the paper P is referenced in support of a
Gene-GO term association involving T
• SGD Gold Standard
• 4922 PMIDS
• 2455 GO terms
• 10485 PMID-GO term pairs
SMBM 2006
8
SGD Gold Standard
• Advantages
• SGD data already exists – no further annotation work required
• More Gold Standard data from other model organism databases
• Disadvantage
• List of Gene-GO term assignments in SGD is incomplete for our task
• Each paper is associated with GO terms whose assignment to
specific genes it supports, but the paper may be missing other
GO terms which can also be legitimately attached to it
• List does not contain all papers supporting a given assignment
• Consequence
• SGD Gold Standard is “GO-term incomplete”
• Weak measure of Recall
• Precision figures difficult to interpret
SMBM 2006
9
SGD Gold Standard
• Further issue:
• SGD Gene-GO term assignments are based on full
papers, whereas system only has access to abstracts
• Consequence:
10
• Limit on maximum Recall obtainable by system
SMBM 2006
IC Gold Standard
• Manually extend SGD Gold Standard to obtain
GO-term complete annotation
• Select SGD papers for which all GO term
assignments are supported by abstract or title
• Semi-automatically add further GO terms by
fuzzy term matching + post-editing
• IC Gold Standard
• 785 PMIDS
• 1006 GO terms
• 5170 PMID-GO term pairs
SMBM 2006
11
IC Gold Standard
• Advantage
• Closer to GO-term complete Gold Standard
• Disadvantages
• Still not GO-term complete
• Direct mentions of GO terms vs. semantically
inferred GO terms
• Gold Standard creation method favors lexical
look-up approach to GO-tagging
• Data set is small
SMBM 2006
12
Outline of Rest of Talk
• Data sets / Gold standards
• SGD Gold Standard
• IC Gold Standard
• Three approaches to GO-tagging
• Lexical look-up
• Information retrieval approach
• Machine learning
• Evaluation results
• Conclusions
SMBM 2006
13
Lexical Look-Up
• (Task: given a text (PubMed abstract) and GO/GO
Slim, assign 0 or more GO terms to the text if the
text is “about” the process/component/function
identified by the GO term)
• GO term T is assigned to a paper if term T
occurs in the abstract of the paper
• Simple & fast baseline
• GO terms recognized in text can be used as
features in Machine Learning approach
SMBM 2006
14
Lexical Look-Up
• Web service calls to Termino term tagger
• Term classes in Termino
• GO terms
• GO term synonyms
• SGD yeast gene names
15
• Lexical look-up method
• Case-insensitive
• Simple morphological analysis
• Cells mapped onto cell
• Mitochondrial, mitochondria not mapped onto mitochondrion
SMBM 2006
Lexical Look-Up Results
• Recall
• Full text (SGD) vs. abstracts only (IC)
• Inherent drawbacks of lexical look-up: term variation, literal mentions
• Effects of Gold Standard creation method (IC)
• Precision
• Effects of Gold Standard creation method (IC)
• GO vs. GO Slim
• Recognizing GO Slim terms is easier than recognizing GO terms
SMBM 2006
16
Lexical Look-Up
• Extensions
• GO term T is assigned to a paper if synonym of
term T occurs in the abstract of the paper
• GO term T is assigned to a paper if yeast gene name
associated with term T occurs in the abstract of the paper
• Effects on performance
• Adding synonyms: slight decrease in Precision, substantial
increase in Recall
• Adding yeast terms: substantial decrease in Precision, substantial
increase in Recall
SMBM 2006
17
IR-Based Approach
• Document collection
• For each GO term, create a document consisting
of the GO term, its synonyms, and its definition
• Query
18
• For each paper, create a query consisting
of the words in the abstract of the paper
• Given a query (i.e., abstract), retrieve relevant
documents (i.e., GO terms) from the document
collection
• Assign top-ranked 5, 10, … GO terms to abstract
SMBM 2006
IR-Based Approach
• Index documents using Lucene search engine
• Standard IR preprocessing: tokenization, stop word
removal, case normalization, stemming
• Similarity measure: vector space model
• Two kinds of document
• Flat document = GO term + synonyms + definition
• Hierarchical document = GO term + synonyms +
definition + terms, synonyms, and definitions of parent
GO nodes
SMBM 2006
19
IR-Based Results
20
• Better performance on IC abstracts than on SGD abstracts
• Hierarchical documents do slightly worse than flat documents
• Discriminatory effect of specific GO terms may be reduced
by occurrence of general terms such as cell and protein
SMBM 2006
Machine Learning
• Variety of text classification algorithms: Naïve Bayes,
Decision Tree, SVM classifier, …
• Naïve Bayes predicts only one GO term per abstract
• SGD GS: 2.1 GO terms/abstract; IC GS: 6.6 GO terms/abstract
• Features: words, frequent phrases
• Preprocessing steps: tokenization, removal of
stop words, stemming
• Training on 66% of annotated data, evaluation on
remainder of data
• GO term assignments vis-à-vis generic GO Slim to
mitigate data sparsity problems
SMBM 2006
21
Machine Learning Results
22
• One GO term vs. multiple GO terms per abstract makes a difference
• Higher precision scores than lexical look-up (SGD): GO terms directly
mentioned in text not be assigned if GO terms not present in training set
• Oracle Text Decision Tree (IC): classifier learns systematic, strong
correlation between words in text and words in GO terms
SMBM 2006
Comparison of Approaches
• Best F scores for GO Slim
• SGD Gold Standard
LLU
R
51.0
P
29.9
F
37.7
IR
51.5
26.2
34.7
ML
36.8
51.6
43.0
LLU
R
79.5
P
98.5
F
88.0
IR
59.5
37.6
46.1
ML
76.5
83.0
79.6
• IC Gold Standard
SMBM 2006
23
Conclusions
• GO-tagging is an interesting task
• NLP challenges
• Benefits of functional GO-tagger for
researchers and curators
• Creating valid Gold Standard
• Completeness of annotation
SMBM 2006
24
Conclusions
• Methods for GO-tagging
• Lexical look-up
• Fast, simple
• Term variation, relevant GO terms inferred from text
• Information retrieval approach
• Novel perspective
• Noise from general biomedical terms
• Machine Learning
• Able to capture generalizations
• Feature selection
SMBM 2006
25
Future Work
• Enhancements to each of the three simple approaches
• Combining three approaches into a hybrid system
• Improving resources and methodology for evaluating
the technology
• Building and evaluating end-user applications employing
this technology
• Look at other tasks:
• Extracting GO term-gene/gene product pairs
• Assigning evidence codes
SMBM 2006
26
Navigating GO-Tagged Document Collections
Abstract
Titles
GO
Hierarchy
27
Abstract
Bodies
GO Terms/
Gene Names
SMBM 2006