Text Mining for Biological Databases

Download Report

Transcript Text Mining for Biological Databases

Text Mining for Biology
Lynette Hirschman
The MITRE Corporation
Bedford, MA, USA
RegCreative Jamboree
Nov 29-Dec 1, 2006
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
Outline
Overview of text mining
- Retrieval and extraction
- Where are we?
How text mining can help
- Database consistency assessment
- Tools to aid curators
Conclusions
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
Text Mining Overview
Collections:
Gigabytes
Documents:
Megabytes
PIR
Question Answering:
question to answer
Lists,Tables:
Kilobytes
Phrases:
Bytes
Genbank
MEDLINE
Disease
Ebola
Ebola
Ebola
Ebola
Ebola
Ebola
Ebola
Ebola
Information Retrieval:
Retrieve & classify
documents via key words
Source
PROMED
PROMED
PROMED
PROMED
PROMED
PROMED
PROMED
PROMED
Country
Uganda
Uganda
Uganda
Uganda
Uganda
Uganda
Uganda
Uganda
City_nameDate
Case s
New_case s Dea d
Gula
26-Oct -2000
182
17
Gula
5-Nov-2000
280
14
Gulu
13-Oct -2000
42
9
Gulu
15-Oct -2000
51
7
Gulu
16-Oct -2000
63
12
Gulu
17-Oct -2000
73
2
Gulu
18-Oct -2000
94
21
Gulu
19-Oct -2000
111
17
64
89
30
31
33
35
39
41
Protease-resistant
prion protein
interacts with...
Information Extraction:
Identify, extract & normalize
entities, relations
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
The MOD Curation Pipeline and Text Mining
BioCreAtIve: Gene Normalization
Extract gene names & normalize:
20 participants
3. Curate genes from paper
2. List genes for curation
1. Select papers
MEDLINE
KDD 2002 Task 1;
TREC Genomics 2004 Task 2
BioCreAtIvE II: PPI article selection
BioCreAtIvE II:
Protein annotation
Find relations &
supporting evidence in
text: 28 participants
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
ORegAnno Curation Pipeline & Text Mining
Gene & TF Normalization:
Extract gene, protein names &
normalize to standard ID
3. Curate genes from paper
2. List TFBS for curation
1. Select papers
MEDLINE
Extract evidence passages
and map to evidence
types/sub-types
Curation queue management
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
State of the Art: Document Retrieval
 Input: query words
Output: ranked list of documents
 Approach
- Speed, scalability
domain independence and robustness are critical
for access to large collections of documents
 Techniques
- Shallow processing provides coarse-grained
result (entire documents or passages)
- Query is transformed to collection of words,
but grammatical relations between words lost
- Documents are indexed by word occurrences
- Search matches query bag-of-words against
indexed documents using Boolean combination of
terms, or vector of word occurrences or
language model
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
State of the Art: Extraction
For news, automated systems exist now that can:
- Identify entities (90-95% F-measure*)
- Extract relations among entities (70-80% F)
(information extraction)
- Answer simple factual questions using large
document collections at 75-85% accuracy
(question answering)
How good is text mining applied to biology?
- Is biology easier, because it has structured
resources (ontology, synonym lists)?
- Is it harder because of specialized biological
language, complex biological reasoning?
F-measure is harmonic mean of precision and recall: 2*P*R/(P+R)
Precision = TP/TP+FP; Recall = TP/TP+FN
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
Assessments: Document Classification
TREC Genomics track focused on retrieval
- Part of Text Retrieval Conf, run by National
Institutes of Standards and Technology
- Tasks have included retrieval of
Documents to identify gene function
Documents for MGI curation pipeline
Documents, passages to answer queries, e.g.,
“what effect does the insulin receptor gene
have on tumorigenesis?”
- 40+ groups participating starting 2004
KDD Challenge Cup task 2002
- Yeh et al, MITRE; Gelbart, Mathew et al,
FlyBase task
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
KDD Challenge Cup
Task: automate part of FlyBase curation:
- Determine which papers need to be
curated for Drosophila gene expression
information
- Curate only those papers containing
experimental results on gene products
(RNA transcripts and proteins)
Teamed with FlyBase, who provided
- Data annotation plus biological expertise
- Input on the task formulation
Venue: ACM conference on Knowledge
Discovery and Data Mining (KDD)
- Alex Yeh (MITRE) ran Challenge Cup task
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
FlyBase: Evidence for Gene Products
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
Results
18 teams submitted results (32 entries)
Winner: a team from ClearForest and Celera
- Used manually generated rules and patterns to
perform information extraction
Subtask results
Best Median
Ranked-list for curation:
84%
69%
Yes/No curate paper:
78%
58%
Yes/No gene products:
67%
35%
Conclusion: ranking papers for curation promising;
open question: would this help curators?
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
BioCreAtIvE I: Workshop March 2004
- Tasks (Participation)
Gene Mention (15)
Gene Normalization: Fly, Mouse, Yeast (8)
Functional Annotation (8)
BioCreAtIvE II: Workshop April 2006
- Tasks (Participation)
Gene Mention (21)
Gene Normalization: Human (20)
Protein-Protein Interaction (28)
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
Gene Normalization
 List unique gene IDs for Fly, Mouse, Yeast abstracts
Abstract ID
fly_00035_training
fly_00035_training
Organism Gene ID
FBgn0000592
FBgn0026412
A locus has been found, an allele of which causes a modification of
some allozymes of the enzyme esterase 6 in Drosophila melanogaster.
There are two alleles of this locus, one of which is dominant to the other
and results in increased electrophoretic mobility of affected allozymes.
The locus responsible has been mapped to 3-56.7 on the standard
genetic map (Est-6 is at 3-36.8). Of 13 other enzyme systems analyzed,
only leucine aminopeptidase is affected by the modifier locus.
Neuraminidase incubations of homogenates altered the electrophoretic
mobility of esterase 6 allozymes, but the mobility differences found are
not large enough to conclude that esterase 6 is sialylated.
Sample Gene ID and synonyms:
FBgn0000592: Est-6, Esterase 6, CG6917, Est-D, EST6, est-6, Est6, Est,
EST-6, Esterase-6, est6, Est-5, Carboxyl ester hydrolase
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
BioCreAtIvE I Results: Gene Normalization
1
Precision
0.8
0.6
0.4
FLY
MOUSE
YEAST
0.8 F-measure
0.9 F-measure
0.2
0
0
0.2
0.4
0.6
Recall
0.8
1
• Yeast results good:
High: 0.93 F
Smallest vocab
Short names
Little ambiguity
• Fly:
•0.82 F
High ambiguity
• Mouse: 0.79 F
Large vocabulary
Long names
• Human: ~80%
(BioCreAtIvE II)
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
Impact of BioCreAtIvE I
BioCreAtIvE showed state of the art:
- Gene name mentions: F = 0.83
- Normalized gene IDs: F = 0.8 - 0.9
- Functional annotation: F ~ 0.3
BioCreAtIvE II
- Participation 2-3x higher!
- Results and workshop April 23-25, Madrid
What next?
- New model of curator/text mining cooperation
Have biological curators contribute data
(training and test sets)
Text mining developers work on real
biological problems
- RegCreative is
an instance of this model
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
How Text Mining Can Help
Quality & Consistency
- Assess consistency of annotation
- First step is to determine consistency of human
performance on classification or annotation tasks
- Use agreement studies to improve annotation
guidelines and resources (training materials,
annotated data)
Coverage
- Text mining can speed up curation to achieve
better coverage
Currency
- Faster curation improves currency of annotations
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
Inter-Annotator Agreement
Thesis: if people cannot do a task consistently, it
will be hard to automate the task
- Also, data will be less valuable
Method
- Two humans perform same classification task on
a “blind” data set, using classification guidelines
(after some designated training)
- Results are compared via a scoring metric
Outcome: Determine whether guidelines are
sufficient to ensure consistent classification
Study can be informal
- Used to flag places that need improvement
- Or more formal, to measure progress over time
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
Checking Interannotator Agreement:
An Experiment from BioCreAtIvE I
Camon et al did 1st inter-curator agreement expt*
- 3 EBI GOA annotators annotated 12 overlapping
documents for GO terms (4 docs/pair of curators)
- Results after developing consensus gold standard:
Avg precision (% annotations correct):
~95%
Avg recall (% correct annotations found): ~72%
Lessons learned
- Very few wrong annotations, but some were missed
- Annotators differed on specificity of annotation,
depending on their biological knowledge
- Annotation by paper meant evidence standard was
less clear (normal annotation is by protein)
- Annotation is a complex task for people!
•Camon et al.,BMC Bioinformatics 2005, 6(Suppl 1):S17 (2005)
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
Conclusions
Text mining can provide a methodology to assess
consistency of annotation
Text mining can provide tools
- To manage the curation queue
- To assist curators, particularly in normalization
& mapping into ontologies
Next steps
- Define intended uses of RegCreative data
- Establish curator training materials
- Identify key bottlenecks in curation
- Provide data, user input to develop tools
Major stumbling block for text mining
- Handling of pdf documents!
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.
Acknowledgements
US National Science Foundation for funding of
BioCreAtIvE I and BioCreAtIve II*
MITRE colleagues who worked on BioCreAtIvE
- Alex Morgan (now at Stanford)
- Marc Colosimo
- Jeff Colombe
- Alex Yeh (also KDD Challenge Cup)
Collaborators at CNB and CNIO
- Alfonso Valencia
- Christian Blaschke (now at bioalma)
- Martin Krallinger
* Contract numbers EIA-0326404 and IIS-0640153 .
© 2006 The MITRE Corporation. ALL RIGHTS RESERVED.