Transcript Slide 1
BioNLP Tutorial
K. Bretonnel Cohen
Olivier Bodenreider
Lynette Hirschman
PSB 2006
Wailea, Maui, HI
The Biological Data Cycle
Experimental
Data
Ontologies
Databases
Genbank
SwissProt
Literature
Collections
MEDLINE
Expert
Curation
Bottleneck: getting
knowledge from literature to
databases
Solution: text mining
1
Model Organism Curation
Pipeline
3. Curate genes from paper
2. List genes for curation
1. Select papers
MEDLINE
1
Double exponential growth
in the literature
New entries in Medline with publication date in
Jan-Aug 2005: 431,478 (avg. 1775/ day)
1
Examples of BioNLP in action
1
Examples of BioNLP in action
1
Examples of BioNLP in action
1
Application types
Information retrieval: find documents in
response to an “information need”
p53
Resistance to apoptosis, increased growth
potential, and altered gene expression in cells
that survived genotoxic hexavalent chromium
exposure.
PMID: 16283527
2
Application types
Question-answering: question as input, answer as
output
What is BRCA1?
A gene located on the seventeenth
chromosome associated with a risk of
breast and ovarian cancer (Yu and Sable 2005) 2
Application types
•Summarization
– Input: one or more texts
– Output: single (shorter) text
Information extraction: Information extraction systems find
statements about some specified type of relationship in text.
Entity identification is a necessary prerequisite to information
extraction. Information retrieval: Information retrieval is
Ling as
etthe
al. location
(multiple
documents)
classically defined
of documents
that are
relevant to some information need. PubMed is a premier
Lu et al. (single
document)
example of a sophisticated
biomedical
information retrieval
system. Summarization systems benefit from high-performance
entity identification and normalization. Other approaches
involve information extraction.
2
Application types
Information extraction: relationships
between things
BINDING_EVENT
Binder:
Bound:
2
Application types
Met28 binds to DNA.
Lussier (gene/phenotype)
BINDING_EVENT
Binder: Met28
Bound: DNA
Maguitman (protein/family)
Chun (gene/disease)
Höglund (protein/location)
Stoica (protein/function)
2
Application types
HSP60
Hsp-60
heat shock protein 60
Cerberus
wingless
Ken and Barbie
the
Entity
identification
3
Application types
Entity normalization: find concepts in
text and map them to unique identifiers
A locus has been found, an allele of which causes a modification of some
allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are
two alleles of this locus, one of which is dominant to the other and results in
increased electrophoretic mobility of affected allozymes. The locus
responsible has been mapped to 3-56.7 on the standard genetic map (Est-6
is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine
aminopeptidase is affected by the modifier locus. Neuraminidase
incubations of homogenates altered the electrophoretic mobility of esterase
6 allozymes, but the mobility differences found are not large enough to
conclude that esterase 6 is sialylated.
3
Application types
• Perfect entity identification finds 5 mentions; they
correspond to just 2 genes:
– FBgn0000592 (esterase 6)
– FBgn0026412 (leucine aminopeptidase)
A locus has been found, an allele of which causes a modification of some
allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are
two alleles of this locus, one of which is dominant to the other and results in
increased electrophoretic mobility of affected allozymes. The locus
responsible has been mapped to 3-56.7 on the standard genetic map (Est-6
is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine
aminopeptidase is affected by the modifier locus. Neuraminidase
incubations of homogenates altered the electrophoretic mobility of esterase
6 allozymes, but the mobility differences found are not large enough to
conclude that esterase 6 is sialylated.
3
Application types
• Partial list of synonyms for FBgn0000592:
– Esterase 6
– Carboxyl ester hydrolase
– CG6917
Chun (gene/disease)
– Est6
Johnson (ontology alignment)
– Est-D
Stoica (gene/function)
– Est-5
Vlachos (FlyBase mapping)
3
Biological Nomenclature: “V-SNARE”
V-SNARE
Vesicle SNARE
SNAP Receptor
Soluble NSF Attachment Protein
N-Ethylmaleimide-Sensitive Fusion Protein
Maleic acid N-ethylimide
Vesicle Soluble Maleic acid N-ethylimide Sensitive
Fusion Protein Attachment Protein Receptor
(A. Morgan)
4
The Biological Data Cycle
Experimental
Data
Ontologies
Databases
Genbank
SwissProt
Literature
Collections
MEDLINE
Expert
Curation
What’s the organizing
principle for all of this?
4
Organizing principles
Clinical
repositories
Genetic
knowledge bases
SNOMED
Other
subdomains
OMIM
…
MeSH
UMLS
Biomedical
literature
NCBI
Taxonomy
Model
organisms
GO
UWDA
Anatomy
Genome
annotations
4
Organizing principles
4
Ontologies as text mining
resources
Neurofibromatosis type 2 (NF2) is often not
recognised as a distinct entity from peripheral
neurofibromatosis. NF2 is a predominantly
intracranial condition whose hallmark is bilateral
vestibular schwannomas. NF2 results from a
mutation in the gene named merlin, located on
chromosome 22.
(Uppal, S., and A. P. Coatesworth. “Neurofibromatosis Type 2.” Int J
Clin Pract, 57, no. 8, 2003, pp. 698-703.)
4
Ontologies as text mining
resources
Neurofibromatosis type 2 (NF2) is often not
recognised as a distinct entity from peripheral
neurofibromatosis. NF2 is a predominantly
intracranial condition whose hallmark is bilateral
vestibular schwannomas. NF2 results from a
mutation in the gene named merlin, located on
chromosome 22.
vestibular
schwannoma
manifestation of neurofibromatosis 2
• Tumor
manifestation
of Disease
neurofibromatosis
2Tumor
associated
with
mutation
of merlin
• Disease
associated with
mutation
of Gene
Disease
Gene
Chromosome
• merlin
locatedononChromosome
chromosome 22
Gene located
4
What’s the state of the art?
Source
Craven '99
Rindflesch '99
Proux '00
Friedman '01
Pustejovsky '02
Bunescu '05
Relation
location
binding
interact
pathway
inhibit
interact
Entity
protein
UMLS
gene
many
gene
protein
DB
Prec Recall
Yeast
92%
21%
MEDLINE 79%
72%
Flybase
81%
44%
Articles
96%
63%
MEDLINE 90%
57%
MEDLINE ~37% ~50%
Precision
≈ Specificity
• Tasks differ greatly: finding human
protein
interactions (Bunescu ‘05) may be harder
finding
Recall ≈than
Sensitivity
“inhibition” relations (Pustejovsky ‘02)
• Need a CASP-style competitive evaluation
4
What’s the state of the art?
•
•
•
•
KDD Cup (2002)
TREC Genomics (2003, 2004, 2005)
BioCreAtIvE (2004)
BioNLP (2004)
What’s the state of the art?
BioCreAtIvE information extraction task:
PDB → Gene Ontology
3. Curate genes from paper
2. List genes for curation
1. Select papers
MEDLINE
BioCreAtIvE entity
identification and entity
normalization tasks
KDD 2002, TREC Genomics 2004
5
What’s the state of the art?
1
• Yeast results good:
High: 0.93 F
Smallest vocab
Short names
Little ambiguity
• Fly: 0.82 F
High ambiguity
• Mouse: 0.79 F
Large vocabulary
Long names
Precision
0.8
0.6
0.4
0.2
FLY
MOUSE
YEAST
0.8 F-measure
0.9 F-measure
0
0
0.2
0.4
0.6
0.8
1
Recall
**F-measure is balanced precision and recall: 2*P*R/(P+R)
Recall:
# correctly identified/# possible correct
Precision:
# correctly identified/# identified
3
What’s the state of the art?
user
run evaluated
results
"perfect"
predictions
correct
protein,
"general" GO
user4
1
1048
268 (25.57%)
74 (7.06%)
user5
1
1053
166 (15.76%)
77 (7.31%)
2
1050
166 (15.81%)
90 (8.57%)
3
1050
154 (14.67%)
86 (8.19%)
1
1057
272 (25.73%)
154 (14.57%)
2
1864
43 (2.31%)
40 (2.15%)
3
1703
66 (3.88%)
40 (2.35%)
1
251
125 (49.80%)
13 (5.18%)
2
70
33 (47.14%)
5 (7.14%)
3
89
41 (46.07%)
7 (7.87%)
user10 1
45
36 (80.00%)
3 (6.67%)
2
59
45 (76.27%)
2 (3.39%)
3
64
50 (78.12%)
4 (6.25%)
user14 1
1050
303 (28.86%)
69 (6.57%)
user15 1
524
59 (11.26%)
28 (5.34%)
2
998
125 (12.53%)
69 (6.91%)
user17 1
413
83 (20.10%)
19 (4.60%)
2
458
7 (1.53%)
user20 1
1048
301 (28.72%)
57 (5.44%)
2
1048
280 (26.72%)
60 (5.73%)
3
1050
239 (22.76%)
59 (5.62%)
user7
user9
(0.00%)
Blaschke et al.
5
What’s the state of the art?
Cellular Component: 34.61% (561/1621)
Molecular Function: 33.00% (933/2827)
Biological Process: 23.02% (1011/4391)
Cellular component is easier because task is
relation between “entities”
located_in (protein,cell_component)
Biological process is hardest because it is
the most abstract
Blaschke et al.
5
2.5 types of solutions
Johnson
Chun (IE,(information
(ontology
multiple gene
alignment,
-> UMLS
GO disease)
→
other
OBO)
Höglund
extraction,
gene
→ localiz.)
Lu
Ling
(summarization,
(summarization,
Entrez
FlyBase)
Gene
→ GeneRIFs)
Maguitman
(info. extract.,
SWISSPROT
→ Pfam)
• Rule-based
Lussier
extraction,
GOAgene
-> phenotype)
Vlachos (info.
(entity
normalization,
→ FlyBase)
– Patterns
Vlachos
(coreference,
FlyBase & Sequence Ont.)
Stoica (gene
→ GO code)
– Grammars
• Statistical/machine learning
– Labelled training data
– Noisy training data
• Hybrid statistical/rule-based
5
Common tools/techniques
• “Stop word” removal: eliminate
features that are rarely helpful
the, a, and…
• (Porter) stemming: convert
inflected words to their roots
promot, mitochondri, cytochrom
• POS: “part of speech”— ≈80
categories
5
Why text mining is difficult
• Variability
• Pervasive ambiguity at every level of
analysis
5
Why text mining is difficult
Met28 binds to DNA
…binding of Met28 to DNA…
…Met28 and DNA bind…
…binding between Met28 and DNA…
…Met28 is sufficient to bind DNA…
…DNA bound by Met28…
2(6)
Why text mining is difficult
…binding of Met28 to DNA…
…binding under unspecified conditions of
Met28 to DNA…
…binding of this translational variant of
Met28 to DNA…
…binding of Met28 to upstream regions of
DNA…
2(6)
Why text mining is difficult
…binding under unspecified conditions of
this translational variant of Met28 to
upstream regions of DNA…
3(6)
Why text mining is difficult
•
•
•
•
•
Document segmentation
Sentence segmentation
Tokenization
Part of speech tagging
Parsing
5
Why text mining is difficult
Here, we show that Bifocal F-measure
(Bif), a
putative cytoskeletal regulator, is a
MaxEnt_1 of the Msn pathway.40
component
for
regulating R cell growth targeting. bif
MaxEnt_2strong genetic interaction
.67 with
displays
msn.
KeX
LingPipe
.95
(Ruan et al. 2002)
.96
6
(Baumgartner, in prep.)
Why text mining is difficult
lead
• 69 tokens in GENIA
–
–
–
–
–
“bare stem” verb: 34
3rd person singular present tense verb: 29
Noun: 3
Past tense verb: 2
Past participle: 1
6
Why text mining is difficult
HUNK
• Human natural killer (cell type)
• HUN kinase (gene/protein)
• Radiological/orthopedic classification
scheme
• Piece of something
6
Why text mining is difficult
NaCT is expressed in liver, testis and
brain in rat and shows preference for
citrate over dicarboxylates… (GeneRIF
266998:12177002)
NACT:
neoadjuvant chemotherapy (PMID 8898170)
N-acetyltransferase (PMID 10725313)
Na+-coupled citrate transporter (PMID 12177002 )
6
Why text mining is difficult
NaCT is expressed in liver, testis and
brain in rat and shows preference for
citrate over dicarboxylates… (GeneRIF
266998:12177002)
•(liver), (testis) and (brain in rat)
•liver, (testis and brain in rat)
•(liver, testis and brain in rat)
6
Why text mining is difficult
NaCT is expressed in liver, testis and brain
in rat and shows preference for citrate over
dicarboxylates… (GeneRIF 266998:12177002)
•shows preference for (citrate over
dicarboxylates)
•shows preference (for citrate) (over
dicarboxylates)
7
Why text mining is difficult
regulation of cell migration and proliferation
(PMID …)
serine phosphorylation, translocation, and degradation
of IRS-1 (PMID 16099428)
! proliferation and regulation of cell migration
! regulation of proliferation and cell migration
regulation of cell migration and regulation of cell
proliferation
7
Why text mining is difficult
regulation of cell migration and proliferation
(PMID …)
serine phosphorylation, translocation, and
degradation of IRS-1 (PMID 16099428)
!degradation of IRS-1, translocation, and serine
phosphorylation
!serine phosphorylation, serine translocation,
and serine degradation (of IRS-1)
7
Most biomedical text mining to
date: “ungrounded”
• Drosophila OBP76a is necessary for fruit
flies to respond to the aggregation
pheromone 11-cis vaccenyl acetate (PMID
•
Entrez
Gene
lush is completely devoid of evoked activity
to the pheromoneID:40136
11-cis vaccenyl acetate (VA),
15664166)
revealing that this binding protein is
absolutely required for activation of
pheromone-sensitive chemosensory neurons
(PMID 15664171)
7
The next step
• Text mining can be key tool for linking
biological knowledge from the literature
to structured data in biological
databases…
• …and databases to each other.
7
Papers in the text mining session
• 5 papers on linkage to ontologies
• Höglund et al.: generating cellular localization annotations
• Lussier et al.: PhenoGO for capture of phenome data
• Stoica and Hearst: functional annotation of proteins
• Johnson et al.: ontology alignments
• Vlachos et al.: ontology for name extraction, anaphora
• 2 papers linking other sets of resources
• Maguitman et al. on “bibliome” to reproduce Pfam classes
• Chun et al. on linking genes and diseases
• 2 papers on summarization, using linked
resources
• Lu et al.: automated GeneRIF extraction
• Ling et al.: automated gene summary generation
7
Acknowledgements
• Alex Morgan for several slides
• Christian Blaschke for data and slides
• Bill Baumgartner for sentence
segmenter performance data
• Helen Johnson for data on POS
ambiguity in GENIA
• Lu Zhiyong for syntactic ambiguity
examples
• Larry Hunter for current PubMed graph
7
How big is a
humuhumunukunukuapua’a?
How big is a
humuhumunukunukuapua’a?