Transcript slides

Accomplishments and Challenges in
Literature Data Mining for Biology
L. Hirschman et al.
Presented by Jing Jiang
CS491CXZ Spring, 2004
Outline

Accomplishments
–
–

Natural Language Processing Perspective
Biomedical Applications
Challenges
–
–
Organizing A Challenge Evaluation
Sample Challenge Problems:


Extraction of Biological Pathways
Automated Database Curation and Ontology Development
Early Work: to Identify Protein Names
Fukuda et al. (1998)
 Challenges encountered:
–
–
–

Long compound names
Different names for the same protein
Common English words as protein names
Solutions proposed:
–
–
–
Uppercase letters (Src homology 2 domains)
Numerals (p54 SAP kinase)
Special endings (EGF receptor)
Recent Work: to Recognize Interactions
between Proteins and Other Molecules

Statistical Approach
–
–

Stapley & Benoit (2000): co-occurrences of gene names to
predict connections
Ding et al. (2002): co-occurrences when the unit is an abstract,
a sentence, or a phrase
NLP Approach
–
–
–
Ng & Wong (1999): templates with linguistic structures to
recognize interactions
Others: extended Ng & Wong’s work
All based on grammars
NLP in Biological Applications

To capture specific relations in databases
–
–

To improve retrieval and clustering in searching
large collections
–
–

To learn ontological relations
To extract biological pathways
Homology search using sequence similarity
Clustering MEDLINE abstracts
For classification
Problem I
How to compare different approaches?
Researchers
Precision/
Specificity
Recall/
Sensitivity
Data Set
Extracted
Results
Yakushiji
et al. (2001)
60 – 80%
/
MEDLINE
abstracts
argument
structures
broad set of
biological
relations
the “inhibit”
relations
Friedman
et al. (2001)
96%
63%
8000 word
article from
Cell
Pustejovsky &
Castaño
(2002)
90%
57%
MEDLINE
Problem II
How well does a system have to
perform to be useful?
–
–
What does 90% specificity at 57%
sensitivity mean to the user?
Need user-centered evaluations.
Challenge Evaluation
Identification of Challenge Problem
Task Definition
Training Data
Test Data
Evaluator
Participants
Building System
Evaluation
Evaluation Methodology
Funding
Sample Challenge Problem I:
Extraction of Biological Pathways
What are biological
pathways?
A network of interactions and
events between proteins,
drugs, and other
molecules.
E.g. the Glycolytic Pathway
Challenge Problem
Three layers of challenges:
 To recognize names of proteins, drugs, and
other molecules
 To recognize basic interaction events between
molecules
 To recognize the relationships between the
basic interaction events
Task Definition
db: set of records
(t1, F1)
ti: texts (sentences, abstracts,
or whole articles)
(t2, F2)
…
(tm, Fm)
Fi = {fi,1, fi,2, …, fi,ni}: set of
expected facts (short sentences
in highly standardized forms.
e.g. “P1 activate P2”)
Evaluation Methodology
recall(E) = TP(E)/[TP(E) + FN(E)]
precision(E) = TP(E)/[TP(E) + FP(E)]
E: information extractor
TP: true positive
FN: false negative
FP: false positive
Evaluation Methodology

At the record level
TP( E ) 
|E (t )  F|

At the database level
TP( E ) |
( t , F )db
FN ( E )  (
( t , F )db
|F|)  TP( E )
FN ( E ) |
|E (t )|)  TP( E )
FP( E ) |
( t , F )db
FP( E )  (
( t , F )db
 E (t )  F |
 F | TP( E )
( t , F )db
 E (t ) | TP( E )
( t , F )db
Question: which one is more effective a measure?
Test Data

Appendix of Kohn (1999)
–
–

200 statements of interaction events
Sentences of a fairly complex form
MEDLINE abstracts on “Topoisomerase
inhibitors”
–
–
150 – 200 new abstracts each year
Less than 1000 names and less than 200 interaction
events each year
Sample Challenge Problem II:
Automated Database Curation and
Ontology Development

Importance:
–

The nomenclature problem for proteins:
–

protein referred to by names
A newly discovered protein may be named based on
its functions, sequence features, gene name,
cellular location, molecular weight, etc.
NLP technologies in information extraction,
classification and ontology induction can be
applied here
An Example
3 fields from the entry for Appl+P130kD in FlyBase:
(1)
Protein size (kD): Luo et al, 1990 130
(2)
Cell location:
Luo et al, 1990 axon
(3)
Expression pattern:Luo et al, 1990
Stage
Tissue/Position
Embryo
Embryonic Central Nervous System
Embryo
Peripheral Nervous System
The abstract of Luo et al. (1990)
(1)
APPL … is converted to a 130-kDa secreted from …
(2)
APPL … was observed in … axonal tracts, …
(3)
In the embryo, APPL proteins are expressed exclusively in the
CNS and PNS neurons …
Knowledge Discovery and Data
Mining Challenge Cup 2002

Participants are given
–
–

A collection of journal articles
Each labeled with genes mentioned in the article
Participants are required to answer
–
–
Does the article contain any experimental results
about gene expression that should be put in the
database?
If so, for each gene in the article, is there
experimental evidence for any transcripts (RNA),
protein, or polypeptide products of that gene?
Protein Knowledge Base
Evaluation of Ontologies

Challenging:
–

no established metric for measuring knowledge in
terms of content or value
Two levels:
–
–
Intrinsic: compare terms and ontological relations
discovered by the system against those found by
humans
Extrinsic: evaluate ontology’s usefulness in manual
query expansion
Summary
Contributions of this paper:
 Summarized the work done so far in the field of
literature data mining for biology
 Identified the important ingredients for a
successful evaluation
 Gave concrete evaluation examples
End of the Talk
Identifying Protein Names from
Biological Papers (Fukuda et al.)

Capital letters, numerical figures, and special symbols
(core-terms)
–
–

Key-words (feature-terms)
–
–

Src homology (SH) 2 and SH3 domains
P54 SAP kinase
EGF receptor
Ras GRPase-activating protein (GAP)
IE system:
–
–
Core-term extraction from tokenized texts
Concatenation of core-terms and f-terms
Toward Routine Automatic Pathway
Discovery from On-line Scientific Text
Abstracts (Ng & Wong)

Key function words:
–
–

Inhibitor: {inhibit, suppress, negatively regulate}
Activator: {activate, transactivate, induce,
unregulate, positively regulate}
Pattern matching rules:
–
–
–
<A> … <fn> … <B>
<A> … <fn> of … <B>
<A> … <fn> by … <B>
Evaluation Methodology

Simple Matching Coefficient (SMC)
–

SMC(E) = TP(E)/[TP(E) + FN(E) + FP(E)]
Satisfies two conditions:
–
–
To distinguish the ideal information extractor from
the worst one
To show a gradual monotonic change in value when
the information extractor is changed from the worst
to the best
Three Tasks


To recognize names: obvious
To recognize interaction events: grammar
PosEvent ::= P phosphorylate P [on T] [at L]
| P dephosphorylate P [on T] [at L] …
Event
::= PosEvent [mediated-by P+] [independent-of P+] …

To recognize relationships: grammar
Relationship ::= Event [is-caused-by Event+] [provided Event+]
…