Automatic Extraction of Gene and Protein Synonyms from

Download Report

Transcript Automatic Extraction of Gene and Protein Synonyms from

ISMB 2003 presentation
Extracting Synonymous Gene
and Protein Terms from
Biological Literature
Hong Yu and Eugene Agichtein
Dept. Computer Science, Columbia University, New York, USA
{hongyu, eugene}@cs.columbia.edu
212-939-7028
Significance and Introduction

Genes and proteins are often associated with multiple
names





Apo3, DR3, TRAMP, LARD, and lymphocyte associated
receptor of death
Authors often use different synonyms
Information extraction benefits from identifying those
synonyms
Synonym knowledge sources are not complete
Developing automate approaches for identifying
gene/protein synonyms from literature
Background-synonym
identification

Semantically related words

Distributional similarity [Lin 98][Li and Abe 98][Dagan et al 95]

“beer” and “wine”


Mapping abbreviations to full forms

Map LARD to lymphocyte associated receptor of death


“drink”, “people”, “bottle” and “make”
[Bowden et al. 98] [Hisamitsu and Niwa 98] [Liu and Friedman 03] [Pakhomov
02] [Park and Byrd 01] [Schwartz and Hearst 03] [Yoshida et al. 00] [Yu et al. 02]
Methods for detecting biomedical multiword
synonyms

Sharing a word(s) [Hole 00]


cerebrospinal fluid cerebrospinal fluid protein assay
Information retrieval approach

Trigram matching algorithm [Wilbur and Kim 01]

Vector space model


cerebrospinal fluidcer, ere, …, uid
cerebrospinal fluid protein assaycer,ere, …, say
Background-synonym
identification

GPE

[Yu et al 02]
A rule-based approach for detecting
synonymous gene/protein terms

Manually recognize patterns authors use to list synonyms


Extract synonym candidates and heuristics to filter out
those unrelated terms


Apo3/TRAMP/WSL/DR3/LARD
ng/kg/min
Advantages and disadvantages


High precision (90%)
Recall might be low, expensive to build up
Background—Machine-learning



Machine-learning reduces manual effort by
automatically acquiring rules from data
Unsupervised and supervised
Semi-supervised

Bootstrapping [Hearst 92, Yarowsky 95] [Agichtein
and Gravano 00]

Hyponym detection [Hearst 92]


The bow lute, such as the Bambara ndang, is plucked and
has an individual curved neck for each string.
 A Bambara ndang is a kind of bow lute
Co-training [Blum and Mitchell 98]
Method-Outline

Machine-learning

Unsupervised


Similarity [Dagan et al 95]
Semi-supervised

Bootstrapping


Supervised



SNOWBALL [Agichtein and Gravano 02]
Support Vector Machine
Comparison between machine-learning and
GPE
Combined approach
Method--Unsupervised

Contextual similarity [Dagan et al 95]


Hypothesis: synonyms have similar
surrounding words
Mutual information
 N freq(t , w) 
I (t , w)  log 2 

d
freq
(
t
)
freq
(
w
)



Similarity
sim(t1, t 2) 

min( I ( w, t1), I (W , t 2))  min( I (t1, w), I (t 2, w)

max( I ( w, t1), I ( w, t 2))  max( I (t1, w), I (t 2, w))
wlexicon
wlexicon
Methods—semi-supervised

SNOWBALL [Agichtein and Gravano 02]

Bootrapping

Starts with a small set of user-provided seed tuples for the
relation, automatically generates and evaluates patterns
for extracting new tuples.
{Apo3, DR3}
“Apo3, also known as DR3…”
“DR3, also called LARD…”
{LARD, Apo3}
“<GENE>, also called <GENE>”
{DR3, LARD}
“<GENE>, also known as <GENE>”
Method--Supervised

Support Vector Machine

State-of-the-art text classification method


Training sets:



SVMlight
The same sets of positive and negative tuples as the
SNOWBALL
Features: the same terms and term weights used
by SNOWBALL
Kernel function

Radial basis kernel (rbf) kernel function
Methods—Combined

Rational




Machine-learning approaches increase recall
The manual rule-based approach GPE has a high
precision with lower recall
Combined will boost both recall and precision
Method


Assume each system is an independent predictor
Prob=1-Prob that all systems extracted incorrectly
Evaluation-data

Data

GeneWays corpora [Friedman et al 01]

52,000 full-text journal articles


Preprocessing
 Gene/Protein name entity tagging


Abgene [Tanabe and Wilbur 02]
Segmentation


Science, Nature, Cell, EMBO, Cell Biology, PNAS, Journal of
Biochemistry
SentenceSplitter
Training and testing
 20,000 articles for training


Tuning SNOWBALL parameters such as context window, etc.
32,000 articles for testing
Evaluation-matrices

Estimating precision



Randomly select 20 synonyms with confident scores (0.00.1, 0.1-0.2, …,0.9-1.0)
Biological experts judged the correctness of synonym pairs
Estimating recall
 SWISSPROT—Gold Standard


989 pairs of SWISSPROT synonyms co-appear in at least
one sentence in the test set
Biological experts judged 588 pairs were indeed
synonyms

“…and cdc47, cdc21, and mis5 form another complex,
which relatively weakly associates with mcm2…”
Results

Patterns SNOWBALL found
Conf
Left
0.75
0.54
0.47

Middle
<(0.55><ALSO 0.53><CALLED 0.53>
<ALSO 0.47><KNOWN 0.47><AS 0.47>
<( 0.54> <ALSO 0.54> <TERMED 0.54>
Right
-
Of 148 evaluated synonym pairs,
62(42%) were not listed as synonyms
in SWISSPROT
Results
1
Snowball
SVM
Similarity
GPE
Combined
recall
0.8
0.6
0.4
0.2
0
0
0.1
0.2
0.3
0.4 0.5
score
0.6
0.7
0.8
0.9
Results
precision
1
Snowball
0.8
SVM
0.6
GPE
Combined
0.4
0.2
0
0.1
0.2
0.3
0.4
0.5 0.6
recall
0.7
0.8
0.9
Results

System performance
System Tagging
Similarity Snowball SVM
Time
40 mins
7 hs
2 hs
1.5 h
GPE
35 mins
Conclusions



Extraction techniques can be used as a
valuable supplement to resources such as
SWISSPROT
Synonym relations can be automated through
machine-learning approaches
SNOWBALL can be applied successfully for
recognizing the patterns