Using Biomedical Literature Mining to Consolidate the Set

Download Report

Transcript Using Biomedical Literature Mining to Consolidate the Set

Machine Learning Group
Integrating Co-occurrence Statistics with IE for
Robust Retrieval of Protein Interactions from
Medline
Razvan C. Bunescu
Raymond J. Mooney
Machine Learning Group
Department of Computer Sciences
University of Texas at Austin
{razvan, mooney}@cs.utexas.edu
University of Texas at Austin
Arun K. Ramani
Edward M. Marcotte
Institute for Cellular and Molecular Biology
Center for Computational Biology and
Bioinformatics
University of Texas at Austin
{arun, marcotte}@icmb.utexas.edu
Machine Learning Group
Introduction
• Two orthogonal approaches to mining binary relations from a
collection of documents:
– Information Extraction:
• Relation Extraction from individual sentences;
• Aggregation of the results over the entire collection.
– Co-occurrence Statistics:
• Compute (co-)occurrence counts over the entire corpus;
• Use statistical tests to detect whether co-occurrence is due to chance.
•
Aim: Combine the two approaches into an integrated extraction
model.
University of Texas at Austin
2
Machine Learning Group
Outline
 Introduction.
• Two approaches to relation extraction:
– Information Extraction.
– Co-occurrence Statistics.
•
Integrated Model.
•
Evaluation Corpus.
•
Experimental Results.
•
Future Work & Conclusion.
University of Texas at Austin
3
Machine Learning Group
Information Extraction
• Most IE systems detect relations only between entities mentioned in
the same sentence.
• The existence & type of the relationship is based on lexico-semantic
cues inferred from the sentence context.
In synchronized human osteosarcoma cells, cyclin D1 is induced in early
G1 and becomes associated with p9Ckshs1, a Cdk-binding subunit.
• Given a pair of entities, corpus-level results are assembled by
combining the confidence scores that the IE system associates with
each occurrence.
University of Texas at Austin
4
Machine Learning Group
Relation Extraction using a Subsequence Kernel
[Bunescu et al., 2005].
• Subsequences of words and POS tags are used as Implicit Features.
interaction of (3) PROT (3) with PROT
• Assumes the entities have already been annotated.
• Exponential penalty factor  is used to downweigh longer word gaps.
• Generalization of the extraction system from [Blaschke et al., 2001].
• The system is trained to ouput a normalized confidence value for each
extraction.
University of Texas at Austin
5
Machine Learning Group
Aggregating Corpus-Level Results
Sentences
Confidences
S1
P(R( p1, p2 ) | S1)
S2
P(R( p1, p2 ) | S2 )
.
.
.
Information

Extraction

University of Texas at Austin


.
.
.
P(R( p1, p2 ) | Sn )
Sn
Aggregation


P(R( p1, p2 ) | C)
6
Machine Learning Group
Aggregation Operators
• Maximum
max  maxP(R( p1, p2 ) | Si )
i
• Noisy-OR

nor  1   (1  P( R( p1 , p2 ) | Si ))
i
• Average
avg 
• AND
1
P( R( p1 , p2 ) | Si )

n i
and   P( R( p1 , p2 ) | Si )1/ n
i
University of Texas at Austin
7
Machine Learning Group
Outline
 Introduction.
 Two approaches to relation extraction:
 Information Extraction.
– Co-occurrence Statistics.
•
Integrated Model.
•
Evaluation Corpus.
•
Experimental Results.
•
Future Work & Conclusion.
University of Texas at Austin
8
Machine Learning Group
Co-occurrence Statistics
• Compute (co-)occurrence counts for the two entities in the entire
corpus.
• Based on these counts, detect if the co-occurrence of the two entities is
due to chance, or to an underlying relationship.
•
Can use various statistical measures:
– Pointwise Mutual Information (PMI)
– Chi-square Test (2)
– Log-Likelihood Ratio (LLR)
University of Texas at Austin
9
Machine Learning Group
Pointwise Mutual Information
•
•
•
•
N : the total number of protein pairs co-occurring in the same sentence in the
entire corpus.
P(p1,p2)  n12/N : the probability that p1 and p2 co-occur in the same sentence.
P(p1, p)  n1/N : the probability that p1 co-occurs with any protein in the same
sentence.
P(p2, p)  n2/N : the probability that p2 co-occurs with any protein in the same
sentence.
PMI ( p1 , p2 )  log
sPMI ( p1 , p2 ) 
•
P( p1 , p2 )
n
 log N 12
P( p1 , p)  P( p2 , p)
n1  n2
n12
n1  n2
The higher the sPMI(p1, p2) value, the less likely it is that p1 and p2 cooccurred by chance => they may be interacting.
University of Texas at Austin
10
Machine Learning Group
Outline
 Introduction.
 Two approaches to relation extraction:
 Information Extraction.
 Co-occurrence Statistics.
•
Integrated Model.
•
Evaluation Corpus.
•
Experimental Results.
•
Future Work & Conclusion.
University of Texas at Austin
11
Machine Learning Group
Integrated Model
•
[Local] The sentence-level Relation Extraction (SSK) uses information that
is local to one occurrence of a pair of entities (p1,p2).
•
[Global] The corpus-level Co-occurrence Statistics (PMI) are based on
counting all occurrences of a pair of entities (p1,p2).
•
[Local & Global] Achieve a more reliable extraction performance by
combining the two orthogonal approaches into an integrated model.
University of Texas at Austin
12
Machine Learning Group
Integrated Model
•
Rewrite sPMI as:
n12
1 n12
sPMI ( p1 , p2 ) 

1

n1  n2 n1  n2 i 1
•
Instead of counting 1 for each co-occurrence, use the confidence ouput by the
IE system => a weighted PMI:
1 n12
wPMI ( p1 , p2 ) 
P( R( p1 , p2 ) | Si )

n1  n2 i 1
n
 12 avg ({P(R( p1, p2 ) | Si )}
n1  n 2
•
Can use any aggregation operator:
wPMI(
 p1, p2 ) 
University of Texas at Austin
n12
max ({P(R( p1, p2 ) | Si )}
n1  n 2
13
Machine Learning Group
Outline
 Introduction.
 Two approaches to relation extraction:
 Information Extraction.
 Co-occurrence Statistics.
 Integrated Model.
•
Evaluation Corpus.
•
Experimental Results.
•
Future Work & Conclusion.
University of Texas at Austin
14
Machine Learning Group
Evaluation Corpus
• An evaluation corpus needs to provide two types of information:
– The complete list of interactions mentioned in the corpus.
– Annotations of protein mentions, together with their gene identifiers.
• The corpus was compiled based on the HPRD [www.hprd.org] and
NCBI [www.ncbi.nih.gov] databases:
– Every interaction is linked to a set of Medline articles that report the
corresponding experiment.
– An interaction is specified as a tuple containing:
• The LocusLink (EntrezGene) identifiers of the proteins involved.
• The PubMed identifiers of the corresponding Medline articles.
University of Texas at Austin
15
Machine Learning Group
Evaluation Corpus (cont’ed)
<gene id=“2318”>
<name>FNLC</name>
Interactions (XML) (HPRD)
<description>filamin C, gamma</description>
Participant Genes (XML) (NCBI)
<interaction>
<synonyms>
<synonym>ABPA</synonym>
<gene>2318</gene>
<synonym>ABPL</synonym>
<gene>58529</gene>
<synonym>FNL2</synonym>
<pubmed>10984498 11171996</pubmed>
<synonyms>
</interaction>
<proteins>
<protein> gamma filamin </protein>
<protein>filamin 2</protein>
<protein>filamin C, gamma</protein>
</proteins>
</gene>
Medline Abstracts (XML) (NCBI)
<PMID>10984498</PMID>
<AbstractText>
<gene id=“58529”>
<description>myozenin 1</description>
We found that this protein binds to three other Z-dics proteins;
therefore we have named it FATZ, gamma-filamin, alpha-actinin
and telethonin binding protein of the Z-disc.
<synonyms> ... </synonyms>
</AbstractText>
<name>MYOZ1</name>
<proteins> FATZ … </proteins>
</gene>
University of Texas at Austin
16
Machine Learning Group
Gene Name Annotation and Normalization
• NCBI provides a comprehensive dictionary of human genes, where
each gene specifies is specified by its unique identifier, and qualified
with:
– an official name,
– a description,
– a list of synonyms,
– a list of protein names.
• All these names (including the description) are considered as referring
to the same entity.
• Use a dictionary-based annotation, similar to [Cohen, 2005].
University of Texas at Austin
17
Machine Learning Group
Gene Name Annotation and Normalization
• Each name is reduced to a normal form, by:
1)
2)
3)
4)
5)
Replacing dashes with spaces
Introducing spaces between letters and digits
Replacing Greek letters with their Latin counterparts
Substituting Roman numerals with Arabic numerals
Decapitalizing the first word (if capitalized).
• The names are further tokenized, and checked against a dictionary of
100K English nouns.
• Names associated with more than one gene identifier (i.e. ambiguous
names) are ignored.
• The final gene name dictionary is implemented as a trie-like structure.
University of Texas at Austin
18
Machine Learning Group
Outline
 Introduction.
 Two approaches to relation extraction:
 Information Extraction.
 Co-occurrence Statistics.
 Integrated Model.
 Evaluation Corpus.
•
Experimental Results.
•
Future Work & Conclusion.
University of Texas at Austin
19
Machine Learning Group
Experimental Results
• Compare four methods on the task of interaction extraction:
– Information Extraction:
• [SSK.Max] Relation extraction with the subsequence kernel (SSK), followed by
an aggregation of corpus-level results using Max.
– Co-occurrence Statistics:
• [PMI] Pointwise Mutual Information.
• [HG] The HyperGeometric distribution method from [Ramani et al., 2005].
– Integrated Model:
• [PMI.SSK.Max]The combined model of PMI & SSK.
• Draw Precision vs. Recall graphs, by ranking the extractions and
choosing only the top N interactions, while N is varying.
University of Texas at Austin
20
Machine Learning Group
Experimental Results
University of Texas at Austin
21
Machine Learning Group
Future Work
• Derive an evaluation corpus from a potentially more accurate database
(Reactome).
• Investigate combining IE with other statistical tests (LLR, 2).
• Design an IE method that is trained to do corpus-level extraction (as
opposed to sentence-level extraction).
University of Texas at Austin
22
Machine Learning Group
Conclusion
• Introduced an integrated model that combines two orthogonal
approaches to corpus-level relation extraction:
– Information Extraction (SSK).
– Co-occurrence Statistics (PMI).
• Derived an evaluation corpus from the HPRD and NCBI databases.
• Experimental results show a more consistent performance across the
PR curve.
University of Texas at Austin
23
Machine Learning Group
Aggregating Corpus-Level Results
• Two entities p1 and p2 are mentioned in a corpus C of n sentences,
C {S1, …, Sn}.
• The IE system outputs a confidence value for each of the n
occurrences:
P( R( p1 , p2 ) | Si ) [0,1]
• The corpus-level confidence value is computed using an aggregation
operator :
P( R( p1, p2 ) | C)  ({P( R( p1, p2 ) | Si ) | i  1..n})
University of Texas at Austin
24
Machine Learning Group
Experimental Results
• Compare [PMI] and [HG] on the task of extracting interactions from
the entire Medline.
• Use the shared protein function benchmark from [Ramani et al. 2005].
– Calculate the extent to which interactions partners share functional
annotations, as specified in the KEGG and GO databases.
– Use a Log-Likelihood Ratio (LLR) scoring scheme to rank the
interactions:
LLR  ln
P( D | I )
~
P( D | I )
• Also plot the scores associated with the HPRD, BIND and Reactome
databases.
University of Texas at Austin
25
Machine Learning Group
Experimental Results
University of Texas at Austin
26