Annotating Gene List From Literature

Download Report

Transcript Annotating Gene List From Literature

Annotating Gene List From
Literature
Xin He
Department of Computer Science
UIUC
Motivation



Biologists often need to understand the
commonalities of a list of genes (e.g. whether
they are involved in the same pathway).
These genes typically come from clustering
results in microarray expression
Given a list of gene names, is there any
automatic way to find the common themes
from literature articles?
Related Work



The most popular way is based on the analysis of GO
terms associated with genes.
Method: each gene is associated with a set of GO
terms. Find the GO terms that are overrepresented in
the input list
Hypergeometric test: p-value of a GO term
 M  N  M 
 

k 1 
i  n  i 

P  1 
N
i 0
 
n 
N: total number of genes
M: total number of genes annotated with this term
n: number of genes in the list
k: number of genes in the list annotated with this
term
Problems with GO-based Approach


GO cannot cover all the important concepts
in the literature. E.g. GO has relatively low
coverage for behavior terms (compared with
specialized behavior ontology)
The associations of genes and concepts
change very rapidly. E.g. new functions of
known genes are constantly found..
Text-based Gene List Annotation

Hypothesis testing approach:



find terms that are overrepresented for each gene:
Poisson distribution
find common terms across the gene list:
hypergeometric distribution
Comparative text mining approach: find the
common themes in multiple collections (one
for each gene)
Comparative Text Mining



For each gene, find a collection of articles
that discuss this gene
Each article in a collection is a mixture of two
distributions: a theme common to all
collections; and a collection-specific theme
Parameter estimation in the mixture model:
the standard EM algorithm
Results: Pelle System


Pelle system in Drosophila: Saptzle, Toll,
Pelle, Tube, Cacus, Dorsal
Among the top-50 words: signaling, pathway,
receptor, embryo, ventral, dorsoventral,
patterning, embryonic
Results: MET cluster


MET cluster from yeast cell-cycle data:
MET28, MET14, MET16, MET10, MET2,
MUP1
Among the top-50 words: amino, met25,
sulphite
Problems and Plan

Many common words (such as stop words) in
the top-list, not properly normalized



Use the entire Medline corpus as background: not
working
Hypothesis testing approach as alternative
Single words not very suggestive

Phrase extraction as the postprocessing step