Annotating Gene List From Literature
Download
Report
Transcript Annotating Gene List From Literature
Annotating Gene List From
Literature
Xin He
Department of Computer Science
UIUC
Motivation
Biologists often need to understand the
commonalities of a list of genes (e.g. whether
they are involved in the same pathway).
These genes typically come from clustering
results in microarray expression
Given a list of gene names, is there any
automatic way to find the common themes
from literature articles?
Related Work
The most popular way is based on the analysis of GO
terms associated with genes.
Method: each gene is associated with a set of GO
terms. Find the GO terms that are overrepresented in
the input list
Hypergeometric test: p-value of a GO term
M N M
k 1
i n i
P 1
N
i 0
n
N: total number of genes
M: total number of genes annotated with this term
n: number of genes in the list
k: number of genes in the list annotated with this
term
Problems with GO-based Approach
GO cannot cover all the important concepts
in the literature. E.g. GO has relatively low
coverage for behavior terms (compared with
specialized behavior ontology)
The associations of genes and concepts
change very rapidly. E.g. new functions of
known genes are constantly found..
Text-based Gene List Annotation
Hypothesis testing approach:
find terms that are overrepresented for each gene:
Poisson distribution
find common terms across the gene list:
hypergeometric distribution
Comparative text mining approach: find the
common themes in multiple collections (one
for each gene)
Comparative Text Mining
For each gene, find a collection of articles
that discuss this gene
Each article in a collection is a mixture of two
distributions: a theme common to all
collections; and a collection-specific theme
Parameter estimation in the mixture model:
the standard EM algorithm
Results: Pelle System
Pelle system in Drosophila: Saptzle, Toll,
Pelle, Tube, Cacus, Dorsal
Among the top-50 words: signaling, pathway,
receptor, embryo, ventral, dorsoventral,
patterning, embryonic
Results: MET cluster
MET cluster from yeast cell-cycle data:
MET28, MET14, MET16, MET10, MET2,
MUP1
Among the top-50 words: amino, met25,
sulphite
Problems and Plan
Many common words (such as stop words) in
the top-list, not properly normalized
Use the entire Medline corpus as background: not
working
Hypothesis testing approach as alternative
Single words not very suggestive
Phrase extraction as the postprocessing step