gene - BeeSpace

Download Report

Transcript gene - BeeSpace

BeeSpace Informatics Research
ChengXiang (“Cheng”) Zhai
Department of Computer Science
Institute for Genomic Biology
Statistics
Graduate School of Library & Information Science
University of Illinois at Urbana-Champaign
BeeSpace Workshop, May 22, 2009
1
Goal of Informatics Research
• Develop general and scalable computational
methods to enable
– Semantic integration of data and information
– Effective information access and exploration
– Knowledge discovery
– Hypothesis formulation and testing
• Reinforcement of research in biology and
computer science
– CS research to automate manual tasks of biologests
– Biology research to raise new challenges for CS
2
Overview of BeeSpace Technology
Knowledge
Discovery
&
Hypothesis
Testing
Users
Gene
Summarizer
Question
Answering
Function
Annotator
Information
Space/Region Manager, Navigation Support
Access &
Exploration Search Engine
Text Miner
Words/Phrases Entities
Content
Analysis
Relations
Relational
Database
Natural Language Understanding
Literature Text
Meta Data
3
Informatics Research Accomplishments
Users
Knowledge
Discovery
&
Automatic Function
Annotation
Question
Entity/Gene Summarization
Gene
Function
Hypothesis
Test
al. 09/10]
[Ling et al. 06], [Ling et al. 07],
[Ling et al. 08] Answering[He etAnnotator
Summarizer
Information
Access &
Exploration
Topic discovery and interpretation
Space/Region
Manager,
Support
[Mei et al. 06a],
[Mei etNavigation
al. 07a], [Mei
et al. 07b],
[Chee & Schatz 08]
Biomedical information
retrieval Text Miner
Search Engine
[Jiang & Zhai 07], [Lu et al. 08]
Relational
Database
Words/Phrases Entities Relations
Entity/Relation extraction
Content
[Jiang & Language
Zhai 06], [Jiang
& Zhai 07a], [Jiang & Zhai 07b]
Natural
Understanding
Analysis
Literature Text
Meta Data
4
Overview of BeeSpace Technology
Knowledge
Discovery
&
Hypothesis Part 3. Entity
Gene
Testing
Summarizer
Summarization
Users
Information
Access &
Exploration
4. Function
Question Part Function
Answering
Annotator
Analysis
Space/Region
Manager, Navigation
Part 2. Navigation
Support Support
Search Engine
Text Miner
Words/Phrases Entities
Content
Analysis
Relations
Relational
Database
1. Information
Extraction
NaturalPart
Language
Understanding
Literature Text
Meta Data
5
Part 1. Information Extraction
6
Natural Language Understanding
…We have cloned and sequenced
NP
VP
VP
a cDNA encoding Apis mellifera ultraspiracle (AMUSP)
NP
NP
Gene
Gene
and examined its responses to …
VP
NP
7
Entity & Relation Extraction
Lopes FJ et al., 2005 J. Theor. Biol.
Genetic Interaction
Expression Location
Gene X
Gene Y
Gene X
Anatomy Y
Bcd
hb
Bcd
embryo
….
….
Hb
egg
…
…
…
…
…
8
General Approach: Machine Learning
• Computers learn from labeled examples to
compute a function to predict labels of new
examples
• Examples of predictions
– Given a phrase, predict whether it is a gene name
– Given a sentence with two gene names mentioned,
predict whether there is a genetic interaction
relation
• Many learning methods are available, but
training data isn’t always available
9
Extraction Example 1:
Gene Name Recognition
Gene?
… expression of terminal gap genes is mediated by the local
activation of the Torso receptor tyrosine kinase (Tor). At the
anterior, terminal gap genes are also activated by the Tor
pathway but Bcd contributes to their activation.
Gene?
Gene?
10
Features for Recognizing Genes
• Syntactic clues:
– Capitalization (especially acronyms)
– Numbers (gene families)
– Punctuation: -, /, :, etc.
• Contextual clues:
– Local: surrounding words such as “gene”,
“encoding”, “regulation”, “expressed”, etc.
– Global: same noun phrase occurs several times in
the same article
11
Maximum Entropy Model
for Gene Tagging
•
•
•
•
Given an observation (a token or a noun phrase),
together with its context, denoted as x
Predict y  {gene, non-gene}
Maximum entropy model:
P(y|x) = K exp(ifi(x, y))
Typical f:
– y = gene & candidate phrase starts with a capital letter
– y = gene & candidate phrase contains digits
•
Estimate i with training data
12
Special Challenges
• Gene name disambiguation
• Domain adaptation
13
Gene Name Disambiguation
• Gene names can be common English words:
for (foraging), in (inturned), similar (sima),
yellow (y), black (b)…
• Solution:
– Disambiguate by looking at the context of the
candidate word
– Train a classifier
14
Discriminative Neighbor Words
15
Sample Disambiguation Results
“black”
the cuticular melanization phenotype of black flies is rescued by beta-alanine but
-2.780
beta-alanine production by aspartate decarboxylation was reported to be normal in
assays of black mutants and although …
+9.759
“foraging”, “for”
... affect complex behaviors such as locomotion and foraging. The foraging
-1.468
+3.359
(for) gene encodes a pkg in drosophila melanogaster here we demonstrate a
+5.497
function for the for gene in sensory responsiveness and …
-0.582 +5.980
16
Problem of Domain Overfitting
ideal setting
gene name
recognizer
54.1%
wingless
realistic setting
daughterless
eyeless gene name
recognizer
28.1%
apexless
…
Nov 27, 2007
17
Solution:
Learn Generalizable Features
…decapentaplegic and wingless are expressed in
analogous patterns in each primordium of…
…that CD38 is expressed by both neurons and glial
cells…that PABPC5 is expressed in fetal brain and in
a range of adult tissues.
Generalizable Feature: “w+2 = expressed”
18
Generalizability-Based Feature Ranking
training
data
…
1
2
3
4
5
6
7
8
…
…
-less
…
…
expressed
…
…
1
2
3
4
5
6
7
8
…
…
…
expressed
…
…
…
-less
1
2
3
4
5
6
7
8
…
…
…
expressed
…
…
-less
…
…
expressed
…
…
…
-less
…
…
…
0.125
…
…
…
0.167
…
…
1
2
3
4
… 5
6
7
8
…
…
…
…
expressed
…
…
-less
19
Effectiveness of Domain Adaptation
standard learning
Fly + Mouse
gene name
recognizer
Yeast
63.3%
Yeast
75.9%
domain adaptive
learning
Fly + Mouse
gene name
recognizer
20
More Results on Domain Adaptation
Exp
Method
Precision
Recall
F1
F+M→Y
Baseline
0.557
0.466
0.508
Domain
0.575
0.516
0.544
% Imprv.
+3.2%
+10.7%
+7.1%
Baseline
0.571
0.335
0.422
Domain
0.582
0.381
0.461
% Imprv.
+1.9%
+13.7%
+9.2%
Baseline
0.583
0.097
0.166
Domain
0.591
0.139
0.225
% Imprv.
+1.4%
+43.3%
+35.5%
F+Y→M
M+Y→F
•Text data from BioCreAtIvE (Medline)
•3 organisms (Fly, Mouse, Yeast)
21
Extraction Example 2:
Genetic Interaction Relation
Gene
Is there a genetic interaction relation here?
Bcd regulates the expression of the maternal and zygotic
gene hunchback (hb) that shows a step-like-function
expression pattern, in the anterior half of the egg.
Gene
22
Challenges
• No/little training data
• What features to use?
23
Solution: Pseudo Training Data
Gene:
Bcd
+
These results uncovered an antagonism between hunchback
and bicoid at the anterior pole, whereas the two genes are
known to act in concert for most anterior segmented
development.
24
Pseudo Training Data Works
Reasonably Well
Precision
0.8
Using all features works the best
0.7
0.6
0.5
0.4
0.3
wBetween
wBetween+wBefore
0.2
wBetween+wAfter
0.1
wBetween+wBefore+wAfter
wBetween+wBefore+wAfter+iword
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Recall
1
25
Large-Scale Entity/Relation
Extraction
• Entity annotation
Entity Type
Resource
Method
Gene
NCBI, FlyBase, …
Dictionary string search +
machine learning
Anatomy
FlyBase
Dictionary string search
Chemical
MeSH, Biosis, …
Dictionary string search
“x x behavior” pattern search
Behavior
• Relation extraction
Relation Type
Method
Regulatory
Pre-defined pattern + machine learning
Expressed In
Co-occurrence + relevant keywords
Gene  Behavior
Co-occurrence
Gene  Chemical
Co-occurrence
53
Part 2: Semantic Navigation
27
Space-Region Navigation
Topic Regions
…
My Regions/Topics
Intersection, Union,…
Fly Rover
Bee Forager
MAP
EXTRACT
MAP
Bee
Bird Singing
EXTRACT
…
My Spaces
Fly
Bird
SWITCHING
Behavior
Literature Spaces
Intersection, Union,…
28
General Approach: Language
Models
• Topic = word distribution
• Modeling text in a space with mixture models
of multinomial distributions
• Text Mining = Parameter Estimation +
Inferences
• Matching = Computer similarity between word
distributions
• Users can “control” a model by specifying
topic preferences
29
A Sample Topic & Corresponding Space
Word Distribution
(language model)
filaments
muscle
actin
z
filament
myosin
thick
thin
sections
er
band
muscles
antibodies
myofibrils
flight
images
0.0410238
0.0327107
0.0287701
0.0221623
0.0169888
0.0153909
0.00968766
0.00926895
0.00924286
0.00890264
0.00802833
0.00789018
0.00736094
0.00688588
0.00670859
0.00649626
labels
Meaningful labels
actin filaments
flight muscle
flight muscles
Example
documents
• actin filaments in honeybee-flight muscle
move collectively
• arrangement of filaments and cross-links
in the bee flight muscle z disk by image
analysis of oblique sections
• identification of a connecting filament
protein in insect fibrillar flight muscle
• the invertebrate myosin filament
subfilament arrangement of the solid
filaments of insect flight muscles
• structure of thick filaments from insect
flight muscle
30
MAP: Topic/RegionSpace
•
•
MAP: Use the topic/region description as a query to
search a given space
Retrieval algorithm:
– Query word distribution: p(w|Q)
– Document word distribution: p(w|D)
– Score a document based on similarity of Q and D
score(Q, D)   D( Q ||  D )  
•

wVocabulary
p( w |  Q ) log
p( w |  Q )
p( w |  D )
Leverage existing retrieval toolkits: Lemur/Indri
31
EXTRACT: Space Topic/Region
• Assume k topics, each being represented by a
word distribution
• Use a k-component mixture model to fit the
documents in a given space (EM algorithm)
• The estimated k component word
distributions are taken as k topic regions
| D|
Likelihood:
k
log p(C | )    log[ p( Di |  B )  (1   )  j p( Di |  j )]
DC i 1
j 1
Maximum likelihood estimator:   arg max p(C | )
Bayesian estimator:   arg max p( | C )  arg max p(C | ) p()
*
*



32
User-Controlled Exploration:
Sample Topic 1
age
division
labor
colony
foraging
foragers
workers
task
behavioral
behavior
older
tasks
old
individual
ages
young
genotypic
social
0.0672687
0.0551497
0.052136
0.038305
0.0357817
0.0236658
0.0191248
0.0190672
0.0189017
0.0168805
0.0143466
0.013823
0.011839
0.0114329
0.0102134
0.00985875
0.00963096
0.00883439
Prior:
labor 0.2
division 0.2
33
User-Controlled Exploration:
Sample Topic 2
behavioral 0.110674
age
0.0789419
maturation 0.057956
task
0.0318285
division
0.0312101
labor
0.0293371
workers
0.0222682
colony
0.0199028
social
0.0188699
behavior
0.0171008
performance 0.0117176
foragers
0.0110682
genotypic
0.0106029
differences 0.0103761
polyethism 0.00904816
older
0.00808171
plasticity
0.00804363
changes
0.00794045
Prior:
behavioral 0.2
maturation 0.2
34
Exploit Prior for Concept Switching
foraging
foragers
forage
food
nectar
colony
source
hive
dance
forager
information
feeder
rate
recruitment
individual
reward
flower
dancing
behavior
0.142473
0.0582921
0.0557498
0.0393453
0.03217
0.019416
0.0153349
0.0151726
0.013336
0.0127668
0.0117961
0.010944
0.0104752
0.00870751
0.0086414
0.00810706
0.00800705
0.00794827
0.00789228
foraging
nectar
food
forage
colony
pollen
flower
sucrose
source
behavior
individual
rate
recruitment
time
reward
task
sitter
rover
rovers
0.290076
0.114508
0.106655
0.0734919
0.0660329
0.0427706
0.0400582
0.0334728
0.0319787
0.0283774
0.028029
0.0242806
0.0200597
0.0197362
0.0196271
0.0182461
0.00604067
0.00582791
0.00306051
35
Part 3: Entity Summarization
36
Automated Gene Summarization?
Multi-Aspect Gene Summary
Gene product
Expression
Sequence
Interactions
Mutations
General
Functions
A Two-Stage Approach
Text Summary of Gene Abl
General Entity Summarizer
•
•
•
Task: Given any entity and k aspects to summarize,
generate a semi-structured summary
Assumption: Training sentences available for each
aspect
Method:
– Train a recognizer for each aspect
– Given an entity, retrieve sentences relevant to the entity
– Classify each sentence into one of the k aspects
– Choose the best sentences in each category
40
Further Generalizations
•
•
•
Task: Given any entity and k pre-specified aspects to
summarize, generate a semi-structured summary
Assumption: Training sentences available for each
aspect
Method:
New method based on mixture model
– Train a recognizer for each aspect
and regularized optimization
– Given an entity, retrieve sentences relevant to the entity
– Classify each sentence into one of the k aspects
– Choose the best sentences in each category
41
Part 4. Function Analysis
42
Annotating Gene Lists:
GO Terms vs. Literature Mining
Limitations of GO annotations:
- Labor-intensive
- Limited Coverage
Literature Mining:
- Automatic
- Flexible exploration in the
entire literature space
Overview of Gene List Annotator
Bcd
Bcd
Cad
…
Tll
For any gene:
retrieve
its relevant
documents
Gene group
Cad
…
Tll
Entrez Gene
Document
sets
For any
term:
test its
significance
Segmentation 56.0
Pattern 34.2
Cell_cycle 25.6
Development 22.1
Regulation 20.4
…
Enriched
concepts
Interactive
analysis
Intuition for
Literature-based Annotation
Gene
TPI1
GPM1 PGK1
TDH3
TDH2
protein_kinase
0
0
2
0
0
decarboxylase
10
0
10
7
6
protein
39
26
65
44
33
stationary_phase
2
7
3
4
2
energy_metabolism
4
5
5
8
0
oscillation
0
0
0
0
1
Likelihood Ratio Test with
2-Poisson Mixture Model
Reference
distribution:
Poisson(λ0;d)
Dataset
distribution:
Poisson(λ;d)
Agreement with GO-based Method
•
Gene List: 93 genes up-regulated by the manganese treatment
GO Theme
Related Annotator terms
neurogenesis
axon guidance, growth cone,
commissural axon, proneural gene
synaptic transmission
synaptic vesicle, neurotransmitter
release, synaptic transmission, sodium
channel
cytoskeletal protein
alpha tubulin, actin filament
cell communication
tight junction, heparan sulfate
proteoglycan
47
Discovering Novel Themes
•
Gene List: 69 genes up-regulated by the methoprene treatment
Theme
Annotator terms
muscle
flight muscle, muscle myosin, nonmuscle
myosin, light chain, myosin ii, thick
filament, thin filament, striated muscle
synaptic transmission
neurotransmitter release, synaptic
transmission, synaptic vesicle
signaling pathway
notch signal
48
Summary
Knowledge
Discovery
&
Hypothesis
Part 3. Entity
Gene
Machine Learning
Testing
Summarizer
Summarization
Users
4. Function
Question Part Function
Answering
Annotator
Analysis
+ Language Models
Information
+ Minimum
Human
Effort
Space/Region
Manager,
Navigation Support
Access &
Exploration
Part 2. Navigation Support
Text Miner
Searchscalable,
Engine
General and
Relational
but there’s
room for deeper semantics
Database
Words/Phrases Entities Relations
Content
Analysis
1. Information
Extraction
NaturalPart
Language
Understanding
Literature Text
Meta Data
49
Looking Ahead…
• Knowledge integration, inferences
• Support for hypothesis formulation and
testing
50
Exploring Knowledge Space
Behavior B2
isa
Co-occur-fly
Gene A1
Orth-mos
Gene A1’
Reg
isa
Co-occur-bee
Behavior B1
Behavior B3
Co-occur-mos
Co-occur-fly
Gene A2
Gene A3
Reg
Reg
Reg
Gene A4’
orth
Behavior B4
Gene A4
Gene A5
1.X=NeighborOf(B4, Behavior, {co-occur,isa}) {B1,B2,B3}
2. Y=NeighborOf(X, Gene, {c-occur, orth} {A1,A1’,A2,A3}
3. Y=Y + {A5, A6} {A1,A1’, A2, A3,A5,A6}
4. Z=NeighborOf(Y, Gene, {reg}) {A4, A4’}
P= PathBetween({Z, B4, {co-occur, reg,isa})
51
Full-Fledged BeeSpace V5
Inferences
Hypothesis Formulation & Testing
Biomedical
Literature
Entities
- Gene
- Behavior
- Anatomy
- Chemical
Relations
-Orthology
- Regulatory
interaction
-…
Experiment
Data
Analysis
Additional
entities and
relations
Expert knowledge
52
Thanks to
Xin He (UIUC)
Jing Jiang (SMU)
Yanen Li (UIUC)
Xu Ling (UIUC)
Yue Lu (UIUC)
Qiaozhu Mei (UIUC/Michigan)
& Bruce Schatz (PI, BeeSpace)
53
Thank You!
54