Some Natural Language Processing (NLP) Tasks

Download Report

Transcript Some Natural Language Processing (NLP) Tasks

Parsing A Bacterial Genome
Mark Craven
Department of Biostatistics & Medical Informatics
University of Wisconsin
U.S.A.
[email protected]
www.biostat.wisc.edu/~craven
The Task
Given: a bacterial genome
Do: use computational methods
to predict a “parts list” of
regulatory elements
Outline
1.
2.
3.
4.
5.
background on bacterial gene regulation
background on probabilistic language models
predicting transcription units using probabilistic language
models
augmenting training with “weakly” labeled examples
refining the structure of a stochastic context free grammar
The Central Dogma of
Molecular Biology
Transcription in Bacteria
Operons in Bacteria
promoter
gene
gene
gene
terminator
mRNA
• operon: sequence of one or more genes transcribed as a unit
under some conditions
• promoter: “signal” in DNA indicating where to start
transcription
• terminator: “signal” indicating where to stop transcription
The Task Revisited
Given:
– DNA sequence of E. coli
genome
– coordinates of
known/predicted genes
– known instances of operons,
promoters, terminators
Do:
– learn models from known
instances
– predict complete catalog of
operons, promoters,
terminators for the genome
Our Approach:
Probabilistic Language Models
1.
2.
3.
write down a “grammar” for elements of interest
(operons, promoters, terminators, etc.) and relations
among them
learn probability parameters from known instances of
these elements
predict new elements by “parsing” uncharacterized DNA
sequence
Transformational Grammars
• a transformational grammar characterizes a set of legal
strings
• the grammar consists of
– a set of abstract nonterminal symbols
 S,
C1 , C2 , C3 , C4 
– a set of terminal symbols (those that actually appear in
strings)
 a,
c, g, t
– a set of productions
C1  t C2
C2  a C3
C2  g C4
C3  a
C3  g
C4  a
A Grammar for Stop Codons
S  C1 C1  t C2 C2  a C3 C3  a
C2  g C4 C3  g
C4  a
• this grammar can generate the 3 stop codons:
taa, tag, tga
• with a grammar we can ask questions like
– what strings are derivable from the grammar?
– can a particular string be derived from the grammar?
The Parse Tree for tag
S  C1 C1  t C2 C2  a C3 C3  a
C2  g C4 C3  g
C4  a
S
C1
C2
t
a
C3
g
A Probabilistic Version
of the Grammar
1.0
1.0
0.7
0.2
0.3
0.8
S  C1 C1  t C2 C2  a C3 C3  a
C2  g C4 C3  g
1.0
C4  a
• each production has an associated probability
• the probabilities for productions with the same left-hand
side sum to 1
• this grammar has a corresponding Markov chain model
A Probabilistic Context Free Grammar
for Terminators
START
PREFIX
PREFIX
BBBBBBBBB
STEM_BOT1
tl STEM_BOT2 tr
STEM_BOT2
tl* STEM_MID tr* | tl* STEM_TOP2 tr*
STEM_MID
tl* STEM_MID tr* | tl* STEM_TOP2 tr*
STEM_TOP2
tl* STEM_TOP1 tr*
STEM_TOP1
tl LOOP tr
LOOP
B B LOOP_MID B B
LOOP_MID
SUFFIX
B
t = {a,c,g,u},
t* = {a,c,g,u, }
STEM_BOT1
SUFFIX
a
c
B LOOP_MID | 
g
BBBBBBBBB
c
a|c|g|u
c
a
g
c-u-c-a-a-a-g-g- c
u c u
c
prefix
g
c
stem
g
loop
g
suffix
u
c
g -u-u-u-u-u-u-u-u
Inference with
Probabilistic Grammars
• for a given string there may be many parses, but some are
more probable than others
• we can do prediction by finding relatively high probability
parses
• there are dynamic programming algorithms for finding the
most probable parse efficiently
Learning with
Probabilistic Grammars
• in this work, we write down the productions by hand, but
learn the probability parameters
• to learn the probability parameters, we align sequences of a
given classs (e.g. terminators) with the relevant part of the
grammar
• when there is hidden state (i.e. the correct parse is not
known), we use Expectation Maximization (EM) algorithms
Outline
1.
2.
3.
4.
5.
background on bacterial gene regulation
background on probabilistic language models
predicting transcription units using probabilistic language
models
[Bockhorst et al., ISMB/Bioinformatics ‘03]
augmenting training with “weakly” labeled examples
refining the structure of a stochastic context free grammar
A Model for Transcription Units
stem
loop
spacer
-35
prom
intern
start
spacer
-10
post
prom
TSS
start
ORF
intra
ORF
RIT
prefix
UTR
last
ORF
pre
term
end
spacer
RDT
prefix
stem
loop
SCFG
ORF
position specific Markov model
semi-Markov model
RIT
suffix
RDT
suffix
untranscribed region
transcribed region
end
The Components of the Model
• position-specific Markov models represent fixed-length
sequence motifs
j
Pr( xi: j | )  Pr( xi | 1 )  Pr( xl | xl 1 , l i 1 )
l i 1
• semi-Markov models represent variable-length sequences
Pr( xi: j | )  Pr( L |  L ) 
j
Pr( xi |  0 )  Pr( xl | xl 1 ,  0 )
l i 1
• stochastic context free grammars (SCFGs) represent
variable-length sequences with long-range dependencies
Gene Expression Data
• in addition to DNA sequence data,
we also use expression data to make
our parses
• microarrays enable the
simultaneous measurement of the
transcription levels of thousands of
genes
genes/
sequence positions
experimental
conditions
Incorporating Expression Data
• our models parse two sequences simultaneously
– the DNA sequence of the genome
– a sequence of expression measurements associated with
particular sequence positions
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT
• the expression data is useful because it provides
information about which subsequences look like they are
transcribed together
Predictive Accuracy for Operons
100
80
sensitivity
60
specificity
40
precision
20
0
sequence only
sensitivit y 
TP
TP  FN
expression only
specificit y 
sequence+expression
TN
TN  FP
precision 
TP
TP  FP
Predictive Accuracy for Promoters
100
90
sensitivity
80
specificity
70
precision
60
50
sequence only
sensitivit y 
TP
TP  FN
expression only
specificit y 
sequence+expression
TN
TN  FP
precision 
TP
TP  FP
Predictive Accuracy for Terminators
100
90
sensitivity
80
specificity
70
precision
60
50
sequence only
sensitivit y 
TP
TP  FN
expression only
specificit y 
sequence+expression
TN
TN  FP
precision 
TP
TP  FP
Accuracy of Promoter & Terminator
Localization
Terminator Predictive Accuracy
100%
True Positive Rate
80%
60%
TP
TP  FN
40%
20%
0%
20%
SCFG
SCFG, no training
Complementarity Matrix
Interpolated Markov Model
40%
60%
80%
False Positive Rate
FP
FP  TN
100%
Outline
1.
2.
3.
4.
5.
background on bacterial gene regulation
background on probabilistic language models
predicting transcription units using probabilistic language
models
augmenting training data with “weakly” labeled examples
[Bockhorst & Craven, ICML ’02]
refining the structure of a stochastic context free grammar
Key Idea: Weakly Labeled Examples
• regulatory elements are inter-related
– promoters precede operons
– terminators follow operons
– etc.
• relationships such as these can be exploited to augment training
sets with “weakly labeled” examples
Inferring “Weakly” Labeled Examples
g1
g2
g3
g4
ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT
TGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA
g5
• if we know that an operon ends at g4, then there must be a
terminator shortly downstream
• if we know that an operon begins at g2, then there must be a
promoter shortly upstream
• we can exploit relations such as this to augment our training
sets
Strongly vs. Weakly Labeled
Terminator Examples
strongly labeled terminator:
sub-class: rho-independent
end of stem-loop
gtccgttccgccactattcactcatgaaaatgagttcagagagccgcaagatttttaattttgcggtttttttgtatttgaattccaccatttctctgttcaatg
extent of terminator
weakly labeled terminator:
gtccgttccgccactattcactcatgaaaatgagttcagagagccgcaagatttttaattttgcggtttttttgtatttgaattccaccatttctctgttcaatg
Training the Terminator Models:
Strongly Labeled Examples
negative examples
negative
model
rho-independent
examples
rho-independent
terminator model
rho-dependent
examples
rho-dependent
terminator model
Training the Terminator Models:
Weakly Labeled Examples
weakly labeled examples
negative examples
negative
model
rho-independent
terminator model
rho-dependent
terminator model
combined terminator model
Do Weakly Labeled Terminator
Examples Help?
• task: classification of terminators (both sub-classes) in
E. coli K-12
• train SCFG terminator model using:
– S strongly labeled examples and
– W weakly labeled examples
• evaluate using area under ROC curves
Area under ROC curve
Learning Curves using
Weakly
Labeled
Terminators
1
0.9
0.8
0.7
250 weak examples
0.6
25 weak examples
0 weak examples
0.5
0
20
40
60
80
100
120
Number of strong positive examples
140
Are Weakly Labeled Examples Better
than Unlabeled Examples?
• train SCFG terminator model using:
– S strongly labeled examples and
– U unlabeled examples
• vary S and U to obtain learning curves
Training the Terminator Models:
Unlabeled Examples
unlabeled examples
negative
model
rho-independent
terminator model
combined model
rho-dependent
terminator model
Learning Curves: Weak vs. Unlabeled
Weakly Labeled
Unlabeled
Area under ROC curve
1
0.8
250 weak examples
25 weak examples
0 weak examples
0.6
0
40
80
120
250 unlabeled examples
25 unlabeled examples
0 unlabeled examples
0
40
Number of strong positive examples
80
120
Are Weakly Labeled Terminators
from Predicted Operons Useful?
• train operon model with S labeled operons
• predict operons
• generate W weakly labeled terminators from W most confident
predictions
• vary S and W
Learning Curves using
Weakly Labeled Terminators
Area under ROC curve
1
0.9
0.8
0.7
200 weak examples
100 weak examples
0.6
25 weak examples
0 weak examples
0.5
0
20
40
60
80
100
120
Number of strong positive examples
140
160
Outline
1.
2.
3.
4.
5.
background on bacterial gene regulation
background on probabilistic language models
predicting transcription units using probabilistic language
models
augmenting training with “weakly” labeled examples
refining the structure of a stochastic context free grammar
[Bockhorst & Craven, IJCAI ’01]
Learning SCFGs
• given the productions of a grammar, can learn the
probabilities using the Inside-Outside algorithm
• we have developed an algorithm that can add new
nonterminals & productions to a grammar during learning
• basic idea:
– identify nonterminals that seem to be “overloaded”
– split these nonterminals into two; allow each to
specialize
Refining the Grammar in a SCFG
• there are various “contexts” in which each grammar
nonterminal may be used
• consider two contexts for the nonterminal w2
w1
w1
A w2 U
C w2 G
w2  Aw3 U | 0.4
Cw3G | 0.1
w2  Aw3 U | 0.1
Cw3G | 0.4
Gw3C | 0.1
Uw3 A 0.4
Gw3C | 0.4
Uw3 A 0.1
• if the probabilities for w2 look very different, depending
on its context, we add a new nonterminal and specialize w2
Refining the Grammar in a SCFG
• we can compare two probability distributions P and Q
using Kullback-Leibler divergence
P ( xi )
H ( P || Q )   P ( xi )
Q( xi )
i
P
Q
w2  Aw3 U | 0.4
Cw3G | 0.1
w2  Aw3 U | 0.1
Cw3G | 0.4
Gw3C | 0.1
Uw3 A 0.4
Gw3C | 0.4
Uw3 A 0.1
Learning Terminator SCFGs
• extracted grammar from the literature
(~ 120 productions)
• data set consists of 142 known E. coli terminators, 125
sequences that do not contain terminators
• learn parameters using Inside-Outside algorithm (an EM
algorithm)
• consider adding nonterminals guided by three heuristics
– KL divergence
– chi-squared
– random
SCFG Accuracy After Adding 25
New Nonterminals
100%
True Positive Rate
80%
60%
40%
20%
0%
20%
40%
Chi square
KL Divergence
Random
Original Grammar
60%
80%
False Positive Rate
100%
SCFG Accuracy vs.
Nonterminals Added
Area under ROC curve
1
0.9
0.8
0.7
0.6
0.5
0
5
10
Chi square
KL Divergence
Random
15
20
Additional nonterminals
25
Conclusions
• summary
– we have developed an approach to predicting transcription units in
bacterial genomes
– we have predicted a complete set of transcription units for the E. coli
genome
• advantages of the probabilistic grammar approach
– can readily incorporate background knowledge
– can simultaneously get a coherent set of predictions for a set of
related elements
– can be easily extended to incorporate other genomic elements
• current directions
– expanding the vocabulary of elements modeled (genes, transcription
factor binding sites, etc.)
– handling overlapping elements
– making predictions for multiple related genomes
Acknowledgements
• Craven Lab: Joe Bockhorst, Keith Noto
• David Page, Jude Shavlik
• Blattner Lab: Fred Blattner, Jeremy Glasner, Mingzhu Liu,
Yu Qiu
• funding from National Science Foundation, National
Institutes of Health