L529 Presentation Computer Model for Promoter Recognition

Download Report

Transcript L529 Presentation Computer Model for Promoter Recognition

HISPIG – A Discriminative
Model Refinement Approach
with Iterations for Detecting
Regulatory Regions
Takuma Tsukahara
[email protected]
[email protected]
Milton Taylor Laboratory
• Using microarrays and bioinformatics
technologies to develop better treatments
for HCV (Virahep-C project)
– Only known treatment for HCV is treatment
with interferon-alpha (IFN-a), or more recently
combination treatment of pegylated IFN-a and
Ribavirin
– Interferons were discovered as proteins that
inhibit virus replication, and are induced in
mammalian cells in response to virus infection
PBMC Experiment
• PBMC was isolated from group of healthy
individuals, and treated with IFN-a alone, or
with Ribavirin.
• By microarray experiment results,
expression of large number of genes were
either up-regulated or down-regulated
– It was of interest to analyze the upstream region
of these genes for the presence of motifs (ISRE
and GAS)
Goal of My Project
• Build a computer model that effectively
searches ISRE and GAS sequences in
human genes
– ISRE/GAS both work as a promoter
– ISRE drives the expression of most of type I
IFN stimulated genes (and some gamma)
– GAS drives the expression of type II IFN
stimulated genes
– Genes that contain ISRE / GAS express more
with IFN than ones that do not
– Generalize to be able to search any motif in the
future
Type I IFN Signal Transduction
IFN/
HETERODIMER
1 2
p48
(IRF-9)
JAK1
TYK2
P
STAT2
STAT1
ISGF3
Transcription
ISRE
CYTOPLASM
NUCLEUS
The Situation
• We have a list of known motifs to refer to
– Numerous ISRE and GAS are known and published
• We have sets of sequences from microarray
experiments that is
– likely to contain motifs…S1 (up-regulated genes)
– unlikely to contain motifs…S2 (down-regulated
genes, and random genes)
• To detect motifs, build a model M(+) using the
list of known motifs
– Occurrences of the model will be detected in both
S1 and S2
How to Solve
• Still, it is difficult to accurately predict motifs
– Motifs are short in length, and also divergent
– So, occurrences in S1 and S2 are difficult to
distinguish
• We overcome this problem by a
discriminative model refinement approach
– We make two models:
M(+)…from known motifs
M(-)…from false motifs
– Iteratively refine the models, and separate the
occurrences in S1 and S2
HISPIG
Methods Used
• HMMER
• Log-likelihood Method
• Both with iterative model
refinement approach
HMMER
•
Detects ISRE and GAS sequences (upregulated genes, down-regulated genes and
random genes)
1. Build a model with a list including known and
functional motifs from journals by hmmbuild
hmm consensus sequence
2. Parse promoter region of each gene
3. Look for occurrences of the consensus within the
promoter region of the three gene groups by
hmmsearch
Alignment File (.aln)
• List of known motifs – as .aln file
• Example of ISRE:
IP10
AGGTTTCACTTTCCA
ISG15 CAGTTTCGGTTTCCC
Factor CAGTTTCTGTTTCCT
Tla
TAGTTTCACTTTTTG
GBP
TACTTTCAGTTTCAT
ISG20 ATCTTTGACTTTGTC
***
***
Result for INDO gene (2 ISREs)
Alignments of top-scoring domains:
INDO: domain 1 of 2, from 4901 to 4915: E = 0.0097
*-> g g g a a a . t g a a a c t a<-*
+gaaa+tgaaa c+a
INDO 4901 TAGAAA a TGAAACCA 4915
negative strand
INDO: domain 2 of 2, from 5370 to 5384: E = 0.18
*-> g g g a a a . t g a a a c t a <-*
g ++ a a + g a a a c t a
INDO 5370 TGAGAA a GGAAACTA 5384
Iterative Model Refinement
Model
S1
But that is too
many to add
:
Sm
n sequences were
1. look for more
occurrences
S1
Sm+n
significant (may
be functional)
Model
:
2. rank the new sequences
Let’s add only
relevant k
sequences
3. add top k sequences
S1
Sm+k
Model
:
This is my new
model for next
iteration
hmmsearch results (ISRE)
group
updowniterations
random
regulated
regulated
1
6
2
0
2
22
4
1
1
53
11
16
2
82
25
28
e-val < 0.01
e-val < 0.1
hmmsearch results (GAS)
group
updowniterations
random
regulated
regulated
1
0
0
0
2
23
7
19
1
9
2
7
2
72
37
52
e-val < 0.1
e-val < 0.3
Problems of hmmsearch
• Number of significant motifs detected
– ISRE >>> GAS (in terms of e-value)
• Cannot tell whether the detected motifs are
functional or not
– E-value is the only measure
• GAS overlap between different gene groups
– 25% between up-regulated and random
• As in previous slides, occurrences detected
from the different gene groups are hard to
distinguish
Log Likelihood Method
• Calculate scores for each detected motif to tell
whether functional, and to discriminate gene
groups
– Score = log (M(+) / M(-))
– M(+)… Known motifs, M(-)… False motifs
– 1 pseudo count for each nucleotide per 10 sequences
• If the log-likelihood score for the given motif is
– positive… the motif is functional if also have
significantly low e-value
– negative… the motif is not functional
Concept of Models(+/-)
List of known & functional motifs
ISRE1 CAGTTT..
ISRE2 TAGTTT..
GAS1 TTTCAA..
1. build model
Model(+)
2. search occurrences of M(+)
in negative model
List of false positive motifs
ISRE1 TACTTT..
ISRE2 AGGCTT..
GAS1 TATGAA..
3. build model
Model(-)
Base Composition Tweaking
• All known functional ISRE has two “TTT”s
– Without tweaking, a motif with a “TTT” and a
“TCC” will receive high log-likelihood score
• To solve this problem, we look for high
percentage nucleotides, and make them
dominant
– Example: base composition of a certain column
A -3%
G -14%
C -12%
T -71%
tweak!
A -0.1%
G -0.1%
C -0.1%
T -99.7%
Iteration and Model Refinement
Model(+)
S(+)1
:
S(+)n
S(-)1
Model(-)
:
S(-)n
First iteration (model refinement)
Model(+)
S(+)1
:
S(+)n
S(-)1
Model(-)
S(-)n
:
Second iteration (model refinement)
Model(+)
S(+)1
S(+)n
:
S(-)1
S(-)n
Model(-)
:
Up-regulated vs. Random
1
up-regulated genes AVG
Log-likelihood score
0.5
0
-0.5
-1
random genes AVG
-1.5
-2
-2.5
-3
-3.5
1
2
Iterations
3
ISRE(positive)
ISRE(negative)
GAS(positive)
GAS(negative)
Search Result of HISPIG
• Numerous potentially functional ISRE and
GAS were detected from 100 most upregulated genes (both known and unknown)
– Approximately 80% of the genes had either
functional ISRE or GAS
– Numerous genes contain unknown functional
motifs that match with other gene expression
experiments previously shown in journals
• All motifs included in the model were
concluded to be functional
Improvement of log-likelihood
• Re-aligning process of model refinement
– Rank sequences that match criteria by
1. e-value
2. log-likelihood score
3. both (not easy to implement algorithm)
– Convincing if 2. works better than others
• Which model to refine each iteration
– Only positive? Only negative? Both?
Measuring the Reliability of the
Program
• Best Way – Do wet lab experiments to see if
a detected unknown motif is really
functional
• Alternative
1. Remove some known and functional
sequences from the initial model
2. See if the program still detects those in
the end
Reliability Experiment (ISRE)
gene name detected
e-value
log-likelihood
result
INDO
YES
0.23
4.28
FAIR
INDO
YES
0.097
2.74
GOOD
ISG20
NO
BF
YES
0.057
5.90
GOOD
IFIT2
YES
0.011
5.88
GOOD
G1P3
YES
0.0033
5.06
GOOD
G1P3
YES
0.0039
5.54
GOOD
CXCL10
YES
0.43
4.31
FAIR
OAS1
YES
0.01
4.68
GOOD
BAD
Acknowledgements
Sun Kim
Milton Taylor
Stuart Young