Transcript Slides

CS5263 Bioinformatics
Lecture 20
Practical issues in motif finding
Final project
Homework problem 3.1
• Count separately the number of character
comparisons and the number of steps
needed to find the next matching character
using the bad character rule
• Question: can you give an example?
Extended bad character rule
k
T: xpbctbxabpqqaabpqz
P: tpabxab
In this iteration:
*^^
Find T(k) in P that is
# of comparison = 3
immediately left to i,
shift P to align T(k)
with that position
P: tpabxab
char
Position in P
a
6, 3
b
7, 4
p
2
t
1
x
5
Table lookup: 2
Results for some real genes
llr = 394
E-value = 2.0e-023
llr = 347
E-value = 9.8e-002
llr = 110
E-value = 1.4e+004
Strategies to improve results
• Combine results from different algorithms
usually helpful
– Ones that appeared multiple times are
probably more interesting
• Except simple repeats like AAAAA or ATATATATA
– Cluster motifs into groups according to their
similarities
Strategies to improve results
• Compare with known motifs in database
– TRANSFAC
– JASPAR
• Issues:
– How to determine similarity between motifs?
• Alignment between matrices
– How similar is similar?
• Empirically determine some threshold
Strategies to improve results
• Statistical test of significance
– Enrichment in target sequences vs
background sequences
Target set
T
Assumed to contain a
common motif, P
Background set
B
Assumed to not contain P,
or with very low frequency
Ideal case: every sequence in T has P, no sequence in B has P
Statistical test for significance
P
Background set + target set
B+T
Target set
T
Size = N
Appear in n
sequences
Appear in m
sequences
Size = M
• Intuitively: if n / N >> m / M
– P is enriched (over-represented) in T
– Statistical significance?
Hypergeometric distribution
• A box with M balls, of which N are
red, and the rest are blue.
– Red ball: target sequences
– Blue ball: background sequences
• If we randomly draw m balls from
the box,
• what’s the probability we’ll see n
red balls?
 N  M  N 
 

n mn 
Hypegeom ( M , N , m, n)   
M 
 
m
– If probability very small, we are
probably not drawing randomly
• Total # of choices: (M choose m)
• # of choices to have n red balls:
(N choose n) x (M-N choose m-n)
Cumulative hypergeometric test for
motif significance
• We are interested in: if we
randomly pick m balls, how likely
that we’ll see at least n red balls?
 N  M  N 
 

min( m , N ) 
i  m  i 

cHypegeom ( M , N , m, n)  
M 
i n
 
m
This can be interpreted as the p-value for the null
hypothesis that we are randomly picking.
Alternative hypothesis: our selection favors red balls.
Equivalent: the target set T is enriched with motif P.
Or: P is over-represented in T.
 N  M  N 
 

n 1 
i  m  i 

 1 
M 
i 0
 
m
Examples
•
•
•
•
•
•
•
Yeast genome has 6000 genes
Select 50 genes believed to be co-regulated by a common TF
Found a motif for these 50 genes
It appeared in 20 out of these 50 genes
In the whole genome, 100 genes have this motif
M = 6000, N = 50, m = 100+20 = 120, n = 20
Intuition:
– m/M = 120/6000. In Genome, 1 out 50 genes have the motif
– N = 50, would expect only 1 gene in the target set to have the motif
– 20-fold enrichment
• P-value = 6 x 10-22
• n = 5. 5-fold enrichment. P-value = 0.003
• Normally a very low p-value is needed, e.g. 10-10
0
0
10
10
hypegeom(6000, 50, 120, n)
chypegeom(6000, 50, 120, n)
hypegeom(6000, 50, 120, n)
chypegeom(6000, 50, 120, n)
-1
10
-20
10
-2
10
-3
-40
10
10
-4
10
-60
10
-5
10
-6
10
-80
10
-7
10
10
0
2
4
6
n
-100
0
10
20
30
n
40
50
8
10
ROC curve for motif significance
• Motif is usually a PWM
• Any word will have a score
–
–
–
–
Typical scoring function: Log P(W | M) / P(W | B)
W: a word.
M: a PWM.
B: background model
• To determine whether a sequence contains a
motif, a cutoff has to be decided
– With different cutoffs, you get different number of
genes with the motif
– Hyper-geometric test first assumes a cutoff
– It may be better to look at a range of cutoffs
ROC curve for motif significance
P
Target set
T
Size = N
•
•
•
•
•
•
Given a score cutoff
Appeared in
n sequences
Appeared in
m sequences
Background set + target set
B+T
Size = M
With different score cutoff, will have different m and n
Assume you want to use P to classify T and B
Sensitivity: n / N
Specificity: (M-N-m+n) / (M-N)
False Positive Rate = 1 – specificity: (m – n) / (M-N)
With decreasing cutoff, sensitivity , FPR 
ROC curve for motif significance
A good cutoff
Lowest cutoff. Every sequence
has the motif. Sensitivity = 1.
specificity = 0.
sensitivity
1
ROC-AUC: area under curve.
• 1: perfect separation.
• 0.5: random.
Motif 1
Motif 2
Random
0
0
1-specificity
Motif 1 is better than motif 2.
1
Highest cutoff. No motif can pass the cutoff. Sensitivity = 0. specificity = 1.
Other strategies
• Cross-validation
– Randomly divide sequences into 10 sets, hold 1 set
for test.
– Do motif finding on 9 sets. Does the motif also appear
in the testing set?
• Phylogenetic conservation information
– Does a motif also appears in the homologous genes
of another species?
– Strongest evidence
– However, will not be able to find species-specific ones
Other strategies
• Finding motif modules
– Will two motifs always appear in the same gene?
• Location preference
– Some motifs appear to be in certain location
• E.g., within 50-150bp upstream to transcription start
– If a detect motif has strong positional bias, may be a sign of its
function
• Evidence from other types of data sources
– Do the genes having the motif always have similar activities
(gene expression levels) across different conditions?
– Interact with the same set of proteins?
– Similar functions?
– etc.
To search for new instances
• Usually many false positives
• Score cutoff is critical
• Can estimate a score cutoff from the “true”
binding sites
Motif finding
Scoring function
A set of scores for the “true” sites. Take mean - std as a cutoff.
(or a cutoff such that the majority of “true” sites can be predicted).
To search for new instances
• Use other information, such as positional
biases of motifs to restrict the regions that
a motif may appear
• Use gene expression data to help: the
genes having the true motif should have
similar activities
• Phylogenetic conservation is the key
Final project
• Write a review paper on a topic that we didn’t cover in
lectures
Or
• Implement an algorithm and do some experiments
• Compare several algorithms (existing implementation ok)
• Combine several algorithms to form a pipeline (e.g. gene
expression + motif analysis)
• Final:
– 5 -10 pages report (single space, single column, 12pt) + 15
minutes presentation
Possible topics for term paper
• Possible topics:
– Haplotype inferencing
– Computational challenges associated with new
microarray technologies
– Phylogenetic footprinting
– Small RNA gene / target prediction (siRNA, mRNA,
…)
– Biomedical text mining
– Protein structure prediction
– Topology of biological networks
An example project
•
•
•
•
•
Given a gene expression data (say cell cycle)
Cluster genes using k-means
Find motifs using several algorithms
(Cluster and combine similar motifs)
Rank motifs according to their specificity to the
target sequences comparing to the other
clusters
• Get their logos
• Use the sequences to search the whole genome
for more genes with the motif
• Do they have any functional significance?