Prbsel - NYU Computer Science Department

Download Report

Transcript Prbsel - NYU Computer Science Department

Efficient Probe Selection
in Micro-array Design
Algorithmics Group, Dept. of Computer Science, University of Liverpool
Speaker: Cindy Y. Li
Joint work with:
Leszek Gąsieniec, Paul Sant, and Prudence Wong
Special thanks go to: David Peleg
http://www.csc.liv.ac.uk/~cindy
1
Talk Overview





Background: Microarrays & Hybridization
Problem Statement
Our Approach
Experimental Work
Conclusion
http://www.csc.liv.ac.uk/~cindy
2
Hybridization Process
DNA
5’...
TGTGCTTGACAACATAGTTG... 3’
||
|
|
Short DNA Fragments
3’-CTACGGACCGAT-5’
A single-stranded DNA probe (middle panel) is linked to an
enzyme and allowed to base pair (hybridize) with the
mRNA. After a series of washes, only fragments that are
hybridized with the target mRNA remain.
http://www.csc.liv.ac.uk/~cindy
3
Tool: DNA Microarrays
Labeled DNA/RNA mixture flushed
over array of short DNA fragments
Laser activation of fluorescent labels
http://www.csc.liv.ac.uk/~cindy
4
Talk Overview





Background: Microarrays & Hybridization
Problem Statement
Our Approach
Experimental Work
Conclusion
http://www.csc.liv.ac.uk/~cindy
5
Probe concept
A probe is a substring of a gene, which acts
as its fingerprint (a.k.a., signature)
 Probes are relatively short DNA sequences.
Usually, a probe is ~ 20-25 base pairs long.
 For example:
DNA ...TGTGCTTGGCAACATAGATAGATGC...

Probe
TGCTTGGCAACATAGATAGA
http://www.csc.liv.ac.uk/~cindy
6
Finding unique probes
P1
P2
P3
P4
P5
Probes
G1
Genes
G2
G3
G4


We are interested in finding a single (or a small
group of) unique probe(s) for each gene
The search process should be both time and
space efficient
http://www.csc.liv.ac.uk/~cindy
7
Finding unique probes




Given a database S of gene sequences
For each sequence g in S try to find a single
probe P which hybridizes only with g
If P cross-hybridizes with some other
sequences in S (i.e., P has a close
occurrence in S) then find a small set of
probes that uniquely identifies g.
Sometimes multiple probes are required due
to the error prone wet lab environment
http://www.csc.liv.ac.uk/~cindy
8
The use of probes


The uniqueness of probes allows us to
identify the genes taking part in the
experiment in the wet lab
I.e., seeing the trace (green color) of a
number of probes on the microarray we
can identify precisely which genes were
involved in the experiment
http://www.csc.liv.ac.uk/~cindy
9
Finding Unique Probes Performance Measure





Each gene in the database S should be uniquely identified by a
smallest possible number of probes
The search for probes should be time/space efficient
The time of the search for probes should be “fairly”
independent of the length of the probes
All probes should be far (Hamming distance) from each other
Probes should satisfy some extra (e.g., related to hybridization
process) conditions
Naive approach:
Scans through the whole length-n genome for every length-m probe and
determine if the Hamming distance is big enough, which takes O(mn2) time.
For example, 72 hours for S. pombe genome of length 7.1 x 106 bps and
thus impractical for large genome.
http://www.csc.liv.ac.uk/~cindy
10
Previous Work –
Approaches based on


Suffix array and fast pattern matching [Li F. and Stormo G., 2001]
BLAST to avoid cross-hybridization
[Rouillard J. M., Herbert C. J. and Zuker M., 2002]



Longest common substrings [Rahmann S. 2002]
Various filtering techniques [Lockhart DJ et al, 1996]
Methods based on pigeon hole principle
[Lee W. H. and Sung W. K., 2003]

etc
http://www.csc.liv.ac.uk/~cindy
11
Previous Work –
The probe selection criteria






No single base exceeds 50% of the probe size
The length of any contiguous As and Ts or Cs and Gs
is less than 25% of the probe size
(G+C)% is between 40% and 60% of the probe
Sensitivity - No self-complementarity within the probe
sequence
Homogeneity - Melting Temperature not being too low
or too high
Specificity – probes are unique to each gene
http://www.csc.liv.ac.uk/~cindy
12
Previous Work – Test data
Test data
Genome
Name
Genome
Length
Number of
Genes
E. coli
S. pombe
4,752,411 13,149,651
5,253
521
http://www.csc.liv.ac.uk/~cindy
Yeast
8,783,280
5,888
13
Previous Work – Test data
Yeast
Total length
8,783,280
2500
# of genes
2000
Total # of genes
5,888
1500
1000
500
0
<1K
[1K, 2K)
[2K, 3K)
[3K, 4K)
[4K, 5K)
[5K, 6K)
[6K, 7K)
>= 7K
gene length
http://www.csc.liv.ac.uk/~cindy
14
Previous Work
Li and
Stormo
BIBE,2000
E.Coli
23-bps
1.5 days
Yeast
24-bps
4 days
Rouillard et
al,Bioinformatics,2002
50-bps
31 minutes
50-bps
1 day
Neurospora
crassa
# of probes
Top 10
Rahmann, Lee and
WABI,
Sung,CSB,
2002
2003
All probes
50-bps
49 minutes
25-bps
4 hours
50-bps
3.5 hours
Top 50
All probes
http://www.csc.liv.ac.uk/~cindy
15
Talk Overview





Background: Microarrays & Hybridization
Problem Statement
Our new alternative approach
- main observations
- the algorithm
Experimental work
Conclusion
http://www.csc.liv.ac.uk/~cindy
16
Main Observations
In general randomness help!

80% of “randomly” (based on our algorithm) chosen
candidates for probes satisfy the probe selection
criteria related to hybridization process
[this suggests that random sequences hybridize
properly more likely]

The expected Hamming distance between two
randomly chosen sequences of a length n over 4
letter alphabet is ~ 3n/4.
[this suggests that randomly chosen probes will be
far from each other]
http://www.csc.liv.ac.uk/~cindy
17
An interesting observation


In general, fragments of DNA sequences
representing genes are more deterministic
(contain more organized information)
comparing to the rest of the sequence.
In contrary, the best probes (signatures)
representing genes are very likely to be
random or almost random!
http://www.csc.liv.ac.uk/~cindy
18
The Algorithm
(*) For every gene g in the database S:
a) generate a random base-pair sequence of length m
b) find the closest length-m substring P in gene g
c) check P for good probe criteria [80% pass this test]
 If P does not pass the criteria go to a)
d) cross-hybridization checking for P [98% pass this test]
 For every length-m substring Q in other sequences S-{g}:
 If H(P,Q) > d, P is chosen as the probe for g, goto (*)
 Otherwise, P can possibly cross-hybridize and we must
generate another length-m random substring P', go to a)
http://www.csc.liv.ac.uk/~cindy
19
The algorithm
(*) For every gene g in the database S:
a) generate a random base-pair sequence of length m
R
b) find the closest length-m substring P in gene g
g
P
c) Check P for good probe criteria, if P does not pass the
criteria, go to a)
http://www.csc.liv.ac.uk/~cindy
20
The algorithm
d) Check P for cross-hybridization checking
For every length-m substring Q in other sequences (S - {g}):
If H(P,Q) > d, P is chosen as the probe for g, goto (*);
Otherwise, P can possibly cross-hybridize and we must
generate another length-m random substring, go to a)
g
g1
Background g
2
Sequences
gi
P
P is far from g1 √
P is far from g2 √
…
H(P,Q)<d
Q
X
Generate another length-m random substring
http://www.csc.liv.ac.uk/~cindy
21
Talk Overview





Background: Microarrays & Hybridization
Problem Statement
Algorithm
Experimental Work
Conclusion
http://www.csc.liv.ac.uk/~cindy
22
Experimental Work
For Yeast:
 1.80% genes with no probes


Duplicated / very similar / too short
apart from that



98.0% genes need only one probe
1.5% genes need two probes
0.5% genes need three probes
Similar
result with genome E.coli
http://www.csc.liv.ac.uk/~cindy
23
Talk Overview





Background: Microarrays & Hybridization
Problem Statement
Algorithm
Experimental Work
Conclusion
http://www.csc.liv.ac.uk/~cindy
24
Conclusion

Almost all (98%) genes can be uniquely
identified by a single probe; the others need
at most three probes

Our method is:

Suitable for large scale probe design

Fairly independent from the length of probes

Both time and space efficient

Useful in design of fault-tolerant system of probes
http://www.csc.liv.ac.uk/~cindy
25
Ongoing Work
Distinguish multiple targets in a sample
g1
P1
g2
P2
g3
P1 ’
P2 ’
http://www.csc.liv.ac.uk/~cindy
26
Questions
c
ag
tg
?
ag
cc
t
??
a
a
tg
gc
?
aa
ga
a
cc
??
t
ac
?
?
?
c?
g
gc
gc
a
gc
t
a
ct
http://www.csc.liv.ac.uk/~cindy
g
g
ga
??
c
ac
??
c
?t
tt
t
ct
cg
27
Thank You!
Presented By Cindy Y. Li
http://www.csc.liv.ac.uk/~cindy
28
self-complementarity
Probe 5‘
TTTCAGTAATAAAAGATTTCTGT 3‘
||||
Probe 3‘ TGTCTTTAGAAAAATTAGACTTT
5‘
http://www.csc.liv.ac.uk/~cindy
29
Melting Temperature


TM can be used as a parameter to evaluate probe hybridization
behavior
TM is calculated for each probe as (SantaLucia et al., 1996)
is the sum of the nearest neighbor enthalpy changes
is the sum of the nearest neighbor entropy changes
R is the Gas Constant (1.987 cal deg-1 mol-1)
CT is the total molar concentration of strands (
)
http://www.csc.liv.ac.uk/~cindy
30
Melting Temperature

thermodynamic stability / nearest neighbour/
A
A
T
C
G
-1.2 -0.9 -1.5 -1.5
T
C
G
-0.9 -1.2 -1.5 -1.7
-1.7 -1.5 -2.1 -2.8
-1.5 -1.5 -2.3 -2.1
TTTCAGTAATTAAAAAGATTTCTGT
-1.2
-1.7 -1.5 kcal/mol
http://www.csc.liv.ac.uk/~cindy
31