Waiting Time and Seed Selection for Homology Search

Download Report

Transcript Waiting Time and Seed Selection for Homology Search

Waiting Time and Seed Selection
for Homology Search
Gary Benson
Department of Computer Science
Department of Biology
Graduate Program in Bioinformatics
Boston University
Classroom Experiment – Estimating Waiting
Time
• Each student tosses a coin until the first
occurrence of 3 heads in a row. This is one
trial. If ten tosses do not produce three heads in
a row, stop and begin a new trial.
• Repeat trials, if necessary, so that there are 100
trials across all the students in the class.
• For each trial, record in a table which toss (if
any) gives the first occurrence of three heads in
a row.
Example
Trial 1:
Trial 2:
Trial 3:
Trial 4:
1
H
T
H
H
2
T
T
T
T
3
H
H
T
T
4
H
H
H
H
5
T
H
H
H
6 7 8 9 10
H H H
T H H H
T H T H H
First
occurrence
8
5
9
None
Fill in Table
N =
Trial 1
Trial 2
Trial 3
Trial 4
Total in
4 trials
3
4
5
6
7
8
X
9
10
None
X
X
X
0
0
1
0
0
1
1
0
1
Class Results
Class Trial Record
N =
3
Total in
100 trials
Probability
Total/100
Cumulative
Probability
4
5
6
7
8
9
10
None
ACGTTCGACT
Speciation
Nucleotide substitution
ACGTTCGACT
ACGTTCGGCT
Nucleotide substitution
AAGTTCGACT
Evolutionary time
AAGTTCGACT
ACCTTCGGCT
AAGTTCGACA
Human
ACGTTCGGCT
ACCTTCGGCT
Molecular evolution
Mouse
Biologists Ask:
I have an unstudied human gene sequence
AAGTTCGACA
What other genes (from other organisms)
are similar to this one?
Why do they ask that question?
Genes that have similar sequences often:
1. produce proteins that have similar function
2. are regulated in similar ways
So a gene that has already been studied can often
provide information about an unknown gene.
How we answer the question:
Compare the human gene to
every sequence
in a database of known genomes
How
BIG is the database?
Combined length of all sequences
is greater than
1011 = 100,000,000,000 nucleotides.
http://www.ncbi.nlm.nih.gov/
http://www.ncbi.nlm.nih.gov/
How
long does it take to search?
The best algorithm (not the fastest) for detecting
similarity takes time proportional to the product
of the lengths of the sequence and the
database.
Assume the unstudied gene sequence is
102 = 100 nucleotides and the database is
1011 = 100,000,000,000 nucleotides.
Searching the database
102 • 1011 = 1013 nucleotide comparisons.
Good desktop computers now run at
3 gigahertz = 3 billion = 3 • 109 comparisons
per second
So comparing one sequence to the database takes
1013 / 3 • 109 = 3333 seconds = 55 minutes
Searching the database
102 • 1011 = 1013 nucleotide comparisons.
Good desktop computers now run at
3 gigahertz = 3 billion = 3 • 109 comparisons
per second
So comparing one sequence to the database takes
1013 / 3 • 109 = 3333 seconds = 55 minutes
UGH!!!
A faster way
Take small “words” from A and find matches in the
database first:
AAGTTCGACA
AAGTT AGTTC GTTCG … CGACA
Then only do comparisons where there is a “hit”.
Example
Sequence: AAGTTCGACA
Targets: AAGTT AGTTC GTTCG … CGACA
Database:
ACTGGTTCAAATGGCGCATGCAAAAGTTGGCT
GATTTGCATGACGTACCCTGAGACCTCGGAATT
CTAGCTTGCGAAGTAATACGATACCGTACGTTG
CCGACATACGGTACGTCGTCTACGTACGTACG
CCTACGCTACGTACCTTCGGCTTTTCATGGCAG
CGATCGTACTCCTCTAGTTCCTGACTGACTAC
…
Example
Sequence: AAGTTCGACA
Targets: AAGTT AGTTC GTTCG … CGACA
Database:
ACTGGTTCAAATGGCGCATGCAAAAGTTGGCT
GATTTGCATGACGTACCCTGAGACCTCGGAATT
CTAGCTTGCGAAGTAATACGATACCGTACGTTG
CCGACATACGGTACGTCGTCTACGTACGTACG
CCTACGCTACGTACCTTCGGCTTTTCATGGCAG
CGATCGTACTCCTCTAGTTCCTGACTGACTAC
…
But we missed the mouse gene!
What size should the “words” be?
• Too long and we miss similar sequences
This reduces sensitivity.
• Too small and there are “hits” everywhere.
This reduces specificity.
We need to balance these two issues.
Computing sensitivity
We need to answer the following type of question:
Suppose a gene sequence of length 100 has a
match in the database and the two sequences
are expected to be identical at 80% of their
positions. If we search with “words” of length 5,
how often will we find the match?
Computing sensitivity
We model the two sequences with coin tosses,
where a match is considered a head and a
mismatch (mutation) is considered a tail.
Example:
AAGTTCGACA
ACCTTCGGCT
HTTHHHHTHT
Computing sensitivity
We model the two sequences with coin tosses,
where a match is considered a head and a
mismatch (mutation) is considered a tail.
Example:
AAGTTCGACA
ACCTTCGGCT
HTTHHHHTHT
We are interested in at least one occurrence of five
heads if we use small words of length five.
Waiting Time
Our question is a classic waiting time problem. It
asks, for coin toss sequences of length n, and
with probability of heads = P(H), what is the
probability of tossing k heads in a row at least
once?
For our question,
n = 100,
P(H) = 0.8
k=5
Answer
99.99% -- Which means that using small words of
length 5, we will never miss similar sequences
of length 100.
In fact, the sequences have to be as short as 23
nucleotides before we will miss 5%.
And at 9 nucleotides we miss 41% of similarities.
AAGTTCGACA
ACCTTCGGCT
But wait! Maybe the “words” can
be modified
It turns out, that words composed of letters
with don’t care spaces produce better
results.
AAGTTCGACA
AA*TTC AG*TCG GT*CGA … TC*ACA
But wait! Maybe the “words” can
be modified
It turns out, that words composed of letters
with don’t care spaces produce better
results.
AAGTTCGACA
AA*TTC AG*TCG GT*CGA … TC*ACA
Current research involves finding the best
word shapes.