Peter, Ingmar

Download Report

Transcript Peter, Ingmar

Overview
Two stage search
 Indexing on intervals
 Frames
 Scoring
 Optimization
 Results
 Questions + answers

Two-stage searching

Coarse searching
 Uses heuristics to
 Interval matching

find promising alignments
Fine searching
 Intensive processing of results from coarse search
 Calculates the actual score of the alignments by using
more detailed information from the matching
sequences
 This requires retrieval of sequences from the
database which is expensive
Intervals

Overlapping intervals:
 Sequence: ACCTGACG,
 Interval
with length l=8
length: n=3
 Intervals: ACC, CCT, TGA, GAC, ACG
Indexed VS. Exhaustive

Exhaustive:
 Every
sequence in the database is retrieved
and processed, this is costly for large
databases
 Often use heuristics to reduce the number of
sequences that need to be aligned
 Popular exhaustive systems FASTA, BLAST1,
BLAST2 are based on Wilbur-Lipman interval
approach
Indexed VS. Exhaustive

Wilbur-Lipman
 Build
a hashing structure with all intervals in
the query sequence as keys
 Hash all intervals in the entire database, and
if the interval is present in the hashing
structure we know that there is a match
 This is faster than walking through the query
sequence for every interval in the database
Wilbur-Lipman
query
sequence
A C T G A C T C
hashing structure
offsets
interval
A C T
0 4
C T C
5
C T G
1
G A C
3
T G A
2
Indexed search
Instead of retrieving all sequences from
the database, use an index to locate
promising alignments
 Cost

 storage

of the index
Advantages
 fast
lookup, only index needs to be accessed
 enables partial retrieval (for fine searching)
Indexed search

FLASH
 Enables
gapped searching by indexing permuted
intervals
 Example:





Interval length = 5, subsequence length = 3
Sequence: ACCTGATT, results in:
ACC, ACT, ACG
ACT ACG
ATG
 Problem:
Index becomes huge
Indexed search in CAFE


(similar to RAMDB)
Index contains:
 Search structure
 Searching is done on keys, these keys
represent overlapping intervals that
occur in the database
 Posting lists
 For every interval a list is stored
containing references to all occurences
of this interval. These references
contain a sequence id and an offset
Indexed search in CAFE
interval

sequence id
offsets
AA
8
3 12 17
AC
8
4 11 20
11 2 6 14
Posting lists can be long
 Compressed
to reduce used disk space
Compression techniques





To reduce the size of the index posting lists are
compressed
Instead of storing each sequence number of the
sequence in which the interval occurs, store the
first one and use relative offsets for the rest
Eg.
101, 109, 217, 412, 980, 1013 becomes
101, 8, 108, 195, 568, 33
Compression techniques
Of course the same same can be done
with the places where the interval occurs
inside a query
 Also Elias Gamma and Delta coding
 But just compression was not good
enough

 Use
heuristics to reduce the size of the index
even further
Heuristics to decrease index size

Limit indexing of wildcards
 wildcards
are instantiated before being
indexed.
 Several adjacent wildcards may lead to an
explosion of matches (Eg. NNNNN)
 Solution: only index intervals containing at
most one wildcard
Heuristics to decrease index size

Another one:
 Do
not create an entry in the index for
intervals that occur in more than x% of the
sequences
Coarse searching
Uses FRAMES
 “A frame is a set of one or more matching
intervals between a database sequence
and a query sequence that are at the
same relative offset”

FRAMES
Frames are constructed using the offsets
stored in the index
 A frame can be represented by a
combination of sequence id and offset
 The information contained in frames is
used to determine the most promising
alignments

FRAMES



F2 : (9,7), (10,8), (27,25), (28,26), (29,27), (30,28), (31,29), (32,…)
F1 : (17,16), (18,17)
F26 : (53,27), (54,28)
Constructing frames

For every interval match with offsets x and y:
 Calculate
the relative offset z
 If a frame with this relative offset (Fz) already
exists, append (x, y) to this frame
 Otherwise, create a new frame containing (x, y)
Frame ranking schemes

First approach: FRAMECOUNT
 Count
the amount of matching intervals for
each frame, and take the maximum over all
frames
FRAMES




F2 : Contains 8 matching intervals
F1 : Contains 2 matching intervals
F26 : Contains 2 matching intervals
FRAMECOUNT = 8, resulting from frame F2
Frame ranking schemes

Problem with FRAMECOUNT:
 Does
not take into account relative positioning
of intervals within a single frame

COVERAGE:
 Amount
of matching bases per frame
 Takes into account overlapping of intervals

Comparing FRAMECOUNT and COVERAGE:

A : FRAMECOUNT = 7, COVERAGE = 9
 B : FRAMECOUNT = 7, COVERAGE = 21

COVERAGE seems to be a more accurate scoring scheme
Frame ranking schemes

Another scheme: LENGTH
 Amount
of bases between first and last base
contained in matching intervals in a frame
A: LENGTH = 21
 B: LENGTH = 55
 C: LENGTH = 55

Frame ranking schemes

What we did not understand:


“Despite this, the length scheme is particularly attractive since it
ranks highly regions that are longer and, therefore, will rank long
homologous alignments ahead of shorter alignments.”
COMBINED scheme:




COVERAGE and LENGTH combined
Takes into account residues that are not part of matching intervals
COMBINED = COVERAGE – k * (LENGTH – COVERAGE)
The value for k is determined empirically (<< 1)
Scoring with amino acids

COVERAGE scheme can be improved by
using a substitution matrix
 Instead
of counting the number of matches,
use the sum of appropriate values in the
substitution matrix
 Interval scores can be cached
Normalizing scores

To compensate for increased score with
longer sequences:
 Snorm

= (S * 21.21)/(ln l1 * ln l2)
“l1 and l2 are the lengths of the two aligned
regions”
 We
assume they mean the lengths of the
aligned sequences, since we have read that
this is used by Shpaer et al.
Optimising frames

Apply a fixed ceiling on the amount of frames.
 Decreases
memory usage and CPU time
 Which frames are most important ?




Frames containing the most discriminating intervals
So we should look up the most discriminating intervals first.
The database-frequency of each interval occuring in the
query is determined
After sorting them, we lookup the intervals with the lowest
frequency first, in this way have the best discrimination.
Optimising frames

NEIGHBOURHOOD scheme
 Gives
higher scores for frames that are close
to other frames
 Allows gaps (indels), by “combining” frames
 For every other frame in the same sequence:
add s1/d to the score, where
 s is the score of the other frame
 d is the difference in offset from the other frame

Optimising frames

Information stored for frames can be
(re)used for fine searching:
 Matching
regions can be used as a starting
point for fine-searching
 Frames allow us to partially retrieve all
relevant sequences from the database
Test data

PIR database used to assess accuracy
 Sequences
used that are classified in super families
 Single member families removed
 Filtered test set: 1834 sequences
 Only query sequences with #residues < 500
 Precision and recall are measured


Precision = Relevant results/Total results
Recall = Relevant results/Total relevant
Test data

Used for assessing speed and index size:
 Genbank

GENBANK97:


databases:
652 mln. nucleotide bases in 1 mln. sequences
GENBANK108:

1797 mln. nucleotide bases in 2.5 mln. sequences
 VERTE:

177 mln. nucleotide bases in 0.12 mln. sequences
Results

CAFE compared to BLAST and FASTA:
 CAFE
has similar precision
 CAFE becomes relatively faster with increasing
database size

Further observations
 For
CAFE, queries take more time and memory for
processing when their intervals occur often in the
database
 For BLAST and FASTA, performance drops when the
entire database can not be stored in memory
Questions

Bogdan:
 Question

1(overlapping intervals):
Why do they use overlapping intervals? Is that
because if you have the string “ABCDEF” and the
intervals “ABC” and “DEF”, that than a query
interval “BCD” wouldn’t match ?
 Answer:

We think your intuition is correct, if we do not use
overlapping intervals lots of existing intervals
would be missed
Questions
 Question

2 (optimising frames):
What do they mean with sorting the query intervals,and how does sorting
make them discriminate well between sequences?
 Answer:

Query intervals: the intervals that are generated from the query sequence.


The database-frequency of each interval occuring in the query is determined
After sorting them, we lookup the intervals with the lowest frequency first, in
this way have the best discrimination.
3: “Could you explain the two alternatives when the
threshold is reached?”
 Answer:
 Question


Option 1: Stop checking matches immediately when the ceiling is reached
Option 2: Keep adjusting existing frames with new matches
Questions

Marjolijn:
 Question:

Can you explain the difference between an
inverted index and a fine-grain index?
 Answer:
Only one type of index is used
 This index is an inverted list
 The index also has a high granularity (fine-grain)

Questions

Lee:
 Question

1 (compression):
What sorts of compression are used to decrease
the size of the index ?
 Answer:

See the part of this presentation about
compression
Questions
 Question 2:
 “They say that the problem with uncompressed lists is that
you can be penalized in time by the disk retrievals. On page
22 they are saying "However, these algorithms are highly
reliant on having sufficent memory to store the complete
database." So if I understand it right, they store the index on
the disk and the database in main memory? Isn't it more
common to put the index in main memory and the database
on disk?
 Answer:
 The sentence from page 22 is not about CAFE, but about two
other methods: FASTA and BLAST.
 These methods do not use an inverted index and hence need
fast access to the database.
Questions

Jacob:
 Question(COMBINED):
 “I was wondering if the combination of coverage and length as
mentioned on page 14 is sufficient for scoring alignments. I believe
that there are lots of examples with different alignments which would
score the same. Like this one

CGATCGAATAGCATCGTGGCGGTGAGCGGTTTCTGTTTCTGTTCTT
::::::
:::
TTATCGAATAGCGGCGCTAGCATCGATCATTCTACTTTCAAACTGC

CGATCGAATAGCATCGTGGCGGTGAGCGGTTTCTGTTTCTGTTCTT
:::
:::
:::
TTATCTCTATAGCGGCGGGCGCATCGATCATTCTCTTTCAAACTGC
 Answer:
 We should investigate whether clustering influences the probability of
a successful allignment
 If so we could try to think of a new scoring scheme:


Scheme: CLUSTERING
Calculate the degree of clustering within a frame
Questions

Laurence:

Question:

“On page 15 in section 3.3 in paragraph 4 it says that composition alignment
between matching intervals in the length metric is difficult to model without
fine-searching the region. Therefore they elect not to modify the LENGTH
scheme for complex models but rather apply statistical normalisation to
incorporate variations in composition the resultant score.
Can you maybe explain in further detail why is chosen to apply statistical
normalisation in contrast to modifying the LENGTH scheme, and what does
fine-searching the region have to do with this”

Answer:



LENGTH misses compositional information
Compositional information requires retrieval of database sequences
However, when random query and database strings get larger, average
allignment scores get higher. To rule this out we do a normalisation. By doing
this, length still plays an important role in scoring.
Questions

Bram:
 Question:

“For COVERAGE they use pre-calculations of the scores to
incorporate variations in composition in the resultant score. In
LENGTH they use the Shpaer normalisation for it. At the end of the
paragraph they propose to use the Shpaer scheme to normalise the
COVERAGE scores. What's the advantage of normalising the
COVERAGE scores, if we already did the pre-calculation?”
 Answer:


You probably think of normalisation as ruling out which substitution
matrix was used, that’s not what is meant.
The Shpaer normalisation is applied to values calculated for frames
Questions

Adriano
 Question:
 page 22 “In CAFE, evaluation times depend on query length
and on statistics of the intervals in the query; intervals with
longer inverted lists require more processing. What does
"longer inverted lists" mean? and why does it require more
processing?”
 Answer:
 Inverted lists = posting lists for matching intervals
 For every matching interval we need to process its posting
list in order to constuct frames