lecture03_16

Download Report

Transcript lecture03_16

Database Similarity Search
Why do we care to align sequences?
Sequences that are similar probably have the same function
2
Discover Function of a new sequence
new sequence
Sequence Database
?
≈
Similar function
Discover Function of a new sequence
4
Searching Databases
for similar sequences
Naïve solution: Use exact algorithm to
compare each sequence in the database to
query.
Is this reasonable ??
Complexity for genomes
• Human genome contains 3  109 base pairs
– Searching an mRNA against HG requires ~1012
cells
-Even efficient exact algorithms will be extremely
slow when preformed millions of times even
with parallel computing.
So what can we do?
Searching databases
Solution:
Use a heuristic (approximate) algorithm
Heuristic strategy
Reduce the search space
Remove regions that are not useful for
meaningful alignments
Perform efficient search strategies
Preprocess database into new data structure to
enable fast accession
Heuristic strategy
• Reduce the search space
Remove regions that are not useful for
meaningful alignments
• Preprocess database into new data structure
to enable fast accession
What sequences to remove?
• AAAAAAAAAAA
• ATATATATATATA
• Transposable elements
53% of the genome
is repetitive DNA
Low complexity
sequences
(JUNK???)
Low Complexity Sequences
What's wrong with them?
* Not informative
* Produce artificial high scoring alignments.
So what do we do?
We apply Low Complexity masking to the database
and the query sequence
TCGATCGTATATATACGGGGGGTA
Mask
TCGATCGNNNNNNNNCNNNNNNTA
Heuristic strategy
• Remove low-complexity regions that are
not useful for meaningful alignments
• Perform efficient search strategies
Preprocess database into new data structure to
enable fast accession
BLAST
Basic Local Alignment Search Tool
• General idea - a good alignment contains
subsequences of high identity (local alignment):
ACGCCCGGGAGCGC
CTGGGCGTATAGCCC
Altschul et al 1990
DNA/RNA vs protein alphabet
DNA(4)
ATG C
A T=A G….
RNA(4)
AUGC
A T=A G….
Protein (20)
ACDEFGHIKLMNPQRSTVWY
A G>>A W….
WHY is it different?
The 20 Amino Acids
The 20 Amino Acids
G
A
W
Scoring system for amino acids mismatches
BLAST
Basic Local Alignment Search Tool
• General idea - a good alignment contains
subsequences of high identity (local alignment):
ACGCCCGGGAGCGC
CTGGGCGTATAGCCC
– First, identify (most efficiently) short almost exact
matches .
– Next, extended to longer regions of similarity.
– Finally, optimize the alignment using an exact
algorithm.
Altschul et al 1990
BLAST
(Protein Sequence Example)
First, identify (most efficiently) short almost exact matches
between the query sequence and the database.
Query sequence …FSGTWYA…
Words of length 3: FSG, SGT, GTW, TWY, WYA
BLAST
Preprocessing of the database
Seq 1 FSGTWYA
Seq 2 FDRTSYV
Seq 3 SWRTYVA
…….
Seq 1
FSG, SGT, GTW, TWY, WAY
FDR, DRT, RTS, TSY, SYV
SWR, WRT,RTY, TYV, YVA
BAG OF WORDS
Seq 102
Seq 3546
FSG
SGT GTW TWY
WYA
YSG
TGT ATW SWY
WFA
FTG.. SVT. GSW. TWF.. WYS….
BLAST
Query sequence …FSGTWYA…
Words of length 3: FSG, SGT, GTW, TWY, WYA…
DATABASE
FSG
YSG
FTG
SGT
TGT
SVT
GTW
ATW
GSW
TWY
SWY
TWF
WYA
WFA
WYS….
SEQ N INVIEIAFDGTWTCATTNAMHEWASNINETEEN
BLAST
Basic Local Alignment Search Tool
• General idea - a good alignment contains
subsequences of high identity (local alignment):
ACGCCCGGGAGCGC
CTGGGCGTATAGCCC
– First, identify (most efficiently) short almost exact
matches .
– Next, extended to longer regions of similarity.
– Finally, optimize the alignment an exact algorithm.
Altschul et al 1990
BLAST
2.Extend word pairs as much as possible,
i.e., as long as the total score increases
High-scoring Segment Pairs (HSPs)
Q:
D:
FIRSTLINIHFSGTWYAAMESIRPATRICKREAD
INVIEIAFDGTWTCATTNAMHEWASNINETEEN
3. Finally, optimize the alignment using an
exact algorithm.
Q= query sequence, D= sequence in database
Running BLAST to predict a
function of a new protein
>Arrestin protein (C. elegance)
MFIANNCMPQFRWEDMPTTQINIVLAEPRCMAGEFFNAKVLLDSSDPDTVVHSFCAEIKG
IGRTGWVNIHTDKIFETEKTYIDTQVQLCDSGTCLPVGKHQFPVQIRIPLNCPSSYESQF
GSIRYQMKVELRASTDQASCSEVFPLVILTRSFFDDVPLNAMSPIDFKDEVDFTCCTLPF
GCVSLNMSLTRTAFRIGESIEAVVTINNRTRKGLKEVALQLIMKTQFEARSRYEHVNEKK
LAEQLIEMVPLGAVKSRCRMEFEKCLLRIPDAAPPTQNYNRGAGESSIIAIHYVLKLTAL
PGIECEIPLIVTSCGYMDPHKQAAFQHHLNRSKAKVSKTEQQQRKTRNIVEENPYFR
How to interpret a BLAST score:
•The score is a measure of the similarity of the
query to the sequence shown.
How do we know if the score is significant?
-Statistical significance
-Biological significance
How to interpret a BLAST search:
For each blast score we can calculate an
expectation value (E-value)
The expectation value E-value is the number of alignments
with scores greater than or equal to score S
that are expected to occur by chance in a
database search.
page 105
BLAST- E value:
Increases
linearly with
length of query
sequence
Increases
linearly with
length of
database
Decreases
exponentially
with score of
alignment
m = length of query ; n= length of database ; s= score
–K ,λ: statistical parameters dependent upon scoring system
and background residue frequencies
What is a Good E-value
(Thumb rule)
• E values of less than 0.00001 show that
sequences are almost always related.
• Greater E values, can represent functional
relationships as well.
• Sometimes a real (biological) match has an
E value > 1
• Sometimes a similar E value occurs for a
short exact match and long less exact match
How to interpret a BLAST search:
•The score is a measure of the similarity of the
query to the sequence shown.
How do we know if the score is significant?
-Statistical significance
-Biological significance
Treating Gaps in BLAST
>Human DNA
CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA
>Human mRNA
CATGCGACTGACATCGATCATA
Sometimes correction to the model are needed to infer biological
significance
Gap Scores
• Standard solution: affine gap model
wx = g + r(x-1)
wx : total gap penalty; g: gap open penalty;
r: gap extend penalty ;x: gap length
– Once-off cost for opening a gap
– Lower cost for extending the gap
– Changes required to algorithm
Gapped BLAST
4. Connect several HSPs by aligning the
sequences in between them:
THEFIRSTLINIHFSGTWYAA____M_ESIRPATRICKREAD
INVIEIAFDGTWTCATTNAMHEW___ASNINETEEN
The Gapped Blast algorithm allows several segments that are
separated by short gaps to be connected together to one alignment
BLAST
BLAST is a family of programs
Query:
DNA
Protein
Database:
DNA
Protein