E value - Webcourse

Download Report

Transcript E value - Webcourse

From Pairwise Alignment
to
Database Similarity Search
Pairwise Alignment Summary
Global
• Best score for aligning
the full length sequences
• Dynamic programming
• Algorithm:
Needelman- Wunch
• Table cells are allowed
any score
Local
• Best score for aligning
part of sequences
• Dynamic programming
• Algorithm:
Smith-Waterman
• Table cells never score
below zero
Why do we care to align sequences?
Sequences that are similar probably have the same function
3
Discover Function of a new sequence
new sequence
Sequence Database
?
≈
Similar function
Searching Databases
for similar sequences
Naïve solution: Use exact algorithm to
compare each sequence in the database to
query.
Is this reasonable ??
How much time will it take to calculate?
Complexity for genomes
• Human genome contains 3  109 base pairs
– Searching an mRNA against HG requires ~1013
cells
-Even efficient exact algorithms will be extremely
slow when preformed millions of times even
with parallel computing.
So what can we do?
Searching databases
Solution:
Use a heuristic (approximate) algorithm to
discard most irrelevant sequences and perform
the exact algorithm on the small group of
remaining sequences.
Heuristic strategy
• Remove regions that are not useful for
meaningful alignments
• Preprocess database into new data structure
to enable fast accession
Heuristic strategy
• Remove regions that are not useful for
meaningful alignments
• Preprocess database into new data structure
to enable fast accession
What sequences to remove?
• AAAAAAAAAAA
• ATATATATATATA
• Transposable elements
(LINEs, SINEs)
Low complexity
sequences
Low Complexity Sequences
What's wrong with them?
Produce artificial high scoring alignments.
So what do we do?
We apply Low Complexity masking to the database
and the query sequence
TCGATCGTATATATACGGGGGGTA
Mask
TCGATCGNNNNNNNNCNNNNNNTA
Low Complexity Sequences
Complexity is calculated as:
K=1/L logN(L!/Π ni!)
all i
Where N=4 in DNA (4 bases), L is the length of the sequence
and ni the number of each residue in the sequence
For the sequence GGGG:
For the sequence CTGA:
L! =4x3x2x1=24
ng =4
nc =0
na =0
nt =0
Πni =24x1x1x1=24
L! =4x3x2x1=24
ng =1
nc =1
na =1
nt =1
Πni =1x1x1x1
K
K
=1/4 log4 (24/24)=0
=1/4 log4 (24/1)=0.573
Heuristic strategy
• Remove low-complexity regions that are
not useful for meaningful alignments
• Preprocess database into new data structure
to enable fast accession
Heuristic (approximate solution) Methods:
FASTA and BLAST
• FASTA (Lipman & Pearson 1985)
– First fast sequence searching algorithm for comparing
a query sequence against a database
• BLAST - Basic Local Alignment Search
Technique (Altschul et al 1990)
– improvement of FASTA: Search speed, ease of use,
statistical rigor
FASTA and BLAST
• Common idea - a good alignment contains
subsequences of absolute identity:
– First, identify very short (almost) exact matches.
– Next, the best short hits from the 1st step are extended
to longer regions of similarity.
– Finally, the best hits are optimized using the SmithWaterman algorithm.
FastA (fast alignment)
• Assumption: a good alignment probably matches
some identical ‘words’
• Example: Aligning a query sequence to a database
Database record:
ACTTGTAGATACAAAATGTG
Query sequence:
A-TTGTCG-TACAA-ATCTG
FastA
• Preprocess of all the sequences in the
database. Find short words and organize in
dictionaries.
• Process the query sequence and prepare a
dictionary.
– ATGGCTGCTCAAGT….
ATGG
TGGC
GGCT
…
…
Query
FastA locates regions of the query sequence and the
search set sequence that have high densities of exact
word matches.
For DNA sequences the word length used is 6.
Words
in
seq1
Words in seq2
The 10 highest-scoring sequence regions are saved
and re-scored using a scoring matrix.
seq1
seq2
FastA determines if any of the initial regions
from different diagonals may be joined together
to form an approximate alignment with gaps.
Only non-overlapping regions may be joined.
seq1
seq2
The score for the joined regions is the sum of the
scores of the initial regions minus a joining penalty
for each gap.
seq1
seq2
BLAST
Basic Local Alignment Search Tool
• Developed to be as sensitive as FastA but much
faster.
• Also searches for short words.
– Protein 3 letter words
– DNA 11 letter words.
– Words can be similar, not only identical
BLAST
(Protein Sequence Example)
1. Search the database for matching word pairs (> T)
Example:
…FSGTWYA…
A list of words (w=3) is:
FSG SGT GTW TWY WYA
YSG TGT ATW SWY WFA
FTG SVT GSW TWF WYS
BLAST
(Protein Sequence Example)
1.Search the database for matching word pairs (>T)
2.Extend word pairs as much as possible,
i.e., as long as the total score increases
• Result: High-scoring Segment Pairs (HSPs)
THEFIRSTLINIHFSGTWYAAMESIRPATRICKREAD
INVIEIAFDGTWTCATTNAMHEWASNINETEEN
BLAST
3. Try to connect HSPs by aligning the
sequences in between them:
THEFIRSTLINIHFSGTWYAA____M_ESIRPATRICKREAD
INVIEIAFDGTWTCATTNAMHEW___ASNINETEEN
The Gapped Blast algorithm allows several segments that are
separated by short gaps to be connected together to one alignment
How to interpret a BLAST search:
•The score is a measure of the similarity of the
query to the sequence shown.
How do we know if the score is significant?
-Statistical significance
-Biological significance
Assessing Alignment Significance
Determine probability of alignment occurring at random
No Good
Ideal
Frequency
Random
Related
Score
Score
For each score we can count the probability of getting it by chance
How to interpret a BLAST search:
For each blast score we get an E-value
The expect value E-value is the number of alignments
with scores greater than or equal to score S
that are expected to occur by chance in a
database search.
An E value is related to a probability value p (p-value).
page 105
BLAST- E value:
Increases
linearly with
length of query
sequence
Increases
linearly with
length of
database
Decreases
exponentially
with score of
alignment
m = length of query ; n= length of database ; s= score
–K ,λ: statistical parameters dependent upon scoring system
and background residue frequencies
From raw scores to bit scores
• Bit
scores S’ are normalized and are
comparable in different databases
The E value corresponding to a given bit
score is:
E = mn 2 -S’
page 106
What is a Good E-value
(Thumb rule)
• E values of less than 0.00001 show that
sequences are almost always homologues.
• Greater E values, can represent homologues
as well.
• Generally the decision whether an E-value
is biologically significant depends on the
size of database that is searched
• Sometimes a real match has an E value > 1
• Sometimes a similar E value occurs for a
short exact match and long less exact match
Treating Gaps in BLAST
>Human DNA
CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA
>Human mRNA
CATGCGACTGACATCGATCATA
Biologically, indels occur in groups we want our gap score to
reflect this
Gap Scores
• Standard solution: affine gap model
wx = g + r(x-1)
wx : total gap penalty; g: gap open penalty;
r: gap extend penalty ;x: gap length
– Once-off cost for opening a gap
– Lower cost for extending the gap
– Changes required to algorithm
Significance of Gapped
Alignments
• Gapped alignments use same statistics
•  and K cannot be easily estimated
• Empirical estimations and gap scores
determined by looking at random
alignments
BLAST
BLAST is a family of programs
Query:
DNA
Protein
Database:
DNA
Protein
Choose the BLAST program
Program Input
Database
1
blastn
DNA
DNA
1
blastp
protein
protein
6
blastx
DNA
protein
6
tblastn
protein
DNA
36
tblastx
DNA
DNA
Example :The lipocalins (each dot is a protein)
retinol-binding
protein
apolipoprotein D
odorant-binding
protein
Example is taken from
Bioinformatics and Functional Genomics
by Jonathan Pevsner (ISBN 0-471-21004-8).
Copyright © 2003 by John Wiley & Sons, Inc.
BLAST search with PAEP as a query
finds many other lipocalins
Assessing whether proteins are homologous
RBP4 and PAEP:
Low bit score, E value 0.49, 24% identity
but they are indeed homologous.