lecture03_08

Download Report

Transcript lecture03_08

From Pairwise Alignment
to
Database Similarity Search
Outline
• Summary Local and Global alignments
• FASTA and BLAST algorithms
• Evaluating significance of alignments
Pairwise Alignment Summary
Global
• Best score for aligning
the full length sequences
• Dynamic programming
• Algorithm:
Needelman- Wunch
• Table cells are allowed
any score
Local
• Best score for aligning
part of sequences
• Dynamic programming
• Algorithm:
Smith-Waterman
• Table cells never score
below zero
Available tools for Sequence
Alignments
ALIGN -- GLOBAL (N-W)/ LOCAL (S-W)
BLAST2SEQ – Only LOCAL using word match
spidey -- aligns mRNAs to genomic sequence
est2genome -- aligns ESTs to genomic
sequence
Gap Scores
>Human DNA
CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA
>Human mRNA
CATGCGACTGACATCGATCATA
Biologically, indels occur in groups we want our gap score to
reflect this
Gap Scores
• Standard solution: affine gap model
– Once-off cost for opening a gap
– Lower cost for extending the gap
– Changes required to algorithm
Affine Gap Penalty
• wx = g + r(x-1)
• wx : total gap penalty; g: gap open penalty; r: gap
extend penalty;x: gap length
• gap penalty chosen
– Gaps not excluded
– Gaps not over included
– Typical Values: g=-12; r = -4
Is this good enough ???
Drawbacks to DP Approaches
• Compute intensive
• Memory Intensive
Complexity
• Complexity is determined by size of table
– Aligning a sequence of length m against one of
length n requires calculating (m  n) cells
• Estimate: we calculate 108 cells per second
– Aligning two mRNA sequences of 8,000 bp
requires 64,000,000 cells
– Aligning an mRNA and a 107 bp chromosome
requires ~1011 cells
Searching databases
•
Goal: Find homologue sequences in
database to query input.
Sequence Database
new sequence
?
≈
similar
Similar function
function
Searching databases
•
Goal: Find homologue sequences in
database to query input.
•
Naïve solution:
Use exact algorithm to compare each
sequence in the database to query.
Complexity for genomes
• Human genome contains 3  109 base pairs
– Searching an mRNA against HG requires ~1013
cells
-Even efficient exact algorithms will be extremely
slow when preformed millions of times.
-Running the computations in parallel is expensive.
So what can we do?
Searching databases
•
Solutions:
1. Use a heuristic (approximate) algorithm to
discard most irrelevant sequences.
2. Perform the exact algorithm on the small
group of remaining sequences.
Heuristic strategy
• Homologous sequences are expected to
contain common short segments (probably
with substitutions, but without ins/dels)
• Preprocess database (DB) into new data
structure to enable fast accession
• Remove low-complexity regions that are
not useful for meaningful alignments
Low Complexity Sequences
• AAAAAAAAAAA
• ATATATATATATA
• Transposable elements
(LINEs, SINEs)
Low Complexity Sequences
Whats wrong with them?
produce artificial high scoring alignments.
So what do we do?: We apply Low Complexity
masking to the query sequence
TCGATCGTATATATACGGGGGGTA
Mask
TCGATCGNNNNNNNNCNNNNNNTA
Low Complexity Sequences
Complexity is calculated as:
K=1/L logN(L!/Π ni!)
all i
Where N=4 in DNA (4 bases), L is the length of the sequence
And ni the number of each residue in the sequence
For the sequence GGGG:
For the sequence CTGA:
L! =4x3x2x1=24
ng =4
nc =0
na =0
nt =0
Πni =24x1x1x1=24
L! =4x3x2x1=24
ng =1
nc =1
na =1
nt =1
Πni =1x1x1x1
K
K
=1/4 log4 (24/24)=0
=1/4 log4 (24/1)=0.573
Heuristic (approximate solution) Methods:
FASTA and BLAST
• FASTA (Lipman & Pearson 1985)
– First fast sequence searching algorithm for comparing
a query sequence against a database
• BLAST - Basic Local Alignment Search
Technique (Altschul et al 1990)
– improvement of FASTA: Search speed, ease of use,
statistical rigor
– Gapped BLAST (Altschul et al 1997)
FASTA and BLAST
• Common idea - a good alignment contains
subsequences of absolute identity:
– First, identify very short (almost) exact matches.
– Next, the best short hits from the 1st step are extended
to longer regions of similarity.
– Finally, the best hits are optimized using the SmithWaterman algorithm.
FastA (fast alignment)
• Assumption: a good alignment probably matches
some identical ‘words’
• Example: Aligning a query sequence to a database
Database record:
ACTTGTAGATACAAAATGTG
Query sequence:
A-TTGTCG-TACAA-ATCTG
Matching words of size 4
FastA
• Preprocess of all the sequences in the
database. Find short words and organize in
dictionaries.
• Process the query sequence and prepare a
dictionary.
– ATGGCTGCTCAAGT….
ATGG
TGGC
GGCT
…
…
Query
FastA locates regions of the query sequence and the
search set sequence that have high densities of exact
word matches.
For DNA sequences the word length used is 6.
seq1
seq2
The 10 highest-scoring sequence regions are saved
and re-scored using a scoring matrix.
seq1
seq2
FastA determines if any of the initial regions
from different diagonals may be joined together
to form an approximate alignment with gaps.
Only non-overlapping regions may be joined.
seq1
seq2
The score for the joined regions is the sum of the
scores of the initial regions minus a joining penalty
for each gap.
seq1
seq2
FastA final stage
• Apply an exact algorithm of local alignment
on surviving records, computing the final
alignment score.
• Calculate an Alignment score (S)
• Evaluate the statistical significance
Assessing Alignment Significance
Determine probability of alignment occurring at random
Random
Related
No Good
Ideal
FastA at EMBL
FastA
• A set of programs for database searching.
FastA EMBL
FASTA SEQUENCE FORMAT
This format contains a one line header followed
by lines of sequence data.
•Sequences in fasta formatted files are preceded by a line starting
with a" >" symbol.
•The first word on this line is the name of the sequence.
•The remaining lines contain the sequence itself.
•Blank lines in a FASTA file are ignored,
FastA at EMBL
• Output, general information:
–Z’ score = deviation (in sd) of the actual
score from the mean of random scores
Z=(x-mean)/sd
– Opt: the number of optimized scores observed.
Lower limit
– E( ): the number of sequences expected in the score
range.
FastA at EMBL
FastA at EMBL
BLAST
Basic Local Alignment Search Tool
• Developed to be as sensitive as FastA but much
faster.
• Also searches for short words.
– Protein 3 letter words
– DNA 11 letter words.
– Words can be similar, not only identical
Word Search -BLAST
•
•
•
•
Identity - CAT : CAT
Similarity – CAT : CAT, CAY, HAT …
But even CAT : XTX can be similar
For each three letter word there are many
similar words (depending on the alphabet).
• Similar words are only the ones that have a
minimum cut-off score (T).
Y= C or T
H= A, C, T
X=A or T or C or G
BLAST
• Find matching word pairs
• Extend word pairs as much as possible,
i.e., as long as the total weight increases
• Result: High-scoring Segment Pairs (HSPs)
THEFIRSTLINIHAVEADREAMESIRPATRICKREAD
INVIEIAMDEADMEATTNAMHEWASNINETEEN
BLAST
• Try to connect HSPs by aligning the
sequences in between them:
THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD
INVIEIAMDEADMEATTNAMHEW___ASNINETEEN
The Gapped Blast algorithm allows several segments that are
separated by short gaps to be connected together to one alignment
Score and E-value
•The score is a measure of the similarity of the
query to the sequence shown.
•The E-value is a measure of the reliability of
the score.
•E-value is the probability due to chance, that
there is another alignment with a similarity
greater than the given S score.
BLAST- Score
Score (S): (identities + mismatches)-gaps
Bit score (S) :
– Similar to alignment score
– Normalized
– Higher means more significant
BLAST- E value:
Expected # of alignments with score at least S
Increases
linearly with
length of query
sequence
Increases
linearly with
length of
database
Decreases
exponentially
with score of
alignment
m = length of query ; n= length of database ; s= score
–K ,λ: statistical parameters dependent upon scoring system
and background residue frequencies
What is a Good E-value
- thumb rules
• E values of less than 0.00001 show that
sequences are almost always homologues.
• Greater E values, can represent homologues
as well.
• Generally the decision whether an E-value
is biologically significant depends on the
size of database that is searched
Significance of Gapped
Alignments
• Gapped alignments use same statistics
•  and K cannot be easily estimated
• Empirical estimations and gap scores
determined by looking at random
alignments
BLAST
• Blast is a family of programs:
BlastN, BlastP, BlastX, tBlastN, tBlastX
•
•
•
•
•
Query:
DNA
Protein
Database:
DNA
Protein
BlastN - nt versus nt database
BlastP - protein versus protein database
BlastX - translated nt versus protein database
tBlastN - protein versus translated nt database
tBlastX - translated nt versus translated nt database
BLAST at NCBI
•
Output
–
–
–
Graphical out put of top results
The alignments for top scores
Scores for each alignment:
1. E value
2. Bits score: a score normalized with respect to the
scoring system. Can be used to compare different
searches.