lecture05_06

Download Report

Transcript lecture05_06

Introduction to Bioinformatics
From Pairwise to
Multiple Alignment
Outline
• Advances in BLAST
• Multiple Sequence Alignment- CLUSTAL
Scoring system for BLAST
Substitution Matrix +
Gap Penalty
Substitution Matrix
• BLOSUM matrices are based on the
replacement patterns found in more highly
conserved regions of the sequences without
gaps
• PAM matrices based on mutations observed
throughout a global alignment, includes
both highly conserved and highly mutable
regions
Gap penalty
• Example showed -1 score per indel
– So gap cost is proportional to its length
• Biologically, indels occur in groups
– We want our gap score to reflect this
• Standard solution: affine gap model
– Once-off cost for opening a gap
– Lower cost for extending the gap
– Changes required to algorithm
Statistical significance
E-value
• The number of hits (with the same similarity score) one can
"expect" to see just by chance when searching the given
string in a database of a particular size.
• higher e-value lower similarity
– “sequences with E-value of less than 0.01 are almost always
found to be homologous”
• The lower bound is normally 0 (we want to find the best)
Expectation Values
Increases
linearly with
length of query
sequence
Increases
linearly with
length of
database
Decreases
exponentially
with score of
alignment
Remote homologues
• Sometimes BLAST isn’t enough.
• Large protein family, and BLAST only
gives close members. We want more distant
members
PSI-BLAST
PSI-BLAST
• Position Specific Iterated BLAST
Regular blast
Construct profile from
blast results
Blast profile search
Final results
PSI-BLAST
• Advantage: PSI-BLAST looks for seqs that
are close to ours, and learns from them to
extend the circle of friends
• Disadvantage: if we found a WRONG
sequence, we will get to unrelated
sequences. This gets worse and worse each
iteration
Multiple Sequence Alignment
MSA
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGTSSNIGS--ITVNWYQQLPG
LRLSCSSSGFIFSS--YAMYWVRQAPG
LSLTCTVSGTSFDD--YYSTWVRQPPG
PEVTCVVVDVSHEDPQVKFNWYVDG-ATLVCLISDFYPGA--VTVAWKADS-AALGCLVKDYFPEP--VTVSWNSG--VSLTCLVKGFYPSD--IAVEWWSNG--
Like pairwise alignment BUT compare n
sequences instead of 2
Rows represent individual sequences
Columns represent ‘same’ position
May be gaps in some sequences
Why multiple alignments?
BLAST Usually obtains many sequences
that are significantly similar to the query
sequence
Practically
Comparing each and every sequence to
every other may impractical when the
number of sequences is large
Solution
generating a profile
MSA
MSA can give you a better picture of functional sites on
proteins and nucleic acids as well as the forces that shape
evolution!
• Important amino acids or nucleotides are not allowed to mutate
• Less important positions change more easily
VTISCTGSSSNIGAG-NHVKWYQQLPG
VTISCTGSSSNIGS--ITVNWYQQLPG
LRLSCTGSGFIFSS--YAMYWYQQAPG
LSLTCTGSGTSFDD-QYYSTWYQQPPG
Alignment Example
GTCGTAGTCGGCTCGAC
GTCTAGCGAGCGTGAT
GCGAAGAGGCGAGC
GCCGTCGCGTCGTAAC
GTCGTAGTCG-GC-TCGAC
GTC-TAG-CGAGCGT-GAT
GC-GAAG-AG-GCG-AG-C
GCCGTCG-CG-TCGTA-AC
Score : 4/4 =1 , 3/4 =0.75 , 2/4=0.5 , 1/4= 0
1*1
2*0.75
11*0.5
Score=8
4*1
11*0.75
2*0.5
Score=13.25
Example of 3 sequences:
Dynamic Programming
• Pairwise A–B alignment table
– Cell (i,j) = score of best alignment between first i
elements of A and first j elements of B
– Complexity: length of A  length of B
• 3-way A–B–C alignment table
– Cell (i,j,k) = score of best alignment between first i
elements of A, first j of B, first k of C
– Complexity: length A  length B  length C
• Example: protein family alignment
– 100 proteins, 1000 amino acids each
– Complexity: 10300 table cells
– Calculation time: beyond the big bang!
Feasible Approach
• Based on pairwise alignment scores
– Build n by n table of pairwise scores
• Align similar sequences first
– After alignment, consider as single sequence
– Continue aligning with further sequences
– For n sequences, there are n(n-1)/2 pairs
GTCGTAGTCG-GC-TCGAC
GTC-TAG-CGAGCGT-GAT
GC-GAAG-AG-GCG-AG-C
GCCGTCG-CG-TCGTA-AC
1
2
3
4
GTCGTAGTCG-GC-TCGAC
GTC-TAG-CGAGCGT-GAT
GC-GAAGAGGCG-AGC
GCCGTCGCGTCGTAAC
1
2
3
4
GTCGTA-GTCG-GC-TCGAC
GTC-TA-G-CGAGCGT-GAT
G-C-GAAGA-G-GCG-AG-C
G-CCGTCGC-G-TCGTAA-C
CLUSTAL method
Progressive Sequence Alignment
• Higgins and Sharp 1988
– ref: CLUSTAL: a package for performing
multiple sequence alignment on a
microcomputer. Gene, 73, 237–244. [Medline]
An approximation strategy (heuristic
algorithm) yields a possible alignment, but
not necessarily the best one
First step:
A
B
C
D
Compute the pairwise
alignments for all against all
the similarities are stored in a
table
A
B
C
A
B
11
C
3
1
D
2
2
10
D
Second step:
A B C D
A
B 11
cluster the sequences to create a tree
•Represents the order in which pairs of
sequences are to be aligned
•similar sequences are neighbors in the
tree
•distant sequences are distant from each
other in the tree
C
3
1
D
2
2 10
A
B
C
D
Join alignments
NYLS
N KYLS
NFS
N K/- Y L S
N K/- Y/F L/- S
NFLS
N F L/- S
Treating Gaps in ClustalW
• Penalty for opening gaps and additional
penalty for extending the gap
• Gaps found in initial alignment remain
fixed
• New gaps are introduced as more sequences
are added (decreased penalty if gap exists)
• Decreased within stretches of hydrophilic
residues
MSA Approaches
• Progressive approach
CLUSTALW (CLUSTALX)
http://www.ebi.ac.uk/clustalw/
PILEUP
T-COFFEE
• Iterative approach:
Repeatedly realign subsets of sequences.
MultAlin, DiAlign.
• Statistical Methods:
Hidden Markov Models
SAM2K
• Genetic algorithm
SAGA