Lesson06MultipleSequenceAlignments
Download
Report
Transcript Lesson06MultipleSequenceAlignments
Doug Raiford
Lesson 5
Dynamic programming methods
Needleman-Wunsch (global alignment)
Smith-Waterman (local alignment)
BLAST
Fixed: best
Linear: next best
Polynomial (n2): not bad
Exponential (3n): very bad
BLAST fast (linear)
But not as sensitive
Speed
Sensitivity
Similarity matrix
Especially with amino
acids
Some amino acids have
similar chemical
characteristics
Similarity to all 8,000 3mers calculated
Usually ~50 are above a
threshold
All of these ~50 are
considered hits when
searching
Matrices
PAM (Point Accepted Mutation)
Built from observed substitution
rates in closely related proteins
BLOSOM (BLOck SUbstitution Matrix)
Built from observed substitution
rates in evolutionarily divergent
proteins
PSI-BLAST (Position
Specific Iterative)
Align using default
similarity matrix
At each query location
build a Position Specific
Scoring Matrix (PSSM)
based upon observed
search and alignment
results
Repeat with new matrix
until results no longer
change
PSI-BLAST
Build sensitivity by specifying
allowed similarity at each position
Slower, but still faster than local
alignment
Central to
bioinformatics
Need for
Phylogeny
Protein function
Protein structure
▪ Structure function
Drug discovery
Some parts of proteins are very important to
maintain function
Must be similar from species to species
Can we spot these regions through
alignment?
atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcag
acctcgatacgtgccgcaggagatcaggactttcacct--tggatcatgcgaccgtacctac
Often conserved regions are near active sights
Ligand binding sights (docking)
Protein-to-protein interface
Important regions for tertiary structure
Ligand: small molecule, target of protein,
e.g. O2 is the ligand for hemoglobin
Substrate: a molecule upon which an
enzyme acts
What if we look at more proteins
Increase our confidence?
But how to go about performing multiple
sequence alignment?
atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcag
acctccatacgtgccccaggagatctggactttcacc---tggatcatgcgaccgtacctac
t-atgg-t-cgtgccgcaggagatcaggactttca-gt--g-aatcatctgg-cgc--c-aa
t--tcgt-ac-tgccccaggagatctggactttcaaa---ca-atcatgcgcc-g-tc-tat
aattccgtacgtgccgcaggagatcaggactttcag-t--a-tatcatctgtc-ggc--tag
Hyper-dimensional
dynamic programming
Becomes exponential
with respect to number
of sequences
O(nL) with L = number of
sequences
Determine all pair-wise distances
Fast: number of l-mer
matches
ClustalW:
Slower:
full globalcluster-alignment
alignments
Start with closest pair
and aligns
Then aligns the next closest to those two
And so on..
Profile: matrix of real values, representing
the probability of amino acids at each
position in a corresponding multiple
sequence alignment
A modification of the Smith/Waterman
algorithm
Degree to which an aa is preferred is the degree of
match between the profile and the sequence
Consensus
OPSD_XENLA
1 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38
| : : : || : ::::: :
|: | ::|: :
| :
1 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33
Mistakes early in a
progressive approach
propagated
throughout process
Once aligned not
revisited
Iterative methods
devised to revisit
Newest version of
ClustalW (version 2)
includes iteration
Other MSA apps
•T-Coffee
•PSalign
•DIALIGN
•MUSCLE
Height of letter represents how prevalent
that letter is at that position
Scores are affected by sequence lengths
If want scores that can be compared across
different query lengths need to normalize
Term “bit” comes from fact that probabilities
are stored as log2 values (binary, bit)
Done so can add across length of sequence
instead of multiply
Database Searches
16