Lesson06MultipleSequenceAlignments

Download Report

Transcript Lesson06MultipleSequenceAlignments

Doug Raiford
Lesson 5

Dynamic programming methods
 Needleman-Wunsch (global alignment)
 Smith-Waterman (local alignment)

BLAST
Fixed: best
Linear: next best
Polynomial (n2): not bad
Exponential (3n): very bad


BLAST fast (linear)
But not as sensitive
Speed
Sensitivity
Similarity matrix
Especially with amino
acids
 Some amino acids have
similar chemical
characteristics
 Similarity to all 8,000 3mers calculated


 Usually ~50 are above a
threshold
 All of these ~50 are
considered hits when
searching
Matrices
PAM (Point Accepted Mutation)
Built from observed substitution
rates in closely related proteins
BLOSOM (BLOck SUbstitution Matrix)
Built from observed substitution
rates in evolutionarily divergent
proteins
PSI-BLAST (Position
Specific Iterative)
 Align using default
similarity matrix
 At each query location
build a Position Specific
Scoring Matrix (PSSM)
based upon observed
search and alignment
results
 Repeat with new matrix
until results no longer
change

PSI-BLAST
Build sensitivity by specifying
allowed similarity at each position
Slower, but still faster than local
alignment


Central to
bioinformatics
Need for
 Phylogeny
 Protein function
 Protein structure
▪ Structure  function
 Drug discovery



Some parts of proteins are very important to
maintain function
Must be similar from species to species
Can we spot these regions through
alignment?
atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcag
acctcgatacgtgccgcaggagatcaggactttcacct--tggatcatgcgaccgtacctac

Often conserved regions are near active sights
 Ligand binding sights (docking)
 Protein-to-protein interface
 Important regions for tertiary structure
Ligand: small molecule, target of protein,
e.g. O2 is the ligand for hemoglobin
Substrate: a molecule upon which an
enzyme acts



What if we look at more proteins
Increase our confidence?
But how to go about performing multiple
sequence alignment?
atgccgca-actgccgcaggagatcaggactttcatgaatatcatcatgcgtggga-ttcag
acctccatacgtgccccaggagatctggactttcacc---tggatcatgcgaccgtacctac
t-atgg-t-cgtgccgcaggagatcaggactttca-gt--g-aatcatctgg-cgc--c-aa
t--tcgt-ac-tgccccaggagatctggactttcaaa---ca-atcatgcgcc-g-tc-tat
aattccgtacgtgccgcaggagatcaggactttcag-t--a-tatcatctgtc-ggc--tag


Hyper-dimensional
dynamic programming
Becomes exponential
with respect to number
of sequences
 O(nL) with L = number of
sequences

Determine all pair-wise distances
 Fast: number of l-mer
matches
ClustalW:
 Slower:
full globalcluster-alignment
alignments



Start with closest pair
and aligns
Then aligns the next closest to those two
And so on..


Profile: matrix of real values, representing
the probability of amino acids at each
position in a corresponding multiple
sequence alignment
A modification of the Smith/Waterman
algorithm
 Degree to which an aa is preferred is the degree of
match between the profile and the sequence
Consensus
OPSD_XENLA
1 M.ERS.HLPEG.PFAAALSGARFAAQSSGN.ASVL..DWNVLP.E 38
| : : : || : ::::: :
|: | ::|: :
| :
1 MNG.GTE..EGPN.NFYVP.PMS...SN.NKTGVVRSP.P..PFD 33

Mistakes early in a
progressive approach
propagated
throughout process
 Once aligned not
revisited
 Iterative methods
devised to revisit
 Newest version of
ClustalW (version 2)
includes iteration
Other MSA apps
•T-Coffee
•PSalign
•DIALIGN
•MUSCLE

Height of letter represents how prevalent
that letter is at that position


Scores are affected by sequence lengths
If want scores that can be compared across
different query lengths need to normalize

Term “bit” comes from fact that probabilities
are stored as log2 values (binary, bit)
 Done so can add across length of sequence
instead of multiply
Database Searches
16