lecture04_05

Download Report

Transcript lecture04_05

Sequence Alignment
Part 3
Protein Sequence Alignment
Multiple Sequence Alignment
Table 3.1. Web sites for alignment of sequence pairs
Name of site
Bayes block alignera
http://www.wadsworth.or
g/resnres/bioinfo
Zhu et al. (1998)
Likelihood-weighted
sequence alignmentb
http://stateslab.bioinform
atics.med.umich.edu/servi
ce
see Web site
PipMaker (percent identity
plot), a graphical tool for
assessing long alignments
http://www.bx.psu.edu/mi
ller_lab/
Schwartz et al. (2000)
BCM Search Launcherc
http://searchlauncher.bcm
.tmc.edu/
see Web site
SIM—Local similarity
program for finding
alternative alignments
http://us.expasy.org/
Huang et al. (1990);
Huang and Miller (1991);
Pearson and Miller (1992)
Global alignment
programs (GAP, NAP)
http://genome.cs.mtu.edu
/align/align.html
Huang (1994)
FASTA program suited
http://fasta.bioch.virginia.
edu/
Pearson and Miller (1992);
Pearson (1996)
Pairwise BLASTe
http://www.ncbi.nlm.nih.g
ov/blast/bl2seq/bl2.html
Altschul et al. (1990)
AceViewf shows alignment
of mRNAs and ESTs to the
genome sequence
http://www.ncbi.nlm.nih.g
ov/IEB/Research/Acembly
see Web site
BLATf Fast alignment for
finding genes in genome
http://genome.ucsc.edu
Kent (2002)
GeneSeqerf predicts genes
and aligns mRNA and
genome sequences
http://www.bioinformatics
.iastate.edu/bioinformatics
2go/
Usuka et al. (2000)
SIM4f
http://globin.cse.psu.edu
Floria et al. (1998)
Protein Sequence Alignment
Protein Pairwise Sequence Alignment
• The alignment tools are similar to the DNA
alignment tools
• BLASTP, FASTA
• Main difference: instead of scoring match (+2)
and mismatch (-1) we have similarity scores:
• Score s(i,j) > 0 if amino acids i and j have
similar properties
• Score s(i,j) is  0 otherwise
• How should we score s(i,j)?
The 20 Amino Acids
Chemical Similarities Between
Amino Acids
Acids & Amides
DENQ
(Asp, Glu, Asn, Gln)
Basic
HKR
(His, Lys, Arg)
Aromatic
FYW
(Phe, Tyr, Trp)
Hydrophilic
ACGPST (Ala, Cys, Gly, Pro, Ser, Thr)
Hydrophobic
ILMV
(Ile, Leu, Met, Val)
Amino Acid Substitutions Matrices
• For aligning amino acids, we need a scoring
matrix of 20 rows  20 columns
• Matrices represent biological processes
– Mutation causes changes in sequence
– Evolution tends to conserve protein function
– Similar function requires similar amino acids
• Could base matrix on amino acid properties
– In practice: based on empirical data
identity
similarity
Given an alignment of closely related sequences
we can score the relation between amino acids
based on how frequently they substitute each other
AGHKKKR D SFHRRRAGC
D
E
In this column
E & D are found
D
E
8/10
D
E
E
S
Symmetric matrix
Amino Acid Matrices
of 20x20 entries:
Entry (i,i) is
greater than any
entry (i,j), ji.
entry (i,j)=entry(j,i)
Entry (i,j): the score
of aligning amino
acid i against amino
acid j.
PAM - Point Accepted Mutations
•
•
Developed by Margaret Dayhoff, 1978.
Analyzed very similar protein sequences
•
•
•
•
•
Used global alignment.
•
•
Proteins are evolutionary close.
Alignment is easy.
Point mutations - mainly substitutions
Accepted mutations - by natural selection.
Counted the number of substitutions (i,j) per amino acid
pair: Many i<->j substitutions => high score s(i,j)
Found that common substitutions occurred
involving chemically similar amino acids.
PAM 250
• Similar amino acids are close to each other.
• Regions define conserved substitutions.
Selecting a PAM Matrix
• Low PAM numbers: short sequences, strong local
similarities.
• High PAM numbers: long sequences, weak
similarities.
– PAM120 recommended for general use (40% identity)
– PAM60 for close relations (60% identity)
– PAM250 for distant relations (20% identity)
• If uncertain, try several different matrices
– PAM40, PAM120, PAM250 recommended
BLOSUM
• Blocks Substitution Matrix
– Steven and Jorga G. Henikoff (1992)
• Based on BLOCKS database (www.blocks.fhcrc.org)
– Families of proteins with identical function
– Highly conserved protein domains
• Ungapped local alignment to identify motifs
– Each motif is a block of local alignment
– Counts amino acids observed in same column
– Symmetrical model of substitution AABCDA… BBCDA
DABCDA. A.BBCBB
BBBCDABA.BCCAA
AAACDAC.DCBCDB
CCBADAB.DBBDCC
AAACAA… BBCCC
BLOSUM Matrices
• Different BLOSUMn matrices are
calculated independently from
BLOCKS
• BLOSUMn is based on sequences that
are at most n percent identical.
Selecting a BLOSUM Matrix
• For BLOSUMn, higher n suitable for
sequences which are more similar
– BLOSUM62 recommended for general use
– BLOSUM80 for close relations
– BLOSUM45 for distant relations
Multiple Sequence Alignment
Multiple Alignment
• Like pairwise alignment
– n input sequences instead of 2
– Add indels to make same length
– Local and global alignments
• Score columns in alignment independently
• Seek an alignment to maximize score
Alignment Example
GTCGTAGTCGGCTCGAC
GTCTAGCGAGCGTGAT
GCGAAGAGGCGAGC
GCCGTCGCGTCGTAAC
GTCGTAGTCG-GC-TCGAC
GTC-TAG-CGAGCGT-GAT
GC-GAAG-AG-GCG-AG-C
GCCGTCG-CG-TCGTA-AC
Score : 4/4 =1 , 3/4 =0.75 , 2/4=0.5 , 1/4= 0
1*1
2*0.75
11*0.5
Score=8
4*1
11*0.75
2*0.5
Score=13.25
Dynamic Programming
• Pairwise A–B alignment table
– Cell (i,j) = score of best alignment between first
i elements of A and first j elements of B
– Complexity: length of A  length of B
• 3-way A–B–C alignment table
– Cell (i,j,k) = score of best alignment between
first i elements of A, first j of B, first k of C
– Complexity: length A  length B  length C
MSA Complexity
• n-way S1–S2–…–Sn-1–Sn alignment table
– Cell (x1,…,xn) = best alignment score between
first x1 elements of S1, …, xn elements of Sn
– Complexity: length S1  …  length Sn
• Example: protein family alignment
– 100 proteins, 1000 amino acids each
– Complexity: 10300 table cells
– Calculation time: beyond the big bang!
Feasible Approach
• Based on pairwise alignment scores
– Build n by n table of pairwise scores
• Align similar sequences first
– After alignment, consider as single sequence
– Continue aligning with further sequences
• Sum of pairwise alignment scores
– For n sequences, there are n(n-1)/2 pairs
GTCGTAGTCG-GC-TCGAC
GTC-TAG-CGAGCGT-GAT
GC-GAAG-AG-GCG-AG-C
GCCGTCG-CG-TCGTA-AC
1
2
3
4
GTCGTAGTCG-GC-TCGAC
GTC-TAG-CGAGCGT-GAT
GC-GAAGAGGCG-AGC
GCCGTCGCGTCGTAAC
1
2
3
4
GTCGTA-GTCG-GC-TCGAC
GTC-TA-G-CGAGCGT-GAT
G-C-GAAGA-G-GCG-AG-C
G-CCGTCGC-G-TCGTAA-C
ClustalW Algorithm
Progressive Sequences Alignment (Higgins and Sharp 1988)
• Compute pairwise alignment for all the pairs of
sequences.
• Use the alignment scores to build a phylogenetic
tree such that
• similar sequences are neighbors in the tree
• distant sequences are distant from each other in
the tree.
• The sequences are progressively aligned
according to the branching order in the guide tree.
• http://www.ebi.ac.uk/clustalw/
Progressive Sequence Alignment
(Protein sequences example)
NYLS
N KYLS
NFS
N K/- Y L S
N K/- Y/F L/- S
NFLS
N F L/- S
Treating Gaps in ClustalW
• Penalty for opening gaps and additional
penalty for extending the gap
• Gaps found in initial alignment remain
fixed
• New gaps are introduced as more sequences
are added (decreased penalty if gap exists)
• Decreased within stretches of hydrophilic
residues
MSA Approaches
• Progressive approach
CLUSTALW (CLUSTALX)
PILEUP
T-COFFEE
• Iterative approach:
Repeatedly realign subsets of sequences.
MultAlin, DiAlign.
• Statistical Methods:
Hidden Markov Models
SAM2K
• Genetic algorithm
SAGA