Alignment - Cloudfront.net

Download Report

Transcript Alignment - Cloudfront.net

Sequence Alignment
Goal: line up two or more sequences
An alignment of two amino acid sequences:
123456789….
Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP
HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P
Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP
Position within the alignment = columns
Sequence Alignment
The alignment is a hypothesis:
• The positions with identical nt/AA were present in the common
ancestor
• Differences represent the nt/AA that have diverged since the
common ancestor
Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP
HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P
Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP
From extant sequences to evolution
Constructing and evaluating alignments
• how to identify regions of sequence similarity between two
sequences?
• How to evaluate the degree of similarity?
• What is the biological significance of the alignment?
Dot Plots: visualization of an alignment
1) Take two English words:
THISSEQUENCE and THATSEQUENCE
2) place the two sequences on vertical and horizontal axes of graph
3) put dots wherever there is a match
4) diagonal line is the region of identity
–
local alignment
Alignments reveal insertions and deletions
seq1
seq2
THIS---SEQUENCE
THISISASEQUENCE
a gap in Seq1 accounts
for the insertion of ISA
into Seq2
Alignments reveal substitutions
THISSEQUENCE
THATSEQUENCE
– How many substitutions?
– Are all substitutions equal?
– If these were real AA sequences in two extant organisms, how can
we determine whether they reflect evolutionary ancestry?
• Would two unrelated sequence share this level of identity?
Substitutions
Need a methods for evaluating the likelihood of
- A to V (alanine to valine)
- R to F (Arginine to Phenylalanine)
Scoring schemes to assess similarity
• Percent identity = number of identical amino acids
• Percent similarity (biochemical equivalence)
123456789….
Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP
HKIYHLQSKVP R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P
Seq2: HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP
• Substitution matrices
– value assigned based on the probability of substitution
– score the alignment
Substitution matrices
How might one construct a scoring matrix?
– what types of sequence events should we consider?
• DNA level? Transition vs. transversion
• Amino acid level? Biochemical equivalence
• Using known proteins?
– comparing protein homologs
– post hoc determination of probabilities
– should we use the same substitution matrix for two very closely
related proteins vs. proteins that diverged long ago?
• Probability of substitutions increases over time
• Probability that multiple substitutions occurred in a single position
Substitution matrix
Each cell represents the likelihood of substitution of each possible
pair of amino acids
THISSEQUENCE
THATSEQUENCE Sum up the score = 52
581145505695
PAM and BLOSUM matrices for AA sequences
Most protein alignment matrices are empirically derived:
• PAM Scoring Matrices
– compared full length of closely related proteins
– Measured the frequency of all possible substitution pairs
• BLOSUM Scoring Matrices
– compared highly conserved regions of proteins
– blocks
How to score gaps?
THISISASEQUENCE vs THATSEQUENCE?
THISISASEQUENCE
TH----ATSEQUENCE
THA---TSEQUENCE More than one possible alignment
TH---ATSEQUENCE
TH-A-T-SEQUENCE
Scoring the alignment must take into account
1) Substitutions
2) Gaps
Gap penalties:
1) start a new gap (-4)
2) extend an existing gap (-1)
Score all, choose highest score
Alignments
• Finding regions of sequence identity or similarity
• Inserting gaps to reflect indels
• Scoring the possible alignments to find the
optimal alignment by
Common tools that produce alignments
• BLAST to identify similar sequences, given a query
sequence
• ClustalW to align two or more sequences across their
entire length
the blast algorithm
• Uses PAM or BLOSUM matrix
• divides query sequence into short strings, called words
• searches through the database to find subject sequences
that contain similar words
• When finds similar words, it extends and scores the
alignment
• Output consists of all subject sequences that align to the
query at or above a threshold score
• If no words are similar, then no alignment
BLAST Algorithm
divide entire length into words (segments of X length)
Extend hits one base at a time
S is the alignment score:
If it falls below a threshold,
the extension processes
ends
HSPs are Aligned Regions
• High scoring segment pairs = the original word match plus
the extension
– high scoring = score of the alignment above threshold
– segment = the region of the query sequence aligned to the subject
– pair = alignment between two sequences (query and subject)
• BLAST often produces several short HSPs rather than a
single aligned region
BLAST Results report local alignments
Query was the entire protein sequence (position 1 to 749)
• Score,E-value, Identities, Positives, Gaps
>gi|17556182|ref|NP_497582.1|
Predicted CDS, phosphatidylinositol transfer protein
[Caenorhabditis elegans]
Score = 283 bits (723), Expect = 8e-75
Identities = 144/270 (53%), Positives = 186/270 (68%), Gaps = 13/270 (4%)
Query: 48
Sbjct: 70
KEYRVILPVSVDEYQVGQLYSVAEASKNXXXXXXXXXXXXXXPYEK----DGE--KGQYT 101
K+ RV+LP+SV+EYQVGQL+SVAEASK
P++
+G+ KGQYT
KKSRVVLPMSVEEYQVGQLWSVAEASKAETGGGEGVEVLKNEPFDNVPLLNGQFTKGQYT 129
Query: 102 HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP 160
HKIYHLQSKVP +R +AP+G+L IHE+AWNAYPYC+TV+TN +YMKE+F +KIET H P
Sbjct: 130 HKIYHLQSKVPAILRKIAPKGSLAIHEEAWNAYPYCKTVVTNPDYMKENFYVKIETIHLP 189
Query: 161 DLGTQENVHKLEPEAWKHVEAVYIDIADRSQVL-SKDYKAEEDPAKFKSIKTGRGPLGPN 219
D GT EN H L+ +
E V I+IA+ + L S D
+ P+KF+S KTGRGPL N
Sbjct: 190 DNGTTENAHGLKGDELAKREVVNINIANDHEYLNSGDLHPDSTPSKFQSTKTGRGPLSGN 249
Query: 220 WKQELVNQKDCPYMCAYKLVTVKFKWWGLQNKVENFIHKQERRLFTNFHRQLFCWLDKWV 279
WK +
P MCAYKLVTV FKW+G Q VEN+ H Q RLF+ FHR++FCW+DKW
Sbjct: 250 WKDSVQ-----PVMCAYKLVTVYFKWFGFQKIVENYAHTQYPRLFSKFHREVFCWIDKWH 304
Query: 280 DLTMDDIRRMEEETKRQLDEMRQKDPVKGM 309
LTM DIR +E + +++L+E R+
V+GM
Sbjct: 305 GLTMVDIREIEAKAQKELEEQRKSGQVRGM 334
BLAST Statistics
• E-value is equivalent to a P value
• smaller numbers are more significant
– 1e-4 = 1 x 10-4 = 0.0004
– 1e-50 = 1 x 10-50
• E-value is calculated from the alignment score (S)
• how many alignments of that score would likely occur by
chance if you query a database of that size?
– if GenBank contains 10 million sequences, there is a good
probability that the sequence “MAGAV” will occur multiple times in
sequences that are NOT evolutionarily related
• The E-value represents the likelihood that the observed
alignment is due to chance alone
Interpretation of output
• very low E-values (e-100) represent sequences that are very
close to being identical
• moderate E-values are related genes (homologs)
• long list of gradually declining of E-values indicates a large
gene family
• you must examine the results when e-value is in the 10-4 to -5
range
– examine sequences
• a few AA matches in a long sequence?
• many AA matches in a very short sequence?
Evaluating Blast results
• Alignment:
– colored bar alignments (region of alignment, score along length)
– sequence alignments (region of alignment, AA information)
• Exploring potential function from significant blast hits
– use accession link to go to the record page for each hit
• published papers
• full sequence information
• annotation
• Blast is linked to a protein domain tool