Pairwise alignments 2 - Bioinformatics Centre
Download
Report
Transcript Pairwise alignments 2 - Bioinformatics Centre
Pairwise alignment 2:
Scoring matrices and gaps BLAST,
BLAT and FASTA
Introduction to bioinformatics
Stinus Lindgreen
[email protected]
Bioinformatics Centre, University of Copenhagen
1
Outline of the lecture
Scoring an alignment: BLOSUM and PAM
Scoring matrices for nucleotides
Treatment of gaps
The BLAST algorithm
Position Specific Scoring Matrices
PSI-BLAST and other variants
BLAT
FASTA
2
Definition of scoring system
To build an alignment we need a scoring system
Aligning two different residues indicates a
substitution
How likely is it to replace A with B?
A likely alignment should receive a large score
Create a substitution matrix
A 20×20 symmetrical table (matrix)
Look up the score for aligning two amino acids
The diagonal represents conservation
3
Considerations
Conservative substitutions
Prefer changes that respect structure and function
E.g. hydrophobicity, size, chemical properties etc.
Frequencies
How often do specific residues occur?
Scores should weigh the rare ones higher
Evolution
How distant are the sequences?
Ancient divergence more changes
4
The creation
Approach 1:
Make an evolutionary model
Use closely related sequences
Extrapolate to greater distances
Done in the PAM family
Approach 2:
Look at related sequences
Observe actual mutations in motifs
Use sets of different overall identity
Done in the BLOSUM family
5
The PAM250 matrix[1]
[1]
M.O. Dayhoff: Survey of new data and computer methods of analysis (1978),
Atlas of protein sequence and structure, 5:3
6
The PAM family
Defining characteristics
PAM: Point (or percent) accepted mutations
One PAM unit: 1 change per 100 residues (~ 10M ys)
Not the probability that a residue A is something else
What would PAM250 mean in that case?
But: The number of expected changes overall
Including changes A B A
7
Construction of PAM matrices
The general idea:
Assume evolution is independent
Describe ”one step of evolution”
The next step will follow
PAMx: Scoring matrix for x evolutionary steps
Very similar sequences: Low x
Very divergent sequences: Large x
What is an evolutionary step?
8
Evolutionary model
Assume independence
between positions
over time
P(AB)=P(BA)
Consequences:
How a sequence looks tomorrow depends
ONLY on how it looks today
Not on how it looked yesterday
Known as a Markov chain
The context does not matter
Evolution ”reversible”
9
Building PAM1
1.
2.
3.
4.
Find set of very similar sequences (1% ID)
Make global alignment (71 groups)
Count the substitutions (1572 changes)
Calculate weights
10
From counts to scores
The simplified version
Count mutation frequencies
Normalize by mutability of respective amino acid
Makes comparisons possible
Multiply to wanted evolutionary distance x
The respective probabilities
Still probabilities
Calculate log-odds score for (AB+BA)/2
You have the PAMx matrix!
11
Why log-odds?
Si , j log
qi , j
pi p j
Summation instead of multiplication
log(A·B)=log(A)+log(B)
S>0: More often than per chance
S=0: Number expected by chance
S<0: Less often than per chance
12
The PAM1 matrix
Highly identical sequences 1 evolutionary step
PAM1 describes likelihood of changes
Diagonal close to 1, off-diagonal close to 0
PAM2=PAM1×PAM1
PAM3=PAM1×PAM2 …
Markov chain property (independence)
Greater divergence through matrix multiplication
Large PAM: More even values
13
PAM problems
Good and bad things
Based on evolution
Assumes independence
Global data evolution of proteins
Based on small dataset
Extrapolates to greater divergence
information loss & error growth
Evolutionary model partly verified
But still simplified…
Mutations happen at the nucleotide level
14
The BLOSUM matrix[2]
[2]
S. Henikoff and J.G. Henikoff: Amino acid substitution matrices from
protein blocks (1992), Proc. Natl. Acad, Sci., 89:10915–10919s
15
The BLOSUM family
Look at conserved motifs in proteins
Scan databases for motifs
No gaps in alignment a BLOCK
Make groups of motifs
One protein can contain many motifs
~2000 blocks, >500 protein families
The BLOCKS database
BLOSUM: BLOCKS substitution matrices
16
The BLOSUM idea
Don’t extrapolate data
Collect divergent dataset
Observe mutations directly
Local alignment of sequences
Evolution treated implicitly
Still assumes independence between sites
17
Building BLOSUMx
Substitution scores for sequences of x %ID
1.
Create sets of blocks of x %ID
2.
Very similar sequences are grouped and weighted
3.
4.
5.
Avoid bias due to numbers
Count the substitutions
Calculate log-odds scores for all pairs
You have the BLOSUMx matrix!
18
Some general notes
BLOSUM62 has proved to work for a range of
similarities
Other values (e.g. 30, 45 and 80) also used
Today: BLOSUM used more than PAM
19
BLOSUM problems
Good and bad things
Evolution not considered explicitly
Known related proteins
Assumes star-shaped phylogeny
Based on conserved blocks /
Local alignment
20
The matrix matters
Example: BLAST using BLOSUM45 and BLOSUM80
Note number of hits
Note difference in scores
NCBI BLAST
>GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR.
MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQ
FVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVT
HAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM
21
Nucleotide scoring matrices
Normally just simple 1/-1 cost scheme
Problematic simplification
Especially for structural sequences (e.g. ncRNA)
Consider evolution (PAM-like matrix)
Use actual frequencies of nucleotides as background
Weigh transitions and transversions differently
22
Gap penalties
We can score pairs of nucleotides/amino acids
What about indels and gaps?
Two main strategies
Linear gap penalty: G(n)=a·n
Penalize each gap the same (cf. last lecture)
Affine gap penalty: G(n)=O+E·n
More evolutionary sound
Opening a gap costs more than extending it
Standard today
23
Gaps: Things to consider
How large should O and E be?
Depends on the scoring matrix used
Normally good values are given
Too large: No gaps created
Too small: Too many gaps created
Should end gaps be penalized?
Local alignment: No
Global: Probably
24
Heuristics
So, now we know all about scoring an alignment
Let us use Smith-Waterman or Needleman-Wunsch
Against a database or a full-length genome?
Bad idea!
Instead: Heuristic method
Guaranteed to give a good solution fast
Not necessarily the optimal alignment
25
The BLAST algorithm
Basic Local Alignment Search Tool
You have all used it, but how does it work?
Used to find statistically significant local alignments
between a query and a database
Many versions (protein-protein, nucleotidenucleotide,nucleotide-protein …)
Fast, widely used, good reason to know the algorithm
26
BLAST overview
The six steps
1.
Filtering of low complexity regions (optional)
2.
Compile list of relevant words
3.
Scan database sequences
4.
Extend hits to HSPs
5.
Calculate E-value for significant hits
6.
Report hits and alignments
27
BLAST step 1
Filtering
Some sequences contain low complexity regions
Give rise to many random hits
Remember dotplots
Filter out by replacing with Xs
28
BLAST step 2
Compiling word list
Word length L=3 (protein) or L=11 (nucleotides)
Choose score threshold T (usually 11)
Find all words of length L in query sequence
One for each position
Compare to all possible length L words (8000/~4.2M)
For each position in query, remove words below T
Limited to appr. 50 words per position
29
The word list
APLSADQASLVKSTWAQVRNSEVEILAAVFTAYPDIQARF…
APL
PLS
LSA
SAD
ADQ
DQA…
Word list, position 1 APL: APC,APS,APT,APE…
Remove words w with score(APL,w)<T
30
BLAST step 3
Scan database
Store words for each position in efficient search tree
Scan each sequence in database
Remember exact word hits
If query length is 100, scan for appr. 50·100 hits
31
Scanning
APL
APA
LPL
IIAPLIGNESNAPAVQTLVGQLPLSHKARG…
Perfect match between word and database is a hit
32
BLAST step 4
Extend hits (BLAST2)
HSP: High-scoring Segment Pair
Find two hits on same diagonal with distance ≤A
Connect them (ungapped alignment)
Extend using gaps, matches and mismatches
Extension continues while score increases
Stop when score drops X below highest score
Original BLAST
Extend all single word hits (higher T needed)
33
BLAST extension
Find diagonal hits
Extend alignment
34
BLAST step 5
Calculate E-scores
Compile list of HSPs scoring more than S
Let x be the score of a HSP
The probability of seeing this score by chance
P(score≥x)
The expectation of seeing this score in the database
E≈1-e-P(score≥x)D
Rather complicated calculations
The Karlin-Altschul equation (written in many ways)
35
BLAST step 6
Report the results
List the hits (sorted by E-value)
Graphical representation
Show Smith-Waterman alignment of the HSPs
36
Using BLAST
Settings
Query sequence
Sub-sequence
Database important
Conserved Domain search: Additional info
What organisms to search
Scoring matrix
Normally leave the rest as defaults
But you can change, well, anything
37
Understanding BLAST
Output
Database searched and query used
Number of hits
Color-coded diagram
Magnitude of score
Relation between hits (if any)
Hits sorted by E-value
Alignments
Additional info by clicking a link
38
BLAST summary
Heuristic local alignment
Finds word matches
Extends locally
Might miss optimal solutions
Fast
Lower E-value better result
Many hits between query and sequence possible
Remember: Use a proper scoring matrix!
39
Position Specific Scoring Matrix
Another brief detour: PSSM or Profile
Useful to represent (part of) a multiple alignment
Represents the information in a motif
For each position: What are the frequencies of the
characters? (Nucleotides or amino acids)
Frequencies can give the probabilities
Find most probable hit to a PSSM in a sequence
Compare to the background probability (i.e. the
overall frequencies)
Pseudocounts to avoid 0 probabilities
40
PSSM example
123456789
--------AAGGTAAGT
TGTGTGAGT
CAGGTATAC
ATGGTAACT
TAGGTACTG
TAGGTCATT
ACAGTCAGT
CAGGTTGGA
TCCGTAAGT
GAGGTAAAC
| 1 2 3 4 5 6 7 8 9
-|-------------------A| 3 6 1 0 0 6 7 2 1
C| 2 2 1 0 0 2 1 1 2
G| 1 1 7 10 0 1 1 5 1
T| 4 1 1 0 10 1 1 2 6
Just divide by 10 to get probabilities
Assume equal background distribution
P(A)=P(C)=P(G)=P(T)=0.25
41
PSSM score
Comparing sequence CAGGTAGTC to the PSSM
Probability given the model
0,2·0,6·0,7·1,0·1,0·0,6·0,1·0,2·0,2=0,0002016
Probability given the background
0,259=0,000003815
log-odds score (bits)
Pmodel
0,0002016
log
log
5,72
Pbckg
0,000003815
42
PSI-BLAST
Position-Specific-Iterated BLAST
Designed to find distant homologs
More sensitive than BlastP
Use PSSM instead of just sequence comparison
43
The PSI-BLAST algorithm
1.
2.
3.
4.
5.
Perform standard BlastP search
Create PSSM based on query and best hits
Search database again for hits to the PSSM
Incorporate new hits (i.e. update frequencies)
Iterate (repeat) until convergence
End result: More distant homologs found
Slower than standard BLAST (of course)
44
PSI-BLASTing a protein
Output at first iteration, 2nd iteration, …
When to stop?
In this case: 7 iterations
PSI-BLAST
>GLBP_CHITH 152 P11582 GLOBIN CTT-E/E' PRECURSOR.
MKFIILALCVAAASALSGDQIGLVQSTYGKVKGDSVGILYAVFKADPTIQAAFPQ
FVGKDLDAIKGGAEFSTHAGRIVGFLGGVIDDLPNIGKHVDALVATHKPRGVT
HAQFNNFRAAFIAYLKGHVDYTAAVEAAWGATFDAFFGAVFAKM
45
Other BLAST versions
MegaBLAST
BLAST2SEQUENCES
Blast on only two sequences (local alignments)
PHI-BLAST
Assumes high similarity, longer sequences
Pattern Hit Initiated BLAST – searches for motif
WU-BLAST
BLAST from Washington University, not NCBI
46
BLAST variants
Nucleotide vs nucleotide: blastn
Protein vs protein: blastp
Translated sequence vs protein database: blastx
Protein sequence vs translated database: tblastn
Translated sequence vs translated database: tblastx
Find divergent protein sequences: PSI-BLAST
47
The BLAT algorithm
BLAST-Like Alignment Tool
Designed for the genome projects
Local alignments between long sequences
Speed important!
BLAST turned upside-down
48
What BLAT does
How does it work?
1.
Index the database (length 11 words)
2.
Find hits in the query
3.
Opposite the BLAST strategy
Extend hits to HSPs
Useful when the database does not often change
Too time (and space) consuming for normal BLAST
49
The FASTA algorithm
Similar to BLAST … but different
Fast All (since it works with all alphabets)
Not as widely used as BLAST – but one of the first
Works in a step-wise fashion
Locate word hits, extend using Smith-Waterman
50
The steps in FASTA
1.
2.
3.
4.
Find word hits
Score the hits and trim results
Join regions of similarity
Find the best alignment
51
FASTA step 1
Word hits
Choose word size ktup (2 for protein, 4 or 6 for DNA)
Create two word lists: Query and database
Find all words that occur in both
Connect nearby hits directly (i.e. no gaps)
52
Word hits
Find all hits
Connect hits on same diagonal
53
FASTA step 2
Score and trim
Only keep the best 10 segments from step 1
Re-evaluate all hits using PAM250
For each hit: Note the best score
54
Score and trim
Keep 10 best hits
Recalculate scores
55
FASTA step 3
Join regions
All regions scoring over a threshold are kept
Crudely join regions
Add linear gap penalty for joining to diagonals
Keep the best scoring rough alignments
Removes unlikely similar regions
56
Join and penalize
Keep the good regions
Connect using gaps. Remove the low scoring
57
FASTA step 4
Optimize alignment
Consider the good but crude alignments
Reoptimize using Smith-Waterman
Only a small part of the matrix is needed
Banded version (much faster)
58
Optimize alignment
Optimize crude alignment
Use banded Smith-Waterman
59
FASTA results
Histogram of scores
Actual scores versus
expected scores
Optimal alignments
E-value representing the
probability of the hit
60
BLAST vs. FASTA
Differences between the two
BLAST searches the neighbourhood
FASTA looks for exact matches
BLAST returns all the best hits in a sequence
FASTA returns one hit per sequence
BLAST is faster than FASTA
FASTA produces better final alignment
61
Summary
Introduction to Markov chains and HMMs
Scoring matrices: PAM and BLOSUM
Larger PAM Greater distance
Larger BLOSUM Greater identity
Gap penalties
BLAST, PSI-BLAST etc.
BLAT
FASTA
Next time: Multiple alignment
62