Transcript Document

Alignment principles and homology
searching using (PSI-)BLAST
Jaap Heringa
Centre for Integrative Bioinformatics VU (IBIVU)
http://ibivu.cs.vu.nl
Bioinformatics
“Nothing in Biology makes sense except in
the light of evolution” (Theodosius
Dobzhansky (1900-1975))
“Nothing in bioinformatics makes sense
except in the light of Biology”
Evolution
Four requirements:
• Template structure providing stability
(DNA)
• Copying mechanism (meiosis)
• Mechanism providing variation (mutations;
insertions and deletions; crossing-over; etc.)
• Selection (enzyme specificity, activity, etc.)
Evolution
Ancestral sequence: ABCD
ACCD (B C)
ACCD
AB─D
or
ABD (C ø)
mutation
deletion
ACCD
A─BD
Pairwise Alignment
See “Primer of Genome Science” P. 114 –
box “Phylogenetics”
Evolution
Ancestral sequence: ABCD
ACCD (B C)
ACCD
AB─D
true alignment
or
ABD (C ø)
mutation
deletion
ACCD
A─BD
Pairwise Alignment
See “Primer of Genome Science” P. 114 –
box “Phylogenetics”
Comparing two sequences
•We want to be able to choose the best
alignment between two sequences.
•Alignment assumes divergent evolution
(common ancestry) as opposed to convergent
evolution
•The first sequence to be compared is assigned
to the horizontal axis and the second is assigned
to the vertical axis. See “Primer of Genome Science” P. 72-75
box “Pairwise Sequence Alignment”
MTSAVLPAAYDRKHTSIIFQTSWQ
M
T
S
A
V
L
P
A
A
Y
D
R
K
H
T
T
S
W
Q
All possible alignments between the two sequences can be
represented as a path through the search matrix
MTSAVLPAAYDRKHTSIIFQTSWQ
M
T
S
A
V
L
P
A
A
Y
D
R
K
H
T
T
S
W
Q
Corresponds
to stretch
“SIIFQ” in
horizontal
sequence
(indel)
All possible alignments between the two sequences can be
represented as a path through the search matrix
A protein sequence alignment
MSTGAVLIY--TSILIKECHAMPAGNE-------GGILLFHRTHELIKESHAMANDEGGSNNS
A DNA sequence alignment
attcgttggcaaatcgcccctatccggccttaa
attt---ggcggatcg-cctctacgggcc----
Sequence alignment
History
1970 Needleman-Wunsch global pair-wise
alignment
1981 Smith-Waterman local pair- wise alignment
1984 Hogeweg-Hesper progressive multiple
alignment
1989 Lipman-Altschul-Kececioglu simultaneous
multiple alignment
1994 Hidden Markov Models (HMM) for
multiple alignment
1996 Iterative strategies for progressive multiple
alignment revived
1997 PSI-Blast (PSSM)
Pair-wise alignment
T D W V T A L K
T D W L - - I K
Combinatorial explosion
- 1 gap in 1 sequence: n+1 possibilities
- 2 gaps in 1 sequence: (n+1)n
- 3 gaps in 1 sequence: (n+1)n(n-1), etc.
2n
~
=
n
22n
(2n)!
(n!)2
n
2 sequences of 300 a.a.: ~1088 alignments
2 sequences of 1000 a.a.: ~10600 alignments!
Dynamic programming
Scoring alignments
 s (a , b ) + 
Sa,b =
l
i
j
k
Nk  gp(k )
gp(k) = -Popen -kPextension
gp(k) is gap of
size k, Nk is the
number of gaps
of length k
affine gap penalties
Popen and Pextension are the penalties for gap initialisation and
extension, respectively
 s (a , b )
l
i
j
describes the likelihood of a given
residue match in the alignment
Amino acid exchange matrices
How do we get one?
First systematic method to derive amino
acid exchange matrices by Margaret
Dayhoff et al. (1978) – Atlas of Protein
Structure. There are now various matrix
series (PAM, BLOSUM) corresponding to
different evolutionary speeds or time
since divergence
2020
And how do we get associated gap penalties?
Gap-opening
penalty
Gap-extension
penalty
Formalisms are available for exchange matrices but for gap penalties no formal
theory exists yet. Most researchers use recommended gap penalty values provided by
experts
Dynamic programming
Scoring alignments
T D W V T A L K
T D W L - - I K
Gap is 2 positions
long
2020
10
Amino Acid Exchange
Matrix
1
Affine gap penalties (Popen,
Pextension)
Score: s(T,T)+s(D,D)+s(W,W)+s(V,L) -Popen -2Pext +
+s(L,I)+s(K,K)
A
2
R -2
6
N
0
0
2
D
0 -1
2
PAM250 matrix
4
C -2 -4 -4 -5 12
Q
0
1
1
2 -5
4
E
0 -1
1
3 -5
2
4
G
1 -3
0
1 -3 -1
0
2
1 -3
1 -2
H -1
2
3
5
6
I -1 -2 -2 -2 -2 -2 -2 -3 -2
5
L -2 -3 -3 -4 -6 -2 -3 -4 -2
2
1
0 -5
1
amino acid
exchange matrix
(log odds)
0 -2
6
K -1
3
M -1
0 -2 -3 -5 -1 -2 -3 -2
2
4
0
6
F -4 -4 -4 -6 -4 -5 -5 -5 -2
1
2 -5
0
9
1
0 -1 -1 -3
0 -2 -3 -1 -2 -5
S
1
0
1
0
0 -1
0
1 -1 -1 -3
0 -2 -3
1
2
T
1 -1
0
0 -2 -1
0
0 -1
0 -1 -3
0
1
0 -2
2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4
Y -3 -4 -2 -4
0 -4 -4 -5
0 -1 -1 -4 -2
3
0 -6 -2 -5 17
7 -5 -3 -3
B
0 -1
2
3 -4
1
2
0
1 -2 -3
1 -2 -5 -1
0
0 -5 -3 -2
2
Z
0
0
1
3 -5
3
3 -1
2 -2 -3
0 -2 -5
0
0 -1 -6 -4 -2
2
3
A
R
N
D
Q
E
H
K
P
S
B
Z
I
L
2 -1 -1 -1
0 10
0 -2 -2 -2 -2 -2 -2 -1 -2
G
2 -2
6
V
C
4
Positive exchange values
denote mutations that are
more likely than randomly
expected, while negative
numbers correspond to
avoided mutations compared
to the randomly expected
situation
5
P
W -6
0 -1 -1
0 -2 -3
M
F
0 -6 -2
T
W
Y
4
V
Pairwise sequence alignment needs
sense of evolution
Global dynamic programming
MDAGSTVILCFVG
M
D
A
A
S
T
I
L
C
G
S
Evolution
Search matrix
MDAGSTVILCFVGMDAAST-ILC--GS
Alignment
Amino Acid Exchange
Matrix
Gap penalties
(open,extension)
Global dynamic programming
j-1
i-1
Si,j = si,j + Max
Max{S0<x<i-1, j-1 - Pi - (i-x-1)Px}
Si-1,j-1
Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px}
Global dynamic programming
Global dynamic programming
Pairwise alignment
• Global alignment: all gaps are penalised
• Semi-global alignment: N- and C-terminal
gaps (end-gaps) are not penalised
End-gaps
MSTGAVLIY--TS-------GGILLFHRTSGTSNS
End-gaps
Local dynamic programming
(Smith & Waterman, 1981)
LCFVMLAGSTVIVGTR
E
D
A
S
T
I
L
C
G
S
Negative
numbers
Amino Acid
Exchange Matrix
Search matrix
Gap penalties
(open, extension)
AGSTVIVG
A-STILCG
This is a local alignment (only part of the sequences aligned)
Local dynamic programming
(Smith & Waterman, 1981)
j-1
i-1
Si,j = Max
Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px}
Si,j + Si-1,j-1
Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px}
0
Local dynamic programming
Multiple sequence alignment (MSA)
of 12 * Flavodoxin + cheY sequence
Progressive multiple alignment
- general principle
1
2
1
3
Score 1-2 AllagainstScore 1-3 all
pairwise
Score 4-5 alignment
4
5
Scores
5×5
Scores to distances
Guide tree
Similarity
matrix
Iteration possibilities
Multiple alignment
Sequence database (or homology) searching
-available techniques
DP too slow
for repeated
database
searches
• Dynamic Programming (DP)
• FASTA
• BLAST and PSI-BLAST
• QUEST
Fast heuristics
This lecture
• HMMER
• SAM-T99
Hidden Markov modelling
(more recent, slow)
Homology Searching Motivation
•If you have an unknown gene, you can try
and find a homologous sequence (an ortholog
or a paralog) in an annotated sequence
database, i.e. a database containing sequences
for which the functions are known
•You then transfer the information from a
putatively homologous database sequence to
the query sequence
This transfer of information based on homology has
arguably produced more knowledge about genes than any
other technique
See “Primer of Genome Science” Pp. 25-26
box “GenBank Files”
Heuristic Alignment Motivation
•dynamic programming has performance
O(mn), where m and n are the sequence
lengths, which is too slow for large databases
with high query traffic
•heuristic methods do fast approximation to
dynamic programming
– FASTA [Pearson & Lipman, 1988]
– BLAST [Altschul et al., 1990]
Heuristic Alignment Motivation
• consider the task of searching SWISS-PROT
against a query sequence:
– say our query sequence is 362 amino-acids long
– SWISS-PROT release 38 contains 29,085,265 amino
acids
• finding local alignments via dynamic
programming would entail O(1010) matrix
operations
• many servers handle thousands of such queries a
day (NCBI > 50,000)
BLAST
• Basic Local Alignment Search Tool
• BLAST heuristically finds high scoring segment pairs
(HSPs):
– identical length segments each time from 2 sequences (query
and database sequence) with statistically significant match
scores
– i.e. ungapped local alignments
• key tradeoff: sensitivity vs. speed
• Sensitivity = number of significant matches detected/
number of significant matches in DB
BLAST Overview
• Given: query sequence q, word length w, word score
threshold T, segment score threshold S
– compile a list of “words” that score at least T when
compared to words from q
To gain speed, BLAST generates all words (tripeptides) from a query sequence and
for each of those the derivation of a table of similar tripeptides: the number of
tripeptides is only a fraction of total number possible.
– scan database for matches to words in list
The initial search is done for each tripeptide that can be found in the table of
similar tripeptides for each query tripeptide, and scores at least the threshold value
T when compared to the query tripeptide using a substitution matrix for scoring.
– extend all matches to seek high-scoring segment pairs
BLAST quickly scans each sequence in a database of protein sequences for
ungapped regions showing high similarity, which are called high-scoring segment
pairs (HSP), using the tables of similar peptides. The word hits are extended in
either direction in an attempt to generate an alignment with a score exceeding the
threshold of S, and as far as the cumulative alignment score can be increased.
• Return: segment pairs (HSPs) scoring at least S
Compiling list of words…
• Given:
– query sequence: QLNFSAGW
– word length w = 3
– word score threshold T = 8
• Step 1: determine all words of length w in
query sequence
QLN LNF NFS FSA SAG AGW
Compiling list of words (Ctd)…
• Step 2: determine all words that score at least T
when compared to a word in the query sequence:
words from
sequence
QLN
LNF
NFS
…
SAG
...
query words w/ T=8
QLN=11, QMD=9, HLN=8, ZLN=9,…
LNF=9, LBF=8, LBY=7, FNW=7,…
NFS=12, AFS=8, NYS=8, DFT=10,…
none
Scanning the Database
• Search all sequences in the database for all
occurrences of query words that
• Remember hits
Extending Hits
• Extend hits in both directions (without allowing
gaps)
• Terminate extension in one direction when score
falls certain distance below best score for shorter
extensions
• return segment pairs scoring at least S
Sensitivity versus Running Time
• the main parameter controlling the
sensitivity vs. running-time trade-off is T
(threshold for what becomes a query word)
– small T: greater sensitivity, more hits to expand
– large T: lower sensitivity, fewer hits to expand
BLAST Notes
• may fail to find all HSPs
– may miss seeds if T is too stringent
– extension is greedy
• empirically, 10 to 50 times faster than SmithWaterman
• is a heuristic local alignment technique
• large impact:
– NCBI’s BLAST server handles more than 50,000
queries a day
– most used bioinformatics program
BLAST flavours
• blastp compares an amino acid query sequence
against a protein sequence database
• blastn compares a nucleotide query sequence
against a nucleotide sequence database
• blastx compares the six-frame conceptual protein
translation products of a nucleotide query
sequence against a protein sequence database
• tblastn compares a protein query sequence against
a nucleotide sequence database translated in six
reading frames
• tblastx compares the six-frame translations of a
nucleotide query sequence against the six-frame
translations of a nucleotide sequence database.
More Recent BLAST Extensions
• the two-hit method
• gapped BLAST
• PSI-BLAST
all are aimed at increasing sensitivity while
limiting run-time
• Altschul et al., Nucleic Acids Research 1997
The Two-Hit Method
• extension step typically accounts for 90% of
BLAST’s execution time
• key idea: do extension only when there are
two hits on the same diagonal within
distance A of each other
• to maintain sensitivity, lower T parameter
– more single hits found
– but only small fraction have associated 2nd hit
The Two-Hit Method
Figure from: Altschul et al. Nucleic Acids Research 25, 1997
Gapped BLAST
• Start gapped alignment only if two-hit
extension has a sufficiently high score
• find length-11 segment with highest score;
use central pair in this segment as seed
• run DP process both forward & backward
from seed
• prune cells when local alignment score falls
a certain distance below best score yet
Gapped BLAST
The black parts in the figure are the parts that are covered by Dynamic Programming
starting in two directions from the seed: the best alignment found in both directions
are then combined in the final optimal gapped alignment.
Figure from: Altschul et al. Nucleic Acids Research 25, 1997
BLAST usage
• BLAST produces a list of
Q
sequences that score higher
than the specified threshold
Pos.
(putative homologs)
• But there is always the
T
problem of false positives
and false negatives
DB
Neg.
• As a trick to find more
sequences, you can use
database sequences found as
a query for a new BLAST
search or use
See “Primer of Genome Science” P. 86-87
box “Searching Sequence Databases Using
PSI-BLAST
BLAST”
PSI-BLAST
PSI (Position Specific Iterated) BLAST
basic idea:
1. Carry out gapped-BLAST using the query sequence to find first
hits
Query sequence is first scanned for the presence of so-called low-complexity regions
(Wooton and Federhen, 1996), i.e. regions with a biased composition likely to lead to
spurious hits are excluded from alignment.
2. use results from (gapped) BLAST query to construct a profile
matrix (PSSM), containing information about the query sequence
and hits found
The program takes significant local alignments found (E-value better than threshold),
constructs a (master-slave) multiple alignment and abstracts a position specific scoring
matrix (PSSM) from this alignment.
3.
4.
search database with PSSM (containing improved information
from multiple sequence segments) instead of single query
sequence
Iterate preceding two steps
Rescan the database in a subsequent round to find more homologous sequences. Iteration
continues until user decides to stop or search has converged (no more hits found)
PSI-BLAST iteration
Lowcomplexity
region
Q
Query sequence
xxxxxxxxxxxxxxxxx
Gapped BLAST search
Q
Query sequence
xxxxxxxxxxxxxxxxx
Database hits
make PSSM
make new
PSSM
A
C
D
.
.
Y
PSSM
Pi
Px
Gapped BLAST search
A
C
D
.
.
Y
Pi
Px
PSSM
Database hits
A Profile Matrix (Position Specific
Scoring Matrix – PSSM)
PSI BLAST
• Searching with a Profile
• aligning profile matrix to a simple sequence
– like aligning two sequences
– except score for aligning a character with a matrix
position is given by the matrix itself
– not a substitution matrix
PSI BLAST:
Constructing the Profile Matrix
Remember that only local fragments are fished out of the database
by BLAST! These can cover only part of the query sequence.
Figure from: Altschul et al. Nucleic Acids Research 25, 1997
PSI-BLAST output example
Normalised sequence similarity
The p-value is defined as the probability of seeing at
least one unrelated score S greater than or equal to a
given score x in a database search over n sequences.
This probability follows the Poisson distribution
(Waterman and Vingron, 1994):
P(x, n) = 1 – e-nP(S x),
where n is the number of sequences in the database
Depending on x and n (fixed)
Normalised sequence similarity
Statistical significance
The E-value is defined as the expected number of nonhomologous sequences with score greater than or equal
to a score x in a database of n sequences:
E(x, n) = nP(S  x)
if E-value = 0.01, then the expected number of random
hits with score S  x is 0.01, which means that this Evalue is expected by chance only once in 100
independent searches over the database.
if the E-value of a hit is 5, then five fortuitous hits with S
 x are expected within a single database search, which
renders the hit not significant.
Normalised sequence similarity
Statistical significance
• Database searching is commonly performed
using an E-value in between 0.1 and 0.001.
• Low E-values decrease the number of false
positives in a database search, but increase
the number of false negatives, thereby
lowering the sensitivity of the search.
Functional annotation by BLAST
local search
Serious problem: multi-domain proteins
See “Primer of
Genome Science”
Pp. 105-108:
“Functional
Annotation and
Gene Family
Clusters”
Homology-derived Secondary Structure of Proteins
(HSSP)
Sander & Schneider, 1991
Literature: Read the following pages
in Gibson and Muse’s “Primer of
Genome Science”
Pp. 25-26 box “GenBank Files”
Pp. 72-75 box “Pairwise Sequence Alignment”
Pp. 86-87 box “Searching Sequence Databases Using BLAST”
Pp. 105-108: “Functional Annotation and Gene Family Clusters”
P. 114 box “Phylogenetics”