Searching Sequence Databases

Download Report

Transcript Searching Sequence Databases

Dot Plots, Path Matrices, Score Matrices
Sequence A
Sequence B
VI L S T R IV HVNS I L P S T N
V
I
L
S
T
R
I
V
I
L
P
E
F
S
T
diagonal lines give equivalent residues
Sequence A
Sequence B
VI L S T R I V HVNS I L P S T N
V
I
L
S
T
R
I
V
I
L
P
E
F
S
T
identical residues score 1
highest scoring path across the matrix gives best alignment
Alignment from Dot Plot
VILSLV
ILPQRSLVVILSLVI LALTV
STVILSLVNVILPQR
ILSLVISLAL
score = 20
sequence identity = 20/26 = 75%
Path or Score Matrix
A
I
C
I
N
R
C
K
C
R
H
P
A
1
0
0
0
0
0
0
0
0
0
0
0
H
0
0
0
0
0
0
0
0
0
0
1
0
C
0
0
1
0
0
0
1
0
1
0
0
0
N
0
0
0
0
1
0
0
0
0
0
0
0
I
0
1
0
1
0
0
0
0
0
0
0
0
R
0
0
0
0
0
1
0
0
0
1
0
0
Q
0
0
0
0
0
0
0
0
0
0
0
0
C
0
0
1
0
0
0
1
0
1
0
0
0
L
0
0
0
0
0
0
0
0
0
0
0
0
C
0
0
1
0
0
0
1
0
1
0
0
0
R
0
0
0
0
0
1
0
0
0
1
0
0
P
0
0
0
0
0
0
0
0
0
0
0
1
M
0
0
0
0
0
0
0
0
0
0
0
0
…HRKVLA
ALVKRH…
1 0 0 0 0……
1
1
1
1
Residue
substitution
matrix
Needleman & Wunsch
A H C N I
R Q C L C R P M
A
I
1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0
C
I
N
R
0
0
0
0
C
K
C
0 0 1 0 0 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 1 0 1 0 0 0
R
H
P
0 0 0 0 0 1 0 0 0 0 1 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
Needleman & Wunsch Algorithm
•
Accumulate the matrix by starting in the bottom row
• moving from right to left a row at a time
• adding score from highest scoring cell in ‘column or row, to right and
below’ to current cell.
•
find the highest scoring path in the matrix by:
•
starting in the top left corner
• moving down across the matrix from cell to cell
• choosing the highest scoring cell at each move
•
the path can not go back on itself or cross the same row or column twice
Accumulating the Matrix
• Add to the score in the cell the highest score from a cell in:
‘the column or row to right and below’
i,j
i-1,j-1
i-n,j-1
i-1,j-m
Sequence A
Sequence B
A H C N I
R Q C L C R P M
A
I
8 7 6 6 5 4 4 3 3 2 1 0 0
7 7 6 6 6 4 4 3 3 2 1 0 0
C
I
N
R
6
6
5
4
C
K
C
3 3 4 3 3 3 3 4 3 3 1 0 0
3 3 3 3 3 3 3 3 3 2 1 0 0
2 2 3 2 2 2 2 3 2 3 1 0 0
R
H
P
2 1 1 1 1 2 1 1 1 1 2 0 0
1 2 1 1 1 1 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0
6
6
5
4
7
6
5
4
6
5
6
4
5
6
5
4
4
4
5
5
4
4
4
4
4
3
3
3
3
3
3
3
3
2
3
2
1
1
1
2
0
0
0
0
0
0
0
0
Possible Moves in Finding a Path across
the Matrix
• start in the leftmost or topmost row
• move to the highest scoring cell in:
‘the column or row to right and below’
i,j
i-1,j-1
i-n,j-1
i-1,j-m
Sequence A
Sequence B
A H C N I
R Q C L C R P M
A
I
8 7 6 6 5 4 4 3 3 2 1 0 0
7 7 6 6 6 4 4 3 3 2 1 0 0
C
I
N
R
6
6
5
4
C
K
C
3 3 4 3 3 3 3 4 3 3 1 0 0
3 3 3 3 3 3 3 3 3 2 1 0 0
2 2 3 2 2 2 2 3 2 3 1 0 0
R
H
P
2 1 1 1 1 2 1 1 1 1 2 0 0
1 2 1 1 1 1 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0
6
6
5
4
7
6
5
4
6
5
6
4
5
6
5
4
4
4
5
5
4
4
4
4
4
3
3
3
3
3
3
3
3
2
3
2
1
1
1
2
0
0
0
0
0
0
0
0
Sequence B
Sequence A
A
I
C
I
N
R
C
K
C
R
H
P
AH
8 7
7 7
6 6
6 6
5 5
4 4
3 3
3 3
2 2
2 1
1 2
0 0
C
6
6
7
6
5
4
4
3
3
1
1
0
N
6
6
6
5
6
4
3
3
2
1
1
0
I
5
6
5
6
5
4
3
3
2
1
1
0
R
4
4
4
4
5
5
3
3
2
2
1
0
Q
4
4
4
4
4
4
3
3
2
1
1
0
C
3
3
4
3
3
3
4
3
3
1
1
0
L
3
3
3
3
3
3
3
3
2
1
1
0
C
2
2
3
2
3
2
3
2
3
1
1
0
R
1
1
1
1
1
2
1
1
1
2
1
0
P
0
0
0
0
0
0
0
0
0
0
0
1
M
0
0
0
0
0
0
0
0
0
0
0
0
AHCNI -RQCLCR - PM
A IC - INR- CKCRHPM
Sequence A
Sequence B
V I L S L V I L P Q R S L V V I L S L V I L A L T V
gap
penalty
=3
S
T
V
I
L
S
L
V
R
N
V
I
L
P
Q
R
I
L
S
L
V
I
S
L
A
L
6
3
6
3
5
6
runs
(tuples) of
3
residues
3
6
3
SCORE =
20 - 9 =
11
BLAST
Basic Local Alignment Tool
Altschul et al (1990)
• A highest scoring segment pair (HSP) is found between
two sequences
the sequences may be related if
HSP score > cutoff
matches significant ‘words’ or segments and then extends
these matches using local dynamic programming
BLAST
Step 1: match significant words
query sequence of
length L
For each sequence find the
‘words’ with significant
scores
BLAST
Step 2: compare the word list to the database and identify exact matches
BLAST
Step 3: for each word match, extend the alignment using a PAM matrix and
dynamic programming
BLAST
• searches for 2 non-overlapping segments on same diagonal
• must be within a certain distance of each other before extension is
invoked
• can also allow gaps so that the method joins segments on different
diagonals
Assessing the Significance of Sequence Match
• length - can get artificially high scores between small sequences
• composition - if sequences are rich in particular amino acid
residues can get high scores for unrelated proteins
• to assess the significance of a match it is necessary to compare
the score with that returned by random or unrelated sequences
• if the database is small or when considering a pair-wise
comparison, the sequences can be shuffled to generate random
sequences
Assessing the Significance of Scores Returned from a Database Scan
frequency
S-m
s.d
mean
score
probe score S
Z score = score (S) - mean for unrelated (m)
standard deviation (s.d)
Z value > 3 s.d
related sequences
BLAST results
BLAST best hit
>gi|17472322|ref|XP_061555.1| (XM_061555) similar to orphan G
protein-coupled receptor GPR26
S - score for the pairwise
alignment.
[Homo sapiens]
Length = 337
Score =
298 bits (762), Expect = 8e-80
Identities = 168/327 (51%)
Query: 1
MGPGEALLAGLLVMVLAVALLSNALVLLCCAYSAELRTRASGVLLVNLSLGHLLLAALDM 60
G+LL
++M
M
A LAGLLV
+ V+LLSNALVLLC
+SA++R +A
+
+NL+
Sbjct: 1
MNSWNAGLAGLLVGTIGVSLLSNALVLLCLLHSADIRRQAPALFTLNLTCGNLLCTVVNM 60
Query: 61
PFTLLGVMRGRTPSAPGACQVIGFLDTFLASNAALSVAALSADQWLAVGFPLRYAGRLRP 120
Y
++R
P TL GV+
R P+
C++
FLDTFLA+N+ LS+AALS D+W+AV FPL
Sbjct: 61
PLTLAGVVAQRQPAGDRLCRLAAFLDTFLAANSMLSMAALSIDRWVAVVFPLSYRAKMRL 120
Query: 121
RYAGLLLGCAWGQSLAFSGAALGCSWLGYSSAFASCSLRLPPEPERPRFAAFTATLHAVG 180
HA+
R A L++
W
+L F
AAL
SWLG+
+ASC+L
ER RFA FT
Sbjct: 121
RDAALMVAYTWLHALTFPAAALALSWLGFHQLYASCTLCSRRPDERLRFAVFTGAFHALS 180
E value - number of hits
you would expect by
chance with score S or
higher given the size of
the database and the
length of the alignment
Good Match
< 1 X 10-50
Possible Match
1 X 10-50 to 1 X 10-2
Multiple Alignment
• direct extensions of the standard DP approach for the alignment of
2 sequences are computationally impossible for more than 3
sequences
• practical heuristic solutions are based on the idea that sequences
are evolutionary related and can be aligned using an underlying
phylogenetic tree
this is known as progressive alignment
(1) Pairwise Alignment
4 sequences A, B, C, D
A
6 pairwise comparisons
B
then cluster analysis
C
D
(2) Multiple Alignment following the tree from 1
B
Align most similar pair
D
gaps to optimise alignment
A
C
A
C
B
D
Align next most similar pair
new gap to optimise alignment of BD with AC
Align alignments - preserve gaps
B
D
A
C
Multiple Alignment
• start by aligning the most closely related pairs using DP and
gradually align these groups together keeping the gaps that
appear in earlier alignments fixed
• alternatively can add sequences one at a time to a growing
multiple alignment
the heuristic approach is not guaranteed to find the optimum
alignment - but it is soundly based, biologically
ClustalW
Higgins, 1997
• since the choice of parameters used can have significant effect on the
alignment for very distant sequences, ClustalW addresses this problem
by:
position specific gap opening and extension penalties
using different amino acid substitution matrices - one for close relatives,
one for distant
More recent resources:
MAFFT
MUSCLE
JALVIEW
ClustalW
Gap penalties
• where structure is known, one would want to increase the gap penalty
within helices and strands and decrease it between them - forcing gaps
to occur more frequently in loops
• if no structure known, can use simple rules which depends on the
residues occurring and the frequencies of gaps
e.g. use lower gap penalties where gaps already occur
Identifying sequence conserved residue positions
multiple sequence
alignment of relatives
from functional group
1 = highly
conserved
Structural model
Putative functional site
0=
unconserved
Scorecons - Thornton
Score
conservation
for each
position in the
alignment using
an entropy
measure
Searching Sequence Databases
Do fast scans using approximate
methods e.g. BLAST or PSIBLAST
Align proteins carefully using a dynamic
programming method Needleman & Wunsch
Smith & Waterman
Scan against sequence profiles (or
HMMs) in secondary databases e.g.
Pfam, Gene3D, InterPro
Align query sequence against family relatives
using: ClustalW, Jalview, MUSCLE, MAFFT
Can you inherit functional information?
Profile Based Sequence Search Methods
 by comparing related sequences within a protein family can
identify patterns of conserved residues
 even the most distant members of the family should have these
patterns of conserved residues
 can make a profile which encapsulates these patterns and use it
to detect more distantly related sequences
 highly conserved positions usually correspond to the buried core
or functional residues within the active site
Iterated Application of BLAST
PSI-BLAST
Altschul et al. (1997)
• first constructs a multiple alignment of all the related sequences
identified by BLAST
• then estimates the residue frequencies at each position to construct a
score matrix Position Specific Score Matrices (PSSM) also known as
weight matrices or 1D profile
PSI-BLAST
Altschul et al. (1997)
UniProt Database
query
sequence
aligns matched
sequences and builds
profile
further iterations pull out more distant sequence relatives
PSI-BLAST
Use the Multiple Alignment to Calculate Residue Frequencies
the residue frequencies at each position are used to calculate the scores
for aligning a query sequence against the pattern
query
relatives
P1……...
P5 P6…………... Pn…………...
three times more powerful than BLAST!!
putative
relative
…HRVLA
10
10
20
70
90
.
A
I
C
I
N
R
C
K
C
R
H
P
10
70
70
90
Position specific
substitution
matrix
Path matrix or
score matrix
Searching Protein Family Databases
Secondary databases (as opposed to primary sequence databases) group
proteins into related families
Families are usually represented by a sequence profile or sequence model
(Hidden Markov Model HMM) derived from a multiple sequence alignment of the
relatives
HMMs for Protein Domain Family Recognition
Pfam, SUPERFAMILY, Gene3D :
Hidden Markov Models (HMMs)
sequence is aligned using a probabilistic model of interconnecting
match, delete or insert states
contains statistical information on observed and expected
positional variation - “fingerprint of a protein family”
5 times more powerful than BLAST
Di
Ii
B
Mi
E
Pfam-A
Pfam-B
Other

Pfam-A

10,340 curated families with annotation
Pfam-B

224,303 families derived from ADDA

(50% clearly related to a Pfam-A)

UniProt coverage




74% of sequences
51% of residues
PDB coverage


94% of sequences
76% of residues
Pfam :
SEED alignment
Profile-HMM
representative members
HMMer-2.0
Search UniProt
FULL alignment
Manually curated
Automatically made
Pfam classification
Protein fold,
etc.
Protein
Pfam classification
Protein fold,
etc.
Family
Protein
Pfam classification
Clan
Protein fold,
etc.
Family
Protein