Searching Sequence Databases

Download Report

Transcript Searching Sequence Databases

Bioinformatics Methods for Inheriting Structural and
Functional annotations for Gene Sequences
 if a related sequence has a known function can you inherit
functional properties
 if a related sequence has a known structure, can you model the
unknown structure using the known?
 structural information can often provide additional clues as to the
function
 What are the best methods to use?
 What thresholds should be used for safe inheritance of functional
properties?
Homologues are related sequences:
a
duplication
a
paralogs
b
speciation
a
species 1
b
a
b
species 2
orthologs
Protein Sequence and Structure Databases
 GenBank sequence database in the States has over 120
million sequences - some partial. More than a million nonidentical sequences
 DNA database of Japan (DDBJ)
 UniProt (SWISS-PROT) database has > a million nonidentical sequences - validated gene sequences
 Protein Structure Databank (PDB - States, ePDB - UK) has
>70,000 entries
Web Based Public Resources containing Functional
Annotations
• Protein Family and Function databases
Pfam, InterPro, PROSITE, PRINTS, PANTHER, SMART, SCOP,
CATH, HOMSTRAD
• Databases of biochemical pathways and biological databases
KEGG, WIT, GO, FunCat, EC
• Databases of Protein-Ligand Interactions
IntAct, MIPS, RELIBASE, BIND, DIP, IrefIndex
• Species Databases
ENSEMBL, FlyBase, YPD, WORMDb, GenProtEC, EcoCyc
Evolution of Protein Sequences
 substitutions due to single base mutations
 insertions or deletions (indels) of residues - usually not in the
secondary structures but in the connecting loops
 insertions/deletions (indels) can make it harder to compare
sequences - have to line up the equivalent regions and put gaps
where there are indels
Evolution of Protein Sequences
Sequence A
Sequence B
Human Hemoglobin: Alpha and Beta Chains
a
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT
b
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQ
a KTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPN
b
RFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL
a ALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEF
b DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH
a TPAVHASLDKFLASVSTVLTSKYR
b FGKEFTPPVQAAYQKVVAGVANALAHKYH
Human Hemoglobin: Alpha and Beta Chains
a
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT
b VHLTPEEKSAVTALWGKV
a KTYFPHF DLSH
NVDEVGGEALGRLLVVYPWT
GSAQVKGHGKKVADALTNAVAHV
b QRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHL
a DDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAH
b DNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHH
a LPAEFTPAVHASLDKFLASVSTVLTSKYR
b FGKEFTPPVQAAYQKVVAGVANALAHKYH
Percentage Sequence Identity
=
number of identical residues X 100
number of residues in smallest protein
For globin example
without gaps
with gaps
~9%
~41%
Searching for Homologues with Related Functions
 How do you handle the evolutionary changes?
 How similar do the sequences need to be to inherit structural and
functional properties
 How do you cope with the volume of data ie millions of
sequences to search?
Searching Sequence Databases
Do fast scans using approximate
methods e.g. BLAST or PSIBLAST
Align proteins carefully using a dynamic
programming method Needleman & Wunsch
Smith & Waterman
Scan against sequence profiles (or
HMMs) in secondary databases e.g.
Pfam, InterPro, Gene3D
Align query sequence against family relatives
using: ClustalW, Jalview, MUSCLE, MAFFT
Can you inherit functional information?
Dot Plots, Path Matrices, Score Matrices
Sequence A
Sequence B
VI L S T R IV HVNS I L P S T N
V
I
L
S
T
R
I
V
I
L
P
E
F
S
T
diagonal lines give equivalent residues
Sequence A
Sequence B
VI L S T R I V HVNS I L P S T N
V
I
L
S
T
R
I
V
I
L
P
E
F
S
T
identical residues score 1
highest scoring path across the matrix gives best alignment
Sequence A
Sequence B
V I L S L V I L P Q R S L V V I L S L V I L A L T V
gap
penalty
=3
S
T
V
I
L
S
L
V
R
N
V
I
L
P
Q
R
I
L
S
L
V
I
S
L
A
L
6
3
6
3
5
6
runs
(tuples) of
3
residues
3
6
3
SCORE =
20 - 9 =
11
Alignment from Dot Plot
VILSLV
ILPQRSLVVILSLVI LALTV
STVILSLVNVILPQR
ILSLVISLAL
score = 20
sequence identity = 20/26 = 75%
Dynamic Programming Methods
Needleman & Wunsch
Smith & Waterman
Global alignment
Local alignment
Sequence B
Sequence A
Significance of sequence similarity – length dependence
40
Sequence
identity (%)
20
Homologous pairs
0
0
200
length
400
 protein pairs having > 150 residues are homologous if the sequence
identity is > 25%
 short proteins/fragments of 20-40 residues - 30% sequence identity
frequently occurs by chance
If proteins are homologous they are likely to have similar structures and
functions…..
Sequence identity between homologues required for
inheriting structure or function:
•Modelling a structure based on the structure of a
homologue >= 30%
•Inheriting functional properties from a homologue >= 60%
The structures of proteins in a family tend to be much more highly
conserved during evolution than the sequences (and, in some
families, the function)
Residue Substitution Matrices
a substitution matrix is a 20 x 20 matrix which scores each possible
comparison of residues
Identity Matrix
 simplest scoring scheme - amino acids are either identical (score 1) or
non-identical (score 0)
Physicochemical Properties Matrix
 score residue pairs according to similarities in their physico-chemical
properties e.g. val->leu scores well, val->arg scores low
Evolutionary Matrices
 score residue pairs according to how frequently the mutation is oberved
to occur in evolution eg Dayhoff (PAM), BLOSSUM matrices
Dayhoff Matrix (PAM or MDM)
 based on evolutionary relationships, it is derived by analysing the
substitutions observed in closely related sequences (>80% identity)
 the method measures evolutionary distance by determining the
number of point accepted mutations, where:
1PAM = a single point mutation every 100 residues
for distant relatives in the twilight zone (<25% identity), generally use a
250 PAM matrix
for database searches generally use 120 PAMS
BLOSUM Substitution Matrices
Henikoff & Henikoff (1993)
• matrix is derived from analysing substitution patterns in more distant
relatives (i.e. < 85% identity)
• for clusters of related sequences (e.g. 60% ID, 80% ID) derive
multiple alignments without gaps, for short regions of related
sequences
• use the alignments to calculate residue substitution frequencies
Which Matrix Should be Used?
• Matrices derived from observed substitution data (e.g. DAYHOFF,
BLOSUM) are better than identity matrix or those based on physical
properties
• various studies suggest that PAM250 gives the best result when
aligning distant proteins using dynamic programming algorithms
• in database searching it may be better to use PAM120 or
BLOSUM62
BLAST
Basic Local Alignment Tool
Altschul et al (1990)
• A highest scoring segment pair (HSP) is found between
two sequences
the sequences may be related if
HSP score > cutoff
matches significant ‘words’ or segments and then extends
these matches using local dynamic programming
BLAST
Step 1: match significant words
query sequence of
length L
For each sequence find the
‘words’ with significant
scores
BLAST
Step 2: compare the word list to the database and identify exact matches
BLAST
Step 3: for each word match, extend the alignment using a PAM matrix and
dynamic programming
BLAST
• searches for 2 non-overlapping segments on same diagonal
• must be within a certain distance of each other before extension is
invoked
• can also allow gaps so that the method joins segments on different
diagonals
Assessing the Significance of Sequence Match
• length - can get artificially high scores between small sequences
• composition - if sequences are rich in particular amino acid
residues can get high scores for unrelated proteins
• to assess the significance of a match it is necessary to compare
the score with that returned by random or unrelated sequences
• if the database is small or when considering a pair-wise
comparison, the sequences can be shuffled to generate random
sequences
Assessing the Significance of Scores Returned from a Database Scan
frequency
S-m
s.d
mean
score
probe score S
Z score = score (S) - mean for unrelated (m)
standard deviation (s.d)
Z value > 3 s.d
related sequences
BLAST results
BLAST best hit
>gi|17472322|ref|XP_061555.1| (XM_061555) similar to orphan G
protein-coupled receptor GPR26
S - score for the pairwise
alignment.
[Homo sapiens]
Length = 337
Score =
298 bits (762), Expect = 8e-80
Identities = 168/327 (51%)
Query: 1
MGPGEALLAGLLVMVLAVALLSNALVLLCCAYSAELRTRASGVLLVNLSLGHLLLAALDM 60
G+LL
++M
M
A LAGLLV
+ V+LLSNALVLLC
+SA++R +A
+
+NL+
Sbjct: 1
MNSWNAGLAGLLVGTIGVSLLSNALVLLCLLHSADIRRQAPALFTLNLTCGNLLCTVVNM 60
Query: 61
PFTLLGVMRGRTPSAPGACQVIGFLDTFLASNAALSVAALSADQWLAVGFPLRYAGRLRP 120
Y
++R
P TL GV+
R P+
C++
FLDTFLA+N+ LS+AALS D+W+AV FPL
Sbjct: 61
PLTLAGVVAQRQPAGDRLCRLAAFLDTFLAANSMLSMAALSIDRWVAVVFPLSYRAKMRL 120
Query: 121
RYAGLLLGCAWGQSLAFSGAALGCSWLGYSSAFASCSLRLPPEPERPRFAAFTATLHAVG 180
HA+
R A L++
W
+L F
AAL
SWLG+
+ASC+L
ER RFA FT
Sbjct: 121
RDAALMVAYTWLHALTFPAAALALSWLGFHQLYASCTLCSRRPDERLRFAVFTGAFHALS 180
E value - number of hits
you would expect by
chance with score S or
higher given the size of
the database and the
length of the alignment
Good Match
< 1 X 10-50
Possible Match
1 X 10-50 to 1 X 10-2
Needleman & Wunsch
A H C N I
R Q C L C R P M
A
I
1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0
C
I
N
R
0
0
0
0
C
K
C
0 0 1 0 0 0 0 1 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 1 0 1 0 0 0
R
H
P
0 0 0 0 0 1 0 0 0 0 1 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
Needleman & Wunsch Algorithm
• Accumulate the matrix by adding to each cell the highest score in
the column or row to the right and below it
• find the highest scoring path in the matrix by:
• starting in the top left corner
• moving down across the matrix from cell to cell
• choosing the highest scoring cell at each move
• the path can not go back on itself or cross the same row or column
twice
Accumulating the Matrix
• Add to the score in the cell the highest score from a cell in the row or
column to right and below
i,j
i-1,j-1
i-n,j-1
i-1,j-m
Sequence A
Sequence B
A H C N I
R Q C L C R P M
A
I
8 7 6 6 5 4 4 3 3 2 1 0 0
7 7 6 6 6 4 4 3 3 2 1 0 0
C
I
N
R
6
6
5
4
C
K
C
3 3 4 3 3 3 3 4 3 3 1 0 0
3 3 3 3 3 3 3 3 3 2 1 0 0
2 2 3 2 2 2 2 3 2 3 1 0 0
R
H
P
2 1 1 1 1 2 1 1 1 1 2 0 0
1 2 1 1 1 1 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0
6
6
5
4
7
6
5
4
6
5
6
4
5
6
5
4
4
4
5
5
4
4
4
4
4
3
3
3
3
3
3
3
3
2
3
2
1
1
1
2
0
0
0
0
0
0
0
0
Possible Moves in Finding a Path across
the Matrix
• start in the leftmost or topmost row
• move to the highest scoring cell in row or column to right and below
i,j
i-1,j-1
i-n,j-1
i-1,j-m
Sequence A
Sequence B
A H C N I
R Q C L C R P M
A
I
8 7 6 6 5 4 4 3 3 2 1 0 0
7 7 6 6 6 4 4 3 3 2 1 0 0
C
I
N
R
6
6
5
4
C
K
C
3 3 4 3 3 3 3 4 3 3 1 0 0
3 3 3 3 3 3 3 3 3 2 1 0 0
2 2 3 2 2 2 2 3 2 3 1 0 0
R
H
P
2 1 1 1 1 2 1 1 1 1 2 0 0
1 2 1 1 1 1 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0
6
6
5
4
7
6
5
4
6
5
6
4
5
6
5
4
4
4
5
5
4
4
4
4
4
3
3
3
3
3
3
3
3
2
3
2
1
1
1
2
0
0
0
0
0
0
0
0
Sequence B
Sequence A
A
I
C
I
N
R
C
K
C
R
H
P
AH
8 7
7 7
6 6
6 6
5 5
4 4
3 3
3 3
2 2
2 1
1 2
0 0
C
6
6
7
6
5
4
4
3
3
1
1
0
N
6
6
6
5
6
4
3
3
2
1
1
0
I
5
6
5
6
5
4
3
3
2
1
1
0
R
4
4
4
4
5
5
3
3
2
2
1
0
Q
4
4
4
4
4
4
3
3
2
1
1
0
C
3
3
4
3
3
3
4
3
3
1
1
0
L
3
3
3
3
3
3
3
3
2
1
1
0
C
2
2
3
2
3
2
3
2
3
1
1
0
R
1
1
1
1
1
2
1
1
1
2
1
0
P
0
0
0
0
0
0
0
0
0
0
0
1
M
0
0
0
0
0
0
0
0
0
0
0
0
AHCNI -RQCLCR - PM
A IC - INR- CKCRHPM