Transcript PowerPoint
Sequence Alignments and Database Searching
08/20/07
Why compare protein sequences?
Significant sequence similarities
allow associations based upon
known functions.
Protein A of interest to you.
ornithine decarboxylase?
Homology vs. similarity
Possible for proteins to
possess high sequence identity
between segments and not be
homologous
In this example, cytochrome c4, has
reasonably high sequence similarity
with trypsins, yet does not have
common ancestor, nor common
fold.
Also, subtilisin has same spatial
arrangement of active site
residues, but is not related to
trypsins
Extracted from ISMB2000 tutorial,
WR Pearson, U. of Virginia
Homology vs. similarity
Homologous proteins always share a common threedimensional fold, often with common active or binding site.
Proteins that share a common ancestor are homologous.
Proteins that possess >25% identity across entire length
generally will be homologous.
Proteins with <20% identity are not necessarily homologous
Extracted from ISMB2000 tutorial,
WR Pearson, U. of Virginia
Orthologous cyctochrome c isozymes
Homologous sequences
are either: 1) orthologous,
or 2) paralogous
Orthologs - sequence differences arises
from divergence in different species (i.e.
cyctochrome c)
Paralogs - sequence differences arise
after gene duplication within a given
species (i.e. GPCRs, hemoglobins)
Hemoglobins contain both
orthologs and paralogs
•For orthologs - sequence divergence and
evolutionary relationships will agree.
•For paralogs - no necessary linkage
between sequence divergence and
speciation.
We’ve all seen and/or used sequence alignments, but how
are they accomplished?
Sequence searches and alignments using DNA/RNA are usually not as
informative as searches and alignments using protein sequences. However.
DNA/RNA searches are intuitively easier to understand:
AGGCTTAGCAAA........TCAGGGCCTAATGCG
|||||||| |||
||||||||||| |||
AGGCTTAGGAAACTTCCTAGTCAGGGCCTAAAGCG
The above alignment could be scored giving a “1” for each identical nucleotide,
A zero for a mismatch, and a -4 for “opening a “gap” and a -1 for each extension
of the gap. So score = 25 – 11= 14
Protein sequence alignments are much more complicated.
How would this alignment be scored?
ARDTGQEPSSFWNLILMY.........DSCVIVHKKMSLEIRVH
|
| | |
|
||| | | ||
|||
AKKSAEQPTSYWDIVILYESTDKNDSGDSCTLVKKRMSIQLRVH
Unlike nucleotide sequence alignments, which are either identical or
not identical at a given position, protein sequence alignments include
“shades of grey” where one might acknowledge that a T is sort of
equivalent to an S etc. But how equivalent? What number would you
assign to an S-T mismatch? And what about gaps? Since alanine is
a common amino acid, couldn’t the A-A match be by chance? Since
Trp and Cys are uncommon, should those matches be given higher
scores?
Do you see that accurately aligning sequences and accurately
finding related sequences are the same problem?
Databases
Nucleotide: GenBank (NCBI), EMBL, DDBJ
Protein: SwissProt, TrEMBL, GenPept(GenBank)
Huge databases – share much information. Many entries linked to other
databases (e.g. PDB). SwissProt small but well “curated”. NCBI non-redundant
(nr) protein sequence database is very large but sometimes confusing.
These databases can be searched in a number of ways. Can search only
human or metazoan sequences. Can eliminate entries made before a given
Date. Etc.
What do all those numbers mean?
Type of Record
Sample accession format
GenBank/EMBL/DDBJ Nucleotide
One letter followed by five digits, e.g.:
U12345
Two letters followed by six digits, e.g.:
SequenceRecords
AY123456, AF123456
GenPept Sequence Records
(which contain the amino acid translations from
GenBank/EMBL/DDBJ records that have a coding
region feature annotated on them)
Three letters and five digits, e.g.:
Protein Sequence Records from SWISS-PROT
All are six characters:
Character/Format
1 [O,P,Q]
2 [0-9]
3 [A-Z,0-9]
4 [A-Z,0-9]
5 [A-Z,0-9]
6 [0-9]
e.g.:
P12345 and Q9JJS7
Protein Sequence Records from PRF
A series of digits (often six or seven)
followed by a letter, e.g.:
1901178A
RefSeq Nucleotide Sequence Records
Two letters, an underscore bar, and six digits, e.g.:
mRNA records (NM_*):
NM_000492
genomic DNA contigs (NT_*):
NT_000347 complete genome or chromosome
(NC_*):
NT_000907 genomic region (NG_*):
NG000019
NCBI
and PIR
AAA12345
Continued….
RefSeq Protein Sequence Records
Two letters (NP), an underscore bar, and six digits,
e.g.:
NP_000483
RefSeq Model (predicted) Sequence Records from
the Human Genome annotation process
Two letters (XM, XP, or XT), an underscore bar,
and six digits, e.g.:
Protein Structure Records
PDB accessions generally contain one digit
followed by three letters, e.g.:
1TUP
MMDB ID numbers generally contain four digits,
e.g.:
3973
The record for the Tumor Suppressor P53
Complexed With DNA can be retrieved by either
number above
NCBI
XM_000583
GI numbers:
a series of digits that are assigned consecutively by NCBI to each sequence it processes.
Version numbers:
consist of the accession number followed by a dot and a version number.
Nucleotide sequence:
GI: 6995995
VERSION: NM_000492.2
Protein translation:
GI: 6995996
VERSION: NP_000483.2
>gi|897557|gb|AAA98443.1| TIAM1 protein
http://www.ornl.gov/sci/techresources/Human_Genome/posters/chromosome/geneguide.shtml
We’ve got the data, now how do we score/search?
First, we need a way to assign numbers to “shades
of grey” matches.
Genetic code scoring system – This assumes that changes in protein
sequence arise from mutations. If only one point mutation is needed
to change a given AA to another (at a specific position in alignment),
the two amino-acids are more closely related than if two point mutations
were required.
Physicochemical scoring system – a Thr is like a Ser, a Trp is not like
an Ala……
These systems are seldom used because they have problems. Why
try to second guess Nature? Since there are many related sequences
out there, we can look at some (trusted) alignments to SEE which substitutions have occurred and the frequency with which they occur.
Observed substitution scoring schemes
PAM (percent acceptable mutation) matrices are derived from studying
global alignments of well-characterized protein families. Use 1% residue
change (short evolutionary distance) to get PAM1 matrix. Raise this to
250 power to get 250% change (greater evolutionary distance). Therefore a
PAM 30 would be used to analyze more closely related proteins, PAM 400
is used for finding and analyzing distantly related proteins.
Block substitution matrices (BLOSUM) are derived from studying local
alignments (blocks) of sequences from related proteins. In other words,
one might use the portions of aligned sequences from related proteins
that have >62% identity (in the portions or blocks) to derive the BLOSUM
62 scoring matrix. One might use only the blocks that have >80% identity
to derive the BLOSUM 80 matrix.
Amino acid substitution matrices
•Negative scores - unlikely substitutions
Note that for identical matches,
scores vary depending upon
observed frequencies. That is,
rare amino acid (i.e. Trp) that are
not substituted have high scores;
frequently occuring amino acids
(i.e. Ala) are down-weighted
because of the high probability of
aligning by chance.
PAM250 matrix
Extracted from ISMB2000 tutorial,
WR Pearson, U. of Virginia
Amino acid substitution matrices
In general, Blosum62 matrix is more
accurate than PAM.
However, should be aware that
search performance will depend on
underlying matrix
Q. Which are more divergent:
PAM120 or PAM250; Blosum45 or
Blosum62?
PNAS 89, 10915 (1992).
Gap penalties – Intuitively one recognizes that there should be a penalty
for introducing (requiring) a gap during identification/alignment of a given
sequence. But if two sequences are related, the gaps may well be located
In loop regions which are more tolerant of mutational events and probably
have little impact on structure. Therefore, a new gap should be penalized,
but extending an existing gap should be penalized very little.
Filtering – many proteins and nucleotides contain simple repeats or
regions of low sequence complexity. These must be excluded from
searches and alignments. Why?
Significance of a “hit” during a search - More important than an arbitrary
score is an estimation of the likelihood of finding a hit through pure chance.
Ergo the “Expectation value” or E-value. E-values can be as low as 10-70.
Statistics
* - Similarity score
distribution expected
based upon random
chance using given
searched database.
= - distribution of
normalized similarity
scores (observed) for a
search using a proton
ATPase against the same
database.
E-value
So, for sufficiently large databases (so can apply statistics):
E = Kmne-S
m- query length
n - database length
E - expectation value
K - scale factor for search space (database)
- scale factor for scoring system
S - score, dependent on substitution matrix, gappenalties, etc.
Doubling either sequence string doubles number of sequences with a
given expectation value; similarly, double the score and expectation
value decreases exponentially
Expectation value - probability that given score will occur by chance
given the query AND database strings
Statistics
Must account for
increases in similarity
score due to increase
in sequence length
searched.
Extracted from ISMB2000 tutorial,
WR Pearson, U. of Virginia
Basic local alignment search tool (BLAST)
1) Break query up into “words” e.g. ASTGHKDLLV
AST
WORDS
STG
TGH
2) Generate expanded list of words that would match with (i.e. PAM250)
a score of at least T – You’re acknowledging that you may not have any
exact matches with original list of words.
3) Use expanded list of words to search database for exact matches.
4) Extend alignments from where word(s) found exact match.
Heuristic algorithm – Uses guesses. Increases speed without a great
loss of accuracy (BLASTP, FASTA (local Hueristic), S-W local rigorous,
Needleman-Wunsch global, rigorous)
Global versus local alignments
Global scores require alignment of entire sequence length.
Cannot be used to detect relationships between domains in
mosaic proteins.
Local alignments are necessary to detect domains within mosaic
proteins, internal duplications.
Extracted from ISMB2000 tutorial,
WR Pearson, U. of Virginia
Pictorial representation of BLAST algorithm.
Query sequence
Words (they overlap)
Expand list of words
Search database, find exact hits, extend alignments
Report sorted list of hits
BLAST
ATCGCCATGCTTAATTGGGCTT
CATGCTTAATT exact word match
one hit
Nucleotide BLAST looks for exact matches
Protein BLAST requires two hits
two hits
NCBI
GTQITVEDLFYNI
neighborhood words SEI
YYN
FASTA
Instead of breaking up query into words (and then generating a list
of similar words), find all sequences in the database that contain
short sequences that are exact or nearly exact matches for sequences
within the query. Score these and sort. Sort of reverse methodology to
BLAST
Query sequence
Database sequence
Protein database
mouse over
sorted by e values
5 X 10-98
link to entrez
LocusLink
Low complexity filter
Identifying distant homologies
(use several different query sequences)
Also remember - If A is homologous
to B, and B to C, then A should be
homologous to C
Examine output carefully. A lack of
statistical significance doesn’t
necessarily mean a lack of homology!
Extracted from ISMB2000 tutorial,
WR Pearson, U. of Virginia
PSI-BLAST
Very sensitive, but must not include a non-member sequence!
1) Regular BLAST search
2) Sequences above a certain threshold (< specified E-value) are
included. Assumed to be related proteins. This group of sequences
is used to define a “profile” that contains the essence of the “family”.
3) Now with the important sequence positions highlighted, can look
for more distantly related sequences that should still have the essence
of the protein family.
4) Inclusion of more distantly related sequences modifies the profile
further (further defines the essence) and allows for identification of
even more distantly related sequences. Etc.
Note: PSI-BLAST may find and then subsequently lose a homologous
sequence during the iteration process! “Drifting” of the program, would
be the gradual loss of close homologs during the iteration process.
Position specific scoring matrix (PSSM)
(learning from your “hits”)
Weakly conserved serine
Active site serine
Position specific scoring matrix (PSSM)
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
D
G
V
I
S
S
C
N
G
D
S
G
G
P
L
N
C
Q
A
A
0
-2
-1
-3
-2
4
-4
-2
-2
-5
-2
-3
-3
-2
-4
-1
0
0
-1
R N D C Q E G H I L
-2 0 2 -4 2 4 -4 -3 -5 -4
-1 0 -2 -4 -3 -3 6 -4 -5 -5
1 -3 -3 -5 -1 -2 6 -1 -4 -5
3 -3 -4 -6 0 -1 -4 -1 2 -4
-5 0 8 -5 -3 -2 -1 -4 -7 -6
-4 -4 -4 -4 -1 -4 -2 -3 -3 -5
scored
-7 Serine
-6 -7 12
-7 -7 differently
-5 -6 -5 -5
0 in
2 these
-1 -6 two
7 0
-2 0 -6 -4
positions
-3 -3 -4 -4 -4 -5 7 -4 -7 -7
-5 -2 9 -7 -4 -1 -5 -5 -7 -7
-4 -2 -4 -4 -3 -3 -3 -4 -6 -6
-6 -4 -5 -6 -5 -6 8 -6 -8 -7
-6 -4 -5 -6 -5 -6 8 -6 -7 -7
Active site nucleophile
-6 -6 -5 -6 -5 -5 -6 -6 -6 -7
-6 -7 -7 -5 -5 -6 -7 0 -1 6
-6 0 -6 -4 -4 -6 -6 -1 3 0
-4 -5 -5 10 -2 -5 -5 1 -1 -1
1 4 2 -5 2 0 0 0 -4 -2
-1 1 3 -4 -1 1 4 -3 -4 -3
K
0
0
1
6
-4
-4
-7
2
-5
-4
-3
-5
-5
-4
-6
-5
-5
1
-1
M
-2
-2
-5
-2
-6
-4
-5
0
-4
-7
-5
-6
-6
-6
1
4
0
0
-2
F
-6
-3
-6
-5
-7
-5
0
-2
-4
-7
-6
-7
-7
-7
0
-3
-1
0
-2
P
1
-2
-4
-5
-5
-1
-7
-5
-6
-5
-4
-6
-6
9
-6
-6
-4
0
-3
S
0
-2
0
-3
1
4
-4
-1
-3
-4
7
-4
-2
-4
-6
-2
-1
-1
0
T
-1
-1
-2
0
-3
3
-4
-3
-5
-4
-2
-5
-4
-4
-5
-1
0
-1
-2
W
-6
0
-6
-1
-7
-6
-5
-3
-6
-8
-6
-6
-6
-7
-5
-6
-5
-3
-2
Y
-4
-6
-4
-4
-5
-5
0
-4
-6
-7
-5
-7
-7
-7
-4
-1
0
-3
-2
V
-1
-5
-2
0
-6
-3
-4
-3
-6
-7
-5
-7
-7
-6
0
6
0
-4
-3
PSI-BLAST: initial run
>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOH
MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYY
VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQ
EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNG
RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTH
VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY
e value cutoff for PSSM
PSI-BLAST: initial run
NCBI
PSI-BLAST: first PSSM search
Other purine nucleotide metabolizing enzymes not found by ordinary
BLAST
PSI-BLAST: importance of original query
(remember, if A is like B….)
iteration
1
iteration
2
PSI-Blast of
human Tiam1
PSI-BLAST: importance of original query
iteration 1
iteration 2
Ras-binding domains
PSI-Blast of
mouse Tiam2 (~90%
identity with human
Tiam1)
iteration 3
Three-dimensional Position Specific
Scoring Matrix (3D-PSSM)
Extremely sensitive, but the structure of a homolog must exist!
Uses a Library of structures that represent all the known folds*
and a non-redundant sequence database.
Preparing the 3D-PSSM database
1) 1D-PSSM generation. For every entry in the Library of structures,
perform 20 iteration of PSI-BLAST against the NR database. Use
E-value cutoff of 0.0005. Keep intermediate results from 1st through
20th iteration. Recombine these intermediates at the end. Generate
a PSSM (1D-PSSM) from the results.
(A 1D-PSSM for a protein of length L will have dimensions L X 20 )
2) For each Library entry, assign 2ndry structure (Helix, Strand, Coil)
3) Perform 3D structural superposition between each entry in the Library
and all other members of its fold superfamily. Use cutoff criteria. Use
the “residue equivalencies” from the superpositions to augment the 1DPSSMs for Library members in a given superfamily. (Key here is that
structural alignment reduces possibility of miss-alignment of sequences).
4) Use the structural info from the whole Library to assign “solvation
potentials” for each residue type. e.g. Alanines with only 5%
solvent exposure are seen 122 times. The total number of residues
In the Library with 5% exposure is 3246. So the solvation potential
would be 122/3246=0.038 for an alanine with 5% exposure. Do this
for Ala at 10%, 15%, …95%, 100%. Do for all 20 AAs.
Enter query sequence
5) Use Psi-BLAST to generate 1D-PSSM for query (nr database)
6) Perform 2ndry structure prediction for query
7) Align the query sequence against each member of the Library using
a “3 pass” approach:
I) query is aligned against Library member using the 1D-PSSM
of the Library entry
2) query is aligned to the 3D-PSSM of the Library entry
3) Library entry is aligned to the query’s 1D-PSSM
During these procedures the 2ndry structure matching and solvation
potentials are being used but are constant. The highest scoring of the
3 passes is taken as the final result.
So, how good is 3D-PSSM?
Three papers report the initial characterization of PLC-e
(what, no PH domain???)
JBC, 276, 2758 (2001)
EMBO J 20, 743 (2001)
JBC 276, 2752 (2001)
A fourth paper quickly follows….
(PLC-es share architecture of PLC-b isozymes. How’d they do that???)
Wing M. et al.,
JBC 2002
3D-PSSM
PLCe sequence
entered as query
3D-PSSM
PDB entry
(for existing structure)
Expectation value
3-D model
(with sidechains!)
Sequence alignment
(between query and existing structure)
Fold
A very simple HMM for a protein with 4 amino acids
The square boxes are called “match states” – these will emit a amino
acid with a set probability for each AA. Diamond boxes are for insertions
between match states, and the circles are for deletions.
Not only are there emission probabilities for the set and insert states,
there are probabilities for the transitions between states. There are
many possible paths through the Model!
Random transitions through the Model and emissions from the states
are guided by probabilities. All you see at the end is the generated
sequence. The model that generated the sequence is “hidden”. But the
resulting sequence is related to those sequences used to construct the
model. IT IS POSSIBLE TO CALCULATE THE PROBABILITY
THAT A GIVEN SEQUENCE WAS GENERATED BY THE MODEL!
Multiple sequence alignments (MSAs)
In this example, an MSA is used to identify regions of high
sequence conservation presumably reflecting structural and
functional constraints. Useful for delimiting known domains and
potential new functional regions (e.g. the Ras-binding domain in
yellow and the blue box of currently unknown function).
Fun with MSA...
MSA used to locate
functional residues and
domain boundries in
homologs of Dbl-proteins
with known structure (Dbs
and Tiam1).
Red amino acids directly
interact with GTPases.
Blue residues directly
interact with
phosphoinositides.
What you should know
The general approaches to finding related sequences – i.e. the
methodology the terminology, how they differ.
Some of the definitions (e.g. what factors affect the E-value?, what’s
paralogous?)