Proteins with

Download Report

Transcript Proteins with

Sequence Alignments and Database
Searching
August 26, 2011
Biochemistry 201
David Worthylake, 7152 MEB, x5176
Why compare protein sequences?
Significant sequence similarities
allow associations based upon
known functions.
Protein A of interest to you.
ornithine decarboxylase?
Homology vs. similarity
Possible for proteins to possess high
sequence identity/ similarity between
segments and not be homologous
1) Homologous proteins (ie having
similar structures) need not
posess high sequence identity /
similarity:
S. griseus trypsin 36%
S. griseus protease A 25%
2) cytochrome c4, has reasonably
high sequence identity/
similarity with trypsins, yet does
not have common ancestor, nor
common fold.
3) subtilisin has same spatial
arrangement of active site
residues, but is not related to
trypsins
Extracted from ISMB2000 tutorial,
WR Pearson, U. of Virginia
Homology vs. similarity
Homologous proteins always share a common threedimensional fold, often with common active or binding site.
Proteins that share a common ancestor are homologous.
Proteins that possess >25% identity across entire length
generally will be homologous (but there can be exceptions).
Proteins with <20% identity are not necessarily not
homologous
Homology vs. similarity
Extracted from ISMB2000 tutorial,
WR Pearson, U. of Virginia
Orthologous cyctochrome c isozymes
Homologous sequences
are either: 1) orthologous,
or 2) paralogous
Orthologs - sequence differences arises
from divergence in different species (i.e.
cyctochrome c)
Paralogs - sequence differences arise
after gene duplication within a given
species (i.e. GPCRs, hemoglobins)
Hemoglobins contain both
orthologs and paralogs
•For orthologs - sequence divergence and
evolutionary relationships will agree.
•For paralogs - no necessary linkage between
sequence divergence and speciation.
We’ve all seen and/or used sequence alignments, but how
are they accomplished?
Sequence searches and alignments using DNA/RNA are usually not as
informative as searches and alignments using protein sequences. However.
DNA/RNA searches are intuitively easier to understand:
AGGCTTAGCAAA........TCAGGGCCTAATGCG
|||||||| |||
||||||||||| |||
AGGCTTAGGAAACTTCCTAGTCAGGGCCTAAAGCG
The above alignment could be scored giving a “1” for each identical nucleotide,
A zero for a mismatch, and a -4 for “opening a “gap” and a -1 for each extension
of the gap. So score = 25 – 11= 14
Protein sequence alignments are much more complicated.
How would this alignment be scored?
ARDTGQEPSSFWNLILMY.........DSCVIVHKKMSLEIRVH
|
| | |
|
||| | | ||
|||
AKKSAEQPTSYWDIVILYESTDKNDSGDSCTLVKKRMSIQLRVH
Unlike nucleotide sequence alignments, which are either identical or
not identical at a given position, protein sequence alignments include
“shades of grey” where one might acknowledge that a T is sort of
equivalent to an S etc. But how equivalent? What number would you
assign to an S-T mismatch? And what about gaps? Since alanine is
a common amino acid, couldn’t the A-A match be by chance? Since
Trp and Cys are uncommon, should those matches be given higher
scores?
Do you see that accurately aligning sequences and accurately
finding related sequences are  the same problem?
Global versus local alignments
BLAST
Needleman-Wunsch
Global scores require alignment of entire sequence length.
Cannot be used to detect relationships between domains in
mosaic proteins.
Local alignments are necessary to detect domains within mosaic
proteins, internal duplications.
Extracted from ISMB2000 tutorial,
WR Pearson, U. of Virginia
Databases
Nucleotide: GenBank (NCBI), EMBL, DDBJ (Japan)
Protein: SwissProt, TrEMBL, GenPept(GenBank)
Huge databases – share much information. Many entries linked to other
databases (e.g. PDB). SwissProt small but well “curated”. NCBI non-redundant
(nr) protein sequence database is very large but sometimes confusing.
These databases can be searched in a number of ways. Can search only
human or metazoan sequences. Can eliminate entries made before a given
Date. Etc.
We’ve got the lots of sequences, now how do we
score/search? First, we need a way to assign numbers
to “shades of grey” matches.
Genetic code scoring system – This assumes that changes in protein
sequence arise from mutations. If only one point mutation is needed
to change a given AA to another (at a specific position in alignment),
the two amino-acids are more closely related than if two point mutations
were required.
Physicochemical scoring system – a Thr is like a Ser, a Trp is not like
an Ala……
These systems are seldom used because they have problems. Why
try to second guess Nature? Since there are many related sequences
out there, we can look at some (trusted) alignments to SEE which substitutions have occurred and the frequency with which they occur.
PAM (Point Accepted Mutation) matrices
• Are derived from studying global alignments of well-characterized protein families.
• PAM1 = only 1% of residues has changed (ie short evolutionary distance)
• Raise this to 250 power to get 250% change of two sequences (greater
evolutionary distance), or about 20% sequence identity.
• Therefore,
a PAM 30 would be used to analyze more closely related proteins,
a PAM 400 is used for finding and analyzing very distantly related proteins.
• PAMx = PAM1x
(Dayhoff, Atlas of Protein Sequence and Structure, vol. 5, suppl 3, p 345-352)
Block substitution matrices (BLOSUM)
Are derived from studying local alignments (blocks) of sequences from related proteins
that differ by no more than X%. (Henikoff & Henikoff, PNAS ‘92, 89, p10915-10919)
1) In other words, one might use the portions of aligned sequences from related
proteins that have no more than 62% identity (in the portions or blocks) to derive
the BLOSUM 62 scoring matrix.
2) One might use only the blocks that have <80% identity to derive the BLOSUM 80
matrix.
3) BLOSUM and PAM substitution matrices have the opposite effects:
a) The higher the number of the BLOSUM matrix (BLOSUM X), the more closely
related proteins you are looking for.
a) The higher the number of the PAM matrix (PAM X), the more distantly related
proteins you are looking for.
Amino acid substitution matrices
•Negative scores - unlikely substitutions
Note that for identical matches,
scores vary depending upon
observed frequencies. That is,
rare amino acid (i.e. Trp) that are
not substituted have high scores;
frequently occuring amino acids
(i.e. Ala) are down-weighted
because of the high probability of
aligning by chance.
PAM250 matrix
Extracted from ISMB2000 tutorial,
WR Pearson, U. of Virginia
Gap penalties – Intuitively one recognizes that there should be a penalty
for introducing (requiring) a gap during identification/alignment of a given
sequence. But if two sequences are related, the gaps may well be located
in loop regions which are more tolerant of mutational events and probably
have little impact on structure. Therefore, a new gap should be penalized,
but extending an existing gap should be penalized very little.
Filtering – many proteins and nucleotides contain simple repeats or
regions of low sequence complexity. These must be excluded from
searches and alignments. Why?
Significance of a “hit” during a search - More important than an arbitrary
score is an estimation of the likelihood of finding a hit through pure chance.
Ergo the “Expectation value” or E-value. E-values can be as low as 0 for
Identical (long) match (e.g. a 250 AA protein finding itself in search).
E-value
So, for sufficiently large databases (so one can apply statistics):
E = Kmne-S
m- query length
n - database length
E - expectation value
K - scale factor for query sequence (AA composition)
 - scale factor for scoring system (e.g. PAM250)
S - score, dependent on substitution matrix, gappenalties, etc.
Doubling either m or n doubles number of sequences returned with a
given expectation value; similarly, double the score and expectation
value decreases exponentially
Expectation value - probability that given score will occur by chance
given the query AND database “strings”
Removing length bias from scoring statistics
• Must account for
increases in similarity
score due to increase in
sequence length
searched.
• Scaling the sequence
length allows the
detection of distantlyrelated sequences.
• solids = individual
sequences
• opens = average score
Extracted from ISMB2000 tutorial,
WR Pearson, U. of Virginia
Global versus local alignments
Global scores require alignment of entire sequence length.
Cannot be used to detect relationships between domains in
mosaic proteins.
Local alignments are necessary to detect domains within mosaic
proteins, internal duplications.
Extracted from ISMB2000 tutorial,
WR Pearson, U. of Virginia
Basic local alignment search tool (BLAST)
1) Break query up into “words” e.g. ASTGHKDLLV
AST
WORDS
STG
TGH
2) Generate expanded list of words that would match with (i.e. PAM250)
a score of at least T – You’re acknowledging that you may not have any
exact matches with original list of words.
3) Use expanded list of words to search database for exact matches.
4) Extend alignments from where word(s) found exact match.
Heuristic algorithm – Uses guesses. Increases speed without a great
loss of accuracy (BLASTP, FASTA (local Hueristic), S-W (local rigorous),
Needleman-Wunsch (global, rigorous)
Pictorial representation of BLASTp algorithm
(Basic Local Alignment Search Tool proteins).
Query sequence
Words (they overlap)
Expand list of words (each word (left) has “similar” words)
Search database, find hits, extend alignments
Report sorted list of hits
BLAST
ATCGCCATGCTTAATTGGGCTT
CATGCTTAATT exact word match
one hit
Nucleotide BLAST looks for exact matches
Protein BLAST (BLASTp) requires two hits
two hits
NCBI
GTQITVEDLFYNI
neighborhood words SQI
YYN
FASTA
Instead of breaking up query into words (and then generating a list
of similar words), find all sequences in the database that contain
short sequences that are exact or nearly exact matches for sequences
within the query. Score these and sort. Sort of reverse methodology to
BLAST
Query sequence
Database sequence
Protein database
mouse over
sorted by e values
5 x e-98
link to entrez
λS - lnK
S’ =
ln2
E = mn 2-S’
Gene
= Kmne-λS
Identifying distant homologies
(use several different query sequences)
Also remember - If A is homologous
to B, and B to C, then A should be
homologous to C
Examine output carefully. A lack of
statistical significance doesn’t
necessarily mean a lack of homology!
Extracted from ISMB2000 tutorial,
WR Pearson, U. of Virginia
PSI-BLASTp
Very sensitive, but must not include a non-member sequence!
1) Regular BLASTp search
2) Sequences above a certain threshold (< specified E-value) are
included. Assumed to be related proteins. This group of sequences
is used to define a “profile” that contains the sequence “essence” of
the protein family.
3) Now with the important sequence positions highlighted, can look
for more distantly related sequences that should still have the “essence”
of the protein family.
4) Inclusion of more distantly related sequences modifies the profile
further (further defines the essence) and allows for identification of
even more distantly related sequences. Etc.
Note: PSI-BLASTp may find and then subsequently lose a homologous
sequence during the iteration process! “Drifting” of the program, would
be the gradual loss of distant homologs during the iteration process.
PSI-BLAST: initial run
>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOH
MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYY
VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQ
EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNG
RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTH
VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY
NCBI
e value cutoff for
inclusion
PSI-BLAST: initial run
NCBI
PSI-BLAST: first PSSM search
Other purine nucleotide metabolizing enzymes not found by ordinary
BLAST
Note: These E-values are different from
usual BLASTp because of positionspecific scoring matrix (later).
PSI-BLAST: importance of original query
(remember, if A is like B….)
iteration
1
iteration
2
PSI-Blast of
human Tiam1
PSI-BLAST: importance of original query
iteration 1
iteration 2
Ras-binding domains
PSI-Blast of
mouse Tiam2 (~90%
identity with human
Tiam1)
iteration 3
Position specific scoring matrix (PSSM)
(learning from your “hits”)
Weakly conserved serine
Active site serine
NCBI
Position specific scoring matrix (PSSM)
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
D
G
V
I
S
S
C
N
G
D
S
G
G
P
L
N
C
Q
A
A R N D C Q E G H I L
0 -2 0 2 -4 2 4 -4 -3 -5 -4
-2 -1 0 -2 -4 -3 -3 6 -4 -5 -5
-1 1 -3 -3 -5 -1 -2 6 -1 -4 -5
-3 3 -3 -4 -6 0 -1 -4 -1 2 -4
-2 -5 0 8 -5 -3 -2 -1 -4 -7 -6
4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5
scored
-4 -7 Serine
-6 -7 12
-7 -7 differently
-5 -6 -5 -5
-2 0 in
2 these
-1 -6 two
7 0
-2 0 -6 -4
positions
-2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7
-5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7
-2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6
-3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7
-3 -6site
-4nucleophile
-5 -6 -5 -6 8 -6 -7 -7
Active
-2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7
-4 -6 -7 -7 -5 -5 -6 -7 0 -1 6
-1 -6 0 -6 -4 -4 -6 -6 -1 3 0
0 -4 -5 -5 10 -2 -5 -5 1 -1 -1
0 1 4 2 -5 2 0 0 0 -4 -2
-1 -1 1 3 -4 -1 1 4 -3 -4 -3
K
0
0
1
6
-4
-4
-7
2
-5
-4
-3
-5
-5
-4
-6
-5
-5
1
-1
M
-2
-2
-5
-2
-6
-4
-5
0
-4
-7
-5
-6
-6
-6
1
4
0
0
-2
F
-6
-3
-6
-5
-7
-5
0
-2
-4
-7
-6
-7
-7
-7
0
-3
-1
0
-2
P
1
-2
-4
-5
-5
-1
-7
-5
-6
-5
-4
-6
-6
9
-6
-6
-4
0
-3
S
0
-2
0
-3
1
4
-4
-1
-3
-4
7
-4
-2
-4
-6
-2
-1
-1
0
T
-1
-1
-2
0
-3
3
-4
-3
-5
-4
-2
-5
-4
-4
-5
-1
0
-1
-2
W
-6
0
-6
-1
-7
-6
-5
-3
-6
-8
-6
-6
-6
-7
-5
-6
-5
-3
-2
Y
-4
-6
-4
-4
-5
-5
0
-4
-6
-7
-5
-7
-7
-7
-4
-1
0
-3
-2
V
-1
-5
-2
0
-6
-3
-4
-3
-6
-7
-5
-7
-7
-6
0
6
0
-4
-3
Multiple sequence alignments (MSAs)
In this example, an MSA is used to identify regions of high
sequence conservation presumably reflecting structural and
functional constraints. Useful for delimiting known domains and
potential new functional regions (e.g. the Ras-binding domain in
yellow and the blue box of currently unknown function).
Fun with MSA...
MSA used to locate
functional residues and
domain boundaries in
homologs of Dbl-proteins
with known structure (Dbs
and Tiam1).
Red amino acids directly
interact with GTPases.
Blue residues directly
interact with
phosphoinositides.
Phyre uses a 3-dimensional Position Sensitive Scoring Matrix!
Hidden Markhov Models – devices for generating folds
HMM is created using some examples and general rules.
The examples are defined folds.
For instance, 60 PH domains might be used to create
an HMM for PH domains.
An HMM can assign a probability that it generated a given
sequence (e.g. does this sequence represent a PH domain?)
A very simple HMM for a protein with 4 amino acids
The square boxes are called “match states” – these will emit a amino
acid with a set probability for each AA. Diamond boxes are for insertions
between match states, and the circles are for deletions of match states.
There are probabilities associated with all of the arrows. There are
many possible paths through the Model! These are the “rules”
learned from the examples (e.g. PH domains you used).
Random transitions through the Model and emissions from the states
are guided by probabilities. All you see at the end is the generated
sequence. The model that generated the sequence is “hidden”. But the
resulting sequence is related to those sequences used to construct the
model. Again, IT IS POSSIBLE TO CALCULATE THE PROBABILITY
THAT A GIVEN SEQUENCE WAS GENERATED BY THE MODEL!
What you should know
Homology
If two proteins are homologous, they have a common fold and
a common ancestor
If two proteins have >25% identity across their entire length, they are likely to be
homologs. However, sometimes true homologs have quite low sequence identity!
Orthologs
Paralogs
Alignments
Homologous (and equivalent) proteins from different species.
Arise from speciation.
Homologous (and equivalent) proteins found in same species.
Divergence of sequences NOT from speciation (gene duplication).
How to score?
Minimum # of mutations?, Physicochemical properties (as
perceived by us)?, Or learn from nature?
Scoring schemes PAM, BLOSUM
Algorithms - BLASTp, FASTA, Smith-Watermann, Needleman-Wunsch
BLAST vs. PSI-BLAST
E values
What it means in words
E = Kmne -λS
Alignment algorithms
Why use local alignment algorithm?
Why use global alignment?
BLAST (Local, heuristic)
FASTA (Local, heuristic)
Smith-Waterman (Local, rigorous)
Needleman-Wunsch (Global, rigorous)