Transcript Similarity
Sequence analysis
June 17, 2003
Learning objectives-Review amino acids
structures. Understand sliding window programs.
Understand difference between identity, similarity
and homology. Understand difference between
global alignment and local alignment.
Workshop-Perform sliding window to compute
%GC as a function of position in sequence.
Sliding window (1)
This refers to the number of characters
you look at, during one particular time.
4
GCATATGCGCATATCCCGTCAATACCA
5
GCATATGCGCATATCCCGTCAATACCA
6
GCATATGCGCATATCCCGTCAATACCA
Sliding window (2)
A "window" can be defined as a span of a certain
number of residues (nucleotides or amino acids). One
calculates some value for the residues in that fragment.
Once the calculation is completed, the program
analyzes the next window of residues and this process
repeats itself until the end of the sequence is reached.
A simple example is to calculate the %GC
content within a window. Then move the
window one nucleotide and repeat the
calculation.
Sliding window (3)
If the window is too small it is difficult to detect the trend
of the measurement. If too large you could miss meaningful
data.
Small window
%GC
Number in sequence
Large window
%GC
Number in sequence
Sliding window (4)
Four levels of protein structure
1) Primary
Linear sequenceAGHIPLLQ
2) Secondary
Initial folding patternsAGHIPLLQ
aaaTTTbb
3) Tertiary
4) Quaternary
Complex folding patterns-
Interactions between
polypeptides
Other classification schemes
Two major types:
Alpha Helical Regions
Beta Sheet Regions
Other classification schemes:
Turns
Transmembrane regions
Internal regions
External regions
Antigenic regions
Chou-Fasman Rules (Mathews, Van Holde, Ahern)
Amino Acid
Ala
Cys
Leu
Met
Glu
Gln
His
Lys
Val
Ile
Phe
Tyr
Trp
Thr
Gly
Ser
Asp
Asn
Pro
Arg
a-Helix
1.29
1.11
1.30
1.47
1.44
1.27
1.22
1.23
0.91
0.97
1.07
0.72
0.99
0.82
0.56
0.82
1.04
0.90
0.52
0.96
b-Sheet
0.90
0.74
1.02
0.97
0.75
0.80
1.08
0.77
1.49
1.45
1.32
1.25
1.14
1.21
0.92
0.95
0.72
0.76
0.64
0.99
Turn
0.78
0.80
0.59
0.39
1.00
0.97
0.69
0.96
0.47
0.51
0.58
1.05
0.75
1.03
1.64
1.33
1.41
1.23
1.91
0.88
Favors
a-Helix
Favors
b-Sheet
Favors
Turns
Chou-Fasman
First widely used procedure
If propensity in a window of six residues (for a
helix) is above a certain threshold the helix is
chosen as secondary structure.
If propensity in a window of five residues (for a
beta strand) is above a certain threshold then beta
strand is chosen.
Each classification is extended until the average
propensity in a 4 residue window falls below a
value.
Output-helix, strand or turn.
Chou&Fasman structure
prediction
Chou & Fasman [Biochemistry 13(2):222-245 (1974)]. By studying a number of proteins whose structures
were known, they were able to determine stretches of amino acids that could serve to form an a-helix or a bsheet. These amino acids are called helix formers or sheet formers and can have different strengths for
forming their structures. Once these nucleation sites are determined, adjacent amino acids are examined to
see if the structure can be extended in either or both directions. Values for some amino acids allow extension,
other amino acids do not. Some amino acids are categorized as helix breakers, or sheet breakers. A string of
these will terminate the current structure. This method is about 60-65% accurate.
Kyte-Doolittle Hydropathy
– Another sliding window routine [J. Mol. Biol. 157:105-132 (1982)]. They determine a "hydropathy scale"
for each amino acid based on empirical observations.
5
1
2
3
4
6
7
Amino Acid Hydrop. VALUE
A
1.8
C
2.5
D
-3.5
E
-3.5
F
2.8
G
-0.4
H
-3.2
I
4.5
K
-3.9
L
3.8
M
1.9
N
-3.5
P
-1.6
Q
-3.5
R
-4.5
S
-0.8
T
-0.7
V
4.2
W
-0.9
Y
-1.3
Purpose of finding differences and similarities
of amino acids in two proteins.
Infer structural information
Infer functional information
Infer evolutionary relationships
Evolutionary Basis of Sequence
Alignment
1. Similarity: Quantity that relates how much
two amino acid sequences are alike.
2. Identity: Quantity that describes how much
two sequences are alike in the strictest terms.
3. Homology: a conclusion drawn from data
suggesting that two genes share a common
evolutionary history.
One is mouse trypsin and the other is crayfish trypsin.
They are homologous proteins. The sequences share 41% identity.
Evolutionary Basis of Sequence
Alignment (Cont. 1)
Why are there regions of identity?
1) Conserved function-residues participate in
reaction.
2) Structural (For example, conserved cysteine
residues that form a disulfide linkage)
3) Historical-Residues that are conserved solely
due to a common ancestor gene.
Evolutionary Basis of Sequence
Alignment (Cont. 2)
Note: it is possible that two proteins share a high degree of
similarity but have two different functions. For example,
human gamma-crystallin is a lens protein that has no known
enzymatic activity. It shares a high percentage of identity with
E. coli quinone oxidoreductase. These proteins likely had a
common ancestor but their functions diverged.
Analogous to railroad car and diner function.
Modular nature of proteins
The previous alignment was global. However,
many proteins do not display global patterns of
similarity. Instead, they possess local regions of
similarity.
Proteins can be thought of as assemblies of
modular domains. THINK OF MR.
POTATOHEAD. It is thought that this may, in
some cases, be due to a process known as exon
shuffling.
Modular nature of proteins (cont. 1)
Gene A
Exon 1a
Exon 2a
Duplication of Exon 2a
Gene A
Exon 1a
Exon 2a
Exon 2a
Exchange with Gene B
Gene B
Exon 1b
Exon 2b
Exon 2b
Gene A
Exon 1a
Exon 2a
Exon 3 (Exon 2b from Gene B)
Gene B
Exon 1b
Exon 2b
Exon 3 (Exon 2a from Gene A)
Dot Plots
A
A
T
G
C
C
T
A
G
T
G
C C
*
T
*
*
A
G
*
*
*
* *
* *
*
*
*
*
*
*
Window = 1
Note that 25% of
the table will be
filled due to random
chance. 1 in 4 chance
at each position
Dot Plots with window = 2
A
A
{
T
{
G
{
C
{
C
{
T
{A
{G
T
G
C C
T
A
*
*
*
*
*
*
*
G
Window = 2
The larger the window
the more noise can
be filtered
What is the
percent chance that
you will receive a
match randomly? One
in (four)2 chance.
1/16 * 100 = 6.25%
Identity Matrix
A
C
I
L
1
0
0
0
A
1
0 1
0 0
C I
1
L
Simplest type of scoring matrix
Similarity
It is easy to score if an amino acid is identical to another (the
score is 1 if identical and 0 if not). However, it is not easy to
give a score for amino acids that are somewhat similar.
+NH
3
CO2-
+NH
3
CO2-
Isoleucine
Leucine
Should they get a 0 (non-identical) or a 1 (identical) or
Something in between?
The Dotter Program
• Program consists of three components:
•Sliding window
•A table that gives a score for each amino acid match
•A graph that converts the score to a dot of certain density.
The higher the density the higher the score.
Two proteins that are similar in
certain regions
Tissue plasminogen activator (PLAT)
Coagulation factor 12 (F12).
Region of
similarity
Single region on F12
is similar to two regions
on PLAT