Detecting Positively and Negatively Selected Sites in a

Download Report

Transcript Detecting Positively and Negatively Selected Sites in a

Prediction of
functional/structural sites in a
protein using conservation and
hyper-variation
(ConSeq, ConSurf, Selecton)
1
Empirical findings:
variation among genes
“Important” proteins evolve
slower
than “unimportant” ones.
2
Histone H4 protein
3
Empirical findings:
variation among genes
Functional regions evolve
slower
than nonfunctional regions.
4
Conservation = functional/structural
importance
5
Alignment preproinsulin
Xenopus
Bos
MALWMQCLP-LVLVLLFSTPNTEALANQHL
MALWTRLRPLLALLALWPPPPARAFVNQHL
**** : * *.*: *:..* :. *:****
Xenopus
Bos
CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ
CGSHLVEALYLVCGERGFFYTPKARREVEG
**************:******** :*::*
Xenopus
Bos
AQVNGPQDNELDG-MQFQPQEYQKMKRGIV
PQVG---ALELAGGPGAGGLEGPPQKRGIV
.**.
** *
*
*****
Xenopus
Bos
EQCCHSTCSLFQLENYCN
EQCCASVCSLYQLENYCN
**** *.***:*******
6
7
8
Conservation based inference
 Conserved sites:



Important for the function or structure
Not allowed to mutate
“Slow evolving” sites
Low rate of evolution
 Variable



sites:
Less important (usually)
Change more easily
“Fast evolving” sites
High rate of evolution
9
Detecting conservation:
Evolutionary rates
• Rate (~speed) = distance / time
• Distance = number of substitutions per site
• Time = 2*#years (doubled because the sequences
evolved independently
d
d
r
2T
10
Mean Rate of Nucleotide Substitution in
Mammalian Genomes
-9
~10
Substitutions/site/year
Evolution is a very slow
process at the molecular
level (“Nothing
happens…”)
11
Rate computation
1
2
3
4
5
6
7
Human
D
M
A
A
H
A
M
Chimp
D
E
A
A
G
G
C
Cow
D
Q
A
A
W
A
P
Fish
D
L
A
A
C
A
L
S. cerevisiae
D
D
G
A
F
A
A
S. pombe
D
D
G
A
L
G
E
12
http://conseq.tau.ac.il
Site-specific rate computation method
13
Using the ConSeq server
14
ConSeq results:
15
Crash course in protein structure
16
Why protein structure?
 Each protein has a particular 3D structure that determines
its function
 Protein structure is better conserved than protein sequence
and more closely related to function
 Analyzing a protein structure is
more informative than analyzing its
sequence for function inference
17
PDB: Protein Data Bank
http://www.rcsb.org

Holds 3D models of biological macromolecules
(protein, RNA, DNA, small molecules)

All data are available to the public

X-Ray crystals (84%) NMR models (16%)

Submitted by biologists and biochemists from
around the world.
18
PDB model

Defines the 3D coordinates (x,y,z) of each of the atoms
in one or more molecules (i.e., complex)

There are models of proteins, protein complexes,
proteins and DNA, protein segments, etc …

The models also include the positions of ligand
molecules, solvent molecules, metal ions, etc…

PDB code: integer + 3 integers/characters (e.g., 1a14)
19
The PDB file – text format
20
The PDB file – text format
Residue
identity
Atom
identity
Atom
number
chain
The coordinates
for each residue in
the structure
Residue
number
X
Y
Z
Temperature
factor
ATOM:
Usually protein
or DNA
HETATM:
Usually Ligand,
ion, water
21
Viewing structures
Wireframe
Spacefill
Backbone
22
Conservation in the structure
Protein core: structurally constrained - usually conserved
Active site: functionally constrained - usually conserved
Surface loops: tolerant to mutations - usually variable
Surface loops
Active site
Hydrophobic core
23
http://consurf.tau.ac.il
Same algorithm as ConSeq, but here the results
are projected onto the 3D structure of the protein
24
Using the ConSurf server
25
ConSurf example:
potassium channel
 An
integral membrane protein with sequence
similarity to all known K+ channels, particularly
in the pore region.
 PDB
code: 1bl8, chain A
26
ConSurf results:
27
ConSurf example:
potassium channel
 Alignment of
homologs found by psi-blast:
28
ConSurf results:
29
ConSurf example:
potassium channel
 Neighbor-Joining reconstructed phylogenetic
tree:
30
ConSurf results:
31
Conservation scores:

The scores are standardized: the average score for all
residues is zero, and the standard deviation is one
 The lowest score represents the most conserved site
in the protein


negative values: slowly evolving (= low evolutionary rate),
conserved sites
The highest score represents the most variable site in
the protein

positive values: rapidly evolving (= fast evolutionary rate),
variable sites
32
ConSurf results: amino-acid
conservation scores
33
ConSurf result with First Glance in
Jmol:
34
ConSeq/ConSurf user intervention
(advanced options)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Method of calculating the amino acid conservation scores:
Bayesian/Max Likelihood
Enter your own MSA file
Multiply Align Sequences using: MUSCLE/CLUSTALW
Collect the Homologues from: SWISS-PROT/UniProt
Max. Number of Homologues (default = 50)
No. of PSI-BLAST Iterations (default = 1)
PSI-BLAST E-value Cutoff (default = 0.001)
Model of substitution for proteins:
JTT/Dayhoff/mtREV/cpREV/WAG
Enter your own PDB file
Enter your own TREE file
35
Codon-level selection
 ConSeq/ConSurf:

Compute the evolutionary rate of amino-acid
sites → the data are amino acids.
 But,
codons encode amino acids…
 61 codons vs. 20 amino acids !
 Aren’t we loosing information ???
36
Darwin – the theory of
natural selection
 Adaptive
evolution:
Favorable traits will become more
frequent in the population
37
M. Kimura – the neutral theory
of molecular evolution
Most of the DNA variation between
species is neutral with regards to the
phenotype
Selection operates to preserve a trait
38
Synonymous (silent) and nonsynonymous (non-silent) substitutions
Silent
Non-silent…
39
Synonymous vs. nonsynonymous substitutions
synonymous substitutions = silent
substitutions
non-synonymous substitutions = non-silent
or amino-acid altering substitutions
UUU → UUC (Phe → Phe ): synonymous
UUU → CUU (Phe → Leu): non-synonymous
40
Synonymous vs. non-synonymous substitutions
For most proteins, it is observed that the
rate of synonymous substitutions is much
Higher
than the non-synonymous rate
This is called purifying
selection
(= conservation
this is what ConSeq/Surf are computing
)
41
Synonymous vs. non-synonymous substitutions
Structural proteins
42
Saturation of synonymous substitutions
Histone H4 between human and wheat: saturation of
synonymous substitutions
43
Synonymous vs. nonsynonymous substitutions
There are rare cases where the nonsynonymous rate is much larger than the
synonymous rate.
This is called Positive
selection
44
Positive Selection
The hypothesis:
Promotes the fitness of the organism
Examples:
 Proteins of the immune system
 Pathogen proteins evading the host immune
system
 Pathogen proteins that are drug targets
 Proteins that are products of gene duplication
 Proteins involved in the reproduction system
45
Computing synonymous and nonsynonymous rates
• Codon-based MSA: translate DNA to amino acids, align,
backtrack to the DNA but keep alignment
• Phylogenetic tree: 5 replacements in 10 positions between
human and chimp is a lot, but between human and
cucumber is nothing
• Different replacement probabilities between two amino
acids:
LysArg ≠ LysCys
Positive evolution occurs at only a few sites !
46
Inferring positive selection
Divide the rate of non-silent substitutions (Ka)
by the rate of silent substitutions (Ks)
ka
ks
47
Inferring positive selection
Basic assumptions:
Selection score (Ka/Ks) > 1
↓
positive selection
Selection score (Ka/Ks) < 1
↓
purifying selection
48
Not so fast !!!
 Our
computational model assumes
there is positive selection in the data
 There is a good chance our model
will find a few positively selected
sites whatever the case
 Is this really indicative of positive
selection or plain randomness?
So, maybe there’s no positive selection after all
49
Statistics helps us to compare
between hypotheses
H0: There’s no positive selection
 H1: There is positive selection




H0: compute the probability (likelihood) of
the data using a model that does not
account for positive selection
H1: compute the probability (likelihood) of
the data using a model that does account for
positive selection
Perform a likelihood ratio test (LRT)
L( Data | M ( H 1))
2
2  ln(
)~
L( Data | M ( H 0))
50
http://selecton.tau.ac.il
51
Using the selecton server
52
Input = a coding sequence
at the codon level



The user must provide the sequences – no psi-blast option
The sequences’ lengths must divide by 3 (ORF) and must not
include any stop-codons
An alignment should be a codon alignment RevTrans
53
Similar to ConSurf
optional
Nuclear/mitochondria
different species
Default run:
M8(H1) and the M8a(H0)
54
Selecton Example: HIV Protease
The Protease is an
essential enzyme
for viral infectivity
PDB ID: 1hxw
55
Selecton Results:
56
Selecton Results:
57
Selecton results:
58
Selection scores (Ka/Ks):
 The
scores are normalized
 Ka/Ks > 1: positive selected site
 Ka/Ks <1: purified selected site
59
Coloring scheme:
 Used
for visualization is based on the
continuous Ka/Ks scores.
 The color grades (1-7):


1 for positive selected sites (blue)
7 for purified selected sites (bordeaux)
Color coding scheme of Selecton
60
61