Transcript AP review

Sequence specific recognition of DNA by
proteins.
• Nitrogen and oxygen exposed in the grooves can make
hydrogen bonds with proteins.
• Different Watson/Crick base pairs have different patterns
of donors and acceptors
- H-bond acceptor
- hydrogen atom
- H-bond donor
- methyl group
G
C
G
C
A
T
A
T
C
G
C
G
T
A
T
A
Major groove
Minor groove
Difference between DNA & RNA:
Differences between DNA & RNA:
• T is replaced by U
• Extra –OH group at 2’ pentose sugar, sugar is ribose, not
deoxyribose
• RNA usually does not form double helix, makes loops
within one strand, often contains modified bases
• RNA has an additional 2’-OH group which can form HB,
stabilizing tertiary structure
Illustration of RNA secondary structures.
From M.S. Andronescu
DNA/RNA thermodynamics.
Two major types of interactions:
• Base pairing (hydrogen bonds)
• Base stacking of nearest neighbors (π-electron sharing
of aromatic rings+ hydrophobic)
G  G
init
 G
pairing
 G
stacking
RNA secondary structure prediction
Assumptions used in predictions:
- The most likely structure is the most stable one.
- The energy of each base pair depends only on the
energy of the previous base pair.
- Energy parameters for different types of RNA secondary
structures are derived from the experiment.
- The structure is formed w/o knots.
Minimum energy method of RNA
secondary structure prediction.
• Self-complementary regions can be found in a dot matrix
• The energy of each base pair depends only on the
energy of the previous base pair
• Energy parameters for different types of RNA secondary
structures are derived from the experiment
• The most energetically favorable conformations are
predicted by the method similar to dynamic programming
Sequence covariation method.
Some positions from different species can covary because they are
involved in pairing
fm(B1) - frequences in column m;
fn(B2) – frequences in column n;
fm,n(B1,B2) – joint frequences of two nucleotides in two columns.
f m,n ( B1 , B2 ) /( f m ( B1 )  f n ( B2 ))
Seq 1
Seq 2
Seq 3
Seq 4
---G------C-----G------C-----A------T-----T------A---
Gene prediction.
Gene – DNA sequence encoding protein, rRNA,
tRNA …
Gene concept is complicated:
- Introns/exons
- Alternative splicing
- Genes-in-genes
- Multisubunit proteins
Codon usage tables.
- Each amino acid can be encoded by several codons.
- Each organism has characteristic pattern of codon usage.
Problems arising in gene prediction.
• Distinguishing pseudogenes (not working former
genes) from genes.
• Exon/intron structure in eukaryotes, exon
flanking regions – not very well conserved.
• Exon can be shuffled alternatively – alternative
splicing.
• Genes can overlap each other and occur on
different strands of DNA.
Gene identification
• Homology-based gene prediction
– Similarity Searches (e.g. BLAST, BLAT)
– ESTs
• Ab initio gene prediction
– Prokaryotes
• ORF identification
– Eukaryotes
• Promoter prediction
• PolyA-signal prediction
• Splice site, start/stop-codon predictions
Ab initio gene prediction.
Predictions are based on the observation that gene DNA
sequence is not random:
- Gene-coding sequence has start and stop codons.
- Each species has a characteristic pattern of synonymous
codon usage.
- Non-coding ORFs are very short.
- Gene would correspond to the longest ORF.
These methods look for the characteristic features of genes
and score them high.
Example of ORFs.
There are six possible ORFs in each sequence for both directions of
transcription.
Gene preference score – important
indicator of coding region.
Observation: frequencies of codons and codon pairs in coding and noncoding regions are different.
Given a sequence of codons:
and assuming independence, the probability of finding coding region:
The probability of finding sequence “C” in non-coding regions:
The gene preference score:
P(C )
GPS  log(
)
P0 (C )
Gene prediction accuracy.
True positives (TP) – nucleotides, which are
correctly predicted to be within the gene.
Actual positives (AP) – nucleotides, which are
located within the actual gene.
Predicted positives (PP) – nucleotides, which are
predicted in the gene.
Sensitivity = TP / AP
Specificity = TP / PP
The value of genome sequences lies in their
annotation
• Annotation – Characterizing genomic features
using computational and experimental methods
• Genes: levels of annotation
– Gene Prediction – Where are genes?
– What do they encode?
– What proteins/pathways involved in?
Human Genome project.
Analysis of gene order (synteny).
Genes with a related function are frequently
clustered on the chromosome.
Ex: E.coli genes responsible for synthesis of Trp
are clustered and order is conserved between
different bacterial species.
Operon: set of genes transcribed simultaneously
with the same direction of transcription
Structure and stability of globular proteins.
Native proteins are marginally stable.
Scale of interactions in proteins:
G
- Interactions less than kT~0.6 kcal/mol
are neglected.
- ΔG ~ 5 - 20 kcal/mol
U
F
ΔG
Reaction coordinate
Potential energy = Van der Waals + Electrostatic + …
Hydrophobic effect.
Hydrophobic interaction – tendency of
H
O
nonpolar compounds to transfer from an
aqueous solution to an organic phase.
H
O
H
H
-
The entropy of water molecules decreases when they make a contact with
a nonpolar surface (TΔS = -9.6 kcal/mol for cyclohexane) .
-
The effect is entropic because the energy of HB is very high.
-
The hydrophobic effect is proportional to buried surface area, the energy is
~ 20-25 cal/mol/A^2
Hierarchy of protein structure.
1.
2.
3.
4.
Amino acid sequence
Secondary structure
Tertiary structure
Quaternary structure
Picture from Branden & Tooze
“Introduction to protein structure”
Protein secondary structure prediction.
Assumptions:
• There should be a correlation between amino acid sequence and
secondary structure. Short aa sequence is more likely to form one type
of SS than another.
• Local interactions determine SS. SS of a residues is determined by their
neighbors (usually a sequence window of 13-17 residues is used).
Exceptions: short identical amino acid sequences can sometimes be found
in different SS.
Accuracy: 65% - 75%, the highest accuracy – prediction of an α helix
Methods of SS prediction.
• Chou-Fasman method
• GOR (Garnier,Osguthorpe and Robson)
• Neural network method
PHD – neural network program with multiple
sequence alignments.
• Blast search of the input sequence is performed,
similar sequences are collected.
• Multiple alignment of similar sequences is used
as an input to a neural network.
• Sequence pattern in multiple alignment is
enhanced compared to if one sequence used as
an input.
Protein structure prediction.
Fold recognition.
Unsolved problem: direct prediction of protein structure from the physicochemical principles.
Solved problem: to recognize, which of known folds are similar to the fold
of unknown protein.
Fold recognition is based on observations/assumptions:
- The overall number of different protein folds is limited (1000-3000 folds)
-
The native protein structure is in its ground state (minimum energy)
Protein structure prediction.
Prediction of three-dimensional structure from its protein sequence. Different
approaches:
-
Homology modeling (predicted structure has a very close homolog in the
structure database).
-
Fold recognition (predicted structure has an existing fold).
-
Ab initio prediction (predicted structure has a new fold).
Steps of homology modeling.
1.
2.
3.
4.
5.
Template recognition & initial alignment.
Backbone generation.
Loop modeling.
Side-chain modeling.
Model optimization.
1. Template recognition.
Recognition of similarity between the target and template.
Target – protein with unknown structure.
Template – protein with known structure.
Main difficulty – deciding which template to pick, multiple
choices/template structures.
Template structure can be found by searching for structures in PDB using
sequence-sequence alignment methods.
Fold recognition.
Goal: to find protein with known structure which best matches
a given sequence.
Since similarity between target and the closest to it template is
not high, sequence-sequence alignment methods fail.
Solution: threading – sequence-structure alignment method.
Threading – method for structure prediction.
Sequence-structure alignment, target sequence is compared
to all structural templates from the database.
Requires:
- Alignment method (dynamic programming, Monte Carlo,…)
- Scoring function, which yields relative score for each
alternative alignment
Scoring function for threading.
• Contact-based scoring function
depends on the amino acid types of
two residues and distance between
them.
• Sequence-sequence alignment
scoring function does not depend on
the distance between two residues.
• If distance between two nonadjacent residues in the template is
less than 8 Å, these residues make
a contact.
Threading model validation.
• Correct bond length and bond angles
>> 3.8 Angstroms
• Correct placement of functionally important sites
• Prediction of global topology, not partial alignment (minimum
number of gaps)
Classwork II: Homology modeling.
-
Go to NCBI Entrez, search for gi461699
Do Blast search against PDB
Repeat the same for gi60494508
Predict functionally important sites
Protein engineering and protein design.
Protein engineering – altering protein sequence to change protein function or
structure
Protein design – designing de novo protein which satisfies a given requirement
Stability of mutants compared to wild-type
protein.
Measure of stability – melting
temperature at which 50% of enzyme is
inactivated during reversible heat
denaturation. For wild-type Tm = 42 C.
• all mutants were more stable than
wild-type.
• the longer the loop between Cys, the
larger the effect (the more restricted is
unfolded state).
• the more disulfide bonds were
introduced, the more stable was the
mutant.
From B. Mathews et al
Can structural scaffolds be reduced in size
with maintaining function?
A. Braisted & J.A. Wells used Z-domain (58 residues) of
bacterial protein A:
• removed third helix (truncated protein - 38 residues);
• mutated residues in the first and second helices;
• used phage display to select active forms;
• restored the binding of truncated protein.
Designing an amino acid sequence that will
fold into a given structure.
• Inverse protein folding problem:
designing a sequence which will fold
into a given structure – much easier
than folding problem!
• B. Dahiyat & S. Mayo: designed a
sequence of zinc finger domain that
does not require stabilization by Zn.
• Wild type protein domain is
stabilized by Zn (bound to two Cys
and two His); mutant is stabilized by
hydrophobic interactions.
Molecular basis of evolution.
Goal – to reconstruct the evolutionary history of all organisms
in the form of phylogenetic trees.
Classical approach: phylogenetic trees were constructed
based on the comparative morphology and physiology.
Molecular phylogenetics: phylogenetic trees are constructed
by comparing DNA/protein sequences between organisms.
Mechanisms of evolution.
- By mutations of genes. Mutations spread
through the population via genetic drift
and/or natural selection.
- By gene duplication and recombination.
Measures of evolutionary distance between
amino acid sequences.
1. P-distance. Evolutionary distance is usually
measures by the number of amino acid
substitutions.
p  nd / n
nd – number of amino acid differences between two
sequences; n – number of aligned amino acids.
Poisson correction for evolutionary distance.
2. PC-distance. Takes into account multiple
substitutions and therefore is proportional to
divergence time.
PC-distance can be expressed through the pdistance:
d   ln( 1  p )
The concept of evolutionary trees.
- Trees consist of nodes and branches, topology - branching
pattern.
- The length of each branch represents the number of
substitutions occurring between two nodes. If rate of
evolution is constant, branches will have the same length
(molecular clock hypothesis).
- The distance along the tree is calculated by summing up all
intervening branch lengths.
- Trees can be binary or bifurcating.
- Trees can be rooted and unrooted. The root is placed by
including a taxon which is known to branch off earlier than
others.
Accuracies of phylogenetic trees.
Two types of errors:
- Topological error
- Branch length error
Bootstrap test:
Resampling of alignment columns with
replacement; recalculating the tree; counting how
many times this topology occurred – “bootstrap
confidence value”. If it is close to 100% – reliable
topology/interior branch.
Estimation of evolutionary rates in hemoglobin
alpha-chains.
P-distance
PC-distance
Gamma-distance
Human/cow
0.121
0.129
0.134
Human/kangaroo
0.186
0.205
0.216
Human/carp
0.486
0.665
0.789
Estimate the evolutionary rate of divergence between human
and cow (time of divergence between these groups is ~90
millions years).
1. Distance methods. Calculating branch
lengths from distances.
A
B
C
a  b  20;
A
-----
20
30
B
-----
-----
44
a  c  40;
b  c  44;
C
-----
-----
-----
a  8; b  12; c  32.
a
c
b
Neighbor-joining method.
NJ is based on minimum evolution principle (sum of branch
length should be minimized).
Given the distance matrix between all sequences, NJ joins
sequences in a tree so that to give the estimate of
branch lengths.
C
1. BStarts with the stard tree,
the sum of branch
 a  bcalculates
;
lengths.
d  a  c;
b
c
AB
AC
a
d
D
d AD  a  d ;
d AE  a  e;
S  abcd e 
e
A
(d AB  d AC  d AD  d AE  d BC  d BD  d BE  d CD  d CE  d DE ) /( N  1)
E
2.1 Maximum parsimony: definition of
informative sites.
Maximum parsimony tree – tree, that requires the smallest number of
evolutionary changes to explain the differences between external nodes.
Site, which favors some trees over the others.
1
2
3
4
A
A
A
A
A
G
G
G
G
C
A
A
A
C
T
G
5
6
7
C T G
C T G
T T C
T T C
*
*
Site is informative (for nucleotide sequences) if there are at least two
different kinds of letters at the site, each of which is represented in at
least two of the sequences.