DNA properties.

download report

Transcript DNA properties.

DNA properties.
Sugar-phosphate backbones form
ridges on edges of helix.
Copyright © Ramaswamy H. Sarma 1996
Classwork I.
1. Go to http://ndbserver.rutgers.edu/.
2. Select Crystal structure of B-DNA,
resolution >=2 Angstroms.
3. Select Crystal structure of singlestranded RNA with mismatch base
pairing with resolution >= 2 Angstroms.
RNA secondary structure prediction
Assumptions used in predictions:
- The most likely structure is the most stable one.
- The energy associated with a given position
depends only on the local sequence/structure
- The structure is formed w/o knots.
Minimum energy method of RNA
secondary structure prediction.
• Self-complementary regions can be found in a
dot matrix
• The energy of each structure is estimated by the
nearest-neighbor rule
• The most energetically favorable conformations
are predicted by the method similar to dynamic
programming
Minimum energy method of RNA
secondary structure prediction.
Classwork II: Predict secondary
structure for RNA “ACGUGCGU”.
Stacking energies for base pairs
A/U
C/G
G/C
U/A
G/U
U/G
A/U
-0.9
-1.8
-2.3
-1.1
-1.1
-0.8
C/G
-1.7
-2.9
-3.4
-2.3
-2.1
-1.4
G/C
-2.1
-2.0
-2.9
-1.8
-1.9
-1.2
U/A
-0.9
-1.7
-2.1
-0.9
-1.0
-0.5
G/U
-0.5
-1.2
-1.4
-0.8
-0.4
-0.2
U/G
-1.0
-1.9
-2.1
-1.1
-1.5
-0.4
Destabilizing energies for loops
Number of
bases
1
5
10
20
30
Internal
-
5.3
6.6
7.0
7.4
Bulge
3.9
4.8
5.5
6.3
6.7
Hairpin
-
4.4
5.3
6.1
6.5
Prediction of most probable structure.
Probability of forming a base pair:
P  exp( G / kt)
For a double-stranded structure probability =
product of Boltzmann factors for each of stacking
base pairs.
Sequence covariation method.
Some positions from different species can covary because
they are involved in pairing
fm(B1) - frequences in column m;
fn(B2) – frequences in column n;
fm,n(B1,B2) – joint frequences of two nucleotides in two
columns.
f m,n ( B1 , B2 ) /( f m ( B1 )  f n ( B2 ))
Seq 1
Seq 2
Seq 3
Seq 4
---G------C-----C------G-----A------C-----A------T---
Gene Prediction.
Gene prediction.
Gene – DNA sequence encoding protein,
rRNA, tRNA (snRNA, snoRNA)…
Gene concept is complicated:
- Introns/exons
- Alternative splicing
- Genes-in-genes
- Multisubunit proteins
Gene identification
• Homology-based gene prediction
– Similarity Searches (e.g. BLAST, BLAT)
– Genome Browsers
– RNA evidence (ESTs)
• Ab initio gene prediction
– Prokaryotes
• ORF identification
– Eukaryotes
• Promoter prediction
• PolyA-signal prediction
• Splice site, start/stop-codon predictions
Prokaryotic genes – searching for
ORFs.
- Small genomes have high gene density
Haemophilus influenza – 85% genic
- No introns
- Operons
One transcript, many genes
- Open reading frames (ORF) –
contiguous set of codons, start with Met-codon,
ends with stop codon.
Ab initio gene prediction.
Predictions are based on the observation
that gene DNA sequence is not random.
- Each species has a characteristic pattern of
synonymous codon usage.
- Non-coding ORFs are very short.
GeneMark (HMMs), GenScan, Grail
II(neural networks) and GeneParser (DP)
Gene preference score – important
indicator of coding region.
Observation: occurrence of codon pairs in coding regions is
not random.
The probability of exon starting at base 1:
P  a1 / a  Cn1
a1 – the score for an exon starting at base 1;
a – the sum of all scores for base 1, base2 and base 3;
n – the score for noncoding region starting at base 1;
C – the ratio of coding to noncoding bases in the organism.
Gene prediction accuracy.
True positives (TP) – nucleotides, which are
correctly predicted to be within the gene.
Actual positives (AP) – nucleotides, which
are located within the actual gene.
Predicted positives (PP) – nucleotides,
which are predicted in the gene.
Sensitivity = TP / AP
Specificity = TP / PP
Principles of protein structure and
stability.
Hydrophobic effect.
Hydrophobic interaction – tendency of
nonpolar compounds to transfer from an
aqueous solution to an organic phase.
H
O
H
O
H
H
- The entropy of water molecules decreases when they
make a contact with a nonpolar surface, the energy
increases.
- As a result, upon folding nonpolar AA are burried
inside the protein, polar and charged AA – outside.
Summary:
- Hydrophobic effect is mostly responsible for
making a compact globule. Final specific tertiary
structure is formed by van der Waals
interactions, HB, disulfide bonds.
- Secret of stability of native structures is not in
the magnitude of the interactions but in their
cooperativity.
Protein secondary structure prediction.
Assumptions:
• There should be a correlation between amino acid
sequence and secondary structure. Short aa sequence is
more likely to form one type of SS than another.
• Local interactions determine SS. SS of a residues is
determined by their neighbors (usually a sequence window
of 13-17 residues is used).
Exceptions: short identical amino acid sequences can
sometimes be found in different SS.
Accuracy: 65% - 75%, the highest accuracy – prediction of an
α helix
Chou-Fasman method.
Analysis of frequences for all amino acids to be in different
types of SS.
Ala, Glu, Leu and Met – strong predictors of alpha-helices,
Pro and Gly predict to break the helix.
Score(ai , S )  log( f (ai , S ) / f ( S )
GOR method.
Assumption: formation of SS of an amino acid is determined
by the neighboring residues (usually a window of 17
residues is used).
GOR uses principles of information theory for predictions.
I ( S ; a)  log( f S ,a / f S )
Method maximizes the information difference between two
competing hypothesis: that residue “a” is in structure “S”,
and that “a” is not in conformation “S”.
Input
sequence
window
L
Input layer
0
0
0
W
0
E
0
Hj
A
0
0
coil
0
0
0
T
0
Y
0
0
0
0
0
1
β
0
0
Predicted SS
Si
Oi
0
G
P
α
0
0
S
Output layer
1
G
V
Hidden layer
0
A
P
Neural network method.
Wij Sj
Hj
Oi
PHD – neural network program with multiple
sequence alignments.
• Blast search of the input sequence is
performed, similar sequences are
collected.
• Multiple alignment of similar sequences is
used as an input to a neural network.
• Sequence pattern in multiple alignment is
enhanced compared to if one sequence
used as an input.
Classwork
• Go to http://ncbi.nlm.nih.gov, search for
protein “flavodoxin” in Entrez, retrieve its
amino acid sequence.
• Go to
http://cubic.bioc.columbia.edu/predictprotei
n and run PHD on the sequence.
The Conserved Domain Architecture
Retrieval Tool (CDART).
• Performs similarity searches of the NCBI Entrez Protein
Database based on domain architecture, defined as the
sequential order of conserved domains in proteins.
•
The algorithm finds protein similarities across significant
evolutionary distances using sensitive protein domain
profiles. Proteins similar to a query protein are grouped
and scored by architecture.
Fold recognition.
Unsolved problem: direct prediction of protein structure from
the physico-chemical principles.
Solved problem: to recognize, which of known folds are
similar to the fold of unknown protein.
Fold recognition is based on observations/assumptions:
- The overall number of different protein folds is limited
(1000-3000 folds)
- The native protein structure is in its ground state (minimum
energy)
Protein structure prediction.
Prediction of three-dimensional structure from its protein
sequence. Different approaches:
- Homology modeling (predicted structure has a very close
homolog in the structure database).
- Fold recognition (predicted structure has an existing fold).
- Ab initio prediction (predicted structure has a new fold).
Steps of homology modeling.
1.
2.
3.
4.
5.
6.
Template recognition & initial alignment.
Backbone generation.
Loop modeling.
Side-chain modeling.
Model optimization.
Model validation.
Classwork I: Homology modeling.
- Go to NCBI Entrez, search for gi461699
- Do Blast search against PDB
- Do CD-search.
Fold recognition.
Goal: to find in PDB a fold which best matches a
given sequence.
Since similarity between target and the closest to it
template is not high, sequence-sequence
alignment methods fail to find a closest match.
Solution: threading – sequence-structure alignment
method.
Threading – method for structure prediction.
Sequence-structure alignment, target sequence is
compared to all structural templates from the
database.
Requires:
- Alignment method (dynamic programing, Monte
Carlo,…)
- Scoring function, which yields relative score for
each alternative alignment
Scoring function for threading.
Contact-based scoring function
depends on the amino acid types of
two residues and distance between
them.
Sequence-sequence alignment
scoring function does not depend on
the distance between two residues.
If distance between two nonadjacent residues in the template is
less than 8 Å, these residues make a
contact.
Mechanisms of evolution.
- Evolution is caused by mutations of genes.
- Mutations spread through the population via
genetic drift and/or natural selection.
- If mutant gene produces an advantage (new
morphological character), this feature will be
inherited by all descendant species.
Gene duplication and recombination.
New genes/proteins can occur through the gene
duplication and recombination.
Ancestral globin
duplication
Gene 1
+
Gene 2
globin
globin
hemoglobin
myoglobin
New gene
Duplication
Recombination
Measures of evolutionary distance
between amino acid sequences.
Evolutionary distance is usually
measures by the number of amino
acid substitutions.
1. P-distance.p  n / n
d
nd – number of amino acid differences between
two sequences; n – number of aligned amino
acids.
Poisson correction for evolutionary distance.
Takes into account multiple substitutions
and therefore is proportional to
divergence time.
PC-distance – total # of substitutions per
d sequences
 ln( 1  p )
site for two
Gamma-distance.
Substitution rate varies from site to site
according to gamma-distribution.
a – gamma-parameter, describing the
shape of the distribution, =0.2-3.5.
When P<0.2, there is no need to use
gamma-distance.

d G  a (1  p )
1 / a

1
Another method to estimate evolutionary
distances: amino acid substitution matrices.
Substitutions occur more often between amino acids of
similar properties.
Dayhoff (1978) derived first matrices from multiple
alignments of close homologs.
The number of aa substitutions is measured in terms of
accepted point mutations (PAM) – one aa substitution
per 100 sites.
Dayhoff-distance can be approximated by gamma-distance
with a=2.25.
The concept of evolutionary trees.
-
Trees show relationships between organisms.
- Trees consist of nodes and branches, topology branching pattern.
- The length of each branch represents the number of
substitutions occurred between two nodes. If rate of
evolution is constant, branches will have the same length
(molecular clock hypothesis).
- Trees can be binary or bifurcating.
- Trees can be rooted and unrooted. The root is placed by
including a taxon which is known to branch off earlier
than others.
Accuracies of phylogenetic trees.
Two types of errors:
- Topological error
- Branch length error
Bootstrap test:
Resampling of alignment columns with replacement;
recalculating the tree; counting how many times
this topology occurred – “bootstrap confidence
value”. If it is >0.95 – reliable topology/interior
branch.
Calculating branch lengths from distances.
A
B
C
A
-------------
B
20
---------
C
30
44
-----
a
c
b
a  b  20;
a  c  40;
b  c  44;
a  8; b  12; c  32.
1. Distance methods: Neighbor-joining
method.
NJ is based on minimum evolution principle (sum of branch length
should be minimized).
Given the distance matrix between all sequences, NJ joins sequences
in a tree so that to give the estimate of branch lengths.
1. Starts with the star tree, calculates the sum of branch lengths.
C
d AB  a  b;
B
b
d AC  a  c;
c
a
d
D
d AD  a  d ;
d AE  a  e;
S  abcd e 
e
A
(d AB  d AC  d AD  d AE  d BC  d BD  d BE  d CD  d CE  d DE ) /( N  1)
E
Neighbor-joining method.
2. Combine two sequences in a pair, modify the tree.
Recalculate the sum of branch lengths, S for each
possible pair, choose the lowest S.
C
B
c
b
d AX  (d AC  d AD  d AE ) / 3;
d
a
A
D
e
d BX  (d BC  d BD  d BE ) / 3;
a  b  d AB ; a  x  d AX ; b  x  d BX .
E
3. Treat cluster CDE as one sequence “X”, calculate average distances
between “A” and “X”, “B” and “X”, calculate “a” and “b”.
4. Treat AB as a single sequence, recalculate the distance matrix.
5. Repeat the cycle and calculate the next pair of branch lengths.
2.1 Maximum parsimony: definition of
informative sites.
Maximum parsimony tree – tree, that requires the smallest
number of evolutionary changes to explain the differences
between external nodes.
Site, which favors some trees over the others.
1
2
3
4
A
A
A
A
A
G
G
G
G
C
A
A
A
C
T
G
5
6
7
C T G
C T G
T T C
T T C
*
*
Site is informative if there are at least two different kinds of
letters at the site, each of which is represented in at least
two of the sequences.
2.2 Maximum parsimony.
Site 3
1.G
3.A 1.G
G
A
2.C
A
A
4.A 3.A
Tree 1.
2.C
2.C 1.G
A
4.A
A
4.A
Tree 2.
3.A
Tree 3.
Site 3 is not informative, all trees are realized by the same number of
substitutions.
Advantage: deals with characters, don’t need to compute distance matrices.
Disadvantage:
- multiple substitutions are not considered
- branch lengths are difficult to calculate
- slow
Maximum likelihood methods.
• Similarity with maximum parsimony:
- for each column of the alignment all possible trees are
calculated
- trees with the least number of substitutions are more likely
• Advantage of maximum likelihood over maximum parsimony:
- takes into account different rates of substitution between
different amino acids and/or different sites
- applicable to more diverse sequences
Molecular clock.
• First observation: rates of amino acid substitutions in
hemoglobin and cytochrome c are ~ the same among
different mammalian lineages.
• Molecular clock hypothesis: rate of evolution is ~
constant over time in different lineages; proteins evolve
at constant rates.
• This hypothesis is used in estimating divergence times
and reconstruction of phylogenetic trees.
Estimation of species divergence time.
Assumption: rate constancy, molecular clock.
Find T1 if T2 is known.
T1
T2
A
B
K AC
K AB

;
2T1
2T2
T1 
K AC T2
K AB
C
Neutral theory of evolution.
• Kimura in 1968: majority of molecular changes
in evolution are due to the random fixation of
neutral mutations (do not effect the fitness of
organism.
• As a consequence the random genetic drift
occurs.
• Value of selective advantage of mutation should
be stronger than effect of random drift.
Genome analysis.
Genome – the sum of genes and intergenic
sequences of a haploid cell.
Analysis of gene order (synteny).
Genes with a related function are frequently
clustered on the chromosome.
Ex: E.coli genes responsible for synthesis of
Trp are clustered and order is conserved
between different bacterial species.
Operon: set of genes transcribed
simultaneously with the same direction of
transcription
Role of “junk” DNA in a cell.
1. There is almost no correlation between the number of genes and
organism’s complexity.
2. There is a correlation between the amount of nonprotein-coding
DNA and complexity.
Species
H.sapiens
Size
Genes Genes/Mb
3,200Mb
35,000
11
D.melanogaster
137Mb
13.338
97
C.elegans
85.5Mb
18,266
214
A.thaliana
115Mb
25,800
224
15Mb
6,144
410
4,300
934
S.cerevisiae
E.coli
Regulatory role of non-coding regions.
- “Micro-RNAs” control timing of processes in
development and apoptosis.
- Intron’s RNAs inform about the transcription of a
particular gene.
- Alternative splicing can be regulated by noncoding regions.
- Non-coding regions can be very well conserved
between the species and many genetic
deseases have been linked to
Systems biology.
• Integrative approach to study the relationships
and interactions between various parts of a
complex system.
Goal: to develop a model of interacting
components for the whole system.
Characteristics of networks: vertex degree
distribution.
K=2
K=2
K=3
K=1
P(k,N) – degree distribution, k - degree of the vertex, N - number of
vertices.
If vertices are statistically independent and connections are random, the
degree distribution completely determines the statistical properties of a
network.
Characteristics of networks: clustering
coefficient.
The clustering coefficient characterizes the density of
connections in the environment close to a given vertex.
d – total number of edges connecting
nearest neighbors; n – number of nearest
verteces for a given vertex
2d
C
n(n  1)
C = 2/6
Different network models: Barabasi-Alberts.
Model of preferential attachment.
• At each step, a new vertex is added to the graph
• The new vertex is attached to one of old vertices with probability
proportional to the degree of that old vertex.
ln(P(k))
Degree distribution – power law distribution.
p(k )  k

ln(k)
Power Law distribution
p(k ) ~ k
p(k )  (k )


Multiplying k by a constant, does not
change the shape of the distribution –
scale free distribution.


p(k )
From T. Przytycka
Difference between scale-free and random
networks.
Random networks are
homogeneous, most nodes
have the same number of
links.
Scale-free networks have a few
highly connected verteces.