Powerpoint slides

Download Report

Transcript Powerpoint slides

Genome analysis.
Genome – the sum of genes and intergenic
sequences of a haploid cell.
The value of genome sequences lies in
their annotation
• Annotation – Characterizing genomic features
using computational and experimental methods
• Genes: Four levels of annotation
– Gene Prediction – Where are genes?
– What do they look like?
– What do they encode?
– What proteins/pathways involved in?
Koonin & Galperin
Accuracy of genome annotation.
• In most genomes functional predictions has been made
for majority of genes 54-79%.
• The source of errors in annotation:
- overprediction (those hits which are statistically
significant in the database search are not checked)
- multidomain protein (found the similarity to only one
domain, although the annotation is extended to the
whole protein).
The error of the genome annotation can be as big as 25%.
Sample genomes
Species
H.sapiens
Size
Genes
Genes/Mb
3,200Mb
35,000
11
D.melanogaster
137Mb
13.338
97
C.elegans
85.5Mb
18,266
214
A.thaliana
115Mb
25,800
224
S.cerevisiae
15Mb
6,144
410
E.coli
4.6Mb
4,300
934
So much DNA – so “few” genes …
s
T
Genic
C
Intergenic
T
Human Genome project.
Comparative genomics - comparison of gene
number, gene content and gene location in
genomes..
Campbell & Heyer “Genomics”
Analysis of gene order (synteny).
Genes with a related function are frequently
clustered on the chromosome.
Ex: E.coli genes responsible for synthesis of Trp
are clustered and order is conserved between
different bacterial species.
Operon: set of genes transcribed simultaneously
with the same direction of transcription
Analysis of gene order (synteny).
Koonin & Galperin “Sequence, Evolution, Function”
Analysis of gene order (synteny).
• The order of genes is not very well conserved if
%identity between prokaryotic genomes is less
than 50%
• The gene neighborhood can be conserved so
that all neighboring genes belong to the same
functional class.
• Functional prediction can be based on gene
neighboring.
Role of “junk” DNA in a cell.
1. There is almost no correlation between the number of genes and
organism’s complexity.
2. There is a correlation between the amount of nonprotein-coding
DNA and complexity.
Species
H.sapiens
Size
Genes
Genes/Mb
3,200Mb
35,000
11
D.melanogaster
137Mb
13.338
97
C.elegans
85.5Mb
18,266
214
A.thaliana
115Mb
25,800
224
S.cerevisiae
15Mb
6,144
410
E.coli
4.6Mb
4,300
934
New interpretation of introns.
1. Modern introns envaded eukaryotes late in
evolution, they are derived from self-splicing
mobile genetic elements similar to group II
introns.
2. Nucleus which separates transcription and
translation, appears only in eukaryotes. For
prokaryotes there would not be time for introns
to splice themselves out.
3. Hypothesis: important regulatory role of introns.
Regulatory role of non-coding regions.
- “Micro-RNAs” control timing of processes in
development and apoptosis.
- Intron’s RNAs inform about the transcription of a
particular gene.
- Alternative splicing can be regulated by non-coding
regions.
- Non-coding regions can be very well conserved between
the species and many genetic deseases have been
linked to variations/mutations in non-coding regions.
COGs – Clusters of Orthologous Genes.
Orthologs – genes in different
species that evolved from a
common ancestral gene by
speciation;
Paralogs – paralogs are genes
related by duplication within a
genome.
Classwork I: Comparing microbial
genomes.
• Go to
http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi
• Select Thermus thermophilus genome
• View TaxTable
• What gene clusters do you see which are
common with Archaea?
Systems biology.
• Integrative approach to study the relationships and
interactions between various parts of a complex system.
Goal: to develop a model of interacting components for
the whole system.
Basic notions of networks.
Network (graph) – a set of vertices connected via edges.
The degree of a vertex – the total number of connections of
a vertex.
Random networks – networks with a disordered
arrangement of edges.
Properties of networks.
• Vertex degree distribution/connectivity.
• Clustering coefficient.
• Network diameter.
Characteristics of networks: vertex degree
distribution.
K=2
K=2
K=3
K=1
P(k,N) – degree distribution, k - degree of the vertex, N - number of
vertices.
If vertices are statistically independent and connections are random, the
degree distribution completely determines the statistical properties of a
network.
Characteristics of networks: vertex
degree distribution.
Characteristics of networks: clustering
coefficient.
The clustering coefficient characterizes the density of connections in the
environment close to a given vertex.
d – total number of edges connecting
nearest neighbors; n – number of nearest
verteces for a given vertex
2d
C
n(n  1)
C = 2/6
Characteristics of networks: diameter, smallworld.
Diameter of a network – shortest path along the existing
links averaged over all pairs of verteces. Distance
between two verteces = the smallest number of steps
one can take to reach on vertex from another.
Small-world character of the networks: any two verteces
can be connected by relatively short paths.
For random networks the diameter increases
logarithmically with the addition of new verteces.
Different network models:
Erdos-Renyi model.
• Start with the fixed set of vertices.
• Iterate the following process:
Chose randomly two vertices and connect them by an edge.
• Stop at certain number of edges.
ln(P(k))
Degree distribution – Poisson
distribution, λ – average degree
e   k
P(k ) 
k!
ln( k )
Different network models: model 2.
•
•
At each step, a new vertex is added to the graph
Simultaneously, a pair of randomly chosen vertices is connected by an
edge.
This is a non-equilibrium model – the total number of vertices is not fixed.
ln(P(k))
Degree distribution – exponential distribution.
p(k )  e
k / 
ln(k)
Different network models: Barabasi-Alberts.
Model of preferential attachment.
• At each step, a new vertex is added to the graph
• The new vertex is attached to one of old vertices with probability proportional
to the degree of that old vertex.
ln(P(k))
Degree distribution – power law distribution.
p(k )  k

ln(k)
Power Law distribution
p(k ) ~ k
p(k )  (k )


Multiplying k by a constant, does not
change the shape of the distribution –
scale free distribution.


p(k )
From T. Przytycka
Difference between scale-free and random
networks.
Random networks are
homogeneous, most nodes
have the same number of
links.
Scale-free networks have a few
highly connected verteces.
Example 1: the large-scale organization of
metabolic networks.
D-Glucose
Glycolysis metabolic network
ATP
Hexokinase2.7.1.1
ADP
D-Glucose-6P
Pentose phosphate
cycle
5.3.1.9
Phosphoglucose
isomerase
D-Fructose-6P
ATP
Phosphofructokin 2.7.1.11
ase
ADP
D-Fructose-1,6P2
Aldolase4.1.2.13
5.3.1.1
Triose phosphate isomerase
Glycerone-P
Glyceraldehyde-3P
NAD+ +
1.2.1.12
Glyceraldehyde 3-P dehydrogenase
Pi
NADH + H+
Glycerate-1,3P2
Glycerolipid
ADP
metabolism
2.7.2.3
Phoshoglycerate kinase
ATP
Apicoplast FA
Glycerate-3P
synthesis
5.4.2.1
Phosphoglycerate mutase
enzymes
subsbstrate
Slide credit: Hagai Ginsburg
Glycerate-2P
Enolase4.2.1.11
H2O
Phosphoenol-pyruvate
ADP
Pyruvate
2.7.1.40
Pyruvate kinase
metabolism
ATP
Pyruvate
NADH + H+
1.1.1.27
Lactate dehydrogenase
NAD+
Lactate
Example 1: the large-scale organization of
metabolic networks.
Jeong et al, Nature, 2000:
-
Compared metabolic networks of 43 organisms.
Verteces – substrates connected with each other through
links/metabolic reactions.
Results:
- Scale-free nature of metabolic
networks for all organisms, γ = 2.2
- Diameters of metabolic networks
for all organisms are the same.
Biological interpretations of power-law
connectivity.
• Few verteces dominate the overall connectivity
of network.
• Self-similarity of networks.
• Small diameter, respond quickly to a mutation
which can destroy an enzyme, activate different
paths quickly.
Protein-protein interaction networks.
Sneppen & Maslov:
• Verteces – proteins, edges connect those
proteins which interact in a cell
• Network: 3278 interactions,1289 proteins
• Scale free network,  = 2.5 +/- 0.3
Sneppen & Maslov