Chapter 5 - SAGE Research Methods

Download Report

Transcript Chapter 5 - SAGE Research Methods

5
Genomics, Proteomics, and
Systems Biology
5 Genomics, Proteomics, and Systems Biology
• Genomes and Transcriptomes
• Proteomics
• Systems Biology
Introduction
Genome sequencing projects
introduced large-scale experimental
approaches, that generate vast
amounts of data, to the study of
biological systems.
Complete genome sequences can be
determined, as well as large-scale
analyses of all the RNAs and proteins
expressed in a cell.
Introduction
These global experimental approaches
form the basis of the new field of
systems biology, which seeks a
quantitative understanding of the
integrated behavior of complex
biological systems.
Genomes and Transcriptomes
The Human Genome Project: the effort
to sequence the entire human
genome (3 billion base pairs),
published in 2004.
The genome sequences of many other
species have also been determined,
and advances in sequencing
technology now allow rapid
sequencing of individual genomes.
Genomes and Transcriptomes
The first complete genome was
reported in 1995, of the bacterium
Haemophilus influenzae.
It contains 1.8 × 106 base pairs.
Protein-coding regions were identified
by computer analysis to detect openreading frames—long stretches that
don’t contain any stop codons.
Figure 5.1 The genome of Haemophilus influenzae
Genomes and Transcriptomes
In bacteria, most of the DNA encodes
proteins.
The E. coli genome is twice the size of
H. influenzae, 4.6 × 106 base pairs
(about 4,000 genes). Nearly 90% of
the DNA is protein-coding.
More than 2,000 bacterial genomes
have now been sequenced.
Genomes and Transcriptomes
The yeast Saccharomyces cerevisiae
has the simplest eukaryotic genome,
making it a useful model for
eukaryotic cells.
Yeasts have about 6,000 genes; about
70% of the genome codes for
proteins.
Genomes and Transcriptomes
Multicellular organisms (C. elegans,
Drosophila, and Arabidopsis), were
sequenced next.
These genomes are about 10 times
larger than yeast, but had fewer
genes than expected for more
complex organisms.
Much less of the DNA is protein-coding
than in bacteria and yeasts.
Table 5.1 Representative Genomes
Genomes and Transcriptomes
Drosophila has fewer genes than C.
elegans.
Sequencing revealed the fact that
biological complexity is not just
related to number of genes.
Genomes and Transcriptomes
The genome of Arabidopsis thaliana
was sequenced in 2000 and found to
have about 26,000 genes.
Even more genes occur in other plant
genomes (e.g., 57,000 in apples).
Genomes and Transcriptomes
The human genome has about 3 × 109
base pairs.
Draft sequences were published in
2001 by two different groups using
different approaches.
The complete sequence was published
in 2004.
Genomes and Transcriptomes
The International Human Genome
Sequencing Consortium sequenced
DNA fragments derived from BAC
(bacterial artificial chromosome)
clones that had been previously
mapped to human chromosomes.
Key Experiment, Ch. 5, p. 161 (3)
Genomes and Transcriptomes
A team led by Craig Venter of Celera
Genomics used a shotgun approach:
Small DNA fragments were cloned and
sequenced; overlaps between
sequences were then used to
assemble the sequence of the
genome.
Genomes and Transcriptomes
A major surprise from the human
genome sequence was that there are
only 21,000 protein-coding genes,
about 1% of the total genome.
Genomes and Transcriptomes
40% of human proteins are related to
proteins in simpler eukaryotes; most
function in basic cellular processes.
Most proteins that are unique to
humans are made up of domains that
are also found in other organisms,
but are arranged in novel
combinations.
Genomes and Transcriptomes
The genomes of many other
vertebrates have now been
sequenced.
This allows comparisons to the human
genome, and helps identify functional
sequences.
Comparison of human, mouse,
chicken, and zebrafish genomes
shows that about half of proteincoding genes are common to all
vertebrates.
Figure 5.2 Evolution of sequenced vertebrates
Figure 5.3 Comparison of vertebrate genomes
Genomes and Transcriptomes
Mice, rats, and humans have 90% of
their genes in common.
Mouse and rat genome sequences
provide essential databases for
research in mammalian genetics and
human physiology and medicine.
Genomes and Transcriptomes
The dog genome sequence has
become important in understanding
the genetic basis of morphology,
behavior, and a variety of diseases.
Characteristics of the many dog
breeds are highly specific, which
facilitates identification of the
responsible genes.
Genomes and Transcriptomes
Many diseases, including cancer, are
common in some breeds, and
understanding the genetic basis will
benefit both veterinary and human
medicine.
Genomes and Transcriptomes
Genome sequences of other primates
may help pinpoint unique features
that distinguish humans.
Human and chimpanzee genomes are
nearly 99% identical.
But sequence differences often alter
the coding sequences, leading to
different amino acid sequences of
most of the proteins in the two
species.
Genomes and Transcriptomes
Neandertals and modern humans
diverged 300,000 to 400,000 years
ago, and their genomes are about
99.9% identical.
The differences alter coding
sequences of only 90 genes that are
conserved in modern humans.
Genomes and Transcriptomes
The human genome project used the
dideoxynucleotide technique first
described by Fred Sanger in 1977.
But even with automation, this
approach is slow and expensive.
Next-generation sequencing: new
techniques that increased speed and
lowered costs.
Figure 5.4 Progress in DNA sequencing
Genomes and Transcriptomes
Next-generation, or massively parallel
sequencing, are methods in which
millions of templates are sequenced
simultaneously.
Figure 5.5 Next-generation sequencing
Genomes and Transcriptomes
The first individual human genomes to
be sequenced were those of Craig
Venter and James Watson (2007 and
2008).
Since then, thousands of individual
genomes have been sequenced.
Personal sequences will allow
therapies to be specifically tailored to
the needs of individual patients.
Genomes and Transcriptomes
In the future, genome sequencing may
be important in disease prevention by
identifying genes that confer
susceptibility to particular diseases.
Genomes and Transcriptomes
Transcriptome: all the RNAs that are
transcribed in a cell.
Complete genome sequences allow
study of gene expression for the
whole genome, instead of one gene
at a time.
One method used is hybridization to
DNA microarrays.
Genomes and Transcriptomes
Oligonucleotides are printed by a
robotic system onto glass or silicon
chips.
Each spot on the array consists of a
single oligonucleotide.
DNA microarrays can be used to
compare gene expression between
two cell types.
Figure 5.6 DNA microarrays
Genomes and Transcriptomes
cDNAs are synthesized from mRNAs
by reverse transcription, labeled with
fluorescent dyes and hybridized to
DNA microarrays.
The relative level of expression of
each gene is indicated by intensity of
fluorescence at each position on the
microarray.
Genomes and Transcriptomes
RNA-seq reveals the sequences of all
mRNAs in a cell.
Cellular mRNAs are reverse
transcribed to cDNAs, which are
analyzed by next-generation
sequencing.
The frequency of mRNAs found also
indicates their abundance in the cell.
Figure 5.7 RNA-seq
Proteomics
To understand cell function, it is
necessary to know what proteins are
expressed and how they function
within the cell.
The large-scale analysis of cell
proteins is called proteomics.
The goal is to identify and quantify all
proteins expressed in a given cell
(the proteome).
Proteomics
The number of proteins expressed in a
cell is greater than the number of
genes.
Many genes can be expressed to yield
several distinct mRNAs, which
encode different polypeptides as a
result of alternative splicing.
Proteins can also be modified in
various ways.
Proteomics
The first technology to separate
proteins was two-dimensional gel
electrophoresis.
Proteins are separated based on
charge and then size.
This technique is biased toward the
most abundant proteins.
Figure 5.8 Two-dimensional gel electrophoresis
Proteomics
The main tool currently used is mass
spectrometry.
A protease cleaves the protein into
small peptides. These are ionized
and analyzed in a mass
spectrometer, which determines the
mass-to-charge ratio of each peptide.
The mass spectrum is compared to a
data base of known spectra.
Figure 5.9 Identification of proteins by mass spectrometry
Proteomics
A “shot-gun” approach eliminates the
gel electrophoresis.
Cell proteins are digested with
protease and the whole mixture
sequenced by tandem mass
spectrometry.
Figure 5.10 Tandem mass spectrometry
Proteomics
Determining the locations of proteins in
cells and organelles is also important.
Organelles are isolated by subcellular
fractionation and the proteins are
analyzed by mass spectrometry.
The proteome of a variety of
organelles and structures have been
characterized.
Table 5.2 Protein composition of cellular structures
Proteomics
Proteins function by interacting with
other proteins in protein complexes
and networks.
The systematic analysis of these
complexes and interactions has
become an important goal of
proteomics.
Proteomics
Proteins can be isolated from cells
under gentle conditions so that
protein complexes are not disrupted.
Typically, an antibody against a protein
of interest would be used to isolate
the protein from a cell extract by
immunoprecipitation.
Figure 5.11 Immunoprecipitation
Proteomics
Immunoprecipitated protein complexes
can then be analyzed by mass
spectrometry.
The protein against which the antibody
was directed can be identified, along
with other proteins it was associated
with in the cell extract.
Figure 5.12 Analysis of protein complexes
Proteomics
Alternative approaches include
screens for protein interactions in
vitro, and screens that detect
interactions between pairs of proteins
introduced into yeast cells.
Proteomics
In the yeast two-hybrid system, two
different cDNAs (e.g., from human
cells) are joined to two distinct
domains of a protein that stimulates
expression of a target gene in yeast.
Figure 5.13 The yeast two-hybrid system
Proteomics
Screens have identified thousands of
protein–protein interactions, which
can be presented as maps that depict
a network of interacting proteins
within a cell.
Figure 5.14 A protein interaction map of Drosophila
Bioinformatics and Systems Biology
Genome sequencing, proteomics, and
other large-scale experiments have
yielded vast amounts of data.
Bioinformatics, at the interface
between biology and computer
science, uses computational methods
to analyze and extract biological
information from all this data.
Bioinformatics and Systems Biology
These large-scale experimental
approaches form the basis of the new
field of systems biology.
The goal: A quantitative understanding
of the integrated dynamic behavior of
complex biological systems and
processes.
Figure 5.15 Systems biology
Bioinformatics and Systems Biology
Systematic screens of gene function:
One approach to study gene function
is to inactivate (knockout) each gene.
Collections of strains with mutations in
all known genes are available for E.
coli, yeast, Drosophila, C. elegans,
and Arabidopsis thaliana.
Bioinformatics and Systems Biology
A large-scale international project to
systematically knockout all genes in
the mouse is also under way.
Targeted mutagenesis has determined
functions of more than 7,000 mouse
genes.
Bioinformatics and Systems Biology
Other large-scale screening projects
are based on RNA interference
(RNAi).
Double-stranded RNAs are used to
induce degradation of homologous
mRNAs in cells.
Figure 4.38 RNA Interference
Bioinformatics and Systems Biology
With the availability of complete
genome sequences, libraries of
double-stranded RNAs can be
designed and used in genome-wide
screens to identify all of the genes
involved in any biological process.
Figure 5.16 Genome-wide RNAi screen for cell growth and viability
Bioinformatics and Systems Biology
Regulation of gene expression:
Understanding the mechanisms that
control gene expression is a central
undertaking in cell and molecular
biology.
It is far more difficult to identify gene
regulatory sequences than proteincoding sequences.
Bioinformatics and Systems Biology
Most regulatory elements are short
sequences, typically only about ten
base pairs.
Consequently, sequences resembling
regulatory elements occur frequently
by chance in genomic DNA.
Identifying regulatory sequences is a
major challenge in systems biology.
Bioinformatics and Systems Biology
Global studies of gene expression,
using microarrays or RNA-seq can
reveal overall changes in gene
regulation associated with discrete
cell behaviors, such as the response
of cells to a particular hormone.
Changes in expression of multiple
genes can help pinpoint shared
regulatory elements.
Bioinformatics and Systems Biology
Computational approaches are also
used to characterize regulatory
elements.
Comparative analysis of genome
sequences of related organisms
assumes that functionally important
sequences are conserved in
evolution, and nonfunctional
segments diverge more rapidly.
Bioinformatics and Systems Biology
Computational analysis to identify
noncoding sequences that are
conserved between the mouse, rat,
dog, and human genomes has
helped identify sequences that
control gene transcription.
Figure 5.17 Conservation of functional gene regulatory elements
Bioinformatics and Systems Biology
Genome-wide analysis of the binding
sites of regulatory proteins have also
been developed.
Genome-wide analysis of the sites of
histone modifications can also
provide identification of gene
regulatory sequences.
Bioinformatics and Systems Biology
ENCODE (Encyclopedia of DNA
Elements) utilized RNA-seq to
characterize all transcribed RNAs,
plus global methods to determine
gene regulatory sequences in 147
different types of human cells.
One result: Many transcribed
noncoding sequences play important
roles in gene regulation.
Bioinformatics and Systems Biology
Networks:
Classical experimental biology focuses
on single genes and proteins, which
often act sequentially to catalyze
reactions in a metabolic pathway.
Signaling pathways act similarly to
transmit information from the
environment, such as presence of a
hormone, to targets within the cell.
Figure 5.18 Example of a signaling pathway
Bioinformatics and Systems Biology
But metabolic and signaling pathways
do not operate in isolation.
There is extensive crosstalk between
pathways, so that multiple pathways
interact with one another to form
networks.
Computational modeling of networks is
currently a major challenge in
systems biology.
Bioinformatics and Systems Biology
Many pathways are controlled by
feedback loops (e.g., feedback
inhibition of metabolic pathways, or
negative feedback loop).
Feedforward relays: activity of one
component of a pathway stimulates a
distant downstream component.
Bioinformatics and Systems Biology
Crosstalk: interaction of one pathway
with another; can be positive (one
pathway stimulates the other) or
negative (one pathway inhibits the
other).
Figure 5.19 Elements of signaling networks
Bioinformatics and Systems Biology
In this view of the cell as an integrated
system, a full understanding of cell
signaling will require development of
network models.
A model of a gene regulatory network
controlling development of an
embryonic cell lineage in sea urchins
has recently been developed.
Figure 5.20 A gene regulatory network
Bioinformatics and Systems Biology
Synthetic biology:
The goal is to design and create new
(unnatural or synthetic) systems, to
create useful products and to better
understand how the behavior of
existing cells is controlled.
Bioinformatics and Systems Biology
Synthetic biologists can synthesize
new molecules with biological
properties, such as RNA, or engineer
new systems using components of
existing cells.
The ability to engineer a novel
biological system tests and expands
our understanding of how natural
systems function.
Bioinformatics and Systems Biology
Genetic circuits in E. coli were first
engineered in 2000.
A genetic toggle switch was designed
to confer stability and memory on a
network regulating gene expression.
The key feature is that two repressors
control expression of each other as
well as a reporter gene.
Figure 5.21 A genetic toggle switch
Bioinformatics and Systems Biology
Similar genetic circuits have since
been engineered in eukaryotic
models.
This has substantially advanced our
understanding of how a regulatory
circuit can alternate between two
stable states—a common feature of
networks involved in many aspects of
cell signaling and regulation of cell
proliferation.
Bioinformatics and Systems Biology
Practical applications of synthetic
biology—treating malaria:
Malaria is a serious parasitic disease,
caused by the protozoan
Plasmodium and transmitted by
mosquitoes.
Research on vaccine development is
underway, but none is currently
available.
Molecular Medicine, Ch. 5, p. 180
Bioinformatics and Systems Biology
The most effective antimalarial drug
right now is artemisinin, a compound
produced by a plant that takes 8
months to mature.
The supply of artemisinin from these
plants is limited and the price
fluctuates.
Figure 5.22 Structure of artemisinin
Bioinformatics and Systems Biology
Synthetic biologists have developed
strains of yeast engineered to
produce a precursor to artemisin,
which is then used for commercial
production of this important drug.
Bioinformatics and Systems Biology
The first cell with a completely
synthetic genome was created in
2010.
Venter et al. synthesized overlapping
oligonucleotides corresponding to the
complete genome sequence of
Mycoplasma mycoides.
Bioinformatics and Systems Biology
The synthetic genome was then
introduced into a different
mycoplasma subspecies, M.
capricolum.
These cells grew normally and showed
the morphology of normal M.
mycoides.
Because the cell proteins are specified
by the synthetic genome, they
represent the first synthetic cells.
Figure 5.23 First cell with a synthetic genome