Human Genome Project

Download Report

Transcript Human Genome Project

Human Genome Project
Basic Strategy
• How to determine the sequence of the roughly 3 billion
base pairs of the human genome. Started in 1995.
• Various side projects: genetic diseases, variations
between individuals, ethnic variation, comparison to
other species.
• Strategy:
– 1. physical map relating specific DNA markers to the proper
chromosomal position.
– 2. Overlapping set of cloned DNAs (contigs)
– 3. sequencing and assembly
– 4. finding the genes in the sequence
– 5. annotation of gene function
Physical Maps
•
•
•
•
A genetic map uses
recombination, crossing over
during meiosis, to determine how
frequently two genes (or markers)
are inherited together.
A physical map determines where
a given DNA marker is located on
the DNA of the chromosome.
Genetic and physical maps are
(supposed to be) colinear—all the
genes appear in the same order in
both maps. But, distances are
quite different: there is very little
recombination in the centromeres,
so large DNA distances are very
short recombination distances.
Genetic maps using microsatellite
(SSR) markers were used to
develop physical maps: the
appropriate SSR sites were
expected to be found on the
corresponding cloned DNA.
Sequence Tagged Sites
•
•
•
•
•
•
•
•
a sequence tagged site (STS) is a short sequence that is unique in the
genome.
You obtain the sequence information from cloned DNA, and then locate it in
the genome.
Using PCR it is then possible to determine whether your STS is present in
any other clone or cell line.
Obtaining STS: sequencing the ends of large cloned DNAs (BACs or YACs,
for example).
Uniqueness: use the cloned DNA from the STS as a probe on a Southern
blot of genomic DNA: if the STS is unique, only 1 band will hybridize.
Repetitive DNA is very common in the human genome, and many DNA
sequences are not unique.
A good source of unique DNA is EST clones: cDNA made from messenger
RNA.
Size: a DNA sequencing run will usually give 500-600 bp of good, reliable
sequence information. On the other hand, consider the size for the
genome: 3 x 109 bp. Each base is one of 4 choices, so a 16 bp sequence
will appear about once in 4.3 x 109 bp. In practice, 20 bp is about the
minimum size for good PCR amplification, and 24 bp is about the minimum
that will give a good BLAST hit.
Somatic Cell Hybrids
• Human and mouse (or hamster) cultured cells can be
fused together using polyethylene glycol.
– The resulting fused cell is a heterokaryon: it has 2 nuclei from
different species.
– If the heterokaryon undergoes mitosis, the nuclei fuse.
– Human chromosomes are unstable in a mixed nucleus, and
most of them are randomly lost. The mouse chromosomes all
stay.
– Different cell lines can be established that contain different
combinations of human chromosomes
– You can identify which human chromosomes remain using
chromosome banding techniques.
• A good way to determine which chromosome a DNA
sequence is on. Sometimes also for gene products or
phenotypes.
Radiation Hybrids
•
•
•
Standard somatic cell fusions contain entire
human chromosomes. To locate a gene
more closely, you need to use chromosome
fragments.
Start by irradiating human cells with a
controlled dose of X-rays: chromosomes
break up. Then, fuse the cells to mouse
cells. The human chromosome fragments
get integrated into the mouse chromosomes.
Create a panel of mouse/human hybrid cell
lines.
–
–
–
–
•
The current standard panels contain about
100 cell lines.
Each line contains about 32% of the human
genome
Average size of human genome fragment =
25 kbp
More radiation = smaller fragments
Mapping: the hybrid cell lines contain
random human chromosome fragments, but
closely linked sites are usually in the same
cell line (same basic principle as
recombination mapping).
–
Until you have located some of the markers
on the chromosomes, radiation hybrid
mapping only gives you information about
whether any two sequences are close
together on the chromosome.
Contigs
•
•
•
•
•
A contig is a set of partially overlapping
clones, a contiguous set of clones. No
gaps between them.
Contigs allow you to build up the
sequence of the chromosome over
much larger regions than any single
clone.
The first reasonably complete physical
map of the human genome involved
contigs generated by YACs (yeast
artificial chromosomes).
Initially, you have a collection of clones
with no information about how they are
ordered on the chromosome.
Contigs are built up by using PCR to
identify unique sequences (STS or
EST) on each clone, and then looking
for overlaps between the clones.
Sequencing Strategy
•
•
•
•
Once a contig map of the genome was
obtained, it was necessary to
sequence each individual clone.
Most of the actual human genome
sequencing was done on BAC clones,
which are less prone to rearrangement
than YAC clones. BACs are about
100-200 kbp long.
Large clones are generally sequenced
by shotgun sequencing: The large
cloned DNA is randomly broken up into
a series of small fragments ( less than
1 kb). These fragments are cloned
and sequenced. A computer program
then assembles them based on
overlaps between the sequences of
each clone.
To ensure that every bit has been
covered, you need to sequence
random clones until you have covered
each spot 5-10 times on average.
Whole Genome Shotgun
Sequencing
•
•
•
•
•
Why bother with creating a large scale physical map: all that YAC and BAC
cloning, radiation hybrids, STS comparisons, etc? Why not just fragment
the whole genome into 1 kb pieces, sequence them all, and let the
computer assemble the whole genome?
In practice, the genome is cloned into large fragments first, and then each
large fragment is broken up for shotgun sequencing. But, the large
fragments are not ordered: no physical map or set of contigs is created.
Requires a lot of overlapping coverage
Also requires good software.
Very successful for prokaryotic genomes (10 Mbp or less).
– but the human genome is 300 times larger
•
•
Big problem: repeat sequence DNA, which is everywhere, and especially
near the centromere. To find overlaps between clones, you need unique
regions.
It remains unclear whether whole genome shotgun sequencing will work if
there is no other information available to provide order. It has not been
widely adopted for eukaryotic projects (so far).
Gene Detection
• the best evidence that a given DNA sequence is expressed is to find
an EST (cDNA copy of mRNA) that matches it. Large numbers of
EST libraries have been constructed and sequenced.
– The primary result of this was to determine that many genes have
several different intron slicing patterns: sequences are exons in some
tissues but introns in others.
• Homology searches, using BLAST, are a good way to find genes. If
a DNA sequence closely matches a sequence from another
organism, it has been evolutionarily conserved, and that usually
means that it is an expressed gene.
• Exon prediction: exons need to be open reading frames (no stop
codons), and they display patterns of nucleotide usage different from
random DNA. Several different programs exist, and they give
somewhat varying results. “Hypothetical genes” are genes whose
existence has been predicted by computer but which lacks any
experimental or cross-species data to confirm it.
– a “conserved hypothetical gene” is a sequence that matches other
species even though there is no EST or other experimental evidence for
its expression
Gene Annotation
•
•
•
•
Computer predictions of gene function are
mediocre at best. Humans, especially those who
are experts in the field, do a much better job of
evaluating evidence and deciding what a given
gene’s function is.
There is a big problem of too much information not
uniformly coded or maintained. The scientific
literature contains numerous examples of the
same gene or protein with several different
names, and getting common definitions of
functions is even harder.
To counter this, the Gene Ontology Consortium
(GO) has created a controlled vocabulary of about
11,000 terms.
Every gene product (protein) can be annotated
into three general categories:
–
–
–
•
•
•
molecular function: what the protein actually does,
such as “kinase activity”
biological process: what cellular process the protein
participates in, such as “signal transduction”
cellular component: where the protein is found in the
cell, such as “integral to the plasma membrane”
Each gene product can have multiple descriptive
terms.
The terms are hierarchical: more specific terms
are contained within less specific terms.
But, a given term can have more than one parent
and more than one child term.
GO Example