Transcript Lecture 2

The Human Genome Project –
Part 2
BLT/ Topic 1 Pt 2/Apr 2012
2
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle
River, New Jersey 07458
Your assignment for last week…
• Group 1 and 2 – explain the following terms
related to genome sequencing:
1: mapping, STSs and ESTs, coverage, contigs,
golden tiling path,
• 2: library, BACs, finishing, annotation
• Group 3 – explain the hierarchical approach
• Group 4 – explain the whole genome shotgun
approach
3
© 2005 Prentice Hall
Inc. / A Pearson
Education Company /
Upper Saddle River, New
Jersey 07458
Some animations to watch first ….
http://www.yourgenome.org/teachers/bac.shtml
http://www.dnalc.org/resources/animations/
http://bcs.whfreeman.com/thelifewire/content/chp17/1702002.html
http://www.dnaftb.org/39/
4
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
Background
• Field of genomics began with decision to sequence
human genome
▫ Size of human genome is 3 billion base pairs, which
necessitated new ways to do sequencing
• Approaches to sequencing the human genome
▫ Scale up existing techniques
▫ Develop new sequencing techniques
▫ Start with smaller genomes used as a warm-up project
Whole-genome shotgun sequencing I
• Developed by Celera
▫ Subsidiary of Applied
Biosystems, maker of
automated
sequencers
• No mapping
• Instead, the whole
genome is sheared
• Randomly sequenced
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
6
Whole-genome shotgun sequencing II
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
Generate tens of millions of
sequence reads
Assemble
7
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
Whole-genome shotgun sequencing III
• Major challenge: assembly
▫ Repetitive elements are the biggest problem
• Performed on very high-speed computers, using
novel software
• Key to assembly is paired reads
▫ Sequence both ends of each clone
8
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
Map-based sequencing I
• Human Genome Project adopted a map-based
strategy
▫
▫
▫
▫
Start with well-defined physical map
Produce shortest tiling path for large-insert clones
Assemble the sequence for each clone
Then assemble the entire sequence, based on the
physical map
9
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
Map based sequencing
Steps in genomic sequencing
• Library making
▫ Large-insert library from genome
• Production sequencing
▫ Generate fragments to be sequenced
▫ Perform sequencing reactions
▫ Determine sequence
• Finishing
▫ Assemble into continuous sequence
▫ Fill gaps
10
Map based sequencing
Library making
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
• Library of genomic fragments made in vector
▫ BAC – Bacterial artificial chromosome
▫ Usually have several-fold coverage
 Every DNA sequence on five to eight different clones
• Difficult and inefficient to sequence straight from
large fragment
• Need to break into manageable pieces
▫ Random shearing
 By nebulization or sonication
Fragments for sequencing
• Generally use 2–10 kb
pieces for sequencing
• Clone into sequencing
vector
▫ Contains binding sites
for sequencing primers
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Pros and cons of large-insert vectors
• Lambda phage and
cosmids
▫ Inserts stable
▫ But insert size too small
for large-scale
sequencing projects
• YACs
▫ Largest insert size
▫ But difficult to work
with
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
BACs and PACs
• BACs and PACs
▫ Most commonly used
vectors for large-scale
sequencing
▫ Good compromise
between insert size and
ease of use
▫ Growth and isolation
similar to that for
plasmids
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Contigs
• Contigs are groups of overlapping pieces of chromosomal
DNA
▫ Make contiguous clones
• For sequencing one wants to create “minimum tiling path”
▫ Contig of smallest number of inserts that covers a region of
the chromosome
genomic DNA
contig
minimum
tiling path
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
15
Physical mapping
© 2005 Prentice Hall
Inc. / A Pearson
Education Company /
Upper Saddle River, New
Jersey 07458
• Restriction mapping (by restriction endonucleases)
• STS (Sequence tag site) mapping
• FISH – fluorescence in-situ hybridisation
16
© 2005 Prentice Hall
Inc. / A Pearson
Education Company /
Upper Saddle River, New
Jersey 07458
PPT from D. Bartholomeu
17
© 2005 Prentice Hall
Inc. / A Pearson
Education Company /
Upper Saddle River, New
Jersey 07458
PPT from D. Bartholomeu
18
© 2005 Prentice Hall
Inc. / A Pearson
Education Company /
Upper Saddle River, New
Jersey 07458
Finishing I
• Process of assembling raw
sequence reads into
accurate contiguous
sequence
▫ Required to achieve
1/10,000 accuracy
Gap
Single
stranded
• Manual process
▫ Look at sequence reads at
positions where programs
can’t tell which base is the
correct one
▫ Fill gaps
▫ Ensure adequate coverage
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Finishing II
• To fill gaps in
sequence, design
primers and sequence
from primer
• To ensure adequate
coverage, find regions
where there is not
sufficient coverage
and use specific
primers for those
areas
GAP
Primer
Primer
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
21
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
Verification
• Region verified for the following:
▫ Coverage
▫ Sequence quality
▫ Contiguity
• Determine restriction-enzyme cleavage sites
▫ Generate restriction map of sequenced region
▫ Must agree with fingerprint generated of clone
during mapping step
22
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
Sequencing coverage
• Coverage is the number of times the same region
is sequenced
▫ Ideally, one wants an equal number of sequences
in each direction
• To obtain accuracy of one error in 10,000 bases,
one needs the following:
▫ 10x coverage
 Stringent finishing
▫ Complete sequence
 Base-perfect sequencing
23
Map-based sequencing II
Construct clone map and
select mapped clones
Generate several thousand
sequence reads per clone
Assemble
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
24
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
Why map before sequencing?
• Major problem in large-scale sequencing:
▫ Current technologies can only sequence 600–800
bases at a time
• One solution: make a physical map of
overlapping DNA fragments
▫ Determine sequence of each fragment
▫ Then assemble to form contiguous sequence
25
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
26
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
Controversy: Map-based sequencing vs.
whole-genome shotgun sequencing
• Celera used publicly funded sequence to produce
its published draft of the human genome
• Scientists who worked on the map-based effort
claimed Celera couldn’t have produced a draft
without access to the public sequence
• Celera scientists claim that they could have
produced an accurate draft even without the
public sequence
27
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
Hybrid approach
• Combines aspects of both map-based and wholegenome shotgun approaches
▫
▫
▫
▫
Map clones
Sequence some of the mapped clones
Do whole-genome sequencing
Combine information from both methods
 Use sequence from mapped clones as scaffold to assemble
whole-genome shotgun reads
• Used for sequencing the mouse genome
Sequence annotation
• Annotation
performed on
completed sequence
• Computer programs
used to find the
following:
▫
▫
▫
▫
Genes
Exons and introns
Regulatory sequences
Repetitive elements
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Industrialization of sequencing
• Most large-scale
sequencing projects
divide tasks among
different teams
▫ Large-insert libraries
▫ Production
sequencing
▫ Finishing
• Sequencing machines
run 24/7
• Many tasks
performed by robots
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
30
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
More about Mapping!
•
•
•
•
•
Genetic mapping
Physical mapping
Chromosome walking
Determining DNA sequences
New techniques for mapping and sequencing
Mapping I
• Mapping is
identifying
relationships between
genes on
chromosomes
▫ Just as a road map
shows relationships
between towns on
highway
• Two types of
mapping: genetic and
physical
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
32
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
Mapping II
Genetic mapping
▫ Based on differences in recombination frequency
between genetic loci
• Physical mapping
▫ Based on distances in base pairs between specific
sequences found on the chromosome
• Most powerful when genetic and physical
mapping are combined
33
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
Genetic mapping I
• Based on recombination frequencies
▫ The further away two points are on a
chromosome, the more recombination there is
between them
• Because recombination frequencies vary along a
chromosome, we can obtain a relative position
for the loci
34
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
Genetic mapping II
• Genetic mapping requires that a cross be
performed between two related organisms
▫ The organism should have phenotypic differences
resulting from allele differences at two or more
loci
• The frequency of recombination is determined
by counting the F2 progeny with each phenotype
Genetic mapping example I
• Genes on two
different
chromosomes
▫ Independent
assortment during
meiosis
▫ No linkage
F1
9
:
3
:
3
:
1
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Genetic mapping example II
• Genes very close
together on same
chromosome
▫ Will usually end up
together after meiosis
▫ Tightly linked
F1
1
:
2
:
1
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Genetic mapping example III
• Genes on same
chromosome, but not
very close together
▫ Recombination will
occur
▫ Frequency of
recombination
proportional to
distance between
genes
▫ Measured in
centiMorgans
recombinants
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
38
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
Genetic markers
• Genetic mapping between positions on
chromosomes
▫ Positions can be genes
 Responsible for phenotype
 Examples: eye color or disease trait
▫ Positions can be physical markers
 DNA sequence variation
39
© 2005 Prentice Hall Inc. / A Pearson
Education Company / Upper Saddle River, New
Jersey 07458
Physical mapping
• Determination of physical distance between two
points on chromosome
▫ Distance in base pairs
• Physical markers are DNA sequences that vary
between two related genomes
 Referred to as a DNA polymorphism
 Usually not in a gene
▫ Examples
 RFLP
 SSLP
 SNP
RFLP
• Restriction-fragment length polymorphism
▫ Cut genomic DNA from two individuals with
restriction enzyme
▫ Run Southern blot
▫ Probe with different pieces of DNA
▫ Sequence difference creates different band pattern
200
1
GGATCC
CCTAGG
400
GTATCC
GATAGG
200 *
2
GGATCC
CCTAGG
GCATCC
GGTAGG
GGATCC
CCTAGG
400
1
2
*
*
600
400
GGATCC
CCTAGG
200
*
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
SSLP
•
Simple-sequence length polymorphism
•
•
•
•
Most genomes contain repeats of three or four nucleotides
Length of repeat varies
Use PCR with primers external to the repeat region
On gel, see difference in length of amplified fragment
1
1 ATCCTACGACGACGACGATTGATGCT
18
2 ATCCTACGACGACGACGACGACGATTGATGCT
12
2
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
SNP
• Single-nucleotide polymorphism
▫ One-nucleotide difference in sequence of two
organisms
▫ Found by sequencing
▫ Example: Between any two humans, on average
one SNP every 1,000 base pairs
1ATCGATTGCCATGAC
2ATCGATGGCCATGAC
SNP
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
Expressed Sequence Tags (EST)
• Idea: sequence only
“important” genes
▫ Those genes
expressed in a
particular tissue
• Sequence random
cDNAs made from
RNA extracted from
tissue of interest
Muscle
mRNA
cDNA
libraries
“New”
Biolims
Robotized stations DNA sequencers
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
EST sequencing II
• Make cDNA library
• Select clones at
random
• Sequence in from one
or both ends
5’
cDNA
3’
Partial sequence = EST
▫ One-pass sequencing
• The resulting
sequence = expressed
sequence tag (EST)
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey 07458
45
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle
River, New Jersey 07458
EST sequencing: pros and cons
• Advantages
▫ Relatively inexpensive
▫ Certainty that sequence
comes from transcribed gene
▫ Information about tissue and
developmental stage
• Disadvantages
▫ No regulatory information
▫ Usually less than 60% of
genes found in EST
collections
▫ Location of sequence in
genome unknown