l - Church Lab

Download Report

Transcript l - Church Lab

New Sequencing Technologies
& Diploid Personal Genomes
George Church Thu 27-Apr-2006 9:30-11 Broad-MPG
Thanks to:
NHGRI Seq Tech 2004: Agencourt, 454, Microchip,
2005: Nanofluidics, Network, VisiGen
Affymetrix, Helicos, Solexa-Lynx
‘Next Generation’
Technology Development
Multi-molecule
Affymetrix
Gorfinkel
454 LifeSci
Lynx/Solexa
Agencourt
Our role
Software
Polony to Capillary
Paired ends, emulsion
Multiplexing & polony
Seq by Ligation (SbL)
Single molecules
Helicos Biosci SAB, cleavable fluors
Pacific Biosci
Agilent
Nanopores
Visigen Biotech
Complete Genomics SbL
Sequencing components
1.
2.
3.
4.
5.
6.
7.
Applications & goals
Cost, accuracy, continuity goals
Source, consent, ELSI
Sample prep
Technology development, deployment, scaling
Software: data acquisition to interpretation
Human interface, education
Sequencing applications
1. Environment (genetic): maternal, allergens, microbes
2. Small mutations: whole genome vs targeted
3. DNA copy number & rearrangements (paired ends)
4. Exons conserved &/or mutable regions
5. Haplotype: LD &/or causative combinations in cis
6. RNA Digital Analysis of Gene Expression (by counting)
7. RNA splicing (that arrays can’t handle)
8. Proteomics: MS, Ab, aptamers
9. Metabolomics: MS, Ab, aptamers
10.Microbial evolution resequencing (needs consensus accuracy)
11.Cancer resequencing
12.Gene synthesis by sequencing (needs raw accuracy)
13. DNA methylation
Why single chromosome sequencing?
(or single cell or single particle?)
(1) When we only have one cell as in Preimplantation
Genetic Diagnosis (PGD) or environmental samples
(2) Sequence relations >100 kbp (haplotypes)
(3) Prioritizing or pooling (rare) species based
on an initial DNA screen
(4) Anything relating 2 or more chromosomes
(in a cell or virus)
(5) Cell-cell interactions
(e.g. predator-prey, symbionts, commensals, parasites, etc)
Sequencing/genotyping on single human chromosomes
Method#1: ‘in situ’ haplotyping
153
Mbp
Zhang et al. Nature Genet. Mar 2006
Sequencing/genotyping on single human chromosomes
Method#2: Chromosome dilution library
QC: Reverse-FISH of amplicons
Amplicon 19
Amplicon 6q
Single chromosome molecule
sequencing
• How?
– Isothermal Strand Displacement Amplification
from a single chromosome (Ploning)
– Shotgun sequencing on the amplicon
• Challenges
– Non-specific amplification competes with a single
template molecule
– Amplicons have high-order DNA structures, which
creates issues in sequencing library construction
Single cell chromosome molecule sequencing
Reduce chimeras when cloning from SDA Plones
From 19% to 6%
S1 nuclease
digestion
Phi-29
debranching
DNA pol I nick
translation
Single cell chromosome molecule sequencing
Ploning & sequencing 2.5 Mbp molecules
#1
#2
# Good seq reads
7,166
10,660
Average length (bp)
769.4
676.6
5,513,520
7,212,556
# unkown seqs
12
10
# vectors
23
44
# other seqs
74
2
63%
67%
Chromosome#
Total length (bp)
% genome sampled
Plone
amplification
errors:
< 1.7×10-5
Integrated Polony Sequencing Pipeline
(open source hardware, software, wetware)
In vitro
paired tag
libraries
Bead polonies via
emulsion PCR
Enrich
amplified
beads
Monolayer gel
immobilization
SOFTWARE
Images → Tag Sequences
SBE or SBL
sequencing
Epifluorescence &
Flow Cell
Tag Sequences → Genome
Shendure, Porreca, Reppas, Lin, McCutcheon, Rosenbaum,
Wang, Zhang, Mitra, Church (2005) Science 309:1728.
Shear or
Nla III
digest
Paired-end libraries
Shendure, Porreca, et al. (2005) Science 309: 1728
Margulies et al. (2005) Nature 437: 376.
ligate select
+
dilute, ligate
ligate
L
M
amplify
digest
amplify
Mme I
hRCA
R
ePCR
Distribution of Distances Between Mate-Paired Tags
10.7 bp
FT
1.0 kb
frequency
2.0 kb
980 ± 96 bp
distance (bp)
ePCR
bead
5’
L
Tag 1
7 bp
6 bp
M
Tag 2
7 bp
R
3’
6 bp
4 positions for paired-end
anchor 'primers'
Each yields 6 to 7 bp of contiguous sequence
34 bp new sequence per 135 bp amplicon
Sequencing by Ligation (SBL) with
fluorescent combinatorial 9-mers
5’-Cy5-nnnnAnnnn-3’
5’-Cy3-nnnnGnnnn-3’
5’-TR-nnnnCnnnn-3’
5’-Cy3+Cy5-nnnnTnnnn-3’
Excitation Emission
647
700
555
605
nm
572
630
555
700
5'PO4
ACUCAUC…
(3’)…TAGAGT????????????????TGAGTAG…(5’)
Shendure, Porreca, et al. (2005) Science 309:1728
Automation Schematic
microscope
& xyz stage
HPLC autosampler
flow-cell
(96 wells)
temperature control
syringe
pump
Off the Shelf Instrumentation
$140,000
Mitra
Shendure
Porreca
Image Collection & Data Processing
514 raster positions x 4 images per cycle
26 cycles of sequencing
2 additional image sets for object-finding algorithms
54996 images (1000 x 1000, 14-bit)
100GBytes
5M reads
$500 run
Porecca et al.
Open Source Readmapper
v1.0 (Shendure, Porreca et al)
• Hash all the reads (n)
• Scan genome (m), and
for each window:
– Does current window
exist in hash?
– If so, move
downstream, scan d
positions & test hash
for membership
m + (n * d) = 10+ hours, 20 nodes,
1.6e6 reads
v2.0 (Gary Gao, Sasha Wait)
• Hash all possible reads
from genome (m)
• Scan the reads (n), and
for each:
– Does it occur in the hash?
– If so, does the second
exist?
– If so, take union (k)
n * k = 10 hours, 1 node, 1.6e6
reads
Error quantitation
6X consensus <3E-7
[>Q65, 99.99997%]
Median raw
Polony = 3E-3 (99.7%)
454 raw = 4E-2 (96%)
Shendure, Porreca et al, 2005
ABI 454 Sep05
$/kb @4E-5
$/3e9@1X
Paired ends
Device $
$7
Sep05 Feb 06
$9
0.8
0.07
300K $30K
no
500K
yes
140K
3M
yes
300K
Polony
18
16
SBL $/kb
14
ABI $/kb
12
454 $/kb
10
8
Cost vs consensus
error rate
6
4
2
0
1E-8
1E-6
1E-4
1E-2
Why low error rates?
Goal of genotyping & resequencing  Discovery of variants
E.g. cancer somatic mutations ~1E-6 (or lab evolved cells)
Consensus error rate
1E-4
4E-5
Total errors (E.coli)
Bermuda/Hapmap
454 @40X
(Human)
500
600,000
200
240,000
3E-7
Polony-SbL @6X
0
1800
1E-8
Goal for 2006
0
60
Also, effectively reduce (sub)genome target size by enrichment for
exons or common SNPs to reduce cost & # false positives.
Mutation Discovery in Engineered/Evolved E.coli
Position
Type
Gene
Location
ABI
Confirm
Comments
986,334
T>G
ompF
Promoter-10

Only in evolved strain
985,797
T>G
ompF
Glu > Ala

Only in evolved strain
931,960
▲8 bp
lrp
frameshift

Only in evolved strain
3,957,960
C>T
ppiC
5' UTR

MG1655 heterogeneity
l-3274
T>C
cI
Glu > Glu

l-red heterogeneity
l-9846
T>C
ORF61
Lys > Gly

l-red heterogeneity
Shendure, Porreca, et al. (2005) Science 309:1728
Sequence monitoring of evolution
(optimize small molecule synthesis/transport)
8
Doubling time (hr)
7
6
5
Q1
Q3
4
Q2-1
Q2-2
3
Sequence trp-
2
EcNR1
1
0
0
10
20
30
40
50
60
70
80
90 100 110 120 130 140 150
# of passages
Reppas, Lin & Church
ompF - non-specific transport channel
Can increase import & export capability simultaneously
AAAGAT
CAAGAT
-12 -11 -10 -9
-8 -7
• Promoter mutation at
position (-12)
• Makes -10 box more
consensus-like
-6
• Glu-117 → Ala (in the pore)
• Charged residue known to affect
pore size and selectivity
Co-evolution of mutual biosensors
sequenced across time & within each time-point
3 independent lines of Trp/Tyr co-culture frozen.
OmpF: 42R-> G, L, C, 113 D->V, 117 E->A
Promoter: -12A->C, -35 C->A
Lrp: 1bp deletion, 9bp deletion, 8bp deletion, IS2
insertion, R->L in DBD.
Heterogeneity within each time-point reflecting
colony heterogeneity.
Mixture of wild & 2kb Inversion (pin)
proximal tag
placement
Incorrect distance
Red=same strand
Black opposite strand
distal tag
placement
1,206k
1,210k
Using paired ends, rearrangement & copy-number detection is
>1000X easier than point mutation detection (6X vs 6000X)
Open source
hardware, software, wetware
Human Diplome Sequencing
Diplome
chromosome
Exons &
dilution shotgun
Strategies
conserved 3%
(0.01X $300)
(6X $9K)
1M Causative Genome
Changes CGCs
(10X MIP pool $20)
40K RNA diplome
(10X MIP pool $20)
Strand displacement
amplification (ploning)
Polony sequencing
7E8 pixels
Chip Genotyping/
Haplotyping
Personal Genome Project (ELSI)
Padlock, Molecular Inversion Probes (MIPs)
Causative Genomic Changes (CGCs, e.g. conserved 3%)
(not restricted to Single Nucleotides or Polymorphisms >1%)
Universal
primers
R
Optional multiplex tag
L
Genomic DNA
CG
CA Alternative alleles
TG
Hardenbol .. Landegren Davis et al. Multiplexed genotyping with sequence-tagged
molecular inversion probes. Nat Biotechnol. 2003 21:673-8 .
“10,000 targeted SNPs genotyped in a single tube assay.” Genome Res. 2005 15:269
Vitkup, Sander, Church (2003) The Amino-acid Mutational Spectrum of Human
Genetic Disease. Genome Biol. 4: R72. (CG to CA, TG)
MIPs for VDJ Polonies
Over the whole field of human T-cells
1 TRAC + 2 TRBC primers
cDNA
xxx
47 TRAV * 50 TRAJ + 46 TRAV * 13 TRBJ = 2948 MIP oligos
or
47 TRAV * 1 TRAC + 46 TRAV * 2 TRBC = 139 MIP oligos
In situ RCA or PCR for each T-cell
Polony sequencing of tag &/or gap fill (e.g. 18 to 33bp in CDR3)
(two tags per cell sufficient?)
http://www.infobiogen.fr/services/chromcancer/Genes/TCRBID24.html
‘Next Generation’
Technology Development
Multi-molecule
Affymetrix
Gorfinkel
454 LifeSci
Lynx/Solexa
Agencourt
Our role
Software
Polony to Capillary
Paired ends, emulsion
Multiplexing & polony
Seq by Ligation (SbL)
Single molecules
Helicos Biosci SAB, cleavable fluors
Pacific Biosci
Agilent
Nanopores
Visigen Biotech
Complete Genomics SbL
Human subjects consent
“Because the database will be public, people who do identity
testing, such as for paternity testing or law enforcement, may also
use the samples, the database, and the HapMap, to do general
research. However, it will be very hard for anyone to learn anything
about you personally from any of this research because none of
the samples, the database, or the HapMap will include
your name or any other information that could identify
you or your family.”
http://www.hapmap.org/downloads/elsi/CEPH_Reconsent_Form.pdf
YRI=
JPT=
CHB=
CEU=
Yoruba, Ibadan, Nigeria
Japan, Tokyo
China (Han) Beijing
CEPH (N&W Europe) Utah
Is anonymity in genomics realistic?
http://arep.med.harvard.edu/PGP/Anon.htm
1) Re-identification after “de-identification” using other public data.
Group Insurance Commission list of birth date, gender, and zip code was sufficient to reidentify medical records of Governor Weld & family via voter-registration records (1998)
(2) Hacking. “Drug Records, Confidential Data vulnerable via Harvard ID number &
PharmaCare loophole” (2005). A hacker gained access to confidential medical info at the
U. Washington Medical Center -- 4000 files (names, conditions, etc, 2000)
(3) Combination of surnames from genotype with geographical info
An anonymous sperm donor was traced on the internet 2005 by his 15 year old son who
used his own Y chromosome genealogy to access surname relations.
(4) Inferring phenotype from genotype Markers for eye, skin, and hair color, height,
weight, racial features, dysmorphologies, etc. are known & the list is growing.
(5) Unexpected self-identification. An example of this at Celera undermined confidence
in the investigators. Kennedy D. Science. 2002 297:1237. Not wicked, perhaps, but tacky.
(6) A tiny amount of DNA data in the public domain with a name leverages the rest.
This would allow the vast amount of DNA data in the HapMap (or other study) to be
identified. This can happen for example in court cases even if the suspect is acquitted.
(7) Identification by phenotype. If CT or MR imaging data is part of a study, one could
reconstruct a person’s appearance . Even blood chemistry can be identifying in some cases.
"Open-source"
Personal Genome Project (PGP)
• Harvard Medical School IRB Human Subjects protocol
submitted Sep-2004, approved Aug-2005 renewed Feb-2006.
• Start with 3 highly-informed individuals consenting to nonanonymous genomes & extensive phenotypes (medical records,
imaging, omics).
• Cell lines in Coriell NIGMS Repository
G M Church GM (2005) The Personal Genome Project
Nature Molecular Systems Biology doi:10.1038/msb4100040
Kohane IS, Altman RB. (2005) Health-information altruists--a potentially
critical resource. N Engl J Med. 10;353(19):2074-7.
Discussion: Ascertainment bias vs.
risk of disclosure without consent.
It is likely that less-privileged citizens ‘might be’
less likely to volunteer & will be more likely
to volunteer due to higher financial risk.
These same people ‘might be’ even less likely
to volunteer is the data might become public.
These same folks might be especially
impacted socially if identifying (genome
and/or phenome) data were to get out after
they were assured that it would not.
Proposal for multi-tiered (re)consent of
subjects in genomic studies
Five categories:
1) Withdrawal from studies due to new information on risks
(all data destroyed).
2) Highest security (possibly higher than the original study)
encryption, aggressive de-identification, only expert access
with IRB-approval of each person, not whole teams. Consent
form clearly states the risks (see previous slides).
3) Medium security, similar to current practice, but consented as
above. IRB approval for teams to download de-identified
data.
4) Open-PGP-type security. Click-through agreement. IRBapproval only for data collection, not for data reading.
5) Fully open. No IRB approval; full web access e.g. subject
initiated.
.