Chapter 2 PowerPoint Slides

Download Report

Transcript Chapter 2 PowerPoint Slides

ASHG Redux
2008
• Session -- Using DNA sequence to detect
variation related to disease
– Richard Wilson – WashU – deep sequencing of
cancer tumors (AML) identified variations in 8
genes
– Richard Gibbs – Baylor College of Medicine –
"Complete Genomics" – genome for < $5,000
• Accurate sequencing by hybridization for DNA
diagnostics and individual genomics, Drmanac, et
al., Nature Biotechnology
1
ASHG Redux
• Session -- Using DNA sequence to detect
variation related to disease
– Micahel Stratton – Wellcome Trust Cancer
Institute – genomic sequencing of breast cancer
cell lines
• Copy number variations ("structural variants")
• "genomic shards" – 305 rearrangements in breast
cancer cell line
• Difficult to assemble with short reads technology
2
ASHG Redux
• Session – Genomics I
– Sharp – whole genome screen for novel
imprinting genes
• Bisulphite treatment – convert all un-methylated C's
to U (uracil) -- then sequence and all methylated C's
sites are ID'ed
• Drawback – harsh, fragments DNA
– High density HapMap of Humans, Dogs, and
Cattle
• Genotypes 900 dogs /w Affy 2.0 array at 61,344
SNPs
• Dogs have very uniform phylogenetic tree with
bread specific recombination rates
3
ASHG Redux
• Session – Genomics I
– Biesecker – ClinSeq – effort to map phenotypic
features to genotypes for atherosclerosis
• 1000 subjects
Clinical
data
Desired Data
Subjects
Genome
Rare
Common
Mendelian Mendelian
Penetrance
Variants
Variants
Unknown
Territory
Common
SNPs
0.5
SNP Freq
4
ASHG Redux
• Session – Genomics II
– BGI (Beijing Genomics Institute)
•
•
•
•
•
First Asian genome sequenced
100 bioinformaticians (-> 300)
18 Solexas
5 454's
4 Solids (?)
– Altshuler (1000 Genomes Project) – effort to sequence
1000 genomes to catalogue variations in genome
•
•
•
•
www.1000genomes.org
Duplicated amount of sequence in GenBank in Sept.
Again in October
Data release – Jan 2009
5
6
Genome Sequence
Reference:
"Discovering Genomics, Proteomics, and
Bioinformatics." Second Edition 2/e.
Campbell and Heyer. 2007. ISB: 0-80538219-4.
Chapter 2:
7
genomics
• reduction -- for a very long time molecular
methods where primarily tools to dissect
cells and understand how parts work in
isolation
• expansion -- genomics, in theory, enables
science to begin piecing together how parts
work together as a system (systems
biology?)
8
Overview
•
•
•
•
•
What is Genomics?
How to sequence a genome?
Annotating (annotation)
Protein function
Gene Ontology
9
Genomics
• "involves large data sets"
– human genome -- 3 billion nucleotides
– hundreds of genomes have been finished
• "high-throughput methods"
– sequencing
– measuring the expression of all genes
– genotyping (1,000,000 SNPs on 1 chip)
• other -omes
– proteome, transcriptome, metabolome, variome?, exome
– http://cancergenome.nih.gov/media/process_textonly.asp
10
How do we sequence a genome?
•
•
•
•
•
preliminary sequencing
finishing (not always performed -- coverage)
annotating
The "dideoxy method"
Need (for DNA replication):
– DNA, DNA polymerase, primers, deoxyribonucleotide
triphosphates (dNTPs) (G,T,A,C)'s (one with
radioactive atoms), dideoxyribonucleotide
triphosphates (ddNTPs)
11
Dideoxy Method Obsolete?
• Next-generation sequencing technology
–
–
–
–
Cost per nucleotide down by factor of 100-1000
Cost per run is still very high
Expen$ive for validation on an individual basis
Dideoxy method is very mature, very well understood
12
dideoxy method
• Under normal DNA polymerization, dNTPs are added to
the end of the elongating strand of DNA.
• If an ddNTP is incorporated, the elongation terminates -also carries "label" -- radioactive isotope or fluorescent dye
• This is performed in 4 different containers (test tubes),
with each test tube having ddATP, ddGTP, ddCTP, and
ddGTP.
• Therefore, each tube terminates with the same ddNTP
• Run these out on a gel, and smallest migrate fastest.
• Expose to x-ray film (or scan with laser), read gel
13
14
Figure 2.1
15
Figure 2.2
Comment
•
•
•
•
•
•
•
Note -- this is pretty awful work
The gel material is toxic
Working with radioactive molecules
Slow and tedious
reading bands on glass
capturing/entering data
500 bases took 24 hours (16,438 years to do the
human genome with this method)
16
Automated sequencing
• Leroy Hood -- developed nonradioactive dideoxy
method
• ddNTP's are "labeled" with a different fluorescent
dye
• 1 lane could be used instead of 4 (why?)
• A laser fluoresces the dye, the band can be "read",
indicating which ddNTP terminated the sequence
• The intensities of these bands are now captured
and graphed -- in what is called a chromatogram
• Lane in a gel is replaced with a capillary
• Can run 96, or 384 capillaries at a time (Applied
Biosystems)
• A run is approximately 1 hour
17
• 500 bases * 384 cap ==> 651 years
18
Box 2.1 Table
Choosing genomes
• Big 7
– human, mouse, yeast, E. coli, fly, worm, arabidopsis
• medical applications
– Pseudomonas aeruginosa (CF infection), mosquito,
trypanosomes, HIV
• evolutionary significance
– microbes, archaea, chimp, gorilla, fugu fish
• environmental impact
– microbes
• food production
– wheat, rice, bovine, pig, yeast
19
20
Figure 2.3
21
Figure 2.3 (detail)
Automated Reads
• Automated sequencing almost requires automated
base-calling
– PHRED
• reads chromatograms
• quality assessment (for re-sequencing)
• peak height and spacing
– assemble multiple reads (PHRAP) into a "contig"
– What about mutations, variations, SNPs?
• Gaps
– requires human intervention -- techniques to try and
span specific DNA regions
• ex) chromosome walking
22
Gaps
• 2001 draft sequence published
– 147,821 gaps
– pressure to publish a sequence because of Celera and
Craig Venter
• 2004
– 341 gaps
• Usually repeats (but may be epigenetic)
• Very expensive to completely finish
– many genomes never "finished"
23
24
Figure 2.4
25
Figure MM2.1
Show BL2SEQ example
Annotation
• "functionally" important sections of a
genome
– exons, introns, promoters, enhancers, splice
sites, UTR's,
– pseudogenes, SNPs, markers, repeats, Alus,
gene duplications, gene families, micro-RNAs,
methylation, phosphorylation, tissue specific
alternative splicing, copy number variations,
(CNVs, also called "structural variations")
differential expression, gene function, ????
26
Gene Identification
• Gene prediction (ORF finding)
– was a hot topic
– cooled when it became clear that EST sequencing was
far superior
– EST sequencing in human (and some model organisms
-- rat, mouse, others) was very extensive -- millions of
sequencing reads
– The most effective approach to gene finding was the
overlaying of EST sequences to genomic sequence (but
note you need both).
– Gene prediction was 40-60% at best
– Gene prediction has made a bit of resurgence because
of the cost savings of "in silico" gene finding
27
Pseudogenes
• text -- mammalian genome contains
approximately 225 BP per KB of
pseudogenes
• What are pseudogenes?
28