Next generation sequencing -- Tutorial

Download Report

Transcript Next generation sequencing -- Tutorial

Bioinformatics for next-generation
DNA sequencing
Gabor T. Marth
Boston College Biology Department
BC Biology new graduate student orientation
September 2, 2008
Genetic code (DNA)
AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA
CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT
GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG
GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT
AGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGT
GCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGT
AGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAG
TCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTG
GGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCT
CGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTAT
ATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATGCT
GATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCT
AGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGA
AGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCT
The genome
Genome sequencing
~1 Mb
~100 Mb
>100 Mb
~3,000 Mb
bases per machine run
Next-generation sequencing machines
Illumina, AB/SOLiD
short-read sequencers
1Gb
(1Gb in 25-50 bp reads)
100 Mb
454 pyrosequencer
(20-100 Mb in 100-250 bp reads)
10 Mb
ABI capillary sequencer
1Mb
10 bp
100 bp
1,000 bp
read length
Individual human resequencing
Variations at every scale of genome organization
Single-base substitutions (SNPs)
Structural variations including largescale chromosomal rearrangements
Insertion-deletion polymorphisms
Epigenetic variations (e.g.
changes in methylation /
chromatic structure)
We care about genetic variations because…
… they underlie
phenotypic
differences
… cause heritable diseases
and determine responses
to drugs
… allow tracking ancestral
human history
Individual resequencing / SNP discovery
(ii) micro-repeat analysis
REF
IND
(iii) read mapping
(iv) read
assembly
IND
(v) SNP and short INDEL calling
(i) base calling
(vii) data validation, hypothesis generation
Tools
The variation discovery “toolbox”
• base callers
• read mappers
GigaBayes
• SNP callers
• SV callers
• assembly viewers
Base calling
Quinlan et al.
Nature Methods 2008
Read mapping
Read mapping is like doing a jigsaw puzzle…
…you get the pieces…
… and they give you the
picture on the box
Problem is, some pieces are easier to place than others…
Read mapping
Michael Stromberg
in prep.
SNP discovery
GigaBayes
Marth et al. Nature Genetics 1999
Quinlan et al. in prep.
Structural variation discovery
Navigation bar
Fragment
lengths in
selected region
Depth of
coverage in
selected region
Stewart et al. in prep.
Assembly viewers
Huang and Marth
Genome Research 2008
Data mining
SNP calling in single-read 454 coverage
DNA courtesy of Chuck Langley, UC Davis
• collaborative project with Andy Clark (Cornell) and Elaine Mardis (Wash. U.)
• goal was to assess polymorphism rates between 10 different African and American
melanogaster isolates
• 10 runs of 454 reads (~300,000 reads per isolate) were collected
Mutational profiling in deep 454 data
Pichia stipitis reference sequence
Image from JGI web site
• collaboration with Doug Smith at Agencourt
• Pichia stipitis is a yeast that efficiently converts xylose to ethanol (bio-fuel production)
• one specific mutagenized strain had especially high conversion efficiency
• goal was to determine where the mutations were that caused this phenotype
• we analyzed 10 runs (~3 million reads) of 454 reads (~20x coverage of the 15MB genome)
• processed the sequences with our 454 pipeline
• found 39 mutations (in as many reads in which we found 650K SNP in melanogaster)
• informatics analysis in < 24 hours (including manual checking of all candidates)
Smith et al. Genome Research 2008
SNP calling in short-read coverage
C. elegans reference genome (Bristol, N2 strain)
Bristol, N2 strain
(3 ½ machine runs)
Pasadena, CB4858
(1 ½ machine runs)
• goal was to evaluate the Solexa/Illumina technology for the complete resequencing of
large model-organism genomes
• 5 runs (~120 million) Illumina reads from the Wash. U. Genome Center, as part of a
collaborative project lead by Elaine Mardis, at Washington University
• we found 45,000 SNP with very
high validation rate
SNP
Hillier et al.
Nature Methods 2008
Current focus
1000 Genomes Project
• data quality assessment
• project design (# samples depth of read coverage)
• read mapping
• SNP calling
• structural variation discovery
SV discovery in autism
deletion
amplification
Transcriptome sequencing
(from: Mortazavi et al. Nature Methods 2008)
Lab
The team
Michael Stromberg
Michele
Busby
Aaron Quinlan
Chip
Stewart
Damien
Croteau-Chonka
Eric Tsung
Derek Barnett
Weichun Huang
Resources
• computer cluster
• 128 GB RAM server
• 20TB disk space
• 2 large R01 grants
from the NIH
• a BC RIG grant
Collaborations
Genome Canada
Baylor HGSC
Wash. U. GSC
UC Davis
UBC GSC
UCSF
UCLA
Cornell
NCBI @ NIH
NCI @ NIH
Marshfield Clinic
Pfizer
Graduate student rotations
• Looking for new graduate students
• Spots are available for all three rotations
• Lots or projects
• Caveat: you need to be able to program…
• Check us out at:
http://bioinformatics.bc.edu/marthlab/
•If you are interested, please talk to me