Paired end sequencing - Siteman Cancer Center
Download
Report
Transcript Paired end sequencing - Siteman Cancer Center
Mutational evolution in a
lobular breast tumour profiled
at single nucleotide resolution
Elaine Mardis, Ph.D.
Associate Professor in Genetics
DEFINITIONS
NEXT-GENERATION SEQUENCING
• Unlike gel-based sequencing, next-generation methods
involve massively parallel sequencing of random
fragments.
• Paired end sequencing samples bases from each end
of the random fragments comprising a library.
• Computer programs perform short read alignment
onto the human reference and discover where the
sequence differences are.
COVERAGE
• “Coverage” is based on theoretical considerations of
Poisson sampling, read length, library insert size, error
rate.
• Complete coverage attempts to provide both a breadth
and depth of reads genome-wide that is sufficient to
detect variants with confidence.
• Short read alignment algorithms use different
approaches to find the best-in-genome placement for
each read pair. The repetitive content of the human
genome makes exact read placement challenging in
some regions.
Evaluating Coverage
• Genome-wide SNP array data:
–
–
–
–
Positive tumor:normal identity
Tumor purity and ploidy estimates
LOH information
Identity and positions of homo- and
heterozygous SNPs on every
chromosome
• A means to track accumulating coverage of
NGS sequence data toward complete
genome coverage
• Our goals are >98% coverage of SNPs
genome-wide, for tumor and normal
“-omics” Definitions
• Genome re-sequencing: studying the chromosomal
and mitochondrial DNA by massively parallel whole
genome methods
– requires a high-quality reference sequence for read alignment
– ability to discover various types of sequence variations
• Transcriptome sequencing: studying the transcript
population by cDNA library construction and massively
parallel sequencing
– Total RNA, polyA+ RNA, miRNA
– Align to genome or assemble reads
DNA Variant Detection
• Single nucleotide variants (SNVs): tumorspecific (“somatic”) and normal-specific
(“germline”)
– Mutations in genes are non-synonymous,
synonymous, nonsense, non-stop (readthrough) or
affect splice site recognition)
• Focused insertions and deletions (1-100 bp)
• Copy number alterations (large-scale
amplifications and deletions)
• Insertions and Inversions
• Chromosomal translocations
Detecting Somatic Mutations in cancer
genomes
Sequence tumor to 30x
Sequence normal to 30x
Compare to human reference, call variants
Compare to each other, identify somatic variants
Remove known dbSNPs, calculate high confidence
Candidate Tumor-unique SNVs
Validation by targeted PCR and sequencing
Evaluate mutation prevalence in tumor cells
Validated SNVs
Recurrency screening by
targeted PCR and sequencing in
additional tumor/normal samples
Recurrent SNVs
•
•
•
•
BreakDancer: detecting
structural variation
Read pair analysis
with BreakDancer
identifies putative
SVs for tumor and
normal
simultaneously.
We visually examine
a Pairoscope graph
to add confidence.
The identified reads
are used to produce
an assembly.
Putative somatic
SVs are validated
by PCR.
K. Chen et al., Nature Methods 6: 677-81 (2009)
RNA-seq Detection
•
•
•
•
•
Single-nucleotide variants
Insertion/deletion variants (focused)
Alternative splicing isoforms
Allelic expression bias
RNA editing (non-synonymous amino acid
changes introduced by RNA editing enzymes)
• If an adjacent normal tissue sample is
available, differential expression levels can be
detected/studied.
Shah et al.
Nature 2009
Lobular breast cancer
•
•
•
•
Estrogen-receptor positive disease
Low/intermediate grade tumor
Approx. 15% of all breast cancer diagnoses
Samples studied
– Metastatic tumor: gDNA and RNA
– Normal tissue gDNA (PBL)
– Primary tumor: gDNA
Sequencing and Variant
Detection
• Produced 43X coverage of paired end
sequencing reads from metastatic DNA library
(WGSS)
• Produced 160.9 Mreads from cDNA library
(WTSS)
• WGSS data yielded SNVs, insertion/deletions,
translocations, inversions and CNAs
• WTSS data yielded SNVs, gene fusions
• Normal DNA used only for validation
Filtering Putative ns Variants
1,456 predicted ns SNvs
Pseudogenes, HLA
1,178 predicted ns SNvs
PCR primer design
1,120 predicted ns SNVs
PCR met and normal DNA
437 confirm
32 somatic confirmed
(2 unique to WTSS)
405 germline
Why validate?
• Orthogonal validation is important-why?
– Alignment and variant discovery algorithms aren’t
perfect
– Instruments have biases and errors happen
– In this study, the normal genome wasn’t sequenced
by a WGS approach, so validation and germline
determination of variants are coupled
• Limited to the coding variants identified (expense)
Evaluating the Somatic
Mutations
• CAN breast genes (0)
• COSMIC (11*)
• Screening for recurrent mutations in 192 breast
cancers (112 lobular, 80 ductal)
– 3/192 contained ns variants or deletions in HER2
kinase domain
– 2/192 had nonsense HAUS3 mutations (genome
stability mediated by kinetochore attachment and
centromere morphogenesis)
• Evaluating mutational prevalence
Prevalence Assays of Mutations
•
•
•
•
Deep read counts of specific loci for 28/32 mutations and 36 heterozygous
germline SNPs.
PCR, alignment of sequences and counting of reference vs. variant bases.
Germline het and metastatic somatic variants were ~50% (mode).
Primary disease showed HAUS3, ABCB11, PALB2 and SLC24A4 as
prevalent, 6 variants between 1-13%, 19 mutations were met-specific (not
detected)
Why check for prevalence of
mutations?
• Each tumor gDNA sample consists of the
contributions of many tumor (and associated
normal) cells.
• The digital nature of NGS data allows an
estimation of how common each validated
mutation is in the tumor cell population.
• More prevalent mutations are likely “older” and
this adds evidence for their importance in
driving carcinogenesis.
Why screen for recurrent
mutations?
• Adds evidence for a given mutation as a
“driver” of carcinogenesis.
• Cumulative information on recurrent mutations
allows early pathways-based analysis without
the time required to fully sequence hundreds of
cancer cases.
• With improvements in sequencer throughput,
the spectrum of recurrency testing is likely to
change by becoming more focused, but still will
be important to know.
RNA-seq Analysis
• Fusion transcripts were predicted but not
validated.
• RNA editing events suspect: estrogen
regulation and gene expression of ADAR
• 3,122 candidates,1,637 genes => 75
events in 12 genes
– COG3 and SRP9 showed high frequency nonsynonymous tx-editing
Conclusions
• Both DNA and RNA are important to study in
tumor genomics
• Sequencing primary and metastatic disease
yields important insights.
• Some power to assess genome-wide diversity
may be lost by not doing WGSS of the normal
(but it’s cheaper)
Breast cancer “quartet”
• African-American female, mid-
•
•
•
•
•
40s at diagnosis
Basal subtype (“triple
negative”) breast cancer
Metastatic brain tumor (frontal
lobe)
BRCA1/2 genotypes unknown
Deceased
Four samples:
− PBL (normal)
− Primary tumor
− Metastatic tumor
− Xenograft (“HIM”) of
primary tumor
Coverage/Mapping Stats
Sample
Gbp
analyzed
Haploid
Coverage
Known
Het SNP
Coverage
Known
Hom
Coverage
dbSNP
concordance
Filtered
SNPs
Called
Normal
130.7
38.8X
98.3%
99.3%
78.5%
4,325,512
Primary
Tumor
124.9
29.0X
96.8%
93.7%
78.9%
4,121,595
Brain
Metastasis
111.8
32.0X
96.2%
97.2%
79.0%
3,860,638
Primary
Tumor
Xenograft
149.2
23.8X
88.8%
98.7%
79.5%
3,626,361
• Xenograft alignment rate is 65%, compared to 95% for other samples
• Xenograft generates 2.1X coverage of mouse genome (~13% contamination),
compared to < 1% for Normal
Basal Breast Cancer Quartet:
Results
• 50 somatic point mutations and focused indels
were validated (48 were shared)
• 28 large deletions, 6 inversions and 7
translocations were validated as somatic
• Of the 48 point mutations in the primary tumor,
20 were found at increased prevalence (allele
frequency) in the metastatic tumor, and 22
similarly increased in prevalence in the
xenograft (overlap of 16 genes)
Altered Mutation Prevalence
Breast Tumor Lineage
Primary Tumor
Brain Metastasis
Xenograft