NGS tool tutorial I MuTect * detection of somatic

Download Report

Transcript NGS tool tutorial I MuTect * detection of somatic

Detection of somatic mutations:
A data mining and a computational
approach
Presenter: Huy Vuong, PhD
Department of Biomedical Informatics
Vanderbilt University
5/3/2013
Somatic single nucleotide variants
(sSNV)
• Play major role in tumorigenesis
and cancer development
• Aim 1: Literature mining
Mutations in COSMIC
745,924
• Catalogue of Somatic Mutations
In Cancer (COSMIC): the most
comprehensive catalogue today
• Aim 2: Tumor-specific mutations
in tumor-normal pairs
405,271
340,585
10,647
V1 (2004)
V60
(7/2012)
V61
(9/2012)
V62
(11/2012)
2
Classes of somatic mutations
• Point mutation:
• Coding
• Silent
• Missense
• Nonsense
• Noncoding (UTR, ncRNA, miRNA…)
• Intronic
• Intergenic
• Small scale mutation:
• Small insertions
• Small deletions
• Large scale mutation: rearrangements
• Intrachromosomal
• Deletion
• Invertion
• Duplication
• Interchromosomal
• Translocation
• Insertion
Aim 1: Mining COSMIC For
Protein Domain Interaction
4
History of COSMIC
The Evolution of the Cosmos started with the Big Bang!
http://en.wikipedia.org/wiki/Big_Bang
Yet, another COSMIC
• History of the Catalogue Of Somatic Mutations In Cancer (Wellcome
Trust Sanger Institute)
V1 (2004)
V64 (2013)
913,166
424,394
COSMIC V1
(4th February, 2004)
Genes
10,647
847,698
57,444
Mutations Tumours
Comparison V1 vs. V64
COSMIC V64
(26th March, 2013)
Advantages and Disadvantages
• Bimonthly updates
• Manual curated data,
removed low quality data
• Consistent vocabulary
(histology and tissue)
• Mutation maps to single
version of gene (no
alternative splicing)
• FREE availability!!!
• Curation bias
• Many positive results, few
negative results
• Other quality issues:
experimental error, missing
mutations
• Interpretation of mutation
frequency
Typical workflow
Histogram
Distribution
Specific aims
• Map somatic mutations (SM) in COSMIC to
protein structural model
• Identify SM in pocket region of protein
• Use statistical analysis to score SM in the
context of cancer (specificity, sensitivity)
Dataset and preprocessing step
• Data are downloaded from COSMIC version 62 via Biomart interface
as TSV file (http://cancer.sanger.ac.uk/biomart/martview/)
• Use R to clean the data (i.e remove duplicates) and import to a
SQLite database
• Database contained 776,917 mutations and 15 variables:
1.
2.
3.
4.
5.
6.
7.
8.
Gene.Name
CDS.Mutation.Syntax
AA.Mutation.Syntax
Zygosity
Primary.Site
Primary.Histology
In.Cancer.Census
Tumour.Source
9.
10.
11.
12.
13.
14.
15.
Genomic.Coordinates.GRCh37
CDS.Mutation.Type
AA.Mutation.Type
Somatic.status
Validation.status
Entrez.Gene.ID
COSMIC.Sample.ID
Protein pocket region
• Li et al developed algorithm to identify
functional pocket regions in protein
Vast majority of disease-associated SNPs are located in
Pockets. (Tseng and Li, PNAS, 2011)
A case study: KRAS
About 64% of SM in KRAS is located on the functional pocket region
Yu et al (Nature Biotechnology, 2012) also reported about 65% of disease associated in-frame
mutations are located on the interaction surfaces of proteins associated with the diseases.
Aim 2: Tumor-specific
mutations in tumor-normal
pairs
15
Outline
• Challenges in detecting somatic single
nucleotide variants (sSNV)
• GATK pipeline for calling sSNV
• Installing and running MuTect
• MuTect output
• Summary
16
Detecting sSNV in cancer: challenge #1
Many sSNV occur at very low frequency in
genome (0.1 to 100 mutations per megabase)
17
Slide adapted from Mike Lawrence, TCGA Annual Symposium
Detecting sSNV in cancer:
challenge #2
C. Tri-clonal tumor
Tumors are impure (i.e. contain normal
contaminating cells) and heterogeneous (i.e.
contain sub-clones)
18
Slide adapted from Christopher Miller, TCGA Annual Symposium and Mardis Elaine
GATK pipeline
GATK Best Practices: http://www.broadinstitute.org/gatk/guide/topic?name=best-practices
NGS: Resources
• SEQanswers (http://seqanswers.com/)
• SEQanswers software list
(http://seqanswers.com/wiki/Software/list
• Galaxy (https://main.g2.bx.psu.edu/)
• NGS Catalog
(http://bioinfo.mc.vanderbilt.edu/NGS/)
Slide adapted from Peilin Jia, PhD
Two types of error
• USER ERRORS:
• Due to wrong command line or incorrect user
input files
• Please do not post this error to the GATK
forum
• RUNTIME ERRORS:
• Due to the program code
• Do post this error to the GATK forum (together
with the trace file)
USER ERROR
• ##### ERROR ----------------------------------------------------------------------------------------• ##### ERROR A USER ERROR has occurred (version 2.2-25-g2a68eab):
• ##### ERROR The invalid arguments or inputs must be corrected before the
GATK can proceed
• ##### ERROR Please do not post this error to the GATK forum
• ##### ERROR
• ##### ERROR See the documentation (rerun with -h) for this tool to view
allowable command-line arguments.
• ##### ERROR Visit our website and forum for extensive documentation and
answers to
• ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
• ##### ERROR
• ##### ERROR MESSAGE: SAM/BAM file
SAMFileReader{/scratch/vuongh/Lungevity_Project/GATK/bwa/13_karosorted_R
G_MarkDup_Realigned_Recal.bam} is malformed: read starts with deletion.
Cigar: 9D18M15I38M26S. Although the SAM spec technically permits such reads,
this is often indicative of malformed files. If you are sure you want to use this
file, re-run your analysis with the extra option: -rf BadCigar
BEST OF RUNTIME ERROR
• ##### ERROR ----------------------------------------------------------------------------------------• ##### ERROR A GATK RUNTIME ERROR has occurred (version 2.4-7g5e89f01):
• ##### ERROR
• ##### ERROR Please visit the wiki to see if this is a known problem
• ##### ERROR If not, please post the error, with stack trace, to the
GATK forum
• ##### ERROR Visit our website and forum for extensive
documentation and answers to
• ##### ERROR commonly asked questions
http://www.broadinstitute.org/gatk
• ##### ERROR
• ##### ERROR MESSAGE: START (0) > (-1) STOP -- this should never
happen -- call Mauricio!
MuTect: a highly sensitive and specific
sSNV caller
• Distinct Features
• Focus on identifying low allelic fraction
mutations due to tumor heterogeneity,
normal contaminating cell, sub-clones
• Use Bayesian model with allelic fraction as
parameter  yield high sensitivity
• Carefully tuned , elaborated set of filters 
yield high specificity
24
Overview of the detection of a somatic
point mutation using MuTect
Bayesian model
25
Variant Filter
Panel of Normal Filter
Cibulskis, K. et al.Nat Biotechnology (2013).doi:10.1038/nbt.2514
Benchmarking mutation-detection
methods
Advantages:
High sensitivity at low allelic fraction (f=0.1)
High specificity achieved by filters
Cibulskis, K. et al.Nat Biotechnology (2013).doi:10.1038/nbt.2514
26
Filter options
•
•
•
•
•
•
•
Strand bias
Proximal gap
Poor mapping
Triallelic site
Strand bias
Clustered position
Observed in Control
Panel of normal samples
27
Good
Bad
Jia et al. PLoS ONE 7(6): e38470
Installing MuTect
• Installation (Linux)
• Version 1.1.4 available for download at
http://www.broadinstitute.org/cancer/c
ga/mutect_download (must register an
account at Broad)
• Can also be built from source available
for download at
http://www.nature.com/nbt/journal/v31
/n3/extref/nbt.2514-S3.zip
28
Preparing input
• Resources:
• COSMIC VCF file: use b37_cosmic_v54_120711.vcf
• dbSNP VCF file: use dbsnp_132_b37.leftAligned.vcf.gz
• Human reference fasta: downloaded from GATK
reference bundle, use
Homo_sapiens_assembly19.fasta, *.fai, *.dict files
• Inputs:
• Tumor bam file and matched normal bam file from
read alignment tool output (e.g. BWA, Tophat)
• Bam files needed to be sorted and indexed.
• Recommendation: corrected for local indels
realignment, marked for PCR duplicates according to
GATK best practice variant detection
29
Running MuTect
• Command line with all default parameter
java -Xmx4g -jar /scratch/vuongh/mutect_latest/muTect-1.1.4.jar \
--analysis_type MuTect \
--reference_sequence /ref/Homo_sapiens_assembly19.fasta \
-cosmic /ref/hg19_cosmic_v54_120711.vcf \
-dbsnp /ref/dbsnp_132_b37.leftAligned.vcf \
--input_file:normal /Huy-RNAseq/1/accepted_hits.sorted.RG.bam \
--input_file:tumor /Huy-RNAseq/2/accepted_hits.sorted.RG.bam \
--out /out/1_2_cal_stats.out \
--vcf /out/1_2_mutation.vcf \
-cov /out/1_2_coverage.wig.txt \
--enable_extended_output
Notes:
• Put all resource files (COSMIC, dbSNP and reference fasta) in folder ref 30
• Normal bam file and index in folder 1, turmor bam and index in folder 2.
• Output call stats and vcf file of mutation candidates in folder out
Result
• Test data: RNA-seq data from squamous
cell lung cancer patients (tumor/normal
pair)
• Total run time: 6 hours on 8 Intel Nehalem
CPUs (2.4 GHz) and, processed 65.1 million
reads per sample
• View the result with Excel
31
Example of Mutect output
contig position ref_allele alt_allele t_lod_fstar tumor_f
1
14470
G
A
8.631487 0.272727
contaminant_
lod
failure_reasons
judgem
ent
-0.096458
normal_lod,alt_allele_in_normal,poor_m
apping_region_alternate_allele_mapq REJECT
1
14542
A
G
4.993144 0.076923
-0.228097
fstar_tumor_lod,possible_contamination,
normal_lod,alt_allele_in_normal
REJECT
1
14574
A
G
4.82618 0.071429
-0.245647
fstar_tumor_lod,possible_contamination REJECT
T
137.96602
6
0.714286
-0.429894
1
14653
C
normal_lod,alt_allele_in_normal
REJECT
fstar_tumor_lod,possible_contamination,
alt_allele_in_normal,poor_mapping_regi
on_alternate_allele_mapq
REJECT
KEEP
1
1
14673
139393
G
G
C
T
5.07638 0.030769
8.97833
0.3
2.317242
-0.087734
1
788867
C
T
7.335518 0.285714
-0.061414
KEEP
1
1
1321326
1498692
C
T
G
C
7.495658 0.333333
6.681093
0.2
-0.052641
-0.087736
KEEP
KEEP
1
1498813
T
C
6.706235 0.166667
-0.105281
KEEP
Keep: 1143 (0.5%) %Reject: 213000 (99.5%)
32
Distribution of keep versus reject
calls
Density plot with cutoff threshold = 6.3
density
• Most reject calls are
high allelic fraction
sSNV
• Keep most of the lowallelic fraction sSNV
• Mono-clonal ???
33
Allelic fraction f
Variant annotation (Annovar)
Effect
Variant annotation
nonsynonymous CLSTN1:NM_014944:exon2:c.C163T:p.L55F,CLSTN1:NM_
SNV
001009566:exon2:c.C163T:p.L55F,
stopgain SNV
MASP2:NM_006610:exon10:c.T1236A:p.C412X,
nonsynonymous VPS13D:NM_018156:exon63:c.G11985C:p.L3995F,VPS13
SNV
D:NM_015378:exon64:c.G12060C:p.L4020F,
nonsynonymous
SNV
DHRS3:NM_004753:exon6:c.G852C:p.E284D,
nonsynonymous
SNV
RSC1A1:NM_006511:exon1:c.C1741T:p.L581F,
RAP1GAP:NM_001145657:exon9:c.T297A:p.H99Q,RAP1
nonsynonymous GAP:NM_001145658:exon8:c.T489A:p.H163Q,RAP1GAP:
SNV
NM_002885:exon8:c.T297A:p.H99Q,
stopgain SNV
HSPG2:NM_005529:exon41:c.C5053T:p.R1685X,
nonsynonymous
SNV
RPL11:NM_000975:exon2:c.C7G:p.Q3E,
nonsynonymous
SNV
RPL11:NM_000975:exon2:c.A8C:p.Q3P,
RPS6KA1:NM_002953:exon22:c.G2207A:p.X736X,RPS6K
synonymous SNV
A1:NM_001006665:exon21:c.G2234A:p.X745X,
Display 10 out of 432 genes
Chr Start
End
Ref Alt
1
1
9833381 9833381 G
11090294 11090294 A
A
T
1
12475169 12475169 G
C
1
12628426 12628426 C
G
1
15988104 15988104 C
T
1
1
21940577 21940577 A
22186457 22186457 G
T
A
1
24019099 24019099 C
G
1
24019100 24019100 A
C
1
26900691 26900691 G
A
34
Summary
• MuTect is a highly sensitive and specific tool
for somatic SNVs calling
• Designed to detect low allelic fraction somatic
mutations in as few as 10% of cancer cells
• Easy to install and run on all OS
• Work on all NGS data
• Limitations:
• Computational intensive
• Can’t call indels
35
THANK YOU
36