1. dia - Amazon Web Services

Download Report

Transcript 1. dia - Amazon Web Services

Considerations for Analyzing Targeted
NGS Data
Introduction
Tim Hague, CTO
Introduction
Many mapping, alignment and variant calling
algorithms
Most of these have been developed for
whole genome sequencing and to some
extent population genetic studies.
Premise
In contrast, NGS based diagnostics deals
with particular genes or mutations of an
individual.
Different diagnostic targets present specific
challenges.
Goal
Present analysis issues related to
differences in:
Sequencing technologies
Targeting technologies
Target specifics
Pseudogenes and segmental duplication
NGS Sequencers
 Illumina
 Ion Torrent
 Roche 454
 (SOLiD)
Illumina
IonTorrentt
Roche 454
Moore B, Hu H, Singleton M, De La Vega, FM, Reese MG, Yandell M. Genet Med. 2011 Mar;13(3):210-7.
Sequencing Technology
Differences:
Homopolymer error rates
G/C content errors
Read length
Sequencing protocols (single vs paired
reads)
Targeting Methods
 PCR primers (e.g. amplicons)
 Hybridization probes (e.g. exome kits)
Targeting Technology
Differences:
Exact matching regions vs regions with
SNPs.
Results in:
Need for mapping against whole
chromosomes to avoid false positives.
Analysis Targets
Differences:
Rate of polymorphism
Repetitive structures
Mutation profiles
G/C content
Single genes vs multi gene complexes
BRCA1/2
1/2000
HLA
1/29
CFTR
1/2000
Distributions of insertions and deletions
Distribution of repeat elements
Segmental Duplications
 Sometimes called Low Copy Repeats (LCRs)
 Highly homologous, >95% sequence identity
 Rare in most mammals
 Comprise a large portion of the human genome (and
other primate genomes)
 Important for understanding HLA
Segmental Duplications
Many LCRs are concentrated in "hotspots"
Recombinations in these regions are responsible for a wide
range of disorders, including:
 Charcot-Marie-Tooth syndrome type 1A
 Hereditary neuropathy with liability to pressure palsies
 Smith-Magenis syndrome
 Potocki-Lupski syndrome
Data Analysis Tools
Differences:
Detection rates of complex variants (sensitivity)
False positive rates (accuracy)
Speed
Ease of use
Data analysis
shouldn’t be like this!
“Depending upon which tool you use, you can
see pretty big differences between even the
same genome called with different tools—
nearly as big as the two Life Tech/Illumina
genomes.”
Mark Yandel in BioIT-World.com, June 8, 2011
Examples
 Missing variants
 SNPs, a DNP and deletions
Identify more valid variants
Find homopolymer indels
Examples
 Coverage differences
Four times exon coverage
[0-96]
[0-432]
Higher exome coverage
[0-10]
[0-24]
First conclusion
Read accuracy is not the limiting factor in
accurate variant analysis.
Example
 Dense region of SNPs
www.omixon.com
Second conclusion
As variant density increases the
performance of most tools goes down.
Variant Calling
 There
few popular
SAMtools
There
T
here
are feware
popular
variantvariant
callers:callers:
GATK,GATK,
SAMtools
mpileup,
mpileup,
VarScan (GATK) has a whole pipeline, including a
VarScan
The
most
comprehensive
The
quality
most
recalibration
comprehensive
step and
(GATK)
an (GATK)
indel
has realignment
a whole
pipeline,
step
including a
The
most comprehensive
has a whole
pipeline,
qualityincluding
These
recalibration
recalibration
and
steprealignment
and
an indelsteps
realignment
a quality
recalibration
step are
andhighly
an step
indel
These realignment
recommended
recalibration
to be
and
run realignment
before any variant
steps are
callhighly
step
recommendedand
Deduplication
to be
removing
run before
non-primary
any variant
alignments
call
may also be
 These recalibration and realignment steps are highly
Deduplication and removing non-primary alignments may also be
required
recommended to be run before any variant call
required
 Deduplication and removing non-primary alignments
may also be required
Indel realigner problem
Variants that can be hard to find








DNPs
TNPs
Small indels next to SNPs
30+ bp indels
Homopolymer indels
Homopolymer indel and SNP together
Indels in palindromes
Dense regions of variants