1. dia - Amazon Web Services
Download
Report
Transcript 1. dia - Amazon Web Services
Considerations for Analyzing Targeted
NGS Data
Introduction
Tim Hague, CTO
Introduction
Many mapping, alignment and variant calling
algorithms
Most of these have been developed for
whole genome sequencing and to some
extent population genetic studies.
Premise
In contrast, NGS based diagnostics deals
with particular genes or mutations of an
individual.
Different diagnostic targets present specific
challenges.
Goal
Present analysis issues related to
differences in:
Sequencing technologies
Targeting technologies
Target specifics
Pseudogenes and segmental duplication
NGS Sequencers
Illumina
Ion Torrent
Roche 454
(SOLiD)
Illumina
IonTorrentt
Roche 454
Moore B, Hu H, Singleton M, De La Vega, FM, Reese MG, Yandell M. Genet Med. 2011 Mar;13(3):210-7.
Sequencing Technology
Differences:
Homopolymer error rates
G/C content errors
Read length
Sequencing protocols (single vs paired
reads)
Targeting Methods
PCR primers (e.g. amplicons)
Hybridization probes (e.g. exome kits)
Targeting Technology
Differences:
Exact matching regions vs regions with
SNPs.
Results in:
Need for mapping against whole
chromosomes to avoid false positives.
Analysis Targets
Differences:
Rate of polymorphism
Repetitive structures
Mutation profiles
G/C content
Single genes vs multi gene complexes
BRCA1/2
1/2000
HLA
1/29
CFTR
1/2000
Distributions of insertions and deletions
Distribution of repeat elements
Segmental Duplications
Sometimes called Low Copy Repeats (LCRs)
Highly homologous, >95% sequence identity
Rare in most mammals
Comprise a large portion of the human genome (and
other primate genomes)
Important for understanding HLA
Segmental Duplications
Many LCRs are concentrated in "hotspots"
Recombinations in these regions are responsible for a wide
range of disorders, including:
Charcot-Marie-Tooth syndrome type 1A
Hereditary neuropathy with liability to pressure palsies
Smith-Magenis syndrome
Potocki-Lupski syndrome
Data Analysis Tools
Differences:
Detection rates of complex variants (sensitivity)
False positive rates (accuracy)
Speed
Ease of use
Data analysis
shouldn’t be like this!
“Depending upon which tool you use, you can
see pretty big differences between even the
same genome called with different tools—
nearly as big as the two Life Tech/Illumina
genomes.”
Mark Yandel in BioIT-World.com, June 8, 2011
Examples
Missing variants
SNPs, a DNP and deletions
Identify more valid variants
Find homopolymer indels
Examples
Coverage differences
Four times exon coverage
[0-96]
[0-432]
Higher exome coverage
[0-10]
[0-24]
First conclusion
Read accuracy is not the limiting factor in
accurate variant analysis.
Example
Dense region of SNPs
www.omixon.com
Second conclusion
As variant density increases the
performance of most tools goes down.
Variant Calling
There
few popular
SAMtools
There
T
here
are feware
popular
variantvariant
callers:callers:
GATK,GATK,
SAMtools
mpileup,
mpileup,
VarScan (GATK) has a whole pipeline, including a
VarScan
The
most
comprehensive
The
quality
most
recalibration
comprehensive
step and
(GATK)
an (GATK)
indel
has realignment
a whole
pipeline,
step
including a
The
most comprehensive
has a whole
pipeline,
qualityincluding
These
recalibration
recalibration
and
steprealignment
and
an indelsteps
realignment
a quality
recalibration
step are
andhighly
an step
indel
These realignment
recommended
recalibration
to be
and
run realignment
before any variant
steps are
callhighly
step
recommendedand
Deduplication
to be
removing
run before
non-primary
any variant
alignments
call
may also be
These recalibration and realignment steps are highly
Deduplication and removing non-primary alignments may also be
required
recommended to be run before any variant call
required
Deduplication and removing non-primary alignments
may also be required
Indel realigner problem
Variants that can be hard to find
DNPs
TNPs
Small indels next to SNPs
30+ bp indels
Homopolymer indels
Homopolymer indel and SNP together
Indels in palindromes
Dense regions of variants