Transcript Map reads
Introduction to Variant Analysis
of Exome- and Amplicon sequencing data
Lecture by:
Date:
Training:
Extended version see:
Dr. Christian Rausch
29 May 2015
TraIT Galaxy Training
http://tinyurl.com/o74uehq
Focus of lecture and
practical part
• Lecture: from NGS data to Variant analysis
• Hands on training: we will analyze NGS read
data of a panel of cancer genes (Illumina
TruSeq Amplicon - Cancer Panel) of prostate
cancer cell line VCaP.
• Analysis software tools will be run
interactively through Galaxy, “a web-based
platform for data intensive biomedical
research”
Image source: A survey of tools for variant analysis of next-generation
genome sequencing data, Pabinger et al., Brief Bioinform (2013) doi:
10.1093/bib/bbs086
Workflow: QC & Mapping reads
Input reads
(fastq files)
Quality check
with FastQC
Not OK?
Quality- & Adaptertrimming
OK?
Map reads to
reference genome
using e.g. BWA or
Bowtie2
Sort by coordinates using SAMtools
sort or PicardTools SortSam
Output:
Sorted BAM file
(binary SAM
sequence
alignment map)
Variant Calling & Annotation pipeline
Reads
mapped to
reference
genome
SAM or BAM
file
Filter variant
allele
frequency
Discard variants
with variant
allele frequency
below
threshold
SAMtools Mpileup
• Analyze
mismatches &
compute
likelihoods of
SNP etc.
Slice VCF
Cut vcf file to retain
only the regions
that were enriched
for sequencing (=
discard regions
covered by off
target reads
Varscan2
• does the actual calling
• Output: VCF
• Various statistics on quality of
each variant (read depth etc.),
homozygous/heterozygous etc.
ANNOVAR: Annotate
SNPs with:
• statistics (dbSNP,
1000 genomes
etc.) and
• predictions (SIFT,
PholyPhen etc.)
DGIdb: Drug Gene
Interaction Database
Find drugs to
diseases arising from
gene mutations
standard format to sequencing reads with quality information
(‘Q’ stands for Quality)
@unique identifiers and optional descriptions of the sequence
the actual DNA sequence
+ separator optionally followed by description
The quality values of the sequence
(one character per nucleotide)
More info see wikipedia ‘FASTQ_format’
standard format to store sequence data (DNA and protein seq.)
>FASTA header, often contains unique identifiers and descriptions of the sequence
• format
Read Quality Control with
FastQC
Examples of per base sequence quality
of all reads
Not so good, might still be usable,
depending on application
50 bp
Historical example of very first Solexa
reads 2006 (Solexa acquired by Illumina 2007)
“Mapping reads to the reference” is
finding where their sequence occurs in the genome
100 bp identified
200 – 500 bp unknown sequence
100 bp identified
Source: Wikimedia, file:Mapping Reads.png
“Mapping reads to the reference”:
naïve text search algorithms are too slow
• Naïve approach: compare each read with every position in the genome
– Takes too long, will not find sequences with mismatches
• Search programs typically create an index of the reference sequence (or
text) and store the reference sequence (text) in an advanced data
structure for fast searching.
• An index is basically like a phone book (with
addresses) Quickly find address (location)
of a person
Example of algorithm using ‘indexed
seed tables’ to quickly find locations
of exact parts of a read
Read Mapping: General problems
• Read can match equally well at more than one location (e.g.
repeats, pseudo-genes)
• Even fit less well to it’s actual position, e.g. if it carries a break
point, insertions and/or deletions
Example from: genetics.stanford.edu/gene211/lectures/Lecture3_Resequencing_Functional_Genomics-2014.pdf
Variant Calling & Annotation
Possible reasons for a mismatch
• True SNP
• Error generated in library preparation
• Base calling error
– May be reduced by better base calling methods, but cannot be
eliminated
• Misalignment (mapping error):
– Local re-alignment to improve mapping
• Error in reference genome sequence
Partly from www.biostat.jhsph.edu/~khansen/LecSNP2.pdf
Variant Calling: Principles
• Naïve approach (used in early NGS studies):
– Filter base calls according to quality
– Filter by frequency
– Typically, a quality Filter of PHRED Q 20 was used (i.e.,
probability of error 1% ).
– Then, the following frequency thresholds were used according to
the frequency of the non-ref base, f(b):
– The frequency heuristic works well if the sequencing depth is
high, so that the probability of a heterozygous nucleotide falling
outside of the 20% - 80% region is low.
– Problems with frequency heuristic:
• For low sequencing depth, leads to undercalling of heterozygous
genotypes
• Use of quality threshold leads to loss of information on
• individual read/base qualities
• Does not provide a measure of confidence in the call
In parts from: compbio.charite.de/contao/index.php/genomics.html
Variant Calling: Principles
• Today’s Variant Callers rely on probability
calculations
• Use of Bayes’ Theorem:
– E.g. MAQ: One of the first widely used read mappers
and variant callers
• Takes into account a quality score for whole read
alignment & quality of base at the individual
position
• Calls the most likely genotype given observed
substitutions
• Reliability score can be calculated
Variant Calling & Annotation: Popular Tools
•
•
•
•
•
SAMtools (Mpileup & Bcftools)
GATK
Varscan2
Freebayes
MAQ
VCF = Variant Call Format
• Variant Call Format / BCF = binary version
ANNOVAR
a Swiss Knife to annotate genetic variants
(SNPs and CNVs)
• Input:
– variants as VCF file
– various databases with statistical and predictive annotations: dbSNP,
1000 genomes, …
• Output:
– In coding region? Which gene? How frequently observed in 1000
genomes project? (and more statistics).
• According to coordinates from RefSeq genes, UCSC genes, ENSEMBL
genes, GENCODE genes or others, etc.
– In non-coding region? Conserved region?
• According to conservation in 44 species, transcription factor binding sites,
GWAS hits, etc.
– Predicted Effect?
• Score of SIFT, PolyPhen-2, GERP++
Annotation with DGIdb: mining the druggable genome
Drug Gene Interaction Database:
Matching disease genes with potential drugs.
• Searches genes against a compendium of drug-gene
interactions and identify potentially 'druggable' genes
Practical part
Import Workflow into Galaxy
• Import the workflow 'Goecks Exome pipeline hg19_noDedup’
to your workflows by either
– clicking on ‘Shared Data’ ‘Published Workflows’ Goecks Exome
pipeline hg19_noDedup green plus symbol to import
– Or click here:
– http://bioinf-galaxian.erasmusmc.nl/galaxy/u/crausch/w/importedgoecks-exome-pipeline-hg19
– This workflow has been imported with small modifications from:
http://www.usegalaxy.org/cancer Ref: Goecks et al., Cancer Med.
2014.
Import data set
• Import data by either
– Clicking on: http://bioinfgalaxian.erasmusmc.nl/galaxy/u/crausch/h/vcap-variant-analysis
– Or go to ‘Shared Data’ published histories VCaP Variant Analysis
– Or download the data from http://tinyurl.com/pfmshlu, unzip and
upload to Galaxy by clicking on
Run Goeck’s Exome analysis pipeline on your
data
• Chose parameter Variant allele frequency: e.g. 10%
• Chose the name of the data
• The genomic regions (bed file) contains the locations of the
exons / amplicons
• Overview of the workflow: see next slide
• Because the whole workflow runs about 20 min, you can
import all results from:
– http://bioinf-galaxian.erasmusmc.nl/galaxy/u/crausch/h/goecksexome-pipeline-hg19nodedupvcap
• Before looking at the results of the variant analysis pipeline,
check that our seq. data has good quality, using program
FastQC (logically that would be the first one would do…)
Overview of the Variant Analysis workflow
Legend: see next slide
Overview of the Variant
Analysis workflow
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Input: 2 fastq files, of the forward and reverse reads. Make sure that sequencing adapters and
parts of reads with low quality have been removed.
Mapping of the reads to the reference genome hg19, output format: BAM
Sorting: Reads are sorted according to their coordinate position on the genome
Marking and Removing of ‘duplicate reads’: Reads with the identical position on the genome are
likely duplicates created during the PCR amplification step. Exome sequencing typically relies on
hybridization-based selection of genomic shared DNA fragments in the so-called target regions.
Because the DNA sharing step is (assumed to be) a random process, reads starting at exactly the
same position are more likely to be due to PCR amplification than to originate from two
independent fragments starting at the same position. Note: this step is omitted when processing
Amplicon sequencing reads, because reads of a given amplicon all start at the same position but
can be copies of different original templates.
Summary of Alignment Statistics
MPileup: Variant counts per position and statistics (1st step of variant calling)
Varscan2: variant calling with program Varscan 2. Check: different analysis types are possible. In
this course: “Analysis type: single nucleotide variation” is selected. Output format: VCF (variant
call file).
Label and filter out variants with low Variant Allele frequency.
Slice VCF: discard all genetic regions except the exons (defined in the input BED file).
ANNOVAR: filter and annotate variants.
Annotate with DGI db (Drug Gene Interaction database).
Variant filtering and visualization
• Open the Annovar output in a new tab
• Go with the mouse over the lines of the VCF file and open the
file in trackster (click on bar chart symbol)
• In trackster, load the sorted BAM file as additional track
• Choose one mutation that is annotated by the last annotation
step, DGI. What was done in the last filter step before DGI was
run?
• Read all annotations that the mutation has that you have
selected
• Browse to this genomic location in Trackster. What is the
coverage? Is this variant reliably covered?