pgmx - UCSD CSE

Download Report

Transcript pgmx - UCSD CSE

Personalized genomics
Goal
• Input
– Genomic sequence (WGS) from family
– Pedigree & affectedness
– Disease (standard ontology needed)
• Output
– Genes/mutations relevant to the disease.
Pedigree &
Affectedness
Genomic Sequence
GRCh37
FASTQ
Read Mapping
BAM
known sites
SNV Calling
BAM prep
SV Calling & Validation
Merge VCFs
dbSNP
SeattleSeq
VCF
Variant annotation
HGMD
Variant filtering
Disease Gene/Mutation
1.
Sequencing
2.
Mapping
3.
BAM Prep
4.
Variant calling (SNV)
5.
Variant Calling (SV)
6.
VCF manipulation/merging
7.
Variant annotation
8.
Variant filtering
9.
Disease gene association
•
Platform
–
–
–
–
–
•
HiSeq/MiSeq
PacBio
Ion proton (life)
CGI
…
Mode
–
–
–
–
Whole Genome
Exome
RNA-Seq
…
•
•
•
Map short reads (FASTQ format) to a reference
Output a BAM file
Mapping tools
–
–
–
•
•
BWA
Bowtie
Custom
Compute/disk intensive part of the pipeline.
WGS file size: ~200Gb per sample.
•
•
Input: BAM file
Output: BAM file
–
Sorting BAM
•
–
Marking (PCR) Duplicates
•
–
GATK
Base-Q Covariates & Recalibration
•
•
Picard Tools
INDEL Re-alignment
•
–
Picard Tools
GATK
Compute intensive part of the pipeline
•
•
Input: multiple BAMs
Output: VCF (loci that differ from the reference)
–
SNVs
•
–
SVs
•
–
•
•
Custom pipelines needed
Browsing variant calls
•
•
Broad’s GATK Caller
Genome Savant
Confirming variants via resequencing
Compute intensive part of the pipeline.
Integrating SVs and SNVs.
Extract
FASTQ
Bowtie
BreakDancer
CNVer
Reprever
Zygosity calling
GQL+Genome Savant
VCF merging and validation
•
•
Given multiple VCF files, merge them (each
column corresponds to an individual sample).
Can be mostly done by VCFtools. Our goal would
be to visualize problematic regions for manual
validation, and design primers for confirmation
automatically.
•
•
Input: variant calls (raw VCF)
Output: annotation of variants (annotated VCF)
–
–
–
–
–
•
Coding
Synonymous
Splice-variant
Regulatory
ncRNA
Annotating coding variation for deleteriousness
–
–
–
–
SIFT
Polyphen
GERP
SeattleSeq
•
•
Input: VCF (annotated)
Output: set of relevant variants/genes
–
Filters based on variant annotation
•
–
Filters based on inheritance patterns
•
•
deleterious: missense/nonsense/splice
Disease model (recessive/dominant/compound het)
Filtering tools:
•
•
Gemini (http://gemini.readthedocs.org/en/latest/)
FamAnn
(https://sites.google.com/site/famannotation/home)
•
•
•
Input: collection of genes with mutations.
Output: relevant diseases, functional information
Basic Information
–
•
Adding pathway
–
•
Ingenuity
Databases of Disease gene links
–
–
–
•
Genecards
HGMD
OMIM
ClinVar
We are currently using an outdated version of HGMD, but
can possibly do better, or just replace it with Step 9.
•
Automated machine learning approach to
correlating genes with diseases
–
–
–
Standard ontologies for diseases
• MeSH
• Disease Ontology
Standard vocabulary for gene names
ML approach (parse abstracts to make these
connections)
Disk/sample
CPU/sample
Read Mapping
800 Gb
320 h*
BAM prep
150 Gb
140 h
SNV & INDEL
calling
20 Gb
540 h*
SV & CNV
calling
Merging VCF
200 Gb
30h + 30h
1.5 Gb
1h
Variant Annotation
20 Gb
1h
-
1h
?
?
Variant Filtering
Disease/Gene
Assoc.
*amenable to multithread parallelization
(up to a point when memory becomes
bottleneck)
•
•
•
Variant annotation
Variant filtering
Gene Disease connection
The HPO aims to act as a central resource to connect several genomics datasets with the
diseasome.
Sebastian Köhler et al. Nucl. Acids Res. 2014;42:D966D974
© The Author(s) 2013. Published by Oxford University Press.
•
•
•
10,000 terms describing human phenotypic
abnormalities, (7300 human hereditary
syndromes).
2741 genes used to create DAG (Disease
Associated Genes)
3 independent sub-ontologies
–
–
–
•
mode of inheritance
onset and clinical course
phenotypic abnormalities
The phenotypic terms are cross-linked
•
•
•
Whole genome sequencing
Exome sequencing
Disease associated genome sequencing
•
At 20X coverage, what fraction of het variants
will be called?
–
15% will be missed
1.4
1.2
Missing homozygous
variant
Het variants
1
0.8
0.6
0.4
0.2
0
2
3
4
5
6
7
8
9
10
•
•
Remove off-target and synonymous variants
Test population frequency of other variants
–
•
These are known SNPs
–
–
•
•
•
frequency score: max(0,1-0.13 exp(100*f))
Scores from SIFT/Polyphen
Most pathogenic score was taken
Final variant score: pathogenic score X frequency
score
Clinical relevance score: semantic similarity
between phenotypic abnormalities and 2741
genes.
Average (clinical, variant)
•
Simulated mutation data from HGMD