Transcript 06-pvpx

The role of ontologies in the analysis
of personal genome data
Imane Boudellioua
King Abdullah University of Science and Technology
[email protected]
NGS and personalized medicine
• Sequencing cost is
dropping faster than
expected
• What to do with the
massive amount of
genomic data?
• Need computational
tools for genome
analysis:
• Better diagnosis
• Effective treatment
2
Human genetic variations
 All humans have 99.9 % identical DNA.
 0.1% of DNA difference is what makes us unique!
 Genetic variations
 Different traits (skin color, height, hair color)
 Could lead to diseases.
Challenge: Identify which genetic variant is causing a disease
3
Mutations and Diseases
De novo mutations
Mendelian disorders
 Humans have a per-
 Diseases in which the
generation mutation rate of
between 7.6 × 10−9 and 2.2 ×
10−8 per bp per generation
 An average newborn is
calculated to have acquired
50 to 100 new mutations in
their genome
 0.86 novel nonsynonymous mutations
 Recurrent mutations
 Copy number variations
 Large genes, eg. Duchenne
muscular dystrophy
4
phenotypes are largely
determined by the action, lack of
action, of mutations at individual
loci.
 Rare 1% of all live born
individuals (<1:2000)
 Over 6,000 diseases are known
 4 types of inheritance
: Autosomal dominant
: Autosomal recessive
: X linked dominant
: X linked recessive
How do we identify the de novo mutation
responsible?
Compared to the Human genome reference sequence, which is itself
constructed from 13 individuals
5
1000 Genomes project: A map of human genome variation from population-scale sequencing, Nature 467:1061–1073
Identifying a causative de novo mutation
sequence
genome
select exomic
sequences
~22,000 variants
(exome re-sequencing)
Minor allele
frequency filter
(1%)
MSGTCASTTR
MSGTNASTTR
~5,640 coding
variants
Trio analysis;
sequence parents and
exclude their private
variants
~143 novel
coding variants
Identify a single
likely-causative
mutation?
Look at affected gene
function and
mutational impact
10 de novo
novel coding
variants
Disease-causing variants discovery
 First successful identification of disease-causing variation of
Miller syndrome from WES [Ng et al., 2010]
 Current success rate of WES and WGS: 22% - 25%
 Using systematic filtering of variants ( MAF, quality, etc)
 Issues:
 Approach does not scale well
 A human exome contains around 30K variants with about 100 genuine
loss-of-function mutations
 After filtering  hundreds of variants are left, which one is the one??
7
“Needles in stacks of Needles!”
8
Gregory M. Cooper and Jay Shendure. Needles in stacks of needles: inding disease-causal variants in a wealth of
genomic data. Nature reviews. Genetics, 12(9):628{640, September 2011.
Predicting deleteriousness
 Leveraging biochemical, evolutionary, and structural
information about variants.
 In WES data:
 SIFT [Kumar et al., 2009]
 PolyPhen [Adzhubei1 et al., 2010]
 GERP++ [Davydov et al., 2010]
 In WGS data:
 GWAVA [Ritchie et al., 2014]
 MutationTaster [Schwarz et al.,2014]
 FATHMM-MKL [Shihab et al.,2015]
9
More sophisticated methods..
 The Combined Annotation–Dependent Depletion CADD
[Kircher et al.,2014]
 A trained SVM classifier over 1K features associated with
known pathegenic variants (functional annotations, scores
from SIFT, PolyPhen, GERP++, etc)
 Another method, DANN [Quang1 et al.,2014]
 A trained deep neural network classifier on the same features
and training data as CADD
 Outperformed CADD!
10
A promising approach..
 Adding model organisms into the mix!
 Computational phenotype analysis:
 To exploit phenotype-genotype associations observed in humans and
model organisms
 Phenotypic similarity of a patients phenotypes to known diseases and
characterized non-human disease models
 To prioritize disease-causing variants.
 Existing Tools:




11
PhenomeNet [Hoehndorf et al., 2011]
Extasy [Sifrim et al.,2013]
Exomiser [Robinson et al., 2014]
Phevor [Singleton et al., 2014]
PhenomeNET-VP (PVP)
 A phenotype-driven method for the prioritization of disease



causative variants in both WES and WGS data
Incorporates PhenomeNET ontology for cross-species
phenotypic similarity (human, mouse, zebrafish)
A trained random forest classifier to score variants for
prioritization.
Scores SNPs, and indels
Combines two sets of features:
 Molecular information  assess variants’ pathogenicity
 Phenotypic information  determine variants’ causality
12
PVP Training data design
ClinVar pathogenic variants
(43,236)
- 50% simulated pathogenic
disease-non-causing variants
- 50% ClinVar benign variants
Total (43,236)
13
PVP Features
 Each variant is represented by 60 features:
 54 binary high-level phenotypes from HP and MP
 Disease Inheritance mode associated with variant
(Dominant, Recessive, X-linked, and Others)
 CADD, DANN, and GWAVA scores
 Phenotype similarity score
 Genotype (homozygote or heterozygote)
14
PVP Pipeline
Offline Training Phase
15
PVP Pipeline
Offline Training Phase
Preprocssing
•
•
16
Split variants (multiple alleles, multiple disease)
Design training set
PVP Pipeline
Offline Training Phase
Preprocessing
•
•
•
•
17
Variants
Annotation
Pathogenicity scores (CADD, DANN, GWAVA)
Phenotype similarity score of variant-associated gene to OMIM ID
High level phenotypes from HP and MP
OMIM ID inheritance mode
PVP Pipeline
Offline Training Phase
Preprocessing
Variants
Annotation
Feature
Extraction
•
18
60 Features (pathogenicity scores + phenotype
score + GT + HLP + inheritance mode)
PVP Pipeline
Offline Training Phase
Preprocessing
Variants
Annotation
Feature
Extraction
RF
•
19
Random forest classifier trained by WEKA
PVP Pipeline
Offline Training Phase
Preprocessing
Variants
Annotation
Feature
Extraction
RF
Trained model
Prediction Phase
+ Phenotypes or OMIM ID
+ Inheritance mode
20
PVP Pipeline
Offline Training Phase
Preprocessing
Variants
Annotation
Feature
Extraction
RF
Trained model
Prediction Phase
Preprocessing
+ Phenotypes or OMIM ID
+ Inheritance mode
•
•
21
Remove CNVs
Remove variants with missing allele in the GT field
PVP Pipeline
Offline Training Phase
Preprocessing
Variants
Annotation
Feature
Extraction
RF
Trained model
Prediction Phase
Preprocessing
Variants
Annotation
+ Phenotypes or OMIM ID
+ Inheritance mode
22
•
•
•
•
Pathogenicity scores (CADD, DANN, GWAVA)
Phenotype similarity score of variant-associated gene to OMIM ID/phenotypes
High level phenotypes from HP and MP
Inheritance mode
PVP Pipeline
Offline Training Phase
Preprocessing
Variants
Annotation
Feature
Extraction
RF
Trained model
Prediction Phase
Preprocessing
Variants
Annotation
Feature
Extraction
+ Phenotypes or OMIM ID
+ Inheritance mode
•
23
60 Features (pathogenicity scores + phenotype
score + GT + HLP + inheritance mode)
PVP Pipeline
Offline Training Phase
Preprocessing
Variants
Annotation
Feature
Extraction
RF
Trained model
Prediction Phase
Preprocessing
Variants
Annotation
Feature
Extraction
Classification
+ Phenotypes or OMIM ID
+ Inheritance mode
•
24
Apply trained model to get prediction scores for
each variant
PVP Pipeline
Offline Training Phase
Preprocessing
Variants
Annotation
Feature
Extraction
RF
Trained model
Prediction Phase
Preprocessing
Variants
Annotation
Feature
Extraction
Classification
+ Phenotypes or OMIM ID
+ Inheritance mode
•
25
Rank variants by sorting them in descending order by
prediction score
PVP Evaluation
 Cross validation results on fully trained model
Precision
0.894
Recall
0.893
F-measure ROC AUC
0.893
0.963
 Benchmarking:
 Train model on 80% data
 Create synthetic exomes with 20% holdout set (8746 exomes)
using1000G Project WES data.
 Compare performance with Exomiser, CADD, and DANN
26
PVP Evaluation
Top hit (%)
Top 10 hits (%)
Exomiser
24.65%
58.56%
CADD
15.15%
32.05%
DANN
6.06%
26.69%
Phevor
18.02%
47.82%
PVP
45.82%
72.64%
27
PVP Evaluation by Inheritance mode
71.40%
40.68%
25.94%
AD
28
AR
Others/Unknown
Analysis of UK10K data
 19 exomes from UK10K_RARE_THYROID study samples
(EGA study ID: EGAS00001000131) with mutations confirmed
by Sanger sequencing..
 Includes patients with:
 Congenital Hypothyroidism (CH)
 Resistance to Thyroid hormone (RTH)
 Shared phenotypes used:





29
HP:0000821 Hypothyroidism
HP:0000851 Congenital hypothyroidism
HP:0002925 Thyroid-stimulating hormone excess
HP:0005990 Thyroid hypoplasia
HP:0011791 Inactivating thyroid-stimulating hormone
receptor (TSHR) defect
 PVP was run on all 19 exomes.
Analysis of UK10K data
 Success rate with top hit: 52.6% !
30
Gene 1 (c.DNA, zygosity)
UK10K_THY5329055
UK10K_THY5329056
UK10K_THY5329059
UK10K_THY5329060
UK10K_THY5370898
UK10K_THY5329047
UK10K_THY5329053
UK10K_THY5329054
UK10K_THY5329061
UK10K_THY5329062
UK10K_THY5236178
UK10K_THY5236179
UK10K_THY5236180
UK10K_THY5236181
UK10K_THY5329044
UK10K_THY5329045
UK10K_THY5329046
UK10K_THY5068932
UK10K_THY5068934
TG c.1583C>A (hom)
TG c.1583C>A (hom)
TG c.2177G>A (hom), c.3149G>T (hom)
TG c.2177G>A (hom), c.3149G>T (hom)
TG c.638+5G>A (hom)
TG c.638+5G>A (hom)
TG c.4478G>A (hom)
TG c.4478G>A (hom)
TG c.8054G>T (hom)
TG c.8054G>T (hom)
TG c.5071T>C (het), c.7640T>A (het)
TG c.5071T>C (het)
TG c.5071T>C (het)
TG c.7640T>A (het)
TG c.2311C>T (het)
TG c.2311C>T (het)
TSHR c.1637G>A (het)
TG c.3433+3_3433+6delGAGT (het)
TG c.3433+3_3433+6delGAGT (het)
Highlighted in yellow are variants reported in ClinVar as Pathogenic
Phevor Rank
PVP Rank
ID
1
1
1,2
1,2
8031
5616
1
1
1
1
94,13
91
129
11
1
1
5
43
41
209
103
223
209
92
103
125
127
74
220
116
298
67
306
209
91
9
246
202
Sounds good, where can I get PVP?
 PVP is open source and freely available to use.
 STEP 1:
Go to our GitHub page:
https://github.com/bio-ontology-research-group/phenomenet-vp
 STEP 2:
Check the software requirements:
 At least 32 GB RAM.
 Any Unix-based operating system
 Java 8 or above
 At least 170GB free disk space to accommodate the necessary
databases for annotation
31
How to install PVP
 STEP 3:
Installation process
1. Download the distribution file phenomenet-vp-1.0.zip
2. Download the data files phenomenet-vp-1.0-data.zip
3. Extract the distribution files phenomenet-vp-1.0.zip
4. Extract the data files data.tar.gz inside the directory
phenomenet-vp-1.0
5. cd phenomenet-vp-1.0
6. Run the command: bin/phenomenet-vp to display help and
parameters.
32
Getting the required Databases
 STEP 4:
Downloading required databases:
1. Download CADD database file.and unzip it.
2. Download and run the script generate.sh (RequiresTABIX).
3. Copy the generated files cadd.txt.gz and cadd.txt.gz.tbi to
directory phenomenet-vp-1.0/data/db.
4. Download DANN database file and its indexed file to
directory phenomenet-vp-1.0/data/db.
5. Rename the above two files
as dann.txt.gz and dann.txt.gz.tbi respectively.
33
PVP Parameters
 --file, -f




34
Path to VCF file
--inh, -i
Mode of inheritance (dominant, recessive, x-linked, others, or
unknown) Default: unknown
--omim, -o
OMIM ID of the input VCF file
--phenotypes, -p
List of phenotype ids separated by commas (HPO or MPO
terms)
--all, -a
Keep all variants for analysis (i.e. Do not filter variants based on
their annotation type as coding variants or noncoding variants)
Default: false
Tips and Tricks
 Analysis of rare variants:
 Preprocess the vcf file by filtering out variants with MAF > 1%
 Can be done using VCFtools:
vcftools --vcf input_file.vcf --recode --max-maf 0.01 --out filtered
 Run PVP on the output file filtered.recode.vcf
 Synthetic exomes
 The synthetic exomes generated for evaluation are available here:
http://www.cbrc.kaust.edu.sa/onto/pvp/synthetic_exomes/
35
PVP Demo
36
PVP Demo
 Input file: a patient with von Willebrand disorder
(PGP: hu432EB5) : vwd.vcf
 Input phenotypes: OMIM:193400
 Run command inside phenomenet-vp-1.0 directory:
bin/phenomenet-vp -f /path/to/file/vwd.vcf -o OMIM:193400 –a

37
In you input file directory, you will find a file called
vwd.vcf.res containing the ranked list of variants.
PVP Demo
Chr
12
6
2
11
8
10
22
22
9
3
11
1
11
14
17
22
12
6
19
38
Start
6143978
1.37E+08
68607947
64026639
67592152
29784072
26853905
26854441
35712003
57303684
18319180
25291010
60893235
51352345
60493386
30888494
52886513
1.37E+08
19413092
Ref
C
G
G
G
T
G
C
G
G
A
G
A
C
G
A
C
G
C
C
Alt
T
T
T
A
C
C
A
A
A
G
T
T
T
A
G
T
C
G
T
GT
Gene
0/1
VWF
0/1 BCLAF1
0/1
PLEK
0/1
PLCB3
0/1 C8orf44
1/1
SVIL
1/1
HPS4
1/1
HPS4
0/1
TLN1
0/1
APPL1
0/1
HPS5
1/1 RUNX3
0/1
CD5
0/1 ABHD12B
0/1 EFCAB3
0/1 SEC14L4
0/1
KRT6A
0/1 BCLAF1
0/1
SUGP1
CADD
35
25.4
23.9
26.1
23.3
24.8
23
26.4
22.9
23.5
23.5
22.5
31
25.9
25.4
26
24.9
23
35
GWAVA
0.463333
0.58
0.516667
0.383333
0.653333
0.273333
0.503333
0.49
0.43
0.45
0.383333
0.616667
0.423333
0.543333
0.55
0.56
0.566667
0.42
0.566667
DANNSim_Score
Prediction_Score
0.999526 0.975058 0.986995
0.992042 0.88948 0.840971
0.998591 0.93108 0.789026
0.999368 0.891637 0.75738
0.993587 0.920447 0.743274
0.996674 0.922038 0.699697
0.994371 0.834421 0.621787
0.996695 0.834421 0.61149
0.9966 0.93274 0.588055
0.958623 0.88948 0.587411
0.994 0.862657 0.580477
0.969569 0.892147 0.575594
0.99887 0.892147 0.570217
0.997034
. 0.568456
0.997709
. 0.568129
0.999344
. 0.567916
0.996862
. 0.567303
0.991787 0.88948 0.565746
0.999531
. 0.565727
PVP Demo
Chr
12
6
2
11
8
10
22
22
9
3
11
1
11
14
17
22
12
6
19
39
Start
6143978
1.37E+08
68607947
64026639
67592152
29784072
26853905
26854441
35712003
57303684
18319180
25291010
60893235
51352345
60493386
30888494
52886513
1.37E+08
19413092
Ref
C
G
G
G
T
G
C
G
G
A
G
A
C
G
A
C
G
C
C
Alt
T
T
T
A
C
C
A
A
A
G
T
T
T
A
G
T
C
G
T
GT
Gene
0/1
VWF
0/1 BCLAF1
0/1
PLEK
0/1
PLCB3
0/1 C8orf44
1/1
SVIL
1/1
HPS4
1/1
HPS4
0/1
TLN1
0/1
APPL1
0/1
HPS5
1/1 RUNX3
0/1
CD5
0/1 ABHD12B
0/1 EFCAB3
0/1 SEC14L4
0/1
KRT6A
0/1 BCLAF1
0/1
SUGP1
CADD
35
25.4
23.9
26.1
23.3
24.8
23
26.4
22.9
23.5
23.5
22.5
31
25.9
25.4
26
24.9
23
35
GWAVA
0.463333
0.58
0.516667
0.383333
0.653333
0.273333
0.503333
0.49
0.43
0.45
0.383333
0.616667
0.423333
0.543333
0.55
0.56
0.566667
0.42
0.566667
DANNSim_Score
Prediction_Score
0.999526 0.975058 0.986995
0.992042 0.88948 0.840971
0.998591 0.93108 0.789026
0.999368 0.891637 0.75738
0.993587 0.920447 0.743274
0.996674 0.922038 0.699697
0.994371 0.834421 0.621787
0.996695 0.834421 0.61149
0.9966 0.93274 0.588055
0.958623 0.88948 0.587411
0.994 0.862657 0.580477
0.969569 0.892147 0.575594
0.99887 0.892147 0.570217
0.997034
. 0.568456
0.997709
. 0.568129
0.999344
. 0.567916
0.996862
. 0.567303
0.991787 0.88948 0.565746
0.999531
. 0.565727
Acknowledgement
 Rozaimi Razali
 Maxat Kulmano
 Vladimir Bajic
 Eva Goncalves-Serra
 Nadia Schoenmakers
 Georgios V Gkouto
 Paul N Schofield
 Robert Hoehndorf
40
41