Predicting Deleterious Mutations
Download
Report
Transcript Predicting Deleterious Mutations
Capstone Project Presentation
Predicting Deleterious Mutations
Young SP, Radivojac P, Mooney SD
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Deleterious
“Hurtful or injurious to life or health;
noxious”
(Oxford English Dictionary)
“Tis pity wine should be so deleterious, For tea
and coffee leave us much more serious.”
(BYRON Juan IV, 1821)
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
SNPs
What is an SNP (single nucleotide
polymorphism)?
Why are SNPs important?
Some SNPs are nonsynonymous
The molecular effects of SNPs vary
widely
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
MOTIVATION
Improve on the existing deleterious
prediction methods
Use protein sequence, evolution and
structure data combined with machine
learning to identify potentially diseasecausing SNPs
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
SNP data is increasingly available
Over 40 major online databases
dbSNP is the primary SNP database
(contains 5,000,000+ validated human SNPs)
Many databases contain potentially diseasecausing SNPs related to a particular disease
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Deleterious effects of mutations on
proteins
Function
Stability
Expression
Protein-Protein Interactions
Friday 17rd December 2004
Stuart Young
Current Classification Tools
Sequence Approaches
BLOSUM62
An amino acid substitution score matrix
SIFT
Collects sequence homologues in multiple
alignments and identifies non-conservative
changes in amino acids
Ng P & Henikoff S, 'Predicting Deleterious Amino Acid
Substitutions‘. Genome Research, 2001, 11:863-874.
Friday 17rd December 2004
Stuart Young
Current Classification Tools
Structural Approaches
Expert rules
Uses evolutionary and structural data
Sunyaev et al, 'Prediction of deleterious human alleles‘. Human
Molecular Genetics, 2001, Vol. 10, No. 6, 593.
Decision Trees
Improved performance based on sequence and
structural data
Produces intuitive rules
Friday 17rd December 2004
Stuart Young
Our foundation for the project
Saunders CT & Baker D
‘Evaluation of Structural and Evolutionary
Contributions to Deleterious Mutation
Prediction’
J. Mol. Biol. (2002) 322, 891–901
Structural and evolutionary features
Trained classifiers based on two data sets
- experimental mutations and human alleles
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
S & B - Training Sets
Experimental mutations (~5,000)
HIV-1 protease
E. Coli Lac repressor
T4 Lysozyme
Human alleles (~350 mutations)
103 ‘hot’ human genes
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Why two training sets?
Unbiased human data is hard to get:
Many disease-associated mutations are
discovered through genetics association studies
and may not be causative (i.e., only linked with
the causative allele)
Effect of mutations is hard to measure
Experimental ‘whole gene mutagenesis’
data is used considered ‘unbiased’
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Features used in S&B Study
SIFT
SIFT + Solvent Accessibility(SA)
SIFT + normalized B-factor
SIFT + Sunyaev expert rules
SIFT + SA + B-factor
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Hypothesis
Can we improve on the results
of Saunders and Baker by using
more structural and sequence
properties?
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Experimental Design
Classification algorithm
Decision Trees
Support Vector
Neural Nets
Additional Features
Amino acid relative frequencies
Additional structural properties
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Structural Property Values
Russ Altman (Stanford) developed a vector
representation of protein structural sites
Spheres (1.875Å → 7.5Å) centered on Calpha atom of the mutation position
66 features
Atom/residue counts within sphere and
other features, e.g.:
Solubility
Solvent accessibility
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Amino Acid Windows
AA frequencies within a window
on either side of the mutation
position
20 AAs = 20 features
LEFT and RIGHT →
Friday 17rd December 2004
40 features
Stuart Young
Predicting Deleterious Mutations
Amino Acid Windows
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Tools
Databases
PDB - Protein structure data
S-BLEST - Structural features
Software
Perl 5.8.0
Matlab (NN, PRTools(DT), SVC)
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
List of Features Used
BLOSUM62, disorder, secondary
structure, molecular weight
Grouped amino acid frequency windows
of varying widths
SIFT
S-BLEST (vector contains four sub-shells
spreading outward from site)
Solvent accessibility (C-beta density, i.e.,
the number of C-beta atoms around the
site)
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Comparison with S&B Results
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
1. Human Data Set
Human allele dataset as train and test set
Ensembles of decision trees for classification
20-fold cross validation
Progressively added features to see their affect
on performance
Because structural data was not available for all
mutation sites, we used a subset of the original
Saunders and Baker training set
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Best Features
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
1. Experimental Data Set
Same as human data set but using experimental
mutations for training and testing
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Evaluation of S-BLEST Using a
Random Subset of the Experimental
Training Set
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
3. Cross-classification
Used the same features described above
Trained on one dataset and tested on the other:
Human to experimental
Experimental to human
Experimental gene to exp. gene
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Summary of Results
Human data set
80% accuracy (up from 70%)
Experimental data set
87% accuracy (up from 79.5%)
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Conclusion
Prediction tools CAN identify
deleterious mutations
We believe that further study is
warranted to identify over-fitted
classifiers to further improve
classification accuracy on real world
data
Friday 17rd December 2004
Stuart Young
Acknowledgements
People
Andrew Campen (CCBB IT, IUPUI)
Brandon Peters (CCBB, IUPUI)
Haixu Tang (Capstone Coordinator, IUB)
Funding
This work was funded by a grant from the Showalter Trust (Sean
Mooney, PI), INGEN, and a IUPUI McNair Scholarship. The
Indiana Genomics Initiative (INGEN) Indiana University is
supported in part by Lilly Endowment Inc.
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Thank You
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Friday 17rd December 2004
Stuart Young