Predicting Deleterious Mutations

Download Report

Transcript Predicting Deleterious Mutations

Capstone Project Presentation
Predicting Deleterious Mutations
Young SP, Radivojac P, Mooney SD
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Deleterious
“Hurtful or injurious to life or health;
noxious”
(Oxford English Dictionary)
“Tis pity wine should be so deleterious, For tea
and coffee leave us much more serious.”
(BYRON Juan IV, 1821)
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
SNPs
What is an SNP (single nucleotide
polymorphism)?
 Why are SNPs important?
 Some SNPs are nonsynonymous
 The molecular effects of SNPs vary
widely

Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
MOTIVATION
Improve on the existing deleterious
prediction methods
 Use protein sequence, evolution and
structure data combined with machine
learning to identify potentially diseasecausing SNPs

Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
SNP data is increasingly available
 Over 40 major online databases
dbSNP is the primary SNP database
(contains 5,000,000+ validated human SNPs)
 Many databases contain potentially diseasecausing SNPs related to a particular disease

Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Deleterious effects of mutations on
proteins
Function
 Stability
 Expression
 Protein-Protein Interactions

Friday 17rd December 2004
Stuart Young
Current Classification Tools
Sequence Approaches

BLOSUM62
An amino acid substitution score matrix

SIFT
Collects sequence homologues in multiple
alignments and identifies non-conservative
changes in amino acids
Ng P & Henikoff S, 'Predicting Deleterious Amino Acid
Substitutions‘. Genome Research, 2001, 11:863-874.
Friday 17rd December 2004
Stuart Young
Current Classification Tools
Structural Approaches

Expert rules
Uses evolutionary and structural data
Sunyaev et al, 'Prediction of deleterious human alleles‘. Human
Molecular Genetics, 2001, Vol. 10, No. 6, 593.
Decision Trees



Improved performance based on sequence and
structural data
Produces intuitive rules
Friday 17rd December 2004
Stuart Young
Our foundation for the project
Saunders CT & Baker D
‘Evaluation of Structural and Evolutionary
Contributions to Deleterious Mutation
Prediction’
J. Mol. Biol. (2002) 322, 891–901
Structural and evolutionary features
 Trained classifiers based on two data sets
- experimental mutations and human alleles

Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
S & B - Training Sets

Experimental mutations (~5,000)
HIV-1 protease
E. Coli Lac repressor
T4 Lysozyme

Human alleles (~350 mutations)
103 ‘hot’ human genes
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Why two training sets?

Unbiased human data is hard to get:
Many disease-associated mutations are
discovered through genetics association studies
and may not be causative (i.e., only linked with
the causative allele)
 Effect of mutations is hard to measure

Experimental ‘whole gene mutagenesis’
data is used considered ‘unbiased’

Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Features used in S&B Study
SIFT
 SIFT + Solvent Accessibility(SA)
 SIFT + normalized B-factor
 SIFT + Sunyaev expert rules
 SIFT + SA + B-factor

Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Hypothesis
Can we improve on the results
of Saunders and Baker by using
more structural and sequence
properties?
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Experimental Design

Classification algorithm
Decision Trees
 Support Vector
 Neural Nets


Additional Features
Amino acid relative frequencies
 Additional structural properties

Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Structural Property Values
Russ Altman (Stanford) developed a vector
representation of protein structural sites
Spheres (1.875Å → 7.5Å) centered on Calpha atom of the mutation position
 66 features
 Atom/residue counts within sphere and
other features, e.g.:
 Solubility
 Solvent accessibility

Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Amino Acid Windows
AA frequencies within a window
on either side of the mutation
position
 20 AAs = 20 features


LEFT and RIGHT →
Friday 17rd December 2004
40 features
Stuart Young
Predicting Deleterious Mutations
Amino Acid Windows
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Tools

Databases
PDB - Protein structure data
 S-BLEST - Structural features


Software
Perl 5.8.0
 Matlab (NN, PRTools(DT), SVC)

Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
List of Features Used
BLOSUM62, disorder, secondary
structure, molecular weight
 Grouped amino acid frequency windows
of varying widths
 SIFT
 S-BLEST (vector contains four sub-shells
spreading outward from site)
 Solvent accessibility (C-beta density, i.e.,
the number of C-beta atoms around the
site)

Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Comparison with S&B Results
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
1. Human Data Set
 Human allele dataset as train and test set
Ensembles of decision trees for classification
 20-fold cross validation
 Progressively added features to see their affect
on performance
 Because structural data was not available for all
mutation sites, we used a subset of the original
Saunders and Baker training set

Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Best Features
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
1. Experimental Data Set
 Same as human data set but using experimental
mutations for training and testing
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Evaluation of S-BLEST Using a
Random Subset of the Experimental
Training Set
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
3. Cross-classification
 Used the same features described above

Trained on one dataset and tested on the other:
Human to experimental
 Experimental to human
 Experimental gene to exp. gene

Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Summary of Results

Human data set
80% accuracy (up from 70%)

Experimental data set
87% accuracy (up from 79.5%)
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Conclusion
Prediction tools CAN identify
deleterious mutations
 We believe that further study is
warranted to identify over-fitted
classifiers to further improve
classification accuracy on real world
data

Friday 17rd December 2004
Stuart Young
Acknowledgements
People
Andrew Campen (CCBB IT, IUPUI)
Brandon Peters (CCBB, IUPUI)
Haixu Tang (Capstone Coordinator, IUB)
Funding
This work was funded by a grant from the Showalter Trust (Sean
Mooney, PI), INGEN, and a IUPUI McNair Scholarship. The
Indiana Genomics Initiative (INGEN) Indiana University is
supported in part by Lilly Endowment Inc.
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Thank You
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Friday 17rd December 2004
Stuart Young
Predicting Deleterious Mutations
Friday 17rd December 2004
Stuart Young