17: Poster_U - Computer Science
Download
Report
Transcript 17: Poster_U - Computer Science
Identification of Cancer-Causing Mutations in the
Human Genome with Machine Learning Techniques
U, Man Chon (Kevin)
Computer Science Department
The University of Georgia
[email protected]
Introduction
Discussion
Cancer is a leading cause of death worldwide and
the total number of cases globally is increasing. The
number of global cancer deaths is projected to
increase by 45% from 2007 to 2030 (from 7.9 million
to 11.5 million deaths). In most developed countries,
cancer is the second largest cause of death after
cardiovascular disease. Therefore, research to
improve our understanding of the causes of cancer
and its most promising therapies is urgently needed.
Our experimental results demonstrate that by
utilizing machine learning techniques, we can
indentify the cancer-causing mutations in human
genome with very high accuracy. The little
variance in the accuracies of the different
machine learning algorithms suggests that our
new features are very useful in terms of playing a
significant role in the identification process.
Furthermore, when we limited our experiments to
the Kinase domain, the accuracy of classification
reached 90.1757%, and by giving the
experimentally confirmed drivers (cancercausing) list, we were able to successfully
identify the mutations with 98.549% accuracy
without having any attribute selection or instance
selection methods implemented.
Figure 1. Single Nucleotide Polymorphisms
100.00%
90.00%
Background
•Single nucleotide polymorphisms (SNPs): DNA
sequence variations that occur when a single
nucleotide (A,T,C,or G) in the genome sequence is
altered, as Figure 1 illustrates.
•Mutations: Driver mutations are responsible for
oncogenicity. Passenger mutations are harmless
mutations.
•Hypothesis: Subsets of the non-synonymous
SNPs (nsSNPs) will help identify the multiple genes
associated with complex ailments such as cancer.
•Motivation: Finding those nsSNPs is extremely
expensive, and time-consuming.
•Solution: The aim of this research is to use
machine learning techniques to identify probable
cancer causing nsSNPs
Ultimate Goal
To identify suspicious mutations that we can assert
with a high degree of certainty to be driver mutations
and build a sophisticated model for this process.
80.00%
Exp 1. Not limited to
Kinase Domain
70.00%
60.00%
50.00%
Exp. 2: Limited to Kinase
Domain
40.00%
30.00%
20.00%
Exp. 3: Limited to Kinase
Domain &
Experimentally
Confirmed Drivers List
10.00%
0.00%
Contributions
• Applied different machine learning techniques to
identify cancer-causing mutations.
Figure 2. Visualization of Classification Results
• New features are introduced.
Table I. Classification Results
Algorithms
Exp. 1
Exp. 2
Exp. 3
J48 (Tree)
Random Forest
Best First Tree
Functional Tree
Decision Table
DTNB
LWL(J48+KNN)
Bayes Net
Naïve Bayes
SVM
Neural Network
86.8474 %
83.4522 %
85.1047 %
83.0243 %
83.1098 %
85.5920 %
86.5906 %
81.9686 %
80.2568 %
83.8516 %
76.0057 %
90.1757 %
85.4633 %
89.5367 %
87.3003 %
85.7029 %
89.2971 %
90.1757 %
86.9808 %
84.8243 %
88.9776 %
81.9489 %
98.5490%
95.8888%
97.3398%
97.5816%
92.6239%
97.8235%
98.5490%
96.0097%
93.8331%
97.0979%
97.5816%
• Provide evidence to biologists for inventing new
therapies for cancer treatment.
Acknowledgments
I would like to thank Dr. Khaled Rasheed and Dr.
Natarajan Kannan for their guidance in this
project. I would also like to thank Eric Talevich for
help in collecting the data.