talk - Trinity University

Download Report

Transcript talk - Trinity University

A Machine Learning
Method for Prediction of
Osteoporosis-Related
Genes
JACOB M. LUBER1, CATHERINE SHARP2, KB CHOI2,
CHERYL ACKERT-BICKNELL3, MATTHEW A. HIBBS1
1DEPARTMENT
2THE
OF COMPUTER SCIENCE, TRINITY UNIVERSITY, SAN ANTONIO, TX
JACKSON LABORATORY, BAR HARBOR, ME
3UNIVERSITY
OF ROCHESTER MEDICAL CENTER, ROCHESTER, NY
Computational Biology and Functional Genomics
Osteoporosis Background
Scale of “Big Data”
Netflix
Netflix Recommendations
Viewing History
“Star” Rankings
Observe Viewing
Patterns and
Account Activity
•Compare your viewing history to
all other subscribers
Machine Learning
•Find commonalities between user
profiles
•Predict potential movies/tv shows
you might like
Recommendations
Gene Function Predictions
Genome Scale
Data
“Gold Standard”
Viewing History
“Star” Rankings
Observe Viewing
Patterns and
Account Activity
Machine
Learning
Recommendatio
ns
≈
Laboratory
Experiments
Machine
Learning
Predictions
Data

Data Comes from 661 discrete microarray
experiments from hundreds of wet labs in addition
to RNA-seq data generated by our lab

Measurements in microarrays are on a per gene
basis

Aggregation of disparate data into efficient
nested direct access hashmaps
Distance Metrics

Absolute expression levels of genes important, but…

Relationships between genes are biologically key


Proteins often act as a group

Functional “job” of a protein may change depending on the
available interaction partners
Transform data into pair-wise relationships

Distance metrics & correlation profiles
Pearson Correlation
Euclidean Distance
Spearman Correlation
Support Vector Machines
Think of a space with
many dimensions as
microarray datasets (yes,
it’s hard for more than 3).
Each gene-pair
corresponds to a point in
this space.
Genes-pairs with similar
feature profiles will be
close to each other.
Separation of Examples
We want to make a “rule”
or dividing line between
two classes of examples.
But which line is best?
Best Separator
Best separator maximizes
the margin (distance to
the closest example).
The closest examples are
called the support
vectors.
Solution is a quadratic
programming problem
How Well Did We Do?
Training Error – How
well you classified the
training examples
Test Error – How well
you classify new
examples that were not
used for training
Test error is almost
always worse than
Training error
A method generalizes if
the drop from training to
test is small
Evaluation Metrics

TPs, FPs, TNs, FNs

Agnostic to pairs not appearing in standard

ROC curves: Sensitivity-Specificity

PR curves: Precision-Recall
Precision Recall Curves
Ordered
Predictions
1
Precision
TP
TP + FP
Random performance
0
0
Recall
TP
TP + FN
1
ROC Curves
Ordered
Predictions
1
TP
TP + FN
0
0
FP
FP + TN
1
Cross Validation
Over-fitting
Results (optimal model)
More Power!

Node Running this algorithm has 60 3.2Ghz cores
running in parallel connected to a 40TB Raid SAN
with a half TB of RAM.

Size of python list of gene pairs without any
associated data is 30GB

Each core can process sections of data by storing
gene pairs in a threadsafe iterator and each
utilizes ~6 GB ram

Bash >> is threadsafe*
Graphs!
Node for each gene
Edges between all pairs
Graph search algorithms
allow us to find pathways
Edge weight is probability of
two genes being related to
Osteoporosis
Functional
Protein
Association
Network
New Gene Pathways
Whats Next?

International Conference on Intelligent Biology
and Medicine

Gene Predictions will be tested in culture this
winter at the University of Rochester Medical
Center

“Algorithmic Pipeline” has other machine learning
applications; Harvard University Multi-text Project

Case based reasoning from Behavioral
Economics and Game Theory for certain
Biomedical applications

Algorithms for complex feature eliminations
(Exabyte feature space) utilizing heuristics and
greedy search algorithms