“biology driven” challenges for the stc cs researchers
Download
Report
Transcript “biology driven” challenges for the stc cs researchers
Data to Biology
Shankar Subramaniam
University of California at San Diego
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS
RESEARCHERS
• KNOWLEDGE EXTRACTION FROM DATA
– DEALING WITH THE COFFEE
DRINKERS PROBLEM
– HOW CAN BIOLOGICAL DATA BE
INTEGRATED?
– DEFINING THE GRANULARITY OF
DATA
– UNBIASED STATISTICAL METHODS
– BIOLOGY-CONSTRAINED METHODS
– INFORMATION METRICS
– HOW DO WE DEAL WITH CONTEXT?
FOUR “BIOLOGY DRIVEN” CHALLENGES FOR THE
STC CS RESEARCHERS
• NOISY DATA
– CAN WE DEFINE HOW MUCH
NOISE AND WHAT TYPE OF
NOISE CAN BE TOLERATED IN
EXTRACTING KNOWLEDGE?
– IS MISSING DATA TANTAMOUNT
TO NOISE? IF NOT HOW DO WE
DEAL WITH IT?
FOUR “BIOLOGY DRIVEN” CHALLENGES FOR THE
STC CS RESEARCHERS
• CLASSIFICATION OF MODULARITY
FROM DATA
– HOW CAN WE DEFINE MODULES
(FUNCTIONAL, SPATIAL,
TEMPORAL, ETC.) FROM DATA?
– WHAT IS THE INFORMATION
CONTENT IN THE MODULES?
– CAN WE COMPARE MODULES
QUANTITATIVELY?
FOUR “BIOLOGY DRIVEN” CHALLENGES FOR THE
STC CS RESEARCHERS
• DEALING WITH DYNAMICAL DATA
– HOW DO WE DEAL WITH TIME
SERIES DATA?
– HOW IS INFORMATION
PROCESSED IN TIME SERIES
DATA?
– WHAT GRANULARITY AND
CONTEXT IS NECESSARY TO
ANALYZE THIS DATA?
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS
RESEARCHERS
• The coffee drinkers problem [highly
skewed distributions]:
– 90% of people are coffee drinkers
• What does this say about making
drink predictions that are 90%
accurate?
• Biology is all about highly skewed
distributions – posing significant
challenges for methods, measures, and
validation
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS
RESEARCHERS
• The coffee drinkers problem – examples:
– 99% of us likely do not have the
disease one might be looking for
– 99% of protein interactions are
accounted for by 5% of the proteins
– 99% of the known disease-implicated
mutations occur in less than 5% of the
people
– (all estimates, but largely realistic)
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS
RESEARCHERS
•
The coffee drinkers problem:
– Most current techniques in data analysis are rendered useless
because of this.
– Statistical significance with meaningful null hypotheses are critical
(information content is one of the most commonly used measures
even today)
– Simulation based methods often do not work – requiring analytics
– Methods must optimize for these analytical measures of quality
– Validation in the absence of complete data is hard
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS
RESEARCHERS
•
•
The coffee drinkers problem (real examples)
•
When is a module in a network significant?
•
When is an observed mutation in a sequenced phenotype
implicated genome significant?
•
When is an alignment of two networks significant?
•
When is correlation in time-course microarray data significant?
Conversely:
•
How do we detect the most significant modules in a network?
•
How do we identify all phenotype-implicated mutations from a
large number of sequenced diseased and normal genomes?
•
How do we align networks for most statistically significant
alignments?
•
How do we find most correlated signals and associated groups
of genes?
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS
RESEARCHERS
•
The Hidden Terminal Problem:
– Consider a phenotype, reflected in its genetic variants (i.e., what
are nucleotide-level variations associated with a disease, say).
– Often, these variations are not consistent (e.g., liver cancer
manifests itself in gene mutations that are not all at the same
place).
– However, these variations correspond to significantly aligned
pathways in the underlying networks (i.e., they disrupt the same
function, albeit by altering different genes).
– How do we go from an observable (phenotype/disease) to an
abstraction (where the observable has little informative content) to
other abstractions (where the observable might have significant
information content).
– More importantly, how do we go backwards (predict observables)?
“BIOLOGY DRIVEN” CHALLENGES FOR THE STC CS
RESEARCHERS
•
The Hidden Terminal Problem: Specific Instance
•
Start from observed mutations in a specific disease (liver or
breast cancer has significant genomic data available)
•
The mutations result from both noise, other phenotypes, and the
specific disease. A simple intersection yields no signal.
•
Cross-reference against synthetic lethality data.
•
Redefine intersection over pathways.
•
Reassess mutations under this definition and quantify the
significance of these mutations w.r.t. observed phenotype.