presentation

Download Report

Transcript presentation

Discovery Challenge
Gene expression datasets
On behalf of Olivier Gandrillon
SAGE data from the Cancer
Anatomy Project
Two datasets (public data on human cells)
– 822 * 74
– 27 679 * 90
Questions to answer
– Can we find synexpression groups?
– Are we able « to group » cell types using gene
expression profiles?
– Can we obtain bi-sets, i.e., sets of genes associated
to sets of cells which denote some relevant biological
associations?
– Can we find invariant genes?
« Quantitative » feedback
• Increasing number of submissions for analysing
SAGE data
– From 2 in Pisa to 7 in Porto
• 5 on the smaller expression matrix (minimal
transcriptome), 2 on the larger
• Why the minimal transcriptome is preferred ;-)
Topics (1)
• Association rules (1)
– Gasmi et al.
– Extracting generic bases of association rules from SAGE
data
• Yet another « cover » of association rules, considering the
smaller data set, added-value w.r.t. previous work is unclear
• Class characterization (CBA-like approach) (1)
– Hebert et al.
– Mining delta-strong characterization rules in large SAGE
dataset
• Yet another « cover » of association rules but with class
characterization as the targeted application, some biological
validation of the added-value … which is also expected given
that the « data providers » are involved in the research
Topics (2)
• Clustering
– Martinez et al.
– Exploratory analysis of cancer SAGE data
• Added-value w.r.t. previous work unclear, including the
first attempt to use clustering for global analysis of SAGE
data (2001) Does cleaning improves cluster relevancy
from a biological perspective? Why considering only the
minimal transcriptome?
Topics (3)
• Supervized classification (4)
– Hsuan-Tien Lin et al.
– Analysis of SAGE results with combined learning
techniques
• Using Support Vector Machines on the large SAGE data
set for feature extraction and discriminating cancer
librairies. Impossible to assess the added-value since the
extracted model is not explicit from the paper.
– Ylirinne
– Analysis of the Gene expression data ith 4ft-Miner
• This is an application of GUHA method (descriptive rules)
to the small SAGE matrix without any insight on the
added-value.
Topics (4)
• Supervized classification (cont.)
– Esseghir et al.
– Localizing compact sets of genes involved in
cancer diseases using an evolutionary
connectionist approach
• Predicting cancer class values from the small SAGE
dataset by means of neural networks and genetic
algorithms. Results about gene selection/classifying
accuracies have been given but the data providers have
not been able to interpret the concrete results.
Topics (5)
• Supervized classification (cont.)
– Alves et al.
– Predictive analysis of gene expression data from
human SAGE libraries
• Study the impact of dimensionnality reduction techniques
on classification performances for the small dataset. It
leads to an unexpected results that best classifying
preformances are obtained when selecting the genes with
relatively low expression and low variation levels. Does
this remain true for the large one when no selection has
been applied beforehand?
Conclusion
• Much better than last year … and we should
encourage data miners to work on real-life
biological data
– What can be learned from this data … or what should
not be learned
• Typical problem of false positive patterns
• Impact of data preprocessing (feature selection/construction)
needs further research
– Nobody has been using external sources of
knowledge in order to support the biological
interpretation … which is actually needed but also
extremely hard
Discussion
• Shall we reduce drastically the number of
genes and especially remove the ones
with small expression?
• Is it reasonable to try to predict cancerous
class values from such datasets?
What to do next?
• What molecular biologists can bring to machine
learning/data mining researchers in the context
of discovery challenges?
– Real data, nice context for e-science, need for
multiple expertise/collaborative research, etc
• What machine learning/data mining researchers
can bring to molecular biologists in the context of
discovery challenges?
– New methods for data analysis, new methods for
collecting data (e.g., suggestion of relevant wet
biology experiments to optimize the return on
investment), etc