GENE-CBR - Informatics

Download Report

Transcript GENE-CBR - Informatics

geneCBR
a case-base reasoning tool for cancer diagnosis using
microarray datasets
dr. florentino fdez-riverola
university of vigo
Computer System of New Generation
1
Outline
DNA Microarray Technology
characteristics and model operation overview
Bioinformatics and AI
new challenges and emerging research areas
CBR systems
case-based reasoning
GENE-CBR
human genome analysis using CBR systems
Demo
geneCBR in action: cancer diagnosis using microarrays
2/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
Microarrays: characteristics
silicon chips that can measure the expression levels of
thousands of genes simultaneously
microarrays are base on a database over 40000 fragments of
genes called expressed sequence tags (ESTs)
allow us for the first time to obtain a “global” view of the
cells belonging to:
• different individuals
• different time-intervals for the same individual
• different tissues of the same individual
gene expression profiles can be used as inputs to large-scale
data analysis as:
• fingerprints to build more accurate molecular classification
• discovering hidden taxonomies
• Increasing our understanding of normal and disease states
3/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
Microarrays: model operation overview
how does the chip work?
Microarray chips incorporate different dyed genes
tiled in a grid-like fashion
The individual’s DNA to analyze is dyed with a
different colour
Both sets of labelled DNA strands are allowed to
hybridize or bind
hybridization events are detected identifying
fluorescent changes in the strands or DNA
scanner
an scanner and the associated software perform
various forms of image analysis to measure and
report raw gene expression values
the scanned intensities show how active the genes
represented by the ESTs are in the cell:
• strong fluorescence indicates that the gene is very
active in the cell
• no fluorescence indicates that the gene is inactive in
the cell
preprocessing
microarray data file
4/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
Available data
bone narrow samples from 43 adult patients with Acute
Myeloid Leukemia (AML) plus 6 sane individuals
10
4
7
22
6
patients with Acute Promyelocytic Leukemia
patients with Acute Myeloid Leukemia with inv(16)
patients with Acute Monocytic Leukemia
patients with Acute non-Monocytic Leukemia
samples belonging to sane individuals
[APL]
[AML-inv(16)]
[AML-mono]
[AML-other]
[control samples]
volume of information processed
each microarray contains 22.283 ESTs ( genes)
49 microarrays = 1.091.867 gene expression values
today available data
150 microarrays (Human Genome 133A) + 210 microarrays (Human Genome - plus)
5/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
Challenges for microarray Data Mining
three main types of data analysis needed for
biomedical applications:
gene selection ( attribute selection in AI):
• find the genes most strongly related to a particular class
classification ( supervised classification in AI)
• classifying diseases or predicting outcomes based on gene
expression patterns, and perhaps even identifying the best
treatment for given genetic signature
clustering ( unsupervised classification in AI)
• finding new biological classes or refining existing ones
three parallel research areas:
convenient visualization of experiments and results
discovery of biological knowledge (metabolic pathways, etc.)
low-level analysis providing better readouts (preprocessing,
normalization, etc.)
6/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
Problems with existing data
analysis of microarrays presents a number of unique challenges for
Machine Learning and Data Mining techniques but …
Its capacity for generating enormous amounts of data is, however, also an
handicap:
great amount of data belonging to each individual (thousands of genes)
• efficiency and memory problems
lack of initial knowledge
• which is the significance level of each gene?
given the difficulty of collecting microarray samples, the number of samples
is likely to remain small in many interesting cases
having so many fields relative to so few samples creates a high likelihood of
finding false positives
these problems are increased if we consider the potential errors that can be
present in microarray data (symmetric and random errors)
it is required sophisticated data analysis techniques and robust methods
capable of extracting biologically meaningful knowledge from the raw
data
7/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
CBR systems (Case-Based Reasoning)
Kolodner (1983a, 1983b). Problem solving paradigm in AI. It can be viewed
as a methodology for reasoning and learning
“reasoning by re-using past cases is a powerful and frequently applied way to solve
problems for humans” Joh (1997)
the memory of the system (case base) stores a certain number of
previously experienced situations
CASE = PROBLEM
description
+ applied SOLUTION [ + RESULT ]
a new problem is solved by finding similar past cases and reusing them in
the new problem situation
Riesbeck et al., (1989)
4 cyclical steps are performed when it is necessary to solve a new problem
Kolodner (1993); Aamodt y Plaza (1994); Watson (1997)
Case-based reasoning is - in effect - a cyclic and integrated process of
solving a problem, learning from this experience, solving a new problem,
and so on...
8/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
The CBR cycle
RETRIEVING
one or more previously
experienced cases
New problem
(1)
RETRIEVE
most similar
cases
REUSING
the case(s) in one way or
another
MEMORY
(2)
(4) RETAIN
confirmed
solution
CASE
BASE
(3)
REUSE
REVISING
the solution based on
reusing a previous case(s)
proposed solution
REVISE
RETAINING
the new experience by
incorporating it into the
existing knowledge-base
(case base).
9/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
Main characteristics of CBR systems
adaptive and dynamic systems: the number of cases stored in the memory
of the model changes, allowing the system adaptation to new situations
CBR allow the utilisation of general knowledge in the resolution of a
particular problem
CBR facilitate the indexation of the available information
CBR can use uncompleted cases
CBR are advised about their limitations (perhaps a problem has no
solution)
CBR facilitate the utilisation of representative and flexible data structures
case adaptation aids to discover inter-connections and hided structures in
the available data
CBR can be completely automated
10/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
GENE-CBR
11/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
Goals
Objectives
GENE-CBR
Develop an effective and reliable
system able to diagnose cancer
subtypes based on the analysis of
microarray data
CBR system (Case-Based Reasoning)
“Solve new problems (new patient) based on the
previous experience (diagnosed patients)”
doctor
uses
Implement a flexible tool for designing
and testing new techniques and
experiments
AI techniques
selection, clustering, inference…
research
group
programmer
Construct an advanced edition module
for run-time modification of coded
techniques
BeanShell
Programmer interface
12/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
Logic architecture
wizard
JCBR
doctor
research group
DFP
CBR
GCS
Expert
Mode
Diagnostic
(testing techniques)
(diagnosing)
Mode
[1]
[2]
RETRIEVE
REUSE
CASE
BASE
Programming Mode (BeanShell)
[4]
[3]
RETAIN
REVISE
programmer
13/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
Model overview

gene-CBR
reclassification
Gene
Selection
most relevant
genes = DFP
Clustering
revised
prediction
and final
diagnostic
genetically
similar
patients
Initial
prediction
Prediction
Knowledge
Discovery
14/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
GENE-CBR::[i]
retrieval
objectives:
perform gene selection without losing information
• extracting simplified fuzzy patterns (FP) for each pathology
possibility of using AI techniques initially discarded
main phases:
supervised fuzzy discretisation of gene expression values
• Low, Medium, High and overlapping labels (LM, MH)
supervised gene selection for each pathology
advantages:
independence of the ordering existing in data
takes into account data variability
allows for discovering new knowledge
obtained results are interpretable
15/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
GENE-CBR::[i]
healthy
APL
AML-inv()
AML-mono
retrieval
AML-other
Leucemia Aguda Promielocítica
16/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
GENE-CBR::[i]
healthy
APL
AML-inv()
AML-mono
retrieval
AML-other
FP_AML-other
FP_healthy
FP_AML-inv()
FP_APL
FP_AML-monocytic
DFP
.
.
.
17/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
GENE-CBR::[ii]
reuse
objectives:
unsupervised identification of genetic similarities between patients
• taking only into account the previous selected genes (DFP)
main phases:
training a GCS network DFP-dimensional
• Growing Cell Structures. Fritzke, B. (1993)
presenting the new patient to the network
classifying using a proportional weighting voting schema
advantages:
clustering without taking into account the patient class
definition of an indexing and similarity structure between nodes (
relating patients)
generation of clusters containing new subtypes of unknown cancer
(knowledge discovery)
18/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
GENE-CBR::[ii]
reuse
DFP
.
.
.
PAT.
gene expression values DFP
CLASS
+ Similarity
AML-inv()
AML-inv()
AML-otras
AML-inv()
- Similarity
¿?
AML-inv()
19/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
GENE-CBR::[iii]
revise
objectives:
provide doctors with meaningful information about the
classification carried out by the system
help in discovering new knowledge
• if-then rules as decision making support mechanism
information supplied:
identification of similar patients (from a genetically point of view)
proportional weighting voting and assigned weights
rules generation using See5. Quinlan, J.R. (2000)
• DFP genes belonging to the set of patients retrieved by the GCS
network
advantages:
doctors can supervise the final decision proposed by the system
new knowledge generation in the form of easy understandable rules
20/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
GENE-CBR::[iii]
revise
AML-inv()
AML-inv()
AML-inv()
AML-otras
AML-inv()
BIOLOGICAL AND CLINICAL CHARACTERISTICS
CARIOTYPE
Rule 6: (45 / 4, lift 1.1)
If X65962 (AFFX-HSAC07/X00351_5_at) is LOW then
If U96781 (AFFX-BioDn-3_at) is LOW-MEDIUM then AML-other
Else If D87845 (AFFX-hum_alu_at) is HIGH then AML-inv() [0.968]
21/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
GENE-CBR::[iv]
retain
objectives:
feedback the system with new knowledge
•
•
•
•
•
new subclassification of existing cancer pathologies
reclassification of existing patients
identification of correlated genes
discovering of new marks able to distinguish new pathologies
Identification of prototypical patients and rare cases
main phases:
update the case base with new a microarray every time a new
classification is generated
modification of the parameters of the model
advantages:
possibility of easily integrating new biological knowledge in the
hybrid system
22/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
Applied technologies
Design patterns
100% Java
Swing
BeanShell
Log4j
JFreeChart
Action
Future
MVC
Singleton
Wizard
Unified Modeling Language
Poseidon for UML
23/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
Future work
going through a plug-in architecture
designing a core where each technique is implemented as a
plug-in => aiBENCH
implementing fold-cross validation
generation of multiple training and test cases in an automatic
way
supporting standard microarray data formats
MIAME: Minimum Information About a Microarray Experiment
deploying of GENE-CBR with JavaWebStart
remote and automatic access to latest versions of
project
GENE-CBR
on-line access to genetic sequence databases
geneBank (http://www.ncbi.nlm.nih.gov/Genbank)
24/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
Demo:: GENE-CBR in action
25/26
Microarrays
Bioinformatics-AI
CBR systems
geneCBR
Demo
geneCBR
a case-base reasoning tool for cancer diagnosis using
microarray datasets
dr. florentino fdez-riverola
university of vigo
Computer System of New Generation
26