sma-talk-oct03 - NUS School of Computing

Download Report

Transcript sma-talk-oct03 - NUS School of Computing

From Informatics to Bioinformatics:
The Knowledge Discovery
Perspective
Limsoon Wong
Institute for Infocomm Research
Singapore
Copyright 2003 limsoon wong
Plan
• Overview of recent knowledge discovery
successes in bioinformatics
• Risk assignment of childhood ALL
patients to optimize risk-benefit ratio of
therapy
• Recognition of translation intiation sites
from DNA sequences
Copyright 2003 limsoon wong
overview of recent knowledge
discovery successes in
bioinformatics
Copyright 2003 limsoon wong
What is Datamining?
Jonathan’s blocks
Jessica’s blocks
Whose block
is this?
Jonathan’s rules
Jessica’s rules
: Blue or Circle
: All the rest
Copyright 2003 limsoon wong
What is Datamining?
Question: Can you explain how?
Copyright 2003 limsoon wong
What is Bioinformatics?
Copyright 2003 limsoon wong
Bioinformatics brings benefits
To the patient:
Better drug, better treatment
To the pharma:
Save time, save cost, make more $
To the scientist:
Better science
Copyright 2003 limsoon wong
To figure these out,
we bet on...
“solution” =
Data Mgmt + Knowledge Discovery
Data Mgmt =
Integration + Transformation + Cleansing
Knowledge Discovery =
Statistics + Algorithms + Databases
Copyright 2003 limsoon wong
History
8 years of
bioinformatics
R&D in
Singapore
Integration
Technology
(Kleisli)
MHC-Peptide Protein Interactions
Extraction (PIES)
Binding
(PREDICT)
Gene Expression
Molecular
Cleansing &
Connections & Medical Record
Warehousing
Datamining (PCL)
(FIMM)
Gene Feature
Recognition (Dragon)
Venom
Informatics
GeneticXchange
1994
ISS
1996
1998
KRDL
2000
Biobase
2002
LIT/I2R
Copyright 2003 limsoon wong
Predict Epitopes,
Find Vaccine Targets
• Vaccines are often the
only solution for viral
diseases
• Finding & developing
effective vaccine targets
(epitopes) is slow and
expensive process
Copyright 2003 limsoon wong
Recognize Functional Sites,
Help Scientists
• Effective recognition of
initiation, control, and
termination of biological
processes is crucial to
speeding up and focusing
scientific experiments
• Data mining of bio seqs to
find rules for recognizing &
understanding functional
sites
Dragon’s 10x
reduction of
TSS recognition
false positives
Copyright 2003 limsoon wong
Diagnose Leukaemia,
Benefit Children
• Childhood leukaemia is a
heterogeneous disease
• Treatment is based on subtype
• 3 different tests and 4 different
experts are needed for
diagnosis
 Curable in USA,
 fatal in Indonesia
Copyright 2003 limsoon wong
Understand Proteins,
Fight Diseases
• Understanding function and role
of protein needs organised info
on interaction pathways
• Such info are often reported in
scientific paper but are seldom
found in structured databases
• Knowledge extraction
system to process free text
• extract protein names
• extract interactions
Copyright 2003 limsoon wong
risk assignment of
childhood ALL patients to optimize
risk-benefit ratio of therapy
Copyright 2003 limsoon wong
Childhood ALL
Heterogeneous Disease
• Major subtypes are
– T-ALL
– E2A-PBX1
– TEL-AML1
– MLL genome rearrangements
– Hyperdiploid>50
– BCR-ABL
Copyright 2003 limsoon wong
Childhood ALL
Treatment Failure
• Overly intensive treatment leads to
– Development of secondary cancers
– Reduction of IQ
• Insufficiently intensive treatment leads to
– Relapse
Copyright 2003 limsoon wong
Childhood ALL
Risk-Stratified Therapy
• Different subtypes respond differently to
the same treatment intensity
Generally good-risk,
lower intensity
TEL-AML1,
Hyperdiploid>50
T-ALL
Generally high-risk,
higher intensity
E2A-PBX1
BCR-ABL,
MLL
Match patient to optimum treatment
intensity for his subtype & prognosis
Copyright 2003 limsoon wong
Childhood ALL
Risk Assignment
• The major subtypes look similar
• Conventional diagnosis requires
– Immunophenotyping
– Cytogenetics
– Molecular diagnostics
Copyright 2003 limsoon wong
Mission
• Conventional risk assignment procedure
requires difficult expensive tests and
collective judgement of multiple
specialists
• Generally available only in major
advanced hospitals
Can we have a single-test easy-to-use
platform instead?
Copyright 2003 limsoon wong
Single-Test Platform of
Microarray & Machine Learning
Copyright 2003 limsoon wong
Overall Strategy
Diagnosis
of subtype
Subtypedependent
prognosis
• For each subtype,
select genes to
develop classification
model for diagnosing
that subtype
Riskstratified
treatment
intensity
• For each subtype,
select genes to
develop prediction
model for prognosis
of that subtype
Copyright 2003 limsoon wong
Childhood ALL
Subtype Diagnosis by PCL
•
•
•
•
Gene expression data collection
Gene selection by 2
Classifier training by emerging pattern
Classifier tuning (optional for some
machine learning methods)
• Apply classifier for diagnosis of future
cases by PCL
Copyright 2003 limsoon wong
Childhood ALL Subtype Diagnosis
Our Workflow
A tree-structured
diagnostic
workflow was
recommended by
our doctor
collaborator
Copyright 2003 limsoon wong
Childhood ALL Subtype Diagnosis
Training and Testing Sets
Copyright 2003 limsoon wong
Childhood ALL Subtype Diagnosis
Signal Selection Basic Idea
• Choose a signal w/ low intra-class distance
• Choose a signal w/ high inter-class distance
Copyright 2003 limsoon wong
Childhood ALL Subtype Diagnosis
Signal Selection by 2
Copyright 2003 limsoon wong
Childhood ALL Subtype Diagnosis
Emerging Patterns
• An emerging pattern is a set of conditions
– usually involving several features
– that most members of a class satisfy
– but none or few of the other class satisfy
• A jumping emerging pattern is an emerging
pattern that
– some members of a class satisfy
– but no members of the other class satisfy
• We use only jumping emerging patterns
Copyright 2003 limsoon wong
Childhood ALL Subtype Diagnosis
PCL: Prediction by Collective Likelihood
Copyright 2003 limsoon wong
Childhood ALL Subtype Diagnosis
Accuracy of PCL (vs. other classifiers)
The classifiers are all applied to the 20 genes selected
by 2 at each level of the tree
Copyright 2003 limsoon wong
Multidimensional Scaling Plot
Subtype Diagnosis
Copyright 2003 limsoon wong
Multidimensional Scaling Plot
Subtype-Dependent Prognosis
• Similar computational
analysis was carried
out to predict relapse
and/or secondary
AML in a subtypespecific manner
• >97% accuracy
achieved
Copyright 2003 limsoon wong
Childhood ALL
Is there a new subtype?
• Hierarchical
clustering of gene
expression
profiles reveals a
novel subtype of
childhood ALL
Copyright 2003 limsoon wong
Childhood ALL
Cure Rates in ASEAN Countries
• Conventional risk
assignment
procedure requires
difficult expensive
tests and collective
judgement of multiple
specialists
Not available in less
advanced ASEAN
80%
countries
cure rate
cambodia
vietnam
thailand
philippines
indonesia
malaysia
singapore
0%
20% 40% 60%
Copyright 2003 limsoon wong
Childhood ALL
Treatment Cost
• Treatment for childhood ALL over 2 yrs
– Intermediate intensity: US$60k
– Low intensity: US$36k
– High intensity: US$72k
• Treatment for relapse: US$150k
• Cost for side-effects: Unquantified
Copyright 2003 limsoon wong
Childhood ALL in ASEAN Counties
Current Situation (2000 new cases/yr)
• Intermediate intensity
conventionally applied
in less advanced
ASEAN countries
Over intensive for 50%
of patients, thus more
side effects
Under intensive for
10% of patients, thus
more relapse
5-20% cure rates
• US$120m (US$60k *
2000) for intermediate
intensity treatment
• US$30m (US$150k *
2000 * 10%) for relapse
treatment
• Total US$150m/yr
plus un-quantified
costs for dealing with
side effects
Copyright 2003 limsoon wong
Childhood ALL in ASEAN Counties
Using Our Platform (2000 new cases/yr)
• Low intensity applied
to 50% of patients
• Intermediate intensity
to 40% of patients
• High intensity to 10%
of patients
Reduced side effects
Reduced relapse
75-80% cure rates
• US$36m (US$36k * 2000
* 50%) for low intensity
• US$48m (US$60k * 2000
* 40%) for intermediate
intensity
• US$14.4m (US$72k *
2000 * 10%) for high
intensity
• Total US$98.4m/yr
Save US$51.6m/yr
Copyright 2003 limsoon wong
Acknowledgements
Copyright 2003 limsoon wong
recognition of translation intiation
sites from DNA sequences
Copyright 2003 limsoon wong
Translation Initiation Site
Copyright 2003 limsoon wong
A Sample mRNA
299 HSU27655.1 CAT U27655 Homo sapiens
CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG
CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA
CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
............................................................
................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
80
160
240
80
160
240
What makes the second ATG the translation
initiation site?
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Steps of a General Approach
• Training data gathering
• Signal generation
 k-grams, colour, texture, domain know-how, ...
• Signal selection
 Entropy, 2, CFS, t-test, domain know-how...
• Signal integration
 SVM, ANN, PCL, CART, C4.5, kNN, ...
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Training & Testing Data
• Vertebrate dataset of Pedersen & Nielsen
[ISMB’97]
•
•
•
•
•
3312 sequences
13503 ATG sites
3312 (24.5%) are TIS
10191 (75.5%) are non-TIS
Use for 3-fold x-validation expts
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Signal Generation
• K-grams (ie., k consecutive letters)
–
–
–
–
K = 1, 2, 3, 4, 5, …
Window size vs. fixed position
Up-stream, downstream vs. any where in window
In-frame vs. any frame
3
2.5
2
seq1
seq2
seq3
1.5
1
0.5
0
A
C
G
T
Copyright 2003 limsoon wong
Signal Generation:
An Example
299 HSU27655.1 CAT U27655 Homo sapiens
CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG
CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA
CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
80
160
240
• Window = 100 bases
• In-frame, downstream
– GCT = 1, TTT = 1, ATG = 1…
• Any-frame, downstream
– GCT = 3, TTT = 2, ATG = 2…
• In-frame, upstream
– GCT = 2, TTT = 0, ATG = 0, ...
Copyright 2003 limsoon wong
Signal Generation:
Too Many Signals
• For each value of k, there are
4k * 3 * 2 k-grams
• If we use k = 1, 2, 3, 4, 5, we have
4 + 24 + 96 + 384 + 1536 + 6144 = 8188
features!
• This is too many for most machine
learning algorithms
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Signal Selection (eg., 2)
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Signal Selection (eg., CFS)
• Instead of scoring individual signals,
how about scoring a group of signals as
a whole?
• CFS
– Correlation-based Feature Selection
– A good group contains signals that are
highly correlated with the class, and yet
uncorrelated with each other
Copyright 2003 limsoon wong
Signal Selection:
Sample k-grams Selected
Kozak consensus
Leaky scanning
• Position –3
• in-frame upstream ATG
• in-frame downstream
Stop codon
– TAA, TAG, TGA,
– CTG, GAC, GAG, and GCC
Codon bias
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Signal Integration
• kNN
Given a test sample, find the k training
samples that are most similar to it. Let the
majority class win.
• SVM
Given a group of training samples from two
classes, determine a separating plane that
maximises the margin of error.
• Naïve Bayes, ANN, C4.5, PCL, ...
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Results (on Pedersen & Nielsen’s mRNA)
TP/(TP + FN)
TN/(TN + FP)
TP/(TP + FP)
Accuracy
Naïve Bayes
84.3%
86.1%
66.3%
85.7%
SVM
73.9%
93.2%
77.9%
88.5%
Neural Network
77.6%
93.2%
78.8%
89.4%
Decision Tree
74.0%
94.4%
81.1%
89.4%
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
mRNAprotein
A
T
How about using k-grams
from the translation?
E
L
R
F S
L
P
Y
I T
N
K
D
E
S
stop
M
V
A
H
Q
C
W
R
G
Copyright 2003 limsoon wong
Signal Generation:
Amino-Acid Features
Copyright 2003 limsoon wong
Signal Generation:
Amino-Acid Features
Copyright 2003 limsoon wong
Signal Selection:
Amino Acid K-grams Discovered
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Results (based on amino acid features)
Performance based on amino-acid features:
is better than performance based on DNA seq. features:
Copyright 2003 limsoon wong
Acknowledgements
•
•
•
•
•
•
Huiqing Liu
Jinyan Li
Roland Yap
Zeng Fanfan
A.G. Pedersen
H. Nielsen
Copyright 2003 limsoon wong
To give this lecture to SMA students.
Date: 28 Oct 2003
Time: 10-11.30am
Venue: Video Conference Room, S15-04-30
Copyright 2003 limsoon wong