sma-talk-oct03 - NUS School of Computing
Download
Report
Transcript sma-talk-oct03 - NUS School of Computing
From Informatics to Bioinformatics:
The Knowledge Discovery
Perspective
Limsoon Wong
Institute for Infocomm Research
Singapore
Copyright 2003 limsoon wong
Plan
• Overview of recent knowledge discovery
successes in bioinformatics
• Risk assignment of childhood ALL
patients to optimize risk-benefit ratio of
therapy
• Recognition of translation intiation sites
from DNA sequences
Copyright 2003 limsoon wong
overview of recent knowledge
discovery successes in
bioinformatics
Copyright 2003 limsoon wong
What is Datamining?
Jonathan’s blocks
Jessica’s blocks
Whose block
is this?
Jonathan’s rules
Jessica’s rules
: Blue or Circle
: All the rest
Copyright 2003 limsoon wong
What is Datamining?
Question: Can you explain how?
Copyright 2003 limsoon wong
What is Bioinformatics?
Copyright 2003 limsoon wong
Bioinformatics brings benefits
To the patient:
Better drug, better treatment
To the pharma:
Save time, save cost, make more $
To the scientist:
Better science
Copyright 2003 limsoon wong
To figure these out,
we bet on...
“solution” =
Data Mgmt + Knowledge Discovery
Data Mgmt =
Integration + Transformation + Cleansing
Knowledge Discovery =
Statistics + Algorithms + Databases
Copyright 2003 limsoon wong
History
8 years of
bioinformatics
R&D in
Singapore
Integration
Technology
(Kleisli)
MHC-Peptide Protein Interactions
Extraction (PIES)
Binding
(PREDICT)
Gene Expression
Molecular
Cleansing &
Connections & Medical Record
Warehousing
Datamining (PCL)
(FIMM)
Gene Feature
Recognition (Dragon)
Venom
Informatics
GeneticXchange
1994
ISS
1996
1998
KRDL
2000
Biobase
2002
LIT/I2R
Copyright 2003 limsoon wong
Predict Epitopes,
Find Vaccine Targets
• Vaccines are often the
only solution for viral
diseases
• Finding & developing
effective vaccine targets
(epitopes) is slow and
expensive process
Copyright 2003 limsoon wong
Recognize Functional Sites,
Help Scientists
• Effective recognition of
initiation, control, and
termination of biological
processes is crucial to
speeding up and focusing
scientific experiments
• Data mining of bio seqs to
find rules for recognizing &
understanding functional
sites
Dragon’s 10x
reduction of
TSS recognition
false positives
Copyright 2003 limsoon wong
Diagnose Leukaemia,
Benefit Children
• Childhood leukaemia is a
heterogeneous disease
• Treatment is based on subtype
• 3 different tests and 4 different
experts are needed for
diagnosis
Curable in USA,
fatal in Indonesia
Copyright 2003 limsoon wong
Understand Proteins,
Fight Diseases
• Understanding function and role
of protein needs organised info
on interaction pathways
• Such info are often reported in
scientific paper but are seldom
found in structured databases
• Knowledge extraction
system to process free text
• extract protein names
• extract interactions
Copyright 2003 limsoon wong
risk assignment of
childhood ALL patients to optimize
risk-benefit ratio of therapy
Copyright 2003 limsoon wong
Childhood ALL
Heterogeneous Disease
• Major subtypes are
– T-ALL
– E2A-PBX1
– TEL-AML1
– MLL genome rearrangements
– Hyperdiploid>50
– BCR-ABL
Copyright 2003 limsoon wong
Childhood ALL
Treatment Failure
• Overly intensive treatment leads to
– Development of secondary cancers
– Reduction of IQ
• Insufficiently intensive treatment leads to
– Relapse
Copyright 2003 limsoon wong
Childhood ALL
Risk-Stratified Therapy
• Different subtypes respond differently to
the same treatment intensity
Generally good-risk,
lower intensity
TEL-AML1,
Hyperdiploid>50
T-ALL
Generally high-risk,
higher intensity
E2A-PBX1
BCR-ABL,
MLL
Match patient to optimum treatment
intensity for his subtype & prognosis
Copyright 2003 limsoon wong
Childhood ALL
Risk Assignment
• The major subtypes look similar
• Conventional diagnosis requires
– Immunophenotyping
– Cytogenetics
– Molecular diagnostics
Copyright 2003 limsoon wong
Mission
• Conventional risk assignment procedure
requires difficult expensive tests and
collective judgement of multiple
specialists
• Generally available only in major
advanced hospitals
Can we have a single-test easy-to-use
platform instead?
Copyright 2003 limsoon wong
Single-Test Platform of
Microarray & Machine Learning
Copyright 2003 limsoon wong
Overall Strategy
Diagnosis
of subtype
Subtypedependent
prognosis
• For each subtype,
select genes to
develop classification
model for diagnosing
that subtype
Riskstratified
treatment
intensity
• For each subtype,
select genes to
develop prediction
model for prognosis
of that subtype
Copyright 2003 limsoon wong
Childhood ALL
Subtype Diagnosis by PCL
•
•
•
•
Gene expression data collection
Gene selection by 2
Classifier training by emerging pattern
Classifier tuning (optional for some
machine learning methods)
• Apply classifier for diagnosis of future
cases by PCL
Copyright 2003 limsoon wong
Childhood ALL Subtype Diagnosis
Our Workflow
A tree-structured
diagnostic
workflow was
recommended by
our doctor
collaborator
Copyright 2003 limsoon wong
Childhood ALL Subtype Diagnosis
Training and Testing Sets
Copyright 2003 limsoon wong
Childhood ALL Subtype Diagnosis
Signal Selection Basic Idea
• Choose a signal w/ low intra-class distance
• Choose a signal w/ high inter-class distance
Copyright 2003 limsoon wong
Childhood ALL Subtype Diagnosis
Signal Selection by 2
Copyright 2003 limsoon wong
Childhood ALL Subtype Diagnosis
Emerging Patterns
• An emerging pattern is a set of conditions
– usually involving several features
– that most members of a class satisfy
– but none or few of the other class satisfy
• A jumping emerging pattern is an emerging
pattern that
– some members of a class satisfy
– but no members of the other class satisfy
• We use only jumping emerging patterns
Copyright 2003 limsoon wong
Childhood ALL Subtype Diagnosis
PCL: Prediction by Collective Likelihood
Copyright 2003 limsoon wong
Childhood ALL Subtype Diagnosis
Accuracy of PCL (vs. other classifiers)
The classifiers are all applied to the 20 genes selected
by 2 at each level of the tree
Copyright 2003 limsoon wong
Multidimensional Scaling Plot
Subtype Diagnosis
Copyright 2003 limsoon wong
Multidimensional Scaling Plot
Subtype-Dependent Prognosis
• Similar computational
analysis was carried
out to predict relapse
and/or secondary
AML in a subtypespecific manner
• >97% accuracy
achieved
Copyright 2003 limsoon wong
Childhood ALL
Is there a new subtype?
• Hierarchical
clustering of gene
expression
profiles reveals a
novel subtype of
childhood ALL
Copyright 2003 limsoon wong
Childhood ALL
Cure Rates in ASEAN Countries
• Conventional risk
assignment
procedure requires
difficult expensive
tests and collective
judgement of multiple
specialists
Not available in less
advanced ASEAN
80%
countries
cure rate
cambodia
vietnam
thailand
philippines
indonesia
malaysia
singapore
0%
20% 40% 60%
Copyright 2003 limsoon wong
Childhood ALL
Treatment Cost
• Treatment for childhood ALL over 2 yrs
– Intermediate intensity: US$60k
– Low intensity: US$36k
– High intensity: US$72k
• Treatment for relapse: US$150k
• Cost for side-effects: Unquantified
Copyright 2003 limsoon wong
Childhood ALL in ASEAN Counties
Current Situation (2000 new cases/yr)
• Intermediate intensity
conventionally applied
in less advanced
ASEAN countries
Over intensive for 50%
of patients, thus more
side effects
Under intensive for
10% of patients, thus
more relapse
5-20% cure rates
• US$120m (US$60k *
2000) for intermediate
intensity treatment
• US$30m (US$150k *
2000 * 10%) for relapse
treatment
• Total US$150m/yr
plus un-quantified
costs for dealing with
side effects
Copyright 2003 limsoon wong
Childhood ALL in ASEAN Counties
Using Our Platform (2000 new cases/yr)
• Low intensity applied
to 50% of patients
• Intermediate intensity
to 40% of patients
• High intensity to 10%
of patients
Reduced side effects
Reduced relapse
75-80% cure rates
• US$36m (US$36k * 2000
* 50%) for low intensity
• US$48m (US$60k * 2000
* 40%) for intermediate
intensity
• US$14.4m (US$72k *
2000 * 10%) for high
intensity
• Total US$98.4m/yr
Save US$51.6m/yr
Copyright 2003 limsoon wong
Acknowledgements
Copyright 2003 limsoon wong
recognition of translation intiation
sites from DNA sequences
Copyright 2003 limsoon wong
Translation Initiation Site
Copyright 2003 limsoon wong
A Sample mRNA
299 HSU27655.1 CAT U27655 Homo sapiens
CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG
CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA
CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
............................................................
................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
80
160
240
80
160
240
What makes the second ATG the translation
initiation site?
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Steps of a General Approach
• Training data gathering
• Signal generation
k-grams, colour, texture, domain know-how, ...
• Signal selection
Entropy, 2, CFS, t-test, domain know-how...
• Signal integration
SVM, ANN, PCL, CART, C4.5, kNN, ...
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Training & Testing Data
• Vertebrate dataset of Pedersen & Nielsen
[ISMB’97]
•
•
•
•
•
3312 sequences
13503 ATG sites
3312 (24.5%) are TIS
10191 (75.5%) are non-TIS
Use for 3-fold x-validation expts
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Signal Generation
• K-grams (ie., k consecutive letters)
–
–
–
–
K = 1, 2, 3, 4, 5, …
Window size vs. fixed position
Up-stream, downstream vs. any where in window
In-frame vs. any frame
3
2.5
2
seq1
seq2
seq3
1.5
1
0.5
0
A
C
G
T
Copyright 2003 limsoon wong
Signal Generation:
An Example
299 HSU27655.1 CAT U27655 Homo sapiens
CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG
CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA
CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
80
160
240
• Window = 100 bases
• In-frame, downstream
– GCT = 1, TTT = 1, ATG = 1…
• Any-frame, downstream
– GCT = 3, TTT = 2, ATG = 2…
• In-frame, upstream
– GCT = 2, TTT = 0, ATG = 0, ...
Copyright 2003 limsoon wong
Signal Generation:
Too Many Signals
• For each value of k, there are
4k * 3 * 2 k-grams
• If we use k = 1, 2, 3, 4, 5, we have
4 + 24 + 96 + 384 + 1536 + 6144 = 8188
features!
• This is too many for most machine
learning algorithms
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Signal Selection (eg., 2)
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Signal Selection (eg., CFS)
• Instead of scoring individual signals,
how about scoring a group of signals as
a whole?
• CFS
– Correlation-based Feature Selection
– A good group contains signals that are
highly correlated with the class, and yet
uncorrelated with each other
Copyright 2003 limsoon wong
Signal Selection:
Sample k-grams Selected
Kozak consensus
Leaky scanning
• Position –3
• in-frame upstream ATG
• in-frame downstream
Stop codon
– TAA, TAG, TGA,
– CTG, GAC, GAG, and GCC
Codon bias
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Signal Integration
• kNN
Given a test sample, find the k training
samples that are most similar to it. Let the
majority class win.
• SVM
Given a group of training samples from two
classes, determine a separating plane that
maximises the margin of error.
• Naïve Bayes, ANN, C4.5, PCL, ...
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Results (on Pedersen & Nielsen’s mRNA)
TP/(TP + FN)
TN/(TN + FP)
TP/(TP + FP)
Accuracy
Naïve Bayes
84.3%
86.1%
66.3%
85.7%
SVM
73.9%
93.2%
77.9%
88.5%
Neural Network
77.6%
93.2%
78.8%
89.4%
Decision Tree
74.0%
94.4%
81.1%
89.4%
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
mRNAprotein
A
T
How about using k-grams
from the translation?
E
L
R
F S
L
P
Y
I T
N
K
D
E
S
stop
M
V
A
H
Q
C
W
R
G
Copyright 2003 limsoon wong
Signal Generation:
Amino-Acid Features
Copyright 2003 limsoon wong
Signal Generation:
Amino-Acid Features
Copyright 2003 limsoon wong
Signal Selection:
Amino Acid K-grams Discovered
Copyright 2003 limsoon wong
Translation Initiation Site Recognition:
Results (based on amino acid features)
Performance based on amino-acid features:
is better than performance based on DNA seq. features:
Copyright 2003 limsoon wong
Acknowledgements
•
•
•
•
•
•
Huiqing Liu
Jinyan Li
Roland Yap
Zeng Fanfan
A.G. Pedersen
H. Nielsen
Copyright 2003 limsoon wong
To give this lecture to SMA students.
Date: 28 Oct 2003
Time: 10-11.30am
Venue: Video Conference Room, S15-04-30
Copyright 2003 limsoon wong