talk in bioitworld2002
Download
Report
Transcript talk in bioitworld2002
From Datamining
to Bioinformatics
Limsoon Wong
Laboratories for Information Technology
Singapore
What is Bioinformatics?
Themes of Bioinformatics
Bioinformatics =
Data Mgmt + Knowledge Discovery
Data Mgmt =
Integration + Transformation + Cleansing
Knowledge Discovery =
Statistics + Algorithms + Databases
Benefits of Bioinformatics
To the patient:
Better drug, better treatment
To the pharma:
Save time, save cost, make more $
To the scientist:
Better science
From Informatics to Bioinformatics
8 years of
bioinformatics
R&D in
Singapore
Integration
Technology
(Kleisli)
1994
ISS
MHC-Peptide Protein Interactions
Binding
Extraction (PIES)
(PREDICT)
Gene Expression
Cleansing &
& Medical Record
Warehousing
Datamining (PCL)
(FIMM)
Gene Feature
Recognition (Dragon)
1996
Venom
Informatics
1998
KRDL
2000
2002
LIT
Quick Samplings
Epitope Prediction
TRAP-559AA
MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE
EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN
LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS
LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL
TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR
FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK
TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ
CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI
IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ
KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN
QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN
RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE
KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP
GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
Epitope Prediction Results
Prediction by our ANN model for HLA-A11
29 predictions
22 epitopes
76% specificity
Prediction by BIMAS matrix for HLA-A*1101
Number of experimental binders
19 (52.8%)
5 (13.9%)
12 (33.3%)
1
66
100
Rank by BIMAS
Transcription Start Prediction
Transcription Start Prediction Results
Medical Record Analysis
age
sex
chol
ecg
heart
sick
49
64
58
58
58
M
M
F
M
M
266
211
283
284
224
Hyp
Norm
Hyp
Hyp
Abn
171
144
162
160
173
N
N
N
Y
Y
Looking for patterns that are
valid
novel
useful
understandable
Gene Expression Analysis
Classifying gene expression profiles
find stable differentially expressed genes
find significant gene groups
derive coordinated gene expression
Medical Record & Gene
Expression Analysis Results
PCL, a novel “emerging
pattern’’ method
Beats C4.5, CBA, LB, NB,
TAN in 21 out of 32 UCI
benchmarks
Works well for gene
expressions
Cancer Cell, March 2002, 1(2)
Behind the Scene
Vladimir Bajic
Vladimir Brusic
Jinyan Li
See-Kiong Ng
Limsoon Wong
Louxin Zhang
Allen Chong
Judice Koh
SPT Krishnan
Huiqing Liu
Seng Hong Seah
Soon Heng Tan
Guanglan Zhang
Zhuo Zhang
and many more:
students, folks from geneticXchange,
MolecularConnections, and other collaborators….
Questions?
A More Detailed Account
What is Datamining?
Jonathan’s blocks
Jessica’s blocks
Whose block
is this?
Jonathan’s rules : Blue or Circle
Jessica’s rules
: All the rest
What is Datamining?
Question: Can you explain how?
The Steps of Data Mining
Training data gathering
Signal generation
k-grams, colour, texture, domain know-how, ...
Signal selection
Entropy, 2, CFS, t-test, domain know-how...
Signal integration
SVM, ANN, PCL, CART, C4.5, kNN, ...
Translation Initiation
Recognition
A Sample cDNA
299 HSU27655.1 CAT U27655 Homo sapiens
CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG
CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA
GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA
CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT
............................................................
................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
What makes the second ATG the translation
initiation site?
80
160
240
80
160
240
Signal Generation
K-grams (ie., k consecutive letters)
K = 1, 2, 3, 4, 5, …
Window size vs. fixed position
Up-stream, downstream vs. any where in window
In-frame vs. any frame
3
2.5
2
seq1
seq2
seq3
1.5
1
0.5
0
A
C
G
T
Too Many Signals
For each value of k, there are
4k * 3 * 2 k-grams
If we use k = 1, 2, 3, 4, 5, we have
4 + 24 + 96 + 384 + 1536 + 6144 = 8188
features!
This is too many for most machine learning
algorithms
Signal Selection (Basic Idea)
Choose a signal w/ low intra-class distance
Choose a signal w/ high inter-class distance
Which of the following 3 signals is good?
Signal Selection (eg., t-statistics)
Signal Selection (eg., MIT-correlation)
Signal Selection (eg., 2)
Signal Selection (eg., CFS)
Instead of scoring individual signals, how
about scoring a group of signals as a whole?
CFS
A good group contains signals that are highly
correlated with the class, and yet uncorrelated
with each other
Homework: find a formula that captures the
key idea of CFS above
Sample k-grams Selected
Kozak consensus
Leaky scanning
Position –3
in-frame upstream ATG
in-frame downstream
Stop codon
TAA, TAG, TGA,
CTG, GAC, GAG, and GCC
Codon bias
Signal Integration
kNN
Given a test sample, find the k training samples
that are most similar to it. Let the majority class
win.
SVM
Given a group of training samples from two
classes, determine a separating plane that
maximises the margin of error.
Naïve Bayes, ANN, C4.5, ...
Results (on Pedersen & Nielsen’s mRNA)
TP/(TP + FN)
TN/(TN + FP)
TP/(TP + FP)
Accuracy
Naïve Bayes
84.3%
86.1%
66.3%
85.7%
SVM
73.9%
93.2%
77.9%
88.5%
Neural Network
77.6%
93.2%
78.8%
89.4%
Decision Tree
74.0%
94.4%
81.1%
89.4%
Acknowledgements
Roland Yap
Zeng Fanfan
A.G. Pedersen
H. Nielsen
Questions?
Common Mistakes
Self-fulfilling Oracle
Consider this scenario
Given classes C1 and C2 w/ explicit signals
Use 2 to C1 and C2 to select signals s1, s2, s3
Run 3-fold x-validation on C1 and C2 using s1,
s2, s3 and get accuracy of 90%
Is the accuracy really 90%?
What can be wrong with this?
Phil Long’s Experiment
Let there be classes C1 and C2 w/ 100000
features having randomly generated values
Use 2 to select 20 features
Run k-fold x-validation on C1 and C2 w/
these 20 features
Expect: 50% accuracy
Get: 90% accuracy!
Lesson: choose features at each fold
Apples vs Oranges
Consider this scenario:
Fanfan reported 89% accuracy on his TIS
prediction method
Hatzigeorgiou reported 94% accuracy on her
TIS prediction method
So Hatzigeorgiou’s method is better
What is wrong with this conclusion?
Apples vs Oranges
Differences in datasets used:
Fanfan’s expt used Pedersen’s dataset
Hatzigeorgiou’s used her own dataset
Differences in counting:
Fanfan’s expt was on a per ATG basis
Hatzigeorgiou’s expt used the scanning rule
and thus was on a per cDNA basis
When Fanfan ran the same dataset and
count the same way as Hatzigeorgiou, got
94% also!
Questions?