Introduction
Download
Report
Transcript Introduction
Data Mining in
Genomics: the dawn
of personalized
medicine
Gregory Piatetsky-Shapiro
KDnuggets
www.KDnuggets.com/gps.html
Connecticut College, October 15, 2003
Overview
Data Mining and Knowledge Discovery
Genomics and Microarrays
Microarray Data Mining
© 2003 KDnuggets
2
Trends leading to Data Flood
More data is generated:
Bank, telecom, other
business transactions ...
Scientific Data: astronomy,
biology, etc
Web, text, and e-commerce
More data is captured:
Storage technology faster
and cheaper
DBMS capable of handling
bigger DB
© 2003 KDnuggets
3
Knowledge Discovery Process
Integration
Interpretation
& Evaluation
Knowledge
Knowledge
__ __ __
__ __ __
__ __ __
DATA
Ware
house
© 2003 KDnuggets
Transformed
Data
Target
Data
4
Patterns
and
Rules
Understanding
Raw
Dat
a
Major Data Mining Tasks
Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Summarization: describing a group
Estimation: predicting a continuous value
Deviation Detection: finding changes
Link Analysis: finding relationships
© 2003 KDnuggets
5
Major Application Areas for
Data Mining Solutions
Advertising
Bioinformatics
Customer Relationship Management (CRM)
Database Marketing
Fraud Detection
eCommerce
Health Care
Investment/Securities
Manufacturing, Process Control
Sports and Entertainment
Telecommunications
Web
© 2003 KDnuggets
6
Genome, DNA & Gene
Expression
An organism’s genome is the “program” for
making the organism, encoded in DNA
Human DNA has about 30-35,000 genes
A gene is a segment of DNA that specifies how to make
a protein
Cells are different because of differential gene
expression
About 40% of human genes are expressed at one time
Microarray devices measure gene expression
© 2003 KDnuggets
7
Molecular Biology Overview
Nucleus
Cell
Chromosome
Gene
expression
Protein
© 2003 KDnuggets
Gene (mRNA),
single strand
8
Gene (DNA)
Graphics courtesy of the National Human Genome Research Institute
Affymetrix Microarrays
1.28cm
50um
~107 oligonucleotides,
half Perfectly Match mRNA (PM),
half have one Mismatch (MM)
Gene expression computed from
PM and MM
© 2003 KDnuggets
9
Affymetrix Microarray Raw
Image
Gene
D26528_at
D26561_cds1_at
D26561_cds2_at
D26561_cds3_at
D26579_at
D26598_at
D26599_at
D26600_at
D28114_at
Scanner
enlarged section of raw image
© 2003 KDnuggets
10
raw data
Value
193
-70
144
33
318
1764
1537
1204
707
Microarray Potential Applications
New and better molecular diagnostics
New molecular targets for therapy
few new drugs, large pipeline, …
Outcome depends on genetic signature
best treatment?
Fundamental Biological Discovery
finding and refining biological pathways
Personalized medicine ?!
© 2003 KDnuggets
11
Microarray Data Mining
Challenges
Avoiding false positives, due to
too few records (samples), usually < 100
too many columns (genes), usually > 1,000
Model needs to be robust in presence of noise
For reliability need large gene sets; for
diagnostics or drug targets, need small gene sets
Estimate class probability
Model needs to be explainable to biologists
© 2003 KDnuggets
12
False Positives in Astronomy
cartoon used with permission
© 2003 KDnuggets
13
CATs: Clementine Application
Templates
CATs - examples of
complete data mining
processes
Microarray CAT
Preparation
MultiClass
2-Class
© 2003 KDnuggets
14
Clustering
Key Ideas
Capture the complete process
X-validation loop w. feature selection inside
Randomization to select significant genes
Internal iterative feature selection loop
For each class, separate selection of optimal gene
sets
Neural nets – robust in presence of noise
Bagging of neural nets
© 2003 KDnuggets
15
Microarray Classification
© 2003 KDnuggets
Train data
Feature and Parameter Selection
Data
Model Building
Test data
Evaluation
16
Classification: External X-val
Gene Data
T r a i n
Train data
Feature and Parameter Selection
Data
Model Building
Test data
Evaluation
Final Model
FinalTest
Final Results
© 2003 KDnuggets
17
Measuring false positives with
randomization
Gene
Class
178
105
4174
7133
1
1
2
2
© 2003 KDnuggets
Rand
Class
Randomize
500 times
2
1
1
2
Gene
Class
178
105
4174
7133
2
1
1
2
18
Bottom
1% T-value = -2.08
Select potentially
interesting genes at 1%
Gene Reduction improves
Classification
most learning algorithms look for non-linear
combinations of features -- can easily find many
spurious combinations given small # of records
and large # of genes
Classification accuracy improves if we first reduce
# of genes by a linear method, e.g. T-values of
mean difference
Heuristic: select equal # genes from each class
Then apply a favorite machine learning algorithm
© 2003 KDnuggets
19
Iterative Wrapper approach to
selecting the best gene set
Test models using 1,2,3, …, 10, 20, 30, 40, ...,
100 top genes with x-validation.
Heuristic 1: evaluate errors from each class;
select # number of genes from each class that
minimizes error for that class
For randomized algorithms, average 10+
Cross-validation runs!
Select gene set with lowest average error
© 2003 KDnuggets
20
Clementine stream for subset
selection by x-validation
© 2003 KDnuggets
21
Microarrays: ALL/AML Example
Leukemia: Acute Lymphoblastic (ALL) vs Acute
Myeloid (AML), Golub et al, Science, v.286, 1999
72 examples (38 train, 34 test), about 7,000 genes
well-studied (CAMDA-2000), good test example
ALL
AML
Visually similar, but genetically very different
© 2003 KDnuggets
22
Gene subset selection: one Xvalidation
Error Avg for 10-fold X-val
30%
25%
20%
15%
10%
5%
0%
1
2
3
4
5
10
20
Genes per Class
Single Cross-Validation run
© 2003 KDnuggets
23
30
40
Gene subset selection:
multiple cross-validation runs
For ALL/AML data, 10 genes per
class had the lowest error: (<1%)
Point in the center
is the average error
from 10 crossvalidation runs
Bars indicate 1 st. dev
above and below
© 2003 KDnuggets
24
ALL/AML: Results on the test
data
Genes selected and model trained on Train set
ONLY!
Best Net with 10 top genes per class (20 overall)
was applied to the test data (34 samples):
33 correct predictions (97% accuracy),
1 error on sample 66
Actual Class AML, Net prediction: ALL
other methods consistently misclassify sample 66 -misclassified by a pathologist?
© 2003 KDnuggets
25
Pediatric Brain Tumour Data
92 samples, 5 classes (MED, EPD, JPA, EPD, MGL,
RHB) from U. of Chicago Children’s Hospital
Outer cross-validation with gene selection inside
the loop
Ranking by absolute T-test value (selects top
positive and negative genes)
Select best genes by adjusted error for each class
Bagging of 100 neural nets
© 2003 KDnuggets
26
Selecting Best Gene Set
Minimizing
Combined
Error for all
classes is
not optimal
Average, high and low error rate for all classes
© 2003 KDnuggets
27
Error rates for each class
Error rate
Genes per Class
© 2003 KDnuggets
28
Evaluating One Network
Averaged over 100 Networks:
Class
Error rate
MED
MGL
2.1%
17%
RHB
EPD
JPA
24%
9%
19%
*ALL* 8.3%
© 2003 KDnuggets
29
Bagging 100 Networks
Class
MED
MGL
Individual
Error Rate
2.1%
17%
Bag Error
rate
2% (0)*
10%
Bag Avg
Conf
98%
83%
RHB
EPD
JPA
*ALL*
24%
9%
19%
8.3%
11%
0
0
3% (2)*
76%
91%
81%
92%
Note: suspected error on one sample (labeled as
MED but consistently classified as RHB)
© 2003 KDnuggets
30
AF1q: New Marker for
Medulloblastoma?
AF1Q ALL1-fused gene from chromosome 1q
transmembrane protein
Related to leukemia (3 PUBMED entries) but not to Medulloblastoma
© 2003 KDnuggets
31
Future directions for Microarray
Analysis
Algorithms optimized for small samples
Integration with other data
biological networks
medical text
protein data
Cost-sensitive classification algorithms
error cost depends on outcome (don’t want to miss
treatable cancer), treatment side effects, etc.
© 2003 KDnuggets
32
Acknowledgements
Eric Bremer, Children’s Hospital (Chicago) &
Northwestern U.
Greg Cooper, U. Pittsburgh
Tom Khabaza, SPSS
Sridhar Ramaswamy, MIT/Whitehead Institute
Pablo Tamayo, MIT/Whitehead Institute
© 2003 KDnuggets
33
Thank you
Further resources on Data Mining:
www.KDnuggets.com
Microarrays:
www.KDnuggets.com/websites/microarray.html
Contact:
Gregory Piatetsky-Shapiro:
www.kdnuggets.com/gps.html
© 2003 KDnuggets
34