Introduction

Download Report

Transcript Introduction

Data Mining in
Genomics: the dawn
of personalized
medicine
Gregory Piatetsky-Shapiro
KDnuggets
www.KDnuggets.com/gps.html
Connecticut College, October 15, 2003
Overview
 Data Mining and Knowledge Discovery
 Genomics and Microarrays
 Microarray Data Mining
© 2003 KDnuggets
2
Trends leading to Data Flood
 More data is generated:
 Bank, telecom, other
business transactions ...
 Scientific Data: astronomy,
biology, etc
 Web, text, and e-commerce
 More data is captured:
 Storage technology faster
and cheaper
 DBMS capable of handling
bigger DB
© 2003 KDnuggets
3
Knowledge Discovery Process
Integration
Interpretation
& Evaluation
Knowledge
Knowledge
__ __ __
__ __ __
__ __ __
DATA
Ware
house
© 2003 KDnuggets
Transformed
Data
Target
Data
4
Patterns
and
Rules
Understanding
Raw
Dat
a
Major Data Mining Tasks
 Classification: predicting an item class
 Clustering: finding clusters in data
 Associations: e.g. A & B & C occur frequently
 Visualization: to facilitate human discovery
 Summarization: describing a group
 Estimation: predicting a continuous value
 Deviation Detection: finding changes
 Link Analysis: finding relationships
© 2003 KDnuggets
5
Major Application Areas for
Data Mining Solutions












Advertising
Bioinformatics
Customer Relationship Management (CRM)
Database Marketing
Fraud Detection
eCommerce
Health Care
Investment/Securities
Manufacturing, Process Control
Sports and Entertainment
Telecommunications
Web
© 2003 KDnuggets
6
Genome, DNA & Gene
Expression
 An organism’s genome is the “program” for
making the organism, encoded in DNA
 Human DNA has about 30-35,000 genes
 A gene is a segment of DNA that specifies how to make
a protein
 Cells are different because of differential gene
expression
 About 40% of human genes are expressed at one time
 Microarray devices measure gene expression
© 2003 KDnuggets
7
Molecular Biology Overview
Nucleus
Cell
Chromosome
Gene
expression
Protein
© 2003 KDnuggets
Gene (mRNA),
single strand
8
Gene (DNA)
Graphics courtesy of the National Human Genome Research Institute
Affymetrix Microarrays
1.28cm
50um
~107 oligonucleotides,
half Perfectly Match mRNA (PM),
half have one Mismatch (MM)
Gene expression computed from
PM and MM
© 2003 KDnuggets
9
Affymetrix Microarray Raw
Image
Gene
D26528_at
D26561_cds1_at
D26561_cds2_at
D26561_cds3_at
D26579_at
D26598_at
D26599_at
D26600_at
D28114_at
Scanner
enlarged section of raw image
© 2003 KDnuggets
10
raw data
Value
193
-70
144
33
318
1764
1537
1204
707
Microarray Potential Applications
 New and better molecular diagnostics
 New molecular targets for therapy
 few new drugs, large pipeline, …
 Outcome depends on genetic signature
 best treatment?
 Fundamental Biological Discovery
 finding and refining biological pathways
 Personalized medicine ?!
© 2003 KDnuggets
11
Microarray Data Mining
Challenges
 Avoiding false positives, due to
 too few records (samples), usually < 100
 too many columns (genes), usually > 1,000
 Model needs to be robust in presence of noise
 For reliability need large gene sets; for
diagnostics or drug targets, need small gene sets
 Estimate class probability
 Model needs to be explainable to biologists
© 2003 KDnuggets
12
False Positives in Astronomy
cartoon used with permission
© 2003 KDnuggets
13
CATs: Clementine Application
Templates
 CATs - examples of
complete data mining
processes
 Microarray CAT
Preparation
MultiClass
2-Class
© 2003 KDnuggets
14
Clustering
Key Ideas
 Capture the complete process
 X-validation loop w. feature selection inside
 Randomization to select significant genes
 Internal iterative feature selection loop
 For each class, separate selection of optimal gene
sets
 Neural nets – robust in presence of noise
 Bagging of neural nets
© 2003 KDnuggets
15
Microarray Classification
© 2003 KDnuggets
Train data
Feature and Parameter Selection
Data
Model Building
Test data
Evaluation
16
Classification: External X-val
Gene Data
T r a i n
Train data
Feature and Parameter Selection
Data
Model Building
Test data
Evaluation
Final Model
FinalTest
Final Results
© 2003 KDnuggets
17
Measuring false positives with
randomization
Gene
Class
178
105
4174
7133
1
1
2
2
© 2003 KDnuggets
Rand
Class
Randomize
500 times
2
1
1
2
Gene
Class
178
105
4174
7133
2
1
1
2
18
Bottom
1% T-value = -2.08
Select potentially
interesting genes at 1%
Gene Reduction improves
Classification
 most learning algorithms look for non-linear
combinations of features -- can easily find many
spurious combinations given small # of records
and large # of genes
 Classification accuracy improves if we first reduce
# of genes by a linear method, e.g. T-values of
mean difference
 Heuristic: select equal # genes from each class
 Then apply a favorite machine learning algorithm
© 2003 KDnuggets
19
Iterative Wrapper approach to
selecting the best gene set
 Test models using 1,2,3, …, 10, 20, 30, 40, ...,
100 top genes with x-validation.
 Heuristic 1: evaluate errors from each class;
select # number of genes from each class that
minimizes error for that class
 For randomized algorithms, average 10+
Cross-validation runs!
 Select gene set with lowest average error
© 2003 KDnuggets
20
Clementine stream for subset
selection by x-validation
© 2003 KDnuggets
21
Microarrays: ALL/AML Example
 Leukemia: Acute Lymphoblastic (ALL) vs Acute
Myeloid (AML), Golub et al, Science, v.286, 1999
 72 examples (38 train, 34 test), about 7,000 genes
 well-studied (CAMDA-2000), good test example
ALL
AML
Visually similar, but genetically very different
© 2003 KDnuggets
22
Gene subset selection: one Xvalidation
Error Avg for 10-fold X-val
30%
25%
20%
15%
10%
5%
0%
1
2
3
4
5
10
20
Genes per Class
Single Cross-Validation run
© 2003 KDnuggets
23
30
40
Gene subset selection:
multiple cross-validation runs
For ALL/AML data, 10 genes per
class had the lowest error: (<1%)
Point in the center
is the average error
from 10 crossvalidation runs
Bars indicate 1 st. dev
above and below
© 2003 KDnuggets
24
ALL/AML: Results on the test
data
 Genes selected and model trained on Train set
ONLY!
 Best Net with 10 top genes per class (20 overall)
was applied to the test data (34 samples):
 33 correct predictions (97% accuracy),
 1 error on sample 66
 Actual Class AML, Net prediction: ALL
 other methods consistently misclassify sample 66 -misclassified by a pathologist?
© 2003 KDnuggets
25
Pediatric Brain Tumour Data
 92 samples, 5 classes (MED, EPD, JPA, EPD, MGL,
RHB) from U. of Chicago Children’s Hospital
 Outer cross-validation with gene selection inside
the loop
 Ranking by absolute T-test value (selects top
positive and negative genes)
 Select best genes by adjusted error for each class
 Bagging of 100 neural nets
© 2003 KDnuggets
26
Selecting Best Gene Set
 Minimizing
Combined
Error for all
classes is
not optimal
Average, high and low error rate for all classes
© 2003 KDnuggets
27
Error rates for each class
Error rate
Genes per Class
© 2003 KDnuggets
28
Evaluating One Network
Averaged over 100 Networks:
Class
Error rate
MED
MGL
2.1%
17%
RHB
EPD
JPA
24%
9%
19%
*ALL* 8.3%
© 2003 KDnuggets
29
Bagging 100 Networks
Class
MED
MGL
Individual
Error Rate
2.1%
17%
Bag Error
rate
2% (0)*
10%
Bag Avg
Conf
98%
83%
RHB
EPD
JPA
*ALL*
24%
9%
19%
8.3%
11%
0
0
3% (2)*
76%
91%
81%
92%
 Note: suspected error on one sample (labeled as
MED but consistently classified as RHB)
© 2003 KDnuggets
30
AF1q: New Marker for
Medulloblastoma?
 AF1Q ALL1-fused gene from chromosome 1q
 transmembrane protein
 Related to leukemia (3 PUBMED entries) but not to Medulloblastoma
© 2003 KDnuggets
31
Future directions for Microarray
Analysis
 Algorithms optimized for small samples
 Integration with other data
 biological networks
 medical text
 protein data
 Cost-sensitive classification algorithms
 error cost depends on outcome (don’t want to miss
treatable cancer), treatment side effects, etc.
© 2003 KDnuggets
32
Acknowledgements
 Eric Bremer, Children’s Hospital (Chicago) &
Northwestern U.
 Greg Cooper, U. Pittsburgh
 Tom Khabaza, SPSS
 Sridhar Ramaswamy, MIT/Whitehead Institute
 Pablo Tamayo, MIT/Whitehead Institute
© 2003 KDnuggets
33
Thank you
Further resources on Data Mining:
www.KDnuggets.com
Microarrays:
www.KDnuggets.com/websites/microarray.html
Contact:
Gregory Piatetsky-Shapiro:
www.kdnuggets.com/gps.html
© 2003 KDnuggets
34