Transcript Document

Dental Data Mining:
Practical Issues
and
Potential Pitfalls
Stuart A. Gansky
University of California, San Francisco
Center to Address Disparities in Children’s Oral Health
Support: US DHHS/NIH/NIDCR U54 DE14251
What is Knowledge Discovery
and Data Mining (KDD)?
• “Semi-automatic discovery of patterns, associations,
anomalies, and statistically significant structures in data”
– MIT Tech Review (2001)
• Interface of
– Artificial Intelligence
– Computer Science
– Machine Language
– Engineering
– Statistics
• Association for Computing Machinery Special Interest
Group on Knowledge Discovery in Data and Data
Mining (ACM SIGKDD sponsors KDD Cup)
2
Data Mining as Alchemy
Pb
Au
3
Some Potential KDD Applications
in Oral Health Research
•
•
•
•
•
•
•
Large surveys (eg NHANES)
Longitudinal studies (eg VA Aging Study)
Disease registries (eg SEER)
Digital diagnostics (radiographic & others)
Molecular biology (eg PCR, microarrays)
Health services research / claims data
Provider and workforce databases
4
Supervised Learning
Unsupervised Learning
• Regression
• Hierarchical clustering
• k nearest neighbor
• k-means
• Trees (CART, MART,
boosting, bagging)
• Random Forests
• Multivariate Adaptive
Regression Splines (MARS)
• Neural Networks
• Support Vector Machines
5
KDD Steps
Collect
& Store
PreProcess
Sample
Merge
Warehouse
Clean
Impute
Transform
Standardize
Register
Analyze
Supervised
Unsupervised
Visualize
Validate
Act
Internal
Intervene
Split Sample
Set Policy
Cross-validate
Bootstrap
External
6
Data Quality
7
Example – Caries
• Predicting disease with traditional logistic
regression may have modelling difficulties:
nonlinearity (ANN better) & interactions (CART
better)(Kattan et al, Comp Biomed Res, ’98)
• Want to compare the performance of logistic
regression to popular data mining techniques –
tree and artificial neural network models in
dental caries data
• CART in caries (Stewart & Stamm, JDR, ’91)
8
Example study – child caries
• Background: ~20% of children have ~80% of
caries (tooth decay)
• University of Rochester longitudinal study
(Leverett et al, J Dent Res, 1993)
• 466 1st-2nd graders caries-free at baseline
• Saliva samples & exams every 6 months
• Goal: Predict 24 month caries incidence (output)
9
18-month Predictors (Inputs)
• Salivary bacteria
– Mutans Streptococci (log10 CFU/ml)
– Lactobacilli (log10 CFU/ml)
• Salivary chemistry
– Fluoride (ppm)
– Calcium (mmol/l)
– Phosphate (ppm)
10
Modeling Methods
Logistic
Regression
Neural
Networks
Decision
Trees
11
Logistic Regression Models
Logit (Primary Dentition Caries)
Schematic Surface
Fluoride (F) ppm
log10 Mutans
Streptococci
12
Tree Models
Logit (Primary Dentition Caries)
Schematic Surface
Fluoride (F) ppm
log10 Mutans
Streptococci
13
Artificial Neural Networks
Logit (Primary Dentition Caries)
Schematic Surface
Fluoride (F) ppm
log10 Mutans
Streptococci
14
Artificial Neural Network (p-r-1)
wij
x1
h1
x2



xp
inputs
wj
h2



y
hr
hidden layer (neurons)
output
15
Common Mistakes with ANN
(Scwartzer et al, StatMed, 2000)
• Too many parameters for sample size
• No validation
• No model complexity penalty
(eg Akaike Information Criterion (AIC))
•
•
•
•
•
Incorrect misclassification estimation
Implausible function
Incorrectly described network complexity
Inadequate statistical competitors
Insufficiently compared to stat competitors
16
Validation
• Split sample (70% training/30% validation)
Validation estimates unbiased misclassification
• K-fold Cross Validation
Mean squared error (Brier Score)
17
Why Validate?
Example: Overfitting in 2 Dimensions
Data
Response
15
10
5
0
0
5
10
15
20
25
Predictor
19
Linear Fit to Data
Response
15
10
5
y = 0.3449x + 1.2802
2
R = 0.9081
0
0
5
10
15
20
25
Predictor
20
High Degree Polynomial Fit to Data
Response
15
10
5
6
5
4
3
2
y = -0.0012x + 0.1196x - 4.8889x + 105.05x - 1250.4x + 7811.5x - 19989
R2 = 1
0
0
5
10
15
20
25
Predictor
21
10-Fold Cross-validation
1
2
3
4
5
6
7
8
9
10
22
10-Fold Cross-validation
1
2
3
4
5
6
7
8
9
10
23
10-Fold Cross-validation
1
2
3
4
5
6
7
8
9 10
1
2
3
4
5
6
7
8
9 10
1
2
3
4
5
6
7
8
9 10
1
2
3
4
5
6
7
8
9 10
1
2
3
4
5
6
7
8
9 10
1
2
3
4
5
6
7
8
9 10
1
2
3
4
5
6
7
8
9 10
1
2
3
4
5
6
7
8
9 10
1
2
3
4
5
6
7
8
9 10
1
2
3
4
5
6
7
8
9 10
24
Caries Example
Model Settings
• Logit
– Stepwise selection
– Alpha=.05 to enter, alpha=.20 to stay
– AIC to judge additional predictors
• Tree
– Splitting criterion: Gini index
– Pruning: Proportion correctly classified
25
ANN Settings
• Artifical Neural Network (5-3-1 = 22 df)
–
–
–
–
–
–
–
Multilayer perceptron
5 Preliminary runs
Levenberg-Marquardt optimization
No weight decay parameter
Average error selection
3 Hidden nodes/neurons
Activation function: hyperbolic tangent
26
ANN Sensitivity Analyses
• Random seeds: 5 values
– No differences
• Weight decay parameters: 0, .001, .005, .01, .25
– Only slight differences for .01 and .25
• Hidden nodes/neurons: 2, 3, 4
– 3 seems best
27
Tree Model
N=322 Training
N=144 Validation
Overall
Primary Caries
15%
log10 MS <7.08
15%
log10 LB <3.05
10%
log10 MS <3.91
3%
Prevalence: Node > Overall (15%)
Prevalence: Node < Overall (15%)
log10 MS 7.08
91%
log10 LB 3.05
23%
log10 MS 3.91
14%
F < .056
22%
F < .110
100%
F  .110
0%
F  .056
25%
28
Receiver Operating Characteristic
(ROC) Curves
30
Cumulative Captured Response Curves
31
Lift Chart
32
Logistic Regression
Beta Std Err Odds Ratio 95% CI
log10 MS .238 .072
1.27
1.10 – 1.46
log10 LB .311
.070
1.36
1.19 – 1.57
33
MARS – MS at 4 Times
34
Predicted Quintiles
2
1
0
-1
-2
0
1
2
3
Rank for Variable PR_ANN
4
35
Predicted Quintiles
2
1
0
-1
-2
0
1
2
3
Rank for Variable PR_ANN
4
36
5-fold CV Results
RMS error
AUC
Logit
.365
.680
Tree
.363
.553
ANN
.362
.707
37
Summary
•
•
•
•
•
Data quality and study design are paramount
Utilize multiple methods
Be sure to validate
Graphical displays help interpretations
KDD methods may provide advantages over
traditional statistical models in dental data
38
39
Prediction
as good as the
data
and
model
40