Title of Talk

Download Report

Transcript Title of Talk

GhostMiner Wine example
Włodzisław Duch
Dept. of Informatics,
Nicholas Copernicus University,
Toruń, Poland
http://www.phys.uni.torun.pl/~duch
ISEP Porto, 8-12 July 2002
GhostMiner Philosophy
GhostMiner, data mining tools from our lab.
http://www.fqspl.com.pl/ghostminer/
• Separate the process of model building and
knowledge discovery from model use =>
GhostMiner Developer & GhostMiner Analyzer.
• There is no free lunch – provide different type of tools
for knowledge discovery.
Decision tree, neural, neurofuzzy, similarity-based,
committees.
• Provide tools for visualization of data.
• Support the process of knowledge discovery/model
building and evaluating, organizing it into projects.
GM summary
Ghost Miner combines 4 basic tools for predictive data
mining and understanding of data, avoiding too many
choices of parameters (like network structure specs):
• IncNet ontogenic neural network using Kalman filter
learning separating each class from all other classes;
• Feature Space Mapping neurofuzzy system producing
logical rules of crisp and fuzzy types.
• Separability Split Value decision tree.
• Weighted nearest neighbor method.
• K-classifiers and committees of models.
• MDS visualization
Wine data example
Chemical analysis of wine from grapes grown in
the same region in Italy, but derived from three
different cultivars.
Task: recognize the source of wine sample.
13 quantities measured, continuous features:
•
•
•
•
•
alcohol content
ash content
magnesium content
flavanoids content
proanthocyanins
phenols content
• OD280/D315 of diluted
wines
•
•
•
•
malic acid content
alkalinity of ash
total phenols content
nonanthocyanins
phenols content
• color intensity
• hue
• proline.
Exploration and visualization
Load data (using load icon) and look at
general info about the data.
Exploration: data
Inspect the data itself in the raw form.
Exploration: data statistics
Look at distribution of feature values
Note that Proline has very large values, therefore the data
should be standardized before further processing.
Exploration: data standardized
Standardized data: unit standard deviation, about 2/3 of all
data should fall within [mean-std,mean+std]
Other options: normalize to fit in [-1,+1], or normalize
rejecting some extreme values.
Exploration: 1D histograms
Distribution of feature values in classes
Some features are more useful than the others.
Exploration: 1D/3D histograms
Distribution of feature values in
classes, 3D
Exploration: 2D projections
Projections (cuboids) on selected 2D
Projections on selected 2D
Visualize data
Relations in more than 3D are hard to imagine.
SOM mappings: popular for visualization, but
rather inaccurate, no measure of distortions.
Measure of topographical distortions: map all Xi
points from Rn to xi points in Rm, m < n, and ask:
How well are Rij = D(Xi, Xj) distances reproduced by
distances rij = d(xi,xj) ?
Use m = 2 for visualization,
use higher m for dimensionality reduction.
Visualize data: MDS
Multidimensional scaling: invented in
psychometry by Torgerson (1952), re-invented
by Sammon (1969) and myself (1994) …
Minimize measure of topographical distortions
moving the x coordinates.
1
S1  x  
2
R
 ij
R
i j
ij
 rij  x  
2
MDS
i j
1  r  x  
1
S2  x  
 Rij

1
S3  x  
 Rij
 1  r  x 
i j
i j
i j
i j
2
ij
Sammon
Rij
ij
Rij 
2
MDS, more local
Visualize data: Wine
3 clusters are clearly distinguished, 2D is fine.
The green outlier can be identified easily.
Decision trees
Simplest things first:
use decision tree to find logical rules.
Test single attribute, find good point to split the
data, separating vectors from different classes.
DT advantages: fast, simple, easy to understand,
easy to program, many good algorithms.
4 attributes used,
10 errors, 168 correct,
94.4% correct.
Decision borders
Univariate trees:
test the value of a single attribute x < a.
Multivariate trees: test on combinations of
attributes, hyperplanes.
Result: feature space is divided into cuboids.
Wine data: univariate
decision tree borders for
proline and flavanoids
Separability Split Value (SSV)
SSV criterion:
• select attribute and split value that maximizes the number
of correctly separated pairs from different classes;
• if several equivalent split values exist select one that
minimizes the number of pairs split from the same class.
Works on raw data, including symbolic values.
Search for splits using best-first or beam-search method.
Tests are A(x) < T or x  {si}
Create tree that classifies all data correctly.
Use crossvalidation to determine how many node to prune or
what should be the pruning level.
Wine – SSV 5 rules
Lower pruning leads to more complex tree.
7 nodes, corresponding to 5 rules;
10 errors, mostly Class2/3 wines mixed; check the
confusion matrix in “results”.
Wine – SSV optimal rules
What is the optimal complexity of rules?
Use crossvalidation to estimate generalization.
Various solutions may be found, depending on the search:
5 rules with 12 premises, making 6 errors,
6 rules with 16 premises and 3 errors,
8 rules, 25 premises, and 1 error.
if OD280/D315 > 2.505  proline > 726.5  color > 3.435 then class 1
if OD280/D315 > 2.505  proline > 726.5  color < 3.435 then class 2
if OD280/D315 < 2.505  hue > 0.875  malic-acid < 2.82 then class 2
if OD280/D315 > 2.505  proline < 726.5 then class 2
if OD280/D315 < 2.505  hue < 0.875 then class 3
if OD280/D315 < 2.505  hue > 0.875  malic-acid > 2.82 then class 3
Neurofuzzy systems
MLP: discrimination, finds separating surfaces as
combination of sigmoidal functions.
Fuzzy approach: define MF replacing mx0,1 (no/yes)
by a degree mx[0,1].
Typically triangular, trapezoidal, Gaussian ... MF are
used.
M.f-s in many dimensions are
constructed using products to
determine the threshold of
mXconst[0,1].
Advantage: easy to add a priori knowledge (proper
bias); may work well for very small datasets!
Feature Space Mapping
Feature Space Mapping (FSM) neurofuzzy system.
Find best network architecture (number of nodes and
feature selection) using an ontogenic network
(growing and shrinking) with one hidden layer.
Use separable rectangular, triangular, Gaussian MF.
G  X ; P    Gi  X i ; Pi 
i 1
Initialize using clusterization techniques.
Allow for rotation of Gaussian functions.
Describe the joint prob. density p(X,C).
Neural adaptation using RBF-like algorithms.
Good for logical rules and NN predictive models.
Wine – FSM rules
SSV: hierarchical rules
FSM: density estimation with feature selection.
Complexity of rules depends on desired accuracy.
Use rectangular functions for crisp rules.
Optimal accuracy may be evaluated using crossvalidation.
FSM discovers simpler rules, for example:
if proline > 929.5 then class 1
(48 cases, 45 correct, 2 recovered by other rules).
if color < 3.79285 then class 2
(63 cases, 60 correct)
IncNet
Incremental Neural Network (IncNet).
Ontogenic NN with single hidden layer, adding,
removing and merging neurons.
Transfer functions: Gaussians or combination of sigmoids
(bi-central functions).
Training: use Kalman filter approach to estimate network
parameters.
Fast Kalman filter training is usually sufficient.
Always creates one network per class, separating it from
other samples.
Creates predictive models equivalent to fuzzy rules.
k-nearest neighbors
Use various similarity functions to evaluate how
similar new case is to all reference (training)
cases, use p(Ci|X) = k(Ci)/k.
Similarity functions include Minkovsky and similar functions.
Optimize k, the number of neighbors included.
Optimize the scaling factors of features Wi|Xi-Yi|:
this goes beyond feature selection.
Use search-based techniques to find good scaling
parameters for features.
Notice that:
For k=1 always 100% on the training set is obtained! To
evaluate accuracy on training use leave-one-out procedure.
Committees and K-classifiers
K-classifiers: in K-class problems create K
classifiers, one for each class.
Committees:
combine results from different classification models:
create different models using the same method (for example
decision tree) on different data samples (bootstraping);
combine several different models, including other
committees, into one model;
use majority voting to decide on the predicted class.
No rules, but stable and accurate classification models.
Summary
Please get your copy from
http://www.fqspl.com.pl/ghostminer/
Ghost Miner combines 4 basic tools for predictive data
mining and understanding of data.
GM includes K-classifiers and committees of models.
GM includes MDS visualization/dimensionality reduction.
Model building is separated from model use.
GM provides tools for easy testing of statistical accuracy.
Many new classification models are coming.