Mining Regional Knowledge in Spatial Dataset

Download Report

Transcript Mining Regional Knowledge in Spatial Dataset

ISMIS Oct 21-23, 2015, Lyon, France
Paul Amalaman, and Christoph F. Eick
Department of Computer Science
University of Houston
HC-edit: A Hierarchical Clustering Approach
To Data Editing
CS@UH
Data Mining & Machine Learning Group
1
Talk Organization
1. Introduction
2. Data Editing
3. Supervised Taxonomy & The STAXAC Algorithm
4. HC-edit
5. Related Work
6. Summary and Conclusion
CS@UH
Data Mining & Machine Learning Group
2
1. Introduction
k-NN Classifiers
K-neighborhood, Nk(x)
Given an example x, a training dataset O, the k-neighborhood of
x, noted Nk(x), consisting of its k nearest neighbors of x.
K-NN rule
 Given Nk(x), the k-NN rule for x states:
Assign to x the class label of the majority class labels in Nk(x).
CS@UH
Data Mining & Machine Learning Group
5
K-NN Classifiers
 Advantages
 Easy to implement
 Accuracy typically high
 Complex decision boundaries
 Disadvantages
 Need to store the training set—the dataset is the model!
 Slow
CS@UH
Data Mining & Machine Learning Group
3
2. Data Editing
1.Introduction
2.Data Editing

Background: Data Editing Algorithms



Wilson Editing
Multi-Edit
Representative-based Supervised Clustering Editing
3.Supervised Taxonomy & The STAXAC Algorithm
4.HC-edit
5.Related Work
6.Summary and Conclusion
CS@UH
Data Mining & Machine Learning Group
4
2. Data Editing
Problem Definition:
Given
1. Dataset O
2. Remove “bad” examples from O
Oedited < O\{“bad” examples}
3. Use Oedited as the model for a k-NN classifier
The goal of the presented paper is to improve the accuracy of kNN classifiers by developing new data editing approaches
CS@UH
Data Mining & Machine Learning Group
6
Background: Some Data Editing Algorithms
Wilson Editing [Wilson 72]
Wilson editing relies on the idea that if an example is erroneously
classified using the k-NN rule it has to be removed from the training set
Multi-Edit [Devijver 80]
The algorithm repeatedly applies Wilson editing to m random subsets of
the original dataset until no more examples are removed
(Representative-based) Supervised Clustering Editing [Eick 2004]
Use a representative-based supervised clustering approach to cluster
the data. Delete all non representative examples
CS@UH
Data Mining & Machine Learning Group
7
Examples Wilson Editing


Developed by Wilson in 1972
Remove points that do not agree with the majority of their k nearest
neighbours
Earlier example
Original data
Wilson editing with k=7
CS@UH
Overlapping classes
Original data
Wilson editing with k=7
Data Mining & Machine Learning Group
8
Background: Data Editing Algorithms
Problem With Wilson Editing:
Excessive examples removal—especially in the decision boundary areas
(a) Original dataset
Natural boundary
(b) Wilson Editing Result
Natural boundary
New boundary
CS@UH
Data Mining & Machine Learning Group
9
Supervised Clustering
Goal: Obtain class uniform clusters that are dominated by instances of a single class
Traditional clustering
CS@UH
Supervised clustering
Data Mining & Machine Learning Group
10
Representative-based Supervised Clustering Editing 11
Attribute1
2
1
Clustering
maximizes
purity
3
4
Attribute2
Objective: Find a set of objects OR in the dataset O to be clustered, such that
the clustering X obtained by using the objects in OR as representatives
minimizes q(X); e.g. the following q(X):
q(X):= i purity(Ci)*(|Ci|**) with >1
CS@UH
Data Mining & Machine Learning Group
Supervised Taxonomy & The STAXAC Algorithm
1.Introduction
2.Editing Algorithms
3.Supervised Taxonomy & The STAXAC Algorithm
4.HC-edit
5.Related Work
6.Summary and Conclusion
CS@UH
Data Mining & Machine Learning Group
12
3. Supervised Taxonomy
Supervised Taxonomy (ST).
Supervised taxonomies are generated considering background information concerning class
labels in addition to distance metrics, and are capable of capturing class-uniform regions in a
dataset
CS@UH
Data Mining & Machine Learning Group
13
14
Uses of Supervised Taxonomies
 Data Set Editing (this talk)
 Bioinformatics (e.g. try to
identify interesting
subclasses of known diseases)
 Clustering
 Meta Learning—create useful background from
datasets by developing algorithms that operate
on supervised taxonomies
CS@UH
Data Mining & Machine Learning Group
The STAXAC Algorithm
CS@UH
Data Mining & Machine Learning Group
15
The STAXAC Algorithm
16
Algorithm 1: STAXAC (Supervised TAXonomy Agglomerative Clustering)
Input: examples with class labels and their distance matrix D.
Output: Hierarchical clustering
1. Start with a clustering X of one-object clusters.
2. C* ,C’X; merge-candidate(C*,C’)  (1-NNX(C*) = C’ or 1-NNX(C’ )=C* )
3. WHILE there are merge-candidates (C*, C’) left
BEGIN
a. Merge the pair of merge-candidates (C*,C’) obtaining a
new cluster C=C*C’ and a new clustering X’ for which Purity(X’) has the largest
value
b. Update merge-candidates:
C’’ merge-candidate(C’’,C)  (merge-candidate(C’’,C*) or
merge-candidate(C’’,C’))
c. Extend dendrogram by drawing edges from C’ and C* to C
END
4. Return the constructed dendrogram
CS@UH
Data Mining & Machine Learning Group
Properties of STAXAC
17
1. In contrast to other supervised clustering algorithms, STAXAC is a
2.
3.
4.
5.
6.
hierarchical clustering that maximizes cluster purity.
Individual clusterings can be extracted using different purity
thresholds—HC-edit that is introduced later relies on this property.
In contrast to other hierarchical clustering algorithms, STAXAC
conducts a wider search, merging clusters that are neighboring and
not necessarily the closest two clusters
STAXAC uses a 1-NN graph to determine which clusters are
neighboring
Proximity graphs need only be computed at the beginning of the run.
STAXAC could be generalized by using more powerful proximity
graphs, such as Gabriel Graphs, to conduct a wider search.
CS@UH
Data Mining & Machine Learning Group
Proximity Graphs
Proximity graphs provide various definitions of “neighbour”:
NNG  MST  RNG  GG  DT
NNG = Nearest Neighbour Graph
MST = Minimum Spanning Tree
RNG = Relative Neighbourhood Graph
GG = Gabriel Graph
DT = Delaunay Triangulation (neighbours of a 1NN-classifier)
CS@UH
Data Mining & Machine Learning Group
18
HC-edit
1.Introduction
2.Editing Algorithms
3.Supervised Taxonomy & The STAXAC Algorithm
4.HC-edit
5.Related Work
6.Summary and Conclusion
CS@UH
Data Mining & Machine Learning Group
19
4. The HC-edit Approach
HC-Edit Approach:
1. Create a supervised taxonomy ST for dataset O using STAXAC
2. Extract a clustering from ST for a given purity threshold β
3. Remove minority examples from each cluster
4. Classification: Use k-NN weighted vote rule to classify an example
CS@UH
Data Mining & Machine Learning Group
20
HC-Edit Code
21
Inputs:
T: tree generated by hierarchical clustering algorithm such as STAXAC (with
purity information and examples associated with for each node cluster) for a dataset O
β: cluster purity threshold
Output: Oedited, a dataset that is a subset of the original dataset O
Function EDIT_TREE (T)
Oedited:= ;
EXTRACT_EXAMPLES(T);
RETURN Oedited;
END
Function EXTRACT_EXAMPLES (T)
BEGIN
IF T=NULL EXIT
IF purity(T) >= β
Add majority examples of T to Oedited
ELSE
EXTRACT_EXAMPLES (T.right)
EXTRACT_EXAMPLES (T.left)
END
CS@UH
Data Mining & Machine Learning Group
30
25
20
15
10
5
0
A
A
B
Clusters & Maj. Class Labels
# of Examples Per Cluster
#of maj. examples
A
A
B
A
Clusters & Maj. Class Labels
#of maj. examples
CS@UH
30
25
20
15
10
5
0
A
# of min. examples
A
B
Clusters & Maj Class Labels.
#of maj. examples
30
25
20
15
10
5
0
A
22
Boston Housing – Class Distribution for β
=0.96
# of min. examples
Boston Housing – Class Distribution for β
=0.95
A
# of Examples Per Clusters
Boston Housing – Class Distribution for β
=1
# of Examples Per Cluster
# of Examples Per Cluster
Top Clusters of size ≥20 Extracted from T
A
# of min. examples
Boston Housing – Class Distribution for β
=0.948
40
30
20
10
0
A
A
B
A
Clusters & Maj Class Labels
#of maj. examples
Data Mining & Machine Learning Group
# of min. examples
A
# of Examples Per Cluster
Clusters of size ≥20 continued
Boston Housing – Class Distribution for β
=0.86
23
Boston Housing – Class Distribution for β
=0.75
120
150
100
80
100
60
40
50
20
0
0
A
B
#of maj. examples
CS@UH
B
# of min. examples
A
A
B
B
Clusters & Maj. Class Labels
#of maj. examples
# of min. examples
Data Mining & Machine Learning Group





How to choose ?
24
If  is set to 1, no editing occurs.
As  decreases more objects will be removed in the editing process,
and the obtained clusters contain more instances
We also observed that using very low values for  leads to low
accuracy, as clusters get too contaminated by examples belonging to
other classes.
Moreover, only a finite subset of  values needs to be considered: a
subset of the node purities that occur in the supervised taxonomy.
Basically, only a finite number of clusterings can be extracted from the
supervised taxonomy and for many purity values the extracted
supervised clusterings are the same .
Therefore, one could set a lower bound  for  and use n-fold crossvalidation for purities that occur in the tree in [,1), and choose the
purity value that leads to the highest accurarcy for the editing of the
dataset.
r
CS@UH
Data Mining & Machine Learning Group
Experimental Results
25
E. Coli
HC-edit-STAXAC
HC-edit-UPGMA
56.1 (50) [43.20]
47.56 (100)[0.00],(90)[0.00]
Wils onProb
57.32 [49.73]
Wils on
57.32 [49.73]
K-NN
K=1
K=3
56.10 (100)[0],(90)[1.06],(80)[8.13 ]
52.44 (100)[0.00],(90)[0.00]
57.32 [44.53]
57.32 [45.73]
56.10
K=5
58.54 (100)[0.00],(90)[10.67]
63.41 (100)[0.00],(90)[10.67]
53.66 (100)[0.00],(90)[0.00]
51.22 [42.40]
51.22 [41.86]
54.88
62.20 (80)[4.00]
52.44 [43.20]
54.88[46.80]
56.1
K-NN
K=7
51.22
BEV
HC-edit-STAXAC
HC-edit-UPGMA
Wils onProb
Wils on
K=1
81.35 (90)[6.00]
82.64 (80)[14.17]
83.6 [19.96]
83.6 [19.96]
79.1
K=3
81.03 (90)[6.00]
81.99 (80)[14.17]
82.64 [16.92]
81.79 [16.75]
81.67
K=5
81 (100)[0.00]
83.28 (80)[14.17]
81.67[16.50]
79.74 [18.60]
80.39
K=7
81.35 (90)[6.00]
83.28 (80)[14.17]
81.35[16.25]
80.06 [18.60]
80.39
Vot
91.89 (100)[0.00]
HC-edit-UPGMA
90.09 (90)[5.80],(80)[10.9]
Wils onProb
90.95 [9.00]
Wils on
90.95 [9.00]
K-NN
K=1
HC-edit-STAXAC
K=3
91.38 (100)[0.00]
91.38 (90)[5.80]
90.95 [8.00]
90.95 [8.00]
92.24
K=5
92.67 (100)[0.00],(90)[5.38]
93.10 (100)[0.00]
92.24 (100)[0.00],(90)[5.80]
90.95 [8.00]
90.95 [8.00]
91.38
92.67 (100)[0.00]
90.95 [8.00]
90.95 [8.00]
92.67
K=7
89.66
Bld
HC-edit-STAXAC
HC-edit-UPGMA
61.11 (100)[0.00]
62.09 (80)[7.15]
Wils onProb
61.44 [39.26]
Wils on
61.44 [39.26]
K-NN
K=1
K=3
59.80 (100)[0.00]
59.80 (90),(80)[7.15]
59.48 [35.64]
59.48 [35.60]
59.48
K=5
63.07 (90)[3.21]
59.8 [32.00]
59.8 [31.73]
63.4
K=7
64.05 (90)[3.21]
62.75 (100)[0.00],(90)[0.6]
64.38 (100)[0.00],(90)[0.6]
61.44 [29.48]
61.44 [29.37]
64.05
HC-edit-UPGMA
78.57 (80)[5.6]
Wils onProb
77.14 [25.61 ]
Wils on
77.14 [25.61]
K-NN
K=1
HC-edit-STAXAC
80 (80)[10.50]
78.57 (100)[0.00],(90)[0.7],(80)[5.6]
72.86 [18.25]
74.29 [16.84]
78.57
K=5
78.57 (90)[2.50],(80)[10.50]
80 (90)[2.5]
77.14 (100)[0.00],(90)[0.7],(80)[5.6]
72.86 [18.60]
71.43 [17.89]
78.57
K=7
78.57 (100)[0.00],(90)[2.5],(80)[10.5]
78.57 (100)[0.00],(90)[0.7],(80)[5.6]
71.43 [18.25]
70 [18.94]
74.29
Wils on
K-NN
60.13
AEV
K=3
77.14
BEE
HC-edit-STAXAC
HC-edit-UPGMA
Wils onProb
K=1
80.83 (90)[7.31]
84.17 (80)[15.83]
78.33 [17.68]
K=3
81.67 (90)[7.31]
84.17 (80)[15.83]
82.5 [19.07]
K=5
82.50 (100)[0.00],(90)[7.31]
84.17 (80)[15.83]
82.5 [17.96]
K=7
84.17 (90)[7.31]
84.17 (80)[15.83]
84.17 [16.67]
78.69 [17.68]
85 [20.09]
78.33
76.15 [17.96]
85.83 [16.57]
80.83
Wils on
K-NN
82.5
83.33
Bos
HC-edit-STAXAC
HC-edit-UPGMA
Wils onProb
K=1
69.57 (90)[37.28]
72.33 (80)[7.14]
67.00 [23.24]
65.5 [23.24]
67
K=3
70.55 (90)[37.28]
71.94 (80)[7.14]
70.36 [23.68]
70.55 [23.48]
67.39
67.46 [23.61]
69.37 [25.83]
66.4 [24.16]
70.55
69.57 [24.18]
73.12
K=5
70.75 (100)[0.00]
70.36 (90)[1.03]
K=7
73.12 (100)[0.00]
73.12 (100)[0.00]
CS@UH
Data Mining & Machine Learning Group
5. Related Work
26
1. Devijver, P. and Kittler, J.: “On the Edited Nearest Neighbor Rule”. IEEE
1980 Pattern Recognition.Vol.1, 72-80Multi-Edit
2. Eick, C.F., Zeidat, N., and Vilalta, R.: “Using Representative-Based or
Nearest Neighbor Dataset Editing”. ICDM 2004: 375-378 RBSCE
3. Wilson, D.L.,: “Asymptotic Properties of Nearest Neighbor Rules Using
Edited Data”, IEEE Transactions on Systems, Man, and Cybernetics, 2:408420, 1972Wilson Editing
4. Saitou, N., Nei., M.: The neighbor-joining method: a new method for
reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4) (1987)UPGMA
5. Vázquez, F., Sánchez, J., Pla, F.: A stochastic approach to Wilson’s editing
algorithm; in LNCS, vol. 3523, pp. 35–42. Springer, Heidelberg (2005) 
Wilson Probing
CS@UH
Data Mining & Machine Learning Group
Summary and Conclusion
1.Introduction
2.Editing Algorithms
3.Supervised Taxonomy & The STAXAC Algorithm
4.HC-edit
5.Related Work
6.Summary and Conclusion
CS@UH
Data Mining & Machine Learning Group
27
6. Summary and Conclusion
28
 We proposed HC-edit that identifies regions in the dataset of



high purity using supervised clustering and removes minority
examples from the identified regions. HC-edit takes as input a
hierarchical clustering augmented with purity information in the
nodes.
In general, the supervised taxonomy can be viewed as a
“useful” intermediate result which allows to inspect multiple
supervised clusterings before selecting the “best” one.
Traditional k-nearest neighbor methods used the parameter k
for both editing and classification. By allowing two parameters
for modeling—purity for editing, and k for the classification—,
HC-edit provides greater landscape for a model selection.
The experiments reveal that the HC-edit has improved
accuracy for many datasets—usually removing less of training
examples than other methods.
CS@UH
Data Mining & Machine Learning Group
Future Work
29
 Explore more sophisticated techniques to select purity
thresholds for extracting “good” supervised clusterings
for editing and meta learning.
 Generalize STAXAC to use more complicated
proximity graphs, such as Gabriel graphs.
 Explore the usefulness of supervised taxonomies for
other applications
CS@UH
Data Mining & Machine Learning Group
Multi-Edit
Multi-edit: Repeatedly applies Wilson
A: DIFFUSION: Divide the dataset O into N>=3 random
subsets S1 ,…SN
B: CLASSIFICATION: Classify Si using the K-NN rule
with S(i+1) Mod N as the training set (i=1,…,N)
C: EDITING: Discard all incorrectly classified examples
D: CONFUSION: Replace O by subset Or consisting of the union of all remaining
examples in S1,…SN
E: TERMINATION: if the last iteration produced no editing then terminate;
otherwise go to step A.
CS@UH
Data Mining & Machine Learning Group
Extracting a clustering from a STAXAC generated Tree
Algorithm 3: ExtractClustering (T, β)
1. Inputs: taxonomy tree T; user-defined purity threshold 𝜃
2. Output: clustering X
3. Function ExtractClustering (T, β)
4. IF (T = NULL)
5.
RETURN 
6. IF T.purity ≥ β
7.
RETURN T
8. ELSE
9.
RETURN ExtractClustering(T.left, β)  ExtractClustring(T.right, β)
10. End Function
CS@UH
Data Mining & Machine Learning Group