analyse input data

Download Report

Transcript analyse input data

Clustering II
Finite Mixtures
• Model data using a mixture of distributions
•
•
•
•
– Each distribution represents one cluster
– Each distribution gives probabilities of attribute values in that
cluster
Finite mixtures: finite number of clusters
Individual distributions are usually normal
Combine distributions using cluster weights
Each normal distribution can be described in terms of μ (mean)
and σ (standard deviation)
• For a single attribute with two clusters
– μA, σA for cluster A and μB, σB for cluster B
– The attribute values are obtained by combining values from cluster
A with a probability of PA and from cluster B with a probability PB
– Five parameters μA, σA, μB, σB and PA (because PA+PB=1) describe the
attribute value distribution
2
EM Algorithm
• EM = Expectation – Maximization
– Generalize k-means to probabilistic setting
• Input: Collection of instances and number of clusters,
k
• Output: probabilities with which each instance
belongs to each of the k clusters
• Method:
– Start by guessing values for all the parameters of the k
clusters (similar to guessing centroids in k-means)
– Repeat
• E ‘Expectation’ Step: Calculate cluster probability for each
instance
• M ‘Maximization’ Step: Estimate distribution parameters from
cluster probabilities
• Store cluster probabilities as instance weights
• Estimate parameters from weighted instances
– Until we obtain distribution parameters that predict the
input data
3
Incremental Clustering
(Cobweb/Classit)
• Input: Collection of instances
• Output: A hierarchy of clusters
• Method:
– Start with an empty root node of the tree
– Add instances one by one
– if any of the existing leaves is a good ‘host’ for the
incoming instance then form a cluster with it
• Good host has high category utility (next slide)
– If required restructure the tree
• Cobweb - for nominal attributes
• Classit – for numerical attributes
4
Category Utility
• Category Utility,
CU(C1,C2,…,Ck) = {∑lP[Cl] ∑i ∑j(P[ai=vij|Cl]2-P[ai=vij]2)}/k
• Computes the advantage in predicting the
values of attributes of instances in a cluster
– If knowing the cluster information of an instance
does not help in predicting the values of its
attributes, then the cluster isn’t worth forming
• The inner term of difference of squares of
probabilities, (P[ai=vij|Cl]2-P[ai=vij]2) is computing this
information
• The denominator, k is computing this
information per cluster
5
Weather Data with ID
ID
Outlook
Temperature
Humidity
Windy
Play
a
sunny
hot
high
false
no
b
sunny
hot
high
true
no
c
overcast
hot
high
false
yes
d
rainy
mild
high
false
yes
e
rainy
cool
normal
false
yes
f
rainy
cool
normal
true
no
g
overcast
cool
normal
true
yes
h
sunny
mild
high
false
no
i
sunny
cool
normal
false
yes
j
rainy
mild
normal
false
yes
k
sunny
mild
normal
true
yes
l
overcast
mild
high
true
yes
m
overcast
hot
normal
false
yes
n
rainy
mild
high
true
no
Artificial data, therefore not possible to find natural clusters
(two clusters of yeses and nos not possible)
6
Trace of Cobweb
2
1
a:no
a:no
b:no
c:yes
d:yes
e:yes
No good host for
the first five instances
3
a:no
b:no
c:yes
e is the best host
CU of e&f as cluster high
e&f are similar
d:yes
e:yes
f:no
7
Trace of Cobweb (Contd)
4
a:no
b:no
c:yes
d:yes
e:yes
f:no
At root: e&f cluster best host
At e&f: no host, so no new cluster, g added to e&f cluster
f&g are similar
5
b:no
a:no
d:yes
h:no
g:yes
c:yes
e:yes
f:no
g:yes
At root: a is the best host and d is the runner-up
Before h is inserted runner-up, d is evaluated
CU of a&d is high, so d merged into a to form a new cluster
At a&d: no host, so no new cluster, h added to a&d cluster
8
Trace of Cobweb (Contd)
g:yes
a:no
d:yes
c:yes
h=no
b:no
k:yes
l:yes
e:yes
f:no
j:yes
m:yes
n:no
i:yes
For large data sets, growing the tree to
individual instances might lead to overfitting.
A similarity threshold called cutoff used to
suppress growth
9
Hierarchical Agglomerative
Clustering
• Input: Collection of instances
• Output: A hierarchy of clusters
• Method:
– Start with individual instances as clusters
– Repeat
• Merge the ‘closest’ two clusters
– Until only one cluster remains
• Ward’s method: Closeness or proximity between two
clusters is defined as the increase in squared error
that results when two clusters are merged
• Squared error measure used for only the local
decision of merging clusters
– No global optimization
10
HCE
• A visual knowledge discovery tool for analysing and
understanding multi-dimensional (> 3D) data
• Offers multiple views of
– input data and clustered input data
– where views are coordinated
• Many other similar tools do a patch work of statistics
and graphics
• HCE follows two fundamental statistical principles of
exploratory data analysis
– To examine each dimension first and then find relationships
among dimensions
– To try graphical displays first and then find numerical
summaries
11
GRID Principles
• GRID – graphics, ranking and interaction for
discovery
• Two principles
– Study 1D, study 2D and find features
– Ranking guides insight, statistics confirm
• These principles help users organize their knowledge
discovery process
• Because of GRID, HCE is more than R + Visualization
• GRID can be used to derive some scripts to organize
exploratory data analysis using R (or some such
statistics package)
12
Rank-by-Feature Framework
• A user interface framework based on the
GRID Principles
• The framework
– Uses interactive information visualization
techniques combined with
– statistical methods and data mining algorithms
– Enables users to orderly examine input data
• HCE implements rank-by-feature framework
– This means
• HCE uses existing statistical and data mining methods to
analyse input data and
• Communicate those results using interactive information
visualization techniques
13
Multiple Views in HCE
•
•
•
•
•
Dendrogram
Colour Mosaic
1 D histograms
2D scatterplots
And more
14
Dendrogram Display
• Results of HAC are shown
visually using a dendrogram
• A dendrogram is a tree
– with data items at the
terminal (leaf) nodes
– Distance from the root node
represents similarity among
leaf nodes
• Two visual controls
A
B
C
D
– minimum similarity bar allows
users to adjust the number
of clusters
– Detail cut-off bar allows
users to reduce clutter
15
Colour Mosaic
•
•
•
•
•
•
Input data is shown using this
view
Is a colour coded visual display
of tabular data
Each cell in the table is painted
in a colour that reflects the
cell’s value
Two variations
1
3
2
4
Table
Original layout
– The layout of the mosaic is
similar to the original table
– A transpose of the original
layout
HCE uses the transposed layout
because data sets usually have
more rows than columns
A colour mapping control
Transposed Layout
16
1D Histogram Ordering
• This data view is part of the rank-by-feature
framework
• Data belonging to one column (variable) is
displayed as a histogram + box plot
– Histogram shows the scale and skewness
– Box plot shows the data distribution, center and
spread
• For the entire data set many such views are
possible
• By studying individual variables in detail
users can select the variables for other
visualizations
17
2D Scatter Plot Ordering
• This data view is again part of the rank-by-feature
framework
• Three categories of 2D presentations are possible
– Axes of the plot obtained from Principal Component Analysis
• Linear or non-linear combinations of original variables
– Axes of the plot obtained directly from the original variables
– Parallel coordinates
• HCE uses the second option of plotting pairs of
variables from the original variables
• Both 1D and 2D plots can be sorted according to some
user selected criteria such as number of outliers
18