Evaluating a clustering solution: An application in the tourism market

Download Report

Transcript Evaluating a clustering solution: An application in the tourism market

Mining Time-Changing Data
Streams
Advisor: Dr. Hsu
Graduate: Yung-Chu Lin
2002/3/6
IDS Lab Seminar
Outline










Motivation
Objective
Hoeffding Bounds
The VFDT Algorithm
The CVFDT Algorithm
Window Size
Time and Space Complexity
Empirical Study
Conclusion
Opinion
2002/3/6
IDS Lab Seminar
Motivation


The volume and time span of
accumulated data for future use
Real large database is not random
sample drawn from a stationary
2002/3/6
IDS Lab Seminar
Objective

Solving the classification problem:
concept drift
2002/3/6
IDS Lab Seminar
Hoeffding Bounds



n examples
confidence 1-δ
variable r, range R


2002/3/6
probability: R=1
information gain: R=log C
IDS Lab Seminar
Concept of VFDT Algorithm


Initialize the HT
Repeat



2002/3/6
Scan nmin examples (a window size)
Compute Gain(x) of every attribute
Split the leaf node or not
IDS Lab Seminar
The VFDT Algorithm
2002/3/6
IDS Lab Seminar
Example for Algorithms
Day
Outlook
Temperature
Humidity
Wind
PlayTennis
D1
Sunny
Hot
High
Weak
No
D2
Sunny
Hot
High
Strong
No
D3
Overcast
Hot
High
Weak
Yes
D4
Rain
Mild
High
Weak
Yes
D5
Rain
Cool
Normal
Weak
Yes
D6
Rain
Cool
Normal
Strong
No
D7
Overcast
Cool
Normal
Strong
Yes
D8
Sunny
Mild
High
Weak
No
D9
Sunny
Cool
Normal
Weak
Yes
D10
Rain
Mild
Normal
Weak
Yes
D11
Sunny
Mild
Normal
Strong
Yes
D12
Overcast
Mild
High
Strong
Yes
D13
Overcast
Hot
Normal
Weak
Yes
Rain
Mild
Strong
No
D14
2002/3/6
High
IDS Lab Seminar
Concept of CVFDT Algorithm




An extension to VFDT, which adds the ability to
detect and respond to changes in examplegenerating
Not need to learn a new model from scratch every
time a new example arrives
Scan HT and alternate trees periodically  look for
internal nodes whose sufficient statistics indicate
better attribute
When alternate subtree becomes more accurate, the
old subtree is replaced by the new one
2002/3/6
IDS Lab Seminar
The CVFDT Algorithm
2002/3/6
IDS Lab Seminar
The CVFDTGrow Procedure
2002/3/6
IDS Lab Seminar
The ForgetExample and
CheckSplitValidity procedure
2002/3/6
IDS Lab Seminar
Window Size


One windows size w will not be appropriate
for every concept and every type of drift
Shrink w when



many of the nodes in HT become questionable at
once
or in response to a rapid change in data rate
Increase w when

2002/3/6
Few questionable node  concept is stable
IDS Lab Seminar
Time and Space Complexity
VFDT
CVFDT
time
O(lvdvc)
O(lcdvc)
Space
O(ndvc)
n: #nodes in CVFDT’s main tree and alternate trees
d: #attributes
v: max number of values per attribute
c: #class
lv,lc: longest path
2002/3/6
IDS Lab Seminar
Empirical Study


Synthetic data
Web data
2002/3/6
IDS Lab Seminar
Synthetic Data

A hyperplane in d-dimensional space is
the set of points x that satisfy
d
w x
i 1


i i
 w0
where xi is the ith coordinate of x.
d
d
positive:  wi xi  w0 negative:  wi xi  w0
i 1
2002/3/6
i 1
IDS Lab Seminar
Synthetic Data (cont’d)





Initialize weight to .2 except for w0
which is .25d
Substitute its coordinates into the left
hand side of Equation to obtain a sum s
|s|<=.1*w0  positive
|s|<=.2*w0  negative
xi=[0,1]
2002/3/6
IDS Lab Seminar
Synthetic Data (cont’d)




5 million training examples
δ=0.0001
f=20,000
Nmin=300;┬=0.05;w=100,000
2002/3/6
IDS Lab Seminar
Synthetic Data (cont’d)
2002/3/6
IDS Lab Seminar
Synthetic Data (cont’d)



CVFDT took 4.3 times longer than VFDT
VFDT’s average memory allocation over
the course of the run was 23MB while
CVFDT’s was 16.5MB
The average number of nodes in
VFDT’s tree was 2696 and in CVFDT’s
tree was 677(132: alternate tree, 545:
main tree)
2002/3/6
IDS Lab Seminar
Conclusion


Introducing CVFDT, learning accurate
models from the most demanding highspeed, concept-drifting data streams
Maintain a decision-tree up-to-date with
a windows of examples
2002/3/6
IDS Lab Seminar
Opinion


Many techniques have the problem,
concept drift.
So, maybe we could apply the concept
in this paper to our other techniques,
like the association rule mining,
clustering algorithms, and so on.
2002/3/6
IDS Lab Seminar