Transcript Amit Goyal
A dynamic-programming algorithm
for hierarchical discretization of
continuous attributes
Amit Goyal (15st April 2008)
Department of Computer Science
The University of British Columbia
Reference
Ching-Cheng Shen and Yen-Liang Chen. A
dynamic-programming algorithm for
hierarchical discretization of continuous
attributes. In European Journal of Operational
Research 184 (2008) 636-651 (ElseVier).
Amit Goyal (UBC Computer Science)
Overview
Motivation
Background
Why need Discretization?
Related Work
DP Solution
Analysis
Conclusion
Amit Goyal (UBC Computer Science)
Motivation
Situation: Attrition rate for
mobile phone customer is
around 25-30% per year
Task:
Given customer information for
past N months, predict who is likely
to attrite next month
Also estimate customer value &
what is the cost effective order to
be made to this customer
Customer Attributes:
Age
Gender
Location
Phone bills
Income
Occupation
Amit Goyal (UBC Computer Science)
Pattern Discovery
t1:
t2:
t3:
t4:
t5:
t6:
t7:
Beef, Chicken, Milk
Beef, Cheese
Cheese, Boots
Beef, Chicken, Cheese
Beef, Chicken, Clothes, Cheese, Milk
Chicken, Clothes, Milk
Chicken, Milk, Clothes
Transaction data
Assume:
min_support = 30%
min_confidence = 80%
An example frequent itemset:
{Chicken, Clothes, Milk}
[sup = 3/7]
Association rules from the itemset:
Clothes Milk, Chicken
[sup = 3/7, conf = 3/3]
…
…
Clothes, Chicken Milk,
[sup = 3/7, conf = 3/3]
Amit Goyal (UBC Computer Science)
Issues with Numeric Attributes
Size of the discretized intervals affect support &
confidence
{Occupation = SE, (Income = $70,000)} {Attrition = Yes}
{Occupation = SE, (60K Income 80K)} {Attrition = Yes}
{Occupation = SE, (0K Income 1B)} {Attrition = Yes}
If intervals too small
If intervals too large
may not have enough support
may not have enough confidence
Loss of Information (How to minimize?)
Potential solution: use all possible intervals
Too many rules!!!
Amit Goyal (UBC Computer Science)
Background
Discretization
reduce the number of values for a given
continuous attribute by dividing the range of the
attribute into intervals.
Concept hierarchies
reduce the data by collecting and replacing low
level concepts (such as numeric values for the
attribute age) by higher level concepts (such as
young, middle-aged, or senior).
Amit Goyal (UBC Computer Science)
Why need discretization?
Data Warehousing and Mining
Data reduction
Association Rule Mining
Sequential Patterns Mining
In some machine learning algorithms like
Bayesian approaches and Decision Trees.
Granular Computing
Amit Goyal (UBC Computer Science)
Related Work
Manual
Equal-Width Partition
Equal-Depth Partition
Chi-Square Partition
Entropy Based Partition
Clustering
Amit Goyal (UBC Computer Science)
Simple Discretization Methods: Binning
Equal-width (distance) partitioning:
It divides the range into N intervals of equal size:
uniform grid
if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (BA)/N.
The most straightforward
Equal-depth (frequency) partitioning:
It divides the range into N intervals, each
containing approximately same number of
samples
Amit Goyal (UBC Computer Science)
Chi-Square Based Partitioning
Χ2 (chi-square) test
2
(
Observed
Expected
)
2
Expected
The larger the Χ2 value, the more likely the variables
are related
Merge: Find the best neighboring intervals and merge
them to form larger intervals recursively
Amit Goyal (UBC Computer Science)
Entropy Based Partition
Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy
after partitioning is
E (S ,T )
| S1|
| S|
Ent ( S1)
|S 2|
| S|
Ent ( S 2)
The boundary that minimizes the entropy function
over all possible boundaries is selected as a binary
discretization.
The process is recursively applied to partitions
obtained until some stopping criterion is met
Amit Goyal (UBC Computer Science)
Clustering
Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
Can be very effective if data is clustered but not if data is
“smeared”
Can have hierarchical clustering and be stored in multidimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms
Amit Goyal (UBC Computer Science)
Weaknesses
Seeks a local optimal solution instead of a
global optimal
Subject to constraint that each interval can
only be partitioned into a fixed number of
sub-intervals
Constructed tree may be unbalanced
Amit Goyal (UBC Computer Science)
Notations
val(i): value of ith data
num(i): number of occurrences of value val(i)
R: depth of the output tree
ub: upper boundary on the number of
subintervals spawned from an interval
lb: lower boundary
Amit Goyal (UBC Computer Science)
Example
R = 2, lb = 2, ub = 3
Amit Goyal (UBC Computer Science)
Problem Definition
Given parameters R, ub, and lb and input data val(1),
val(2), …, val(n) and num(1), num(2), … num(n), our
goal is to build a minimum volume tree subject to the
constraints that all leaf nodes must be in level R and that
the branch degree must be between ub and lb
Amit Goyal (UBC Computer Science)
Distances and Volume
Intra-distance of a node containing data from data i to data j
j
intradist (i, j ) | val( x) mean(i, j ) | *num( x)
x i
Inter-distance b/w two adjacent siblings; first node containing
data from i to u, second node containing data from u+1 to j
interdist (i, j , u ) (val(u 1) val(u )) totalnum (i, j )
(val(u 1) val(u )) ( totalnum(i , u) totalnum(u 1, j))
interdist L (i, u ) interdist R (u 1, j )
Volume of a tree is the total intra-distance minus total interdistance in the tree
Amit Goyal (UBC Computer Science)
Theorem
The volume of a tree = the intra-distance of the
root node + the volumes of all its sub-trees - the
inter-distances among its children
Amit Goyal (UBC Computer Science)
Notations
T*(i,j,r): the minimum volume tree that
contains data from data i to data j and has
depth r
T(i,j,r,k): the minimum volume tree that
contains data from data i to data j, has depth
r, and whose root has k branches
D*(i,j,r): the volume of T*(i,j,r)
D(i,j,r,k): the volume of T(i,j,r,k)
Amit Goyal (UBC Computer Science)
Notations Cont.
k
QD(i,j,r,k ): min{ (the volume of vth node ) v 1
k-1
(the interdista nce b/w v
th
node and (v 1 )th node)}
v 1
Amit Goyal (UBC Computer Science)
Notations Cont.
k
QD (i,j,r,k): min{ (the volume of vth node )
M
v 1
k-1
(the interdista nce b/w v
th
node and (v 1 )th node)
v 1
interdista nce R (i, u )}
Amit Goyal (UBC Computer Science)
Algorithm
QD M (i, j, r , k ) min {D * (i, u, r ) QD M (u 1, j, r , k 1)
i u j
interdist R (i, u ) interdist L (i, u )}
Amit Goyal (UBC Computer Science)
Algorithm Cont.
QD (i, j, r , k ) min {D * (i, u, r ) QD M (u 1, j, r , k 1)
i u j
interdist L (i, u )}
Amit Goyal (UBC Computer Science)
Algorithm Cont.
D(i, j , r , k ) intradist (i, j ) QD (i, j , r 1, k )
Amit Goyal (UBC Computer Science)
The complete DP algorithm
QD M (i, j, r , k ) min {D * (i, u, r ) QD M (u 1, j, r , k 1)
i u j
interdist R (i, u ) interdist L (i, u )}
QD (i, j, r , k ) min {D * (i, u, r ) QD M (u 1, j, r , k 1)
i u j
interdist L (i, u )}
D(i, j , r , k ) intradist (i, j ) QD (i, j , r 1, k )
D * (i, j , r ) min {D(i, j , r , k )}
lb k rb
Amit Goyal (UBC Computer Science)
Volume of trees constructed
Gain Ratios of different algorithms
(Money Spent Monthly)
Run times of different algorithms
Gain Ratios of different algorithms
(Monthly Household Income)
Amit Goyal (UBC Computer Science)
Conclusion
Global optima instead of local optima
Each interval is partitioned into the most
appropriate number of subintervals
Trees are balanced
Time complexity is cubic, thus slightly slower
Amit Goyal (UBC Computer Science)
http://www.cs.ubc.ca/~goyal
([email protected])
Thank you !!!
Amit Goyal (UBC Computer Science)
Gain Ratio
n
entropy(T ) pi log 2 pi
i 1
r
| Si |
entropy( Si )
i 1 | S |
purity ( S1 , S 2 ,...S r )
The information gain due to particular split of S into Si, i = 1, 2, …., r
Gain (S, {S1, S2, …., Sr) = purity(S ) – purity (S1, S2, … Sr)
|S |
|S |
i
IntrinsicInfo(S , A) log i .
|S| 2 | S |
GainRatio(S , A)
Gain(S , A)
.
IntrinsicInfo(S , A)
Amit Goyal (UBC Computer Science)
Chi-Square Test Example
Heads
Tails
Total
Observed
53
47
100
Expected
50
50
100
(O-E)2
9
9
X2= (O-E)2/E
0.18
0.18
0.36
In order to see whether this result is statistically significant, the Pvalue (the probability of this result not being due to chance) must be
calculated or looked up in a chart. The P-value is found to be
Prob(X21 ≥ 0.36) = 0.5485. There is thus a probability of about 55%
of seeing data that deviates at least this much from the expected
results if indeed the coin is fair. Hence, fair coin.
Amit Goyal (UBC Computer Science)