ICDM07_Jin - Kent State University

Download Report

Transcript ICDM07_Jin - Kent State University

Data Discretization Unification
Ruoming Jin
Yuri Breitbart
Chibuike Muoh
Kent State University, Kent, USA
Outline
•
•
•
•
•
•
•
Motivation
Problem Statement
Prior Art & Our Contributions
Goodness Function Definition
Unification
Parameterized Discretization
Discretization Algorithm & Its Performance
Motivation
Patients Table
Age
18
25
41
…
51
52
53
….
Success
10
5
100
….
250
360
249
…..
Failure
1
2
5
….
10
5
10
…..
Total
19
7
105
….
260
365
259
…..
Motivation
Possible implication of the table:
– If a person is between 18 and 25, the probability of
procedure success is ‘much higher’ than if the person
is between 45 and 55
– Is that a good rule or this one is better: If a person is
between 18 and 30, the probability of procedure
success is ‘much higher’ than if the person is between
46 and 61
• What is the best interval?
Motivation
•
Without data discretization some rules would be difficult to establish.
•
Several existing data mining systems cannot handle continuous variables
without discretization.
•
Data discretization significantly improves the quality of the discovered
knowledge.
•
New methods of discretization needed for tables with rare events.
•
Data discretization significantly improves the performance of data mining
algorithms. Some studies reported ten fold increase in performance.
However:
Any discretization process generally leads to a loss
of information. Minimizing such a possible loss is the mark
of good discretization method.
Problem Statement
Given an input table:
Intervals
Class1
Class2
….
Class J
S1
r11
r12
….
r1J
N1
S2
r21
r22
….
r2J
N2
.
.
.
.
.
.
.
.
.
….
….
….
.
.
.
SI
rI1
rI2
…
rIJ
NI
Column
Sum
M1
M2
MJ
N(Total)
Row Sum
.
.
.
Problem Statement
Obtain an output table:
Intervals
Class1
Class2
….
Class J
S1
C11
C12
….
C1J
N1
S2
C21
C22
….
C2J
N2
.
.
.
.
.
.
.
.
.
….
….
….
.
.
.
SI’
CI’1
CI’2
…
CI’J
NJ
Column
Sum
M1
M2
MJ
N(Total)
Row Sum
.
.
.
Where S = Union of consecutive k intervals. The quality of discretization is
i
measured by
cost(model) =cost(data/model) +penalty(model)
Prior Art
• Unsupervised Discretization – no class information is provided:
– Equal-width
– Equal-frequency
•
Supervised Discretization – class information is provided with each attribute
value:
– MDLP
– Pearson’s X2 or Wilks’ G2 statistic based methods
•
Dougherty, Kohavi (1995) compare unsupervised and supervised methods
of Holte (1993) and entropy based methods by Fayyad and Irani (1993) and
conclude that supervised methods give less classification errors than the
unsupervised ones and supervised methods based on entropy are better
than other supervised methods .
Prior Art
• There are several recent (2003-2006) papers introduced new
discretization algorithms: Yang and Webb; Kurgan and Cios (CAIM);
Boulle (Khiops).
• CAIM attempts to minimize the number of discretization intervals
and at the same time to minimize the information loss.
• Khiops uses Pearson’s X2 statistic to select merging consecutive
intervals that minimize the value of X2.
• Yang and Webb studied discretization using naïve Bayesian
classifiers. They report that their method generates a lower number
of classification errors than the alternative discretization methods
that appeared in literature
Our Results
• There is a strong connection between discretization
methods based on statistic and on entropy.
• There is a parametric function so that any prior
discretization method is derivable from this function by
choosing at most two parameters.
• There is an optimal dynamic programming method that
derived from our discretization approach that mostly
outperforms any prior discretization method in
experiments that we conducted.
Goodness Function Definition
(Preliminaries)
Intervals
Class1
Class2
….
Class J
S1
C11
C12
….
C1J
N1
S2
C21
C22
….
C2J
N2
.
.
.
.
.
.
.
.
.
….
….
….
.
.
.
SI’
CI’1
CI’2
…
CI’J
NJ
Column
Sum
M1
M2
MJ
N(Total)
Row Sum
.
.
.
Goodness Function Definition
(Preliminaries)
Entropy of the i-th row
of a contingency table
Total entropy of all
intervals
I
Entropy of a
contingency table:
•
•
•
J
Sl   Cij log( Cij / N )
i 1 j 1
Binary encoding of: a row requires H(Si) binary characters.
Binary encoding of a set of rows requires H( S1, S2, … SI’ ) binary
characters
Binary encoding of a table requires SL binary characters
N*H( S1 ,S2 , ….SI’ ) = SL
Goodness Function Definition
• Cost(Model, T)= Cost(T/Model)+Penalty(Model)
(Mannila, et. al.)
• Cost(T/Model) is the complexity of table encoding in
the given model.
• Penalty(Model) reflects a complexity of the resulting
table.
• GFModel (T) = Cost(Model, T0) – Cost(Model, T)
Goodness Function Definition
Models To Be Considered
•
•
•
•
MDLP (Information Theory)
Statistical Model Selection
Confidence Level of Rows Independence
Gini Index
Goodness Function Examples
• Entropy:
Cost ( MDLP , T )  SL  ( I '1) log
N
 I ' ( J  1) log J
( I '1)
GFMDLP(T )  Cost ( MDLP , T 0 )  Cost ( MDLP , T )
• Statistical Akaike (AIC):
Cost ( AIC , T )  2SL  2 I ' ( J  1)
GFAIC (T )  Cost ( AIC , T 0 )  Cost ( AIC , T )
• Statistical Bayesian (BIC): Cost ( BIC , T )  2SL  I ' ( J  1) log N
GFBIC(T )  Cost ( BIC , T 0 )  Cost ( BIC , T )
Goodness Function Examples (Con’t)
• Confidence level based Goodness Functions:
Pearson X2 –statistic
where
Wilks’ G2-statistic
• Table’s degree of freedom is df=(I’-1)(J-1)
• Distribution functions for these statistics are
• It is known in statistic that asymptotically both Pearson’s
X2 and Wilks G2 statistics have chi-square distribution
with df degrees of freedom.
Unification
• The following relationship between G2 and goodness
functions for MDLP, AIC, and BIC holds:
G2/2 = N*H(S1U ……USI’) – SL
• Thus, the goodness functions for MDLP, AIC and BIC
can be rewritten as follows:
N
GFMDLP(T )  G  2df log J  2( I '1) log
I '1
2
GFAIC(T )  G  df
2
log N
GFBIC (T )  G  df *
2
2
Unification
• Normal Deviate is the difference between the mean and
the variable divided by standard deviation
• Consider t a random variable chi-square distributed. U(t)
be a normal deviate so that the following equation holds
• Let u(t) be a normal deviate function. The following
theorem holds (see Wallace 1959, 1960)
For all t>df, all df>.37, and with w(t)=[t-df-df*log(t/df)](1/2)
0<w(t)≤u(t)≤w(t)+.6*df-(1/2)
Unification
• From this theorem it follows that if df goes to
infinity and w(t)>>0,
u(t)/w(t) ~1.
• Finally, w2(t) ~ u2(t) ~t – df – df*log(t/df) and
GFG2(T)=u2(G2)=G2 - df*(1+ log(G2 /df))
and similarly goodness function for GFX2(T)
is asymptotically the same
Unification
• G2 estimate
– If G2 >df, then G2 <2N*logJ. It follows from the upper bound
logJ on H(S1 U ….U SJ) and lower bound 0 on entropy of a
specific row of the contingency table.
– Recall that u2(G2)=G2 - df*(1+ log(G2 /df))
– Thus, penalty of u2(G2) is between O(df) and O(df*logN)
• If G2 ~ c*df and c>1, then penalty is O(df)
• If G2 ~ c*df and N/df ~N/(I’J)=c, then penalty is also O(df)
• If G2 ~ c*df and N->inf and N/(I’J) ->inf, then penalty is
O(df*logN)
Unification
• GFMDLP = G2 –(depending on the ratio of N/I’) O(df)
O(dflogN)
• GFAIC = G2 - df
• GFBIC = G2 - df*(logN)/2
• GFG2= G2 - df([either constant or logN, depending on
the ratio between N and I and J]
• In general, GF = G2 - df*f(N,I,J)
• To unify a Gini function as one of the cost functions
we resort to the parametric approach to goodness of
discretization
Gini Based Goodness Function
• Let Si be a row of a contingency table. Gini index on Si
is defined as follows:
• Cost(Gini, T) = Cost (Gini, T ) 
iI '
 N  Gini(S )
i
i
i 1
• Gini Goodness Function:
cij 2 J MJ 2
GFGini(T )  

 2(i'1)
i 1 j 1 Ni
j 1 n
I'
J
Parametrized Discretization
• Parametrized Entropy
• Entropy of the row
• Gini of the row
• Parametrized Data Cost
Parametrized Discretization
• Parametrized Cost of T0
• Parametrized Goodness Function
Parameters for Known Goodness Functions
Parametrized Dynamic Programming
Algorithm
Dynamic Programming Algorithm
Experiments
Classification errors for Glass dataset
(Naïve Bayesian)
Iris+C4.5
Experiments: C4.5 Validation
Experiments: Naïve Bayesian Validation
Conclusions
• We considered several seemingly different
approaches to discretization and demonstrated
that they can be unified by considering a notion
of generalized entropy.
• Each of the methods that were discussed in
literature can be derived from generalized
entropy by selecting at most two parameters.
• Dynamic Programming Algorithm for a given set
of two parameters selects an optimal method of
discretization (in terms of the discretization
goodness function)
What Remains To be Done
• How to find analytically a relationship between
the goodness function in terms of the model and
the number of classification errors?
• What is the Algorithm for Selecting the Best
Parameters for a Given Set of Data?