Y - Knowledge Engineering Group

Download Report

Transcript Y - Knowledge Engineering Group

Djamel A. Zighed and Nicolas Nicoloyannis
ERIC Laboratory
University of Lyon 2 (France)
[email protected]
Prague
Sept. 04
About Computer science dep.
• In Lyon, there are 3 universities, 100000
students
• Lumière university Lyon 2, has 22000 students,
• Lyon 2, is mainly a liberal art university
• The faculty of economic has tree departments,
among them the computer science one
• We belong to this department
• We have Bachelor, Master and PhD programs
for 300 students
ERIC Lab at the University
Faculties
of
university
of Lyon 2
Economic
Research
centers of the
university
Sociology
Linguistic
Law
ERIC
Knowledge Engineering Research Center
- The budget of ERIC doesn’t depend from the university, it’s given par
The national ministry of education
- We have a large autonomy in decision making
ERIC Lab
•
•
•
•
•
Born in 1995,
11 professors (N. Nicoloyannis, director)
15 PhD Students
Grants+contracts+WK+…=200K€/year
Research topics
– Data mining (theory, tools and applications)
– Data warehouse management (T,T,A)
Data Mining (T,T,A)
• Theory
– Induction graphs
– Learning and classification
• Tools
– SIPINA : Plate form for data mining
• Applications
–
–
–
–
Medical fields
Chemical applications
Human science
…
Data mining TTA for complex data
Data mining on complex data
• An example : Breast cancer diagnosis
Motivations
 TXY  Association measure :
Let
X and Y be two attributes
Ω be a set of data
X    x1 ,  , x r 
It measures the strength
of the relationship
between X and Y
Y    y1 ,  , y c 
Contingency
table
TXY
x1
X

xr
Y
y1
n11

yc
n1c
n1.

n r1
n.1
nrc
n.c
nr.
n
Motivations
 TXY  Association measure :
Let
X and Y be two attributes
Ω be a set of data
X    x1 ,  , x r 
It measures the strength
of the relationship
between X and Y
Y    y1 ,  , y c 
Contingency
table
TXY
x1
X

xr
Y
y1
n11

yc
n1c
n1.

n r1
n.1
nrc
n.c
nr.
n
Motivations
 TXY  Association measure :
Let
X and Y be two attributes
Ω be a set of data
X    x1 ,  , x r 
It measures the strength
of the relationship
between X and Y
Y    y1 ,  , y c 
Contingency
table
TXY
x1
X

xr
Y
y1
n11

yc
n1c
n1.

n r1
n.1
nrc
n.c
nr.
n
Motivations
 TXY  Association measure :
Let
X and Y be two attributes
Ω be a set of data
X    x1 ,  , x r 
It measures the strength
of the relationship
between X and Y
Y    y1 ,  , y c 
Contingency
table
TXY
x1
X

xr
Y
y1
n11

yc
n1c
n1.

n r1
n.1
nrc
n.c
nr.
n
According to a specific
association measure, may we
improve the strength of the
relationship by merging some
rows and/or some columns ?
Motivations
 TXY  Association measure :
According to a specific association measure,
may we improve the strength of the relation
ship by merging some rows and/or some
columns ?
It measures the strength
of the relationship
between X and Y
Contingency table
TXY
x1
X

xr
Y
y1
n11

T ' XY
yc
n1c
n1.

n r1
n.1
nrc
n.c
nr.
n
such that :
c'  c and
r '  r and
 T ' XY    TXY 
An example
 TXY   tˆ  0.14
Goal:
Find the groupings that maximize the association between
attributes
For the preceding example
the maximization of the Tschuprow’s t gives
 T ' XY   tˆ'  0.32
tˆ'  tˆ
Yes, we can improve the association by
reducing the size of the contingency
table
Extension
 TXY 
Let
X and Y be two attributes
Ω be a set of data
X    x1 ,  , x r 
According to a specific
association measure, may we
find the optimal reduced
contingency table ?
Y    y1 ,  , y c 
Contingency
table
TXY
x1
X

xr
Y
y1
n11

yc
n1c
*
TXY
n1.

n r1
n.1
nrc
n.c
nr.
n
c*  c
l*  l
*
  max  TXYi 
 TXY
i
Optimal solution (exhaustive search)
Goal : Find the best cross partition on T
P TX : The set of all partitions brought about over X  
P TY : The set of all partitions brought about over Y  
#P TX : the size of the set P TX 
#P TY : the size of the set P TY 
The number of cases we have to check is
l  #P TX  #P TY 
#P TX 
nominal case
ordinal case
Optimal solution (exhaustive search)
#P TX 
nominal case
ordinal case
Optimal solution (exhaustive search)
According to a specific association measure,
may we find the optimal reduced contingency
table ?
Yes, but the solution is intractable in real
word because of the high time complexity
Heuristic
Proceed successively to the grouping of 2
(row or column) values that
maximizes the increase in the association
criteria.
Starting with T 0  Tr ,c the finest categories
The algorithm determines successive ly T k k  1,2 
Stop when  T k    T k 1 
Complexity
Simulation
Goal: How far is the quasi-optimal solution from the true optimum?
Comparison tractable for tables not greater than 6 × 6.
Simulation Design
Randomly generate 200 tables
Analysis of the distribution of the deviations between optima and
quasi-optima.
Generating the Tables
10000 cases distributed in the cxr cells of the table with an uniform distribution
(worst case).
Quasi-optimal solution
Quasi-optimal solution
Conclusion
• Implementation for new approach induction decision
tree.
– Zighed, D.A., Ritschard, G., W. Erray and V.-M. Scuturici (2003),
Abogodaï,a New approach for Decision Trees, in Lavrac, N., D.
Gamberger, L. Todorovski and H. Blockeel (eds), Knowledge
Discovery in databases: PKDD 2003 , LNAI 2838, Berlin:
Springer, 495--506.
– Zighed D. A., Ritschard G., Erray W., Scuturici V.-M. (2003),
Decision tree with optimal join partitioning, To appear in Journal
of Information Intelligent Systems, Kluwer (2004).
• Divisive top-down approach
• Extension to multidimensionnal case