Transcript 4FT Miner

Contributions to MiningMart
Petr Berka
Laboratory for Intelligent Systems
University of Economics, Prague
[email protected]
University of Economics, Prague

LISp - Laboratory for Intelligent Systems

SALOME - Laboratory for Multidisciplinary
Approaches to Decision-making Support in Economics and
Management
MiningMart prezentation (c) Petr Berka, LISp, 2001
2
LISp research

probabilistic methods - decomposable
probability models and bayesian networks

symbolic ML methods - 4FT association
rules and decision rules

logical calculi for knowledge discovery in
databases
MiningMart prezentation (c) Petr Berka, LISp, 2001
3
LISp activities

Organized conferences


Organized workshops


ECML’97, PKDD’99
Discovery Challenge (PKDD‘99, PKDD2000, PKDD20001),
WUPES‘97, WUPES2000
International Projects


MLNet, Sol-Eu-Net, EUNITE,
KDNet
MUM, MGT
MiningMart prezentation (c) Petr Berka, LISp, 2001
4
SALOME research

Quantitative and AI (pattern recognition,
fuzzy, neural nets) approaches to support of
decision making in econmics and
management
MiningMart prezentation (c) Petr Berka, LISp, 2001
5
SALOME activities

Organized workshops


STIPR‘97, MME‘99
International Projects

Univ. Salzburg, Univ. Hokkaido, Univ. Cambridge
MiningMart prezentation (c) Petr Berka, LISp, 2001
6
LISp software

LISp-Miner (data mining system)
DataSource (for data manipulation)
 4FT Miner (4FT association rules) and
 KEX (decision rules)

experimental software for building
graphical models
 preprocessing procedures

related to KEX
 based on information theoretic approach

MiningMart prezentation (c) Petr Berka, LISp, 2001
7
LISP-Miner procedures

DataSource
creating new (virtual) attributes using SQL
ekvidistant and equifrequent discretization
grouping attribute values
computing attribute-value frequencies
MiningMart prezentation (c) Petr Berka, LISp, 2001
8
LISP-Miner procedures

4FT-Miner (GUHA procedure)
4FT association rules in the form
Ant ~ Suc / Cond

KEX
weighted decision rules in the form
Ant  C (weight)
MiningMart prezentation (c) Petr Berka, LISp, 2001
9
4FT-Miner basic idea

Generate a (potential) rule, e.g.
COLOUR(red)  SIZE(small) 0.9, 20 TEMP(high)
AGE(21-30)  SALARY(low) 0.85,15 PAYMENTS (High)  LOAN(bad)

Verify a rule using four-fold table
Suc Suc
Ant a
b
d
Ant c
 p,B
a
TRUE iff a B 
p
ab
 p , B TRUE iff a B 
MiningMart prezentation (c) Petr Berka, LISp, 2001
a
p
abc
10
KEX basic idea

Generate a (potential) rule, e.g.
YEARS-IN-COMPANY(0-3)  AGE(0-25)  LOAN(GOOD)

If rule refines current set of rules
(validity a/(a+b) differs from weight inferred during consultation)
add into rule base with proper weight
MiningMart prezentation (c) Petr Berka, LISp, 2001
13
LISp-Miner architecture
MetaData
(ODBC ACCESS)
LM
Data
(ODBC
ACCESS)
Windows
MiningMart prezentation (c) Petr Berka, LISp, 2001
Results
16
Preprocessing

(LISp)
KEX-oriented
 (fuzzy)
discretization + grouping of values
 computing the amount of noise in data
 random sampling + balancing of data
 handling missing values

Information theory
 attribute
selection
 attribute grouping
MiningMart prezentation (c) Petr Berka, LISp, 2001
17
… fuzzy discretization
NClass(Int)
NClass
N(Int) < > N
MiningMart prezentation (c) Petr Berka, LISp, 2001
18
… amount of noise
head
o
o
o
o
o
body
r
r
r
r
r
smile
y
y
y
y
n
holding
s
s
f
b
s
jacket
r
r
y
y
r
tie
y
y
n
n
y
class
+
+
Amount of noise: 20%
max. possible accuracy = 80%
MiningMart prezentation (c) Petr Berka, LISp, 2001
19
… data sampling
random split into training and testing set
 select random stratified sample
 balance unbalanced classes

MiningMart prezentation (c) Petr Berka, LISp, 2001
20
… handling missing values
remove example
 substitute missing with new value
 substitute missing with majority value
 proportional substitution

MiningMart prezentation (c) Petr Berka, LISp, 2001
21
… information theory

Attribute selection - based on mutual information

Attribute grouping - based on information content
MiningMart prezentation (c) Petr Berka, LISp, 2001
22
Preprocessing architecture
Input data
procedure
(ASCII)
Data
Output data
(ASCII)
procedure
Results
(ASCII)
MiningMart prezentation (c) Petr Berka, LISp, 2001
23
SALOME software

Feature Selection Toolbox (Multi-Purpose
Tool for Pattern Recognition)
feature selection
 approximation-based modeling
 classification

a consulting system helping to choose the most
suitable method is being developed
MiningMart prezentation (c) Petr Berka, LISp, 2001
24
Search strategies for FS
Search for a subset maximizing a criterion
function (distance, divergence):
 with
apriori information
exhaustive search
 branch and bound based algorithms
 floating search algorithms

 without
apriori information
approximation method
 divergence method

MiningMart prezentation (c) Petr Berka, LISp, 2001
25
FST architecture
Data
(ASCII)
FST
Results
Windows
MiningMart prezentation (c) Petr Berka, LISp, 2001
26
References
LISp-Miner:

Berka,P. - Ivanek,J.: Automated Knowledge Acquisition for
PROSPECTOR-like Expert Systems. In: (Bergadano, deRaedt
eds.) Proc. ECML'94, Springer 1994, 339-342.

Berka,P. - Rauch,J.: Data Mining using GUHA and KEX. In:
(Callaos, Yang, Aguilar eds.) 4th. Int. Conf. on Information
Systems, Analysis and Synthesis ISAS'98, 1998, Vol 2, 238- 244.

Rauch,J.: Classes of Four Fold Table Quantifiers. In: (Zytkow,
Quafafou eds.) Principles of Data Mining and Knowledge
Discovery. Springer 1998, 203 - 211.
MiningMart prezentation (c) Petr Berka, LISp, 2001
27
References
Preprocessing:

Bruha,I. - Berka,P.: Discretization and Fuzzification of Numerical
Attributes in Attribute-Based Learning. In: Szepaniak, Lisboa,
Kacprzyk (eds.): Fuzzy Systems in Medicine, Physica Verlag,
2000, 112-138.


Pudil, P., Novovičová J.: Novel Methods for Subset Selection with
Respect to Problem Knowledge, IEEE Transactions on Intelligent
Systems - Special Issue on Feature Transformation and Subset
Selection 1998, 66-74
J. Zvarova and M. Studeny: Information theoretical approach to
constitution and reduction of medical data. International Journal of
Medical Informatics 45 (1997), n. 1-2, pp. 65-74.
MiningMart prezentation (c) Petr Berka, LISp, 2001
28