Transcript 4FT Miner
Contributions to MiningMart
Petr Berka
Laboratory for Intelligent Systems
University of Economics, Prague
[email protected]
University of Economics, Prague
LISp - Laboratory for Intelligent Systems
SALOME - Laboratory for Multidisciplinary
Approaches to Decision-making Support in Economics and
Management
MiningMart prezentation (c) Petr Berka, LISp, 2001
2
LISp research
probabilistic methods - decomposable
probability models and bayesian networks
symbolic ML methods - 4FT association
rules and decision rules
logical calculi for knowledge discovery in
databases
MiningMart prezentation (c) Petr Berka, LISp, 2001
3
LISp activities
Organized conferences
Organized workshops
ECML’97, PKDD’99
Discovery Challenge (PKDD‘99, PKDD2000, PKDD20001),
WUPES‘97, WUPES2000
International Projects
MLNet, Sol-Eu-Net, EUNITE,
KDNet
MUM, MGT
MiningMart prezentation (c) Petr Berka, LISp, 2001
4
SALOME research
Quantitative and AI (pattern recognition,
fuzzy, neural nets) approaches to support of
decision making in econmics and
management
MiningMart prezentation (c) Petr Berka, LISp, 2001
5
SALOME activities
Organized workshops
STIPR‘97, MME‘99
International Projects
Univ. Salzburg, Univ. Hokkaido, Univ. Cambridge
MiningMart prezentation (c) Petr Berka, LISp, 2001
6
LISp software
LISp-Miner (data mining system)
DataSource (for data manipulation)
4FT Miner (4FT association rules) and
KEX (decision rules)
experimental software for building
graphical models
preprocessing procedures
related to KEX
based on information theoretic approach
MiningMart prezentation (c) Petr Berka, LISp, 2001
7
LISP-Miner procedures
DataSource
creating new (virtual) attributes using SQL
ekvidistant and equifrequent discretization
grouping attribute values
computing attribute-value frequencies
MiningMart prezentation (c) Petr Berka, LISp, 2001
8
LISP-Miner procedures
4FT-Miner (GUHA procedure)
4FT association rules in the form
Ant ~ Suc / Cond
KEX
weighted decision rules in the form
Ant C (weight)
MiningMart prezentation (c) Petr Berka, LISp, 2001
9
4FT-Miner basic idea
Generate a (potential) rule, e.g.
COLOUR(red) SIZE(small) 0.9, 20 TEMP(high)
AGE(21-30) SALARY(low) 0.85,15 PAYMENTS (High) LOAN(bad)
Verify a rule using four-fold table
Suc Suc
Ant a
b
d
Ant c
p,B
a
TRUE iff a B
p
ab
p , B TRUE iff a B
MiningMart prezentation (c) Petr Berka, LISp, 2001
a
p
abc
10
KEX basic idea
Generate a (potential) rule, e.g.
YEARS-IN-COMPANY(0-3) AGE(0-25) LOAN(GOOD)
If rule refines current set of rules
(validity a/(a+b) differs from weight inferred during consultation)
add into rule base with proper weight
MiningMart prezentation (c) Petr Berka, LISp, 2001
13
LISp-Miner architecture
MetaData
(ODBC ACCESS)
LM
Data
(ODBC
ACCESS)
Windows
MiningMart prezentation (c) Petr Berka, LISp, 2001
Results
16
Preprocessing
(LISp)
KEX-oriented
(fuzzy)
discretization + grouping of values
computing the amount of noise in data
random sampling + balancing of data
handling missing values
Information theory
attribute
selection
attribute grouping
MiningMart prezentation (c) Petr Berka, LISp, 2001
17
… fuzzy discretization
NClass(Int)
NClass
N(Int) < > N
MiningMart prezentation (c) Petr Berka, LISp, 2001
18
… amount of noise
head
o
o
o
o
o
body
r
r
r
r
r
smile
y
y
y
y
n
holding
s
s
f
b
s
jacket
r
r
y
y
r
tie
y
y
n
n
y
class
+
+
Amount of noise: 20%
max. possible accuracy = 80%
MiningMart prezentation (c) Petr Berka, LISp, 2001
19
… data sampling
random split into training and testing set
select random stratified sample
balance unbalanced classes
MiningMart prezentation (c) Petr Berka, LISp, 2001
20
… handling missing values
remove example
substitute missing with new value
substitute missing with majority value
proportional substitution
MiningMart prezentation (c) Petr Berka, LISp, 2001
21
… information theory
Attribute selection - based on mutual information
Attribute grouping - based on information content
MiningMart prezentation (c) Petr Berka, LISp, 2001
22
Preprocessing architecture
Input data
procedure
(ASCII)
Data
Output data
(ASCII)
procedure
Results
(ASCII)
MiningMart prezentation (c) Petr Berka, LISp, 2001
23
SALOME software
Feature Selection Toolbox (Multi-Purpose
Tool for Pattern Recognition)
feature selection
approximation-based modeling
classification
a consulting system helping to choose the most
suitable method is being developed
MiningMart prezentation (c) Petr Berka, LISp, 2001
24
Search strategies for FS
Search for a subset maximizing a criterion
function (distance, divergence):
with
apriori information
exhaustive search
branch and bound based algorithms
floating search algorithms
without
apriori information
approximation method
divergence method
MiningMart prezentation (c) Petr Berka, LISp, 2001
25
FST architecture
Data
(ASCII)
FST
Results
Windows
MiningMart prezentation (c) Petr Berka, LISp, 2001
26
References
LISp-Miner:
Berka,P. - Ivanek,J.: Automated Knowledge Acquisition for
PROSPECTOR-like Expert Systems. In: (Bergadano, deRaedt
eds.) Proc. ECML'94, Springer 1994, 339-342.
Berka,P. - Rauch,J.: Data Mining using GUHA and KEX. In:
(Callaos, Yang, Aguilar eds.) 4th. Int. Conf. on Information
Systems, Analysis and Synthesis ISAS'98, 1998, Vol 2, 238- 244.
Rauch,J.: Classes of Four Fold Table Quantifiers. In: (Zytkow,
Quafafou eds.) Principles of Data Mining and Knowledge
Discovery. Springer 1998, 203 - 211.
MiningMart prezentation (c) Petr Berka, LISp, 2001
27
References
Preprocessing:
Bruha,I. - Berka,P.: Discretization and Fuzzification of Numerical
Attributes in Attribute-Based Learning. In: Szepaniak, Lisboa,
Kacprzyk (eds.): Fuzzy Systems in Medicine, Physica Verlag,
2000, 112-138.
Pudil, P., Novovičová J.: Novel Methods for Subset Selection with
Respect to Problem Knowledge, IEEE Transactions on Intelligent
Systems - Special Issue on Feature Transformation and Subset
Selection 1998, 66-74
J. Zvarova and M. Studeny: Information theoretical approach to
constitution and reduction of medical data. International Journal of
Medical Informatics 45 (1997), n. 1-2, pp. 65-74.
MiningMart prezentation (c) Petr Berka, LISp, 2001
28