kejkula - Knowledge Engineering Group

Download Report

Transcript kejkula - Knowledge Engineering Group

Self-Organised Data Mining
– 20 Years after GUHA-80
Martin Kejkula
KEG 8th April 2004
http://gama.vse.cz/keg/
Agenda

Idea of Self-Organised Data Mining



GUHA-80 revival
Process of Self-Organised Data Mining

Key factors for Self-Organised Data Mining

Metabase, Knowledge Base, etc.
Proposed EverMiner system for Self-Organised
Data Mining
2
Introduction

Motivation: support X-Miner users

Best practices, known problems collection

Muller, Lemke: Self-Organising Data Mining
(2000)
 My thesis:


Design/test strings of jobs for EverMiner
Formalization/using heuristics
3
References (1)
P. – Havránek, T.: GUHA 80: An
Application of Artificial Intelligence to Data
Analysis. Computers and Artificial
Intelligence, Vol. 1, 1982, pp. 107-134
 Hájek, P. – Ivánek, J.: Artificial Intelligence
and Data Analysis. Proc. COMPSTAT’82,
Wien, Physica Verlag 1982, pp. 54-60
 Hájek,
4
References (2)
P. – Havránek, T.: GUHA-80 – An
Application of Artificial Intelligence to Data
Analysis. Matematické středisko
biologických ústavů ČSAV, Praha, 1982
 Jirků, P. – Havránek, T.: On Verbosity
Levels in Cognitive Problem Solvers. Proc.
Computational Linguistics, 1982,
http://acl.eldoc.ub.rug.nl/mirror/C/C82/
 Hájek,
5
References (3)
J.: EverMiner – studie projektu.
Dokumentace projektu LISp-Miner, 2003.
 Mueller, J.-A. – Lemke, F.: Self-Organising
Data Mining. Extracting Knowledge from
Data. Dresden, Berlin, 2000.
 Rauch,
6
GUHA-80: Main Features
 Application
of artificial intelligence to
exploratory data analysis
 To generate interesting views onto given
empirical data (recognize interesting
logical patterns)
 Views: relevant, useful
7
GUHA-80 Sources (1)
 GUHA

Automatically generate all interesting
hypotheses
 Lenat’s AM




Jobs (tasks)
Agenda of jobs
Hundreds of heuristical rules
Concepts
8
GUHA-80 Sources (2)
 GUHA-80

vs. Lenat’s AM
Data
• Data-processing procedures
 Statistical

program packages
Effective modules
9
GUHA-80 Paradigm
 Open-ended

To maximize interestingness value
 Hundreds

data analysis
of heuristic rules
Guide to define and study next step
 Access
potentially relevant rules,
Find truly relevant rules,
Follows truly relevant rules
10
Interestingness in GUHA-80
 No
explicit definition
 Determined by interplay



Heuristical rules
Weighting mechanisms
Testing in practice (adequately behaviour?)
 No
algorithm, but constraints
11
Principles of GUHA-80
 Domain
dependence (…exploratory data
analysis)
 Join human possibilities with machine
 More heuristics are relevant
 Interactivity with user
 Non routine (GUHA-80 not for every-day
data processing)
12
GUHA-80 Structure (1)
13
GUHA-80 Structure (2)
 Input
empirical data
 Input parameters

How understood “interestingness”
 Effective


modules (system’s knowledge)
Clustering procedures
GUHA procedures
 Agenda
of jobs (priority/weight)
14
GUHA-80 Structure (3)
 Heuristics:
optimal way to realize a job
 Changing system of concepts
 Hierarchy of concepts (applicability)
 Possible unification of heuristics, jobs,…
15
16
17
18
19
GUHA-80 Input
 Data
 Input


information
Decompositions/orderings of sets of quantities
Help understand “interestingness”
20
GUHA-80 Effective modules
 Evaluation
of usual statistical
characteristics,…
 Complicated procedures
 Synthesis of parameters (“job on job”)
21
GUHA-80
 Hundreds
of heuristic rules
 No explicit definition of interestingness
(exploration in a space)
 Interactivity with the user
 Non-routine character
22
Process of S-O Data Mining
Empirical
Data
Domain Knowledge,…
Chains of Data & Knowledge
Processing Tasks
All Interesting Views, Patterns
DataSource, TimeTransf, SumatraTT, 4ft, KL, CF, …
23
Process of S-O Data Mining
24
Key Factors of S-O Data Mining
 Data
Preparation
 Modeling
 Evaluation
 Knowledge Base
 Domain Knowledge
25
Data Preparation
 Discretization

Attribute Type dependent:
• Nominal/Ordinal/Interval/Ratio




Type of coefficient dependent
Discretization-Modeling Cycle (KL, 4ft, CF,…)
Known problem with intervals of categories
without values
Usually not one target attribute
26
Attribute type dependent discretization
 Nominal

Classes of values
 Ordinal



Extrem/missing values
Type of coefficient
Usually not one target attribute
27
Intervals of Categories without Values
28
Intervals of Categories without Values
Solution:



Statistics – extrem values
4ft Task: correlations, implications
Potentially interesting patterns
29
Extrem/Missing Values
4ft: Find associations between
extrem/missing values (impl/correl)
CF, KL: Find patterns with extrem/missing
values
30
Data Preparation
 Classes



of attributes
Partial cedents
Associations between attributes in one class
Associations between partial cedents
31
Evaluation-Modeling
 Input
information for partial cedents
 Mining for Interesting Patterns



Exceptions
Missing values
Extrem values
 Discovered


hypotheses
Groups of hypotheses
Coverage hypotheses/input data
32
Heuristic Rules (1)
 Examples:



IF more extrem/missing values found, search
for association with extrem/missing values
IF 0 hypotheses found, set-up less strong
quantifier (p, Base) values
IF subset of input data not covered by
hypotheses THEN search for associations
covering these data
33
Heuristic Rules (2)
 Examples:


IF nominal type of column (input data matrix)
AND no associated table for discretization
THEN each value is one category (attribute
creation)
Use “subset” coefficient type for nominal
attributes
34
Metabase, Knowledge Base
 Metadata



(Knowledge):
Results of Previous X-Miner Tasks
Domain Knowledge
Interaction with User (learning?)
35
GUHA-80 vs. X-Miner (1)
 Task
parameters (partial cedents, …)
 SW, HW
 Experiences with LM applications,…
36
GUHA-80 vs. X-Miner (2)
 More
complex heuristics
37
EverMiner – Features
 Based
on LispMiner (X-Miners)
 Agenda of jobs, priority/strings
 Heuristics
 Interaction with user
 Enables to repeat the process on new
data (“check” vs. new KDD process)
38
EverMiner – where we are
 Experiences
(Medicine, traffic, shares,
sociology,…)
 Heuristics collection (www, brainstorming)
 Co-operation with data preparation experts
(FEL, SumatraTT)
 Testing “Strings of jobs” (learning)
39
Discussion
40