kejkula - Knowledge Engineering Group
Download
Report
Transcript kejkula - Knowledge Engineering Group
Self-Organised Data Mining
– 20 Years after GUHA-80
Martin Kejkula
KEG 8th April 2004
http://gama.vse.cz/keg/
Agenda
Idea of Self-Organised Data Mining
GUHA-80 revival
Process of Self-Organised Data Mining
Key factors for Self-Organised Data Mining
Metabase, Knowledge Base, etc.
Proposed EverMiner system for Self-Organised
Data Mining
2
Introduction
Motivation: support X-Miner users
Best practices, known problems collection
Muller, Lemke: Self-Organising Data Mining
(2000)
My thesis:
Design/test strings of jobs for EverMiner
Formalization/using heuristics
3
References (1)
P. – Havránek, T.: GUHA 80: An
Application of Artificial Intelligence to Data
Analysis. Computers and Artificial
Intelligence, Vol. 1, 1982, pp. 107-134
Hájek, P. – Ivánek, J.: Artificial Intelligence
and Data Analysis. Proc. COMPSTAT’82,
Wien, Physica Verlag 1982, pp. 54-60
Hájek,
4
References (2)
P. – Havránek, T.: GUHA-80 – An
Application of Artificial Intelligence to Data
Analysis. Matematické středisko
biologických ústavů ČSAV, Praha, 1982
Jirků, P. – Havránek, T.: On Verbosity
Levels in Cognitive Problem Solvers. Proc.
Computational Linguistics, 1982,
http://acl.eldoc.ub.rug.nl/mirror/C/C82/
Hájek,
5
References (3)
J.: EverMiner – studie projektu.
Dokumentace projektu LISp-Miner, 2003.
Mueller, J.-A. – Lemke, F.: Self-Organising
Data Mining. Extracting Knowledge from
Data. Dresden, Berlin, 2000.
Rauch,
6
GUHA-80: Main Features
Application
of artificial intelligence to
exploratory data analysis
To generate interesting views onto given
empirical data (recognize interesting
logical patterns)
Views: relevant, useful
7
GUHA-80 Sources (1)
GUHA
Automatically generate all interesting
hypotheses
Lenat’s AM
Jobs (tasks)
Agenda of jobs
Hundreds of heuristical rules
Concepts
8
GUHA-80 Sources (2)
GUHA-80
vs. Lenat’s AM
Data
• Data-processing procedures
Statistical
program packages
Effective modules
9
GUHA-80 Paradigm
Open-ended
To maximize interestingness value
Hundreds
data analysis
of heuristic rules
Guide to define and study next step
Access
potentially relevant rules,
Find truly relevant rules,
Follows truly relevant rules
10
Interestingness in GUHA-80
No
explicit definition
Determined by interplay
Heuristical rules
Weighting mechanisms
Testing in practice (adequately behaviour?)
No
algorithm, but constraints
11
Principles of GUHA-80
Domain
dependence (…exploratory data
analysis)
Join human possibilities with machine
More heuristics are relevant
Interactivity with user
Non routine (GUHA-80 not for every-day
data processing)
12
GUHA-80 Structure (1)
13
GUHA-80 Structure (2)
Input
empirical data
Input parameters
How understood “interestingness”
Effective
modules (system’s knowledge)
Clustering procedures
GUHA procedures
Agenda
of jobs (priority/weight)
14
GUHA-80 Structure (3)
Heuristics:
optimal way to realize a job
Changing system of concepts
Hierarchy of concepts (applicability)
Possible unification of heuristics, jobs,…
15
16
17
18
19
GUHA-80 Input
Data
Input
information
Decompositions/orderings of sets of quantities
Help understand “interestingness”
20
GUHA-80 Effective modules
Evaluation
of usual statistical
characteristics,…
Complicated procedures
Synthesis of parameters (“job on job”)
21
GUHA-80
Hundreds
of heuristic rules
No explicit definition of interestingness
(exploration in a space)
Interactivity with the user
Non-routine character
22
Process of S-O Data Mining
Empirical
Data
Domain Knowledge,…
Chains of Data & Knowledge
Processing Tasks
All Interesting Views, Patterns
DataSource, TimeTransf, SumatraTT, 4ft, KL, CF, …
23
Process of S-O Data Mining
24
Key Factors of S-O Data Mining
Data
Preparation
Modeling
Evaluation
Knowledge Base
Domain Knowledge
25
Data Preparation
Discretization
Attribute Type dependent:
• Nominal/Ordinal/Interval/Ratio
Type of coefficient dependent
Discretization-Modeling Cycle (KL, 4ft, CF,…)
Known problem with intervals of categories
without values
Usually not one target attribute
26
Attribute type dependent discretization
Nominal
Classes of values
Ordinal
Extrem/missing values
Type of coefficient
Usually not one target attribute
27
Intervals of Categories without Values
28
Intervals of Categories without Values
Solution:
Statistics – extrem values
4ft Task: correlations, implications
Potentially interesting patterns
29
Extrem/Missing Values
4ft: Find associations between
extrem/missing values (impl/correl)
CF, KL: Find patterns with extrem/missing
values
30
Data Preparation
Classes
of attributes
Partial cedents
Associations between attributes in one class
Associations between partial cedents
31
Evaluation-Modeling
Input
information for partial cedents
Mining for Interesting Patterns
Exceptions
Missing values
Extrem values
Discovered
hypotheses
Groups of hypotheses
Coverage hypotheses/input data
32
Heuristic Rules (1)
Examples:
IF more extrem/missing values found, search
for association with extrem/missing values
IF 0 hypotheses found, set-up less strong
quantifier (p, Base) values
IF subset of input data not covered by
hypotheses THEN search for associations
covering these data
33
Heuristic Rules (2)
Examples:
IF nominal type of column (input data matrix)
AND no associated table for discretization
THEN each value is one category (attribute
creation)
Use “subset” coefficient type for nominal
attributes
34
Metabase, Knowledge Base
Metadata
(Knowledge):
Results of Previous X-Miner Tasks
Domain Knowledge
Interaction with User (learning?)
35
GUHA-80 vs. X-Miner (1)
Task
parameters (partial cedents, …)
SW, HW
Experiences with LM applications,…
36
GUHA-80 vs. X-Miner (2)
More
complex heuristics
37
EverMiner – Features
Based
on LispMiner (X-Miners)
Agenda of jobs, priority/strings
Heuristics
Interaction with user
Enables to repeat the process on new
data (“check” vs. new KDD process)
38
EverMiner – where we are
Experiences
(Medicine, traffic, shares,
sociology,…)
Heuristics collection (www, brainstorming)
Co-operation with data preparation experts
(FEL, SumatraTT)
Testing “Strings of jobs” (learning)
39
Discussion
40