Data Mining & Knowledge Discovery: A Review of Issues and a Multi

Download Report

Transcript Data Mining & Knowledge Discovery: A Review of Issues and a Multi

2
Data Mining & Knowledge
Discovery: A Review of Issues
and a Multi-strategy Approach
Ryszard S. Michalski and Kenneth A. Kaufman
[개요]
Emergence of new research area
: Data mining & Knowledge discovery
Abundant
raw data
Usning
Machine learning
Pattern recognition
Statistical analysis
Data visualization
Neural nets, etc
Useful
task-oriented
Knowledge
[2.1] Introduction
How to extract useful, task-oriented knowledge from abundant
raw data?
Tradition/current
methods
Regression analysis
Clustering analysis
Multi-dimensional analysis
Time series analysis
Numerical taxanomy
Stochastic model
Non-linear estimation
technique
etc
Limitation: Primarily oriented toward the explanations of
quantitative & statistical data characteristics
Continued
Traditional statistical methods can
explane covarianc/correlation btw variables in data
explane central tendency/variance of given factors
fit a curve to a set of datapoints
create a classification of entities
specify a numerical similarity
etc.
But can’t
characterize the dependencies at an abstract, conceptual level
produce a causal explanation of reasons why
develop a justification of these relationships in the form of
higher-level logic style
produce a qualitative description of the regularities
determine the functions not explicitly provieded in the data
draw an analogy btw the discovered regularity and one in another
domain
hypothesize reasons for the entities being in the same category
Continued
그리고 traditional methods는 스스로 domain knowledge를 취하여 자동적으로
관련된 attributes를 만들어 내지는 못한다.
In order to overcome
Data
+
Background
Knowledge
Machine learning기법을 이용
Symbolic reasoning가능
Data mining & Knowledge discovery
Goal of the researches in this field :
To develop computational models for acquiring knowledge
from data and background knowledge
Continued
• Machine learning과 기존의 전통적인 방법을 적용하여 Task-oriented data
characteristics와 generalization을 도출해낸다.
•‘Task-oriented’는 동일한 data로부터 다른 knowledge를 얻을수 있어야 함을
의미하므로 결국은 Multi-strategy approach를 요한다. (다른 task는 다른 data
exploration과 knowledge generalization을 요하므로)
•Multi-strategy approach의 목적은 human expert가 얻을수 있는 data description
과 유사한 형태의 Knowledge를 얻는것이다.
이러한 Knowledge는 Logical/Numerical/Statistical/Graphical
등의 여러가지 description의 형태로 가능
•Main constraints:
domain expert가 쉽게 이해/해석할 수 있는 Knowledge description이어야 한다.
즉, "Principle of comprehensibility
"를 만족해야 한다.
Continued
Distinction between Data mining & Knowledge discovery
D-M: Application of Machine learning and other methods to
the enumeration of patterns over the data
K-D: The whole process of data analysis lifecycle
Identification of data analysis goal
Acquisition & organization of raq data
Generation of potentially useful knoledge
Interpretation and testing of the result
[2.2] Machine learning & multi-strategy data exploration
Two points to be explained here
•Relationship between Machine learning methodology & goals
of Data mining and Knowledge discovery
•How methods of symbolic M-L can be used for (semi)automating tasks with conceptual exploration of data and a
generation of task-oriented knowledge from data?
[2.2.1] Determining general rules from specific cases
•Multi-strategy data exploration is based on
(1)Examples of decision classes
(or class of relationship)
(2)Problem-relevant knowledge
3 types of
descriptions
“Symbolic inductive learning”
Hypothesize
General description of each
class in the following forms
(1)decision rules
(2)decision tree
(3)semantic net
etc.
Attributional description of entities
Structural description of entities
Relational description of entities
Two types of data exploration operators
(1)Operators for defining general symbolic descriptions of a designed group or
groups of entities in a data set.
•각 group내의 entity에 대한 공통적 특성을 기술
•‘Constructive induction’이라고 하는 mechanism을 이용해 original data에
존재하지 않는 추상적 개념을 이용할 수 있다.
Learning “Characteristic concept descriptions”
Continued
(2)Operators for defining differences between different groups of entities
Learning “Discriminant concept descriptions”
Basic assumptions in concept learning
•Examples don’t have errors.
•All attributes have a specified values in them.
•All examples are located in the same database.
•All concepts have a precise(crisp) description that doesn’t change over time.
Doesn't hold inreal problems
(1)Incorrect data : error/noise 존재
(2)Incomplete data : values of some attributes are unknown
(3)Distributed data : learning from separate collection of data
(4)Drifting or evolving concepts : unstable, unstatic concepts
(5)Data arriving over time : incremental learning
(6)Biased data : actual distribution of the event를 반영치 않음
Continued
•Integrating qualitative & quantitative discovery
: To define sets of equations for a given set of data points, and qualitative
conditions for the application of their equations.
•Qualitative prediction
:Sequence/process내의 pattern을 찾고 이것을 이용해서 미래의 input에 대해 정
량적으로 예측.
[2.2.2] Conceptual clustering
•Another class of machine learning methods related to D-M & K-D.
•Similar to traditional cluster analysis but quite different.
(1)A set of attributional
descriptions of some entities
(2)Description language for
characterizing class of such
entities
(3)A classification quality
criterion
Clust
ering
(1)Classification structure of
entities
(2)Symbolic description of
the outcome classes
classical cluster기법과의 주된 차이
Diffenence between Conceptual & Traditional clustering
•In Traditional clustering : similar measure is a function only of the
properties(attribute values) of the entities.
Similarity(A,B) = f(properties)
Continued
•In Conceptual clustering : similarity measure is a function of properties of entities,
and two other factors
Description language(D)
Environment(E)
Conceptual cohesiveness(A,B) = f(properties,L,E)
.
.
.. .
. .
.. .
..
.. .
. .
..
.
.
. .
. .. . . .
.
. ..
. .. . . . . .
.
.
.
. .
.
A
B
Fig. An illustration of the difference between closeness and conceptual cohesiveness
Two points A and B may be put into the same cluster in the viewpoint of the Traditional
method but into the different clusters in the conceptual clustering.
[2.2.3] Constructive induction
•In learning rules or decision trees from examples, the initially given
attributes may not be directly relevant or irrelevant to the learning problem
at hand.
•Advantage of the symbolic methods over statistical methods : symbolic
methods가 statistical method에 비해 non-essential attributes를 쉽게 판단
할 수 있다.
•How to improve the representation space
(1)Removing less relevant attributes.
(2)Generating new relevant attributes.
(3)Abstracting attributes.(or Grouping some attribute value)
•“Constructive Induction” consists of two pahses
(1)Construction of the best representation space
(2)Generation of the best hypothesis in the found space above
[2.2.4] Selection of the most representative examples
Usually database is very large => Process of determining, generating
patterns/rules is quite time-consuming.
Therefor extraction of the most representative cases of given classes is
necessary to make the process more efficient.
[2.2.5] Integration of Qualitative & Quantitative discovery
Numerical attributes를 포함한 database의 경우 equation을 통해 이들
attributes들간의 관계를 잘 설명하는 quantitative discovery를 수행할 수
있으나 different qualitative condition에서는 이러한 고정된 quantitative
equation만으로는 설명이 불가능하므로 qualitative condition에 따라
quantitative equation을 결정하는 방법이 요구된다.
[2.2.6] Qualitative prediction
The goal is not to predict a specific value of a variable(as in Time series
analysis), but to qualitatively describe a plausible future object
[2.2.7] Summarizing the ML-oriented approach
Traditional statistical methods
•Oriented towards numerical characterization of a data set
•Used for globally characterizing a given class of objects
Machine learning methods
• Primarily oriented towards symbolic logic-style descriptions of data
•Can determine the description for predicting class membership of future
objecs
But Multi-strategy approach combining the above two is necessary,
since different type of questions require different exploratory strategies.
[2.3] Classification of data exploration tasks
How to use the GDT(General Data Table) to relate Machine learning
techniques to data exploration problems?
(1) Learning rules from examples
하나의 discrete attribute를 output attribute로 하고 나머지 attributes를 input으로
하여 주어진 set of rows를 training samples로 하여 이들간의 relationship(rule)을
구한다. => 모든 attributes들에 대해 적용할 수 있다.
(2) Determining tree-dependent patterns
Detection of temporal patterns in sequences of data arranged along the true
dimension in a GDT.
Using
Multi-model method for qualitative prediction
Temporal constructive induction technique
(3) Example selection
Select rows from the table corresponding to the most representative examples of
different classes.
Continued
(4) Attribute selection
Feature selection이라고도 하며 least relevant attributes to the learning에 해당하는
column을 제거한다.
주로 Gain ration나 Promise level과 같은 attribute selection 기법을 이용한다.
(5) Generating new attributes
앞에서 설명한 Constructive induction에 의해 초기에 주어진 attribute으로부터
새로운 relevant attributes를 생성한다.
(6) Clustering
역시 앞에서 설명한 Conceptual clustering에 의해 rows of the GDT를 목적하는
group(cluster)로 partition한다. => 이 결과로 나온 cluster를 기술하는 Rule은
Knowledge base에 저장된다.
(7) Determining attribute dependencies
Determine relationships(e.g., correlation, causal dependencies, logical dependencies)
among attributes(column) using statistical/logical methods
Continued
(8) Incremental rule update
Update the working knowledge(rules) to accommodate new information
(9) Searching for approximate patterns in the (imperfect) data
Determine the best hypothesis that accounts for most of the available data
(10) Filling in missing data
Determine the plausible values of the missing entities through the analysis of the
currently available data
(11) Determining decision structures for declarative knowledge(Decision
rules)
주어진 data set(GDT)에 대한 general decision rule이 가정되었을 때 새
로운 case에 대한 예측을 위해 사용되기 위해서는 decision tree(or
decision structure)의 형태로 변환하는 것이 바람직하다.