Knowledge Discovery in Databases

Download Report

Transcript Knowledge Discovery in Databases

Brief introduction to lectures
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
Transparencies prepared
by Ho Tu Bao [JAIST]
1
Lecture 5:
Automatic Cluster Detection
• One of the most widely used KDD classification
techniques for unsupervised data.
• Content of the lecture
1.
2.
3.
4.
Introduction
Partitioning Clustering
Hierarchical Clustering
Software and case-studies
• Prerequisite: Nothing special
2
Partitioning Clustering
A partition of a set of n objects X  {x1 , x2 ,..., xn }
is a collection of K disjoint non - empty subsets P1 , P2 ,..., PK of X
(K  n), often called clusters , satisfying the following conditions :
(1) they are disjoint : Pi  Pj  0 for all Pi and Pj, i  j
(2) their union is X : P1  P2  ...  PK  X.
Denote the partition P  {P1, P2,..., PK }, Pi are called components of P
• Each cluster must contain at least one object
• Each object must belong to exactly one group
3
Partitioning Clustering
What is a “good” partitioning clustering?
Key ideas: Objects in each group are similar and objects
between different groups are dissimilar.
P  {{x1 , x4 , x7 , x9 , x10},{x2 , x7 },{x3 , x5 , x6 }}
  
P1
P2
P3
Minimize the within-group distance and
Maximize the between-group distance.
Notice: Many ways to define the “within-group distance” (the average of
distance to the group’s center or the average of distance between all pairs
of objects, etc.) and to define the “between-group distance”. It is in
general impossible to find the optimal clustering.
4
Hierarchical Clustering
Partition Q is nested into partition P if every component
of Q is a subset of a component of P.
P  {x1 , x4 , x7 , x9 , x10},{x2 , x8},{x3 , x5 , x6 }
Q  {x1 , x4 , x9 },{x7 , x10},{x2 , x8},{x5},{x3 , x6 }
A hierarchical clustering is a sequence of partitions in
which each partition is nested into the next partition in
the sequence.
(This definition is for bottom-up hierarchical clustering. In case of
top-down hierarchical clustering, “next” becomes “previous”).
5
Bottom-up Hierarchical Clustering
x1 , x2 , x3 , x4 , x5 , x6 
{x1 , x2 , x3 , x4},{x5 , x6}
{x1 , x2 , x3},{x4},{x5 , x6}
{x1},{x2 , x3},{x4},{x5},{x6}
{x1},{x2},{x3},{x4},{x5},{x6}
x1
x2
x3
x4
x5
x6
6
Top-Down Hierarchical Clustering
x1 , x2 , x3 , x4 , x5 , x6 
{x1 , x2 , x3 , x4},{x5 , x6}
{x1 , x2 , x3},{x4},{x5 , x6}
{x1},{x2 , x3},{x4},{x5},{x6}
{x1},{x2},{x3},{x4},{x5},{x6}
x1
x2
x3
x4
x5
x6
7
OSHAM: Hybrid Model
Brief Description
of Concepts
Concept
Hierarchy
Discovered
Concepts
Multiple Inheritance Concepts
Wisconsin
Breast
Cancer
Data
Attributes
8
Brief introduction to lectures
Lecture 1: Overview of KDD
Lecture 2: Preparing data
Lecture 3: Decision tree induction
Lecture 4: Mining association rules
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
9
Lecture 6: Neural networks
• One of the most widely used KDD
classification techniques.
• Content of the lecture
1. Neural network representation
2. Feed-forward neural networks
3. Using back-propagation algorithm
4. Case-studies
• Prerequisite: Nothing special
10
Brief introduction to lectures
Lecture 1: Overview of KDD
Lecture 2: Preparing data
Lecture 3: Decision tree induction
Lecture 4: Mining association rules
Lecture 5: Automatic cluster detection
Lecture 6: Artificial neural networks
Lecture 7: Evaluation of discovered knowledge
11
Lecture 7
Evaluation of discovered knowledge
• One of the most widely used KDD
classification techniques.
• Content of the lecture
1. Cross validation
2. Bootstrapping
3. Case-studies
• Prerequisite: Nothing special
12
Out-of-sample testing
Training
data
Induction
method
2/3
Historical
Data
(warehouse)
Sampling
method
Sample
data
Sampling
method
Model
1/3
Testing
data
Error
estimation
error
The quality of the test sample estimate is dependent on the number of
test cases and the validity of the independent assumption
13
Cross
Validation
Historical
Data
(warehouse)
Sampling
method
10-fold cross validation
appears adequate (n = 10)
iterate
Sample 1
Sample
data
Sampling
method
Sample 2
...
Sample n
- Mutually exclusive
- Equal size
Induction
method
Model
Error
estimation
Run’s
error
Error
estimation
14
Evaluation: k-fold cross validation (k=3)
1
1
2
to be evaluated
2 A method
3
1
3
2
3
A data set
randomly split the data
set into 3 subsets of
equal size
run on each 2
subsets as training
data to find
knowledge
3
2
1
test on the rest
subset as testing
data to evaluate
the accuracy
average the
accuracies as
final evaluation
15
Outline of the presentation
Objectives,
Brief
Discussion
Prerequisite
Introduction
and
and Content
to Lectures
Conclusion
This presentation summarizes the content and organization
of lectures in module “Knowledge Discovery and Data Mining”
16