Data Mining by Farzana Forhad

Transcript Data Mining by Farzana Forhad

Data Mining
By
Farzana Forhad
CS 157B
Agenda
Decision Tree and ID3
Rough Set Theory
Clustering
Introduction
• Data mining is a component of a wider
process called knowledge discovery from
databases.
• The basic foundations of data mining:
– decision tree
– association rules
– clustering
– other statistical techniques
Decision Tree
• ID3 (Quinlan 1986), represents concepts
as decision trees.
• A decision tree is a classifier in the form of
a tree structure where each node is either:
– a leaf node, indicating a class of instances
OR
– a decision node, which specifies a test to be
carried out on a single attribute value, with
one branch and a sub-tree for each possible
outcome of the test
Decision Tree
• The set of records available for classification is
divided into two disjoint subsets:
– a training set : used for deriving the classifier
– a test set: used to measure the accuracy of the
classifier
• Attributes whose domain is numerical are
called numerical attributes
• Attributes whose domain is not numerical
are called categorical attributes.
Decision Tree
• A decision tree is a tree with the following
properties:
– An inner node represents an attribute
– An edge represents a test on the attribute of
the father node
– A leaf represents one of the classes
• Construction of a decision tree
– Based on the training data
– Top-Down strategy
Training Dataset
Test Dataset
Decision Tree
RULE 1
RULE 2
RULE 3
RULE 4
RULE 5
If it is sunny and the humidity is not above 75%, then play.
If it is sunny and the humidity is above 75%, then do not play.
If it is overcast, then play.
If it is rainy and not windy, then play.
If it is rainy and windy, then don't play.
Training Dataset
Decision Tree for Zip Code and Age
Iterative Dichotomizer 3 (ID3)
• Quinlan (1986)
• Each node corresponds to a splitting attribute
– Entropy is used to measure how informative is a node.
– The algorithm uses the criterion of information gain to
determine the goodness of a split.
Iterative Dichotomizer 3 (ID3)
Rough Set Theory
– Useful means for studying delivery patterns,
rules, and knowledge in data
– The rough set is the estimate of a vague
concept by a pair of specific concepts, called
the lower and upper approximations.
Rough Set Theory
– The lower approximation is a type of the
domain objects which are known with certainty
to belong to the subset of interest.
– The upper approximation is a description of
the objects which may perhaps belong to the
subset.
– Any subset defined through its lower and
upper approximations is called a rough set, if
the boundary region is not empty.
Lower and Upper Approximations
of a Rough Set
Association Rule Mining
• Basket Analysis
Definition of Association Rules
Mining the Rules
Two Steps of Association Rule
Mining
Clustering
• The process of organizing objects into
groups whose members are similar in
some way
• Statistics, machine learning, and
database researchers have studied data
clustering
• Recent emphasis on large datasets
Different Approaches to Clustering
• Two main approaches to clustering:
- partitioning clustering
- hierarchical clustering
• Clustering algorithms differ among
themselves in the following ways:
– in their ability to handle different types of
attributes (numeric and categorical)
– in accuracy of clustering
– in their ability to handle disk-resident data
Problem Statement
• N objects to be grouped in k clusters
• Number of different possibilities:
• The objective is to find a grouping such
that the distances between objects in a
group is minimum
• Several algorithms to find near optimal
solution
k-Means Algorithm
1. Randomly select k points to be the starting points
for the centroids of the k clusters.
2. Assign each object to the centroid closest to the
object, forming k exclusive clusters of examples.
3. Calculate new centroids of the clusters. Take the
average of all the attribute values of the objects
belonging to the same cluster.
4. Check if the cluster centroids have changed their
coordinates. If yes, repeat from Step 2.
5. If no, cluster detection is finished, and all objects
have their cluster memberships defined.
Example
• One-dimensional database
•
•
•
with N = 9
Objects labeled z1…z9
Let k = 2
Let us start with z1 to z2 as
the initial centroids
Table: Onedimensional
database
Example
Table: New cluster
assignments
Example
Table: Reassignment of
objects to two clusters
Questions?
Thank You

Data Mining by Farzana Forhad

Transcript Data Mining by Farzana Forhad

Directory