Transcript Cecilia157B
DATA MINING
By
Cecilia Parng
CS 157B
Contents
Definition of Data Mining
–
Classification
–
Decision-Tree
Association Rules
–
–
Knowledge Discovery in Databases
Support
Confidence
Clustering
Definition of Data Mining
Data Mining: A class of database applications
that look for hidden patterns in a group of
data that can be used to predict future
behavior. For example, data mining software
can help retail companies find customers
with common interests.
Data mining is also popular in the science
and mathematical fields.
Definition of Data Mining (cont.)
Data Mining, also known as KnowledgeDiscovery in Databases (KDD)
–
The knowledge discovery process includes six
phases:
data selection
data cleansing
enrichment
data transformation or encoding
data mining
reporting and displaying of the discovered information.
Classification ( Decision-Tree)
Classification is the process of learning a
model that describes different classes of
data. The classes are predetermined.
–
Decision-Tree classifier is a widely used
technique for classification.
Decision-Tree Classifier
A decision tree takes as input an object or
situation described by a set of properties,
and outputs a yes/no decision. Decision
trees therefore represent Boolean functions.
How to build a Decision-Tree
–
A decision tree is constructed by looking for
regularities in data.
Data
Decision Tree
Allows us to make predictions
on unseen data
Decision Rule
–
For example: Some one who apply for a
credit card may be classified as a “poor risk,” or a
“good risk.”
Example Decision Tree for Credit Card
Application
married
yes
no
salary
< 20k
>= 20k
< 50k
Poor risk
Fair risk
Acct balance
>= 50k
>= 5k
< 5k
Poor risk
Good risk
age
< 25
>= 25
Fair risk
Good risk
Association Rules
An association rule must have an
associated population:
–
The population consists of a set of instances
Rule is used to discover elements that occur
in common within a given data set.
Rules have an associated support, as well
as an associated confidence
Association Rules & Frequent Items
Association rule algorithms typically only identify patterns that
occur in the original form throughout the database. In databases which
contain many small variations in the data, potentially important
discoveries may be ignored as a result. an associate rule mining
algorithm.
Customer
1
2
3
4
5
Items
orange juice, soda
milk, orange juice, window cleaner
orange juice, detergent,
orange juice, detergent, soda
window cleaner, soda
How does association rule analysis work
The co-occurrence table contains some simple
patterns:
·
·
·
OJ and soda are likely to be purchased together than any other two items.
Detergent is never purchased with window cleaner or milk.
Milk is never purchased with soda or detergent.
Items
OJ
Window Cleaner
Milk
Soda
Detergent
OJ
4
1
1
2
1
Cleaner
1
2
1
1
0
Milk
1
1
1
0
0
Soda
2
1
0
3
1
Detergent
1
0
0
1
2
Association Support
The Support:
–
These simple observations are examples of associations
and may suggest a formal rule like: “If a customer
purchases soda, then the customer also purchases milk”.
For now, we find this rule automatically. In the data, two of
the five transactions include both soda and orange juice.
These two transactions support the rule. Another way of
expressing this is as a percentage. The support for the rule
is two out of five or 40 percent.
Association Confident
The Confident:
Since both the transactions that contain soda also contain
orange juice, there is a high degree of confidence in the rule as
well. In fact, every transaction that contains soda also contains
orange juice, so the rule “if soda, then orange juice” has a
confidence of 100 percent. We are less confident about the
inverse rule, “if orange juice then soda”, because of the four
transactions with orange juice, only two also have soda. Its
confidence, then, is just 50 percent. More formally, confidence
is the ratio of the number of the transactions supporting the rule
to the number of transactions where the conditional part of the
rule holds. Another way of saying this is that confidence is the
ratio of the number of transactions with all the items to the
number of transactions with just the “if” items.
Clustering
The goal of clustering is to place records into
groups, such that records in a group are
similar to each other and dissimilar to
records in other groups. The groups are
usually disjoint.
The
End