Data Mining by Tracy Juang

Download Report

Transcript Data Mining by Tracy Juang

Data Mining
By Fu-Chun (Tracy) Juang
What is Data Mining?
►
The process of analyzing LARGE databases to find
useful patterns.
►
Attempts to discover rules and patterns from data.
► Similar
to knowledge discovery (in artificial
intelligence) or statistical analysis.
► =>
Knowledge discovery in database.
Type of Knowledge Discovered
►
Classification
► Association
Rules
► Clustering
► Others
-- Sequential Pattern
-- Pattern within Time Series
Classification
►
►
Deal with Prediction
Work from an existing set of events to
create hierarchy of classes.
Use this classification hierarchy to predict
which “class” a new item belong.
Classification (cont.)
► Example:
Credit-card company classified population into 4
range of credit worthiness (bad, average, good
and excellent) based on payment history of the
existing customers.
The company will find some rules between
credit worthiness and other information about the
customers, such as their educational history, age
and salary.
Use this classification rules to determine
(predict) credit worthiness of a new applicant.
Classification : Rules
►
Some of the rules looks like:
∀person P, P.degree = masters and
P.income > 75,000 => P.credit = excellent
∀person P, P.degree = bachelors or
( P.income ≥ 25,000 and P.income ≤75,000)
=> P.credit = good
Classification : Decision-Tree
►
A popular technique for classification.
►
Each leaf node of the tree represents a class ( e.g.
good credit & bad credit)
►
Each internal node has a function associate with it,
to determine which child to go to for the new item.
(e.g. married & salary range)
►
When trying to place a new item in a class, we
traverse the decision-tree until we reach a leaf
node.
Decision-Tree
Classification : Regression
►A
special application of classification rules.
► Regression
deals with the prediction of a
value, rather than a class.
► e.g.
If having a series of test results of a
patient, use regression rule to predict the
probability of survival of that patient.
Association Rules
► Retail shops are often interested in Associations
between different items that people buy.
►
X => Y , if a costumer buys X, he is likely to buy Y
►
e.g. A female retail shopper buys a handbag, she is
likely to buy shoes.
association rule: Handbag => Shoes
►
e.g. A person who bought the book Database System
Concept is likely to buy Operating System Concepts.
association rule: DBS Concept => OS Concept
Association Rules :
Support & Confidence
► Association
Rules need to have degree of
Support and Confidence .
► Data
miners use Support and Confidence of
the association rules to determine whether the
particular association rule is significant.
Association Rule: Support
►
Support is a measure of what fraction of the
population satisfies both LHS and RHS of the rule.
►
Which is how frequently a specific itemset
(LHS + RHS) occurs in the database.
►
If only 0.001% of all purchases in store include Milk
and Screwdrivers, then the support of rule:
milk => screwdriver is low.
►
If 50% purchases include Milk and Juice, the support
of rule: milk => juice is high.
Association Rule: Confidence
► Confidence
is a measure of how often the RHS
(consequent) is true when the LHS (antecedent) is
true
►
►
e.g. the rule:
bread => milk
has a confidence of 80% if 80% of the purchases
that include bread also include milk.
A rule with low confidence is not meaningful.
Clustering
► Clustering
is to group similar points together
in a single set.
► In
business, groups of customers who has
similar buying patterns.
► In
medicine, groups of patients who shows
similar reactions to prescribed drugs.
References
► A.
Silberschatz, H.F. Korth, S. Sudershan:
Database System Concepts, 5th ed.,
McGraw-Hill, 2006
► R.
Elmasri, S.B. Navathe: Fundamentals Of
Database Systems, 4th ed., Addison Wesley,
2003