Transcript Data Mining
Data Mining
Tri Nguyen
Agenda
Data Mining As Part of KDD
Decision Tree
Association Rules
Clustering
Amazon Data Mining Examples
Data Mining and KDD
Putting the results
in practical use
What is Data Mining?
“the automated extraction of hidden
predictive information from large databases”
Algorithms produce patterns, rules
Predict future trends/behavior
Used to make business decisions
Classification
Items belong to classes
Given past items’ classification, predict class
of new item
Example: Issuing credit cards
Use information: income, educational background,
age, current debts
Credit worthiness: Bad, good, excellent
Decision Tree Classifiers
Internal Node has predicate
Leaf node is class
To classify instance
Start at root node
Traverse tree until reach leaf node
Each internal node, make decision
Credit Risk Decision Tree
Decision Tree Construction
Some Definitions
Purity: > # instances of each leaf
belonging to only 1 class means > purity
Best Split: split giving the maximum
information gain ratio (info gain/info
content)
Choose attribute and condition resulting in
maximum purity
Decision Tree Construction
Association Rules
antecedent consequent
if then
beer diaper (Walmart)
economy bad higher unemployment
Higher unemployment higher unemployment
benefits cost
Rules associated with population,
support, confidence
Association Rules
Population: instances such as grocery store
purchases
Support
% of population satisfying antecedent and
consequent
Confidence
% consequent true when antecedent true
Association Rules
Population
Support (MS)= 3/6
MS, MSA, MSB, MA, MB, BA
M=Milk, S=Soda, A=Apple, B=beer
(MS,MSA,MSB)/(MS,MSA,MSB,MA,MB, BA)
Confidence (MS) = 3/5
(MS, MSA, MSB) / (MS,MSA,MSB,MA,MB)
Clustering
“The process of dividing a dataset into
mutually exclusive groups such that the
members of each group are as "close"
as possible to one another, and different
groups are as "far" as possible from one
another, where distance is measured
with respect to all available variables.”
Clustering
Birch Algorithm
points inserted into multidimensional
tree
items guided to leaf nodes "near"
representative internal nodes
nearby points clustered into one leaf
node
Clustering
Example of Clustering
predict what new movies a person is
interested in
1) a person’s past movie preferences
2) others with similar preferences
3) preferences of those in the pool for new
movies
Clustering
1) cluster people with similar movie
preferences
2) given a new movie goer, find a
cluster of similar movie goers
3) then predict the cluster's new movie
preferences
Amazon Examples
Amazon Examples
Amazon Examples
Amazon Examples
References
http://www.thearling.com/text/dmwhite/dmwhite.htm
http://www.cse.ohio-state.edu/~srini/694Z/part1.ppt
http://www-aig.jpl.nasa.gov/public/kdd95/tutorials/IJCAI95tutorial.html