Transcript Data Mining

Data Mining
Data Mining (DM)/
Knowledge Discovery in
Databases (KDD)
“The nontrivial extraction of implicit, previously unknown,
and potentially useful information from data” [Frawley et al, 1992]
Need for Data Mining
Increased
Remote
ability to generate data
sensors and satellites
Bar codes for commercial products
Computerization of businesses
Need for Data Mining
 Increased
ability to store data
 Media:
bigger magnetic disks, CD-ROMs
 Better database management systems
 Data warehousing technology
Volume
of data
1970
1980
1990
2000
Need for Data Mining
 Examples
 Wal-Mart
records 20,000,000 transactions/day
 Healthcare transactions yield multi-GB databases
 Mobil Oil exploration storing 100 terabytes
 Human Genome Project, multi-GBs and increasing
 Astronomical object catalogs, terabytes of images
 NASA EOS, 1 terabyte/day
Something for Everyone






Bell Atlantic
MCI
Land’s End
Visa
Bank of New York
FedEx
Market Analysis and Management
 Customer
profiling
Data
mining can tell you what types of
customers buy what products (clustering or
classification) or what products are often
bought together (association rules).
 Identifying
 Discover
customer requirements
relationship between personal
characteristics and probability of purchase
 Discover correlations between purchases
Fraud Detection and Management
 Applications:
 Widely
used in health care, retail, credit card
services, telecommunications, etc.
 Approach:
 Use historical data to build models of fraudulent
behavior and use data mining to help identify similar
instances.
 Examples:
 Auto Insurance
 Money Laundering
 Medical Insurance
IBM Advertisement
DM step in KDD Process
Statistics
Database
AI
Data Mining
Hardware
Mining Association Rules
Mining Association Rules



Assocation rule mining:
 Finding associations or correlations among a set of items or
objects in transaction databases, relational databases, and
data warehouses.
Applications:
 Basket data analysis, cross-marketing, catalog design, lossleader analysis, clustering, etc.
Examples:
 Rule form: “Body ead [support, confidence]”.
 Buys=Diapers  Buys=Beer [0.5%, 60%]
 Major=CS ^ Class=DataMining Grade=A [1%, 75%]
Rule Measures: Support and Confidence
Customer
buys both
Customer
buys beer
Customer
buys diaper

Find all the rules X & Y  Z with
minimum confidence and support
 support, s, probability that a
transaction contains {X, Y, Z}
 confidence, c, conditional
probability that a transaction
having {X, Y} also contains Z.
Transaction ID Items Bought
For minimum support 50%,
2000
A,B,C
minimum confidence 50%:
1000
A,C
A  C (50%, 66.6%)
4000
A,D
C  A (50%, 100%)
5000
B,E,F
Association Rule


Given
 Set of items I = {i1, i2, .., im}
 Set of transactions D
 Each transaction T in D is a set of items
An association rule is an implication X  Y
 X and Y are itemsets, X  I , Y  I , X  Y  
 Rule meets minimum confidence c
(c% of transactions in D which contain X contain Y)
 A minimum support s is also met
c  X Y / X
s  X Y / D
Mining Strong Association Rules in Transaction DBs

Measurement of rule strength in a transaction DB.
A B [support, confidence]
support = Prob(A B) = #_of_trans_containing_all_the_items_in
confidence = Prob(B|A)


AB
total_#_of_trans
= #_of_trans_that_contain_both A and B
#_of_trans_containing A
We are often interested in only strong associations, i.e.
support min_sup and confidence  min_conf.
Examples.
milk bread [5%, 60%].
tire  auto_accessories  auto_services [2%, 80%].
Methods for Mining Associations
 Apriori
 Partition
Technique:
 Sampling technique
 Anti-Skew
 Multi-level or generalized association
 Constraint-based or query-based association
Apriori (Levelwise)
 Scan
database multiple times
 For ith scan, find all large itemsets of size i with min
support
 Use the large itemsets from scan i as input to scan i+1
 Create candidates, subsets of size i+1 which contain only
large itemsets as subsets
 Notation: Large k-itemset, Lk
Set of candidate large itemsets of size k, Ck
 Note: If {A,B} is not a large itemset, then no superset of it
can be either.
Mining Association Rules -- Example
Transaction ID Items Bought
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
For rule A  C:
support = support({A, C}) = 50%
confidence = support({A, C})/support({A}) = 66.6%
Apriori principle:
Any subset of a frequent itemset must be frequent.
Transaction ID Items Bought
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
Minsup = 0.25, Minconf = 0.5
L1 = {(A, 3), (B, 2), (C, 2), (D, 1), (E, 1), (F, 1)}
C2 = {(A,B), (A,C), (A,D), (A,E), (A,F), (B,C), ..,
(E,F)}
L2 = {(A,B, 1), (A,C, 2), (A,D, 1), (B,C, 1),
(B,E, 1), (B,F, 1), (E,F, 1)}
C3 = {(A,B,C), (A,B,D), (A,C,D), (B,C,E), (B,C,F), (B,E,F)}
L3 = {(A,B,C, 1), (B,E,F, 1)}
C4 = {}, L4 = {}, End of program
Possible Rules
A=>B (c=.33,s=1), B=>A (c=.5,s=1), A=>C (c=.67,s=2), C=>A (c=1.0,s=2)
A=>D (c=.33,s=1), D=>A (c=1.0,s=1), B=>C (c=.5,s=1), C=>B (.5,s=1),
B=>E (c=.5,s=1), E=>B(c=1,s=1), B=>F (c=.5,s=1), F=>B(c=1,s=1)
A=>B&C (c=.33,s=1), B=>A&C (c=.5,s=1), C=>A&B (c=.5,s=1),
A&B=>C(c=1,s=1), A&C=>B (c=.5,s=1), B&C=>A (c=1,s=1), B=>E&F (c=.5,s=1),
E=>B&F(c=1,s=1), F=>B&E (c=1,s=1), B&E=>F (c=1,s=1), B&F=>E(c=1,s=1),
E&F=>B (c=1,s=1)
Example
Partitioning
 Requires
only two passes through external database
 Divide database into n partitions, each fits in main memory
 Scan 1: Process one partition in memory at a time, finding
local large itemsets
 Candidate large itemsets are the union of all local large
itemsets (superset of actual large itemsets, contains false +)
 Scan 2: Calculate support, determine actual large itemsets
 If data is skewed, partitioning may not work well. The
chance that a local large itemset is a global large itemset
may be small.
Partitioning
Will any large itemsets be missed?
 If l  Li, then
t1(l)/t1 < MS & t2(l)/t2 < MS & … & tn(l)/tn < MS
thus
t1(l) + t2(l) + … + tn(l) < MS * (t1 + t2 + … + tn)

How do run times compare?
800
80
700
600
60
500
400
300
40
20
200
100
0
0
D1
A .5
P .5
D2
A .33
P .33
A .25
D3
P .25
A .5
P .5
D4
A .33
P .33
A .25
P .25
PlayTennis Training Examples
Day
Outlook
Temperature
Humidity
Wind
PlayTennis
D1
Sunny
Hot
High
Weak
No
D2
Sunny
Hot
High
Strong
No
D3
Overcast
Hot
High
Weak
Yes
D4
Rain
Mild
High
Weak
Yes
D5
Rain
Cool
Normal
Weak
Yes
D6
Rain
Cool
Normal
Strong
No
D7
Overcast
Cool
Normal
Strong
Yes
D8
Sunny
Mild
High
Weak
No
D9
Sunny
Cool
Normal
Weak
Yes
D10
Rain
Mild
Normal
Weak
Yes
D11
Sunny
Mild
Normal
Strong
Yes
D12
Overcast
Mild
High
Strong
Yes
D13
Overcast
Hot
Normal
Weak
Yes
D14
Rain
Mild
High
Strong
No
Association Rule Visualization: DBMiner
DBMiner
Association Rule Graph
Clementine (UK, bought by SPSS)
The Web Node
shows the strength of associations in
the data - i.e. how often field values
coincide
Multi-Level Association
Food


A descendant of an infrequent
itemset cannot be frequent
A transaction database can be
encoded by dimensions and levels
bread
milk
2%
skim
Fraser
TID
T1
T2
T3
T4
T5
wheat
white
Sunset
Items
{111, 121, 211, 221}
{111, 211, 222, 323}
{112, 122, 221, 411}
{111, 121}
{111, 122, 211, 221, 413}
Encoding Hierarchical Information in Transaction Database

A taxonomy for the relevant data items
food
milk
2%

.
.
bread
. . .
chocolate
. into
. . generalized_item_id.
. . . . of bar_code
Conversion
old Mills Wonder
Dairyland Foremost
GID
bar_code_set
category
content_spec
brand
112
{17325, 34823, 91265}
milk
2%
Foremost
141
{29394, 77454, 89157}
milk
skim
Dairy_land
171
{73295, 99184, 79520}
milk
chocolate Dairy_land
212 {88452, 35672, 98427, 31205}
bread
wheat
Wonder
...
{..., ...}
...
...
...
711
{32514, 78152}
fruit_juice
orange
Minute_maid
Mining Surprising Temporal Patterns
Chakrabarti et al
1990
Find prevalent rules
that hold over large
fractions of data
 Useful for promotions
and store arrangement
 Intensively researched

Milk and
cereal sell
together!
Prevalent != Interesting
1995
Analysts already know
about prevalent rules
 Interesting rules are
those that deviate from
prior expectation
 Mining’s payoff is in
finding surprising
phenomena

1998
Zzzz...
Milk and
cereal sell
together!
Milk and
cereal sell
together!
Association Rules - Strengths & Weaknesses
 Strengths
 Understandable
and easy to use
 Useful
 Weaknesses
 Brute

force methods can be expensive (memory and time)
Apriori is O(CD), where
C = sum of sizes of candidates
(2n possible, n = #items)
D = size of database
 Association does not necessarily imply correlation
 Validation?
 Maintenance?
Clustering
Clustering
 Group
similar items together
 Example: sorting laundry
 Similar items may have important attributes /
functionality in common
 Group customers together with similar interests and
spending patterns
 Form of unsupervised learning
 Cluster objects into classes using rule:
 Maximize
intraclass similarity, minimize interclass similarity
Clustering Techniques
 Partition
 Enumerate
all partitions
 Score by some criteria
K
means
 Hierarchical
 Model based
 Hypothesize
model for each cluster
 Find model that best fits data
 AutoClass, Cobweb
Clustering Goal



Suppose you
transmit
coordinates of
points drawn
randomly from this
dataset
Only allowed 2
bits/point
What
encoder/decoder
will lose least
information?
Idea One


Break into grid
Decode each bitpair as middle of
each grid cell
00
01
10
11
Idea Two


Break into grid
Decode each bitpair as centroid of
all data in the grid
cell
00
10
01
11
K Means Clustering
1.
Ask user how
many clusters
(e.g., k=5)
K Means Clustering
1.
2.
Ask user how
many clusters
(e.g., k=5)
Randomly guess k
cluster center
locations
K Means Clustering
1.
2.
3.
Ask user how
many clusters
(e.g., k=5)
Randomly guess k
cluster center
locations
Each data point
finds closest
center
K Means Clustering
1.
2.
3.
4.
Ask user how
many clusters
(e.g., k=5)
Randomly guess k
cluster center
locations
Each data point
finds closest
center
Each cluster finds
new centroid of
its points
K Means Clustering
1.
2.
3.
4.
5.
Ask user how
many clusters
(e.g., k=5)
Randomly guess k
cluster center
locations
Each data point
finds closest
center
Each cluster finds
new centroid of
its points
Repeat until…
K Means Issues
 Computationally
efficient
 Initialization
 Termination
condition
 Distance measure
 What should k be?
Hierarchical Clustering
1.
Each point is its
own cluster
Hierarchical Clustering
1.
2.
Each point is its
own cluster
Find most similar
pair of clusters
Hierarchical Clustering
1.
2.
3.
Each point is its
own cluster
Find most similar
pair of clusters
Merge it into a
parent cluster
Hierarchical Clustering
1.
2.
3.
4.
Each point is its
own cluster
Find most similar
pair of clusters
Merge it into a
parent cluster
Repeat
Hierarchical Clustering Issues
 This
was bottom-up clustering
(agglomerative clustering)
 Can also perform top-down clustering
(divisive clustering)
 Define similarity between clusters
 What is stopping criteria?
Cluster Visualization: IBM




One row per
cluster (# is %
size)
Charts show
fields, ordered by
influence
Pie: outer ring
dist for whole,
inner ring for
cluster
Bar: solid bar for
cluster,
transparent bar
for whole
Structure-based Brushing
 User
selects region of interest with magenta triangle
 User selects level of detail with colored polyline
Spatial Associations
FIND SPATIAL ASSOCIATION RULE
DESCRIBING "Golf Course"
FROM Washington_Golf_courses, Washington
WHERE CLOSE_TO(Washington_Golf_courses.Obj,
Washington.Obj, "3 km")
AND Washington.CFCC <> "D81"
IN RELEVANCE TO Washington_Golf_courses.Obj,
Washington.Obj, CFCC
SET SUPPORT THRESHOLD 0.5
Partek
Conclusions