What is data mining

Download Report

Transcript What is data mining

Introduction to Data Mining
Mining Association Rules
Reference: Tan et al: Introduction to data mining.
Some slides are adopted from Tan et al
What is Data Mining?
Data mining is the exploration and analysis
of large quantities of data in order to
discover valid, novel, potentially useful,
and ultimately understandable patterns in
data.
What is data mining (cont.)?
Valid: The patterns hold in general.
Novel: We did not know the pattern
beforehand.
Useful: We can devise actions from the
patterns.
Understandable: We can interpret and
comprehend the patterns.
Data mining falls into the broad field of
knowledge discovery
The data mining process
Origins of Data Mining
Draws ideas from machine learning/Artificial
Intelligence, pattern recognition, statistics, and
database systems
Human analysis and
traditional Techniques
may be unsuitable due to
Statistics/
AI
– Enormity of data
– High dimensionality
of data
– Heterogeneous data
– Distributed nature
of data
Data Mining Tasks
Prediction Methods
– Use some variables to predict unknown or future
values of other variables.
Description Methods
– Find human-interpretable patterns that describe the
data.
Data Mining Tasks...
Association Rule Discovery [Descriptive]
Clustering [Descriptive]
Classification [Predictive] (for discrete variables)
Sequential Pattern Discovery [Descriptive]
Regression [Predictive] (for continuous variable)
Deviation Detection [Predictive]
Association Rule Discovery: Definition
Given a set of records each of which contain some
number of items from a given collection;
– Produce dependency rules which will predict
occurrence of an item based on occurrences of other
items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Association Rule Discovery: Sample Application
Supermarket shelf management.
– Goal: To identify items that are bought together by
sufficiently many customers.
– Approach: Process the point-of-sale data collected
with barcode scanners to find dependencies among
items.
Clustering: Definition
Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
– Data points in one cluster are more similar to one
another.
– Data points in separate clusters are less similar to
one another.
Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.
Illustrating Clustering
Euclidean Distance Based Clustering in 3-D space.
Clustering: Sample Application 1
Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns
of customers in same cluster vs. those from different clusters.
Clustering: sample application 2
Document Clustering:
– Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
– Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
– Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.
Classification: Definition
Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class.
Find a model for class attribute as a function of
the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
Classification and Clustering
Classification:
– Classes pre-defined
– Uses training set (thus also known as supervised
learning)
Clustering:
– Classes not defined in advance
– No training set (thus also known as unsupervised
learning)
Classification: Example
Classification: sample application 1
Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class
attribute.
• Collect various demographic, lifestyle, and companyinteraction related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier
model.
Classification: sample application 2
Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
• Use credit card transactions and the information on its
account-holder as attributes.
– When does a customer buy, what does he buy, how often he
pays on time, etc
• Label past transactions as fraud or fair transactions. This
forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.
Sequential Pattern Discovery: Definition
Given a set of objects, with each object associated with its own timeline of
events, find rules that predict strong sequential dependencies among
different events.
Rules are formed by first discovering patterns. Event occurrences in the
patterns are governed by timing constraints.
Regression
Predict a value of a given continuous valued variable
based on the values of other variables, assuming a
linear or nonlinear model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
– Predicting sales amounts of new product based on
advetising expenditure.
– Predicting wind velocities as a function of
temperature, humidity, air pressure, etc.
– Time series prediction of stock market indices.
Deviation/Anomaly Detection
Detect significant deviations from normal
behavior
Applications:
– Credit Card Fraud Detection
– Network Intrusion Detection
Mining Association Rules
Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},
Implication means co-occurrence!
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
• Support count ()
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
• Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Definition: Association Rule

Association Rule
– An implication expression of the form
X  Y, where X and Y are itemsets
– Example:
{Milk, Diaper}  {Beer}

Rule Evaluation Metrics
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
– Support (s)

Example:
Fraction of transactions that contain
both X and Y
{Milk , Diaper }  Beer
– Confidence (c)

Measures how often items in Y
appear in transactions that
contain X
s
 (Milk, Diaper, Beer )
|T|

2
 0.4
5
 (Milk, Diaper, Beer ) 2
c
  0.67
 (Milk, Diaper )
3
Association Rule Mining Task
Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!
Mining Association Rules
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still
computationally expensive
Frequent Itemset Generation
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ABCDE
ACDE
BCDE
Given d items, there
are 2d possible
candidate itemsets
Frequent Itemset Generation
Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions
N
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
List of
Candidates
M
w
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
Computational Complexity
Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
 d   d  k 
R       

 k   j 
 3  2 1
d 1
d k
k 1
j 1
d
d 1
If d=6, R = 602 rules
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction
Reducing Number of Candidates
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
Apriori principle holds due to the following
property of the support measure:
X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
Illustrating Apriori Principle
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Found to be
Infrequent
ABCD
Pruned
supersets
ABCE
ABDE
ABCDE
ACDE
BCDE
Illustrating Apriori Principle
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Count
4
2
4
3
4
1
Items (1-itemsets)
Minimum Support count = 3
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based pruning,
6 + 6 + 1 = 13
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Count
3
2
3
2
3
3
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Itemset
{Bread,Milk,Diaper}
Count
2
Apriori Algorithm
Method:
– Let k=1
– Generate frequent k-itemsets
– Repeat until no new frequent itemsets are identified
• Generate candidate (k+1)-itemsets from frequent k-itemsets
• Prune candidate itemsets containing subsets that are
infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those
that are frequent
Example:apriori method for finding frequent
itemset
Trans Items
action
ID
01
02
milk, bread,
cookies, juice
Milk, bread, juice
03
milk, eggs
04
bread, cookies,
coffee
Find itemsets with support >=50%
1. Frequent 1-itemsets:
{bread}, {cookies}, {juice}, {milk}
2. Generate candidate 2-itemsets:
{bread, cookies}, {bread, Juice},
{bread, milk}, {cookies, juice},
{cookies, milk}, {juice, milk}
3. Frequent 2-itemsets
{bread, cookies} {bread, Juice} {bread,
milk} { juice, milk}
4. candidate 3-itemsets
{bread, cookies, juice}, {bread,
cookies,milk}, {bread, juice, milk}
Counting the support
support counting:
– Scan the database of transactions to determine the
support of each candidate itemset
– To reduce the number of comparisons, store the
candidates in a hash structure
• Instead of matching each transaction against every
candidate, match it against candidates contained in the
hashed buckets (details omitted)
Hash Structure
Transactions
N
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
k
Buckets
Rule Generation for Apriori Algorithm
Lattice of rules
Low
Confidence
Rule
CD=>AB
ABCD=>{ }
BCD=>A
BD=>AC
D=>ABC
Pruned
Rules
ACD=>B
BC=>AD
C=>ABD
ABD=>C
AD=>BC
B=>ACD
ABC=>D
AC=>BD
A=>BCD
AB=>CD
Rule Generation for Apriori Algorithm
• Candidate rule is generated by merging two rules that
share the same prefix
in the rule consequent
CD=>AB
BD=>AC
• join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
• Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence
D=>ABC
Example: generating rules
TID
Items
1
2
3
4
5
Bread, Milk
milk, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
1. {bread, milk} {diaper}
{bread, diaper}  {milk}
{diaper, milk} {bread}
2. {bread} {diaper, milk}
2/3
2/3
2/4
2/3
Find rules with confidence
>=2/3 from the itemset
{bread, milk, diaper}
41
Next
• Clustering
• classification
42