Transcript Slides
Data Mining,
Frequent-Itemset Mining
Data Mining
• Discovery of useful, possibly unexpected, patterns in data.
• Example patterns:
– "50% of the people who buy hot dogs also buy mustard,“
– "these three individual's pattern of credit-card expenditures indicate that
they are running an illigal activity."
Some mining problems
• Find frequent itemsets in "market-basket" data
• Find "similar" items in a large collection.
– Example applications:
• Find documents on the Web that share a significant amount of common text or
• Find books that have been bought by many of the same Amazon customers.
• Find clusters of data
– Example application
• Find clusters of Web pages by the words they use.
Frequent-Itemset Mining
The Market-Basket Model
• A large set of items, e.g., things sold in a supermarket.
• A large set of baskets, each of which is a small set of the items,
e.g., the things one customer buys on one day.
Fundamental problem
• What sets of items are often bought together?
Application
• If a large number of baskets contain both hot dogs and mustard,
we can use this information in several ways. How?
Hot Dogs and Mustard
1. Apparently, many people walk from where the hot dogs are to
where the mustard is.
•
•
We can put them close together, and put between them other foods
that might also be bought with hot dogs and mustard, e.g., ketchup
or potato chips.
Doing so can generate additional "impulse" sales.
2. The store can run a sale on hot dogs and at the same time raise the
price of mustard.
•
•
•
People will come to the store for the cheap hot dogs, and many will
need mustard too.
It is not worth the trouble to go to another store for cheaper mustard,
so they buy that too.
The store makes back on mustard what it loses on hot dogs, and also
gets more customers into the store.
Beer and Diapers
• What’s the explanation here?
On-Line Purchases
• Amazon.com offers several million different items for sale, and
has several tens of millions of customers.
• Basket = Customer, Item = Book, DVD, etc.
– Motivation: Find out what items are bought together.
• Basket = Book, DVD, etc. Item = Customer
– Motivation: Find out similar customers.
Words and Documents
• Baskets = sentences; items = words in those sentences.
– Lets us find words that appear together unusually frequently, i.e.,
linked concepts.
• Baskets = sentences, items = documents containing those
sentences.
– Items that appear together too often could represent plagiarism.
Genes
• Baskets = people; items = genes or blood-chemistry factors.
– Has been used to detect combinations of genes that result in
diabetes
Support
• Question: find sets of items that appear “frequently” in the baskets.
• Support for itemset I = the number of baskets containing all items
in I.
• Given a support threshold s, sets of items that appear in > s baskets
are called frequent itemsets.
Example: Frequent Itemsets
• Items={milk, coke, pepsi, beer, juice}.
• Support = 3 baskets.
B1 = {m, c, b}
B2 = {m, p, j}
B3 = {m, b}
B4 = {c, j}
B5 = {m, p, b}
B6 = {m, c, b, j}
B7 = {c, b, j}
B8 = {b, c}
• Frequent itemsets: {m}, {c}, {b}, {j},
{m,b}, {b,c} , {c,j}.
Scale of Problem
• WalMart sells 100,000 items and can store billions of baskets.
• The Web has over 100,000,000 words and billions of pages.
Association Rules
• If-then rules about the contents of baskets.
• {i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then it
is likely to contain j.”
• Confidence of this association rule is the probability of j given
i1,…,ik.
Example
B1 = {m, c, b}
B3 = {m, b}
B5 = {m, p, b}
B7 = {c, b, j}
B2 = {m, p, j}
B4 = {c, j}
B6 = {m, c, b, j}
B8 = {b, c}
• An association rule: {m, b} → c.
– Confidence = 2/4 = 50%.
Interest
• The interest of an association rule X → Y is the absolute value
of the amount by which the confidence differs from the
probability of Y being in a given basket.
Example
B1 = {m, c, b}
B3 = {m, b}
B5 = {m, p, b}
B7 = {c, b, j}
B2 = {m, p, j}
B4 = {c, j}
B6 = {m, c, b, j}
B8 = {b, c}
• For association rule {m, b} → c, item c appears in 5/8 of the
baskets.
• Interest = |2/4 - 5/8| = 1/8 --- not very interesting.
Finding Association Rules
• A typical question: “find all association rules with support ≥ s
and confidence ≥ c.”
– Note: “support” of an association rule is the support of the set of
items it mentions.
• Hard part: finding the high-support (frequent ) itemsets.
– Checking the confidence of association rules involving those sets
is relatively easy.