Transcript ppt - CS

Data Mining
Jim
Which cow
should I buy??
Jim’s cows
Cows on sale
Name
Milk
Avg.
(MA)
Name
Milk
Avg.
(MA)
AGE
Rating
AGE
Mona
6
5
Good
Phil
5
3
Lisa
4
6
Bad
Collins
3
2
Mary
8
3
Good
Larry
9
5
Quirri
6
5
Bad
Bird
2
5
Paula
2
6
Good
Abdul
10
7
Bad
Which cow
should I buy??
Which cow should I buy??

And suppose I know their:







Behavior
Preferred mating months
Milk production
Nutritional habits
Immune system data
…
Now suppose I have 10,000 cows…
“understanding” data

Looking for patterns is not new:






Hunters seek patterns in animal migration
Politicians seek patterns in voting habits
…
Available data is increasing very fast
(exponentially?)
Greater opportunities to extract valuable
information
But “understanding” the data becomes more
difficult
Data Mining
Data Mining: The process of discovering
patterns in data, usually stored in a Database.
The patterns lead to advantages (economic or
other).
Very fast growing area of research
Because:







Huge databases (Walmart-20 mil transactions/day)
Automatic data capture of transactions (Bar code,
satellites, scanners, cameras, etc.)
Large financial advantage
Evolving analytical methods
Data Mining techniques in some
huji courses
Technique
Course
Decision Trees
Artificial Intelligence
EM, Perceptron, SVM,
PCA…
Neural Networks
Intro. to Machine
Learning
Intro. to information
processing and Learning
Neural Networks 1, 2.
K-Nearest Neighbor
Computational Geometry
Data Mining

Two extremes for the expression of the
patterns:
1.
2.
“Black Box”: “Buy cow Zehava, Petra and
Paulina”
“Transparent Box” (Structural
Patterns): “Buy cows with age<4 and
weight >300 or cows with calm behavior
and >90 liters of milk production per month”
The weather example
Outlook
Temp.
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Today is Overcast, mild temperature, high humidity,
and windy. Will we play?
Questions one can ask

A set of rules learned from this data could be
presented in a Decision List:







If outlook=sunny and humidity=high then play=no
ElseIf outlook=rainy and windy=true then play=no
ElseIf outlook=overcast then play=yes
ElseIf humidity=normal then play=yes
Else play=yes
This is an example of Classification Rules
We could also look for Association Rules:


If temperature=cool then humidity=normal
If windy=false and play=no then outlook=sunny and
humidity=high
Example Cont.

The previous example is very simplified.
Real Databases will probably:
1.
2.
3.

Contain Numerical values as well.
Contain “Noise” and errors (stochastic).
Be a lot larger.
And the analysis we are asked to perform
might not be of Association Rules, but
rather Decision Trees, Neural Networks,
etc.
Caution
David Rhine was a parapsychologist in the 1930-1950’s
 He hypothesized that some people have Extra-Sensory
Perception (ESP)
 He asked people to say if 10 hidden cards are red or blue.
 He discovered that almost 1 in every 1000 people has ESP !
 He told these people that they have ESP and called them in
for another test
 He discovered almost all of them had lost their ESP !
 He concluded that…
 You shouldn’t tell people they have ESP, it makes them
loose it.
[Source: J. Ullman]

Another Example





Classic example: Database of purchases in a
supermarket
Such huge DB’s which are saved for long periods
of time are called Data Warehouses
It is extremely valuable for the manager of the
store to extract Association Rules from the Data
Warehouse
It is even more valuable if this information can
be associated with the person buying, hence the
Club Memberships…
Each Shopping Basket is a list of items that
were bought in a single purchase by some
customer
Supermarket Example

Beer and Diapers were found to often be
bought together by men, so they were
placed in the same aisle
[Datamining urban legend]
The Purchases
Relation
Itemset: A set of items
Support of an itemset: the
fraction of transactions that
contain all items in the
itemset.
What is the Support of:
1. {pen}?
2. {pen, ink}?
3. {pen, juice}?
transid
111
111
111
111
112
112
112
113
113
114
114
114
item
pen
ink
milk
juice
pen
ink
milk
pen
milk
pen
ink
juice
Frequent Itemsets




We would like to find items that are purchased
together at high frequency- Frequent Itemsets.
We look for itemsets which have a
support > minSupport.
If minSupport is set to 0.7, then the frequent
itemsets in our example would be:
{pen}, {ink}, {milk}, {pen, ink}, {pen, milk}
The A-Priori property of frequent itemsets:
Every subset of a frequent itemset is also a
frequent itemset.
Algorithm for finding Frequent
itemsets







Suppose we have n items.
The naïve approach: for every subset of items, check if it
is frequent.
Very expensive
Improvement (based on the A-priori property): first
identify frequent itemsets of size 1, then try to expand
them.
Greatly reduces the number of candidate frequent
itemsets.
A single scan of the table is enough to determine which
candidate itemsets, are frequent.
The algorithm terminates when no new frequent
itemsets are found in an iteration.
Algorithm for finding Frequent
itemsets
foreach item, check if it is a frequent itemset;
(appears in >minSupport of the transactions)
k=1;
repeat
foreach new frequent itemset Ik with k items:
Generate all itemsets Ik+1 with k+1 items, such that
Ik is contained in Ik+1.
scan all transactions once and add itemsets that
have support > minSupport.
k++
until no new frequent itemsets are found
transid
111
111
111
111
112
112
112
113
113
114
114
114
item
pen
ink
milk
juice
pen
ink
milk
pen
milk
pen
ink
juice
Finding Frequent itemsets, on table
“Purchases”, with minSupport=0.7





In the first run, the following single itemsets are
found to be frequent: {pen}, {ink}, {milk}.
Now we generate the candidates for k=2: {pen,
ink}, {pen, milk}, {pen, juice}, {ink, milk}, {ink,
juice} and {milk, juice}.
By scanning the relation, we determine that the
following are frequent: {pen, ink}, {pen, milk}.
Now we generate the candidates for k=3: {pen,
ink, milk}, {pen, milk, juice}, {pen, ink, juice}.
By scanning the relation, we determine that
none of these are frequent, and the algorithm
ends with:
{ {pen}, {ink}, {milk}, {pen,
ink}, {pen, milk} }
Algorithm refinement:





One important refinement: after the candidategeneration phase, and before the scan of the relation (Apriori), eliminate candidate itemsets in which there is a
subset which is not frequent. This is due to the A-Priori
property.
In the second iteration, this means we would eliminate
{pen, juice}, {ink, juice} and {milk, juice} as candidates
since {juice} is not frequent. So we only check {pen,
ink}, {pen, milk} and {ink, milk}.
So only {pen, ink, milk} is generated as a candidate, but
it is eliminated before the scan because {ink, milk} is not
frequent.
So we don’t perform the 3rd iteration of the relation.
More complex algorithms use the same tools: iterative
generation and testing of candidate itemsets.
Association Rules


Up until now we discussed identification of frequent
item sets. We now wish to go one step further.
An association rule is of the structure
{pen}=> {ink}



Meaning: “if a pen is purchased in a transaction, it is
likely that ink will also be purchased in that
transaction”.
It describes the data in the DB (past). Extrapolation
to future transactions should be done with caution.
More formally, an Association Rule is LHS=>RHS,
where both LHS and RHS are sets of items, and
implies that if every item in LHS was purchased in a
transaction, it is likely that the items in RHS are
purchased as well.
Measures for Association Rules
1.
2.

Support of “LHS=>RHS” is the support
of the itemset (LHS U RHS). In other
words: the fraction of transactions that
contain all items in (LHS U RHS) .
Confidence of “LHS=>RHS”: Consider
all transactions which contain all items in
LHS. The fraction of these transactions
that also contain all items in RHS, is the
confidence of RHS.
=S(LHS U RHS)/S(LHS)
The confidence of a rule is an indication
of the strength of the rule.
What is the support of {pen}=>{ink}?
And the confidence?
What is the support of {ink}=>{pen}?
And the confidence?
transid
111
111
111
111
112
112
112
113
113
114
114
114
item
pen
ink
milk
juice
pen
ink
milk
pen
milk
pen
ink
juice
Finding Association rules



A user can ask for rules with minimum
support minSup and minimum confidence
minConf.
Firstly, all frequent itemsets with
support>minSup are computed with the
previous Algorithm.
Secondly, rules are generated using the
frequent itemsets, and checked for
minConf.
Finding Association rules


Find all frequent itemsets using the previous alg.
For each frequent itemset X with support S(X):
For each division of X into 2 itemsets, LHS and RHS:
Calculate the Confidence of LHS=>RHS : S(X)/S(LHS).

We computed S(LHS) in the previous algorithm
(because LHS is frequent since X is frequent).
Generalized association rules
transid
111
111
date
1.5.99
1.5.99
item
pen
ink
111
111
112
112
1.5.99
1.5.99
10.5.99
10.5.99
Milk
juice
pen
ink
112
113
113
10.5.99
15.5.99
15.5.99
milk
Pen
milk
114
114
114
1.6.99
1.6.99
1.6.99
Pen
Ink
juice
We would like to know
if the rule
{pen}=>{juice} is
different on the first
day of the month
compared to other
days. How?
What are its support
and confidence
generally?
And on the first days
of the month?
Generalized association rules



By specifying different attributes to group
by (date in the last example), we can
come up with interesting rules which we
would otherwise miss.
Another example would be to group by
location and check if the same rules apply
for customers from Jerusalem compared
to Tel Aviv.
By comparing the support and confidence
of the rules we can observe differences in
the data on different conditions.
Caution in prediction




When we find a pattern in the data, we wish to
use it for prediction (that is in many case the
point).
However, we have to be cautious about this.
For example: suppose {pen}=>{ink} has a high
support and confidence. We might give a
discount on pens in order to increase sales of
pens and therefore also in sales of ink.
However, this assumes a causal link between
{pen} and {ink}.
Caution in prediction






Suppose pens and pencils are sold together a lot
We would then also get the rule {pencil}=>{ink}
with high support and confidence
However, it is clear there is no causal link between
buying pencils and buying ink.
If we promoted pencils it would not cause an
increase in sales of ink, despite high support and
confidence.
The chance to infer “wrong” rules (rules which are
not causal links) decreases as the DB size increases,
but we should keep in mind that such rules do come
up.
Therefore, the generated rules are a only good
starting point for identifying causal links.