SAS Homework - Temple Fox MIS

Download Report

Transcript SAS Homework - Temple Fox MIS

SAS HOMEWORK 3 REVIEW
ASSOCIATION RULES MINING
MIS2502
Data Analytics
SAS Homework 3 Review
Association Rules
• Using Transactions Data Set
• Reject Store and Quantity – don’t need them
• Assign ID to Transaction (Nominal) – this is our ‘basket’
• Target to Product (Nominal) - this is what we’re trying to
determine but now its not a Y/N(binary)
• Step 8 = Transaction !
• Add an Associations node (Model)
• In Properties Export Rule by ID = Yes
• Answer some questions regarding the Association Rules
• Evaluate Support, Confidence and Lift
Set Up
• Retail – associations between items purchased from
Health/Beauty and Stationary.
• 400K + transactions collected from POS
• Products
•
•
•
•
•
•
bar soap
bows
candy bars
deodorant
greeting cards
magazines
•
•
•
•
•
markers
pain relievers
pencils
pens
perfume
•
•
•
•
•
•
We are using 2
photo processing
prescription medications
shampoo
toothbrushes
toothpaste
wrapping paper
Association Rules - Diagram
• Right Click and Run .
Then view results…..
Process
• Set rule thresholds
• Define Item Sets
• Read through Item Sets, create list of all possible association rules
(X => Y) for the Item Sets
• Compute Support, Confidence and Lift for each Rule
• Support, frequency count of occurrence/ all transactions for both the
individual items (X and, Y) and for the ItemSet (X,Y)
• Confidence , strength of the association. How often Y appears in baskets
that contain X
• count (X=>Y)/count(X)
• Expected Confidence X=>Y is the probability that one of the baskets has Y
• Lift = s (X->Y)/s(X)*s(Y)
• Or, in SAS, (confidence/expected confidence )
• Drop those that don’t meet thresholds
Evaluating the Statistics
Support – frequency: % occurrence of ItemSet in data
Confidence – strength: % right hand occurs in left
Lift – dependence: prob of dependent occurrence /prob of random occurrence (>1)
Support v Confidence
Blue – 2 variable , - Red 3 variable
Confidence
v Expected
Confidence
Diff is Lift
<=Ordered by lift on x axis
Confidence Plot Left v Right (red = high)
range at bottom
Evaluating the Rules Table
view>rules>rule table
In Class
In Class
1) Which rule(s) have the highest confidence?
MUSICSTREAM ==> WEBSITE
2) Which rule(s) have the highest support?
WEBSITE ==> PODCAST and PODCAST ==> WEBSITE
3) Which rule(s) have the highest lift?
ARCHIVE ==> WEBSITE and WEBSITE ==> ARCHIVE
4) What are the two rule “pairs” in the list above?
ARCHIVE ==> WEBSITE/WEBSITE ==> ARCHIVE and
WEBSITE ==> PODCAST/PODCAST ==> WEBSITE
5) What other service “goes the most” with visiting the website for general
information (WEBSITE)? In other words, what other service are WEBSITE visitors
most likely to seek out? What statistic did you use to figure this out?
ARCHIVE – LIFT is greater than 1. This implies that this isn’t just random
chance – people are actively seeking out the WEBSITE if they’ve used the
ARCHIVE.
In Class
6) What other service seems to “go the least” with visiting the website for
general information (WEBSITE)? In other words, what other service are
WEBSITE visitors least likely to seek out? What statistic did you use to figure
this out?
PODCAST – LIFT is less than 1. This also implies that this isn’t just
random chance – but this time, people who visit the web site are
particularly unlikely to also download a podcast.
7) The rule MUSICSTREAM ==> WEBSITE has poor lift (i.e., less than 1), but
the rule has the highest confidence. Explain how this is possible.
It could be that many people use both MUSICSTREAM and WEBSITE so it
appears in visitors’ set of services a lot. However, there can still be a
negative effect of one on the other. For example, I use the website a lot,
and I use music streaming a lot, but I’m still less likely to do one if I’ve
done the other – possibly they are substitutes.