Predictive Analytics

Download Report

Transcript Predictive Analytics

Predictive Analytics in SQL and
Datalog
UCLA CS240A Course Notes*
1
Rollups, Data Cubes: Descriptive Analytics
In the 90s DBMS were extended with Descriptive Analytics—a
major technical and commercial success
•From GROUP BY of SQL-2 aggregates to GROUP BY
ROLLUP/DATACUBE: an obvious step at the language level.
•The sort-based support of traditional aggregates extended to
supergroups. Some technical challenges, but the general DBMS
framework of big data residing in secondary store was not
challenged.
While the need to go beyond that and support Discovery analytics
was immediately realized no satisfactory solution was found.
2
Memorable Attempts
Apriori in DB2 Sarawagi et al. [SIGMOD 1998]: only
cache-mining works at the required speed !
Imielinski and Mannila [CACM 1996]: Define
Declarative Language constructs for Data Mining.
The rest will follow, as future research will invent
query optimization techniques that will make these
constructs very efficient.
Inductive Databases: a research field that explored
this idea for a few years and … then gave up.
But recently things started changing
3
MBA: applicable to many other contexts
Telecommunication:
Each customer is a transaction containing the set
of customer’s phone calls
Atmospheric phenomena:
Each time interval (e.g. a day) is a transaction
containing the set of observed event (rains, wind,
etc.)
Etc.
4
Association Rules
Express how product/services relate to
each other, and tend to group together
“if a customer purchases three-way calling,
then will also purchase call-waiting”
simple to understand
actionable information: bundle three-way
calling and call-waiting in a single package
5
Useful, trivial, unexplicable
Useful: “On Thursdays, grocery store
consumers often purchase diapers and
beer together”.
Trivial: “Customers who purchase
maintenance agreements are very likely
to purchase large appliances”.
Unexplicable: “When a new hardaware
store opens, one of the most sold items
is toilet rings.”
6
Basic Concepts
Transaction:
Relational format
Compact format
<Tid,item>
<Tid,itemset>
<1, item1>
<1, {item1,item2}>
<1, item2>
<2, {item3}>
<2, item3>
Item: single element, Itemset: set of items
Support of an itemset I: # of transaction containing I
Minimum Support  : threshold for support
Frequent Itemset : with support  .
Frequent Itemsets represents set of items which are
positively correlated
7
Frequent Itemsets
Transaction ID Items Bought
1
dairy,fruit
2
dairy,fruit, vegetable
3
dairy
4
fruit, cereals
Support({dairy}) = 3 (75%)
Support({fruit}) = 3 (75%)
Support({dairy, fruit}) = 2 (50%)
If  = 60%, then
{dairy} and {fruit} are frequent while {dairy, fruit}
is not.
8
Association Rules: Measures
Let A and B be a partition of I :
A  B [s, c]
A and B are itemsets
s = support of A  B = support(AB)
c = confidence of A  B = support(AB)/support(A)
 Measure for rules:
 minimum support 
 minimum confidence 
The rules holds if : s   and c  
9
Association Rules: Meaning
A  B [ s, c ]
Support: denotes the frequency of the rule within
transactions. A high value means that the rule involve a
great part of database.
support(A  B [ s, c ]) = p(A  B)
Confidence: denotes the percentage of transactions
containing A which contain also B. It is an estimation of
conditioned probability .
confidence(A  B [ s, c ]) = p(B|A) = p(A & B)/p(A).
10
Association Rules - Example
Transaction ID Items Bought
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
For rule A  C:
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
support = support({A, C}) = 50%
confidence = support({A, C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
11
Problem Statement
The database consists of a set of transactions.
Each transaction consists of a transaction ID and a set of items
bought in that transaction (as in a market basket).
An association rule is an implication of the form X  Y , which
says that customers who buy item X are also likely to buy item Y .
In practice we are only interested in relationships between highvolume items (aka frequent items)
Confidence: X  Y holds with confidence C% if C% of
transactions that contain X also contain Y .
Support: X  Y has support S% if S% of transactions contain X
 Y . Observe that the support level for X is  to that for X  Y
and that their inverse ratio is the confidence of X Y:
confidence(X  Y) = support(XY)/support(X)
12
Algorithms: Apriori
A level-wise, candidate-generation-and-test approach
(Agrawal & Srikant 1994)
Data base D
TID
Items
10
20
a,
b,
a,
e
b,
30
40
c, d
c, e
b, c,
1-candidates
Scan D
e
Min_sup=2
3-candidates
Scan D
Itemset
bce
Freq 3-itemsets
Itemset
bce
Sup
2
Itemset
a
b
c
d
e
Freq 1-itemsets
Sup
2
3
3
1
3
Itemset
a
b
c
Sup
2
3
3
e
3
Freq 2-itemsets
Itemset
ac
bc
be
ce
Sup
2
2
3
2
2-candidates
Counting
Itemset
ab
ac
ae
bc
be
ce
Sup
1
2
1
2
3
2
Itemset
ab
ac
ae
bc
be
ce
Scan D
13
Performance Challenges of
Frequent Sets (aka Frequent Pattern) Mining
Challenges
Data structures: Hash tables and Prefix trees
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: Many algorithms proposed.
General ideas
Reduce number of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
FP without candidate generation [Han, Pei, Yin 2000].
14
Apriori Summary
Scanning the database and counting occurrences
Pruning the itemsets below the minimum support
level: [Particularly after the first step, we might
want to prune the database D as well]
Combining frequent sets of size n into candidate
larger sets of size n + 1 [or even larger].
Monotonicity Condition: The support level of a set
is always smaller than that of every subset
15
Apriori in DB2
S. Sarawagi, S. Thomas, R. Agrawal: "Integrating
Association Rule Mining with Databases:
Alternatives and Implications", Data Mining and
Knowledge Discovery Journal, 4(2/3), July 2000.
Very difficult to integrate Apriori into DBMS: the
only approach that works is cache-mining. And
similar conclusions apply to other KDD algorithms
Lackluster commercial success of DBMS vendors
withpredictive analytics– in stark contrast with their
success in descriptive analytics, i.e. rollups and
data cubes.
16
Extracting the Rules
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
For rule A  C:
Support for rule: support for set of items
= support({A, C})=50%
Confidence:support for the rule over
support for its left side=
support({A, C})/support({A})=66.6%
17
Rule Implications
Lemma: If X  Y Z, then XY  Z, and
XZ Y
This properties can be used to limit the
number of rules tested.
Example: For frequent itemset ABC
If ABC does not hold, then neither do
ABC or BAC. Then we can test BCA
and if this hold we need to test CAB
18
Apriori in Datalog
19