LN19 - WSU EECS

Download Report

Transcript LN19 - WSU EECS

CPT-S 580-06
Advanced Databases
Yinghui Wu
EME 49
1
Scalable data mining (a case study)
2
Data mining

What is data mining? A tentative definition:
–
Use of efficient techniques for analysis of very large collections of
data and the extraction of useful and possibly unexpected patterns
in data
– Non-trivial extraction of implicit, previously unknown and potentially
useful information from data
– Exploration & analysis, by automatic or semi-automatic means, of
large quantities of data in order to discover meaningful patterns
Data
3
Database Processing vs. Data Mining
 Query
– Well defined
– SQL, SPARQL, Xpath…
 Query
– Poorly defined
– No precise query language
– Find all credit applicants with last
– Find all credit applicants who are
name of Smith.
poor credit risks. (classification)
– Identify customers who have
purchased more than $10,000 in
the last month.
– Identify customers with similar
buying habits. (Clustering)
– Find all my friends who frequently
goes to French restaurants if their
friends do (association rules)
– Find all my friends
living in Seattle and like
French restaurant

Output
– Precise
– Subset of database

Output
– Fuzzy
– Not a subset of database
4
Data Mining Models and Tasks
Use variables to predict unknown or
future values of other variables.
Find human-interpretable
patterns that describe the data.
5
Association rule
6
Association Rule Discovery: Definition
 Given a set of records each of which contain some number of
items from a given collection
– Produce dependency rules which will predict occurrence of
an item based on occurrences of other items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules Discovered:
{Milk} --> {Coke}
{Diaper, Milk} --> {Beer}
Association Rule Discovery: Applications
 Marketing and Sales Promotion:
– Let the rule discovered be
{Bagels, … } --> {Potato Chips}
– Potato Chips as consequent => Can be used to
determine what should be done to boost its sales.
– Bagels in the antecedent => Can be used to see which
products would be affected if the store discontinues
selling bagels.
– Bagels in antecedent and Potato chips in consequent =>
Can be used to see what products should be sold with
Bagels to promote sale of Potato chips
Definition: Frequent Itemset
 Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
 Support count ()
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2
 Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
 Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Definition: Association Rule
• Association Rule
– An implication expression of the
form X  Y, where X and Y are
itemsets
– Example:
{Milk, Diaper}  {Beer}
• Rule Evaluation Metrics
– Confidence (c)
• Measures how often items in Y
appear in transactions that
contain X
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example:
{Milk , Diaper }  Beer
– Support (s)
• Fraction of transactions that
contain both X and Y
TID
s
c
 (Milk, Diaper, Beer )
|T|
2
  0.4
5
 (Milk, Diaper, Beer ) 2
  0.67
 (Milk, Diaper )
3
Association Rule Mining Task
 Given a set of transactions T, the goal of association rule mining
is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
 Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!
 Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
 d   d  k 
R       

k
j

  
 3  2 1
d 1
d k
k 1
j 1
d
d 1
If d=6,
R = 602 rules
Mining Association Rules: Decoupling
TID
Items
1
Bread, Milk
2
3
4
5
Example of Rules:
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
Bread, Diaper, Beer, Eggs {Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer {Beer}  {Milk,Diaper} (s=0.4, c=0.67)
Bread, Milk, Diaper, Coke {Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
 Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset

Frequent itemset generation is still computationally
expensive
Frequent Itemset Generation
 Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
Transactions
N
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
List of
Candidates
w
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
M
Frequent Itemset Generation Strategies
 Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
 Reduce the number of transactions (N)
– Reduce size of N as the size of itemset increases
– Use a subsample of N transactions
 Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every transaction
Reducing Number of Candidates: Apriori
 Apriori principle:
– If an itemset is frequent, then all of its subsets must also be
frequent
 Apriori principle holds due to the following property of
the support measure:
X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
Illustrating Apriori Principle
null
A
Found to be
Infrequent
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
Pruned
supersets
ABCE
ABDE
ABCDE
ACDE
BCDE
Illustrating Apriori Principle
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Count
4
2
4
3
4
1
Items (1-itemsets)
Minimum Support = 3
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based pruning,
6 + 6 + 1 = 13
Count
3
2
3
2
3
3
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Itemset
{Bread,Milk,Diaper}
Count
3
Apriori Algorithm
 Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Prune candidate itemsets containing subsets of length k that
are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those
that are frequent
Apriori: Reducing Number of Comparisons
 Candidate counting:
– Scan the database of transactions to determine the support of
each candidate itemset
– To reduce the number of comparisons, store the candidates in a
hash structure
• Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets
Hash Structure
Transactions
N
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
k
Buckets
Apriori: Implementation Using Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
We need:
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node
(if number of candidate itemsets exceeds max leaf size, split the node)
Hash function
3,6,9
1,4,7
2,5,8
234
567
145
136
345
124
457
125
458
159
356
357
689
367
368
Apriori: Implementation Using Hash Tree
1 2 3 5 6 transaction
1+ 2356
2+ 356
12+ 356
3+ 56
13+ 56
234
567
15+ 6
145
136
345
124 125
457 458
159
356
357
689
367
368
Match transaction against 11 out of 15
candidates
Apriori: Alternative Search Methods
 Traversal of Itemset Lattice
– General-to-specific vs Specific-to-general
Frequent
itemset
border
null
null
..
..
..
..
{a1,a2,...,an}
(a) General-to-specific
{a1,a2,...,an}
Frequent
itemset
border
null
..
..
Frequent
itemset
border
(b) Specific-to-general
{a1,a2,...,an}
(c) Bidirectional
Apriori: Alternative Search Methods
 Traversal of Itemset Lattice
– Breadth-first vs Depth-first
(a) Breadth first
(b) Depth first
Bottlenecks of Apriori
 Candidate generation can result in huge candidate
sets:
– 104 frequent 1-itemset will generate 107 candidate 2-itemsets
– To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate 2100 ~ 1030 candidates.
 Multiple scans of database:
– Needs (n +1 ) scans, n is the length of the longest pattern
Rule Generation
 Given a frequent itemset L, find all non-empty
subsets f  L such that f  L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D,
A BCD,
AB CD,
BD AC,
ABD C,
B ACD,
AC  BD,
CD AB,
ACD B,
C ABD,
AD  BC,
BCD A,
D ABC
BC AD,
 If |L| = k, then there are 2k – 2 candidate association
rules (ignoring L   and   L)
Rule Generation
 How to efficiently generate rules from frequent
itemsets?
– In general, confidence does not have an antimonotone property
c(ABC D) can be larger or smaller than c(AB D)
– But confidence of rules generated from the same
itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
c(ABC  D)  c(AB  CD)  c(A  BCD)
• Confidence is anti-monotone w.r.t. number of items on
the RHS of the rule
Association rules in graph data
28
Association in social networks
Traditional association rules:
X ⇒ Y
Association in social networks
French restaurant
visit
x
city
“if customers x and x′ are friends
living in the same city c, there are at
least 3 French restaurants in c that x
and x′ both like, and if x′ visits a
newly opened French restaurant y in
c, then x may also visit y.”
y
x’
French3
restaurant
Identify customers for a
new French restaurant
Association with
topological constraints!
Conventional association rules: item set
29
Association via graph patterns
x2
“fake”
acc #2
acc #1
x1
x
article
Ecuador
“Shakira”
album
Identify customers for
released album
blog
blog
“claim a prize”
(keywords)
detect fake account
Question 1: How to define association rule via graph patterns?
Question 2: How to discovery interesting rules?
Question 3: How to use the rules to identify customers?
30
more involved than rules for itemsets!
Graph Pattern Association Rules (GPARs)
graph-pattern association rule (GPAR) R(x, y)
R(x, y): Q(x, y) ⇒ q(x, y)
• Q(x, y): a graph pattern; where x and y are two designated nodes
• q(x, y) is an edge labeled q from x to y (predicate)
• Q and q as the antecedent and consequent of R, respectively.
French
restauran
t
x
French
restauran
t
French
restauran
t
x’
:
French3
restaurant
x
x’
⇒
x
French3
restaurant
city
city
R(x, French restaurant): Q(x, French restaurant) ⇒ like(x, French
restaurant)
31
Graph Pattern Association Rules (GPARs)
graph-pattern association rule (GPAR) R(x, y)
R(x, y): Q(x, y) ⇒ q(x, y)
If there exists a match h that identifies vx and vy as matches of the
designated nodes x and y in Q, respectively, then the consequent q(vx ,
vy) will likely hold. “if x and x′ are friends living in city c, there are at least 3 French
restaurants in c that x and x′ both like, and if x′ visits a newly opened
French restaurant y in c, then x may also visit y.”
French
restauran
t
x
city
French
restaurant
x’
:
French3
restaurant
R(x, French restaurant):
x
city
x’
French
restaurant
⇒
x
French3
restaurant
Q(x, French restaurant) ⇒ like(x, French restaurant)
32
# of isomorphic subgraph in single graph?
Support and Confidence
Support of R(x, y): Q(x,y) ⇒ p(x,y)
Le
Bernadin
French
restauran
t
x
city
x’
⇒
u1
French3
restaurant
French3
restaurant
Per se
u2
New
York
(city)
u3
French3
restaurant
33
Support and Confidence
Confidence of R(x, y)
Local closed world assumption
v2
x2
x1
Candidate # of x with one edge of
type q but is not a match for q(x,y)
x
v1
Ecuador
Ecuador
Shakira
album
v
Shakira
album
“positive”
v
'
MJ’s
album
v''
hobby
Pop
music
“negative” “unknown”
34
Discover GPARs
Mining GPARs for particular event – often leads to same group of entities
French restaurant
French restaurant
Difference functiony
x
y
Diversified GPARs
X’
x
x
Bi-criteria Diversification function
city
Le Bernadin
French2
restaurant
Per se
city
Patina
The diversified mining problem:
Asian
restaurant
Asian
restaurant
u1 a graph G,ua2predicateuq(x,
u4
u5 positive u6
Given
bound and
3 y), a support
integers k and d, find a set S of k nontrivial GPARs pertaining to q(x,
New
LA
y) such that (a) York
F(S) is maximized; and (b) for each GPAR
R ∈ S,Asian
(city)
restaurant
(city) r(PR, x)
supp(R,G)
≤ d.3
French
French3 ≥ α and
French3
restaurant
restaurant
restaurant
35
Parallel GPAR Discovery
A parallel discovery algorithm
• coordinator Sc divides G into n-1 fragments, each assigned to a processor Si
• discovers GPARs in parallel by bulk synchronous processing in d rounds
• Parallel
Sc postsscalable?
a set M of GPARs to each processor
• Load
each Balancing?
processor generates GPARs locally by extending those in M
Communication cost?
• new GPARs are collected and assembled by Sc in the barrier
synchronization phase; Sc incrementally updates top-k GPARs set Lk
Assembled
New GPARs Mi+1
Lk R1 R2 … Rk
coordinator
3. Synchronization/
Update Lk
1. Distribute Mi
worker 1
worker 2
…
2. Locally expand Mi
worker n
36
Identifying entities using GPARs
Given a set of GPARs Σ pertaining to the same q(x,y), graph G and
confidence threshold η, the set of entities identified by Σ is the set
37
Entity Identification algorithm
38
Scalability of discovery algorithm
On average 3.2 time faster
39