Transcript slides

CS 590M Fall 2001: Security
Issues in Data Mining
Lecture 5: Association Rules,
Sequential Associations
Why Association Rules?
• Understand attributes, not entities
• Discover relationships that
– Show some dependency between attributes
– Are “interesting”
• Give an understanding of the data space
Formal Definition
• Data:
– Items={i1,…,in}
– Transactions T={t1,…,tm} where ti = {ij1, …, ijk}
• Support: Given AI,
supp(A) = |{t  T | t  A}| / |T|
• Goal: Find rules AB with support ≥ s and
confidence ≥ c where:
– A, B  I, A  B = 
– s = supp(A  B), c = supp(A  B) / supp(A)
Sample: Market Basket
I
Hardware
Auto
Clothing Furnishings
Paper
goods
Grocery
t0
-
-
+
-
+
-
t1
+
+
-
-
-
-
t2
+
-
-
-
+
-
t3
-
-
+
+
+
-
t4
+
+
-
-
+
-
t5
-
-
-
-
+
+
t6
-
-
+
+
-
-
t7
+
+
-
-
+
-
t8
-
-
+
-
+
+
t9
-
-
-
-
+
-
T
Types of associations
• Machine-learning base: classification /
decision rules
– Entities independent, unordered
– Find rules leading to target class
– To get rule sets, re-run for all classes as targets
• Market-basket
– Collection of related entities with same key
– Can be modeled as independent entities, sparse data
• Sequential
– Like market basket, but group by distance rather
than same key
Historical Association Rule
Learning
• Decision tree converted to rules
– ID3, as discussed in previous lecture
• Direct production of decision rules
– CN2, others
• Problem: Algorithms don’t scale well to
many practical problems
Database community contribution:
Market Basket Association Rules
• Rakesh Agrawal, Tomasz Imielinski, and Arun
Swami. Mining association rules between sets
of items in large databases. In Proc. of the
ACM SIGMOD Conference on Management of
Data, pages 207--216, Washington, D.C., May
1993.
• Rakesh Agrawal, Ramakrishnan Srikant: Fast
Algorithms for Mining Association Rules in
Large Databases. VLDB 1994: 487-499
Database community contribution:
Market Basket Association Rules
• Practical problems often have sparse data
– Many attributes, few items per transaction
• Goal is typically search for high support
– High support = broad impact
– High confidence not crucial (as opposed to
classification)
• Very Large data sets
(main-memory algorithms impractical)
A-Priori Algorithm
• Observation: if A has support s, then
– i  A, supp(i) ≥ s
• Gives bottom-up algorithm
– Find single items with support ≥ s
– Just look at transaction subsets with those
items for pairs
– Recurse
A-Priori Algorithm
• First, generate all large itemsets
– Sets X  I such that supp(X) ≥ s (threshold)
– Captures “supp(A  B) ≥ s” part of problem
• Second, find high-confidence rules that are
subsets of X
– B = Xi , A = X-B
– To find confidence, need supp(A)
But A will be in all large itemsets – don’t need to go
back to the database!
A-Priori Algorithm
L1 = {large 1-itemsets};
for ( k = 2; Lk-1  ; k++ )
Ck = select p.i1, p.iY, …, p.ik-1, q.ik-1
from Lk-1 p, Lk-1 q
where p.i1 = q.i1, …, p.ik-2 = q.ik-2
 transactions t  T
Ct = subset(Ck, t);
// Candidates contained in t
 candidates c  Ct: c.count++;
Lk = {c  Ck | c.count  minsup}
Answer = k Lk;
Frequent episodes for sequential
associations
• Heikki Mannila, Hannu Toivonen, and A.
Inkeri Verkamo: Discovering Frequent
Episodes in Sequences. In First International
Conference on Knowledge Discovery and Data
Mining (KDD'95), 210 - 215, Montreal,
Canada, August 1995. AAAI Press.
• Instead of transaction, items grouped by sliding
window in time
• Same basic idea as A-Priori
Frequent Episodes: Definition
•
•
•
•
Event types E
Event (A,t) where A in E
Sequence S=((A1,t1),…,(An,tn))
Frequent episode F = (Ai, …, Aj) where
–  tl, tm such that t1tl<…<tm  tn
tm-tl  window:
– count( ((Ai,tl), …, (Aj, tm)) )  support
Applications/Issues in Security
• Frequent episodes in intrusion detection
data
– What does this tell us?
• Preventing the discovery of associations
– Known items to protect
– What if we don’t know what we want to
protect?