Transcript ppt

A distributed method for mining
association rules
Pham Nguyen Anh Huy*
Department of Information Technology
Vietnam National University of HoChiMinh city
presented by Ho Tu Bao
School of Knowledge Science
Japan Advanced Institute of Science and Technology
(*work done during 3 months of the author JSPS’s fellowship in JAIST)
1
Outline
Introduction
Background
A distributed Apriori algorithm using
mobile agents
Experimental evaluation
Conclusion
2
Introduction
Association analysis is a new and attractive research
area in data mining
Apriori algorithm (R. Agrawal, IBM 1993) is a key
technique for association analysis
Though the apriori principle allows us to considerably
reduce the search space, the technique still requires a
huge computation, particularly for large database
This research proposes a distributed version of Apriori
algorithm using mobile agents. The experiments show
that we can reduce computation time when using
computers in a distributed computing environment.
3
Outline
Introduction
Background


Association rules and Apriori algorithm
Mobile agents and Aglets
A distributed Apriori algorithm using
mobile agents
Experimental evaluation
Conclusion
4
Association rules:
Market basket analysis
Analyzes customer buying habits by finding associations
between the different items that customers place in their
“shopping baskets” (in the form X  Y, where X and Y are sets of
items)
How often people
I = {I1=beer, I2=cake, I3=onigiri}
buy onigiri and beer
together?
Transactional database
TID1:
TID2:
TID3:
TID4:
TID5:
{I1, I2, I3}
{I1, I2}
{I2, I3}
{I2}
{I1, I2}
An association rule {I1}  {I3}
5
Rule measures: Support and Confidence
Customer buys both
Customer
buys beer
Customer
buys onigiri
Transaction ID Items Bought
2000
1000
4000
5000
A,B,C
A,C
A,D
B,E,F
 Association rule X
Y
 support s = probability that a
transaction contains
X and Y
 confidence c = conditional
probability that a transaction
having X also contains Y
 A  C (s=50%, c=66.6%)
 C  A (s=50%, c=100%)
6
Association mining: Apriori algorithm
It is composed of two steps:
1.
2.
Find all frequent itemsets: By definition, each of
these itemsets will occur at least as frequently as a
pre-determined minimum support count
Generate strong association rules from the
frequent itemsets: By definition, these rules must
satisfy minimum support and minimum confidence
(Agrawal, R., 1993)
7
Association mining: Apriori principle
Transaction ID
2000
1000
4000
5000
For rule A  C
Items Bought
A,B,C
A,C
A,D
B,E,F
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
support = support({A and C}) = 50%
confidence = support({A and C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
(if an itemset is not frequent, its supersets are not)
8
The Apriori algorithm: Finding frequent
itemsets using candidate generation
1. Find the frequent itemsets: the sets of items that have
support higher than the minimum support

A subset of a frequent itemset must also be a frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a
frequent itemset

Iteratively find frequent itemsets Lk with cardinality from 1 to k
(k-itemset) by from candidate itemsets Ck (Lk  Ck)
C1  …  Li-1  Ci  Li  Ci+1  …  Lk
2. Use the frequent itemsets to generate association rules.
9
Example (min_sup_count = 2)
Transactional data
TID
List of items_IDs
T100
T200
T300
T400
T500
T600
T700
T800
T900
I1,
I2,
I2,
I1,
I1,
I2,
I1,
I1,
I1,
I2,
I4
I3
I2,
I3
I3
I3
I2,
I2,
I5
I4
I3, I5
I3
Scan D for
count of
each
candidate
Compare
candidate
support count
with minimum
support count
C1
L1
Itemset Sup.Count
Itemset Sup.Count
{I1}
{I2}
{I3}
{I4}
{I5}
6
7
6
2
2
{I1}
{I2}
{I3}
{I4}
{I5}
6
7
6
2
2
10
Example (min_sup_count = 2)
C2
Itemset
Generate
{I1, I2}
candidates {I1, I3}
C2 from L1
{I1, I4}
using
{I1, I5}
Apriori
{I2, I3}
principle
{I2, I4}
{I2, I5}
{I3, I4}
{I3, I5}
{I4, I5}
C2
Scan D for
count of
each
candidate
Generate
candidates
C3 from L2
Itemset
using
Apriori
{I1, I2, I3}
principle
{I1, I2, I5}
Compare
candidate
support
count with
minimum
support
count
Itemset S.count
{I1, I2}
4
{I1, I3}
4
{I1, I4}
1
{I1, I5}
2
{I2, I3}
4
{I2, I4}
2
{I2, I5}
2
{I3, I4}
0
{I3, I5}
1
{I4, I5}
0
Scan D for
count of
each
candidate
C3
Itemset
Sc
{I1, I2, I3} 2
{I1, I2, I5} 2
Compare
candidate
support
count with
minimum
support
count
L2
Itemset S.count
{I1, I2}
4
{I1, I3}
4
{I1, I5}
2
{I2, I3}
4
{I2, I4}
2
{I2, I5}
2
L3
Itemset
Sc
{I1, I2, I3} 2
{I1, I2, I5} 2
11
Agents and Mobile agents
An agent is a
computation entity
that:



Acts on behalf of other
entities in autonomous
fashion.
Performs its actions with
some level of
pro-activity and
re-activeness.
Exhibits some level of
the key attributes of
co-operation.
Mobile network agents are
programs that:
 can migrate from system to
system within a network
environment

Performs some processing at
each host
 Agent decides when and where
to move next
 How does it move?



Save state
Transport saved state to next
system
Resume execution of saved state
12
Distributed Computing using
Mobile Programs
13
Mobile agent tools
No.
1
Name
Concordia
Developer
Language
Application
Framework for agent development Mitsubishi E.I.T.
Java
Mobile computing, Data base
Java Class libraries
IBM, Tokyo
Java
Internet
Agent Tcl
Transportable agent system
R. Gray, U Dart.
Tcl Tk
Information management
Odyssey
Set of Java Class libraries
General Magic
Telescript
Electronic commerce
OAA
Open Agent Architecture
SRI International, AI
C, C-Lisp, Java, VB General purpose
Ara
Agent for Remote Action
U Kaiserslautern
C/C++, Tcl, Java
Partially connected c. D.D.B.
Tacoma
Tromso and Cornel Moving Agent Norway & Cornell
C, UNIX-based,
Client/Server model issues / OS support
Voyager
Platform for distributed applic.
ObjectSpace
Java
Support for agent systems
AgentSpace Agent building platform
Ichiro Sato, O. U.
Java
General purpose
Mole
First Java-Based MA system
Stuttgart U. Germany Java, UNIX-based
General purpose
MOA
Mobile Object and Agents
OpenGroup, UK
Java
General purpose
Kali Scheme Distributed impl. of Scheme
NEC Research I.
Scheme
Distributed data mining, load balancing
The Tube
mobile code system
David Halls, UK
Scheme
Remote execution of Scheme
Ajanta
Network mobile object
Minoseta U.
Java
General purpose
Knowbots
Research infrastructure of MA
CNRI
Python
Distributed systems / Internet
AgentSpace Mobile agent framework
Alberto Sylva
Java
Support for dynamic and dist. Appl.
Plangent
Intelligent Agent system
Toshiba Corporation
Java
Intelligent tasks
JATlite
Java Agent framework dev /KQML Standford U.
Java
Information retrievial, Interface agent
Kafka
Multiagent libraries for Java
Fujitsu Lab. Japan
Java UNIX based
General purpose
UCI
C (Messenger-C)
General purpose
2 Aglet
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Description
Messengers Autonomous messages

14
What are Aglets ?
Aglets (Agile Applets) are Java objects that
can move from one host on the Internet to
another, and perform arbitrary operations
within the security limits.
When an Aglet moves it takes along its
program code as well as its data.
The Aglets framework is implemented by the
Aglets Software Development Kit (ASDK) from
IBM. It is an environment for programming mobile
Internet Agent in Java.
15
Aglets at Runtime
Currently aglets use the Agent Transfer
Protocol (ATP) as a default implementation
of the communication layer (ATP is
modeled after HTTP)
Used on the Tahiti aglet server
Use the Aglets Server Interface to write
application capable of hosting, receiving
and dispatching aglets
16
Outline
Introduction
Background
A distributed Apriori algorithm
using the mobile agents
Experimental evaluation
Conclusion
17
A distributed Apriori algorithm
1 (1) spawn n slave processes; (2) divide database into partitions
(3) distribute partitions to each slave process
2 Master process
Slave processes
1. send global candidate (k-1)-itemsets
Ck-1 to each slave process
2. receive the global candidate
(k-1)-itemsets Ck-1
4. wait and receive local supports,
count global supports for global
candidate (k-1)-itemsets Ck-1
3. count local supports for global
candidate (k-1)-itemsets Ck-1, and
send local supports to the master
process.
5. compute frequent (k-1)-itemsets Lk-1,
and send clusters of frequent (k-1)itemsets Lk-1 to slave processes
8. wait and receive local candidate
k-itemsets from slave processes
9. unionize local candidate k-itemsets
and prune to form global candidate
k-itemsets.
6. receive frequent (k-1)-itemsets
Lk-1 from the master process
7. generate local candidate
kitemsets and send these local
candidate k-itemsets to the
master process
18
A distributed Apriori algorithm
SEND global
candidate
(k-1) itemsets
Ck-1
…
COUNT and SEND
local supports for
global candidate
(k-1)-itemsets
(counting support
Aglets)
COUNT global
supports for
global candidate
(k-1)-itemsets
Ck-1
JOIN and SEND
local candidate
k-itemsets
(Aprio_gen Aglet)
UNIONIZE local
candidate
k-itemsets and
PRUNE to form
global candidate
k-itemsets Ck
e.g.,{AB}
DB1
DB
...
DB2
DBn
master
slaves
2
3
1
…
DB1
DB
8
FIND and SEND
frequent (k-1)itemsets Lk-1
...
DB2
DB
DBn
master
slaves
master
19
Global support count &
Global candidate itemsets
X is a candidate itemset, global support count of X is
n
X k .GSupp   X k .LSuppi
i 1
The set of global candidate k-itemsets GCk formed by
local candidate k-itemsets
n
GCk  i 1 LCk
GLk formed by Apriori-gen with ID segment (p, q) of GLk-1
GLik  Apriori  gen(GLk 1 , p  q)
GLk = {GCk ‫ ׀‬GCk.G-Supp  G-Min-Supp}
20
Outline
Introduction
Background
A distributed Apriori algorithm using
the mobile agents
Experimental evaluation
Conclusion
21
Experiments: Synthetic datasets
Using synthetic datasets of varying sizes:
Name
|D|
|T|
Size (MB)
D100k.T30
100K
30
3M
D100k.T100
100K
100
10M
D320k.T150
320K
150
48M
|D| Number of transactions
|T| Average amount of items on transactions
22
Experiment environment
Software

Database : Oracle server

Language: Java – JDK1.3-Sun

Mobile agents: Aglet- IBM

Protocol traffic: ATP – Aglet Transfer Protocol

Platform: Windows
Hardware

PC Petium3-300 Mhz, RAM 128MB

15 machines (at Knowledge Science Center, JAIST)
23
Execution time (sec.) with different
minimum support thresholds
Name
D100k.T30
D100k.T100
D320k.T150
1 slave
80,860
155,440
329,532
5 slave
42,558
77,720
147,673
10 slaves 15 slaves
22,079
15,838
41,080
28,062
76,432
53,318
40%
Name
D100k.T30
D100k.T100
D320k.T150
1 slave
4,158
30,005
69,011
5 slave
1,980
15,792
28,425
10 slaves 15 slaves
1,149
988
7,978
5,843
18,349
12,854
50%
Name
D100k.T30
D100k.T100
D320k.T150
1 slave
485
27,012
52,322
5 slave
244
13,506
20,259
10 slaves 15 slaves
218
95
7,047
5,062
11,979
9,112
35%
24
Execution time with min_sup 35%
Time (sec.) with min_sup = 35%
350,000
D100k.T30
300,000
D100k.T100
250,000
D320k.T150
350,000
200,000
150,000
300,000
100,000
250,000
50,000
200,000
D100k.T30
D100k.T100
0
1 slave
5 slave
10 slaves
15 slaves
D320k.T150
150,000
100,000
50,000
0
1 slave
5 slave
10 slaves
15 slaves
25
Execution time with min_sup 40%
Time (sec.) with min_sup = 40%
80,000
D100k.T30
70,000
D100k.T100
60,000
D100k.T150
50,000
80,000
70,000
D100k.T30
40,000
60,000
D100k.T100
50,000
D320k.T150
30,000
20,000
10,000
40,000
0
30,000
1 slave
5 slave
10 slaves
15 slaves
20,000
10,000
0
1 slave
5 slave
10 slaves
15 slaves
26
Execution time with min_sup 50%
Time (sec.) with min_sup 50%
60000
D100k.T30
50000
D100k.T100
40000
D320k.T150
Time (sec.) with min_sup 50%
30000
60000
20000
50000
D100k.T30
D100k.T100
10000
40000
0
30000
1 slave
5 slave
10 slaves
15 slaves
D320k.T150
20000
10000
0
1 slave
5 slave
10 slaves
15 slaves
27
Rate of execution time
Nb of slaves minsup 35% minsup 45% minsup 50%
1
1
1
1
5
1.9
2.1
2
10
3.6
3.6
2.2
15
5.1
4.2
5.1
The rate
between
execution time
and number
of slaves is
nearly linear
minsup avg
1
2
3.1
4.8
average rate
6
5
4
3
2
average rate
1
0
1 slave
5 slaves
10 slaves
15 slaves
28
Conclusion
Proposed a distributed apriori algorithm for
mining association rule
Experimental evaluation show that when the
number of slaves increases the execution
time decreases nearly linear
Future work:


Segment both the master and GLk for support
counts
Develop incremental algorithms for association
analysis using the MA technology
29