Transcript P2PARM
Data Mining Algorithms for
Large-Scale Distributed
Systems
Presenter: Ran Wolff
Joint work with Assaf Schuster
2003
What is Data Mining?
The automatic analysis of large
database
The discovery of previously
unknown patterns
The generation of a model of the
data
Main Data Mining Problems
Association rules
Description
Classification
Fraud, Churn
Clustering
Analysis
He who does this and that
will usually do some other
thing too
These attributes indicate
a good behavior - those
indicate bad behavior.
There are three types of
entities
Examples – Classification
Customers purchase artifacts in a store
Each transaction is described in terms
of a vector of features
The owner of the store tries to predict
which transactions are fraudulent
Example: young men who buy small
electronics during rash-hours
Solution: do not respect checks
Examples – Associations
Amazon tracks user queries
Suggests to each user additional books he
would usually be interested in
Supermarket finds out “people who buy
diapers also buy beer”
Place diapers and beer at opposite sides of
the supermarket
Examples – Clustering
Resource location
Find the best location for k distribution
centers
Feature selection
Find 1000 concepts which summarize a
whole dictionary
Extract the meaning out of a document by
replacing each work with the appropriate
concept
Car for auto, etc.
Why Mine Data of LSD Systems?
Data mining is good
It is otherwise difficult to monitor an LSD
system: lots of data, spread across the system,
impossible to collect
Many interesting phenomena are inherently
distributed (e.g., DDoS), it is not enough to just
monitor a few nodes
An Example
Peers in the Kazza network reveal to the
system which files they have on their disks in
exchange to access to the files of their peers
The result is a 2M peers database of people
recreational preferences
Mining it, you could discover that Matrix fans
are also keen of Radio-Head songs
Promote RH performances in Matrix-Reloaded
Ask RH to write the music for Matrix-IV
What is so special about this
problem?
Huge systems – Huge amounts of data
Dynamic setting
System – join / depart
Data – constant update
Ad-hoc solution
Fast convergence
Our Work
We developed an association rule
mining algorithm that works well in LSD
Systems
Local and therefore scalable
Asynchronous and therefore fast
Dynamic and therefore robust
Accurate – not approximated
Anytime – you get early results fast
In a Teaspoon
A distributed data mining algorithm can be
described as a series of distributed decisions
Those decisions are reduced to a majority
vote
We developed a majority voting protocol
which has all those good qualities
The outcome is an LSD association rule
mining (still to come: classification)
Problem Definition –
Association Rule Mining (ARM)
I i1 , i2 ,..., im
X I
TI
DB T1 , T2 ,..., Tk
Support X , DB T DB : X T
Freq X , DB Support X , DB DB
Conf X Y , DB Freq X Y , DB Freq X , DB
Solution to Traditional ARM
Let 0 MinFreq 1, 0 MinConf 1
X Y
RDB X Y : Freq X Y , DB MinFreq
Conf
X
Y
,
DB
MinConf
Large-Scale Distributed ARM
DB DBtu : u, t
u t v V : v is reachable from u at time t
v
Ru t R DBt
vu t
Solution of LSD-ARM
No termination
Anytime solution
Recall
~
R u t X Y : X Y
~
Ru t Ru t
Ru t
~
R u t Ru t
~
R u t
Precision
Majority Vote in LSD Systems
Unknown number of nodes vote 0 or 1
Nodes may dynamically change their vote
Edges are dynamically added / removed
An infra-structure
detects failure
ensures message integrity
maintains a communication forest
Each node should decide if the global
majority is of 0 or 1
Majority Vote in LSD Systems
– cont.
Because of the dynamic settings, the
algorithm never terminates
Instead we measure the percent of correct
outputs
In static periods that percent ought to
converge to 100%
In stationary periods we will show it
converges to a different percentage
Assume the overall percentage of ones remains
the same, but they are constantly switched
LSD-Majority Algorithm
Nodes communicates by exchanging
messages <s, c>
Node u maintains:
su – its vote, cu – one (for now)
<suv, cuv>– the last <s,c> it had sent to v
<svu, cvu>– the last <s,c> it had received
from v
LSD-Majority – cont.
Node u calculates:
u
vu
s s c c
vE u
vE u
Captures the current knowledge of u
u
u
vu
uv s vu s uv c vu c uv
Captures the current agreement between u
and v
LSD-Majority – Rational
It is OK if the current knowledge of u is
more extreme than what it had agreed
with v
The opposite is not OK
v might assume u supports its decision
more strongly than u actually does
Tie breaking prefers a negative decision
LSD-Majority – The Protocol
If c vu c vu 0 and u 0 or
c vu c vu 0 and either
uv 0 and u uv
or
0 and
uv
then send s u
u
wu
u
s
,
c
wu vuE u
uv
wu
c
to v
wu vuE u
LSD-Majority – The Protocol
The same decision is applied whenever
a message is received
su changes
an edge fails or recovers
LSD-Majority – Example
LSD-Majority Results
Proof of Correctness
Will be given in class
Back from Majority to ARM
To decide whether an itemset is
frequent or not
set MinFreq
set s Support X , DB
u
set c u DBtu
run LSDM
u
t
Back from Majority to ARM
To decide whether a rule is confident or
not
set MinConf
Support X , DB
set s u Support X Y , DBtu
set c
u
run LSDM
u
t
Additionally
Create candidates based on the ad-hoc
solution
Create rules on-the-fly rather than upon
termination
Our algorithm outputs the correct rules
without specifying their global
frequency and confidence
Eventual Results
By the time the database is scanned once, in parallel, the average
node has discovered 95% of the rules, and has less than 10% false
rules.