Transcript P2PARM

Data Mining Algorithms for
Large-Scale Distributed
Systems
Presenter: Ran Wolff
Joint work with Assaf Schuster
2003
What is Data Mining?
The automatic analysis of large
database
The discovery of previously
unknown patterns
The generation of a model of the
data
Main Data Mining Problems
Association rules

Description
Classification

Fraud, Churn
Clustering

Analysis
He who does this and that
will usually do some other
thing too
These attributes indicate
a good behavior - those
indicate bad behavior.
There are three types of
entities
Examples – Classification
Customers purchase artifacts in a store
Each transaction is described in terms
of a vector of features
The owner of the store tries to predict
which transactions are fraudulent


Example: young men who buy small
electronics during rash-hours
Solution: do not respect checks
Examples – Associations
Amazon tracks user queries

Suggests to each user additional books he
would usually be interested in
Supermarket finds out “people who buy
diapers also buy beer”

Place diapers and beer at opposite sides of
the supermarket
Examples – Clustering
Resource location

Find the best location for k distribution
centers
Feature selection


Find 1000 concepts which summarize a
whole dictionary
Extract the meaning out of a document by
replacing each work with the appropriate
concept
 Car for auto, etc.
Why Mine Data of LSD Systems?
Data mining is good
It is otherwise difficult to monitor an LSD
system: lots of data, spread across the system,
impossible to collect
Many interesting phenomena are inherently
distributed (e.g., DDoS), it is not enough to just
monitor a few nodes
An Example
Peers in the Kazza network reveal to the
system which files they have on their disks in
exchange to access to the files of their peers
The result is a 2M peers database of people
recreational preferences
Mining it, you could discover that Matrix fans
are also keen of Radio-Head songs


Promote RH performances in Matrix-Reloaded
Ask RH to write the music for Matrix-IV
What is so special about this
problem?
Huge systems – Huge amounts of data
Dynamic setting


System – join / depart
Data – constant update
Ad-hoc solution
Fast convergence
Our Work
We developed an association rule
mining algorithm that works well in LSD
Systems





Local and therefore scalable
Asynchronous and therefore fast
Dynamic and therefore robust
Accurate – not approximated
Anytime – you get early results fast
In a Teaspoon
A distributed data mining algorithm can be
described as a series of distributed decisions
Those decisions are reduced to a majority
vote
We developed a majority voting protocol
which has all those good qualities
The outcome is an LSD association rule
mining (still to come: classification)
Problem Definition –
Association Rule Mining (ARM)
I  i1 , i2 ,..., im 
X I
TI
DB  T1 , T2 ,..., Tk 
Support  X , DB   T  DB : X  T 
Freq X , DB   Support  X , DB  DB
Conf  X  Y , DB   Freq X  Y , DB  Freq X , DB 
Solution to Traditional ARM
Let 0  MinFreq  1, 0  MinConf  1
X Y  




RDB    X  Y : Freq X  Y , DB   MinFreq 




Conf
X

Y
,
DB

MinConf


Large-Scale Distributed ARM
DB  DBtu : u, t
u t  v V : v is reachable from u at time t


v
Ru t  R   DBt 
vu t

Solution of LSD-ARM
No termination
Anytime solution
Recall
~
R u t  X  Y : X  Y   
~
Ru t  Ru t
Ru t
~
R u t  Ru t
~
R u t
Precision
Majority Vote in LSD Systems
Unknown number of nodes vote 0 or 1



Nodes may dynamically change their vote
Edges are dynamically added / removed
An infra-structure
 detects failure
 ensures message integrity
 maintains a communication forest
Each node should decide if the global
majority is of 0 or 1
Majority Vote in LSD Systems
– cont.
Because of the dynamic settings, the
algorithm never terminates
Instead we measure the percent of correct
outputs
In static periods that percent ought to
converge to 100%
In stationary periods we will show it
converges to a different percentage

Assume the overall percentage of ones remains
the same, but they are constantly switched
LSD-Majority Algorithm
Nodes communicates by exchanging
messages <s, c>
Node u maintains:



su – its vote, cu – one (for now)
<suv, cuv>– the last <s,c> it had sent to v
<svu, cvu>– the last <s,c> it had received
from v
LSD-Majority – cont.
Node u calculates:
 u
vu 
  s   s    c   c 
vE u
vE u


Captures the current knowledge of u
u

u
vu

uv  s vu  s uv   c vu  c uv


Captures the current agreement between u
and v
LSD-Majority – Rational
It is OK if the current knowledge of u is
more extreme than what it had agreed
with v
The opposite is not OK

v might assume u supports its decision
more strongly than u actually does
Tie breaking prefers a negative decision
LSD-Majority – The Protocol
If c vu  c vu  0 and u  0 or
c vu  c vu  0 and either
uv  0 and u  uv
or
  0 and   
uv
then send s u 
u
wu
u
s
,
c


wu  vuE u
uv
wu
c
 to v
wu  vuE u
LSD-Majority – The Protocol
The same decision is applied whenever



a message is received
su changes
an edge fails or recovers
LSD-Majority – Example
LSD-Majority Results
Proof of Correctness
Will be given in class
Back from Majority to ARM
To decide whether an itemset is
frequent or not
set   MinFreq

set s  Support X , DB
u
set c u  DBtu
run LSDM
u
t

Back from Majority to ARM
To decide whether a rule is confident or
not
set   MinConf


 Support X , DB 
set s u  Support X  Y , DBtu
set c
u
run LSDM
u
t
Additionally
Create candidates based on the ad-hoc
solution
Create rules on-the-fly rather than upon
termination
Our algorithm outputs the correct rules
without specifying their global
frequency and confidence
Eventual Results
By the time the database is scanned once, in parallel, the average
node has discovered 95% of the rules, and has less than 10% false
rules.