The Road to a Ph.D. - University of Kentucky

Download Report

Transcript The Road to a Ph.D. - University of Kentucky

Distributed Classification in
Peer-to-Peer Networks
Ping Luo, Hui Xiong, Kevin Lü, Zhongzhi Shi
Institute of Computing Technology, Chinese Academy of Sciences
Presentation by: Satya Bulusu
Overview
Introduction
• Building Local Classifiers
• Distributed Plurality Voting
• Experimental Results
• Related Works
• Summary
03/27/2008
Research Motivation
• Widespread use of P2P networks and sensor networks
• Data to be analyzed are distributed on nodes of these
large-scale dynamic networks
• Traditional distributed data mining algorithms must be
extended to fit this new environment
• Motivating Examples
- P2P anti-spam networks
- Automatic organization of web documents in P2P
environments
• A distributed classification algorithm is critical in these
applications.
03/27/2008
Research Motivation contd…
• New Challenges
- highly decentralized peers, do not have the notion
of clients and servers
- including hundreds or thousands of nodes,
impossible global synchronization
- frequent topology changes caused by frequent
failure and recovery of peers
• Algorithm Requirements
- scalability, decentralized in-network processing
- communication efficient, local synchronism
- fault-tolerance
03/27/2008
Problem Formulation
• Given:
 A connected topology graph G(U , E)
 Each peer u  U owns its local training data for classification
 Local neighborhood change is informed to each peer realtimely
• Find:
 Classification paradigm in this setting
 Including how to train and use a global classifier
• Objective:
 Scalability, communication-efficient, decentralized in-network
processing, fault-tolerance
• Constraints:
 Each peer can only communicate with its immediate neighbors
 The network topology changes dynamically
03/27/2008
Contributions from this paper
• An algorithm to build an ensemble classifier for
distributed classification in P2P networks by plurality
voting on all the local classifiers
– Adapt the training paradigm of pasting bites for
building local classifiers
– An algorithm of (restrictive) Distributed Plurality
Voting (DPV) to combine the decisions of local
classifiers
 Correctness
 Optimality
• Extensive Experimental Evaluation
– Communication overhead and convergence time of
DPV
– Accuracy comparison with centralized classification
03/27/2008
Overview
• Introduction
Building Local Classifiers
• Distributed Plurality Voting
• Experimental Results
• Related Works
• Summary
03/27/2008
Building Local Classifiers
• Pasting Bites by Breiman [JML’99]
– Generating small bites of the data by importance
sampling based on the out-of-bag error of classifiers
built so far
– Stopping criteria: difference of errors between two
successive iteration is below a threshold
– Voting uniformly all the classifiers
• The more data on a local node, the more classifiers
generated on it, the more votes it owns.
03/27/2008
Overview
• Introduction
• Building Local Classifiers
Distributed Plurality Voting
• Experimental Results
• Related Works
• Summary
03/27/2008
Problem Formulation Of DPV
• Given:
 A group U of peers in a graph G (U , E ) would like to agree on
one of d options.
 Each peer u  U conveys its preference by initializing a voting
vector Pu   d , where Pu [i] is the number of votes on the i-th
option.

Local neighborhood change is informed to each peer real-timely
• Find:
 The option with the largest number of votes over all peers:
• Objective:
 Scalability, communication-efficient, decentralized in-network
processing, fault-tolerance
• Constraints:
 Each peer can only communicate with its immediate neighbors
 The network topology changes dynamically
03/27/2008
An Example Of DPV
The third option is the answer.
03/27/2008
Comparison Between DPV and Distributed
Majority Voting (DMV, by Wolff et al. [TSMC’04])
• DMV Given:
 A group U of peers in a graph G (U , E )
 Each peer u  U conveys its preference by kainitializing a 2tuple
, where
stands for the number of
the votes for certain option and
stands for the number
of the total vote on this peer.
 The majority ratio
• DMV Find:
 Check whether the voting proportion of the specified option is
above :
• DMV Converted to DPV:
 Replacing the 2-tuple
voting vector
03/27/2008
on each peer with the
Comparison Between DPV and DMV contd…
• DPV vs. DMV
 DPV is a multi-values function while DMV is a binary
predicate.
 DMV can be solved by converting it to DPV.
 However, DMV can only solve 2-option DPV problems. For a
d-option DPV problem, pairwise comparisons among all d
options must be performed by DMV for
times
(Multiple Choice Voting [TSMC’04]).
 DPV finds the maximally supported option directly, and thus
saves a lot of communication overhead and the time for
convergence.
• DPV is the general form of DMV
03/27/2008
Challenges for DPV
• No central server to add all voting vectors, Only
communication between immediate neighbors
• Dynamic change of not only the network topology but
also the local voting vectors
• Supporting not only one-shot query, but also
continuous monitor the current voting result according
to the latest network status
03/27/2008
DPV Protocol Overview
• Assumption:
– it includes a mechanism to maintain an un-directional
spanning tree for the dynamic P2P network. The protocol
performs on this tree (duplicate insensitive).
– A node is informed of changes in the status of adjacent nodes.
• Protocol Overview




Each node performs the same algorithm independently
Specify how nodes initialize and react under different
situations: a message received, neighboring node detached or
joined, the local voting vector changed
When the node status changes under the above situation, the
node notifies this change to the other neighbors only if the
condition for sending messages satisfies.
To guarantee that each node in the network converges toward
the correct plurality
03/27/2008
The Condition
for Sending Messages contd…
Message
Sent
(5,2,1)+(2,0,0)=(7,2,1)
7-2=5
7-1=6
(8,6,1)+(2,0,0)=(10,6,1)
10-4=4
10-1=9
4<5 9>6
The differences between the votes of maximally voted
option and any other option decrease.
03/27/2008
The Condition
for Sending Messages
No Message
Sent
(5,2,1)+(2,0,0)=(7,2,1)
7-2=5
7-1=6
(8,4,1)+(2,0,0)=(10,4,1)
10-4=6
10-1=9
6>5 9>6
The differences between the votes of maximally voted
option and all other options do not decrease.
03/27/2008
The Correctness of DPV Protocol
All the nodes converge to the same result.
The difference between the actual votes of maximally voted
option and any other option is not smaller than what the protocol
have sent.
Then, all the nodes converge to the right result.
03/27/2008
The Optimality of DPV Protocol
C1 is more restrictive than C2, iff, for any input case if C1 is
true then C2 is true.
C1 is strictly more restrictive than C2, iff, C1 is more
restrictive than C2 and there at least exists an input case
such that C1 is false and C2 is true.
is the most restrictive condition for sending
messages to keep the correctness of the DPV protocol.
It is the condition, which is the most difficult to satisfy. In
this sense, it guarantees the optimality in communication
overhead.
03/27/2008
The Extension of DPV Protocol
Restrictive Distributed Plurality Voting:
output the maximally voted option whose proportion to all the votes is
above a user-specified threshold.
It can be used in a classification ensemble in a restrictive manner by
leaving out some uncertain instances.
The new condition for sending messages is based on the spirit
of
.
03/27/2008
Overview
• Introduction
• Building Local Classifiers
• Distributed Plurality Voting
Experimental Results
• Related Works
• Summary
03/27/2008
Accuracy of P2P Classification
Data: covtype (581012*54, 7 classes) from the
UCI database, distributed onto 500 nodes
03/27/2008
The Performance of DPV Protocol
Experimental Parameters
Difference types of network topology: Power-law Graph,
Random Graph, Grid
Number of nodes: 500, 1000, 2000, 4000, 8000, 16000
7-option DPV problems
Experimental Metrics
The average communication overhead for each node
The convergence time of the protocol for one-shot query
03/27/2008
The Performance of DPV Protocol contd…
DPV0 vs. RANK (Multiple Choice Voting)
500 nodes
Averaging 2000 instances of 7-option plurality voting problems
a and b are the largest and
second largest options,
respectively.
03/27/2008
The Performance of DPV Protocol contd…
The Scalability of DPV0
Different number of nodes vs. communication overhead of each node
03/27/2008
Ping Luo KDD 07
The Performance of DPV Protocol (4)
The Local Optimality of DPV0
Communication overhead and convergence time under different
conditions for sending messages
Ping Luo KDD 07
03/27/2008
Overview
• Introduction
• Building Local Classifiers
• Distributed Plurality Voting
• Experimental Results
Related Works
• Summary
03/27/2008
Related Work - Ensemble Classifiers
• Model Combination: (weighted) voting, meta-learning
• For Centralized Data
– applying different learning algorithms with heterogeneous
models
– applying a single learning algorithm to different versions of
the data



Bagging: random sampling with replacement
Boosting: re-weighting of the mis-classified training examples
Pasting Bites: generating small bites of the data by importance
sampling based on the quality of classifiers built so far
• For Distributed Data
– distributed boosting by Lazarevic et al. [sigkdd’01]
– distributed approach to pasting small bites by Chawla et al.
[JMLR’04], which uniformly votes hundreds or thousands of
classifiers built on all distributed data sites
03/27/2008
Related Work - P2P Data Mining
• Primitive Aggregates
–
–
–
–
Average
Count, Sum
Max, Min
Distributed Majority Voting by Wolff et al. [TSMC’04]
• P2P Data Mining Algorithms
–
–
–
–
P2P Association Rule Mining by Wolff et al. [TSMC’04]
P2P K-means clustering by Datta et al. [SDM’06]
P2P L2 threshold monitor by Wolff et al. [SDM’06]
Outlier detection in wireless sensor networks by Branch et al.
[ICDCS’06]
– A classification framework in P2P networks by Siersdorfer et
al. [ECIR’06]

03/27/2008
Limitations: local classifiers’ propagations, experiments on 16
peers, only focusing on the accuracy issue, without involving any
dynamism of P2P networks.
Overview
• Introduction
• Building Local Classifiers
• Distributed Plurality Voting
• Experimental Results
• Related Works
Summary
03/27/2008
Summary
Proposed an ensemble paradigm for distributed
classification in P2P networks
Formalized a generalized Distributed Plurality Voting
(DPV) protocol for P2P networks
The property of DPV0
supporting both one-shot query and continuous
monitor
theoretical local optimality in terms of communication
overhead
outperforms alternative approaches
scale up to large networks
03/27/2008
Acknowledgement
Q. & A.
03/27/2008