Mining Distributed Data: An Overview and an Algorithm for

Download Report

Transcript Mining Distributed Data: An Overview and an Algorithm for

Data Mining Over a Large,
Dynamic Network
An Overview and Algorithm for Outlier Detection
An Overview and Algorithm for Outlier Detection
Talk Outline
●
●
Introduction
–
Data mining over a large, dynamic network -- what
and why?
–
Some recent research on “epidemic-style”
algorithms
Example problem
–
Outlier detection over a large, dynamic network
●
●
An algorithm and experiments in a wireless sensor
network
Wrap-up summary
Large, Dynamic Networks
●
Internet, peer-to-peer file sharing networks
–
●
Kazaa: 5M simultaneously on-line users (Jan 2004)
Wireless sensor networks
–
800 node WSN built at Berkley (2001)
–
10000+ node WSNs are expected in the near future
Large, Dynamic Networks
●
●
Large number of network nodes each with a
local data set, Di
Dynamic environment:
–
nodes may enter/leave at will
–
nodes’ local data, Di, may change at will
Distributed (time varying) dataset: D = UDi
Why Mine a Large, Dynamic
Network?
●
Goal: learn global properties of the dataset, D.
Does the fraction of nodes in the WSN reporting a
problem exceed a threshold?
Download patterns in an internet, P2P file-sharing
network.
The Challenges
●
Large network size
  No global operators/synchronizations
  Be communication-efficient
●
Dynamic environment (changing local data,
node joins, departures)
  Algorithms robust to data/network change
Epidemic-Style Algorithms
Each node in the network
●
●
maintains an estimate of the desired global
property (e.g. global average, majority)
selectively communicates with neighboring
nodes in the network
All nodes’ estimates eventually converge on the
true global property.
Epidemic-Style Algorithms
A good choice for data mining over large,
dynamic networks:
●
highly decentralized
●
asynchronous
●
robust to data/network change
Epidemic-Style Algorithm Examples
●
Majority voting
–
●
Facility location, decision tree induction
–
●
A useful basic primitive
All using majority voting as a building block
Outlier detection
 This talk.
Outlier Detection Over a Large,
Dynamic Network
●
Applications:
–
data cleaning in wireless sensor networks
–
detecting misconfigured machines grid networks
But first, what do I mean by an “outlier”?
Overview of Outlier Detection on
Centralized Data
●
●
Outlier – A point in a dataset which appears to
be inconsistent with the remainder of the
dataset.
Outlier detection is studied widely
–
Statistics
–
Machine Learning
–
Data Mining
Centralized Outlier Detection (cont.)
●
Outlier detection
problem:
–
Given dataset D, score
parameter k, and integer
n, find the top n scored
points in D.
Outlier Detection Over a Large,
Dynamic Network
●
Problem:
–
Each node Ni has: a local data set Di (time varying),
and parameters k and n.
–
Solve the unsupervised outlier detection problem
over D = UDi.
–
At the end, each node has, O[D], the top n outliers
over D.
Algorithm
●
●
Assumptions:
–
Reliable message delivery
–
Reliable immediate neighborhood knowledge
Choice of score function
–
Algorithm works for any scoring function satisfying
●
“anti-monotonicity” and “smoothness”
●
Nodes send data points to their immediate
neighbors based on events:
–
Local data change (birth or death of points in Di)
–
Receipt of data points from immediate neighbors
–
Change in immediate neighborhood
Some Notation
Given neighboring nodes N1 and N2,
●
●
●
Let P1→2 denote all the points sent by N1 to
N2
Let P2→1 denote all the points sent by N2 to
N1
Let P1 denote the set of points currently held by
N1:
D1 U P21
Node Event Response: Basic Idea
●
Each node maintains an estimate of O[D], the
global outliers.
–
●
N1 maintains O[P1]; N2 maintains O[P2]
N1 sends any point to N2 that might change
N2's estimate.
–
Call these necessary points from P1
How does N1 compute its necessary points?
N1 approximates N2’s estimate O[P2] as
O[P12 U P21]
The following are necessary points from P1
1. O[P1] and its k nearest neighbors
2. k nearest neighbors of O[P12 U P21]
Z0 = union of 1. and 2.
If Z0 were sent to N2, its estimate might change.
N1 would approximate N2’s new estimate as
O[Z0 U P12 U P21]
The following would be necessary points from P1
1. Z0
2. k-nn of O[Z0 U P12 U P21]
Z1 = union of 1. and 2.
If Z1 were sent to N2, its estimate might change.
N1 would approximate N2’s new estimate as
O[Z1 U P12 U P21]
Z2 = Z1 U k-nn of O[Z1 U P12 U P21]
Etc….
Let Z* = Zi+1 = Zi
for i as small as possible (fixed-point).
Z* is the set of necessary points from P1.
N1 actually sends Z*/(P12 U P21)
N1 knows N2 already has (P12 U P21)
Some Key Algorithm Features
●
Highly decentralized
●
Asynchronous
●
Robust to network/data change
Each node’s estimate is guaranteed to converge
to O[D] (the globally outliers).
Experiments in a Wireless Sensor Network
Wrap-Up Summary
●
●
Data mining over a large, dynamic network
–
Motivation (wireless sensor networks, internet P2P
information sharing networks)
–
Some recent research on epidemic-style algorithms
Outlier detection over a large, dynamic network
–
An algorithm and experiments in a wireless sensor
network
THE END
Thank you for your attention.
Chris Giannella: [email protected]
Slides Available From:
www.cs.nmsu.edu/~cgiannel/Mining_Dynamic_Network2008.ppt
Selected Research Contributions in
Distributed Data Mining
●
●
DM over a large, dynamic network
–
K-means clustering (SDM 2006)
–
Outlier detection (ICDCS 2006)
–
Decision tree induction (Statistical Analysis and Data Mining
2008)
DDM on astronomy data (NASA funded)
–
●
Principal component analysis and outlier detection (SDM 2007)
Privacy preserving data mining
–
Distance-preserving data perturbation (PKDD 2006)
Mining a Large-Scale, Distributed
Dataset
The problem:
–
Each node Ni has: a data set Di (time varying).
–
Compute some function, F, of the global dataset
D = U Di.
–
Each node learns F(D).
Two Main Centralization Problems
●
Communication Cost
–
●
Centralization creates a communication bottleneck
Privacy Loss
–
Data from one source may be too sensitive to
reveal to others
DDM on Sensitive Data
●
Privacy Preserving Data Mining (PPDM)
–
Example: multiple banks
●
Detect fraudulent transactions based on all the
companies' data
→ Secure Multi-party Computation (cryptography)
–
Example: census data
●
Release the data to other organizations for analysis
→ Data transformation
DDM on Non-Sensitive Data
●
Problem: data centralization creates a
communication bottleneck
–
Static environment, data and network do not
change
●
–
Parallel computing
Dynamic environment
●
Remainder of the talk
Some Common PPDM Techniques
●
Secure multi-party computation (cryptography)
–
●
Build complicated distributed algorithms from
simple, secure primitives (e.g. secure sum)
Data transformation
–
Release a transformed dataset
DDM Introduction Summary
●
●
Basic DDM framework
DDM analysis of sensitive data (Privacy
Preserving Data Mining)
–
●
Data transformation vs. cryptographic approaches
Analysis of non-sensitive data
–
Static environment (parallel computing)
–
Dynamic environment (remainder of the talk)
Centralized Outlier Detection (cont.)
●
●
Define a function, S, such that for any set A of
data points in Rn, and a data point x in Rn,
–
S(A,x) defines the outlier score of x with
respect to A
–
The larger the score, the more x is deemed to
be inconsistent with A.
Example: S(A,x) = Average Euclidean distance
of x to its k nearest neighbours in A. (k a fixed
constant)
Epidemic-Style Algorithms
??????????Each node in the network
●
●
maintains an estimate of the desired global
property (e.g. global average, majority)
selectively communicates with immediately
neighboring nodes upon detecting an event
–
local data/network change, message receipt.
All nodes’ estimates eventually converge on the
true global property
Basic Idea
●
●
O[P1] is N1's current estimate of O[D].
N1 sends any point to N2 that might change
N2's outlier estimate, O[P2].
●
N1 should send its outlier estimate and their K
nearest neighbours
–
●
B0 = O[P1] U KNN(O[P1])
Assuming B0 is sent
–
N1 should also send its K nearest neighbours to
N2’s outlier estimate, O[P2]
–
N1 guesses O[P2] to be O[B0 U P1→2 U P2→1]
–
N1 should send
●
B1 = B0 U KNN(O[B0 U P1→2 U P2→1])
●
Assuming B1 is sent, N1 should send
–
●
Assuming B2 is sent, N1 should send
–
●
B2 = B0 U KNN(O[B1 U P1→2 U P2→1])
B3 = B0 U KNN(O[B2 U P1→2 U P2→1])
Etc….
N1 sends a fixed-point B*:
B* = B0 U KNN(O[B* U P1→2 U P2→1])
Selected Research Contributions
●
●
Privacy preserving data mining
–
Distance preserving data transformation (PKDD
2006)
–
Outlier detection using secure multi-party
computation (in preparation)
DDM in dynamic environments
–
K-means clustering (SDM 2006)
–
Outlier detection (ICDCS 2006) ===> Remainder of
the talk
Theorem
●
If data and network are fixed, then
–
The algorithm terminates
●
–
Upon termination, O[Pi] = O[D] for every node Ni
●
●
i.e. All nodes stop sending messages
i.e. The outlier estimate for every node equals the true
outliers.
The algorithm is guaranteed to converge on the
correct result provided enough time elapses
between changes
References
●
“K-Means Clustering Over a Large, Dynamic Network”, Siam
Conf. Data Mining (SDM) 2006
●
“In-Network Outlier Detection in Wireless Sensor Networks”, Int.
Conf. Distributed Computing Systems (ICDCS) 2006
●
“Distributed Decision Tree Induction in Peer-to-Peer Systems”,
Statistical Analysis and Data Mining (to appear)
●
“Distributed Top-K Outlier Detection from Astronomy Catalogs
using the DEMAC System”, Siam Conf. Data Mining (SDM)
2007
●
“An Attacker's View of Distance Preserving Maps for Privacy
Preserving Data Mining”, European Conf. Principals and
Practice of Data Mining (PKDD) 2006