The Role of Agents in Distributed Data Mining: Issues and

Download Report

Transcript The Role of Agents in Distributed Data Mining: Issues and

The Role of Agents in
Distributed Data Mining:
Issues and Benefits
Josenildo Costa da Silva 1, Matthias Klusch 1, Stefano Lodi 2,
Gianluca Moro 2, Claudio Sartori 2
1Deduction
and Multiagent Systems,
German Research Center for Artificial Intelligence,
Saarbruecken, Germany
2Department of Electronics, Computer Science and Systems,
University of Bologna,
Bologna, Italy
Distributed Data Mining (DDM)
• Data sets
– Massive
– Inherently distributed
• Networks
– Limited bandwidth
– Limited computing resources at nodes
21/07/2015
• Privacy and security
– Sensitive data
– Share goals, not data
2
AgentLink III: TFG1 IIA4WE, Roma
Centralized solution
21/07/2015
• Apply traditional DM algorithms to data
retrieved from different sources and
stored in a data warehouse
• May be impractical or even impossible
for some business settings
– Autonomy of data sources
– Data privacy
– Scalability (~TB/d)
3
AgentLink III: TFG1 IIA4WE, Roma
Agents and DDM
21/07/2015
• DDM exploits distributed processing and
problem decomposability
• Is there any real added value of using
concepts from agent technology in DDM?
– Few DDM algorithms use agents
– Evidence that cooperation among distributed DM
processes may allow effective mining even without
centralized control
– Autonomy, adaptivity, deliberative reasoning
naturally fit into the DDM framework
4
AgentLink III: TFG1 IIA4WE, Roma
State of the Art
• BODHI
– Mobile agent platform/Framework for collective
DM on heterogeneous sites
• PADMA
– Clustering homogeneous sites
– Agent based text classification/visualization
• JAM
– Metalearning, classifiers
21/07/2015
• Papyrus
– Wide area DDM over clusters
– Move data/models/results to minimize network
load
AgentLink III: TFG1 IIA4WE, Roma
5
Agents for DDM (pros)
21/07/2015
• Autonomy of data sources
• Scalability of DM to massive distributed
data
• Multi-strategy DDM
• Collaborative DM
6
AgentLink III: TFG1 IIA4WE, Roma
Agents for DDM (against)
• Need to enforce minimal privileges at a
data source
21/07/2015
– Unsolicited access to sensitive data
– Eavesdropping
– Data tampering
– Denial of service attacks
7
AgentLink III: TFG1 IIA4WE, Roma
The Inference Problem
21/07/2015
• Work in statistical DB (mid 70’s)
• Integration/aggregation at the summary
level is inherent in DDM
– Infer sensitive data even from partial
integration to a certain extent and with
some probability (inference problem)
– Existing DDM systems are not capable of
coping with the inference problem
8
AgentLink III: TFG1 IIA4WE, Roma
Data Clustering
• Popular problem
– Statistics (cluster analysis)
– Pattern Recognition
– Data Mining
21/07/2015
• Decompose multivariate data set into
groups of objects
– Homogeneity within groups
– Separation between groups
9
AgentLink III: TFG1 IIA4WE, Roma
DE-clustering
• Clustering based on non-parametric
density estimation
21/07/2015
– Construct an estimate of the probability
density function from the data set
– Objects “attracted” by a local maximum of
the estimate belong to the same cluster
10
AgentLink III: TFG1 IIA4WE, Roma
Kernel Density Estimation
21/07/2015
•
• The higher the number of data objects
in the neighbourhood of x, the higher
density at x
• A data object
exerts more influence
on the value of the estimate at x than
any data object farther from x than xi
• The influence of data objects is radial
11
AgentLink III: TFG1 IIA4WE, Roma
Formalizing Density Estimators
21/07/2015
• The density estimate at a space object x is
proportional to a sum of weights
• The sum consists of one weight for every
data object
• Weight is a monotonically decreasing function
(kernel ) of the distance between x and xi
scaled by a factor h (window width )
12
AgentLink III: TFG1 IIA4WE, Roma
Kernel Functions
• Uniform kernel
0.5
0.4
0.3
21/07/2015
0.2
0.1
-3
-2
-1
1
AgentLink III: TFG1 IIA4WE, Roma
2
3
13
Kernel Functions
• Triangular kernel
1
0.8
0.6
21/07/2015
0.4
0.2
-3
-2
-1
1
AgentLink III: TFG1 IIA4WE, Roma
2
3
14
Kernel Functions
• Epanechnikov’s kernel
0.3
0.25
0.2
0.15
21/07/2015
0.1
0.05
-3
-2
-1
1
AgentLink III: TFG1 IIA4WE, Roma
2
3
15
Kernel Functions
• Gaussian kernel
0.4
0.3
0.2
21/07/2015
0.1
-3
-2
-1
1
AgentLink III: TFG1 IIA4WE, Roma
2
3
16
Example (1/2)
• Uniform kernel, h=250
0.0005
0.0004
0.0003
0.0002
21/07/2015
0.0001
2000
4000
6000
8000
10000
17
AgentLink III: TFG1 IIA4WE, Roma
Example (2/2)
• Gaussian kernel, h=250
0.0005
0.0004
0.0003
0.0002
21/07/2015
0.0001
2000
4000
6000
8000
10000
18
AgentLink III: TFG1 IIA4WE, Roma
Distributed Data Clustering (1/2)
• Clustering algorithm A( )
• Homogeneous distributed data
clustering problem for A:
– Data set S
– Sites Lj
– Lj stores data set Dj with
M
D
j
S
21/07/2015
j 1
19
AgentLink III: TFG1 IIA4WE, Roma
Distributed Data Clustering (2/2)
• Problem: find clustering Cj in the data
space of Lj such that:
– Cj agree with A(S) (correctness
21/07/2015
requirement):
– Time/communication costs are minimized
(efficiency requirement)
– The size of data transferred out of the data
space of any Lj is minimized (privacy
requirement)
20
AgentLink III: TFG1 IIA4WE, Roma
Traditional (centralized) solution
21/07/2015
• Gather all local data sets into one
centralized repository (e.g., a data
warehouse)
• Run A( ) on the centralized data set
• Unsatisfied privacy requirement
• Unsatisfied efficiency requirement for
some A( )
21
AgentLink III: TFG1 IIA4WE, Roma
Sampling
• Goal: given some class of functions of type
represent every member as a sampling series
21/07/2015
where:
–
–
is a collection of points of
is some set of suitable expansion functions
22
AgentLink III: TFG1 IIA4WE, Roma
Example
• The class of polynomials of degree 1
– Sampling points
– Expansion functions
21/07/2015
• Finite sum
23
AgentLink III: TFG1 IIA4WE, Roma
Band-limited Functions
21/07/2015
• Function f of one real variable
• Range of frequencies of a function f 
support of the Fourier transform of f
• Any function whose range of
frequencies is confined to a bounded
set B is called band-limited to B (the
band-region)
24
AgentLink III: TFG1 IIA4WE, Roma
Example: sinc function
1
1
0.8
0.8
0.6
0.6
0.4
21/07/2015
0.2
-6
-4
-2
0.4
2
4
6
0.2
-0.2
-0.4
-7.5
-5
-2.5
2.5
5
7.5
25
AgentLink III: TFG1 IIA4WE, Roma
Sampling Theorem
• If f is band-limited with band-region
21/07/2015
then
26
AgentLink III: TFG1 IIA4WE, Roma
Sampling Theorem (scaled
multidimensional version)
• Let
–
–
–
where
is the
-th component of a vector
21/07/2015
• If f is band-limited to B then
27
AgentLink III: TFG1 IIA4WE, Roma
Sampling Density Estimates
(1/4)
21/07/2015
• Additivity of density estimates of a
distributed data set
28
AgentLink III: TFG1 IIA4WE, Roma
Sampling Density Estimates
(2/4)
• The sampling series of the density
estimate is also additive
21/07/2015
where
29
AgentLink III: TFG1 IIA4WE, Roma
Sampling Density Estimates
(3/4)
• Truncation errors
– The support of a kernel function is not
bounded in general
21/07/2015
• Aliasing errors
– The support of the Fourier transform of a
kernel function is not bounded in general

kernel functions are not band-limited
30
AgentLink III: TFG1 IIA4WE, Roma
Sampling Density Estimates
(4/4)
21/07/2015
• The sampling series of a density
estimate can only be approximated
• Trade-off between the number of
samples and accuracy
– Define a minimal multidimensional
rectangle outside which samples are
negligible
– Define a vector of sampling intervals such
that the aliasing error is negligible
31
AgentLink III: TFG1 IIA4WE, Roma
The KDEC scheme
• Every site Lj:
• Helper H:
21/07/2015
– Computes a local
density estimate  [ D j ]
of its data Dj
– Samples  [ D j ]
– Sends the samples to
H
– Waits for the samples
of the global density
estimate
– Reconstructs  [S ]
from its samples
– Applies DE-clustering
to Dj and  [S ]
AgentLink III: TFG1 IIA4WE, Roma
– Waits for the samples
of local density
estimates  [ D j ]
– Orderly sums the
samples
– Sends the summation
back to each Lj
32
The KDEC scheme
Helper
Site1
21/07/2015
Site2
33
AgentLink III: TFG1 IIA4WE, Roma
The KDEC scheme
Helper
2
1.5
-4
1
0.5
0
-2
y
0
0
2
1.5
1
0.5
0
-4
y
0
-2
21/07/2015
2
2
2
0
x
4
x
4
34
AgentLink III: TFG1 IIA4WE, Roma
The KDEC scheme
Helper
2
1.5
1
0.5
0
-4
-5
-5
21/07/2015
y
-2.5
0
y
0
-2
0
2.5
5
5
2
0
x
4
x
35
AgentLink III: TFG1 IIA4WE, Roma
The KDEC scheme
Helper
-5
-5
21/07/2015
y
-5
-2.5
0
0
2.5
5
5
-5
y
-2.5
0
0
x
2.5
5
5
x
36
AgentLink III: TFG1 IIA4WE, Roma
The KDEC scheme
4
Helper
2
-6
0
-4
-2
-2
0
0
2
4
2
6
4
4
-6
-4
2
4
1
-2
-6
2
2
0
1
2
21/07/2015
-6
-6
-1
-4
-4
-5
1
-1
-2
-2
y -2
-3
-4
2
0.15
4 0
0.1
0.05
0
3
-2
0
0 -2.5
0
2
00
0
2
2
4
26
2
4 2.5
4
5
6
6
5
0
x
-2 -5
-2
-1
1
-4
3
2
-2
-1
5
2
2
2
-6
-6
-2
-4
-4
-5
-3
-2
-2
y
-4 0 00
-5
4
0
4
2
2
4
2
6
6
56
4
5
4
2
2
2.5
0
0
0.15
0.1
0.05
-2 -2
-2 -5
0 0
0 -2.5
0
x
37
AgentLink III: TFG1 IIA4WE, Roma
21/07/2015
Properties of the approach
• Communication complexity depends only on
the number of samples
• Data objects are never transmitted over the
network
• Local clusters are close to global clusters
which can be obtained using DE-cluster
• Time complexity does not exceed the time
complexity of centralized DE-clustering
38
AgentLink III: TFG1 IIA4WE, Roma
Window width and sampling
frequency
21/07/2015
• Good estimates when h is not less than
a small multiple of the smallest distance
between objects
• As
, the number of samples
rarely exceeds the number of data
points
39
AgentLink III: TFG1 IIA4WE, Roma
Complexity
• Site j
– Sampling: O(q(N) Sam)
– DE-cluster: O(|Dj|q(Dj))
• Helper
– Summation of samples: O(Sam)
21/07/2015
• Communication
– Time: O(Sam)
– Volume: O(M Sam)
40
AgentLink III: TFG1 IIA4WE, Roma
Complexity
(centralized approach)
• Site j
– Transmission/Reception of data objects:
O(|Dj|)
• Helper
– Global DE-clustering: O(N q(N))
21/07/2015
• Communication:
– Time: O(N)
– Volume: O(N)
41
AgentLink III: TFG1 IIA4WE, Roma
Stationary agent-based KDEC
• The helper engages site
agents to agree on:
21/07/2015
–
–
–
–
Kernel function
Window width
Sampling frequencies
Sampling region
• The global sampled
form of the estimate is
computed in a single
stage
42
AgentLink III: TFG1 IIA4WE, Roma
Mobile agent-based KDEC
• At site Ln the visiting
agent:
21/07/2015
– Negotiates kernel
function, window width,
sampling frequencies,
sampling region
– Carries the sum of
samples collected at Lm,
m<n, in its data space
• The global sampled
form of the estimate is
returned to the
interested agents
43
AgentLink III: TFG1 IIA4WE, Roma
21/07/2015
A Hierarchical Scheme
• Additivity allows to
extend the scheme to
trees of arbitrary arity
• Local sampled density
estimates are
propagated upwards in
partial sums, until the
global sampled DE is
computed at the root
and returned to the
leaves
•May provide more protection
against disclosure of DEs
44
AgentLink III: TFG1 IIA4WE, Roma
Inference and Trustworthiness
• Inference problem for kernel density
estimates
– Goal of inference attacks: exploit information
contained in a density estimate to infer the data
objects
21/07/2015
• Trustworthiness of helpers
– Trustworthy helper no bit of information written
to memory by a process for the Helper procedure
is sent to a system peripheral by a different
process
45
AgentLink III: TFG1 IIA4WE, Roma
Inference Attacks on Kernel
Density Estimates
• Let
21/07/2015
be extensionally equal to a density
estimate:
• For example, g is the reconstructed
density estimate (sampling series)
46
AgentLink III: TFG1 IIA4WE, Roma
Inference Attacks on Kernel
Density Estimates
• Simple strategy: Search the density estimate
or its derivatives for discontinuities
• Example: The kernel is the square pulse
21/07/2015
– For each pair of projections of objects on an axis
there is a pair of projections of discontinuities on
that axis having the same distance as the objects’
projections
– If h is known then the objects can be inferred
easily
• If the kernel has discontinuous derivatives,
then the same technique applies to the
derivatives
AgentLink III: TFG1 IIA4WE, Roma
47
Inference Attacks on Kernel
Density Estimates
• If g is not continuous at x  an object lies at
0.0005
h=250
0.0004
0.0003
21/07/2015
0.0002
0.0001
48
500
1000
AgentLink III: TFG1 IIA4WE, Roma
1500
2000
Inference Attacks on Kernel
Density Estimates
21/07/2015
• If the kernel is infinitely differentiable
the problem is more difficult
• Select
space objects
and
attempt to solve a nonlinear system of
equations
49
AgentLink III: TFG1 IIA4WE, Roma
Attack Scenarios
• Single-site attack
– One of the sites attempts to infer the data objects
from the global density estimate
– Unable to associate a specific data object to a
specific site
21/07/2015
• Site coalition attack
– A coalition computes the sum of the density
estimates of all the other sites as difference
– Special case: the coalition includes all sites but
one  the attack potentially reveals the data
objects at the site
50
AgentLink III: TFG1 IIA4WE, Roma
Untrustworthy Helpers
21/07/2015
• Reputation binary random variable
with probabilities p and 1-p
– p is the probability that the helper is
untrustworthy
– If the agent community supports referrals
about an agent's reputation as a helper,
then an agent might know per agent
probabilities
51
AgentLink III: TFG1 IIA4WE, Roma
Conclusions
• Agent technology in DDM
– Preserves the autonomy of data sources and
scalability of the DM step
– Privacy protection (inherent in most DDM
approaches) may be less effective
21/07/2015
• Agent-based distributed density estimate
computation & clustering
– Scalable
– Implementation based on mobile agents
– Could be vulnerable to inference attacks on
density estimates perpetrated by coalitions
52
AgentLink III: TFG1 IIA4WE, Roma
Work in progress (1)
• Inference attacks on sampled density
estimates by solving the corresponding
system of NL equations
21/07/2015
– Globally convergent inexact Newton
methods with constraints [Bellavia,
Macconi, Morini 2003]
– Gradient method
53
AgentLink III: TFG1 IIA4WE, Roma
Work in progress (2)
• Inference attacks on sampled density
estimates by iteratively reducing the
dimensionality of the system of NL equations
– pointhunt algorithm
– Proceeds iteratively by
21/07/2015
• selecting a point x close to the “border” of the density
estimate
• guessing an object such that the object is the only
contributor to density estimate at x
54
AgentLink III: TFG1 IIA4WE, Roma
Work in progress (3)
21/07/2015
• Ontology and protocol
• Algorithms for deliberative participation based
on trustworthiness referrals
• Probabilities that a local density estimate may
be learned by another agent
– Probability that no other agent learns the density
estimate
– Probability that at least k other agents learn the
density estimate
– Probability that exactly k other agents learn the
density estimate
AgentLink III: TFG1 IIA4WE, Roma
55
Future Work
• Algorithms for the negotiation of
parameters
• Formalization of errors
21/07/2015
– Bounds on aliasing errors
– Clustering errors (e.g., using an index of
partition difference)
56
AgentLink III: TFG1 IIA4WE, Roma
21/07/2015
Thanks!
57
AgentLink III: TFG1 IIA4WE, Roma