The Datacentric Grid and the Public/Private Boundary

Download Report

Transcript The Datacentric Grid and the Public/Private Boundary

The Datacentric Grid and the
Public/Private Boundary
David Skillicorn
School of Computing
Queen’s University, Kingston
[email protected]
www.cs.queensu.ca/home/skill
Assumption: Big datasets are immovable
•
•
•
•
There’s not enough bandwidth (between storage and network)
There’s not enough temporary storage space
Some data may not cross political boundaries
Some data may not cross social boundaries
and distributed
• They are naturally collected that way
• There are optimal sizes for data repositories
The main implication of immovable data is that code must move to the
data – reversing an assumption that’s deeply embedded in computing,
from the design of processors all the way to distributed computing.
The Datacentric Grid Project at Queen’s is exploring the implications
for data-intensive distributed applications.
Some of the infrastructure of the computational grid carries over,
but much does not (at least directly).
Scale
Costs
Usage patterns
Interfaces
are different.
D.B. Skillicorn and D. Talia, Mining Large Data Sets on Grids: Issues and Prospects,
Computing and Informatics, Special Issue on Grid Computing Vol. 21, No. 4, 1--16, 2002.
1. Architectural implication:
Data repositories should have their own tightly-coupled compute
engines where code can execute.
cluster
dataset
network
attached
storage
cluster
2. Semantics/programming languages implications:
There’s a spectrum of choices about how much functionality exists at
repositories, and how much arrives in the mobile code.
At one extreme: the mobile code selects from a static library of
operations, e.g. Netsolve, web services.
- The user interface resembles SQL or SAS diagrams.
At the other extreme: the mobile code is fully functional code that
can communicate only locally.
- The user interface resembles Java, Matlab.
3. Data-mining-specific implications:
Caching Results.
The result of applying a DM technique to a dataset is a model.
Such a model may continue to have value (even to other users), so it
makes sense to cache it.
The obvious place to cache it is at one of the data repositories.
So data repositories accrete models, which may themselves play the
role of datasets for other users (e.g. a carefully chosen 10% sample).
Caching execution plans.
Data mining tends to be exploratory – computing one model suggests
refinements to produce a better model. Or the same model is built on
a new day’s data.
The execution plan for the second computation may very similar to
that for the first.
Rather than recompute the best way to execute the second
computation, tweak the plan used to execute the first.
4. Information flow implications:
Three reasons to limit information flow:
Security
Privacy
Confidentiality
across three boundaries
Between user and host
Between user and user
Across the private/public boundary, i.e. between virtual
private grids and public grids
Security: “Each party does exactly what it has agreed to do”.
Computational grids emphasize protecting hosts from malicious users,
but protecting users from malicious hosts is just as important.
a. Hosts must be protected from malicious code:
Authentication (requires an enforcement sidechannel –cannot
be open)
Sandboxing and playgrounds (still unreliable)
Proof carrying code (cumbersome)
Fully typed code (getting interesting)
C.B. Jay, The constructor calculus.
www-staff.it.uts.edu.au/~cbj/ Publications/constructors.ps
b. Users must be protected from malicious hosts:
Obfuscated code (fudge!)
Multiple execution (e.g. SETI, but expensive)
Code instrumentation
Private function evaluation (?practical)
Program transforms (encrypting constants, mult, add means
polynomials can be evaluated secretly)
Sander and Tschudin, Protecting Mobile Agents Against Malicious Hosts, LNCS ,
1998.
c. Users must be protected from other users.
Virtual machines (?strong enough)
Privacy: “The default is that other people don’t know what I’m doing”.
Strong emotions about privacy, but a recent invention, outside
traditional morality.
The `denial vs no comment’ issue.
N.B. how willing people are to surrender privacy for a benefit, e.g.
Google search.
Users must be protected from snoopy hosts, and snoopy fellow users
(also perhaps from snooping in the network).
Most of the threats are from logs – can hosts do without them?
Confidentiality: “If either party reveals what’s being done there are
real losses to one or both parties”
We want to prevent others knowing
what we are doing, and/or
even that we’re doing it.
Users need protection from hosts and from other users.
How important it is to provide these properties, and what costs can
be borne, depend on the grid environment:
1.
2.
3.
Virtual private grids, within a single ‘organisation’;
Public grids, open to all who pay to use them and piggbacking on
ordinary internet resources;
Applications that originate within private grids but that also use
facilities from public grids.
Information flow issues within a virtual private grid:
Not important in theory because of the confluence of the interests
of the organization and the people. Still relatively easy in practice
because of visibility and control.
Security: potential consequences of violation to perpetrator.
Privacy: implicitly surrendered by individuals.
Confidentiality: preserved because information does not flow out of
the VPG.
Information flow issues within a public grid:
Security: difficult to achieve in either direction (implicit assumption
that malicious behavior is readily detectable) – viruses, worms!!
Privacy: protected only by the difficulty of actual discovering what
one individual is doing.
Confidentiality: non-existent.
Scientific research is usually conceived of as appropriate for this
space. This is a dubious assumption – almost all research has some IP
implications. Working in a public space could conceivably be
considered public disclosure for patent purposes.
Information flow issues that cross the boundary of private and public
grids:
We assume that the base of the computation is within a private grid,
but that information (and so computation) in the public grid is also
required.
We want to prevent inference of the goal of the entire computation
from:
what’s being computed in public, and/or
where the public computation is going for its data, how much work
it does at each site, etc. (traffic analysis)
Examples:
A movie distribution company bases next week’s shows on confidential
financial information – and public reviews.
An oil company bases projections of future supply on an in-house
model – and political and weather information.
A biochemical research team uses new techniques on a public dataset,
and publishes descriptive results – but not the names of interesting
genes discovered.
A struggling company looks at its cashflow projections – and federal
law about Chapter 11 and layoffs.
Two helpful technologies that provide partial solutions:
Crowds: A group of users agree to form a crowd. Each request to the
outside world is sent at random to one of the crowd. With a certain
probability, the recipient either sends it to the outside world, or
passes it on to another member of the crowd.
This simple scheme makes it easy to hide the true origin of a
computation, even with partial collusion among members of the crowd.
So crowds can combine different organizations without significant
risk.
Reiter and Rubin, Crowds: Anonymity for Web Transactions, ACM Transactions on
Information and System Security, 1997.
Information grazing.
Human-directed data mining is hard; it amounts to knowing what
question to ask.
It’s better if the data-mining system can pose plausible questions and
then filter the answers to find the interesting ones.
But this means that speculative data mining is happening inside an
organization – so that what is visible outside is uncoupled from what’s
going on inside the organization.
cf FUD project at Queen’s.
Ways to protect what’s being modelled:
I want to compute f(x)
In public, I compute g(x)
I need an h such that
*h•g=f
* f is hard to figure out knowing g
* g(x) is small relative to x
Since h has no access to x, g(x) has to somehow contain all of the
useful information about x.
1. h selects part of g(x) – so the real intent is hidden inside the
computation of g :
Examples:
a.
b.
c.
Association rules – use a lower support than required and
postfilter rules;
Use more dimensions of a truncated matrix decomposition than
required and throw the others away;
Have g build a classifier with more targets than required and use
unions of g’s classes at deployment – there are an exponential
number of possibilities;
2. In public, compute g1 and g2 in such a way that they are not
connected (e.g. using crowds).
Examples:
a.
b.
c.
d.
e.
If g1 and g2 access different rows of datasets, their models can
be combined using techniques from parallel data mining;
If g1 and g2 access different columns of datasets, their models
can be combined using techniques from distributed data mining;
(Random forests shows that this can be very powerful;)
g1 and g2 can be deployed using bagging;
Let g1 and g2 use targets such that their intersections are the
real targets;
Use unsupervised classifiers in public, then learn the mapping
from the cluster labels to the desired target in private (adapt
techniques from metalearning);
3. Use private function evaluation + private selection
General techniques for private function evaluation are known, but
they are based on a circuit evaluation ‘trick’. Knowledge of the circuit
leaks, and translating some part of the functionality into an input
doesn’t help in our setting.
Private selection can be built on top of this trick so that the owner of
the dataset doesn’t know which values are being selected (so traffic
analysis is partially prevented).
However, both of these techniques are (at present) computationally
intractable.
The Sander-Tschudin technique seems to be practical, but only works
for functions that can be cast as polynomials (which seems to include
some control flow).
Summary:
1.
2.
3.
4.
Increasingly it’s the data rather than the computation that is
the performance limiting factor. We need to learn how to move
functionality to data, rather than data to functionality.
Large-scale, geographically-distributed datasets bring their own
special problems, and require grids that are quite different from
computational grids.
Code that moves to data necessarily executes in environments
that are out of the control of the code’s owner. This makes
problems that were previously almost ignored suddenly critical.
The information flow issues require hosts to do what they are
told to do; in particular, they must not (a) do something else, or
(b) learn what the code they are executing is doing.
These are tough, but interesting, problems.
[email protected]
www.cs.queensu.ca/home/skill
?