History Independent Data Structures

Download Report

Transcript History Independent Data Structures

Foundations of Privacy
Informal Lecture
Anti-Persistence or History Independent Data
Structures
Lecturer: Moni Naor
Why hide your history?
• Core dumps
• Losing your laptop
– The entire memory representation
of data structures is exposed
• Emailing files
– The editing history may
be exposed (e.g. Word)
• Maintaining lists of people
– Sports teams, party invitees
Election Day



Elections for class president
Each student whispers in Mr. Drew’s ear
Mr. Drew writes down the votes

Carol Alice
Alice
Bob
Problem:
Mr. Drew’s notebook leaks sensitive
information
 First student voted for Carol
 Second student voted for Alice
 …
3
Learning from history – only what’s
necessary
• A data structure has:
– A “legitimate” interface: the set of operations allowed to
be performed on it
– A memory representation
• The memory representation should reveal no
information that cannot be obtained from the
legitimate interface
History of history independence
Issue dealt with in Cryptographic and Data Structures
communities
• Micciancio (1997): history independent trees
– Motivation: incremental crypto
– Based on the “shape” of the data structure, not including
memory representation
– Stronger performance model!
• Uniquely represented data structures
– Treaps (Seidel & Aragon), uniquely represented
dictionaries
– Ordered hash tables (Amble & Knuth 1974)
More History
• Persistent Data Structures: possible to reconstruct
all previous states of the data structure (Sarnak and
Tarjan)
– We want the opposite: anti-persistence
• Oblivious RAM (Goldreich and Ostrovsky)
Overview
• Definitions
• History independent open addressing hashing
• History independent dynamic perfect hashing
– Memory Management
• (Union Find)
• Open problems
Precise Definitions
• A data structure is
– history independent: if any two sequences of operations
S1 and S2 that yield the same content induce the same
probability distribution on the memory representation.
– strongly history independent: if given any two sets of
breakpoints along S1 and S2 s.t. corresponding points
have identical contents, S1 and S2 induce the same
probability distributions on memory representation at
those points.
Alternative Definition –
transition probability
Relaxations
• Statistical closeness
• Computational indistinguishability
– Example where helpful: erasing
• Allow some information to be leaked
– Total number of operations
– n-history independent: identical distributions if the last n
operations where identical as well
• Under-defined data structures: same query can yield several
legitimate answers,
– e.g. approximate priority queue
– Define identical content: no suffix T such that set of permitted results
returned by S1T is different from the one returned by S2T
History independence is easy (sort of)
• If it is possible to decide the (lexicographically)
“first” sequence of operations that produce a
certain content, just store the result of that
• This gives a history independent version of a
huge class of data structures
• Efficiency is the problem…
Dictionaries
• Operations are insert(x), lookup(x) and
possibly delete(x)
• The content of a dictionary is the set of elements
currently inserted (those that have been inserted
but not deleted)
• Elements x  U some universe
• Size of table/memory N
Goal
• Find a history independent implementation of
dictionaries with good provable performance.
• Develop general techniques for history
independence
Approaches
• Unique representation
– e.g. array in sorted order
– Yields strong history independence
• Secret randomness
– e.g. array in random order
– only history independence (not strongly)
Open addressing: traditional version
• Each element x has a probe sequence
h1(x), h2(x), h3(x), ...
– Linear probing: h2(x) = h1(x)+1, h3(x) = h1(x)+2, ...
– Double hashing
– Uniform hashing
• Element is inserted into the first free space in its
probe sequence
– Search ends unsuccessfully at a free space
• Efficient space utilization
– Almost all the table can be full
Open addressing: traditional version
Not history independent: laterinserted elements move further
along in their probe sequence
y
x
x arrived before y, so move y
y
y
No clash, so insert y
History independent version
• At each cell i, decide elements’ priorities
independently of insertion order
• Call the priority function pi(x,y).
• If there is a clash, move the element of lower
priority
• At each cell, priorities must form a total order
Insertion
y
y
x
x
x
p2(x,y)? No, so move x
Search
• Same as in the traditional algorithm
• In unsuccessful search, can quit as soon as you
find a lower-priority element
No deletions
• Problematic in open addressing
• Possible way out - clusters
Strong history independence
Claim:
For all hash functions and priority functions, the final
configuration of the table is independent of the
order of insertion.
Conclusion:
Strongly history independent
Proof of history independence
A static insertion algorithm (clearly history independent):
Gather up the rejects and
restart
x2 x1
x1 x3
x 4 x5
x6
x2
x5
x2
p1(x2,x1) so insert x2
x6x5x4
x
x54
p
3(x4,x
insert
x55) and p3(x4,x6).
Insert x4 and remove x5
x1
x4 x
x1
x3
6
x3
p6(x6,x4) and p6(x3,x6),
so insert x3
Proof of history independence
• Nothing moves further in the static algorithm than
in the dynamic one
– By induction on rounds of the static alg.
• Vice versa
– By induction on the steps in the dynamic alg.
• Strongly history independent
Alternative view [Blelloch-Golovin]:
Stable Matching
Some priority functions
• Global
– A single priority function independent of cell
• Random
– Choose a random order at each cell
• Youth-rules
– Call an element “younger” if it has moved less far along
its probe sequence; younger elements get higher priority
Youth-rules
y
p2(x,y) because x has taken fewer steps than y
x
y
Use a tie-breaker if # of steps the same
y
x
•This is a priority function
Specifying a scheme
• Priority rule
– Choice of priority functions
– In Youth-rules – determined by probe sequence
• Probe functions
– How are they chosen
– Maintained
– Computed
Implementing Youth-rules
• Let each hi be chosen from a pair-wise
independent collection
– For any two x and y the r.v. hi(x) and hi(y) are
uniform and independent.
• Let h1, h2, h3, … be chosen independently
– Example: hi(x) = (ai·x mod U) + bi mod N
• Space: two elements per function
Need only log N functions
Prime
Performance Analysis
• Based on worst-case insertion sequence
• The important parameter:  - the fraction of the
table that is used ·N elements
• Analysis of expected insertion time and
search time (number of probes to the table)
– Have to distinguish successful and unsuccessful
search
Analysis via the Static Algorithm
• For insertions, the total number of probes in static
and dynamic algorithm are identical
– Easier to analyze the static algorithm
• Key point for Youth-rules: in the phase i all
unsettled elements are in the ith probe of their
sequence
– Assures fresh randomness of hi (x)
Performance
For Youth-rules, implemented as specified:
• For any sequence of insertion the expected
probe-time for insertion is at most 1/(1-)
• For any sequence of insertions the expected
probe-time for successful or unsuccessful search
is at most 1/(1-)
• Analysis based on static algorithm
 is the fraction of the table that is used
Comparison to double hashing
• Analysis of double hashing with truly random
functions [Guibas & Szemeredi, Lueker &
Molodowitch]
• Can be replaced by log n wise independent
functions [Schmidt & Siegel]
• log n wise independent is relatively expensive:
– either a lot of space or log n time
Youth-rules is a simple and provably efficient
scheme with very little extra storage
Extra benefit of considering history independence
Other Priority Functions
• [Amble & Knuth] log(1/(1-)) for global
– Truly random hash functions
• Experiments show about log(1/(1-)) for
most priority functions tried
– Performance is for amortized search
Other types of data structures
• Memory management (dealing with pointers)
– Memory Allocation
• Other state-related issues
Dynamic perfect hashing:
FKS scheme, dynamized
n elements to
be inserted
Low-level tables:
O(n) space total.
Each gets about si2
Top-level table:
O(n) space
h0
x1 x3
x6
x5
x4
x2
h
s0
s1
sk
h1
hk
The hi are perfect on their respective sets.
Rechoose h or some hi to maintain perfection and linear space.
A subtle problem:
the intersection bias problem
Suppose we have:
– a set of states {1, 2, ...}
– a set of objects {h1, h2, ...}
– a way to decide whether hi is “good” for j.
• Keep a “current” h as states change
• Change h only if it is no longer “good”.
– Choose uniformly from the “good” ones for .
Then this is not history independent:
– h is biased towards the intersection of those good
for current  and for previous states.
Dynamized FKS is not history independent
• Does not erase upon deletion
• Uses history-dependent memory allocation
• Hash functions (h, h1, h2, ...) are changed
whenever they cease to be “good”
– Hence they suffer from the intersection bias problem,
since they are biased towards functions that were
“good” for previous sets of elements
– Hence they leak information about past sets of
elements
Making it history independent
• Use history independent memory allocation
• Upon deletion, erase the element and rechoose
the appropriate hi. This solves the low-level
intersection bias problem.
• Some other minor changes
• Solve the top-level intersection bias problem...
Solving the top-level intersection bias
problem
• Can’t afford a top-level rehash on every deletion
• Generate two “potential h”s 1 and 2 at the
beginning
– Always use the first “good” one
– If neither are good, rehash at every deletion
– If not using 1, keep a top-level table for it for easy
“goodness” checking (likewise for 2)
Proof of history independence
Table’s state is defined by:
• The current set of elements
• Top-level hash functions
– Always the first “good” i, or rechosen each step
• Low-level hash functions
– Uniformly chosen from perfect functions
• Arrangement of sub-tables in memory
– Use history-independent memory allocation
• Some other history independent things
Performance
• Lookup takes two steps
• Insertion and deletion take expected amortized
O(1) time
– There is a 1/poly chance that they will take more
SHI and Unique Representation
Theorem [Hartline et al]: for a reversible data
structure to be SHI, a canonical (unique)
representation for each state must be determined
during the data structure’s initialization.
SHI with Deletions
• Blelloch and Golovin: a dictionary based on linear
probing
– Goal: search in O(1) time (guaranteed)
– Each cluster of size O(log n)
– Can be obtained using 5-wise independence [Pagh et
al., STOC 2007]
– Needs ‘random oracle’ for high level intersection bias
Open Problems
• Better analysis for youth-rules as well as other
priority functions with no random oracles.
• Efficient memory allocation
– ours is O(s log s)
• Separations
– Between strong and weak history independence
[Buchbinder-Petrank]
– Between history independent and traditional versions
• e.g. for Union Find
• Can persistence and (computational) history
independence co-exist efficiently?
References
• Moni Naor and Vanessa Teague, Anti-persistence: History
Independent Data Structures, STOC, 2001.
• Hartline, Hong, Mohr, Pentney and Rocke, Characterizing History
Independent Data Structures, Algorithmica 2005
• Buchbinder and Petrank, Lower and upper bounds on obtaining
history independence, Information and Computation 2006.
• Guy Blelloch and Daniel Golovin, Strongly History-Independent
Hashing with Applications, FOCS 2007
• Tal Moran, Moni Naor and Gil Segev Deterministic HistoryIndependent strategies for Storing Information in Write-Once
Memories, ICALP 2007