Transcript Slide 1
Brahms
Byzantine-Resilient
Random Membership Sampling
Bortnikov, Gurevich, Keidar, Kliot, and Shraer
April 2008
1
Edward (Eddie) Bortnikov
Gabriel (Gabi) Kliot
April 2008
Maxim (Max) Gurevich
Idit Keidar
Alexander (Alex) Shraer
2
Why Random Node Sampling
Gossip partners
Random choices make gossip protocols work
Unstructured overlay networks
E.g., among super-peers
Random links provide robustness, expansion
Gathering statistics
Probe random nodes
Choosing cache locations
April 2008
3
The Setting
Many nodes – n
10,000s, 100,000s, 1,000,000s, …
Come and go
Churn
Full network
Like the Internet
Every joining node knows some others
April 2008
(Initial) Connectivity
4
Adversary Attacks
Faulty nodes (portion f of ids)
Attack other nodes
Byzantine failures
May want to bias samples
April 2008
Isolate nodes, DoS nodes
Promote themselves, bias statistics
5
Previous Work
Benign gossip membership
Small (logarithmic) views
Robust to churn and benign failures
Empirical study [Lpbcast,Scamp,Cyclon,PSS]
Analytical study [Allavena et al.]
Never proven uniform samples
Spatial correlation among neighbors’ views [PSS]
Byzantine-resilient gossip
Full views [MMR,MS,Fireflies,Drum,BAR]
Small views, some resilience [SPSS]
We are not aware of any analytical work
April 2008
6
Our Contributions
1. Gossip-based attack-tolerant membership
Linear portion f of failures
The view is
not all bad
O(n1/3)-size partial views
Correct nodes remain connected
Mathematically analyzed, validated in
simulations
Better than
2. Random sampling
April 2008
benign
gossip
Novel memory-efficient approach
Converges to proven independent uniform
samples
7
Brahms
1. Sampling - local component
2. Gossip - distributed component
Gossip
view
Sampler
sample
April 2008
8
Sampler Building Block
Input: data stream, one element at a time
Bias: some values appear more than others
Used with stream of gossiped ids
Output: uniform random sample
April 2008
of unique elements seen thus far
Independent of other Samplers
One element at a time (converging)
next
Sampler
sample
9
Sampler Implementation
Memory: stores one element at a time
Use random hash function h
From min-wise independent family [Broder et al.]
For each set X, and all x X ,
1
Pr(min{h( X )} h( x))
|X|
init
Choose random hash
function
next
Sampler
Keep id with smallest
hash so far
sample
April 2008
10
Component S: Sampling and Validation
id stream
from gossip
init
next
Sampler
using
pings
Sampler
Sampler
Sampler
Validator
Validator
Validator
sample
Validator
S
April 2008
11
Gossip Process
Provides the stream of ids for S
Needs to ensure connectivity
Use a bag of tricks to overcome attacks
April 2008
12
Gossip-Based Membership Primer
Small (sub-linear) local view V
V constantly changes - essential due to churn
Typically, evolves in (unsynchronized) rounds
Push: send my id to some node in V
Reinforce underrepresented nodes
Pull: retrieve view from some node in V
Spread knowledge within the network
[Allavena et al. ‘05]: both are essential
Low probability for partitions and star topologies
April 2008
13
Brahms Gossip Rounds
Each round:
Send pushes, pulls to random nodes from V
Wait to receive pulls, pushes
Update S with all received ids
(Sometimes) re-compute V
April 2008
Tricky! Beware of adversary attacks
14
Problem 1: Push Drowning
A
E
M
M
M
BM
D
M
April 2008
15
Trick 1: Rate-Limit Pushes
Use limited messages to bound faulty
pushes system-wide
Assume at most p of pushes are faulty
E.g., computational puzzles/virtual currency
Faulty nodes can send portion p of them
Views won’t be all bad
April 2008
16
Problem 2: Quick Isolation
A
E
Ha! She’s out!
Now let’s move on
to the next guy!
M
C
D
April 2008
17
Trick 2: Detection & Recovery
Do not re-compute V in rounds when too
many pushes are received
Hey! I’m swamped!
I better ignore all of
‘em pushes…
Slows down isolation; does not prevent it
Isolation takes log (view size) time
April 2008
18
Problem 3: Pull Deterioration
A
C
D
M3 M4
E
F
M5 M6
B M1 M2
M3 E M7 M8
50% faulty ids in views 75% faulty ids in views
April 2008
19
Trick 3: Balance Pulls & Pushes
Control contribution of push - α|V| ids
versus contribution of pull - β|V| ids
Parameters α, β
Pull-only eventually all faulty ids
Push-only quick isolation of attacked
node
Push ensures: system-wide not all bad ids
Pull slows down (does not prevent) isolation
April 2008
20
Trick 4: History Samples
Attacker influences both push and pull
Yoo-hoo, is there any good
process out there?
Feedback γ|V| random ids from S
Parameters α + β + γ = 1
Attacker loses control - samples are
eventually perfectly uniform
April 2008
21
View and Sample Maintenance
Pushed ids
Pulled ids
|V|
|V|
View V
April 2008
|V|
S
Sample
22
Samples Take Time To Help
Judicious use essential (e.g., 10%)
Deal with churn, bootstrap
Need connectivity for uniform
sampling
Rely on sampling for connectivity?
Break cycle: Assume attack starts
when samples are empty
April 2008
Analyze time to partition without samples
Samples become useful at least as fast
23
Key Property
With appropriate parameters
3
E.g., | V || S | ( n )
Time to partition > time to convergence of
1st good sample
Prove lower bound
using tricks 1,2,3
(not using samples yet)
Prove upper bound
until some good sample
persists forever
Self-healing from partitions
April 2008
24
Time to Partition Analysis
Easiest partition to cause – isolate one node
Targeted attack
Analysis of targeted attack:
April 2008
Assume unrealistically strong adversary
Analyze evolution of faulty ids in views as a
random process
Show lower bound on time to isolation
Depends on p, |V|, α, β
Independent of n
25
Scalability of PSP(t) – Probability
of Perfect Sample at Time t
Analysis says:
PSP(t ) (1 e
|V |2 |S |
n
)
For scalability, want small and constant
convergence time
independent of system size, e.g., when
| V || S | ( 3 n )
n
| V | (log n),| S | ( 2 )
log n
April 2008
26
Analysis
1. Sampling - mathematical analysis
2. Connectivity - analysis and simulation
3. Full system simulation
April 2008
27
Connectivity Sampling
Theorem: If overlay remains connected
indefinitely, samples are eventually uniform
April 2008
28
Sampling Connectivity Ever After
Perfect sample of a sampler with hash h:
the id with the lowest h(id) system-wide
If correct, sticks once the sampler sees it
Correct perfect sample self-healing from
partitions ever after
April 2008
29
Convergence to 1st Perfect Sample
n = 1000
f = 0.2
40% unique ids
in stream
April 2008
30
Connectivity Analysis 1:
Balanced Attacks
Attack all nodes the same
Maximizes faulty ids in views system-wide
in any single round
If repeated, system converges to fixed point
ratio of faulty ids in views, which is < 1 if
γ=0 (no history) and p < 1/3 or
History samples are used, any p
There are always good
ids in views!
April 2008
31
Fixed Point Analysis: Push
Local view node 1
Local view node i
i
Time t:
lost push
push
push from
faulty node
Time t+1:
1
x(t) – portion of faulty nodes in views at round t;
portion of faulty pushes to correct nodes :
p / ( p + ( 1 − p )( 1 − x(t) ) )
April 2008
32
Fixed Point Analysis: Pull
Local view node 1
Local view node i
i
Time t:
pull from faulty
pull from i: faulty with probability x(t)
Time t+1:
E[x(t+1)] =
April 2008
p / (p + (1 − p)(1 − x(t)))
+ ( x(t) + (1-x(t))x(t) )
+ γf
33
Faulty Ids in Fixed Point
Assumed perfect in
analysis, real history
in simulations
Perfectly validated
fixed points
and convergence
April 2008
With a few history
samples, any portion
of bad nodes can be
tolerated
34
Convergence to Fixed Point
n = 1000
p = 0.2
α=β=0.5
γ=0
April 2008
35
Connectivity Analysis 2:
Targeted Attack – Roadmap
Step 1: analysis without history samples
Isolation in logarithmic time
… but not too fast, thanks to tricks 1,2,3
Step 2: analysis of history sample convergence
Time-to-perfect-sample < Time-to-Isolation
Step 3: putting it all together
April 2008
Empirical evaluation
No isolation happens
36
Targeted Attack – Step 1
Q: How fast (lower bound) can an attacker
isolate one node from the rest?
Worst-case assumptions
No use of history samples ( = 0)
Unrealistically strong adversary
April 2008
Observes the exact number of correct pushes
and complements it to α|V|
Attacked node not represented initially
Balanced attack on the rest of the system
37
Isolation w/out History Samples
n = 1000
p = 0.2
α=β=0.5
γ=0
Isolation time
for |V|=60
Depend on
α,β,p
E(indegree(t 1))
E(indegree(t))
A2x2
, Ai, j 1
i
E(outdegree(t 1))
E(outdegree(t)
April 2008
38
Step 2: Sample Convergence
n = 1000
p = 0.2
Perfect sample in
2-3 rounds
α=β=0.5, γ=0
40% unique ids
Empirically
verified
April 2008
39
Step 3: Putting It All Together
No Isolation with History Samples
n = 1000
p = 0.2
Works well despite
small PSP
α=β=0.45
γ=0.1
April 2008
40
Sample Convergence (Balanced)
| V || S | 23 n
p = 0.2
α=β=0.45
| V || S | 23 N
γ=0.1
Convergence twice as fast with | V || S | 33 n
April 2008
41
Summary
O(n1/3)-size views
Resist attacks / failures of linear portion
Converge to proven uniform samples
Precise analysis of impact of attacks
April 2008
42