Transcript Slides

Brahms
Byzantine-Resilient
Random Membership Sampling
Bortnikov, Gurevich, Keidar, Kliot, and Shraer
April 2008
1
Edward (Eddie) Bortnikov
Gabriel (Gabi) Kliot
April 2008
Maxim (Max) Gurevich
Idit Keidar
Alexander (Alex) Shraer
2
Why Random Node Sampling
 Gossip partners

Random choices make gossip protocols work
 Unstructured overlay networks


E.g., among super-peers
Random links provide robustness, expansion
 Gathering statistics

Probe random nodes
 Choosing cache locations
April 2008
3
The Setting
 Many nodes – n

10,000s, 100,000s, 1,000,000s, …
 Come and go

Churn
 Every joining node knows some others

Connectivity
 Full network

Like the Internet
 Byzantine failures
April 2008
4
Adversary Attacks
 Faulty nodes (portion f of ids)

Attack other nodes
 May want to bias samples


April 2008
Isolate nodes, DoS nodes
Promote themselves, bias statistics
5
Previous Work
 Benign gossip membership
 Small (logarithmic) views
 Robust to churn and benign failures
 Empirical study [Lpbcast,Scamp,Cyclon,PSS]
 Analytical study [Allavena et al.]
 Never proven uniform samples
 Spatial correlation among neighbors’ views [PSS]
 Byzantine-resilient gossip
 Full views [MMR,MS,Fireflies,Drum,BAR]
 Small views, some resilience [SPSS]
 We are not aware of any analytical work
April 2008
6
Our Contributions
1. Gossip-based attack-tolerant membership




Linear portion f of failures
The view is
not all bad
O(n1/3)-size partial views
Correct nodes remain connected
Mathematically analyzed, validated in
simulations
Better than
2. Random sampling


April 2008
benign
gossip
Novel memory-efficient approach
Converges to proven independent uniform
samples
7
Brahms
1. Sampling - local component
2. Gossip - distributed component
Gossip
view
Sampler
sample
April 2008
8
Sampler Building Block
 Input: data stream, one element at a time


Bias: some values appear more than others
Used with stream of gossiped ids
 Output: uniform random sample



April 2008
of unique elements seen thus far
Independent of other Samplers
One element at a time (converging)
next
Sampler
sample
9
Sampler Implementation
 Memory: stores one element at a time
 Use random hash function h

From min-wise independent family [Broder et al.]
 For each set X, and all x  X ,
1
Pr(min{ h( X )}  h( x)) 
|X|
init
Choose random hash
function
next
Sampler
Keep id with smallest
hash so far
sample
April 2008
10
Component S: Sampling and Validation
id stream
from gossip
init
next
Sampler
using
pings
Sampler
Sampler
Sampler
Validator
Validator
Validator
sample
Validator
S
April 2008
11
Gossip Process
 Provides the stream of ids for S
 Needs to ensure connectivity
 Use a bag of tricks to overcome attacks
April 2008
12
Gossip-Based Membership Primer
 Small (sub-linear) local view V
 V constantly changes - essential due to churn
 Typically, evolves in (unsynchronized) rounds
 Push: send my id to some node in V
 Reinforce underrepresented nodes
 Pull: retrieve view from some node in V
 Spread knowledge within the network
 [Allavena et al. ‘05]: both are essential
 Low probability for partitions and star topologies
April 2008
13
Brahms Gossip Rounds
 Each round:




Send pushes, pulls to random nodes from V
Wait to receive pulls, pushes
Update S with all received ids
(Sometimes) re-compute V

April 2008
Tricky! Beware of adversary attacks
14
Problem 1: Push Drowning
A
E
M
M
M
BM
D
M
April 2008
15
Trick 1: Rate-Limit Pushes
 Use limited messages to bound faulty
pushes system-wide
 E.g., computational puzzles/virtual currency
 Faulty nodes can send portion p of them
 Views won’t be all bad
April 2008
16
Problem 2: Quick Isolation
A
E
Ha! She’s out!
Now let’s move on
to the next guy!
M
C
D
April 2008
17
Trick 2: Detection & Recovery
 Do not re-compute V in rounds when too
many pushes are received
 Slows down isolation; does not prevent it
Hey! I’m swamped!
I better ignore all of
‘em pushes…
April 2008
18
Problem 3: Pull Deterioration
A
C
D
M3 M4
E
F
M5 M6
B M1 M2
M3 E M7 M8
50% faulty ids in views  75% faulty ids in views
April 2008
19
Trick 3: Balance Pulls & Pushes
 Control contribution of push - α|V| ids
versus contribution of pull - β|V| ids
 Parameters α, β
 Pull-only  eventually all faulty ids
 Push-only  quick isolation of attacked
node
 Push ensures: system-wide not all bad ids
 Pull slows down (does not prevent) isolation
April 2008
20
Trick 4: History Samples
 Attacker influences both push and pull
Yoo-hoo, is there any good
process out there?
 Feedback γ|V| random ids from S
 Parameters α + β + γ = 1
 Attacker loses control - samples are
eventually perfectly uniform
April 2008
21
View and Sample Maintenance
Pushed ids
Pulled ids
 |V|
|V|
View V
April 2008
|V|
S
Sample
22
Key Property
 Samples take time to help

Assume attack starts when samples are empty
 With appropriate parameters

E.g., | V || S | ( 3 n )
 Time to isolation > time to convergence
Prove lower bound
using tricks 1,2,3
(not using samples yet)
Prove upper bound
until some good sample
persists forever
Self-healing from partitions
April 2008
23
History Samples: Rationale
 Judicious use essential


Bootstrap, avoid slow convergence
Deal with churn
 With a little bit of history samples (10%)
we can cope with any adversary

April 2008
Amplification!
24
Analysis
1. Sampling - mathematical analysis
2. Connectivity - analysis and simulation
3. Full system simulation
April 2008
25
Connectivity  Sampling
 Theorem: If overlay remains connected
indefinitely, samples are eventually uniform
April 2008
26
Sampling  Connectivity Ever After
 Perfect sample of a sampler with hash h:
the id with the lowest h(id) system-wide
 If correct, sticks once the sampler sees it
 Correct perfect sample  self-healing from
partitions ever after
 We analyze PSP(t) – probability of perfect
sample at time t
April 2008
27
Convergence to 1st Perfect Sample
 n = 1000
 f = 0.2
 40% unique ids
in stream
April 2008
28
Scalability
 Analysis says:
PSP (t )  (1  e
|V |2 | S |

n
)
 For scalability, want small and constant
convergence time

independent of system size, e.g., when
| V || S | ( 3 n )
n
| V | (log n),| S | ( 2 )
log n
April 2008
29
Connectivity Analysis 1:
Balanced Attacks
 Attack all nodes the same
 Maximizes faulty ids in views system-wide

in any single round
 If repeated, system converges to fixed point
ratio of faulty ids in views, which is < 1 if


γ=0 (no history) and p < 1/3 or
History samples are used, any p
There are always good
ids in views!
April 2008
30
Fixed Point Analysis: Push
Local view node 1
Local view node i
i
Time t:
lost push
push
push from
faulty node
Time t+1:
1
x(t) – portion of faulty nodes in views at round t;
portion of faulty pushes to correct nodes :
p / ( p + ( 1 − p )( 1 − x(t) ) )
April 2008
31
Fixed Point Analysis: Pull
Local view node 1
Local view node i
i
Time t:
pull from faulty
pull from i: faulty with probability x(t)
Time t+1:
E[x(t+1)] =
April 2008
 p / (p + (1 − p)(1 − x(t)))
+  ( x(t) + (1-x(t))x(t) )
+ γf
32
Faulty Ids in Fixed Point
Assumed perfect in
analysis, real history
in simulations
Perfectly validated
fixed points
and convergence
April 2008
With a few history
samples, any portion
of bad nodes can be
tolerated
33
Convergence to Fixed Point
n = 1000
p = 0.2
α=β=0.5
γ=0
April 2008
34
Connectivity Analysis 2:
Targeted Attack – Roadmap
 Step 1: analysis without history samples


Isolation in logarithmic time
… but not too fast, thanks to tricks 1,2,3
 Step 2: analysis of history sample convergence

Time-to-perfect-sample < Time-to-Isolation
 Step 3: putting it all together


April 2008
Empirical evaluation
No isolation happens
35
Targeted Attack – Step 1
 Q: How fast (lower bound) can an attacker
isolate one node from the rest?
 Worst-case assumptions


No use of history samples ( = 0)
Unrealistically strong adversary



April 2008
Observes the exact number of correct pushes
and complements it to α|V|
Attacked node not represented initially
Balanced attack on the rest of the system
36
Isolation w/out History Samples
n = 1000
p = 0.2
α=β=0.5
γ=0
Isolation time
for |V|=60
Depend on
α,β,p
 E(indegree(t  1)) 
 E(indegree(t)) 

A

2x 2 


 , Ai, j  1
i
 E(outdegree(t  1)) 
 E(outdegree(t) 
April 2008
37
Step 2: Sample Convergence
 n = 1000
 p = 0.2
Perfect sample in
2-3 rounds
 α=β=0.5, γ=0
 40% unique ids
Empirically
verified
April 2008
38
Step 3: Putting It All Together
No Isolation with History Samples
 n = 1000
 p = 0.2
Works well despite
small PSP
α=β=0.45
γ=0.1
April 2008
39
Sample Convergence (Balanced)
 | V || S | 23 n
p = 0.2
α=β=0.45
| V || S | 23 N
γ=0.1
Convergence twice as fast with | V || S | 33 n
April 2008
40
Summary
 O(n1/3)-size views
 Resist attacks / failures of linear portion
 Converge to proven uniform samples
 Precise analysis of impact of attacks
April 2008
41