Transcript 10937
Worms, Viruses, and Cascading
Failures in networks
D. Towsley
U. Massachusetts
Collaborators: W. Gong, C. Zou (UMass)
A. Ganesh , L. Massoulie (Microsoft)
Internet as enabler of
terrific apps
Internet as enabler of
terrific apps
… but also of malicious
behavior
worms, viruses
Internet as a complex system
critical DNS, BGP
infrastructures
Worms and failures
Code Red worm
SQL Slammer
more than 360,000 infected in less than one day
disrupted parts of BGP infrastructure
less than 15 minutes to infect 75,000 hosts
congested parts of Internet
BGP errors in one network → cascade of
faults in BGP in another network
Goals
what
are appropriate models?
deterministic
stochastic
what
makes worm/virus/failure
virulent?
how does topology affect virulence?
Outline
worms,
deterministic models
cascading failures, stochastic models
summary
Worm spreading behavior
scan
for vulnerable hosts
sequential, random, topological
uniform, local preference
virulence
sensitive to
scanning strategy
host speed, bandwidth
protocol
…
Worm spreading model
address
N
space, size W
vulnerable hosts
scan
rate (per host), h
W
N
Simple worm spreading model
I(t) - number of infected hosts at
time t
I(t)
W
I(t)(N I(t))
with initial condition
I(0)
number infected hosts
Epidemic model:
h
4.E+05
3.E+05
2.E+05
1.E+05
0.E+00
time
Code Red: model
measurements from
two Class A networks
scan rate I(t)
epidemic model
matches increasing
part of observed
Code Red data
(Staniford)
What about decrease?
human
countermeasures
congestion
Zou, etal, 2002
600,000
D. Goldsmith
K. Eichman
500,000
scan rate
400,000
300,000
200,000
100,000
0
04:00
09:00
14:00
19:00
00:00
# of scans
time
600000
500000
# of Scans( Eichman)
Model
400000
300000
200000
100000
0
2
4
6
8
10
12
14
16
18
Assumptions
classic epidemic model
ignore countermeasures
ignore congestion
Code Red parameters
h = 358/min
N = 360,000
uniform scan, W 232
I(0) = 10
100s minutes to spread
400000
350000
300000
No. infected
250000
200000
150000
100000
50000
0
0
100
200
300
400
Time (minutes)
500
600
Worm virulence
increase
h
increase I(0)
decrease W
Worm virulence
increase
h
increase I(0)
decrease W
smarter scanning
The perfect worm
perfect worm
scan vulnerable nodes exactly once
h
I (t )
I (t ) I (t ) min( e ht / N , N )
N
flash worm
(Staniford,…)
uniform scan of vulnerable nodes (W N)
h
I (t ) I (t )(N I (t ))
N
Perfect Code Red worm
I(0) = 10
h = 358/min
N = 360,000
all hosts infected
within 2 sec.
add 2 sec. infection
delay
-> six-fold slowdown
random scan almost
perfect!
4.E+05
No. infected
4.E+05
3.E+05
perfect worm
with delay
flash worm
with delay
3.E+05
2.E+05
2.E+05
1.E+05
5.E+04
0.E+00
0
2
4 6 8 10 12 14
Time (seconds)
Perfect Code Red worm
4.E+05
No. infected
4.E+05
3.E+05
perfect worm
with delay
flash worm
with delay
3.E+05
2.E+05
2.E+05
4.E+05
1.E+05
4.E+05
5.E+04
3.E+05
0.E+00
No. infected
I(0) = 10
h = 358/min
N = 360,000
all hosts infected
within 2 sec.
add 2 sec. infection
delay
-> six-fold slowdown
random scan almost
perfect!
3.E+05
2.E+05
2 4 6 8 10 12 14
02.E+05
1.E+05 Time (seconds)
5.E+04
0.E+00
0
100 200 300 400 500 600
Time (minutes)
Hitlist, routing worms
hitlist
worm
increases I(0)
routing
worm
decreases W
BGP table information:
W = .29 232
– 29% of IP address space
Hitlist, routing worms
Code Red style worm
h = 358/min
N = 360,000
hitlist, I(0) = 10,000
350000
300000
No. infected
routing worm as
effective as hitlist
worm
hitlist/routing worm
extremely virulent
400000
250000
200000
150000
Code Red worm
100000
Hit-list worm
50000
Routing worm
Hitlist routing worm
0
0
100
200
300
400
Time (minutes)
500
600
Local preference worm
K subnetworks
p – probability scan
local subnet
(1-p) – prob. scan
outside local subnet
p
1
1-p
2
…
K
Local preference worm
Nk,
no. vulnerable hosts in subnet k
Ik(t),
no. infected hosts in subnet k
fits
epidemic model for interacting
groups
set of coupled ODEs
Local preference worm
K = 116
Nk = 360,000/K
I1(0) = 10;
Ik(0) = 0, k>1
h = 358/min
400000
350000
No. infected
300000
provides some of the
locality of a routing
worm
250000
200000
150000
Uniform worm
Routing worm
100000
Preference p=0.1
Preference p=0.5
Preference p=0.99
50000
0
0
100
200
300
400
Time (minutes)
500
600
Questions
topological
worms
sequential scan
bandwidth constraints
topology?
failure
recovery?
Topology and fast/slow
recovery
model
description
general
conditions for fast-slow recovery
specific
network topologies
network topologies
complete graphs (BGP routers)
hypercubes (peer-to-peer networks)
power-law graphs (Internet AS graph; Email address book graph)
Susceptible-Infective-Susceptible
(SIS) epidemic model
Also known as contact process; see
[Liggett]
topology:
undirected, finite graph
G=(V,E), connected ;
Xv = 1 if node v down (infected)
Xv = 0 if node v up (healthy)
Model
{Xv
vV} Markov process on {0,1}V with
jump rates:
Xv → 1 with rate w→vXw
Xv → 0 with rate
unique
absorbing state at 0
all other states communicate, 0 is
reachable
Time to absorption
system
eventually recovers
how
long does this take?
T = time to hit 0 (from a given initial
condition)
how
does E[T] depend on , , G?
Example
G
= line segment or ring with n nodes
Fix 1
Theorem (Durrett and Liu): There is
critical c > 0 such that,
if < c , then E[T] = O(log n)
if > c , then log E[T] ≈ na
signature of phase transition in
infinite 1-D lattice.
Fast recovery, spectral radius
- spectral radius of graph adjacency
matrix, A; n=|V| .
Then,
P(X(t) 0) ≤ c n½ exp([ -]t)
Hence, when
< ,
Survival time T satisfies:
E(T) ≤ [log(n)+1]/[ - ]
Coupling proof
Consider “Branching Random Walk”, i.e.
Markov process {Yv}vV
Yv → Yv +1 with rate w~v Yw = (AY)v
Yv → Yv -1 with rate Yv
Can couple processes so that, for all t,
X(t) ≤ Y(t).
Branching random walk bound
By “linearity” of Y,
dE[Y(t)]/dt = ( A - I) Y(t),
so
E[Y(t)] = exp( A - I) Y(0) ;
Use P(X(t) 0) ≤ vV E[Yv(t)]
Slow recovery
Graph isoperimetric constant:
| E (S , S ) |
h min
S :|S | n / 2
|S |
“perimeter”
“area”
S
S
Generalized isoperimetric constant
| E (S , S ) |
hm inf
S :|S |m
|S |
Slow die-out and isoperimetric
constant
Suppose for some m ≤ n/2,
r := [ hm] / > 1
Then, with positive probability,
epidemics survive for time at least rm/[2m]
Hence, if m = na, survival time T satisfies
log (E[T]) = W(na)
Coupling proof
Let |X| = v Xv .
Then |X| dominates process Z on {0,…,m} with
transition rates:
z→ z+1 at rate h z,
z→ z-1 at rate z.
Then study absorption time for Z
Complete graph
Here, = n-1, hm = n-m
By picking m = na, a < 1,
Thresholds:
fast recovery if / < 1/(n-1)
slow recovery if / > 1/(n-na)
Hypercube {0,1}d
Here, d = log2(n) and = d
For m=2k, k < d, hm = d-k
Hence, for k = d,
Thresholds: ,
fast recovery if / < 1/d
slow recovery if / > 1/[d(1-)]
Erdős-Rényi random graph
edge between each pair of nodes present
with probability pn independent of others
dense: dn := npn = Ω(log n)
then ρ ~ h ~ dn with high probability
Star network
spectral radius: n1/2
isoperimetric constant: hm = 1 for all m < n/2
general results not useful
Specialized analysis yields:
for arbitrary constant c > 0, if / < c/n1/2,
fast recovery, E[T] = O(log(n))
if / > na-1/2 , for a > 0,
slow recovery, log(E[T]) = W(na)
Power-law random graph
Power-law graph with exponent :
number of degree k vertices k-
E.g. Internet AS graph with = 2.1
Expected degree PLRG [Chung et al]:
expected degrees w1 > ··· > wn :
edge (i,j) present w.p. wi wj/k wk
particular choice: wi = c1(i+c2)-1/( -1)
Power-law random graph (2)
Spectral radius of PLRG [Chung et al.,03]:
Denote by m max. expected degree (m=w1), and
by d average of expected degrees.
Then:
( m ), > 2.5
(A )
3
(m ), 2 < < 2.5
PLRG, > 2.5
Epidemics on full graph live longer
than on sub-graph.
Look at star induced by node 1:
slow die-out for / > ma-1/2
Compare to spectral radius condition:
Fast die-out for / < m-1/2
Two thresholds differ by ma ; same
gap as for star
PLRG, 2 < < 2.5
Consider top N nodes, for suitable N;
Erdős-Rényi core, with
isoperimetric constant:
h = F()
Gap between thresholds h and :
constant factor, F()
Open problems
gap between upper and lower bounds in
sparse ER graphs
power law random graphs for < 2.5
spectral radius bound tight in examples,
always true?
conditioned on slow recovery, how many nodes
are down at intermediate times?
extensions to other graphs and to SIR
epidemics
Observations
neither
parameter tight
gap for topologies with diverse degrees
spectral radius “seems” to be right
nothing
between log n and exp(na) ?
Hitlist, routing worms
hitlist worm
increase I(0)
routing worm
decrease W
BGP table information: W = .29 232
– 29% of IP address space
/8 aggregation: W = .45
232
– 116 out of 256 possible 8 bit prefixes
0110…0xxx
8
The appearance of phase
transitions
N=200, ks =1, kl=0.01
Mean time to absorption goes down from 1047 ,
to about 0 in a matter of few states
Accuracy of fluid model
300000
250000
200000
150000
Mean
95%
5%
100000
50000
0
1
10
0
20
0
30
0
40
0
50
0
60
0
70
0
80
0
population: 360,000
scan rate h =
N(358/min, 1002)
normal distr.
scanning space: 232
I(0) =1
100 simulations
No. Infected
400000
350000
time (minutes)
Accuracy of fluid model
population:
360,000
scan rate h =
N(358/min, 1002)
normal distr.
scanning space: 232
I(0) =10
100 simulations
400000
Code Red worm
350000
No. Infected
300000
Mean
250000
200000
150000
100000
50000
0
1
100 200 300 400 500 600 700
time (minutes)
Accuracy of fluid model
population: 360,000
scan rate h =
N(358/min, 1002)
normal distr.
scanning space: 232
I(0) =10
100 simulations
Mean
350000
95%
5%
300000
No. Infected
400000
250000
200000
150000
100000
50000
0
1
100 200 300 400 500 600 700
time (minutes)
Local preference worm
Ik t I k (t ) 'I j (t ) Nk I k (t )
j k
- local scan rate
’- global scan rate
initial conditions Ik(0)
Erdős-Rényi random graph
edge between each pair of nodes present
with probability pn independent of others
sparse: pn = c log(n)/n, c > 1.
then ρ ≤ c’ log(n), h ≥ c’’ log(n)
with high probability, for some c’’ < c < c’
dense: dn := npn = Ω(log n)
then ρ ~ h ~ dn with high probability