Transcript 10937

Worms, Viruses, and Cascading
Failures in networks
D. Towsley
U. Massachusetts
Collaborators: W. Gong, C. Zou (UMass)
A. Ganesh , L. Massoulie (Microsoft)

Internet as enabler of
terrific apps
Internet as enabler of
terrific apps
 … but also of malicious
behavior



worms, viruses
Internet as a complex system

critical DNS, BGP
infrastructures
Worms and failures

Code Red worm



SQL Slammer



more than 360,000 infected in less than one day
disrupted parts of BGP infrastructure
less than 15 minutes to infect 75,000 hosts
congested parts of Internet
BGP errors in one network → cascade of
faults in BGP in another network
Goals
 what
are appropriate models?
deterministic
 stochastic

 what
makes worm/virus/failure
virulent?
 how does topology affect virulence?
Outline
 worms,
deterministic models
 cascading failures, stochastic models
 summary
Worm spreading behavior
 scan
for vulnerable hosts
sequential, random, topological
 uniform, local preference

 virulence
sensitive to
scanning strategy
 host speed, bandwidth
 protocol
…

Worm spreading model
 address
N
space, size W
vulnerable hosts
 scan
rate (per host), h
W
N
Simple worm spreading model
I(t) - number of infected hosts at
time t
I(t) 
W
I(t)(N  I(t))
with initial condition
I(0)
number infected hosts
Epidemic model:
h

4.E+05
3.E+05
2.E+05
1.E+05
0.E+00
time
Code Red: model
measurements from
two Class A networks


scan rate  I(t)
epidemic model
matches increasing
part of observed
Code Red data
(Staniford)
What about decrease?
 human
countermeasures
 congestion
Zou, etal, 2002
600,000
D. Goldsmith
K. Eichman
500,000
scan rate

400,000
300,000
200,000
100,000
0
04:00
09:00
14:00
19:00
00:00
# of scans
time
600000
500000
# of Scans( Eichman)
Model
400000
300000
200000
100000
0
2
4
6
8
10
12
14
16
18
Assumptions
classic epidemic model



ignore countermeasures
ignore congestion
Code Red parameters


h = 358/min
N = 360,000
uniform scan, W  232
 I(0) = 10
 100s minutes to spread

400000
350000
300000
No. infected

250000
200000
150000
100000
50000
0
0
100
200
300
400
Time (minutes)
500
600
Worm virulence
 increase
h
 increase I(0)
 decrease W
Worm virulence
 increase
h
 increase I(0)
 decrease W
 smarter scanning
The perfect worm

perfect worm

scan vulnerable nodes exactly once
h

I (t ) 
I (t )  I (t )  min( e ht / N , N )
N

flash worm

(Staniford,…)
uniform scan of vulnerable nodes (W  N)
h

I (t )  I (t )(N  I (t ))
N
Perfect Code Red worm
I(0) = 10
 h = 358/min
 N = 360,000
 all hosts infected
within 2 sec.
 add 2 sec. infection
delay
-> six-fold slowdown
 random scan almost
perfect!

4.E+05
No. infected
4.E+05
3.E+05
perfect worm
with delay
flash worm
with delay
3.E+05
2.E+05
2.E+05
1.E+05
5.E+04
0.E+00
0
2
4 6 8 10 12 14
Time (seconds)
Perfect Code Red worm
4.E+05
No. infected
4.E+05
3.E+05
perfect worm
with delay
flash worm
with delay
3.E+05
2.E+05
2.E+05
4.E+05
1.E+05
4.E+05
5.E+04
3.E+05
0.E+00
No. infected
I(0) = 10
 h = 358/min
 N = 360,000
 all hosts infected
within 2 sec.
 add 2 sec. infection
delay
-> six-fold slowdown
 random scan almost
perfect!

3.E+05
2.E+05
2 4 6 8 10 12 14
02.E+05
1.E+05 Time (seconds)
5.E+04
0.E+00
0
100 200 300 400 500 600
Time (minutes)
Hitlist, routing worms
 hitlist

worm
increases I(0)
 routing
worm
decreases W
 BGP table information:
W = .29  232

– 29% of IP address space
Hitlist, routing worms
Code Red style worm
 h = 358/min
 N = 360,000
 hitlist, I(0) = 10,000

350000
300000
No. infected
routing worm as
effective as hitlist
worm
 hitlist/routing worm
extremely virulent
400000
250000
200000
150000

Code Red worm
100000
Hit-list worm
50000
Routing worm
Hitlist routing worm
0
0
100
200
300
400
Time (minutes)
500
600
Local preference worm
K subnetworks
 p – probability scan
local subnet
 (1-p) – prob. scan
outside local subnet
p

1
1-p
2
…
K
Local preference worm
 Nk,
no. vulnerable hosts in subnet k
 Ik(t),
no. infected hosts in subnet k
 fits
epidemic model for interacting
groups
set of coupled ODEs
Local preference worm
K = 116
 Nk = 360,000/K
 I1(0) = 10;
Ik(0) = 0, k>1
 h = 358/min


400000
350000
No. infected
300000
provides some of the
locality of a routing
worm
250000
200000
150000
Uniform worm
Routing worm
100000
Preference p=0.1
Preference p=0.5
Preference p=0.99
50000
0
0
100
200
300
400
Time (minutes)
500
600
Questions
 topological
worms

sequential scan

bandwidth constraints
 topology?
 failure
recovery?
Topology and fast/slow
recovery
 model
description
 general

conditions for fast-slow recovery
 specific



network topologies
network topologies
complete graphs (BGP routers)
hypercubes (peer-to-peer networks)
power-law graphs (Internet AS graph; Email address book graph)
Susceptible-Infective-Susceptible
(SIS) epidemic model
Also known as contact process; see
[Liggett]
 topology:
undirected, finite graph
G=(V,E), connected ;
 Xv = 1 if node v down (infected)
Xv = 0 if node v up (healthy)
Model
 {Xv
vV} Markov process on {0,1}V with
jump rates:
 Xv → 1 with rate  w→vXw
 Xv → 0 with rate 
 unique
absorbing state at 0
 all other states communicate, 0 is
reachable
Time to absorption
 system
eventually recovers
 how
long does this take?
 T = time to hit 0 (from a given initial
condition)
 how
does E[T] depend on , , G?
Example
G
= line segment or ring with n nodes
 Fix 1
 Theorem (Durrett and Liu): There is
critical c > 0 such that,
 if  < c , then E[T] = O(log n)
 if  > c , then log E[T] ≈ na
 signature of phase transition in
infinite 1-D lattice.
Fast recovery, spectral radius
 - spectral radius of graph adjacency
matrix, A; n=|V| .
Then,
P(X(t)  0) ≤ c n½ exp([  -]t)
Hence, when
  < ,
Survival time T satisfies:
E(T) ≤ [log(n)+1]/[  -  ]
Coupling proof
Consider “Branching Random Walk”, i.e.
Markov process {Yv}vV
Yv → Yv +1 with rate  w~v Yw =  (AY)v
 Yv → Yv -1 with rate  Yv

Can couple processes so that, for all t,
X(t) ≤ Y(t).
Branching random walk bound
By “linearity” of Y,
dE[Y(t)]/dt = ( A -  I) Y(t),
so
E[Y(t)] = exp( A -  I) Y(0) ;
Use P(X(t)  0) ≤ vV E[Yv(t)]
Slow recovery
Graph isoperimetric constant:
| E (S , S ) |
h  min
S :|S |  n / 2
|S |
“perimeter”
“area”
S
S
Generalized isoperimetric constant
| E (S , S ) |
hm  inf
S :|S |m
|S |
Slow die-out and isoperimetric
constant
Suppose for some m ≤ n/2,
r := [ hm] / > 1
Then, with positive probability,
epidemics survive for time at least rm/[2m]
Hence, if m = na, survival time T satisfies
log (E[T]) = W(na)
Coupling proof
Let |X| = v Xv .
Then |X| dominates process Z on {0,…,m} with
transition rates:
z→ z+1 at rate h  z,
z→ z-1 at rate  z.
Then study absorption time for Z
Complete graph
Here,  = n-1, hm = n-m
By picking m = na, a < 1,
Thresholds:
fast recovery if / < 1/(n-1)
slow recovery if / > 1/(n-na)
Hypercube {0,1}d
Here, d = log2(n) and  = d
For m=2k, k < d, hm = d-k
Hence, for k =  d,
Thresholds: ,
fast recovery if / < 1/d
slow recovery if / > 1/[d(1-)]
Erdős-Rényi random graph

edge between each pair of nodes present
with probability pn independent of others
dense: dn := npn = Ω(log n)
 then ρ ~ h ~ dn with high probability

Star network
spectral radius: n1/2
 isoperimetric constant: hm = 1 for all m < n/2
 general results not useful

Specialized analysis yields:
 for arbitrary constant c > 0, if / < c/n1/2,
fast recovery, E[T] = O(log(n))

if / > na-1/2 , for a > 0,
slow recovery, log(E[T]) = W(na)
Power-law random graph
Power-law graph with exponent :
number of degree k vertices  k-
E.g. Internet AS graph with  = 2.1
Expected degree PLRG [Chung et al]:
 expected degrees w1 > ··· > wn :
edge (i,j) present w.p. wi wj/k wk

particular choice: wi = c1(i+c2)-1/( -1)
Power-law random graph (2)
Spectral radius of PLRG [Chung et al.,03]:
Denote by m max. expected degree (m=w1), and
by d average of expected degrees.
Then:
( m ),  > 2.5
 (A )  
3 
(m ), 2 <  < 2.5
PLRG,  > 2.5
Epidemics on full graph live longer
than on sub-graph.
Look at star induced by node 1:
slow die-out for / > ma-1/2
Compare to spectral radius condition:
Fast die-out for / < m-1/2
Two thresholds differ by ma ; same
gap as for star
PLRG, 2 <  < 2.5
Consider top N nodes, for suitable N;
Erdős-Rényi core, with
isoperimetric constant:
h = F() 
Gap between thresholds h and :
constant factor, F()
Open problems
gap between upper and lower bounds in
 sparse ER graphs
 power law random graphs for  < 2.5
 spectral radius bound tight in examples,
always true?
 conditioned on slow recovery, how many nodes
are down at intermediate times?
 extensions to other graphs and to SIR
epidemics

Observations
 neither
parameter tight
 gap for topologies with diverse degrees

spectral radius “seems” to be right
 nothing
between log n and exp(na) ?
Hitlist, routing worms

hitlist worm


increase I(0)
routing worm


decrease W
BGP table information: W = .29  232
– 29% of IP address space

/8 aggregation: W = .45 
232
– 116 out of 256 possible 8 bit prefixes
0110…0xxx
8
The appearance of phase
transitions
N=200, ks =1, kl=0.01
Mean time to absorption goes down from 1047 ,
to about 0 in a matter of few states
Accuracy of fluid model



300000
250000
200000
150000
Mean
95%
5%
100000
50000
0
1
10
0
20
0
30
0
40
0
50
0
60
0
70
0
80
0

population: 360,000
scan rate h =
N(358/min, 1002)
normal distr.
scanning space: 232
I(0) =1
100 simulations
No. Infected

400000
350000
time (minutes)
Accuracy of fluid model
population:
360,000
 scan rate h =
N(358/min, 1002)
normal distr.
 scanning space: 232
 I(0) =10
 100 simulations
400000

Code Red worm
350000
No. Infected
300000
Mean
250000
200000
150000
100000
50000
0
1
100 200 300 400 500 600 700
time (minutes)
Accuracy of fluid model




population: 360,000
scan rate h =
N(358/min, 1002)
normal distr.
scanning space: 232
I(0) =10
100 simulations
Mean
350000
95%
5%
300000
No. Infected

400000
250000
200000
150000
100000
50000
0
1
100 200 300 400 500 600 700
time (minutes)
Local preference worm


Ik t     I k (t )   'I j (t )  Nk  I k (t )
j k



 - local scan rate

’- global scan rate

initial conditions Ik(0)
Erdős-Rényi random graph

edge between each pair of nodes present
with probability pn independent of others
sparse: pn = c log(n)/n, c > 1.
 then ρ ≤ c’ log(n), h ≥ c’’ log(n)

with high probability, for some c’’ < c < c’
dense: dn := npn = Ω(log n)
 then ρ ~ h ~ dn with high probability
