Reverse Hashing for High-speed Network Monitoring

Transcript Reverse Hashing for High-speed Network Monitoring

Reverse Hashing for High-speed
Network Monitoring: Algorithms, Evaluation,
and Applications
Robert Schweller1, Zhichun Li1, Yan Chen1,
Yan Gao1, Ashish Gupta1, Yin Zhang2, Peter
Dinda1, Ming-Yang Kao1, Gokhan Memik1
1
Lab for Internet and Security Technology (LIST), Northwestern Univ.
2
University of Texas at Austin
The Spread of Sapphire/Slammer
Worms
Motivation (online change detection)
• Online network anomaly/intrusion detection
over high speed links
– Small memory usage
– Small # of memory access per packet
– Scalable to large key space size
• Primitives for online anomaly detection
– Heavy hitters (lots of prior work)
– Heavy changes: enabler for aggregate queries
over multiple data streams
• Asymmetric routing demands spatial aggregation
• Time Series Analysis (TSA) need temporal
aggregation
Outline
•
•
•
•
•
•
•
Background on k-ary sketch
Reversible sketch problem
Modular hashing
IP mangling
Reverse hashing
Evaluation
Conclusion
K-ary sketch
[Krishnamurthy, Sen, Zhang, Chen, 2003]
First to detect flow-level heavy changes in
massive data streams at network traffic speeds
0 1
1
…
j
…
H
…
K-1
k-ary sketch
[Krishnamurthy, Sen, Zhang, Chen, 2003]
APIs:
Update (k, u): Tj [ hj(k)] += u (for all j)
0 1
…
h1(k)
K-1
1
…
hj(k)
…
j
hH(k)
H
Estimate v(S, k): sum of updates for key k
median
j
 T j [ h j ( k )]  sum / K 


1

1
/
K


S=COMBINE(a,S1,b,S2):
a
+b
=
Reverse Sketch Problem
• Main problem
– Cannot efficiently report keys with heavy change
INFERENCE(S,t)
– Important function for anomaly detection!
• Our Contribution
– Determine set of keys that have “large” estimates
in a sketch
?
?
Reversible sketch framework
value stored
value
Streaming
data
IP mangling
recording key
Heavy
change
detection
change
threshold
reversible
k-ary
sketch
Modular
hashing
Reverse
hashing
reversible
k-ary
sketch
Reverse
IP mangling
heavy
change
keys
Outline
•
•
•
•
•
•
•
Background on k-ary sketch
Reversible sketch problem
Modular hashing
IP mangling
Reverse hashing
Evaluation
Conclusion
Taking Intersections
• Intersect A1, A2, A3, A4, A5
H=5
K = 212
#keys = 232 (IP addresses)
E[false positives] << 1
The problem with simple intersection
• Each set Ai can be very large !
H=5
K = 212
#keys = 232 (IP addresses)
|A1| = 232 / 212 = 220
The problem with simple intersection
• Each set Ai can be very large !
• Solution:
Modular hashing
Modular hashing reduces the set size
32 bits
10010100
10101011
10010101
8 bits
h()
12 bits
010 110 001 101
10100011
Modular hashing reduces the set size
32 bits
10010100
10101011
10010101
10100011
8 bits
h1()
010
h2()
110
h3()
001
h4()
101
010 110 001 101
Greatly reduces size of reverse mapped sets
Modular hashing reduces the set size
Intersection:
A1: 25 * 25 * 25 * 25
Only 32 elements per word set
1
b1
b2
2
b3
3
4
5
b4
b5
Modular hashing reduces the set size
Intersection:
1
A1: 25 * 25 * 25 * 25
A2: 25 * 25 * 25 * 25
b1
b2
2
b3
3
4
5
b4
b5
Problem: Too many collisions
32 bits
129.105.56.23
129.105.56.28
129.105.56.109
129.105.56.35
129.105.56.98
...
12 bits
7 . 4 . 0 . *
Problem: Too many collisions
32 bits
129.105.56.23
129.105.56.28
129.105.56.109
129.105.56.35
129.105.56.98
...
Solution:
12 bits
7 . 4 . 0 . *
IP Mangling with GF (Galois Extension Field)
IP Mangling: a bijective mapping function
for breaking the key space continuity
Outline
•
•
•
•
•
•
•
Background on k-ary sketch
Reversible sketch problem
Modular hashing
IP mangling
Reverse hashing
Evaluation
Conclusion
Handling Multiple Intersections…
2H different intersections
1
b1
b1
2
3
b2
b3
4
5
b2
b3
b4
b5
b4
b5
Much more difficult – Solution: Reverse Hashing algorithms
• Step 1: Reverse hashing for each module
• Step 2: Infer the whole key through bucket index matching
among candidates from each module
Reverse Hashing for Each Module
Take the first word as an example
1
A12  A12
 A122  A123  A124
Aijw  32
Gi1 candidate set of the first word in Hash table i
1
2
3
4
5
1
1
G11  A11
 A12
 {2,3}  {2,5}  { 2,3,5}
1
1
G21  A21
 A22
 {2,6}  {9,10}  { 2, 6,9,10}
H=5, r=1, K=212
r tolerance level
1
1
G31  A31
 A32
 {0,3}  {2,3}  {0,2,3}
1
1
G41  A41
 A42
 {3,10}  {2,8}  { 2,3,8,10}
1
1
G51  A51
 A52
 {3,7}  {6,9}  {
3,6,7,9}
{2,3}
I 1   r Gi1  {2}
i
All possible values of the first word in the sketch
Bucket Index Matrix of Candidates
H=5, r=1, K=212
192.168.0.1
1
2
3
4
5
192.123.47.62
b11
b12
b21
b31
b22
b32
b41
b51
For each x in I1, we can
get B1(x), a vector of the
heavy bucket sets which
x hashes to.
b42
b52
192.*.*.* hash to the red
heavy buckets
b11, b12 
b

21


1

B (192)  b32


b
,
b
 41 42 
b51, b52 
Prefix Extension Algorithm
Path discovery algorithm
I1
150
47
236
B1
 2,5 
1 


1,4,9


1,3,6 
3 
3 
5 
 
1 
 
1 
 4 
2 
2,3,7 


2,5 


1

3 
I2
72
+
104
B2
1,2 
1,5 


4,9 


5 
3,7,8
1,2 
 2,6 
 
2 
 
1 
3,9 
=
{22,5}  {1,2} 
1{1} 
 {1,5}


<150.72>  
{41,,94,9}  {4,9}
 

{*1,3,6}  {5} 
{33}  {3,7,8} 
2*
25
 
<47.72>
<236.104> *
2
*
1 
3*
* more
than
r=1
Ignore!
Ignore!
Prefix Extension Algorithm
I3
182
+
32
49
B3
1,2 
1 
 
3,4
 
1 
3 
2 
1 
 
9 
 
1,7 
3 
2 
 2,6 
 
2 
 
1 
3 
=
2
1 
<150.72.182>  
4
 
* 
3 
2
1 
<150.72.32>  
9 
 
* 
3 
2
2
<236.104.49>  
2
 
1 
3 
2 

1
<150.72>  
2  4,9 
1,2   
 * 
75 4 3 
 


1


3,5,92
 
2
<236.104>  
2
 
1 
3 
I4
+
B4
=
2
1 
 
<150.72.182.75>  4 
 
* 
3 
2
2
 
<236.104.49.75> * 
 
1 
3 
Recap:
value stored
value
Streaming
data
IP mangling
recording key
Modular
hashing
reversible
k-ary
sketch
log n
 (
)
log log n
Heavy
change
detection
change
threshold
  (n
reversible
k-ary
sketch
1/ loglogn
)
Reverse
hashing
Reverse
IP mangling
n is the size of key space
heavy
change
keys
Outline
•
•
•
•
•
•
•
Background on k-ary sketch
Reversible sketch problem
Modular hashing
IP mangling
Reverse hashing
Evaluation
Conclusion
Evaluation
• Dataset
– A large US ISP (330M Netflow records)
– NU (19M Netflow records)
• Efficient data recording
For the worst case traffic, all 40-byte packets
–
–
–
–
Software: 526Mbps on P4 3.2Ghz PC
Hardware: 16Gbps on a single FPGA broad
Only a few hundred KB to a couple of MB memory used
Only 15 memory access per packet for 48 bit reversible
sketches and 16 per packet for 64 bit reversible sketches
• Efficient heavy change detection and key inference
– 0.34 seconds for 100 changes. 13.33 seconds for 1000
change
Key Inference Accuracy
2.40
100
96
92
False Positive Percentage
• True positives and false positives of 16bit
reversible sketches for 32bit IP addresses
0.25
0.12
0.08
0.06
0.04
50
450
0.25
1
0.12
0.08
0.06
0.04
1600
2000
H=6, r=1
H=6, r=2
H=5, r=1
Deltoid
0.6
H=6, r=1
H=6, r=2
H=5, r=1
Deltoids
88
2.40
850
1200
0.2
1600
Number of heavy changes
2000
50
450
850
1200
Number of heavy changes
[Deltoids]: S.Muthukrishnan and Graham Cormode, What's New: Find
Significant Differences in Network Data Streams. Infocom 2004
More Results
• Stress test with larger dataset still accurate
• Scalable to larger key space size: similar
results for 64bit IP pairs
• Built anomaly/intrusion detection system to
detect, e.g., SYN flooding and port scans
[ICDCS 2006]
Conclusions
Proposed the first reversible sketches which
• Record high speed network streams online
• Detect the heavy changes and infer the keys
online
• Small memory usage, small # of memory
access per packet
• Scalable to large key space size
Backup Slides
Related work
• Compare with [deltoids]
– Accuracy better
– Scalable to large key space better
– # of Memory access less
• [PCF, IMC2004]: not reversible
• [Q. Zhao et al, IMC2005] [S.Venkataraman,
NDSS2005]: unique fan-out (fan-in)
estimation.
Modular Hashing
Optimal Hashing
Reversible sketch problem
However… Not reversible
Lack of an inference API: INFERENCE(S,t)
• Important function for anomaly detection!
• Decouple the recording stage of sketches from the
detection stage to enable efficient combine and
inference.
• Given a threshold t, report keys whose corresponding
sum of updates are larger than the threshold.
Our contribution: an efficient algorithm for inference
?
?
Problem: Too many collisions
32 bits
129.105.56.23
129.105.56.28
129.105.56.109
129.105.56.35
129.105.56.98
...
12 bits
7 . 4 . 0 . *
Solution:
IP Mangling with
IP-mangling
• Use GF (Galois Extension Field) function for attack
resilience
Modular Hashing
Modular Hashing with IP Mangling
Optimal Hashing
Reverse Hashing for Each Module
Take the first word as an example
Aij1 all possible value of the first word for
H=5, r=1, K=212
the No. j heavy bucket in Hash table i
1
Aijw  32
A12  A12
 A122  A123  A124 1
Gi all possible value of the
A11  A  A  A  A
1
11
2
11
3
11
4
11
first word in Hash table i
1
b11
2
b21
3
b31
I1 
b51

i
1
1
G21  A21
 A22
 {*}
1
1
G31  A31
 A32
 {*}
b41
r
b22
b32
4
5
1
1
G11  A11
 A12
 {*}
b12
Gi1  {v 
1
1
G41  A41
 A42
 {*}
b42
b52

i
1
1
G51  A51
 A52
 {*}
Gi1 | v is mapped to heavy buckets in at least (H - r) hash tables}
All possible value of the first word in the sketch
False positive reduction by original sketch verifying
Final result
<150.72.182.75>
Estimate
(<150.72.182.75>, 180)
Verified original
k-ary sketch
Threshold
150
(<150.72.182.75>, 180)
K-ary sketch
[Krishnamurthy, Sen, Zhang, Chen, 2003]
• first to detect flow-level heavy changes in
massive data streams at network traffic
speeds
• APIs
– UPDATE(S,k,u): Tj [ hj(k)] += u (for all j)
– ESTIMATE(S, k): sum of updates for key k
– Linear combination: S=COMBINE(a,S1,b,S2)
a
+b
=

Reverse Hashing for High-speed Network Monitoring

Transcript Reverse Hashing for High-speed Network Monitoring

Directory