Transcript Slide 1

Sampling and Stability in
TCP/IP Workloads
Lisa Hsu, Ali Saidi, Nathan Binkert
Prof. Steven Reinhardt
University of Michigan
June 4, 2005
MoBS 2005
1
Background
During networking experiments, some
runs would inexplicably get no bandwidth
Searched high and low for what was
“wrong”



Simulator bug?
Benchmark bug?
OS bug?
Answer: none of the above
June 4, 2005
MoBS 2005
2
The Real Answer
Simulation Methodology!?



Tension between speed and accuracy in
simulation
Want to capture representative portions of
simulation WITHOUT running the entire
application
Solution: Fast functional simulation
So what’s the problem here?
June 4, 2005
MoBS 2005
3
TCP Tuning
TCP tunes itself to the performance of
underlying system
Sets its send rate based on perceived end-toend bandwidth


Performance of network
Performance of receiver
During checkpointing simulation, had tuned to
performance of meaningless system
After switching to detailed simulation, the
dramatic change in underlying system
performance disrupted flow
June 4, 2005
MoBS 2005
4
Timing Dependence
The degree to which an application’s
performance depends upon execution
timing (e.g. memory latencies)
Three classes:



Non-timing dependent (like SPEC2000)
Weakly timing dependent (like multithreaded)
Strongly timing dependent
June 4, 2005
MoBS 2005
5
Strongly Timing Dependent
Packet from
application
Perceived bandwidth high 
send it now!
Execution Path
Peceived bandwidth low 
wait til later
Application execution depends on
stored feedback state from
underlying system (like TCP/IP
workloads)
June 4, 2005
MoBS 2005
6
Correctness Issue
Packet from
application
Perceived bandwidth high 
send it now!
MEANINGLESS
Execution Path
Peceived bandwidth low 
wait til later
Functional Simulation
June 4, 2005
Detailed Simulation
MoBS 2005
7
Need to….
Perceived
bandwidth
reflects that of
configuration
under test
Packet from
application Perceived bandwidth
high  send it now!
Safe to take
Data!!
Peceived bandwidth
low  wait til later
June 4, 2005
MoBS 2005
8
Goals
More rigorous characterization of this
phenomenon
Determine severity of this tuning problem
across a variety of networking workloads



Network link latency sensitivity?
Benchmark type sensitivity?
Functional CPU performance sensitivity?
June 4, 2005
MoBS 2005
9
M5 Simulator
Network targeted full system simulator
Real NIC model

National Semiconductor DP83820 GigE
Ethernet Controller
Boots Linux 2.6

Uses Linux 2.6 driver for DP83820
All systems (and link) modeled in a single
process

Synchronization between systems managed
by a global tick frequency
June 4, 2005
MoBS 2005
10
Operating Modes
Modes
Wall Clock Speed
Pure Functional
FASTER
FASTEST
Very fast
(PF)
Checkpointing
Functional with
Fast
Caches (FC) SLOWEST
Cache Warmup
Detailed (D)
Data Measurement
June 4, 2005
Very Slow
FASTER
MoBS 2005
Simulated CPU
Speed
11 or
or 88 IPC
IPC
1 Cycle
Cycle Mem
Mem
1
1 IPC +
Blocking
Blocking Caches
Caches

 <<
<< 1
1 IPC
IPC
OoO Superscalar
Non-Blocking
Non-blocking
Caches
Caches
11
Benchmarks
2 system client/server configuration

Netperf
Stream – a transmit microbenchmark
Maerts – a receive microbenchmark

SPECWeb99
NAT configuration (3 system config)

Netperf maerts with a NAT gateway between
client and server
June 4, 2005
MoBS 2005
12
Experimental Configuration
(x2 if NAT)
System
Under Test
Drive System
link
(receiver/sender)
(sender/NAT/receiver)
D
PF1/PF8
FC1
cache
PF8
CACHE
MEASUREMENT
WARMUP
CHECKPOINTING
June 4, 2005
MoBS 2005
13
“Graph Theory”
Tuning periods after CPU model changes?
How long do they last?
Which graph minimizes Detailed modeling
time necessary?
Effects of checkpointing PF width?
June 4, 2005
MoBS 2005
14
Netperf Maerts
Detailed
8
No tuning!
FC->Detailed
Tuning period
6
7
5
4
5
w idth=1
4
COV
COV1.66%
.5%
3
w idth=8
Gbps
Gbps
6
w idth=1
3
w idth=8
2
Tuning
period
2
1
1
0
20
30
40
50
60
70
Millions of Cycles
80
90
100
bears brunt of
Millions of Cycles
tuning time
60
11
0
16
0
21
0
26
0
31
0
36
0
41
0
46
0
51
0
56
0
10
10
0
Takeaways:
FC Cache
1) Shift from “high performance” CPU
to lower causes
warmup
Known achievable
PF checkpoints
more
drastic tuning periods
endstransition
bandwidth
by each
loadedtransition
to D
2) Shift from lowersystem
performance
to higher
has
to
D
configuration
or FC
more gentle transition
June 4, 2005
MoBS 2005
15
Netperf Stream
FC->Detailed
2.5
2.5
2
2
1.5
width = 1
width = 8
1
Gbps
Gbps
Detailed
1.5
width = 1
1
width = 8
0.5
0.5
20
30
40
50
60
70
80
90
10
10
100
70
13
0
19
0
25
0
31
0
37
0
43
0
49
0
55
0
0
0
Millions of Cycles
Millions of Cycles
Why no tuning periods?



Because it is SENDER limited!
Change in performance is local – no feedback from
network or receiver required
Thus changes in send rate can be immediate
June 4, 2005
MoBS 2005
16
NAT Netperf Maerts
FC->Detailed
Detailed
2.5
4.5
2
3.5
4
w idth = 1
w idth = 8
1
Gbps
2.5
w idth = 1
2
w idth = 8
1.5
1
0.5
0.5
0
10
20
30
40
50
60
70
80
90
100
10
0
60
11
0
16
0
21
0
26
0
31
0
36
0
41
0
46
0
51
0
56
0
Gbps
3
1.5
Millions of Cycles
Millions of Cycles
The “pipe” is changing
NAT =– this feedback takes
senderto receive in
receiver
longer
TCP because it is
not
System
explicit  may ruin
simulation
Under
Test
CPU changes
applied here
June 4, 2005
MoBS 2005
17
TCP Kernel Parameters
pouts – unACKed packets in flight
Detailed Kernel Params
300
37590
250
37580
37570
37560
200
Packets
cwnds – congestion window (in
Solved in real world
packets)
150
100
37550
pouts
37540
37530
cw nds
37520
37510
37500
50
0
37490
0.5 10.5 20.5 30.5 40.5 50.5 60.5 70.5 80.5 90.5
Millions of Cycles
sndw nds
by
TCP
timeouts, but would
**Reflects state of the network pipe
take much too long to
sndwnds – available receiver buffer
simulate
space (in bytes)
**Reflects receiver’s ability to receive
TCP RULES:
pouts may NOT exceed cwnds
bytes(pouts) may NOT exceed sndwnds
June 4, 2005
MoBS 2005
18
SPECWeb99
Detailed
FC->Detailed
6
6
5
5
3
w idth = 8
Gbps
4
w idth = 1
w idth = 1
3
2
2
1
1
0
0
w idth = 8
10
19
0
37
0
55
0
73
0
91
0
10
90
12
70
14
50
16
30
18
10
19
90
10
13
0
25
0
37
0
49
0
61
0
73
0
85
0
97
0
Gbps
4
Millions of Cycle s
Millions of Cycles
Much more complex than Netperf
Harder to understand fundamental interactions
Speculations in paper – but understanding this
more deeply definitely future work
June 4, 2005
MoBS 2005
19
What About Link Delay?
400us Delay Kernel Parameters
300
5
250
4
200
1100
1021
943
864
786
707
629
550
472
393
315
236
cw nds
0.5
1090
1000
910
820
730
640
550
460
0
370
0
280
50
190
1
100
2
100
158
400us Delay
pouts
150
79
Zero delay
3
Packets
6
10
Gbps
Maerts Link Delay Comparison
Millions of Cycles
Millions of Cycles
TCP algorithm: cwnd can only increase upon
every receipt of an ACK packet
Ramp-up of cwnd is limited by RTT
KEY POINT: tuning time is sensitive
to RTT
June 4, 2005
MoBS 2005
20
Conclusions
TCP/IP workloads require a tuning period
relative to the network RTT when receiver
limited
Sender-limited workloads are generally not
problematic
Some cases lead to unstable system behavior
Tips for minimizing tuning time:



“Slow” fast forwarding CPU
Try different switchover points
Use fast-ish cache warmup period to bear brunt of
transition
June 4, 2005
MoBS 2005
21
Future Work
Identify other strongly timing dependent
workloads (feedback directed
optimization?)
Examine SPECWeb behavior further
Further investigate protocol interactions
that cause zero bandwidth periods

Hopefully lead to more rigorous avoidance
method
June 4, 2005
MoBS 2005
22
Questions?
June 4, 2005
MoBS 2005
23
Non-Timing Dependent
memory access
MISS
Perfect
HIT
L1 Cache
Execution Path
Single-threaded, application only
execution (like SPEC2000)
June 4, 2005
MoBS 2005
24
Weakly Timing Dependent
L1 Missidle loop
memory access
Perfect Cachecontinue
Execution Path
RAM accessschedule
different thread
Application execution tied to OS
decisions (like multi-threaded apps)
June 4, 2005
MoBS 2005
25
Basic TCP Overview
Congestion Control Algorithm

Match send rate to the network’s ability to
receive it
Flow Control Algorithm

Match send rate to the receiver’s ability to
receive it
Overall goal:

Send data as fast as possible without
overwhelming system, which would
effectively cause slowdown
June 4, 2005
MoBS 2005
26
Congestion Control
Feedback in the form of


Time Outs
Duplicate ACKs
Feedback dictates Congestion Window
parameter

Limits the number of unACKed packets out at
a given time (i.e. send rate)
June 4, 2005
MoBS 2005
27
Congestion Control cont.
Slow Start

Congestion window starts at 1, every ACK
received is an exponential increase in
congestion window
Additive Increase, Multiplicative Decrease
(AIMD)

Every ACK increases window by 1, losses
perceived by DupACK halve the window
Timeout recovery

Upon timeout, go back to slow start
June 4, 2005
MoBS 2005
28
Flow Control
Feedback in the form of explicit TCP
header notifications

Receiver tells sender how much kernel buffer
space it has available
Feedback dictates send window
parameter

Limits the amount of unACKed data out at any
given time
June 4, 2005
MoBS 2005
29
Results
Zero Link Delay
June 4, 2005
MoBS 2005
30
Non Timing Dependent
Single threaded, application only
simulation (like SPEC2000)
The execution timing does not affect the
commit order of instructions
Architectural state generated by a fast
functional simulator would be the same as
a detailed simulator
June 4, 2005
MoBS 2005
31
Weakly Timing Dependent
Applications whose performance are tied
with OS decisions

Multi-threaded (CMP, SMT, etc.)
Execution timing effects like cache hits
and misses, memory latencies, etc. can
affect scheduling decisions
However, these execution path variations
are all valid and do not pose a correctness
problem
June 4, 2005
MoBS 2005
32
Strongly Timing Dependent
Workloads that explicitly tune themselves
to performance of underlying system
Tuning to an artificially fast system affects
system performance
When switching to detailed simulation, you
may get meaningless results
June 4, 2005
MoBS 2005
33