awaheed seminar

Download Report

Transcript awaheed seminar

Memory Access Characteristics of Network
Infrastructure Applications
Dr. Abdul Waheed
Computer Engineering Department
KFUPM
April 7, 2003
Network Infrastructure
Network infrastructure consists of:
 Core devices
 Edge devices
Edge device
Application layer
I/O devices
Transport layer
Network layer
Examples: switches and routers
NICs
Core device
Outgoing traffic
Incoming traffic
Network layer
NICs
Incoming traffic
Outgoing traffic
Examples: proxy servers,
layer 4 switches, and
web servers
Characteristics of applications on core and edge devices:

Transaction processing systems  on PDUs

High transaction throughput needed to match line speeds
Copyright © 2003
2
Abdul Waheed
Problem Statement

Role of infrastructure device architecture:

High performance protocol engines




High memory bandwidth is needed
Typically, special-purpose architectures are employed  costly solution




High transaction throughput with low latency
Throughput should ideally match with network speed
Network processors
Multiprocessors
ASICS
Does employing general-purpose computer architecture help?

Does memory hierarchy help or hinder high throughput?



Typical hierarchy: registercachesmain memoryvirtual memory
Is memory and I/O bandwidth a bottleneck?
What are the other performance bottlenecks?
Copyright © 2003
3
Abdul Waheed
Motivation

Commonly known characteristics of network infrastructure:

Two categories of transaction processing:



Stands in contrast with scientific/engineering applications
Network processors exploit these characteristics:



 typically, a core device
 typically, an edge device
Lack of data locality in PDU


Those requiring heavy lifting
Those requiring heavy computing
Small data and large instruction caches
Common perception: caches hinder high network throughput
This is obvious but why don’t we try to confirm…


To convince ourselves
If above perception is incorrect, we can use general-purpose
computer architectures for building network infrastructure 
inexpensive, scalable, and flexible solution
Copyright © 2003
4
Abdul Waheed
The Approach

Possible approaches for memory performance evaluation:


Trace-driven simulation
Measurements



Both approaches have limitations:



Using on-chip counters
Benchmarks
Accuracy of trace-driven approach is questionable
Measurement based approach are not repeatable (typically)
Analytical approach:




Based on using a general-purpose computing architecture as a server
A simple latency model
Four categories of possible data movements for each transaction
Analyze the overhead for high transaction throughput due to:


Caches
Limited memory and I/O bandwidth
Copyright © 2003
5
Abdul Waheed
Typical Server Architecture and Data Flow
These data movement
paths are specific to
using the server as a
network protocol engine
Four data transfer paths:



Memory-CPU
Memory-memory
Copyright © 2003

6
Memory-I/O
Memory-network
Abdul Waheed
Latency Model

Each transaction involves:



CPU cycles
Data transfers: one or more of four identified types
Transaction latency:
Ttrans = TCPU + n1Tm-c + n2Tm-m + n3Tm-disk + n4Tm-net






TCPU
Tm-c
Tm-m
Tm-disk
Tm-net
ni
Copyright © 2003






total CPU time needed for the transaction
time to transfer entire PDU from memory to CPU for proc.
Latency of memory-memory copy of a PDU
Latency of memory-I/o read/write of a block of data
Latency of memory-network read/write of a PDU
Number of each type of data movement operations
7
Abdul Waheed
Memory-CPU Transfers

Used for PDU processing





Number of memory stall cycles = (IC)(AR)(MR)(MP)





CPU accesses PDU words in sequential order
Stride = 1
Typically, one-way data flow  from memory to CPU via L1 cache
Example: checksum computation of a TCP segment
IC
AR
MR
MP




Instruction count per transaction
Number of memory accesses/instruction (AR=1)
Ratio of cache misses to memory accesses
Miss penalty in terms of clock cycles
Cache miss rate:


Worst case: MR = 1 while typically MP = 10  stalls = 10 x IC
However, situation is not that bad for network applications!
Copyright © 2003
8
Abdul Waheed
Memory-CPU Transfers (Cont’d)

Temporal vs. spatial locality


A PDU lacks temporal locality
Observation: PDU processing exhibits excellent spatial locality





Thus, generally MR = W/L



Suppose data cache line is 32 bytes (or 16 words) long
Sequential accesses with stride = 1
Accessing one word, brings other 15 words as well
Thus, effective MR = 1/16 = 6.2%  better than even scientific apps
W
L
 Width of each memory access (in bytes)
 Length of each cache line (in bytes)
Validation of above observation:

Similar special locality characteristics reported via measurements:
S. Sohoni et al., “A Study of Memory System Performance of Multimedia
Applications,” in proc. of ACM SIGMETRICS 2001

MR for streaming media player better than SPEC benchmark apps!
Copyright © 2003
9
Abdul Waheed
Memory-CPU Transfers (Cont’d)

Determine cache overhead wrt execution time:




(Execution time)no-cache
(Execution time)with-cache
Cache overhead
Cache overhead in various cases:

Worst case: MR = 1 and MP = 10




Cache results in 11 times higher latency for each transaction!
Memory-CPU latency dependent on internal bus bandwidth
Best case: MR = 0  trivial
Average case: MR = 0.1 and MP = 10 and (MR)(MP)1



= (IC)(CPI)(CC)
= (IC)(CPI)(CC) {1 + (MR)(MP)}
= 1 + (MR)(MP)
Latency due to stalls = ideal execution time without stalls
Cache overhead can restrict the throughput to ½ possible with a special-purpose
network processor
Practical consideration: use internal BW to calculate Tm-c



Avoid using low-level measurements
Latency hiding is very effective for such applications
Tm-c = S/Bi msec where S is the PDU size and Bi is the internal bus BW in
MB/s
Copyright © 2003
10
Abdul Waheed
Memory-Memory Transfers

Memory-memory transfer:





Due to memory copy of PDU between protocol layers
Transfers through caches and CPU
Stride =1
Transfer involves memorycacheCPUcachememory data
movement
Latency:


Dependent on internal (system) bus bandwidth
Tm-m = 2S/Bi msec
Copyright © 2003
11
Abdul Waheed
Memory-I/O and -Network Transfers

Memory-network transfers:




Passes over the I/O bus
DMA can be used
Again, stride = 1
Latency:


Limiting factor is the I/O bus bandwidth
Tm-net = S/Be msec
Copyright © 2003
12
Abdul Waheed
Characteristics of Selected Applications

Three applications:

IP packet forwarding




Simple application
HTTP proxying
RTP streaming
We need to calculate theoretical peak throughput


Based on application data movement characteristics
Depending only on architectural capabilities
Copyright © 2003
13
Abdul Waheed
Header
Payload
Payload
Header
Payload
Header
Payload
Header
Header
IP Forwarding
Payload
Router
Each transaction involves following operations:





Passing of incoming IP packet NICIP
(mem-net)
Reading of destination IP address
(cache miss)
Hash function calculation
(from cache)
Table lookup
(cache miss or hit, longterm hit)
IP header update
(from cache)



Update time to live
Recompute header checksum
Passing of outgoing IP packet IPNIC
Copyright © 2003
14
(mem-net)
Abdul Waheed
IP Forwarding (Cont’d)

For an IP forwarding transaction:




Data cache miss rate = 1 or 2 per transaction
Number of memory-network transactions = 2
Some CPU cycles are consumed for each transaction
Latency:


Depends on CPU cycles and memory-network transfers
TIP = TCPUIP + 2S/Be msec
where TCPUIP is the CPU time (in msec) to process the transaction

Using similar analysis for HTTP proxying and RTP streaming:


THTTP = TCPUH + 5S/Bi + S/Be msec
TRTP = TCPUR + 4S/Bi + S/Be msec
Copyright © 2003
15
Abdul Waheed
Optimistic Theoretical Peak Throughput

Assume:

For each transaction, TCPU << min{Tm-m,Tm-net} and can be ignored



No bus contention among independent transactions
Throughputs (in Mbytes/sec):




Meaning, no software overhead
(Throughput)IP
(Throughput)HTTP
(Throughput)RTP
= Be/2
= BiBe/(5Be + Bi)
= BiBe/(4Be + Bi)
Apply to state-of-the-art processors based architectures:




Intel Pentium IV 3 GHz
AMD Athlon XP 3000+
MIPS R16000 700 MHz
Sun Ultrasparc III 900 MHz
Copyright © 2003
16
Abdul Waheed
Optimistic Peak Throughputs

Assumed PCI-X I/O bus



133 MHz with 64 bits wide
Be = 1066 Mbystes/sec
Consider only IP forwarding case:

Throughput of 4264 Mbps  at least 12 ports, 155 Mbps, fullduplex software router
Copyright © 2003
17
Abdul Waheed
Measurement Based IP Forwarding
Throughput
CPU utilization
Ethernet interface 0
90
450
80
400
70
Mbits/sec
350
300
250
64 bytes packet
200
64K packet
10K packet
150
CPU utilization %
500
60
10K packet
40
64K packet
30
100
20
50
10
0
64 bytes packet
50
0
1
2
3
4
5
6
7
8
1
Configuration

3
4
5
6
7
8
Configuration
Linux IP forwarding platform:




2
Pentium IV 2 GHz processor
PCI-X bus with 3200 MB/sec internal bus
4 Gigabit NICs connecting to four traffic generating hosts
Peak throughput on any interface: 450 Mbps (full-duplex)


Mean total throughput = 450x2x4 = 3600 Mbps  84% of peak
Difference due to software overhead, bus contention, and interrupts
Copyright © 2003
18
Abdul Waheed
Conclusions

Contributions of this work

We show that average cache overhead for network applications is
equal to the ideal-case latency for a transaction




Excellent spatial locality for PDU processing
We show that peak throughput is limited by internal and external
bus bandwidths rather than caches
General-purpose processor based architectures can replace specialpurpose network processor based architectures
Open questions



What is the contribution of software overheads?
What is the role of bus contention to limit the throughput?
Impact of multithreading combined with superscalar architectures


Can latency hiding provide extra mileage?
Role of NI and device driver optimizations?
Copyright © 2003
19
Abdul Waheed
Backup Slides
Memory-Memory Transfer Bandwidth
Extended copy transfer characteristics (stride = 1)
memory bandwidth
(Mbytes/sec)
6000
5000
4000
3000
2000
1000
32
M
8M
2M
51
2K
12
8K
32
K
8K
2K
0.
5K
0.
1K
0
block size (working set)
Copyright © 2003
21
Abdul Waheed
HTTP Proxying

Each HTTP proxy transaction results in following operations:






Passing of HTTP request packet NICIP (mem-mem)
Passing of HTTP request packet IPTCP (mem-mem)
Passing of HTTP request packet TCPHTTP (mem-mem)
Formation of HTTP response header (cache hit)
Retrieval of target document HTTPTCP (mem-mem)
Passing of HTTP response TCPIP
(mem-mem+CPU visit)


Passing of HTTP response IPNIC (mem-mem+ CPU visit)



Entire TCP segment needs to visit CPU (and cache) for checksum computation
Only IP header checksum is computed (at least 20 bytes)
Constant number of compulsory misses
For all data within a transaction:


Data cache miss rate = 1/(B/W) + c1
Number of memory-memory transactions = 3 for request + 3 for response


Request size is only a few bytes while average response size is around 10KB
Context switches = 2
Copyright © 2003
22
Abdul Waheed
Streaming Media

There are typically two parts:

Request part


Response



One time for a transaction
Streaming of very large number of packets for each transaction
RTP packets over UDP
Ignore the request part and consider only the RTP streaming
part
Copyright © 2003
23
Abdul Waheed
RTP Streaming

Each transaction response RTP packet involves:



RTP header calculation
Passing of RTPUDP
Passing of UDPIP



(compulsory cache miss)
(mem-mem)
(mem-mem+CPU visit)
Checksum of entire UDP datagram
Passing of IPNIC
(mem-mem)
For each RTP packet processing:




Data cache miss rate = 1/(B/W) + c
Number of memory-memory transactions = 3
Number of context switches = 1
Max. theoretical throughput = ? (given bus BW)
Copyright © 2003
24
Abdul Waheed
General Memory Access Model

At level i in the memory hierarchy:
Average memory access time = (hit time)i + (miss rate)i * (miss
penalty)i

Added complexity when out of order execution is possible


Miss latency should include only non-overlapped stalls
Measurements should be consistent for a given processor
Memory stall cycles/instruction = (misses/instruction) * {total miss
latency – overlapped miss latency)
Copyright © 2003
25
Abdul Waheed
General Memory Performance
Optimization Techniques


Reduce miss penalty
Reduce time to hit in cache


Parallelism



Use pipelined accesses
Prefetching
Essentially reduces miss penalty
Reduce miss rate

Typical compiler based optimization technique
Copyright © 2003
26
Abdul Waheed
How to Get High Throughput?

By-pass the cache




Direct memory access from NIC
Should decrease the latency by one order of magnitude
Memory bandwidth can still become a bottleneck
Hide the latency



Use multiple threads for independent transactions
Superscalar and pipelined architecture can ensure useful work
while one thread stalls
Throughput is dependent on the number of concurrent threads
Copyright © 2003
27
Abdul Waheed
Trace Driven Simulation
Instrumented
code execution
Application
source code
Model
generator
Parse tree of
source code
Copyright © 2003
Model
execution
Memory parameters
28
Abdul Waheed
Tuning Using Hardware Counters
Application
program
Execution on
target system
Performance
analysis
Performance counter
based measurements
Source code
modifications
Copyright © 2003
29
Abdul Waheed
Latency Hiding


We use measurements to compare the memory performance
and throughput of Darwin streaming server and Windows media
server
Platforms



Darwin running under Linux
WMS running under Windows 2000 Server
Metrics for comparisons




CPU utilization
Cache miss rate
Page fault rate
Data throughput (mbits/sec)
Copyright © 2003
30
Abdul Waheed
Comparison of Streaming Media Servers
Server:
- Pentium 166 MHz
server
- 128 MB RAM
100 Mb/s switch
Clients:
- Pentium 166 MHz
- 48 - 64 MB RAM
Switch:
- 100 Mbps
client 1


client 2
client 3
client 4
- Layer 2 switch
Server machine runs Darwin Streaming Server under Linux
Same server machine runs Windows Media Server under Win2K Server
Copyright © 2003
31
Abdul Waheed
Latency Hiding (Cont’d)
Server CPU Utilization (300k, m ultiple stream s)
30
120
25
100
20
DSS
15
WMS
10
CPU %
Avg. throughput (Mbps)
Average throughput (single, 300k)
5
80
DSS
60
WMS
40
20
0
4
100
0
400
4
Num ber of client requests
100
400
Num ber of client requests
page faults (multiple, 300k)
cache miss (multiple, 300k)
50
2
1.5
DSS
WMS
1
0.5
page faults/sec
cache miss/sec
2.5
40
30
DSS
20
WMS
10
0
0
4
100
4
400
400
Num ber of client requests
Num ber of client requests
Copyright © 2003
100
32
Abdul Waheed
Latency Hiding (Cont’d)

Peak throughput



Memory performance



Indicated by 100% CPU usage
Windows Media Server delivers significantly larger throughput at
higher load than Darwin Streaming Server
WMS shows high cache and page fault rates at high loads but still
delivers better throughput
Better exploitation of latency hiding opportunities offered by the
processor through OS, compiler, and application
Don’t expect more from a freely available media server!!

Darwin is available in public domain with source code from
www.apple.com/quicktime
Copyright © 2003
33
Abdul Waheed
Technologies
Copyright © 2003
34
Abdul Waheed
STREAMS


STREAM – a full-duplex communication channel between a
user-level process and a device
A STREAM consists of:
- STREAM head interfaces with the user process
- driver end interfaces with the device
- zero or more STREAM modules between them.

Each module contains a read queue and a write queue

Message passing is used to communicate between queues
Copyright © 2003
35
Abdul Waheed
The STREAMS Structure
Copyright © 2003
36
Abdul Waheed
Interrupt-Driven I/O Cycle
Copyright © 2003
37
Abdul Waheed
Known Application Characteristics

Some applications are compute-intensive


Large number of memory accesses
Challenge: how to keep CPU busy during every cycle




Identification of the problem: simulation/measurement/static analysis
Tuning: code transformations/algorithmic modifications
Both identification and tuning is hard!!
Some applications allow latency hiding


Multithreaded applications
Example: high throughput servers that process independent
transactions
Copyright © 2003
38
Abdul Waheed