Transcript ppt
ECE 526
Network Processing Systems
Design
Lecture 3-6: Implementation Models
A rather small set of key concepts is enough.
Only by learning the essence of each topic, and
by carrying along the least amount of mental
baggage at each step, will the student emerge
with a good overall understanding of the subject!
- Carver Mead and Lynn Conway
Ning Weng
ECE 526
2
Goals of Models
• Before we can play with improvements, we must know “the
rules of the game”.
• Algorithmics uses four separate areas: protocols, architecture,
OS and algorithms
• Question: How can a logic designer understand protocols and
how can an algorithm designer understand hardware
• Answer:
Use simple models that have explanatory and predictive power
Leave details to the specialist
Ning Weng
ECE 526
3
Outline of Models
•
•
•
•
Networking protocols
Hardware
Architecture
Operating systems
Ning Weng
ECE 526
4
Transport Models
•
•
Presents illusion of two shared queues in each direction
despite unreliable network using seq # and retransmission
Connection between local queues named ports at IP
addresses
Ning Weng
ECE 526
5
Routing Protocols
•
•
Forwarding: router determines next hop/router based on
forwarding table to send a packet. Speed-critical.
Routing: routers work together to populate/build forwarding
table
RIP (Exchange distance estimates)
OSPF (Link state)
BGP (Path vector, based on policy)
Ning Weng
ECE 526
6
Abstract Model of Protocols
• Interfaces: used by applications or higher level protocols
• Packet (message) format: IP,TCP, UDP
• Typical protocol operations
Data copy: from application to kernel; from input link to
output link via switch
Demultiplexing
Looking up state: longest prefix match
Set timers and receive alarms
Ning Weng
ECE 526
7
Measures
• Throughput: number of jobs per second
Focus of industry, ISPs wish to maximize
• Latency: time to complete a job
E.g. time for a packet to go from one end to
another (RTT)
Ideal interactive time 100ms (today 250 msec across US)
Ning Weng
ECE 526
8
Performance facts
• Backbone link speeds:
Optical carrier (OC)-n → n x 51.8Mbits
OC48 ~ 2.5 Gbps, OC192~10Gbps
• TCP and web dominance : Web (70%) and
TCP (90%)
• Small transfers: 50% of accessed files <
50Kbytes
• Poor locality: 1M concurrent type of packets in
backbone. 5 packets to same destination
• Latencies large, e.g. 100 ms for wide area
• Wire speed forwarding
half of packets 40 bytes, many 1500 bytes
Takes 8 ns to send 40 bytes at 40 Gbps
Ning Weng
ECE 526
9
Case study: iSCSI
• Large data centers connect disks and computers with a SAN
(Storage area network) to enable disk sharing. Expensive
today.
• Single SCSI command can READ: 10Mbytes from a remote
disk
• iSCSI over TCP/IP: must implement TCP in hardware, add
iSCSI header with length.
• SCSI messages can be issued out of order, but TCP does not
allow!!!
Case study: the protocol features affect greatly on application
performance
Ning Weng
ECE 526
10
Models
•
•
•
•
Networking (protocols)
Hardware
Architecture
Operating systems
Ning Weng
ECE 526
11
Hardware
• Combinatorial logic
─ From transistors to gates
─ Timing and power
─ High level building blocks (standard cells)
• Memories
─
─
Registers, SRAM, DRAM
Fancy memory tricks (page mode, interleaved)
Ning Weng
ECE 526
12
Combinatorial logic
• A function from digital inputs to outputs
─ Wires, transistors, resistors, capacitors
─ Gates: AND, NAND, NOT, etc. (60 ps)
─ More complex blocks: adders, multiplexers
• Different designs for same Boolean function offer
different trade-offs.
─
─
─
Time for the output to stabilize
Number of transistors
Power dissipation P=CV2F
Ning Weng
ECE 526
13
Priority encoder timing
• Input I and outputs O are N-bit vectors
such that
─ O[j] = 1 iff I[j] = 1 and I[k] = 0, for all k<j.
─ Find first one, represent in one-hot notation
─
• Design 1: Implement using N AND gates
where each gate takes all 1 to N inputs
─ O(N2) appears to take O(1) time.
Ning Weng
ECE 526
14
Priority encoder timing
• Design 2: Every output bit requires the AND of
compliment of first j-1 bits
─ Construct recursively P[j] = P[j-1]. ~I[j]
─ Using N 2-input AND gates in series
─ O(N) transistors but takes O(N)
Summary: Design 1 fast and fat; Design 2 slow and lean
• Design 3: Tradeoff design.
─ Idea: use balanced binary tree to compute P[j] and combine partial
results.
Ning Weng
ECE 526
15
Reduction using standard cells
• Build four input mux from 2 input muxes.
Ning Weng
ECE 526
16
Crossbar scheduler
• PPE: like a priority encoder but with rotating
priority.
─ Find first 1 beyond P
• Arises in switch arbitration
• Design 1:
─ Use a barrel shifter to shift I to the left by P bits.
─ Apply PE
─ Shift back by P bits.
• Two shifts + PE ~ 3 log N gate delays.
─ Each barrel shift using tree of 2-muxes ~ log (N)
Ning Weng
ECE 526
17
Crossbar scheduler
• Smarter scheme due to Gupta/McKeown
• First check if any bit set at or
after P
• If yes, apply PE on bits after P
• If no, apply PE on bits before P
• Used in Tiny Tera and Abrizio
• Tested using TI libraries
2x fast and 3x less area
For a 32 port router.
Ning Weng
ECE 526
18
Memories
• Router forwarding done using logic but packets
stored in memories.
• Memory are significant bottlenecks
─ Slower than logic delays
• Different subsystems require different types
─ 200ms worth of packet buffering (DRAM)
─ 1Mbyte of fast SRAM for forwarding tables
• Need different models
Ning Weng
ECE 526
19
Registers
• Need to store bit such that it stays indefinitely
(absence of writes/power failures)
• Register is ordered collection of flip-flops
• Access from logic to a register ~ 0.5-1ns
Ning Weng
ECE 526
20
SRAM
• N registers addressed by log N bits Addresses.
• Self refreshing.
• Needs a decoder to translate A to unary so
add decode delay.
• On chip SRAM 1-2ns, Off-chip 5-10ns.
Ning Weng
ECE 526
21
DRAM
• SRAM flip-flop needs 5 transistors.
• DRAM needs one transistor plus capacitor
(so more dense)
• Leakage fixed by refresh.
• Slower because output not driven by the
power supply as in SRAM
• 40-60ns
Ning Weng
ECE 526
22
Memories
• Page mode DRAM (latency 40-60ns)
• Direct decoding hard: use divide and conquer
Decode in stages, row first and column next
Reduces decoder from O(N) to O(√N) gates
• Accesses within rows do not require Row address
strobe.
Ning Weng
ECE 526
23
Interleaved DRAMs
•
•
•
•
Increase DRAM throughput (same latency)
While Bank 1 works, send addresses to Bank 2, etc.
SDRAM uses 2 banks, RDRAM uses 16 banks.
If consecutive accesses to different banks, bandwidth
increased by a factor of B
Ning Weng
ECE 526
24
Example: Pipelined flow id lookup
•
•
•
•
•
Lookup 1M 96-bit packet ID for accounting using binary
tree in 128nsec (OC-48)
SRAM infeasible. DRAM too slow (20 accesses)
Use interleaved memory + pipeline
Direct RDRAM runs at 800MHz, read access 60ns
216 too small. So use RAMBUS page mode to retrieve two
96-bit keys + three 20bit pointers in one 256 bit access.
Ning Weng
ECE 526
25
Pin count limitations matter
• Example: Router with five 10GBps links
• Overall buffering - 200ms * 50Gbps
• Want to use DRAM
Bw required 2 * 50 = 100 Gbits/sec + overhead ~ 200 Gbps
• Use single RDRAM with 16 banks
Peak bw of 1.6 Gbytes or 13Gbps
• Accessing each RDRAM requires 64 pins for data, 25
for address/control ~90.
200 Gbps memory requires 16 RDRAMs ~ 1440 pins
• Chips have pin limitations (typically <1000pins)
• Thus requires at least one more chip.
Ning Weng
ECE 526
26
Real world: chip design process
• Box architect partitions functions between chips.
• Design teams write specifications, logic designers
write RTL using Verilog.
• Synthesize gates to generate circuits.
• Timing checks to change design.
• Finally chip tapes out, manufactured, yield inspected.
• Easy to add features before RTL, second spins are
expensive.
Ning Weng
ECE 526
27
Next Lecture
• Networking (protocols)
• Hardware
• Device architecture
• Operating systems
Ning Weng
ECE 526
28
Architecture Models
• Optimizing network performance requires optimizing
all data paths.
─ Through the internals of source node, sink node and routers
• End node architectures optimized for general
computation.
• Router architectures for Internet communication.
End Node Architecture
•
•
•
Main memory (1GB, 50-80nsec, L2 cache, on chip SRAM
registers)
Direct mapped caches
Cache lines, spatial locality, page mode
Memory mapped I/O, DMA, I/O bus versus processor bus.
Alternative endnode architecture
•
•
•
•
Packets from network Mem2.
Processor Reads Mem1.
Switch can alternate the accesses
E.g., infiniband as a replacement for PCI
What a Router Looks Like
Router Architectures
• Major tasks:
Lookup
Classification
Switching
Queuing
• Less time-critical
Header verification
Checksum
Route computation
Generic Router Architecture
Slide courtesy of Nick McKeown
Generic Router Architecture
Network Processors
• General purposes processors optimized for network
traffic
Motivated by unpredictable router tasks.
• Often just multiple processors that work on different
packets at the same time.
Intel IXP has 6 processors with 5.6ns clock.
• Use multithreading to quickly switch contexts
IXP has 4 contexts.
• Special purpose instructions for address lookup and
other functions.
Why NPUs seem like a good idea
• What makes a CPU appealing for a PC.
Flexibility: Supports many applications
Time to market: Allows quick introduction of
new applications.
Future proof: Supports as-yet unthought of
applications
• No-one would consider using fixed function ASICs for a PC
Why NPUs seem like a good idea
• What makes a NPU appealing
Flexibility: Protocols and standards change.
Time to market: Saves 18months building an ASIC. Code re-use.
Future proof: New protocols emerge.
Less risk: Bugs more easily fixed in s/w.
• Surely no-one would consider using fixed function ASICs for
new networking equipment?
Network Processors: Load-balancing
Incoming packets dispatched to:
1. Idle processor, or
2. Processor dedicated to
packets in this flow (to
prevent mis-sequencing), or
3. Special-purpose processor
for flow, e.g. security,
transcoding, applicationlevel processing.
Network Processors: Pipelining
Processing broken down into (hopefully balanced) steps,
Each processor performs one step of processing.
Models
•
•
•
•
Networking (protocols)
Hardware
Device architecture
Operating systems
Operating System
• Routers usually have a lightweight OS
─ Different from traditional Oses
• End-node performance dependent on OS
• Abstractions
─ Uninterrupted Computation: No Interrupts
─ Infinite Memory: just an illusion.
─ Simple I/O: Avoid dealing directly with devices (simple writes/reads)
Uninterrupted communication
• Underlying mechanisms
Context switching
Scheduling
Protection
• Flavors of “process” - increasing complexity
Interrupt handlers, threads, processes
Receiver livelock in BSD
• Keep processing packets only to discard because
apps never run.
• Latency depends on process
Interrupts (10us), context switch (10-100us)
Infinite memory
•
Via virtual memory
Page mapping (avoids finding contiguous locations).
Demand paging (use more space than memory)
Slow DRAM lookup avoided with fast TLB
Protection by allowing only OS to modify page tables.
•
•
Simple I/O using system calls
•
•
•
•
For abstraction alone, I/O could be libraries.
For security, I/O handled by device drivers.
System calls, trap to kernel protection levels.
More expensive function call, because of privilege
escalation.
Next class…
• Implementation principles.
• Read chapter 3 and 4.
• Quiz1