Connecting the Internet in 2011

Download Report

Transcript Connecting the Internet in 2011

Architecture for
Network Hub in 2011
David Chinnery
Ben Horowitz
Internet Model

Network time-of-flight latency
– Unavoidable

End point latency
– Limited by cheap solution for users

Latency of internet nodes (hubs, gateways)
– Can provide differentiated services
 High priority packets
 Other packets
– If bandwidth insufficient, use multiple chips
 send interval of wavelengths to each
Internet Visualization
San Fransisco,
USA
Worst case packet journey:
Halfway around the world
2 end users
2 gateways
? hubs
0.200 s
tolerable latency for
video conferencing
Perth,
Australia
Maximum Nodes Packet Travels

Average number of nodes traveled =
log(number of nodes in internet)
– Journey of 15.7 nodes average in 1996

Estimate one node/person in 2011
– Journey of 22.7 nodes average in 2011
39 nodes worst case in 1996 (1 in 1000)
 Scaling by ratio of averages, gives 56.3
nodes worst case in 2011 (1 in 1000)

1
2
3
4
54
55
56
Time of Flight
Optic fiber delay 5 us/km
 Restore signal with repeaters every 100 km

– Repeater delay 0.92 us [1999]
Worst case journey length ~20,100 km
 20,100 × 5 + 201 × 0.92 = 100,700 us
 Time of flight delay of 0.101 s

0.92 us
500 us
0.92 us
100 km
0.92 us
0.92 us
500 us
Internet Visualization
San Fransisco,
USA
Worst case packet journey:
0.101 s
Halfway around the world
?
2 end users
?
2 gateways
?
52 hubs
0.200 s
tolerable latency for
video conferencing
Perth,
Australia
End User Model
Worst case scenario
 Processing intensive application
– MPEG4 encoding for HDTV2

Limited silicon area, as must be low cost
– Sufficient for 1920×1080 HDTV2 at 30Hz
 Processing latency 1/30 s

End user to end user

0.033 s
Processing latency doubled
0.033 s
Internet Visualization
San Fransisco,
USA
0.033 s
Worst case packet journey:
0.101 s
Halfway around the world
0.067 s
2 end users
?
2 gateways
?
52 hubs
0.200 s
tolerable latency for
video conferencing
Perth,
Australia
0.033 s
Node Hardware Model
Processing cores are Intel IXP1200 routers
 Conservative ASIC frequency estimate

– IXP1200 speed of 166MHz in 0.28 um
– Linearly scale to 0.18 um  speed ×1.56
– Speed ×3.00 from 0.18 um to 0.05 um [ITRS]
 IXP1200 speed of 775MHz in 2011
 Assume
across chip speed of 775 MHz
– With custom macros at 10 GHz in 2011

ITRS estimate, across chip speed of 1.5 GHz
Node Router Hardware

For gateways or hubs
– 2011 ASIC: 8 cm2, 811 million transistors/cm2
 6500 million transistors

6.5 million transistors for IXP1200
– If 2/3 of chip is memory and wires
 Up to 333 IXP1200s on same chip
 estimate 300 IXP1200s
Packet Processing at Nodes

Maximum onto chip bandwidth
– 927 pins chip-to-package in 2011
 359 Gbit/s, 695 Gbit/s

Scaling IXP1200 to 2011, can process 11
million (21 million) packets/second
– Can process 3.3 billion packets/s (6.3 billion)

Smallest IP packet is 20 bytes (header size)
– Maximum required processing of 2.2 billion
packets (4.3 billion)
 Spare processing power available
Bus and I/O Overview
448 bit input bus
48 bit header detection
Q1in
Q2in
IXP1,1
IXP2,1
IXP1,2
IXP2,2
Qin
control
128 bit
control
buses
Q19in
Q20in
IXP19,1
IXP20,1
IXP19,2
IXP20,2
IXP19,15
IXP20,15
Q19out
Q20out
32 bit
I/O bus
64 bit
control
buses
IXP1,15
Q1out
IXP2,15
Q2out
Qout
control
448 bit output bus
Header Detection Hardware

Custom header detection macro runs at 13
times chip speed, 10.075 GHz
– 12 cycles for comparison, 1 to send positions

Forty 48-bit comparators (80 at 1.5 GHz)
– Up to 6 bytes detection (Ethernet destination)
– Store last 47 bits from previous 448 bit word
t-1 47 bits
t 448 bits
48
bit
48
bit
48
bit
48
bit
48
bit
48
bit
48
bit
48
bit
48
bit
48
bit
48
bit
48
bit
48
bit
48
bit
48
bit
48
bit
comparator
comparator
comparator
comparator
48
bit
48
bit
48
bit
48
bit
comparator
comparator
comparator
comparator
48
bit
48
bit
48
bit
48
bit
comparator
48
bit
comparator
48
bit
comparator
48
bit
comparator
48
bit
comparator
48
bit
comparator
48
bit
comparator
48
bit
comparator
48
bit
comparator
48
bit
comparator
48
bit
comparator
48
bit
comparator
48
bit
comparator
48
bit
comparator
48
bit
comparator
48
bit
comparator
48
bit
comparator
comparator
comparator
comparator
comparator
comparator
comparator
comparator
comparator
comparator
comparator
comparator
comparator
comparator
comparator
comparator
1 bit shifter
48-bit Comparators
Set mask for comparison to
0, 1 or X (don’t care)
 Custom comparison circuit

– Signals and their negation are
available from registers
– 10 transistors to implement
7 bit counter with each to set
header position
 About 30,000 transistors total
 Possible 3 packets/448 bits

 31 bits of bus to send positions
inputi
inputi
maski
maski
carei
Simulator
Other simulators cumbersome for our task
 Wrote event driven simulator in Java

– Worst case simulations:
 Can easily process at maximum bandwidth with
no additional latency
Worst Case Scenario Results

Worst case scenario
– Minimum packet size is 20 bytes
– 448 bit input bus
 3 packets or less per cycle
– IXP1200 time to calculate next destination
 75 cycles minimum, 345 cycles average
 600 cycles maximum
 At
most 7 packets processed simultaneously
on IXP1200
– IXP1200 has 6 micro-engines
 load handled easily
Conclusions from Simulation
 Latency
of 605 cycles 0.78 us, 0.40 us
 Largest possible packet that could be sent
after started processing is 65,536 bytes
 Additional 1170 cycles latency 1.51 us, 0.78 us

Transceiver delay 0.05 us [1999]
 Additional 0.10 us/hop

Total latency/hop of 2.4 us, 1.3 us
0.0000024 s
0.0000024 s/hub
Internet Visualization
San Fransisco,
USA
0.033 s
0.0000024 s
0.0000024 s/hub
Worst case packet journey:
0.101 s
Halfway around the world
0.067 s
2 end users
< 0.001 s
2 gateways
< 0.001 s
52 hubs (probability of
1 in 1000) 0.0000024 s
Perth,
0.169 s
tolerable latency for
Australia
video conferencing
0.033 s

Conclusions
 Limiting
factor is maximum bandwidth
 Average case simulations done
 Can easily process at maximum bandwidth with
40 IXP1200 processors (mostly longer packets)
 Reduce
processing power to levels sufficient
for bandwidth and model
– Less IXP1200s on chip
– Smaller chip size reduces cost
– Reduced processing power increases
congestion, and may require high priority
packets for some communications
448 Bit Operation Cycles


448 bits onto chip
Up to 48 bit header detection on previous 47 bits, and 401
bits of current 448 bits (48 bit comparators)
– Send header positions in this 448 bit window










Send to high priority and low priority in queues
Packet priority detection (header) in queues
Incorrect priority queue drops packet, in queue controller
informed
Remainder of packet sent to appropriate in queue
Process packet header, send packet body to out queue
Process times between 70 and 600 cycles, 345 cycles avg.
Send updated packet header to out queue
Inform out queue controller packet ready to send
Send when output bus available
448 bits off chip
Maximum Throughput
Node Hardware
For gateways or hubs
 6.5 million transistors for IXP1200
 0.5 million transistors for other applications
such as speech codecs, V.42bis, Huffman
compression, and 3DES
ASIC max. transistors in 1999 (millions/cm2) [ITRS]
ASIC max. transistors in 2011 (millions/cm2) [ITRS]
ASIC max. chip size in 1999 (cm2) [ITRS]
ASIC max. chip size in 2011 (cm2) [ITRS]
ASIC max. number of transistors/chip in 2011 (millions)
transistors for backbone IXP1200+other possible applications (millions) - one of each
ideal possible number of IXP1200+other possible applications/chip
assume 2/3 overhead for memory, routing, and other CPUs; number of IXP1200s et al.
 Up
to 310 IXP1200s on the same chip
20
811
8
8
6488
7.0
931
310
Packet Processing at Nodes

Maximum onto chip bandwidth
chip-to-package pads in 2011
maximum I/O speed at IXP1200 operating speed with maximum use of pads (Gbits/s)
maximum I/O bandwidth onto chip, must get back off chip as well (Gbit/s)
927
718
359
927
1391
695
Smallest IP packet is 20 bytes (header size)
 Maximum required processing power

number of bits/s that must be processed on chip (Gbit/s)
smallest possible packet size (bytes)
worst case number of (20 byte) packets/s that must be processed per IXP1200
number of packets that IXP1200 can process/s in 1999
number of packets that IXP1200 can process/s in 2011
number of packets that can be processed in 2011/s on our chip
927 pins with I/O
at clock speed
359
695
20
20
7,230,414
14,000,372
2,300,000
2,300,000
10,733,333
20,783,133
3,331,317,770 6,450,486,216
927 pins with I/O
at clock speed
Hub Cache and Main Memory
Required for IXP1200s
 Assumed by Scott in IXP1200 simulations:

– 4 MB of DRAM
– 2 MB of SRAM
DRAM memory per IXP1200 in 1999 (Gbytes)
SRAM memory per IXP1200 in 1999 (Gbytes)
DRAM memory, or equivalent, required in 2011 for IXP1200s (Gbytes)
SRAM memory, or equivalent, required in 2011 for IXP1200s (Gbytes)
area required for DRAM for IXP1200s (cm^2)
area required for SRAM for IXP1200s (cm^2)
0.004
0.002
1.24
0.62
0.17
0.24
Hub Register Memory
average packet latency in 2011 (s)
latency for a single packet in 2011 (s)
number of pins for packet I/O (this many each to get on and off)
maximum bandwidth onto chip and back off chip (Gbit/s)
minimum IPv4 packet size (bytes)
maximum number of IPv4 packets/s (x10^9)
maximum number of IPv4 packets to store while a packet is processed
maximum packet size in IPv6 (bytes)
average storage capacity required at maximum bandwidth (Gbit)
number of in queues
number of out queues
assuming one maximum length packet in each queue, register storage required (Gbit)
area of all the registers (cm^2)
0.00000009
0.00000045
463.5
359
20
18
9668
65536
0.00019
20
20
0.021
0.028
Average Scenario Information

Assumed normal distribution between 80 and
600 cycles to process a packet
– Average of 340 cycles
– 80 and 600 are two
standard deviations
from mean

Packet sizes: