Swiss ICT Task Force
Download
Report
Transcript Swiss ICT Task Force
The ongoing evolution from Packet
based networks to Hybrid Networks in
Research & Education Networks
31 October 2005
Olivier Martin, CERN
Swiss ICT Task Force (Fribourg)
1
Presentation Outline
•The demise of conventional packet based networks in the R&E
community
•The advent of community managed dark fiber networks
•The Grid & its associated Wide Area Networking challenges
•« On-Demand Lambda Grids »
•Ethernet over SONET & new standards
–WAN-PHY, GFP, VCAT/LCAS, G.709, OTN
Disclaimer: The views expressed herein are not necessarily those
of CERN, furthermore although I am formally a CERN staff
member until July 31, 2006, I do not work for CERN any more
since October 3, being on a pre-retirement program.
31 October 2005
Swiss ICT Task Force
Slide 2
Olivier H. Martin (3)
10 Gbit/s 1024
10 Gbit/s 160
10 6
10 Gbit/s 32
10 Gbit/s 16
System Capacity (Mbit/s)
10 5
4
10 Gbit/s 8
10 Gbit/s 4
10 Gbit/s 2
10 4
1.7 Gbit/s
10 3
OC-768c
OC-192c
10-GE
OC-48c
565 Mbit/s
OC-48c
I/0 Rates
=
Optical
Wavelength
Capacity
GigE
OC-12c
10 2
135 Mbit/s
Fast Ethernet
OC-3c
10 1
Optical DWDM Capacity
Ethernet
Internet Backbone
T3
Ethernet
T1
Year
31 October 2005
40-GE
1985
1990
1995
Swiss ICT Task Force
2000
2005
Slide 4
(5 of 12)
Some facts
Internet is everywhere
Ethernet is everywhere
The advent of next generation G.709 Optical Transport Networks
is very unsure!
hence one has to learn how to live best with existing network infrastructures,
which may well explain all the “hype” about “on-demand” lambda Grids!
For the first time in the history of the Internet, the Commercial and the
Research & Education Internet appear to follow different routes
Will they ever converge again?
Dark fiber based, customer owned long distance, networks are booming!
users are becoming their own Telecom Operators
Is it a good or a bad thing?
31 October 2005
Swiss ICT Task Force
Slide 5
Internet Backbone Speeds
Internet Backbone Speed (in Mbps)
IP/
MBPS
10,000,000
OC12c
1,000,000
ATM-VCs
100,000
T3 lines
10,000
1,000
OC3c
T1 Lines
100
10
1
0
00
20
99
19
98
19
97
19
96
19
95
19
94
19
93
19
92
19
91
19
90
19
89
19
98
19
87
19
19
86
0
Olivier H. Martin (6)
High Speed IP Network Transport
Trends
Multiplexing, protection and management at every layer
IP
Signalling
IP
ATM
ATM
IP
SONET/SDH
SONET/SDH
SONET/SDH
IP
Optical
Optical
Optical
Optical
B-ISDN
IP Over
ATM
IP Over
SONET/SDH
IP Over
Optical
Higher Speed, Lower cost, complexity and overhead
Olivier H. Martin (7)
Olivier H. Martin (8)
Olivier H. Martin (9)
Network Exponentials
Network vs. computer performance
– Computer speed doubles every 18 months
– Network speed doubles every 9 months
– Difference = order of magnitude per 5 years
1986 to 2000
– Computers: x 500
– Networks: x 340,000
2001 to 2010
– Computers: x 60
– Networks: x 4000
Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.
October 12, 2001
Intro to Grid Computing and Globus Toolkit™
10
Know the user
(3 of 12)
# of users
A
B
ADSL
C
GigE LAN
F(t)
BW requirements
A -> Lightweight users, browsing, mailing, home use
B -> Business applications, multicast, streaming, VPN’s, mostly LAN
C -> Special scientific applications, computing, data grids, virtual-presence
What the user
(4 of 12)
Total BW
A
B
ADSL
C
GigE LAN
BW requirements
A -> Need full Internet routing, one to many
B -> Need VPN services on/and full Internet routing, several to several
C -> Need very fat pipes, limited multiple Virtual Organizations, few to few
So what are the facts
(5 of 12)
• Costs of fat pipes (fibers) are one/third of
equipment to light them up
– Is what Lambda salesmen told Cees de Laat (University of Amsterdam &
Surfnet)
• Costs of optical equipment 10% of switching 10
% of full routing equipment for same throughput
– 100 Byte packet @ 10 Gb/s -> 80 ns to look up in 100 Mbyte routing table
(light speed from me to you on the back row!)
• Big sciences need fat pipes
• Bottom line: create a hybrid architecture which
Utilization trends
Gbps
30
25
20
15
10
5
0
Ju
98
n
Network Capacity Limit
Lightpaths
IP Peak
IP Average
Ju
99
n
Ju
00
n
Ju
01
n
Ju
02
n
Ju
03
n
Ju
04
n
Jan 2005
Today’s hierarchical IP
network
Other national networks
National or Pan-National IP Network
NREN A
University
NREN C
NREN B
Region
al
NREN D
Tomorrow’s peer to peer IP
network
World
World
National DWDM
Network
World
Child
Lightpaths
NREN A
University
Server
NREN B
NREN C
Region
al
Child
Lightpaths
NREN D
Creation of application VPNs
University
Dept
Direct connect bypasses campus firewall
High Energy
Physics Network
Commodity
Internet
University
Research
Network
CERN
University
Bio-informatics
Network
University
University
eVLBI
Network
Production vs Research
Campus Networks
> Increasingly campuses are deploying parallel networks for high
end users
> Reduces costs by providing high end network capability to only
those who need it
> Limitations of campus firewall and border router are eliminated
> Many issues in regards to security, back door routing, etc
> Campus networks may follow same evolution as campus
computing
> Discipline specific networks being extended into the campus
UCLP intended for projects
like National LambdaRail
CAVEwave acquires a separate wavelength between
Seattle and Chicago and wants to manage it as part of
its network including add/drop, routing, partition etc
NLR
Condominium
lambda network
Original
CAVEwave
GEANT2 POP Design
GÉANT2 PoP
Juniper M-160
Nx10Gbps to other
GÉANT2 PoP
2x10Gbps to local NREN
DWDM
Dark fibre to other GÉANT2 PoP
LHC Data Grid Hierarchy
CERN/Outside Resource Ratio ~1:2
Tier0/( Tier1)/( Tier2)
~1:1:1
~PByte/sec
Online System
Experiment
~100-400
MBytes/sec
Tier 0 +1
10 Gbps
Tier 1
IN2P3 Center
INFN Center
RAL Center
Tier 2
Tier 3
~2.5 Gbps
InstituteInstitute Institute
~0.25TIPS
Physics data cache
Workstations
31 October 2005
CERN 700k SI95
~1 PB Disk;
Tape Robot
Institute
0.1–1 Gbps
Tier 4
FNAL: 200k
SI95; 600 TB
2.5/10 Gbps
Tier2 Center
Tier2 Center
Tier2 Center
Tier2 Center
Tier2 Center
Physicists work on analysis “channels”
Each institute has ~10 physicists
working on one or more channels
Swiss ICT Task Force
Slide 21
Main Networking Challenges
• Fulfill the, yet unproven, assertion that the network can be
« nearly » transparent to the Grid
• Deploy suitable Wide Area Network infrastructure (50-100 Gb/s)
• Deploy suitable Local Area Network infrastructure (matching or
exceeding that of the WAN)
• Seamless interconnection of LAN & WAN infrastructures
firewall?
• End to End issues (transport protocols, PCs (Itanium, Xeon), 10GigE
NICs (Intel, S2io), where are we today:
memory to memory: 7.5Gb/s (PCI bus limit)
memory to disk: 1.2MB (Windows 2003 server/NewiSys)
disk to disk: 400MB (Linux), 600MB (Windows)
31 October 2005
Swiss ICT Task Force
Slide 22
Main TCP issues
•
Does not scale to some environments
High speed, high latency
Noisy
•
Unfair behaviour with respect to:
Round Trip Time (RTT
Frame size (MSS)
Access Bandwidth
•
Widespread use of multiple streams in order to compensate for
inherent TCP/IP limitations (e.g. Gridftp, BBftp):
Bandage rather than a cure
•
New TCP/IP proposals in order to restore performance in single
stream environments
Not clear if/when it will have a real impact
In the mean time there is an absolute requirement for backbones
with:
– Zero packet losses,
– And no packet re-ordering
Which re-inforces the case for “lambda Grids”
31 October 2005
Swiss ICT Task Force
Slide 23
TCP dynamics
(10Gbps, 100ms RTT, 1500Bytes
packets)
Window size (W) = Bandwidth*Round Trip Time
– Wbits = 10Gbps*100ms = 1Gb
– Wpackets = 1Gb/(8*1500) = 83333 packets
Standard Additive Increase Multiplicative Decrease
(AIMD) mechanisms:
– W=W/2 (halving the congestion window on loss event)
– W=W + 1 (increasing congestion window by one
packet every RTT)
Time to recover from W/2 to W (congestion
avoidance) at 1 packet per RTT:
– RTT*Wp/2 = 1.157 hour
– In practice, 1 packet per 2 RTT because of delayed
acks, i.e. 2.31 hour
Packets per second:
– RTT*Wpackets = 833’333 packets
31 October 2005
Swiss ICT Task Force
Slide 24
Internet2 land speed record history
(IPv4 & IPv6) period 2000-2004
Evolution of Internet2 Landspeed record
8.000
Month
7.000
Mar-00
6.000
Apr-02
5.000
Sep-02
Gigabit/second 4.000
Oct-02
Apr-04
3.000
Oct-03
2.000
Type
May-04
7.09
Apr-04
4.2226
Feb-04
6.25
Nov-03
5.64
5.64
5.44
5.44
Oct-03
0.983
0.983
Feb-03
Nov-02
Oct-02
Sep-02
Apr-02
Mar-00
Feb-03
Month
May-03
Oct-03
Nov-03
Nov-03
Feb-04
Apr-04
May-04
4
4
Nov-03
May-03
Month
IPv6 (Gb/s) multiple
streams
0.000
Oct-02
IPv4 (Gb/s) single stream
1.000
Nov-02
2.38
2.38
0.923
0.923
0.348
0.348
0.483
0.483
IPv6 (Gb/s) multiple
streams
IPv6 (Gb/s) single stream
IPv4 (Gb/s) multiple
streams
IPv4 (Gb/s) single stream
0.402
0.402
0.956
0.760
Month
31 October 2005
Swiss ICT Task Force
Slide 25
Layer1/2/3 networking (1)
• Conventional layer 3 technology is no longer fashionable
because of:
– High associated costs, e.g. 200/300 KUSD for a 10G router interfaces
– Implied use of shared backbones
• The use of layer 1 or layer 2 technology is very attractive
because it helps to solve a number of problems, e.g.
– 1500 bytes Ethernet frame size (layer1)
– Protocol transparency (layer1 & layer2)
– Minimum functionality hence, in theory, much lower costs (layer1&2)
31 October 2005
Swiss ICT Task Force
Slide 26
Layer1/2/3 networking (2)
« 0n-demand Lambda Grids » are becoming very popular:
• Pros:
circuit oriented model like the telephone network, hence no need for complex
transport protocols
Lower equipment costs (i.e. « in theory » a factor 2 or 3 per layer)
the concept of a dedicated end to end light path is very elegant
• Cons:
« End to end » still very loosely defined, i.e. site to site, cluster to cluster or
really host to host
Higher circuit costs, Scalability, Additional middleware to deal with circuit
set up/tear down, etc
Extending dynamic VLAN functionality is a potential nightmare!
31 October 2005
Swiss ICT Task Force
Slide 27
« Lambda Grids »
What does it mean?
•
•
•
Clearly different things to different people, hence the apparently easy
consensus!
Conservatively, on demand « site to site » connectivity
Where is the innovation?
What does it solve in terms of transport protocols?
Where are the savings?
Less interfaces needed (customer) but more standby/idle circuits needed (provider)
Economics from the service provider vs the customer perspective?
– Traditionally, switched services have been very expensive,
» Usage vs flat charge
» Break even, switches vs leased, few hours/day
» Why would this change?
In case there are no savings, why bother?
More advanced, cluster to cluster
Implies even more active circuits in paralle
Is it realistic?
•
Even more advanced, Host to Host or even « per flow »
All optical
Is it really realisitic?
31 October 2005
Swiss ICT Task Force
Slide 28
Some Challenges
• Real bandwidth estimates given the chaotic nature
of the requirements.
• End-end performance given the whole chain
involved
– (disk-bus-memory-bus-network-bus-memory-busdisk)
• Provisioning over complex network infrastructures
(GEANT, NREN’s etc)
• Cost model for options (packet+SLA’s, circuit
switched etc)
• Consistent Performance (dealing with firewalls)
• Merging leading edge research with production
networking
31 October 2005
Swiss ICT Task Force
Slide 29
Tentative conclusions
There is a very clear trend towards community managed dark fiber
networks
As a consequence National Research & Education Networks are evolving
into Telecom Operators, is it right?
•
•
•
•
In the short term, almost certainly YES
In the longer term, probably NO
In many countries, there is NO other way to have affordable access
to multi-Gbit/s networks, therefore this is clearly the right move
The Grid & its associated Wide Area Networking challenges
« on-demand Lambda Grids » are, according to me, extremely
doubtful!
Ethernet over SONET & new standards will revolutionize the Internet
WAN-PHY (IEEE) has, according to me NO future!
However, GFP, VCAT/LCAS, G.709, OTN are very likely to have a very
bright future.
31 October 2005
Swiss ICT Task Force
Slide 30
Single TCP stream performance
under periodic losses
Bandwidth Utilization (%)
Effect of packet loss
100
90
80
70
60
50
40
30
20
10
0
0.000001
Loss rate =0.01%:
LAN BW
utilization= 99%
WAN BW
utilization=1.2%
0.00001
0.0001
0.001
0.01
0.1
Packet Loss frequency (%)
WAN (RTT=120ms)
LAN (RTT=0.04 ms)
1
10
Bandwidth available = 1 Gbps
TCP throughput much more sensitive to packet loss in WANs than LANs
TCP’s congestion control algorithm (AIMD) is not well-suited to
gigabit networks
The effect of packets loss can be disastrous
TCP is inefficient in high bandwidth*delay networks
The future performance-outlook for computational grids looks bad
if we continue to rely solely on the widely-deployed TCP RENO
Responsiveness
Time to recover from a single packet loss:
2
C
.
RTT
r=
2 . MSS
Path
C : Capacity of the link
Bandwidth
RTT (ms)
Time to recover
1
MTU
(Byte)
1500
LAN
10 Gb/s
Geneva–Chicago
10 Gb/s
120
1500
1 hr 32 min
Geneva-Los Angeles
1 Gb/s
180
1500
23 min
Geneva-Los Angeles
10 Gb/s
180
1500
3 hr 51 min
Geneva-Los Angeles
10 Gb/s
180
9000
38 min
Geneva-Los Angeles
10 Gb/s
180
64k (TSO)
5 min
Geneva-Tokyo
1 Gb/s
300
1500
1 hr 04 min
430 ms
Large MTU accelerates the growth of the window
Time to recover from a packet loss decreases with large MTU
Larger MTU reduces overhead per frames (saves CPU cycles,
reduces the number of packets)
Single TCP stream between
Caltech and CERN
Available (PCI-X)
CPU load =
100%
Single
packet
loss
Bandwidth=8.5 Gbps
RTT=250ms (16’000 km)
9000 Byte MTU
15 min to increase
throughput from 3 to 6 Gbps
Sending station:
Tyan S2882
Burst of
packet
losses
motherboard,
2x
Opteron 2.4 GHz ,
2 GB DDR.
Receiving station:
CERN OpenLab:HP
rx4640,
4x 1.5GHz
Itanium-2, zx1
chipset,
8GB memory
Network adapter:
S2IO 10 GbE
High Throughput Disk to Disk
Transfers: From 0.1 to 1GByte/sec
Server Hardware (Rather than
Network) Bottlenecks:
Write/read and transmit tasks
share the same limited
resources: CPU, PCI-X bus,
memory, IO chipset
PCI-X bus bandwidth: 8.5 Gbps
[133MHz x 64 bit]
Link aggregation (802.3ad):
Logical interface with two
physical interfaces on two
independent PCI-X buses.
LAN test: 11.1 Gbps
(memory to memory)
Performance in this range (from 100
MByte/sec up to 1 GByte/sec) is required to
build a responsive Grid-based Processing
and Analysis System for LHC
Transferring a TB from Caltech to
CERN in 64-bit MS Windows
Latest disk to disk over 10Gbps WAN: 4.3 Gbits/sec (536
MB/sec) - 8 TCP streams from CERN to Caltech; 1TB file
3 Supermicro Marvell SATA disk controllers + 24 SATA
7200rpm SATA disks
Local Disk IO – 9.6 Gbits/sec (1.2 GBytes/sec read/write,
with <20% CPU utilization)
S2io SR 10GE NIC
10 GE NIC – 7.5 Gbits/sec (memory-to-memory, with 52%
CPU utilization)
2*10 GE NIC (802.3ad link aggregation) – 11.1 Gbits/sec
(memory-to-memory)
Memory to Memory WAN data flow, and local Memory to
Disk read/write flow, are not matched when combining the
two operations
Quad Opteron AMD848 2.2GHz processors with
3 AMD-8131 chipsets: 4 64-bit/133MHz PCI-X slots.
Interrupt Affinity Filter: allows a user to change the
CPU-affinity of the interrupts in a system.
Overcome packet loss with re-connect logic.
Proposed Internet2 Terabyte File Transfer Benchmark