Transcript NEC 2005

The ongoing evolution from Packet
based networks to Hybrid Networks in
Research & Education Networks
16 September 2005
Olivier Martin, CERN
NEC’2005 Conference, VARNA (Bulgaria)
1
Presentation Outline
• The demise of conventional packet based networks
in the R&E community
• The advent of community managed dark fiber
networks
• The Grid & its associated Wide Area Networking
challenges
• « on-demand Lambda Grids »
• Ethernet over SONET & new standards
– WAN-PHY, GFP, VCAT/LCAS, G.709, OTN
16 September 2005
NEC’2005 conference
Slide 2
Olivier H. Martin (3)
10 Gbit/s 1024
10 Gbit/s 160
10 6
10 Gbit/s 32
10 Gbit/s 16
System Capacity (Mbit/s)
10 5
4
10 Gbit/s 8
10 Gbit/s 4
10 Gbit/s 2
10 4
1.7 Gbit/s
10 3
OC-768c
OC-192c
10-GE
OC-48c
565 Mbit/s
OC-48c
I/0 Rates
=
Optical
Wavelength
Capacity
GigE
OC-12c
10 2
135 Mbit/s
Fast Ethernet
OC-3c
10 1
Optical DWDM Capacity
Ethernet
Internet Backbone
T3
Ethernet
T1
Year
16 September 2005
40-GE
1985
1990
1995
NEC’2005 conference
2000
2005
Slide 4
Internet Backbone Speeds
Internet Backbone Speed (in Mbps)
IP/
MBPS
10,000,000
OC12c
1,000,000
ATM-VCs
100,000
T3 lines
10,000
1,000
OC3c
T1 Lines
100
10
1
0
00
20
99
19
98
19
97
19
96
19
95
19
94
19
93
19
92
19
91
19
90
19
89
19
98
19
87
19
19
86
0
Olivier H. Martin (5)
High Speed IP Network Transport
Trends
Multiplexing, protection and management at every layer
IP
Signalling
IP
ATM
ATM
IP
SONET/SDH
SONET/SDH
SONET/SDH
IP
Optical
Optical
Optical
Optical
B-ISDN
IP Over
ATM
IP Over
SONET/SDH
IP Over
Optical
Higher Speed, Lower cost, complexity and overhead
Olivier H. Martin (6)
Olivier H. Martin (7)
Olivier H. Martin (8)
Olivier H. Martin (9)
Network Exponentials

Network vs. computer performance
– Computer speed doubles every 18 months
– Network speed doubles every 9 months
– Difference = order of magnitude per 5 years

1986 to 2000
– Computers: x 500
– Networks: x 340,000

2001 to 2010
– Computers: x 60
– Networks: x 4000
Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.
October 12, 2001
Intro to Grid Computing and Globus Toolkit™
10
Know the user
(3 of 12)
# of users
A
B
ADSL
C
GigE LAN
F(t)
BW requirements
A -> Lightweight users, browsing, mailing, home use
B -> Business applications, multicast, streaming, VPN’s, mostly LAN
C -> Special scientific applications, computing, data grids, virtual-presence
What the user
(4 of 12)
Total BW
A
B
ADSL
C
GigE LAN
BW requirements
A -> Need full Internet routing, one to many
B -> Need VPN services on/and full Internet routing, several to several
C -> Need very fat pipes, limited multiple Virtual Organizations, few to few
So what are the facts
(5 of 12)
• Costs of fat pipes (fibers) are one/third of
equipment to light them up
– Is what Lambda salesmen told Cees de Laat (University of Amsterdam &
Surfnet)
• Costs of optical equipment 10% of switching 10
% of full routing equipment for same throughput
– 100 Byte packet @ 10 Gb/s -> 80 ns to look up in 100 Mbyte routing table
(light speed from me to you on the back row!)
• Big sciences need fat pipes
• Bottom line: create a hybrid architecture which
Utilization trends
Gbps
30
25
20
15
10
5
0
Ju
98
n
Network Capacity Limit
Lightpaths
IP Peak
IP Average
Ju
99
n
Ju
00
n
Ju
01
n
Ju
02
n
Ju
03
n
Ju
04
n
Jan 2005
Today’s hierarchical IP
network
Other national networks
National or Pan-National IP Network
NREN A
University
NREN C
NREN B
Region
al
NREN D
Tomorrow’s peer to peer IP
network
World
World
National DWDM
Network
World
Child
Lightpaths
NREN A
University
Server
NREN B
NREN C
Region
al
Child
Lightpaths
NREN D
Creation of application VPNs
University
Dept
Direct connect bypasses campus firewall
High Energy
Physics Network
Commodity
Internet
University
Research
Network
CERN
University
Bio-informatics
Network
University
University
eVLBI
Network
Production vs Research
Campus Networks
> Increasingly campuses are deploying parallel networks for high
end users
> Reduces costs by providing high end network capability to only
those who need it
> Limitations of campus firewall and border router are eliminated
> Many issues in regards to security, back door routing, etc
> Campus networks may follow same evolution as campus
computing
> Discipline specific networks being extended into the campus
UCLP intended for projects
like National LambdaRail
CAVEwave acquires a separate wavelength between
Seattle and Chicago and wants to manage it as part of
its network including add/drop, routing, partition etc
NLR
Condominium
lambda network
Original
CAVEwave
GEANT2 POP Design
GÉANT2 PoP
Juniper M-160
Nx10Gbps to other
GÉANT2 PoP
2x10Gbps to local NREN
DWDM
Dark fibre to other GÉANT2 PoP
UltraLight Optical Exchange
Point
Photonic switch
 L1, L2 and L3 services
 Interfaces
 1GE and 10GE
 10GE WAN-PHY (SONET friendly)
 Hybrid packet- and circuit-switched PoP
 Interface between packet- & circuit-switched
networks
Calient or
Glimmerglass
Photonic
Cross Connect
Switch
LHC Data Grid Hierarchy
CERN/Outside Resource Ratio ~1:2
Tier0/( Tier1)/( Tier2)
~1:1:1
~PByte/sec
Online System
Experiment
~100-400
MBytes/sec
Tier 0 +1
10 Gbps
Tier 1
IN2P3 Center
INFN Center
RAL Center
Tier 2
Tier 3
~2.5 Gbps
InstituteInstitute Institute
~0.25TIPS
Physics data cache
Workstations
16 September 2005
CERN 700k SI95
~1 PB Disk;
Tape Robot
Institute
0.1–1 Gbps
Tier 4
FNAL: 200k
SI95; 600 TB
2.5/10 Gbps
Tier2 Center
Tier2 Center
Tier2 Center
Tier2 Center
Tier2 Center
Physicists work on analysis “channels”
Each institute has ~10 physicists
working on one or more channels
NEC’2005 conference
Slide 22
Deploying the LHC Grid
Lab m
Uni x
grid for a
regional group
Uni a
CERN Tier 1
Lab a
UK
USA
Tier3
physics
department
France
The LHC
Tier
1
Computing
Tier2
Uni n
Centre
Italy
CERN Tier 0
Japan
Desktop
Lab b


16 September 2005
Taipei?
Germany
Lab c
Uni y
[email protected]

grid for a
physics
study group
Uni b
NEC’2005 conference
Slide 23
What you get
Lab m
Uni x
Uni a
CERN Tier 1
Lab a
UK
USA
physics
department
France
Tier 1
Tier2
Uni n
CERN Tier 0
Italy
Japan
physicist
……….
Lab b
Lab c


16 September 2005
Germany
Uni y
[email protected]

Uni b
NEC’2005 conference
Slide 24
Main Networking Challenges
• Fulfill the, yet unproven, assertion that the network can be
« nearly » transparent to the Grid
• Deploy suitable Wide Area Network infrastructure (50-100 Gb/s)
• Deploy suitable Local Area Network infrastructure (matching or
exceeding that of the WAN)
• Seamless interconnection of LAN & WAN infrastructures
firewall?
• End to End issues (transport protocols, PCs (Itanium, Xeon), 10GigE
NICs (Intel, S2io), where are we today:
 memory to memory: 7.5Gb/s (PCI bus limit)
 memory to disk: 1.2MB (Windows 2003
server/NewiSys)
 disk to disk: 400MB (Linux), 600MB (Windows)
16 September 2005
NEC’2005 conference
Slide 25
Main TCP issues
•
Does not scale to some environments
 High speed, high latency
 Noisy
•
Unfair behaviour with respect to:
 Round Trip Time (RTT
 Frame size (MSS)
 Access Bandwidth
•
Widespread use of multiple streams in order to compensate for
inherent TCP/IP limitations (e.g. Gridftp, BBftp):
 Bandage rather than a cure
•
New TCP/IP proposals in order to restore performance in single
stream environments
 Not clear if/when it will have a real impact
 In the mean time there is an absolute requirement for backbones
with:
– Zero packet losses,
– And no packet re-ordering
 Which re-inforces the case for “lambda Grids”
16 September 2005
NEC’2005 conference
Slide 26
TCP dynamics
(10Gbps, 100ms RTT, 1500Bytes
packets)
Window size (W) = Bandwidth*Round Trip Time
– Wbits = 10Gbps*100ms = 1Gb
– Wpackets = 1Gb/(8*1500) = 83333 packets
Standard Additive Increase Multiplicative Decrease
(AIMD) mechanisms:
– W=W/2 (halving the congestion window on loss event)
– W=W + 1 (increasing congestion window by one
packet every RTT)
Time to recover from W/2 to W (congestion
avoidance) at 1 packet per RTT:
– RTT*Wp/2 = 1.157 hour
– In practice, 1 packet per 2 RTT because of delayed
acks, i.e. 2.31 hour
Packets per second:
– RTT*Wpackets = 833’333 packets
16 September 2005
NEC’2005 conference
Slide 27
Single TCP stream performance
under periodic losses
Bandwidth Utilization (%)
Effect of packet loss
100
90
80
70
60
50
40
30
20
10
0
0.000001
Loss rate =0.01%:
LAN BW
utilization= 99%
WAN BW
utilization=1.2%
0.00001
0.0001
0.001
0.01
0.1
Packet Loss frequency (%)
WAN (RTT=120ms)
LAN (RTT=0.04 ms)
1
10
Bandwidth available = 1 Gbps
 TCP throughput much more sensitive to packet loss in WANs than LANs
TCP’s congestion control algorithm (AIMD) is not well-suited to
gigabit networks
The effect of packets loss can be disastrous
 TCP is inefficient in high bandwidth*delay networks
 The future performance-outlook for computational grids looks bad
if we continue to rely solely on the widely-deployed TCP RENO
Responsiveness
Time to recover from a single packet loss:
2
C
.
RTT
r=
2 . MSS
Path
C : Capacity of the link
Bandwidth
RTT (ms)
Time to recover
1
MTU
(Byte)
1500
LAN
10 Gb/s
Geneva–Chicago
10 Gb/s
120
1500
1 hr 32 min
Geneva-Los Angeles
1 Gb/s
180
1500
23 min
Geneva-Los Angeles
10 Gb/s
180
1500
3 hr 51 min
Geneva-Los Angeles
10 Gb/s
180
9000
38 min
Geneva-Los Angeles
10 Gb/s
180
64k (TSO)
5 min
Geneva-Tokyo
1 Gb/s
300
1500
1 hr 04 min
430 ms
 Large MTU accelerates the growth of the window
 Time to recover from a packet loss decreases with large MTU
 Larger MTU reduces overhead per frames (saves CPU cycles,
reduces the number of packets)
Internet2 land speed record history
(IPv4 & IPv6) period 2000-2004
Evolution of Internet2 Landspeed record
8.000
Month
7.000
Mar-00
6.000
Apr-02
5.000
Sep-02
Gigabit/second 4.000
Oct-02
Apr-04
3.000
Oct-03
2.000
Type
May-04
7.09
Apr-04
4.2226
Feb-04
6.25
Nov-03
5.64
5.64
5.44
5.44
Oct-03
0.983
0.983
Feb-03
Nov-02
Oct-02
Sep-02
Apr-02
Mar-00
Feb-03
Month
May-03
Oct-03
Nov-03
Nov-03
Feb-04
Apr-04
May-04
4
4
Nov-03
May-03
Month
IPv6 (Gb/s) multiple
streams
0.000
Oct-02
IPv4 (Gb/s) single stream
1.000
Nov-02
2.38
2.38
0.923
0.923
0.348
0.348
0.483
0.483
IPv6 (Gb/s) multiple
streams
IPv6 (Gb/s) single stream
IPv4 (Gb/s) multiple
streams
IPv4 (Gb/s) single stream
0.402
0.402
0.956
0.760
Month
16 September 2005
NEC’2005 conference
Slide 30
Layer1/2/3 networking (1)
• Conventional layer 3 technology is no longer fashionable
because of:
– High associated costs, e.g. 200/300 KUSD for a 10G router interfaces
– Implied use of shared backbones
• The use of layer 1 or layer 2 technology is very attractive
because it helps to solve a number of problems, e.g.
– 1500 bytes Ethernet frame size (layer1)
– Protocol transparency (layer1 & layer2)
– Minimum functionality hence, in theory, much lower costs (layer1&2)
16 September 2005
NEC’2005 conference
Slide 31
Layer1/2/3 networking (2)
« 0n-demand Lambda Grids » are becoming very popular:
• Pros:
 circuit oriented model like the telephone network, hence no need for complex
transport protocols
 Lower equipment costs (i.e. « in theory » a factor 2 or 3 per layer)
 the concept of a dedicated end to end light path is very elegant
• Cons:
 « End to end » still very loosely defined, i.e. site to site, cluster to cluster or
really host to host
 Higher circuit costs, Scalability, Additional middleware to deal with circuit
set up/tear down, etc
 Extending dynamic VLAN functionality is a potential nightmare!
16 September 2005
NEC’2005 conference
Slide 32
« Lambda Grids »
What does it mean?
•
•
•
Clearly different things to different people, hence the apparently easy
consensus!
Conservatively, on demand « site to site » connectivity
 Where is the innovation?
 What does it solve in terms of transport protocols?
 Where are the savings?
 Less interfaces needed (customer) but more standby/idle circuits needed (provider)
 Economics from the service provider vs the customer perspective?
– Traditionally, switched services have been very expensive,
» Usage vs flat charge
» Break even, switches vs leased, few hours/day
» Why would this change?
 In case there are no savings, why bother?
More advanced, cluster to cluster
 Implies even more active circuits in paralle
 Is it realistic?
•
Even more advanced, Host to Host or even « per flow »
 All optical
 Is it really realisitic?
16 September 2005
NEC’2005 conference
Slide 33
Some Challenges
• Real bandwidth estimates given the chaotic nature
of the requirements.
• End-end performance given the whole chain
involved
– (disk-bus-memory-bus-network-bus-memory-busdisk)
• Provisioning over complex network infrastructures
(GEANT, NREN’s etc)
• Cost model for options (packet+SLA’s, circuit
switched etc)
• Consistent Performance (dealing with firewalls)
• Merging leading edge research with production
networking
16 September 2005
NEC’2005 conference
Slide 34
Tentative conclusions
 There is a very clear trend towards community managed dark fiber
networks
 As a consequence National Research & Education Networks are evolving
into Telecom Operators, is it right?
•
•
•
•
 In the short term, almost certainly YES
 In the longer term, probably NO
In many countries, there is NO other way to have affordable access
to multi-Gbit/s networks, therefore this is clearly the right move
The Grid & its associated Wide Area Networking challenges
« on-demand Lambda Grids » are, according to me, extremely
doubtful!
Ethernet over SONET & new standards will revolutionize the Internet
 WAN-PHY (IEEE) has, according to me NO future!
 However, GFP, VCAT/LCAS, G.709, OTN are very likely to have a very
bright future.
16 September 2005
NEC’2005 conference
Slide 35
Single TCP stream between
Caltech and CERN
Available (PCI-X)
CPU load =
100%
Single
packet
loss
Bandwidth=8.5 Gbps
 RTT=250ms (16’000 km)
 9000 Byte MTU
15 min to increase
throughput from 3 to 6 Gbps
 Sending station:
 Tyan S2882
Burst of
packet
losses
motherboard,
2x
Opteron 2.4 GHz ,
2 GB DDR.
Receiving station:
 CERN OpenLab:HP
rx4640,
4x 1.5GHz
Itanium-2, zx1
chipset,
8GB memory
Network adapter:
 S2IO 10 GbE
High Throughput Disk to Disk
Transfers: From 0.1 to 1GByte/sec
 Server Hardware (Rather than
Network) Bottlenecks:
 Write/read and transmit tasks
share the same limited
resources: CPU, PCI-X bus,
memory, IO chipset
 PCI-X bus bandwidth: 8.5 Gbps
[133MHz x 64 bit]
 Link aggregation (802.3ad):
Logical interface with two
physical interfaces on two
independent PCI-X buses.
 LAN test: 11.1 Gbps
(memory to memory)
Performance in this range (from 100
MByte/sec up to 1 GByte/sec) is required to
build a responsive Grid-based Processing
and Analysis System for LHC
Transferring a TB from Caltech to
CERN in 64-bit MS Windows
 Latest disk to disk over 10Gbps WAN: 4.3 Gbits/sec (536







MB/sec) - 8 TCP streams from CERN to Caltech; 1TB file
3 Supermicro Marvell SATA disk controllers + 24 SATA
7200rpm SATA disks
 Local Disk IO – 9.6 Gbits/sec (1.2 GBytes/sec read/write,
with <20% CPU utilization)
S2io SR 10GE NIC
 10 GE NIC – 7.5 Gbits/sec (memory-to-memory, with 52%
CPU utilization)
 2*10 GE NIC (802.3ad link aggregation) – 11.1 Gbits/sec
(memory-to-memory)
Memory to Memory WAN data flow, and local Memory to
Disk read/write flow, are not matched when combining the
two operations
Quad Opteron AMD848 2.2GHz processors with
3 AMD-8131 chipsets: 4 64-bit/133MHz PCI-X slots.
Interrupt Affinity Filter: allows a user to change the
CPU-affinity of the interrupts in a system.
Overcome packet loss with re-connect logic.
Proposed Internet2 Terabyte File Transfer Benchmark
UltraLight: Developing
Advanced Network Services for
Data
Intensive
HEP
Applications
 UltraLight: a next-generation hybrid packet- and circuitswitched network infrastructure
 Packet switched: cost effective solution; requires
ultrascale protocols to share 10G  efficiently and
fairly
 Circuit-switched: Scheduled or sudden “overflow”
demands handled by provisioning additional
wavelengths; Use path diversity, e.g. across the US,
Atlantic, Canada,…
 Extend and augment existing grid computing
infrastructures
(currently focused on CPU/storage) to include the
network as an integral component
 Using MonALISA to monitor and manage global
systems
UltraLight MPLS Network


Compute path from one given node to another such that the path
does not violate any constraints (bandwidth/administrative
requirements)
Ability to set the path the traffic will take through the network
(with simple configuration, management, and provisioning
mechanisms)
 Take advantage of the multiplicity of waves/L2 channels
across the US (NLR, HOPI, Ultranet and Abilene/ESnet MPLS
services)
Summary
 For many years the Wide Area Network has been the bottleneck;
this is no longer the case in many countries thus making
deployment of a data intensive Grid infrastructure possible!
 Recent I2LSR records show for the first time ever that the network
can
be truly transparent and that throughputs are limited by the end
hosts
 Challenge shifted from getting adequate bandwidth to deploying
adequate infrastructure to make effective use of it!
 Some transport protocol issues still need to be resolved;
however there are many encouraging signs that practical
solutions may now be in sight.
 1GByte/sec disk to disk challenge. Today: 1 TB
at 536 MB/sec from CERN to Caltech
 Still in Early Stages; Expect Substantial Improvements
 Next generation network and Grid system: UltraLight
 Deliver the critical missing component for future eScience: the
integrated, managed network
 Extend and augment existing grid computing infrastructures
10G DataTAG testbed extension
to Telecom World 2003 and Abilene/Cenic
On September 15, 2003, the DataTAG
project was the first transatlantic testbed
offering direct 10GigE access using Juniper’s
VPN layer2/10GigE emulation.
16 September 2005
NEC’2005 conference
Slide 42