The California Institute for Telecommunications and
Download
Report
Transcript The California Institute for Telecommunications and
The OptIPuter - Implications of Network
Bandwidth on System Design
CHEP’03
University of California, San Diego
March 25, 2003
Dr. Philip M. Papadopoulos
Program Director, Grid and Cluster Computing, SDSC
Co-PI OptIPuter
Agenda
•
•
•
•
•
Some key tech trends in applications, computing and networking
Description and research goals of OptIPuter
Initial networking plans
Key questions on system and network design
Conclusions
Why Optical Networks Are Emerging
as the 21st Century Driver for the Grid
George Stix,
Scientific American,
January 2001
Parallel Lambdas provide the raw capacity to Drive
and change the relationship of computer and network
OptIPuter Inspiration--Node of
a 2009 PetaFLOPS Supercomputer
DRAM – 16 GB
DRAM
- 4MB
GB- -HIGHLY
HIGHLYINTERLEAVED
INTERLEAVED
64/256
5 Terabits/s
MULTI-LAMBDA
Optical Network
CROSS BAR
2nd LEVEL CACHE
Coherence
8 MB
640 GB/s
2nd LEVEL CACHE
8 MB
24 Bytes wide
240 GB/s
VLIW/RISC CORE
40 GFLOPS
10 GHz
...
24 Bytes wide
240 GB/s
VLIW/RISC CORE
40 GFLOPS
10 GHz
Updated From Steve Wallach, Supercomputing 2000 Keynote
Global Architecture of a 2009 COTS
PetaFLOPS System
10 meters=
50 nanosec Delay
3
2
4
5 ...
16
1
17
64
ALL-OPTICAL
SWITCH
63
...
18
...
32
49
48
Systems Become
GRID Enabled
128 Die/Box
4 CPU/Die
47
I/O
LAN/WAN
... 33 Multi-Die
Multi-Processor
46
Source: Steve Wallach, Supercomputing 2000 Keynote
The Biomedical Informatics Research Network
a Multi-Scale Brain Imaging Federated Repository
BIRN Test-beds:
Multiscale Mouse Models of Disease, Human Brain Morphometrics, and
FIRST BIRN (10 site project for fMRI’s of Schizophrenics)
NIH Plans to Expand
to Other Organs
and Many Laboratories
GEON’s Data Grid Team
Has Strong Overlap with BIRN and OptIPuter
• Learning From The BIRN Project
– The GEON Grid:
– Heterogeneous Networks, Compute Nodes, Storage
– Deploy Grid And Cluster Software Across GEON
– Peer-to-Peer Information Fabric for Sharing:
– Data, Tools, And Compute Resources
NSF ITR Grant
$11.25M
2002-2007
Two Science “Testbeds”
Broad Range Of Geoscience Data Sets
Source: Chaitan Baru, SDSC, Cal-(IT)2
NSF’s EarthScope
Rollout Over 14 Years Starting
With Existing Broadband Stations
Data Intensive Scientific Applications
Require Experimental Optical Networks
• Large Data Challenges in Neuro and Earth Sciences
– Each Data Object is 3D and Gigabytes
– Data are Generated and Stored in Distributed Archives
– Research is Carried Out on Federated Repository
• Requirements
–
–
–
–
Computing Requirements PC Clusters
Communications Dedicated Lambdas Over Fiber
Data Large Peer-to-Peer Lambda Attached Storage
Visualization Collaborative Volume Algorithms
• Response
– OptIPuter Research Project
OptIPuter Software Research
• Near-term Goals:
– Build Software To Support Applications With Traditional Models
– High Speed IP Protocol Variations (RBUDP, SABUL, …)
– Switch Control Software For DWDM Management And Dynamic Setup
– Distributed Configuration Management For OptIPuter Systems
• Long-Term Goals:
– System Model Which Supports:
– Grid
– Single System
– Multi-System Views
– Architectures Which Can:
– Harness High Speed DWDM
– Exploit Flexible Dispersion Of Data And Computation
– New Communication Abstractions & Data Services
– Make Lambda-Based Communication Easily Usable
– Use DWDM to Enable Uniform Performance View Of Storage
Source: Andrew Chien, UCSD
Photonic Data Services & OptIPuter
6. Data Intensive Applications (UCI)
5a. Storage (UCSD)
5b. Data Services –
SOAP, DWTP, (UIC/LAC)
4. Transport – TCP, UDP, SABUL,… (USC,UIC)
3. IP
2. Photonic Path Serv. – ODIN, THOR,... (NW)
1. Physical
Source: Robert Grossman, UIC/LAC
OptIPuter is Exploring Quanta
as a High Performance Middleware
• Quanta Is A High Performance Networking Toolkit / API
• Quanta Uses Reliable Blast UDP:
– Assumes An Over-Provisioned Or Dedicated Network
– Excellent For Photonic Networks
– Don’t Try This On Commodity Internet!
– It Is Fast!
– It Is Very Predictable
– We Give You A Prediction Equation To Predict Performance
– It Is Most Suited For Transferring Very Large Payloads
• RBUDP, SABUL, and Tsunami Are All Similar Protocols
That Use UDP For Bulk Data Transfer
Source: Jason Leigh, UIC
XCP Is A New Congestion Control Scheme
Which is Good for Gigabit Flows
• Better Than TCP
– Almost Never Drops Packets
– Converges To Available Bandwidth Very Quickly, ~1Round Trip Time
– Fair Over Large Variations In Flow Bandwidth and RTT
• Supports existing TCP semantics
– Replaces Only Congestion Control, Reliability Unchanged
– No Change To Application/Network API
• Status
– To Date: Simulations and SIGCOMM Paper (MIT).
– See Dina Katabi, Mark Handley, and Charles Rohrs, "Internet Congestion
Control for Future High Bandwidth-Delay Product Environments." ACM
SIGCOMM 2002, August 2002. http://ana.lcs.mit.edu/dina/XCP/
– Current:
– Developing Protocol, Implementation
– Extending Simulations (ISI)
Source: Aaron Falk, Joe Bannister, ISI USC
From SuperComputers to SuperNetworks-Changing the Grid Design Point
• The TeraGrid is Optimized for Computing
–
–
–
–
1024 IA-64 Nodes Linux Cluster
Assume 1 GigE per Node = 1 Terabit/s I/O
Grid Optical Connection 4x10Gig Lambdas = 40 Gigabit/s
Optical Connections are Only 4% Bisection Bandwidth
• The OptIPuter is Optimized for Bandwidth
–
–
–
–
–
32 IA-64 Node Linux Cluster
Assume 1 GigE per Processor = 32 gigabit/s I/O
Grid Optical Connection 4x10GigE = 40 Gigabit/s
Optical Connections are Over 100% Bisection Bandwidth
Grow the network capacity to stay close to full bisection
Convergence of Networking Fabrics
• Today's Computer Room
– Router For External Communications (WAN)
– Ethernet Switch For Internal Networking (LAN)
– Fibre Channel For Internal Networked Storage (SAN)
• Tomorrow's Grid Room
– A Unified Architecture Of LAN/WAN/SAN Switching
– More Cost Effective
– One Network Element vs. Many
– One Sphere of Scalability
– ALL Resources are GRID Enabled
– Layer 3 Switching and Addressing Throughout
Source: Steve Wallach, Chiaro Networks
Who is OptIPuter?
•
Larry Smarr, UCSD CSE, PI
– Mark Ellisman, Co-PI, UCSD School of Medicine (Neuro Science applications)
– Philip Papadopoulos, Co-PI, San Diego Supercomputer Center (Experimental systems)
Tom DeFanti, Co-PI, UIC (All optical exchanges and wide-area networking)
– Jason Leigh, Co-PI, UIC (High-speed graphics systems)
•
Other Key Institutions
– University of Southern California, ISI (Network Protocols, Grid Software)
– Joe Bannister, Carl Kesselman
– UCSD/SDSC/SIO(data, Middleware, computers in the arts, optics, systems, security)
– Chaitan Baru, Andrew Chien, Sheldon Brown, Sadik Esener, Shaya Fainman, John
Orcutt, Graham Kent, Ron Graham, Greg Hidley, Sid Karin, Paul Siegel, Rozeanne
Steckler
– SDSU (GIS systems)
– Eric Frost
– UIC (Data systems, Networks, Visualization)
– Bob Grossman, Tom Moher, Alan Verloa
– UCI (data Systems, real-time computing)
– Kane Kim, Padhraic Smyth
– Northwestern University (performance analysis, networking)
– Joel Mambreti, Valerie Taylor
What is OptIPuter?
•
•
It is a large NSF ITR project funded at $13.5M from 2002 – 2007
Fundamentally, we ask the question:
– What happens to the structure of machines and programs when the network
becomes essentially ‘infinite’?
•
•
•
Project is coupled tightly with key applications to keep the IT research
grounded and focused
Individual researchers are investigating the software and structure from the
physical (Photonic) layer, to network protocols, middleware, and
applications.
We are building (in phases) two high-capacity networks with associated
modest-sized endpoints
– Experimental apparatus allows investigations at various levels
– “Breaking the network” is expected
– Start small (only 4 gigabits/clustered endpoint) and grow to 400
Gigabits/clustered endpoint in 2007.
– UCSD is building a packet-based (traditional) network (Mod-0)
– UIC is building an all-lambda network. (Mod-1)
The OptIPuter 2003
Experimental Network
Wide Array of Vendors
Calient Lambda Switches Now Installed
at StarLight (UIC) and NetherLight
Data plane
8 GigE
8 GigE
128x128
MEMS
Optical Switch
16 GigE
64x64
MEMS
Optical Switch
8 GigE
16 GigE
“Groomer”
at StarLight
16-processor
cluster
8-processor
cluster
2 GigE
16 GigE
Data plane
192
C
O
p s)
b
G
(10
“Groomer”
at
NetherLight
16-processor
cluster
2 GigE
8 GigE
16 GigE
Switch/Router
Switch/Router
Control plane
Control plane
NETHERLIGHT
GigE = Gigabit Ethernet (Gbps connection type)
Source: Maxine Brown
UCSD is building out a high-speed packetThe UCSD OptIPuter Deployment
switched network
To CENIC
Phase I, Fall 02
Phase II, 2003
Production Router
SDSC
SDSC
SDSC
SDSC
Annex
Annex
JSOE
Engineering
CRCA
Arts
SOM
Medicine
Chemistry
Phys.
Sci Keck
Collocation point
Preuss
High
School
6th
Undergrad
College
College
Node M
Collocation
Chiaro Router
SIO
Earth
Sciences
½ Mile
Source: Phil Papadopoulos, SDSC; Greg Hidley, Cal-(IT)2
OptIPuter LambdaGrid
Enabled by Chiaro Networking Router
www.calit2.net/news/2002/11-18-chiaro.html
Medical Imaging
and Microscopy
Chemistry,
Engineering, Arts
switch
switch
• Cluster – Disk
• Disk – Disk
Chiaro
Enstara
• Viz – Disk
• DB – Cluster
switch
switch
San Diego
Supercomputer Center
• Cluster – Cluster
Scripps Institution of
Oceanography
Image Source: Phil Papadopoulos, SDSC
Nodes and Networks
•
Clustered endpoints where each node has a gigabit interfaces on the
optIPuter network.
– Linux Redhat 7.3 managed with NPACI Rocks clustering toolkit
– Visualization clusters and immersive visualization theaters
– Specialized instruments such as light and electron microscopes
•
Nodes are plugged into a “supercheap” Dell 5254 24 port gigE copper switch
with 4 fiber uplinks (~$2K) . Link aggregation is supported to give us 4
gigabits/site today.
– Target 40 Gigabits/site in 2004. Off-campus @ 10 gigabits
– 400 Gigabits/site in 2006. Off-campus @ ?? gigabits
•
The “center” of the UCSD is network is Serial #1 of a Chiaro Enstara Router
– From our viewpoint it provides unlimited capacity (6 Terabits, if fully
provisioned today)
– It can scale to more than 2000 gigabit endpoints, today
– Packets and IP – ability to route at wire speeds across hundreds of 10 GigE
interfaces or 1000s of standard gigE
– “Gold plated” in terms of expense and size (we got a really good deal from
Chiaro
Chiaro’s Software Enstara™ Summary
•
Scalable Capacity
– 6 Tb/S Initial Capacity
– GigE OC-192 Interfaces
– “Soft” Forwarding Plane With Network Processors For
Maximum Flexibility
•
Full protocol suite
– Unicast: BGP, OSPF, IS-IS
– Multicast: PIM, MBGP, MSDP
– MPLS: RSVP-TE, LDP, FRR
•
Stateful Assured Routing (STAR™)
– Provides Service Continuity During Maintenance and Fault
Management Actions
– Stateful Protocol Protection Extended to BGP, ISIS, OSPF,
Multicast, and MPLS
•
Partitions
– Abstraction Permitting Multiple Logical Classes Of Routers To
Be Managed As If Separate Physical Routers
– Each Partition Has Its Own CLI, SNMP, Security, and Routing
Protocols Instances
Where Chiaro sits in the landscape?
Large
Port
Count
Small
Port
Count
Chiaro
Optical
Phased
Array
MEMS
Electrical
Fabrics
Bubble
Electrical
Fabrics
Lithium
Niobate
l
Switching
Speeds
(ms)
Packet
Switching
Speeds
(ns)
The Center of the UCSD OptIPuter Network
http://132.239.26.190/view/view.shtml
Optical Phased Array –
Multiple Parallel Optical Waveguides
Output
Fibers
GaAs
WG #1
Waveguides
Input
Optical Fiber
WG #128
Chiaro Has a Scalable,
Fully Fault Tolerant Architecture
• Significant Technical
Innovation
Network
Proc.
Line
Card
Network
Proc.
Line
Card
Chiaro
OPA
Fabric
– OPA Fabric Enables
Large Port Count
– Global Arbitration
Provides Guaranteed
Performance
– Fault-Tolerant Control
System Provides Nonstop Performance
• Smart Line Cards
Network
Proc.
Line
Card
Network
Proc.
Line
Card
Global
Arbitration
Optical
Electrical
– ASICs With
Programmable Network
Processors
– Software Downloads
For Features And
Standards Evolution
Chain of events leading to the recent
OptIPuter All Hands Meeting Last month
•
The hardest thing in networking – getting fiber in the ground and terminated
–
–
–
–
•
4 pair single mode fiber per site
Fiber terminated on Monday– fiber “polishing” took several days
First light/first packet Wed at about 6pm
Ran Linpack at about 6:05pm
Currently only two sites are connected
– Linux Cluster at each site as a baseline. Sizes/capability/architecture vary as needed
(and $$) change.
– 4 x 1 gigE @ CSC
– 4 x 1 gigE @ SDSC
– Additonal interfaces and sites being added next week (31 March)
•
Chiaro
– Serial #1 production router. We have the “Tiny size”
– One redundant pair of optical cores – 640 gigabits.
– We have alpha/beta gigE blades from Chiaro
– OptIPuter getting these 3 – 4 months ahead of schedule
– Chose multiple gigE striped physically because
– Cost at the site end
– Start parallel
– From the endpoint view – Chiaro works just like a standard router.
Netperf Numbers
(data at first light)
• 2 streams more than 1 gigabit (aggregate)
SDSC -> CSC netperf
( 2 X 1GHz PIII --> 2 x 2.2 GHz PIV)
1800
Bandwidth (Mb/s)
1600
1400
1200
Stream 1
Stream 2
Aggregate
1000
800
600
400
200
0
0
10000
20000
30000
40000
Message Size (bytes)
50000
60000
70000
Linpack Numbers
4 processor Linpack
– Run through local copper gigE switch
– Run through Chiaro Router
Linpack (4 CPUs/4 Nodes)
(17.6 Gflops Peak -- Four 2.2GHz Pentium 4)
700
10
9
600
8
500
7
6
400
5
300
4
3
200
2
100
1
0
0
1000
2000
4000
6000
8000
10000
12000
Matirx Dimension
14000
16000
18000
20000
GigaFlops
Time (sec)
•
Local GigE sec
Chiaro sec
Local GigE Gflops
Chiaro Gflops
Some fundamental questions already
showing up
•
Lambdas and all optical switches (e.g. 3D MEMS) are cheaper than the
monster router
– ~10X cheaper than current 10GigE routers on a per interface basis makes a
compelling reason to investigate
– However, because MEMS are mechanical, they act more like switchboards
that can effectively be “rewired” about 10Hz
– For comparison Chiaro “rewires” it’s OPA at ~ 1MHz
•
Question – where do you expose lambdas (circuits)?
– Are lambdas only a way to physically multiplex fiber?
– Do we expose lambdas to endpoints and build “lambda NICS” (and hence
lambda-aware applications)
– Implies a hybrid network from the endpoint. Circuit switched and packet
switched modes
– Even with massive bandwidth, larger networks will have some congestion
points. Circuits can create “expressways” through these choke points
Where the Rubber meets the Road applications
•
Optiputer is a concerted effort to explore the effects of enormous bandwidth
on the structure of
– Machines
– Protocols
– Applications
•
Distributed applications are generally written to conserve
bandwidth/network resource
– Emerging technology moves congestion points from the core to the
endpoints.
– It will take some time for the software engineering processes to factor in this
fundamental change
– In the limit, applications should worry about only two things
– Latency to the remote resource
– Capacity of the remote resource to fulfill requests
•
Things that once looked immovable (like a terabyte of storage), become
practically accessible from a remote site
Summary
•
•
OptIPuter is a 5 year large research project funded by NSF – driven by
observations of crossing technology exponentials
Exploring how to build networks with bandwidth-matched endpoints
– We’re building experimental apparatus that will let us test ideas
•
•
Key driving applications keep the IT research focused. These applications
also modify based on new capability
There are several key IT research topics within optIPuter. Each is key to the
notion of how the structure of computers and computing will change over
the next decade