Verbs - sigcomm

Download Report

Transcript Verbs - sigcomm

Server I/O Networks
Past, Present, and Future
Renato Recio
Distinguished Engineer
Chief Architect, IBM eServer I/O
1
Copyrighted, International Business
Machines Corporation, 2003
Copyrighted IBM 2003
7/7/2015
Legal Notices
All statements regarding future direction and intent for IBM, InfiniBand TM Trade Association, RDMA Consortium, or
any other standard organization mentioned are subject to change or withdrawal without notice, and represent goals and
objectives only. Contact your IBM local Branch Office or IBM Authorized Reseller for the full text of a specific
Statement of General Direction.
IBM may have patents or pending patent applications covering subject matter in this presentation. The furnishing of
this presentation does not give you any license to these patents. Send license inquires, in writing, to IBM Director of
Licensing, IBM Corporation, 500 Columbus Avenue, Thornwood, NY 10594 USA.
The information contained in this presentation has not been submitted to any formal IBM test and is distributed as is.
While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the
same or similar results will be obtained elsewhere. The use of this information or the implementation of any
techniques described herein is a customer responsibility and depends on the customer's ability to evaluate and integrate
them into the customer's operational environment. Customers attempting to adapt these techniques to their own
environments do so at their own risk.
The following terms are trademarks of International Business Machines Corporation in the United States and/or other
countries: AIX, PowerPC, RS/6000, SP, S/390, AS/400, zSeries, iSeries, pSeries, xSeries, and Remote I/O.
UNIX is a registered trademark in the United States and other countries, licensed exclusively through X/Open
Company, Limited. Ethernet is a registered trademark of Xerox Corporation. TPC-C, TPC-D, and TPC-H are
trademarks of the Transaction Processing Performance Council. Infiniband TM is a trademark of the InfinibandTM Trade
Association. Other product or company names mentioned herein may be trademarks or registered trademarks of their
respective companies or organizations.
2
Copyrighted IBM 2003
7/7/2015
In other words…
Regarding Industry Trends and Directions
 IBM
respects the copyright and trademark of other companies…
and
These slides represent my views:
Does
not imply IBM views or directions.
Does not imply the views or directions of InfiniBandSM Trade Association,
RDMA Consortium, PCI-SIG, or any other standard group.
These slides simply represent my view.
3
Copyrighted IBM 2003
7/7/2015
Agenda
Server
I/O
Network
types
Requirements,
Contenders.
Server
I/O
I/O Attachment

PCI family and InfiniBand
Network

and I/O Networks
stack offload
Hardware, OS, and Application considerations
Local Area
Networks
Cluster Area Networks

InfiniBand and Ethernet
Storage Area

Networks
FC and Ethernet
Summary
4
Copyrighted IBM 2003
7/7/2015
Purpose of Server I/O Networks
Server I/O networks are used to connect devices and other servers.
uP, $
uP, $
Memory
uP, $
uP, $
uP, $
uP, $
Cluster Network
Memory
Controller
I/O Expansion
Network
I/O
Attachment
Virtual
Adapter
I/O
Memory
Controller
Bridge
Switch
I/O
Virtual
Adapter
Virtual
I/O
Virtual
Adapter
Virtual
. .Adapter
..
Virtual
Adapter
Local Area
Network
Adapter
Virtual
Adapter
Memory
I/O Expansion
Network
Storage Area
Network
I/O
Attachment
Virtual
Adapter
I/O
I/O
I/O
5
uP, $
uP, $
Copyrighted IBM 2003
7/7/2015
Server I/O Network Requirements
 In
the past, servers have placed the following requirements on I/O
networks:
Standardization,
so many different vendors' products can be connected;
Performance (scalable throughput and bandwidth; and low latency/overhead);
High availability, so connectivity is maintained despite failures;
Continuous operations, so changes can occur without disrupting availability;
Connectivity, so many units can be connected;
Distance, both to support scaling and to enable disaster recovery; and
Low total cost, interacts strongly with standardization through volumes; also
depends on amount of infrastructure build-up required.
 More
recently, servers have added the following requirements:
Virtualization
of host, fabric and devices;
Service differentiation (including QoS), to manage fabric utilization peaks; and
Adequate security, particularly in multi-host (farm or cluster) situations.
6
Copyrighted IBM 2003
7/7/2015
Server I/O Network History
 Historically,
So




many types of fabrics proliferated:
Local Area Networks
Cluster Networks (a.k.a. HPCN, CAN)
Storage Area Networks
I/O Expansion Networks, etc…
and


no single technology satisfied all the above requirements,
many link solutions proliferated:
Standard:
 FC for SAN,
 Ethernet for LAN
Proprietary:
 a handful of IEONs (IBM’s RIO, HP’s remote I/O, SGI’s XIO, etc…)
 a handful of CANs (IBM’s Colony, Myricom’s Myrinet, Quadrics, etc…)
 Consolidation
solutions are now emerging, but the winner is uncertain.
PCI
family: IOA and IOEN
Ethernet: LAN, SAN, CAN
InfiniBand: CAN, IOEN, and possibly higher-end IOA/SAN.
7
Copyrighted IBM 2003
7/7/2015
Recent Server I/O Network
Evolution Timeline
Proprietary fabrics
(e.g. IBM channels, IBM RIO,
IBM-STI, IBM-Colony, SGI-XIO,
Tandem/Compaq/HP-ServerNet)
Rattner
pitch
(2/98)
NGIO goes
public
(11/98)
FIO goes
public
(2/99)
PCI
8
NGIO
Spec
available
(7/99)
FIO spec
available
(9/99)
RDMA
over IP
begins
(6/00)
53rd IETF
ROI BOF
calls for
IETF ROI
WG
(12/01)
FIO and
NGIO
merge into
IB (9/99)
Verbs,
RDMAC RDMA,
SDP, iSER,
announced DDP,
MPA
…
(5/02) 1.0 specs 1.0 specs
(10/02)
(4/03)
54th IETF RDDP
RDDP WG Chartered
(3/02)
InfiniBand Spec Releases
1.0
1.0.a
1.1
(10/00) (6/01)
(11/2002)
Verb
Ext.
(12/03)
3GIO
AS
PCI-Express ?
described
1.0 Spec
1.0 Spec
at IDF
(7/02)
(2003)
(11/01)
PCI-X
PCI-X 2.0
PCI-X 2.0
DDR/QDR
1.0 spec
announced
Spec
available
(2/02)
(9/99)
(7/02)
Copyrighted IBM 2003
7/7/2015
PCI
 PCI
standard’s strategy is:
Add
evolutionary technology enhancements to the standard,
that maintain the existing PCI eco-system.
 Within
the standard, two contenders are vying for IOA market share:
PCI-X


1.0 is shipping now,
2.0 is next and targets 10 Gig Networking generation.
PCI-Express



9
Maintains existing PCI software/firmware programming model,
adds new: protocol layers, physical layer, and associated connectors.
Can also be used as an IOEN, but does not satisfy all enterprise class requirements
 Enterprise class RAS is optional (e.g. multipathing)
 Fabric Virtualization is missing,
 More efficient I/O communication model, …
Will likely be extended to support:
 Faster speed link,
 Mandatory enterprise class RAS.
Copyrighted IBM 2003
7/7/2015
I/O Attachment Comparison
Performance
Effective link widths
Effective link frequency
Bandwidth range
Connectivity
Distance
PCI-X (1.0, 2.0)
PCI-Express (1.0, 2.0)
Parallel 32 bit, 64 bit
33, 66, 100, 133, 266, 533 MHz
132 MB/s to 4.17 GB/s
Multi-drop bus or point-point
Chip-chip, card-card connector
Serial 1x, 4x, 8x, 16x
2.5 GHz -> 5 or 6.25 GHz
250 MB/s to 4 GB/s
Memory mapped switched fabric
Chip-chip, card-card connector,
cable
Self-management
Unscheduled outage protection Interface checks, Parity, ECC
No redundant paths
Schedule outage protection
Hot-plug and dynamic discovery
Service level agreement
N/A
Virtualization
Interface checks, CRC
No redundant paths
Hot-plug and dynamic discovery
Traffic classes, virtual channels
Host virtualization
Network virtualization
I/O virtualization
Performed by host
None
No standard mechanism
Performed by host
None
No standard mechanism
Cost
Infrastructure build up
Fabric consolidation potential
Delta to existing PCI chips
None
New chip core (macro)
IOEN and I/O Attachment
10
Copyrighted IBM 2003
7/7/2015
InfiniBand
 IB’s
strategy:
Provide
a new, very efficient, I/O communication model,
that satisfies enterprise server requirements, and
can be used for I/O, cluster, and storage.
IB’s model


Enables middleware to communicate across a low latency, high bandwidth fabric,
through messages queues, that can be accessed directly out of user space.
But… required a completely new infrastructure,
(management, software, endpoint hardware, fabric switches, and links).
I/O


adapter industry viewed IB’s model as too complex.
Sooo… I/O adapter vendors are staying on PCI,
IB may be used to attach high-end I/O to enterprise class servers.
Given


11
current I/O attachment reality, enterprise class vendors will likely:
Continue extending their proprietary fabric(s), or
Tunnel PCI traffic through IB, and provide IB-PCI bridges.
Copyrighted IBM 2003
7/7/2015
I/O Expansion Network Comparison
Performance
Link widths
Link frequency
Bandwidth range
Latency
12
PCI-Express
IB
Serial 1x, 4x, 8x, 16x
2.5 GHz
250 MB/s to 4 GB/s
PIO based synchronous operations
(network traversal for PIO Reads)
Connectivity
Distance
Topology
Self-management
Unscheduled
outage protection
Memory mapped switched fabric
Chip-chip, card-card connector, cable
Single host, root Tree
Serial 1x, 4x, 12x
2.5 GHz
250 MB/s to 3 GB/s
Native: Message based asynchronous
operations (Send and RDMA)
Tunnel: PIO based sync. operations
Identifier based switched fabric
Chip-chip, card-card connector, cable
Multi-host, general
Interface checks, CRC
No native memory access controls
No redundant paths
Interface checks, CRC
Memory access controls
Redundant paths
Schedule outage
protection
Service level
agreement
Hot-plug and dynamic discovery
Hot-plug and dynamic discovery
Traffic classes, virtual channels
Service levels, virtual channels
Copyrighted IBM 2003
7/7/2015
I/O Expansion Network Comparison…
Continued
PCI-Express
IB
New chip core (macro)
IOEN and I/O Attachment
New infrastructure
IOEN, CAN, high-end I/O
Attachment
Performed by host
None
No standard mechanism
Standard mechanisms available
End-point partitioning
Standard mechanisms available
5 or 6.25 GHz (work in process)
Mandatory interface checks, CRC
5 or 6.25 GHz (work in process)
Verb enhancements
Cost
Infrastructure build up
Fabric consolidation
potential
Virtualization
Host virtualization
Network virtualization
I/O virtualization
Next steps
Higher frequency links
Advanced functions
13
Copyrighted IBM 2003
7/7/2015
Server Scale-up Topology Options
PCI-Express IOEN
SMP sub-node
Memory
Controller
Memory
Controller
Proprietary
or IB
PCI
tunneling
PCI-X Bridge
PCI-Express
Bridge
Adapter
scaling
 Short-distance remote I/O
 Proprietary based virtualization
 QoS (8 traffic classes, virtual channels)
 Low infrastructure build-up
 Evolutionary compatibility with PCI
Adapter
Key PCI-Express IOEN value proposition
PCI-X Bridge
Adapter
Adapter
SMP sub-node
Adapter
Adapter
PCI-Express:
 SMP only
Adapter
Adapter
Switch
14
$ $
uP,uP,
$ $ uP,uP,
SMP sub-node
Switch
 Bandwidth
$ $
uP,uP,
$ $ uP,uP,
Memory
Memory
Controller
For large SMPs, a
memory fabric must
be used to access
I/O that is not local to
a SMP sub-node.
Memory
Memory
$ $
uP,uP,
$ $ uP,uP,
IB or Proprietary IOEN
Key IB IOEN value proposition
 Bandwidth
scaling
 Long distance remote I/O
 Native, standard based virtualization
 Multipathing for performance and HA
 QoS (16 service levels, virtual lanes)
 CAN and IOEN convergence
Copyrighted IBM 2003
7/7/2015
Server IOA Outlook
 Server
I/O Attachment (GB/s)
100
 Next
MCA
PCI/PCI-X
PCI-Exp.
10


1
.1

100

1999
2004
2009
 IB

I/O Expansion Networks (GB/s)
10
PCI-E (8/16x)
ServerNet
IBM RIO/STI
IB (12x)
SGI XIO
HP
as an IOA
Complexity and eco-system issues will limit IB
to a small portion of high-end IOA.
I/O Expansion
 Options

1


1999
2004
for scale-up servers:
Migrate to IB and tunnel PCI I/O through it.
Continue upgrading proprietary IOEN.
Migrate to PCI-Express.
 SHV
.1
15
drivers for PCI-Express are:
AGP replacement on clients (16x)
CPU chipset on clients and servers (8 or 16x)
 Server

.01
1994
steps in PCI family roadmap:
2003-05: PCI-X 2.0 DDR and QDR.
2005:
PCI-Express
 Key
.01
1994
I/O Attachment
servers will likely pursue PCI-Express:
Satisfies low-end requirements,
but not all enterprise class requirements.
2009
Copyrighted IBM 2003
7/7/2015
Problems with Sockets over TCP/IP
% CPU utilization*
100
80
Available for
Application
Receive server memory to
link bandwidth ratio
Socket Library
Interrupt Processing
NIC
IP
TCP
Copy/Data Mgt
60
40
20
0
No Offload, No Offload,
1 KB
64 KB
 Network
3
2
1
0
Standard NIC
intensive applications consume a large percent of the CPU cycles:
 Small
1 KB transfers spend 40% of the time in TCP/IP, and 18% in copy/buffer mgt
 Large 64 KB transfers spend 25% of the time in TCP/IP, and 49% in copy/buffer mgt
 Network
stack processing consumes a significant amount of the available server
memory bandwidth (3x the link rate on receives).
* Note:
*1 KB Based on Erich Nahum’s Tuxedo on Linux, 1 KB files, 512 clients run, but adds .5 instructions per byte for copy.
*64 KB Based on Erich Nahum’s Tuxedo on Linux, 64 KB files, 512 clients run, but adds .5 instructions per byte for copy.
Copyrighted IBM 2003
16
7/7/2015
Network Offload – Basic Mechanisms
 Successful
Direct

network stack offload requires five basic mechanisms:
user space access to a send/receive Queue Pair (QP) on the offload adapter.
Allows middleware to directly send/receive data through the adapter.
Registration

Allows the hardware adapter to directly access user space memory.
Access

controls between registered memory resources and work queues.
Allows privileged code to associate adapter resources (memory registrations, QPs, and
Shared Receive Queues) to a combination of: OS image, process, and, if desired, thread.
Remote

direct data placement (a.k.a. Remote Direct Memory Access - RDMA).
Allows adapter to directly place incoming data into a user space buffer.
Efficient

17
of virtual to physical address translations with the offload adapter.
implementation of the offloaded network stack.
Otherwise offload may not yield desired performance benefits.
Copyrighted IBM 2003
7/7/2015
Network Stack Offload – InfiniBand
Host Channel Adapter Overview
consumer – Software that uses
HCA to communicate to other
nodes.
 Verb
Verb consumer
Verbs
HCA Driver/Library
 Communication
is thru verbs, that:
Manage
Memory
QP
Translation
Context
and
(QPC)
Protection
Table
(TPT)
CI
Data Engine Layer
IB Transport, IB Network,…
HCA
 SQ – Send Queue
 RQ – Receive Queue
 SRQ – Shared RQ
18
connection state.
Manage memory and queue access.
Submit work to HCA.
Retrieve work and events from HCA.
 Channel
Interface (CI) performs
work on behalf of the consumer.
CI consists of:
– Performs privileged functions.
Library – Performs user space functions.
HCA – hardware adapter.
Driver
 QP – Queue Pair
QP = SQ + RQ
 CQ – Completion Queue
Copyrighted IBM 2003
7/7/2015
Network Stack Offload – iONICs
 iONIC
Sockets over
Ethernet Link
Service
Sockets over
TOE Service
Host
Sockets over
RDMA Service
Application
Sockets
Transport
Network
NIC NIC
Mgt Dvr
TOE Service
Library
TOE Drv
RDMA Service
Library
RNIC Drv
An
internet Offload Network
Interface Controller (iONIC).
Supports one or more internet
protocol suite offload services.
 RDMA enabled
NIC (RNIC)
An
iONIC that supports the RDMA
Service.
 IP
suite offload services, include,
but are not limited to:
 TCP/IP
RDMA/DDP/MPA
TCP
IP
Ethernet
iONIC
Only the Ethernet Link, TOE, and RDMA
Services are shown.
19
Offload Engine (TOE) Service
 Remote Direct Memory Access (RDMA)
Service
 iSCSI Service
 iSCSI Extensions for RDMA (iSER)
Service
 IPSec Service
Copyrighted IBM 2003
7/7/2015
Network Stack Offload – iONIC
RDMA Service Overview
consumer – Software that uses
RDMA Service to communicate to
other nodes.
 Verb
Verb consumer
Verbs
RNIC Driver/Library
 Communication
is thru verbs, that:
Manage
Memory
QP
Translation
Context
and
(QPC)
Protection
Table
(TPT)
RI
Data Engine Layer
RDMA/DDP/MPA/TCP/IP …
iONIC RDMA Service
 SQ – Send Queue
 RQ – Receive Queue
 SRQ – Shared RQ
20
connection state.
Manage memory and queue access.
Submit work to iONIC.
Retrieve work and events from iONIC.
 RDMA Service
Interface (RI)
performs work on behalf of the
consumer.
RI consists of:
– Performs privileged functions.
Library – Performs user space functions.
RNIC – hardware adapter.
Driver
 QP – Queue Pair
QP = SQ + RQ
 CQ – Completion Queue
Copyrighted IBM 2003
7/7/2015
Network I/O Transaction Efficiency
Send and Receive pair
100000
ELS
TOE 1
TOE 2
SDP
RDMA
10000
CPU Instructions/Byte
1000
100
10
1
.1
.01
1
 Graph
10
100
1000
10000
100000
Transfer size in bytes
shows a complete transaction:
Send
and Receive for TOE
Combined Send+RDMA Write and Receive for RDMA
21
Copyrighted IBM 2003
7/7/2015
Network Offload Benefits
Middleware View
Benefit of network stack offload (IB or iONIC) depends on the ratio of:
Application/Middleware (App) instructions :: network stack instructions.
Multi-tier Server Environment
Note: All tiers are logical; they can potentially run on the same server OS instance(s).
Traditionally
use Cluster
Network.
Browser
Presentation
Server
Presentation
Data
Web
Application
Server
DB Client &
Replication
User
Web Server
Client
Tier
Business
Function
Server: OLTP
& BI DB; HPC
Application
Data
Business
Data
Legend
NC not useful at present due to XML & Java overheads
Sockets-level NC support beneficial
- (5 to 6% performance gain for communication between App tier and business function tier)
- (0 to 90% performance gain for communication between browser and web server)
Low-level (uDAPL, ICSC) support most beneficial
- (4 to 50% performance gain for business function tier)
iSCSI, DAFS support beneficial
- (5 to 50% gain for NFS/RDMA compared to NFS performance)
22
Copyrighted IBM 2003
7/7/2015
TCP/IP/Ethernet Are King of LANs
 Ethernet
is standard and widely deployed as a LAN.
Long
distance links (from card-card to 40 Km).
High availability through session, adapter, or port level switchover.
Dynamic congestion management when combined with IP Transports.
Scalable security levels.
Sufficient performance for LAN.
Good enough performance for many clusters, and
high performance (when combined with TCP Offload)
 Strategy
is to extend Ethernet’s role in Wide Area Networks, Cluster,
and Storage, through a combination of:
Faster


link speed (10 Gb/s) at competitive costs
No additional cost for copper (XAUI).
$150 to 2000 transceiver cost for fiber.
internet

Multiple services
Lower

23
Offload Network Interface Controllers (iONICs)
latency switches
Sub-microsecond latencies for data-center (cluster and storage) networks.
Copyrighted IBM 2003
7/7/2015
Market Value of Ethernet
1,000,000
Sw 10
Sw 100
Sw 1000
Sw 10000
100,000
Ports (000s)
Cumulative Shipments
250 Million Ethernet ports installed to date
10,000
1,000
100
10
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
Server Ethernet NIC Prices
$
10000
2004
10 Gb/s Fiber
10 Gb/s Cu
1000
24
2003
1 Gb/s
iONIC
1 Gb/s
10 Gb/s Fiber iONIC
10 Gb/s Cu iONIC
IB 4x
100
100 Mb/s
10 Mb/s
10
1
Jan-96
Jan-99
Jan-02
Copyrighted IBM 2003
Jan-05
7/7/2015
LAN Switch Trends
 Traditional
Nishan switch:
- IBM 0.18um ASIC
- 25 million transistors
- 15mm x 15mm die size
- 928 signal pins
- Less than 2us latency
LAN switch IHV
business model has been to pursue
higher level protocols and functions;
with less attention to latency.
 iONICs
and 10 GigE are expected to
increase the role Ethernet will play in
Cluster and Storage Networks.
iONICs
Switch latencies
General purpose
Data Center (eg
switch
iSCSI) focused switch
25
1997
20-100 us range
2002
10-20 us range
<2 us range
2006
3-5 us range
<1 us range
and 10 GigE provides an
additional business model for switch
vendors, that is focused on satisfying the
performance needs of Cluster and Storage
Networks.
Some Ethernet switch vendors (e.g.
Nishan) are pursuing this new model.
Copyrighted IBM 2003
7/7/2015
LAN Outlook
100
10
vendors are gearing up for
storage and higher performance
cluster networks:
2x every 16 months
(over the past 10 years)
1
GB/s
 Ethernet
Local Area Networks
.1
10
GigE to provide higher bandwidths;
 iONIC to solve CPU and memory
overhead problems; and
lower latency switches to satisfy endend process latency requirements.
Ethernet
Token Ring
ATM
FDDI
.01
.001
1974
1984
1994
2004
 Given
above comes to pass,
How well will Ethernet play in
Cluster
market?
Storage market?
26
Copyrighted IBM 2003
7/7/2015
Cluster Network Contenders
 Proprietary
networks
Strategy:


Provide advantage over standard networks.
 Lower latency/overhead, higher link performance, and advanced functions
Eco-system completely supplied by one vendor.
Two


approaches:
Multi-server – can be used on server from more than one vendor.
Single-server – only available on server from one vendor.
 Standard
networks
IB

10

Strategy is to provide almost all the advantages available on proprietary networks;
thereby, eventually, displacing proprietary fabrics.
Gb/s Ethernet internet Offload NIC (iONIC)
Strategy is to increase Ethernet’s share of the Cluster network pie,
by providing lower CPU/memory overhead and advanced functions,
though at a lower bandwidth than IB and proprietary networks.
Note: PCI-Express doesn’t play, because it is missing many functions.
27
Copyrighted IBM 2003
7/7/2015
Cluster Network Usage in HPC
Cluster Interconnect Technology
Top 500 Supercomputers *
100%
80%
Standards Based
Multi Platform
Single Platform
60%
40%
20%
0%
Top Next Last
100 100 100
Top Next Last
100 100 100
June 2000
 Use
June 2002
Top Next Last
100 100 100
November 2002
of proprietary cluster networks for high-end clusters will continue to decline.
 Multi-platform
cluster networks have already begun to gain significant share.
 Standards-based cluster networks will become the dominant form.
* Source: Top 500 study by Tom Heller.
28
Copyrighted IBM 2003
7/7/2015
Reduction in Process-Process Latencies
256 B and 8 KB Block
1000
Link Time, 256B
Switch, 5 Hops
Network Stack+ Adapter Driver/Library
100
10
6
1.2x
lower
4.6x
lower
1
1 GigE 100 MFLOP=19us
10 GigE 100 MFLOP= 6us
IB
100 MFLOP= 6us
4
8.4x
lower
2
0.1
0
GbEnet
1000
Normalized LAN
Process-Process
Latencies
8
10GbE
10GbE opt
IB-12x
GbEnet
10GbE
10GbE opt
IB-12x
GbEnet
10GbE
10GbE opt
IB-12x
8
LAN Process-Process Latencies
Link Time, 8 KB
6
100
2.5x
lower
10
4
3.0x
lower
3.9x
lower
1
GbEnet
29
2
10GbE
10GbE opt
0
IB-12x
Copyrighted IBM 2003
7/7/2015
HPC Cluster Network Outlook
 Proprietary
fabrics will be displaced by
Ethernet and IB.
Bandwidth (GB/s)
100
10
High Performance
Standard Links (IB/Ethernet)
1
 Server’s
with the most stringent
performance requirements will use IB.
.1
.01
 Cluster
Networks will continue to be
predominately Ethernet.
.001
.0001
1985
 iONIC
1989
Ethernet
FDDI
HiPPI
SGI-GIGAchnl.
30
1994
1999
Token Ring
FCS
ServerNet
IBM SP/RIO/STI
2004
2009
and low-latency switches will
increase Ethernet’s participation in the
cluster network market.
ATM
IB
Mem. Channel
Synfinity
Copyrighted IBM 2003
7/7/2015
Current SAN and NAS Positioning
Overview
LUN
LBA
SAN
 Current
LAN
SAN differentiators
 Current
 Block
level I/O access
 High performance I/O
Low latency, high bandwidth
 Vendor unique fabric mgt protocols
Learning curve for IT folks
 Homogeneous access to I/O.
NFS,
HTTP,
etc…
NAS differentiators
 File
level I/O access
 LAN level performance
High latency, lower bandwidth
 Standard fabric mgt protocols
Low/zero learning curve for IT folks
 Heterogeneous platform access to files
 Commonalities
 Robust
remote recovery and storage management requires special tools for both.
 Each can optimize disk access, though SAN does require virtualization to do it.
 Contenders:
 SAN:
31
FC and Ethernet.

NAS: Ethernet.
Copyrighted IBM 2003
7/7/2015
Storage Models for IP
iSCSI Service
in host
CPU
iSCSI Service
in iONIC
CPU
Application
FS API
FS/LVM
Stg Driver
Adapter Drv
Application
FS API
FS/LVM
iSCSI
TCP/IP
NIC Driver
Application
FS API
FS/LVM
Stg Driver
SCSI or FC
P. Offload
Ethernet
Stg. Adapter
NIC
iSCSI
TCP/IP
Ethernet
iSCSI HBA
100000
10000
1000
CPU Instructions/Byte
Parallel SCSI
or FC
CPU
100
10
1
Parallel SCSI
.1
iSCSI Service in host
iSCSI Service in iONIC
.01
1
10
100
1000
10000
100000
Transfer size in bytes
 Parallel
SCSI and FC have very efficient path through O/S
 Existing
driver to hardware interface has been tuned for many years.
 An efficient driver-HW interface model has been a key iSCSI adoption issue.
 Next
steps in iSCSI development:
 Offload
TCP/IP processing to the host bus adapter,
 Provide switches that satisfy SAN latencies requirements,
 Improve read and write processing overhead at the initiator and target.
32
Copyrighted IBM 2003
7/7/2015
Storage Models for IP
CPU
Application
NFS API
NFS
TCP/IP
NIC Driver
P. Offload
Ethernet
NIC
 RDMA will
 Host


Application
NFS API
NFS
NIC Driver
RDMA/DDP
MPA/TCP
IP/Ethernet
RNIC
100000
10000
1000
CPU Instructions/Byte
NFS over
ELS NIC
NFS Extensions
for RDMA over
RDMA Service in
iONIC
CPU
100
10
1
NFS over ELS NIC
NFS over RNIC
Parallel SCSI
.1
.01
1
10
100
1000
10000 100000
Transfer size in bytes
significantly improve NAS server performance.
network stack processing will be offloaded to an iONIC.
Removes tcp/ip processing from host path.
Allows zero copy.
 NAS
(NFS with RDMA Extensions) protocols will exploit RDMA.
 RDMA allows a file level access device to approach
block level access device performance levels.
 Creating a performance discontinuity for storage.
33
Copyrighted IBM 2003
7/7/2015
Storage I/O Network Outlook
GB/s
10
SCSI
FC
Disk Head
iSCSI/E
1
SAN
Storage network outlook:
Link

.1
bandwidth trends will continue.
Paced by optic technology enhancements.
Adapter
.01

.001
1990
1000
1995
2000
2005
will gradually transition from FC
to IP/Ethernet networks.


10
K IOPS
Paced by higher frequency circuits,
higher performance microprocessors, and
larger fast-write and read cache memory.
SANs
Single Adapter/Controller
Throughput
100
1
Motivated by TCO/complexity reduction.
Paced by availability of:
 iSCSI with efficient TOE (possibly RNIC)
 Lower latency switches
NAS
.1
1994

1998
2003
throughput trend will continue
will be more competitive against SAN.
Paced by RNIC availability.
2008
* Sources: Product literature from 14 companies. Typically use a workload that is 100% read of 512 byte data; not a good
measure of overall sustained performance, but it is a good measure of adapter/controller front-end throughput capability.
Copyrighted IBM 2003
34
7/7/2015
Summary
 I/O
server adapters will likely attach through PCI family,
because of PCI’s low cost and simplicity of implementation.
 I/O expansion networks will likely use
Proprietary
or IB (with PCI tunneling) links that satisfy enterprise requirements,
and
PCI-Express on Standard High Volume, Low-End servers.
 Clusters
networks will likely use
Ethernet
networks for the high-volume portion of market, and
InfiniBand when performance (latency, bandwidth, throughput) is required.
 Storage
area networks will likely
Continue
using Fibre Channel, but gradually migrate to iSCSI over Ethernet.
 LANs
Ethernet is King.
 Several network stack offload design approaches will be attempted
From
all firmware on slow microprocessors,
to heavy state machine usage, to all points in between.
After weed design approaches are rooted out of the market,
iONICs will eventually become a prevalent feature on low to high-end servers.
35
Copyrighted IBM 2003
7/7/2015