Storage Systems CSE 598D, Spring 2007

Download Report

Transcript Storage Systems CSE 598D, Spring 2007

Storage Systems
CSE 598d, Spring 2007
Lecture 15: Consistency Semantics, Introduction to
Network-attached Storage
March 27, 2007
•
Last class
Agenda
– Consistency models: Brief Overview
•
Next
– More details on consistency models
– Network storage introduction
•
•
•
•
NAS vs SAN
DAFS
Some relevant technology and systems innovations
FC, Smart NICs, RDMA, …
– A variety of topics on file systems (and other storage-related software)
•
•
•
•
•
•
•
•
•
Log-structured file systems
Databases and file systems compared
Mobile/poorly connected systems, highly distributed & P2P storage
NFS, Google file system
Asynchronous I/O
Flash-based storage
Active disks, object-based storage devices (OSD)
Archival and secure storage
Storage virtualization and QoS
– Reliability, (emerging) miniature storage devices
Problem Background and Definition
• Consistency issues were first studied in the context of shared-memory multiprocessors and we will start our discussion in the same context
– Ideas generalize to any distributed system with shared storage
• Memory consistency model (MCM) of an SMP provides a formal
specification of how the memory system will appear to the programmer
– Places restrictions on the values that can be returned by a read in a sharedmemory program execution
– An MCM is a contract between the memory and the programmer
• Why different models?
– Trade-offs involved between “strictness” of consistency guarantees,
implementation efforts (hardware, compiler, programmer), system performance
Atomic/Strict Consistency
• Most intuitive, naturally appealing
• Any read to a memory location x returns the value stored by the most recent
write operation to x
• Defined w.r.t. a “global” clock
–
•
That is the only way “most recent” can be defined unambiguously
Uni-processors typically observe such consistency
–
–
A programmer on a uni-processor naturally assumes this behavior
E.g., As a programmer, one would not expect the following code segment to print 1 or any other value
than 2
•
–
Still possible for compiler and hardware to improve throughput by re-ordering instructions
•
•
A = 1; A = 2; print (A);
Atomic consistency can be achieved as long as data and control dependencies are adhered to
Often considered a base model (for evaluating MCMs that we will see next)
Atomic/Strict Consistency
• What happens on a multi-processor?
– Even on the smallest and fastest multi-processor, global time can not be achieved!
– Achieving atomic consistency not possible
– But not a hindrance, since programmers manage quite well with something weaker
than atomic consistency
– What behavior do we expect when we program on a multi-processor?
• What we DO NOT expect: a global clock
• What we expect:
– Operations from a process will execute sequentially
» Again: A = 1; A =2; print (A) should not print 1
• And then we can use Critical section/Mutual exclusion mechanisms to enforce desired
order among instructions coming from different processors
– So we expect a MCM less strict than atomic consistency. What is this consistency
model, what are its properties, and what does the hardware/software (compiler)
have to do to provide it?
Sequential Consistency
• What we typically expect from a shared-memory multi-processor system is
captured by sequential consistency
– Lamport [1979]: A multi-processor is sequentially consistent if the result of any
execution is the same as if
• The operations of all the processors were executed in some sequential order
– That is, memory accesses occur atomically w.r.t. other memory accesses
• The operations of each individual processor appear in this sequence in the order
specified by its program
– Equivalently, any valid interleaving is acceptable as long as all processes see the same
ordering of memory references
– Programmer’s view
P1
P3
P3
Memory
Pn
Example: Sequential Consistency
P1: W(x)1
P2:
P3:
W(y)2
R(y)2
R(x)0 R(x)1
• Not atomically consistent because:
– R(y)2 by P3 reads a value that has not been written yet
– W(x)1 and W(y)2 appear commuted at P3
• But sequentially consistent
– SC doesn’t have the notion of global clock
Example: Sequential Consistency
P1: W(x)1
P2:
P3:
W(y)2
R(y)2
R(x)0 R(x)1
• Not atomically consistent because:
– R(y)2 by P3 reads a value that has not been written yet
– W(x)1 and W(y)2 appear commuted at P3
• But sequentially consistent
• What about?
P1: W(x)1
P2:
P3:
W(y)2
R(y)2 R(x)0 R(x)1
R(y)2 R(x)0 R(x)1
Example: Sequential Consistency
P1: W(x)1
P2:
P3:
W(y)2
R(y)2
R(x)0 R(x)1
• Not atomically consistent because:
– R(y)2 by P3 reads a value that has not been written yet
– W(x)1 and W(y)2 appear commuted at P3
• But sequentially consistent
• What about?
P1: W(x)1
P2:
P3:
W(y)2
R(y)2 R(x)0 R(x)1
R(x)1 R(y)0 R(y)2
Causal Consistency
• Hutto and Ahamad, 1990
• Each operation is either “causally related” or “concurrent” with another
– When a processor performs a read followed later by a write, the two operations are said to be
causally related because the value stored by the write may have been dependent upon the result
of the read
– A read operation is causally related to an earlier write that stored the data retrieved by the read
– Transitivity applies
– Operations that are not causally related are said to be concurrent.
• A memory is causally consistent if all processors agree on the order of causally
related writes
– Weaker than SC that requires all writes to be seen in the same order
P1:
P2:
P3:
P4:
W(x)1
R(x)1 W(x)2
R(x)1
R(x)1
W(x)3
R(x)3 R(x)2
R(x)2 R(x)3
W(x)1 and W(x)2 causally related
W(x)2 and W(x)3 not causally related!
Summary: Uniform MCMs
Atomic consistency
Sequential consistency
Causal consistency
Processor consistency
Cache consistency
PRAM consistency
Slow memory
UNIX and session semantics
• UNIX file sharing semantics on a uni-processor system
– When a read follows a write, the read returns the value just written
– When two writes happen in quick succession, followed by a read, the
value read is that stored by the last write
• Problematic for a distributed system
– Theoretically achievable if single file server and no client caching
• Session semantics
– Writes made visible to others only upon the closing of a file
Delta Consistency
• Any write will become visible within at most Delta time units
–
–
–
–
Barring network latency
Meanwhile … all bets are off!
Push versus pull
Compare with sequential, causal, etc. in terms of valid orderings of
operations
• Related: Mutual consistency with parameter Delta
– A given set of “objects” are within Delta time units of each other at all
times as seen by a client
– Note that it is OK to be stale with respect to the server by more than
Delta!
– Generally, specify two parameters
• Delta1: Freshness w.r.t. server
• Delta2: Mutual consistency of related objects
File Systems Consistency Semantics
•
•
•
•
•
What is involved in providing these semantics?
UNIX semantics easy to implement on a uni-processor
Session semantics: session state at the server
Delta consistency: timeouts, leases
Meta-data consistency
– Some techniques we have seen
• Journaling, LFS, Meta-data journaling: ext3
• Synchronous writes
• NVRAM: expensive, unavailable
– Disk scheduler enforced ordering!
• File system passes sequencing restrictions to the disk scheduler
• Problem: Disk scheduler can not enforce an ordering among requests not yet visible to it
– Soft updates
• Dependency information is maintained for meta-data blocks in write-back cache on a perfield and/or per-pointer granularity
Network-attached Storage
• Introduction to important ideas and technologies
• Lots of slides, will cover some in class, post all on Angel
• Subsequent classes will cover some topics in depth
Direct Attached Storage
• Problems/shortcomings in enterprise/commercial settings
–
–
–
–
Sharing of data difficult
Programming and client access inconvenient
Wastage of data
More?
“Remote” Storage
•
Idea: Separate storage from the clients and application servers and locate it on the other
side of a scalable networking infrastructure
–
•
Variants on this idea that we will see soon
Advantages
–
–
Reduction in wasted capacity by pooling devices and consolidating unused capacity formerly
spread over many directly-attached storage devices
Reduced time to deploy new storage
•
–
Backup made more convenient
•
–
–
Application server involvement removed
Management simplified by centralizing storage under a consolidate manager interface
Availability improved (potentially)
•
•
Client software is designed to tolerate dynamic changes in network resources but not the changing of local
storage configurations while the client is operating
All software and hardware is specifically developed and tested to run together
Disadvantages
–
Complexity, more expertise needed
•
Implies more set-up and management cost
Network Attached Storage
File interface exported
to rest of the network
Storage Area Network (SAN)
Block interface exported
to rest of the network
SAN versus NAS
Source: November 2000/Vol. 43, No. 11 COMMUNICATIONS OF THE ACM
Differences between NAS and SAN
• NAS
–
–
–
–
TCP/IP or UDP/IP protocols and Ethernet networks
High-level requests and responses for files
NAS devices translate file requests into operations on disk blocks
Cheaper
• SAN
–
–
–
–
–
–
Fibre Channel and SCSI
More scalable
Clients translate files access to operate on specific disk
Data block level
Expensive
Separation of storage traffic from general network traffic
• Beneficial from security, performance
NAS File Servers
•
•
•
•
•
•
Pre-configured file servers
Consists of one or more internal
servers with pre-configured capacity
Have a stripped down OS; any
component not associated with file
services is discarded
Connected via Ethernet to LAN
OS stripping makes it more efficient
than a general purpose OS
Have plug and play functionality
Source: Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS iSCSI and InfiniBand
by Ulf Troppens,Rainer Erkens,Wolfgang Mueller
NAS Network Performance
•
•
•
NAS and traditional network file systems use IP-based protocols over NIC
devices.
A consequence of this deployment is poor network performance.
The main culprits often cited include:
- Protocol processing in network stacks
- Memory copying
- Kernel overhead including system calls
and context switches.
NAS Network Performance
Figure depicting sources of TCP/IP overhead
NAS Network Performance
Protocol Processing
•
Data transmission involves the OS services for memory and process
management, the TCP/IP protocol stack and the network device and its device
driver.
•
The network per-packet costs include the overhead to execute the TCP/IP
protocol code, allocate and release memory buffers, and device interrupts for
packet arrival and transmit completion.
•
The per-byte costs include overheads to move data within the end to end system
and to compute checksums to detect data corruption in the network.
NAS Network Performance
Memory Copy
Current implementation for data transmission requires the same data to be
copied at several stages.
NAS Network Performance
•
•
An NFS client requesting data stored on a NAS server with internal SCSI disk
would involve:
- Hard Disk to RAM transfer using SCSI, PCI and
system buses
- RAM to NIC transfer using the System and PCI
buses
For a traditional NFS this would further involve a transfer from the application
memory to the kernel buffer cache of the transmitting computer before
forwarding to the network card.
Accelerating Performance
•
Two starting points to accelerate network file system performance are :
- The underlying communication protocol
TCP/IP was designed to provide a reliable framework for data exchange over an unreliable
network. The TCP/IP stack is complex and CPU-intensive.
Example alternate: VIA/RDMA
- The Network file system
Development of new network file systems which have a reliable network connection
requirement.
Network file systems could be modified to use thinner communication protocols
Example alternate: DAFS
Proposed Solutions
TCP/IP offloading Engines (TOEs)
•
•
An increasing number of network adapters are able to compute internet
checksum
Some adapters can now perform TCP or UDP protocol processing
Copy Avoidance
•
Several buffer management schemes had been proposed to either reduce or
eliminate data copying
Proposed Solutions
Fibre Channel
•
•
Fibre Channel reduces the communication overhead by offloading transport processing
to the NIC instead of using the host processor
Zero copying is facilitated by direct communication between the host memory and the
NIC device
Direct-Access Transport
•
•
•
Requires NIC support for remote DMA
User-level networking made possible through user-mode process interacting directly with
the NIC to send or receive messages with minimal kernel intervention
Reliable message transport network
Proposed Solutions
NIC Support Mechanism
•
•
•
•
NIC device exposes an array of connection descriptors to the system’s physical address
space
During connection setup time network device driver maps a free descriptor into the user
virtual address space
This grants user process a direct and safe access to the NIC’s buffers and registers
This facilitates user-level networking and copy avoidance
Proposed Solutions
User-Level File System
•
•
•
•
•
Kernel policies for file system caching and prefetching do not favor some applications
The migration of OS functions into user level libraries allow user applications more
control and specialization.
Clients would run in user mode as libraries linked directly with applications.This reduces
the overhead due to system calls
Clients may evolve independent of the operating system
Clients could also run on any OS, with no special kernel support except the NIC device
driver.
Virtual Interface And RDMA
•
The virtual interface architecture facilitates fast and efficient data exchange
between applications running on different machines
•
VIA reduces complexity by allowing applications (VI consumers) to
communicate directly with the network card (VI NIC) via common memory
areas, bypassing the operating system
•
The VI provider is the NIC and its device driver
•
RDMA is a communication model supported on the VIA which allow
applications to read and write memory areas of processes running on different
computers
VI Architecture and RDMA
Source: Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS iSCSI and InfiniBand
by Ulf Troppens,Rainer Erkens,Wolfgang Mueller
Remote DMA (RDMA)
VIA Model
CPU
CPU
1
send
doorbell
user address
space
LANai
send
descriptor
3
6
send
buffer
host
2
4
Myrinet NIC
receive
doorbell
data packets
in NIC
memory
5
user address
space
receive
descriptor
LANai
7
8
10
Myrinet NIC
receive
buffer
9
host
InfiniBand
•
•
•
•
“Infinite Bandwidth”
A Switch-based I/O interconnect architecture
Low pin count serial architecture
Infiniband Architecture(IBA) defines a System Area Network (SAN)
–
•
IBA defines a switched communications fabric
–
•
IBA SAN is a communications and management infrastructure for I/O and IPC
high bandwidth and low latency
Backed by top companies in the industries; Compaq, Dell, Hewlett Packard,
IBM, Intel, Microsoft and sun
Limits of the PCI Bus
• Parallel Component Interconnect (PCI)
– Introduced in 1992
– Has become the standard bus architecture for servers
– PCI bus
• 32-bit/33MHz -> 64-bit/66 MHz
– PCI-X
• The latest version 64 bits at: PCI-X 66, PCI-X 133, PCI-X
266 and PCI-X 533 [4.3GBps]
– Other PCI concerns include
•
•
•
•
Bus sharing
Bus speed
Scalability
Fault Tolerance
PCI Express
• High-speed point-to-point architecture that is essentially a
serialized,packetizedversion of PCI
• General purpose serial I/O bus for chip-to-chip
communication, USB 2.0 / IEEE 1349b interconnects,and
high-end graphics
– viable AGP replacement
• Bandwidth 4 Gigabit/second full duplex per lane
– Up to 32 separate lanes
– 128 Gigabit/second
• Software-compatible with PCI device driver model
• Expected to coexist with and not displace technologies like
PCI-X in the foreseeable future
Benefits of IBA
•
•
•
•
•
•
•
•
•
•
•
•
Bandwidths
An open and industry-inclusive standard
Improved connection flexibility and scalability
Improved reliability
Offload communications processing from the OS and CPU
Wide access to a variety of storage systems
Simultaneous device communication
Built-in security, quality of Service
Support for Internet Protocol version (IPv6)
Fewer and better managed system interrupts
Support for up to 64000 addressable devices
Support for copper cable and optic fiber
InfiniBand Components
• Host Channel Adapter (HCA)
– An interface to a host and supports all software Verbs
• Target Channel Adapter (TCA)
– Provides the connection to an I/O device from InfiniBand
• Switch
– Fundamental component of an IB fabric
– Allows many HCAs and TCAs to connect to it and handles
network traffic.
• Router
– Forwards data packets from a local network to other external
subnets
• Subnet Manager
– An application responsible for configuring the local subnet and
ensuring its continued operation
An IBA SAN
InfiniBand Layers
• Physical Layer
Link Pin Count Signaling
Rate
Data
Rate
Full-Duplex Data
Rate
1x
4
2,5 Gb/s
2 Gb/s
4 Gb/s (500 MB/s)
4x
16
10 Gb/s
8 Gb/s
16 Gb/s (2 GB/s)
12x
48
30 Gb/s
24 Gb/s
48 Gb/s (6 GB/s)
InfiniBand Layers
•
Link Layer
–
–
Is central to the IBA and includes packet layout, point to point link
instructions, switching within a local subnet and data integrity
Packets
•
–
Switching
•
–
Supported by Virtual lanes
is a unique logical communication link that shares a single physical link
Up to 15 virtual lane per physical link (VL0 – VL15)
Packet is assigned a priority
Credit Based Flow Control
•
–
Data forwarding within a local subnet
QoS
•
•
•
•
–
Data and management packets
Used to manage data flow between two point-to-point links
Integrity check using CRC
InfiniBand Layers
•
Networking Layer
–Responsible for routing packets from one subnet to another
–The global route header (GRH) located within a packet includes an IPv6 address for the
source and destination of each packet
•
Transport Layer
–Handles the order of packet delivery as well as partitioning, multiplexing and transport
services that determine reliable connections
Infiniband Architecture
• The Queue Pair Abstraction
–2 queues of communication meta data (send & recv)
–Registered buffers which to send from/recv to
“Architectural Interactions of I/O Networks and Inter-networks”, Philip Buonadonna, Intel Research & University of California, Berkeley
Direct Access File System
•
•
•
•
•
•
A new network file system derived from NFS version 4
Tailored to use remote DMA (RDMA) which requires the virtual interface (VI)
framework
Introduced to combine the low overhead of SAN products with the generality
of NAS file servers
Communication between a DAFS server and client is done through RDMA
Client side caching of locks for easier subsequent access to same file
Clients can be implemented as a shared library in user space or in the kernel
DAFS Architecture
Source: Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS iSCSI and InfiniBand
by Ulf Troppens,Rainer Erkens,Wolfgang Mueller
Direct Access File System
DAFS Protocol
•
•
•
•
•
Defined as a set of send and request formats and their semantics
Defines recommended procedural APIs to access DAFS services from a client
program
Assumes a reliable network transport and offers server-directed command flow
Each operation is a separate request but also supports request chaining
Defines features for session recovery and locking primitives
Direct Access File System
Direct Access Data Transfer
•
•
•
•
•
Supports direct variants of data transfer operations such as read, write, setattr etc.
Direct transfer operations to and from client-provided memory using RDMA read and
write operations
Client registers each memory region with local kernel before requesting direct I/O on
region
API defined primitives register and unregister for memory region management; register
returns a region descriptor
Registration issues a system call to pin buffer regions in physical memory, then loads page
translations for the region into a lookup table on the NIC
Direct Access File System
RDMA Operations
•
•
•
RDMA operations for direct I/O are initiated by the server.
Client write request to server includes a region token for the buffer containing
the data
Server then issues a RDMA read to fetch data from client and responds with a
write request response after RDMA completion
Direct Access File System
Asynchronous I/O and Prefetching
•
•
•
Supports fully asynchronous API interface which enables clients to pipeline I/O
operations and overlap them with application processing
Event notification mechanisms delivers asynchronous completions and client
may create several completion groups
DAFS can be implemented as a user library to be linked with applications or
within the kernel.
Direct Access File System
Figure depicting DAFS and NFS Client Architectures
Source: http://www.eecs.harvard.edu/~vino/fs-perf/dafs.html
Direct Access File System
Server Design and
Implementation
• The kernel server design is fashioned
on an event driven state transition
diagram
• The main event triggering state
transitions are:
recv_done, send_done
and bio_done
Figure 1. An event-driven DAFS server
Source: http://www.eecs.harvard.edu/~vino/fs-perf/dafs.html
Direct Access File System
Event Handlers
•
•
•
•
Each network or disk event is associated with a handler routine
recv_done - Client initiated transfer is complete. This signal is asserted by the NIC and
initiates the processing of an incoming RPC request
send_done - Server initiated transfer is complete. The handler for this signal releases all the
locks involved in the RDMA operation and returns an RPC response
bio_done - Block I/O request from disk is complete. This signal is raised by the disk
controller and wakes up any thread that is blocking on a previous disk I/O
Direct Access File System
Server Design and Implementation
•
•
•
•
•
Server performs disk I/O using the zero-copy buffer cache interface
This interface facilitates the locking pages and their mappings
Buffers involved in RDMA need to be locked during the entire transfer duration
Transfers are initiated using RPC handlers and processing is asynchronous
Kernel buffer cache manager registers and de-registers buffer mappings to the
NIC on the fly, as physical pages are returned or removed from the buffers
Direct Access File System
Server Design and Implementation
•
•
•
•
•
•
Server creates multiple kernel threads to facilitate I/O concurrency
A single listener thread monitors for new transport connections. Other worker
threads handle data transfer
Arriving messages generate a recv_done interrupt which is processed by a single
handler for the completion group
Handler queues up incoming RPC requests and invokes a worker thread to start
data processing
A thread locks all the necessary file pages in the buffer cache, creates RDMA
descriptors and issues RDMA operations
After RDMA completion, a send_done signal is sent which initiates the clean up
and release of all resources associated with the completed operation
Communication Alternatives
Source: Storage Networks Explained: Basics and Application of Fibre Channel SAN, NAS iSCSI and InfiniBand
by Ulf Troppens,Rainer Erkens,Wolfgang Mueller
Experimental Setup
Source: http://www.eecs.harvard.edu/~vino/fs-perf/dafs.html
Experimental Setup
System Configuration
•
Pentium III 800 MHz clients and servers
•
Server cache 1GB, 133MHz memory bus
•
9GB Disks, 10K RPM Seagate Cheetah, 64-bit/33MHz PCI bus
•
VI over Giganet cLAN 1000 adapter (DAFS)
•
UDP/IP over Gigabit Ethernet, Alteon Tigon-II adapters (NFS)
Experimental Setup
•
•
•
•
NFS block I/O transfer size is set at mount time
Packets sent in fragmented UDP packets
Interrupt coalescing is set to high on Tigon-II
Checksum offloading enabled on Tigon-II
•
NFS-nocopy required modifying Tigon-II firmware, IP fragmentation code, file
cache code,VM system and Tigon-II driver, to facilitate header splitting and page
remapping
Experimental Results
The table below shows the results for one-byte round trip latency and
bandwidth. The higher latency in Tigon-II was due to datapath crossing
the kernel UDP/IP stack
Experimental Results
Bandwidth and Overhead
•
•
•
Server pre-warmed with 768MB dataset
Designed to stress on network data transfer
Hence client caching not considered
Sequential Configuration
• DAFS client utilized the asynchronous I/O API
• NFS had read-ahead enabled
Random Configuration
• NFS tuned for best-case performance at each request size by selecting a matching NFS
transfer size
Experimental Results
Experimental Results
Experimental Results
TPIE Merge
•
•
•
The sequential record merge program combines n sorted input files of x y-bytes
each into a single sorted output file
Depicts raw sequential I/O performance with varying amounts of processing
Performance is limited by the client CPU
Experimental Results
Experimental Results
PostMark
•
•
A synthetic benchmark used in measuring file system performance over
workloads composed of many short-lived, relatively small files
Creates a pool of files with random sizes followed by sequence of file operations
Experimental Results
Berkeley DB
•
Synthetic workload composed of read-only transactions, processing one small
record at random from a B-tree
Disk Storage Interfaces
•
•
•
•
•
Parallel ATA (IDE, E-IDE)
Serial ATA (SATA)
Small Computer System Interface (SCSI)
Serial Attached SCSI (SAS)
Fiber Channel (FC)
"It's More Then the Interface" By Gordy Lutz of Seagate, August, 2002.
Parallel ATA
•
•
•
•
16-bit bus
Two bytes per bus transaction
40-pin connector
Master/slave shared bus
• Bandwidth
25MHz strobe
x 2 for double data rate clocking
x 16bits per edge
/ 8 bits per byte
------------------------------------= 100MBytes/sec
Serial ATA (SATA)
•
•
•
7-pin connector
Point to Point connections for dedicated bandwidth
Bit-by-bit
–
–
•
One single signal path for data transmission
The other signal path for acknowledgement
Bandwidth
1500MHz embedded clock
x 1 bit per clock
x 80% for 8b10b encoding
/ 8 bits per byte
------------------------------------= 150MBytes/sec
•
•
•
2002 -> 150MB/sec
2004 -> 300MB/sec
2007 -> 600MB/sec
8b10b encoding
•
•
•
•
IBM Patent
Used in SATA, SAS, FC and InfiniBand
Convert 8 bits data to 10 bits codes
Provides better synchronization than
Manchester encoding
Small Computer Systems Interface
(SCSI)
•
•
•
•
•
•
•
•
SCSI for high-performance storage market
SCSI-1 proposed in 1986
Parallel Interface
Maximum cabling distance is 12 meters
Terminators required
Bus width is 8-bit (narrow)
16 devices per bus
A device with a high priority has a bus
SCSI (cont’d)
• Peer-to-peer connection (channel)
• 50/68 pins
•
•
•
•
Hot repair not provided
Multiple buses needed beyond 16 devices
Low bandwidth
Distance limitation
SCSI Roadmap
• Wide SCSI (16-bit bus)
• Fast SCSI (double data rate)
Serial Attached SCSI (SAS)
•
•
•
•
•
•
ANSI standard in 2003
Interoperability with SATA
Full-duplex
Dual-port
128 devices
10 meters
Dual port
• ATA, SCSI and SATA support a single port
• Controller is a single point of failure
• SAS and FC support dual port
SAS Roadmap
http://www.scsita.org/aboutscsi/sas/SAS_roadmap2004.html
Fibre Channel (FC)
• Developed to backbone technology of LANs
• The name is a misnomer
– Runs on copper also
– 4 wire cable or fiber optic
•
•
•
•
10 km or less per link
126 devices per loop
No terminators
Installed base of Fibre Channel devices*
– $2.45 billion FC HBAs in 2005
– $5.4 billion FC switches in 2005
*Source: Gartner, Dec 13, 2001
FC (cont’d)
• Advantage
–
–
–
–
–
High bandwidth
Secure
Zero-copy send and receive
Low host CPU utilization
FCP (Fibre Channel Protocol)
• Disadvantage
–
–
–
–
–
Not a wide-area network
Separate physical network infrastructure
Expensive
Different management mechanisms
Interoperability from difference vendors
Fiber Channel Topologies
Ulf Troppens, Rainer Erkens and Wolfgang Muller, Storage Networks Explained
Fiber Channel Ports
• N-Port: Node port
• F-Port: Fabric port
• L-Port: Loop port
– Only connect to AL
• E-Port: Expansion port
– Connect two switches
• G-Port: Generic port
• B-Port: Bridge port
– Bridge to other networks (IP, ATM, etc)
• NL-Port: Node_Loop_port
– Can connect both in fabric and in AL
• FL-Port: Fabric_Loop_port
– Makes a fabric to connect to a loop
Ulf Troppens, Rainer Erkens and Wolfgang Muller, Storage Networks Explained
Arbitrated Loop in FC
Ulf Troppens, Rainer Erkens and Wolfgang Muller, Storage Networks Explained
Arbitrated Loop in FC
Ulf Troppens, Rainer Erkens and Wolfgang Muller, Storage Networks Explained
Routing mechanisms in switch
• Store-forward routing
• Cut-through routing
William James Dally and Brian Towles, Principles and practices of Interconnection networks, chapter 13
Fibre Channel Hub and
Switch
• Switch
– Thousands of
connections
– Bandwidth per device is
nearly constant
– Aggregate bandwidth
increases with increased
connectivity
– Deterministic latency
• Hub
– 126 Devices
– Bandwidth per device
diminished with increased
connectivity
– Aggregate bandwidth is
constant with increased
connectivity
– Latency increases as the
number of devices
increases
Fibre Channel Structure
Fibre Channel Bandwidth
• Clock rate is 1.0625GHz
• 1.0625[Gbps] x 2048[payload]/2168[payload+overhead] x
0.8[8b10b]/8[bits] = 100.369 MB/s
Cable types in FC
FC Roadmap
Product
Naming
Throughput
(MB/s)
T11 Spec
Completed
(Year)
Market
Availability
(Year)
1GFC
200
1996
1997
2GFC
400
2000
2001
4GFC
800
2003
2005
8GFC
1,600
2006
2008
16GFC
3200
2009
2011
32GFC
6400
2012
Market Demand
64GFC
12800
2016
Market Demand
128GFC
25600
2020
Market Demand
http://www.fibrechannel.org/OVERVIEW/Roadmap.html
Interface Comparison
Market Segments
It’s more than interface, Seagate, 2003
Interface Trends - Previous
It’s more than interface, Seagate, 2003
Interface Trends – Today and
Tomorrow
It’s more than interface, Seagate, 2003
IP Storage
IP Storage (cont’d)
• TCP/IP is used as a storage interconnect to transfer block level
data.
• IETF working group, the IP Storage (IPS)
• iSCSI, iFCP, and FCIP protocols
• Cheaper
• Provides one technology for a client to connect to servers and
storage devices
• Increases operating distances
• Improves availability of storage systems
• Can utilize network management tools
It’s more than interface, Seagate, 2003
iSCSI (Internet SCSI)
• iSCSI is a Transport for SCSI Commands
–
–
–
–
iSCSI is an End to End protocol
iSCSI can be implemented on Desktops, Laptops and Servers
iSCSI can be implemented with current TCP/IP Stacks
iSCSI can be implemented completely in a HBA
• Overcomes the distance limitation
• Cost-effective
Protocol Stack - iSCSI
Packet and Bandwidth - iSCSI
• iSCSI overhead: 78 Bytes
– 14 (Ethernet) + 20 (IP) + 20 (TCP) + 4 (CRC) + 20 (Interframe
Gap)
– iSCSI header occurs 48 bytes per SCSI command
• 1.25[Gbps] x 1460[payload]/1538[payload+overhead] x
0.8[8b10b]/8[bits] = 113.16 MB/s
• Bi-Directional Payload Bandwidth: 220.31 MB/s
Problems with iSCSI
• Limited Performance because
– Protocol overhead in TCP/IP
– Interrupts are generated for each network
packet
– Extra copies when sending and receiving
data
iSCSI Adapter
Implementations
• Software approach
– Show the best performance
– This approach is very competitive due to fast modern CPUs
• Hardware Approaches
– Relatively slow CPU compared to host CPU
– Development speed is also slower than that in host CPU
– Performance improvement is limited without superior advances in
embedded CPU
– Can show performance improvement in highly-loaded systems
Prasenjit Sarkar, Sandeep Utamchandani, Kaladhar Voruganti, Storage over IP: When Does Hardware Support help?, FAST 2003
iFCP (Internet Fiber Channel
Protocol)
•
•
•
•
•
•
•
iFCP is a gateway-to-gateway protocol for the implementation of a fibre channel fabric
over a TCP/IP transport
Allow users to interconnect FC devices over a TCP/IP network at any distance
Traffic between fibre channel devices is routed and switched by TCP/IP network
iFCP maps each FC address to an IP address and each FC session to an TCP session
FC messaging and routing services are terminated at the gateways so that are not merged
Data backup and replication
mFCP uses UDP/IP
How does iFCP work?
Types of iFCP
communication
FCIP (Fiber Channel over IP)
• TCP/IP-based tunneling protocol to encapsulate fibre
channel packets
• Allow users to interconnect FC devices over a TCP/IP
network at any distance (same as iFCP)
• Merges connected SANs into a single FC fabric
• Data backup and replication
• Gateways
–used to interconnect fibre channel SANs to the IP network
–set up connections between SANs or between fibre channel devices
and SANs
FCIP (Fiber Channel over IP)
Comparison between FCIP and
iFCP
IP Storage Protocols: iSCSI, iFCP
and FCIP
RAS
• Reliability
– The basic InfiniBand link connection is comprised of only four
signal wires
– IBA accommodates multiple ports for each I/O unit
– IBA provides multiple CRCs
• Availability
– An IBA fabric in inherently redundant, with multiple paths to
sources assuring data delivery
– IBA allows the network to heal itself if a link fails or is reporting
errors
– IBA has a many-to-many server-to-I/O relationship
• Serviceability
– Hot-pluggable
Feature
Infini Band
Fibre Channel
1Gb & 10 Gb
Ethernet
PCI-X
Bandwidth
2.5 , 10, 30 Gb/s
1, 2.1 Gb/s
1, 10 Gb/s
8.51 Gb/s
Bandwidth FullDuplex
5, 20, 60 Gb/s
2.1 , 4.2 GB/s
2, 20 Gb/s
N/A
Pin Count
4, 16, 48
4
4/8
90
Media
Copper/Fiber
Copper/Fiber
Copper/Fiber
PCB
Max Length
Copper
250 / 125 m
13m
100m
inches
Max Length Fiber
10 km
km
km
N/A
Partitioning
X
X
X
N/A
Scalable Link
Width
X
N/A
N/A
N/A
Max Payload
4 KB
2KB
1.5 KB
No
Packets
A classification of storage systems
(warning - not comprehensive)
•
Isolated
–
–
–
–
•
E.g., A laptop/PC with a local file system
We know how these work
File systems were first developed for centralized computer systems as an OS facility providing a
convenient programming interfact to (disk) storage
Subsequently acquired features like AC, file-locking that made them useful for sharing of data and
programs
Distributed
–
Why?
•
–
“Basic” Distributed file system
•
•
–
–
Sharing, scalability, mobility, fault tolerance, …
Give the illusion of local storage when the data is spread across a network (usually a LAN) to clients running on
multiple computers
Support the sharing of information of in the form of files and hardware resources in the form of persistent storage
throughout an intranet
Enhancements in various domains for “real-time” performance (multimedia), high failure resistance, high
scalability (P2P), security, longevity (archival systems), mobility/disconnections, …
Remote objects to support distributed object-oriented programming
Storage systems and their properties
Sharing
Persistence
Caching/
replication
Consistency
maintenance
Main memory
No
No
No
Strict one-copy
RAM
File system
No
Yes
No
Strict one-copy
UNIX FS
Distributed file
system
Yes
Yes
Yes
Yes (approx.)
NFS
Web
Yes
Yes
Yes
Very approx/No
Web server
Distributed
shared memory
Yes
No
Yes
Yes (approx)
Ivy
Remote objects
(RMI/ORB)
Yes
No
No
Strict one-copy
CORBA
Persistent
object store
Yes
Yes
No
Strict one-copy
CORBA
persistent state
service
P2P storage
system
Yes
Yes
Yes
Very approx
OceanStore
Example