Transcript uVALUE

HPTS2005
Harnessing Petabytes of
Online Storage Effectively
2005/09/27
Jun Nitta ([email protected])
Hitachi, Ltd.
Copyright © Hitachi, Ltd. 2005. All rights reserved.
HPTS2005
1. Introduction: where are we today?
2. Configuring mass online storage
3. Defining distribution of intelligence
4. Miscellaneous topics
5. Summary: beyond 10 petabytes
Copyright © Hitachi, Ltd. 2005. All rights reserved.
HPTS2005
1
Introduction: where are we today?
Copyright © Hitachi, Ltd. 2005. All rights reserved.
1-1
model
Looking into the latest specifications of HDDs…
disk size capacity / disks
rotational speed
interface
(seek / latency) (sustained data rate)
for high performance OLTP
3.5’’
147GB/5
for most other applications
3.5’’
300GB/5
for large volume archives
3.5’’
500GB/5
for small form factor
2.5’’
100GB/2
for portable audio player?
1.0’’
8GB/1
* based on HGST catalogues as of Sep. 2005
data
buffer
15,000rpm
(3.7ms/2.0ms)
4Gbp/s FC-AL
(n/a-93.3MB/s)
16MB
10,025rpm
(4.7ms/3.0ms)
2Gbp/s FC-AL
(46.8-89.3MB/s)
16MB
7,200rpm
(8.5ms/4.2ms)
3Gbp/s SATA-II
(31-64.8MB/s)
16MB
7,200rpm
(10ms/4.2ms)
1.5Gbp/s SATA
(n/a-n/a)
8MB
3,600rpm
(12ms/8.3ms)
CE-ATA
(5.1-10.0MB/s)
128KB
Copyright © Hitachi, Ltd. 2005. All rights reserved.
3
1-2
… and storage subsystems (RAID controllers)
roughly: 1 rack = 200 disks (3.5”) = 100TB (500GB drive)
enterprise
midrange
workgroup
HDDs
1152 (5 cabinets)
240
225
105
raw capacity
332TB (FC)
72TB (FC)
88.5TB (SATA)
40.5TB (SATA)
LUNs
16,384
16,384
2,048
512
FC ports
192
48
4
4
cache
128GB
64GB
8GB
4GB
* based on HDS catalogues as of Sep. 2005
Copyright © Hitachi, Ltd. 2005. All rights reserved.
4
1-3
practical limit
for most
datacenters
Sheer number of HDDs matters practically
HDDs
capacity
practicality
major inhibitor
besides $$$
O(100)
500GB -
a piece of cake
none
O(101)
5TB -
O(102)
50TB -
O(103)
500TB -
O(104)
5PB -
(even possible
personally)
today’s enterprise
mainstream
challenging but still
feasible
getting impractical
O(105)
50PB -
O(106)
500PB -
storage
management
almost prohibitive
power & cooling
disk failure*
* MTBF of a high-end FC HDD is 106h by catalogue spec. (=114yrs, actual number may vary by order of magnitude)
Copyright © Hitachi, Ltd. 2005. All rights reserved.
5
HPTS2005
2
Configuring mass online storage:
array of nodes or disks?
Copyright © Hitachi, Ltd. 2005. All rights reserved.
2-1
Two alternatives to configure online storage
array-of-nodes
array-of-disks
(stack of self-contained boxes)
(separate from servers)
server
farm
(diskless)
network
CPU
memory
HDD
storage
network
(FC or IP)
storage
farm
 Very cost effective for some kind
of applications
- Secondary data management
(especially search)
- Can utilize cheapest components
 Versatile for various mix of
applications
- OLTP, ERP, DWH, email, …
- Cost is steadily going down
Copyright © Hitachi, Ltd. 2005. All rights reserved.
7
2-2
Rationale for array-of-disks model
 It is reasonable to separate mechanical components
– HDD is the only mechanical component bedsides a cooling fan
– It makes much easier to implement hot-swap mechanisms
vs.
 It is reasonable to have external storage subsystems
– Disks can be shared among clusters of servers
– Spare disks can be shared within a storage subsystem
HDD HDD HDD HDD HDD
1
2
3
4
5
HDD HDD HDD HDD HDD
6
7
8
9
10
RAID-5 (4D+1P) group 1
RAID-5 (4D+1P) group 2
HDD
11
shared hot-spare disk
Copyright © Hitachi, Ltd. 2005. All rights reserved.
8
2-3
Additional discussion for array-of-disks model
 It makes data management easier*
– Various data protection techniques can be employed including
third-party backup and D2D replication
backup server
application server
RAID subsystem
tape library
* Actually backup is one of the most compelling reason to
consolidate scattered storages into an external RAID box
– For the array-of-nodes configuration, replication between nodes is
the almost only viable solution for data protection (conventional
backup is difficult to be employed effectively)
Copyright © Hitachi, Ltd. 2005. All rights reserved.
9
2-4
But does this dichotomy has a meaning?
 Nonetheless we need storage “controller” for array-of-disks
– “Controller” is just another name of a special-purpose server of
which restricted operating environment some users prefer
– Two configuration differs essentially in CPU-to-HDD ratio
determined by intelligence which a storage farm requires
Which is most promising?
basic building block
petabytes configuration
special-purpose controller
with a lot of disks
O(100-2) of
clustered subsystems
general-purpose server
with a couple of disks
O(103-4) of
clustered nodes
even a HDD has CPU and
memory (device controller)
O(103-4) of
clustered disks
Copyright © Hitachi, Ltd. 2005. All rights reserved.
10
HPTS2005
3
Defining distribution of intelligence:
protocol and interface
Copyright © Hitachi, Ltd. 2005. All rights reserved.
3-1
Distribution of intelligence among farms
3 reasons some functions are better placed at storage side
- It is naturally implemented using CPU and memory near HDDs
- It requires operations with durable state
- It makes multiple servers share data objects
 3 reasons some functions are better placed at server side
- It is better implemented using CPU and memory near applications
- It requires more powerful and economical CPU / memory
- It handles multiple controllers
server farm
storage farm
storage network
(FC or IP)
server side
intelligence
storage side
intelligence
Copyright © Hitachi, Ltd. 2005. All rights reserved.
12
3-2
Alternative way to place intelligence
Some intelligence could be placed on the network
- But a closer look reveals that most of those “intelligent network
components” are not genuine network core components
- Rather they are placed on the boundary between network and server
/storage which is not a clear-cut edge but a blurred region
Is this a part of network
or storage farm?
server farm
storage network (FC or IP)
boundary
network edge
intelligence
(server side)
network core
storage farm
boundary
network edge
intelligence
(storage side)
Copyright © Hitachi, Ltd. 2005. All rights reserved.
13
3-3
Placement of functions: an example
 Here is an example of intelligence distribution scheme
assuming array-of-disks configuration
storage side
intelligence
- basic RAID control / LUN management
- remote filesystem
- local replication including snapshots (copy-on-write)
- volume migration transparent to servers
server side
intelligence
- local filesystem
- volume migration among multiple controllers
- multi-path management (load balancing & fail over)
- content search / indexing
- block aggregation (a.k.a. logical volume management)
intelligence on - remote replication
- backup
both side
- data encryption
Copyright © Hitachi, Ltd. 2005. All rights reserved.
14
3-4
Which interface & protocol should we adopt?
 There are 3 well-established I/O interfaces: block, file, SQL
- None of them is optimal for today’s server/storage farm environment
- Though file may be most promising for its balanced features
- But I/O interface is stubborn to change (very conservative)
- Thus multi interface/protocol support is a practical solution
interface
block
file
SQL
protocol
(transport)
SCSI-3
(FC or IP)
NFS/CIFS-SMB
(TCP/IP)
proprietary
(mostly TCP/IP)
- low latency
- strong standard
protocol
- broad application
- strong standard
protocol
- high level enough
to encapsulate
physical properties
- layers away from
application
- not network-friendly
- performance and
scalability (especially
for DBMS)
- limited application
- no standard
protocol
strength
weakness
Copyright © Hitachi, Ltd. 2005. All rights reserved.
15
HPTS2005
4
Miscellaneous topics for managing
petabytes of online storage
Copyright © Hitachi, Ltd. 2005. All rights reserved.
4-1
Virtualization: simply too many mappings
server
server level
virtualization
(HBA/device driver,
OS/LVM, DBMS)
AP-recognizable volume
RAID/block aggregation
switch
switch level
virtualization
LU
LU
recognize
RAID/block aggregation
LU
LU
LU
RAID/block aggregation
HDD
HDD
HDD
controller level
virtualization
LU
controller
controller
- But current situation is
too confusing
 Operating Systems and
DBMSs should be aware
that a storage volume is a
logical network resource
LU
controller
 “Virtualization” itself is a
powerful technology to hide
complexity if used properly
export
controller
LU
LU
LU
RAID/block aggr.
RAID/block aggr.
RAID/block aggr.
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
- It can even expand and
shrink dynamically
- There may be more
than 100,000 volumes on
the network (most OS
can recognize up to only
about 1,000 volumes)
Copyright © Hitachi, Ltd. 2005. All rights reserved.
17
4-2
Data protection: disk plays the protagonist
 You have to go to disks at least for the first step to make
backup workable for > 10TB of data
- Eventually those data may go to tape (D2D2T)
1) make quiescent
3) resume
server2
server1
AP/DBMS
2) take snapshot
4) mount
data protection
manager
5) backup to VTL or
replicate to disks
agent
controller
VTL
copy on write
primary
volume
consistent
snapshot
controller
MT emulation
RAID
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
HDD
typical backup
scenario for large
amount of data
Copyright © Hitachi, Ltd. 2005. All rights reserved.
18
4-3
Data migration: latent cost of online storage
 Since data always outlives its container, you should migrate
data from one subsystem to another several times
- Non-disruptiveness to upper layer is desirable which requires some
form of address mapping
- Durable address mapping for storage is not well standardized for
both block and file level (cf. URL -[DNS]-> IP address -> MAC address)
invariant
sever
path
level
(more flexible)
mapping
switch
level
mapping
storage
level
mapping
data
movement
scattered
& long
server1
server2
AP/DBMS
AP/DBMS
yet another mapping
yet another mapping
switch
switch
another mapping
old controller
SCSI LUN
(less flexible)
localized
& short
another mapping
new controller
some mapping
some mapping
HDD
HDD
HDD
HDD
HDD
HDD
Copyright © Hitachi, Ltd. 2005. All rights reserved.
19
4-4
Security: as always matters
 And of course there are a lot of security concerns storage
subsystems have to take care of
- Data-at-rest protection is much more challenging than data-in-flight
because of long-term key management
[management port security]
- user authentication
- access control
- data-in-flight protection
storage
administrator
application server
[data port security]
- device authentication
- access control
- data-in-flight protection
storage subsystem
management
server
[other subsystem security]
- data-at-rest protection
- audit logging
primary site
secondary site
Copyright © Hitachi, Ltd. 2005. All rights reserved.
20
4-5
Storage resource management: spreadsheet?
 Even the basic discovery-and-reporting is still a pain in the
neck for most administrators
- Most widely used management tool today is a spreadsheet
- But can they continue using it for PB environment?
- SNIA SMI-S standard seems good because of its set-oriented query
capability (SNMP has already gone broken for storage management)
- Yet most commercial tools are not proven over PB
storage
administrator
Copyright © Hitachi, Ltd. 2005. All rights reserved.
21
4-6
Applications: will they use DBMS?
today’s typical PB system
 What kind of applications will use
petabytes of online storage?
- email/IM, voice, video archive, …
application server
front-end application
cache
contents server
contents
manager
DBMS
metadata DB
file server
HSM
staging disk
HDD
HDD
HDD
MT library
HDD
HDD
HDD
MTMT
MT
O(10G-100GB)
O(1PB)
e.g. 100MB*107files
- stream data from sensor network
(including RFID)
- geoscience, bioscience, medical, …
 How those data will be managed?
- Most bulk data may not be stored in
RDBMSs but in filesystems (with global
name space)
- XLM native store may engulf a lot of
data (structured and semi-structured)
once well established
Copyright © Hitachi, Ltd. 2005. All rights reserved.
22
HPTS2005
5
Summary: beyond 10 petabytes
Copyright © Hitachi, Ltd. 2005. All rights reserved.
5-1
Beyond 10 petabytes of data
 Continuing capacity growth of HDD enables >10PB online
storage within the reach of most IT organizations in 5 years
- HDD with perpendicular magnetic recording technology is emerging
- Declining $/GB trend shows no sign of discontinuing
 Server farm – network – storage farm configuration will
continue to dominate enterprise data centers
- It is the most cost effective and flexible way to configure online
storage for varieties of applications
 Protocol and interface between server and storage should
evolve to be more network-conscious
- But old guards will never die in a foreseeable future
 XML data store may come to play a significant role in
addition to filesystem and RDBMS
- Who knows!
Copyright © Hitachi, Ltd. 2005. All rights reserved.
24
HPTS2005
Harnessing Petabytes of Online Storage Effectively
2005/09/27
Jun Nitta
Hitachi, Ltd.
Copyright © Hitachi, Ltd. 2005. All rights reserved.