Transcript Document

Storage Bricks
Jim Gray
Microsoft Research
http://Research.Micrsoft.com/~Gray/talks
FAST 2002
Monterey, CA, 29 Jan 2002
Acknowledgements:
Dave Patterson explained this to me long ago
Leonard Chung
Kim Keeton
Helped me sharpen
Erik Riedel
these arguments
Catharine Van Ingen
1
First Disk 1956
• IBM 305 RAMAC
• 4 MB
• 50x24” disks
• 1200 rpm
• 100 ms access
• 35k$/y rent
• Included computer &
accounting software
(tubes not transistors)
2
1.6 meters
10 years later
3
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Disk Evolution
• Capacity:100x in 10 years
1 TB 3.5” drive in 2005
20 GB 1” micro-drive
• System on a chip
• High-speed SAN
Yotta
• Disk replacing tape
• Disk is super computer! 4
Disks are becoming computers
•
•
•
•
•
•
•
•
Smart drives
Camera with micro-drive
Replay / Tivo / Ultimate TV
Phone with micro-drive
MP3 players
Tablet
Xbox
Many more…
Applications
Web, DBMS, Files
OS
Disk Ctlr +
1Ghz cpu+
1GB RAM
Comm:
Infiniband,
Ethernet,
radio… 5
Data Gravity Processing Moves to Transducers
smart displays, microphones, printers, NICs, disks
Processing decentralized
Today:
Moving to data sources
Moving to power sources
ASIC
P=50 mips
M= 2 MB
Storage
Moving to sheet metal
Network
ASIC
In a few years
P= 500 mips
Display
ASIC
? The end of computers ?
M= 256 MB
6
It’s Already True of Printers
Peripheral = CyberBrick
• You buy a printer
• You get a
– several network interfaces
– A Postscript engine
•
•
•
•
cpu,
memory,
software,
a spooler (soon)
– and… a print engine.
7
The Absurd Design?
•
•
•
•
Segregate processing from storage
Poor locality
Much useless data movement
Amdahl’s laws: bus: 10 B/ips io: 1 b/ips
Processors
~ 1 Tips
RAM
~ 1 TB
Disks
8
~ 100TB
The “Absurd” Disk
• 2.5 hr scan time
(poor sequential access)
• 1 aps / 5 GB
(VERY cold data)
• It’s a tape!
• Optimizations:
– Reduce management costs
– Caching
– Sequential 100x faster than random
200$
100 MB/s
200 Kaps
1 TB
9
Disk = Node
• magnetic storage (1TB)
• processor + RAM + LAN
• Management interface
(HTTP + SOAP)
• Application execution
environment
• Application
– File
– DB2/Oracle/SQL
– Notes/Exchange/
TeamServer
– SAP/Seibold/…
– Quickbooks /Tivo/ PC.…
Applications
Services
DBMS
RPC, ...
File System
LAN driver
Disk driver
OS Kernel
10
Implications
Conventional
• Offload device handling to
NIC/HBA
• higher level protocols:
I2O, NASD, VIA, IP, TCP…
• SMP and Cluster
parallelism is important.
Central
Processor &
Memory
Radical
• Move app to
NIC/device controller
• higher-higher level
protocols:
SOAP/DCOM/RMI..
• Cluster parallelism is
VERY important.
Terabyte/s
Backplane
11
Intermediate Step: Shared Logic
•
•
•
•
•
•
Brick with 8-12 disk drives
200 mips/arm (or more)
2xGbpsEthernet
General purpose OS
10k$/TB to 50k$/TB
Shared
–
–
–
–
–
Sheet metal
Power
Support/Config
Security
Network ports
Snap
~1TB
12x80GB NAS
NetApp
~.5TB
8x70GB NAS
Maxstor
~2TB
12x160GB NAS
12
• These bricks could run applications (e.g. SQL or Mail or..)
Example
• Homogenous machines leads
to quick response through
reallocation
• HP desktop machines,
320MB RAM, 3u high, 4
100GB IDE Drives
• $4k/TB (street),
• 2.5processors/TB,
1GB RAM/TB
• JIT storage & processing
3 weeks from order to deploy
13
Slide courtesy of Brewster Kahle, @ Archive.org
What if Disk Replaces Tape?
How does it work?
• Backup/Restore
– RAID
(among the federation)
– Snapshot copies (in most OSs)
– remote replicas (standard in DBMS and FS)
• Archive
– Use “cold” 95% of disk space
• Interchange
– Send computers not disks.
14
It’s Hard to Archive a Petabyte
It takes a LONG time to restore it.
• At 1GBps it takes 12 days!
• Store it in two (or more) places online
A geo-plex
• Scrub it continuously (look for errors)
• On failure,
– use other copy until failure repaired,
– refresh lost copy from safe copy.
• Can organize the two copies differently
(e.g.: one by time, one by space)
15
Archive to Disk
100TB for 0.5M$ + 1.5 “free” petabytes
• If you have 100 TB active
you need 10,000 mirrored disk arms (see tpcC)
• So you have 1.6 PB of (mirrored) storage (160GB drives)
• Use the “empty” 95% for archive storage.
• No extra space or extra power cost.
• Very fast access (milliseconds vs hours).
• Snapshot is read-only (software enforced )
• Makes Admin easy (saves people costs)
16
Disk as Tape Archive
Slide courtesy of Brewster Kahle, @ Archive.org
• Tape is unreliable, specialized,
slow, low density, not improving fast,
and expensive
• Using removable hard drives to replace tape’s
function has been successful
• When a “tape” is needed, the drive is put in a
machine and it is online. No need to copy from
tape before it is used.
• Portable, durable, fast, media cost = raw tapes,
dense. Unknown longevity: suspected good. 17
Disk as Tape Interchange
• Tape interchange is frustrating (often unreadable)
• Beyond 1-10 GB send media not data
– FTP takes too long (hour/GB)
– Bandwidth still very expensive (1$/GB)
• Writing DVD not much faster than Internet
• New technology could change this
– 100 GB DVD @ 10MBps would be competitive.
• Write 1TB disk in 2.5 hrs (at 100MBps)
• But, how does interchange work?
18
Disk As Tape Interchange: What format?
•
•
•
•
Today I send 160GB NTFS/SQL disks.
But that is not a good format for Linux/DB2 users.
Solution: Ship NFS/CIFS/ODBC servers (not disks)
Plug “disk” into LAN.
– DHCP then file or DB server via standard interface.
– “pull” data from server.
19
Some Questions
•
•
•
•
•
•
What is the product?
How do I manage 10,000 nodes (disks)?
How do I program 10,000 nodes (disks)?
How does RAID work?
How do I backup a PB?
How do I restore a PB?
20
What is the Product?
• Concept: Plug it in and it works!
•
•
•
•
•
•
•
•
•
Music/Video/Photo appliance (home)
Game appliance
“PC”
File server appliance
Data archive/interchange appliance
Web server appliance
DB server
eMail appliance
Application appliance
network
power
21
How Does Scale Out Work?
• Files: well known designs:
–
–
–
–
rooted tree partitioned across nodes
Automatic cooling (migration)
Mirrors or Chained declustering
Snapshots for backup/archive
• Databases: well known designs
– Partitioning, remote replication similar to files
– distributed query processing.
• Applications: (hypothetical)
– Must be designed as mobile objects
– Middleware provides object migration system
• Objects externalize methods to migrate
( == backup/restore/archive)
• Web services seem to have key ideas (xml representation)
22
– Example: eMail object is mailbox
Auto Manage Storage
• 1980 rule of thumb:
– A DataAdmin per 10GB, SysAdmin per mips
• 2000 rule of thumb
– A DataAdmin per 5TB
– SysAdmin per 100 clones (varies with app).
• Problem:
– 5TB is 50k$ today, 5k$ in a few years.
– Admin cost >> storage cost !!!!
• Challenge:
– Automate ALL storage admin tasks
23
Admin: TB and “guessed” $/TB
(does not include cost of application, overhead, not “substance”)
•
•
•
•
Google:
Yahoo!
DB
Wall St.
1 :100TB
1 : 50TB
1 : 5TB
1 : 1TB
5k$/TB/y
20k$/TB/y
60k$/TB/y
400k$/TB/y (reported)
• hardware dominant cost only @ Google.
• How can we waste hardware to save people cost?
24
How do I manage 10,000 nodes?
• You can’t manage 10,000 x (for any x).
• They manage themselves.
– You manage exceptional exceptions.
• Auto Manage
– Plug & Play hardware
– Auto-load balance & placement storage &
processing
– Simple parallel programming model
– Fault masking
25
How do I program 10,000 nodes?
• You can’t program 10,000 x (for any x).
• They program themselves.
– You write embarrassingly parallel programs
– Examples: SQL, Web, Google, Inktomi, HotMail,….
– PVM and MPI prove it must be automatic (unless you have a PhD)!
• Auto Parallelism is ESSENTIAL
26
Summary
• Disks will become supercomputers so
– Lots of computing to optimize the arm
– Can put app close to the data (better modularity, locality)
– Storage appliances (self-organizing)
• The arm/capacity tradeoff: “waste” space to save access.
–
–
–
–
Compression (saves bandwidth)
Mirrors
Online backup/restore
Online archive (vault to other drives or geoplex if possible)
• Not disks replace tapes:
Storage appliances replace tapes.
• Self-organizing storage servers (file systems)
(prototypes of this software exist)
27