Storage Systems CSE 598D, Spring 2007
Download
Report
Transcript Storage Systems CSE 598D, Spring 2007
Storage Systems
CSE 598d, Spring 2007
Lecture ?: Rules of thumb in data engineering
Paper by Jim Gray and Prashant Shenoy
Feb 15, 2007
Contents
• Examination of rules-of-thumb in data engineering
– Moore’s law
– Amdahl’s rules
– Gilder’s law
• Technological trends and how/whether existing
rules-of-thumb need to be re-thought
Moore’s Law
• Circuit densities grow at 4x every 3 years
– 100x increase in a decade
– More generally: Ax every B years
– Originally meant for RAM
• Implies an extra bit of addressing every 18 months
• From 16-bit of addressing in 70s (1 MB) to 64-bit
addressing these days (several GB)
– Extended to CPU and storage
Disk parameters over time
Moore’s law applied to HDD
• Disk capacity has increased more than 100x in
the last decade!
– Areal density up from 20 Mbpsi to 35 Gbpsi
• However, data rate has only increased 30x
– Capacity / Accesses per sec growing 10x per decade
– Capacity / bandwidth growing 10x per decade
• Implications:
– Disk accesses becoming more precious
– Disk data becoming “cooler”
Closer look at the implications
• Discussion
– Does the increase in disk capacity mean applications are also
using correspondingly large stores?
– Why are disk accesses per second going up?
• Recall these have grown slower than areal density
• 10 years ago: 30 kaps for 1 GB data
• Today: 120 kaps for 80 GB data
– That is, only 1.5 kaps per GB
– HDD data needs to be 10-100x cooler than it was 10 years
ago
– Use large main memories (caching)
Costly disk accesses have led to ..
• Preferring few large transfers over many
small ones
• Preferring sequential transfers
– Log-structured file systems
• Mirroring rather than other forms of
redundancy
Cost trends
• Historically
– Tape:HDD:RAM has been 1:10:1000
• Calculation for a modern system gives 1:3:300
– Disk prices are approaching tape prices
• Disks are replacing tapes in several domains
– Cost/MB for RAM declines 100x in a decade
• What is economical to put on disk today may be economical to put on RAM in
10 years
– RAM taking up lot of the role of the HDD, HDD taking up a lot of the
role of tape
• Storage management costs exceed device costs
• Admins required to manage more and more data
– Automation, self-manageability becoming crucial
Amhdal’s System Balance Rules
• Parallelism law
– Expresses maximum achievable speedup in terms of the fraction of
parallelizable component of a computation
• Balanced system law
– A system needs 1 bit of IO/sec per instruction/sec
• IOPS = IPS
• Memory law
– MB/MIPS ratio in a balanced system is 1
• IO law
– Programs do IO per 50000 instructions
• How have these rules changed over time?
• Methodology
– Rely on well-regarded benchmarks TPC-C (random) and TPC-H (sequential)
• Revisions to Amhdal’s laws
– Balanced system law: Measure instruction rate and IO rate on relevant workload
– Memory law: MB to MIPS ratio rising from 1 to 4
• Re-iteration of the growth in RAM as disk IOs become expensive
– IO law: Workload dependent
• 50000 instructions per IO was geared toward random IO
• Increased sequentiality (discussed earlier) in disk accesses means higher instructions
per IO
Gilder’s Law
• Network bandwidth would triple every year for the next 25 years
(prediction in 1995)
• Link bandwidth triples every four years
• Network messages used to cost more instructions and IO instructions
per byte than disk
– Network protocol processing overheads
– These overheads have been reduced due to smarter NICs
• Cost comparison
– Cost of moving data over WAN much more expensive than from local disk
over LAN
• Related: Cost of shipping large disk arrays or entire computers comparable to
the cost of data transfer over the Internet
– However, this price gap likely to decline soon and bandwidth would be
plentiful within a decade
• Implication: Local disks could then be used as caches (or pre-fetch
buffers) with the main data store being remote
– Save on local storage management costs
– Managed data center model - is already seen!
Caching
• 5 minute rule for random workloads
• 1 minute rule for sequential worloads
• Web caches
– Cache everything!