Transcript ppt
“Five minute rule ten years
later and other computer
storage rules of thumb”
Authors: Jim Gray, Goetz Graefe
Reviewed by: Nagapramod Mandagere
Biplob Debnath
Outline
Problem Statement
Motivation
Importance and Relevance
Main Contributions and Validation
Key Ideas
Illustrations
New Metrics
Assumptions
Re-write Today
Questions
Problem Statement
Broader Problem: Viewing developments over a
long period of time to try and extract important
technology trends.
Specific Instance: Inferring rules of thumb for
buffer replacement policies in a number of
settings, including RAID environments.
Given: Trends over time for parameters such as
memory cost, disk cost, tape cost
Find: Rules of thumb for deciding where to store the
data and when to replace data from memory buffer
Objectives: Simple rules, extensible rules
Constraints: Hierarchical Storage Model
Typical Database Administrators Dilemma
Should I cache
on the client?
Should store data
back on disk?
(local or network
disk)
Should I cache
this data in
memory?
Should I move
data to tape?
The performance isn’t
good. Am I doing
something wrong?
Importance & Relevance
Different rates at which parameters changes
• seek/second & Disk capacity – 10x to 100x
• Disk MB/K$ & DRAM MB/K$ - 1000x
Importance & Relevance
The location of data is very important
Main Memory: Very Fast, Expensive, limited size
Disk Storage: Lot slower that main memory,
inexpensive, close to unlimited size
Tape Storage: Slowest, dirt cheap, unlimited capacity
How can one decide what data resides where?
System Learns from data access patterns and adapts
(Admins hate to give up control)
Administrator controls data locality by using some
experience or historical performance info (rules of
thumb)
Main Contributions & Validation
The Five minute rule
Randomly accessed buffer pages can be replaced if unused for
more than 5 minutes.
Sequentially accessed buffer pages can be replaced if unused
for more than 1 minute.
Metrics for storage performance characterization
Cost/Access
Maps: Megabyte accesses per second
Scan: Time it takes to sequentially read or write all the data in
the device
Validation Methodology - Examples
Examples
Random access
On pass sort
Two pass sort
Trends observed over a period of time
Key Ideas
Tradeoff between the cost of RAM and the cost
of disk accesses.
The tradeoff is that caching pages in the extra
memory can save disk IOs.
The break-even point is met when the rent on the
extra memory for cache ($/page/sec) exactly matches
the savings in disk accesses per second
($/disk_access/sec).
Illustration – Typical System in
1997
For a system with following characteristics
PagesPerMBofRAM = 128 pages/MB (8KB pages)
AccessesPerSecondPerDisk = 64 access/sec/disk
PricePerDiskDrive = 2000 $/disk (9GB + controller)
PricePerMBofDRAM = 15 $/MB_DRAM
The Inter reference interval is 266 seconds ~ 5
minutes
Illustration
One pass algorithms
reads data and never references it,
no need to cache the data in RAM.
system needs only enough buffer memory to
allow data to stream from disk to main
memory.
Typically, two or three one-track buffers (~100
KB) are adequate per disk to buffer disk
operations and allow the device to stream
data to the application.
Illustration
Two pass algorithms
sequential operations that read a large
dataset and then revisit parts of the data.
Database join, cube, rollup, and sort
operators
Sorting uses two pass if memory size is
smaller than the data set size
Inter reference time is typically about a minute
(sequential data access)
Illustration – Two Pass Sort
• One pass sort needs larger amount of memory
• Memory needed grows faster with size of input file
• For files bigger than memory size, two pass is the only option
Disk vs Tape tradeoff
Tape vs Disk Trade off ?????
• Tape - larger penalty (slower access, least cost)
• Solution – Larger breakeven point, bigger page size
New Metrics
Data flow applications which stream huge
amounts of data like data mining applications,
multimedia applications
New Metrics
Kaps
Kilo byte accesses per second
Maps
Mega byte accesses per second
Scan
Time taken to sequentially read or write all data on a device
These metrics combined with rent costs provide
a price/performance metric
Assumptions
Disk storages have same characteristics
(cost/performance). It assumes that the disk
storage systems is homogenous and does not
consider the more recent shift towards
hierarchical/heterogeneous storage systems.
The trade off only consider the performance
aspect, the security and fault tolerance issues
are assumed to be uniform throughout.
Re-write
Re-evaluate the rules of thumb
considering more recent costs and the
more recent trends in storage systems like
heterogeneous/hierarchical storage
Take into account SAN, NAS characteristics
Questions???
Does Five minute rule hold good today???
No (With Reservations)
If one changes the Page Size to MegaByte range, five
minute rule still applies.
Pages/MB of RAM = 16 (8 K pages)
Access/sec/disk = 64
Price/disk drive = $400
Price/MB of RAM = $0.1
Break even point ~ 1000s
Further Evidence - Jim (Keynote in FAST 2004)
Grayhttp://www.usenix.org/events/fast05/