Transcript Document

Fragmentation in Large Object
Repositories
Russell Sears
Catharine van Ingen
CIDR 2007
This work was performed at Microsoft Research San Francisco
with input from the NTFS and SQL Server teams
Background
• Web services store large
objects for users
– eg: Wikipedia, Flickr,
YouTube, GFS, Hotmail
• Replicate BLOBs or files
Clients
Application
Servers
– No update-in-place
• Benchmark before
deployment
– Then, encounter storage
performance problems
• We set out to make some
sense of this
DB
(metadata)
ObjectStores
Stores
Object
Object
Stores
ObjectStores
Stores
Object
Object
Stores
Object Stores
Replication /
Data scrubbing
Problems with partial updates
• Multiple changes per application request
– Atomicity (distributed transactions)
• Most updates change object size
– Must fragment, or relocate data
• Reading / writing the entire object
addresses these issues
Experimental Setup
• Single storage node
• Compared filesystem, database
– NTFS on Windows Server 2003 R2
– SQL Server 2005 beta
• Repeatedly update (free, reallocate)
objects
– Randomly chose sizes, objects to update
– Unrealistic, easy to understand
• Measured throughput, fragmentation
Reasoning about time
• Existing metrics
– Wall clock time: Requires trace to be
meaningful, cannot compare different
workloads
– Updates per volume: Coupled to volume size
Storage Age:
Average number of updates per object
Read performance
12
• Clean system
• SQL degraded quickly
• NTFS small object
performance was low,
but constant
Read Throughput (MB/s)
– SQL good small object
performance
(inexpensive opens)
– NTFS significantly
faster with objects
>>1MB
Updates per object
0
2
4
10
8
6
4
2
0
NTFS
SQL
256 KB Objects
NTFS
SQL
1 MB Objects
10MB object fragmentation
40
• NTFS approaching
asymptote
• SQL Server
degrades linearly
SQL Server
35
NTFS
Fragments/object
30
25
20
15
– No BLOB
defragmenter
10
5
0
0
1
2
3
4
5
6
Storage Age
7
8
9
10
Rules of Thumb
• Classic pitfalls
– Low free space (< 10%)
– Repeated allocation and deallocation (High
storage age)
• One new problem
– Small volumes (< 100-1000x object size)
• Implicit tuning knobs
– Size of write requests
Append is expensive!
• Neither system can take advantage of final
object size during allocation
• Both API’s provide “append”
– Leave gaps for future appends
– Place objects without knowing length
• Observe same behavior with single and
random object sizes
Conclusions
• Get/put storage is important in practice
• Storage age
– Metric for comparing implementations and
workloads
– Fragmentation behaviors vary significantly
• Append leads to poor layout
----BACKUP SLIDES----
Theory vs. Practice
• Theory focuses on contiguous layout of
objects of known size
• Objects that are allocated in groups are
freed in groups
– Good allocation algorithms exploit this
– Generally ignored for average case results
– Leads to pathological behavior in some
cases
Small volumes
• Small objects / Large volumes
– Percent free space
• Large objects / Small volumes
– Number of free objects
Efficient Get/Put
• No update-in-place
– Partial updates complicate apps
– Objects change size
• Pipeline requests
– Small write buffers, I/O Parallelism
Application
server
42
3
1
Lessons learned
• Target systems avoid update-in-place
No use for database data models
• Quantified fragmentation behavior
– Across implementations, workloads
• Common API’s complicate allocation
– Filesystem / BLOB API is too expressive
Application
server
3421
Example systems
• SharePoint
– Everything in the database, one copy per version
• Wikipedia
– One blob per document version; images are files
• Flickr / YouTube
• GFS
– Scalable append; chunk data into 64MB files
• Hotmail
– Each mailbox is stored as a single opaque BLOB
The folklore is accurate, so why do
application designers…
…benchmark, then deploy the “wrong”
technology?
…switch to the “right one” a year later?
…then switch back?!?
Performance problems crop up
over time
Conclusions
• Existing systems vary widely
– Measuring clean systems is inadequate, but
standard practice
• Support for append is expensive
• Unpredictable storage is difficult to reliably
scale and manage
– See paper for more information about
predicting and managing fragmentation in
existing systems
Comparing data layout strategies
• Study the impact of
– Volume size
– Object size
– Workload
– Update strategies
– Maintenance tasks
– System implementation
• Need a metric that is independent of these
factors
Related work
• Theoretical results
– Worst case performance is unacceptable
– Average case good for certain workloads
– Structure in deallocation requests leads to
poor real-world performance
• Buddy system
– Place structural limitations on file layout
– Bounds fragmentation, fails on large files
Introduction
• Content-rich web services require large,
predictable and reliable storage
• Characterizing fragmentation behavior
• Opportunities for improvement
Data intensive web applications
• Simple data model
(BLOBs)
– Hotmail: user mailbox
– Flickr: photograph(s)
Clients
Application
Servers
• Replication
– Instead of backup
– Load balancing
– Scalability
DB
(metadata)
ObjectStores
Stores
Object
Object
Stores
ObjectStores
Stores
Object
Object
Stores
Object Stores
Replication /
Data scrubbing
Databases vs. Filesystems
• Manageability should be primary concern
– No need for advanced storage features
– Disk bound
• Folklore
– File opens are slow
– Database interfaces stream data poorly
Clean system performance
12
• Single node
Read throughput (MB/sec)
SQL Server
10
NTFS
– Used network
API’s
8
6
• Random workload
4
– Get/put one object
at a time
• Large objects lead
to sequential I/O
2
0
256K
512K
Object Size
1M
Revisiting Fragmentation
• Data intensive web services
– Long term predictability
– Simple data model: get/put opaque objects
• Performance of existing systems
• Opportunities for improvement
Introduction
• Large object updates and web services
– Replication for scalability, reliability
– Get / put vs. partial updates
• Storage age
– Characterizing fragmentation behavior
– Comparing multiple approaches
• State-of-the-art approach:
– Lay out data without knowing final object size
– Change the interface?