Petal: Distributed Virtual Disks
Download
Report
Transcript Petal: Distributed Virtual Disks
Koh-i-Noor
Mark Manasse, Chandu Thekkath
Microsoft Research
Silicon Valley
1
3/26/2016
Motivation
Large-scale storage systems can be
expensive to build and maintain:
Cost of managing the storage,
Cost of provisioning the storage.
Total cost of ownership is dominated by
management cost,
but provisioning 10,000 spindles worth of useful
data is expensive, especially with redundant
storage.
Rationale / Mathematics I
Hotmail has storage needs
approaching a petabyte.
Currently built from difficult-tomanage components.
1,000 10-disk RAID arrays
Requires significant backup
effort, incessant replacement
of failed drives, etc.
Reliability could be improved
by mirroring.
But 10,000 mirrored drives is
a lot of work to manage.
For either system, difficult to
expand gracefully, since the
storage is in small, isolated
clumps.
The Galois Field of order pk
(for p prime) is formed by
considering polynomials in
Z/Zp[x] modulo a primitive
polynomial of degree k.
Facts
Any primitive polynomial will
do; all the resulting fields are
isomorphic.
We write GF(pk) to denote one
such field.
x is a generator of the field.
Everything you know about
algebra is still true.
3
Reducing the Total Cost of Ownership
Reduce management cost:
Use automatic management, automatic load balancing,
incremental system expansion, fast backup.
(Focus of earlier research and the Sepal project.)
Reduce provisioning cost:
Use redundancy schemes that tolerate multiple failed
components without the cost of duplicating them as is
done in mirroring or triplexing.
4
Rationale / Mathematics II
There are existing product that
improve storage management.
It would be difficult to use them
to build petabyte storage
systems that can support
something like Hotmail.
Need thousands of these
components and even with
squadrons of perl and java
programmers, it would be
difficult to configure and
manage such a system.
These products lack formal
management interfaces that can
be used by external programs
or each other to reliably manage
the system.
A Vandermonde matrix is of
the form
1 1
a b
2
2
a
b
.
.
.
.
.
.
k
k
a b
1
c
c2
.
.
.
ck
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
z
z2
.
.
.
z k
The determinant of a
Vandermonde matrix is
(a-b)(a-c)···(a-z)(b-c)···(b-z)···(yz).
5
Large-Scale Mirroring
Assume disk mean-time-to-failure of 50 years.
Mirroring needs 20,000 disks for 10,000 data disks.
Expect 8 disks to fail every week.
Assume 1 day to repair a disk by copying from
mirror.
Mean time before data is unavailable (because
of a double failure) is 45 years.
Cost of storage is double what is actually
needed.
6
Rationale / Mathematics III
Goal: build a distributed blocklevel storage system that is
highly-available, scalable, and
easy to manage.
Design each component of the
system to automate or
eliminate many management
decisions that are today made
by human beings.
Automatically tolerate and
recover from exceptional
conditions such as component
failures that traditionally
require human intervention.
Facts about determinants
det(AB) = det(A) det(B)
det(A | ka) = k det(A | a)
Facts about GF(256)
a + b = a XOR b.
Every element other than 0 is xk,
for some 0≤k≤255.
If a=xk and b=xj, then ab=xk+j.
With a 512-byte table of
logarithms and a 1024-byte table
of anti-logarithms, multiplication
becomes trivial.
A byte of data, viewed as an
element of GF(256), multiplied
and added with other bytes still
occupies a byte of storage.
7
Improving Time to Data Unavailability
Consider 10,000 disks, each with 50 years MTTF.
~ 50 years with mirroring at 2x provisioning cost.
May be too short a period at too high a price.
~ ½M years with triplexing at 3x provisioning cost.
Very long at very high price.
An alternative: use Reed-Solomon Erasure Codes.
40 clusters of 256 disks each.
Can tolerate triple (or more) failures .
~50K years before data unavailability.
~1.03x privisioning cost.
8
Rationale IV
Suppose the instantaneous probability of a disk being in a failed state is p
(computed as MTTR/MTTF).
The probability of k disks failed is pk.
The probability of finding k disks failed out of j is
(j C k)pk=q.
Cluster MTTF=MTTR/q; with N total disks,
System MTTF =~ k! MTTFk/N (j MTTR)k-1 (if j >> k).
RAID sets
N=10,000, k=2, j=10, MTTF=50 years, MTTR=1 day, System
MTTF=5023651/50000 =~ 9 years.
Duplexing
N=20,000, k=2, j=2, System MTTF=5023651/20000 years =~45 years.
Triplexing
N=30,000, k=3, j=3, System MTTF=5033652/30000 years =~ 555000 years.
Erasure codes
N=10,000, k=4, j=256, System MTTF=~ 5043653 24/2563 10000 years =~ 43500
years. Is that enough?
9
1
xk 1
0
det(Vk 1 ) det .
.
.
0
0
1
xk 1
.
.
.
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
.
x2 xk 1
.
x1 xk 1
x2 x x
x22 x2 xk 1 .
1
1 k 1
3
det x1 x12 xk 1
x23 x22 xk 1 .
.
.
.
.
.
.
k 1 k
k 1
k
x1 x1 xk 1 x2 x2 xk 1 .
1
1
( x2 xk 1 )1
( x1 xk 1 )1
(x x )x (x x )x
2
k 1
2
1 k 1 1
det
.
.
.
.
.
.
k
k
( x1 xk 1 ) x1 ( x2 xk 1 ) x2
.
0
.
0
.
0
.
.
.
.
.
1
. xk 1
.
.
.
.
.
.
.
.
0
0
0
. det(Vk 1 )
.
.
1
1
xk xk 1
xk2 xk xk 1
xk3 xk2 xk 1
. .
.
. .
.
. . xkk 1 xkk xk 1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
. ( xk xk 1 )1
. ( xk xk 1 ) xk
.
.
.
.
.
.
. ( xk xk 1 ) xkk
1
xk 1 xk 1
xk21 xk 1 xk 1
3
2
xk 1 xk 1 xk 1
.
.
xkk11 xkk1 xk 1
Inductive step proving
the determinant of a
Vandermonde matrix
is the product of
differences.
Determinant here is 1.
Expand on last
column, after
removing common
factors, what’s left if
Vk.
1
0
0
. ( x1 xk 1 )( x2 xk 1 )...( xk xk 1 ) det(Vk )
.
.
10
0
Reed-Solomon Erasure Codes
1. We use an n×(n+k)
coding matrix to store
data on n data disks
and k check disks.
(k=3 in our example)
1 0
0 1
0 0
. .
0 0
1 1
1 x
1 x2
0
0
1
.
0
1
x2
x4
. 0
d
d1 1
. 0 d2
d2
. 0 d3
d3
. . .
.
. 1 d n
.
. 1 c1
.
. x n c2
dn
. x 2 n c3
3. Omitting failed rows, we get
an invertible n×n matrix R.
1
0
0
.
0
1
1
0
0
0
.
0
1
x
0
0
0
.
0
1
x2
0
1
0
.
0
1
x3
0
0
1
.
0
1
x4
.
.
.
.
.
.
.
0 d1 d1
0 d2 d4
0 d3 d5
. . .
1 . d n
1 . c1
x n d n c2
1 0
0 1
0 0
. .
0 0
1 1
1 x
1 x2
2. Suppose
data disks 2,3
and check
disk 3 fail.
0
0
1
.
0
1
x2
x4
. 0
d
d1 1
. 0 d2
d2
. 0 d3
d3
. . .
.
. 1 d n
.
. 1 c1
.
. x n c2
dn
. x 2 n c3
4. Multiplying both sides by R-1, we
recover all the data.
d1
d2
d
3
.
.
.
dn
1
0
0
.
0
1
1
0
0
0
.
0
1
x
0
0
0
.
0
1
x2
0
1
0
.
0
1
x3
0
0
1
.
0
1
x4
.
.
.
.
.
.
.
1
0 d1
0 d4
0 d5
. .
1 d n
1 c1
x n c2
11
Rationale / Mathematics Va
The matrices were transposed
so that the data would be in
column vectors, which fits
better on the slide.
The correction matrix in the
example uses my special
erasure code for k = 3.
The correction rows are taken
from a Vandermonde matrix.
As you might recognize, had
you been reading the other
half of the slides.
For general k, we take a
Vandermonde matrix using
elements 0, 1, …, 255.
The product of element
differences is non-zero,
because all the individual
element differences are nonzero.
Make it tall enough for all the
data disks.
Diagonalize the square part,
and use the remaining
columns for check disks.
Row operations change the
determinant in simple ways.
12
Rationale / Mathematics Vb
Computing check digits is easy,
and well-parallelized.
To update a block on data disk j,
compute the XOR of the old block
with the new block.
Multicast the log of the XOR
together with j to the check disks
(and have them start reading the
block off disk).
Each check disk multiplies the
XOR’ed data block by the j’th
entry in their column, XOR’s that
with the old check value, and
writes the new check block.
Rotate which disks are check
disks, making the load uniform.
No difference in latency compared
to RAID-5 (but double the disk
bandwidth).
For k=3, we can just take the
identity matrix plus three columns
of Vandermonde, using 1, x, x2.
Computation of check digits is
faster: the logarithm for the k’th
check disk and j’th data disk is kj.
We omit non-failed data-disks in
computing determinants.
What remains is a small minor of
the whole matrix.
1x1 works because all entries are
non-zero.
3x3 works because it’s a
trasposed Vandermonde matrix.
2x2 works because it’s either a
Vandermonde matrix, or a
Vandermonde matrix with columns
multiplied by powers of x.
13
Parallel Reconstruction of a Failed Disk
Given the inverse matrix (as above), a failed
disk is the sum (exclusive or) of products of the
individual bytes on data and check disks with
the elements of the matrix.
The example below shows reconstruction of
disk 2, in a system with 4 data disks and 2
check disks; disks 2 and 3 failed.
D2 ← XOR
XOR
D1 × (1+x+x2+x3+x7)
XOR
D4 × x
C1 × (1+x2+x4+x5+x6+x7)
C2 × (x+x3+x4+x5+x6)
14
Rationale / Mathematics VI
The reconstruction on the
previous slide requires:
Reading all of the surviving
disks.
Parallel multiplication by
precomputed entries from the
inverse matrix.
XOR of pairs of disks entries
working up a binary tree.
A network of small, cheap
switches provides the
necessary connectivity.
Put hot spare CPUs and
drives into network; skip up
tree at nodes with no useful
partner.
Given nature of matrix, the
inverse matrix can be
constructed via Gaussian
elimination very quickly for
small values of k.
With more complicated erasure
coding (2D parity, for
example), it’s harder to see
how to wire the network to
accommodate hot-spares
without requiring more from the
interconnect.
15
Related Work
Reed-Solomon error correcting codes
have been used for single disks and CDs.
Reed-Solomon and Even-Odd erasure
codes have been proposed:
For software RAID [ABC],
For wide area storage systems [LANL].
Our usage of the code is different.
Parallel reconstruction is novel.
Patents in progress on both.
16
Potential deficiencies
This could be a bad idea, depending on what you’re
trying to do:
The bandwidth in at the top of the tree isn’t enough to keep all
the disk arms busy with small reads and writes.
This could be a bad disk for building databases.
The total disk bandwidth at least doubles during degraded
operation, which might affect many more disks than with
mirroring or standard RAID-5.
Failures may not be independent, invalidating predicted
reliability.
CPUs (which, if you attach multiple disks, should serve disks in
different pods) may fail too often, in ways that rebooting doesn’t
fix.
Power supplies, cables, switches, … may fail, which we haven’t
accounted for.
Writes are going to be relatively expensive.
17
Comparisons and future work
Other people have already built small
erasure-coded arrays. Is there enough
new here? We think so.
Real systems “needs”:
Backup,
Geographical redundancy.
18