Petal: Distributed Virtual Disks

Download Report

Transcript Petal: Distributed Virtual Disks

Koh-i-Noor
Mark Manasse, Chandu Thekkath
Microsoft Research
Silicon Valley
1
3/26/2016
Motivation
Large-scale storage systems can be
expensive to build and maintain:
Cost of managing the storage,
Cost of provisioning the storage.
Total cost of ownership is dominated by
management cost,
but provisioning 10,000 spindles worth of useful
data is expensive, especially with redundant
storage.
Rationale / Mathematics I
 Hotmail has storage needs
approaching a petabyte.
 Currently built from difficult-tomanage components.
 1,000 10-disk RAID arrays
 Requires significant backup
effort, incessant replacement
of failed drives, etc.
 Reliability could be improved
by mirroring.
 But 10,000 mirrored drives is
a lot of work to manage.
 For either system, difficult to
expand gracefully, since the
storage is in small, isolated
clumps.
 The Galois Field of order pk
(for p prime) is formed by
considering polynomials in
Z/Zp[x] modulo a primitive
polynomial of degree k.
 Facts
 Any primitive polynomial will
do; all the resulting fields are
isomorphic.
 We write GF(pk) to denote one
such field.
 x is a generator of the field.
 Everything you know about
algebra is still true.
3
Reducing the Total Cost of Ownership
 Reduce management cost:
Use automatic management, automatic load balancing,
incremental system expansion, fast backup.
 (Focus of earlier research and the Sepal project.)
 Reduce provisioning cost:
Use redundancy schemes that tolerate multiple failed
components without the cost of duplicating them as is
done in mirroring or triplexing.
4
Rationale / Mathematics II
 There are existing product that
improve storage management.
 It would be difficult to use them
to build petabyte storage
systems that can support
something like Hotmail.
 Need thousands of these
components and even with
squadrons of perl and java
programmers, it would be
difficult to configure and
manage such a system.
 These products lack formal
management interfaces that can
be used by external programs
or each other to reliably manage
the system.
 A Vandermonde matrix is of
the form
1 1

a b
 2
2
a
b

 .
.

.
 .
 .
.
 k
k
a b
1
c
c2
.
.
.
ck
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1

z

z2 
. 

. 
. 

z k 
 The determinant of a
Vandermonde matrix is
(a-b)(a-c)···(a-z)(b-c)···(b-z)···(yz).
5
Large-Scale Mirroring
 Assume disk mean-time-to-failure of 50 years.
Mirroring needs 20,000 disks for 10,000 data disks.
Expect 8 disks to fail every week.
 Assume 1 day to repair a disk by copying from
mirror.
 Mean time before data is unavailable (because
of a double failure) is 45 years.
 Cost of storage is double what is actually
needed.
6
Rationale / Mathematics III
 Goal: build a distributed blocklevel storage system that is
highly-available, scalable, and
easy to manage.
 Design each component of the
system to automate or
eliminate many management
decisions that are today made
by human beings.
 Automatically tolerate and
recover from exceptional
conditions such as component
failures that traditionally
require human intervention.
 Facts about determinants
 det(AB) = det(A) det(B)
 det(A | ka) = k det(A | a)
 Facts about GF(256)
 a + b = a XOR b.
 Every element other than 0 is xk,
for some 0≤k≤255.
 If a=xk and b=xj, then ab=xk+j.
 With a 512-byte table of
logarithms and a 1024-byte table
of anti-logarithms, multiplication
becomes trivial.
 A byte of data, viewed as an
element of GF(256), multiplied
and added with other bytes still
occupies a byte of storage.
7
Improving Time to Data Unavailability
 Consider 10,000 disks, each with 50 years MTTF.
 ~ 50 years with mirroring at 2x provisioning cost.
 May be too short a period at too high a price.
 ~ ½M years with triplexing at 3x provisioning cost.
 Very long at very high price.
 An alternative: use Reed-Solomon Erasure Codes.
 40 clusters of 256 disks each.
 Can tolerate triple (or more) failures .
 ~50K years before data unavailability.
 ~1.03x privisioning cost.
8
Rationale IV
 Suppose the instantaneous probability of a disk being in a failed state is p
(computed as MTTR/MTTF).
 The probability of k disks failed is pk.
 The probability of finding k disks failed out of j is
(j C k)pk=q.
 Cluster MTTF=MTTR/q; with N total disks,
System MTTF =~ k! MTTFk/N (j MTTR)k-1 (if j >> k).
 RAID sets
 N=10,000, k=2, j=10, MTTF=50 years, MTTR=1 day, System
MTTF=5023651/50000 =~ 9 years.
 Duplexing
 N=20,000, k=2, j=2, System MTTF=5023651/20000 years =~45 years.
 Triplexing
 N=30,000, k=3, j=3, System MTTF=5033652/30000 years =~ 555000 years.
 Erasure codes
 N=10,000, k=4, j=256, System MTTF=~ 5043653 24/2563 10000 years =~ 43500
years. Is that enough?
9
 1

  xk 1
 0

det(Vk 1 )  det  .
 .

 .

 0
0
1
 xk 1
.
.
.
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
.


x2  xk 1
.
 x1  xk 1
 x2  x x
x22  x2 xk 1 .
1
1 k 1
 3
det  x1  x12 xk 1
x23  x22 xk 1 .

.
.
.


.
.
.
 k 1 k
k 1
k
 x1  x1 xk 1 x2  x2 xk 1 .
1
1


( x2  xk 1 )1
 ( x1  xk 1 )1
 (x  x )x (x  x )x
2
k 1
2
 1 k 1 1
 det 
.
.

.
.


.
.

k
k
 ( x1  xk 1 ) x1 ( x2  xk 1 ) x2
.
0
.
0
.
0
.
.
.
.
.
1
.  xk 1
.
.
.
.
.
.
.
.
0

0
0

.  det(Vk 1 ) 
. 
.

1
1
xk  xk 1
xk2  xk xk 1
xk3  xk2 xk 1
. .
.
. .
.
. . xkk 1  xkk xk 1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
. ( xk  xk 1 )1
. ( xk  xk 1 ) xk
.
.
.
.
.
.
. ( xk  xk 1 ) xkk
1


xk 1  xk 1 
xk21  xk 1 xk 1 

3
2
xk 1  xk 1 xk 1 

.


.

xkk11  xkk1 xk 1 
Inductive step proving
the determinant of a
Vandermonde matrix
is the product of
differences.
Determinant here is 1.
Expand on last
column, after
removing common
factors, what’s left if
Vk.
1

0
0

.   ( x1  xk 1 )( x2  xk 1 )...( xk  xk 1 ) det(Vk )
. 
.
10

0
Reed-Solomon Erasure Codes
1. We use an n×(n+k)
coding matrix to store
data on n data disks
and k check disks.
(k=3 in our example)
1 0

0 1
0 0

. .
0 0

1 1

1 x
 1 x2

0
0
1
.
0
1
x2
x4
. 0 
d 
  d1   1 
. 0     d2 
 d2 
. 0     d3 
 d3  
. .     . 
 .    
. 1     d n 
.
. 1     c1 
  .   
. x n     c2 
dn
. x 2 n     c3 
3. Omitting failed rows, we get
an invertible n×n matrix R.











1
0
0
.
0
1
1
0
0
0
.
0
1
x
0
0
0
.
0
1
x2
0
1
0
.
0
1
x3
0
0
1
.
0
1
x4
.
.
.
.
.
.
.
0   d1   d1 
    
0   d2   d4 
0   d3   d5 
    
.  .    . 
   
1   .   d n 
1   .   c1 
    
x n   d n   c2 
1 0

0 1
0 0

. .
0 0

1 1

1 x
 1 x2

2. Suppose
data disks 2,3
and check
disk 3 fail.
0
0
1
.
0
1
x2
x4
. 0 
d 
  d1   1 
. 0     d2 
 d2 
. 0     d3 
 
 d3
. .     . 
 .    
. 1     d n 
.
. 1     c1 
  .   
. x n     c2 
dn
. x 2 n     c3 
4. Multiplying both sides by R-1, we
recover all the data.
 d1  
  
 d2  
d  
 3 
 . 
 .  
  
 .  
  
 dn  
1
0
0
.
0
1
1
0
0
0
.
0
1
x
0
0
0
.
0
1
x2
0
1
0
.
0
1
x3
0
0
1
.
0
1
x4
.
.
.
.
.
.
.
1
0   d1 
  
0   d4 
0   d5 
  
.   . 
 
1   d n 
1   c1 
  
x n   c2 
11
Rationale / Mathematics Va
 The matrices were transposed
so that the data would be in
column vectors, which fits
better on the slide.
 The correction matrix in the
example uses my special
erasure code for k = 3.
 The correction rows are taken
from a Vandermonde matrix.
 As you might recognize, had
you been reading the other
half of the slides.
 For general k, we take a
Vandermonde matrix using
elements 0, 1, …, 255.
 The product of element
differences is non-zero,
because all the individual
element differences are nonzero.
 Make it tall enough for all the
data disks.
 Diagonalize the square part,
and use the remaining
columns for check disks.
 Row operations change the
determinant in simple ways.
12
Rationale / Mathematics Vb
 Computing check digits is easy,
and well-parallelized.
 To update a block on data disk j,
compute the XOR of the old block
with the new block.
 Multicast the log of the XOR
together with j to the check disks
(and have them start reading the
block off disk).
 Each check disk multiplies the
XOR’ed data block by the j’th
entry in their column, XOR’s that
with the old check value, and
writes the new check block.
 Rotate which disks are check
disks, making the load uniform.
 No difference in latency compared
to RAID-5 (but double the disk
bandwidth).
 For k=3, we can just take the
identity matrix plus three columns
of Vandermonde, using 1, x, x2.
 Computation of check digits is
faster: the logarithm for the k’th
check disk and j’th data disk is kj.
 We omit non-failed data-disks in
computing determinants.
 What remains is a small minor of
the whole matrix.
 1x1 works because all entries are
non-zero.
 3x3 works because it’s a
trasposed Vandermonde matrix.
 2x2 works because it’s either a
Vandermonde matrix, or a
Vandermonde matrix with columns
multiplied by powers of x.
13
Parallel Reconstruction of a Failed Disk
 Given the inverse matrix (as above), a failed
disk is the sum (exclusive or) of products of the
individual bytes on data and check disks with
the elements of the matrix.
 The example below shows reconstruction of
disk 2, in a system with 4 data disks and 2
check disks; disks 2 and 3 failed.
D2 ← XOR
XOR
D1 × (1+x+x2+x3+x7)
XOR
D4 × x
C1 × (1+x2+x4+x5+x6+x7)
C2 × (x+x3+x4+x5+x6)
14
Rationale / Mathematics VI
 The reconstruction on the
previous slide requires:
 Reading all of the surviving
disks.
 Parallel multiplication by
precomputed entries from the
inverse matrix.
 XOR of pairs of disks entries
working up a binary tree.
 A network of small, cheap
switches provides the
necessary connectivity.
 Put hot spare CPUs and
drives into network; skip up
tree at nodes with no useful
partner.
 Given nature of matrix, the
inverse matrix can be
constructed via Gaussian
elimination very quickly for
small values of k.
 With more complicated erasure
coding (2D parity, for
example), it’s harder to see
how to wire the network to
accommodate hot-spares
without requiring more from the
interconnect.
15
Related Work
Reed-Solomon error correcting codes
have been used for single disks and CDs.
Reed-Solomon and Even-Odd erasure
codes have been proposed:
For software RAID [ABC],
For wide area storage systems [LANL].
Our usage of the code is different.
Parallel reconstruction is novel.
Patents in progress on both.
16
Potential deficiencies
 This could be a bad idea, depending on what you’re
trying to do:
 The bandwidth in at the top of the tree isn’t enough to keep all
the disk arms busy with small reads and writes.
 This could be a bad disk for building databases.
 The total disk bandwidth at least doubles during degraded
operation, which might affect many more disks than with
mirroring or standard RAID-5.
 Failures may not be independent, invalidating predicted
reliability.
 CPUs (which, if you attach multiple disks, should serve disks in
different pods) may fail too often, in ways that rebooting doesn’t
fix.
 Power supplies, cables, switches, … may fail, which we haven’t
accounted for.
 Writes are going to be relatively expensive.
17
Comparisons and future work
Other people have already built small
erasure-coded arrays. Is there enough
new here? We think so.
Real systems “needs”:
Backup,
Geographical redundancy.
18