N -1 - FSU Computer Science

Download Report

Transcript N -1 - FSU Computer Science

Monte Carlo Linear Algebra
Techniques and Their
Parallelization
Ashok Srinivasan
Computer Science
Florida State University
www.cs.fsu.edu/~asriniva
Outline
• Background
– Monte Carlo Matrix-vector multiplication
– MC linear solvers
• Non-diagonal splitting
– Dense implementation
– Sparse implementation
• Parallelization
• Conclusions and future work
Background
• MC linear solvers are old!
– von Neumann and Ulam (1950)
– Were not competitive with deterministic techniques
• Advantages of MC
– Can give approximate solutions fast
• Feasible in applications such as preconditioning, graph
partitioning, information retrieval, pattern recognition, etc
– Can yield selected components fast
– Are very latency tolerant
Matrix-vector multiplication
• Compute Cj h, C e R nxn
– Choose probability and weight matrices such
that
• Cij = PijWij and hi = pi wi
– Take a random walk, based on these
probabilities
– Define random variables Xi
• X0 = w0, and Xi = Xi-1 Wki ki-1
• E(Xj dikj) = (Cj h)i
– Each random walk can be used to estimate the
kj th component of Cj h
• Convergence rate independent of n
pk0
k0
Pk1k0
k1
Pk2k1
k2
kj
Update (Cj h)kj
Matrix-vector multiplication ... continued
pk0
 Smj=0 Cj h too can be similarly
estimated
 Smj=0 (BC) jBh will be needed by us
k0
Pk1k0
k1
– It can be estimated using probabilities on
both matrices, B and C
– Length of random walk is twice that for the
previous case
Pk2k1
k2
Pk3k2
k3
Update (Cj h)k2j+1
k2m+1
MC linear solvers
• Solve Ax = b
– Split A as: A = N – M
– Write the fixed point iteration
• xm+1 = N-1 Mxm + N-1 b = Cxm + h
– If we choose x0 = h, then we get
• xm = Smj=0 Cj h
– Estimate the above using the Markov chain
technique mentioned earlier
Current techniques
• Current techniques
– Choose N = a diagonal matrix
• Ensures efficient computation of C
• C is sparse when A is sparse
• Example: N = Diagonal of A yields the Jacobi
iteration, and the corresponding MC estimate
Properties of MC linear solvers
• MC techniques estimate the result of a
stationary iteration
– Errors from the iterative process
– Errors from MC
• Reduce the error by
– Variance reduction techniques
– Residual correction
– Choose a better iterative scheme!
Non-diagonal splitting
• Observations
– It is possible to construct an efficient MC
technique for specific splittings, even if
explicit construction of C were
computationally expensive
– It may be possible to implicitly represent C
sparsely, even if C is not actually sparse
Our example
• Choose N to be the diagonal and subdiagonal of A
N=
d1
*
s2 d2
*
*
*
*
N-1 =
*
......
sn dn
*
*
*
*
*
• Computing N-1 C is too expensive
– Compute xm = Smj=0 (N-1 M)j N-1 b instead
*
*
Computing N-1
• Using O(n) storage and precomputation time,
any element of N-1 can be computed in
constant time
– Define T(1) = 1, T(i+1) = T(i) si+1/di+1
– N-1ij =
• 0, if i < j
• 1/di, if i =j
• (-1)i-j /dj T(i)/T(j), otherwise
• The entire N-1, if needed, can be computed in
O(n2) time
Dense implementation
• Compute N-1 and store in O(n2) space
• Choose probabilities proportional to the
weight of the elements
• Use the alias method to sample
– Precomputation time proportional to the
number of elements
– Constant time to generate each sample
• Estimate Smj=0 (N-1 M)j N-1 b
Experimental results
Walk length = 2
Sparse implementation
• We cannot use O(n2) space or time!
• Sparse implementation for M is simple
• Sparse representation of N-1
– Choose Pij =
• 0, if i < j
• 1/(n-j+1) otherwise
• Sampled from the uniform distribution
– Choose Wij = N-1ij Pij
• Constant time to determine any Wij
• Minor modifications needed when si = 0
Experimental results
Walk length = 2
Geometric sampling
• Probabilities are better approximated by
geometric sampling
– Choose Pij =
• 0, if i < j
• Approximately pj(1-pj)I-j otherwise
– X ~ Uniform(0,1), then
• ceil [log(1-x)/log(1-p) - 1] ~ Geometric(p)
– Choose Wij = N-1ij Pij
Experimental results
Walk length = 2
Parallelization
Proc 1
Proc 2
Proc 3
RNG 1
RNG 2
RNG 3
23.6
23.7
23.2
23.5
• MC is “embarrassingly” parallel
• Identical algorithms are run independently on each
processor, with the random number sequences alone
being different
MPI vs OpenMP on Origin 2000
• Cache misses cause poor performance
of the OpenMP parallelization
Conclusions and future work
• Demonstrated that is possible to have
effective MC implementations with nondiagonal splittings too
• Need to extend this to better iterative
schemes
• Non-replicated parallelization needs to
be considered