ppt - NCSU COE People

Download Report

Transcript ppt - NCSU COE People

Fast Parallel Algorithms for
Universal Lossless Source Coding
Dror Baron
CSL & ECE Department, UIUC
[email protected]
Ph.D. Defense – February 18, 2003
Overview
• Motivation, applications, and goals
• Background:
– Source models
– Lossless source coding + universality
– Semi-predictive methods
• An O(N) semi-predictive universal encoder
• Two-part codes
– Rigorous analysis of their compression quality
– Application to parallel compression of Bernoulli sequences
• Parallel semi-predictive (PSP) coding
– Achieving a work-efficient algorithm
– Theoretical results
• Summary
2
Motivation
• Lossless compression: text files, facsimiles, software
executables, medical, financial, etc.
• What do we want in a compression algorithm?
–
–
–
–
–
–
Universality: adaptive to a large class of sources
Good compression quality
Speed: low computational complexity
Simple implementation
Low memory use
Sequential vs. offline
3
Why Parallel Compression ?
• Some applications require high data rates:
–
–
–
–
Compressed pages in virtual memory
Remote archiving + fast communication links
Real-time compression in storage systems
Power reduction for interconnects on a circuit board
• Serial compression is limited by the clock rate
4
Room for Improvement and Goals
• Previous Art:
– Serial universal source coding methods have reached the
bounds on compression quality [Willems1998,Rissanen1999]
– Parallel source coding algorithms have high complexity and/or
poor compression quality
• Naïve parallelization compresses poorly
• Parallel dictionary compression [Franszek et. al.1996]
• Parallel context tree weighting [Stassen&Tjalkens2001,Willems2000]
• Research Goals: “good” parallel compression algorithm
– Work-efficient: O(N/B) time with B computational units
– Compression quality: as good as best serial methods (almost!)
5
Main Contributions
• BWT-MDL (O(N) universal encoder):
– An O(N) algorithm that achieves Rissanen’s redundancy bounds on best
achievable compression
– Combines efficient prefix tree construction with semi-predictive
approach to universal coding
• Fast Suffix Sorting (not in this talk):
– Core algorithm is very simple (can be implemented in VLSI)
– Worst-case complexity O(N log0.5(N))
– Competitive with other suffix sorting methods in practice
• Two-Part Codes:
– Rigorous analysis of their compression quality
– Application to distributed/parallel compression
– Optimal two-part codes
• Parallel Compression Algorithm (not in this talk):
– Work-efficient O(N/B) algorithm
– Compression loss is roughly B log(N/B) bits
6
Source Models
• Binary alphabet X={0,1},
sequence x  XN
• Bernoulli Model:
• i.i.d. model
• p(xi=1)=
• Order-K Markov Model:
•
•
•
•
Previous K symbols called context
Context-dependent conditional probability for next symbol
More flexible than Bernoulli
Exponentially many states
7
Context Tree Sources
P(xn+1=1|0)=0.8
 More flexible than Bernoulli
 More compact than Markov
 Particularly good for text
 Works for M-ary alphabet
0
0 root
leaf
P(xn+1=1|01)=0.4
01
11
1
0
1
internal
node
P(xn+1=1|11)=0.9
• State = context + conditional probabilities
• Example: N=11, x=01011111111
8
Review of Lossless Source Coding
• Stationary ergodic sources
• Entropy rate H=limN H(x)/N
• Asymptotically, H is the lowest attainable per-symbol
rate
• Arithmetic coding:
– Probability assignment p(x)
– Coding length l(x)=-log(p(x))+O(1)
– Can achieve entropy + O(1) bits
9
Universal Source Coding
• Source statistics are unknown
Need probability assignment p(x)
Need to estimate source model
Need to describe estimated source (explicitly or implicitly)
• Redundancy: excess coding length above entropy
(x)=l(x)-NH
10
Redundancy Bounds
• Rissanen’s bound (K unknown parameters):
E[(x)] > (K/2) [log(N)+O(1)]
• Worst-case redundancy for Bernoulli sequences
(K=1): (x*)=maxxXN {(x)}  0.5 log(N/2)
• Asymptotically, (x)/N 0
• In practice, e.g., text, the number of parameters
scales almost linearly with N
Low redundancy is still essential
11
Semi-Predictive Approach
• Semi-predictive methods describe x in two phases:
– Phase I: find a “good” tree source structure S and describe it
using codelength lS
– Phase II: encode x using S with probability assignment pS(x)
Phase I
x
MDL
Estimator
Phase II
S*
y
Encoder
• Phase I: estimate minimum description length (MDL)
tree source model S*=arg min {lS –log(pS(x))}
12
Semi-Predictive Approach - Phase II
• Sequential encoding of x given S*
– Determine which state s of S* generated symbol xi
– Assign xi a conditional probability p(xi|s)
– Arithmetic encoding
S*
xi
Determine s
s
Assign p(xi|s)
p(xi|s)
Arithmetic
Encoder
y
• p(xi|s) can be based on previously processed
portion of x, quantized probability estimates, etc.
13
Context Trees
• We will provide an O(N) semi-predictive algorithm
by estimating S* using context trees
root
node 0
• Context trees arrange x in a tree
 Each node corresponds to
sequence of appended arc
labels on path to root
node 01
 Internal nodes correspond
to repeating contexts in x
unique
 Leaves correspond to unique contexts
 Sentinel symbol x0=$ makes sure
sentinel
symbols have different contexts
node 11
14
Context Tree Pruning
(“To prune or not to prune…”)
s
MDL structure for state s is
• The MDL structure for state s
yields the shortest description
for symbols generated by s
• When processing state s:
– Estimate MDL structures for states
0s and 1s
– Decide whether to keep 0s and 1s or
prune them into state s
– Base decision on coding lengths
0s
00s
s
or
1s
10s
additional
structure
MDL
structure
for 1s
15
Phase I with Atomic Context Trees
• Atomic context tree:
 Arc labels are atomic (single symbol)
 Internal nodes are not necessarily branching
 Has up to O(N2) nodes
1
0
1
0
1
0
0
1
• The coding length minimization of Phase I
processes each node of the context tree [Nohre94]
• With atomic context trees, the worst-case
complexity is at least O(N2) ☹
16
Compact Context Trees
• Compact context tree:
111
0
Arc labels not necessarily atomic
1
0
Internal node are branching
O(N) nodes
Compact representation of the same tree
• Depth-first traversal of compact context tree provides
O(N) complexity 
• Theorem: Phase I of BWT-MDL requires O(N)
operations performed with O(log(N)) bits of precision
17
Phase II of BWT-MDL
S*
xi
Determine s
s
Assign p(xi|s)
p(xi|s)
Arithmetic
Encoder
y
• We determine the generator state using a novel
algorithm that is based on properties of the Burrows
Wheeler transform (BWT)
• Theorem: The BWT-MDL encoder requires O(N)
operations performed with O(log(N)) bits of precision
• Theorem: [Willems et. al. 2000]: redundancy w.r.t.
any tree source S is at most |S|[0.5 log(N)+O(1)] bits
18
Distributed/Parallel Compression of
Bernoulli Sequences
• Splitter partitions x into B blocks x(1),…,x(B)
• Encoder j{1,…,B} compresses x(j); it assigns probabilities
p(xi(j)=1)= and p(xi(j)=0)=1-
 The total probability assigned to x is identical to that in a serial
compression system
x(1) xi(1) Assign p(x (1))
i
Assign p(xi(B))
p(xi(B))
Arithmetic Encoder B
Encoder B
y(1)
…
x(B) xi(B)
Encoder 1
Arithmetic Encoder 1
…
…
x Splitter
p(xi(1))
y(B)
• This structure assumes that  is known; our goal is to provide a
universal parallel compression algorithm for Bernoulli sequences
19
Two-Part Codes
• Two-part codes use a semi-predictive approach to
describe Bernoulli sequences:
– First part of code:
• Determine the maximum likelihood (ML) parameter estimate
ML(x)=n1/(n0+n1)
• Quantize ML(x) to rk, one of K representation levels
• Describe the bin index k with log(K) bits
– Second part of code encodes x using rk
x
Determine ML(x)
ML(x)
Quantizer
k{1,…,K}
rk
Encoder
y
• In distributed systems:
– Sequential compressors require O(N) internal communications
– Two-part codes need only communicate {n0(j),n1(j)}j{1,…,B}
20
Requires O(B log(K)) internal communications
Jeffreys Two-Part Code
• Quantize ML(x)
Bin edges
Representation levels
bk=sin2(k/2K)
rk=sin2((2k-1)/4K)
r1 r2
b0 b1 b2
rk
bk-1
rK
bk
bK
• Use K 1.772N0.5 bins
• Source description:
– log(K) bits for describing the bin index k
– Need –n1 log(ML(x))-n0log(1-ML(x)) for encoding x
21
Redundancy of Jeffreys Code for
Bernoulli Sequences
• Redundancy:
 log(K) bits for
describing k
 N D(ML(x)||rk) bits for
encoding x using
imprecise model
• D(a||b) is Kullback Leibler
divergence
• In bin k, l(x)=-ML(x)log(rk )-[1-ML(x)] log(1-rk )
 l( ML (x)) is poly-line
• Redundancy = log(K)+ l(ML(x))– N H(ML(x))  log(K) + L
 Use quantizers that have small L distance between the entropy
22
function and the induced poly-line fit
Redundancy Properties
• For x s.t. ML(x) is quantized to rk, the worst-case redundancy is
log(K)+N max{D(bk||rk),D(bk-1||rk)}
• D(bk||rk) and D(bk-1||rk)
• Largest in initial or end bins
• Similar in the middle bins
• Difference reduced over wider range of k for larger N (larger K)
 Can construct a near-optimal quantizer by modifying the initial
and end bins of the Jeffreys quantizer
23
Redundancy Results
• Theorem: The worst-case redundancy of the Jeffreys code
is 1.221+O(1/N) bits above Rissanen’s bound
• Theorem: The worst-case redundancy of the optimal twopart code is 1.047+O(1) bits above Rissanen’s bound 24
Parallel Universal Compression for
Bernoulli Sequences
x(1)
x(1)
Determine
n0(1) and n1(1)
Determine
n0(B) and n1(B)
n0(1), n1(1)
ML(x) [j n0(j)] /
[j n0(j)+j n1(j)]
ML(x)
Quantizer
rk
n0(B), n1(B)
Encoder B
x(B)
y(1)
…
…
x(B)
Encoder 1
• Phase I:
• Parallel units (PUs) compute symbol counts for the
B blocks
• Coordinating unit (CU) computes and quantizes the
MDL parameter estimate ML(x) and describes k
• Phase II: B PUs encode the B blocks based on rk
25
y(B)
k
Why do we need
Parallel Semi-Predictive Coding?
x(1)
y(1)
…
…
Splitter
…
x
Compressor 1
• Naïve parallelization:
x(B)
Compressor B
y(B)
• Partition x into B blocks
• Compress blocks independently
• The redundancy for a length-N/B block is O(log(N/B))
Total redundancy is O(B log(N/B))
• Rissanen’s bound is O(log(N))
The redundancy with naïve parallelization is excessive!
26
Parallel Semi-Predictive (PSP) Concept
Phase I
x(1)
Statistics
Accumulator 1
Statistics
Accumulator B
x(1)
Compressor 1
y(1)
symbol counts 1
Coordinating
Unit
symbol counts B
x(B)
S*
…
…
x(B)
Phase II
Compressor B
y(B)
S*
• Phase I:
• B parallel units (PUs) accumulate statistics (symbol
counts) on the B blocks
• Coordinating unit (CU) computes the MDL tree
source estimate S*
• Phase II:-- B PUs compress the B blocks based on S* 27
Source Description in PSP
Coordinating Unit
Describe S*
structure
Describe bin
indices {ks}sS*
S*
xi(b)
Determine s
{ks}sS*
s
Determine p(xi(b)|s) Arithmetic
p(xi(b)|s)
Encoder
y(b)
Parallel unit b
p(xi(b)|s)1{xi(b)=1}rks+1{xi(b)=0}(1-rks )
• Phase I: the CU describes the structure of S* and the
quantized ML parameter estimates {ks}sS*
• Phase II: each of B PUs compresses block x(b) just like
Phase II of the (serial) semi-predictive approach
28
Complexity of Phase I
• Phase I processes each node of the context tree [Nohre94]
• The CU processes the states of a full atomic context tree
of depth-Dmax, where Dmax log(N/B)
• Processing a node:
• Internal node: requires
D
O(1) time
• Leaf: CU adds up block
symbol counts to compute
2
=O(N/B)
each symbol count, i.e., ns=b ns(b), where  {0,1}
The CU processes a leaf node in O(B) time
With O(N/B) leaves, the aggregate complexity is O(N),
which is excessive
29
max
Dmax
Phase I in O(N/B) Time
• We want to compute ns=b ns(b) faster
• An adder tree incurs O(log(B)) delay for adding up B
block symbol counts
• Pipelining enables us to generate a result every O(1) time
O(N/B) nodes, each requiring O(1) time
30
Phase II in O(N/B) Time
Parallel unit b
xi(b)
Determine s
S*
s
Determine p(xi(b)|s) Arithmetic
p(xi(b)|s)
Encoder
y(b)
{ks}sS*
• The challenging part in Phase II is determining s:
• Define the context index for a length-Dmax context s
preceding xi(b) as the binary number that represents s
• The length-2Dmax generator table g satisfies gj=sS* if
s is a suffix of the context whose context index is j
• We can construct g in O(N/B) time (far from trivial!)
• Compute context indices for all symbols of x(b) and
determine the generating states via the generator table g
31
Decoder
x(1)
…
…
…
bus y
D
E
M
U
X
*
Reconstruct S ,{ks}sS*
S*,{ks}sS*
Decoding Unit 1
y(1)
y(B)
Decoding Unit B
x(B)
• An input bus is demultiplexed to multiple units
• The MDL source and quantized ML parameters are
reconstructed
• The B compressed blocks y(B) are decompressed on
B decoding units
32
Theoretical Results
• Theorem: With computations performed with 2 log(N)
bits of precision defined as O(1) time:
– Phase I of PSP approximates the MDL coding length within
O(1) of the true optimum
– The PSP algorithm requires O(N/B) time
• Theorem: The PSP algorithm uses a total of O(N)
words of memory = a total of O(N log(N)) bits
• Theorem: The pointwise redundancy of PSP w.r.t. S*
is (x) < B[log(N/B)+O(1)]+|S|*[log(N)/2+O(1)]
parallelization overhead
33
Main Contributions
• BWT-MDL (O(N) universal encoder):
 An O(N) algorithm that achieves Rissanen’s redundancy bounds on best
achievable compression
 Combines efficient prefix tree construction with semi-predictive
approach to universal coding
• Fast Suffix Sorting (not in this talk):
– Core algorithm is very simple (can be implemented in VLSI)
– Worst-case complexity O(N log0.5(N))
– Competitive with other suffix sorting methods in practice
• Two-Part Codes:
 Rigorous analysis of their compression quality
 Application to distributed/parallel compression
– Optimal two-part codes
• Parallel Compression Algorithm (not in this talk):
 Work-efficient O(N/B) algorithm
 Compression loss is roughly B log(N/B) bits
34
More…
• Results have been extended to |X|-ary alphabet
• Future research can concentrate on:
– Processing broader classes of tree sources
– Problems in statistical inference
• Universal classification
• Channel decoding
• Prediction
– Characterize the design space for parallel
compression algorithms
35
Generic Phase I
if (s is a leaf) {
Count symbol appearances ns0 and ns1
MDLs length(ns0, ns1)
} else { /* s is an internal node */
Recursively compute MDL length and counts for 0s and 1s
ns0  n0s0+n1s0, ns1  n0s1+n1s1
MDLs length(ns0, ns1)
if (MDLs >MDL0s +MDL1s )
Keep 0s and 1s
} else {
Prune 0s and 1s, keep s
}
}
36