presentation

Download Report

Transcript presentation

Network coding on the GPU
Péter Vingelmann
Supervisor: Frank H.P. Fitzek
What is network coding?



Traditional routing in packet networks:
data is simply forwarded by the intermediate
nodes
Network coding breaks with this principle:
nodes may recombine several input packets
into one or several output packets
Linear network coding:
form linear combinations of incoming packets
What are the benefits of network coding?

Throughtput
b1
1
b1
b1
The butterfly network:
b1 xor b2
S
3
Robustness
b1 xor b2
4
b1 xor b2
b2

T1
b2
2
b2
T2
Each encoded packet is “equally important”

Complexity
Less complex protocols (e.g. for content distribution)

Security
It is more difficult to “overhear” anything that makes sense
What is the problem?



The computational overhead introduced by
network coding operations is not negligible
There is no dedicated network coding
hardware yet
A possible solution:
use the Graphics Processing Unit (GPU) to
perform the necessary calculations
Overview of network coding


Definition: coding at a node in a packet network
All operations are performed over a Galois Field GF(2s),
packets are divided into s bit-long symbols
s bits
s bits
s bits
Symbol 1
Symbol 2
Symbol K/s
K bits

Coding
coefficients
The process of network coding can be divided into two
separate parts:
Original
packets
ENCODER
Encoded packets
DECODER
Original
packets
Encoding
L symbols
B
(NxL)
Original Packets
N packets

Encoded packets are linear combinations of
the original packets, where addition and
multiplication are performed over GF(2s)
We can use random coding coefficients
N coefficients
N vectors

C
X
(NxN)
Coding Coefficients
(NxL)
Encoded Data
X  C B
Decoding




Assume a node has received M encoded
packets (together with its coefficients)
Linear system with M equations and N
unknowns
We need M ≥ N to have a chance of solving
this system of equations using standard
Gaussian elimination
At least N linearly independent encoded
packets must be received in order to recover
all the original data packets
CPU implementation






A simple C++ console application with some customizable
parameters:
L: packet length
N: generation size
Object-oriented implementation: Encoder and Decoder
classes
Addition and subtraction over the Galois Field are simply
XOR operations on the CPU
Galois multiplication and division tables are pre-calculated
and stored in arrays: both operations can be performed by
array lookups
Gauss-Jordan elimination is used for decoding:
“on-the-fly” version of the standard Gaussian elimination
It is used as a reference implementation
Graphics card



Originally designed for real-time rendering of 3D
graphics
The past: fixed-function pipeline
They evolved into programmable parallel processors
with enormous computing power
The present: programmable pipeline
Now they can even perform general-purpose
computations with some restrictions
The future: General Purpose Graphics Processing
Unit (GPGPU)
OpenGL & CG implementation





OpenGL is a standard cross-platform
API for computer graphics
It cannot be used on its own, a shader
language is also necessary to
implement custom algorithms
A shader is a short program which is
used to program certain stages of the
rendering pipeline
I chose NVIDIA’s CG toolkit as a
shader language
The developer is forced to think with
the traditional concepts of 3D graphics
(e.g. vertices, pixels, triangles, lines
and points)
Encoder shader in CG




A regular bitmap image serves as input data
Coefficients and data packets are stored in textures
(2D arrays of bytes in graphics memory that can be
accessed efficiently)
The XOR operation and Galois multiplication are
also implemented by texture look-ups:
a 256x256-sized black&white texture is necessary
for each
The encoded packets are rendered (computed) lineby-line onto the screen and they are saved into a
texture
Decoder shaders in CG



1.
2.
3.
The decoding algorithm is more complex
It must be decomposed into 3 different shaders
These shaders correspond to the 3 consecutive
phases of the Gauss-Jordan elimination:
Forward substitution: reduce the new packet by
the existing rows
Finding the pivot element in the reduced packet
Backward substitute the reduced and normalized
packet into the existing rows
NVIDIA’s CUDA toolkit





Compute Unified Device
Architecture (CUDA)
Parallel computing applications in
the C language
Modern GPUs have many
processor cores and they can
launch thousands of threads with
zero scheduling overhead
Terminology:
host = CPU
device = GPU
kernel = a function executed on
the GPU
A kernel is executed in the Single
Program Multiple Data (SPMD)
model, meaning that a userspecified number of threads
execute the same program.
CUDA implementation





A CUDA-capable device is required!
NVIDIA GeForce 8 series at minimum
This is a more native approach, we have fewer
restrictions
A large number of threads must be launched to
achieve the GPU’s peak performance
All data structures are stored in CUDA arrays, which
are bound to texture references if necessary
Computations are visualized using an OpenGL GUI
Encoder kernel in CUDA




Encoding is a matrix multiplication in the GF
domain, and can be considered as a highly
parallel computation problem
We can achieve a very fine granularity by
launching a thread for every single byte to be
computed
Galois multiplication is implemented by array
look-ups, but we have a native XOR operator
The encoder kernel is quite simple
Decoder kernels in CUDA




Gauss-Jordan elimination means that the decoding
of each coded packet can only start after the
decoding of the previous coded packets has finished
=> we have a sequential algorithm!!!
Parallelization is only possible within the decoding of
the current coded packet
We need 2 separate kernels for forward and
backward substitution
A search for the first non-zero element must be
performed on the CPU side, because
synchronization is not possible between all GPU
threads => the CPU must assist the GPU!
Graphical User Interface
Performance evaluation







It is difficult to compare the actual performance of
these implementations
A lot of factors has to be taken into consideration:
Shader/kernel execution times
Memory transfers between host and device memory
Shader/kernel initialization & parameter setup
CPU-GPU synchronization
Measurement results are not uniform, because we
cannot have exclusive control over the GPU:
other applications may have a negative impact
CPU implementation
Desktop PC: Intel Core2 Quad CPU Q6600 @ 2.40 GHz
Encoding
Decoding
Throughput [KB/s]
35000
30000
25000
20000
15000
10000
5000
0
16
32
64
Generation size (N )
128
256
OpenGL & CG implementation
OpenGL & Cg: NVidia GeForce 9600 GT
Encoding
Decoding
80000
Throughput [KB/s]
70000
60000
50000
40000
30000
20000
10000
0
16
32
64
Generation size (N )
128
256
CUDA implementation
CUDA: NVidia GeForce 9600 GT
Encoding
Decoding
Throughput [KB/s]
250000
200000
150000
100000
50000
0
16
32
64
Generation size (N )
128
256
Achievements



It has been shown that the GPU is capable
of performing network coding calculations
What’s more, it can outperform the CPU by a
significant margin in some cases
We have a submitted and accepted paper at
European Wireless ’09 by the title:
Implementation of Random Linear Network
Coding on OpenGL-enabled Graphics Cards
Demonstration



CPU implementation
OpenGL & CG implementation
CUDA implementation
Thank you for your kind
attention!
Questions???