Network Coding: Theory and Applica7ons PhD Course Part IV Tuesday 9.15-12.15 18.6.213 Muriel Médard (MIT), Frank H. P. Fitzek (AAU), Daniel E. Lucani (AAU), Morten V. Pedersen (AAU)
Plan Hello World! Intra flow network coding Intra flow Complexity Overhead Energy Analog Core Project Ass. Theory Mul7cast and Co Distributed Storage Wash Up Inter flow network coding KODO + Simulator KODO + Exercises Group Work Group Work Group Work
Recoding Packets Genera7ng a new linear network coded packet (CP) Coded data Coding Coeff. Header f 1 f 2 f 3 f 4 f 5 x x x x x C 11 x C 12 x Header d 1 d 1 d 1 d 1 d 1 + + + + + e 1 e 2 e 3 e 4 e 5 x x x x x d 2 d 2 d 2 d 2 d 2 d 1 d 1 + + C 21 x C 22 x d 2 d 2 Header New Coded Data X 1 X 2 X 1 = d 1 C 11 + d 2 C 21 and X 2 = d 1 C 12 + d 2 C 22 Recall that: f = C 11 P 1 + C 12 P 2 and e = C 21 P 1 + C 22 P 2 Thus, d 1 f + d 2 e = d 1 C 11 P 1 + d 1 C 12 P 2 + d 2 C 21 P 1 + d 2 C 22 P 2 = X 1 P 1 + X 2 P 2 3
Systematic Coding: Complexity D Uncoded packets CP 1 1 P 1 CP 2 1 P 2 CP 3 CP 4 = a 41 a 42 1 a 43 a 44 a 45 a 46 P 3 P 4 CP 5 a 51 a 52 a 53 a 54 a 55 a 56 P 5 CP 6 a 61 a 62 a 63 a 64 a 65 a 66 P 6 M x M Opera7ons first elimina7on (Product): Gaussian elimina7on n x n matrix, n = M - D requires An 3 + Bn 2 + Cn opera7ons Distribu7on of D determines average # of opera7ons Linked to channel model Erasures IID Be(Pe):
Systematic Coding: Complexity CP 1 1 P 1 CP 2 1 P 2 CP 3 CP 4 = a 41 a 42 1 a 43 a 44 a 45 a 46 P 3 P 4 CP 5 a 51 a 52 a 53 a 54 a 55 a 56 P 5 CP 6 a 61 a 62 a 63 a 64 a 65 a 66 P 6 M x M Opera7ons first elimina7on (Product): Gaussian elimina7on n x n matrix, n = M - D requires An 3 + Bn 2 + Cn opera7ons Distribu7on of D determines average # of opera7ons Linked to channel model Erasures IID Be(Pe): D (M- D)
Systematic Coding: Complexity CP 1 CP 2 CP 3 CP 4 CP 5 CP 6 = Opera7ons first elimina7on (Product): D (M- D) Gaussian elimina7on n x n matrix, n = M - D requires An 3 + Bn 2 + Cn opera7ons Distribu7on of D determines average # of opera7ons Linked to channel model Erasures IID Be(Pe): 1 1 1 a 44 a 54 a 64 a 45 a 55 a 65 a 46 a 56 a 66 M x M A(MPe) 3 + B (MPe) 2 + C (MPe) - - > O(M 3 Pe 3 ) P 1 P 2 P 3 P 4 P 5 P 6
S6 Implementa7on RLNC (27)
Network Coding GF(2)
Systema7c Network Coding GF(2)
Coding throughput on a laptop Lenovo T61p, 2.53 GHz Intel Core2Duo, 2 GB ram, Kubuntu 8.1 64bit
Coding throughput on Nokia N95 Nokia N95-8GB, ARM 11 332 MHz CPU, 128 MB ram, Symbian OS 9.2
Energy Consump7on
Current coding speeds in 212 Speed [MByte/s] 1 1 1 1 2 8 16 OPF 2 8 1 Field size 16 OPF 128 64 32 Genera7on size 16... and in 27 we had 2 kbyte/s for genera7on size of 5!
Current coding speeds in 212 Speed [MByte/s] 1 1 1 1 2 8 16 OPF 2 8 1 Field size 16 OPF 128 64 32 Genera7on size 16... and in 27 we had 2 kbyte/s for genera7on size of 5!
IMPLEMENTATION OF RANDOM LINEAR NETWORK CODING ON OPENGL- ENABLED GRAPHICS CARDS
Main Mo7va7on Mobile devices have co- processor or accelerators for specific tasks Speed Energy consump7on Examples Voice codec Video Codec Gaming But no network coding support (yet)
Example: Video on N95 Spiderman3 Sopware approach DivX player 1.59 W Display.4 W Audio.1 W CPU @ 88%.675 W Hardware accelerator Build- In player.94 W Display.4 W Audio.1 W CPU @ 31%.35 W Accelerator.2 W
CPU implementa7on A simple C++ console applica7on with some customizable parameters: L: packet length N: genera7on size Object- oriented implementa7on: Encoder and Decoder classes Addi7on and subtrac7on over the Galois Field are simply XOR opera7ons on the CPU Galois mul7plica7on and division tables are pre- calculated and stored in arrays: both opera7ons can be performed by array lookups Gauss- Jordan elimina7on is used for decoding: on- the- fly version of the standard Gaussian elimina7on It is used as a reference implementa7on
Graphics card Originally designed for real- 7me rendering of 3D graphics The past: fixed- func7on pipeline They evolved into programmable parallel processors with enormous compu7ng power The present: programmable pipeline Now they can even perform general- purpose computa7ons with some restric7ons The future: General Purpose Graphics Processing Unit (GPGPU)
Plavorm of choice NVidia GeForce 96 GT NVidia GeForce 92M GS
OpenGL & CG implementa7on OpenGL is a standard cross- plavorm API for computer graphics It cannot be used on its own, a shader language is also necessary to implement custom algorithms A shader is a short program which is used to program certain stages of the rendering pipeline Chose NVIDIA s CG toolkit as a shader language The developer is forced to think with the tradi7onal concepts of 3D graphics (e.g. ver7ces, pixels, triangles, lines and points)
Encoder shader in CG A regular bitmap image serves as input data Coefficients and data packets are stored in textures (2D arrays of bytes in graphics memory that can be accessed efficiently) The XOR opera7on and Galois mul7plica7on are also implemented by texture look- ups: a 256x256- sized black&white texture is necessary for each The encoded packets are rendered (computed) line- by- line onto the screen and they are saved into a texture
Decoder shaders in CG The decoding algorithm is more complex It must be decomposed into 3 different shaders These shaders correspond to the 3 consecu7ve phases of the Gauss- Jordan elimina7on: 1. Forward elimina7on: reduce the new packet by the exis7ng rows 2. Finding the pivot element in the reduced packet 3. Backward subs7tute the reduced and normalized packet into the exis7ng rows
NVIDIA s CUDA toolkit Compute Unified Device Architecture (CUDA) Parallel compu7ng applica7ons in the C language Modern GPUs have many processor cores and they can launch thousands of threads with zero scheduling overhead Terminology: host = CPU device = GPU kernel = a func7on executed on the GPU A kernel is executed in the Single Program Mul7ple Data (SPMD) model, meaning that a user- specified number of threads execute the same program.
CUDA implementa7on A CUDA- capable device is required! NVIDIA GeForce 8 series at minimum This is a more na7ve approach, we have fewer restric7ons A large number of threads must be launched to achieve the GPU s peak performance All data structures are stored in CUDA arrays, which are bound to texture references if necessary Computa7ons are visualized using an OpenGL GUI
Encoder kernel in CUDA Encoding is a matrix mul7plica7on in the GF domain, and can be considered as a highly parallel computa7on problem We can achieve a very fine granularity by launching a thread for every single byte to be computed Galois mul7plica7on is implemented by array look- ups, but we have a na7ve XOR operator The encoder kernel is quite simple
Decoder kernels in CUDA Gauss- Jordan elimina7on means that the decoding of each coded packet can only start aper the decoding of the previous coded packets has finished => we have a sequen7al algorithm Paralleliza7on is only possible within the decoding of the current coded packet We need 2 separate kernels for forward and backward subs7tu7on A search for the first non- zero element must be performed on the CPU side, because synchroniza7on is not possible between all GPU threads => the CPU must assist the GPU
Random Coefficient Matrix Encoding OpenGL (A) Ongoing decoding OpenGL Final Decoding OpenGL Original Encoding CPU (B) Ongoing decoding CPU Final Decoding CPU
Random Coefficient Matrix Encoding OpenGL (A) Ongoing decoding OpenGL Final Decoding OpenGL Original Encoding CPU (B) Ongoing decoding CPU Final Decoding CPU
Performance evalua7on It is difficult to compare the actual performance of these implementa7ons A lot of factors have to be taken into considera7on: Shader/kernel execu7on 7mes Memory transfers between host and device memory Shader/kernel ini7aliza7on & parameter setup CPU- GPU synchroniza7on Measurement results are not uniform, because we cannot have exclusive control over the GPU: other applica7ons may have a nega7ve impact
CPU implementa7on
OpenGL & CG implementa7on
CUDA implementa7on
Sparse Code Structures What does sparsity mean? Large frac7on of the coefficients are zero Why use sparse structures? Efficient decoders (less complexity) What are we giving up? Performance: need to transmit more coded packets What are the challenges? Decoders Complexity performance trade- off Re- coding may destroy sparse structure
Sparse Code Structures Decoders: Forward pass of Gaussian elimina7on Dominant effect towards complexity It can introduce spurious coefficients in unprocessed coded packets: sparse structure is lost Re- coding: If not careful, can increase density 1 2 1 1 1 1 1 1 1 3 4 2 1 2 P 1 + 2P 3 3P 5 + 5P 6 P 1 P 2 P 3 P 4 P 5 P 6 Start: 14 non- zero coeff. Aper some steps: 16 non- zero coeff. P 1 + 2P 3 + 3P 5 + 5P 6
Sparse Code Structures Decoders: Forward pass of Gaussian elimina7on Dominant effect towards complexity It can introduce spurious coefficients in unprocessed coded packets: sparse structure is lost Re- coding: If not careful, can increase density 1 2 1 1 1 2 1 1 1 1 4 3 4 1 2 2 P 1 + 2P 3 3P 5 + 5P 6 P 1 P 2 P 3 P 4 P 5 P 6 Start: 14 non- zero coeff. Aper some steps: 16 non- zero coeff. P 1 + 2P 3 + 3P 5 + 5P 6