Network Coding: Theory and Applica7ons

Similar documents
Implementation of Random Linear Network Coding using NVIDIA's CUDA toolkit

Published in: IEEE International Conference on Communications Workshops, ICC Workshops 2009

RouteBricks: Exploi2ng Parallelism to Scale So9ware Routers

GPGPU. Peter Laurens 1st-year PhD Student, NSC

GASPP: A GPU- Accelerated Stateful Packet Processing Framework

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CS427 Multicore Architecture and Parallel Computing

GPU Cluster Computing. Advanced Computing Center for Research and Education

Graphics Hardware. Instructor Stephen J. Guy

ECE 571 Advanced Microprocessor-Based Design Lecture 20

hashfs Applying Hashing to Op2mize File Systems for Small File Reads

Spring 2009 Prof. Hyesoon Kim

PictureViewer Pedersen, Morten Videbæk; Heide, Janus; Fitzek, Frank Hanns Paul; Larsen, Torben

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

ECE 571 Advanced Microprocessor-Based Design Lecture 18

SEDA An architecture for Well Condi6oned, scalable Internet Services

A Distributed Data- Parallel Execu3on Framework in the Kepler Scien3fic Workflow System

First: Shameless Adver2sing

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

NetSlices: Scalable Mul/- Core Packet Processing in User- Space

high performance medical reconstruction using stream programming paradigms

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Threading Hardware in G80

GPGPU on Mobile Devices

The rcuda middleware and applications

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Spring 2011 Prof. Hyesoon Kim

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Shadows for Many Lights sounds like it might mean something, but In fact it can mean very different things, that require very different solutions.

Soft GPGPUs for Embedded FPGAS: An Architectural Evaluation

A Script- Based Autotuning Compiler System to Generate High- Performance CUDA code

CME 213 S PRING Eric Darve

Multimedia in Mobile Phones. Architectures and Trends Lund

LOOP PARALLELIZATION!

A Comparison of GPU Box- Plane Intersec8on Algorithms for Direct Volume Rendering. Chair of Computer Science Prof. Lang University of Cologne, Germany

Portland State University ECE 588/688. Graphics Processors

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Low Complexity Opportunistic Decoder for Network Coding

Agenda. General Organiza/on and architecture Structural/func/onal view of a computer Evolu/on/brief history of computer.

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

Windowing System on a 3D Pipeline. February 2005

Implemen'ng IPv6 Segment Rou'ng in the Linux Kernel

A Bandwidth Effective Rendering Scheme for 3D Texture-based Volume Visualization on GPU

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Optimization solutions for the segmented sum algorithmic function

Current Trends in Computer Graphics Hardware

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

E6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Physis: An Implicitly Parallel Framework for Stencil Computa;ons

Viral Loops for Mobile Clouds. Frank Fitzek Aalborg University

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Profiling & Tuning Applica1ons. CUDA Course July István Reguly

Accelerating image registration on GPUs

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

! Readings! ! Room-level, on-chip! vs.!

Computer Graphics. 2D Transforma5ons. Review Vertex Transforma5ons 2/3/15. adjust the zoom. posi+on the camera. posi+on the model

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Shaders. Slide credit to Prof. Zwicker

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Next Generation OpenGL Neil Trevett Khronos President NVIDIA VP Mobile Copyright Khronos Group Page 1

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Comparison of CPU and GPGPU performance as applied to procedurally generating complex cave systems

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

Performance improvements to peer-to-peer file transfers using network coding

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Leveraging Hybrid Hardware in New Ways: The GPU Paging Cache

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Dave Shreiner, ARM March 2009

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010

Pushing the Envelope: Extreme Network Coding on the GPU

Using Dynamic Voltage Frequency Scaling and CPU Pinning for Energy Efficiency in Cloud Compu1ng. Jakub Krzywda Umeå University

GPGPU introduction and network applications. PacketShaders, SSLShader

For example, could you make the XNA func8ons yourself?

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Adding Advanced Shader Features and Handling Fragmentation

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

High Performance Computing on GPUs using NVIDIA CUDA

Lecture 1 Introduc-on

Hardware Accelerated Volume Visualization. Leonid I. Dimitrov & Milos Sramek GMI Austrian Academy of Sciences

Tesla GPU Computing A Revolution in High Performance Computing

Graphics Processing Unit (GPU) Acceleration of Machine Vision Software for Space Flight Applications

GPGPU/CUDA/C Workshop 2012

Virtualization. Introduction. Why we interested? 11/28/15. Virtualiza5on provide an abstract environment to run applica5ons.

Vision: Towards an Extensible App Ecosystem for Home Automa;on through Cloud- Offload

AES Cryptosystem Acceleration Using Graphics Processing Units. Ethan Willoner Supervisors: Dr. Ramon Lawrence, Scott Fazackerley

POWERVR MBX & SGX OpenVG Support and Resources

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

High Quality DXT Compression using OpenCL for CUDA. Ignacio Castaño

Transcription:

Network Coding: Theory and Applica7ons PhD Course Part IV Tuesday 9.15-12.15 18.6.213 Muriel Médard (MIT), Frank H. P. Fitzek (AAU), Daniel E. Lucani (AAU), Morten V. Pedersen (AAU)

Plan Hello World! Intra flow network coding Intra flow Complexity Overhead Energy Analog Core Project Ass. Theory Mul7cast and Co Distributed Storage Wash Up Inter flow network coding KODO + Simulator KODO + Exercises Group Work Group Work Group Work

Recoding Packets Genera7ng a new linear network coded packet (CP) Coded data Coding Coeff. Header f 1 f 2 f 3 f 4 f 5 x x x x x C 11 x C 12 x Header d 1 d 1 d 1 d 1 d 1 + + + + + e 1 e 2 e 3 e 4 e 5 x x x x x d 2 d 2 d 2 d 2 d 2 d 1 d 1 + + C 21 x C 22 x d 2 d 2 Header New Coded Data X 1 X 2 X 1 = d 1 C 11 + d 2 C 21 and X 2 = d 1 C 12 + d 2 C 22 Recall that: f = C 11 P 1 + C 12 P 2 and e = C 21 P 1 + C 22 P 2 Thus, d 1 f + d 2 e = d 1 C 11 P 1 + d 1 C 12 P 2 + d 2 C 21 P 1 + d 2 C 22 P 2 = X 1 P 1 + X 2 P 2 3

Systematic Coding: Complexity D Uncoded packets CP 1 1 P 1 CP 2 1 P 2 CP 3 CP 4 = a 41 a 42 1 a 43 a 44 a 45 a 46 P 3 P 4 CP 5 a 51 a 52 a 53 a 54 a 55 a 56 P 5 CP 6 a 61 a 62 a 63 a 64 a 65 a 66 P 6 M x M Opera7ons first elimina7on (Product): Gaussian elimina7on n x n matrix, n = M - D requires An 3 + Bn 2 + Cn opera7ons Distribu7on of D determines average # of opera7ons Linked to channel model Erasures IID Be(Pe):

Systematic Coding: Complexity CP 1 1 P 1 CP 2 1 P 2 CP 3 CP 4 = a 41 a 42 1 a 43 a 44 a 45 a 46 P 3 P 4 CP 5 a 51 a 52 a 53 a 54 a 55 a 56 P 5 CP 6 a 61 a 62 a 63 a 64 a 65 a 66 P 6 M x M Opera7ons first elimina7on (Product): Gaussian elimina7on n x n matrix, n = M - D requires An 3 + Bn 2 + Cn opera7ons Distribu7on of D determines average # of opera7ons Linked to channel model Erasures IID Be(Pe): D (M- D)

Systematic Coding: Complexity CP 1 CP 2 CP 3 CP 4 CP 5 CP 6 = Opera7ons first elimina7on (Product): D (M- D) Gaussian elimina7on n x n matrix, n = M - D requires An 3 + Bn 2 + Cn opera7ons Distribu7on of D determines average # of opera7ons Linked to channel model Erasures IID Be(Pe): 1 1 1 a 44 a 54 a 64 a 45 a 55 a 65 a 46 a 56 a 66 M x M A(MPe) 3 + B (MPe) 2 + C (MPe) - - > O(M 3 Pe 3 ) P 1 P 2 P 3 P 4 P 5 P 6

S6 Implementa7on RLNC (27)

Network Coding GF(2)

Systema7c Network Coding GF(2)

Coding throughput on a laptop Lenovo T61p, 2.53 GHz Intel Core2Duo, 2 GB ram, Kubuntu 8.1 64bit

Coding throughput on Nokia N95 Nokia N95-8GB, ARM 11 332 MHz CPU, 128 MB ram, Symbian OS 9.2

Energy Consump7on

Current coding speeds in 212 Speed [MByte/s] 1 1 1 1 2 8 16 OPF 2 8 1 Field size 16 OPF 128 64 32 Genera7on size 16... and in 27 we had 2 kbyte/s for genera7on size of 5!

Current coding speeds in 212 Speed [MByte/s] 1 1 1 1 2 8 16 OPF 2 8 1 Field size 16 OPF 128 64 32 Genera7on size 16... and in 27 we had 2 kbyte/s for genera7on size of 5!

IMPLEMENTATION OF RANDOM LINEAR NETWORK CODING ON OPENGL- ENABLED GRAPHICS CARDS

Main Mo7va7on Mobile devices have co- processor or accelerators for specific tasks Speed Energy consump7on Examples Voice codec Video Codec Gaming But no network coding support (yet)

Example: Video on N95 Spiderman3 Sopware approach DivX player 1.59 W Display.4 W Audio.1 W CPU @ 88%.675 W Hardware accelerator Build- In player.94 W Display.4 W Audio.1 W CPU @ 31%.35 W Accelerator.2 W

CPU implementa7on A simple C++ console applica7on with some customizable parameters: L: packet length N: genera7on size Object- oriented implementa7on: Encoder and Decoder classes Addi7on and subtrac7on over the Galois Field are simply XOR opera7ons on the CPU Galois mul7plica7on and division tables are pre- calculated and stored in arrays: both opera7ons can be performed by array lookups Gauss- Jordan elimina7on is used for decoding: on- the- fly version of the standard Gaussian elimina7on It is used as a reference implementa7on

Graphics card Originally designed for real- 7me rendering of 3D graphics The past: fixed- func7on pipeline They evolved into programmable parallel processors with enormous compu7ng power The present: programmable pipeline Now they can even perform general- purpose computa7ons with some restric7ons The future: General Purpose Graphics Processing Unit (GPGPU)

Plavorm of choice NVidia GeForce 96 GT NVidia GeForce 92M GS

OpenGL & CG implementa7on OpenGL is a standard cross- plavorm API for computer graphics It cannot be used on its own, a shader language is also necessary to implement custom algorithms A shader is a short program which is used to program certain stages of the rendering pipeline Chose NVIDIA s CG toolkit as a shader language The developer is forced to think with the tradi7onal concepts of 3D graphics (e.g. ver7ces, pixels, triangles, lines and points)

Encoder shader in CG A regular bitmap image serves as input data Coefficients and data packets are stored in textures (2D arrays of bytes in graphics memory that can be accessed efficiently) The XOR opera7on and Galois mul7plica7on are also implemented by texture look- ups: a 256x256- sized black&white texture is necessary for each The encoded packets are rendered (computed) line- by- line onto the screen and they are saved into a texture

Decoder shaders in CG The decoding algorithm is more complex It must be decomposed into 3 different shaders These shaders correspond to the 3 consecu7ve phases of the Gauss- Jordan elimina7on: 1. Forward elimina7on: reduce the new packet by the exis7ng rows 2. Finding the pivot element in the reduced packet 3. Backward subs7tute the reduced and normalized packet into the exis7ng rows

NVIDIA s CUDA toolkit Compute Unified Device Architecture (CUDA) Parallel compu7ng applica7ons in the C language Modern GPUs have many processor cores and they can launch thousands of threads with zero scheduling overhead Terminology: host = CPU device = GPU kernel = a func7on executed on the GPU A kernel is executed in the Single Program Mul7ple Data (SPMD) model, meaning that a user- specified number of threads execute the same program.

CUDA implementa7on A CUDA- capable device is required! NVIDIA GeForce 8 series at minimum This is a more na7ve approach, we have fewer restric7ons A large number of threads must be launched to achieve the GPU s peak performance All data structures are stored in CUDA arrays, which are bound to texture references if necessary Computa7ons are visualized using an OpenGL GUI

Encoder kernel in CUDA Encoding is a matrix mul7plica7on in the GF domain, and can be considered as a highly parallel computa7on problem We can achieve a very fine granularity by launching a thread for every single byte to be computed Galois mul7plica7on is implemented by array look- ups, but we have a na7ve XOR operator The encoder kernel is quite simple

Decoder kernels in CUDA Gauss- Jordan elimina7on means that the decoding of each coded packet can only start aper the decoding of the previous coded packets has finished => we have a sequen7al algorithm Paralleliza7on is only possible within the decoding of the current coded packet We need 2 separate kernels for forward and backward subs7tu7on A search for the first non- zero element must be performed on the CPU side, because synchroniza7on is not possible between all GPU threads => the CPU must assist the GPU

Random Coefficient Matrix Encoding OpenGL (A) Ongoing decoding OpenGL Final Decoding OpenGL Original Encoding CPU (B) Ongoing decoding CPU Final Decoding CPU

Random Coefficient Matrix Encoding OpenGL (A) Ongoing decoding OpenGL Final Decoding OpenGL Original Encoding CPU (B) Ongoing decoding CPU Final Decoding CPU

Performance evalua7on It is difficult to compare the actual performance of these implementa7ons A lot of factors have to be taken into considera7on: Shader/kernel execu7on 7mes Memory transfers between host and device memory Shader/kernel ini7aliza7on & parameter setup CPU- GPU synchroniza7on Measurement results are not uniform, because we cannot have exclusive control over the GPU: other applica7ons may have a nega7ve impact

CPU implementa7on

OpenGL & CG implementa7on

CUDA implementa7on

Sparse Code Structures What does sparsity mean? Large frac7on of the coefficients are zero Why use sparse structures? Efficient decoders (less complexity) What are we giving up? Performance: need to transmit more coded packets What are the challenges? Decoders Complexity performance trade- off Re- coding may destroy sparse structure

Sparse Code Structures Decoders: Forward pass of Gaussian elimina7on Dominant effect towards complexity It can introduce spurious coefficients in unprocessed coded packets: sparse structure is lost Re- coding: If not careful, can increase density 1 2 1 1 1 1 1 1 1 3 4 2 1 2 P 1 + 2P 3 3P 5 + 5P 6 P 1 P 2 P 3 P 4 P 5 P 6 Start: 14 non- zero coeff. Aper some steps: 16 non- zero coeff. P 1 + 2P 3 + 3P 5 + 5P 6

Sparse Code Structures Decoders: Forward pass of Gaussian elimina7on Dominant effect towards complexity It can introduce spurious coefficients in unprocessed coded packets: sparse structure is lost Re- coding: If not careful, can increase density 1 2 1 1 1 2 1 1 1 1 4 3 4 1 2 2 P 1 + 2P 3 3P 5 + 5P 6 P 1 P 2 P 3 P 4 P 5 P 6 Start: 14 non- zero coeff. Aper some steps: 16 non- zero coeff. P 1 + 2P 3 + 3P 5 + 5P 6