Accelerating System Simulations

Similar documents
Modeling a 4G LTE System in MATLAB

Modeling a 4G LTE System in MATLAB Idin Motedayen-Aval Senior Applications Engineer MathWorks

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Parallel and Distributed Computing with MATLAB The MathWorks, Inc. 1

Getting Started with MATLAB Francesca Perino

Optimizing and Accelerating Your MATLAB Code

Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

Speeding up MATLAB Applications The MathWorks, Inc.

Multicore Computer, GPU 및 Cluster 환경에서의 MATLAB Parallel Computing 기능

Large Data in MATLAB: A Seismic Data Processing Case Study U. M. Sundar Senior Application Engineer

Daniel D. Warner. May 31, Introduction to Parallel Matlab. Daniel D. Warner. Introduction. Matlab s 5-fold way. Basic Matlab Example

Modeling a 4G LTE System in MATLAB

Moving MATLAB Algorithms into Complete Designs with Fixed-Point Simulation and Code Generation

Optimization and Implementation of Embedded Signal Processing Algorithms Jonas Rutström Senior Application Engineer

High Performance and GPU Computing in MATLAB

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

Introduction to C and HDL Code Generation from MATLAB

Audio Signal Processing in MATLAB Youssef Abdelilah Senior Product Manager

Model-Based Design: Generating Embedded Code for Prototyping or Production

Parallel Computing with MATLAB

MatCL - OpenCL MATLAB Interface

Scaling up MATLAB Analytics Marta Wilczkowiak, PhD Senior Applications Engineer MathWorks

How Real-Time Testing Improves the Design of a PMSM Controller

Technical Computing with MATLAB

Intro to System Generator. Objectives. After completing this module, you will be able to:

Model-Based Design for Video/Image Processing Applications

Deep learning in MATLAB From Concept to CUDA Code

Implementing MATLAB Algorithms in FPGAs and ASICs By Alexander Schreiber Senior Application Engineer MathWorks

MATLAB AND PARALLEL COMPUTING

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany

NumbaPro CUDA Python. Square matrix multiplication

Using Parallel Computing Toolbox to accelerate the Video and Image Processing Speed. Develop parallel code interactively

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX

개발과정에서의 MATLAB 과 C 의연동 ( 영상처리분야 )

General Purpose GPU Computing in Partial Wave Analysis

Model-Based Design: Design with Simulation in Simulink

Avnet Speedway Design Workshop

Real-Time Testing in a Modern, Agile Development Workflow

Integrate MATLAB Analytics into Enterprise Applications

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

MATLAB. Senior Application Engineer The MathWorks Korea The MathWorks, Inc. 2

컴퓨터비전의최신기술 : Deep Learning, 3D Vision and Embedded Vision

designing a GPU Computing Solution

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

MATLAB Based Optimization Techniques and Parallel Computing

CUDA. Matthew Joyner, Jeremy Williams

Integrate MATLAB Analytics into Enterprise Applications

Scaling MATLAB. for Your Organisation and Beyond. Rory Adams The MathWorks, Inc. 1

USING THE SYSTEM-C LIBRARY FOR BIT TRUE SIMULATIONS IN MATLAB

Model-Based Design for Altera FPGAs Using HDL Code Generation The MathWorks, Inc. 1

What s New with the MATLAB and Simulink Product Families. Marta Wilczkowiak & Coorous Mohtadi Application Engineering Group

Matlab for Engineers

High-Performance Data Loading and Augmentation for Deep Neural Network Training

MATLAB: The challenges involved in providing a high-level language on a GPU

MATLAB Parallel Computing Toolbox Benchmark for an Embarrassingly Parallel Application

Practical Introduction to CUDA and GPU

Modeling HDL components for FPGAs in control applications

2015 The MathWorks, Inc. 1

Designing and Targeting Video Processing Subsystems for Hardware

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System

Stream Processing with CUDA TM A Case Study Using Gamebryo's Floodgate Technology

CUDA Programming Model

2015 The MathWorks, Inc. 1

Design and Verify Embedded Signal Processing Systems Using MATLAB and Simulink

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Model-Based Embedded System Design

Supporting Data Parallelism in Matcloud: Final Report

Introducing Simulink R2012b for Signal Processing & Communications Graham Reith Senior Team Leader, UK Application Engineering

RTW SUPPORT FOR PARALLEL 64bit ALPHA AXP-BASED PLATFORMS. Christian Vialatte, Jiri Kadlec,

Using a GPU in InSAR processing to improve performance

What s New in MATLAB and Simulink

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

Accelerate FPGA Prototyping with

Spartan -6 LX150T Development Kit Hardware Co-Simulation Reference Design User Guide

Coarse Grain Reconfigurable Arrays are Signal Processing Engines!

Data Analytics with MATLAB. Tackling the Challenges of Big Data

Higher Level Programming Abstractions for FPGAs using OpenCL

Modeling and Simulating Social Systems with MATLAB

Parallel Processing Tool-box

Parallel Computing with Matlab and R

Designing and Prototyping Digital Systems on SoC FPGA The MathWorks, Inc. 1

A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms

Hardware and Software Co-Design for Motor Control Applications

What s New for MATLAB David Willingham

Developing a Data Driven System for Computational Neuroscience

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

What's new in MATLAB and Simulink for Model-Based Design

Parallel Computing with MATLAB on Discovery Cluster

Advanced CUDA Optimization 1. Introduction

Hardware Implementation and Verification by Model-Based Design Workflow - Communication Models to FPGA-based Radio

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

Vidushi: Parallel Implementation of Alpha Miner Algorithm and Performance Analysis on CPU and GPU Architecture

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

GPU-Accelerated Beat Detection for Dancing Monkeys

MATLAB to iphone Made Easy

System Requirements & Platform Availability by Product for R2016b

The Use of Computing Clusters and Automatic Code Generation to Speed Up Simulation Tasks

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

Transcription:

Accelerating System Simulations 김용정부장 Senior Applications Engineer 2013 The MathWorks, Inc. 1

Why simulation acceleration? From algorithm exploration to system design Size and complexity of models increases Time needed for a single simulation increases Number of test cases increases Test cases become larger Need to reduce simulation time during design simulation time for large scale testing during prototyping 2

MATLAB is quite fast Optimized and widely-used libraries BLAS Basic Linear Algebra Subroutines (multithreaded) LAPACK Linear Algebra Package JIT (Just In Time) Acceleration On-the-fly multithreaded code generation for increased speed Built-in support for vector and matrix operations 3

Application LTE Physical Downlink Control Channel (PDCCH) 4

Workflow Start with a baseline algorithm Profile it to introduce a performance yardstick Introduce the following optimizations: Better MATLAB serial programming techniques Using System objects MATLAB to C code generation (MEX) Parallel Computing GPU-optimized System objects Rapid Accelerator mode of simulation in Simulink 5

Simulation acceleration options in MATLAB Better MATLAB code User s Code System objects MATLAB to C Parallel Computing GPU processing 6

Profiling MATLAB algorithms Profiler summarizes MATLAB code execution total time spent within each function which lines of code use the most processing time Helps identify algorithm bottlenecks 7

Effective MATLAB programming techniques Example of pre-allocation y=[]; for n=1:len/tx G=[u(idx1(n)) u(idx2(n));... -conj(u(idx2(n))) conj(u(idx1(n)))]; y=[y;g]; end y=complex(zeros(len,tx)); y(idx1,1)=u(idx1); y(idx1,2)=u(idx2); y(idx2,1)=-conj(u(idx2)); y(idx2,2)=conj(u(idx1)); Pre-allocation Initialize an array using its final size Helps avoid dynamically resizing arrays in a loop Vectorization Convert code from using scalar loops to using matrix/vector operations Helps MATLAB leverage processor-optimized libraries for vector processing 8

Using System objects of DSP & Communications System Toolboxes Example of System objects System objects facilitate stream processing Can accelerate simulation because function s = Alamouti_DecoderS(u,H) %#codegen % STBC Combiner persistent htddec if isempty(htddec) htddec= comm.ostbccombiner(... 'NumTransmitAntennas',2,'NumReceiveAntennas',2); end s = step(htddec, u, H); Decouple declaration from the execution of the algorithms Reduce overhead of parameter handling in the loop Most of them implemented as MATLAB executables (MEX) 9

MATLAB to C code generation MATLAB Coder Automatically generate a MEX function Call the generated MEX file within testbench Verify same numerical results Assess the baseline function and the generated MEX function for speed 10

Parallel Simulation Runs Worker TOOLBOXES BLOCKSETS Worker Worker Worker Task 1 Task 2 Task 3 Task 4 >> Demo Time Time 11

Summary matlabpool available workers No modification of algorithm Use parfor loop instead of for loop Parallel computation or simulation leads to further acceleration More cores = more speed 12

Simulation acceleration options in MATLAB Better MATLAB code User s Code System objects MATLAB to C Parallel Computing GPU processing 13

What is a Graphics Processing Unit (GPU) Originally for graphics acceleration, now also used for scientific calculations Massively parallel array of integer and floating point processors Typically hundreds of processors per card GPU cores complement CPU cores Dedicated high-speed memory 14

Why would you want to use a GPU? Speed up execution of computationally intensive simulations For example: Performance: A\b with Double Precision 15

Ease of Use Options for Targeting GPUs 1) Use GPU with MATLAB built-in functions 2) Execute MATLAB functions elementwise on the GPU 3) Create kernels from existing CUDA code and PTX files Greater Control 16

Data Transfer between MATLAB and GPU % Push data from CPU to GPU memory Agpu = gpuarray(a) % Bring results from GPU memory back to CPU B = gather(bgpu) 17

GPU Processing with Communications System Toolbox Alternative implementation for many System objects take advantage of GPU processing Use Parallel Computing Toolbox to execute many communications algorithms directly on the GPU GPU System objects comm.gpu.turbodecoder comm.gpu.viterbidecoder comm.gpu.ldpcdecoder comm.gpu.pskdemodulator comm.gpu.awgnchannel Easy-to-use syntax Dramatically accelerate simulations 18

Example: Turbo Coding Impressive coding gain High computational complexity Bit-error rate performance as a function of number of iterations = comm.turbodecoder( NumIterations, numiter, 19

Acceleration with GPU System objects Version Elapsed time Acceleration CPU 8 hours 1.0 1 GPU 40 minutes 12.0 Same numerical results Cluster of 4 GPUs 11 minutes 43.0 = comm.turbodecoder( comm.gpu.turbodecoder( NumIterations, N, = comm.awgnchannel( = comm.gpu.awgnchannel( 20

Key Operations in Turbo Coding Function CPU GPU Version 1 % Turbo Encoder htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise hawgn = comm.awgnchannel('noisemethod', 'Variance'); % BER measurement hber = comm.errorrate; % Turbo Decoder htdec = comm.turbodecoder( 'TrellisStructure',poly2trellis(4, [13 15], 13),... 'InterleaverIndices', intrlvrindices,'numiterations', numiter); % Turbo Encoder htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise hawgn = comm.awgnchannel('noisemethod', 'Variance'); % BER measurement hber = comm.errorrate; % Turbo Decoder htdec = comm.gpu.turbodecoder( 'TrellisStructure',poly2trellis(4, [13 15], 13),... 'InterleaverIndices', intrlvrindices,'numiterations', numiter); ber = zeros(3,1); %initialize BER output %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) data = randn(blklength, 1)>0.5; % Encode random data bits yenc = step(htenc, data); %Modulate, Add noise to real bipolar data modout = 1-2*yEnc; rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding llrdata = (-2/noiseVar).*rData; % Turbo Decode decdata = step(htdec, llrdata); % Calculate errors ber = step(hber, data, decdata); end ber = zeros(3,1); %initialize BER output %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) data = randn(blklength, 1)>0.5; % Encode random data bits yenc = step(htenc, data); %Modulate, Add noise to real bipolar data modout = 1-2*yEnc; rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding llrdata = (-2/noiseVar).*rData; % Turbo Decode decdata = step(htdec, llrdata); % Calculate errors ber = step(hber, data, decdata); end 21

Profile results in Turbo Coding Function CPU GPU Version 1 % Turbo Encoder <0.01 htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise <0.01 hawgn = comm.awgnchannel('noisemethod', 'Variance'); % BER measurement <0.01 hber = comm.errorrate; % Turbo Decoder <0.01 htdec = comm.turbodecoder( 'TrellisStructure',poly2trellis(4, [13 15], 13),... 'InterleaverIndices', intrlvrindices,'numiterations', numiter); % Turbo Encoder <0.01 htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise <0.01 hawgn = comm.awgnchannel('noisemethod', 'Variance'); % BER measurement <0.01 hber = comm.errorrate; % Turbo Decoder 0.02 htdec = comm.gpu.turbodecoder( 'TrellisStructure',poly2trellis(4, [13 15], 13),... 'InterleaverIndices', intrlvrindices,'numiterations', numiter); <0.01 ber = zeros(3,1); %initialize BER output %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) 0.30 data = randn(blklength, 1)>0.5; % Encode random data bits 2.33 yenc = step(htenc, data); %Modulate, Add noise to real bipolar data 0.05 modout = 1-2*yEnc; 1.50 rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding 0.03 llrdata = (-2/noiseVar).*rData; % Turbo Decode 330.54 decdata = step(htdec, llrdata); % Calculate errors 0.17 ber = step(hber, data, decdata); end <0.01 ber = zeros(3,1); %initialize BER output %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) 0.28 data = randn(blklength, 1)>0.5; % Encode random data bits 2.38 yenc = step(htenc, data); %Modulate, Add noise to real bipolar data 0.05 modout = 1-2*yEnc; 1.45 rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding 0.04 llrdata = (-2/noiseVar).*rData; % Turbo Decode 98.18 decdata = step(htdec, llrdata); % Calculate errors 0.17 ber = step(hber, data, decdata); end 22

Key Operations in Turbo Coding Function CPU GPU Version 2 % Turbo Encoder htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise hawgn = comm.awgnchannel('noisemethod', 'Variance'); % BER measurement hber = comm.errorrate; % Turbo Decoder htdec = comm.turbodecoder('trellisstructure',poly2trellis(4, [13 15], 13),... 'InterleaverIndices', intrlvrindices,'numiterations', numiter); %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) data = randn(blklength, 1)>0.5; % Encode random data bits yenc = step(htenc, data); %Modulate, Add noise to real bipolar data modout = 1-2*yEnc; rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding llrdata = (-2/noiseVar).*rData; % Turbo Decode decdata = step(htdec, llrdata); % Calculate errors ber = step(hber, data, decdata); end % Turbo Encoder htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise hawgn = comm.gpu.awgnchannel ('NoiseMethod', 'Variance'); % BER measurement hber = comm.errorrate; % Turbo Decoder - setup for Multi-frame or Multi-user processing numframes = 30; htdec = comm.gpu.turbodecoder('trellisstructure',poly2trellis(4, [13 15], 13),... 'InterleaverIndices', intrlvrindices,'numiterations',numiter, NumFrames,numFrames); %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) data = randn(numframes*blklength, 1)>0.5; % Encode random data bits yenc = gpuarray(multiframestep(htenc, data, numframes)); %Modulate, Add noise to real bipolar data modout = 1-2*yEnc; rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding llrdata = (-2/noiseVar).*rData; % Turbo Decode decdata = step(htdec, llrdata); % Calculate errors ber=step(hber, data, gather(decdata)); end 23

Profile results in Turbo Coding Function CPU GPU Version 2 % Turbo Encoder <0.01 htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise <0.01 hawgn = comm.awgnchannel('noisemethod', 'Variance'); % BER measurement <0.01 hber = comm.errorrate; % Turbo Decoder <0.01 htdec = comm.turbodecoder( 'TrellisStructure',poly2trellis(4, [13 15], 13),... 'InterleaverIndices', intrlvrindices,'numiterations', numiter); %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) 0.30 data = randn(blklength, 1)>0.5; % Encode random data bits 2.33 yenc = step(htenc, data); %Modulate, Add noise to real bipolar data 0.05 modout = 1-2*yEnc; 1.50 rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding 0.03 llrdata = (-2/noiseVar).*rData; % Turbo Decode 330.54 decdata = step(htdec, llrdata); % Calculate errors 0.17 ber = step(hber, data, decdata); end % Turbo Encoder <0.01 htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise 0.03 hawgn = comm.gpu.awgnchannel ('NoiseMethod', 'Variance'); % BER measurement <0.01 hber = comm.errorrate; % Turbo Decoder - setup for Multi-frame or Multi-user processing 0.01 numframes = 30; 0.01 htdec = comm.gpu.turbodecoder('trellisstructure', poly2trellis(4, [13 15], 13),'InterleaverIndices', intrlvrindices, 'NumIterations',numIter, NumFrames,numFrames); %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) 0.22 data = randn(numframes*blklength, 1)>0.5; % Encode random data bits 2.45 yenc = gpuarray(multiframestep(htenc, data, numframes)); %Modulate, Add noise to real bipolar data 0.02 modout = 1-2*yEnc; 0.31 rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding 0.01 llrdata = (-2/noiseVar).*rData; % Turbo Decode 20.89 decdata = step(htdec, llrdata); % Calculate errors 0.09 ber=step(hber, data, gather(decdata)); end 24

Things to note when targeting GPU Minimize data transfer between CPU and GPU. Using GPU only makes sense if data size is large. Some functions in MATLAB are optimized and can be faster than the GPU equivalent (eg. FFT). Use arrayfun to explicitly specify elementwise operations. 25

Summary Acceleration methodologies in MATLAB & Simulink Technology / Product 1. Best Practices in Programming Vectorization & pre-allocation Environment tools. (i.e. Profiler, Code Analyzer) 2. Better Algorithms Ideal environment for algorithm exploration Rich set of functionality (e.g. System objects) MATLAB, Toolboxes, System Toolboxes MATLAB, Toolboxes, System Toolboxes 3. More Processors or Cores High level parallel constructs (e.g. parfor, matlabpool) Utilize cluster, clouds, and grids 4. Refactoring the Implementation Compiled code (MEX) GPUs, FPGA-in-the-Loop Parallel Computing Toolbox, MATLAB Distributed Computing Server MATLAB, MATLAB Coder, Parallel Computing Toolbox 26

Thank You Q & A 27