Accelerating System Simulations

Accelerating System Simulations 김용정부장 Senior Applications Engineer 2013 The MathWorks, Inc. 1

Why simulation acceleration? From algorithm exploration to system design Size and complexity of models increases Time needed for a single simulation increases Number of test cases increases Test cases become larger Need to reduce simulation time during design simulation time for large scale testing during prototyping 2

MATLAB is quite fast Optimized and widely-used libraries BLAS Basic Linear Algebra Subroutines (multithreaded) LAPACK Linear Algebra Package JIT (Just In Time) Acceleration On-the-fly multithreaded code generation for increased speed Built-in support for vector and matrix operations 3

Application LTE Physical Downlink Control Channel (PDCCH) 4

Workflow Start with a baseline algorithm Profile it to introduce a performance yardstick Introduce the following optimizations: Better MATLAB serial programming techniques Using System objects MATLAB to C code generation (MEX) Parallel Computing GPU-optimized System objects Rapid Accelerator mode of simulation in Simulink 5

Simulation acceleration options in MATLAB Better MATLAB code User s Code System objects MATLAB to C Parallel Computing GPU processing 6

Profiling MATLAB algorithms Profiler summarizes MATLAB code execution total time spent within each function which lines of code use the most processing time Helps identify algorithm bottlenecks 7

Effective MATLAB programming techniques Example of pre-allocation y=[]; for n=1:len/tx G=[u(idx1(n)) u(idx2(n));... -conj(u(idx2(n))) conj(u(idx1(n)))]; y=[y;g]; end y=complex(zeros(len,tx)); y(idx1,1)=u(idx1); y(idx1,2)=u(idx2); y(idx2,1)=-conj(u(idx2)); y(idx2,2)=conj(u(idx1)); Pre-allocation Initialize an array using its final size Helps avoid dynamically resizing arrays in a loop Vectorization Convert code from using scalar loops to using matrix/vector operations Helps MATLAB leverage processor-optimized libraries for vector processing 8

Using System objects of DSP & Communications System Toolboxes Example of System objects System objects facilitate stream processing Can accelerate simulation because function s = Alamouti_DecoderS(u,H) %#codegen % STBC Combiner persistent htddec if isempty(htddec) htddec= comm.ostbccombiner(... 'NumTransmitAntennas',2,'NumReceiveAntennas',2); end s = step(htddec, u, H); Decouple declaration from the execution of the algorithms Reduce overhead of parameter handling in the loop Most of them implemented as MATLAB executables (MEX) 9

MATLAB to C code generation MATLAB Coder Automatically generate a MEX function Call the generated MEX file within testbench Verify same numerical results Assess the baseline function and the generated MEX function for speed 10

Parallel Simulation Runs Worker TOOLBOXES BLOCKSETS Worker Worker Worker Task 1 Task 2 Task 3 Task 4 >> Demo Time Time 11

Summary matlabpool available workers No modification of algorithm Use parfor loop instead of for loop Parallel computation or simulation leads to further acceleration More cores = more speed 12

Simulation acceleration options in MATLAB Better MATLAB code User s Code System objects MATLAB to C Parallel Computing GPU processing 13

What is a Graphics Processing Unit (GPU) Originally for graphics acceleration, now also used for scientific calculations Massively parallel array of integer and floating point processors Typically hundreds of processors per card GPU cores complement CPU cores Dedicated high-speed memory 14

Why would you want to use a GPU? Speed up execution of computationally intensive simulations For example: Performance: A\b with Double Precision 15

Ease of Use Options for Targeting GPUs 1) Use GPU with MATLAB built-in functions 2) Execute MATLAB functions elementwise on the GPU 3) Create kernels from existing CUDA code and PTX files Greater Control 16

Data Transfer between MATLAB and GPU % Push data from CPU to GPU memory Agpu = gpuarray(a) % Bring results from GPU memory back to CPU B = gather(bgpu) 17

GPU Processing with Communications System Toolbox Alternative implementation for many System objects take advantage of GPU processing Use Parallel Computing Toolbox to execute many communications algorithms directly on the GPU GPU System objects comm.gpu.turbodecoder comm.gpu.viterbidecoder comm.gpu.ldpcdecoder comm.gpu.pskdemodulator comm.gpu.awgnchannel Easy-to-use syntax Dramatically accelerate simulations 18

Example: Turbo Coding Impressive coding gain High computational complexity Bit-error rate performance as a function of number of iterations = comm.turbodecoder( NumIterations, numiter, 19

Acceleration with GPU System objects Version Elapsed time Acceleration CPU 8 hours 1.0 1 GPU 40 minutes 12.0 Same numerical results Cluster of 4 GPUs 11 minutes 43.0 = comm.turbodecoder( comm.gpu.turbodecoder( NumIterations, N, = comm.awgnchannel( = comm.gpu.awgnchannel( 20

Key Operations in Turbo Coding Function CPU GPU Version 1 % Turbo Encoder htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise hawgn = comm.awgnchannel('noisemethod', 'Variance'); % BER measurement hber = comm.errorrate; % Turbo Decoder htdec = comm.turbodecoder( 'TrellisStructure',poly2trellis(4, [13 15], 13),... 'InterleaverIndices', intrlvrindices,'numiterations', numiter); % Turbo Encoder htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise hawgn = comm.awgnchannel('noisemethod', 'Variance'); % BER measurement hber = comm.errorrate; % Turbo Decoder htdec = comm.gpu.turbodecoder( 'TrellisStructure',poly2trellis(4, [13 15], 13),... 'InterleaverIndices', intrlvrindices,'numiterations', numiter); ber = zeros(3,1); %initialize BER output %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) data = randn(blklength, 1)>0.5; % Encode random data bits yenc = step(htenc, data); %Modulate, Add noise to real bipolar data modout = 1-2*yEnc; rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding llrdata = (-2/noiseVar).*rData; % Turbo Decode decdata = step(htdec, llrdata); % Calculate errors ber = step(hber, data, decdata); end ber = zeros(3,1); %initialize BER output %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) data = randn(blklength, 1)>0.5; % Encode random data bits yenc = step(htenc, data); %Modulate, Add noise to real bipolar data modout = 1-2*yEnc; rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding llrdata = (-2/noiseVar).*rData; % Turbo Decode decdata = step(htdec, llrdata); % Calculate errors ber = step(hber, data, decdata); end 21

Profile results in Turbo Coding Function CPU GPU Version 1 % Turbo Encoder <0.01 htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise <0.01 hawgn = comm.awgnchannel('noisemethod', 'Variance'); % BER measurement <0.01 hber = comm.errorrate; % Turbo Decoder <0.01 htdec = comm.turbodecoder( 'TrellisStructure',poly2trellis(4, [13 15], 13),... 'InterleaverIndices', intrlvrindices,'numiterations', numiter); % Turbo Encoder <0.01 htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise <0.01 hawgn = comm.awgnchannel('noisemethod', 'Variance'); % BER measurement <0.01 hber = comm.errorrate; % Turbo Decoder 0.02 htdec = comm.gpu.turbodecoder( 'TrellisStructure',poly2trellis(4, [13 15], 13),... 'InterleaverIndices', intrlvrindices,'numiterations', numiter); <0.01 ber = zeros(3,1); %initialize BER output %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) 0.30 data = randn(blklength, 1)>0.5; % Encode random data bits 2.33 yenc = step(htenc, data); %Modulate, Add noise to real bipolar data 0.05 modout = 1-2*yEnc; 1.50 rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding 0.03 llrdata = (-2/noiseVar).*rData; % Turbo Decode 330.54 decdata = step(htdec, llrdata); % Calculate errors 0.17 ber = step(hber, data, decdata); end <0.01 ber = zeros(3,1); %initialize BER output %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) 0.28 data = randn(blklength, 1)>0.5; % Encode random data bits 2.38 yenc = step(htenc, data); %Modulate, Add noise to real bipolar data 0.05 modout = 1-2*yEnc; 1.45 rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding 0.04 llrdata = (-2/noiseVar).*rData; % Turbo Decode 98.18 decdata = step(htdec, llrdata); % Calculate errors 0.17 ber = step(hber, data, decdata); end 22

Key Operations in Turbo Coding Function CPU GPU Version 2 % Turbo Encoder htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise hawgn = comm.awgnchannel('noisemethod', 'Variance'); % BER measurement hber = comm.errorrate; % Turbo Decoder htdec = comm.turbodecoder('trellisstructure',poly2trellis(4, [13 15], 13),... 'InterleaverIndices', intrlvrindices,'numiterations', numiter); %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) data = randn(blklength, 1)>0.5; % Encode random data bits yenc = step(htenc, data); %Modulate, Add noise to real bipolar data modout = 1-2*yEnc; rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding llrdata = (-2/noiseVar).*rData; % Turbo Decode decdata = step(htdec, llrdata); % Calculate errors ber = step(hber, data, decdata); end % Turbo Encoder htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise hawgn = comm.gpu.awgnchannel ('NoiseMethod', 'Variance'); % BER measurement hber = comm.errorrate; % Turbo Decoder - setup for Multi-frame or Multi-user processing numframes = 30; htdec = comm.gpu.turbodecoder('trellisstructure',poly2trellis(4, [13 15], 13),... 'InterleaverIndices', intrlvrindices,'numiterations',numiter, NumFrames,numFrames); %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) data = randn(numframes*blklength, 1)>0.5; % Encode random data bits yenc = gpuarray(multiframestep(htenc, data, numframes)); %Modulate, Add noise to real bipolar data modout = 1-2*yEnc; rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding llrdata = (-2/noiseVar).*rData; % Turbo Decode decdata = step(htdec, llrdata); % Calculate errors ber=step(hber, data, gather(decdata)); end 23

Profile results in Turbo Coding Function CPU GPU Version 2 % Turbo Encoder <0.01 htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise <0.01 hawgn = comm.awgnchannel('noisemethod', 'Variance'); % BER measurement <0.01 hber = comm.errorrate; % Turbo Decoder <0.01 htdec = comm.turbodecoder( 'TrellisStructure',poly2trellis(4, [13 15], 13),... 'InterleaverIndices', intrlvrindices,'numiterations', numiter); %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) 0.30 data = randn(blklength, 1)>0.5; % Encode random data bits 2.33 yenc = step(htenc, data); %Modulate, Add noise to real bipolar data 0.05 modout = 1-2*yEnc; 1.50 rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding 0.03 llrdata = (-2/noiseVar).*rData; % Turbo Decode 330.54 decdata = step(htdec, llrdata); % Calculate errors 0.17 ber = step(hber, data, decdata); end % Turbo Encoder <0.01 htenc = comm.turboencoder('trellisstructure',poly2trellis(4, [13 15], 13),.. 'InterleaverIndices', intrlvrindices) % AWG Noise 0.03 hawgn = comm.gpu.awgnchannel ('NoiseMethod', 'Variance'); % BER measurement <0.01 hber = comm.errorrate; % Turbo Decoder - setup for Multi-frame or Multi-user processing 0.01 numframes = 30; 0.01 htdec = comm.gpu.turbodecoder('trellisstructure', poly2trellis(4, [13 15], 13),'InterleaverIndices', intrlvrindices, 'NumIterations',numIter, NumFrames,numFrames); %% Processing loop while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits) 0.22 data = randn(numframes*blklength, 1)>0.5; % Encode random data bits 2.45 yenc = gpuarray(multiframestep(htenc, data, numframes)); %Modulate, Add noise to real bipolar data 0.02 modout = 1-2*yEnc; 0.31 rdata = step(hawgn, modout); % Convert to log-likelihood ratios for decoding 0.01 llrdata = (-2/noiseVar).*rData; % Turbo Decode 20.89 decdata = step(htdec, llrdata); % Calculate errors 0.09 ber=step(hber, data, gather(decdata)); end 24

Things to note when targeting GPU Minimize data transfer between CPU and GPU. Using GPU only makes sense if data size is large. Some functions in MATLAB are optimized and can be faster than the GPU equivalent (eg. FFT). Use arrayfun to explicitly specify elementwise operations. 25

Summary Acceleration methodologies in MATLAB & Simulink Technology / Product 1. Best Practices in Programming Vectorization & pre-allocation Environment tools. (i.e. Profiler, Code Analyzer) 2. Better Algorithms Ideal environment for algorithm exploration Rich set of functionality (e.g. System objects) MATLAB, Toolboxes, System Toolboxes MATLAB, Toolboxes, System Toolboxes 3. More Processors or Cores High level parallel constructs (e.g. parfor, matlabpool) Utilize cluster, clouds, and grids 4. Refactoring the Implementation Compiled code (MEX) GPUs, FPGA-in-the-Loop Parallel Computing Toolbox, MATLAB Distributed Computing Server MATLAB, MATLAB Coder, Parallel Computing Toolbox 26

Thank You Q & A 27