Incremental Risk Charge With cufft: A Case Study Of Enabling Multi Dimensional Gain With Few GPUs

Size: px

Start display at page:

Download "Incremental Risk Charge With cufft: A Case Study Of Enabling Multi Dimensional Gain With Few GPUs"

Cleopatra Kelly
5 years ago
Views:

1 Incremental Risk Charge With cufft: A Case Study Of Enabling Multi Dimensional Gain With Few GPUs Amit Kalele and Manoj Nambiar April 21,

2 Optimization & Parallelization COE Center of Excellence for Optimization & Parallelization Application Domain/Sectors CFD, Finance, Life Sciences, Pharma, Utility etc Performance Optimization Porting of Applications Parallel Solutions/Applicati on development 2

3 Incremental Risk Charge Incremental Risk Charge (IRC) Included in new regulation Minimum trading book capital VaR, IRC, Stressed VaR Credit risk calculation A typical report definition consists of nodes Proprietary algorithm/s Only computational bottleneck was exposed 3

4 IRC Calculation Flow IRC Calculation Loss given default (LGD) Credit rating, Ultimate issuer Product type etc 4

5 IRC Computational Flow Random Credit movement paths Default Loss Distribution with FFTs IRC Calculation 5

6 FFTs Computation FFT computation 1D FFTs Offloaded to grid of 50 workstations Time for 150 scenarios 41 min 1 Scenario consists of 150,000 arrays Each consists of elements 6

7 Problem Definition Compute FFTs for 150 scenarios and optimize for Computation time Energy requirements for computation 7

8 Experimental Setup & Procedure Host: Xeon E socket (6 core x 2), 126GB RAM GPUs: K20 x 4 (in x16 slot), 5GB RAM Create 150,000 arrays, each of elements Each arrays is filled with random numbers between (0 ~ 1) Transfer batches of arrays to GPU Compute FFT and copy back results 8

9 cufft An efficient library Create 1D plan using cufftplan1d Compute transforms using cufftexecr2c/d2z Execution time for 150 Scenarios 1 GPU 50 Workstations 66.7 min 41min 9

10 Challenge cufft is fast GFLOPs * Concerned area is the data transfer Over 90% of the time is spent in data transfer 1 Scenario consists of 150,000 arrays Each consists of elements (doubles) Data for each scenario 150,000 x x GB 10

11 Optimizing Data Transfer Pinned Memory Using pinned memory cudahostalloc cudamallochost Never paged out Required for enabling multiple streams Faster data transfer compared to page-able memory Enabled 2x gain 11

12 Optimizing Data Transfer Streams Multiple Stream Computation Hiding latencies by overlapping Compute data transfer overlap H2D and D2H overlap Requires pinned memory on host side Asynchronous memcpy 12

13 Stream Computation cudamemcpyh2d cufftexecd2z CPU CPU GPU GPU Stream 1 cudamemcpyd2h cudamemcpyh2d Stream 2 CPU GPU 13

14 Multi Stream Computation 14

15 Execution time in min Performance Overall time was reduced to min on 1 GPU Time for 150 Scenarios Kernel Time Data Transfer Time Baseline Pinned Memory Pinned memory with Multiple Streams 15

16 Computing with Multi GPU 4 GPUs Almost linear scale-up Time for 150 IRCs in mins Workstations (#50) GPUs(1 server + 4 GPUs) WORKSTATIONS (#50) 4.5 GPUS(1 SERVER + 4 GPUS) 16

17 Energy Consumption Estimated energy savings Assumed 100W per server and 225W per GPU 250 Energy in KWh Workstation(#50) 4.95 GPUs Energy in KWh 17

18 Concluding Remarks 1 D FFT computation is extremely suitable for GPUs Optimized library cufft Multiple streams allows hiding latencies Multi dimensional gains Energy, Space, hardware footprint and Time Huge reduction in computation cost 18

19 Acknowledgements We are grateful to Vinay Deshpande, nvidia, Pune India, for enabling the benchmarks on K20 GPUs 19

20 Thank You April 21,

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering