Building Real-Time Professional Visualization Solutions on GPUs. Kristof Denolf Samuel Maroy Ronny Dewaele

Size: px

Start display at page:

Download "Building Real-Time Professional Visualization Solutions on GPUs. Kristof Denolf Samuel Maroy Ronny Dewaele"

Thomasine Parks
5 years ago
Views:

1 Building Real-Time Professional Visualization Solutions on GPUs Kristof Denolf Samuel Maroy Ronny Dewaele

2 Page 2

3 Outline Barco s professional visualization solutions The need for performance portability Real PCIe Data Rates to/from GPU Transfer Only (e.g. the bandwidth test) Transfers with parallel GPU Compute/Rendering Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates The cost of OpenGL/CL(or CUDA) interoperability Towards partial transfers to reduce the latency Conclusions Page 3

4 Company structure Four core divisions, five wholly-owned ventures Entertainment Healthcare Control rooms & Simulation Defense & Aerospace Digital signage Lighting LED ATM software Design services Page 4

5 Healthcare Supporting healthcare professionals a billion times a year Page 5

6 Control Rooms Helping over 2.5 billion commuters get home safely every day Page 6

7 Media & Entertainment Setting the scene for over 2,500 gigs and shows every year Page 7

8 Professional Visualization High quality High resulutions Mutliple sources True colours Low latency Perfect calibration Synchronization Page 8

9 OpenCL as Initial Answer for Portability OpenCL for GPU and multi-core CPU programming of image processing chains OpenCL for GPU accelerated prototypes of new algorithms Page 9

10 Portability also Towards FPGA Design [Desh Singh, presented at DATE 2011 and FPGA 2011 Pre-Conference Workshop] Page 10 [Altera news: San Jose, Calif., November 15, 2011]

11 Outline Barco s professional visualization solutions The need for performance portability Real PCIe Data Rates to/from GPU Transfer Only (e.g. the bandwidth test) Transfers with parallel GPU Compute/Rendering Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates The cost of OpenGL/CL(or CUDA) interoperability Towards partial transfers to reduce the latency Conclusions Page 11

12 Ideal Data Transfer has Highest Rate and Virtually no Compute Impact CPU GPU in (n) GPUproc (n) out (n) CPU asynchronous Pinned CPU memory GPU parallel transfer Highest rate Maximize GPU compute & transfer time Graphics card DRAM Quadro Copy Engine Copy Engine CPU PCIe bus GPU in (n+1) GPUproc (n) out (n-1) DRAM CPU Page 12

13 (Over) Peak Data Rates Highest for Direct Transfers from/to Pinned Host Memory oclbandwidthtest testtransferspeed OpenCL Cpu2Gpu, pinned, direct Gpu2Cpu, pinned, direct Cpu2Gpu, pinned, mapped Gpu2Cpu, pinned, mapped Cpu2Gpu, paged, direct Gpu2Cpu, paged, direct Transfer Rate (MBps) Transfer Rate (MBps) Cpu2Gpu, pinned, direct Gpu2Cpu, pinned, direct Cpu2Gpu, pinned, mapped Gpu2CPU, pinned, mapped Cpu2Gpu, pinned, paged Gpu2Cpu, pinned, paged Page 13 Transfer Size (MB) Transfer Size (MB) All tests done on Q3000M on PCIe x 16 Gen2 (GPUdirect on Q4000)

14 Other Transfers Sustain a Similar Rate Page 14 Transfer Rate (MBps) OpenCL Cpu2GPU, buffer Gpu2Cpu, buffer Cpu2Gpu, image 1000 Gpu2Cpu, image Cpu2Gpu, buffergl Gpu2Cpu, buffergl Cpu2Gpu, imagegl Gpu2Cpu, imagegl Transfer Size (MB) CL buffers, images and GL interoperable variants similar Choose most appropriate CL memory type Efficiency > 4 GBps from 480p (1.3 MB) > 4.8 GBps from 720p (3.5 MB) All numbers for RGBA Write to GPU: p60 Read from GPU: p60

dual copy engines working GPU compute in parallel

15 OpenCL/CUDA Transfers with Parallel Compute (Transfer Dominated) in (n+1) GPUproc (n) out (n-1) OpenCL GPU dual copy engines working GPU compute in parallel with data transfers still some gaps present CUDA Page 15

16 Throughput Impact Related to Kernel Duration Transfer Rate (MBps) OpenCL Efficiency (OpenCL) > 3.2 GBps from 480p (1.3 MB) > 3.4 GBps from 720p (3.5 MB) All numbers for RGBA Write to GPU: p60 Read from GPU: p60 Note that also maximizing the GPU compute time is hampered 1000 Page Transfer Size (MB) Peak transfer GPU parallel

17 CUDA and GPUdirect Achieve Highest Peak Rate Transfer Rate (MBps) Efficiency CUDA transfers boost to 6 GBps DVP read from GPU upto 7.5 GBps for very large transfers Other: all around 5.2 GBps How to get 6 GBps for all programming models Cpu2Gpu, OpenCL Gpu2Cpu, OpenCL Cpu2Gpu, OpenGL Gpu2Cpu, OpenGL Cpu2Gpu, GPUdirect Gpu2Cpu, GPUdirect Cpu2Gpu, CUDA Gpu2Cpu, CUDA Page Transfer Size (MB)

18 Outline Barco s professional visualization solutions The need for performance portability Real PCIe Data Rates to/from GPU Transfer Only (e.g. the bandwidth test) Transfers with parallel GPU Compute/Rendering Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates The cost of OpenGL/CL(or CUDA) interoperability Towards partial transfers to reduce the latency Conclusions Page 18

CL/GL Interoperability Hampers Parallelism 6000 OpenCL 5000 4000 Transfer Rate (MBps)

19 CL/GL Interoperability Hampers Parallelism 6000 OpenCL Transfer Rate (MBps) Page Transfer Size (MB) Peak transfer GPU parallel

20 CUDA / GL Interoperability not Trivial No GL rendering With GL rendering Page 20

Return to OpenGL, render on full HD screen (1/2) 6000 OpenGL 5000 4000 Transfer Rate

21 Return to OpenGL, render on full HD screen (1/2) 6000 OpenGL Transfer Rate (MBps) Page Transfer Size (MB) Peak transfer GPU parallel

Return to OpenGL, Readback to CPU Memory (2/2) 6000 OpenGL 5000 4000 Transfer Rate (MBps)

22 Return to OpenGL, Readback to CPU Memory (2/2) 6000 OpenGL Transfer Rate (MBps) Transfer Size (MB) Peak transfer GPU parallel Page 22

23 to Avoid Interoperability Issue 9 HD 1080p in at 60 fps 4.5 GBps Parallel rendering Page 23

24 Outline Barco s professional visualization solutions The need for performance portability Real PCIe Data Rates to/from GPU Transfer Only (e.g. the bandwidth test) Transfers with parallel GPU Compute/Rendering Comparison of CUDA, OpenCL, OpenGL and GPU direct for video data rates The cost of OpenGL/CL(or CUDA) interoperability Towards partial transfers to reduce the latency Conclusions Page 24

25 Partial Image Transfers for Low Latency 1/8 HD (1 MB) transfer size has reasonable rate (certainly for CUDA) Concurrent partial update same image? Page 25

26 Conclusions Barco s professional visualization requires High quality High resolution Multiple sources Barco s professional visualization desires portability DMA enabled and fully parallel data transfers are essential Mind the gap: peak data rates can not be achieved contineoulsy CL or CUDA /GL interoperability is difficult Page 26

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control