Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection

Similar documents
Using GPUs to Accelerate Synthetic Aperture Sonar Imaging via Backpropagation

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

Profiling & Tuning Applications. CUDA Course István Reguly

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

TUNING CUDA APPLICATIONS FOR MAXWELL

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

TUNING CUDA APPLICATIONS FOR MAXWELL

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1

A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

arxiv: v1 [physics.comp-ph] 4 Nov 2013

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Using CUDA to Accelerate Radar Image Processing

CUDA Architecture & Programming Model

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Experiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell

CME 213 S PRING Eric Darve

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Fundamental CUDA Optimization. NVIDIA Corporation

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Fundamental Optimizations

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

CUDA Experiences: Over-Optimization and Future HPC

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Multi-Processors and GPU

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

GPU A rchitectures Architectures Patrick Neill May

GPU programming: Unified memory models and more. Sylvain Collange Inria Rennes Bretagne Atlantique

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

High Performance Computing on GPUs using NVIDIA CUDA

SIMD Programming CS 240A, 2017

Performance potential for simulating spin models on GPU

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

General Purpose GPU Computing in Partial Wave Analysis

OpenCL Vectorising Features. Andreas Beckmann

High Performance Computing and GPU Programming

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

Multi Agent Navigation on GPU. Avi Bleiweiss

NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Tesla Architecture, CUDA and Optimization Strategies

Double-precision General Matrix Multiply (DGEMM)

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

Multi2sim Kepler: A Detailed Architectural GPU Simulator

Improving Performance of Machine Learning Workloads

Using Graphics Chips for General Purpose Computation

CUDA Programming Model

Towards a Performance- Portable FFT Library for Heterogeneous Computing

The HPEC Challenge Benchmark Suite

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

NVIDIA Fermi Architecture

Technology for a better society. hetcomp.com

Trends in HPC (hardware complexity and software challenges)

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

OSKAR: Simulating data from the SKA

Tesla GPU Computing A Revolution in High Performance Computing

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Blocks, Grids, and Shared Memory

Parallel and Distributed Programming Introduction. Kenjiro Taura

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

GPU Computing with Fornax. Dr. Christopher Harris

GPU ARCHITECTURE Chris Schultz, June 2017

CS 179 Lecture 4. GPU Compute Architecture

Caches and Memory Hierarchy: Review. UCSB CS240A, Winter 2016

A Multi-Tiered Optimization Framework for Heterogeneous Computing

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Warps and Reduction Algorithms

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Optimized Scientific Computing:

Parallel Accelerators

CS377P Programming for Performance GPU Programming - II

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

A Cross-Input Adaptive Framework for GPU Program Optimizations

Tuning CUDA Applications for Fermi. Version 1.2

gpucc: An Open-Source GPGPU Compiler

Tuning HipGISAXS on Multi and Many Core Supercomputers

gpucc: An Open-Source GPGPU Compiler

Sparse Linear Algebra in CUDA

NVIDIA Application Lab at Jülich

Programmable Graphics Hardware (GPU) A Primer

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU

A case study of performance portability with OpenMP 4.5

Transcription:

Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection Thomas M. Benson 1 Daniel P. Campbell 1 David Tarjan 2 Justin Luitjens 2 1 Georgia Tech Research Institute {thomas.benson,dan.campbell}@gtri.gatech.edu 2 NVIDIA Corporation {dtarjan,jluitjens}@nvidia.com GPU Technology Conference, Session S3274, March 19, 2013 Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 1 / 26

SAR Backprojection Overview Synthetic aperture radar (SAR) is a radar-based imaging modality Backprojection (BP) is one form of image formation it requires O(N 3 ) operations (N pulses, N N image) https://www.sdms.afrl.af.mil/index.php?collection=ccd_challenge. Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 2 / 26

Backprojection Kernel with Linear Interpolation 1: for all voxels v do 2: I v = 0 % Initialize complex voxel to 0 3: for all pulses p do 4: R = (p vox p plat ) % Distance from platform to voxel 5: bin = (R R 0 )/ R % Range bin (integer) 6: if bin [0, L 2] then 7: w = (R R 0 )/ R bin % Interpolation weight % Phase history data sampled using linear interpolation 8: s = (1 w) X [bin, p] + w X [bin + 1, p] % exp(j 2k u R) represents ideal reflector response 9: I v + = s exp(j 2k u R) 10: end if 11: end for 12: end for Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 3 / 26

What is Good Enough? Double precision (FP64)? Single precision (FP32)? Mixed precision? Intrinsics? Approximations? Texture sampling? BP optimization involves mixed precision and approximations We do not focus on numerical requirements here, but do note that it has been widely reported that the range calculation requires double precision The sine and cosine requirements are more lax given an accurate argument Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 4 / 26

Error Metrics We use a db-scale signal-to-error ratio to judge numerical approximations: SER db = 10 log 10 ( i g i 2 ) i g i t i 2 where g is the double precision reference image and t is the test image. We have also evaluated the results qualitatively and look for SER db values higher than 50. Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 5 / 26

Optimization Phases Algorithmic and numerical optimization High-level algorithmic optimization Numerical approximations Architecture-specific optimization Incorporate architecture-specific instructions Exploit memory hierarchy Profiling, occupancy and register utilization, loop unrolling, autotuning, etc. The above are inter-dependent architecture features guide appropriate algorithmic and numerical optimizations We focus on the latter phase for this talk Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 6 / 26

Methodology Start with double precision implementation and apply incremental optimizations Track the impact of successive optimizations This is not perfect Optimizations are inter-dependent, so ordering matters We autotune at certain stages, but that finds local rather than global optima We use CUDA 5.0 and driver 310.32 for all experiments We report all results in giga backprojections per second (GBP/s) Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 7 / 26

Algorithmic and numerical optimizations V1: Baseline FP64 for all intermediate calculations V2: Mixed precision FP64 for range calculation, FP32 for linear interpolation and accumulation V3: Incremental phase calculations 1 High fidelity phase lookup table and intrinsic sincos instead of FP64 sincos V4: Two-step Newton-Raphson (NR) for square root V5: One-step NR with pulse blocking for square root K20c GBP/s C2050 GBP/s SER V1 5.2 2.1 V2 5.9 2.3 118.7 db V3 9.2 3.8 112.1 db V4 10.7 4.6 77.7 db V5 11.0 5.4 62.9 db 1 T. M. Benson, D. P. Campbell, D. A. Cook, Gigapixel Spotlight Synthetic Aperture Radar Backprojection Using Clusters of GPUs and CUDA, 2012 IEEE Radar Conference, pp. 853 858. Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 8 / 26

Read-Only Data Cache With CC 3.5, we can directly access the read-only data cache without using textures The compiler may use such reads for const restrict pointers, but we directly use the ldg() intrinsic. For example, const float2 lutentry = ldg ( lutptr + index ); instead of const float2 lutentry = lutptr [ index ]; Minimal code change easy empirical evaluation Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 9 / 26

Read-Only Data Cache Results Baseline X LUT Plat LUT/Plat V1 5.2 5.3 5.2 5.7 5.7 V2 5.9 5.8 5.9 6.0 6.0 V3 9.3 9.1 9.2 9.3 9.2 V4 10.7 10.6 10.6 12.0 11.5 V5 11.0 11.0 11.7 12.5 12.9 All results in GBP/s. X := phase history data, Plat := platform positions V5 has the lowest arithmetic intensity memory optimization more important We will ultimately use a combination of constant, shared, and texture memory, but quickly evaluating read-only cache impact is very useful Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 10 / 26

Texture Sampling Backprojection includes linear interpolation can leverage texture sampling (V6) Texture sampling is reduced precision, but data can be upsampled (O(N 2 log N)) prior to backprojection (O(N 3 )) to increase accuracy K20c GBP/s C2050 GBP/s SER V5 11.0 5.4 62.9 db V6 14.7 7.5 59.0 db Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 11 / 26

Constant and Shared Memory V7: Constant memory Platform positions (24B/pulse) in constant memory V8: Shared memory Incremental phase calculation LUT in shared memory LUT can be large, so first calculate the portion needed for the image chip being processed by a given block and load only the relevant entries K20c GBP/s C2050 GBP/s SER V6 14.7 7.5 59.0 db V7 18.9 8.2 59.0 db V8 19.9 8.5 59.0 db Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 12 / 26

Source-level optimizations Workflow: Inspect PTX for missed opportunities, check SASS to confirm issues, modify code; lather, rinse, repeat Example: Newton-Raphson update. x 1 = x 0 (x 0 x 0 α) (0.5/x 0 ) // Outside of loop -- common subexpression elimination mul. f64 %fd5, %fd3, % fd3 ; [ x 0 x 0 ] // Inner loop sub. f64 %fd34, %fd5, % fd33 ; [x 0 x 0 α] mul. f64 %fd35, %fd34, % fd4 ; [(x 0 x 0 α) (0.5/x 0)] sub. f64 %fd36, %fd3, % fd35 ; [x 0 (x 0 x 0 α) (0.5/x 0)] Missed opportunity: We are not using fused multiply-add (FMA) instructions for this calculation. Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 13 / 26

Source-level optimizations (FMA) Rewrite a b c as either a + ( b) c or a + b ( c). Revised Newton-Raphson update. x 1 = x 0 + (x 0 x 0 α) ( 0.5/x 0 ) // Outside of loop -- common subexpression elimination mul. f64 %fd6, %fd4, % fd4 ; [ x 0 x 0 ] // Inner loop sub. f64 %fd33, %fd6, % fd32 ; [x 0 x 0 α] fma.rn.f64 %fd34, %fd33, %fd5, % fd4 ; [x 0+(x 0 x 0 α) (0.5/x 0)] Applied this to several cases. For example, (a b const ) c const a c const + ( b const c const ) Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 14 / 26

Source-level optimizations (type conversions) Examining PTX also revealed some avoidable type conversions for ( int pulse = 0; pulse < N; ++ pulse ) {... tex2d (..., pulse + 0.5 f); // pulse converted to float... } Which can be eliminated: float pulse_f = 0.5 f; for ( int pulse = 0; pulse < N; ++ pulse ) {... } tex2d (..., pulse_f ); pulse_f += 1.0 f;... K20c GBP/s C2050 GBP/s SER V8 19.9 8.5 59.0 db V9 21.9 9.4 59.0 db Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 15 / 26

Multiple Pixels Per Thread Can amortize some redundant costs by processing multiple pixels per thread Compact groups of pixels have locality benefits Group K20c GBP/s Reg C2050 GBP/s Reg SER 1x1 21.9 47 9.4 56 59.0 db 2x1 26.1 51 11.0 56 57.3 db 3x1 25.1 54 11.5 62 56.9 db 4x1 25.1 61 8.8 63 53.7 db 2x2 27.0 59 11.9 63 57.3 db Reg column indicates the kernel register usage. SER db varies due to differing initial estimates for Newton-Raphson square root solves. Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 16 / 26

Autotuning: Loop Unrolling 28 Previous results did not use #pragma unroll to specify an unrolling factor (but the compiler still unrolls) Autotune by sweeping from #pragma unroll 1 to #pragma unroll 12 K20c Loop Unrolling Results C2050 Loop Unrolling Results 27 12 GBP/s 26 25 24 23 22 21 20 19 18 Default 1 2 3 4 5 6 7 8 9 10 11 12 1x1 2x1 3x1 4x1 2x2 3x3 GBP/s 11.5 11 10.5 10 9.5 9 8.5 Default 1 2 3 4 5 6 7 8 9 10 11 12 Autotuning slightly improves results on both Fermi and Kepler. Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 17 / 26

Autotuning: Max Register Count Motivation Can specify the max register count (-maxrregcount) Assume 128 threads/sm[x] (more autotuning!) Kepler: 64K registers/smx, 255 max registers/thread 46 registers/thread 11 max blocks/smx 51 registers/thread 10 max blocks/smx 56 registers/thread 9 max blocks/smx 64 registers/thread 8 max blocks/smx 73 registers/thread 7 max blocks/smx Fermi: 32K registers/sm, 63 max registers/thread 42 registers/thread 6 max blocks/sm 51 registers/thread 5 max blocks/sm Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 18 / 26

Autotuning: Max Register Count Results K20c autotuning results: Max Reg Used Reg Max Perf (GBP/s) Unroll factor Group 46 46 27.2 2 2x2 51 51 27.6 1 2x2 56 56 27.9 unspecified 2x2 64 56 27.8 2 2x2 73 58 27.1 2 2x2 128 56 27.9 2 2x2 C2050 autotuning results: Max Reg Used Reg Max Perf (GBP/s) Unroll factor Group 36 36 12.0 3 2x2 42 42 11.6 1 2x2 51 50 11.9 4 2x2 63 61 12.0 3 2x2 In cases where multiple configurations achieved the highest performance, the minimum register utilization is reported. Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 19 / 26

These go to 11 The max supported clock frequency exceeds the default clock nvidia - smi -q... Applications Clocks Graphics : 705 MHz Max Memory : 2600 MHz Clocks Graphics : 758 MHz SM : 758 MHz Memory : 2600 MHz... shell > nvidia - smi -- application - clocks =2600,758... Applications... Clocks Graphics : 758 MHz Memory : 2600 MHz Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 20 / 26

Summary of Results K20c 2 K20c 3 C2050 SER V1 5.2 5.7 2.1 V2 5.9 6.3 2.3 118.7 db V3 9.2 10.0 3.8 112.1 db V4 10.7 11.4 4.6 77.7 db V5 11.0 11.8 5.4 62.9 db V6 14.7 15.8 7.5 59.0 db V7 18.9 20.0 8.2 59.0 db V8 19.9 21.4 8.5 59.0 db V9 21.9 23.4 9.4 59.0 db V10 27.9 29.9 12.0 57.3 db Summary of results for all implementations in GBP/s. V10 corresponds to best achieved results for autotuning pixels/thread, loop unrolling factors, and maximum registers/thread. 2 705 MHz 3 758 MHz Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 21 / 26

Optimization Effectiveness for Fermi and Kepler Performance improvement as a function of optimization 1.8 1.7 1.6 C2050 K20c Improvement Factor 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 V1 >V2 V2 >V3 V3 >V4 V4 >V5 V5 >V6 V6 >V7 V7 >V8 V8 >V9 V9 >V10 Progressive Optimizations V2 : Replaced FP64 with FP32 Kepler (barely) V3,V4,V5 : Reduced total arithmetic operations Fermi V6 : Reduced arithmetic operations and exploited texture cache improved L1 cache hit rate on Fermi Fermi V7,V8 : Exploit constant and shared memory Kepler V9 : Reduced instruction count Fermi V10 : More work per thread (high register utilization) Equal win Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 22 / 26

Comparison to Previously Published Results Context for achieved performance: prior published results Only showing optimized implementations on modern hardware Caveat: Run-time performance is not directly comparable without also considering achieved accuracy working on an apples-to-apples comparison Hardware GBP/s Peak (FP32/FP64) Reference Dual Intel Xeon E5-2670 7.4 664 / 332 [1] Tesla C2050 11.7 1030 / 515 this Intel Xeon Phi 4 14.0 1920 / 960 [1] Tesla K20c 29.9 3783 / 1261 this [1] J. Park, P. Tang, M. Smelyanskiy, T. Benson, Efficient backprojection-based synthetic aperture radar computation with many-core processors, Supercomputing 2012. 4 Evaluation card. Included 60 cores at 1.0 GHz (vs 1.053 GHz for a 5110P) Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 23 / 26

Summary and Conclusions Determining required accuracy for floating point algorithms is critical for effective optimization Metrics play a vital role, but they rarely exist a priori Evaluating correctness requires domain expertise Who determines required accuracy? Algorithm designer or HPC programmer? Optimization is an iterative process Over 5x performance improvement using many optimizations Improved previously published C2050 results by over 2x Autotuning is your friend optimal parameters are not obvious Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 24 / 26

Acknowledgements We would like to thank NVIDIA for supplying early access to K20 hardware in order to carry out this performance evaluation. This work supported in part by DARPA under contracts HR0011-10-C-0145 and HR0011-10-9-0008. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressly or implied, of the Defense Advanced Research Projects Agency or the U.S. Government. Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 25 / 26

Thank You Thank you! Questions / Comments? Thomas Benson - Georgia Tech Research Institute SAR BP Optimization on K20 GPUs 26 / 26