Interval arithmetic on graphics processing units

Similar documents
GPU & Computer Arithmetics

A GPU interval library based on Boost.Interval

GPU Floating Point Features

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Portland State University ECE 588/688. Graphics Processors

Scientific Computing on GPUs: GPU Architecture Overview

Outline. L9: Project Discussion and Floating Point Issues. Project Parts (Total = 50%) Project Proposal (due 3/8) 2/13/12.

Around GPGPU: architecture, programming, and arithmetic

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

General Purpose Computing on Graphical Processing Units (GPGPU(

Introduction to CUDA

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CS 179 Lecture 4. GPU Compute Architecture

Multi-Processors and GPU

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Using GPUs to compute the multilevel summation of electrostatic forces

Numerical Simulation on the GPU

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Ray Casting of Trimmed NURBS Surfaces on the GPU

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Introduction to Modern GPU Hardware

CME 213 S PRING Eric Darve

CS427 Multicore Architecture and Parallel Computing

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

GPU Fundamentals Jeff Larkin November 14, 2016

Single Scattering in Refractive Media with Triangle Mesh Boundaries

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Optimization of Tele-Immersion Codes

Warps and Reduction Algorithms

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

TUNING CUDA APPLICATIONS FOR MAXWELL

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

GPU for HPC. October 2010

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Real-Time Rendering Architectures

TUNING CUDA APPLICATIONS FOR MAXWELL

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

GRAPHICS PROCESSING UNITS

Hardware/Software Co-Design

Performance potential for simulating spin models on GPU

Floating Point Arithmetic. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA Performance Optimization. Patrick Legresley

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Optimizing Parallel Reduction in CUDA

Direct Rendering of Trimmed NURBS Surfaces

Accelerating Image Feature Comparisons using CUDA on Commodity Hardware

From Shader Code to a Teraflop: How Shader Cores Work

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Exotic Methods in Parallel Computing [GPU Computing]

Introduction to GPGPU and GPU-architectures

Tuning CUDA Applications for Fermi. Version 1.2

Programmer's View of Execution Teminology Summary

High Performance Computing on GPUs using NVIDIA CUDA

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Preparing seismic codes for GPUs and other

Floating Point Numbers

Floating Point Numbers

Around GPGPU: architecture, programming, and arithmetic

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Tesla Architecture, CUDA and Optimization Strategies

International Supercomputing Conference 2009

CUDA Architecture & Programming Model

DEVELOPER DAY. Vulkan Subgroup Explained Daniel Koch NVIDIA MONTRÉAL APRIL Copyright Khronos Group Page 1

A Simulated Annealing algorithm for GPU clusters

Matthieu Lefebvre Princeton University. Monday, 10 July First Computational and Data Science school for HEP (CoDaS-HEP)

A Parameterized Floating-Point Formalizaton in HOL Light

Height field ambient occlusion using CUDA

Warped parallel nearest neighbor searches using kd-trees

Faster Simulations of the National Airspace System

Optimization Techniques for Parallel Code: 5: Warp-synchronous programming with Cooperative Groups

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Improving the efficiency of parallel architectures with regularity

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Floating Point Arithmetic

Fast Tridiagonal Solvers on GPU

CS377P Programming for Performance GPU Programming - II

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University

A Cross-Input Adaptive Framework for GPU Program Optimizations

Using Graphics Processors for Parallelizing Hash-based Data Carving

Solving Dense Linear Systems on Graphics Processors

Computer Architecture

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Graphics Processor Acceleration and YOU

Accelerated Machine Learning Algorithms in Python

Hashcash Parallelization on GPGPU using OpenCL

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Transcription:

Interval arithmetic on graphics processing units Sylvain Collange*, Jorge Flórez** and David Defour* RNC'8 July 7 9, 2008 * ELIAUS, Université de Perpignan Via Domitia ** GILab, Universitat de Girona

How to draw implicit surfaces Applications in computer graphics Several approaches Polygonal Approximation Computationally efficient (implementation on GPUs) Inaccurate 2

How to draw implicit surfaces Applications in computer graphics Several approaches Polygonal Approximation Ray tracing, not guaranteed (91 seconds) Search of function roots using heuristics More accurate, may still miss solutions 3

How to draw implicit surfaces Applications in computer graphics Several approaches Polygonal Approximation Ray tracing, not guaranteed (91 seconds) Search of function roots using an interval Newton method Interval operations required: +,,,,, x i, e x Ray tracing, guaranteed (137 seconds) 4

Outline Interval arithmetic Graphics Processing Units (GPUs) Implementing directed rounding Further optimizations 5

Interval arithmetic Redefinition of basic operators +,,, /,... on intervals eg. [1, 5] + [ 2, 3] = [ 1, 8] Inclusion property: for X, Y, Z intervals, any operator Z=X Y x X, y Y, z=x y z Z Need to take rounding errors into account May return a wider interval Loss of variable dependency X (1 X) (X 1/2) 2 1/4 6

Outline Interval arithmetic Graphics Processing Units (GPUs) Implementing directed rounding Further optimizations Concluding remarks 7

What is a GPU? NVidia GeForce 8800 Computational units Multiprocessor Registers Shared memory Constant memory Memory acces unit Memory controller Graphics memory Cluster x8 8

Single precision FP arithmetic on GPUs NVIDIA GeForce 8000/9000/200 families, AMD Radeon HD 2000 Multiplication, addition correctly rounded to nearest With Cuda, can round toward zero No directed rounding (towards +/ infinity) Division, square root accurate to 2 units on the last place (ulp) Multiply Add with multiply rounded toward zero on NVIDIA AMD Radeon HD 3000, HD 4000 Division correctly rounded to nearest 9

SIMD execution The same instruction is run on multiple threads When a branch instruction occurs If all threads follow the same path, do a real branch Otherwise, execute both paths and use predication Diverging branches are expensive 10

How to program GPUs NVIDIA CUDA For GeForce 8000/9000/200 GPUs only Subset of C++ language Templates not officially supported as of CUDA 2.0 but their use is encouraged [Harris07] in practice, most C++ features are supported AMD CAL/Brook+ Graphics API + shader language Limited languages and features Remains the only portable way for low level arithmetic 11

Outline Interval arithmetic GPU architectural issues Implementing directed rounding Further optimizations 12

We developed two libraries CUDA version In C++ Adapted from the Boost Interval library Provides various levels of consistency checking (NaNs, Infs...) and precision Cg version Need GPU specific routines We focus on the CUDA version in this talk 13

Implementing directed rounding We want round down and round up CUDA provides round toward zero Equivalent to either round down or round up, depending on sign Up Down Negative numbers 0 Toward zero Positive numbers 14

Implementing directed rounding For the other direction Add one ulp to the magnitude of the rounded toward zero value An integer addition on FP representation does this Except in case of underflow/overflow Or a multiplication by (1+EPS) No special cases for overflows/nans EPS=2 23 for single precision Up Down Negative numbers 0 Toward zero Positive numbers Corrective term 15

Porting the Boost Interval library CUDA on a NVIDIA GeForce 8500 GT Measured throughput in cycles/warp (1 warp = 32 threads) Operation Scalar Add Interval Add Interval Multiply Interval Square Interval x 5 Cycles 4 24 33 55 (*) 20 25 (*) 141 148 (*) * Without divergent branching 16

Outline Interval arithmetic GPU architectural issues Implementing directed rounding Further optimizations 17

Multiplication: existing algorithm if (interval_lib::user::is_neg(xl)) if (interval_lib::user::is_pos(xu)) if (interval_lib::user::is_neg(yl)) if (interval_lib::user::is_pos(yu)) // M * M return I(::min(rnd.mul_down_neg(xl, yu), rnd.mul_down_neg(xu, yl)), ::max(rnd.mul_up_pos (xl, yl), rnd.mul_up_pos (xu, yu)), true); else // M * N return I(rnd.mul_down_neg(xu, yl), rnd.mul_up_pos(xl, yl), true); else if (interval_lib::user::is_pos(yu)) // M * P return I(rnd.mul_down_neg(xl, yu), rnd.mul_up_pos(xu, yu), true); else // M * Z return I(static_cast<T>(0), static_cast<t>(0), true); else if (interval_lib::user::is_neg(yl)) if (interval_lib::user::is_pos(yu)) // N * M return I(rnd.mul_down_pos(xl, yu), rnd.mul_up_neg(xl, yl), true); else // N * N return I(rnd.mul_down_pos(xu, yu), rnd.mul_up_pos(xl, yl), true); else if (interval_lib::user::is_pos(yu)) // N * P return I(rnd.mul_down_neg(xl, yu), rnd.mul_up_neg(xu, yl), true); else // N * Z return I(static_cast<T>(0), static_cast<t>(0), true); else if (interval_lib::user::is_pos(xu)) if (interval_lib::user::is_neg(yl)) if (interval_lib::user::is_pos(yu)) // P * M return I(rnd.mul_down_neg(xu, yl), rnd.mul_up_pos(xu, yu), true); else // P * N return I(rnd.mul_down_neg(xu, yl), rnd.mul_up_neg(xl, yu), true); else if (interval_lib::user::is_pos(yu)) // P * P return I(rnd.mul_down_pos(xl, yl), rnd.mul_up_pos(xu, yu), true); else // P * Z return I(static_cast<T>(0), static_cast<t>(0), true); else // Z *? return I(static_cast<T>(0), static_cast<t>(0), true); 18

Rewriting multiplication We do not want branches Starting from the general formula: [a, b] [c, d ]=[ min ac, ad, bc, bd, max ac, ad, bc, bd ] We need to round all subproducts both up and down Cost: 4 (2 mul + 1 min + 1 max) + 3 min + 3 max = 22 flops Or do we? 19

What is the sign of [a, b] [c, d]? ac and bd are always positive when used ad and bc are always negative when used We know in advance the rounding direction No need to select it dynamically Now 12 flops 20

Power to an integer: existing algorithm T pow_dn(const T& x_, int pwr, Rounding& rnd) // x and pwr are positive { T x = x_; } T y = (pwr & 1)? x_ : 1; pwr >>= 1; while (pwr > 0) { x = rnd.mul_down(x, x); if (pwr & 1) y = rnd.mul_down(x, y); pwr >>= 1; } return y; T pow_up(const T& x_, int pwr, Rounding& rnd) // x and pwr are positive { [idem using mul_up] } interval<t, Policies> pow(const interval<t, Policies>& x, int pwr) { [...] if (interval_lib::user::is_neg(x.upper())) { // [-2,-1] T yl = pow_dn(-x.upper(), pwr, rnd); T yu = pow_up(-x.lower(), pwr, rnd); if (pwr & 1) // [-2,-1]^1 return I(-yu, -yl, true); else // [-2,-1]^2 return I(yl, yu, true); } else if (interval_lib::user::is_neg(x.lower())) { // [-1,1] if (pwr & 1) { // [-1,1]^1 return I(-pow_up(-x.lower(), pwr, rnd), pow_up(x.upper(), pwr, rnd), true); } else { // [-1,1]^2 return I(0, pow_up(::max(-x.lower(), x.upper()), pwr, rnd), true); } } else { // [1,2] return I(pow_dn(x.lower(), pwr, rnd), pow_up(x.upper(), pwr, rnd), true); } } 21

Improvements Exponent is constant: we can unroll the loop Cuda 1.0 cannot do loop unrolling Cuda 1.1 can, but without constant propagation and dead code removal Cuda 2.0 beta ignores the directive We use template metaprogramming instead We can compute pow_up from pow_down Add a bound on the error: multiply by (1+n EPS) 22

Performance results Cuda on a NVIDIA GeForce 8500 GT Measured Throughput in cycles/warp Operation Original Optimized Addition 24 24 Multiplication 33 55 (*) 36 x 5 141 148 (*) 23 * Without divergent branching Multiplication in constant time x 5 as fast as an addition 23

Application to ray-tracing Xeon 3GHz vs 8800 GTX Surface Sphere Kusner Schmitt Tangle Gumdrop Torus CPU (s) GPU (s) 300 2 720 2 900 3 1080 3 24

Future directions NVIDIA GeForce GTX 280 now supports double precision In all 4 IEEE rounding modes With no switching overhead More efficient for interval arithmetic than current CPUs! Still not available for single precision Could use double precision for the last convergence steps of interval Newton method 25

Thank you for your attention 26

Earlier GPUs Non compliant rounding for basic operations nvidia GeForce 7, multiplication Faithful rounding AMD ATI Radeon X1x00, multiplication Not toward zero, not faithful 27

Memory access issues Memory access units are shared among a SIMD block Fast memory accesses must obey specific patterns Broadcast: all threads access the same address Coalescing: all threads access consecutive addresses In Cuda On chip shared memory accessible in both modes On chip constant memory in broadcast mode Off chip global memory with coalescing capabilities 28