LDetector: A low overhead data race detector for GPU programs

Similar documents
Tesla Architecture, CUDA and Optimization Strategies

HW2: CUDA Synchronization. Take 2

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo

GRace: A Low-Overhead Mechanism for Detecting Data Races in GPU Programs

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Optimizing Parallel Reduction in CUDA

Lecture 2: CUDA Programming

LD: Low-Overhead GPU Race Detection Without Access Monitoring

Hardware/Software Co-Design

Formal Verification Techniques for GPU Kernels Lecture 1

Device Memories and Matrix Multiplication

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

Hands-on CUDA Optimization. CUDA Workshop

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Understanding Outstanding Memory Request Handling Resources in GPGPUs

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Shared Memory and Synchronizations

Mathematical computations with GPUs

CUDA Architecture & Programming Model

Efficient Data Race Detection for Unified Parallel C

CUDA C Programming Mark Harris NVIDIA Corporation

Chimera: Hybrid Program Analysis for Determinism

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to GPU hardware and to CUDA

Kepler Overview Mark Ebersole

Towards Intelligent Programing Systems for Modern Computing

CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION. Julien Demouth, NVIDIA Cliff Woolley, NVIDIA

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Inter-Block GPU Communication via Fast Barrier Synchronization

Data Parallel Execution Model

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

CUDA Programming Model

Introduction to GPGPU and GPU-architectures

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

Stanford CS149: Parallel Computing Written Assignment 3

CS671 Parallel Programming in the Many-Core Era

GPUfs: Integrating a file system with GPUs

A GPU Implementation of Tiled Belief Propagation on Markov Random Fields. Hassan Eslami Theodoros Kasampalis Maria Kotsifakou

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA Performance Optimization. Patrick Legresley

GPU programming. Dr. Bernhard Kainz

On-the-Fly Elimination of Dynamic Irregularities for GPU Computing

NVIDIA GPU CODING & COMPUTING

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Atomic Operations. Atomic operations, fast reduction. GPU Programming. Szénási Sándor.

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Advanced CUDA Optimizations

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Portland State University ECE 588/688. Graphics Processors

Coherence-Free Multiview: Enabling Reference-Discerning Data Placement on GPU

CS377P Programming for Performance GPU Programming - II

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Tesla GPU Computing A Revolution in High Performance Computing

Red Fox: An Execution Environment for Relational Query Processing on GPUs

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Multithreaded Programming in Cilk. Matteo Frigo

NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS. Sreeram Potluri, Anshuman Goswami - NVIDIA 3/28/2018

Parallel Computing. November 20, W.Homberg

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

High Performance Linear Algebra on Data Parallel Co-Processors I

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

GPU & Reproductibility, predictability, Repetability, Determinism.

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

CS377P Programming for Performance GPU Programming - I

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Lab 1 Part 1: Introduction to CUDA

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

EFFICIENT BREADTH FIRST SEARCH ON MULTI-GPU SYSTEMS USING GPU-CENTRIC OPENSHMEM

B. Tech. Project Second Stage Report on

Improving Performance of Machine Learning Workloads

GPU Fundamentals Jeff Larkin November 14, 2016

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Understanding and modeling the synchronization cost in the GPU architecture

GPU Performance Nuggets

Multi-Processors and GPU

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

Scientific Programming in C XIV. Parallel programming

NVIDIA CUDA. Fermi Compatibility Guide for CUDA Applications. Version 1.1

Introduction to Parallel Computing with CUDA. Oswald Haan

Lightweight Fault Detection in Parallelized Programs

CUDA Parallel Programming Model Michael Garland

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Transcription:

LDetector: A low overhead data race detector for GPU programs 1 PENGCHENG LI CHEN DING XIAOYU HU TOLGA SOYATA UNIVERSITY OF ROCHESTER 1

Data races in GPU Introduction & Contribution Impact correctness as bugs Impact determinism Inter-warp data races and intra-warp data races Contribution A two-pass approach to detect data races in GPU As a debugging tool Simple, but effective and low overhead Zero per memory access monitoring Seek other optimizations to reduce two-pass cost 2 2

A data race example 3 global void Jacobi(int * data) { extern shared int a[]; // size 64 int tid = threadidx.x ;... if (tid < BLOCK_SIZE-1 && tid >0 ) A[tid] = (A[tid-1] + A[tid] + A[tid+1]) / 3; syncthreads(); } Array size: 64 BLOCK_SIZE: 64 WARP_SIZE: 32 # of Warps: 2 # of threads: 64 read set 0 1 2... 30 31 32 33... 61 62 63 value flow write set 3

A data race example 4 global void Jacobi(int * data) { extern shared int a[]; // size 64 int tid = threadidx.x ;... if (tid < BLOCK_SIZE-1 && tid >0 ) A[tid] = (A[tid-1] + A[tid] + A[tid+1]) / 3; syncthreads(); } Array size: 64 BLOCK_SIZE: 64 WARP_SIZE: 32 # of Warps: 2 # of threads: 64 read set 0 1 2... 30 31 32 33... 61 62 63 write set 4

A data race example global void Jacobi(int * data) { extern shared int a[]; // size 64 int tid = threadidx.x ;... if (tid < Application BLOCK_SIZE-1 && scenario: tid >0 ) A[tid] = (A[tid-1] + A[tid] + A[tid+1]) / 3; Inter-warp data races syncthreads(); } read set Shared array (GPU internal) oriented 0 1 2... 30 31 32 33... 61 62 63 No atomics in code regions 5 Array size: 64 BLOCK_SIZE: 64 WARP_SIZE: 32 # of Warps: 2 # of threads: 64 write set 5

Shared array oriented data race detection Synchronized code block A code region between two adjacent synchronization statements is called a synchronized block. Inter-warp data races on a shared array happen only in the same synchronized block. 6 API: syncthreads() A two-pass approach 1st pass : write-write data race detection 2nd pass: read-write data race detection 6

First pass: write-write race detection 7 original copy first written second written conflict part shared private copy diff warp 0 warp 1 case 1 diff temporary map report w-w race, abort time line case 2 union copy initialization first pass finalization 7

Second pass: read-write race detection 8 original copy first written second written conflict part shared compare private warp 0 warp 1 r-w race if any difference otherwise, no race case 2 continues time line initialization second pass finalization 8

Correctness Correctness of write-write race detection 9 Value-based check, region overlaps mean write-write conflicts Correctness of read-write race detection On condition that no write-write races exist If outputs differ, must be from read-write races Convergence 9

A efficient lightweight runtime and compiler supports Four operations: Diff, union, combine, compare Massive parallelism means high efficiency for bulk operations Take diff as example original copy private copy Byte-wise operations 10 Two-level parallelism Warp-level parallelism (32 warps) Intra-warp thread-level parallelism (32 threads) Well scalability Data duplication Data structure promotion in compiler 10

Using speculation to reduce two-pass overhead main kernel 11 first pass kernel launch child kernel diff writes Main kernel speculatively continues execution Executions overlap w-w race second pass Output comparison r-w race 11

Platform settings Evaluation Host machine: Xeon E5-2643 with 3.3GHz, 16GB memory Device: Nvidia Tesla K20c GPU 1 Kepler GK110 architecture, 2496 cores, 208GB/sec memory bandwidth, 5GB memory Dynamic parallelism Two benchmarks on data mining EM Co-clustering Compared with GRace[Zheng et al. PPoPP 11] 12 12

Performance overhead 13 Comparisons EM Co-clustering native 60.27(1x) 28.91(1x) LDetector 250.15(4.15x) 99.46(3.44x) LDetector-spec 177.42(2.94x) 107.04(3.70x) GRace-stmt 104445.9(1733x) 1643781.88(56858x) GRace-addr 18866.83(313x) 7236.8(250x) measured time in milliseconds (slowdown) 13

Memory overhead 14 Comparisons EM Co-clustering LDetector 16K, 2.6M 16K, 2M LDetector-spec 0K, 2.7M 0K, 2.2M GRace-stmt 1.1K, 1000M* 1.1K, 1000M* GRace-addr 1.1K, 18M 1.1K, 9M shared memory usage, global memory usage * Instrument as many memory accesses when memory overhead gets up to 1GB 14

Conclusion Future work A two-pass approach to detect data races in shared memory Zero per memory access monitoring A high efficient lightweight runtime Speculative parallelization optimization 15 Future work Support atomics, locks Solutions on data races in global memory Memory consumption explosion 15

16 Thanks! Q & A 16

17

Data structure expansion 18 Data privatization techniques

First pass: write-write race detection 19 original copy first written second written conflict part shared private copy diff warp 0 warp 1 case 1 diff union copy time line case 2 temporary map report w-w race, abort initialization first pass finalization

Second pass: read-write race detection 20 original copy first written second written conflict part shared compare private warp 0 warp 1 case 1 r-w race if any difference otherwise, no race time line initialization second pass finalization

Target: Our contribution 21 Shared array oriented With no atomics in code regions Inter-warp data races

Using speculation to reduce two-pass overhead kernel compiled kernel 22 synchronization block will-be-detected sync block operations introduced by main kernel detected candidate array union setup environment for child kernel kernel launch w-w race kill main kernel child kernel Diff writes re-run child kernel launch race detection by child kernel race happens, kill main kernel r-w race kill main kernel comparison between two runs

A two-pass approach 23 original copy first written second written conflict part shared copy diff union copy diff warped warp 0 warp 1 case 1 diff union copy read write race if any difference, otherwise, no race time case 2 temporary map report: write write race Initialization first-execution Finalization Initialization second-execution Finalization