Sequential Consistency for Heterogeneous-Race-Free

Similar documents
SIMULATOR AMD RESEARCH JUNE 14, 2015

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER


Panel Discussion: The Future of I/O From a CPU Architecture Perspective

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

HSA MEMORY MODEL HOT CHIPS TUTORIAL - AUGUST 2013 BENEDICT R GASTER

Generic System Calls for GPUs

ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

AMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011

AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES

EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

Understanding GPGPU Vector Register File Usage

AMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD

AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

Vulkan (including Vulkan Fast Paths)

MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015

RegMutex: Inter-Warp GPU Register Time-Sharing

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview April 2013 Approved for public distribution

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

HyperTransport Technology

viewdle! - machine vision experts

HPCA 18. Reliability-aware Data Placement for Heterogeneous memory Architecture

KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

3D Numerical Analysis of Two-Phase Immersion Cooling for Electronic Components

Automatic Intra-Application Load Balancing for Heterogeneous Systems

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

AMD EPYC CORPORATE BRAND GUIDELINES

Heterogeneous-race-free Memory Models

STREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING

FLASH MEMORY SUMMIT Adoption of Caching & Hybrid Solutions

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

Heterogeneous-Race-Free Memory Models

NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer

ROCm: An open platform for GPU computing exploration

BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD

FUSION PROCESSORS AND HPC

clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager

AMD Radeon ProRender plug-in for Unreal Engine. Installation Guide

AMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide

Designing Natural Interfaces

Cilk Plus: Multicore extensions for C and C++

AMD RYZEN CORPORATE BRAND GUIDELINES

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data

Pattern-based analytics to estimate and track yield risk of designs down to 7nm

AMD SEV Update Linux Security Summit David Kaplan, Security Architect

D3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD

AMD AIB Partner Guidelines. Version February, 2015

Fusion Enabled Image Processing

Graphics Hardware 2008

Heterogeneous Computing

Desktop Telepresence Arrived! Sudha Valluru ViVu CEO

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

PROTECTING VM REGISTER STATE WITH AMD SEV-ES DAVID KAPLAN LSS 2017

Gestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO

AMD S X86 OPEN64 COMPILER. Michael Lai AMD

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

The Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015

1 Presentation Title Month ##, 2012

DR. LISA SU

Anatomy of AMD s TeraScale Graphics Engine

Changing your Driver Options with Radeon Pro Settings. Quick Start User Guide v3.0

AMD 780G. Niles Burbank AMD. an x86 chipset with advanced integrated GPU. Hot Chips 2008

Driver Options in AMD Radeon Pro Settings. User Guide

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems

Lazy Release Consistency for GPUs

Fan Control in AMD Radeon Pro Settings. User Guide. This document is a quick user guide on how to configure GPU fan speed in AMD Radeon Pro Settings.

GPU Concurrency: Weak Behaviours and Programming Assumptions

Changing your Driver Options with Radeon Pro Settings. Quick Start User Guide v2.1

User Manual. Nvidia Jetson Series Carrier board Aetina ACE-N622

MULTIMEDIA PROCESSING Real-time H.264 video enhancement by using AMD APP SDK

Overview: Memory Consistency

Device Pack. Network Video Management System Standard Edition. Release Note. Software Version: 9.5a Sony Corporation

Coherence, Consistency, and Déjà vu: Memory Hierarchies in the Era of Specialization

INTRODUCING RYZEN MARCH

Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems

Introducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs

NUMA Topology for AMD EPYC Naples Family Processors

AMD Security and Server innovation

Transcription:

Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K. REINHARDT, DAVID A. WOOD JUNE 12, 2013

EXECUTIVE SUMMARY Existing GPU memory models ambiguous, dense for programmers CPUs: Sequential Consistency for Data-Race-Free (SC for DRF) Relaxed HW, precise semantics, programmer-friendly Problem: GPUs use scoped synchronization Sequential Consistency for Heterogeneous-Race-Free (SC for HRF) SC for DRF + Scopes HRF0: Basic scope synchronization Two threads communicate use identical scope synchronization Works well with existing, regular GPU codes Beyond HRF0? There are limits 2

OUTLINE Background and Setup HRF0: Basic scope synchronization Future directions 3

BACKGROUND AND SETUP CPU Programmers use memory model to understand memory behavior Sequential Consistency (SC) [1979]: threads interleave like multitasking uniprocessor 4

BACKGROUND AND SETUP CPU Programmers use memory model to understand memory behavior Sequential Consistency (SC) [1979]: threads interleave like multitasking uniprocessor HW/Compiler actually implements TSO [1991] or more relaxed model 5

BACKGROUND AND SETUP CPU Programmers use memory model to understand memory behavior Sequential Consistency (SC) [1979]: threads interleave like multitasking uniprocessor HW/Compiler actually implements TSO [1991] or more relaxed model Java TM [2005] and C++ [2008] insure SC for data-race-free (DRF) programs Programmers need a GPU memory model for abstraction and portability Currently GPU models expose ad hoc HW mechanisms SC for DRF is a start, BUT 6

BACKGROUND AND SETUP CPU Programmers use memory model to understand memory behavior Sequential Consistency (SC) [1979]: threads interleave like multitasking uniprocessor HW/Compiler actually implements TSO [1991] or more relaxed model Java TM [2005] and C++ [2008] insure SC for data-race-free (DRF) programs Programmers need a GPU memory model for abstraction and portability Currently GPU models expose ad hoc HW mechanisms SC for DRF is a start, BUT OpenCL TM Execution Model 7

SEQUENTIAL CONSISTENCY FOR DATA-RACE-FREE Two memory accesses participate in a data race if they access the same location at least one access is a store can occur simultaneously i.e. appear as adjacent operations in interleaving. A program is data-race-free if no possible execution results in a data race. Sequential consistency for data-race-free programs Avoid everything else GPUs: Not good enough! 8

DATA-RACE-FREE IS NOT ENOUGH S Global t1 S 12 S 34 t2 t3 t4 Workgroup #1-2 Workgroup #3-4 t1 t2 t3 t4 ST X = 1 GroupLock.Release() GroupLock.Acquire()... GlobalLock.Release() GlobalLock.Acquire() LD X (??) 9

DATA-RACE-FREE IS NOT ENOUGH Two ordinary memory accesses participate in a data race if they Access same location At least one is a store Can occur simultaneously Workgroup #1-2 Workgroup #3-4 t1 t2 t3 t4 ST X = 1 GroupLock.Release() GroupLock.Acquire()... GlobalLock.Release() GlobalLock.Acquire() LD X (??) 10

DATA-RACE-FREE IS NOT ENOUGH Two ordinary memory accesses participate in a data race if they Access same location At least one is a store Can occur simultaneously Workgroup #1-2 Workgroup #3-4 Not a data race Is it SC? t1 t2 t3 t4 ST X = 1 GroupLock.Release() GroupLock.Acquire()... GlobalLock.Release() GlobalLock.Acquire() LD X (??) 11

WHAT WILL HAPPEN? CUDA TM threadfence{_block} definition: waits until all memory accesses made by the calling thread prior to [{Group, Global}Lock.Release()] are visible to all threads in the {group, device} Workgroup #1-2 Workgroup #3-4 t1 t2 t3 t4 ST X = 1 GroupLock.Release() GroupLock.Acquire()... GlobalLock.Release() GlobalLock.Acquire() LD X (??) 12

WHAT WILL HAPPEN? Heterogeneous Race! Workgroup #1-2 Workgroup #3-4 t1 t2 t3 t4 ST X = 1 Makes X =1 visible to t2 GroupLock.Release() GroupLock.Acquire() Makes visible to t3... GlobalLock.Release() GlobalLock.Acquire() LD X (??) 13

SEQUENTIAL CONSISTENCY FOR DATA-RACE-FREE Two memory accesses participate in a data race if they access the same location at least one access is a store can occur simultaneously i.e. appear as adjacent operations in interleaving. A program is data-race-free if no possible execution results in a data race. Sequential consistency for data-race-free programs Avoid everything else 14

SEQUENTIAL CONSISTENCY FOR HETEROGNEOUS- RACE-FREE Two memory accesses participate in a heterogeneous race if access the same location at least one access is a store can occur simultaneously i.e. appear as adjacent operations in interleaving. Are not synchronized with enough scope A program is heterogeneous-race-free if no possible execution results in a heterogeneous race. Sequential consistency for heterogeneous-race-free programs Avoid everything else 15

A FIRST CUT 16

A FIRST CUT HRF0: Basic Scope Synchronization 17

A FIRST CUT HRF0: Basic Scope Synchronization enough = both threads synchronize using identical scope 18

A FIRST CUT HRF0: Basic Scope Synchronization enough = both threads synchronize using identical scope Recall example: Contains a heterogeneous race in HRF0 Workgroup #1-2 Workgroup #3-4 t1 t2 t3 t4 ST X = 1 GroupLock.Release() GroupLock.Acquire()... GlobalLock.Release() GlobalLock.Acquire() LD X (??) 19

A FIRST CUT HRF0: Basic Scope Synchronization enough = both threads synchronize using identical scope Recall example: Contains a heterogeneous race in HRF0 Workgroup #1-2 Workgroup #3-4 t1 t2 t3 t4 ST X = 1 GroupLock.Release() GroupLock.Acquire()... GlobalLock.Release() GlobalLock.Acquire() LD X (??) 20

A FIRST CUT HRF0: Basic Scope Synchronization enough = both threads synchronize using identical scope Recall example: Contains a heterogeneous race in HRF0 Workgroup #1-2 Workgroup #3-4 HRF0 Conclusion: This is bad. Don t do it. t1 t2 t3 t4 ST X = 1 GroupLock.Release() GroupLock.Acquire()... GlobalLock.Release() GlobalLock.Acquire() LD X (??) 21

IS HRF0 USEFUL? Want: For performance, use smallest scope possible What is safe in HRF0? Use smallest scope that includes all producers/consumers of shared data HRF0 Scope Selection Guideline 22

IS HRF0 USEFUL? Want: For performance, use smallest scope possible What is safe in HRF0? Use smallest scope that includes all producers/consumers of shared data HRF0 Scope Selection Guideline Implication: Producers/consumers must be known at synchronization time 23

IS HRF0 USEFUL? Want: For performance, use smallest scope possible What is safe in HRF0? Use smallest scope that includes all producers/consumers of shared data HRF0 Scope Selection Guideline Is this a valid assumption? Implication: Producers/consumers must be known at synchronization time 24

REGULAR GPGPU WORKLOADS M N Define Problem Space Partition Hierarchically Communicate Locally Communicate Globally N times M times Well defined (regular) data partitioning + Well defined (regular) synchronization pattern = Producer/consumers are always known 25

REGULAR GPGPU WORKLOADS M N Generally: HRF0 works well with existing regular GPGPU workloads Define Problem Space Partition Hierarchically Communicate Locally Communicate Globally N times M times Well defined (regular) data partitioning + Well defined (regular) synchronization pattern = Producer/consumers are always known 26

IRREGULAR GPGPU WORKLOADS HRF0: example is race Must upgrade Group -> Global Current hardware: LD X will see value (1)! Workgroup #1-2 Workgroup #3-4 t1 t2 t3 t4 ST X = 1 GroupLock.Release() GroupLock.Acquire()... GlobalLock.Release() GlobalLock.Acquire() LD X (??) 27

HRFX: THE NEXT ITERATION HRF0 is overly conservative on existing HW Makes fast irregular parallelism hard Other HRF definitions are possible e.g., define behavior when different scopes interact What are the gotchas? (there will be many ) The answer:??? 28

CONCLUSIONS & FUTURE DIRECTIONS GPUs Currently expose ad-hoc scoped synchronization From CPU world: mask low-level details with SC for DRF GPU scope synchronization incompatible w/ SC for DRF GPUs: SC for HRF HRF0: Basic scope synchronization + Easy-ish to define/understand + Safe interpretation of existing models + Permits most HW optimizations Prohibits some SW opts in current hardware HRFx: models to exploit hierarchy What happens when different scopes interact? Let s not wait 30 years this time 29

THANKS! QUESTIONS?

BACKUP

BASIC APPLICATION EXAMPLE SEQUENCE ALIGNMENT NW N W Well defined (regular) synchronization pattern + Well defined (regular) data partitioning = Producer/consumers are always known Smallest box: Wavefront Intermediate box: Workgroup Full box: NDRange Generally: HRF0 works well with existing regular GPGPU workloads 32

SCOPED SYNCHRONIZATION Some operations are not global, have visibility limited to workgroup or device HSAIL: st_{rel, part_rel}, ld_{acq, part_acq}, etc. CUDA TM : threadfence_{block, system}, syncthreads, etc. PTX: membar.{cta, gl, sys}, bar, etc. S Global t1 S 12 S 34 t2 t3 t4 34

EXECUTIVE SUMMARY GPUs: streaming memory system, hierarchical programming model Use partial (scoped) synchronization Existing GPU memory models ambiguous, dense for programmers From the CPU world: SC for DRF Relaxed HW, precise semantics, programmer-friendly Let s not wait 30 years this time This work: apply same principle to GPU world Sequential Consistency for Heterogeneous-Race-Free (SC for HRF) HRF0: Basic scope synchronization Two threads communicate use identical scope synchronization Works well with existing GPU codes Others possible. HRF0 is: Difficult to use efficiently w/ irregular synchronization Overly conservative for current implementations 35

INSERTING SCOPED SYNCHRONIZATION Acquire_S? cell[id] = fn(cell[n], cell[nw], cell[w]) Release_S? Synchronization Scope: WF on N, W edge of WG use global acquire WF on S, E edge of WG use global release All other sync is local 36

EXAMPLE AMBIGUITY What value of X does t3 see? S Global t1 S 12 S 34 t2 t3 t4 Answer not obvious -- depends on system t1 t2 t3 t4 ST X = 1 Release_S 12 Acquire_S 12 LD X (1) Release_S Global Acquire_S Global LD X (??) 37

EXAMPLE: CONVENTIONAL BASE SYSTEM X = L1 1 X L2 = 1 X = L1 1 Conventional write-through/combining cache hierarchy: Local release flush stores from queue Local acquire stall until queue is empty Global acquire Invalidate all valid locations in L1 cache Global release Flush all dirty locations in L1 cache X = 1 t1 t2 t3 t4 wf1 wf2 wf3 wf4 ST X = 1 Release_S 12 Acquire_S 12 LD X (1) Release_S Global Acquire_S Global LD X (??) LD X (??) WF3 ALWAYS sees (1) 38

EXAMPLE: OPTIMIZED BASE SYSTEM X = L1 1 X L2 = 0 X = L1 0 Optimized write-combining cache hierarchy Local release flush stores from queue Local acquire stall until queue is empty Global acquire Inv. locations read by acquiring WF in L1 Global release Flush locations written by releasing WF in L1 X = 1 t1 t2 t3 t4 t1 t2 t3 wf4 ST X = 1 Release_S 12 Acquire_S 12 LD X (1) Release_S Global Acquire_S Global LD X (??) LD X (??) WF3 sees (0) or (1) 39

ASSUMPTIONS/SIMPLIFICATIONS 1. Ignore scratchpad (local/group/shared) memory i.e., all memory in single, global shared address space 2. Use scoped acquire/release synchronization Generalizes to other forms of synchronization 3. Examples: two-level scope hierarchy with simple threads S Global t1 S 12 S 34 t2 t3 t4 40

APPLICATION EXAMPLE TASK PARALLEL RUNTIME Local Queue WF1 Global Queue WF2 Local Queue WF3 WF4 Hierarchical task queue: - WF produce/consume tasks independently - Use local queue until: - Local queue empty pull from global - Local queue full push to global Beware: HRF1 consumes significant brain power Observation: - Push/pull performed on behalf of workgroup HRF0: - Either all sync has to be global or WFs coordinate to push/pull HRF1: - Single WF can push/pull independently 41

SCOPES Scope: A subset of threads Scoped synchronization: synchronization w.r.t. a scope OpenCL TM : mem_fence HSAIL: st_{rel, part_rel}, ld_{acq, part_acq}, etc. CUDA TM : threadfence_{block, system}, syncthreads, etc. PTX: membar.{cta, gl, sys}, bar, etc. Scopes introduce new class of races: What happens when threads (wavefronts/warps) use different scopes? L2 S Global L1 L1 WF1 S 12 S 34 WF2 WF3 WF4 WF1 WF2 WF3 WF4 42

DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 43