Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.

Similar documents
Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and

Coordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin

TSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model

An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL

Hardware models: inventing a usable abstraction for Power/ARM. Friday, 11 January 13

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

TSO-CC: Consistency-directed Coherence for TSO. Vijay Nagarajan

Warped parallel nearest neighbor searches using kd-trees

GPU Sparse Graph Traversal

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Algorithm Engineering with PRAM Algorithms

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

An introduction to weak memory consistency and the out-of-thin-air problem

custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader

Scalable GPU Graph Traversal!

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Sequential Consistency & TSO. Subtitle

Parallel Computer Architecture Spring Memory Consistency. Nikos Bellas

Relaxed Memory-Consistency Models

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

TIME TRAVELING HARDWARE AND SOFTWARE SYSTEMS. Xiangyao Yu, Srini Devadas CSAIL, MIT

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

Module 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency

Understanding POWER multiprocessors

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Hybrid Implementation of 3D Kirchhoff Migration

CPU-GPU Heterogeneous Computing

Designing Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve

Lecture 1: Gentle Introduction to GPUs

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

Multi Agent Navigation on GPU. Avi Bleiweiss

Memory Consistency and Multiprocessor Performance

Shared Memory Consistency Models: A Tutorial

Relaxed Memory Consistency

Linux multi-core scalability

Portland State University ECE 588/688. Graphics Processors

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Maximizing Face Detection Performance

CafeGPI. Single-Sided Communication for Scalable Deep Learning

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

Portland State University ECE 588/688. Memory Consistency Models

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Hardware Support for NVM Programming

Using Relaxed Consistency Models

740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Improving the Practicality of Transactional Memory

Implementation of Parallel Path Finding in a Shared Memory Architecture

An Introduction to Parallel Programming

Fast BVH Construction on GPUs

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Visual Analysis of Lagrangian Particle Data from Combustion Simulations

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Distributed Operating Systems Memory Consistency

First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications

Cartoon parallel architectures; CPUs and GPUs

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

S Comparing OpenACC 2.5 and OpenMP 4.5

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs

JCudaMP: OpenMP/Java on CUDA

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

GPU Concurrency: Weak Behaviours and Programming Assumptions

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014

Mapping MPI+X Applications to Multi-GPU Architectures

Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox

Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois

Inter-Block GPU Communication via Fast Barrier Synchronization

Lecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Parallel Memory Defragmentation on a GPU

GAIL The Graph Algorithm Iron Law

Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

Chapter 17 - Parallel Processing

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Software Architecture and Engineering: Part II

Performance impact of dynamic parallelism on different clustering algorithms

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Snooping-Based Cache Coherence

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

Contour Detection on Mobile Platforms

Runtime Support for Scalable Task-parallel Programs

Efficient Computation of Radial Distribution Function on GPUs

Modern Processor Architectures. L25: Modern Compiler Design

Foundations of the C++ Concurrency Memory Model

Transcription:

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader

Challenges of Design Verification Contemporary hardware designs require millions of lines of RTL code More lines of code written for verification than for the implementation itself Tradeoff between performance and design complexity Speculative execution, shared caches, instruction reordering Performance wins out GTC 2016, San Jose, CA 2

Performance vs. Design Complexity Programmer burden Requires correct usage of synchronization Time to market Earlier remediation of bugs is less costly Re-spins on tapeout are expensive Significant time spent of verification Verification techniques are often NPcomplete GTC 2016, San Jose, CA 3

Memory Consistency Models Contract between SW and HW regarding the semantics of memory operations Classic example: Sequential Consistency (SC) All processors observe the same ordering of operations serviced by memory Too strict for modern optimizations/architectures Nomenclature ST[A] 1 Wrote a value of 1 to location A LD[B] 2 Read a value of 2 from location B GTC 2016, San Jose, CA 4

ARM Idiosyncrasies Our focus: ARMv8 Speculative Execution is allowed Threads can reorder reads and writes Assuming no dependency exists Writes are not guaranteed to be simultaneously visible to other cores GTC 2016, San Jose, CA 5

Problem Setup Given an inst. trace from a simulator, RTL, or silicon 1. Construct an initial graph Vertices represent load, store, and barrier insts Edges represent memory ordering Based on architectural rules 2. Iteratively infer additional edges to the graph Based on existing relationships 3. Check for cycles CPU 0 ST[B] 90 ST[B] 92 LD[A] 2 LD[B] 92 CPU 1 LD[B] 92 LD[B] 93 If one exists: contradiction! GTC 2016, San Jose, CA 6

TSOtool Hangal et al., ISCA 04 Designed for SPARC, but portable to ARM Each store writes a unique value to memory Easily map a load to the store that wrote its data Tradeoff between accuracy and runtime Polynomial time, but false positives are possible If a cycle is found, a bug indeed exists If no cycles are found, execution appears consistent GTC 2016, San Jose, CA 7

Need for Scalability Must run many tests to maximize coverage Stress different portions of the memory subsystem Longer tests put supporting logic in more interesting states Many instructions are required to build history in an LRU cache, for instance Using a CPU cluster does not suffice The results of one set of tests dictate the structure of the ensuing tests Faster tests help with interactivity! Solution: Efficient algorithms and parallelism GTC 2016, San Jose, CA 8

Inferred Edge Insertions (Rule 6) S can reach X X does not load data from S S: ST[A] 1 W: ST[A] 2 X: LD[A] 2 GTC 2016, San Jose, CA 9

Inferred Edge Insertions (Rule 6) S can reach X X does not load data from S S comes before the node that stored X s data S: ST[A] 1 X: LD[A] 2 W: ST[A] 2 GTC 2016, San Jose, CA 10

Inferred Edge Insertions (Rule 7) S can reach X Loads read data from S, not X S: ST[A] 1 L: LD[A] 1 M: LD[A] 1 X: ST[A] 2 GTC 2016, San Jose, CA 11

Inferred Edge Insertions (Rule 7) S can reach X Loads read data from S, not X Loads came before X S: ST[A] 1 L: LD[A] 1 M: LD[A] 1 X: ST[A] 2 GTC 2016, San Jose, CA 12

Initial Algorithm for Inferring Edges for_each(store vertex S) { for_each(reachable vertex X from S) //Getting this set is expensive! { if(location[s] == location[x]) { if((type[x] == LD) && (data[s]!= data[x])) { //Add Rule 6 edge from S to W, the store that X read from } else if(type[x] == ST) { for_each(load vertex L that reads data from S) { //Add Rule 7 edge from L to X } } //End if instruction type is store } //End if location } //End for each reachable vertex } //End for each store GTC 2016, San Jose, CA 13

Virtual Processors (vprocs) Split instructions from physical to virtual processors Each vproc is sequentially consistent CPU 0 Program order Memory order ST[B] 91 ST[A] 1 LD[A] 2 ST[B] 92 VPROC 0 ST[B] 91 ST[B] 92 VPROC 1 ST[A] 1 VPROC 2 LD[A] 2 GTC 2016, San Jose, CA 14

Reverse Time Vector Clocks (RTVC) Consider the RTVC of ST[B] = 90 Purple: ST[B] = 92 Blue: NULL Green: LD[B] = 92 Orange: LD[B] = 92 Track the earliest successor from each vertex to each vproc Captures transitivity CPU 0 ST[B] 90 ST[B] 92 LD[A] 2 LD[B] 92 CPU 1 LD[B] 92 LD[B] 93 Complexity of inferring edges: O n 2 p 2 d max GTC 2016, San Jose, CA 15

Updating RTVCs Computing RTVCs once is fast Process vertices in the reverse order of a topological sort Check neighbors directly, then their RTVCs Every time a new edge is inserted, RTVC values need to change # of edge insertions m TSOtool implements both vprocs and RTVCs GTC 2016, San Jose, CA 16

Facilitating Parallelism Repeatedly updating RTVCs is expensive For k edge insertions, RTVC updates take O(kpn) time k = O n 2, but usually is a small multiple of n Idea: Update RTVCs once per iteration rather than per edge insertion For i iterations RTVC updates take O(ipn) time i k (less than 10 for all test cases) Less communication between threads Complexity of inferring edges: O(n 2 p) GTC 2016, San Jose, CA 17

Correctness Inferred edges found by our approach will not be the same as the edges found by TSOtool Might not infer an edge that TSOtool does RTVC for TSOtool can change mid-iteration Might infer an edge that TSOtool does not Our approach will have stale RTVC values Both approaches make forward progress Number of edges monotonically increases Any edge inserted by our approach could have been inserted by the naïve approach [Thm 1] If TSOtool finds a cycle, we will also find a cycle [Thm 2] GTC 2016, San Jose, CA 18

Parallel Implementations OpenMP Each thread keeps its own partition of added edges After each iteration of inferring edges, reduce CUDA Assign threads to each store instruction Threads independently traverse the vprocs of this store Atomically add edges to a preallocated array in global memory GTC 2016, San Jose, CA 19

Experimental Setup Intel Core i7-2600k CPU Quad core, 3.4GHz, 8MB LLC, 16GB DRAM NVIDIA GeForce GTX Titan 14 SMs, 837 MHz base clock, 6GB DRAM ARM system under test Cortex-A57, quad core Instruction graphs range from n = 2 18 to n = 2 22 vertices, n m Sparse, high-diameter, low-degree Tests vary by their distribution of LD/ST/DMB instructions, # of vprocs, and inst dependencies GTC 2016, San Jose, CA 20

Importance of Scaling 512K instructions per core 2M total instructions GTC 2016, San Jose, CA 21

Speedup over TSOtool (Application) Graph Size # of tests Lazy RTVC OMP 2 OMP 4 GPU 64K*4 = 256K 27 5.64x 7.62x 9.43x 10.79x 128K*4 = 512K 27 5.31x 7.12x 8.90x 10.76x 256K*4 = 1M 23 6.30x 9.05x 12.13x 15.47x 512K*4 = 2M 10 3.68x 6.41x 10.81x 24.55x 1M*4 = 4M 2 3.05x 5.58x 9.97x 37.64x GPU is always best; scales much better to larger tests Extreme case: 9 hours using TSOtool under 10 minutes using our GPU approach Avg. Parallel speedups over our improved sequential approach: 1.92x (OMP 2), 3.53x (OMP 4), 5.05x (GPU) GTC 2016, San Jose, CA 22

Summary Relaxing the updates to RTVCs lead to a better sequential approach and facilitated parallel implementations Trade off between redundant work and parallelism Faster execution leads to interactive bug-finding The GPU scales well to larger problem instances Helpful for corner case bugs that slip through pre-silicon verification For the twelve largest test cases our GPU implementation achieves a 26.36x average application speedup GTC 2016, San Jose, CA 23

Acknowledgments Shankar Govindaraju, and Tom Hart for their help on understanding NVIDIA s implementation of TSOtool for ARM GTC 2016, San Jose, CA 24

Questions To raise new questions, new possibilities, to regard old problems from a new angle, requires creative imagination and marks real advance in science. Albert Einstein GTC 2016, San Jose, CA 25

Backup GTC 2016, San Jose, CA 26

Sequential Consistency Examples Valid P1: ST[x] 1 P2: LD[x] 1 LD[x] 2 P3: LD[x] 1 LD[x] 2 P4: ST[x] 2 t=0 t=1 t=2 ST[x] 1 handled before ST[x] 2 Invalid P1: ST[x] 1 P2: LD[x] 1 LD[x] 2 P3: LD[x] 2 LD[x] 1 P4: ST[x] 2 t=0 t=1 t=2 Writes propagate to P2 and P3 in a different order Valid for weaker memory models GTC 2016, San Jose, CA 27

Weaker Models SC is intuitive, but is too strict Prevents common compiler/arch. optimizations Commercial products use weaker models x86: Total Store Order (TSO) Power/ARM: Relaxed Memory Ordering (RMO) Weaker models allow for greater optimization opportunities Cost: More complicated semantics GTC 2016, San Jose, CA 28

Initial Algorithm: Weaknesses Expensive to compute O(n 3 ), assuming edges can be inserted in O(1) time Repeated iteratively until a fixed point is reached Requires the transitive closure of the graph Expensive to store Capturing n 2 relationships (does vertex i reach vertex j?) Adds lots of redundant edges Should leverage transitivity when possible A B C GTC 2016, San Jose, CA 29

Reverse Time Vector Clocks (RTVCs) vprocs provide implicit orderings ST[A] 1 ST[B] 91 ST[B] 92 GTC 2016, San Jose, CA 30

Reverse Time Vector Clocks (RTVCs) vprocs provide implicit orderings ST[A] 1 ST[B] 91 Reverse Vector Time Clock ST[B] 92 Track the earliest successor from each vertex to each vproc Bounds the number of reachable edges to be inspected by p, the number of vprocs No need to compute or store the transitive closure! GTC 2016, San Jose, CA 31

Reverse Time Vector Clocks (RTVCs) Track the earliest successor from each vertex to each vproc Captures transitivity Traverse vprocs rather than the graph itself No need to check every reachable vertex Bounds the number of reachable edges to be inspected by p, the number of vprocs No need to compute or store the transitive closure! GTC 2016, San Jose, CA 32

Superfluous work? Our approach tends to add more edges than TSOtool, some of which are redundant Worst case: 36% additional edges The redundancy is well worth the performance benefits GTC 2016, San Jose, CA 33

Test Info n = V m = E TSOtool Inferred Iterations ST/LD/BAR (%) 2,097,963 3,799,254 4,487,224 5 76/24/0 2,098,219 3,686,624 4,411,887 4 79/21/0 1,977,832 4,453,340 5,179,108 5 46/53/1 2,097,741 3,875,831 4,635,852 7 77/23/0 1,936,321 5,109,990 5,236,671 5 44/54/2 2,098,321 2,491,062 4,257,077 6 80/20/0 2,097,809 4,321,793 4,404,753 7 78/21/1 1,871,831 3,660,617 4,861,044 6 44/54/2 2,097,809 4,434,120 4,418,555 5 80/20/0 4,195,405 6,934,725 9,338,902 7 76/23/1 4,194,961 7,960,567 8,963,281 6 78/22/0 GTC 2016, San Jose, CA 34

Speedup over TSOtool (Inferring edges) Graph Size # of tests Lazy RTVC OMP 2 OMP 4 GPU 64K*4 = 256K 27 15.09x 29.31x 53.45x 57.90x 128K*4 = 512K 27 16.41x 31.49x 57.34x 76.98x 256K*4 = 1M 23 14.51x 27.98x 51.68x 72.32x 512K*4 = 2M 10 4.01x 7.52x 14.19x 42.90x 1M*4 = 4M 2 3.08x 5.70x 10.39x 45.16x Number of tests decreases with test size because of industrial time constraints Motivation for this work Avg. Parallel speedups over our improved sequential approach: 1.92x (OMP 2), 3.53x (OMP 4), 5.05x (GPU) GTC 2016, San Jose, CA 35

Problem Setup Given an inst. trace from a simulator, RTL, or silicon 1. Construct an initial graph Vertices represent load, store, and barrier insts Edges represent memory ordering Based on architectural rules 2. Iteratively infer additional edges to the graph Based on existing relationships 3. Check for cycles CPU 0 ST[B] 90 ST[B] 92 LD[A] 2 LD[B] 92 CPU 1 LD[B] 92 LD[B] 93 If one exists: contradiction! GTC 2016, San Jose, CA 36

Importance of Scaling 128K instructions per core 512K total instructions GTC 2016, San Jose, CA 37

Importance of Scaling 256K instructions per core 1M total instructions GTC 2016, San Jose, CA 38