Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures. Adam McLaughlin, Duane Merrill, Michael Garland, and David A.
|
|
- Chad Mason
- 5 years ago
- Views:
Transcription
1 Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures Adam McLaughlin, Duane Merrill, Michael Garland, and David A. Bader
2 Challenges of Design Verification Contemporary hardware designs require millions of lines of RTL code More lines of code written for verification than for the implementation itself Tradeoff between performance and design complexity Speculative execution, shared caches, instruction reordering Performance wins out GTC 2016, San Jose, CA 2
3 Performance vs. Design Complexity Programmer burden Requires correct usage of synchronization Time to market Earlier remediation of bugs is less costly Re-spins on tapeout are expensive Significant time spent of verification Verification techniques are often NPcomplete GTC 2016, San Jose, CA 3
4 Memory Consistency Models Contract between SW and HW regarding the semantics of memory operations Classic example: Sequential Consistency (SC) All processors observe the same ordering of operations serviced by memory Too strict for modern optimizations/architectures Nomenclature ST[A] 1 Wrote a value of 1 to location A LD[B] 2 Read a value of 2 from location B GTC 2016, San Jose, CA 4
5 ARM Idiosyncrasies Our focus: ARMv8 Speculative Execution is allowed Threads can reorder reads and writes Assuming no dependency exists Writes are not guaranteed to be simultaneously visible to other cores GTC 2016, San Jose, CA 5
6 Problem Setup Given an inst. trace from a simulator, RTL, or silicon 1. Construct an initial graph Vertices represent load, store, and barrier insts Edges represent memory ordering Based on architectural rules 2. Iteratively infer additional edges to the graph Based on existing relationships 3. Check for cycles CPU 0 ST[B] 90 ST[B] 92 LD[A] 2 LD[B] 92 CPU 1 LD[B] 92 LD[B] 93 If one exists: contradiction! GTC 2016, San Jose, CA 6
7 TSOtool Hangal et al., ISCA 04 Designed for SPARC, but portable to ARM Each store writes a unique value to memory Easily map a load to the store that wrote its data Tradeoff between accuracy and runtime Polynomial time, but false positives are possible If a cycle is found, a bug indeed exists If no cycles are found, execution appears consistent GTC 2016, San Jose, CA 7
8 Need for Scalability Must run many tests to maximize coverage Stress different portions of the memory subsystem Longer tests put supporting logic in more interesting states Many instructions are required to build history in an LRU cache, for instance Using a CPU cluster does not suffice The results of one set of tests dictate the structure of the ensuing tests Faster tests help with interactivity! Solution: Efficient algorithms and parallelism GTC 2016, San Jose, CA 8
9 Inferred Edge Insertions (Rule 6) S can reach X X does not load data from S S: ST[A] 1 W: ST[A] 2 X: LD[A] 2 GTC 2016, San Jose, CA 9
10 Inferred Edge Insertions (Rule 6) S can reach X X does not load data from S S comes before the node that stored X s data S: ST[A] 1 X: LD[A] 2 W: ST[A] 2 GTC 2016, San Jose, CA 10
11 Inferred Edge Insertions (Rule 7) S can reach X Loads read data from S, not X S: ST[A] 1 L: LD[A] 1 M: LD[A] 1 X: ST[A] 2 GTC 2016, San Jose, CA 11
12 Inferred Edge Insertions (Rule 7) S can reach X Loads read data from S, not X Loads came before X S: ST[A] 1 L: LD[A] 1 M: LD[A] 1 X: ST[A] 2 GTC 2016, San Jose, CA 12
13 Initial Algorithm for Inferring Edges for_each(store vertex S) { for_each(reachable vertex X from S) //Getting this set is expensive! { if(location[s] == location[x]) { if((type[x] == LD) && (data[s]!= data[x])) { //Add Rule 6 edge from S to W, the store that X read from } else if(type[x] == ST) { for_each(load vertex L that reads data from S) { //Add Rule 7 edge from L to X } } //End if instruction type is store } //End if location } //End for each reachable vertex } //End for each store GTC 2016, San Jose, CA 13
14 Virtual Processors (vprocs) Split instructions from physical to virtual processors Each vproc is sequentially consistent CPU 0 Program order Memory order ST[B] 91 ST[A] 1 LD[A] 2 ST[B] 92 VPROC 0 ST[B] 91 ST[B] 92 VPROC 1 ST[A] 1 VPROC 2 LD[A] 2 GTC 2016, San Jose, CA 14
15 Reverse Time Vector Clocks (RTVC) Consider the RTVC of ST[B] = 90 Purple: ST[B] = 92 Blue: NULL Green: LD[B] = 92 Orange: LD[B] = 92 Track the earliest successor from each vertex to each vproc Captures transitivity CPU 0 ST[B] 90 ST[B] 92 LD[A] 2 LD[B] 92 CPU 1 LD[B] 92 LD[B] 93 Complexity of inferring edges: O n 2 p 2 d max GTC 2016, San Jose, CA 15
16 Updating RTVCs Computing RTVCs once is fast Process vertices in the reverse order of a topological sort Check neighbors directly, then their RTVCs Every time a new edge is inserted, RTVC values need to change # of edge insertions m TSOtool implements both vprocs and RTVCs GTC 2016, San Jose, CA 16
17 Facilitating Parallelism Repeatedly updating RTVCs is expensive For k edge insertions, RTVC updates take O(kpn) time k = O n 2, but usually is a small multiple of n Idea: Update RTVCs once per iteration rather than per edge insertion For i iterations RTVC updates take O(ipn) time i k (less than 10 for all test cases) Less communication between threads Complexity of inferring edges: O(n 2 p) GTC 2016, San Jose, CA 17
18 Correctness Inferred edges found by our approach will not be the same as the edges found by TSOtool Might not infer an edge that TSOtool does RTVC for TSOtool can change mid-iteration Might infer an edge that TSOtool does not Our approach will have stale RTVC values Both approaches make forward progress Number of edges monotonically increases Any edge inserted by our approach could have been inserted by the naïve approach [Thm 1] If TSOtool finds a cycle, we will also find a cycle [Thm 2] GTC 2016, San Jose, CA 18
19 Parallel Implementations OpenMP Each thread keeps its own partition of added edges After each iteration of inferring edges, reduce CUDA Assign threads to each store instruction Threads independently traverse the vprocs of this store Atomically add edges to a preallocated array in global memory GTC 2016, San Jose, CA 19
20 Experimental Setup Intel Core i7-2600k CPU Quad core, 3.4GHz, 8MB LLC, 16GB DRAM NVIDIA GeForce GTX Titan 14 SMs, 837 MHz base clock, 6GB DRAM ARM system under test Cortex-A57, quad core Instruction graphs range from n = 2 18 to n = 2 22 vertices, n m Sparse, high-diameter, low-degree Tests vary by their distribution of LD/ST/DMB instructions, # of vprocs, and inst dependencies GTC 2016, San Jose, CA 20
21 Importance of Scaling 512K instructions per core 2M total instructions GTC 2016, San Jose, CA 21
22 Speedup over TSOtool (Application) Graph Size # of tests Lazy RTVC OMP 2 OMP 4 GPU 64K*4 = 256K x 7.62x 9.43x 10.79x 128K*4 = 512K x 7.12x 8.90x 10.76x 256K*4 = 1M x 9.05x 12.13x 15.47x 512K*4 = 2M x 6.41x 10.81x 24.55x 1M*4 = 4M x 5.58x 9.97x 37.64x GPU is always best; scales much better to larger tests Extreme case: 9 hours using TSOtool under 10 minutes using our GPU approach Avg. Parallel speedups over our improved sequential approach: 1.92x (OMP 2), 3.53x (OMP 4), 5.05x (GPU) GTC 2016, San Jose, CA 22
23 Summary Relaxing the updates to RTVCs lead to a better sequential approach and facilitated parallel implementations Trade off between redundant work and parallelism Faster execution leads to interactive bug-finding The GPU scales well to larger problem instances Helpful for corner case bugs that slip through pre-silicon verification For the twelve largest test cases our GPU implementation achieves a 26.36x average application speedup GTC 2016, San Jose, CA 23
24 Acknowledgments Shankar Govindaraju, and Tom Hart for their help on understanding NVIDIA s implementation of TSOtool for ARM GTC 2016, San Jose, CA 24
25 Questions To raise new questions, new possibilities, to regard old problems from a new angle, requires creative imagination and marks real advance in science. Albert Einstein GTC 2016, San Jose, CA 25
26 Backup GTC 2016, San Jose, CA 26
27 Sequential Consistency Examples Valid P1: ST[x] 1 P2: LD[x] 1 LD[x] 2 P3: LD[x] 1 LD[x] 2 P4: ST[x] 2 t=0 t=1 t=2 ST[x] 1 handled before ST[x] 2 Invalid P1: ST[x] 1 P2: LD[x] 1 LD[x] 2 P3: LD[x] 2 LD[x] 1 P4: ST[x] 2 t=0 t=1 t=2 Writes propagate to P2 and P3 in a different order Valid for weaker memory models GTC 2016, San Jose, CA 27
28 Weaker Models SC is intuitive, but is too strict Prevents common compiler/arch. optimizations Commercial products use weaker models x86: Total Store Order (TSO) Power/ARM: Relaxed Memory Ordering (RMO) Weaker models allow for greater optimization opportunities Cost: More complicated semantics GTC 2016, San Jose, CA 28
29 Initial Algorithm: Weaknesses Expensive to compute O(n 3 ), assuming edges can be inserted in O(1) time Repeated iteratively until a fixed point is reached Requires the transitive closure of the graph Expensive to store Capturing n 2 relationships (does vertex i reach vertex j?) Adds lots of redundant edges Should leverage transitivity when possible A B C GTC 2016, San Jose, CA 29
30 Reverse Time Vector Clocks (RTVCs) vprocs provide implicit orderings ST[A] 1 ST[B] 91 ST[B] 92 GTC 2016, San Jose, CA 30
31 Reverse Time Vector Clocks (RTVCs) vprocs provide implicit orderings ST[A] 1 ST[B] 91 Reverse Vector Time Clock ST[B] 92 Track the earliest successor from each vertex to each vproc Bounds the number of reachable edges to be inspected by p, the number of vprocs No need to compute or store the transitive closure! GTC 2016, San Jose, CA 31
32 Reverse Time Vector Clocks (RTVCs) Track the earliest successor from each vertex to each vproc Captures transitivity Traverse vprocs rather than the graph itself No need to check every reachable vertex Bounds the number of reachable edges to be inspected by p, the number of vprocs No need to compute or store the transitive closure! GTC 2016, San Jose, CA 32
33 Superfluous work? Our approach tends to add more edges than TSOtool, some of which are redundant Worst case: 36% additional edges The redundancy is well worth the performance benefits GTC 2016, San Jose, CA 33
34 Test Info n = V m = E TSOtool Inferred Iterations ST/LD/BAR (%) 2,097,963 3,799,254 4,487, /24/0 2,098,219 3,686,624 4,411, /21/0 1,977,832 4,453,340 5,179, /53/1 2,097,741 3,875,831 4,635, /23/0 1,936,321 5,109,990 5,236, /54/2 2,098,321 2,491,062 4,257, /20/0 2,097,809 4,321,793 4,404, /21/1 1,871,831 3,660,617 4,861, /54/2 2,097,809 4,434,120 4,418, /20/0 4,195,405 6,934,725 9,338, /23/1 4,194,961 7,960,567 8,963, /22/0 GTC 2016, San Jose, CA 34
35 Speedup over TSOtool (Inferring edges) Graph Size # of tests Lazy RTVC OMP 2 OMP 4 GPU 64K*4 = 256K x 29.31x 53.45x 57.90x 128K*4 = 512K x 31.49x 57.34x 76.98x 256K*4 = 1M x 27.98x 51.68x 72.32x 512K*4 = 2M x 7.52x 14.19x 42.90x 1M*4 = 4M x 5.70x 10.39x 45.16x Number of tests decreases with test size because of industrial time constraints Motivation for this work Avg. Parallel speedups over our improved sequential approach: 1.92x (OMP 2), 3.53x (OMP 4), 5.05x (GPU) GTC 2016, San Jose, CA 35
36 Problem Setup Given an inst. trace from a simulator, RTL, or silicon 1. Construct an initial graph Vertices represent load, store, and barrier insts Edges represent memory ordering Based on architectural rules 2. Iteratively infer additional edges to the graph Based on existing relationships 3. Check for cycles CPU 0 ST[B] 90 ST[B] 92 LD[A] 2 LD[B] 92 CPU 1 LD[B] 92 LD[B] 93 If one exists: contradiction! GTC 2016, San Jose, CA 36
37 Importance of Scaling 128K instructions per core 512K total instructions GTC 2016, San Jose, CA 37
38 Importance of Scaling 256K instructions per core 1M total instructions GTC 2016, San Jose, CA 38
Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and
Optimizing Energy Consumption and Parallel Performance for Static and Dynamic Betweenness Centrality using GPUs Adam McLaughlin, Jason Riedy, and David A. Bader Motivation Real world graphs are challenging
More informationCoordinating More Than 3 Million CUDA Threads for Social Network Analysis. Adam McLaughlin
Coordinating More Than 3 Million CUDA Threads for Social Network Analysis Adam McLaughlin Applications of interest Computational biology Social network analysis Urban planning Epidemiology Hardware verification
More informationTSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model
TSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model Sudheendra Hangal, Durgam Vahia, Chaiyasit Manovit, Joseph Lu and Sridhar Narayanan tsotool@sun.com ISCA-2004 Sun Microsystems
More informationAn Energy-Efficient Abstraction for Simultaneous Breadth-First Searches. Adam McLaughlin, Jason Riedy, and David A. Bader
An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches Adam McLaughlin, Jason Riedy, and David A. Bader Problem Data is unstructured, heterogeneous, and vast Serious opportunities for
More informationLecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the
More informationA New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader
A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem
More informationA POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL
A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH
More informationHardware models: inventing a usable abstraction for Power/ARM. Friday, 11 January 13
Hardware models: inventing a usable abstraction for Power/ARM 1 Hardware models: inventing a usable abstraction for Power/ARM Disclaimer: 1. ARM MM is analogous to Power MM all this is your next phone!
More informationAccelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte
More informationTSO-CC: Consistency-directed Coherence for TSO. Vijay Nagarajan
TSO-CC: Consistency-directed Coherence for TSO Vijay Nagarajan 1 People Marco Elver (Edinburgh) Bharghava Rajaram (Edinburgh) Changhui Lin (Samsung) Rajiv Gupta (UCR) Susmit Sarkar (St Andrews) 2 Multicores
More informationWarped parallel nearest neighbor searches using kd-trees
Warped parallel nearest neighbor searches using kd-trees Roman Sokolov, Andrei Tchouprakov D4D Technologies Kd-trees Binary space partitioning tree Used for nearest-neighbor search, range search Application:
More informationGPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More informationScalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009
Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation
More informationAlgorithm Engineering with PRAM Algorithms
Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and
More informationPresenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs
Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance
More informationAn introduction to weak memory consistency and the out-of-thin-air problem
An introduction to weak memory consistency and the out-of-thin-air problem Viktor Vafeiadis Max Planck Institute for Software Systems (MPI-SWS) CONCUR, 7 September 2017 Sequential consistency 2 Sequential
More informationcustinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader
custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader What we will see today The first dynamic graph data structure for the GPU. Scalable in size Supports the same functionality
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationSequential Consistency & TSO. Subtitle
Sequential Consistency & TSO Subtitle Core C1 Core C2 data = 0, 1lag SET S1: store data = NEW S2: store 1lag = SET L1: load r1 = 1lag B1: if (r1 SET) goto L1 L2: load r2 = data; Will r2 always be set to
More informationParallel Computer Architecture Spring Memory Consistency. Nikos Bellas
Parallel Computer Architecture Spring 2018 Memory Consistency Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture 1 Coherence vs Consistency
More informationRelaxed Memory-Consistency Models
Relaxed Memory-Consistency Models [ 9.1] In Lecture 13, we saw a number of relaxed memoryconsistency models. In this lecture, we will cover some of them in more detail. Why isn t sequential consistency
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationTIME TRAVELING HARDWARE AND SOFTWARE SYSTEMS. Xiangyao Yu, Srini Devadas CSAIL, MIT
TIME TRAVELING HARDWARE AND SOFTWARE SYSTEMS Xiangyao Yu, Srini Devadas CSAIL, MIT FOR FIFTY YEARS, WE HAVE RIDDEN MOORE S LAW Moore s Law and the scaling of clock frequency = printing press for the currency
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationMemory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB
Memory Consistency and Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB 1 Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that
More informationCMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today
More informationModule 15: "Memory Consistency Models" Lecture 34: "Sequential Consistency and Relaxed Models" Memory Consistency Models. Memory consistency
Memory Consistency Models Memory consistency SC SC in MIPS R10000 Relaxed models Total store ordering PC and PSO TSO, PC, PSO Weak ordering (WO) [From Chapters 9 and 11 of Culler, Singh, Gupta] [Additional
More informationUnderstanding POWER multiprocessors
Understanding POWER multiprocessors Susmit Sarkar 1 Peter Sewell 1 Jade Alglave 2,3 Luc Maranget 3 Derek Williams 4 1 University of Cambridge 2 Oxford University 3 INRIA 4 IBM June 2011 Programming shared-memory
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationDesigning Memory Consistency Models for. Shared-Memory Multiprocessors. Sarita V. Adve
Designing Memory Consistency Models for Shared-Memory Multiprocessors Sarita V. Adve Computer Sciences Department University of Wisconsin-Madison The Big Picture Assumptions Parallel processing important
More informationLecture 1: Gentle Introduction to GPUs
CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed
More informationOn Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs
On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University
More informationMulti Agent Navigation on GPU. Avi Bleiweiss
Multi Agent Navigation on GPU Avi Bleiweiss Reasoning Explicit Implicit Script, storytelling State machine, serial Compute intensive Fits SIMT architecture well Navigation planning Collision avoidance
More informationMemory Consistency and Multiprocessor Performance
Memory Consistency Model Memory Consistency and Multiprocessor Performance Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel
More informationShared Memory Consistency Models: A Tutorial
Shared Memory Consistency Models: A Tutorial By Sarita Adve, Kourosh Gharachorloo WRL Research Report, 1995 Presentation: Vince Schuster Contents Overview Uniprocessor Review Sequential Consistency Relaxed
More informationRelaxed Memory Consistency
Relaxed Memory Consistency Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationLinux multi-core scalability
Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org Overview Scalability theory Linux history Some common scalability trouble-spots Application workarounds Motivation
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Haicheng Wu 1, Gregory Diamos 2, Tim Sheard 3, Molham Aref 4, Sean Baxter 2, Michael Garland 2, Sudhakar Yalamanchili 1 1. Georgia
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationN-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo
N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational
More informationAlternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield
Alternative GPU friendly assignment algorithms Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield Graphics Processing Units (GPUs) Context: GPU Performance Accelerated
More informationPortland State University ECE 588/688. Memory Consistency Models
Portland State University ECE 588/688 Memory Consistency Models Copyright by Alaa Alameldeen 2018 Memory Consistency Models Formal specification of how the memory system will appear to the programmer Places
More informationIntroduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization
Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency
More informationHardware Support for NVM Programming
Hardware Support for NVM Programming 1 Outline Ordering Transactions Write endurance 2 Volatile Memory Ordering Write-back caching Improves performance Reorders writes to DRAM STORE A STORE B CPU CPU B
More informationUsing Relaxed Consistency Models
Using Relaxed Consistency Models CS&G discuss relaxed consistency models from two standpoints. The system specification, which tells how a consistency model works and what guarantees of ordering it provides.
More information740: Computer Architecture Memory Consistency. Prof. Onur Mutlu Carnegie Mellon University
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University Readings: Memory Consistency Required Lamport, How to Make a Multiprocessor Computer That Correctly Executes Multiprocess
More informationLecture 24: Multiprocessing Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this
More informationImproving the Practicality of Transactional Memory
Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter
More informationImplementation of Parallel Path Finding in a Shared Memory Architecture
Implementation of Parallel Path Finding in a Shared Memory Architecture David Cohen and Matthew Dallas Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 Email: {cohend4, dallam}
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationFast BVH Construction on GPUs
Fast BVH Construction on GPUs Published in EUROGRAGHICS, (2009) C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, D. Manocha University of North Carolina at Chapel Hill NVIDIA University of California
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationGPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27
1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution
More informationVisual Analysis of Lagrangian Particle Data from Combustion Simulations
Visual Analysis of Lagrangian Particle Data from Combustion Simulations Hongfeng Yu Sandia National Laboratories, CA Ultrascale Visualization Workshop, SC11 Nov 13 2011, Seattle, WA Joint work with Jishang
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationDistributed Operating Systems Memory Consistency
Faculty of Computer Science Institute for System Architecture, Operating Systems Group Distributed Operating Systems Memory Consistency Marcus Völp (slides Julian Stecklina, Marcus Völp) SS2014 Concurrent
More informationFirst Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors
First Swedish Workshop on Multi-Core Computing MCC 2008 Ronneby: On Sorting and Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Distributed Computing Systems Chalmers University
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,
More informationThread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications
Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores
More informationCartoon parallel architectures; CPUs and GPUs
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationS Comparing OpenACC 2.5 and OpenMP 4.5
April 4-7, 2016 Silicon Valley S6410 - Comparing OpenACC 2.5 and OpenMP 4.5 James Beyer, NVIDIA Jeff Larkin, NVIDIA GTC16 April 7, 2016 History of OpenMP & OpenACC AGENDA Philosophical Differences Technical
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationGPU Computation Strategies & Tricks. Ian Buck NVIDIA
GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit
More informationAn Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs
An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering
More informationJCudaMP: OpenMP/Java on CUDA
JCudaMP: OpenMP/Java on CUDA Georg Dotzler, Ronald Veldema, Michael Klemm Programming Systems Group Martensstraße 3 91058 Erlangen Motivation Write once, run anywhere - Java Slogan created by Sun Microsystems
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationAn Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos
An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm Martin Burtscher Department of Computer Science Texas State University-San Marcos Mapping Regular Code to GPUs Regular codes Operate on
More informationGPU Concurrency: Weak Behaviours and Programming Assumptions
GPU Concurrency: Weak Behaviours and Programming Assumptions Jyh-Jing Hwang, Yiren(Max) Lu 03/02/2017 Outline 1. Introduction 2. Weak behaviors examples 3. Test methodology 4. Proposed memory model 5.
More informationLecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2014
Lecture 13: Memory Consistency + Course-So-Far Review Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2014 Tunes Beggin Madcon (So Dark the Con of Man) 15-418 students tend to
More informationMapping MPI+X Applications to Multi-GPU Architectures
Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under
More informationFast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox
Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques James Fox Collaborators Oded Green, Research Scientist (GT) Euna Kim, PhD student (GT) Federico Busato, PhD student (Universita di
More informationSpeculative Synchronization: Applying Thread Level Speculation to Parallel Applications. University of Illinois
Speculative Synchronization: Applying Thread Level Speculation to Parallel Applications José éf. Martínez * and Josep Torrellas University of Illinois ASPLOS 2002 * Now at Cornell University Overview Allow
More informationInter-Block GPU Communication via Fast Barrier Synchronization
CS 3580 - Advanced Topics in Parallel Computing Inter-Block GPU Communication via Fast Barrier Synchronization Mohammad Hasanzadeh-Mofrad University of Pittsburgh September 12, 2017 1 General Purpose Graphics
More informationLecture 13: Memory Consistency. + Course-So-Far Review. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 13: Memory Consistency + Course-So-Far Review Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Wanna Be On Your Mind Valerie June (Pushin Against a Stone) Yes,
More informationParallel Memory Defragmentation on a GPU
Parallel Memory Defragmentation on a GPU Ronald Veldema, Michael Philippsen University of Erlangen-Nuremberg Germany Informatik 2 Programmiersysteme Martensstraße 3 91058 Erlangen Motivation Application
More informationGAIL The Graph Algorithm Iron Law
GAIL The Graph Algorithm Iron Law Scott Beamer, Krste Asanović, David Patterson GAP Berkeley Electrical Engineering & Computer Sciences gap.cs.berkeley.edu Graph Applications Social Network Analysis Recommendations
More informationFast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs
Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University 2 Oracle
More informationChapter 17 - Parallel Processing
Chapter 17 - Parallel Processing Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 17 - Parallel Processing 1 / 71 Table of Contents I 1 Motivation 2 Parallel Processing Categories
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationSoftware Architecture and Engineering: Part II
Software Architecture and Engineering: Part II ETH Zurich, Spring 2016 Prof. http://www.srl.inf.ethz.ch/ Framework SMT solver Alias Analysis Relational Analysis Assertions Second Project Static Analysis
More informationPerformance impact of dynamic parallelism on different clustering algorithms
Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationData-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.
Data-Centric Consistency Models The general organization of a logical data store, physically distributed and replicated across multiple processes. Consistency models The scenario we will be studying: Some
More informationSnooping-Based Cache Coherence
Lecture 10: Snooping-Based Cache Coherence Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Elle King Ex s & Oh s (Love Stuff) Once word about my code profiling skills
More informationImproving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm
Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm Department of Computer Science and Engineering Sogang University, Korea Improving Memory
More informationData parallel algorithms, algorithmic building blocks, precision vs. accuracy
Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationContour Detection on Mobile Platforms
Contour Detection on Mobile Platforms Bor-Yiing Su, subrian@eecs.berkeley.edu Prof. Kurt Keutzer, keutzer@eecs.berkeley.edu Parallel Computing Lab, University of California, Berkeley 1/26 Diagnosing Power/Performance
More informationRuntime Support for Scalable Task-parallel Programs
Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism
More informationEfficient Computation of Radial Distribution Function on GPUs
Efficient Computation of Radial Distribution Function on GPUs Yi-Cheng Tu * and Anand Kumar Department of Computer Science and Engineering University of South Florida, Tampa, Florida 2 Overview Introduction
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationFoundations of the C++ Concurrency Memory Model
Foundations of the C++ Concurrency Memory Model John Mellor-Crummey and Karthik Murthy Department of Computer Science Rice University johnmc@rice.edu COMP 522 27 September 2016 Before C++ Memory Model
More information