Memory access patterns. 5KK73 Cedric Nugteren
|
|
- Eugene Garrett
- 5 years ago
- Views:
Transcription
1 Memory access patterns 5KK73 Cedric Nugteren
2 Today s topics 1. The importance of memory access patterns a. Vectorisation and access patterns b. Strided accesses on GPUs c. Data re-use on GPUs and FPGA s 2. Classifying memory access patterns a. Berkeley s 7 dwarfs b. Algorithmic species c. Algorithmic skeletons 3. Algorithmic skeletons for accelerators (after the break, Mark Wijtvliet) 1/25
3 Vector-SIMD execution N iters ld r1, addr1 ld r2, addr2 add r3, r1, r2 st r3, addr3 for (i=0; i<n; i++) c[i] = a[i] + b[i]; N/4 iters ldv vr1, addr1 ldv vr2, addr2 addv vr3, vr1, vr2 stv vr3, addr3 SIMD processes multiple scalar operations concurrently 2/25
4 Vector-SIMD execution A single instruction being executed: By multiple processing engines (ALUs, PEs, cores, nodes) Concurrently in lockstep (no synchronization) On multiple data elements Present in a wide range of architectures SIMD, GPU, AVX, SSE, NEON, Xetal, etc. Type of parallelism that is easy and cheap to implement No coherence problem No lock problem Caveat: Hard to program and/or easy to lose many factors of performance [Slides taken from P. Sadayappan] 3/25
5 How to use SIMD instructions? Pick your favourite: 1. Vectorising compiler (ICC, latest GCCs) 2. Macros or intrinsics 3. Assembly..B8.5 movaps a(,%rdx,4), %xmm0 addps b(,%rdx,4), %xmm0 movaps %xmm0, c(,%rdx,4) addq $4, %rdx cmpq $rdi, %rdx jl..b8.5 for (i=0; i<n; i++) c[i] = a[i] + b[i]; m128 ra, rb, rc; for (int i = 0; i <N; i+=4) { ra = _mm_load_ps(&a[i]); rb = _mm_load_ps(&b[i]); rc = _mm_add_ps(ra,rb); _mm_store_ps(&c[i], rc); } [Slides taken from P. Sadayappan] 4/25
6 What is the performance impact? for (i=0; i<n; i++) a[i] = a[i] + 1; Properties of the example: Stride-1 accesses to array a Inner loop has independent operations (no loop carried dependences) Array a resides in L1 cache (12.5 KB) Performance in GOPS/s on 128-bits wide CPU: char (16) int (4) float (4) double (2) original SSE speed-up 20.1x 5.0x 3.9x 2.0x [Slides taken from P. Sadayappan] 5/25
7 Strided accesses (1/2) for (i=0; i<n; i+=16) a[i] = a[i] + 1; Properties of the example: Stride-16 accesses to array a Inner loop has independent operations Array a resides in L1 cache Why no performance gain? Operands are not contiguous in memory Multiple loads/stores, vector pack/unpack No auto-vectorisation in GCC ICC vectorises, but no gains Performance in GOPS/s on 128-bits wide CPU: char (16) int (4) float (4) double (2) original SSE 2.7 speed-up 1.0x 1.0x 1.0x 1.0x [Slides taken from P. Sadayappan] 6/25
8 Strided accesses (2/2) for (i=0; i<n; i+=stride) a[i] = a[i] + 1; Generalised example (still L1 resident) Performance in GOPS/s on 128-bits wide CPU: STRIDE char (16) int (4) float (4) double (2) [Slides taken from P. Sadayappan] 7/25
9 Dependent operations for (i=0; i<n; i++) a[i] = a[i-1] + 1; Why no performance gain? Iteration i depends on iteration i-1 Inner loop cannot be parallelised Properties of the example: Stride-1 accesses to array a Inner loop has dependent operations Array a resides in L1 cache Performance in GOPS/s on 128-bits wide CPU: char (16) int (4) float (4) double (2) original SSE speed-up 1.0x 1.0x 1.0x 1.0x [Slides taken from P. Sadayappan] 8/25
10 L1 versus main memory for (i=0; i<10000*n; i++) a[i] = a[i] + 1; Why is performance limited? Code has become memory bandwidth bound Explained by the roofline model Properties of the example: Stride-1 accesses to array a Inner loop has independent operations Array a resides in main memory (DRAM) Performance in GOPS/s on 128-bits wide CPU: char (16) int (4) float (4) double (2) SSE L SSE DRAM [Slides taken from P. Sadayappan] 9/25
11 Multi-core scaling #pragma omp parallel for for (i=0; i<n; i++) a[i] = a[i] + 1; #pragma omp parallel for for (i=0; i<10000*n; i++) a[i] = a[i] + 1; threads char (16) int (4) float (4) double (2) SSE L SSE L x 3.0x 3.0x 3.0x threads char (16) int (4) float (4) double (2) SSE DRAM SSE DRAM x 1.0x 1.0x 1.0x speed-up speed-up [Slides taken from P. Sadayappan] 10/25
12 Lessons learned from vectorisation Vectorisation and parallelisation are important Significant speed-ups can be obtained......depending on the memory access patterns! Performance depends on the memory access pattern Strided accesses Dependent / independent operations Size of data structures Performance / implementation will differ per architecture Vector width and data types L1 resident or not (L1 cache size, DRAM bandwidth, etc.) Bottom line: Let s take a closer look at memory access patterns 11/25
13 Strided accesses on GPUs global void stride_copy(float* out, float* in) { int id = blockidx.x*blockdim.x + threadidx.x; out[id*stride] = in[id* STRIDE]; } Performance in GB/s on a Tesla C2050: S=1 S=2 S=3 S=4 S=5 S=6 S=7 S= S=9 S=10 S=11 S=12 S=13 S=14 S=15 S= Why is performance deteriorating? Memory accesses are no longer coalesced Not all data in cache-lines are used 12/25
14 Data-reuse on GPUs global void filter(float* out, float* in) { int id = blockidx.x*blockdim.x + threadidx.x; out[id] = 0.33 * (in[id-1] + in[id] + in[id+1]); } Properties of the example: Each data element is used 3 times (data-reuse) Memory bandwidth is the limiting performance factor Use the GPU s scratchpad memory (shared) to benefit from reuse Newer GPUs use caches to benefit automatically Expected performance gain: up to 2x id reuse id+1 in[] out[] 13/25
15 Data-reuse on FPGAs Implementing an erosion filter on an FPGA: Manually (VHDL) Automatically from C (HLS) Automatically from C using memory access pattern information 14/25
16 Today s topics 1. The importance of memory access patterns a. Vectorisation and access patterns b. Strided accesses on GPUs c. Data re-use on GPUs and FPGA s 2. Classifying memory access patterns a. Berkeley s 7 dwarfs b. Algorithmic species c. Algorithmic skeletons 3. Algorithmic skeletons for accelerators (after the break, Mark Wijtvliet) 15/25
17 Classifying program code Berkeley s 7 dwarves of computation: Dense Linear Algebra Sparse Linear Algebra Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal Dynamic Programming Backtrack and Branch-and-Bound Graphical Models Finite State Machines More information: ( A View From Berkeley ) 16/25
18 Classifying memory access patterns Berkeley s dwarves are High-level and intuitive, but......don t capture all relevant details of memory access patterns Not formalised nor exact: classes are based on a textual description Can we do better? Introducing algorithmic species A classification of code based on memory access patterns 17/25
19 Algorithmic species examples (1/3) for(i=0; i<64; i++) { for(j=0; j<128; j++) { R[i][j] = 2 M[i][j]; } } Basic forall matrix copy Each i,j iteration one data element is read from M Each i,j iteration one data element is written to R M[0:63,0:127] element R[0:63,0:127] element 18/25
20 Algorithmic species examples (2/3) for(i=0; i<64; i++) { r[i] = 0; for(j=0; j<128; j++) { r[i] += M[i][j] v[j]; } } Matrix-vector multiplication Each i iteration a row is read from M and the full vector v Each i iteration one element of the vector r is produced M[0:63,0:127] chunk(-,0:127) + v[0:127] full r[0:63] element 19/25
21 Algorithmic species examples (3/3) for(i=1; i<128-1; i++) { m[i] = 0.33 (a[i 1]+a[i]+a[i+1]); } Filter with data-reuse Each i iteration three neighbouring elements from a are read Each i iteration one element of m is produced a[1:126] neighbourhood(-1:1) m[1:126] element 20/25
22 Can t we capture more details? (P,r,[0..7],2,2) for(i=0; i<4; i++) { Q[i] = P[2 i] + P[2 i + 1]; } (P,r,[0..6],1,2) combine (P,r,[1..7],1,2) Characterise based on: (Q,w,[0..3],1,1) Array name Type (read or write) Domain Number of elements Step 21/25
23 How can we use a classification? global void filter(float* out, float* in) { int id = blockidx.x*blockdim.x + threadidx.x; out[id] = 0.33 * (in[id-1] + in[id] + in[id+1]); } Consider the earlier GPU filter example: Each data element is used 3 times (data-reuse) Use the GPU s scratchpad memory (shared) to benefit from reuse What if we had an optimised pre-implemented skeleton (template) for such neighbourhood type of computations? id reuse id+1 in[] out[] 22/25
24 Using algorithmic skeletons <args> = float* out, float* in <computation> = 0.33 * (in[i-1] + in[i] + in[i+1]) global void filter(float* out, float* in) { <input> = in int id = blockidx.x*blockdim.x + threadidx.x; <output> = out int sid = threadidx.x; <type> = float (user input) // Load into local (shared) memory shared smem[512]; smem[sid] = in[id]; global void neighbourhood_skeleton( syncthreads() ;<args>) { int id = blockidx.x*blockdim.x + threadidx.x; // Perform the computation int sid = threadidx.x; float res = 0.33*(smem[sid-1]+smem[sid]+smem[sid+1]); out[id] = res; // Load into local (shared) memory } (instantiated skeleton) shared <type> smem[512]; smem[id] = <input>[id]; syncthreads() ; + // Perform the computation <type> res = <computation> <output>[id] = res; (simplified skeleton) } 23/25
25 Today s topics 1. The importance of memory access patterns a. Vectorisation and access patterns b. Strided accesses on GPUs c. Data re-use on GPUs and FPGA s 2. Classifying memory access patterns a. Berkeley s 7 dwarfs b. Algorithmic species c. Algorithmic skeletons 3. Algorithmic skeletons for accelerators (after the break, Mark Wijtvliet) 24/25
26 Further reading Compiler vectorisation: Auto-vectorization of interleaved data for SIMD (paper) D. Nuzman, I. Rosen, A. Zaks, 2006 Roofline model: Roofline: an insightful visual performance model for multicore architectures (paper) S. Williams, A. Waterman, D. Patterson - Communications of the ACM, 2009 Memory access patterns: Patterns for parallel programming (book) T.G. Mattson, B.A. Sanders, B.L. Massingill, 2004 The landscape of parallel computing research: A view from Berkeley (paper) K. Asanovic, R. Bodik, B.C. Catanzaro, et al., 2006 Algorithmic species revisited: A program code classification based on array references (paper) C. Nugteren, R. Corvino, H. Corporaal, /25
Program Optimization Through Loop Vectorization
Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Simple Example Loop
More informationProgram Optimization Through Loop Vectorization
Program Optimization Through Loop Vectorization María Garzarán, Saeed Maleki William Gropp and David Padua Department of Computer Science University of Illinois at Urbana-Champaign Program Optimization
More informationRoofline Model (Will be using this in HW2)
Parallel Architecture Announcements HW0 is due Friday night, thank you for those who have already submitted HW1 is due Wednesday night Today Computing operational intensity Dwarves and Motifs Stencil computation
More informationDense Linear Algebra. HPC - Algorithms and Applications
Dense Linear Algebra HPC - Algorithms and Applications Alexander Pöppl Technical University of Munich Chair of Scientific Computing November 6 th 2017 Last Tutorial CUDA Architecture thread hierarchy:
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming Lecturer: Alan Christopher Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationCS560 Lecture Parallel Architecture 1
Parallel Architecture Announcements The RamCT merge is done! Please repost introductions. Manaf s office hours HW0 is due tomorrow night, please try RamCT submission HW1 has been posted Today Isoefficiency
More informationSIMD: Data parallel execution
ERLANGEN REGIONAL COMPUTING CENTER SIMD: Data parallel execution J. Eitzinger HLRS, 15.6.2018 CPU Stored Program Computer: Base setting Memory for (int j=0; j
More informationFlynn Taxonomy Data-Level Parallelism
ecture 27 Computer Science 61C Spring 2017 March 22nd, 2017 Flynn Taxonomy Data-Level Parallelism 1 New-School Machine Structures (It s a bit more complicated!) Software Hardware Parallel Requests Assigned
More informationVector Processors and Graphics Processing Units (GPUs)
Vector Processors and Graphics Processing Units (GPUs) Many slides from: Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley TA Evaluations Please fill out your
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationCS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)
CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
More informationTHE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing
THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2014 COMP3320/6464/HONS High Performance Scientific Computing Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable
More informationCSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization
CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationOptimizing Parallel Reduction in CUDA
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf Parallel Reduction Tree-based approach used within each
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationLab 1 Part 1: Introduction to CUDA
Lab 1 Part 1: Introduction to CUDA Code tarball: lab1.tgz In this hands-on lab, you will learn to use CUDA to program a GPU. The lab can be conducted on the SSSU Fermi Blade (M2050) or NCSA Forge using
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is
More informationLeveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute
Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication Steve Rennich Nvidia Developer Technology - Compute Block Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication
More informationInstructor: Leopold Grinberg
Part 1 : Roofline Model Instructor: Leopold Grinberg IBM, T.J. Watson Research Center, USA e-mail: leopoldgrinberg@us.ibm.com 1 ICSC 2014, Shanghai, China The Roofline Model DATA CALCULATIONS (+, -, /,
More informationProgrammer's View of Execution Teminology Summary
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs
More informationIntroduction to Runtime Systems
Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents
More informationParallel Computing. Prof. Marco Bertini
Parallel Computing Prof. Marco Bertini Modern CPUs Historical trends in CPU performance From Data processing in exascale class computer systems, C. Moore http://www.lanl.gov/orgs/hpc/salishan/salishan2011/3moore.pdf
More informationComputer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15: Dataflow
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationVector an ordered series of scalar quantities a one-dimensional array. Vector Quantity Data Data Data Data Data Data Data Data
Vector Processors A vector processor is a pipelined processor with special instructions designed to keep the (floating point) execution unit pipeline(s) full. These special instructions are vector instructions.
More informationLocality-Aware Mapping of Nested Parallel Patterns on GPUs
Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue
More informationIntroduction to GPU Computing. Design and Analysis of Parallel Algorithms
Introduction to GPU Computing Design and Analysis of Parallel Algorithms Sources CUDA Programming Guide (3.2) CUDA Best Practices Guide (3.2) CUDA Toolkit Reference Manual (3.2) CUDA SDK Examples Part
More informationCS 61C: Great Ideas in Computer Architecture
CS 61C: Great Ideas in Computer Architecture Flynn Taxonomy, Data-level Parallelism Instructors: Vladimir Stojanovic & Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/ 1 New-School Machine Structures
More informationMULTI-CORE PROGRAMMING. Dongrui She December 9, 2010 ASSIGNMENT
MULTI-CORE PROGRAMMING Dongrui She December 9, 2010 ASSIGNMENT Goal of the Assignment 1 The purpose of this assignment is to Have in-depth understanding of the architectures of real-world multi-core CPUs
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationCUDA. More on threads, shared memory, synchronization. cuprintf
CUDA More on threads, shared memory, synchronization cuprintf Library function for CUDA Developers Copy the files from /opt/cuprintf into your source code folder #include cuprintf.cu global void testkernel(int
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Intel Advisor for vectorization
More informationGP-GPU. General Purpose Programming on the Graphics Processing Unit
GP-GPU General Purpose Programming on the Graphics Processing Unit Goals Learn modern GPU architectures and its advantage and disadvantage as compared to modern CPUs Learn how to effectively program the
More informationLecture 14 The C++ Memory model Implementing synchronization SSE vector processing (SIMD Multimedia Extensions)
Lecture 14 The C++ Memory model Implementing synchronization SSE vector processing (SIMD Multimedia Extensions) No section this Friday Announcements 2 Today s lecture C++ memory model continued Synchronization
More informationIntroduction to CUDA (1 of n*)
Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today
More informationReview for Midterm 3/28/11. Administrative. Parts of Exam. Midterm Exam Monday, April 4. Midterm. Design Review. Final projects
Administrative Midterm - In class April 4, open notes - Review notes, readings and review lecture (before break) - Will post prior exams Design Review - Intermediate assessment of progress on project,
More informationlast time out-of-order execution and instruction queues the data flow model idea
1 last time 2 out-of-order execution and instruction queues the data flow model idea graph of operations linked by depedencies latency bound need to finish longest dependency chain multiple accumulators
More informationExam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence
Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector,
More informationMasterpraktikum Scientific Computing
Masterpraktikum Scientific Computing High-Performance Computing Michael Bader Alexander Heinecke Technische Universität München, Germany Outline Logins Levels of Parallelism Single Processor Systems Von-Neumann-Principle
More informationGPU programming: Code optimization part 1. Sylvain Collange Inria Rennes Bretagne Atlantique
GPU programming: Code optimization part 1 Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr Outline Analytical performance modeling Optimizing host-device data transfers Optimizing
More informationCompile-time GPU memory access optimizations
Compile-time GPU memory access optimizations Braak, van den, G.J.W.; Mesman, B.; Corporaal, H. Published in: Proceedings of the 2010 International Conference on Embedded Computer Systems (SAMOS), 19-22
More informationAuto-Vectorization of Interleaved Data for SIMD
Auto-Vectorization of Interleaved Data for SIMD Dorit Nuzman, Ira Rosen, Ayal Zaks (IBM Haifa Labs) PLDI 06 Presented by Bertram Schmitt 2011/04/13 Motivation SIMD (Single Instruction Multiple Data) exploits
More informationComputer Architecture Lecture 16: SIMD Processing (Vector and Array Processors)
18-447 Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors) Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/24/2014 Lab 4 Reminder Lab 4a out Branch handling and branch
More informationProgramming Techniques for Supercomputers: Modern processors. Architecture of the memory hierarchy
Programming Techniques for Supercomputers: Modern processors Architecture of the memory hierarchy Prof. Dr. G. Wellein (a,b), Dr. G. Hager (a), Dr. M. Wittmann (a) (a) HPC Services Regionales Rechenzentrum
More informationHigh Performance Matrix-matrix Multiplication of Very Small Matrices
High Performance Matrix-matrix Multiplication of Very Small Matrices Ian Masliah, Marc Baboulin, ICL people University Paris-Sud - LRI Sparse Days Cerfacs, Toulouse, 1/07/2016 Context Tensor Contractions
More informationLecture 2: Introduction to OpenMP with application to a simple PDE solver
Lecture 2: Introduction to OpenMP with application to a simple PDE solver Mike Giles Mathematical Institute Mike Giles Lecture 2: Introduction to OpenMP 1 / 24 Hardware and software Hardware: a processor
More informationData Parallel Execution Model
CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling
More informationCS 152 Computer Architecture and Engineering. Lecture 17: Vector Computers
CS 152 Computer Architecture and Engineering Lecture 17: Vector Computers Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationLearn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh
Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationKirill Rogozhin. Intel
Kirill Rogozhin Intel From Old HPC principle to modern performance model Old HPC principles: 1. Balance principle (e.g. Kung 1986) hw and software parameters altogether 2. Compute Density, intensity, machine
More informationARE WE OPTIMIZING HARDWARE FOR
ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS? PARSEC S VECTORIZATION EFFECTS ON ENERGY EFFICIENCY AND ARCHITECTURAL REQUIREMENTS Juan M. Cebrián 1 1 Depart. of Computer and Information Science
More informationCode Optimizations for High Performance GPU Computing
Code Optimizations for High Performance GPU Computing Yi Yang and Huiyang Zhou Department of Electrical and Computer Engineering North Carolina State University 1 Question to answer Given a task to accelerate
More informationAn Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center
An Introduction to GPU Architecture and CUDA C/C++ Programming Bin Chen April 4, 2018 Research Computing Center Outline Introduction to GPU architecture Introduction to CUDA programming model Using the
More informationLecture 3. Programming with GPUs
Lecture 3 Programming with GPUs GPU access Announcements lilliput: Tesla C1060 (4 devices) cseclass0{1,2}: Fermi GTX 570 (1 device each) MPI Trestles @ SDSC Kraken @ NICS 2011 Scott B. Baden / CSE 262
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationHPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,
HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationLecture 5. Performance programming for stencil methods Vectorization Computing with GPUs
Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationSlide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth
Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 DRAM Bandwidth MEMORY ACCESS PERFORMANCE Objective To learn that memory bandwidth is a first-order performance factor in
More informationGlobal Memory Access Pattern and Control Flow
Optimization Strategies Global Memory Access Pattern and Control Flow Objectives Optimization Strategies Global Memory Access Pattern (Coalescing) Control Flow (Divergent branch) Global l Memory Access
More informationCUDA Performance Optimization. Patrick Legresley
CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationECE 574 Cluster Computing Lecture 17
ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux
More informationGPU programming basics. Prof. Marco Bertini
GPU programming basics Prof. Marco Bertini CUDA: atomic operations, privatization, algorithms Atomic operations The basics atomic operation in hardware is something like a read-modify-write operation performed
More informationSIMD Programming CS 240A, 2017
SIMD Programming CS 240A, 2017 1 Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures usually both in same system! Most common parallel processing programming style: Single
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationFundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU
More informationA Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality
A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse
More informationEE382N (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV
EE382 (20): Computer Architecture - Parallelism and Locality Lecture 13 Parallelism in Software IV Mattan Erez The University of Texas at Austin EE382: Parallelilsm and Locality (c) Rodric Rabbah, Mattan
More informationIntegrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali
Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali 5 Generations of TI Multicore Processors Keystone architecture Lowers
More informationHands-on CUDA Optimization. CUDA Workshop
Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory
More informationVectorization. V. Ruggiero Roma, 19 July 2017 SuperComputing Applications and Innovation Department
Vectorization V. Ruggiero (v.ruggiero@cineca.it) Roma, 19 July 2017 SuperComputing Applications and Innovation Department Outline Topics Introduction Data Dependencies Overcoming limitations to SIMD-Vectorization
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics
More informationOverview: Graphics Processing Units
advent of GPUs GPU architecture Overview: Graphics Processing Units the NVIDIA Fermi processor the CUDA programming model simple example, threads organization, memory model case study: matrix multiply
More informationOptimizations of BLIS Library for AMD ZEN Core
Optimizations of BLIS Library for AMD ZEN Core 1 Introduction BLIS [1] is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [2] The framework was
More informationA Tutorial on CUDA Performance Optimizations
A Tutorial on CUDA Performance Optimizations Amit Kalele Prasad Pawar Parallelization & Optimization CoE TCS Pune 1 Outline Overview of GPU architecture Optimization Part I Block and Grid size Shared memory
More informationLecture 16 SSE vectorprocessing SIMD MultimediaExtensions
Lecture 16 SSE vectorprocessing SIMD MultimediaExtensions Improving performance with SSE We ve seen how we can apply multithreading to speed up the cardiac simulator But there is another kind of parallelism
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationCS 61C: Great Ideas in Computer Architecture. The Flynn Taxonomy, Data Level Parallelism
CS 61C: Great Ideas in Computer Architecture The Flynn Taxonomy, Data Level Parallelism Instructor: Justin Hsia 7/11/2012 Summer 2012 Lecture #14 1 Review of Last Lecture Performance programming When possible,
More informationCS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek. EECS, University of California at Berkeley
CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek EECS, University of California at Berkeley http://inst.eecs.berkeley.edu/~cs152 Administrivia Lab
More informationParallelism III. MPI, Vectorization, OpenACC, OpenCL. John Cavazos,Tristan Vanderbruggen, and Will Killian
Parallelism III MPI, Vectorization, OpenACC, OpenCL John Cavazos,Tristan Vanderbruggen, and Will Killian Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction MPI
More informationScientific Computing on GPUs: GPU Architecture Overview
Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationScheduling Image Processing Pipelines
Lecture 14: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More information