Finite Element Integration and Assembly on Modern Multi and Many-core Processors
|
|
- Myles Mills
- 5 years ago
- Views:
Transcription
1 Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, Kraków, Poland Filip Krużel Cracow University of Technology, Warszawska 24, Kraków, Poland ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 1/34
2 Outline Motivation - processor architectures 1 Motivation - processor architectures 2 3 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 2/34
3 Motivation Motivation - processor architectures Current computer architectures: clusters with homogeneous or heterogeneous nodes multicore processors with vector capabilities manycore accelerators (including GPUs) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 3/34
4 Motivation Motivation - processor architectures The principal question: what is the best processor architecture for running FEM codes? are the new architectures (GPUs, MIC) worth investments? CPU core GPU core ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 4/34
5 Motivation Motivation - processor architectures Disadvantages of new architectures: more complex than traditional programming model different optimization strategies price ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 5/34
6 Motivation Motivation - processor architectures GPU (and MIC) advantages: higher floating point performance higher memory bandwidth ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 6/34
7 Architecture comparison Architecture Kepler / Pascal Xeon Phi Xeon (E5) Processor GK110 / GP P / / 2699v3 Year of introduction 2013 / / / 2014 Number of multiprocessors/cores 13 / (59) / 64 6 / 18 Number of SP SIMD lanes 2496(x2) / 3584(x2) 960(x2) / 2048(x2) 48 / 288(x2) Number of DP SIMD lanes 832(x2) / 1792(x2) 480(x2) / 1024(x2) 24 / 144(x2) Fast global memory size [GB] 4.8 / 12 (16) 8 /16 (384) 384 / 768 LLC memory size [MB] 1.5 / 4.0 [L2] 30 / 32 [L2] 15 / 45 [L3] Frequency [GHz] 0.7 / / / 2.3 Performance characteristics Peak SP performance [TFlops] 3.52 / / / 1.3 Peak DP performance [TFlops] 1.17 / / / 0.66 Benchmark (DGEMM) performance 1.10 /??? 0.84 / / 0.48 Peak memory bandwidth [GB/s] 208 / 549 (732) 320 / > / 68 Benchmark (STREAM) bandwidth 144 /??? 171 / 480 (85) 33 / 58 Machine balance [DP flops/access] 45 / 58 (44) 25 / <43 18 / 77 Benchmark machine balance 61 /??? 39 / / 66 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 7/34
8 Architecture comparison Architecture Kepler / Pascal Xeon Phi Xeon (E5) Processor GK110 / GP P / / 2699v3 Multiprocessor/core characteristics Number of SP SIMD lanes 192 (x2) / 64 (x2) 16 (x2) / 32 (x2) 8 / 16 (x2) Number of DP SIMD lanes 64 (x2) / 32 (x2) 8 (x2) / 16 (x2) 4 / 8 (x2) Shared memory / L2 size [KB] 16 or 48 / / / 256 L1 cache memory size [KB] 48 or 16 / 24(?) 32 / / 32 Number of 32 bit registers / >2048 / > / 1680 Resources per single SP SIMD lane (+latency hiding?) Number of SP registers 341 / (x4) / 32 (x4) 16 (x2) / 16 (x2) Number of SP entries in SM/L1 64 / / / 512 Number of SP entries in L2 cache 131 / / / 4096 Resources per single DP SIMD lane (+latency hiding?) Number of DP registers 512 / (x4) / 32 (x4) 16 (x2) / 16 (x2) Number of DP entries in SM/L1 96 / / / 512 Number of DP entries in L2 196 / / / 4096 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 8/34
9 Two phases: creation of the system of linear equations (integration and assembly) linear system solution (direct or iterative) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 9/34
10 The creation of FEM systems of linear equations Finite element integration and assembly ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 10/34
11 Finite element integration and assembly 1 for e = 1 to N E do 2 - read input data and initialize output arrays A e and b e 3 for i Q = 1 to N Q do 4 - compute auxiliary terms (vol) and arrays (φ, c and d ) 5 for i S = 1 to N S do 6 for j S = 1 to N S do 7 update A e [i S ][j S ] using vol[i Q ], c[i Q ], φ[i S ][i Q ], φ[j S ][i Q ] 8 if (i S == j S ) then 9 update b e [i S ] using vol[i Q ] d[i D ][i Q ], φ[i S ][i Q ] 10 end if 11 end for 12 end for 13 end for 14 - assemble A e and b e into the global arrays 15 end for ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 11/34
12 Computational complexity: the dependence on the order of approximation an example for discontinuous Galerkin approximation Execution time [s] Degree of approximation elements integration faces integration preconditioner set-up iterations total time elements integration faces integration preconditioner set-up iterations Number of operations [Mflop] ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 12/34
13 Finite element integration and assembly Size of input, output and auxiliary arrays for numerical integration The size of arrays Degree of approximation p ξ Q, w Q φ max c, d ξ Total x, det( x ) ξ Total φ Total max c, d A e, b e ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 13/34
14 Finite element integration and assembly Computational and memory complexity: the dependence on the order of approximation an example for discontinuous Galerkin approximation ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 14/34
15 Offloading model Motivation - processor architectures OpenCL devices with PCIe APU with unified memory ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 15/34
16 Available resources for application programmers: Processing environment: multi-threading for multi-core standard - Xeon massive (also for latency hiding) - GPUs, Xeon Phi vectorization wide SIMD units - Xeon (8, 16) wider SIMD units - GPUs (32), Xeon Phi (16, 32) different communication and synchronization mechanisms ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 16/34
17 Available resources for application programmers: Memory hierarchy: large, latency optimized DRAM memory below 150 GB/s (for 2-socket configurations) fast global memory (MCDRAM, HBM) - GPU and MIC above 400 GB/s for single processor shared memory, caches registers smaller, explicitly managed - GPUs with CUDA, OpenCL larger, implicitly managed - Xeon, Xeon Phi large number, explicitly managed - GPUs with CUDA, OpenCL implicitly managed - Xeon, Xeon Phi ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 17/34
18 Programming models: Standard - OpenMP: parallel loops implicit management of data placement GPU oriented - CUDA, OpenCL (can be used as well for x86) explicit and implicit thread organization (warps, threadblocks, grid) memory hierarchy (registers, shared, global) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 18/34
19 Parallelization strategies One element one thread loop over elements parallelized small resource requirements for low orders of approximation possible explicit placement of some arrays in the shared memory to speed up calculations for CUDA and OpenCL large resource requirements for higher orders of approximation can be handled by flexible memory hierarchy of Xeons prevent OpenCL kernels from executing on GPUs ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 19/34
20 Parallelization strategies One element several threads loop over elements parallelized additionally loops over the entries of output arrays parallelized in domain decomposition manner, with no dependencies the number of threads usually up to the size of threadblocks for CUDA and OpenCL sufficient shared memory resources for GPUs for high orders even several threadblocks can operate on a single element (or additional loop over parts of the output arrays is introduced) serial fraction associated with some auxiliary calculations at interation points ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 20/34
21 Parallelization strategies One element two kernels strategy kernel 1: auxiliary terms calculations loop over elements parallelized loop over integration points parallelized auxiliary terms calculated and stored in global memory kernel 2: actual calculations of element arrays loop over elements parallelized additionally loops over the entries of output arrays parallelized in the same way as for one element several threads strategy no serial fraction ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 21/34
22 Computational example: elasticity problem Higher order approximation numerical integration for several degrees p (from 2 to 7) parallelization for single element based on data decomposition for output arrays parallelization of double loop over shape functions different options for placement of auxiliary arrays Linux, C, OpenCL several processor architectures (with the same, portable OpenCL kernels): Intel Xeon E AMD Radeon HD5870 (Cypress) and HD7950 (Tahiti PRO) Nvidia GeForce GTX580 (Fermi) and Tesla M2075 (Fermi) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 22/34
23 Performance results for elasticity problem and p=3 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 23/34
24 Performance results for elasticity problem and p=3 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 24/34
25 Performance results for elasticity problem and p=5 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 25/34
26 Performance results for elasticity problem and p=5 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 26/34
27 Computational example: convection-diffusion problem First order approximation simple Poisson (Laplace) problem and more computationally intensive conv-diff problem prismatic and tetrahedral elements one element one thread parallelization strategy different options for placement of auxiliary arrays Linux, C, OpenCL several processor architectures (with the same, portable OpenCL kernels): Intel Xeon E (in dual socket configuration) Intel Xeon Phi 5110P (Knights Corner) Nvidia Tesla K20M (Kepler) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 27/34
28 Performance results for convection diffusion problem Execution time [ns] Processor: Tesla K20m Xeon Phi 5110P Xeon E Poisson - Tetra Poisson - Prism Conv-Diff - Tetra Conv-Diff - Prism ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 28/34
29 Performance results for convection diffusion problem Performance [GB/s] Poisson - Tetra Processor: Tesla K20m Xeon Phi 5110P Xeon E Conv-Diff - Tetra Performance [GFLOPS] Processor: Tesla K20m Xeon Phi 5110P Xeon E Poisson - Prism Conv-Diff - Prism ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 29/34
30 Performance results for convection diffusion problem Processor: Tesla K20m Xeon Phi 5110P Xeon E Performance as percentage of the benchmark maximum Poisson - Tetra Poisson - Prism Conv-Diff - Tetra Conv-Diff - Prism ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 30/34
31 Computational example: convection-diffusion problem Code optimizations for GPUs classical optimizations (automatic) loop invariant code motion, common subexpression elimination, loop unrolling, induction variable simplification, etc. compiler directives based parameter tuning placing different arrays in different levels of memory hierarchy large number of options automatic testing of the parameter space ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 31/34
32 Parameter based performance tuning ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 32/34
33 Parameter based performance tuning ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 33/34
34 Thank you. ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 34/34
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationReal-Time Rendering Architectures
Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationCS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationFrom Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)
From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationAntonio R. Miele Marco D. Santambrogio
Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationEarly Experiences Writing Performance Portable OpenMP 4 Codes
Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic
More informationGpuWrapper: A Portable API for Heterogeneous Programming at CGG
GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers
More informationIntroduction to GPGPU and GPU-architectures
Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks
More informationIN FINITE element simulations usually two computation
Proceedings of the International Multiconference on Computer Science and Information Technology pp. 337 342 Higher order FEM numerical integration on GPUs with OpenCL 1 Przemysław Płaszewski, 12 Krzysztof
More informationOpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer
OpenACC Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer 3 WAYS TO ACCELERATE APPLICATIONS Applications Libraries Compiler Directives Programming Languages Easy to use Most Performance
More informationSupport Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura
Support Tools for Porting Legacy Applications to Multicore Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura Agenda Introduction PEMAP: Performance Estimator for MAny core Processors The overview
More informationTHE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS
Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT
More informationHigh Performance Computing with Accelerators
High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationFrom Shader Code to a Teraflop: How Shader Cores Work
From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA
More informationGPU ARCHITECTURE Chris Schultz, June 2017
GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng
More informationOpenACC programming for GPGPUs: Rotor wake simulation
DLR.de Chart 1 OpenACC programming for GPGPUs: Rotor wake simulation Melven Röhrig-Zöllner, Achim Basermann Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU) GPU computing
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationScientific Computing on GPUs: GPU Architecture Overview
Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11
More informationIntel Xeon Phi Coprocessors
Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationCUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN
CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction Francesco Rossi University of Bologna and INFN * Using this terminology since you ve already heard of SIMD and SPMD at this school
More informationCOSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors
COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationNumerical Algorithms on Multi-GPU Architectures
Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationAccelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies
Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies John C. Linford John Michalakes Manish Vachharajani Adrian Sandu IMAGe TOY 2009 Workshop 2 Virginia
More informationFlux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters
Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationNVIDIA s Compute Unified Device Architecture (CUDA)
NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability 1 History of GPU
More informationNVIDIA s Compute Unified Device Architecture (CUDA)
NVIDIA s Compute Unified Device Architecture (CUDA) Mike Bailey mjb@cs.oregonstate.edu Reaching the Promised Land NVIDIA GPUs CUDA Knights Corner Speed Intel CPUs General Programmability History of GPU
More informationA GPU based brute force de-dispersion algorithm for LOFAR
A GPU based brute force de-dispersion algorithm for LOFAR W. Armour, M. Giles, A. Karastergiou and C. Williams. University of Oxford. 8 th May 2012 1 GPUs Why use GPUs? Latest Kepler/Fermi based cards
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationCUDA Experiences: Over-Optimization and Future HPC
CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationGPU ARCHITECTURE Chris Schultz, June 2017
Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE Problems Solved Over Time versus Why are they different? Complex
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationAccelerating Financial Applications on the GPU
Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General
More informationEfficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs
Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs Markus Geveler, Dirk Ribbrock, Dominik Göddeke, Peter Zajac, Stefan Turek Institut für Angewandte Mathematik TU Dortmund,
More informationParallel and Distributed Programming Introduction. Kenjiro Taura
Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationCS 179 Lecture 4. GPU Compute Architecture
CS 179 Lecture 4 GPU Compute Architecture 1 This is my first lecture ever Tell me if I m not speaking loud enough, going too fast/slow, etc. Also feel free to give me lecture feedback over email or at
More informationTHE USE of graphics processing units (GPUs) in scientific
Proceedings of the International Multiconference on Computer Science and Information Technology pp. 285 291 ISBN 978-83-60810-14-9 ISSN 1896-7094 Testing Tesla Architecture for Scientific Computing: the
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationAuto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters
Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationAutomated Finite Element Computations in the FEniCS Framework using GPUs
Automated Finite Element Computations in the FEniCS Framework using GPUs Florian Rathgeber (f.rathgeber10@imperial.ac.uk) Advanced Modelling and Computation Group (AMCG) Department of Earth Science & Engineering
More informationGPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA
GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles
More informationIntroduction to Parallel Computing with CUDA. Oswald Haan
Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationPerformance of deal.ii on a node
Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions
More informationExperts in Application Acceleration Synective Labs AB
Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Building systems with GPUs is hard. Why? 2 Goal of
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationPorting a parallel rotor wake simulation to GPGPU accelerators using OpenACC
DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU
More information