On the efficiency of the Accelerated Processing Unit for scientific computing
|
|
- Angel Watkins
- 5 years ago
- Views:
Transcription
1 24 th High Performance Computing Symposium Pasadena, April 5 th 2016 On the efficiency of the Accelerated Processing Unit for scientific computing I. Said, P. Fortin, J.-L. Lamotte, R. Dolbeau, H. Calandra contact: isaid@uh.edu
2 HPC ecosystem The expanding demands of GPUs I. Said 24th High Performance Computing Symposium 04/05/2016 1/34 Graphics Processing Units (GPUs) are widely used for scientific computing: Liner algebra, numerical simulations / iterative methods, signal processing, etc.. However: Applications with high CPU-GPU communications can be bottlenecked by the PCI CPU+discrete GPU (GPUs are not standalone) systems require large amounts of energy
3 HPC ecosystem Towards unifying CPUs and GPUs I. Said 24th High Performance Computing Symposium 04/05/2016 2/34 Dispatch units CPU FPU FPU CU 0 PE Register file CU 1 PE Register file CU N-1 PE Register file CPU 0 CPU s-1 L1 L2 WC L1 L2 WC PCI Express Bus L3 Local memory L1 Local memory L1 Local memory L1 System memory CPU+discrete GPU L2 GPU main memory Quad-core CPU module UNB Integrated GPU module Accelerated Processing Unit (APU) FPU CPU 0 FPU CPU 1 Memory controller GPU memory controller Dispatch units L1 WC L1 CPU WC CU 0 PE Register file CU 1 PE Register file CU N-1 PE Register file FPU L2 FPU. CPU 2 CPU 3 L1 WC L1 WC Local memory TEX L1 Local memory TEX L1 Local memory TEX L1 L2 L2 ONION GARLIC System memory
4 I. Said 24th High Performance Computing Symposium 04/05/2016 3/34 HPC ecosystem Strengths Why using APUs? No PCI Express bus Integrated GPUs can address the entire memory Low power processors ( 95 W TDP at most): CPU 150 W TDP at most GPU 250 W at most Weaknesses Low compute power as compared to GPUs: Kaveri APU (A K) 730 GFlop/s (integrated GPU) Phenom CPU (X6 1055T) 130 GFlop/s Tahiti GPU (HD 7970) 3700 GFlop/s An order of magnitude less memory bandwidth than GPUs: APU up to 25 GB/s memory bandwidth GPU 300 GB/s
5 I. Said 24th High Performance Computing Symposium 04/05/2016 4/34 HPC ecosystem Motivations and context. Can we find a certain range of applications (with appropriate problem sizes) for which APUs may be suitable and/or more power efficient than discrete GPUs? In the scope of this work, we only consider using the integrated GPU of an APU as it represents the major computation power (Kaveri: 87%)
6 Outline I. Said 24th High Performance Computing Symposium 04/05/2016 5/34 Data placement strategies Benchmarks: details and OpenCL implementations Floating point performance evaluation Power efficiency evaluation Conclusion
7 Outline I. Said 24th High Performance Computing Symposium 04/05/2016 6/34 Data placement strategies Benchmarks: details and OpenCL implementations Floating point performance evaluation Power efficiency evaluation Conclusion
8 I. Said 24th High Performance Computing Symposium 04/05/2016 7/34 Understanding the memory system Multiple memory locations Importance of the data placement on the APU Software manipulations are needed to ensure zero-copy
9 I. Said 24th High Performance Computing Symposium 04/05/2016 8/34 The APU memory subsystem Onion: coherent bus (slow) Garlic: non coherent bus (full memory bandwidth)
10 The APU memory subsystem I. Said 24th High Performance Computing Symposium 04/05/2016 8/34 c: regular CPU memory (size depends on the RAM)
11 I. Said 24th High Performance Computing Symposium 04/05/2016 8/34 The APU memory subsystem g: fixed size (512 MB to 4 GB) cg: explicit copy from CPU memory to GPU memory gc: explicit copy from GPU memory to CPU memory
12 I. Said 24th High Performance Computing Symposium 04/05/2016 8/34 The APU memory subsystem u: zero-copy and non coherent (read-only accesses from GPU cores) Fixed and limited size (up to 1 GB)
13 The APU memory subsystem I. Said 24th High Performance Computing Symposium 04/05/2016 8/34 z: zero-copy and coherent memory
14 I. Said 24th High Performance Computing Symposium 04/05/2016 8/34 The APU memory subsystem p: zero-copy memory that lies on the GPU memory Limited size (up to 512 MB)
15 I. Said 24th High Performance Computing Symposium 04/05/2016 9/34 Data placement strategies on APU OpenCL data copy kernel From buffer A to buffer B Store buffers A and B in different memory locations Evaluate different combinations: For example cggc (explicit copy): zz (zero-copy):
16 Data placement benchmark on APU I. Said 24th High Performance Computing Symposium 04/05/ / Time (ms) cggc zgc ugc zz uz up pp Memory location init iwrite kernel oread obackup Using zero-copy = 60% maximum sustained bandwidth Select the most relevant strategies: cggc, ugc up, and zz
17 Outline I. Said 24th High Performance Computing Symposium 04/05/ /34 Data placement strategies Benchmarks: details and OpenCL implementations Floating point performance evaluation Power efficiency evaluation Conclusion
18 I. Said 24th High Performance Computing Symposium 04/05/ /34 Applicative benchmarks on APU Matrix multiplication Building block in linear algebra (e.g. the LINPACK benchmark) Compute bound algorithm Evaluate the sustained compute gap between GPUs and APUs 8 th order 3D finite difference stencil Building block of seismic workflows (e.g. Reverse Time Migration) Memory bound algorithm Evaluate the APU memory performance Impact of data placement strategies on the APU performance
19 I. Said 24th High Performance Computing Symposium 04/05/ /34 Matrix Multiplication SGEMM (BLAS) C = βc + αa B, α = β = 1 A, B and C are squared matrices of dimension N N, N [64, 4096] Compute complexity O(N 3 ) Storage complexity O(N 2 ) Include possible CPU-GPU data transfers and study their impact on the application performance
20 Matrix Multiplication OpenCL deployment I. Said 24th High Performance Computing Symposium 04/05/ /34 2D work-item grid on the 2D squared matrices (N [64, 4096]) X elements of C/work-item (X = 2 or X = 4 in practice) Implementations: scalar: global memory (natural blocking thanks to OpenCL) local scalar: A and B are partitioned using the local memory vectorized: explicit vectorization local vectorized: local memory + explicit vectorization image: cache friendly tiled layout format (texture memory)
21 I. Said 24th High Performance Computing Symposium 04/05/ /34 Finite difference stencil Linear combination of neighboring values weighted by coefficients U i,j,k = p/2 l= p/2 a l U i+l,j,k + p/2 l= p/2 a l U i,j+l,k + p/2 l= p/2 a l U i,j,k+l, p = 8 Problem sizes N N 32, N [64, 1024] Compute complexity O(N 3 ) Storage complexity O(N 3 ) Data snapshotting (K [1 10])
22 Finite Difference Stencil OpenCL deployment I. Said 24th High Performance Computing Symposium 04/05/ /34 2D work-item grid on the 3D domain X columns along the Z axis/work-item (X = 2 or X = 4 in practice) Register blocking when traversing the Z dimension Implementations: scalar: global memory local scalar: local memory to exploit memory access redundancies vectorized: global memory + explicit vectorization local vectorized: local memory + explicit vectorization
23 Outline I. Said 24th High Performance Computing Symposium 04/05/ /34 Data placement strategies Benchmarks: details and OpenCL implementations Floating point performance evaluation Power efficiency evaluation Conclusion
24 I. Said 24th High Performance Computing Symposium 04/05/ /34 Matrix Multiplication CPU performance 256 GFlop/s higher is better theoretical peak performance scalar local vectorized GotoBLAS vectorized OpenMP N OpenCL > OpenMP thanks to the OpenCL natural blocking vectorized > scalar thanks to SSE local vectorized > vectorized thanks to partitioning A and B GotoBLAS is the best thanks to close-to-hardware optimizations
25 I. Said 24th High Performance Computing Symposium 04/05/ /34 Matrix Multiplication GPU performance GFlop/s higher is better 4096 theoretical peak performance scalar vectorized image local scalar local vectorized N local vectorized (up to TFlop/s) > vectorized OpenCL images offer only 7% of enhancement (GCN) vectorized versions >> scalar versions (not expected)
26 I. Said 24th High Performance Computing Symposium 04/05/ /34 Matrix Multiplication APU performance GFlop/s higher is better 1024 theoretical peak performance scalar vectorized image local scalar local vectorized N Similarly to the GPU, vectorized versions > scalar versions Similarly to the GPU, local memory enhances the performance OpenCL images improve the performance by 25%
27 Matrix Multiplication APU performance and Data Placement Strategies Consider timing CPU-GPU interactions Best OpenCL implementations (vectorized, local vectorized) Combine with data placement strategies: cggc, ugc, up and zz I. Said 24th High Performance Computing Symposium 04/05/ /34 GFlop/s higher is better theoretical peak performance vectorized-cggc vectorized-ugc vectorized-zz vectorized-up local vectorized-cggc local vectorized-ugc local vectorized-zz local vectorized-up N Best: local vectorized coupled with the cggc data placement strategy local vectorized-zz only 3% lower than local vectorized-cggc (enhancement of Onion bandwidth as compared to older APUs)
28 Matrix Multiplication Performance comparison I. Said 24th High Performance Computing Symposium 04/05/ / GFlop/s higher is better 1024 performance projection CPU (GotoBLAS) APU GPU APU(Onion=Garlic*) * to mimic upcoming APUs with fully unified memory N CPU > APU N 100 (small matrices) APU > GPU, N 700 (medium sized matrices) GPU > APU, N > 700 (large matrices, transfer times are small compared to computation)
29 Finite Difference Stencil CPU performance I. Said 24th High Performance Computing Symposium 04/05/ /34 GFlop/s higher is better 18 scalar local vectorized vectorized OpenMP NxNx32 Explicit vectorization helped to deliver the best performance (SSE) OpenCL OpenMP
30 Finite Difference Stencil GPU performance I. Said 24th High Performance Computing Symposium 04/05/ / GFlop/s higher is better scalar vectorized local scalar local vectorized NxNx32 Scalar vectorized thanks to GCN (Graphics Core Next) Scalar code + OpenCL local memory offered the best performance
31 Finite Difference Stencil APU performance I. Said 24th High Performance Computing Symposium 04/05/ /34 90 GFlop/s higher is better scalar vectorized local scalar local vectorized NxNx32 Local scalar gives the best performance numbers for N 128 Vectorization is not needed thanks to GCN
32 Finite Difference Stencil APU performance and Data Placement Strategies Fixed problem size ( ) One snapshot every K computations (1 K 10) Select the best OpenCL implementations (scalar, local scalar) Combine with data placement strategies: cggc, ugc, up and zz 80 problem size: 1024x1024x32 (128 MB) GFlop/s higher is better best scalar-zz local scalar-ugc scalar-cggc scalar-up local scalar-zz scalar-ugc local scalar-cggc local scalar-up K computations + 1 snapshot Best: local scalar (zz) for 1 K 3 and (cggc) for 3 K 10 Kaveri is the first APU (compared to older ones) that enables performance gains when using zero-copy buffers I. Said 24th High Performance Computing Symposium 04/05/ /34
33 Finite Difference Stencil Performance comparison I. Said 24th High Performance Computing Symposium 04/05/ / GFlop/s higher is better performance projection 50 CPU APU GPU APU(Onion=Garlic*) * to mimic upcoming APUs with fully unified memory K computations + 1 snapshot APU > CPU K GPU > APU, 2 K 10 APU > GPU when performing one snapshot after each iteration
34 APU performance evaluation Conclusions I. Said 24th High Performance Computing Symposium 04/05/ /34 APU can be an attractive solution: For a high rate of data snapshotting (finite difference) For medium sized problems (matrix multiplication) For other cases: 3 to 4 practical GPU/APU performance gap (against 5 to 10 theoretical gap) Power is gaining interest in the HPC community (Green500) Power wall and Exascale What about power consumption?
35 Outline I. Said 24th High Performance Computing Symposium 04/05/ /34 Data placement strategies Benchmarks: details and OpenCL implementations Floating point performance evaluation Power efficiency evaluation Conclusion
36 Power efficiency Methodology for power measurement I. Said 24th High Performance Computing Symposium 04/05/ /34 Tools and metric Raritan PX (DPXR8A-16) PDU to monitor the power consumption Performance per Watt (PPW) metric Methodology The power drawn by the system as a whole: Same functional hardware components for the 3 architectures CPU+GPU for GPU based solutions Importance of Power Supply Units (PSUs) electric efficiency
37 I. Said 24th High Performance Computing Symposium 04/05/ /34 Power efficiency Matrix Multiplication GFlop/s/W higher is better 8 7 up to 55 W 6 up to 197 W up to 145 W 1 cpu tahiti kaveri N CPU offers a very low power efficiency ( 1 GFlop/s/W) APU is 20% more power efficient than the GPU
38 Power efficiency Finite Difference Stencil I. Said 24th High Performance Computing Symposium 04/05/ /34 GFlop/s/W higher is better 1.4 problem size: 1024x1024x32 (128 MB) up to 62 W up to 222 W CPU GPU APU 0.2 up to 159 W K computations + 1 snapshot CPU offers a very low power efficiency (0.08 GFlop/s/W) APU is 13% more power efficient than the GPU Higher gain for compute bound algorithm (matrix multiplication): Flops consume less power than memory accesses
39 Outline I. Said 24th High Performance Computing Symposium 04/05/ /34 Data placement strategies Benchmarks: details and OpenCL implementations Floating point performance evaluation Power efficiency evaluation Conclusion
40 I. Said 24th High Performance Computing Symposium 04/05/ /34 Conclusions and perspectives Conclusions The APUs (almost) always outperforms the CPUs The APUs can match or outperform discrete GPUs For some medium-sized problems For problems with high communication requirements (snapshotting) Performance + Power consumption Despite 3.3-fold performance difference APUs are more power efficient than GPUs More in Perspectives Promising features with upcoming APUs: Full memory unification (hardware level) HBM (High Bandwidth Memory) + compute units count increase OpenPower and NVLink
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationHigh Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging
High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging Presenter: Murtaza Ali, Texas Instruments Contributors: Murtaza Ali, Eric Stotzer, Xiaohui Li, Texas Instruments
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationOptimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink
Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationJohn W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands
Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationPARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)
PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5) Rob van Nieuwpoort Vrije Universiteit Amsterdam & Astron, the Netherlands Institute for Radio Astronomy Why Radio? Credit: NASA/IPAC
More informationGPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS
GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge
More informationCan FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.
Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationConvey Vector Personalities FPGA Acceleration with an OpenMP-like programming effort?
Convey Vector Personalities FPGA Acceleration with an OpenMP-like programming effort? Björn Meyer, Jörn Schumacher, Christian Plessl, Jens Förstner University of Paderborn, Germany 2 ), - 4 * 4 + - 6-4.
More informationHETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA
HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS
More informationOutline. Single GPU Implementation. Multi-GPU Implementation. 2-pass and 1-pass approaches Performance evaluation. Scalability on clusters
Implementing 3D Finite Difference Codes on the GPU Paulius Micikevicius NVIDIA Outline Single GPU Implementation 2-pass and 1-pass approaches Performance evaluation Multi-GPU Implementation Scalability
More informationGame-changing Extreme GPU computing with The Dell PowerEdge C4130
Game-changing Extreme GPU computing with The Dell PowerEdge C4130 A Dell Technical White Paper This white paper describes the system architecture and performance characterization of the PowerEdge C4130.
More information*Yuta SAWA and Reiji SUDA The University of Tokyo
Auto Tuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS *Yuta SAWA and Reiji SUDA The University of Tokyo iwapt 29 October 1-2 *Now in Central Research Laboratory, Hitachi,
More informationThe Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015
The Rise of Open Programming Frameworks JC BARATAULT IWOCL May 2015 1,000+ OpenCL projects SourceForge GitHub Google Code BitBucket 2 TUM.3D Virtual Wind Tunnel 10K C++ lines of code, 30 GPU kernels CUDA
More informationHigh-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs
High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs Gordon Erlebacher Department of Scientific Computing Sept. 28, 2012 with Dimitri Komatitsch (Pau,France) David Michea
More informationThe IBM Blue Gene/Q: Application performance, scalability and optimisation
The IBM Blue Gene/Q: Application performance, scalability and optimisation Mike Ashworth, Andrew Porter Scientific Computing Department & STFC Hartree Centre Manish Modani IBM STFC Daresbury Laboratory,
More informationA Multi-Tiered Optimization Framework for Heterogeneous Computing
A Multi-Tiered Optimization Framework for Heterogeneous Computing IEEE HPEC 2014 Alan George Professor of ECE University of Florida Herman Lam Assoc. Professor of ECE University of Florida Andrew Milluzzi
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationSCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL
SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL Matthias Bach and David Rohr Frankfurt Institute for Advanced Studies Goethe University of Frankfurt I: INTRODUCTION 3 Scaling
More informationPerformance Benefits of NVIDIA GPUs for LS-DYNA
Performance Benefits of NVIDIA GPUs for LS-DYNA Mr. Stan Posey and Dr. Srinivas Kodiyalam NVIDIA Corporation, Santa Clara, CA, USA Summary: This work examines the performance characteristics of LS-DYNA
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationRUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS
RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering
More informationAccelerating Foreign-Key Joins using Asymmetric Memory Channels
Accelerating Foreign-Key Joins using Asymmetric Memory Channels Holger Pirk Stefan Manegold Martin Kersten holger@cwi.nl manegold@cwi.nl mk@cwi.nl Why? Trivia: Joins are important But: Many Joins are (Indexed)
More informationHPC and the AppleTV-Cluster
HPC and the AppleTV-Cluster Dieter Kranzlmüller, Karl Fürlinger, Christof Klausecker Munich Network Management Team Ludwig-Maximilians-Universität München (LMU) & Leibniz Supercomputing Centre (LRZ) Outline
More informationGPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com
GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationTurbostream: A CFD solver for manycore
Turbostream: A CFD solver for manycore processors Tobias Brandvik Whittle Laboratory University of Cambridge Aim To produce an order of magnitude reduction in the run-time of CFD solvers for the same hardware
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationMassively Parallel Phase Field Simulations using HPC Framework walberla
Massively Parallel Phase Field Simulations using HPC Framework walberla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich
More informationPARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort
PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationNVIDIA Jetson Platform Characterization
NVIDIA Jetson Platform Characterization Hassan Halawa, Hazem A. Abdelhafez, Andrew Boktor, Matei Ripeanu The University of British Columbia {hhalawa, hazem, boktor, matei}@ece.ubc.ca Abstract. This study
More informationEffect of GPU Communication-Hiding for SpMV Using OpenACC
ICCM2014 28-30 th July, Cambridge, England Effect of GPU Communication-Hiding for SpMV Using OpenACC *Olav Aanes Fagerlund¹, Takeshi Kitayama 2,3, Gaku Hashimoto 2 and Hiroshi Okuda 2 1 Department of Systems
More informationRealization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects
Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects Markus Geveler, Stefan Turek, Dirk Ribbrock PACO Magdeburg 2015 / 7 / 7 markus.geveler@math.tu-dortmund.de
More informationHPC future trends from a science perspective
HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively
More informationHow to Optimize Geometric Multigrid Methods on GPUs
How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient
More informationIntroduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA
Introduction to the Xeon Phi programming model Fabio AFFINITO, CINECA What is a Xeon Phi? MIC = Many Integrated Core architecture by Intel Other names: KNF, KNC, Xeon Phi... Not a CPU (but somewhat similar
More informationA study on linear algebra operations using extended precision floating-point arithmetic on GPUs
A study on linear algebra operations using extended precision floating-point arithmetic on GPUs Graduate School of Systems and Information Engineering University of Tsukuba November 2013 Daichi Mukunoki
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationAccelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX
Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationVOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017
VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100
More informationUsing Graphics Chips for General Purpose Computation
White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 48 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More informationLeveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication. Steve Rennich Nvidia Developer Technology - Compute
Leveraging Matrix Block Structure In Sparse Matrix-Vector Multiplication Steve Rennich Nvidia Developer Technology - Compute Block Sparse Matrix Vector Multiplication Sparse Matrix-Vector Multiplication
More information8. Hardware-Aware Numerics. Approaching supercomputing...
Approaching supercomputing... Numerisches Programmieren, Hans-Joachim Bungartz page 1 of 22 8.1. Hardware-Awareness Introduction Since numerical algorithms are ubiquitous, they have to run on a broad spectrum
More informationAccelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing
Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationTHE PATH TO EXASCALE COMPUTING. Bill Dally Chief Scientist and Senior Vice President of Research
THE PATH TO EXASCALE COMPUTING Bill Dally Chief Scientist and Senior Vice President of Research The Goal: Sustained ExaFLOPs on problems of interest 2 Exascale Challenges Energy efficiency Programmability
More informationA Preliminary evalua.on of OpenPOWER through op.mizing stencil based algorithms
A Preliminary evalua.on of OpenPOWER through op.mizing stencil based algorithms Speaker: Jingheng Xu Tsinghua University Revolu'onizing the Datacenter Join the Conversa'on #OpenPOWERSummit Contents 1 About
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationEarly Experiences Writing Performance Portable OpenMP 4 Codes
Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic
More informationPedraforca: a First ARM + GPU Cluster for HPC
www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu
More informationFPGA-based Supercomputing: New Opportunities and Challenges
FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:
More informationAccelerating Molecular Modeling Applications with Graphics Processors
Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference
More informationWhat is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms
CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationOpenStaPLE, an OpenACC Lattice QCD Application
OpenStaPLE, an OpenACC Lattice QCD Application Enrico Calore Postdoctoral Researcher Università degli Studi di Ferrara INFN Ferrara Italy GTC Europe, October 10 th, 2018 E. Calore (Univ. and INFN Ferrara)
More informationEfficient and Scalable Shading for Many Lights
Efficient and Scalable Shading for Many Lights 1. GPU Overview 2. Shading recap 3. Forward Shading 4. Deferred Shading 5. Tiled Deferred Shading 6. And more! First GPU Shaders Unified Shaders CUDA OpenCL
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationHPCC Results. Nathan Wichmann Benchmark Engineer
HPCC Results Nathan Wichmann Benchmark Engineer Outline What is HPCC? Results Comparing current machines Conclusions May 04 2 HPCChallenge Project Goals To examine the performance of HPC architectures
More informationPyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent
PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python F.D. Witherden, M. Klemm, P.E. Vincent 1 Overview Motivation. Accelerators and Modern Hardware Python and PyFR. Summary. Motivation
More informationLecture 3: Intro to parallel machines and models
Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class
More informationArm Processor Technology Update and Roadmap
Arm Processor Technology Update and Roadmap ARM Processor Technology Update and Roadmap Cavium: Giri Chukkapalli is a Distinguished Engineer in the Data Center Group (DCG) Introduction to ARM Architecture
More informationPorting a parallel rotor wake simulation to GPGPU accelerators using OpenACC
DLR.de Chart 1 Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC Melven Röhrig-Zöllner DLR, Simulations- und Softwaretechnik DLR.de Chart 2 Outline Hardware-Architecture (CPU+GPU)
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationHigh performance Computing and O&G Challenges
High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating
More informationThe Many-Core Revolution Understanding Change. Alejandro Cabrera January 29, 2009
The Many-Core Revolution Understanding Change Alejandro Cabrera cpp.cabrera@gmail.com January 29, 2009 Disclaimer This presentation currently contains several claims requiring proper citations and a few
More informationImplicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC
Fourth Workshop on Accelerator Programming Using Directives (WACCPD), Nov. 13, 2017 Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC Takuma
More informationThe GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.
The GPU as a co-processor in FEM-based simulations Preliminary results Dipl.-Inform. Dominik Göddeke dominik.goeddeke@mathematik.uni-dortmund.de Institute of Applied Mathematics University of Dortmund
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationLECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016
LECTURE ON PASCAL GPU ARCHITECTURE Jiri Kraus, November 14 th 2016 ACCELERATED COMPUTING CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 2 ACCELERATED COMPUTING CPU Optimized
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationNVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas
NVidia s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture -
More informationEfficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford
Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable
More informationProfiling GPU Code. Jeremy Appleyard, February 2016
Profiling GPU Code Jeremy Appleyard, February 2016 What is Profiling? Measuring Performance Measuring application performance Usually the aim is to reduce runtime Simple profiling: How long does an operation
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More information"On the Capability and Achievable Performance of FPGAs for HPC Applications"
"On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies
More informationHETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE
HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)
More informationEmbedded real-time stereo estimation via Semi-Global Matching on the GPU
Embedded real-time stereo estimation via Semi-Global Matching on the GPU Daniel Hernández Juárez, Alejandro Chacón, Antonio Espinosa, David Vázquez, Juan Carlos Moure and Antonio M. López Computer Architecture
More informationParallel FFT Program Optimizations on Heterogeneous Computers
Parallel FFT Program Optimizations on Heterogeneous Computers Shuo Chen, Xiaoming Li Department of Electrical and Computer Engineering University of Delaware, Newark, DE 19716 Outline Part I: A Hybrid
More informationUsing GPUs for unstructured grid CFD
Using GPUs for unstructured grid CFD Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Schlumberger Abingdon Technology Centre, February 17th, 2011
More information