Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Similar documents
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Real-Time Rendering Architectures

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

GPU Architecture. Alan Gray EPCC The University of Edinburgh

n N c CIni.o ewsrg.au

arxiv: v1 [physics.comp-ph] 4 Nov 2013

CME 213 S PRING Eric Darve

Introduction to GPU hardware and to CUDA

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Mathematical computations with GPUs

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Parallel Computing. November 20, W.Homberg

Antonio R. Miele Marco D. Santambrogio

General Purpose GPU Computing in Partial Wave Analysis

Early Experiences Writing Performance Portable OpenMP 4 Codes

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Introduction to GPGPU and GPU-architectures

IN FINITE element simulations usually two computation

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

High Performance Computing with Accelerators

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Parallel Accelerators

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Lecture 8: GPU Programming. CSE599G1: Spring 2017

From Shader Code to a Teraflop: How Shader Cores Work

GPU ARCHITECTURE Chris Schultz, June 2017

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Fundamental CUDA Optimization. NVIDIA Corporation

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

OpenACC programming for GPGPUs: Rotor wake simulation

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Scientific Computing on GPUs: GPU Architecture Overview

Intel Xeon Phi Coprocessors

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Numerical Algorithms on Multi-GPU Architectures

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Tesla Architecture, CUDA and Optimization Strategies

When MPPDB Meets GPU:

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

A GPU based brute force de-dispersion algorithm for LOFAR

Portland State University ECE 588/688. Graphics Processors

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CUDA Experiences: Over-Optimization and Future HPC

CUDA Programming Model

GPU ARCHITECTURE Chris Schultz, June 2017

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Accelerating Financial Applications on the GPU

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Parallel and Distributed Programming Introduction. Kenjiro Taura

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CS 179 Lecture 4. GPU Compute Architecture

THE USE of graphics processing units (GPUs) in scientific

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

High Performance Computing on GPUs using NVIDIA CUDA

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Large scale Imaging on Current Many- Core Platforms

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Automated Finite Element Computations in the FEniCS Framework using GPUs

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

Introduction to Parallel Computing with CUDA. Oswald Haan

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Performance of deal.ii on a node

Experts in Application Acceleration Synective Labs AB

GPUfs: Integrating a file system with GPUs

CPU-GPU Heterogeneous Computing

Tesla GPU Computing A Revolution in High Performance Computing

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

HPC with Multicore and GPUs

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

The Era of Heterogeneous Computing

The Art of Parallel Processing

GPUfs: Integrating a file system with GPUs

Transcription:

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków, Poland Filip Krużel Cracow University of Technology, Warszawska 24, 31-155 Kraków, Poland pobanas@cyf-kr.edu.pl, janbielanski@agh.edu.pl, chlon@agh.edu.pl, fkruzel@pk.edu.pl ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 1/34

Outline Motivation - processor architectures 1 Motivation - processor architectures 2 3 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 2/34

Motivation Motivation - processor architectures Current computer architectures: clusters with homogeneous or heterogeneous nodes multicore processors with vector capabilities manycore accelerators (including GPUs) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 3/34

Motivation Motivation - processor architectures The principal question: what is the best processor architecture for running FEM codes? are the new architectures (GPUs, MIC) worth investments? CPU core GPU core ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 4/34

Motivation Motivation - processor architectures Disadvantages of new architectures: more complex than traditional programming model different optimization strategies price ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 5/34

Motivation Motivation - processor architectures GPU (and MIC) advantages: higher floating point performance higher memory bandwidth ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 6/34

Architecture comparison Architecture Kepler / Pascal Xeon Phi Xeon (E5) Processor GK110 / GP100 5110P / 7230 2620 / 2699v3 Year of introduction 2013 / 2016 2012 / 2016 2012 / 2014 Number of multiprocessors/cores 13 / 56 60 (59) / 64 6 / 18 Number of SP SIMD lanes 2496(x2) / 3584(x2) 960(x2) / 2048(x2) 48 / 288(x2) Number of DP SIMD lanes 832(x2) / 1792(x2) 480(x2) / 1024(x2) 24 / 144(x2) Fast global memory size [GB] 4.8 / 12 (16) 8 /16 (384) 384 / 768 LLC memory size [MB] 1.5 / 4.0 [L2] 30 / 32 [L2] 15 / 45 [L3] Frequency [GHz] 0.7 / 1.1 1.0 / 1.3 2.5 / 2.3 Performance characteristics Peak SP performance [TFlops] 3.52 / 8 2.02 / 5.3 0.24 / 1.3 Peak DP performance [TFlops] 1.17 / 4 1.01 / 2.6 0.10 / 0.66 Benchmark (DGEMM) performance 1.10 /??? 0.84 / 1.9 0.09 / 0.48 Peak memory bandwidth [GB/s] 208 / 549 (732) 320 / > 400 42.6 / 68 Benchmark (STREAM) bandwidth 144 /??? 171 / 480 (85) 33 / 58 Machine balance [DP flops/access] 45 / 58 (44) 25 / <43 18 / 77 Benchmark machine balance 61 /??? 39 / 32 21 / 66 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 7/34

Architecture comparison Architecture Kepler / Pascal Xeon Phi Xeon (E5) Processor GK110 / GP100 5110P / 7230 2620 / 2699v3 Multiprocessor/core characteristics Number of SP SIMD lanes 192 (x2) / 64 (x2) 16 (x2) / 32 (x2) 8 / 16 (x2) Number of DP SIMD lanes 64 (x2) / 32 (x2) 8 (x2) / 16 (x2) 4 / 8 (x2) Shared memory / L2 size [KB] 16 or 48 / 64 512 / 512 256 / 256 L1 cache memory size [KB] 48 or 16 / 24(?) 32 / 32 32 / 32 Number of 32 bit registers 65536 / 65536 >2048 / >2560 1472 / 1680 Resources per single SP SIMD lane (+latency hiding?) Number of SP registers 341 / 1024 32 (x4) / 32 (x4) 16 (x2) / 16 (x2) Number of SP entries in SM/L1 64 / 256 512 / 256 1024 / 512 Number of SP entries in L2 cache 131 / 292 8192 / 4096 8192 / 4096 Resources per single DP SIMD lane (+latency hiding?) Number of DP registers 512 / 1024 32 (x4) / 32 (x4) 16 (x2) / 16 (x2) Number of DP entries in SM/L1 96 / 256 512 / 256 1024 / 512 Number of DP entries in L2 196 / 292 8192 / 4096 8192 / 4096 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 8/34

Two phases: creation of the system of linear equations (integration and assembly) linear system solution (direct or iterative) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 9/34

The creation of FEM systems of linear equations Finite element integration and assembly ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 10/34

Finite element integration and assembly 1 for e = 1 to N E do 2 - read input data and initialize output arrays A e and b e 3 for i Q = 1 to N Q do 4 - compute auxiliary terms (vol) and arrays (φ, c and d ) 5 for i S = 1 to N S do 6 for j S = 1 to N S do 7 update A e [i S ][j S ] using vol[i Q ], c[i Q ], φ[i S ][i Q ], φ[j S ][i Q ] 8 if (i S == j S ) then 9 update b e [i S ] using vol[i Q ] d[i D ][i Q ], φ[i S ][i Q ] 10 end if 11 end for 12 end for 13 end for 14 - assemble A e and b e into the global arrays 15 end for ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 11/34

Computational complexity: the dependence on the order of approximation an example for discontinuous Galerkin approximation 100 10 10 Execution time [s] 1 0.1 0.01 1 2 3 4 5 6 Degree of approximation elements integration faces integration preconditioner set-up iterations total time elements integration faces integration preconditioner set-up iterations 1 0.1 0.01 Number of operations [Mflop] ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 12/34

Finite element integration and assembly Size of input, output and auxiliary arrays for numerical integration The size of arrays Degree of approximation p 1 2 3 4 5 ξ Q, w Q 24 72 192 320 600 φ 24 72 160 300 504 max c, d 20 20 20 20 20 ξ Total x, det( x ) ξ 60 180 480 800 1500 Total φ 144 1296 7680 24000 75600 Total max c, d 120 360 960 1600 3000 A e, b e 42 342 1640 5700 16002 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 13/34

Finite element integration and assembly Computational and memory complexity: the dependence on the order of approximation an example for discontinuous Galerkin approximation ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 14/34

Offloading model Motivation - processor architectures OpenCL devices with PCIe APU with unified memory ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 15/34

Available resources for application programmers: Processing environment: multi-threading for multi-core standard - Xeon massive (also for latency hiding) - GPUs, Xeon Phi vectorization wide SIMD units - Xeon (8, 16) wider SIMD units - GPUs (32), Xeon Phi (16, 32) different communication and synchronization mechanisms ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 16/34

Available resources for application programmers: Memory hierarchy: large, latency optimized DRAM memory below 150 GB/s (for 2-socket configurations) fast global memory (MCDRAM, HBM) - GPU and MIC above 400 GB/s for single processor shared memory, caches registers smaller, explicitly managed - GPUs with CUDA, OpenCL larger, implicitly managed - Xeon, Xeon Phi large number, explicitly managed - GPUs with CUDA, OpenCL implicitly managed - Xeon, Xeon Phi ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 17/34

Programming models: Standard - OpenMP: parallel loops implicit management of data placement GPU oriented - CUDA, OpenCL (can be used as well for x86) explicit and implicit thread organization (warps, threadblocks, grid) memory hierarchy (registers, shared, global) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 18/34

Parallelization strategies One element one thread loop over elements parallelized small resource requirements for low orders of approximation possible explicit placement of some arrays in the shared memory to speed up calculations for CUDA and OpenCL large resource requirements for higher orders of approximation can be handled by flexible memory hierarchy of Xeons prevent OpenCL kernels from executing on GPUs ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 19/34

Parallelization strategies One element several threads loop over elements parallelized additionally loops over the entries of output arrays parallelized in domain decomposition manner, with no dependencies the number of threads usually up to the size of threadblocks for CUDA and OpenCL sufficient shared memory resources for GPUs for high orders even several threadblocks can operate on a single element (or additional loop over parts of the output arrays is introduced) serial fraction associated with some auxiliary calculations at interation points ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 20/34

Parallelization strategies One element two kernels strategy kernel 1: auxiliary terms calculations loop over elements parallelized loop over integration points parallelized auxiliary terms calculated and stored in global memory kernel 2: actual calculations of element arrays loop over elements parallelized additionally loops over the entries of output arrays parallelized in the same way as for one element several threads strategy no serial fraction ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 21/34

Computational example: elasticity problem Higher order approximation numerical integration for several degrees p (from 2 to 7) parallelization for single element based on data decomposition for output arrays parallelization of double loop over shape functions different options for placement of auxiliary arrays Linux, C, OpenCL several processor architectures (with the same, portable OpenCL kernels): Intel Xeon E5-2670 AMD Radeon HD5870 (Cypress) and HD7950 (Tahiti PRO) Nvidia GeForce GTX580 (Fermi) and Tesla M2075 (Fermi) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 22/34

Performance results for elasticity problem and p=3 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 23/34

Performance results for elasticity problem and p=3 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 24/34

Performance results for elasticity problem and p=5 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 25/34

Performance results for elasticity problem and p=5 ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 26/34

Computational example: convection-diffusion problem First order approximation simple Poisson (Laplace) problem and more computationally intensive conv-diff problem prismatic and tetrahedral elements one element one thread parallelization strategy different options for placement of auxiliary arrays Linux, C, OpenCL several processor architectures (with the same, portable OpenCL kernels): Intel Xeon E5-2620 (in dual socket configuration) Intel Xeon Phi 5110P (Knights Corner) Nvidia Tesla K20M (Kepler) ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 27/34

Performance results for convection diffusion problem Execution time [ns] 66 64 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 Processor: Tesla K20m Xeon Phi 5110P Xeon E5-2620 Poisson - Tetra Poisson - Prism Conv-Diff - Tetra Conv-Diff - Prism ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 28/34

Performance results for convection diffusion problem Performance [GB/s] 150 145 140 135 130 125 120 115 110 105 100 95 90 85 80 75 70 65 60 55 50 45 40 35 30 25 20 15 10 Poisson - Tetra Processor: Tesla K20m Xeon Phi 5110P Xeon E5-2620 Conv-Diff - Tetra Performance [GFLOPS] 310 300 290 280 270 260 250 240 230 220 210 200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 Processor: Tesla K20m Xeon Phi 5110P Xeon E5-2620 Poisson - Prism Conv-Diff - Prism ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 29/34

Performance results for convection diffusion problem 100 95 90 85 Processor: Tesla K20m Xeon Phi 5110P Xeon E5-2620 80 75 70 Performance as percentage of the benchmark maximum 65 60 55 50 45 40 35 30 25 20 15 10 5 0 Poisson - Tetra Poisson - Prism Conv-Diff - Tetra Conv-Diff - Prism ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 30/34

Computational example: convection-diffusion problem Code optimizations for GPUs classical optimizations (automatic) loop invariant code motion, common subexpression elimination, loop unrolling, induction variable simplification, etc. compiler directives based parameter tuning placing different arrays in different levels of memory hierarchy large number of options automatic testing of the parameter space ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 31/34

Parameter based performance tuning ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 32/34

Parameter based performance tuning ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 33/34

Thank you. ParEng 2017 K. Banaś et al. Finite Elements on Many-core Processors 34/34