Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Similar documents
Efficient implementation of simple lattice Boltzmann kernels

simulation framework for piecewise regular grids

High Scalability of Lattice Boltzmann Simulations with Turbulence Models using Heterogeneous Clusters

Performance and Software-Engineering Considerations for Massively Parallel Simulations

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Performance Analysis of the Lattice Boltzmann Method on x86-64 Architectures

(LSS Erlangen, Simon Bogner, Ulrich Rüde, Thomas Pohl, Nils Thürey in collaboration with many more

Numerical Algorithms on Multi-GPU Architectures

Software and Performance Engineering for numerical codes on GPU clusters

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Bandwidth Avoiding Stencil Computations

Multi-Core Microprocessor Chips: Motivation & Challenges

Cache Oblivious Algorithms

3.2 Cache Oblivious Algorithms

arxiv: v1 [cs.dc] 3 Nov 2011

walberla: Developing a Massively Parallel HPC Framework

Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla

Technology Trends Presentation For Power Symposium

John Hengeveld Director of Marketing, HPC Evangelist

From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Advances of parallel computing. Kirill Bogachev May 2016

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Large scale Imaging on Current Many- Core Platforms

What does Heterogeneity bring?

Application Performance on Dual Processor Cluster Nodes

Virtual EM Inc. Ann Arbor, Michigan, USA

Peta-Scale Simulations with the HPC Software Framework walberla:

Simulation of moving Particles in 3D with the Lattice Boltzmann Method

CellSs Making it easier to program the Cell Broadband Engine processor

How to Write Fast Code , spring th Lecture, Mar. 31 st

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Master Informatics Eng.

Massively Parallel Phase Field Simulations using HPC Framework walberla

Report Seminar Algorithm Engineering

Carlo Cavazzoni, HPC department, CINECA

CIT 668: System Architecture

Large Scale Parallel Lattice Boltzmann Model of Dendritic Growth

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

A Uniform Programming Model for Petascale Computing

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

8. Hardware-Aware Numerics. Approaching supercomputing...

8. Hardware-Aware Numerics. Approaching supercomputing...

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

The Icosahedral Nonhydrostatic (ICON) Model

Fundamentals of Quantitative Design and Analysis

Adaptive Matrix Transpose Algorithms for Distributed Multicore Processors

On the Single Processor Performance of Simple Lattice Boltzmann Kernels

Implicit and Explicit Optimizations for Stencil Computations

1 Serial Implementation

CIT 668: System Architecture. Computer Systems Architecture

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Computational Fluid Dynamics with the Lattice Boltzmann Method KTH SCI, Stockholm

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD

Introduction to Xeon Phi. Bill Barth January 11, 2013

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Are the Killer Micros Still Attacking? Some Random Thoughts From Users Perspectives

Introduction to parallel Computing

Cache-oblivious Programming

HW Trends and Architectures

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Commodity Cluster Computing

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Performance Evaluation with the HPCC Benchmarks as a Guide on the Way to Peta Scale Systems

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

EE/CSCI 451: Parallel and Distributed Computation

Parallel Computing: Parallel Architectures Jin, Hai

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS

Parallel Computing. Hwansoo Han (SKKU)

Cache Oblivious Stencil Computations

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:

smooth coefficients H. Köstler, U. Rüde

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Geometric Multigrid on Multicore Architectures: Performance-Optimized Complex Diffusion

6.1 Multiprocessor Computing Environment

What SMT can do for You. John Hague, IBM Consultant Oct 06

Simulation of Liquid-Gas-Solid Flows with the Lattice Boltzmann Method

Multi-core Computing Lecture 2

Overview: The OpenMP Programming Model

Quantifying performance bottlenecks of stencil computations using the Execution-Cache-Memory model

How many cores are too many cores? Dr. Avi Mendelson, Intel - Mobile Processors Architecture group

Sony/Toshiba/IBM (STI) CELL Processor. Scientific Computing for Engineers: Spring 2008

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CUDA Experiences: Over-Optimization and Future HPC

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Introduction to GPU computing

Trends and Challenges in Multicore Programming

Performance Evaluation of Current HPC Architectures Using Low-Level Level and Application Benchmarks

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Transcription:

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System Simulation Department of Computer Science Friedrich-Alexander-University Erlangen-Nuremberg International Conference for Mesoscopic Methods in Engineering and Science 2006

Survey The party is over Trends in High Performance Computing Implementing iterative LBM achieving spatial locality Cache-Oblivious Blocking Approach for the LBM improving temporal locality Summary 27.07.2006 (2) ICMMES 2006

The party is over Ever growing processor speed & Moore s Law 1965 G. Moore (co-founder of Intel) claimed #transistors on processor chip doubles every 12-24 months Intel Corp. Processor speed grew roughly at the same rate My computer: 350 MHz (1998) 3,000 MHz (2004) Growth rate: 43 % p.y. -> doubles every 24 months Does this trend continue? 27.07.2006 (3) ICMMES 2006

The party is over Multi-core the working horse of numerical simulation Multi-Core Processors The party is over Problem: Moore s law is still valid but increasing clock speed hits a technical wall (heat) Solution: Reduce clock speed of processor but put 2 (or more) processors (cores) on a single silicon die We will have to use many less powerful processors in the future Intel Tera-Scale Computing Research Program 2006 Evolution 27.07.2006 (4) ICMMES 2006

The party is over Why will Multi-Core succeed? By courtesy of D. Vrsalovic, Intel Power ~ Frequency 3 Performance Power 1.00x Max Frequency 27.07.2006 (5) ICMMES 2006

The party is over Why will Multi-Core succeed? By courtesy of D. Vrsalovic, Intel Power ~ Frequency 3 1.73x Performance Power 1.13x 1.00x Over-clocked (+20%) Max Frequency 27.07.2006 (6) ICMMES 2006

The party is over Why will Multi-Core succeed? By courtesy of D. Vrsalovic, Intel Power ~ Frequency 3 1.73x Performance Power 1.13x 1.00x 0.87x 0.51x Over-clocked (+20%) Max Frequency Under-clocked (-20%) 27.07.2006 (7) ICMMES 2006

The party is over Why will Multi-Core succeed? By courtesy of D. Vrsalovic, Intel Power ~ Frequency 3 1.73x Dual-Core Performance Power 1.73x 1.13x 1.00x 1.02x Over-clocked (+20%) Max Frequency Dual-core (-20%) 27.07.2006 (8) ICMMES 2006

Cache-Oblivious Blocking Approach Multi-Core Memory hierarchies Cache based Micro-Processor Cache based Multi-Core Processors Main Main Processor chip Memory MS DRAM Gap L2 cache L1 cache floating point register arithmetic unit Processor chip L1 cache FP register MS arithmetic unit Memory DRAM Gap L2 cache L1 cache FP register arithmetic unit Intel Xeon5100 / Woodcrest 27.07.2006 (9) ICMMES 2006

The party is over Multi-Core: Lessons to be learned Multi-Core Processors The party is over Single core performance will remain constant or decrease in the future Several core on a silicon-die will share resources, e.g. caches Main memory bandwidth will not scale with the number of cores Heterogeneous cores on a silicon-die (see IBM Cell) Lessons to be learned: Reduce bandwidth requirements (Cache blocking to improve spatial and temporal locality) Parallelization will be mandatory for most applications in the future Hybrid programming approaches for large scale simulations?! 27.07.2006 (10) ICMMES 2006

The party is over Other directions IBM Cell processor To be used in Sony Playstation3 221mm 2 die size - 234 million transistors Eight synergistic processor elements (SPE) plus Power processor M I C P P U Clock speed ~ 4 GHz S P U S P U Peak performance (single precision) ~ 256 GFlops Peak performance (double precision) ~ 26 GFlops S P U S P U S S P P U MIB U Roundoff = Cutoff Programming Model? S P U S P U 27.07.2006 (11) ICMMES 2006 B I C R R A C

The party is over Other directions Acceleration Cards Clearspeed acceleration board: 50 GFlop/s DGEMM at 25 Watt Memory bandwidth Use of highly optimized offload libraries Built for special purpose (e.g. long range MD simulation) Field Programmable Gate Arrays (FPGAs) Configurable Processor (at moderate speed 200-500 MHz) Can provide massive parallelism (100`s of Bit operations/cycle) Not useful for DP floating point operations Memory bandwidth Not a conventional programming approach 27.07.2006 (12) ICMMES 2006

Implementing iterative LBM achieving spatial locality

Cache-Oblivious Blocking Approach Discretization of LBM Boltzmann Equation t f + ξ f = 1 (0 ) [ λ f f ] ξ f λ (0)... particle velocity... equilibrium distribution function... relaxation time Discretization of particle velocity space (finite set of discrete velocities) t f α + ξ α f α = 1 (eq f f ) α α [ λ ] f f α ( eq) α r ( x, t) r ( x, t) = = f f r ( x, ξα, t) (0) r ( x, ξ, t) α ξ α determined by discretization scheme 27.07.2006 (14) ICMMES 2006

Cache-Oblivious Blocking Approach Discretization schemes for LBM Different discretization schemes in 3D Numerical accuracy and stability Computational speed and simplicity D3Q15 D3Q19 D3Q27 We choose D3Q19 because of good balance between stability and computational efficiency 27.07.2006 (15) ICMMES 2006

Cache-Oblivious Blocking Approach Stream and collide steps Discretization in space x and time t: collision step: streaming step: source Ω f α ( x i α ~ ( eq f ( x, t) ( x, t) [ f ( x, t) f ) α i = α i ω α i α r + e δt, t + δt) = destination f ~ f α ( x i, t) ( x i, t)] Stream-Collide (Pull-Method) Get the distributions from the neighboring cells in the source array and store the relaxated values to one cell in the destination array Collide-Stream (Push-Method) Take the distributions from one cell in the source array and store the relaxed values to the neighboring cells in the destination array We choose Collide-Stream in what follows and do both steps in a single loop 27.07.2006 (16) ICMMES 2006

Cache-Oblivious Blocking Approach Spatial and temporal blocking Basics (Full matrix) Spatial blocking: Once a cache line is in the cache all entries should be used! Investigate data-layout: F(i,x,y,z,t) vs. F(x,y,z,i,t) Implement spatial blocking Iterative LBM: Each time step is performed on all cells of the full domain [1] G. Wellein, T. Zeiser, G. Hager, and S. Donath, Comp. & Fluids, Vol. 35 (2006) F(x,y,z,i,t) F(i,x,y,z,t) 27.07.2006 (17) ICMMES 2006

Cache-Oblivious Blocking Approach Spatial and temporal blocking Basics (Sparse LBM) square channel aos 80 75 70 aos_split3 aos_split5 soa soa_split3 85 65 90 60 100 12 1 5 soa_split5 27.07.2006 (18) ICMMES 2006 95 55 8 4 0 50 80 75 70 85 45 65 AMD Opteron 270 (4 cores 2GHz) 10 90 40 60 15 95 35 20 55 25 30 100 12 1 5 8 4 0 50 porous media (sphere packing) 45 10 40 15 35 20 25 30

Cache-Oblivious Blocking Approach for the LBM improving temporal locality

Cache-Oblivious Blocking Approach Spatial and temporal blocking - Basics Temporal blocking: Load small blocks to cache and perform several time steps before loading next block Choose appropriate block sizes Optimize kernel for cache performance Time-Blocked LBM applications: A fixed number of time steps will be performed on the domain Implement idea of Frigo et al. for LBM Block of 8 sites of a long 1D chain Time step (t) 1 2 3 4 1 2 3 4 5 6 7 8 time t Site index (i) x 27.07.2006 (20) ICMMES 2006

Cache-Oblivious Blocking Approach Temporal blocking of LBM using Frigo s method Standard temporal blocking approaches: Appropriate blocking sizes and time steps must be determined for each cache size & cache hierarchy Cache-Oblivious Blocking Approach introduced by Frigo, Prokop, et al. for matrix transpose / FFT / sorting [2] Harald Prokop. Cache-Oblivious Algorithms. Masters thesis, MIT. 1999. [3] M. Frigo, C.E. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proceedings of the 40th IEEE Symposium on Foundations of Computer Science (FOCS 99), p.285-297. 1999 Cache-oblivious blocking (recursive) approach (COBRA): Independent of hardware parameters, e.g. cache size / cache-line length Algorithms should choose an optimal amount of work and move data optimally among multiple levels of cache Can easily be extended to stencils of higher order 27.07.2006 (21) ICMMES 2006

Cache-Oblivious Blocking Approach Basic Idea Recursive algorithm with space and time cuts to define domains which fits into cache allow several time steps to be performed Example: 1 spatial dimension (1D) time Space cut time Time cut t1 t0 A1.. A2 t1 t0 A2 A1 x0 x1 space x0 x1 27.07.2006 (22) ICMMES 2006

Cache-Oblivious Blocking Approach (COBRA) Structure of recursive algorithm void walk1(int t0, int t1, int x0, int ix0, int x1, int ix1) { int dt = t1 - t0; if (dt == 1) { /* base case */ Solve Kernel() } else if (dt > 1) { if (2 * (x1 - x0) + (ix1 - ix0) * dt >= 4 * dt) { /* space cut */ int xm = (2 * (x0 + x1) + (2 + ix0 + ix1) * dt) / 4; walk1(t0, t1, x0, ix0, xm, -1); walk1(t0, t1, xm, -1, x1, ix1); } else { /* time cut */ int s = dt / 2; walk1(t0, t0 + s, x0, ix0, x1, ix1); walk1(t0 + s, t1, x0 + ix0 * s, ix0, x1 + ix1 * s, ix1); } } } 27.07.2006 (23) ICMMES 2006

Cache-Oblivious Blocking Approach (COBRA) Single processor performance COBRA vs. Iterative 27.07.2006 (24) ICMMES 2006

Cache-Oblivious Blocking Approach (COBRA) Parallelization Simple wavefront in spacetime Use COBRA to cut spacetime in blocks of size t B xt B with t B << t max Blocks on diagonals can be processes in parallel Max. length of diagonal: min( t max / t B ; D / (2* t B ) ) 2D spacetime projection of 3D domain (100 3 ) with t max =100 & t B =10 27.07.2006 (25) ICMMES 2006

Cache-Oblivious Blocking Approach Parallelization Implementation Use shared memory approach Standard OpenMP does not yet fully support nested parallelism Use Intel`s #pragma intel omp taskq Use OpenMP locking functions to synchronize threads before starting the next wavefront 27.07.2006 (26) ICMMES 2006

Cache-Oblivious Blocking Approach Parallelization Parallel performance & scalability Parallel performance & scalability t B, t max, #threads D=100 #threads=4 t B =10 -> 5 diags. t B =20 -> 2,5 diags. t B =25 -> 2 diags. Max. length of diagonal: min( t max / t B ; D/(2* t B ) ) 27.07.2006 (27) ICMMES 2006

Cache-Oblivious Blocking Approach (COBRA) Parallelization Parallel performance & scalability Parallel performance & scalability Block size t B, t max, #threads Theoretical limit of iterative method = Bandwidth [MByte/s] 456 [Byte/FLUP]) 27.07.2006 (28) ICMMES 2006

Cache-Oblivious Blocking Approach (COBRA) Parallelization Parallel performance & scalability Parallel performance & scalability Block size t B, t max, #threads 27.07.2006 (29) ICMMES 2006

Summary & Outlook Iterative LBM: Efficient implementation strategies for simple (full matrix) and complex (sparse LBM) geometries are available Easy parallelization through domain decomposition (pref. MPI) Time blocked LBM: Cache-Oblivious Blocking Approach (COBRA) has high potential to overcome the bandwidth limitations for simple geometries Shared memory parallelization through task-queue model Use on complex geometries (sparse LBM)? Pure MPI parallelization large overhead through multiple ghost layers! Hybrid parallelization approach (OpenMP within node + MPI between nodes)? 27.07.2006 (30) ICMMES 2006

Thank you! http://www.rrze.uni-erlangen.de/hpc/ Acknowledgement: R. Vogelsang, R. Wolff (SGI) W. Oed (CRAY) Th. Schoenemeyer (NEC) Acknowledgement: M. Brehm et al. (LRZ) U. Küster, P. Lammers (HLRS) H. Cornelius, H. Bast, A. Semin (Intel) 27.07.2006 (31) ICMMES 2006