Auto-tuning Multigrid with PetaBricks

Similar documents
PetaBricks: A Language and Compiler based on Autotuning

PetaBricks. With the dawn of the multi-core era, programmers are being challenged to write

PetaBricks: A Language and Compiler for Algorithmic Choice

Language and Compiler Support for Auto-Tuning Variable-Accuracy Algorithms

PetaBricks: A Language and Compiler for Algorithmic Choice

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

EECS Electrical Engineering and Computer Sciences

Introduction to Multigrid and its Parallelization

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Intel Performance Libraries

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Large-scale Gas Turbine Simulations on GPU clusters

Code Generators for Stencil Auto-tuning

smooth coefficients H. Köstler, U. Rüde

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Smoothers. < interactive example > Partial Differential Equations Numerical Methods for PDEs Sparse Linear Systems

Accelerating image registration on GPUs

Bandwidth Avoiding Stencil Computations

Turbostream: A CFD solver for manycore

How to Optimize Geometric Multigrid Methods on GPUs

Adaptive Mesh Refinement in Titanium

Lecture 15: More Iterative Ideas

Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Optimization by Run-time Specialization for Sparse-Matrix Vector Multiplication

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Performance of Multicore LUP Decomposition

Matrix-free multi-gpu Implementation of Elliptic Solvers for strongly anisotropic PDEs

OpenFOAM + GPGPU. İbrahim Özküçük

AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

Hyperparameter Tuning in Bandit-Based Adaptive Operator Selection

Efficiently Multiplying Sparse Matrix - Sparse Vector for Social Network Analysis

Contents. I The Basic Framework for Stationary Problems 1

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Reconstruction of Trees from Laser Scan Data and further Simulation Topics

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

An Investigation into Iterative Methods for Solving Elliptic PDE s Andrew M Brown Computer Science/Maths Session (2000/2001)

Efficient multigrid solvers for strongly anisotropic PDEs in atmospheric modelling

Lecture 13: March 25

Exploring unstructured Poisson solvers for FDS

Yee Lok Wong. Submitted to the Department of Mathematics in partial fulfillment of the requirements for the degree of. Doctor of Philosophy at the

What is Multigrid? They have been extended to solve a wide variety of other problems, linear and nonlinear.

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Efficient O(N log N) algorithms for scattered data interpolation

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

High Performance Computing: Tools and Applications

Two-Phase flows on massively parallel multi-gpu clusters

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Applications of Berkeley s Dwarfs on Nvidia GPUs

Performance Tools for Technical Computing

Distributed NVAMG. Design and Implementation of a Scalable Algebraic Multigrid Framework for a Cluster of GPUs

ScaFaCoS and P 3 M. Olaf Lenz. Recent Developments. June 3, 2013

Automatic Tuning of the High Performance Linpack Benchmark

How to Write Fast Code , spring st Lecture, Jan. 14 th

Highly Parallel Multigrid Solvers for Multicore and Manycore Processors

Distributed Schur Complement Solvers for Real and Complex Block-Structured CFD Problems

Communication-Avoiding Optimization of Geometric Multigrid on GPUs

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Implicit and Explicit Optimizations for Stencil Computations

Parallel Systems. Project topics

Multigrid Methods for Markov Chains

Parallel FFT Program Optimizations on Heterogeneous Computers

Numerical Modelling in Fortran: day 6. Paul Tackley, 2017

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Parallelization. Saman Amarasinghe. Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Computer Science and Artificial Intelligence Laboratory Technical Report. MIT-CSAIL-TR June 23, 2014

High Performance Computing for PDE Some numerical aspects of Petascale Computing

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Performance of deal.ii on a node

Behavioral Data Mining. Lecture 12 Machine Biology

Auto-tunable GPU BLAS

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Image deblurring by multigrid methods. Department of Physics and Mathematics University of Insubria

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Automatic code generation and tuning for stencil kernels on modern shared memory architectures

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

Semester Final Report

Generating Optimized Sparse Matrix Vector Product over Finite Fields

CUDA Toolkit 5.0 Performance Report. January 2013

Automatic Performance Tuning and Machine Learning

Use of Synthetic Benchmarks for Machine- Learning-based Performance Auto-tuning

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

GPU Implementation of Elliptic Solvers in NWP. Numerical Weather- and Climate- Prediction

First Experiences with Intel Cluster OpenMP

How to Write Fast Numerical Code

State of Art and Project Proposals Intensive Computation

Transcription:

Auto-tuning with PetaBricks Cy Chan Joint Work with: Jason Ansel Yee Lok Wong Saman Amarasinghe Alan Edelman Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

2 ic Choice in Sorting

3 ic Choice in Sorting

4 ic Choice in Sorting

5 ic Choice in Sorting

6 ic Choice in Sorting

Variable Accuracy s Lots of algorithms where the accuracy of output can be tuned: Iterative algorithms (e.g. solvers, optimization) Signal processing (e.g. images, sound) Approximation algorithms Can trade accuracy for speed All user wants: Solve to a certain accuracy as fast as possible using whatever algorithms necessary! 7

8 The PetaBricks Language General purpose language and auto-tuner Support for algorithmic choices and variable accuracy built into the language Specify multiple algorithms and accuracy levels Auto-tune parameters (e.g. number of iterations) to produce programs of different accuracy is a prime target: Iterative linear solver algorithm Lots of choices!

9 Outline Auto-tuning with PetaBricks Tuning the V-Cycle Extension to Auto-tuning Full Cycles Performance Results

10 PetaBricks Language Example: Sort transform Sort from A[n] to B[n] { // Recursive case, merge sort to(b b) from(a a) { (a1, a2) = Split(a); b1 = Sort(a1); b2 = Sort(a2); b = Merge(b1, b2); } OR 4096 2048 1024 Merge sort Merge sort Merge sort } // Base case, insertion sort to(b b) from(a a) { b = InsertionSort(a); } Insertion sort

Modeling Costs ic Complexity Compiler Complexity Memory System Complexity All impact performance Processor Complexity No simultaneous model for all of these! Solution: Use learning! 11

12 PetaBricks Work Flow PetaBricks Source PetaBricks Compiler Tunable Executable Configuration File Static Executable

A Very Brief Intro Used to iteratively solve PDEs over a gridded domain Relaxations update points using neighboring values (stencil computations) Restrictions and Interpolations compute new grid with coarser or finer discretization Resolution Relax on current grid Restrict to coarser grid Interpolate to finer grid Compute Time 13

Cycles How coarse do we go? V-Cycle W-Cycle Relaxation operator? Full MG V-Cycle How many iterations? 14 Standard Approaches

Cycles Generalize the idea of what a multigrid cycle can look like Example: relaxation steps direct or iterative shortcut Goal: Auto-tune cycle shape for specific usage 15

ic Choice in Need framework to make fair comparisons Perspective of a specific grid resolution How to get from A to B? A B A B Direct Restrict Interpolate A B? Iterative Recursive 16

ic Choice in Tuning cycle shape! Examples of recursive options: A B Standard V-cycle 17

ic Choice in Tuning cycle shape! Examples of recursive options: A B A B Take a shortcut at a coarser resolution 18

ic Choice in Tuning cycle shape! Examples of recursive options: A B Iterating with shortcuts 19

ic Choice in Tuning cycle shape! Once we pick a recursive option, how many times do we iterate? Higher Accuracy A B C D Number of iterations depends on what accuracy we want at the current grid resolution! 20

21 Comparing Cycle Shapes Different convergence AND execution rates Need a way to make fair comparisons Measure accuracy: reduction of RMS error Example: A cycle has accuracy level 10 3 if the RMS error of guess is reduced by a 10 3 factor Must train on representative data Imperfect metric: ignores error frequency Use accuracy AND time to make comparisons between cycle shapes

22 Optimal Subproblems Plot all cycle shapes for a given grid resolution: Better Keep only the optimal ones! Idea: Maintain a family of optimal algorithms for each grid resolution

23 The Discrete Solution Problem: Too many optimal cycle shapes to remember Remember! Solution: Remember the fastest algorithms for a discrete set of accuracies

Use Dynamic Programming to Manage Auto-tuning Search Only search cycle shapes that utilize optimized sub-cycles in recursive calls Build optimized algorithms from the bottom up Allow shortcuts to stop recursion early Allow multiple iterations of sub-cycles to explore time vs. accuracy space 24

25 Auto-tuning the V-cycle? transform k from X[n,n], B[n,n] to Y[n,n] { // Base case // Direct solve } OR // Base case // Iterative solve at current resolution OR // Recursive case // For some number of iterations // Relax // Compute residual and restrict // Call i for some i // Interpolate and correct // Relax ic choice Shortcut base cases Recursively call some optimized subcycle Iterations and recursive accuracy let us explore accuracy versus performance space Only remember best versions

Variable Accuracy Keywords accuracy_variable tunable variable accuracy_metric returns accuracy of output accuracy_bins set of discrete accuracy bins generator creates random inputs for accuracy measurement transform k from X[n,n], B[n,n] to Y[n,n] accuracy_variable numiterations accuracy_metric Poisson2D_metric accuracy_bins 1e1 1e3 1e5 1e7 generator Poisson2D_Generator { 26

27 Training the Discrete Solution Resolution i Resolution i+1 Resolution Accuracy 1 Accuracy 2 Accuracy 3 Accuracy 4 i+1 Training Resolution i Optimized

28 Training the Discrete Solution Resolution i Resolution i+1 Resolution Accuracy 1 Accuracy 2 Accuracy 3 Accuracy 4 i+1 Optimized Resolution i Optimized

29 Training the Discrete Solution Accuracy 1 Accuracy 2 Accuracy 3 Accuracy 4 Finer Optimized Training 1x Optimized Training 2x Coarser Optimized Tuning order Possible choice (Shortcuts not shown)

30 Example: Auto-tuned 2D Poisson s Equation Solver Accy. 10 Accy. 10 3 Accy. 10 7 Finer Coarser

Auto-tuned Cycles for 2D Poisson Solver Cycle shapes for accuracy levels a) 10, b) 10 3, c) 10 5, d) 10 7 31 Optimized substructures visible in cycle shapes

Extension to Full Build auto-tuned Full cycles out of autotuned V-cycles Two phases: Estimation phase: Restrict and recursively call autotuned Full at coarser grid resolution Solve phase: Interpolate and run auto-tuned V-cycle at current grid resolution Choose accuracy level of each phase independently Use dynamic programming 32

33 Auto-tuned Full Cycles for 2D Poisson Solver Cycle shapes for accuracy levels a) 10, b) 10 3, c) 10 5, d) 10 7

Benchmark Application: Solving 2D Poisson s Equation Solve 2D Poisson s Equation on random data (uniform over [-2 32, 2 32 ]) for problems of size 2 n for n = 2, 3,, 12 Reference s (also in PetaBricks): Reference Iterate using V-cycle until accuracy target is reached Reference Full Estimate using a standard Full iteration, then iterate using V-cycle until accuracy target is reached 34

Performance Testbed Shared memory machines Intel Harpertown Two quad-core 3.2 GHz Xeons AMD Barcelona Two quad-core 2.4 GHz Opterons Sun Niagara One quad-core 1.2 GHz T1 PetaBricks compiler still under development Some low-level optimizations not yet supported (no explicit pre-fetching or SIMD vectorization) Focus on tuning and comparing cycle shapes 35

Impact of Auto-tuning Intel Harpertown (2 Sockets, 8 Cores)

Impact of Auto-tuning AMD Barcelona (2 Sockets, 8 Cores)

Impact of Auto-tuning Sun Niagara (4 Cores, 32 Threads)

39 Tuned Cycles Across Architectures Tuned cycles to achieve accuracy 10 5 at resolution 2 11 i) Intel Harpertown ii) AMD Barcelona iii) Sun Niagara

Selected Related Work Auto-tuning Software: FFTW Fast Fourier Transform ATLAS, FLAME Linear Algebra SPARSITY, OSKI Sparse Matrices STAPL Template Framework Library SPL Digital Signal Processing Tuning : SuperSolvers Composite Linear Solver Sellappa and Chatterjee (2004), Rivera and Tseng (2000) Cache-Aware Thekale, Gradl, Klamroth, Rude (2009) Optimizing Interations of V-Cycles in Full

Future Work Add support for auto-tuning other aspects of multigrid Tuning of relaxation, interpolation, and restriction operators Low-level optimizations: explicit prefetch and vectorization Add support for tuning data movement in AMR Parameterize tuned subproblems by data location in addition to size and accuracy Try different data layouts during recursion 41

General PetaBricks Future Work Dynamic choices during execution Support for other parallel architectures Distributed memory machines Heterogeneous clusters (e.g. CPU + GPGPU) Sparse Matrix support Auto-tune sparse matrix storage format e.g. CSR, CSC, COO, ELLPACK register block sizes, cache block sizes 42

Conclusion Auto-tuning with PetaBricks ic choice Variable accuracy Auto-tuning Cycles Construct more efficient multigrid solvers Use dynamic programming Speedup shown over reference algorithms 43

44 Thanks! http://projects.csail.mit.edu/petabricks/