Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.

Size: px
Start display at page:

Download "Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee."

Transcription

1 Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee

2 Outline Pre-intro: BLAS Motivation What is ATLAS Present release How ATLAS works Performance results Future work short term Open sourcing ATLAS

3 Basic Linear Algebra Subprograms (BLAS) Level 3 matrix-matrix operations - gemm, symm, hemm, syrk, herk, syr2k, her2k trmm trsm Level 2 matrix-vector operations - gemv hemv symv trmv trsv - ger, geru, gerc, her, her2, syr2 Level 1 vector-vector operations - swap, scal, copy, axpy, dot, nrm2, asum iamax Packed Banded

4 The Problem For many operations, no such thing as enough compute power Therefore, need to extract near peak performance even as hardware changes at breakneck pace of Moore s Law Extracting near-optimal performance is tedious, time consuming, and requires expertise in many fields Optimization is not portable

5 Solution, Part A: Create libraries Isolate time-critical sections of code, define and agree on API (BLAS) - Get experts in all needed fields (type of computation, hardware platform, and programming environment) to optimize PROBLEMS: - Demand for experts far outstrips supply - Even with experts, by time a library is fully optimized, the target architecture is well on its way towards obsolescence

6 Solution, Part B: AEOS AEOS: Automated Empirical Optimization of Software KEY IDEA: Automate tuning process so it can be done by computer, rather than team of experts GOAL: Optimized, portable library available for new platform in minutes or hours rather than months or years

7 What is ATLAS A package that adapts to differing architectures via AEOS techniques - Initially, supply BLAS Automated Empirical Optimization of Software (AEOS) - Machine searches opt space - Finds applicationapparent architecture AEOS requires: - Method of code variation» Parameterization» Multiple implement.» Code generation - Sophisticated Timers - Robust search heuristic

8 ATLAS, Present Release ANSI/ISO C - BSD-style license (no advertising clause) Optimized dense Level 3 BLAS - Performance from GEMM kernel» code generator + para Optimized dense Level 2 - GEMV & GER kernels» multiple implementation + para Reference Level 1, banded and packed BLAS Recursive LU & Cholesky factorizations (LAPACK) C and F77 interfaces for all routines

9 Algorithmic Approach for Matrix Multiply Only generated code is on-chip multiply All BLAS operations written in terms of generated on-chip multiply All transpose cases coerced through data copy to 1 case of on-chip multiply - Only 1 case generated per platform N K N M NB C M A * B K

10 Code generation strategy Code is iteratively generated & timed until optimal case is found. We try: - Differing NBs - Breaking false dependencies - M, N and K loop unrolling On-chip multiply optimizes for: - TLB access - L1 cache reuse - FP unit usage - Memory fetch - Register reuse - Loop overhead minimization

11 500x500 DGEMM Across Various Architectures MFLOPS AMD Athlon-600 DEC ev DEC ev6-500 HP9000/735/135 Vendor BLAS ATLAS BLAS F77 BLAS IBM PPC IBM Power2-160 IBM Power3-200 Pentium Pro-200 Pentium II-266 Pentium III-550 SGI R10000ip SGI R12000ip ArchitecturesSun UltraSparc2-200

12 500 x 500 Double Precision RB LU factorization Vendor BLAS ATLAS BLAS F77 BLAS 500 MFLOPS AMD Athlon-600 DEC ev DEC ev6-500 HP9000/735/135 IBM PPC IBM Power2-160 IBM Power3-200 Pentium Pro-200 Architecture Pentium II-266 Pentium III-550 SGI R10000ip SGI R12000ip Sun UltraSparc2-200

13 500x500 Recursive BLAS on UltraSparc Vendor BLAS ATLAS BLAS Reference BLAS 250 MFLOPS DGEMM DSYMM DSYRK DSYR2K DTRMM DTRSM BLAS

14 ATLAS, Next Release Definite: - Beefed up config - SMP support via pthreads - Support for user contribution Playing with: - Packed (banded) support, including extension to Level 3 - Level 1 optimizations - More user control over levels of optimization - Sparse support - Further Level 2 optimization» addition of code generation

15 Open sourcing ATLAS Developers can scratch their own itch optimize only operation/architecture they need, and help the whole community Must standardize and document multiple implementation testing/timing so user can supply machine-specific kernels Allows for machine-specific optimizations that cannot be done in a portable language such as C: - Assembly GEMM for ev5/6 Kazushige Goto - SSE & 3DNow! assembler Camm Maguire - UltraSparc kernel -- Peter Strazdins & Viet Nguyen

16 Open Source: Status Developer release: - ey/atlas/os Developer mailing list: - atlas-comm@cs.utk.edu - Archived at:» comm Level 2 GER/GEMV kernel contribution GEMM kernel contribution (mult. Implementation) GEMM replacement STILL NEED: - Support for usercontributed GEMM cleanup

17 ATLAS Team Jack Dongarra, Directory of ICL Antoine Petitet R. Clint Whaley You - Kazushige Goto - Camm Maguire - Viet Nguyen - Peter Strazdins

18 Algorithmic approach for Level 3 BLAS Recur down to L1 cache block size Need kernel at bottom of recursion - Use gemm-based kernel for portability 0 Recursive TRMM

Automated Empirical Optimization of High Performance Floating Point Kernels. R. Clint Whaley University of Texas, San Antonio. and

Automated Empirical Optimization of High Performance Floating Point Kernels. R. Clint Whaley University of Texas, San Antonio. and Automated Empirical Optimization of High Performance Floating Point Kernels R. Clint Whaley University of Texas, San Antonio and David B. Whalley Florida State University Outline of talk I. Introduction:

More information

Automatically Tuned Linear Algebra Software

Automatically Tuned Linear Algebra Software Automatically Tuned Linear Algebra Software R. Clint Whaley whaley@cs.utsa.edu www.cs.utsa.edu/ whaley University of Texas at San Antonio Department of Computer Science November 5, 2007 R. Clint Whaley

More information

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations. TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing

More information

High-Performance Implementation of the Level-3 BLAS

High-Performance Implementation of the Level-3 BLAS High-Performance Implementation of the Level- BLAS KAZUSHIGE GOTO The University of Texas at Austin and ROBERT VAN DE GEIJN The University of Texas at Austin A simple but highly effective approach for

More information

Self Adapting Linear Algebra Algorithms and Software

Self Adapting Linear Algebra Algorithms and Software Self Adapting Linear Algebra Algorithms and Software Jim Demmel, Jack Dongarra, Victor Eijkhout, Erika Fuentes, Antoine Petitet, Rich Vuduc, R. Clint Whaley, Katherine Yelick October 17, 2004 Abstract

More information

Self Adapting Linear Algebra Algorithms and Software

Self Adapting Linear Algebra Algorithms and Software 1 Self Adapting Linear Algebra Algorithms and Software Jim Demmel[1], Jack Dongarra[2], Victor Eijkhout[2], Erika Fuentes[2], Antoine Petitet[3], Rich Vuduc[1], R. Clint Whaley[4], Katherine Yelick[1]

More information

Self-Adapting Linear Algebra Algorithms and Software

Self-Adapting Linear Algebra Algorithms and Software Self-Adapting Linear Algebra Algorithms and Software JAMES DEMMEL, FELLOW, IEEE, JACK DONGARRA, FELLOW, IEEE, VICTOR EIJKHOUT, ERIKA FUENTES, ANTOINE PETITET, RICHARD VUDUC, R. CLINT WHALEY, AND KATHERINE

More information

2 Fred G. Gustavson, Jerzy Waśniewski and Jack J. Dongarra a. Lower Packed Format fi 2

2 Fred G. Gustavson, Jerzy Waśniewski and Jack J. Dongarra a. Lower Packed Format fi 2 Level-3 Cholesky Kernel Subroutine of a Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm Fred G. Gustavson IBM T.J. Watson Research Center and Jerzy Waśniewski Department

More information

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas

More information

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering

More information

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra

CS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra Slides from James Demmel and Kathy Yelick 1 Outline What is Dense Linear Algebra? Where does the time go in an algorithm?

More information

Behavioral Data Mining. Lecture 12 Machine Biology

Behavioral Data Mining. Lecture 12 Machine Biology Behavioral Data Mining Lecture 12 Machine Biology Outline CPU geography Mass storage Buses and Networks Main memory Design Principles Intel i7 close-up From Computer Architecture a Quantitative Approach

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

BLIS: A Framework for Rapid Instantiation of BLAS Functionality

BLIS: A Framework for Rapid Instantiation of BLAS Functionality 0 BLIS: A Framework for Rapid Instantiation of BLAS Functionality FIELD G. VAN ZEE and ROBERT A. VAN DE GEIJN, The University of Texas at Austin The BLAS Libray Instantiation Software (BLIS) is a new framework

More information

ATLAS Version 3.8 : Overview and Status

ATLAS Version 3.8 : Overview and Status ATLAS Version 3.8 : Overview and Status R. Clint Whaley November 5, 2007 Abstract This paper describes the widely-used ATLAS (Automatically Tuned Linear Algebra Software) project as it stands today. ATLAS

More information

Various optimization and performance tips for processors

Various optimization and performance tips for processors Various optimization and performance tips for processors Kazushige Goto Texas Advanced Computing Center 2006/12/7 Kazushige Goto (TACC) 1 Contents Introducing myself Merit/demerit

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

BLIS: A Modern Alternative to the BLAS

BLIS: A Modern Alternative to the BLAS 0 BLIS: A Modern Alternative to the BLAS FIELD G. VAN ZEE and ROBERT A. VAN DE GEIJN, The University of Texas at Austin We propose the portable BLAS-like Interface Software (BLIS) framework which addresses

More information

GPUCC An Open-Source GPGPU Compiler A Preview

GPUCC An Open-Source GPGPU Compiler A Preview GPUCC An GPGPU Compiler A Preview Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Jingyue Wu, Xuetian Weng, Artem Belevich, Robert Hundt (rhundt@google.com) Why

More information

BLAS. Christoph Ortner Stef Salvini

BLAS. Christoph Ortner Stef Salvini BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations

More information

BLIS: A Framework for Generating BLAS-like Libraries. FLAME Working Note #66

BLIS: A Framework for Generating BLAS-like Libraries. FLAME Working Note #66 BLIS: A Framework for Generating BLAS-like Libraries FLAME Working Note #66 (We recommend reading the updated version of this paper titled BLIS: A Framework for Rapidly Instantiating BLAS Functionality,

More information

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen High-Performance Libraries and Tools HPC Fall 2012 Prof. Robert van Engelen Overview Dense matrix BLAS (serial) ATLAS (serial/threaded) LAPACK (serial) Vendor-tuned LAPACK (shared memory parallel) ScaLAPACK/PLAPACK

More information

Auto-Optimization of Linear Algebra Parallel Routines: The Cholesky Factorization

Auto-Optimization of Linear Algebra Parallel Routines: The Cholesky Factorization John von Neumann Institute for Computing Auto-Optimization of Linear Algebra Parallel Routines: The Cholesky Factorization L.-P. García, J. Cuenca, D. Giménez published in Parallel Computing: Current &

More information

MAINTAINING HIGH PERFORMANCE ACROSS ALL PROBLEM SIZES AND PARALLEL SCALES USING MICROKERNEL-BASED LINEAR ALGEBRA

MAINTAINING HIGH PERFORMANCE ACROSS ALL PROBLEM SIZES AND PARALLEL SCALES USING MICROKERNEL-BASED LINEAR ALGEBRA MAINTAINING HIGH PERFORMANCE ACROSS ALL PROBLEM SIZES AND PARALLEL SCALES USING MICROKERNEL-BASED LINEAR ALGEBRA A Dissertation Submitted to the Graduate Faculty of the Louisiana State University and Agricultural

More information

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri *R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001.

More information

A Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm

A Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm RAL-TR-2004-017 A Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm Bjarne S. Andersen, John A. Gunnels, Fred G. Gustavson, John K. Reid, and Jerzy Waśniewski May 25, 2004

More information

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of

More information

Dealing with Asymmetry for Performance and Energy Efficiency

Dealing with Asymmetry for Performance and Energy Efficiency Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures

More information

INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006

INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006 INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006 The Challenges of Multicore and Specialized Accelerators Jack Dongarra University of Tennessee

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Dense linear algebra, LAPACK, MMM optimizations in ATLAS Instructor: Markus Püschel TA: Daniele Spampinato & Alen Stojanov Today Linear algebra software: history,

More information

A Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm

A Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm A Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm Bjarne S. Andersen UNI C Danish IT Center for Education and Research and John A. Gunnels and Fred G. Gustavson IBM T.J.

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg

BLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg University of Freiburg BLIS retreat September 19, 217 Basic Linear Algebra Subroutines For Embedded Optimization performance dgemm_nt 5 4 Intel Core i7 48MQ HP OpenBLAS.2.19 MKL 217.2.174 ATLAS 3.1.3 BLIS.1.6

More information

User contribution to ATLAS

User contribution to ATLAS User contribution to ATLAS R. Clint Whaley July 10, 2014 Abstract This paper describes the method by which users can speed up ATLAS for themselves, as well as contribute any such speedup to the ATLAS project.

More information

BLAS. Basic Linear Algebra Subprograms

BLAS. Basic Linear Algebra Subprograms BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize

More information

Cache-oblivious Programming

Cache-oblivious Programming Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix

More information

Accelerating GPU Kernels for Dense Linear Algebra

Accelerating GPU Kernels for Dense Linear Algebra Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28

More information

Accurate Cache and TLB Characterization Using Hardware Counters

Accurate Cache and TLB Characterization Using Hardware Counters Accurate Cache and TLB Characterization Using Hardware Counters Jack Dongarra, Shirley Moore, Philip Mucci, Keith Seymour, and Haihang You Innovative Computing Laboratory, University of Tennessee Knoxville,

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

Level-3 BLAS on the TI C6678 multi-core DSP

Level-3 BLAS on the TI C6678 multi-core DSP Level-3 BLAS on the TI C6678 multi-core DSP Murtaza Ali, Eric Stotzer Texas Instruments {mali,estotzer}@ti.com Francisco D. Igual Dept. Arquitectura de Computadores y Automática Univ. Complutense de Madrid

More information

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L3: Autotuning Compilers

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L3: Autotuning Compilers A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse

More information

Empirically Tuning LAPACK s Blocking Factor for Increased Performance

Empirically Tuning LAPACK s Blocking Factor for Increased Performance Proceedings of the International Multiconference on Computer Science and Information Technology pp. 303 310 ISBN 978-83-60810-14-9 ISSN 1896-7094 Empirically Tuning LAPACK s Blocking Factor for Increased

More information

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators Enrique S. Quintana-Ortí Disclaimer Not a course on how to program dense linear algebra kernels on s Where have you

More information

Parallel and Fully Recursive Multifrontal Sparse Cholesky

Parallel and Fully Recursive Multifrontal Sparse Cholesky Parallel and Fully Recursive Multifrontal Sparse Cholesky Dror Irony Gil Shklarski Sivan Toledo 1th December Abstract We describe the design, implementation, and performance of a new parallel sparse Cholesky

More information

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence

More information

Computing Explicit Matrix Inverses by Recursion

Computing Explicit Matrix Inverses by Recursion Computing Explicit Matrix Inverses by Recursion Lars Karlsson February 15, 2006 Master s Thesis in Computing Science, 20 credits Supervisor at CS-UmU: Robert Granat Examiner: Per Lindström Umeå University

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Optimizations of BLIS Library for AMD ZEN Core

Optimizations of BLIS Library for AMD ZEN Core Optimizations of BLIS Library for AMD ZEN Core 1 Introduction BLIS [1] is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [2] The framework was

More information

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden

Using recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden Using recursion to improve performance of dense linear algebra software Erik Elmroth Dept of Computing Science & HPCN Umeå University, Sweden Joint work with Fred Gustavson, Isak Jonsson & Bo Kågström

More information

A Blocked Implementation of Level 3 BLAS for RISC. Processors 1. Revised version. ENSEEIHT-IRIT Technical Report, RT/APO/97/2.

A Blocked Implementation of Level 3 BLAS for RISC. Processors 1. Revised version. ENSEEIHT-IRIT Technical Report, RT/APO/97/2. A Blocked Implementation of Level 3 BLAS for RISC Processors 1 Michel J. Dayde 2 and Iain S. Du 3;4 Revised version ENSEEIHT-IRIT Technical Report, RT/APO/97/2 December 1, 1997 Abstract We describe a version

More information

How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato

How to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato How to Write Fast Numerical Code Spring 2012 Lecture 9 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato Today Linear algebra software: history, LAPACK and BLAS Blocking (BLAS 3): key

More information

Programming Dense Linear Algebra Kernels on Vectorized Architectures

Programming Dense Linear Algebra Kernels on Vectorized Architectures University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 5-2013 Programming Dense Linear Algebra Kernels on Vectorized Architectures Jonathan Lawrence

More information

Advanced Numerical Techniques for Cluster Computing

Advanced Numerical Techniques for Cluster Computing Advanced Numerical Techniques for Cluster Computing Presented by Piotr Luszczek http://icl.cs.utk.edu/iter-ref/ Presentation Outline Motivation hardware Dense matrix calculations Sparse direct solvers

More information

NEW ADVANCES IN GPU LINEAR ALGEBRA

NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

Parallel BLAS Performance Report

Parallel BLAS Performance Report 5 Parallel BLAS Performance Report Jakub Kurzak Mark Gates Asim YarKhan Ichitaro Yamazaki Panruo Wu Piotr Luszczek Jamie Finney Jack Dongarra Innovative Computing Laboratory April 1, 2018 This research

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 1 HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 BLAS BLAS 1, 2, 3 Performance GEMM Optimized BLAS Parallel

More information

ReproBLAS: Reproducible BLAS

ReproBLAS: Reproducible BLAS ReproBLAS: Reproducible BLAS http://bebop.cs.berkeley.edu/reproblas/ James Demmel, Nguyen Hong Diep SC 13 - Denver, CO Nov 22, 2013 1 / 15 Reproducibility Reproducibility: obtaining bit-wise identical

More information

HPCS HPCchallenge Benchmark Suite

HPCS HPCchallenge Benchmark Suite HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. () Jack Dongarra (UTK) Piotr Luszczek () 28 September 2004 Slide-1 Outline Brief DARPA HPCS Overview Architecture/Application Characterization Preliminary

More information

Java Performance Analysis for Scientific Computing

Java Performance Analysis for Scientific Computing Java Performance Analysis for Scientific Computing Roldan Pozo Leader, Mathematical Software Group National Institute of Standards and Technology USA UKHEC: Java for High End Computing Nov. 20th, 2000

More information

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015

Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Overview Dense linear algebra algorithms Hybrid CPU GPU implementation

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Changhao Jiang (cjiang@cs.uiuc.edu) Marc Snir (snir@cs.uiuc.edu) University of Illinois Urbana Champaign GPU becomes more powerful

More information

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact

More information

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy

Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems) Julie Langou Piotr Luszczek Alfredo Buttari Julien Langou

More information

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units Technical Report 2014-001 Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units Kazuya Matsumoto, Naohito Nakasato, and Stanislav Sedukhin October 22, 2014 Graduate School of Computer

More information

Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs

Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory, University of Tennessee

More information

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org

More information

In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that:

In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that: Parallel Computing and Data Locality Gary Howell In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that: Real estate and efficient computation

More information

Self Adapting Numerical Software (SANS-Effort)

Self Adapting Numerical Software (SANS-Effort) Self Adapting Numerical Software (SANS-Effort) Jack Dongarra Innovative Computing Laboratory University of Tennessee and Oak Ridge National Laboratory 1 Work on Self Adapting Software 1. Lapack For Clusters

More information

Level-3 Cholesky Factorization Routines as Part of Many Cholesky Algorithms

Level-3 Cholesky Factorization Routines as Part of Many Cholesky Algorithms Level-3 Cholesky Factorization Routines as Part of Many Cholesky Algorithms Fred G. Gustavson IBM T.J. Watson Research Center, Emeritus and Umeå University, Adjunct and Jerzy Waśniewski Department of Informatics

More information

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore svmoore@utep.edu CPS5401 Fall 2012 svmoore.pbworks.com November 8, 2012 1 Learning ObjecNves AOer complenng this lesson, you

More information

INTEL MKL Vectorized Compact routines

INTEL MKL Vectorized Compact routines INTEL MKL Vectorized Compact routines Mesut Meterelliyoz, Peter Caday, Timothy B. Costa, Kazushige Goto, Louise Huot, Sarah Knepper, Arthur Araujo Mitrano, Shane Story 2018 BLIS RETREAT 09/17/2018 OUTLINE

More information

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Hierarchical DAG Scheduling for Hybrid Distributed Systems June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical

More information

Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors

Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors Future Generation Computer Systems 21 (2005) 743 748 Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors O. Bessonov a,,d.fougère b, B. Roux

More information

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable

More information

Statistical Models for Automatic Performance Tuning

Statistical Models for Automatic Performance Tuning Statistical Models for Automatic Performance Tuning Richard Vuduc, James Demmel (U.C. Berkeley, EECS) {richie,demmel}@cs.berkeley.edu Jeff Bilmes (Univ. of Washington, EE) bilmes@ee.washington.edu May

More information

Autotuning (1/2): Cache-oblivious algorithms

Autotuning (1/2): Cache-oblivious algorithms Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.17] Tuesday, March 4, 2008 1 Today s sources CS 267 (Demmel

More information

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker

BLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de

More information

Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach

Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 8-21 Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an

More information

Accelerating GPU kernels for dense linear algebra

Accelerating GPU kernels for dense linear algebra Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,

More information

Adaptive Scientific Software Libraries

Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing

More information

Introducing coop: Fast Covariance, Correlation, and Cosine Operations

Introducing coop: Fast Covariance, Correlation, and Cosine Operations Introducing coop: Fast Covariance, Correlation, and Cosine Operations November 14, 2017 Drew Schmidt wrathematics@gmail.com Version 0.6-1 Disclaimer Any opinions, findings, and conclusions or recommendations

More information

Parallelism in Spiral

Parallelism in Spiral Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was

More information

Anatomy of High-Performance Matrix Multiplication

Anatomy of High-Performance Matrix Multiplication 12 Anatomy of High-Performance Matrix Multiplication KAZUSHIGE GOTO and ROBERT A. VAN DE GEIJN The University of Texas at Austin We present the basic principles that underlie the high-performance implementation

More information

Parallel Programming & Cluster Computing

Parallel Programming & Cluster Computing Parallel Programming & Cluster Computing Grab Bag: Scientific Libraries, I/O Libraries, Visualization Henry Neeman, University of Oklahoma Charlie Peck, Earlham College Andrew Fitz Gibbon, Earlham College

More information

arxiv: v3 [cs.ms] 7 Jan 2018

arxiv: v3 [cs.ms] 7 Jan 2018 BLASFEO: basic linear algebra subroutines for embedded optimization Gianluca Frison, Dimitris Kouzoupis, Tommaso Sartor, Andrea Zanelli, Moritz Diehl University of Freiburg, Department of Microsystems

More information

Linear Algebra for Modern Computers. Jack Dongarra

Linear Algebra for Modern Computers. Jack Dongarra Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d

More information

Optimization of Triangular Matrix Functions in BLAS Library on Loongson2F

Optimization of Triangular Matrix Functions in BLAS Library on Loongson2F Optimization of Triangular Matrix Functions in BLAS Library on Loongson2F Yun Xu 1,2, Mingzhi Shao 1,2, and Da Teng 1,2 1 School of Computer Science and Technology, University of Science and Technology

More information

NVBLAS LIBRARY. DU _v6.0 February User Guide

NVBLAS LIBRARY. DU _v6.0 February User Guide NVBLAS LIBRARY DU-06702-001_v6.0 February 2014 User Guide DU-06702-001_v6.0 2 Chapter 1. INTRODUCTION The is a GPU-accelerated Libary that implements BLAS (Basic Linear Algebra Subprograms). It can accelerate

More information

Notes on LINPACK NxN Benchmark on Hewlett-Packard Systems

Notes on LINPACK NxN Benchmark on Hewlett-Packard Systems Notes on LINPACK NxN Benchmark on Hewlett-Packard Systems Piotr Luszczek August 3, 2001 Benchmark Matrix Optimizations Parallel name dimension allowed Processing 100 100 complier No 1000 1000 manual No

More information

High Performance Linear Algebra

High Performance Linear Algebra High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control

More information

Bo Kågström 10/17/2004. Management of Deep Memory Hierarchies Recursive Blocking and Hybrid Data Structures for Dense Matrix Computations

Bo Kågström 10/17/2004. Management of Deep Memory Hierarchies Recursive Blocking and Hybrid Data Structures for Dense Matrix Computations Management of Deep Memory Hierarchies Recursive Blocking and Hybrid Data Structures for Dense Matrix Computations Bo Kågström Dept of Computing Science & HPC2N Umeå University, Sweden 5th Workshop on Linux

More information

BLAS: Basic Linear Algebra Subroutines I

BLAS: Basic Linear Algebra Subroutines I BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines

More information

Automatic Tuning of Sparse Matrix Kernels

Automatic Tuning of Sparse Matrix Kernels Automatic Tuning of Sparse Matrix Kernels Kathy Yelick U.C. Berkeley and Lawrence Berkeley National Laboratory Richard Vuduc, Lawrence Livermore National Laboratory James Demmel, U.C. Berkeley Berkeley

More information

Toward Scalable Matrix Multiply on Multithreaded Architectures

Toward Scalable Matrix Multiply on Multithreaded Architectures Toward Scalable Matrix Multiply on Multithreaded Architectures Bryan Marker 1, Field G Van Zee 1, Kazushige Goto 1, Gregorio Quintana Ortí 2, and Robert A van de Geijn 1 1 The University of Texas at Austin

More information

Software Packages on Multi-Core Hardware

Software Packages on Multi-Core Hardware Comparative Study of One-Sided Factorizations with Multiple Software Packages on Multi-Core Hardware Emmanuel Agullo, Bilel Hadri, Hatem Ltaief and Jack Dongarra Department of Electrical Engineering and

More information

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity

More information