Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.
|
|
- Beatrice Sharp
- 6 years ago
- Views:
Transcription
1 Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee
2 Outline Pre-intro: BLAS Motivation What is ATLAS Present release How ATLAS works Performance results Future work short term Open sourcing ATLAS
3 Basic Linear Algebra Subprograms (BLAS) Level 3 matrix-matrix operations - gemm, symm, hemm, syrk, herk, syr2k, her2k trmm trsm Level 2 matrix-vector operations - gemv hemv symv trmv trsv - ger, geru, gerc, her, her2, syr2 Level 1 vector-vector operations - swap, scal, copy, axpy, dot, nrm2, asum iamax Packed Banded
4 The Problem For many operations, no such thing as enough compute power Therefore, need to extract near peak performance even as hardware changes at breakneck pace of Moore s Law Extracting near-optimal performance is tedious, time consuming, and requires expertise in many fields Optimization is not portable
5 Solution, Part A: Create libraries Isolate time-critical sections of code, define and agree on API (BLAS) - Get experts in all needed fields (type of computation, hardware platform, and programming environment) to optimize PROBLEMS: - Demand for experts far outstrips supply - Even with experts, by time a library is fully optimized, the target architecture is well on its way towards obsolescence
6 Solution, Part B: AEOS AEOS: Automated Empirical Optimization of Software KEY IDEA: Automate tuning process so it can be done by computer, rather than team of experts GOAL: Optimized, portable library available for new platform in minutes or hours rather than months or years
7 What is ATLAS A package that adapts to differing architectures via AEOS techniques - Initially, supply BLAS Automated Empirical Optimization of Software (AEOS) - Machine searches opt space - Finds applicationapparent architecture AEOS requires: - Method of code variation» Parameterization» Multiple implement.» Code generation - Sophisticated Timers - Robust search heuristic
8 ATLAS, Present Release ANSI/ISO C - BSD-style license (no advertising clause) Optimized dense Level 3 BLAS - Performance from GEMM kernel» code generator + para Optimized dense Level 2 - GEMV & GER kernels» multiple implementation + para Reference Level 1, banded and packed BLAS Recursive LU & Cholesky factorizations (LAPACK) C and F77 interfaces for all routines
9 Algorithmic Approach for Matrix Multiply Only generated code is on-chip multiply All BLAS operations written in terms of generated on-chip multiply All transpose cases coerced through data copy to 1 case of on-chip multiply - Only 1 case generated per platform N K N M NB C M A * B K
10 Code generation strategy Code is iteratively generated & timed until optimal case is found. We try: - Differing NBs - Breaking false dependencies - M, N and K loop unrolling On-chip multiply optimizes for: - TLB access - L1 cache reuse - FP unit usage - Memory fetch - Register reuse - Loop overhead minimization
11 500x500 DGEMM Across Various Architectures MFLOPS AMD Athlon-600 DEC ev DEC ev6-500 HP9000/735/135 Vendor BLAS ATLAS BLAS F77 BLAS IBM PPC IBM Power2-160 IBM Power3-200 Pentium Pro-200 Pentium II-266 Pentium III-550 SGI R10000ip SGI R12000ip ArchitecturesSun UltraSparc2-200
12 500 x 500 Double Precision RB LU factorization Vendor BLAS ATLAS BLAS F77 BLAS 500 MFLOPS AMD Athlon-600 DEC ev DEC ev6-500 HP9000/735/135 IBM PPC IBM Power2-160 IBM Power3-200 Pentium Pro-200 Architecture Pentium II-266 Pentium III-550 SGI R10000ip SGI R12000ip Sun UltraSparc2-200
13 500x500 Recursive BLAS on UltraSparc Vendor BLAS ATLAS BLAS Reference BLAS 250 MFLOPS DGEMM DSYMM DSYRK DSYR2K DTRMM DTRSM BLAS
14 ATLAS, Next Release Definite: - Beefed up config - SMP support via pthreads - Support for user contribution Playing with: - Packed (banded) support, including extension to Level 3 - Level 1 optimizations - More user control over levels of optimization - Sparse support - Further Level 2 optimization» addition of code generation
15 Open sourcing ATLAS Developers can scratch their own itch optimize only operation/architecture they need, and help the whole community Must standardize and document multiple implementation testing/timing so user can supply machine-specific kernels Allows for machine-specific optimizations that cannot be done in a portable language such as C: - Assembly GEMM for ev5/6 Kazushige Goto - SSE & 3DNow! assembler Camm Maguire - UltraSparc kernel -- Peter Strazdins & Viet Nguyen
16 Open Source: Status Developer release: - ey/atlas/os Developer mailing list: - atlas-comm@cs.utk.edu - Archived at:» comm Level 2 GER/GEMV kernel contribution GEMM kernel contribution (mult. Implementation) GEMM replacement STILL NEED: - Support for usercontributed GEMM cleanup
17 ATLAS Team Jack Dongarra, Directory of ICL Antoine Petitet R. Clint Whaley You - Kazushige Goto - Camm Maguire - Viet Nguyen - Peter Strazdins
18 Algorithmic approach for Level 3 BLAS Recur down to L1 cache block size Need kernel at bottom of recursion - Use gemm-based kernel for portability 0 Recursive TRMM
Automated Empirical Optimization of High Performance Floating Point Kernels. R. Clint Whaley University of Texas, San Antonio. and
Automated Empirical Optimization of High Performance Floating Point Kernels R. Clint Whaley University of Texas, San Antonio and David B. Whalley Florida State University Outline of talk I. Introduction:
More informationAutomatically Tuned Linear Algebra Software
Automatically Tuned Linear Algebra Software R. Clint Whaley whaley@cs.utsa.edu www.cs.utsa.edu/ whaley University of Texas at San Antonio Department of Computer Science November 5, 2007 R. Clint Whaley
More informationThe Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.
TITLE Basic Linear Algebra Subprograms BYLINE Robert A. van de Geijn Department of Computer Science The University of Texas at Austin Austin, TX USA rvdg@cs.utexas.edu Kazushige Goto Texas Advanced Computing
More informationHigh-Performance Implementation of the Level-3 BLAS
High-Performance Implementation of the Level- BLAS KAZUSHIGE GOTO The University of Texas at Austin and ROBERT VAN DE GEIJN The University of Texas at Austin A simple but highly effective approach for
More informationSelf Adapting Linear Algebra Algorithms and Software
Self Adapting Linear Algebra Algorithms and Software Jim Demmel, Jack Dongarra, Victor Eijkhout, Erika Fuentes, Antoine Petitet, Rich Vuduc, R. Clint Whaley, Katherine Yelick October 17, 2004 Abstract
More informationSelf Adapting Linear Algebra Algorithms and Software
1 Self Adapting Linear Algebra Algorithms and Software Jim Demmel[1], Jack Dongarra[2], Victor Eijkhout[2], Erika Fuentes[2], Antoine Petitet[3], Rich Vuduc[1], R. Clint Whaley[4], Katherine Yelick[1]
More informationSelf-Adapting Linear Algebra Algorithms and Software
Self-Adapting Linear Algebra Algorithms and Software JAMES DEMMEL, FELLOW, IEEE, JACK DONGARRA, FELLOW, IEEE, VICTOR EIJKHOUT, ERIKA FUENTES, ANTOINE PETITET, RICHARD VUDUC, R. CLINT WHALEY, AND KATHERINE
More information2 Fred G. Gustavson, Jerzy Waśniewski and Jack J. Dongarra a. Lower Packed Format fi 2
Level-3 Cholesky Kernel Subroutine of a Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm Fred G. Gustavson IBM T.J. Watson Research Center and Jerzy Waśniewski Department
More informationOptimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides
Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas
More informationBatch Linear Algebra for GPU-Accelerated High Performance Computing Environments
Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering
More informationCS Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra
CS 294-73 Software Engineering for Scientific Computing Lecture 10:Dense Linear Algebra Slides from James Demmel and Kathy Yelick 1 Outline What is Dense Linear Algebra? Where does the time go in an algorithm?
More informationBehavioral Data Mining. Lecture 12 Machine Biology
Behavioral Data Mining Lecture 12 Machine Biology Outline CPU geography Mass storage Buses and Networks Main memory Design Principles Intel i7 close-up From Computer Architecture a Quantitative Approach
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationA Standard for Batching BLAS Operations
A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community
More informationScientific Computing. Some slides from James Lambers, Stanford
Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical
More informationBLIS: A Framework for Rapid Instantiation of BLAS Functionality
0 BLIS: A Framework for Rapid Instantiation of BLAS Functionality FIELD G. VAN ZEE and ROBERT A. VAN DE GEIJN, The University of Texas at Austin The BLAS Libray Instantiation Software (BLIS) is a new framework
More informationATLAS Version 3.8 : Overview and Status
ATLAS Version 3.8 : Overview and Status R. Clint Whaley November 5, 2007 Abstract This paper describes the widely-used ATLAS (Automatically Tuned Linear Algebra Software) project as it stands today. ATLAS
More informationVarious optimization and performance tips for processors
Various optimization and performance tips for processors Kazushige Goto Texas Advanced Computing Center 2006/12/7 Kazushige Goto (TACC) 1 Contents Introducing myself Merit/demerit
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationBLIS: A Modern Alternative to the BLAS
0 BLIS: A Modern Alternative to the BLAS FIELD G. VAN ZEE and ROBERT A. VAN DE GEIJN, The University of Texas at Austin We propose the portable BLAS-like Interface Software (BLIS) framework which addresses
More informationGPUCC An Open-Source GPGPU Compiler A Preview
GPUCC An GPGPU Compiler A Preview Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Jingyue Wu, Xuetian Weng, Artem Belevich, Robert Hundt (rhundt@google.com) Why
More informationBLAS. Christoph Ortner Stef Salvini
BLAS Christoph Ortner Stef Salvini The BLASics Basic Linear Algebra Subroutines Building blocks for more complex computations Very widely used Level means number of operations Level 1: vector-vector operations
More informationBLIS: A Framework for Generating BLAS-like Libraries. FLAME Working Note #66
BLIS: A Framework for Generating BLAS-like Libraries FLAME Working Note #66 (We recommend reading the updated version of this paper titled BLIS: A Framework for Rapidly Instantiating BLAS Functionality,
More informationHigh-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen
High-Performance Libraries and Tools HPC Fall 2012 Prof. Robert van Engelen Overview Dense matrix BLAS (serial) ATLAS (serial/threaded) LAPACK (serial) Vendor-tuned LAPACK (shared memory parallel) ScaLAPACK/PLAPACK
More informationAuto-Optimization of Linear Algebra Parallel Routines: The Cholesky Factorization
John von Neumann Institute for Computing Auto-Optimization of Linear Algebra Parallel Routines: The Cholesky Factorization L.-P. García, J. Cuenca, D. Giménez published in Parallel Computing: Current &
More informationMAINTAINING HIGH PERFORMANCE ACROSS ALL PROBLEM SIZES AND PARALLEL SCALES USING MICROKERNEL-BASED LINEAR ALGEBRA
MAINTAINING HIGH PERFORMANCE ACROSS ALL PROBLEM SIZES AND PARALLEL SCALES USING MICROKERNEL-BASED LINEAR ALGEBRA A Dissertation Submitted to the Graduate Faculty of the Louisiana State University and Agricultural
More informationAutomated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri
Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri *R. Clint Whaley, Antoine Petitet and Jack Dongarra. Parallel Computing, 27(1-2):3-35, 2001.
More informationA Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm
RAL-TR-2004-017 A Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm Bjarne S. Andersen, John A. Gunnels, Fred G. Gustavson, John K. Reid, and Jerzy Waśniewski May 25, 2004
More informationImplementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS
Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of
More informationDealing with Asymmetry for Performance and Energy Efficiency
Dealing with Asymmetryfor Performance and Energy Efficiency Enrique S. QUINTANA-ORTÍ Motivation Moore s law is alive, but Dennard s scaling is over Motivation Welcome dark silicon and asymmetric architectures
More informationINTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006
INTERNATIONAL ADVANCED RESEARCH WORKSHOP ON HIGH PERFORMANCE COMPUTING AND GRIDS Cetraro (Italy), July 3-6, 2006 The Challenges of Multicore and Specialized Accelerators Jack Dongarra University of Tennessee
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Dense linear algebra, LAPACK, MMM optimizations in ATLAS Instructor: Markus Püschel TA: Daniele Spampinato & Alen Stojanov Today Linear algebra software: history,
More informationA Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm
A Fully Portable High Performance Minimal Storage Hybrid Format Cholesky Algorithm Bjarne S. Andersen UNI C Danish IT Center for Education and Research and John A. Gunnels and Fred G. Gustavson IBM T.J.
More informationAutomatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning
More informationBLASFEO. Gianluca Frison. BLIS retreat September 19, University of Freiburg
University of Freiburg BLIS retreat September 19, 217 Basic Linear Algebra Subroutines For Embedded Optimization performance dgemm_nt 5 4 Intel Core i7 48MQ HP OpenBLAS.2.19 MKL 217.2.174 ATLAS 3.1.3 BLIS.1.6
More informationUser contribution to ATLAS
User contribution to ATLAS R. Clint Whaley July 10, 2014 Abstract This paper describes the method by which users can speed up ATLAS for themselves, as well as contribute any such speedup to the ATLAS project.
More informationBLAS. Basic Linear Algebra Subprograms
BLAS Basic opera+ons with vectors and matrices dominates scien+fic compu+ng programs To achieve high efficiency and clean computer programs an effort has been made in the last few decades to standardize
More informationCache-oblivious Programming
Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix
More informationAccelerating GPU Kernels for Dense Linear Algebra
Accelerating GPU Kernels for Dense Linear Algebra Rajib Nath, Stan Tomov, and Jack Dongarra Innovative Computing Lab University of Tennessee, Knoxville July 9, 21 xgemm performance of CUBLAS-2.3 on GTX28
More informationAccurate Cache and TLB Characterization Using Hardware Counters
Accurate Cache and TLB Characterization Using Hardware Counters Jack Dongarra, Shirley Moore, Philip Mucci, Keith Seymour, and Haihang You Innovative Computing Laboratory, University of Tennessee Knoxville,
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationLevel-3 BLAS on the TI C6678 multi-core DSP
Level-3 BLAS on the TI C6678 multi-core DSP Murtaza Ali, Eric Stotzer Texas Instruments {mali,estotzer}@ti.com Francisco D. Igual Dept. Arquitectura de Computadores y Automática Univ. Complutense de Madrid
More informationA Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L3: Autotuning Compilers
A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse
More informationEmpirically Tuning LAPACK s Blocking Factor for Increased Performance
Proceedings of the International Multiconference on Computer Science and Information Technology pp. 303 310 ISBN 978-83-60810-14-9 ISSN 1896-7094 Empirically Tuning LAPACK s Blocking Factor for Increased
More informationOptimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí
Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators Enrique S. Quintana-Ortí Disclaimer Not a course on how to program dense linear algebra kernels on s Where have you
More informationParallel and Fully Recursive Multifrontal Sparse Cholesky
Parallel and Fully Recursive Multifrontal Sparse Cholesky Dror Irony Gil Shklarski Sivan Toledo 1th December Abstract We describe the design, implementation, and performance of a new parallel sparse Cholesky
More informationDynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection
Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence
More informationComputing Explicit Matrix Inverses by Recursion
Computing Explicit Matrix Inverses by Recursion Lars Karlsson February 15, 2006 Master s Thesis in Computing Science, 20 credits Supervisor at CS-UmU: Robert Granat Examiner: Per Lindström Umeå University
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationOptimizations of BLIS Library for AMD ZEN Core
Optimizations of BLIS Library for AMD ZEN Core 1 Introduction BLIS [1] is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries [2] The framework was
More informationUsing recursion to improve performance of dense linear algebra software. Erik Elmroth Dept of Computing Science & HPC2N Umeå University, Sweden
Using recursion to improve performance of dense linear algebra software Erik Elmroth Dept of Computing Science & HPCN Umeå University, Sweden Joint work with Fred Gustavson, Isak Jonsson & Bo Kågström
More informationA Blocked Implementation of Level 3 BLAS for RISC. Processors 1. Revised version. ENSEEIHT-IRIT Technical Report, RT/APO/97/2.
A Blocked Implementation of Level 3 BLAS for RISC Processors 1 Michel J. Dayde 2 and Iain S. Du 3;4 Revised version ENSEEIHT-IRIT Technical Report, RT/APO/97/2 December 1, 1997 Abstract We describe a version
More informationHow to Write Fast Numerical Code Spring 2012 Lecture 9. Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato
How to Write Fast Numerical Code Spring 2012 Lecture 9 Instructor: Markus Püschel TAs: Georg Ofenbeck & Daniele Spampinato Today Linear algebra software: history, LAPACK and BLAS Blocking (BLAS 3): key
More informationProgramming Dense Linear Algebra Kernels on Vectorized Architectures
University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 5-2013 Programming Dense Linear Algebra Kernels on Vectorized Architectures Jonathan Lawrence
More informationAdvanced Numerical Techniques for Cluster Computing
Advanced Numerical Techniques for Cluster Computing Presented by Piotr Luszczek http://icl.cs.utk.edu/iter-ref/ Presentation Outline Motivation hardware Dense matrix calculations Sparse direct solvers
More informationNEW ADVANCES IN GPU LINEAR ALGEBRA
GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear
More informationParallel BLAS Performance Report
5 Parallel BLAS Performance Report Jakub Kurzak Mark Gates Asim YarKhan Ichitaro Yamazaki Panruo Wu Piotr Luszczek Jamie Finney Jack Dongarra Innovative Computing Laboratory April 1, 2018 This research
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationHIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA
1 HIGH PERFORMANCE NUMERICAL LINEAR ALGEBRA Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA 2 BLAS BLAS 1, 2, 3 Performance GEMM Optimized BLAS Parallel
More informationReproBLAS: Reproducible BLAS
ReproBLAS: Reproducible BLAS http://bebop.cs.berkeley.edu/reproblas/ James Demmel, Nguyen Hong Diep SC 13 - Denver, CO Nov 22, 2013 1 / 15 Reproducibility Reproducibility: obtaining bit-wise identical
More informationHPCS HPCchallenge Benchmark Suite
HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. () Jack Dongarra (UTK) Piotr Luszczek () 28 September 2004 Slide-1 Outline Brief DARPA HPCS Overview Architecture/Application Characterization Preliminary
More informationJava Performance Analysis for Scientific Computing
Java Performance Analysis for Scientific Computing Roldan Pozo Leader, Mathematical Software Group National Institute of Standards and Technology USA UKHEC: Java for High End Computing Nov. 20th, 2000
More informationComparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra. Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015
Comparing Hybrid CPU-GPU and Native GPU-only Acceleration for Linear Algebra Mark Gates, Stan Tomov, Azzam Haidar SIAM LA Oct 29, 2015 Overview Dense linear algebra algorithms Hybrid CPU GPU implementation
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationAutomatic Tuning Matrix Multiplication Performance on Graphics Hardware
Automatic Tuning Matrix Multiplication Performance on Graphics Hardware Changhao Jiang (cjiang@cs.uiuc.edu) Marc Snir (snir@cs.uiuc.edu) University of Illinois Urbana Champaign GPU becomes more powerful
More informationDense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends
Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact
More informationExploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy
Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear Systems) Julie Langou Piotr Luszczek Alfredo Buttari Julien Langou
More informationImplementing Level-3 BLAS Routines in OpenCL on Different Processing Units
Technical Report 2014-001 Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units Kazuya Matsumoto, Naohito Nakasato, and Stanislav Sedukhin October 22, 2014 Graduate School of Computer
More informationNovel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs
Novel HPC Techniques to Batch Execution of Many Variable Size BLAS Computations on GPUs Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, Jack Dongarra Innovative Computing Laboratory, University of Tennessee
More informationLinear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre
Linear Algebra libraries in Debian Who I am? Core developer of Scilab (daily job) Debian Developer Involved in Debian mainly in Science and Java aspects sylvestre.ledru@scilab.org / sylvestre@debian.org
More informationIn 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that:
Parallel Computing and Data Locality Gary Howell In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that: Real estate and efficient computation
More informationSelf Adapting Numerical Software (SANS-Effort)
Self Adapting Numerical Software (SANS-Effort) Jack Dongarra Innovative Computing Laboratory University of Tennessee and Oak Ridge National Laboratory 1 Work on Self Adapting Software 1. Lapack For Clusters
More informationLevel-3 Cholesky Factorization Routines as Part of Many Cholesky Algorithms
Level-3 Cholesky Factorization Routines as Part of Many Cholesky Algorithms Fred G. Gustavson IBM T.J. Watson Research Center, Emeritus and Umeå University, Adjunct and Jerzy Waśniewski Department of Informatics
More informationLinear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore svmoore@utep.edu CPS5401 Fall 2012 svmoore.pbworks.com November 8, 2012 1 Learning ObjecNves AOer complenng this lesson, you
More informationINTEL MKL Vectorized Compact routines
INTEL MKL Vectorized Compact routines Mesut Meterelliyoz, Peter Caday, Timothy B. Costa, Kazushige Goto, Louise Huot, Sarah Knepper, Arthur Araujo Mitrano, Shane Story 2018 BLIS RETREAT 09/17/2018 OUTLINE
More informationHierarchical DAG Scheduling for Hybrid Distributed Systems
June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical
More informationDevelopment of efficient computational kernels and linear algebra routines for out-of-order superscalar processors
Future Generation Computer Systems 21 (2005) 743 748 Development of efficient computational kernels and linear algebra routines for out-of-order superscalar processors O. Bessonov a,,d.fougère b, B. Roux
More informationBindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core
Tuning on a single core 1 From models to practice In lecture 2, we discussed features such as instruction-level parallelism and cache hierarchies that we need to understand in order to have a reasonable
More informationStatistical Models for Automatic Performance Tuning
Statistical Models for Automatic Performance Tuning Richard Vuduc, James Demmel (U.C. Berkeley, EECS) {richie,demmel}@cs.berkeley.edu Jeff Bilmes (Univ. of Washington, EE) bilmes@ee.washington.edu May
More informationAutotuning (1/2): Cache-oblivious algorithms
Autotuning (1/2): Cache-oblivious algorithms Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.17] Tuesday, March 4, 2008 1 Today s sources CS 267 (Demmel
More informationBLAS and LAPACK + Data Formats for Sparse Matrices. Part of the lecture Wissenschaftliches Rechnen. Hilmar Wobker
BLAS and LAPACK + Data Formats for Sparse Matrices Part of the lecture Wissenschaftliches Rechnen Hilmar Wobker Institute of Applied Mathematics and Numerics, TU Dortmund email: hilmar.wobker@math.tu-dortmund.de
More informationAccelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an Autotuned and Algorithmic Approach
University of Tennessee, Knoxville Trace: Tennessee Research and Creative Exchange Masters Theses Graduate School 8-21 Accelerating Dense Linear Algebra for GPUs, Multicores and Hybrid Architectures: an
More informationAccelerating GPU kernels for dense linear algebra
Accelerating GPU kernels for dense linear algebra Rajib Nath, Stanimire Tomov, and Jack Dongarra Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville {rnath1, tomov,
More informationAdaptive Scientific Software Libraries
Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing
More informationIntroducing coop: Fast Covariance, Correlation, and Cosine Operations
Introducing coop: Fast Covariance, Correlation, and Cosine Operations November 14, 2017 Drew Schmidt wrathematics@gmail.com Version 0.6-1 Disclaimer Any opinions, findings, and conclusions or recommendations
More informationParallelism in Spiral
Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was
More informationAnatomy of High-Performance Matrix Multiplication
12 Anatomy of High-Performance Matrix Multiplication KAZUSHIGE GOTO and ROBERT A. VAN DE GEIJN The University of Texas at Austin We present the basic principles that underlie the high-performance implementation
More informationParallel Programming & Cluster Computing
Parallel Programming & Cluster Computing Grab Bag: Scientific Libraries, I/O Libraries, Visualization Henry Neeman, University of Oklahoma Charlie Peck, Earlham College Andrew Fitz Gibbon, Earlham College
More informationarxiv: v3 [cs.ms] 7 Jan 2018
BLASFEO: basic linear algebra subroutines for embedded optimization Gianluca Frison, Dimitris Kouzoupis, Tommaso Sartor, Andrea Zanelli, Moritz Diehl University of Freiburg, Department of Microsystems
More informationLinear Algebra for Modern Computers. Jack Dongarra
Linear Algebra for Modern Computers Jack Dongarra Tuning for Caches 1. Preserve locality. 2. Reduce cache thrashing. 3. Loop blocking when out of cache. 4. Software pipelining. 2 Indirect Addressing d
More informationOptimization of Triangular Matrix Functions in BLAS Library on Loongson2F
Optimization of Triangular Matrix Functions in BLAS Library on Loongson2F Yun Xu 1,2, Mingzhi Shao 1,2, and Da Teng 1,2 1 School of Computer Science and Technology, University of Science and Technology
More informationNVBLAS LIBRARY. DU _v6.0 February User Guide
NVBLAS LIBRARY DU-06702-001_v6.0 February 2014 User Guide DU-06702-001_v6.0 2 Chapter 1. INTRODUCTION The is a GPU-accelerated Libary that implements BLAS (Basic Linear Algebra Subprograms). It can accelerate
More informationNotes on LINPACK NxN Benchmark on Hewlett-Packard Systems
Notes on LINPACK NxN Benchmark on Hewlett-Packard Systems Piotr Luszczek August 3, 2001 Benchmark Matrix Optimizations Parallel name dimension allowed Processing 100 100 complier No 1000 1000 manual No
More informationHigh Performance Linear Algebra
High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control
More informationBo Kågström 10/17/2004. Management of Deep Memory Hierarchies Recursive Blocking and Hybrid Data Structures for Dense Matrix Computations
Management of Deep Memory Hierarchies Recursive Blocking and Hybrid Data Structures for Dense Matrix Computations Bo Kågström Dept of Computing Science & HPC2N Umeå University, Sweden 5th Workshop on Linux
More informationBLAS: Basic Linear Algebra Subroutines I
BLAS: Basic Linear Algebra Subroutines I Most numerical programs do similar operations 90% time is at 10% of the code If these 10% of the code is optimized, programs will be fast Frequently used subroutines
More informationAutomatic Tuning of Sparse Matrix Kernels
Automatic Tuning of Sparse Matrix Kernels Kathy Yelick U.C. Berkeley and Lawrence Berkeley National Laboratory Richard Vuduc, Lawrence Livermore National Laboratory James Demmel, U.C. Berkeley Berkeley
More informationToward Scalable Matrix Multiply on Multithreaded Architectures
Toward Scalable Matrix Multiply on Multithreaded Architectures Bryan Marker 1, Field G Van Zee 1, Kazushige Goto 1, Gregorio Quintana Ortí 2, and Robert A van de Geijn 1 1 The University of Texas at Austin
More informationSoftware Packages on Multi-Core Hardware
Comparative Study of One-Sided Factorizations with Multiple Software Packages on Multi-Core Hardware Emmanuel Agullo, Bilel Hadri, Hatem Ltaief and Jack Dongarra Department of Electrical Engineering and
More informationAdvanced Computing Research Laboratory. Adaptive Scientific Software Libraries
Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity
More information