Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.

Size: px

Start display at page:

Download "Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee."

Beatrice Sharp
6 years ago
Views:

1 Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee

2 Outline Pre-intro: BLAS Motivation What is ATLAS Present release How ATLAS works Performance results Future work short term Open sourcing ATLAS

3 Basic Linear Algebra Subprograms (BLAS) Level 3 matrix-matrix operations - gemm, symm, hemm, syrk, herk, syr2k, her2k trmm trsm Level 2 matrix-vector operations - gemv hemv symv trmv trsv - ger, geru, gerc, her, her2, syr2 Level 1 vector-vector operations - swap, scal, copy, axpy, dot, nrm2, asum iamax Packed Banded

4 The Problem For many operations, no such thing as enough compute power Therefore, need to extract near peak performance even as hardware changes at breakneck pace of Moore s Law Extracting near-optimal performance is tedious, time consuming, and requires expertise in many fields Optimization is not portable

5 Solution, Part A: Create libraries Isolate time-critical sections of code, define and agree on API (BLAS) - Get experts in all needed fields (type of computation, hardware platform, and programming environment) to optimize PROBLEMS: - Demand for experts far outstrips supply - Even with experts, by time a library is fully optimized, the target architecture is well on its way towards obsolescence

6 Solution, Part B: AEOS AEOS: Automated Empirical Optimization of Software KEY IDEA: Automate tuning process so it can be done by computer, rather than team of experts GOAL: Optimized, portable library available for new platform in minutes or hours rather than months or years

7 What is ATLAS A package that adapts to differing architectures via AEOS techniques - Initially, supply BLAS Automated Empirical Optimization of Software (AEOS) - Machine searches opt space - Finds applicationapparent architecture AEOS requires: - Method of code variation» Parameterization» Multiple implement.» Code generation - Sophisticated Timers - Robust search heuristic

8 ATLAS, Present Release ANSI/ISO C - BSD-style license (no advertising clause) Optimized dense Level 3 BLAS - Performance from GEMM kernel» code generator + para Optimized dense Level 2 - GEMV & GER kernels» multiple implementation + para Reference Level 1, banded and packed BLAS Recursive LU & Cholesky factorizations (LAPACK) C and F77 interfaces for all routines

9 Algorithmic Approach for Matrix Multiply Only generated code is on-chip multiply All BLAS operations written in terms of generated on-chip multiply All transpose cases coerced through data copy to 1 case of on-chip multiply - Only 1 case generated per platform N K N M NB C M A * B K

10 Code generation strategy Code is iteratively generated & timed until optimal case is found. We try: - Differing NBs - Breaking false dependencies - M, N and K loop unrolling On-chip multiply optimizes for: - TLB access - L1 cache reuse - FP unit usage - Memory fetch - Register reuse - Loop overhead minimization

11 500x500 DGEMM Across Various Architectures MFLOPS AMD Athlon-600 DEC ev DEC ev6-500 HP9000/735/135 Vendor BLAS ATLAS BLAS F77 BLAS IBM PPC IBM Power2-160 IBM Power3-200 Pentium Pro-200 Pentium II-266 Pentium III-550 SGI R10000ip SGI R12000ip ArchitecturesSun UltraSparc2-200

12 500 x 500 Double Precision RB LU factorization Vendor BLAS ATLAS BLAS F77 BLAS 500 MFLOPS AMD Athlon-600 DEC ev DEC ev6-500 HP9000/735/135 IBM PPC IBM Power2-160 IBM Power3-200 Pentium Pro-200 Architecture Pentium II-266 Pentium III-550 SGI R10000ip SGI R12000ip Sun UltraSparc2-200

13 500x500 Recursive BLAS on UltraSparc Vendor BLAS ATLAS BLAS Reference BLAS 250 MFLOPS DGEMM DSYMM DSYRK DSYR2K DTRMM DTRSM BLAS

14 ATLAS, Next Release Definite: - Beefed up config - SMP support via pthreads - Support for user contribution Playing with: - Packed (banded) support, including extension to Level 3 - Level 1 optimizations - More user control over levels of optimization - Sparse support - Further Level 2 optimization» addition of code generation

15 Open sourcing ATLAS Developers can scratch their own itch optimize only operation/architecture they need, and help the whole community Must standardize and document multiple implementation testing/timing so user can supply machine-specific kernels Allows for machine-specific optimizations that cannot be done in a portable language such as C: - Assembly GEMM for ev5/6 Kazushige Goto - SSE & 3DNow! assembler Camm Maguire - UltraSparc kernel -- Peter Strazdins & Viet Nguyen

16 Open Source: Status Developer release: - ey/atlas/os Developer mailing list: - atlas-comm@cs.utk.edu - Archived at:» comm Level 2 GER/GEMV kernel contribution GEMM kernel contribution (mult. Implementation) GEMM replacement STILL NEED: - Support for usercontributed GEMM cleanup

17 ATLAS Team Jack Dongarra, Directory of ICL Antoine Petitet R. Clint Whaley You - Kazushige Goto - Camm Maguire - Viet Nguyen - Peter Strazdins

18 Algorithmic approach for Level 3 BLAS Recur down to L1 cache block size Need kernel at bottom of recursion - Use gemm-based kernel for portability 0 Recursive TRMM

Automated Empirical Optimization of High Performance Floating Point Kernels. R. Clint Whaley University of Texas, San Antonio. and

Automated Empirical Optimization of High Performance Floating Point Kernels R. Clint Whaley University of Texas, San Antonio and David B. Whalley Florida State University Outline of talk I. Introduction: