Adaptive Scientific Software Libraries

Similar documents
Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

Grid Computing: Application Development

Research in High-Performance Grid Software

Introduction to HPC. Lecture 21

Input parameters System specifics, user options. Input parameters size, dim,... FFT Code Generator. Initialization Select fastest execution plan

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Grid Application Development Software

Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Intel Enterprise Processors Technology

Algorithms and Computation in Signal Processing

Memory Systems IRAM. Principle of IRAM

My 2 hours today: 1. Efficient arithmetic in finite fields minute break 3. Elliptic curves. My 2 hours tomorrow:

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Vector IRAM: A Microprocessor Architecture for Media Processing

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

The Mont-Blanc approach towards Exascale

Outline. How Fast is -fast? Performance Analysis of KKD Applications using Hardware Performance Counters on UltraSPARC-III

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Formal Loop Merging for Signal Transforms

Performance Analysis of KDD Applications using Hardware Event Counters. CAP Theme 2.

System Demonstration of Spiral: Generator for High-Performance Linear Transform Libraries

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Node Hardware. Performance Convergence

Advanced Processor Architecture. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

High-Performance Linear Algebra Processor using FPGA

Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Advanced Processor Architecture

Microarchitecture Overview. Performance

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Microarchitecture Overview. Performance

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

Performance Issues and Query Optimization in Monet

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Computer Architecture. Introduction. Lynn Choi Korea University

Effect of memory latency

SMP and ccnuma Multiprocessor Systems. Sharing of Resources in Parallel and Distributed Computing Systems

Evolution of Computers & Microprocessors. Dr. Cahit Karakuş

Affordable and power efficient computing for high energy physics: CPU and FFT benchmarks of ARM processors

Java Performance Analysis for Scientific Computing

Intel released new technology call P6P

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Scientific Computing. Some slides from James Lambers, Stanford

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Algorithms and Computation in Signal Processing

PERFORMANCE MEASUREMENT

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

Application Performance on Dual Processor Cluster Nodes

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

CS 426 Parallel Computing. Parallel Computing Platforms

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

Parallel FFT Program Optimizations on Heterogeneous Computers

Outline. 1 Reiteration. 2 Cache performance optimization. 3 Bandwidth increase. 4 Reduce hit time. 5 Reduce miss penalty. 6 Reduce miss rate

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Optimizing MMM & ATLAS Library Generator

Itanium 2 Impact Software / Systems MSC.Software. Jay Clark Director, Business Development High Performance Computing

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Itanium 2 Processor Microarchitecture Overview

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

Halfway! Sequoia. A Point of View. Sequoia. First half of the course is over. Now start the second half. CS315B Lecture 9

EITF20: Computer Architecture Part4.1.1: Cache - 2

UMBC. Rubini and Corbet, Linux Device Drivers, 2nd Edition, O Reilly. Systems Design and Programming

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

7/28/ Prentice-Hall, Inc Prentice-Hall, Inc Prentice-Hall, Inc Prentice-Hall, Inc Prentice-Hall, Inc.

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Adaptable benchmarks for register blocked sparse matrix-vector multiplication

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Computer Architecture. Fall Dongkun Shin, SKKU

Technology Trends Presentation For Power Symposium

Benchmarking CPU Performance

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

XPU A Programmable FPGA Accelerator for Diverse Workloads

Parallelism in Spiral

Chapter 1: Introduction to the Microprocessor and Computer 1 1 A HISTORICAL BACKGROUND

Why Parallel Architecture

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

Instruction Level Parallelism

Accurate Cache and TLB Characterization Using Hardware Counters

Mainstream Computer System Components

Cache Justification for Digital Signal Processors

HW Trends and Architectures

Composite Metrics for System Throughput in HPC

Robert Jamieson. Robs Techie PP Everything in this presentation is at your own risk!

Alternate definition: Instruction Set Architecture (ISA) What is Computer Architecture? Computer Organization. Computer structure: Von Neumann model

Algorithms and Computation in Signal Processing

Computer Architecture s Changing Definition

Memory latency: Affects cache miss penalty. Measured by:

Memory latency: Affects cache miss penalty. Measured by:

Gedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

Transcription:

Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston

Challenges Diversity of execution environments Growing complexity of modern microprocessors. Deep memory hierarchies Out-of-order execution Instruction level parallelism Growing diversity of platform characteristics SMPs Clusters (employing a range of interconnect technologies) Grids (heterogeneity, wide range of characteristics) Wide range of application needs Dimensionality and sizes Data structures and data types Languages and programming paradigms

Challenges Algorithmic High arithmetic efficiency low floating-point v.s. load/store ratio Unfavorable data access patterns (big 2 n strides) Application owns the datastructures/layout Additions/multiplications unbalanced Version explosion Verification Maintenance

Opportunities Multiple algorithms with comparable numerical properties for many functions Improved software techniques and hardware performance Integrated performance monitors, models and data bases Run-time code construction

Approach Automatic algorithm selection polyalgorithmic functions (CMSSL, FFTW, ATLAS, SPIRAL,..) Exploit multiple precision options Code generation from high-level descriptions (WASSEM, CMSSL, CM-Convolution-Compiler, FFTW, UHFFT,..) Integrated performance monitoring, modeling and analysis Judicious choice between compile-time and run-time analysis and code construction Automated installation process

The UHFFT Program preparation at installation (platform dependent) Integrated performance models (in progress) and data bases Algorithm selection at run-time from set defined at installation Automatic multiple precision constant generation Program construction at run-time based on application and performance predictions

Performance Tuning Methodology Input Parameters System specifics, User options Input Parameters Size, dim., UHFFT Code generator Initialization Select best plan (factorization) Library of FFT modules Execution Calculate one or more FFTs Performance database Installation Performance Monitoring Database update Run-time

The UHFFT Software Architecture UHFFT Library Library of FFT Modules Initialization Routines Execution Routines Utilities FFT Code Generator Mixed-Radix (Cooly-Tukey) Prime Factor Algorithm Split-Radix Algorithm Rader's Algorithm Unparser Optimizer Scheduler Initializer (Algorithm Abstraction) Key: Fixed library code Generated code Code generator

The UHFFT: Code Generation Structure Algorithm abstraction Optimization Generation of a DAG Scheduling of instructions Unparsing Implementation Code generator is written in C Speed, portability and installation tuning Highly optimized straight line C code Generates FFT codelets of arbitrary size, direction, and rotation

The UHFFT: Code Generation (cont d) Basic structure is an Expression Constant, variable, sum, product, sign change, Basic functions Expression sum, product, assign, sign change, Derived structures Expression vectors, matrices and lists Higher level functions Matrix vector operations FFT specific operations Algorithms currently supported Rader (two versions), PFA, Split-radix, Mixed-radix

Equation: W Is implemented as: The UHFFT: Code Generation Mixed-Radix Algorithm n (Wr Im )Dr,m(I r Wm )Πn,r /* * FFTMixedRadix() Mixed-radix splitting. * Input: * r radix, * dir, rot direction and rotation of the transform, * u input expression vector. */ ExprVec *FFTMixedRadix(int r, int dir, int rot, ExprVec *u) { int m, n = u->n, *p; } m = n/r; p = ModRSortPermutation(n, r); u = FFTxI(r, m, dir, rot, TwiddleMult(r, m, dir, rot, IxFFT(r, m, dir, rot, PermuteExprVec(u, p)))); free(p); return u;

The UHFFT: Performance Modeling Analytic models Cache influence on library codes Performance measuring tools (PCL, PAPI) Prediction of composed code performance Updated from execution experience Data base Library codes. Recorded at installation time Composed codes. Recorded and updated for each execution.

The UHFFT: Execution Plan Generation Optimal plan search options Exhaustive Recursive Empirical Algorithms used Rader (FFTW, UHFFT) PFA (UHFFT) Split-radix (UHFFT) Mixed-radix (FFTW, SPIRAL, UHFFT)

Characteristics of Some Processors Processor Clock frequency Peak Performance Cache structure Intel Pentium IV 1.8 GHz 1.8 GFlops L1: 8K+8K, L2: 256K AMD Athlon 1.4 GHz 1.4 GFlops L1: 64K+64K, L2: 256K PowerPC G4 867 MHz 867 MFlops L1: 32K+32K L2: 256K, L3: 1-2M Intel Itanium 800 Mhz 3.2 GFlops L1: 16K+16K L2: 92K, L3: 2-4M IBM Power3/4 375 MHz 1.5 GFlops L1: 64K+32K, L2: 1-16M HP PA 8x00 750 MHz 3 GFlops L1: 1.5M + 0.75M Alpha EV67/68 833 MHz 1.66 GFlops L1: 64K+64K, L2: 4M MIPS R1x000 500 MHz 1 GFlop L1: 32K+32K, L2: 4M

Codelet efficiency Intel PIV 1.8 GHz AMD Athlon 1.4 GHz PowerPC G4 867 MHz

Radix-4 codelet efficiency Intel PIV 1.8 GHz AMD Athlon 1.4 GHz PowerPC G4 867 MHz

Radix-8 codelet efficiency Intel PIV 1.8 GHz AMD Athlon 1.4 GHz PowerPC G4 867 MHz

Plan Performance, 32- bit Architectures

Power3 plan performance 350 300 250 200 150 100 MFLOPS 50 0 16 2 8 4 4 8 2 2 2 4 2 4 2 4 2 2 2 2 2 2 Plan 222 MHz 888 Mflops

Power3 plan performance PFA sizes 800 Mflops peak

Itanium.. Processor Clock frequency Peak Performance Cache structure Intel Itanium 800 Mhz 3.2 GFlops Intel Itanium 2 900 Mhz 3.6 GFlops Intel Itanium 2 1000 Mhz 4 GFlops Sun UltraSparc-III 750 Mhz 1.5 GFlops Sun UltraSparc-III 1050 Mhz 2.1 GFlops L1: 16K+16K (Data+Instruction) L2: 92K, L3: 2-4M (off-die) L1: 16K+16K (Data+Instruction) L2: 256K, L3: 1.5M (on-die) L1: 16K+16K (Data+Instruction) L2: 256K, L3: 3M (on-die) L1: 64K+32K+2K+2K (Data+Instruction+Pre-fetch+Write) L2: up to 8M (off-die) L1: 64K+32K+2K+2K (Data+Instruction+Pre-fetch+Write) L2: up to 8M (off-die) Tested configuration

Memory Hierarchy Itanium-2 (McKinley) Itanium L1I and L1D Size: Line size/associativity: Latency: Write Policies: 16KB + 16KB 64B/4-way 1 cycle Write through, No write allocate 16KB + 16KB 32B/4-way 1 cycle Write through, No write allocate Size: 256KB 96K B Unified L2 Line size/associativity: Integer Latency: FP Latency: 128B/8-way Min 5 cycles Min 6 cycles 64B/6-way Min 6 cycles Min 9 cycles Write Policies: Write back, write allocate Write back, write allocate Size: 3MB or 1.5MB on chip 4MB or 2MB off chip Unified L3 Line size/associativity: Integer Latency: FP Latency: 128B/12-way Min 12 cycles Min 13 cycles 64B/4-way Min 21 cycles Min 24 cycles Bandwith: 32B/cycle 16B/cycle

Itanium Comparison Workstation HP i2000 HP zx2000 Processor 800 MHz Intel Itanium 900 MHz Intel Itanium 2 (McKinley) Bus Speed 133 MHZ 400 MHz Bus Width 64 bit 128 bit Chipset Intel 82460GX HP zx1 Memory 2 GB SDRAM (133 MHz) 2 GB DDR SDRAM (266 MHz) OS 64-bit Red Hat Linux 7.1 HP version of the 64-bit RH Linux 7.2 Compiler Intel 6.0 Intel 6.0

HP zx1 Chipset 2-way block diagram Features: 2-way and 4-way Low latency connection to the DDR memory (112 ns) Directly (112 ns latency) Through (up to 12 ) scalable memory expanders (+25 ns latency) Up to 64 GB of DDR today (256 in the future) AGP 4x today (8x in the future versions) 1-8 I/O adapters supporting PCI, PCI-X, AGP

UHFFT Codelet Performance

UHFFT Codelet Performance

UHFFT Codelet Performance

Codelet Performance Radix-2

Codelet Performance Radix-3

Codelet Performance Radix-4

Codelet Performance Radix-5

Codelet Performance Radix-6

Codelet Performance Radix-7

Codelet Performance Radix-8

Codelet Performance Radix-9

Codelet Performance Radix-10

Codelet Performance Radix-11

Codelet Performance Radix-12

Codelet Performance Radix-13

Codelet Performance Radix-14

Codelet Performance Radix-15

Codelet Performance Radix-16

Codelet Performance Radix-17

Codelet Performance Radix-18

Codelet Performance Radix-19

Codelet Performance Radix-20

Codelet Performance Radix-21

Codelet Performance Radix-22

Codelet Performance Radix-23

Codelet Performance Radix-24

Codelet Performance Radix-25

Codelet Performance Radix-32

Codelet Performance Radix-36

Codelet Performance Radix-45

Codelet Performance Radix-64

The UHFFT: Summary Code generator written in C Code is generated at installation Codelet library is tuned to the underlying architecture The whole library can be easily customized through parameter specification No need for laborious manual changes in the source Existing code generation infrastructure allows easy library extensions Future: Inclusion of vector/streaming instruction set extension for various architectures Implementation of new scheduling/optimization algorithms New codelet types and better execution routines Unified algorithm specification on all levels

New Tools for Library Code Development Generalization of the tools developed for the UHFFT library CODELAB: A Developers' Tool for Efficient Code Generation and Optimization Combination of High-level scripting language Code generator Performance measurement tools Visualization Under development Several test examples show very promising results

CODELAB IDE STRUCTURE CODELAB IDE Script Interpreter Code Generator Visualization Performance measurement User Input Library code Support Code Compiler Execution Operating System Application

CODELAB Structure Application consists of Library code Support Code Application Library code: Automatically generated and optimized collection of subroutines Supporting code: Code that binds the library routines together It could be hand-written or automatically generated Application is instrumented for performance measurements automatically

Script Interpreter User Input CODELAB Structure Supporting Code Application User writes: Simple script that produces the code generator or supporting code Supporting code for the application The code generator should be able to produce a large variety of code depending on a few input parameters (otherwise it is simpler to write the code by hand) Example: A single code generator for FFT codelets of different size, type, direction, rotation, Script Interpreter: Simplifies construction of the code generator Very restricted set of commands at the moment

CODELAB Code Generator Structure Library code Support Code Application Visualization Code generator C program that generates the application and supporting code Uses abstract expression algebra for code generation Several layers of software: Basic expressions Complex expressions algebra Vector and matrix algebra Polynomial algebra The generated code can be instrumented for performance measurements The initial expression list is transformed into DAG and optimized: Simplification of expressions Folding of constants User can get a variety of information about the generated code: Number of arithmetic ops DAG graph, etc

CODELAB Structure Visualization Compiler Performance measurements Execution Operating System Performance measurements The application can be compiled and executed from within the IDE The performance data are collected and visualized User can modify the code and repeat the process until a satisfactory performance is obtained Detailed performance information by using PAPI library interface

CODELAB Applications FFT and DSP Libraries Efficient multiple precision arithmetic Finite Element Methods Linear Algebra Other well structured applications that allow for simple parameterization few parameters define a large variety of code

Overview UHFFT Performance on some new architectures Intel Itanium 800 MHz, Intel Itanium 2 (McKinley) 900 MHz Sun UltraSparc-III 750 MHz New Tools for Library Code Development CODELAB Integrated Development Environment (IDE) Introduction Structure of the CODELAB IDE Applications

The UHFFT: An Adaptive FFT Library UHFFT employs more ways of combining codelets for execution than any other library Better coverage of the space of possible algorithms The PFA algorithm yields good performance where the Mixed-Radix algorithm (MR) performs poorly PFA algorithm requires less FP operations than MR Data access pattern in PFA is more complex than in MR, but large 2 n strides can be avoided Example IBM Power3 Good: 128-way set associative L1 data and instruction caches Bad: Direct mapped L2 cache very vulnerable to cache trashing despite the large cache size PFA execution model works better for large FFT sizes

Acknowledgements Contributors Dragan Mirkovic Rishad Mahasoom Fredrick Mwandia Nils Smeds Funding: Alliance (NSF) LACSI (DoE)