Automated Timer Generation for Empirical Tuning

Similar documents
Automated Timer Generation for Empirical Tuning

Automated Programmable Code Transformation For Portable Performance Tuning

BLAS. Christoph Ortner Stef Salvini

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Parallelism V. HPC Profiling. John Cavazos. Dept of Computer & Information Sciences University of Delaware

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Overcoming the Barriers to Sustained Petaflop Performance. William D. Gropp Mathematics and Computer Science

Exploring the Optimization Space of Dense Linear Algebra Kernels

Automated Empirical Tuning of Scientific Codes For Performance and Power Consumption

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Adaptive Scientific Software Libraries

Performance Optimization Tutorial. Labs

Vectorization Past Dependent Branches Through Speculation

Formal Loop Merging for Signal Transforms

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

Automated Empirical Optimizations of Software and the ATLAS project* Software Engineering Seminar Pascal Spörri

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Parallel Hybrid Computing Stéphane Bihan, CAPS

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Lecture 13: March 25

Performance of Multicore LUP Decomposition

Performance Modeling for Ranking Blocked Algorithms

A Standard for Batching BLAS Operations

Efficiently Introduce Threading using Intel TBB

Statistical Models for Automatic Performance Tuning

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Automatically Tuned Linear Algebra Software

Intelligent Compilation

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience

Toward a supernodal sparse direct solver over DAG runtimes

Solving Dense Linear Systems on Graphics Processors

Achieving accurate and context-sensitive timing for code optimization

Harp-DAAL for High Performance Big Data Computing

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop

Advanced School in High Performance and GRID Computing November Mathematical Libraries. Part I

Evaluating the Role of Optimization-Specific Search Heuristics in Effective Autotuning

Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley Innovative Computing Laboratory University of Tennessee.

Accurate Cache and TLB Characterization Using Hardware Counters

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Performance Tools for Technical Computing

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

ORACLE SNAP MANAGEMENT UTILITY FOR ORACLE DATABASE

Dealing with Heterogeneous Multicores

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Dense matrix algebra and libraries (and dealing with Fortran)

Microarchitecture Overview. Performance

A Few Numerical Libraries for HPC

Using Cache Models and Empirical Search in Automatic Tuning of Applications. Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX

Parallelism in Spiral

AMD CPU Libraries User Guide Version 1.0

CSE 141 Summer 2016 Homework 2

Auto-tuning Multigrid with PetaBricks

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

PAPI: Performance API

Intel Math Kernel Library 10.3

Faster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017

High Performance Computing Implementation on a Risk Assessment Problem

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Intel Math Kernel Library (Intel MKL) Team - Presenter: Murat Efe Guney Workshop on Batched, Reproducible, and Reduced Precision BLAS Georgia Tech,

Field Analysis. Last time Exploit encapsulation to improve memory system performance

ECE 695 Numerical Simulations Lecture 3: Practical Assessment of Code Performance. Prof. Peter Bermel January 13, 2017

Accelerated Library Framework for Hybrid-x86

Optimization solutions for the segmented sum algorithmic function

A Dynamically Tuned Sorting Library Λ

FFTSS Library Version 3.0 User s Guide

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Microarchitecture Overview. Performance

Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.

HPC Challenge Awards 2010 Class2 XcalableMP Submission

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.

First Experiences with Intel Cluster OpenMP

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Intel Performance Libraries

Managing Hardware Power Saving Modes for High Performance Computing

A Matrix--Matrix Multiplication methodology for single/multi-core architectures using SIMD

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

Lecture 3 Notes Topic: Benchmarks

Specializing Code for FFT Libraries. Minhaj Ahmad Khan Henri Pierre Charles University of Versailles Saint-Quentin-en-Yvelines, France

Cray Scientific Libraries: Overview and Performance. Cray XE6 Performance Workshop University of Reading Nov 2012

Oracle Developer Studio 12.6

Introduction to Mobile Development

Enforcing Textual Alignment of

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications

Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks

INTEL MKL Vectorized Compact routines

Proactive Process-Level Live Migration in HPC Environments

How to Write Fast Numerical Code

Affordable and power efficient computing for high energy physics: CPU and FFT benchmarks of ARM processors

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Power Measurement Using Performance Counters

Evaluating Performance Via Profiling

Compilation for Heterogeneous Platforms

Our new HPC-Cluster An overview

Detection and Analysis of Iterative Behavior in Parallel Applications

Transcription:

Automated Timer Generation for Empirical Tuning Josh Magee Qing Yi R. Clint Whaley University of Texas at San Antonio SMART'10 1

Propositions How do we measure success for tuning? The performance of the tuned code --- of course But what about tuning time? How long are the users willing to wait? Given 3 more hours, how much can we improve program efficiency? Auto-tuning libraries have been successful and widely used ATLAS, PHiPAC, FFTW, SPIRAL... Critical routines are tuned because they are invoked many many times What happens when tuning whole applications? What the end users need and what compilers expect to see But applications are often extremely large and time consuming to run Do not want to rerun entire applications to try out different optimization configurations SMART'10 2

Observations Performance of full applications critically depend on a few computation/data intensive routines These routines are often small but invoked a large number of times Performance analysis tools (e.g., HPC toolkit) can be used to identify these routines Tuning these routines can significantly improve overall performance of whole applications while reducing tuning time In some SPEC benchmarks, running the whole application is about 175K times longer than running a single critical routine The problem: setting up execution environment of the routines A driver is required to set up parameters and global variables properly and accurately measure the runtime of each routine invocation The cache and memory states of the machine is very important (Whaley and Castaldo, SPE 08) NOT a trivial problem as one may think Overall goal: reduce tuning time without sacrificing tuning accuracy SMART'10 3

Empirical tuning approach Instrumentation library Collect details of routine execution within whole applications Invoked after HPC toolkit is used to identify critical routines POET timer generator Input: routine specification + cache config + output config Output: timing driver with accurately replicated execution environment Support a checkpointing approach for routines operating on irregular data Empirical tuning system Apply optimizations to produce different routine implementations Link routine implementation with the timing driver and collect performance feedback SMART'10 4

Replicating Environment of Routine Invocations Goal: ensure proper input values and operand workspaces Reflect common usage patterns of routine Should not result in abnormal evaluation Data insensitive routines Amount of computation determined by integer parameters controlling problem size Performance not noticeably affected by values stored in input Example: dense matrix multiplication Data sensitive routines Amount of computation depends on values and positioning of data Examples: sorting algorithms, complex pointer-chasing algorithms Replicating routine invocation environment For data insensitive routines: replicate problem size and use randomly generated values For data sensitive routines: use the check-pointing approach SMART'10 5

The Default Timing Approach (for data-insensitive routines) Routine specification for a Matrix Multiplication kernel routine=void ATL_USERMM(const int M, const int N, const int K, const double alpha, const double* A, const int lda, const double* B,const int ldb, const double beta, double* C, const int ldc); init={ M=Macro(MS,72); N=Macro(NS,72); K=Macro(KS,72); lda=ms; ldb=ks; ldc=ms; alpha=1; beta=1; A=Matrix(double,M,K,RANDOM,flush align(16)); B=Matrix(double,K,N,RANDOM,flush align(16)); C=Matrix(double,M,N,RANDOM,flush align(16)); } ; flop="2*m*n*k+m*n"; Template of auto-generated timing driver for each routine parameter s in R do if s is a pointer or array variable then allocate memory for s end for for each repetition of timing do for each routine parameter s in R do if s needs to be initialized then initialize memory_s end for if Cache flshing = true then Flush Cache time_start <- current time call R time_end <- current time time_spent <- time_end - time_start end for Calculate min, max, and average of time_spent if flps is defied then Calculate Max and average MFLOPS end if Print All timings SMART'10 6

The Checkpointing Approach (for data-sensitive routines) enter_checkpoint(checkpointing_image_name);... starttime=getwalltime(); retval = maingtu(i1, i2, block, quadrant, nblock, budget); endtime=getwalltime();... stop_checkpoint(); Checkpoint image includes All the data in memory before calling enter_checkpoint All the instructions between enter_checkpoint and stop_checkpoint Checkpoint image is saved to a file Auto-generated timers can invoke the checkpoint image via a call to restart_checkpoint Utilize the Berkeley Lab Checkpoint/Restart (BLCR) library Delayed checkpointing Call enter_checkpoint several instructions/loop iterations ahead of time to restore the cache state SMART'10 7

The POET Language Language for expressing parameterized program transformations Parameterized code transformations and configuration space Transformations controlled by tuning parameters Configuration space: parameters and constraints on their values Interpreted by search engine and transformation engine Language capabilities: Able to parse/transform/output arbitrary languages Have tried subsets of C/C++, Cobol, Java; going to add Fortran Able to express arbitrary program transformations Support optimizations by compilers or developers Have implemented a large collection of compiler optimizations Have achieved comparable performance to ATLAS(LCSD07) Able to easily compose different transformations Allow transformations to be defined easily reordered Empirical tuning of transformation ordering (LCPC08) Parameterization is built-in and well supported SMART'10 8

Experimental Evaluation Goal: verify that POET-generated timers can Significantly reduce tuning time for large applications Accurately reproduce performance of the tuned routines Methodology Compare POET-generated timers with the ATLAS timers Using differently optimized gemm kernels by POET Compare POET-generated timers with profiling results from running whole applications For both data-insensitive and data-sensitive routines Verify both the default timing approach and the checkpointing approach Evaluation platforms Two multicore platforms: a 3.0Ghz Dual-Core AMD Opteron 2222 and a 3.0Ghz Quad-Core Intel Xeon Mac Pro. Timings obtained in serial mode using a single core of each machine. SMART'10 9

Reduction in tuning time Full application Delayed checkpoint Immediate checkpoint Default timing via POET mult_su3_ mat_vec 877,430ms 3,502ms 3,510ms 5ms maingtu 45,765ms 2,019ms 1,975ms 4ms scan_for _patterns 90,460ms 6,218ms 5,930ms n/a SMART'10 10

Comparing to ATLAS SMART'10 11

Tuning Data-Insensitive Routine SMART'10 12

Tuning Data-Sensitive Routine SMART'10 13

Summary and Ongoing work Goal: reduce the tuning time of large scientific applications Independently measure and tune the performance of critical routines Accurately replicate the execution environment of routines Solutions Libraries to profile and collect execution environment of critical routines Use POET to automatically generate timing drivers Immediate and delayed checkpointing approach Ongoing work Reduce tuning time through the right search strategies Automate the tuning process by integrating POET with advanced compiler technologies SMART'10 14