Code Auto-Tuning with the Periscope Tuning Framework

Similar documents
Automatic Tuning of HPC Applications with Periscope. Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München

AutoTune Workshop. Michael Gerndt Technische Universität München

Energy Efficiency Tuning: READEX. Madhura Kumaraswamy Technische Universität München

The AutoTune Project

READEX: A Tool Suite for Dynamic Energy Tuning. Michael Gerndt Technische Universität München

Performance Analysis with Periscope

READEX Runtime Exploitation of Application Dynamism for Energyefficient

Periscope Tutorial Exercise NPB-MPI/BT

Integrating Parallel Application Development with Performance Analysis in Periscope

Programming Support for Heterogeneous Parallel Systems

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Performance Analysis for Large Scale Simulation Codes with Periscope

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee

IBM High Performance Computing Toolkit

Accelerator programming with OpenACC

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Performance analysis with Periscope

ELASTIC: Dynamic Tuning for Large-Scale Parallel Applications

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Overview of research activities Toward portability of performance

Performance Analysis of Parallel Scientific Applications In Eclipse

CPU-GPU Heterogeneous Computing

Accelerated Earthquake Simulations

AUTOMATIC SMT THREADING

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Trends in HPC (hardware complexity and software challenges)

Computational Interdisciplinary Modelling High Performance Parallel & Distributed Computing Our Research

Serial and Parallel Sobel Filtering for multimedia applications

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Tuning Alya with READEX for Energy-Efficiency

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Performance Analysis with Vampir

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Automatic Performance Tuning of Pipeline Patterns for Heterogeneous Parallel Architectures

page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH

Technology for a better society. hetcomp.com

[Scalasca] Tool Integrations

Performance Cockpit: An Extensible GUI Platform for Performance Tools

High performance Computing and O&G Challenges

Adaptive Cluster Computing using JavaSpaces

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Mixed MPI-OpenMP EUROBEN kernels

Programming NVIDIA GPUs with OpenACC Directives

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

An evaluation of the Performance and Scalability of a Yellowstone Test-System in 5 Benchmarks

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

Accelerating sequential computer vision algorithms using commodity parallel hardware

READEX: Linking Two Ends of the Computing Continuum to Improve Energy-efficiency in Dynamic Applications

CS 553: Algorithmic Language Compilers (PLDI) Graduate Students and Super Undergraduates... Logistics. Plan for Today

Design Principles for End-to-End Multicore Schedulers

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Barcelona Supercomputing Center

Introduction to Performance Tuning & Optimization Tools

GOING ARM A CODE PERSPECTIVE

S Comparing OpenACC 2.5 and OpenMP 4.5

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

COMP528: Multi-core and Multi-Processor Computing

Toward Runtime Power Management of Exascale Networks by On/Off Control of Links

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications

Introduction to Grid Computing

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Hands-on / Demo: Building and running NPB-MZ-MPI / BT

PTF: MPI-Parameters Plugin. Eduardo César, Gertvjola Saveta February 13th 2014

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

An Introduction to OpenACC

Arm crossplatform. VI-HPS platform October 16, Arm Limited

Portable Power/Performance Benchmarking and Analysis with WattProf

CellSs Making it easier to program the Cell Broadband Engine processor

Повышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин

StarPU: a runtime system for multigpu multicore machines

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

GPPD Grupo de Processamento Paralelo e Distribuído Parallel and Distributed Processing Group

A Parallelized Cartographic Modeling Language (CML)-Based Solution for Flux Footprint Modeling

Hybrid Implementation of 3D Kirchhoff Migration

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Using R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov

Motivation Goal Idea Proposition for users Study

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

OpenACC Support in Score-P and Vampir

Potentials and Limitations for Energy Efficiency Auto-Tuning

Parallel Programming with MPI

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

Automatic Pruning of Autotuning Parameter Space for OpenCL Applications

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Parallel Programming. Michael Gerndt Technische Universität München

A brief introduction to OpenMP

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Transcription:

Code Auto-Tuning with the Periscope Tuning Framework Renato Miceli, SENAI CIMATEC renato.miceli@fieb.org.br Isaías A. Comprés, TUM compresu@in.tum.de

Project Participants Michael Gerndt, TUM Coordinator Renato Miceli, SENAI CIMATEC WP5: Evaluation Leader at ICHEC Isaías A. Comprés, TUM Periscope Plugin Developer Siegfried Benkner, University of Vienna WP2, WP4: Work package Leader 2

Tutorial Overview Users of PTF Motivation The AutoTune Project Periscope Tuning Framework ( PTF ) Architecture Plugins Installation & Setup Demo Case Study & Best Practices NPB, SeisSol, LULESH AutoTune Demonstration Center Plugin Developers 3

Background Given Multicore, Accelerators, HPC Cluster Many programming models and applications Problem How to tune application for a given architecture & application? Targeted Users Improve performance : HPC Application Developers Faster tuning : HPC Application Users Reduce energy cost : Supercomputing Centers 4

Motivation Why tune applications? Increasing complexity of HPC architectures Frequently changing HPC hardware Compute time is a valuable resource Manual tuning Large set of tuning parameters with intricate dependencies Diverse set of performance analysis tools Several iterations between analysis and improvements 5

Tuning Stages Measure Application tuning runs Performance data collection Identify metrics Analyze Paradigm and programming model Search space strategies Optimize Apply identified optimizations User knowledge Test Re-evaluate 6

An Ideal Solution Productivity Removes burden of tuning from developers Portability Portable tuning techniques across different environments Reusability Same techniques applied across different applications Flexibility Re-evaluate optimizations for different scenarios Performance Provides performance improvements 7

Periscope Tuning Framework Objective - Single tool for performance analysis and tuning Extends Periscope with a tuning framework Tuning plugins for performance and energy efficiency tuning Online tuning Combine multiple tuning techniques 8

AutoTune Consortium Technische Universität München, Germany Universität Wien, Austria CAPS Entreprise, France Universitat Autònoma de Barcelona, Spain Leibniz Computing Centre, Germany Irish Centre for High-End Computing, Ireland 9

AutoTune Goals Automatic application tuning Tune performance and energy Create a scalable and distributed framework Evaluate the alternatives online Tune for Multicore and GPU accelerated systems Variety of parallel paradigms MPI, OpenMP, OpenCL, Parallel pattern, HMPP 10

Review Tune application Automatic tuning is necessary because manual tuning is a time consuming and cumbersome process Ideal tool Should be able to perform tuning automatically with improved productivity, flexibility, reusability and performance. AutoTune Project AutoTune project is addressing 11

Progress Motivation AutoTune Periscope Tuning Framework ( PTF ) Architecture Plugins Installation & Setup Case Study & Best Practices NPB SeisSol LULESH AutoTune Demonstration Center 12

Periscope Performance Analysis Toolkit Online no need to store trace files Distributed reduced network utilization Scalable Up to 100000s of CPUs Multi-scenario analysis Single-node Performance MPI Communication OpenMP Portable Fortran, C with MPI & OMP Intel Itanium2, x86 based systems IBM Power6, BlueGene P, Cray PTF User Interface PTF Frontend Static Prog Info Perf Analysis Agent Network Monitor Monitor Monitor Application 13

AutoTune Approach AutoTune will follow Periscope principles Predefined tuning strategies combining performance analysis and tuning, online search, distributed processing. Plugins (online and semi-online) Compiler based optimization HMPP tuning for GPUs Parallel pattern tuning MPI tuning Energy efficiency tuning User defined plugins - info in developer session 14

Periscope Framework Extension of Periscope Online tuning process Application phase-based Extensible via tuning plugins Single tuning aspect Combining multiple tuning aspects Rich framework for plugin implementation Automatic and parallel experiment execution PTF User Interface PTF Frontend Tuning Plugins Search Algorithm Exp Execution Static Prog Info Perf Analysis Agent Network Monitor Monitor Monitor Application 15

Tuning Strategy Plugin Strategy Analysis Strategy Plugin Lifecycle Static Analysis and Instrumentation Analysis using Periscope Frontend Generate Tuning Report 16

Progress Motivation AutoTune Periscope Tuning Framework ( PTF ) Architecture Plugins Installation & Setup Case Study & Best Practices NPB SeisSol LULESH AutoTune Demonstration Center 17

Plugin Overview Name Target Objective Compiler Flag Selection (CFS) Compiler flag combinations Optimize performance Dynamic Voltage and Frequency Scaling (DVFS) CPU Frequency and power states Reduce energy consumption with minimal impact High Level Pipeline Patterns Pipeline stages Optimize throughput MPI Runtime MPI parameters Optimize SPMD MPI application performance Master Worker MPI OpenCL Plugin Workload imbalance in MPI applications Kernel tuning based on NDRange Balance application load by optimizing communication Improve utilization of many core architecture 18

Compiler Flag Selection (CFS) Goal Optimize application performance by guiding machine code generation using compiler flags Tuning Technique Selection of compiler flags and corresponding values -O1, O2, -floop-unroll Search through combination of flags 21

Dynamic Voltage and Frequency Scaling (DVFS) Goal Reduce energy consumption with the objective of optimal performance to energy ratio Tuning Technique Select CPU governor and frequency, processor specific power states Energy prediction model avoids evaluating all frequency governor combinations 22

High Level Pipeline Patterns Goal Optimizing throughput of pipeline patterns by exploiting CPU and GPU effectively Tuning Technique Selection of stage replication factor Sizes of intermediate buffers Choice of hardware for executing a task 23

MPI Runtime Goal Tweaking MPI Parameters to improve SPMD code performance Tuning Technique MPI runtime parameters application mapping and buffer/protocol MPI communication parameters - Eager limit, buffer limit 24

Master-Worker MPI Goal Optimize execution time by balancing application load and decreasing communication between workers Tuning Technique Impact of varying partition size per worker Number of workers Size of partition per worker 25

Review Periscope Tool Periscope performs the performance analysis Periscope Tuning Framework (PTF) PTF extends the Periscope to use the performance data and allows developers to write tuning plugins Tuning Plugins Tuning plugins uses PTF infrastructure and automate the tuning of applications using on or more tuning techniques 27

Progress Motivation AutoTune Periscope Tuning Framework ( PTF ) Architecture Plugins Installation & Setup Case Study & Best Practices NPB SeisSol LULESH AutoTune Demonstration Center 28

Setting up PTF periscope.in.tum.de Version: 1.0 (1.1 coming in April 2015) Supported systems: x86-based clusters Bluegene IBM Power License: New BSD 29

Setting up PTF Download PTF http://periscope.in.tum.de/releases/latest/tar/ptf-latest.tar.bz2 Installation of PTF Uses AutoTools Third party library requirements ACE (version >= 6.0) Boost (version >= 1.47) Xerces (version >= 2.7) Papi (version >= 5.1) 30

Setting up PTF Plugin specific libraries Enopt library for the DVFS plugin Vpattern library for the Patterns plugin Doxygen documented code Plugin specific user guides Sample applications repository 31

Using PTF Identifying a Phase Region Code region with repetitive computationally intensive part PTF measurements focus on phase region Optional step Entire application is default phase region Instrumenting the application Use psc_instrument command Executing application Use periscope frontend for execution Use psc_frontend command Interpreting the results Start Setting up PTF Find Phase Region Instrumenting Application Executing Application Interpreting Results Finish 32

Finding Phase Region If not known then find computationally intensive part using the output of gprof or equivalent tool e.g. Serial NPB BT-MZ benchmark Start Setting up PTF Find Phase Region Instrumenting Application Executing Application Interpreting Results Finish 33

Finding Phase Region Annotating the code within loop using pragmas BT-MZ/bt.f Start Setting up PTF Find Phase Region Instrumenting Application Executing Application Interpreting Results Finish 34

Adding Instrumentation Changing the makefile to add instrumentation information Start Setting up PTF Find Phase Region Instrumenting Application Executing Application Interpreting Results Finish 35

Executing Application Prepare a config file cfs_config.cfg Start Setting up PTF Find Phase Region Instrumenting Application Executing Application Interpreting Results Finish 36

Executing Application psc_frontend command is used to execute PTF --tune=compilerflags to select the plugin Start Setting up PTF Find Phase Region $ psc_frontend --apprun="./bt-mz.c.x" \ --starter=fastinteractive --delay=2 \ --mpinumprocs=1 --tune=compilerflags \ --force-localhost \ --debug=2 --selective-debug=autotuneall \ --sir=bt-mz.c.x.sir Instrumenting Application Executing Application Interpreting Results Finish 37

Interpreting the results Combination executing in minimal time is reported as the optimal scenario Start Setting up PTF Find Phase Region Instrumenting Application Executing Application Interpreting Results Finish 38

Demo 39

Review Installing Periscope Tuning Framework Using Periscope Tuning Framework Instrumenting and executing a sample application 40

Progress Motivation AutoTune Periscope Tuning Framework ( PTF ) Architecture Plugins Installation & Setup Case Study & Best Practices NPB SeisSol LULESH AutoTune Demonstration Center 41

Time (Sec) Tuning Results Comparison of manual tuning against automatic tuning 880 872,4 860 840 820 800 797,8 797,5 780 760 740 Baseline (With -O2) Manual Tuning PTF Suggested Flags 42

Scenario # Time (Sec) Selecting Search Strategy Search strategy influencing the number of scenarios executed and total execution time of the plugin 50 47 1800 45 1600 40 1400 35 30 25 20 15 10 5 14 10 6 1200 1000 800 600 400 200 0 Exhaustive Individual k=3 Individual k=2 Individual k=1 Search Strategy 0 Scenario # sec 43

Time(Sec) Tuning Results Improvement in the PTF execution time to find the optimal flags for BT-MZ class C 45000,0 40000,0 42299,7 35000,0 30000,0 25000,0 20000,0 15000,0 10000,0 5000,0 0,0 2790,63 2070,3 285,6 182,6 44

Progress Motivation AutoTune Periscope Tuning Framework ( PTF ) Architecture Plugins Installation & Setup Case Study & Best Practices NPB SeisSol LULESH AutoTune Demonstration Center 45

SeisSol Use Case Developed by the Department of Earth and Environmental Sciences at the Ludwig-Maximilian University. Application simulates realistic earthquakes, propagation of seismic waves e.g. viscoelastic attenuation, strong material heterogeneities and anisotropy. Code Details FORTRAN Code MPI 46

SeisSol Use Case Execution of SeisSol with DVFS plugin Energy Prediction Model predicted the optimal frequency as 1.2GHz Plugin tests +/- 100MHz step for the plugin Here 1.2GHz is the minimum frequency supported by hardware ~26% energy is saved at the cost of ~10% increased execution time 47

Progress Motivation AutoTune Periscope Tuning Framework ( PTF ) Architecture Plugins Installation & Setup Case Study & Best Practices NPB SeisSol LULESH AutoTune Demonstration Center 48

LULESH Use Case Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH) Modelling hydrodynamics DOE proxy applications Code Details C++ Code Serial, OMP, MPI, MPI+OMP 49

Time (Sec) LULESH Use Case CFS Tuning Results 51 50,6 50 49 48 47 46 45,8 45 44 43 Baseline PTF 50

LULESH Use Case Execution Results with the DVFS - 1 iteration - 1000 iterations 51

Meta-plugins Meta Plugins Fixed Sequence Plugin Executes selected set of plugins in the given order Adaptive Sequence Plugin Uses the optimal scenario found by the previous plugin 52

Progress Motivation AutoTune Periscope Tuning Framework ( PTF ) Architecture Plugins Installation & Setup Case Study & Best Practices NPB SeisSol LULESH AutoTune Demonstration Center 53

AutoTune Demonstration Center The AutoTune Demonstration Centre (ADC) Established by AutoTune partners AutoTune Deliverable D6.4 The following figure depicts the various components of the ADC. Centered arou Tuning Framework we have the hardware, software, compilers and licenses for these are provided by BADW-LRZ and optionally by NUI GALWAY. With these resou perform their activities and deliver services to their users or customers. The ADC objective are to: Spread the use of AutoTune software Further exploit energy save and power steering Disseminate Best Practice Guides Provide developers with a platform to test/validate Become a platform for exchange of information To extend the PTF developer community The ADC will offer the following services: Online documentations, Best Practice Guides Discussion forums AutoTune training events Individual support for applications Education and training of students AutoTune website as central hub Service Desk and Issue Tracking Structure of the ADC Figure 22: Components of the AutoTune Demonstration Center Structure of the AutoTune Demonstration Center The ADC is organized as a community effort with a lean governance. The structur outlined in Figure 23. BADW-LRZ provides the underlying software and hardware addition, all partners of the AutoTune consortium are welcome to add their ow additional software. The ADC is coordinated by BADW-LRZ in close cooperation with TUM. TUM providing a stable version of the Periscope Tuning Framework. BADW-LRZ i managing access and the additional software and licenses. Contact persons of responsible for the individual exploitation aspects and organize the internal and ext the Center. Partners will meet twice a year to discuss the operation and direction of th 54 The AutoTune Demonstration Center will offer the following services:

Conclusion Automatic Tuning - Motivation and AutoTune Periscope Tuning Framework (PTF) Tuning Plugins Instrumenting and executing sample application Demo : NPB, SeisSol, LULESH Best Practices AutoTune Demonstration Center 55