AutoTune Workshop. Michael Gerndt Technische Universität München

Similar documents
Code Auto-Tuning with the Periscope Tuning Framework

Automatic Tuning of HPC Applications with Periscope. Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München

Energy Efficiency Tuning: READEX. Madhura Kumaraswamy Technische Universität München

READEX: A Tool Suite for Dynamic Energy Tuning. Michael Gerndt Technische Universität München

Performance Analysis with Periscope

The AutoTune Project

Performance analysis with Periscope

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Performance Analysis for Large Scale Simulation Codes with Periscope

Overview of research activities Toward portability of performance

[Scalasca] Tool Integrations

Integrating Parallel Application Development with Performance Analysis in Periscope

HPC code modernization with Intel development tools

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Introduction to VI-HPS

Performance Tools for Technical Computing

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Tuning Alya with READEX for Energy-Efficiency

StarPU: a runtime system for multigpu multicore machines

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

OpenACC 2.6 Proposed Features

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

CPU-GPU Heterogeneous Computing

Parallel Computing. Hwansoo Han (SKKU)

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Periscope Tutorial Exercise NPB-MPI/BT

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

GPUs and Emerging Architectures

VAMPIR & VAMPIRTRACE INTRODUCTION AND OVERVIEW

Thread and Data parallelism in CPUs - will GPUs become obsolete?

READEX Runtime Exploitation of Application Dynamism for Energyefficient

Score-P A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir

Distribution of Periscope Analysis Agents on ALTIX 4700

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Introduction to Parallel Performance Engineering

Parallel Programming Libraries and implementations

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

ELASTIC: Dynamic Tuning for Large-Scale Parallel Applications

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

A Characterization of Shared Data Access Patterns in UPC Programs

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

Introduction to Runtime Systems

General introduction: GPUs and the realm of parallel architectures

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Profile analysis with CUBE. David Böhme, Markus Geimer German Research School for Simulation Sciences Jülich Supercomputing Centre

S Comparing OpenACC 2.5 and OpenMP 4.5

Recent Developments in Score-P and Scalasca V2

Programming for Fujitsu Supercomputers

Parallel Programming. Michael Gerndt Technische Universität München

IBM High Performance Computing Toolkit

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Debugging Programs Accelerated with Intel Xeon Phi Coprocessors

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Parallel Systems I The GPU architecture. Jan Lemeire

Accelerators in Technical Computing: Is it Worth the Pain?

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Programming. Libraries and Implementations

Multi-core Architectures. Dr. Yingwu Zhu

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

n N c CIni.o ewsrg.au

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction

MPI Performance Engineering through the Integration of MVAPICH and TAU

Trends in HPC (hardware complexity and software challenges)

AUTOMATIC SMT THREADING

Parallel Programming

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Modern Processor Architectures. L25: Modern Compiler Design

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

ELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University

A General Discussion on! Parallelism!

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Parallel Hybrid Computing Stéphane Bihan, CAPS

Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Simultaneous Multithreading on Pentium 4

Intel Xeon Phi архитектура, модели программирования, оптимизация.

45-year CPU Evolution: 1 Law -2 Equations

Introduction II. Overview

Debugging and Optimizing Programs Accelerated with Intel Xeon Phi Coprocessors

Why you should care about hardware locality and how.

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Advanced Threading and Optimization

Masterpraktikum Scientific Computing

Transcription:

AutoTune Workshop Michael Gerndt Technische Universität München

AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy Efficiency Partners Coordinator: Technische Universität München Universität Wien Universitat Autonoma de Barcelona Leibniz Rechenzentrum University of Galway, ICHEC October 2011-April 2015

Project Participants Today Michael Gerndt, TUM Coordinator Michael Lysaght, ICHEC WP5: Evaluation Yury Oleynik, TUM ECS-Group

Goals of Meeting Present the Periscope Tuning Framework Present post-project exploitation plans AutoTune Demonstration Center ECS-Group spin-off Collect feedback on post-project exploitation

Agenda for Today 11:00 12:30 Introduction to the project Lunch 14:00 14:30 AutoTune Demonstration Center 14:30 15:00 ECS-Group 15:00 15:45 Discussion, Questionnaire Coffee

Todays Parallel Architectures Zoo of architectures Multicore processors NUMA multiprocessors Accelerators Clusters Supercomputers and programming models OpenMP, PGAS, TBB MPI CUDA, OpenCL, OpenACC

Multicore Processors Ingredients Superscalar cores SIMD units Multithreading Multiple cores Memory hierarchy Challenges Intel Sandy Bridge processor Long latency, low-bandwith memory access Multiple levels of parallelism: ILP, SIMD, thread Compilation, vectorization, parallelization (OpenMP)

NUMA Multiprocessors Ingredients Multiple multicore processors Shared address space Challenges SuperMUC Board Different latencies to memory, local vs remote access Contention for memory and synchronization OS support for thread affinity

Accelerators Ingredients Many cores Wide SIMD Specialized Hardware Challenges Special programming model CUDA, OpenCL, OpenACC Computation and data offload Application tuning Nvidia Keppler Intel Xeon Phi

Clusters Ingredients Many nodes Scalable high performance network Distributed memory Challenges Explicit data transfers Global parallelization with MPI Scalability Energy consumption SuperMUC @ LRZ

Development Cycle Coding Measurement Performance Analysis Refinement Analysis Program Tuning Ranking Production

Performance Analysis Goal: Identify inefficient regions and root cause Analysis techniques: Time profile Single core: counter profiles Multiple cores: overhead profiles Accelerators: counter and overhead profiles Clusters: Communication profiles and traces Measurement techniques Sampling Instrumentation

Performance Analysis Tools Profiling tools Single core, accelerators Multiple cores, clusters IDE integration Tracing tools Multiple cores, clusters: visualization of dynamic behavior. Automatic Beyond measurements

Knowledge Involved Refinement Current Hypotheses Detected Problems Instrumentation Analysis Requirements Performance Data Experiment

Runtime Tuning Design Time Tuning Application Tuning Objectives Performance Cost Energy Coding Performance Analysis Tuning When Installation Design time Runtime Production Runs Performance Analysis Tuning

VI-HPS Virtual Institute for High Productivity Supercomputing Aiming at development of tools and training of users to enable higher productivity in using supercomputers Partners FZJ GRS RWTH Aachen UNI Stuttgart ZIH Dresden TUM UNI Oregon UNI Tennessee

Periscope Performance Analysis On-line no need to store trace files Distributed reduced network utilization Scales to 100000s of CPUs Analyzes: Single-node Performance MPI Communication OpenMP Supports: Fortran, C Intel Itanium based systems IBM Power BlueGene P x86 based systems periscope.in.tum.de

Performance Properties APART specification language Performance property Condition Severity Confidence Agent analysis strategies Determine hypothesis refinement Region nesting Property hierarchy-based refinement Experiment for a phase execution

Agent Search Strategies Application phase is a period of program s execution Phase region Full program Single user region assumed to be repetitive Phase boundaries have to be global (SPMD programs) Search strategies Determine hypothesis refinement Region nesting Property hierarchy-based refinement

Periscope Tuning Framework Periscope Pathway Tuning Plugins

AutoTune Project Automatic tuning based on expert knowledge Expert knowledge When is it necessary to tune? Where to tune? What are the tuning parameters? How to shrink the search space? Beyond auto-tuning like Active Harmony Tuning plugins for Periscope Dynamically loaded < few hundred lines of code

Periscope Tuning Framework PTF User Interface Extension of Periscope Online tuning process Application phase-based Extensible via tuning plugins Single tuning aspect Combining multiple tuning aspects Rich framework for plugin implementation Automatic and parallel experiment execution Search Algorithms PTF Frontend Tuning Plugins Static Program Info Experiment Execution Agent Network Monitor Monitor Monitor Application Performance Analysis

Tuning Plugins MPI parameters (UAB, TUM) Eager Limit, Buffer space, collective algorithms Application restart or MPIT Tools Interface DVFS (LRZ) Frequency tuning for energy delay product Model-based prediction of frequency, pre-analysis Region level tuning Parallelism capping (TUM) Thread number tuning for energy delay product Exhaustive and curve fitting based prediction

Tuning Plugins Master/worker (UAB) Partition factor and number of workers Prediction through performance model based on data measured in pre-analysis Parallel Pattern (UNI Vienna) Tuning replication and buffer between pipeline stages Based on component distribution via StarPU OpenHMPP/OpenACC (CAPS) Tuning parameters based on CAPS tune directives Optimizing data transfer and kernel execution time

Tuning Plugins MPI IO (TUM) Tuning data sieving and number of aggregators Exhaustive and model based Compiler Flag Selection (TUM) Automatic recompilation and execution Selective recompilation based on pre-analysis Exhaustive and individual search Scenario analysis for significant routines Combination with Pathway Machine learning support

Pathway Goals Formalize performance engineering workflows. Automate usage of analysis and tuning tools. Remove Monkey Work from code developers. Support overall tuning effort by bookkeeping.

Pathway Features Formal, graphical view on workflows Pre-defined workflows available Transparently serve different HPC systems Define connection parameters for each system Pathway automatically generates batch scripts for different schedulers Automatic tool invocation Experiment browser Notebook

Scalability Workflow

Individual Tuning NPB (C)

k-nn training error

DVFS Tuning Plugin SeisSol application code-region optimization Percentage of energy savings on the outer most region (Region ID 1) Region ID 1.6 GHz 1.7 GHz 1.8 GHz 1 1132 1143 1172 2 185 185 186 3 912 928 971 Table showing energy consumption in Joules.

MPI Parameters Tuning (fish school simulation-64k fishes) Parameter Min Max Best Value Execution Time (Distribution) eager_limit 4196 B 65560 B 65560 use_bulk_xfer Yes No Yes bulk_min_msg_size 4096 B 1048576 B 268288 task_affinity CORE MCM CORE pe_affinity Yes No Yes cc_scratch_buf Yes No No 0.587957 (with default values 0.831273)

Periscope Tuning Framework periscope.in.tum.de Version: 1.0 (1.1 coming in April) Supported systems: x86-based clusters Bluegene IBM Power License: New BSD

Summary Manual and automatic performance engineering are important. Periscope supports automatic online analysis Periscope Tuning Framework adds automatic tuning plugins Pathway formalizes and supports the overall performance engineering process Future work: Integration with Score-P