AUTOMATIC SMT THREADING

Similar documents
Undersubscribed Threading on Clustered Cache Architectures

Intel Many Integrated Core (MIC) Architecture

Path to Exascale? Intel in Research and HPC 2012

PCERE: Fine-grained Parallel Benchmark Decomposition for Scalability Prediction

Software and Tools for HPE s The Machine Project

Using Fast and Accurate Simulation to Explore Hardware/Software Trade-offs in the Multi-Core Era

Processing Genomics Data: High Performance Computing meets Big Data. Jan Fostier

Parallel Systems. Project topics

ibench: Quantifying Interference in Datacenter Applications

Joe Butler, Principal Engineer, Director Cloud Services Lab. Nov , OpenStack Summit Paris.

Harp-DAAL for High Performance Big Data Computing

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications

Hybrid Implementation of 3D Kirchhoff Migration

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

. Understanding Performance of Shared Memory Programs. Kenjiro Taura. University of Tokyo

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Bring your application to a new era:

I/O Profiling Towards the Exascale

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Simultaneous Multithreading on Pentium 4

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Jackson Marusarz Intel Corporation

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction

Method-Level Phase Behavior in Java Workloads

Simplified and Effective Serial and Parallel Performance Optimization

Does the Intel Xeon Phi processor fit HEP workloads?

HPC projects. Grischa Bolls

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

AutoTune Workshop. Michael Gerndt Technische Universität München

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

HPC in the Multicore Era

CSE 120 Principles of Operating Systems

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Intel VTune Amplifier XE

Energy Efficiency Tuning: READEX. Madhura Kumaraswamy Technische Universität München

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

Introduction to Xeon Phi. Bill Barth January 11, 2013

Overview of research activities Toward portability of performance

A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores

April 2 nd, Bob Burroughs Director, HPC Solution Sales

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

Lecture 12: Large Cache Design. Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

VLPL-S Optimization on Knights Landing

Adaptive Power Profiling for Many-Core HPC Architectures

Basics of Performance Engineering

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018

Intel Xeon Phi Coprocessors

Performance Tools for Technical Computing

European exascale applications workshop, Manchester, 11th and 12th October 2016 DLR TAU-Code - Application in INTERWinE

Profiling: Understand Your Application

CPU-GPU Heterogeneous Computing

Using Intel Transactional Synchronization Extensions

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Dynamic SIMD Scheduling

Efficient Parallel Programming on Xeon Phi for Exascale

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

A Comparison of Capacity Management Schemes for Shared CMP Caches

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Revealing the performance aspects in your code

Introduction to tuning on KNL platforms

High Performance Computing on GPUs using NVIDIA CUDA

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Parallel Algorithm Engineering

Architecture-Conscious Database Systems

Kaisen Lin and Michael Conley

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Arm Processor Technology Update and Roadmap

Hybrid MPI - A Case Study on the Xeon Phi Platform

Parallel Programming Libraries and implementations

Introduction to parallel Computing

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Performance of Multicore LUP Decomposition

Transcription:

AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY 2 INTEL CORPORATION ROSS 2014, MUNICH

Intel Exascale Labs Europe Strong Commitment To Advance Computing Leading Edge: Intel collaborating with HPC community & European researchers 4 labs in Europe - Exascale computing is the central topic ExaScale Computing Research Lab, Paris ExaCluster Lab, Jülich ExaScience Lab, Leuven Intel and BSC Exascale Lab, Barcelona Performance and scalability of Exascale applications Tools for performance characterization Exascale cluster scalability and reliability Life Science applications Architectural simulation Scalable kernels and RT Scalable RTS and tools New algorithms

WU0 MANAGEMENT, BUSINESS MODELS & INFRASTRUCTURE EXASCIENCE LIFE LAB H I G H P E R F O R M A N C E C O M P U T I N G IN L I F E S C I E N C E S WU1 HIGH PERFORMANCE BIOSTATISTICS APPLICATIONS WU2 PLATFORMS AND FLEXIBLE PERFORMANCE FOR DNA/RNA SEQUENCING, VISUALIZATION AND ANALYSIS WU3 DATA INTENSIVE PARALLEL PROGRAMMING MODELS WU4 ARCHITECTURAL SIMULATION 3 IMEC 2013

OVERVIEW Introduction Intel Xeon Phi architecture Motivation Effects of per-core thread count on performance Dynamic threading Algorithm and implementation of automatic thread count adaptation Results & conclusion 4

XEON PHI ARCHITECTURE Up to 4 SMT threads per core ~60 cores Per-core private L1+L2 caches, coherent Ring-based interconnect Core Core Core Core Cache Cache Cache Cache (60) Main memory 5

HOW MANY SMT THREADS? VARIES! Core: at least two to keep pipeline fully occupied more threads can overlap more latency Cache: shared L1-I/D and L2, TLBs threads can evict each other s data Memory: more threads keeps more requests outstanding no more gains once bandwidth is saturated 6

THREAD COUNT ACROSS APPLICATIONS More threads is better! More threads is worse!? 7

THREAD COUNT ACROSS INPUT SETS (BT) Small input set: cache-fitting at low thread counts, cache-thrashing at high thread counts Large input set: many threads to overlap memory latency 8

THREAD COUNT ACROSS INPUT SETS Best vs. per-application setting based on B 9

CLUSTER-AWARE UNDERSUBSCRIBED SCHEDULING OF THREADS (CRUST) Move decision from programmer to runtime Integrated into the OpenMP runtime library No application changes Dynamic undersubscription Automatically find best setting per workload, input set, during the application Exploit iterations / time steps in the workloads Adapt to each #pragma omp parallel individually Recognizes workload phase behavior 10

PER-SECTION TUNING Calibration phase: for each section occurrence, Measure hardware performance counters Clock cycles, instruction count, L2 misses More would have been nice but not supported by KNC Use user-mode rdpmc for low overhead Try all 4 thread counts in subsequent occurrences Pick best cycle count Instruction count not stable: spin loops Stable phase: Keep watching L2 miss rate, recalibrate on change Signifies data-dependent behavior 11

DYNAMIC AGGREGATION Large sections: data reuse inside phase Small sections: data reuse across phases No good to keep changing thread count / data partitioning Dynamically aggregate sections Up to 50 million clock cycles (~50 ms) Aggregate forms one new section id Using IPC rather than runtime 12

EFFICIENT PMU COLLECTION Use Linux perf interface for counter setup Reading counters: read system call lots of inter-processor interrupts many million clock cycles overhead Alternative: user-mode rdpmc instruction, reads local counter have each OpenMP worker thread read its own counters, communicate results through shared memory 13

IMPLEMENTATION Open-source Intel OpenMP library at http://www.openmprtl.org Added 4 hooks, recompiled into new libiomp5.so (no application changes) Dynamic threading functionality in separate libcrust.so Easily portable to other parallel runtimes (TBB, Cilk, ) main thread (serial code) #pragma omp parallel kmpc_fork_call o o (serial code) worker threads kmp_run_before_invoked_task o (parallel code) ``` kmp_run_after_invoked_task o CRUST hooks CRUST-section-begin omp_set_num_threads CRUST-section-begin-thread rdpmc CRUST-section-end-thread rdpmc CRUST-section-end (collect statistics)

RESULTS Able to find best thread count for all applications Benchmarks too simple to benefit from per-region tuning 15

THROUGH-TIME BEHAVIOR Threads per core (top) and performance (cycles/section, bottom) through time Calibration phase (1 st half), stable phase (2 nd half) 16

COMPLEX APPLICATIONS Multiple distinct phases: Light: compute bound, high thread count Dark: memory bound, low thread count 2% speedup over best-static CoMD 17

CONCLUSIONS Per-core thread count on Xeon Phi affects performance in multiple ways Core, cache, main memory effects Best thread count varies Across applications, input sets, workload phases Dynamically determining best thread count Per application phase (#pragma omp parallel) Aggregate small phases Use performance counters for more insight Prototyped in Intel OpenMP runtime library 18