Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Size: px
Start display at page:

Download "Piecewise Holistic Autotuning of Compiler and Runtime Parameters"

Transcription

1 Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles Exascale Computing Research August 2016 C E R E

2 Context Architecture, system, and application complexities increase System provides default good enough parameter configurations Compiler optimizations: -O2, -O3 Thread affinity: scatter Outperforming default parameters leads to substantial benefits but is a costly process Execution driven studies test different configurations Applications have redundancies Executing an application is time consuming The search space is huge Studies reduce the exploration cost by smartly navigating through the search space 1 / 23

3 Piecewise Exploration Codelet Extractor and REplayer (CERE) decomposes applications into small pieces called Codelets Each codelet maps a loop or a parallel region and is a standalone executable Extract codelets once Replay codelets instead of applications with different configurations to avoid redundancies 2 / 23

4 IS Motivating Example int main() { create_seq() for(i=0;i<11;i++) rank() } IS benchmark IS create seq covers 40% of the execution time IS rank sorting algorithm performs 11 invocations with the same execution time Piecewise exploration benefits Avoid create seq execution Evaluate a single invocation of rank IS rank and create seq are not sensitive to the same optimizations 3 / 23

5 Outline Codelet Extractor and Replayer (CERE) Prediction Model Thread and Compiler Tuning 4 / 23

6 CERE Workflow Applications LLVM IR Region outlining Working set and cache capture Invocation & Codelet subsetting Region Capture Change: number of threads, affinity, runtime parameters Working sets memory dump Codelet Replay Fast performance prediction Warmup + Replay Generate codelets wrapper Retarget for: different architectures different optimizations CERE can extract codelets from: Hot Loops OpenMP non-nested parallel regions 5 / 23

7 Codelet Capture and Replay Codelets are extracted at the LLVM Intermediate Representation level The user can recompile each codelet and replay it while changing compile options, runtime parameters, or the target system Performance accurate replay requires to capture the cache state Semantically accurate replay requires to capture the memory 6 / 23

8 Memory Page Capture region to capture memory allocation a = malloc(256); memory access a[i]++; protect static and currently allocated process memory (/proc/self/maps) intercept memory allocation functions with LD_PRELOAD 1 allocate memory 2 protect memory and return to user program segmentation fault handler 1 dump accessed memory to disk 2 unlock accessed page and return to user program Capture access at page granularity: coarse but fast Small dump footprint: only touched pages are saved 7 / 23

9 Cache State Capture Cold Do not capture cache effects Working Set Warms all the working set during replay (Optimistic) Page Trace Before replay warms the last N pages accessed to restore a cache state close to the original 8 / 23

10 CERE Cache Warmup for (i=0; i < size; i++) a[i] += b[i]; array a[] pages { array b[] pages { memory FIFO (most recently unprotected) 22 pages addresses Reprotect 20 warmup page trace / 23

11 OpenMP Regions Support void main() { #pragma omp parallel { int p = omp_get_thread_num(); printf("%d",p); } } Clang OpenMP front end define { entry:... } define internal { entry: %p = alloca i32, align 4 %call = call store i32 %call, i32* %p, align 4 %1 = load i32* %p, align 4 } C code LLVM simplified IR Thread execution model 10 / 23

12 Selecting Representative Invocations A region can have thousand of invocations Performance differs due to different working sets Cluster to select representative invocations Cycles 8e+07 6e+07 4e+07 2e+07 0e invocation replay Figure: SPEC tonto make ft@shell2.f90:1133 execution trace. 90% of NAS codelets can be reduced to four or less representatives. 11 / 23

13 Performance Classes Across Parameters original trace replay megacycles O0 + 4 threads O3 + 2 threads invocation MG resid invocations execution time Use three invocations to predict the application execution time Parameters do not change the performance classes 12 / 23

14 NUMA Aware Warmup First touch policy: threads allocate the pages that they are the first to touch on their NUMA domain Detect the first thread that touches the memory pages During warmup the recorded NUMA-domains are restored Original Single Thread Warmup NUMA Warmup 1 NUMA domain (compact) 2 NUMA domains (scatter) 4e+10 Cycles 3e+10 2e+10 1e+10 0e thread number Figure: BT xsolve replay 13 / 23

15 Test Architectures and Applications NAS SER and NPB OpenMP 3.0 C version CLASS A Blackscholes from the PARSEC benchmarks Reverse Time Migration (RTM) proto-application Compiler LLVM 3.4 Sandy Bridge Ivy Bridge CPU E5 i Frequency (GHz) Sockets 2 1 Cores per socket 8 4 Threads per core 2 2 L1 cache (KB) L2 cache (KB) L3 cache (MB) 20 8 Ram (GB) Figure: Test architectures 14 / 23

16 Blackscholes Thread Affinities Exploration Different thread affinities to evaluate sn: n scatter threads cn: n compact threads without hyper threading hn: n compact threads with hyper threading Original Replay 3e+07 Cycles 2e+07 1e+07 0e+00 1 s2 c2 h2 s4 c4 h4 s8 c8 h8 s16 h16 h32 thread configuration Figure: PARSEC Blackscholes thread configurations search 15 / 23

17 Outperforming Default Thread Configuration original replay Speed up over standard (s16) hyperthread.h32 compact.c8 BT CG EP FT IS LU MG SP BT CG EP FT IS LU MG SP Figure: NAS thread configurations tuning 16 / 23

18 Autotuning LLVM Middle End Optimizations LLVM middle end offers more than 50 optimization passes Codelet replay enable per-region fast optimization tuning Cycles (Ivybridge 3.4GHz) 1.2e e e e+07 original replay O Id of LLVM middle end optimization passes combination Figure: SP ysolve codelet schedules of random passes combinations explored based on O3 passes. CERE 149 cheaper than running the full benchmark ( 27 cheaper when tuning codelets covering 75% of SP) 17 / 23

19 Hybridization Application Hotspots extraction into codelets O3 sse O2... avx Assign to each codelet the flag that gave the best performance avx sse sse Replay codelets with selected flags to test avx Flags selection Hybridization Application Compile files with their respective best flag Hybrid optimized binary Extract hotspots into new files 18 / 23

20 Hybrid Compilation over the NAS Four parallel regions of SP cover 93% of the execution time No single sequence is the best for all the regions Codelets explore parameters for each region separately Produce an hybrid where each region is compiled using its best sequence gigacycles compact hybrid O2 rhs zsolve xsolve+ysolve total 60 O3 rhs best z best hybrid O2 O3 rhs best z best hybrid O2 compiler optimizations O3 rhs best z best hybrid O2 O3 rhs best z best Figure: Hybrid compilation speeds up SP OpenMP / 23

21 Piecewise Exploration Benefits Speedup over O BT IS SP Benchmarks hybrid (original exploration) hybrid (replay exploration) monolithic best standard cost of piecewise exploration overhead of monolithic exploration zsolve ysolve xsolve BT IS SP zsolve rank ysolve xsolve fullverify createseq Compiler optimization sequences Figure: Piecewise exploration of the NAS SER 20 / 23

22 Codelets Tuning Results Compiler passes Thread affinity #Regions Accuracy Acceleration #Regions Accuracy Acceleration BT CG FT IS SP LU EP MG AVG NAS SER and OpenMP benchmarks average speedup of 1.08 Tuning a single codelet is 13 faster than full applications Codelet average accuracy is 94.6% RTM tuning through a codelet is 200 faster and achieves a speedup of / 23

23 Related Works Kulkarni et al. Improving Both the Performance Benefits and Speed of Optimization Phase Sequence Searches (ACM Sigplan Notices 2010) Fursin et al. Quick and practical run-time evaluation of multiple program optimizations (HiPEAC 2007) Fursin et al. Milepost gcc: Machine learning enabled self-tuning compiler (Int. J. Parallel Prog. 2011) Purini et al. Finding good optimization sequences covering program space (TACO 2013) 22 / 23

24 Conclusion Piecewise tuning with codelets Accelerate the exploration process Improve the benefits Discussion Some regions are not independent: LU jacu and jacld Piecewise tuning sensitivity to the data set Future Work Combine codelets tuning with GA Use a clustering approach over codelets Improve the parallel warmup strategy 23 / 23

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Piecewise Holistic Autotuning of Compiler and Runtime Parameters Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov 1, Chadi Akel 2, William Jalby 1, and Pablo de Oliveira Castro 1 1 Université de Versailles Saint-Quentin-en-Yvelines, Université

More information

PCERE: Fine-grained Parallel Benchmark Decomposition for Scalability Prediction

PCERE: Fine-grained Parallel Benchmark Decomposition for Scalability Prediction PCERE: Fine-grained Parallel Benchmark Decomposition for Scalability Prediction Mihail Popov, Chadi kel, Florent Conti, William Jalby, Pablo de Oliveira Castro UVSQ - PRiSM - ECR Mai 28, 2015 Introduction

More information

1 CERE: LLVM based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization

1 CERE: LLVM based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization 1 CERE: LLVM based Codelet Extractor and REplayer for Piecewise Benchmarking and Optimization PABLO DE OLIVEIRA CASTRO, Université de Versailles Saint-Quentin-en-Yvelines and Exascale Computing Research

More information

OS impact on performance

OS impact on performance PhD student CEA, DAM, DIF, F-91297, Arpajon, France Advisor : William Jalby CEA supervisor : Marc Pérache 1 Plan Remind goal of OS Reproducibility Conclusion 2 OS : between applications and hardware 3

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of

More information

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1 2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed

More information

Introduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on

Introduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on Introduction Contech s Task Graph Representation Parallel Program Instrumentation (Break) Analysis and Usage of a Contech Task Graph Hands-on Exercises 2 Contech is An LLVM compiler pass to instrument

More information

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information

NUMA-aware OpenMP Programming

NUMA-aware OpenMP Programming NUMA-aware OpenMP Programming Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de Christian Terboven IT Center, RWTH Aachen University Deputy lead of the HPC

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

Adaptive Power Profiling for Many-Core HPC Architectures

Adaptive Power Profiling for Many-Core HPC Architectures Adaptive Power Profiling for Many-Core HPC Architectures Jaimie Kelley, Christopher Stewart The Ohio State University Devesh Tiwari, Saurabh Gupta Oak Ridge National Laboratory State-of-the-Art Schedulers

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Dealing with Heterogeneous Multicores

Dealing with Heterogeneous Multicores Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism

More information

On the scalability of tracing mechanisms 1

On the scalability of tracing mechanisms 1 On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica

More information

Performance Analysis and Optimization MAQAO Tool

Performance Analysis and Optimization MAQAO Tool Performance Analysis and Optimization MAQAO Tool Andrés S. CHARIF-RUBIAL Emmanuel OSERET {achar,emmanuel.oseret}@exascale-computing.eu Exascale Computing Research 11th VI-HPS Tuning Workshop MAQAO Tool

More information

Multigrain Parallelism: Bridging Coarse- Grain Parallel Languages and Fine-Grain Event-Driven Multithreading

Multigrain Parallelism: Bridging Coarse- Grain Parallel Languages and Fine-Grain Event-Driven Multithreading Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory - CAPSL Multigrain Parallelism: Bridging Coarse- Grain Parallel Languages and Fine-Grain Event-Driven

More information

Runtime Address Space Computation for SDSM Systems

Runtime Address Space Computation for SDSM Systems Runtime Address Space Computation for SDSM Systems Jairo Balart Outline Introduction Inspector/executor model Implementation Evaluation Conclusions & future work 2 Outline Introduction Inspector/executor

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Runtime Support for Scalable Task-parallel Programs

Runtime Support for Scalable Task-parallel Programs Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

OpenMP at Sun. EWOMP 2000, Edinburgh September 14-15, 2000 Larry Meadows Sun Microsystems

OpenMP at Sun. EWOMP 2000, Edinburgh September 14-15, 2000 Larry Meadows Sun Microsystems OpenMP at Sun EWOMP 2000, Edinburgh September 14-15, 2000 Larry Meadows Sun Microsystems Outline Sun and Parallelism Implementation Compiler Runtime Performance Analyzer Collection of data Data analysis

More information

Fujitsu s Approach to Application Centric Petascale Computing

Fujitsu s Approach to Application Centric Petascale Computing Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview

More information

Progress Report on QDP-JIT

Progress Report on QDP-JIT Progress Report on QDP-JIT F. T. Winter Thomas Jefferson National Accelerator Facility USQCD Software Meeting 14 April 16-17, 14 at Jefferson Lab F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 1 /

More information

Towards a Holistic Approach to Auto-Parallelization

Towards a Holistic Approach to Auto-Parallelization Towards a Holistic Approach to Auto-Parallelization Integrating Profile-Driven Parallelism Detection and Machine-Learning Based Mapping Georgios Tournavitis, Zheng Wang, Björn Franke and Michael F.P. O

More information

Potentials and Limitations for Energy Efficiency Auto-Tuning

Potentials and Limitations for Energy Efficiency Auto-Tuning Center for Information Services and High Performance Computing (ZIH) Potentials and Limitations for Energy Efficiency Auto-Tuning Parco Symposium Application Autotuning for HPC (Architectures) Robert Schöne

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

Kevin O Leary, Intel Technical Consulting Engineer

Kevin O Leary, Intel Technical Consulting Engineer Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."

More information

Getting Performance from OpenMP Programs on NUMA Architectures

Getting Performance from OpenMP Programs on NUMA Architectures Getting Performance from OpenMP Programs on NUMA Architectures Christian Terboven, RWTH Aachen University terboven@itc.rwth-aachen.de EU H2020 Centre of Excellence (CoE) 1 October 2015 31 March 2018 Grant

More information

Compiling for HSA accelerators with GCC

Compiling for HSA accelerators with GCC Compiling for HSA accelerators with GCC Martin Jambor SUSE Labs 8th August 2015 Outline HSA branch: svn://gcc.gnu.org/svn/gcc/branches/hsa Table of contents: Very Brief Overview of HSA Generating HSAIL

More information

ARE WE OPTIMIZING HARDWARE FOR

ARE WE OPTIMIZING HARDWARE FOR ARE WE OPTIMIZING HARDWARE FOR NON-OPTIMIZED APPLICATIONS? PARSEC S VECTORIZATION EFFECTS ON ENERGY EFFICIENCY AND ARCHITECTURAL REQUIREMENTS Juan M. Cebrián 1 1 Depart. of Computer and Information Science

More information

Predictive Runtime Code Scheduling for Heterogeneous Architectures

Predictive Runtime Code Scheduling for Heterogeneous Architectures Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th, 2009 1 Outline Motivation

More information

JURECA Tuning for the platform

JURECA Tuning for the platform JURECA Tuning for the platform Usage of ParaStation MPI 2017-11-23 Outline ParaStation MPI Compiling your program Running your program Tuning parameters Resources 2 ParaStation MPI Based on MPICH (3.2)

More information

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems J.C. Sáez, A. Pousa, F. Castro, D. Chaver y M. Prieto Complutense University of Madrid, Universidad Nacional de la Plata-LIDI

More information

Multicore Cache Coherence Control by a Parallelizing Compiler

Multicore Cache Coherence Control by a Parallelizing Compiler Multicore Cache Coherence Control by a Parallelizing Compiler Hironori Kasahara, Boma A. Adhi, Yohei Kishimoto, Keiji Kimura, Yuhei Hosokawa Masayoshi Mase Department of Computer Science and Engineering

More information

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008*

Intel Cluster Toolkit Compiler Edition 3.2 for Linux* or Windows HPC Server 2008* Intel Cluster Toolkit Compiler Edition. for Linux* or Windows HPC Server 8* Product Overview High-performance scaling to thousands of processors. Performance leadership Intel software development products

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

Performance Profiling

Performance Profiling Performance Profiling Minsoo Ryu Real-Time Computing and Communications Lab. Hanyang University msryu@hanyang.ac.kr Outline History Understanding Profiling Understanding Performance Understanding Performance

More information

arxiv: v2 [cs.dc] 2 May 2017

arxiv: v2 [cs.dc] 2 May 2017 Unimem: Runtime Data Management on Non-Volatile Memory-based Heterogeneous Main Memory Kai Wu University of California, Merced kwu42@ucmerced.edu Yingchao Huang University of California, Merced yhuang46@ucmerced.edu

More information

Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology

Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology Performance Estimation of High Performance Computing Systems with Energy Efficient Ethernet Technology Shinobu Miwa Sho Aita Hiroshi Nakamura The University of Tokyo {miwa, aita, nakamura}@hal.ipc.i.u-tokyo.ac.jp

More information

Separating Access Control Policy, Enforcement, and Functionality in Extensible Systems. Robert Grimm University of Washington

Separating Access Control Policy, Enforcement, and Functionality in Extensible Systems. Robert Grimm University of Washington Separating Access Control Policy, Enforcement, and Functionality in Extensible Systems Robert Grimm University of Washington Extensions Added to running system Interact through low-latency interfaces Form

More information

arxiv: v2 [cs.dc] 2 May 2017

arxiv: v2 [cs.dc] 2 May 2017 High Performance Data Persistence in Non-Volatile Memory for Resilient High Performance Computing Yingchao Huang University of California, Merced yhuang46@ucmerced.edu Kai Wu University of California,

More information

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures Operating System Support for Shared-ISA Asymmetric Multi-core Architectures Tong Li, Paul Brett, Barbara Hohlt, Rob Knauerhase, Sean McElderry, Scott Hahn Intel Corporation Contact: tong.n.li@intel.com

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Runtime Function Instrumentation with EZTrace

Runtime Function Instrumentation with EZTrace Runtime Function Instrumentation with EZTrace Charles Aulagnon, Damien Martin-Guillerez, François Rué and François Trahay 5 th Workshop on Productivity and Performance (PROPER 2012) INTRODUCTION Modern

More information

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch Motivation Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 OpenMP FPGA Device

More information

AutoTune Workshop. Michael Gerndt Technische Universität München

AutoTune Workshop. Michael Gerndt Technische Universität München AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

More information

Parallel Programming and Optimization with GCC. Diego Novillo

Parallel Programming and Optimization with GCC. Diego Novillo Parallel Programming and Optimization with GCC Diego Novillo dnovillo@google.com Outline Parallelism models Architectural overview Parallelism features in GCC Optimizing large programs Whole program mode

More information

What s in this talk? Quick Introduction. Programming in Parallel

What s in this talk? Quick Introduction. Programming in Parallel What s in this talk? Parallel programming methodologies - why MPI? Where can I use MPI? MPI in action Getting MPI to work at Warwick Examples MPI: Parallel Programming for Extreme Machines Si Hammond,

More information

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP

More information

A Characterization of Shared Data Access Patterns in UPC Programs

A Characterization of Shared Data Access Patterns in UPC Programs IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview

More information

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Peter Strazdins (and the CC-NUMA Team), CC-NUMA Project, Department of Computer Science, The Australian National University

More information

Compilers and Compiler-based Tools for HPC

Compilers and Compiler-based Tools for HPC Compilers and Compiler-based Tools for HPC John Mellor-Crummey Department of Computer Science Rice University http://lacsi.rice.edu/review/2004/slides/compilers-tools.pdf High Performance Computing Algorithms

More information

Slurm Configuration Impact on Benchmarking

Slurm Configuration Impact on Benchmarking Slurm Configuration Impact on Benchmarking José A. Moríñigo, Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT - Dept. Technology Avda. Complutense 40, Madrid 28040, SPAIN Slurm User Group Meeting 16

More information

Warming up Storage-level Caches with Bonfire

Warming up Storage-level Caches with Bonfire Warming up Storage-level Caches with Bonfire Yiying Zhang Gokul Soundararajan Mark W. Storer Lakshmi N. Bairavasundaram Sethuraman Subbiah Andrea C. Arpaci-Dusseau Remzi H. Arpaci-Dusseau 2 Does on-demand

More information

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction

More information

Décomposition automatique des programmes parallèles pour l optimisation et la prédiction de performance.

Décomposition automatique des programmes parallèles pour l optimisation et la prédiction de performance. Décomposition automatique des programmes parallèles pour l optimisation et la prédiction de performance. Mihail Popov To cite this version: Mihail Popov. Décomposition automatique des programmes parallèles

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture

More information

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines Jie Tao 1 Karl Fuerlinger 2 Holger Marten 1 jie.tao@kit.edu karl.fuerlinger@nm.ifi.lmu.de holger.marten@kit.edu 1 : Steinbuch

More information

KernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets

KernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets P A R A L L E L C O M P U T A T I O N A L T E C H N O L O G I E S ' 2 0 1 2 KernelGen a toolchain for automatic GPU-centric applications porting Nicolas Lihogrud Dmitry Mikushin Andrew Adinets Contents

More information

A Trace-Scaling Agent for Parallel Application Tracing 1

A Trace-Scaling Agent for Parallel Application Tracing 1 A Trace-Scaling Agent for Parallel Application Tracing 1 Felix Freitag, Jordi Caubet, Jesus Labarta Computer Architecture Department (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat

More information

Specializing Code for FFT Libraries. Minhaj Ahmad Khan Henri Pierre Charles University of Versailles Saint-Quentin-en-Yvelines, France

Specializing Code for FFT Libraries. Minhaj Ahmad Khan Henri Pierre Charles University of Versailles Saint-Quentin-en-Yvelines, France Specializing Code for FFT Libraries Minhaj Ahmad Khan Henri Pierre Charles University of Versailles Saint-Quentin-en-Yvelines, France Outline Specialization Issues for FFT Libraries Limited Code Specialization

More information

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group

MPI: Parallel Programming for Extreme Machines. Si Hammond, High Performance Systems Group MPI: Parallel Programming for Extreme Machines Si Hammond, High Performance Systems Group Quick Introduction Si Hammond, (sdh@dcs.warwick.ac.uk) WPRF/PhD Research student, High Performance Systems Group,

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

Communication Characteristics in the NAS Parallel Benchmarks

Communication Characteristics in the NAS Parallel Benchmarks Communication Characteristics in the NAS Parallel Benchmarks Ahmad Faraj Xin Yuan Department of Computer Science, Florida State University, Tallahassee, FL 32306 {faraj, xyuan}@cs.fsu.edu Abstract In this

More information

Benchmark results on Knight Landing (KNL) architecture

Benchmark results on Knight Landing (KNL) architecture Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3

More information

AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors

AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors Kaixi Hou, Hao Wang, Wu-chun Feng {kaixihou,hwang121,wfeng}@vt.edu Pairwise Sequence Alignment Algorithms

More information

Cross-Layer Memory Management to Reduce DRAM Power Consumption

Cross-Layer Memory Management to Reduce DRAM Power Consumption Cross-Layer Memory Management to Reduce DRAM Power Consumption Michael Jantz Assistant Professor University of Tennessee, Knoxville 1 Introduction Assistant Professor at UT since August 2014 Before UT

More information

Kampala August, Agner Fog

Kampala August, Agner Fog Advanced microprocessor optimization Kampala August, 2007 Agner Fog www.agner.org Agenda Intel and AMD microprocessors Out Of Order execution Branch prediction Platform, 32 or 64 bits Choice of compiler

More information

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING Keynote, ADVCOMP 2017 November, 2017, Barcelona, Spain Prepared by: Ahmad Qawasmeh Assistant Professor The Hashemite University,

More information

The Performance Analysis of Portable Parallel Programming Interface MpC for SDSM and pthread

The Performance Analysis of Portable Parallel Programming Interface MpC for SDSM and pthread The Performance Analysis of Portable Parallel Programming Interface MpC for SDSM and pthread Workshop DSM2005 CCGrig2005 Seikei University Tokyo, Japan midori@st.seikei.ac.jp Hiroko Midorikawa 1 Outline

More information

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing OmpCloud: Bridging the Gap between OpenMP and Cloud Computing Hervé Yviquel, Marcio Pereira and Guido Araújo University of Campinas (UNICAMP), Brazil A bit of background qguido Araujo, PhD Princeton University

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

Bei Wang, Dmitry Prohorov and Carlos Rosales

Bei Wang, Dmitry Prohorov and Carlos Rosales Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512

More information

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests

Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Turbo Boost Up, AVX Clock Down: Complications for Scaling Tests Steve Lantz 12/8/2017 1 What Is CPU Turbo? (Sandy Bridge) = nominal frequency http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/hc23.19.9-desktop-cpus/hc23.19.921.sandybridge_power_10-rotem-intel.pdf

More information

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list

Compiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list Compiling for Scalable Computing Systems the Merit of SIMD Ayal Zaks Intel Corporation Acknowledgements: too many to list Takeaways 1. SIMD is mainstream and ubiquitous in HW 2. Compiler support for SIMD

More information

Computational Interdisciplinary Modelling High Performance Parallel & Distributed Computing Our Research

Computational Interdisciplinary Modelling High Performance Parallel & Distributed Computing Our Research Insieme Insieme-an Optimization System for OpenMP, MPI and OpenCLPrograms Institute of Computer Science University of Innsbruck Thomas Fahringer, Ivan Grasso, Klaus Kofler, Herbert Jordan, Hans Moritsch,

More information

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee @kleegeek davidklee.net gplus.to/kleegeek linked.com/a/davidaklee Specialties / Focus Areas / Passions: Performance Tuning & Troubleshooting Virtualization Cloud Enablement Infrastructure Architecture

More information

Benchmarking computers for seismic processing and imaging

Benchmarking computers for seismic processing and imaging Benchmarking computers for seismic processing and imaging Evgeny Kurin ekurin@geo-lab.ru Outline O&G HPC status and trends Benchmarking: goals and tools GeoBenchmark: modules vs. subsystems Basic tests

More information

Hybrid Cache Architecture (HCA) with Disparate Memory Technologies

Hybrid Cache Architecture (HCA) with Disparate Memory Technologies Hybrid Cache Architecture (HCA) with Disparate Memory Technologies Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ram Rajamony, Yuan Xie Pennsylvania State University IBM Austin Research Laboratory Acknowledgement:

More information

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello

Distributed recovery for senddeterministic. Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello Distributed recovery for senddeterministic HPC applications Tatiana V. Martsinkevich, Thomas Ropars, Amina Guermouche, Franck Cappello 1 Fault-tolerance in HPC applications Number of cores on one CPU and

More information

CSE 262 Lecture 13. Communication overlap Continued Heterogeneous processing

CSE 262 Lecture 13. Communication overlap Continued Heterogeneous processing CSE 262 Lecture 13 Communication overlap Continued Heterogeneous processing Final presentations Announcements Friday March 13 th, 10:00 AM to 1:00PM Note time change. Room 3217, CSE Building (EBU3B) Scott

More information

Parallel Computing. November 20, W.Homberg

Parallel Computing. November 20, W.Homberg Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better

More information

Shoal: smart allocation and replication of memory for parallel programs

Shoal: smart allocation and replication of memory for parallel programs Shoal: smart allocation and replication of memory for parallel programs Stefan Kaestle, Reto Achermann, Timothy Roscoe, Tim Harris $ ETH Zurich $ Oracle Labs Cambridge, UK Problem Congestion on interconnect

More information

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Affine Loop Optimization using Modulo Unrolling in CHAPEL Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers

More information

Hierarchical PLABs, CLABs, TLABs in Hotspot

Hierarchical PLABs, CLABs, TLABs in Hotspot Hierarchical s, CLABs, s in Hotspot Christoph M. Kirsch ck@cs.uni-salzburg.at Hannes Payer hpayer@cs.uni-salzburg.at Harald Röck hroeck@cs.uni-salzburg.at Abstract Thread-local allocation buffers (s) are

More information

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture

How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture How to scale Nested OpenMP Applications on the ScaleMP vsmp Architecture Dirk Schmidl, Christian Terboven, Andreas Wolf, Dieter an Mey, Christian Bischof IEEE Cluster 2010 / Heraklion September 21, 2010

More information

DISP: Optimizations Towards Scalable MPI Startup

DISP: Optimizations Towards Scalable MPI Startup DISP: Optimizations Towards Scalable MPI Startup Huansong Fu, Swaroop Pophale*, Manjunath Gorentla Venkata*, Weikuan Yu Florida State University *Oak Ridge National Laboratory Outline Background and motivation

More information

PRACE Autumn School Basic Programming Models

PRACE Autumn School Basic Programming Models PRACE Autumn School 2010 Basic Programming Models Basic Programming Models - Outline Introduction Key concepts Architectures Programming models Programming languages Compilers Operating system & libraries

More information

xsim The Extreme-Scale Simulator

xsim The Extreme-Scale Simulator www.bsc.es xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa Seminar @ BSC, 28 Feb 2014 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of

More information

Performance of deal.ii on a node

Performance of deal.ii on a node Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs Lecture 5 Performance programming for stencil methods Vectorization Computing with GPUs Announcements Forge accounts: set up ssh public key, tcsh Turnin was enabled for Programming Lab #1: due at 9pm today,

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

A Dynamic Optimization Framework for OpenMP

A Dynamic Optimization Framework for OpenMP A Dynamic Optimization Framework for OpenMP Besar Wicaksono, Ramachandra C. Nanjegowda, and Barbara Chapman University of Houston, Computer Science Dept, Houston, Texas http://www2.cs.uh.edu/~hpctools

More information