Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines

Size: px
Start display at page:

Download "Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines"

Transcription

1 Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines Jie Tao 1 Karl Fuerlinger 2 Holger Marten 1 jie.tao@kit.edu karl.fuerlinger@nm.ifi.lmu.de holger.marten@kit.edu 1 : Steinbuch Center for Computing, Karlsruhe Institute of Technology (KIT), Germany 2 : MNM-Team, Department of Computer Science, LMU München, Germany

2 Outline Introduction Virtualization and the impact on performance Experimental Setup NAS parallel benchmarks, SPEC OpenMP, microbenchmarks Study of SP (NAS Parallel Benchmarks) Initial performance Analysis using ompp Optimization results and microbenchmark study Conclusions Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 2

3 Virtualization Running multiple OSs on the same hardware VM 1 VM 2 VM 3 VM 4 Application Operating System Hardware Guest OS Guest Guest OS OS Hypervisor Host machine Guest OS Concepts Hypervisor (xen, KVM, VMware) Full virtualization vs para-virtualization Adopted for Server consolidation Cloud Computing: on-demand resource provision Performance impact Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 3

4 Performance Impact of Virtualization Has been studied before, E.g., Keith Jackson, et al. Performance of HPC Applications on the Amazon Web Services Cloud Here: The performance impact of virtualization on OpenMP applications Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 4

5 Experimental Setup Benchmarks NAS OpenMP (size A) SPEC OpenMP (reference dataset) EPCC OpenMP Microbenchmarks Host machine AMD Opteron 2376 ( Shanghai ), 2.3 GHz, 2 socket quadcore Scientific Linux Virtualized with xen Virtual machines Hypervisor: xen OS: Debian Compiler: gcc #cores: 1-8 Memory: 4GB Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 5

6 NAS Parallel Benchmarks Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 6

7 NAS Parallel Benchmarks (2) Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 7

8 SPEC OpenMP Benchmarks Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 8

9 SPEC OpenMP Benchmarks (2) Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 9

10 Execution time of NAS SP What is going on here? Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 10

11 OpenMP Performance Analysis with ompp ompp: OpenMP profiling tool Based on source code instrumentation Independent of the compiler and runtime used Supports HW counters through PAPI Uses source code instrumenter Opari from the KOJAK/Scalasca toolset Available for download (GPL): Source Code Automatic instrumentation of OpenMP constructs, manual region instrumentation ompp library Executable Settings (env. Vars) HW Counters, output format, Execution on parallel machine Profiling Report Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 11

12 Source to Source Instrumentation with Opari Preprocessor Instrumentation Example: Instrumenting OpenMP constructs with Opari Preprocessor operation Orignial source code Preprocessor Modified (instrumented) source code Example: Instrumentation of a parallel region POMP_Parallel_fork [master] #pragma omp parallel { POMP_Parallel_begin [team] /* user code in parallel region */ /* user code in parallel region */ } POMP_Barrier_enter [team] #pragma omp barrier POMP_Barrier_exit [team] POMP_Parallel_end [team] } POMP_Parallel_join [master] Instrumentation added by Opari Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 12

13 ompp s Profiling Data Example code section and performance profile: Code: #pragma omp parallel { #pragma omp critical { sleep(1.0); } } Profile: R00002 main.c (34-37) (default) CRITICAL TID exect execc bodyt entert exitt SUM Components: Source code location and type of region Timing data and execution counts, depending on the particular construct One line per thread, last line sums over all threads Hardware counter data (if PAPI is available and HW counters are selected) Data is exact (measured, not based on sampling) Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 13

14 ompp Overhead Analysis (1) Certain timing categories reported by ompp can be classified as overheads: Example: entert in a critical section: Threads wait to enter the critical section (synchronization overhead). Four overhead categories are defined in ompp: Imbalance: waiting time incurred due to an imbalanced amount of work in a worksharing or parallel region Synchronization: overhead that arises due to threads having to synchronize their activity, e.g. barrier call Limited Parallelism: idle threads due not enough parallelism being exposed by the program Thread management: overhead for the creation and destruction of threads, and for signaling critical sections, locks as available Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 14

15 ompp Overhead Analysis (2) S: Synchronization overhead M: Thread management overhead I: Imbalance overhead L: Limited Parallelism overhead Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 15

16 Overhead Analysis for the NAS Benchmarks BT-host BT-full BT-para FT-host FT-full FT-para CG-host CG-full CG-para EP-host EP-full EP-para SP-host SP-full SP-para Total Overhead (%) Synch Imbal Limpar Mgmt (06.48) (11.47) (11.65) (35.44) (34.53) (36.34) 1.55 (08.95) 4.87 (23.59) 6.37 (26.49) 1.08 (01.17) 1.24 (01.37) (22.13) (33.03) (86.89) (77.68) Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 16

17 OpenMP Constructs in the NAS Parallel Benchmarks Parallel Loop Single Barrier Critical Master BT FT CG EP SP Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 17

18 ompp Profile for SP ompp Profiling Report for sp.c (lines ) (para-virtualized) TID exect execc bodyt exitbart exitbart (native host) SUM Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 18

19 exitbart in a Parallel Loops Loop_enter Barrier_enter Opari transforms the implicit barrier into an explict barrier Worst case load imbalance scenario: i Barrier_exit Loop_exit t exitbart = i Thread i can induce at most t seconds exitbart time in each other thread Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 19

20 TID 0 1 exect execc bodyt exitbart exitbart should be max. ~80 seconds SUM Barrier that takes a really long time Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 20

21 Optimization Move parallelization to outermost loop for (j = 1; j <= grid_points[1]-2; j++) { for (k = 1; k <= grid_points[2]-2; k++) { #pragma omp for for (i = 0; i <= grid_points[0]-1; i++) { ru1 = c3c4*rho_i[i][j][k]; cv[i] = us[i][j][k]; rhon[i] = max(dx2+con43*ru1, max(dx5+c1c5*ru1, max(dxmax+ru1, dx1))); } #pragma omp for for (i = 1; i <= grid_points[0]-2; i++) { lhs[0][i][j][k] = 0.0; lhs[1][i][j][k] = - dttx2 * cv[i-1] - dttx1 * rhon[i-1]; lhs[2][i][j][k] = c2dttx1 * rhon[i]; lhs[3][i][j][k] = dttx2 * cv[i+1] - dttx1 * rhon[i+1]; lhs[4][i][j][k] = 0.0; } } } #pragma omp for for (j = 1; j <= grid_points[1]-2; j++) { for (k = 1; k <= grid_points[2]-2; k++) { for (i = 0; i <= grid_points[0]-1; i++) { ru1 = c3c4*rho_i[i][j][k]; cv[i] = us[i][j][k]; rhon[i] = max(dx2+con43*ru1, max(dx5+c1c5*ru1, max(dxmax+ru1, dx1))); } for (i = 1; i <= grid_points[0]-2; i++) { lhs[0][i][j][k] = 0.0; lhs[1][i][j][k] = - dttx2 * cv[i-1] - dttx1 * rhon[i-1]; lhs[2][i][j][k] = c2dttx1 * rhon[i]; lhs[3][i][j][k] = dttx2 * cv[i+1] - dttx1 * rhon[i+1]; lhs[4][i][j][k] = 0.0; } } Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 21

22 Optimization Results Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 22

23 EPCC Microbenchmarks There is significant overhead in fine-grained constructs related to thread scheduling and reduction operations Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 23

24 Conclusion and Future Work Virtualization introduces application-dependent overheads Following good practice advice (outermost, coarse-grained parallelization) even more important Hypercalls are very expensive Future work Investigate this behavior with XEN tracing tools Other OpenMP runtimes Busy wait vs. yielding Virtualization aware runtime Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines 24

Performance Profiling for OpenMP Tasks

Performance Profiling for OpenMP Tasks Performance Profiling for OpenMP Tasks Karl Fürlinger 1 and David Skinner 2 1 Computer Science Division, EECS Department University of California at Berkeley Soda Hall 593, Berkeley CA 94720, U.S.A. fuerling@eecs.berkeley.edu

More information

OpenMP Application Profiling State of the Art and Directions for the Future

OpenMP Application Profiling State of the Art and Directions for the Future Procedia Computer Science Procedia Computer Science00 1 (2010) (2012) 1 8 2107 2114 www.elsevier.com/locate/procedia International Conference on Computational Science, ICCS 2010 OpenMP Application Profiling

More information

ompp: A Profiling Tool for OpenMP

ompp: A Profiling Tool for OpenMP ompp: A Profiling Tool for OpenMP Karl Fürlinger Michael Gerndt {fuerling, gerndt}@in.tum.de Technische Universität München Performance Analysis of OpenMP Applications Platform specific tools SUN Studio

More information

Detection and Analysis of Iterative Behavior in Parallel Applications

Detection and Analysis of Iterative Behavior in Parallel Applications Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University

More information

The OpenMP Profiler ompp: User Guide and Manual Version 0.7.0, March 2009

The OpenMP Profiler ompp: User Guide and Manual Version 0.7.0, March 2009 The OpenMP Profiler ompp: User Guide and Manual Version 0.7.0, March 2009 Contents Karl Fuerlinger Innovative Computing Laboratory Department of Computer Science University of Tennessee karl at cs utk

More information

Binding Nested OpenMP Programs on Hierarchical Memory Architectures

Binding Nested OpenMP Programs on Hierarchical Memory Architectures Binding Nested OpenMP Programs on Hierarchical Memory Architectures Dirk Schmidl, Christian Terboven, Dieter an Mey, and Martin Bücker {schmidl, terboven, anmey}@rz.rwth-aachen.de buecker@sc.rwth-aachen.de

More information

OpenMP at Sun. EWOMP 2000, Edinburgh September 14-15, 2000 Larry Meadows Sun Microsystems

OpenMP at Sun. EWOMP 2000, Edinburgh September 14-15, 2000 Larry Meadows Sun Microsystems OpenMP at Sun EWOMP 2000, Edinburgh September 14-15, 2000 Larry Meadows Sun Microsystems Outline Sun and Parallelism Implementation Compiler Runtime Performance Analyzer Collection of data Data analysis

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Towards Fair and Efficient SMP Virtual Machine Scheduling

Towards Fair and Efficient SMP Virtual Machine Scheduling Towards Fair and Efficient SMP Virtual Machine Scheduling Jia Rao and Xiaobo Zhou University of Colorado, Colorado Springs http://cs.uccs.edu/~jrao/ Executive Summary Problem: unfairness and inefficiency

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

Chapter 5 C. Virtual machines

Chapter 5 C. Virtual machines Chapter 5 C Virtual machines Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple guests Avoids security and reliability problems Aids sharing

More information

Runtime Address Space Computation for SDSM Systems

Runtime Address Space Computation for SDSM Systems Runtime Address Space Computation for SDSM Systems Jairo Balart Outline Introduction Inspector/executor model Implementation Evaluation Conclusions & future work 2 Outline Introduction Inspector/executor

More information

Scalability Analysis of the SPEC OpenMP Benchmarks on Large-Scale Shared Memory Multiprocessors

Scalability Analysis of the SPEC OpenMP Benchmarks on Large-Scale Shared Memory Multiprocessors Scalability Analysis of the SPEC OpenMP Benchmarks on Large-Scale Shared Memory Multiprocessors Karl Fürlinger 1,2, Michael Gerndt 1, and Jack Dongarra 2 1 Lehrstuhl für Rechnertechnik und Rechnerorganisation,

More information

Introduction to Cloud Computing and Virtualization. Mayank Mishra Sujesha Sudevalayam PhD Students CSE, IIT Bombay

Introduction to Cloud Computing and Virtualization. Mayank Mishra Sujesha Sudevalayam PhD Students CSE, IIT Bombay Introduction to Cloud Computing and Virtualization By Mayank Mishra Sujesha Sudevalayam PhD Students CSE, IIT Bombay Talk Layout Cloud Computing Need Features Feasibility Virtualization of Machines What

More information

Spring 2017 :: CSE 506. Introduction to. Virtual Machines. Nima Honarmand

Spring 2017 :: CSE 506. Introduction to. Virtual Machines. Nima Honarmand Introduction to Virtual Machines Nima Honarmand Virtual Machines & Hypervisors Virtual Machine: an abstraction of a complete compute environment through the combined virtualization of the processor, memory,

More information

Optimize HPC - Application Efficiency on Many Core Systems

Optimize HPC - Application Efficiency on Many Core Systems Meet the experts Optimize HPC - Application Efficiency on Many Core Systems 2018 Arm Limited Florent Lebeau 27 March 2018 2 2018 Arm Limited Speedup Multithreading and scalability I wrote my program to

More information

A Case for High Performance Computing with Virtual Machines

A Case for High Performance Computing with Virtual Machines A Case for High Performance Computing with Virtual Machines Wei Huang*, Jiuxing Liu +, Bulent Abali +, and Dhabaleswar K. Panda* *The Ohio State University +IBM T. J. Waston Research Center Presentation

More information

The Challenges of X86 Hardware Virtualization. GCC- Virtualization: Rajeev Wankar 36

The Challenges of X86 Hardware Virtualization. GCC- Virtualization: Rajeev Wankar 36 The Challenges of X86 Hardware Virtualization GCC- Virtualization: Rajeev Wankar 36 The Challenges of X86 Hardware Virtualization X86 operating systems are designed to run directly on the bare-metal hardware,

More information

What is KVM? KVM patch. Modern hypervisors must do many things that are already done by OSs Scheduler, Memory management, I/O stacks

What is KVM? KVM patch. Modern hypervisors must do many things that are already done by OSs Scheduler, Memory management, I/O stacks LINUX-KVM The need for KVM x86 originally virtualization unfriendly No hardware provisions Instructions behave differently depending on privilege context(popf) Performance suffered on trap-and-emulate

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 27 Virtualization Slides based on Various sources 1 1 Virtualization Why we need virtualization? The concepts and

More information

Parallel Programming: OpenMP

Parallel Programming: OpenMP Parallel Programming: OpenMP Xianyi Zeng xzeng@utep.edu Department of Mathematical Sciences The University of Texas at El Paso. November 10, 2016. An Overview of OpenMP OpenMP: Open Multi-Processing An

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation

More information

A Lightweight OpenMP Runtime

A Lightweight OpenMP Runtime Alexandre Eichenberger - Kevin O Brien 6/26/ A Lightweight OpenMP Runtime -- OpenMP for Exascale Architectures -- T.J. Watson, IBM Research Goals Thread-rich computing environments are becoming more prevalent

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY Presenters: Harshitha Menon, Seonmyeong Bak PPL Group Phil Miller, Sam White, Nitin Bhat, Tom Quinn, Jim Phillips, Laxmikant Kale MOTIVATION INTEGRATED

More information

A Parallelizing Compiler for Multicore Systems

A Parallelizing Compiler for Multicore Systems A Parallelizing Compiler for Multicore Systems José M. Andión, Manuel Arenaz, Gabriel Rodríguez and Juan Touriño 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES 2014)

More information

Exercise: OpenMP Programming

Exercise: OpenMP Programming Exercise: OpenMP Programming Multicore programming with OpenMP 19.04.2016 A. Marongiu - amarongiu@iis.ee.ethz.ch D. Palossi dpalossi@iis.ee.ethz.ch ETH zürich Odroid Board Board Specs Exynos5 Octa Cortex

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

An Introduction to the SPEC High Performance Group and their Benchmark Suites

An Introduction to the SPEC High Performance Group and their Benchmark Suites An Introduction to the SPEC High Performance Group and their Benchmark Suites Robert Henschel Manager, Scientific Applications and Performance Tuning Secretary, SPEC High Performance Group Research Technologies

More information

Xen Summit Spring 2007

Xen Summit Spring 2007 Xen Summit Spring 2007 Platform Virtualization with XenEnterprise Rich Persaud 4/20/07 Copyright 2005-2006, XenSource, Inc. All rights reserved. 1 Xen, XenSource and XenEnterprise

More information

Capturing and Analyzing the Execution Control Flow of OpenMP Applications

Capturing and Analyzing the Execution Control Flow of OpenMP Applications Int J Parallel Prog (2009) 37:266 276 DOI 10.1007/s10766-009-0100-2 Capturing and Analyzing the Execution Control Flow of OpenMP Applications Karl Fürlinger Shirley Moore Received: 9 March 2009 / Accepted:

More information

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems J.C. Sáez, A. Pousa, F. Castro, D. Chaver y M. Prieto Complutense University of Madrid, Universidad Nacional de la Plata-LIDI

More information

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP

More information

Oversubscription on Multicore Processors

Oversubscription on Multicore Processors Oversubscription on Multicore Processors ostin Iancu, teven Hofmeyr, Filip lagojević, Yili Zheng Lawrence erkeley National Laboratory Parallel & Dtributed Processing (IPDP), / Motivation Increasingly parallel

More information

Shared Memory programming paradigm: openmp

Shared Memory programming paradigm: openmp IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM

More information

DPHPC: Introduction to OpenMP Recitation session

DPHPC: Introduction to OpenMP Recitation session SALVATORE DI GIROLAMO DPHPC: Introduction to OpenMP Recitation session Based on http://openmp.org/mp-documents/intro_to_openmp_mattson.pdf OpenMP An Introduction What is it? A set of compiler directives

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

Interrupt Coalescing in Xen

Interrupt Coalescing in Xen Interrupt Coalescing in Xen with Scheduler Awareness Michael Peirce & Kevin Boos Outline Background Hypothesis vic-style Interrupt Coalescing Adding Scheduler Awareness Evaluation 2 Background Xen split

More information

Nested Virtualization and Server Consolidation

Nested Virtualization and Server Consolidation Nested Virtualization and Server Consolidation Vara Varavithya Department of Electrical Engineering, KMUTNB varavithya@gmail.com 1 Outline Virtualization & Background Nested Virtualization Hybrid-Nested

More information

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy COMPUTER ARCHITECTURE Virtualization and Memory Hierarchy 2 Contents Virtual memory. Policies and strategies. Page tables. Virtual machines. Requirements of virtual machines and ISA support. Virtual machines:

More information

Standard promoted by main manufacturers Fortran. Structure: Directives, clauses and run time calls

Standard promoted by main manufacturers   Fortran. Structure: Directives, clauses and run time calls OpenMP Introducción Directivas Regiones paralelas Worksharing sincronizaciones Visibilidad datos Implementación OpenMP: introduction Standard promoted by main manufacturers http://www.openmp.org, http://www.compunity.org

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh HPMMAP: Lightweight Memory Management for Commodity Operating Systems Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight

More information

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices

Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices Cooperative VM Migration for a virtualized HPC Cluster with VMM-bypass I/O devices Ryousei Takano, Hidemoto Nakada, Takahiro Hirofuchi, Yoshio Tanaka, and Tomohiro Kudoh Information Technology Research

More information

The Architecture of Virtual Machines Lecture for the Embedded Systems Course CSD, University of Crete (April 29, 2014)

The Architecture of Virtual Machines Lecture for the Embedded Systems Course CSD, University of Crete (April 29, 2014) The Architecture of Virtual Machines Lecture for the Embedded Systems Course CSD, University of Crete (April 29, 2014) ManolisMarazakis (maraz@ics.forth.gr) Institute of Computer Science (ICS) Foundation

More information

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores Junbin Kang, Benlong Zhang, Tianyu Wo, Chunming Hu, and Jinpeng Huai Beihang University 夏飞 20140904 1 Outline Background

More information

Task-based Execution of Nested OpenMP Loops

Task-based Execution of Nested OpenMP Loops Task-based Execution of Nested OpenMP Loops Spiros N. Agathos Panagiotis E. Hadjidoukas Vassilios V. Dimakopoulos Department of Computer Science UNIVERSITY OF IOANNINA Ioannina, Greece Presentation Layout

More information

A Survey on Performance Tools for OpenMP

A Survey on Performance Tools for OpenMP A Survey on Performance Tools for OpenMP Mubrak S. Mohsen, Rosni Abdullah, and Yong M. Teo Abstract Advances in processors architecture, such as multicore, increase the size of complexity of parallel computer

More information

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Piecewise Holistic Autotuning of Compiler and Runtime Parameters Piecewise Holistic Autotuning of Compiler and Runtime Parameters Mihail Popov, Chadi Akel, William Jalby, Pablo de Oliveira Castro University of Versailles Exascale Computing Research August 2016 C E R

More information

OpenMP dynamic loops. Paolo Burgio.

OpenMP dynamic loops. Paolo Burgio. OpenMP dynamic loops Paolo Burgio paolo.burgio@unimore.it Outline Expressing parallelism Understanding parallel threads Memory Data management Data clauses Synchronization Barriers, locks, critical sections

More information

OpenMP, Part 2. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2015

OpenMP, Part 2. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2015 OpenMP, Part 2 EAS 520 High Performance Scientific Computing University of Massachusetts Dartmouth Spring 2015 References This presentation is almost an exact copy of Dartmouth College's openmp tutorial.

More information

Real-Time Cache Management for Multi-Core Virtualization

Real-Time Cache Management for Multi-Core Virtualization Real-Time Cache Management for Multi-Core Virtualization Hyoseung Kim 1,2 Raj Rajkumar 2 1 University of Riverside, California 2 Carnegie Mellon University Benefits of Multi-Core Processors Consolidation

More information

POWER-AWARE SOFTWARE ON ARM. Paul Fox

POWER-AWARE SOFTWARE ON ARM. Paul Fox POWER-AWARE SOFTWARE ON ARM Paul Fox OUTLINE MOTIVATION LINUX POWER MANAGEMENT INTERFACES A UNIFIED POWER MANAGEMENT SYSTEM EXPERIMENTAL RESULTS AND FUTURE WORK 2 MOTIVATION MOTIVATION» ARM SoCs designed

More information

Design Principles for End-to-End Multicore Schedulers

Design Principles for End-to-End Multicore Schedulers c Systems Group Department of Computer Science ETH Zürich HotPar 10 Design Principles for End-to-End Multicore Schedulers Simon Peter Adrian Schüpbach Paul Barham Andrew Baumann Rebecca Isaacs Tim Harris

More information

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware

Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware Performance & Scalability Testing in Virtual Environment Hemant Gaidhani, Senior Technical Marketing Manager, VMware 2010 VMware Inc. All rights reserved About the Speaker Hemant Gaidhani Senior Technical

More information

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1 Lecture 16: Recapitulations Lecture 16: Recapitulations p. 1 Parallel computing and programming in general Parallel computing a form of parallel processing by utilizing multiple computing units concurrently

More information

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Compiling for GPUs. Adarsh Yoga Madhav Ramesh Compiling for GPUs Adarsh Yoga Madhav Ramesh Agenda Introduction to GPUs Compute Unified Device Architecture (CUDA) Control Structure Optimization Technique for GPGPU Compiler Framework for Automatic Translation

More information

Parallel Programming

Parallel Programming Parallel Programming Lecture delivered by: Venkatanatha Sarma Y Assistant Professor MSRSAS-Bangalore 1 Session Objectives To understand the parallelization in terms of computational solutions. To understand

More information

Intel Threading Tools

Intel Threading Tools Intel Threading Tools Paul Petersen, Intel -1- INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS,

More information

Go Multicore Series:

Go Multicore Series: Go Multicore Series: Understanding Memory in a Multicore World, Part 2: Software Tools for Improving Cache Perf Joe Hummel, PhD http://www.joehummel.net/freescale.html FTF 2014: FTF-SDS-F0099 TM External

More information

Virtual Asymmetric Multiprocessor for Interactive Performance of Consolidated Desktops

Virtual Asymmetric Multiprocessor for Interactive Performance of Consolidated Desktops Virtual Asymmetric Multiprocessor for Interactive Performance of Consolidated Desktops Hwanju Kim 12, Sangwook Kim 1, Jinkyu Jeong 1, and Joonwon Lee 1 Sungkyunkwan University 1 University of Cambridge

More information

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele

Advanced C Programming Winter Term 2008/09. Guest Lecture by Markus Thiele Advanced C Programming Winter Term 2008/09 Guest Lecture by Markus Thiele Lecture 14: Parallel Programming with OpenMP Motivation: Why parallelize? The free lunch is over. Herb

More information

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 2 OpenMP Shared address space programming High-level

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center

More information

Improving the Practicality of Transactional Memory

Improving the Practicality of Transactional Memory Improving the Practicality of Transactional Memory Woongki Baek Electrical Engineering Stanford University Programming Multiprocessors Multiprocessor systems are now everywhere From embedded to datacenter

More information

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP

More information

GViM: GPU-accelerated Virtual Machines

GViM: GPU-accelerated Virtual Machines GViM: GPU-accelerated Virtual Machines Vishakha Gupta, Ada Gavrilovska, Karsten Schwan, Harshvardhan Kharche @ Georgia Tech Niraj Tolia, Vanish Talwar, Partha Ranganathan @ HP Labs Trends in Processor

More information

A Programming and Performance Comparison of OpenMP and MPI for Concordance Benchmark

A Programming and Performance Comparison of OpenMP and MPI for Concordance Benchmark A Programming and Performance Comparison of OpenMP and MPI for Concordance Benchmark M. S. Aljabri, P.W. Trinder April 24, 2013 Abstract Of the parallel programming models available OpenMP is the de facto

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18

More information

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven

OpenMP Tutorial. Dirk Schmidl. IT Center, RWTH Aachen University. Member of the HPC Group Christian Terboven OpenMP Tutorial Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center, RWTH Aachen University Head of the HPC Group terboven@itc.rwth-aachen.de 1 IWOMP

More information

Standard promoted by main manufacturers Fortran

Standard promoted by main manufacturers  Fortran OpenMP Introducción Directivas Regiones paralelas Worksharing sincronizaciones Visibilidad datos Implementación OpenMP: introduction Standard promoted by main manufacturers http://www.openmp.org Fortran

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

Questions from last time

Questions from last time Questions from last time Pthreads vs regular thread? Pthreads are POSIX-standard threads (1995). There exist earlier and newer standards (C++11). Pthread is probably most common. Pthread API: about a 100

More information

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1 2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008 1 of 6 Lecture 7: March 4 CISC 879 Software Support for Multicore Architectures Spring 2008 Lecture 7: March 4, 2008 Lecturer: Lori Pollock Scribe: Navreet Virk Open MP Programming Topics covered 1. Introduction

More information

Experimental Evaluation of Application-level Checkpointing for OpenMP Programs

Experimental Evaluation of Application-level Checkpointing for OpenMP Programs Experimental Evaluation of Application-level Checkpointing for OpenMP Programs Greg Bronevetsky, Keshav Pingali, Paul Stodghill {bronevet,pingali,stodghil@cs.cornell.edu Department of Computer Science,

More information

Power Efficiency of Hypervisor and Container-based Virtualization

Power Efficiency of Hypervisor and Container-based Virtualization Power Efficiency of Hypervisor and Container-based Virtualization University of Amsterdam MSc. System & Network Engineering Research Project II Jeroen van Kessel 02-02-2016 Supervised by: dr. ir. Arie

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Virtualization. ...or how adding another layer of abstraction is changing the world. CIS 399: Unix Skills University of Pennsylvania.

Virtualization. ...or how adding another layer of abstraction is changing the world. CIS 399: Unix Skills University of Pennsylvania. Virtualization...or how adding another layer of abstraction is changing the world. CIS 399: Unix Skills University of Pennsylvania April 6, 2009 (CIS 399 Unix) Virtualization April 6, 2009 1 / 22 What

More information

24-vm.txt Mon Nov 21 22:13: Notes on Virtual Machines , Fall 2011 Carnegie Mellon University Randal E. Bryant.

24-vm.txt Mon Nov 21 22:13: Notes on Virtual Machines , Fall 2011 Carnegie Mellon University Randal E. Bryant. 24-vm.txt Mon Nov 21 22:13:36 2011 1 Notes on Virtual Machines 15-440, Fall 2011 Carnegie Mellon University Randal E. Bryant References: Tannenbaum, 3.2 Barham, et al., "Xen and the art of virtualization,"

More information

SFO17-403: Optimizing the Design and Implementation of KVM/ARM

SFO17-403: Optimizing the Design and Implementation of KVM/ARM SFO17-403: Optimizing the Design and Implementation of KVM/ARM Christoffer Dall connect.linaro.org Efficient, isolated duplicate of the real machine Popek and Golberg [Formal requirements for virtualizable

More information

Re-architecting Virtualization in Heterogeneous Multicore Systems

Re-architecting Virtualization in Heterogeneous Multicore Systems Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

Tracing and Visualization of Energy Related Metrics

Tracing and Visualization of Energy Related Metrics Tracing and Visualization of Energy Related Metrics 8th Workshop on High-Performance, Power-Aware Computing 2012, Shanghai Timo Minartz, Julian Kunkel, Thomas Ludwig timo.minartz@informatik.uni-hamburg.de

More information

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2.

OpenMP Overview. in 30 Minutes. Christian Terboven / Aachen, Germany Stand: Version 2. OpenMP Overview in 30 Minutes Christian Terboven 06.12.2010 / Aachen, Germany Stand: 03.12.2010 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda OpenMP: Parallel Regions,

More information

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching

Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching Transparent Throughput Elas0city for IaaS Cloud Storage Using Guest- Side Block- Level Caching Bogdan Nicolae (IBM Research, Ireland) Pierre Riteau (University of Chicago, USA) Kate Keahey (Argonne National

More information

A Fine-grained Performance-based Decision Model for Virtualization Application Solution

A Fine-grained Performance-based Decision Model for Virtualization Application Solution A Fine-grained Performance-based Decision Model for Virtualization Application Solution Jianhai Chen College of Computer Science Zhejiang University Hangzhou City, Zhejiang Province, China 2011/08/29 Outline

More information

Difference Engine: Harnessing Memory Redundancy in Virtual Machines (D. Gupta et all) Presented by: Konrad Go uchowski

Difference Engine: Harnessing Memory Redundancy in Virtual Machines (D. Gupta et all) Presented by: Konrad Go uchowski Difference Engine: Harnessing Memory Redundancy in Virtual Machines (D. Gupta et all) Presented by: Konrad Go uchowski What is Virtual machine monitor (VMM)? Guest OS Guest OS Guest OS Virtual machine

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing

More information

CS420: Operating Systems

CS420: Operating Systems Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing

More information

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Peter Strazdins (and the CC-NUMA Team), CC-NUMA Project, Department of Computer Science, The Australian National University

More information

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COMP Parallel Computing. SMM (2) OpenMP Programming Model COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel

More information

Allows program to be incrementally parallelized

Allows program to be incrementally parallelized Basic OpenMP What is OpenMP An open standard for shared memory programming in C/C+ + and Fortran supported by Intel, Gnu, Microsoft, Apple, IBM, HP and others Compiler directives and library support OpenMP

More information

Kerrighed: A SSI Cluster OS Running OpenMP

Kerrighed: A SSI Cluster OS Running OpenMP Kerrighed: A SSI Cluster OS Running OpenMP EWOMP 2003 David Margery, Geoffroy Vallée, Renaud Lottiaux, Christine Morin, Jean-Yves Berthou IRISA/INRIA PARIS project-team EDF R&D 1 Introduction OpenMP only

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information