page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH

Size: px
Start display at page:

Download "page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH"

Transcription

1 Omni/SCASH heterogeneity Omni/SCASH page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Yoshiaki Sakae, 1 Satoshi Matsuoka, 2 Mitsuhisa Sato 3 and Hiroshi Harada 4 In the commodity cluster environment, there may be performance heterogeneity between nodes due to several reasons, from which the application programmer will suffer as load imbalances. To overcome these problems, some dynamic load balancing mechanisms are needed. In this paper, we report our ongoing work on dynamic load balancing extension to Omni/SCASH. Using our dynamic load balancing mechanisms, we expect that programmers can have load imbalances adjusted automatically by the runtime system without explicit definition of data and task placements in a commodity cluster environment with possibly heterogeneous performance nodes. The results of evaluation indicates that, our loop re-partitioning scheme which is one of the dynamic load balancing extension works well, and also it is important to combine loop re-partition with dynamic page migration. 1. HPC SMP 9) 1 Tokyo Institute of Technology 2 / JST Tokyo Institute of Technology / JST 3 Tsukuba University 4 Hewlett-Packard Japan,Ltd. heterogeneity heterogeneity OpenMP 1

2 SDSM SCASH OpenMP Omni/SCASH 7) Omni/SCASH SCASH page fault counting Omni OpenMP Omni OpenMP OpenMP C translator C-front F-front C Fortran Xobject Xobject AST (Abstract Syntax Tree) Java Exc Java tools Exc Java tools Xobject Java OpenMP C Exc Java tools C 2.2 SCASH SCASH 3) PM 8) OS Software Distributed Shared Memory System SCASH Eager Release Consistency (ERC) invalidate update SCASH home base home base home base home home 2.3 Translation of OpenMP programs to SCASH OpenMP SCASH OpenMP SCASH shemem OpenMP SCASH OpenMP parallel fork-join parallel fork Omni/SCASH nested parallelism SCASH OpenMP OpenMP e.g. SCASH 3. 1 : OpenMP dynamic guided chunk queue 2

3 Example: Num Proc = 3, Perf Ratio = 3:1:2 schedule(profiled) #pragma parallel omp for schedule(profiled) for (i = 0; i < N; i++) { LOOP_BODY; 1 3n n 2n schedule(profiled, 2) schedule(profiled, 2, 3, 1) : executed on proc 0 : executed on proc 1 1 : executed on proc 2 : profiling loop Example of profiled scheduling profiled chunk profiled OpenMP schedule clause schedule(profiled[, chunk_size[, eval_size[, eval_skip]]]) profiled eval skip eval size chunk size eval skip, eval size, chunk size eval skip 0 eval size eval size = 1 chunk size 0 1 profiled 3 3 : 1 : 2 profiled Omni Omni 2 PAPI 2) static void ompc_func(void ** ompc_args){ int i; { int lb, ub, step; double start_time = 0.0, stop_time = 0.0; lb = 0, ub = N, step = 1; _ompc_profiled_sched_init(lb, ub, step, 1); while (_ompc_profiled_sched_next(&lb, &ub, start_time, stop_time)) { _ompc_profiled_get_time(&start_time); for (i = lb; i < ub; i += step) { LOOP_BODY; _ompc_profiled_get_time(&stop_time); 2 profiled CPU ompc profiled sched init() ompc profiled sched next() ompc profiled sched next() ompc profiled get time() 3 ompc profiled sched next() chunk queue dynamic, guided 4. profiled NAS Parallel Benchmarks (NPB-2.3) EP, CG OpenMP Fortran OpenMP RWCP EP EP 2n 3

4 Execution Time of EP Class S on Homogeneous Settings < = > /? static dynamic profiled Time [sec] "! # < = > /? Number of Nodes Execution Time of EP Class S on Homogeneous Settings (Pentium III nodes only ) ( ) * +, -. / ( "7;8: 1 3 $ % & ' ( ) * +, -. / ( 0 1 < 09A = = BDC = E F ( ) * +, -. / ( "798: : Performance Heterogeneous Cluster Fast nodes Slow node CPU Pentium III 500MHz Celeron 300MHz Cache 512KB 128KB Chipset Intel 440BX Same as Fast Memory SDRAM 512MB Same as Fast NIC Myrinet M2M-PCI32C Same as Fast Embarassingly Parallel profiled EP CG CG Conjugate Gradient CG unstructured grid computation CG profiled CPU Pentium III 500MHz Celeron 300MHz Fast, Slow OS RedHat 7.2 SCore gcc O profiled chunk size, eval size, eval skip 4.2 profiled EP 4 dynamic chunk profiled static static, dynamic, profiled EP class S Fast static EP static EP EP chunk queue dynamic static profiled EP 4

5 Execution Time of EP Class S on Heterogeneous Settings static (Slow x 1 + Fast) dynamic (Slow x 1 + Fast) profiled (Slow x 1 + Fast) static (Fast x 4) 2 Memory Behavior of the CG Class A on Celeron 1 + Pentium III 3 settings for static/profiled scheduling static profiled L2 miss ratio 29.6% 31.1% Page Fault at SCASH Barrier Time [sec] Time [sec] Number of Nodes Execution Time of EP Class S on Heterogeneous Settings (one Celeron node + Pentium III nodes) Execution Time of CG Class A on Heterogeneous Settings Number of Nodes static (Slow x 1 + Fast) profiled (Slow x 1 + Fast) Execution Time of CG Class A on Heterogeneous Settings (one Celeron node + Pentium III nodes) chunk vector EP profilied 6 CG class A static profiled CG kernel parallel for affinity CG static profiled profiled 4.4 profiled 2 Slow node 1 Fast node 3 4 CG Class A static profiled L2 cache miss ratio SCASH page fault L2 cache miss ratio PAPI : PAPI L2 TCM L2 cache miss ratio = PAPI L1 DCM 100 L2 cache miss ratio static, profiled SCASH page fault profiled static profiled affinity profiled : profiled page migration page migration page migration stable 5. Ongoing Work: Page Fault Conting Page Migration CG SDSM 5

6 OpenMP page reference counter 6) 1),4),5) SDSM SDSM page fault SPMD 6. Software Distributed Shared Memory, SCASH OpenMP Omni/SCASH heterogeneity page fault conting EP profiled static CG SDSM profiled SDSM page fault counting Omni/SCASH profiled 1) J. Bircsak, P. Craig, R. Crowell, J. Harris, C. A. Nelson, and C. D. Offner. Extending OpenMP for NUMA Machines: The Language. In Proceedings of Workshop on OpenMP Applications and Tool (WOMPAT 2000), July San Diego, USA. 2) S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. A Portable Programming Interface for Performance Evaluation on Modern Processors. The International Journal of High Performance Computing Applications, 14(3): , Fall ) H. Harada, H. Tezuka, A. Hori, S. Sumimoto, T. Takahashi, and Y. Ishikawa. SCASH: Software DSM using High Performance Network on Commodity Hardware and Software. In Proceedings of Eighth Workshop on Scalable Shared-memory Multiprocessors, pages ACM, May ) A. Hasegawa, M. Sato, Y. Ishikawa, and H. Harada. Optimization and Performance Evaluation of NPB on Omni OpenMP Compiler for SCASH, Software Distributed Memory System (in Japanese). In IPSJ SIG Notes, 2001-ARC-142, 2001-HPC-85, pages , Mar ) J. Merlin. Distributed OpenMP: Extensions to OpenMP for SMP Clusters. In Invited Talk. Second European Workshop on OpenMP (EWOMP 00), Oct Edinburgh, Scotland. 6) D. S. Nikolopoulos, T. S. Papatheodorou, C. D. Polychronopoulos, J. Labarta, and E. Ayguadé. Is Data Distribution Necessary in OpenMP? In Proc. of Supercomputing 2000, Nov Dallas, TX. 7) M. Sato, H. Harada, and Y. Ishikawa. OpenMP compiler for Software Distributed Shared Memory System SCASH. In Proceedings of Workshop on OpenMP Applications and Tool (WOMPAT 2000), July San Diego, USA. 8) H. Tezuka, A. Hori, Y. Ishikawa, and M. Sato. PM: An Operating System Coordinated High Performance Communication Library. In P. Sloot and B. Hertzberger, editors, High-Performance Computing and Networking 97, volume 1225, pages Lecture Notes in Computer Science, Apr ) 6

Preliminary Evaluation of Dynamic Load Balancing Using Loop Re-partitioning on Omni/SCASH

Preliminary Evaluation of Dynamic Load Balancing Using Loop Re-partitioning on Omni/SCASH Preliminary Evaluation of Dynamic Load Balancing Using Loop Re-partitioning on Omni/SCASH Yoshiaki Sakae Tokyo Institute of Technology, Japan sakae@is.titech.ac.jp Mitsuhisa Sato Tsukuba University, Japan

More information

Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system

Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system 123 Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system Mitsuhisa Sato a, Hiroshi Harada a, Atsushi Hasegawa b and Yutaka Ishikawa a a Real World Computing

More information

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1 2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed

More information

PM2: High Performance Communication Middleware for Heterogeneous Network Environments

PM2: High Performance Communication Middleware for Heterogeneous Network Environments PM2: High Performance Communication Middleware for Heterogeneous Network Environments Toshiyuki Takahashi, Shinji Sumimoto, Atsushi Hori, Hiroshi Harada, and Yutaka Ishikawa Real World Computing Partnership,

More information

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers

Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Improving Geographical Locality of Data for Shared Memory Implementations of PDE Solvers Henrik Löf, Markus Nordén, and Sverker Holmgren Uppsala University, Department of Information Technology P.O. Box

More information

The Performance Analysis of Portable Parallel Programming Interface MpC for SDSM and pthread

The Performance Analysis of Portable Parallel Programming Interface MpC for SDSM and pthread The Performance Analysis of Portable Parallel Programming Interface MpC for SDSM and pthread Workshop DSM2005 CCGrig2005 Seikei University Tokyo, Japan midori@st.seikei.ac.jp Hiroko Midorikawa 1 Outline

More information

Fig. 1. Omni OpenMP compiler

Fig. 1. Omni OpenMP compiler Performance Evaluation of the Omni OpenMP Compiler Kazuhiro Kusano, Shigehisa Satoh and Mitsuhisa Sato RWCP Tsukuba Research Center, Real World Computing Partnership 1-6-1, Takezono, Tsukuba-shi, Ibaraki,

More information

Omni OpenMP compiler. C++ Frontend. C- Front. F77 Frontend. Intermediate representation (Xobject) Exc Java tool. Exc Tool

Omni OpenMP compiler. C++ Frontend. C- Front. F77 Frontend. Intermediate representation (Xobject) Exc Java tool. Exc Tool Design of OpenMP Compiler for an SMP Cluster Mitsuhisa Sato, Shigehisa Satoh, Kazuhiro Kusano and Yoshio Tanaka Real World Computing Partnership, Tsukuba, Ibaraki 305-0032, Japan E-mail:fmsato,sh-sato,kusano,yoshiog@trc.rwcp.or.jp

More information

Recently, symmetric multiprocessor systems have become

Recently, symmetric multiprocessor systems have become Global Broadcast Argy Krikelis Aspex Microsystems Ltd. Brunel University Uxbridge, Middlesex, UK argy.krikelis@aspex.co.uk COMPaS: a PC-based SMP cluster Mitsuhisa Sato, Real World Computing Partnership,

More information

OpenMP on Distributed Memory via Global Arrays

OpenMP on Distributed Memory via Global Arrays 1 OpenMP on Distributed Memory via Global Arrays Lei Huang a, Barbara Chapman a, and Ricky A. Kall b a Dept. of Computer Science, University of Houston, Texas. {leihuang,chapman}@cs.uh.edu b Scalable Computing

More information

Data Distribution, Migration and Replication on a cc-numa Architecture

Data Distribution, Migration and Replication on a cc-numa Architecture Data Distribution, Migration and Replication on a cc-numa Architecture J. Mark Bull and Chris Johnson EPCC, The King s Buildings, The University of Edinburgh, Mayfield Road, Edinburgh EH9 3JZ, Scotland,

More information

Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support

Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and D.K. Panda Department of Computer Science and Engineering The Ohio State University Columbus,

More information

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP

OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP OmniRPC: a Grid RPC facility for Cluster and Global Computing in OpenMP (extended abstract) Mitsuhisa Sato 1, Motonari Hirano 2, Yoshio Tanaka 2 and Satoshi Sekiguchi 2 1 Real World Computing Partnership,

More information

Towards a Lightweight HPF Compiler

Towards a Lightweight HPF Compiler Towards a Lightweight HPF Compiler Hidetoshi Iwashita 1, Kohichiro Hotta 1, Sachio Kamiya 1, and Matthijs van Waveren 2 1 Strategy and Technology Division, Software Group Fujitsu Ltd. 140 Miyamoto, Numazu-shi,

More information

Detection and Analysis of Iterative Behavior in Parallel Applications

Detection and Analysis of Iterative Behavior in Parallel Applications Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University

More information

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Barbara Chapman, Gabriele Jost, Ruud van der Pas Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology

More information

Omni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation

Omni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation http://omni compiler.org/ Omni Compiler and XcodeML: An Infrastructure for Source-to- Source Transformation MS03 Code Generation Techniques for HPC Earth Science Applications Mitsuhisa Sato (RIKEN / Advanced

More information

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Peter Strazdins (and the CC-NUMA Team), CC-NUMA Project, Department of Computer Science, The Australian National University

More information

Exploiting Memory Affinity in OpenMP through Schedule Reuse

Exploiting Memory Affinity in OpenMP through Schedule Reuse Exploiting Memory Affinity in OpenMP through Schedule Reuse D.S. Nikolopoulos Coordinated Sciences Laboratory University of Illinois at Urbana-Champaign 1308 W. Main Street, MC-228 Urbana, IL, 61801, USA

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Nanos Mercurium: a Research Compiler for OpenMP

Nanos Mercurium: a Research Compiler for OpenMP Nanos Mercurium: a Research Compiler for OpenMP J. Balart, A. Duran, M. Gonzàlez, X. Martorell, E. Ayguadé and J. Labarta Computer Architecture Department, Technical University of Catalonia, cr. Jordi

More information

OpenMP for Clusters. Lei Huang*, Barbara Chapman*, Ricky Kendall +

OpenMP for Clusters. Lei Huang*, Barbara Chapman*, Ricky Kendall + for Clusters Lei Huang*, Barbara Chapman*, Ricky Kendall + *Dept. of Computer Science, University of Houston Texas + Scalable Computing Laboratory, Ames Laboratory, Iowa {leihuang, chapman}@cs.uh.edu,

More information

Switch. Switch. PU: Pentium Pro 200MHz Memory: 128MB Myricom Myrinet 100Base-T Ethernet

Switch. Switch. PU: Pentium Pro 200MHz Memory: 128MB Myricom Myrinet 100Base-T Ethernet COMPaS: A Pentium Pro PC-based SMP Cluster and its Experience Yoshio Tanaka 1, Motohiko Matsuda 1, Makoto Ando 1, Kazuto Kubota and Mitsuhisa Sato 1 Real World Computing Partnership fyoshio,matu,ando,kazuto,msatog@trc.rwcp.or.jp

More information

COMP Parallel Computing. SMM (2) OpenMP Programming Model

COMP Parallel Computing. SMM (2) OpenMP Programming Model COMP 633 - Parallel Computing Lecture 7 September 12, 2017 SMM (2) OpenMP Programming Model Reading for next time look through sections 7-9 of the Open MP tutorial Topics OpenMP shared-memory parallel

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

The Performance Analysis of Portable Parallel Programming Interface MpC for SDSM and pthread

The Performance Analysis of Portable Parallel Programming Interface MpC for SDSM and pthread The Performance Analysis of Portable Parallel Programming Interface MpC for SDSM and pthread Hiroko Midorikawa Department of Computer and Information Science Seikei University -- Kichijouji kita-machi,

More information

Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation

Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation ,,.,, InfiniBand PCI Express,,,. Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation Akira Nishida, The recent development of commodity hardware technologies makes

More information

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed Multicore Performance and Tools Part 1: Topology, affinity, clock speed Tools for Node-level Performance Engineering Gather Node Information hwloc, likwid-topology, likwid-powermeter Affinity control and

More information

RWC PC Cluster II and SCore Cluster System Software High Performance Linux Cluster

RWC PC Cluster II and SCore Cluster System Software High Performance Linux Cluster RWC PC Cluster II and SCore Cluster System Software High Performance Linux Cluster Yutaka Ishikawa Hiroshi Tezuka Atsushi Hori Shinji Sumimoto Toshiyuki Takahashi Francis O Carroll Hiroshi Harada Real

More information

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

Improving the Performance of OpenMP by Array Privatization

Improving the Performance of OpenMP by Array Privatization Improving the Performance of OpenMP by Array Privatization Zhenying Liu, Barbara Chapman, Tien-Hsiung Weng, Oscar Hernandez Department of Computer Science, University of Houston { zliu, chapman, thweng,

More information

Efficient Dynamic Parallelism with OpenMP on Linux SMPs

Efficient Dynamic Parallelism with OpenMP on Linux SMPs Efficient Dynamic Parallelism with OpenMP on Linux SMPs Christos D. Antonopoulos, Ioannis E. Venetis, Dimitrios S. Nikolopoulos and Theodore S. Papatheodorou High Performance Information Systems Laboratory

More information

Lecture 4: OpenMP Open Multi-Processing

Lecture 4: OpenMP Open Multi-Processing CS 4230: Parallel Programming Lecture 4: OpenMP Open Multi-Processing January 23, 2017 01/23/2017 CS4230 1 Outline OpenMP another approach for thread parallel programming Fork-Join execution model OpenMP

More information

ECE 574 Cluster Computing Lecture 10

ECE 574 Cluster Computing Lecture 10 ECE 574 Cluster Computing Lecture 10 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 October 2015 Announcements Homework #4 will be posted eventually 1 HW#4 Notes How granular

More information

A Multiprogramming Aware OpenMP Implementation

A Multiprogramming Aware OpenMP Implementation Scientific Programming 11 (2003) 133 141 133 IOS Press A Multiprogramming Aware OpenMP Implementation Vasileios K. Barekas, Panagiotis E. Hadjidoukas, Eleftherios D. Polychronopoulos and Theodore S. Papatheodorou

More information

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008 1 of 6 Lecture 7: March 4 CISC 879 Software Support for Multicore Architectures Spring 2008 Lecture 7: March 4, 2008 Lecturer: Lori Pollock Scribe: Navreet Virk Open MP Programming Topics covered 1. Introduction

More information

Scientific Programming in C XIV. Parallel programming

Scientific Programming in C XIV. Parallel programming Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence

More information

Code Auto-Tuning with the Periscope Tuning Framework

Code Auto-Tuning with the Periscope Tuning Framework Code Auto-Tuning with the Periscope Tuning Framework Renato Miceli, SENAI CIMATEC renato.miceli@fieb.org.br Isaías A. Comprés, TUM compresu@in.tum.de Project Participants Michael Gerndt, TUM Coordinator

More information

An Introduction to OpenMP

An Introduction to OpenMP An Introduction to OpenMP U N C L A S S I F I E D Slide 1 What Is OpenMP? OpenMP Is: An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism

More information

Parallel applications controllable by the resource manager. Operating System Scheduler

Parallel applications controllable by the resource manager. Operating System Scheduler An OpenMP Implementation for Multiprogrammed SMPs Vasileios K. Barekas, Panagiotis E. Hadjidoukas, Eleftherios D. Polychronopoulos and Theodore S. Papatheodorou High Performance Information Systems Laboratory,

More information

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster G. Jost*, H. Jin*, D. an Mey**,F. Hatay*** *NASA Ames Research Center **Center for Computing and Communication, University of

More information

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012 CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Administrative Mailing list set up, everyone should be on it - You should have received a test mail last night

More information

Slurm Configuration Impact on Benchmarking

Slurm Configuration Impact on Benchmarking Slurm Configuration Impact on Benchmarking José A. Moríñigo, Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT - Dept. Technology Avda. Complutense 40, Madrid 28040, SPAIN Slurm User Group Meeting 16

More information

Runtime Address Space Computation for SDSM Systems

Runtime Address Space Computation for SDSM Systems Runtime Address Space Computation for SDSM Systems Jairo Balart Outline Introduction Inspector/executor model Implementation Evaluation Conclusions & future work 2 Outline Introduction Inspector/executor

More information

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup

Cluster Computing. Performance and Debugging Issues in OpenMP. Topics. Factors impacting performance. Scalable Speedup Topics Scalable Speedup and Data Locality Parallelizing Sequential Programs Breaking data dependencies Avoiding synchronization overheads Performance and Debugging Issues in OpenMP Achieving Cache and

More information

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN OpenMP Tutorial Seung-Jai Min (smin@purdue.edu) School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 1 Parallel Programming Standards Thread Libraries - Win32 API / Posix

More information

Shared Memory Programming with OpenMP (3)

Shared Memory Programming with OpenMP (3) Shared Memory Programming with OpenMP (3) 2014 Spring Jinkyu Jeong (jinkyu@skku.edu) 1 SCHEDULING LOOPS 2 Scheduling Loops (2) parallel for directive Basic partitioning policy block partitioning Iteration

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 4: Work sharing directives Work sharing directives Directives which appear inside a parallel region and indicate how work should be shared out between threads Parallel do/for

More information

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman) CMSC 714 Lecture 4 OpenMP and UPC Chau-Wen Tseng (from A. Sussman) Programming Model Overview Message passing (MPI, PVM) Separate address spaces Explicit messages to access shared data Send / receive (MPI

More information

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

Update of Post-K Development Yutaka Ishikawa RIKEN AICS Update of Post-K Development Yutaka Ishikawa RIKEN AICS 11:20AM 11:40AM, 2 nd of November, 2017 FLAGSHIP2020 Project Missions Building the Japanese national flagship supercomputer, post K, and Developing

More information

Workload Characterization using the TAU Performance System

Workload Characterization using the TAU Performance System Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of

More information

CSE 392/CS 378: High-performance Computing - Principles and Practice

CSE 392/CS 378: High-performance Computing - Principles and Practice CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures A Conceptual Introduction for Software Developers Jim Browne browne@cs.utexas.edu Parallel Computer

More information

Introduction to OpenMP

Introduction to OpenMP Introduction to OpenMP Lecture 9: Performance tuning Sources of overhead There are 6 main causes of poor performance in shared memory parallel programs: sequential code communication load imbalance synchronisation

More information

Shared Memory Programming with OpenMP

Shared Memory Programming with OpenMP Shared Memory Programming with OpenMP (An UHeM Training) Süha Tuna Informatics Institute, Istanbul Technical University February 12th, 2016 2 Outline - I Shared Memory Systems Threaded Programming Model

More information

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP)

HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP) HPC Practical Course Part 3.1 Open Multi-Processing (OpenMP) V. Akishina, I. Kisel, G. Kozlov, I. Kulakov, M. Pugach, M. Zyzak Goethe University of Frankfurt am Main 2015 Task Parallelism Parallelization

More information

Compiler optimization for OpenMP programs

Compiler optimization for OpenMP programs 131 Compiler optimization techniques for OpenMP programs Shigehisa Satoh a,, Kazuhiro Kusano b and Mitsuhisa Sato c Tsukuba Research Center, Real World Computing Partnership, 1-6-1 Takezono, Tsukuba, Ibaraki

More information

MegaProto/E: Power-Aware High-Performance Cluster with Commodity Technology

MegaProto/E: Power-Aware High-Performance Cluster with Commodity Technology MegaProto/E: Power-Aware High-Performance Cluster with Commodity Technology Λ y z x Taisuke Boku Λ Mitsuhisa Sato Λ Daisuke Takahashi Λ Hiroshi Nakashima Hiroshi Nakamura Satoshi Matsuoka x Yoshihiko Hotta

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh

A Short Introduction to OpenMP. Mark Bull, EPCC, University of Edinburgh A Short Introduction to OpenMP Mark Bull, EPCC, University of Edinburgh Overview Shared memory systems Basic Concepts in Threaded Programming Basics of OpenMP Parallel regions Parallel loops 2 Shared memory

More information

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program Amdahl's Law About Data What is Data Race? Overview to OpenMP Components of OpenMP OpenMP Programming Model OpenMP Directives

More information

Optimizing OpenMP Programs on Software Distributed Shared Memory Systems

Optimizing OpenMP Programs on Software Distributed Shared Memory Systems Optimizing OpenMP Programs on Software Distributed Shared Memory Systems Seung-Jai Min, Ayon Basumallik, Rudolf Eigenmann School of Electrical and Computer Engineering Purdue University, West Lafayette,

More information

15-418, Spring 2008 OpenMP: A Short Introduction

15-418, Spring 2008 OpenMP: A Short Introduction 15-418, Spring 2008 OpenMP: A Short Introduction This is a short introduction to OpenMP, an API (Application Program Interface) that supports multithreaded, shared address space (aka shared memory) parallelism.

More information

Execution model of three parallel languages: OpenMP, UPC and CAF

Execution model of three parallel languages: OpenMP, UPC and CAF Scientific Programming 13 (2005) 127 135 127 IOS Press Execution model of three parallel languages: OpenMP, UPC and CAF Ami Marowka Software Engineering Department, Shenkar College of Engineering and Design,

More information

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel)

COSC 6374 Parallel Computation. Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel) COSC 6374 Parallel Computation Introduction to OpenMP(I) Some slides based on material by Barbara Chapman (UH) and Tim Mattson (Intel) Edgar Gabriel Fall 2014 Introduction Threads vs. processes Recap of

More information

OpenMP 4.0/4.5. Mark Bull, EPCC

OpenMP 4.0/4.5. Mark Bull, EPCC OpenMP 4.0/4.5 Mark Bull, EPCC OpenMP 4.0/4.5 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all

More information

A Source-to-Source OpenMP Compiler

A Source-to-Source OpenMP Compiler A Source-to-Source OpenMP Compiler Mario Soukup and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto Toronto, Ontario, Canada M5S 3G4

More information

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers LAB Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012 1 Discovery

More information

OpenMP 4.0. Mark Bull, EPCC

OpenMP 4.0. Mark Bull, EPCC OpenMP 4.0 Mark Bull, EPCC OpenMP 4.0 Version 4.0 was released in July 2013 Now available in most production version compilers support for device offloading not in all compilers, and not for all devices!

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

Advanced OpenMP. Lecture 11: OpenMP 4.0

Advanced OpenMP. Lecture 11: OpenMP 4.0 Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013 Starting to make an appearance in production compilers What s new in 4.0 User defined reductions Construct cancellation

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

PROGRAMMING MODELS FOR HIGH PERFORMANCE COMPUTING

PROGRAMMING MODELS FOR HIGH PERFORMANCE COMPUTING PROGRAMMING MODELS FOR HIGH PERFORMANCE COMPUTING Barbara Chapman, University of Houston and University of Southampton, chapman@cs.uh.edu Abstract High performance computing (HPC), or supercomputing, refers

More information

The Design of MPI Based Distributed Shared Memory Systems to Support OpenMP on Clusters

The Design of MPI Based Distributed Shared Memory Systems to Support OpenMP on Clusters The Design of MPI Based Distributed Shared Memory Systems to Support OpenMP on Clusters IEEE Cluster 2007, Austin, Texas, September 17-20 H sien Jin Wong Department of Computer Science The Australian National

More information

Towards OpenMP for Java

Towards OpenMP for Java Towards OpenMP for Java Mark Bull and Martin Westhead EPCC, University of Edinburgh, UK Mark Kambites Dept. of Mathematics, University of York, UK Jan Obdrzalek Masaryk University, Brno, Czech Rebublic

More information

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 /CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose

More information

EPL372 Lab Exercise 5: Introduction to OpenMP

EPL372 Lab Exercise 5: Introduction to OpenMP EPL372 Lab Exercise 5: Introduction to OpenMP References: https://computing.llnl.gov/tutorials/openmp/ http://openmp.org/wp/openmp-specifications/ http://openmp.org/mp-documents/openmp-4.0-c.pdf http://openmp.org/mp-documents/openmp4.0.0.examples.pdf

More information

2

2 A Study of High Performance Communication Using a Commodity Network of Parallel Computers 2000 Shinji Sumimoto 2 THE SUMMARY OF PH.D DISSERTATION 3 The increasing demands of information processing requires

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Effective Cross-Platform, Multilevel Parallelism via Dynamic Adaptive Execution

Effective Cross-Platform, Multilevel Parallelism via Dynamic Adaptive Execution Effective Cross-Platform, Multilevel Parallelism via Dynamic Adaptive Execution Walden Ko, Mark Yankelevsky, Dimitrios S. Nikolopoulos, and Constantine D. Polychronopoulos Center for Supercomputing Research

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Yasuo Okabe. Hitoshi Murai. 1. Introduction. 2. Evaluation. Elapsed Time (sec) Number of Processors

Yasuo Okabe. Hitoshi Murai. 1. Introduction. 2. Evaluation. Elapsed Time (sec) Number of Processors Performance Evaluation of Large-scale Parallel Simulation Codes and Designing New Language Features on the (High Performance Fortran) Data-Parallel Programming Environment Project Representative Yasuo

More information

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP

Topics. Introduction. Shared Memory Parallelization. Example. Lecture 11. OpenMP Execution Model Fork-Join model 5/15/2012. Introduction OpenMP Topics Lecture 11 Introduction OpenMP Some Examples Library functions Environment variables 1 2 Introduction Shared Memory Parallelization OpenMP is: a standard for parallel programming in C, C++, and

More information

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides

Parallel Programming with OpenMP. CS240A, T. Yang, 2013 Modified from Demmel/Yelick s and Mary Hall s Slides Parallel Programming with OpenMP CS240A, T. Yang, 203 Modified from Demmel/Yelick s and Mary Hall s Slides Introduction to OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

A Characterization of Shared Data Access Patterns in UPC Programs

A Characterization of Shared Data Access Patterns in UPC Programs IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview

More information

CS420: Operating Systems

CS420: Operating Systems Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing

More information

The MOSIX Scalable Cluster Computing for Linux. mosix.org

The MOSIX Scalable Cluster Computing for Linux.  mosix.org The MOSIX Scalable Cluster Computing for Linux Prof. Amnon Barak Computer Science Hebrew University http://www. mosix.org 1 Presentation overview Part I : Why computing clusters (slide 3-7) Part II : What

More information

CS550. TA: TBA Office: xxx Office hours: TBA. Blackboard:

CS550. TA: TBA   Office: xxx Office hours: TBA. Blackboard: CS550 Advanced Operating Systems (Distributed Operating Systems) Instructor: Xian-He Sun Email: sun@iit.edu, Phone: (312) 567-5260 Office hours: 1:30pm-2:30pm Tuesday, Thursday at SB229C, or by appointment

More information

CSE 160 Lecture 9. Load balancing and Scheduling Some finer points of synchronization NUMA

CSE 160 Lecture 9. Load balancing and Scheduling Some finer points of synchronization NUMA CSE 160 Lecture 9 Load balancing and Scheduling Some finer points of synchronization NUMA Announcements The Midterm: Tuesday Nov 5 th in this room Covers everything in course through Thu. 11/1 Closed book;

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

Parallel C++ Programming System on Cluster of Heterogeneous Computers

Parallel C++ Programming System on Cluster of Heterogeneous Computers Parallel C++ Programming System on Cluster of Heterogeneous Computers Yutaka Ishikawa, Atsushi Hori, Hiroshi Tezuka, Shinji Sumimoto, Toshiyuki Takahashi, and Hiroshi Harada Real World Computing Partnership

More information

Efficient Execution of Parallel Java Applications

Efficient Execution of Parallel Java Applications Efficient Execution of Parallel Java Applications Jordi Guitart, Xavier Martorell, Jordi Torres and Eduard Ayguadé European Center for Parallelism of Barcelona (CEPBA) Computer Architecture Department,

More information

MegaProto/E: Power-Aware High-Performance Cluster with Commodity Technology

MegaProto/E: Power-Aware High-Performance Cluster with Commodity Technology MegaProto/E: Power-Aware High-Performance Cluster with Commodity Technology Λ y z x Taisuke Boku Λ Mitsuhisa Sato Λ Daisuke Takahashi Λ Hiroshi Nakashima Hiroshi Nakamura Satoshi Matsuoka x Yoshihiko Hotta

More information

Lecture 9: MIMD Architecture

Lecture 9: MIMD Architecture Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is

More information

Research on the Implementation of MPI on Multicore Architectures

Research on the Implementation of MPI on Multicore Architectures Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna16/ [ 9 ] Shared Memory Performance Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture

More information

OpenMP - Introduction

OpenMP - Introduction OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı - 21.06.2012 Outline What is OpenMP? Introduction (Code Structure, Directives, Threads etc.) Limitations Data Scope Clauses Shared,

More information