OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

Similar documents
page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH

First Experiences with Intel Cluster OpenMP

Cluster-enabled OpenMP: An OpenMP compiler for the SCASH software distributed shared memory system

The Design of MPI Based Distributed Shared Memory Systems to Support OpenMP on Clusters

Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation

The Performance Analysis of Portable Parallel Programming Interface MpC for SDSM and pthread

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Adaptive Techniques for Homebased Software DSMs

Questions from last time

Point-to-Point Synchronisation on Shared Memory Architectures

PM2: High Performance Communication Middleware for Heterogeneous Network Environments

Runtime Address Space Computation for SDSM Systems

MPI-Adapter for Portable MPI Computing Environment

Preliminary Evaluation of Dynamic Load Balancing Using Loop Re-partitioning on Omni/SCASH

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Designing High Performance DSM Systems using InfiniBand Features

University of Tokyo Akihiro Nomura, Yutaka Ishikawa

Runtime Support for Scalable Task-parallel Programs

Omni OpenMP compiler. C++ Frontend. C- Front. F77 Frontend. Intermediate representation (Xobject) Exc Java tool. Exc Tool

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

CS420: Operating Systems

ECE 563 Spring 2012 First Exam

Shared Memory programming paradigm: openmp

COMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP

DISTRIBUTED SYSTEMS [COMP9243] Lecture 3b: Distributed Shared Memory DISTRIBUTED SHARED MEMORY (DSM) DSM consists of two components:

Kernel Level Speculative DSM

SDSM Progression. Implementing Shared Memory on Distributed Systems. Software Distributed Shared Memory. Why a Distributed Shared Memory (DSM) System?

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

ECE 563 Second Exam, Spring 2014

Fig. 1. Omni OpenMP compiler

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

OpenMP on the IBM Cell BE

CS4961 Parallel Programming. Lecture 12: Advanced Synchronization (Pthreads) 10/4/11. Administrative. Mary Hall October 4, 2011

Lecture 4: OpenMP Open Multi-Processing

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support

Identifying Inter-task Communication in Shared Memory Programming Models. Per Larsen, Sven Karlsson and Jan Madsen

CMSC 714 Lecture 4 OpenMP and UPC. Chau-Wen Tseng (from A. Sussman)

Shared Virtual Memory. Programming Models

Loop Modifications to Enhance Data-Parallel Performance

Scientific Programming in C XIV. Parallel programming

Fall CSE 633 Parallel Algorithms. Cellular Automata. Nils Wisiol 11/13/12

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

12:00 13:20, December 14 (Monday), 2009 # (even student id)

Copyright The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 18. Combining MPI and OpenMP

Little Motivation Outline Introduction OpenMP Architecture Working with OpenMP Future of OpenMP End. OpenMP. Amasis Brauch German University in Cairo

ECE 574 Cluster Computing Lecture 10

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

OpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

CS4230 Parallel Programming. Lecture 12: More Task Parallelism 10/5/12

Chapter 4: Multi-Threaded Programming

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

Martin Kruliš, v

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

COMP Parallel Computing. SMM (2) OpenMP Programming Model

Concurrency, Thread. Dongkun Shin, SKKU

Kerrighed: A SSI Cluster OS Running OpenMP

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

OpenMP at Sun. EWOMP 2000, Edinburgh September 14-15, 2000 Larry Meadows Sun Microsystems

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Recently, symmetric multiprocessor systems have become

Shared Memory Programming Model

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

EE/CSCI 451: Parallel and Distributed Computation

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

CSE 4/521 Introduction to Operating Systems

Parallel Computing Why & How?

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Performance Tuning and OpenMP

Static Data Race Detection for SPMD Programs via an Extended Polyhedral Representation

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

Parallel Programming in C with MPI and OpenMP

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Distributed Shared Memory. Presented by Humayun Arafat

OpenMP 4. CSCI 4850/5850 High-Performance Computing Spring 2018

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

OpenACC Course. Office Hour #2 Q&A

Introduction to OpenMP

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Towards OpenMP for Java

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

The Art of Parallel Processing

Shared Memory Parallelism using OpenMP

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Multithreading in C with OpenMP

Review. Lecture 12 5/22/2012. Compiler Directives. Library Functions Environment Variables. Compiler directives for construct, collapse clause

JCudaMP: OpenMP/Java on CUDA

HPC Challenge Awards 2010 Class2 XcalableMP Submission

OpenMP Tutorial. Seung-Jai Min. School of Electrical and Computer Engineering Purdue University, West Lafayette, IN

Chapter 4: Threads. Chapter 4: Threads

Transcription:

OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1

2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed memory parallel computers Distributed shared memory may be used

3 Software DSM DSM has been studied for a long time Performance problem Many DSM systems make twin & diff to know the communication pattern High overhead False sharing

4 FDSM New software DSM system Goal To reduce the overhead of the software DSM system

5 Approach OpenMP is often used for a numerical analysis Loops are used The work in the loop is shared among the processors Communication is required at every iteration

6 Approach Important feature of the numerical analysis The access pattern to the shared memory is fixed in every iteration Communication pattern is also fixed

7 Approach First iteration FDSM obtains the access pattern Later FDSM communicates using the pattern The overhead to know the communication pattern is reduced Process 0 Process 1 Write Read Remote Copy Remote Copy Read Write

8 Approach Synchronization The writers can send the data directly to the readers That is all

9 Access pattern analysis The main issue about FDSM How to obtain the access pattern? Granularity Two processors may write to one page FDSM needs to get the access pattern with the single byte granularity

Access pattern analysis The page protection mechanism is used 1. FDSM forbids all the access to the shared memory 2. When the memory is accessed, page fault occurs 3. The accessed address is recorded in the exception handler in the kernel Problem in returning to the user program 10

11 Access pattern analysis Problem 1 Exception handler will remove the cause of the fault The page fault handler will remove the access protection

12 Access pattern analysis Problem 1 If the exception handler enables the access to the page The next access to this page will not cause the page fault FDSM cannot obtain the access pattern any more

13 Access pattern analysis Problem 2 If the exception handler does not enable the access to the page The same instruction will cause the page fault again The execution cannot be continued

Access pattern analysis Solution The exception handler emulates the fault instruction and increments the PC The memory access is completed The fault instruction will not be executed again Accessed address is recorded Execution continues 14

15 Access pattern analysis Solution The exception handler does not enable the access to the page The next access will cause the page fault The access to this page can be detected again

Software architecture The page fault handler is modified Most part of FDSM is implemented as the kernel module Including the instruction emulator Linux 2.4 Application FDSM user level library Kernel module Modified page fault handler PM Driver (PM) User Kernel Processor Myrinet 16

Software architecture The user level library provides the API s and performs the communication PM communication library is used PM provides user level communication Linux 2.4 Application FDSM user level library Kernel module Modified page fault handler PM Driver (PM) User Kernel Processor Myrinet 17

Omni OpenMP Compiler Modification OpenMP #pragma omp for for (i = 0; i < N; i++) { a[i] = b[i] * b[i]; } Run-time system obtains the access pattern Compiler is simple { } Generated code int t1 = 0; int t2 = N; fdsm_adjust_loop(&t1, &t2); fdsm_before_loop(); for (i = t1; i < t2; i ++) { a[i] = b[i] * b[i]; } fdsm_after_loop(); Omni OpenMP is developed at the University of Tsukuba 18

19 Omni OpenMP Compiler Modification This function modifies the first and the last value of the loop counter This function tells the run-time system to begin the access pattern analysis This function performs the barrier { } Generated code int t1 = 0; int t2 = N; fdsm_adjust_loop(&t1, &t2); fdsm_before_loop(); for (i = t1; i < t2; i ++) { a[i] = b[i] * b[i]; } fdsm_after_loop();

20 Evaluation Hardware environment 16 node cluster PentiumIII 500MHz on each node 512MB on each node Myrinet 1.25Gbps Software environment Linux 2.4 (modified) SCore cluster system software

21 Evaluation Benchmark application CG NAS parallel benchmark OpenMP version Class A Hand compiled We have not implemented the compiler yet

22 Evaluation Comparison Omni/SCASH Software DSM system Implemented at the user level using twin & diff Runs on SCore MPI MPI version of CG

Result Speedup 16 8 4 MPI FDSM SCASH FDSM MPI SCASH 1 30.1 30.1 30.1 CG result 2 4 51.4 88.3 58.9 107 51.4 80.7 8 134 200 95.5 (Mops) 16 160 334 84.8 2 2 4 8 16 # of PE Speedup FDSM scales up to 8 processors Close to MPI FDSM is faster than SCASH 23

Result Speedup 16 8 4 MPI FDSM SCASH FDSM does not perform well on 16 nodes 2 2 4 8 16 # of PE Speedup 24

25 Discussion Current implementation The single byte granularity is used only for the write access The pattern of the read access is obtained with the page size granularity Unnecessary communication It may be the reason why FDSM is slow on 16 nodes Page Processor 0 read/write Processor 1 write The communication from 1 to 0 occurs But it is not necessary

26 Conclusion FDSM succeeds to speed up the execution of an OpenMP application on a cluster By obtaining the access pattern of the application The performance is close to MPI up to 8 nodes FDSM does not require the analysis by a compiler FDSM can work as normal DSM All the OpenMP directives can be supported

27

Nested loop detection (1/2) ID 1 2 3 4 #pragma omp for #pragma omp for Barrier #pragma omp for #pragma omp for Loop 1 Loop 2 { } int t1 = 0; int t2 = N; adjust_loop(&t1, &t2); FDSM_before_loop(1); for (i = t1; i < t2; i ++) { a[i] = b[i] * b[i]; } barrier(); Loop 3 28

Nested loop detection (2/2) Example of the nested loop detection The argument of fdsm_before_loop 1, 2, 1, 2, 3, 4, 3, 4, 1, 2, 1, 2, 1, 2, 3, 4, 3, 4, 1, 2, 1 2 P1 1 2 P1 3 4 1 2 P1 P2 1 2 3 4 P3 P1 P2 1 2 3 4 29