The Public Shared Objects Run-Time System

Similar documents
Distributed Deadlock Detection for. Distributed Process Networks

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

CS222: Cache Performance Improvement

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

6.1 Multiprocessor Computing Environment

LINDA. The eval operation resembles out, except that it creates an active tuple. For example, if fcn is a function, then

Virtual Memory COMPSCI 386

Name: PID: CSE 160 Final Exam SAMPLE Winter 2017 (Kesden)

A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing

Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine)

ECE453: Advanced Computer Architecture II Homework 1

How to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Abstract 1. Introduction

Swapping. Operating Systems I. Swapping. Motivation. Paging Implementation. Demand Paging. Active processes use more physical memory than system has

Multiprocessors 2007/2008

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

High Performance Computing and Programming, Lecture 3

Commission of the European Communities **************** ESPRIT III PROJECT NB 6756 **************** CAMAS

PC cluster as a platform for parallel applications

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CS201 - Introduction to Programming Glossary By

Cache Performance (H&P 5.3; 5.5; 5.6)

University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2.

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

Warps and Reduction Algorithms

Evaluating the Portability of UPC to the Cell Broadband Engine

02 - Distributed Systems

02 - Distributed Systems

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Systems Programming and Computer Architecture ( ) Timothy Roscoe

CS252 Graduate Computer Architecture Multiprocessors and Multithreading Solutions November 14, 2007

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Short Notes of CS201

Virtual Memory. Reading: Silberschatz chapter 10 Reading: Stallings. chapter 8 EEL 358

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Improving Http-Server Performance by Adapted Multithreading

SMD149 - Operating Systems - Multiprocessing

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Distributed OS and Algorithms

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.

Cache Optimisation. sometime he thought that there must be a better way

MapReduce: A Programming Model for Large-Scale Distributed Computation

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

Comparing the Parix and PVM parallel programming environments

ECE 454 Computer Systems Programming

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Tutorial 11. Final Exam Review

Review: Creating a Parallel Program. Programming for Performance

Operating system Dr. Shroouq J.

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Adapted from David Patterson s slides on graduate computer architecture

System Models for Distributed Systems

First-In-First-Out (FIFO) Algorithm

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

Last class. Caches. Direct mapped

CS377P Programming for Performance Multicore Performance Cache Coherence

A Comparison of the Iserver-Occam, Parix, Express, and PVM Programming Environments on a Parsytec GCel

Parallel Programming with OpenMP. CS240A, T. Yang

CSE Traditional Operating Systems deal with typical system software designed to be:

Copyright 2012, Elsevier Inc. All rights reserved.

211: Computer Architecture Summer 2016

Memory Management Outline. Operating Systems. Motivation. Paging Implementation. Accessing Invalid Pages. Performance of Demand Paging

Chapter 2: Memory Hierarchy Design Part 2

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Processes and Threads

Module 5 Introduction to Parallel Processing Systems

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT I

CSE544 Database Architecture

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Multiprocessor Systems

EE/CSCI 451: Parallel and Distributed Computation

TDP3471 Distributed and Parallel Computing

Parallel and High Performance Computing CSE 745

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Cor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming

Experiences in building Cosy - an Operating System for Highly Parallel Computers

The University of Adelaide, School of Computer Science 13 September 2018

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

Sri Vidya College of Engineering & Technology

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

The MOSIX Scalable Cluster Computing for Linux. mosix.org

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Transcription:

The Public Shared Objects Run-Time System Stefan Lüpke, Jürgen W. Quittek, Torsten Wiese E-mail: wiese@tu-harburg.d400.de Arbeitsbereich Technische Informatik 2, Technische Universität Hamburg-Harburg 21071 Hamburg Germany Abstract Public Shared Objects (PSO) is an attempt to offer advantages of Virtual Shared Memory for distributed memory parallel computers without a significant loss of performance. Shared data structures are distributed via the network to processing nodes and may exceed the capacity of local memory. Provided Access Objects hide access latency and reduce communication bandwidth requirement. PSO is a portable software solution extending the C++ programming language. It has been implemented for the PARIX software environment and is available for parallel computers based on Transputers or PowerPC processors. 1 Introduction Message Passing and Virtual Shared Memory are different programming models for parallel computers with distributed memory. Generally, Virtual Shared Memory requires less programming expenditure. The main disadvantage of Virtual Shared Memory is the lack of performance caused by high access latency. So, in practice Message Passing is commonly preferred despite the higher complexity of program design. Public Shared Objects (PSO) is an attempt to offer advantages of Virtual Shared Memory with a small, acceptable loss of performance only. Instead of emulating a shared address space PSO provides a shared symbol space (similar to CC++ [1]). Symbolic names can be shared by all processes of a parallel program. Shared data structures are distributed via the network to processing nodes and may exceed the capacity of local memory. PSO is a portable software solution and requires a Message Passing system such as PVM[2], MPI[3], or PARIX[4]. It is defined as an extension of C++. Making use of data abstraction and operator overloading the only necessary extension is the new storage class specifier shared. Data structures declared as shared are accessible to all processes and are referenced like any other data structure. Fig. 1 Data Data Global Memory Code VSM VSM and PSO PSO Local Memory Processors Code PSO divides shared data structures into several blocks called base objects. A base object () is stored at a single processing node. s are distributed among all processing nodes The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 1

to reduce access conflicts and allow the storage of arrays too large for a single processing node. To support future extensions of PSO migrating s are provided. Those Distributed Shared Data Structures are referred to as DS 2. Since the optimal size and location of the s depend on the application, each DS 2 may be provided with a different distribution strategy specifying size, composition, and initial placement of the s. In order to control data distribution in an easy and flexible way the programmer may define his own strategies. During initialization a reference object for each declared DS 2 is created on all processing nodes. The reference objects determine an unique administration node for the DS 2 by applying a functional projection to the symbolic name. According to the applied distribution strategy the administration node is responsible for creating and initializing the DS 2 including its s. All reference objects request information about the DS 2 at the administration node. This information is called the characteristic information. All characteristic information stored at administration nodes constitute a distributed data base. To realize data access the reference objects determine the corresponding s making use of the characteristic information. In order to allow efficient caching these information contains only static data. Since the position of a may change it is not part of the characteristic information. The reference objects determine the position of a using a fault-tolerant mechanism provided by the PSO run-time system. Consequently, they use two different caches for characteristic information and for locations. 2 The structure of PSO The following example illustrates the use of PSO: // Parallel multiplication of two // (DIM,DIM) matrices by // DIM * DIM processors. // shared declaration of matrices shared double a[dim][dim], b[dim][dim]; shared double c[dim][dim];... // myprocid: processor number int row = myprocid % DIM; int col = myprocid / DIM; for (int k = 0; k < DIM; k++) c[row][col] += a[row][k] * b[k][col]; A precompiler has been built which translates declarations and references of DS 2 s into object calls. The example demonstrates that PSO is suitable for C++ programmers as well as C programmers. However, a C++ compiler is required to translate the code generated by the precompiler. The language C++ was choosen as base of PSO because it offers operator overloading and object oriented programming as well as dynamic memory management. Fig. 2 Source Code PSO Precompiler C++ Compiler Application Administration Interface Communication PSO Run-time System Message Passing Model of PSO The access of an application to a DS 2 is handled by the PSO administration layer. This layer determines the target of the current access. The access to a specific is handled by the communication layer which, if necessary, sends requests to other processing nodes using the underlaying Message Passing system. Unlimited scalability is achieved by using dis- The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 2

tributed administration strategies. All necessesary internal information is stored in distributed data bases. PSO is not intended to be a sole programming model, the application can use the Message Passing system directly. 2.1 Precompiler The concept of data abstraction and operator overloading makes C++ suitable for implementing PSO with a minimum of syntax extensions. The preprocessor transforms each declaration using the new storage class specifier shared into a constructor call of class DSVMem (Data Structure Virtual Memory). shared double a[dim][dim]; DSVMem a(1, sizeof(double),t_double,dim,dim); This class serves to handle any access to a single DS 2. The constructor parameters are evaluated by the precompiler. 1 is the internal unique identifier for the DS 2 and is called DSID (Data Structure ID). The precompiler scans all source files and assigns a single-valued number to each object of type DSVMem. sizeof(double) is the memory required by one element of array a. T_DOUBLE is a symbolic type identifier referring to a basic or user defined data type. DIM,DIM are the dimensions of the array. 2.2 Layers of PSO 2.2.1 Administration Layer Each shared variable or shared array is represented on each processing node by an object of class DSVMem. The constructor call generated by the precompiler (see above) creates an local object that represents array a. First of all a modulo function f DS PID ( DSID, nproc) (nproc=number of available processors) is executed to determine the administrating node of the DS 2. Although each processing node creates a local object a only the administrating node generates the characteristic information of array a and acts as server to all other nodes. The administrating node of an object does not change during lifetime of the DS 2. The characteristic information contains DSID, the ID of the Data-Structure, size of one element in bytes, dimension(s) of the DS 2, number of necessary s, number of elements per, and used strategy. f DS PID f DS f ID client Distributed Data-Structure client Administrator server node node node Fig. 3 Distributing a DS 2 The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 3

A function f determines the splitting of the DS 2 DS into s. Function f ID is used to provide each with a unique ID which has a functional relationship to the ID of the processing node being responsible for the. In the current implementation the is always stored on this node. The introduction of an administrating node for each is useful since in future releases of PSO the s should be able to migrate to other nodes. Both functions and form the distribution strategy of the corresponding DS 2. f DS f ID 2.2.2 Interface The interface between communication layer and administration layer is -based. Both layers are using semantic to communicate with each other matching the client-server concept. Above the interface all s seem to be stored in one huge pool. Instances of the class DSVmem are clients whereas the interface works as a single server handling all the s. The interface can handle the following calls: DS_Get: Demand the characteristic information of a DS 2 from the specified processing node. _Create: Allocate memory for a on the specified processing node. _Read: Fetch a part of a from the specified processing node. _Write: Update a part of a on the specified processing node. _Lock: Lock a on the specified processing node. _UnLock: Demand status of a from the specified processing node. _Delete: Delete a on the specified processing node. _Move: Move a from one specified processing node to another specified processing node. 2.2.3 Communication Layer The communication layer (Fig. 4 ) is the machine dependent part of PSO. Serving as an server it translates calls from the administration layer into Message Passing commands. Consequently, porting PSO to other Message Passing Systems only requires modification of the communication layer. The functionality of this layer is explained by its components. Two Mailboxes are storing Requests and Responses from other processing nodes and implement an asynchronous communication mode within a CSP-based message passing system. Nearly all run-time systems offer synchronous and asynchronous communication models, but they are not using the same syntax. However, we want to have PSO as portable as possible and therefore we implemented genuine mailbox communication routines. The Work List is implemented as a FIFO buffer and serializes the incoming requests. It serves as command queue to the Server. The Server is the crucial module in this layer. It removes requests from the Work List and executes the s in interaction with the administration layer on the same node or an Server somewhere in the network. From the Server s point of view all s are stored in a distributed -Pool. The Server controls the local part of this pool directly. If a is locked (access is temporarily denied) the according requests are queued in the The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 4

Wait Lock List. The mechanism of locking and unlocking s is implemented analogously to the UNIX file locking mechanism. The programmer has to decide whether the access should wait until the is unlocked or if an access failed signal should be returned. All requests to Servers on other nodes are queued by the Send List. The Requester takes entries from the Send List and performs s to other processing nodes. The -Solver waits for acknowledges from pending s, removes the matching entry from the Send List.? from other nodes! PSO Administration Work List Server SO Send List! Respond? Request PSO Mailbox incoming s Responder WL Requester WL SO Base Object Pool Wait Lock List Interface Solver! to other nodes? Fig. 4 Structure of the communication layer Application Administration SO Server Server Requester? Context of processor n+1 Server Responder! SO Fig. 5 Access to a single The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 5

3 Access Objects In order to improve data access performance acceleration methods have been developed. One method is the use of access objects [5] acting as agents between the application program and the distributed data structures. The base concept of access objects is to aggregate data from the distributed data structures in the local memory, before the data is required by the application program, and to write data aggregated in the local memory back to the distributed data structures. This is possible for applications with precalculable access patterns, for example matrix computation algorithms which often have structured access patterns. The basic functionality of an access object is, that the application program can access the locally aggregated data and request a data exchange between the locally aggregated data and the distributed data structure (Fig. 6 ). This data exchange is performed asynchronously by the access object. There are two basic types of access objects, i. e. buffers and queues. Buffers are intended for data that are used several times (matrix multiplication for example) and for applications that allow the precalculation of the required subset of the distributed data structure but not the precalculation of the access sequence (quicksort for example). Queues are intended for data that are required only once (vector scalar product) and for data, which may be processed by the algorithm in an unspecified sequence (e. g. sum of a row/column of a matrix, etc.) Processing node 1 Processing node 2 Processing node 3 Thread 1 Thread 2 Thread 3 Thread 4 Access Objects Distributed Arrays Fig. 6 Basic Functionality of the Access Objects This concept provides some advantages against usual acceleration methods used for Virtual Shared Memory systems: Reduced Overhead for Providing Coherence Virtual Shared Memory systems have to provide cache coherence at every time during program execution. This results in a huge overhead. Using access objects reduces the The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 6

overhead for coherence, because the coherence requirements may be reduced. This is possible, because the application program specifies of which elements copies may exist. Therefore the application program may also specify, where coherence of the copies is required and where not. Hiding Access Latency The access latency of non-local elements of distributed data structures may be hidden using access objects, because access objects allow simultaneous execution of many accesses to non-local data without requiring more than one application program thread per processor. That means the application problem does not have to be partitioned into more threads than processors are used. This reduces programming expense and task switching overhead. Prefetch Using access objects the application program itself specifies, which elements of the distributed data structures have to be prefetched. This avoids prefetching needless data. 4 Examples and Results using Access Objects The example below shows a parallel matrix multiplication. In order to keep the example simple the algorithm itself is not optimized in view of locality of the data access patterns. The example is intended to show the possibilities to use access objects for a given algorithm. The program code is executed on each processor. The memory of a single processing node is supposed to be too small to hold a full copy of matrix a or b. 4.1 Implementation using PSO without Access Objects shared float a[1000][1000]; shared float b[1000][1000]; shared float c[1000][1000]; extern int nproc; extern int PID; for (int i = PID; i < 1000; i += nproc) for (int j = 0; j < 1000; j++) { float sum = 0; for (int k = 0; k < 1000; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }; 4.2 Implementation using PSO and Access Objects // Declaration of distributed arrays // Keyword shared is a syntax extension of PSO // Number of processors // Unique processor ID; range [0, nproc-1] shared float a[1000][1000]; shared float b[1000][1000]; shared float c[1000][1000]; AccessBuffer<float> A; // A buffer to access array a AccessReadQueue<float> B; // A queue to read data from array b AccessWriteQueue<float> C; // A queue to write the results to array c AccessSelector S; // Required to select substructures // Calculate the number of rows of array c to be computed by the local processor int nrows = 1000 / nproc + ( 1000 % nproc > PID? 1 : 0 ); The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 7

// Load all required elements of array a asynchronously to the local buffer area of A A.Associate(a); A.Select(S.Rows(myProcId, nrows, nproc)); A.ReadAll(); // The local memory of the processing node is too small to hold a complete copy // of the distributed array b, therefore the data are read for each column separately. // So only association is required here. B.Associate(b); // Associate the queue C with the distributed array c and select those rows of c, // which are computed on the local processor. C.Associate(c); C.Select(S.Rows(myProcId, nrows, nproc)); for (int i = 0; i < nrows; i++) for (int j = 0; j < 1000; j++) { float sum = 0; // Start to read a column B.SelectAndRead(S.Column(j)); for (int k = 0; k < 8; k++) // Get one element from the local buffer area of A // and one element from the front of B. // Wait if reading the required element of a or b has not been finished. sum += A(i, k) * B.Pop(); // Write result asynchronously to the distributed array c. C.PushAndWrite(sum); }; 4.3 Result of Using the Access Objects In the example program shown above using access objects induces the following advantages: The required elements of the distributed array a are read into local memory only once. All required elements of array a are requested simultaneously. So the communication bandwidth instead of the communication latency becomes the major performance factor. The elements of the distributed array b are read by request of a full column. The program may continue execution after the first element being read. The succeeding elements are read asynchronously. If the communication bandwidth is big enough, this may reduce the influence of latency by a factor which equals the number of elements in a column (in this case 1000). The computed elements of the distributed array c are written asynchronously, so that latency becomes unimportant. 5 Conclusions PSO, a portable software solution providing global distributed data structures on distributed memory systems has been implemented for the PARIX software environment and tested on two different parallel architectures Parsytec SuperCluster with 128 processing nodes (1 INMOS Transputer 805 per node) Parsytec GC/PowerPlus with 64 processing nodes (2 PowerPC 601 processors and 4 INMOS Transputers 805 per node). The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 8

In order to achieve acceptable performance novel acceleration methods have been designed. Using information provided by the programmer the access objects hide latency and reduce communication bandwidth requirement. For this reason the access objects aggregate local copies of data and perform data prefetch and asynchronous write operations. Also asynchronous arithmetic operations are provided. The access objects avoid the major disadvantage of standard acceleration methods. For example, coherence requirements are reduced, prefetch of unnecessary data is avoided, and only one application thread per processor is needed. As a novel acceleration method self-organizing data structures are planned, which provide automatic rearrangement of the data distribution during run-time. PSO is not intended to be a sole programming model, but an enlargement in addition to Message Passing. Consequently, time critical parts of a parallel program may be optimized using Message Passing, while the major parts of the program are implemented using shared data structures. This leads to efficient programs combined with low programming expenditure. For several applications PSO has been proven to facilitate and shorten the process of program development, covering the fields of developing programs designed for shared memory architectures. 6 References [1] K. Mani Chandy and Carl Kesselman, The CC++ language definition, Technical Report Caltech-CS-TR-92-02, California Institute of Technology, 1992 [2] Al Geist et al., PVM: Parallel Virtual Machine - A Users Guide and Tutorial for Networked Parallel Computing, MIT Press, Cambridge, Massachusetts, 1994. [3] Message Passing Interface Forum (Ed.), MPI: A Message Passing Interface Standard, Message Passing Interface Forum, 1994 [4] PARSYTEC GmbH, PARIX Reference Manual, PARSYTEC GmbH, 1993 [5] Stefan Lüpke, Accelerated Access to Shared Distributed Arrays on Distributed Memory Systems by Access Objects, in B. Buchberger and J. Volkert (Eds.), Parallel Processing: CONPAR94 - VAPP VI, S. 449-460, Springer-Verlag, Berlin, 1994 [6] K. Li, Shared Virtual Memory on Loosely Coupled Multiprocessors, Ph.D. Thesis, Yale University, 1986 The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 9