The Public Shared Objects Run-Time System
|
|
- Shauna Fox
- 6 years ago
- Views:
Transcription
1 The Public Shared Objects Run-Time System Stefan Lüpke, Jürgen W. Quittek, Torsten Wiese Arbeitsbereich Technische Informatik 2, Technische Universität Hamburg-Harburg Hamburg Germany Abstract Public Shared Objects (PSO) is an attempt to offer advantages of Virtual Shared Memory for distributed memory parallel computers without a significant loss of performance. Shared data structures are distributed via the network to processing nodes and may exceed the capacity of local memory. Provided Access Objects hide access latency and reduce communication bandwidth requirement. PSO is a portable software solution extending the C++ programming language. It has been implemented for the PARIX software environment and is available for parallel computers based on Transputers or PowerPC processors. 1 Introduction Message Passing and Virtual Shared Memory are different programming models for parallel computers with distributed memory. Generally, Virtual Shared Memory requires less programming expenditure. The main disadvantage of Virtual Shared Memory is the lack of performance caused by high access latency. So, in practice Message Passing is commonly preferred despite the higher complexity of program design. Public Shared Objects (PSO) is an attempt to offer advantages of Virtual Shared Memory with a small, acceptable loss of performance only. Instead of emulating a shared address space PSO provides a shared symbol space (similar to CC++ [1]). Symbolic names can be shared by all processes of a parallel program. Shared data structures are distributed via the network to processing nodes and may exceed the capacity of local memory. PSO is a portable software solution and requires a Message Passing system such as PVM[2], MPI[3], or PARIX[4]. It is defined as an extension of C++. Making use of data abstraction and operator overloading the only necessary extension is the new storage class specifier shared. Data structures declared as shared are accessible to all processes and are referenced like any other data structure. Fig. 1 Data Data Global Memory Code VSM VSM and PSO PSO Local Memory Processors Code PSO divides shared data structures into several blocks called base objects. A base object () is stored at a single processing node. s are distributed among all processing nodes The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 1
2 to reduce access conflicts and allow the storage of arrays too large for a single processing node. To support future extensions of PSO migrating s are provided. Those Distributed Shared Data Structures are referred to as DS 2. Since the optimal size and location of the s depend on the application, each DS 2 may be provided with a different distribution strategy specifying size, composition, and initial placement of the s. In order to control data distribution in an easy and flexible way the programmer may define his own strategies. During initialization a reference object for each declared DS 2 is created on all processing nodes. The reference objects determine an unique administration node for the DS 2 by applying a functional projection to the symbolic name. According to the applied distribution strategy the administration node is responsible for creating and initializing the DS 2 including its s. All reference objects request information about the DS 2 at the administration node. This information is called the characteristic information. All characteristic information stored at administration nodes constitute a distributed data base. To realize data access the reference objects determine the corresponding s making use of the characteristic information. In order to allow efficient caching these information contains only static data. Since the position of a may change it is not part of the characteristic information. The reference objects determine the position of a using a fault-tolerant mechanism provided by the PSO run-time system. Consequently, they use two different caches for characteristic information and for locations. 2 The structure of PSO The following example illustrates the use of PSO: // Parallel multiplication of two // (DIM,DIM) matrices by // DIM * DIM processors. // shared declaration of matrices shared double a[dim][dim], b[dim][dim]; shared double c[dim][dim];... // myprocid: processor number int row = myprocid % DIM; int col = myprocid / DIM; for (int k = 0; k < DIM; k++) c[row][col] += a[row][k] * b[k][col]; A precompiler has been built which translates declarations and references of DS 2 s into object calls. The example demonstrates that PSO is suitable for C++ programmers as well as C programmers. However, a C++ compiler is required to translate the code generated by the precompiler. The language C++ was choosen as base of PSO because it offers operator overloading and object oriented programming as well as dynamic memory management. Fig. 2 Source Code PSO Precompiler C++ Compiler Application Administration Interface Communication PSO Run-time System Message Passing Model of PSO The access of an application to a DS 2 is handled by the PSO administration layer. This layer determines the target of the current access. The access to a specific is handled by the communication layer which, if necessary, sends requests to other processing nodes using the underlaying Message Passing system. Unlimited scalability is achieved by using dis- The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 2
3 tributed administration strategies. All necessesary internal information is stored in distributed data bases. PSO is not intended to be a sole programming model, the application can use the Message Passing system directly. 2.1 Precompiler The concept of data abstraction and operator overloading makes C++ suitable for implementing PSO with a minimum of syntax extensions. The preprocessor transforms each declaration using the new storage class specifier shared into a constructor call of class DSVMem (Data Structure Virtual Memory). shared double a[dim][dim]; DSVMem a(1, sizeof(double),t_double,dim,dim); This class serves to handle any access to a single DS 2. The constructor parameters are evaluated by the precompiler. 1 is the internal unique identifier for the DS 2 and is called DSID (Data Structure ID). The precompiler scans all source files and assigns a single-valued number to each object of type DSVMem. sizeof(double) is the memory required by one element of array a. T_DOUBLE is a symbolic type identifier referring to a basic or user defined data type. DIM,DIM are the dimensions of the array. 2.2 Layers of PSO Administration Layer Each shared variable or shared array is represented on each processing node by an object of class DSVMem. The constructor call generated by the precompiler (see above) creates an local object that represents array a. First of all a modulo function f DS PID ( DSID, nproc) (nproc=number of available processors) is executed to determine the administrating node of the DS 2. Although each processing node creates a local object a only the administrating node generates the characteristic information of array a and acts as server to all other nodes. The administrating node of an object does not change during lifetime of the DS 2. The characteristic information contains DSID, the ID of the Data-Structure, size of one element in bytes, dimension(s) of the DS 2, number of necessary s, number of elements per, and used strategy. f DS PID f DS f ID client Distributed Data-Structure client Administrator server node node node Fig. 3 Distributing a DS 2 The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 3
4 A function f determines the splitting of the DS 2 DS into s. Function f ID is used to provide each with a unique ID which has a functional relationship to the ID of the processing node being responsible for the. In the current implementation the is always stored on this node. The introduction of an administrating node for each is useful since in future releases of PSO the s should be able to migrate to other nodes. Both functions and form the distribution strategy of the corresponding DS 2. f DS f ID Interface The interface between communication layer and administration layer is -based. Both layers are using semantic to communicate with each other matching the client-server concept. Above the interface all s seem to be stored in one huge pool. Instances of the class DSVmem are clients whereas the interface works as a single server handling all the s. The interface can handle the following calls: DS_Get: Demand the characteristic information of a DS 2 from the specified processing node. _Create: Allocate memory for a on the specified processing node. _Read: Fetch a part of a from the specified processing node. _Write: Update a part of a on the specified processing node. _Lock: Lock a on the specified processing node. _UnLock: Demand status of a from the specified processing node. _Delete: Delete a on the specified processing node. _Move: Move a from one specified processing node to another specified processing node Communication Layer The communication layer (Fig. 4 ) is the machine dependent part of PSO. Serving as an server it translates calls from the administration layer into Message Passing commands. Consequently, porting PSO to other Message Passing Systems only requires modification of the communication layer. The functionality of this layer is explained by its components. Two Mailboxes are storing Requests and Responses from other processing nodes and implement an asynchronous communication mode within a CSP-based message passing system. Nearly all run-time systems offer synchronous and asynchronous communication models, but they are not using the same syntax. However, we want to have PSO as portable as possible and therefore we implemented genuine mailbox communication routines. The Work List is implemented as a FIFO buffer and serializes the incoming requests. It serves as command queue to the Server. The Server is the crucial module in this layer. It removes requests from the Work List and executes the s in interaction with the administration layer on the same node or an Server somewhere in the network. From the Server s point of view all s are stored in a distributed -Pool. The Server controls the local part of this pool directly. If a is locked (access is temporarily denied) the according requests are queued in the The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 4
5 Wait Lock List. The mechanism of locking and unlocking s is implemented analogously to the UNIX file locking mechanism. The programmer has to decide whether the access should wait until the is unlocked or if an access failed signal should be returned. All requests to Servers on other nodes are queued by the Send List. The Requester takes entries from the Send List and performs s to other processing nodes. The -Solver waits for acknowledges from pending s, removes the matching entry from the Send List.? from other nodes! PSO Administration Work List Server SO Send List! Respond? Request PSO Mailbox incoming s Responder WL Requester WL SO Base Object Pool Wait Lock List Interface Solver! to other nodes? Fig. 4 Structure of the communication layer Application Administration SO Server Server Requester? Context of processor n+1 Server Responder! SO Fig. 5 Access to a single The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 5
6 3 Access Objects In order to improve data access performance acceleration methods have been developed. One method is the use of access objects [5] acting as agents between the application program and the distributed data structures. The base concept of access objects is to aggregate data from the distributed data structures in the local memory, before the data is required by the application program, and to write data aggregated in the local memory back to the distributed data structures. This is possible for applications with precalculable access patterns, for example matrix computation algorithms which often have structured access patterns. The basic functionality of an access object is, that the application program can access the locally aggregated data and request a data exchange between the locally aggregated data and the distributed data structure (Fig. 6 ). This data exchange is performed asynchronously by the access object. There are two basic types of access objects, i. e. buffers and queues. Buffers are intended for data that are used several times (matrix multiplication for example) and for applications that allow the precalculation of the required subset of the distributed data structure but not the precalculation of the access sequence (quicksort for example). Queues are intended for data that are required only once (vector scalar product) and for data, which may be processed by the algorithm in an unspecified sequence (e. g. sum of a row/column of a matrix, etc.) Processing node 1 Processing node 2 Processing node 3 Thread 1 Thread 2 Thread 3 Thread 4 Access Objects Distributed Arrays Fig. 6 Basic Functionality of the Access Objects This concept provides some advantages against usual acceleration methods used for Virtual Shared Memory systems: Reduced Overhead for Providing Coherence Virtual Shared Memory systems have to provide cache coherence at every time during program execution. This results in a huge overhead. Using access objects reduces the The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 6
7 overhead for coherence, because the coherence requirements may be reduced. This is possible, because the application program specifies of which elements copies may exist. Therefore the application program may also specify, where coherence of the copies is required and where not. Hiding Access Latency The access latency of non-local elements of distributed data structures may be hidden using access objects, because access objects allow simultaneous execution of many accesses to non-local data without requiring more than one application program thread per processor. That means the application problem does not have to be partitioned into more threads than processors are used. This reduces programming expense and task switching overhead. Prefetch Using access objects the application program itself specifies, which elements of the distributed data structures have to be prefetched. This avoids prefetching needless data. 4 Examples and Results using Access Objects The example below shows a parallel matrix multiplication. In order to keep the example simple the algorithm itself is not optimized in view of locality of the data access patterns. The example is intended to show the possibilities to use access objects for a given algorithm. The program code is executed on each processor. The memory of a single processing node is supposed to be too small to hold a full copy of matrix a or b. 4.1 Implementation using PSO without Access Objects shared float a[1000][1000]; shared float b[1000][1000]; shared float c[1000][1000]; extern int nproc; extern int PID; for (int i = PID; i < 1000; i += nproc) for (int j = 0; j < 1000; j++) { float sum = 0; for (int k = 0; k < 1000; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }; 4.2 Implementation using PSO and Access Objects // Declaration of distributed arrays // Keyword shared is a syntax extension of PSO // Number of processors // Unique processor ID; range [0, nproc-1] shared float a[1000][1000]; shared float b[1000][1000]; shared float c[1000][1000]; AccessBuffer<float> A; // A buffer to access array a AccessReadQueue<float> B; // A queue to read data from array b AccessWriteQueue<float> C; // A queue to write the results to array c AccessSelector S; // Required to select substructures // Calculate the number of rows of array c to be computed by the local processor int nrows = 1000 / nproc + ( 1000 % nproc > PID? 1 : 0 ); The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 7
8 // Load all required elements of array a asynchronously to the local buffer area of A A.Associate(a); A.Select(S.Rows(myProcId, nrows, nproc)); A.ReadAll(); // The local memory of the processing node is too small to hold a complete copy // of the distributed array b, therefore the data are read for each column separately. // So only association is required here. B.Associate(b); // Associate the queue C with the distributed array c and select those rows of c, // which are computed on the local processor. C.Associate(c); C.Select(S.Rows(myProcId, nrows, nproc)); for (int i = 0; i < nrows; i++) for (int j = 0; j < 1000; j++) { float sum = 0; // Start to read a column B.SelectAndRead(S.Column(j)); for (int k = 0; k < 8; k++) // Get one element from the local buffer area of A // and one element from the front of B. // Wait if reading the required element of a or b has not been finished. sum += A(i, k) * B.Pop(); // Write result asynchronously to the distributed array c. C.PushAndWrite(sum); }; 4.3 Result of Using the Access Objects In the example program shown above using access objects induces the following advantages: The required elements of the distributed array a are read into local memory only once. All required elements of array a are requested simultaneously. So the communication bandwidth instead of the communication latency becomes the major performance factor. The elements of the distributed array b are read by request of a full column. The program may continue execution after the first element being read. The succeeding elements are read asynchronously. If the communication bandwidth is big enough, this may reduce the influence of latency by a factor which equals the number of elements in a column (in this case 1000). The computed elements of the distributed array c are written asynchronously, so that latency becomes unimportant. 5 Conclusions PSO, a portable software solution providing global distributed data structures on distributed memory systems has been implemented for the PARIX software environment and tested on two different parallel architectures Parsytec SuperCluster with 128 processing nodes (1 INMOS Transputer 805 per node) Parsytec GC/PowerPlus with 64 processing nodes (2 PowerPC 601 processors and 4 INMOS Transputers 805 per node). The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 8
9 In order to achieve acceptable performance novel acceleration methods have been designed. Using information provided by the programmer the access objects hide latency and reduce communication bandwidth requirement. For this reason the access objects aggregate local copies of data and perform data prefetch and asynchronous write operations. Also asynchronous arithmetic operations are provided. The access objects avoid the major disadvantage of standard acceleration methods. For example, coherence requirements are reduced, prefetch of unnecessary data is avoided, and only one application thread per processor is needed. As a novel acceleration method self-organizing data structures are planned, which provide automatic rearrangement of the data distribution during run-time. PSO is not intended to be a sole programming model, but an enlargement in addition to Message Passing. Consequently, time critical parts of a parallel program may be optimized using Message Passing, while the major parts of the program are implemented using shared data structures. This leads to efficient programs combined with low programming expenditure. For several applications PSO has been proven to facilitate and shorten the process of program development, covering the fields of developing programs designed for shared memory architectures. 6 References [1] K. Mani Chandy and Carl Kesselman, The CC++ language definition, Technical Report Caltech-CS-TR-92-02, California Institute of Technology, 1992 [2] Al Geist et al., PVM: Parallel Virtual Machine - A Users Guide and Tutorial for Networked Parallel Computing, MIT Press, Cambridge, Massachusetts, [3] Message Passing Interface Forum (Ed.), MPI: A Message Passing Interface Standard, Message Passing Interface Forum, 1994 [4] PARSYTEC GmbH, PARIX Reference Manual, PARSYTEC GmbH, 1993 [5] Stefan Lüpke, Accelerated Access to Shared Distributed Arrays on Distributed Memory Systems by Access Objects, in B. Buchberger and J. Volkert (Eds.), Parallel Processing: CONPAR94 - VAPP VI, S , Springer-Verlag, Berlin, 1994 [6] K. Li, Shared Virtual Memory on Loosely Coupled Multiprocessors, Ph.D. Thesis, Yale University, 1986 The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 9
Distributed Deadlock Detection for. Distributed Process Networks
0 Distributed Deadlock Detection for Distributed Process Networks Alex Olson Embedded Software Systems Abstract The distributed process network (DPN) model allows for greater scalability and performance
More informationParallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19
Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing
More informationShared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation
Shared Memory and Distributed Multiprocessing Bhanu Kapoor, Ph.D. The Saylor Foundation 1 Issue with Parallelism Parallel software is the problem Need to get significant performance improvement Otherwise,
More informationCS222: Cache Performance Improvement
CS222: Cache Performance Improvement Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati Outline Eleven Advanced Cache Performance Optimization Prev: Reducing hit time & Increasing
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationLINDA. The eval operation resembles out, except that it creates an active tuple. For example, if fcn is a function, then
LINDA Linda is different, which is why we've put it into a separate chapter. Linda is not a programming language, but a way of extending ( in principle ) any language to include parallelism IMP13, IMP14.
More informationVirtual Memory COMPSCI 386
Virtual Memory COMPSCI 386 Motivation An instruction to be executed must be in physical memory, but there may not be enough space for all ready processes. Typically the entire program is not needed. Exception
More informationName: PID: CSE 160 Final Exam SAMPLE Winter 2017 (Kesden)
Name: PID: Email: CSE 160 Final Exam SAMPLE Winter 2017 (Kesden) Cache Performance (Questions from 15-213 @ CMU. Thanks!) 1. This problem requires you to analyze the cache behavior of a function that sums
More informationA Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing
A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing Kentaro Shimada *1, Tatsuya Kawashimo *1, Makoto Hanawa *1, Ryo Yamagata *2, and Eiki Kamada *2 *1 Central Research Laboratory,
More informationParallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine)
Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine) Ehab AbdulRazak Al-Asadi College of Science Kerbala University, Iraq Abstract The study will focus for analysis the possibilities
More informationECE453: Advanced Computer Architecture II Homework 1
ECE453: Advanced Computer Architecture II Homework 1 Assigned: January 18, 2005 From Parallel Computer Architecture: A Hardware/Software Approach, Culler, Singh, Gupta: 1.15) The cost of a copy of n bytes
More informationHow to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform
How to Optimize the Scalability & Performance of a Multi-Core Operating System Architecting a Scalable Real-Time Application on an SMP Platform Overview W hen upgrading your hardware platform to a newer
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationAbstract 1. Introduction
Jaguar: A Distributed Computing Environment Based on Java Sheng-De Wang and Wei-Shen Wang Department of Electrical Engineering National Taiwan University Taipei, Taiwan Abstract As the development of network
More informationSwapping. Operating Systems I. Swapping. Motivation. Paging Implementation. Demand Paging. Active processes use more physical memory than system has
Swapping Active processes use more physical memory than system has Operating Systems I Address Binding can be fixed or relocatable at runtime Swap out P P Virtual Memory OS Backing Store (Swap Space) Main
More informationMultiprocessors 2007/2008
Multiprocessors 2007/2008 Abstractions of parallel machines Johan Lukkien 1 Overview Problem context Abstraction Operating system support Language / middleware support 2 Parallel processing Scope: several
More informationMemory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.
Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip
More informationHigh Performance Computing and Programming, Lecture 3
High Performance Computing and Programming, Lecture 3 Memory usage and some other things Ali Dorostkar Division of Scientific Computing, Department of Information Technology, Uppsala University, Sweden
More informationCommission of the European Communities **************** ESPRIT III PROJECT NB 6756 **************** CAMAS
Commission of the European Communities **************** ESPRIT III PROJECT NB 6756 **************** CAMAS COMPUTER AIDED MIGRATION OF APPLICATIONS SYSTEM **************** CAMAS-TR-2.3.4 Finalization Report
More informationPC cluster as a platform for parallel applications
PC cluster as a platform for parallel applications AMANY ABD ELSAMEA, HESHAM ELDEEB, SALWA NASSAR Computer & System Department Electronic Research Institute National Research Center, Dokki, Giza Cairo,
More informationCUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012
CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix
More informationCS201 - Introduction to Programming Glossary By
CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with
More informationCache Performance (H&P 5.3; 5.5; 5.6)
Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st
More informationUniversity of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2.
University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. Solutions Problem 1 Problem 3.12 in CSG (a) Clearly the partitioning can be
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationParallel Programming Environments. Presented By: Anand Saoji Yogesh Patel
Parallel Programming Environments Presented By: Anand Saoji Yogesh Patel Outline Introduction How? Parallel Architectures Parallel Programming Models Conclusion References Introduction Recent advancements
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationEvaluating the Portability of UPC to the Cell Broadband Engine
Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and
More information02 - Distributed Systems
02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/58 Definition Distributed Systems Distributed System is
More information02 - Distributed Systems
02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/60 Definition Distributed Systems Distributed System is
More informationCSCI-UA.0201 Computer Systems Organization Memory Hierarchy
CSCI-UA.0201 Computer Systems Organization Memory Hierarchy Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Programmer s Wish List Memory Private Infinitely large Infinitely fast Non-volatile
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMemory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University
Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds
More informationSystems Programming and Computer Architecture ( ) Timothy Roscoe
Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture
More informationCS252 Graduate Computer Architecture Multiprocessors and Multithreading Solutions November 14, 2007
CS252 Graduate Computer Architecture Multiprocessors and Multithreading Solutions November 14, 2007 Problem 1: Directory-based Cache Coherence Problem 1.A Cache State Transitions Complete Table 1. No.
More informationOutline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis
Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationParallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)
Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program
More informationShort Notes of CS201
#includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system
More informationVirtual Memory. Reading: Silberschatz chapter 10 Reading: Stallings. chapter 8 EEL 358
Virtual Memory Reading: Silberschatz chapter 10 Reading: Stallings chapter 8 1 Outline Introduction Advantages Thrashing Principal of Locality VM based on Paging/Segmentation Combined Paging and Segmentation
More informationLecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program
More informationChapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs
Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple
More informationImproving Http-Server Performance by Adapted Multithreading
Improving Http-Server Performance by Adapted Multithreading Jörg Keller LG Technische Informatik II FernUniversität Hagen 58084 Hagen, Germany email: joerg.keller@fernuni-hagen.de Olaf Monien Thilo Lardon
More informationSMD149 - Operating Systems - Multiprocessing
SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction
More informationOptimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology
Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:
More informationOverview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy
Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system
More informationDistributed OS and Algorithms
Distributed OS and Algorithms Fundamental concepts OS definition in general: OS is a collection of software modules to an extended machine for the users viewpoint, and it is a resource manager from the
More informationQuestion 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.
Questions 1 Question 13 1: (Solution, p ) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate. Question 13 : (Solution, p ) In implementing HYMN s control unit, the fetch cycle
More informationCache Optimisation. sometime he thought that there must be a better way
Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching
More informationMapReduce: A Programming Model for Large-Scale Distributed Computation
CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview
More informationTHE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems
THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430 Parallel Systems Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable Calculator This
More informationMULTIPROCESSORS AND THREAD LEVEL PARALLELISM
UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared
More informationCS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017
CS 471 Operating Systems Yue Cheng George Mason University Fall 2017 Outline o Process concept o Process creation o Process states and scheduling o Preemption and context switch o Inter-process communication
More informationComparing the Parix and PVM parallel programming environments
Comparing the Parix and PVM parallel programming environments A.G. Hoekstra, P.M.A. Sloot, and L.O. Hertzberger Parallel Scientific Computing & Simulation Group, Computer Systems Department, Faculty of
More informationECE 454 Computer Systems Programming
ECE 454 Computer Systems Programming The Edward S. Rogers Sr. Department of Electrical and Computer Engineering Final Examination Fall 2011 Name Student # Professor Greg Steffan Answer all questions. Write
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationTutorial 11. Final Exam Review
Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationOperating system Dr. Shroouq J.
2.2.2 DMA Structure In a simple terminal-input driver, when a line is to be read from the terminal, the first character typed is sent to the computer. When that character is received, the asynchronous-communication
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationSystem Models for Distributed Systems
System Models for Distributed Systems INF5040/9040 Autumn 2015 Lecturer: Amir Taherkordi (ifi/uio) August 31, 2015 Outline 1. Introduction 2. Physical Models 4. Fundamental Models 2 INF5040 1 System Models
More informationFirst-In-First-Out (FIFO) Algorithm
First-In-First-Out (FIFO) Algorithm Reference string: 7,0,1,2,0,3,0,4,2,3,0,3,0,3,2,1,2,0,1,7,0,1 3 frames (3 pages can be in memory at a time per process) 15 page faults Can vary by reference string:
More informationUltra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.
Abstract Ultra Large-Scale FFT Processing on Graphics Processor Arrays Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Graphics Processor Unit (GPU) technology has been shown well-suited to efficient
More informationLast class. Caches. Direct mapped
Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place
More informationCS377P Programming for Performance Multicore Performance Cache Coherence
CS377P Programming for Performance Multicore Performance Cache Coherence Sreepathi Pai UTCS October 26, 2015 Outline 1 Cache Coherence 2 Cache Coherence Awareness 3 Scalable Lock Design 4 Transactional
More informationA Comparison of the Iserver-Occam, Parix, Express, and PVM Programming Environments on a Parsytec GCel
A Comparison of the Iserver-Occam, Parix, Express, and PVM Programming Environments on a Parsytec GCel P.M.A. Sloot, A.G. Hoekstra, and L.O. Hertzberger Parallel Scientific Computing & Simulation Group,
More informationParallel Programming with OpenMP. CS240A, T. Yang
Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs
More informationCSE Traditional Operating Systems deal with typical system software designed to be:
CSE 6431 Traditional Operating Systems deal with typical system software designed to be: general purpose running on single processor machines Advanced Operating Systems are designed for either a special
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More information211: Computer Architecture Summer 2016
211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University
More informationMemory Management Outline. Operating Systems. Motivation. Paging Implementation. Accessing Invalid Pages. Performance of Demand Paging
Memory Management Outline Operating Systems Processes (done) Memory Management Basic (done) Paging (done) Virtual memory Virtual Memory (Chapter.) Motivation Logical address space larger than physical
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationLike scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures
Superscalar Architectures Have looked at examined basic architecture concepts Starting with simple machines Introduced concepts underlying RISC machines From characteristics of RISC instructions Found
More informationProcesses and Threads
TDDI04 Concurrent Programming, Operating Systems, and Real-time Operating Systems Processes and Threads [SGG7] Chapters 3 and 4 Copyright Notice: The lecture notes are mainly based on Silberschatz s, Galvin
More informationModule 5 Introduction to Parallel Processing Systems
Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this
More informationDEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT I
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year and Semester : II / IV Subject Code : CS6401 Subject Name : Operating System Degree and Branch : B.E CSE UNIT I 1. Define system process 2. What is an
More informationCSE544 Database Architecture
CSE544 Database Architecture Tuesday, February 1 st, 2011 Slides courtesy of Magda Balazinska 1 Where We Are What we have already seen Overview of the relational model Motivation and where model came from
More informationDenison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud
Cache Memories CS-281: Introduction to Computer Systems Instructor: Thomas C. Bressoud 1 Random-Access Memory (RAM) Key features RAM is traditionally packaged as a chip. Basic storage unit is normally
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationMultiprocessor Systems
White Paper: Virtex-II Series R WP162 (v1.1) April 10, 2003 Multiprocessor Systems By: Jeremy Kowalczyk With the availability of the Virtex-II Pro devices containing more than one Power PC processor and
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationTDP3471 Distributed and Parallel Computing
TDP3471 Distributed and Parallel Computing Lecture 1 Dr. Ian Chai ianchai@mmu.edu.my FIT Building: Room BR1024 Office : 03-8312-5379 Schedule for Dr. Ian (including consultation hours) available at http://pesona.mmu.edu.my/~ianchai/schedule.pdf
More informationParallel and High Performance Computing CSE 745
Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationCor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming
Cor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming Conference Object, Postprint version This version is available at http://dx.doi.org/0.479/depositonce-577. Suggested Citation
More informationExperiences in building Cosy - an Operating System for Highly Parallel Computers
Experiences in building Cosy - an Operating System for Highly Parallel Computers R. Butenuth a, W. Burke b, C. De Rose b, S. Gilles b and R. Weber b a Group for Operating Systems and Distributed Systems,
More informationThe University of Adelaide, School of Computer Science 13 September 2018
Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationCS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP Instructors: John Wawrzynek & Vladimir Stojanovic http://inst.eecs.berkeley.edu/~cs61c/ Review
More information1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor
CS6801-MULTICORE ARCHECTURES AND PROGRAMMING UN I 1. Difference between Symmetric Memory Architecture and Distributed Memory Architecture. 2. What is Vector Instruction? 3. What are the factor to increasing
More information16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as
372 Chapter 16 Code Improvement 16.10 Exercises 16.1 In Section 16.2 we suggested replacing the instruction r1 := r2 / 2 with the instruction r1 := r2 >> 1, and noted that the replacement may not be correct
More information6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT
6.189 IAP 2007 Lecture 5 Parallel Programming Concepts 1 6.189 IAP 2007 MIT Recap Two primary patterns of multicore architecture design Shared memory Ex: Intel Core 2 Duo/Quad One copy of data shared among
More informationSri Vidya College of Engineering & Technology
UNIT I INTRODUCTION TO OOP AND FUNDAMENTALS OF JAVA 1. Define OOP. Part A Object-Oriented Programming (OOP) is a methodology or paradigm to design a program using classes and objects. It simplifies the
More informationLecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University
Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27
More informationThe MOSIX Scalable Cluster Computing for Linux. mosix.org
The MOSIX Scalable Cluster Computing for Linux Prof. Amnon Barak Computer Science Hebrew University http://www. mosix.org 1 Presentation overview Part I : Why computing clusters (slide 3-7) Part II : What
More informationChapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative
Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory
More information