The Public Shared Objects Run-Time System

Size: px
Start display at page:

Download "The Public Shared Objects Run-Time System"

Transcription

1 The Public Shared Objects Run-Time System Stefan Lüpke, Jürgen W. Quittek, Torsten Wiese Arbeitsbereich Technische Informatik 2, Technische Universität Hamburg-Harburg Hamburg Germany Abstract Public Shared Objects (PSO) is an attempt to offer advantages of Virtual Shared Memory for distributed memory parallel computers without a significant loss of performance. Shared data structures are distributed via the network to processing nodes and may exceed the capacity of local memory. Provided Access Objects hide access latency and reduce communication bandwidth requirement. PSO is a portable software solution extending the C++ programming language. It has been implemented for the PARIX software environment and is available for parallel computers based on Transputers or PowerPC processors. 1 Introduction Message Passing and Virtual Shared Memory are different programming models for parallel computers with distributed memory. Generally, Virtual Shared Memory requires less programming expenditure. The main disadvantage of Virtual Shared Memory is the lack of performance caused by high access latency. So, in practice Message Passing is commonly preferred despite the higher complexity of program design. Public Shared Objects (PSO) is an attempt to offer advantages of Virtual Shared Memory with a small, acceptable loss of performance only. Instead of emulating a shared address space PSO provides a shared symbol space (similar to CC++ [1]). Symbolic names can be shared by all processes of a parallel program. Shared data structures are distributed via the network to processing nodes and may exceed the capacity of local memory. PSO is a portable software solution and requires a Message Passing system such as PVM[2], MPI[3], or PARIX[4]. It is defined as an extension of C++. Making use of data abstraction and operator overloading the only necessary extension is the new storage class specifier shared. Data structures declared as shared are accessible to all processes and are referenced like any other data structure. Fig. 1 Data Data Global Memory Code VSM VSM and PSO PSO Local Memory Processors Code PSO divides shared data structures into several blocks called base objects. A base object () is stored at a single processing node. s are distributed among all processing nodes The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 1

2 to reduce access conflicts and allow the storage of arrays too large for a single processing node. To support future extensions of PSO migrating s are provided. Those Distributed Shared Data Structures are referred to as DS 2. Since the optimal size and location of the s depend on the application, each DS 2 may be provided with a different distribution strategy specifying size, composition, and initial placement of the s. In order to control data distribution in an easy and flexible way the programmer may define his own strategies. During initialization a reference object for each declared DS 2 is created on all processing nodes. The reference objects determine an unique administration node for the DS 2 by applying a functional projection to the symbolic name. According to the applied distribution strategy the administration node is responsible for creating and initializing the DS 2 including its s. All reference objects request information about the DS 2 at the administration node. This information is called the characteristic information. All characteristic information stored at administration nodes constitute a distributed data base. To realize data access the reference objects determine the corresponding s making use of the characteristic information. In order to allow efficient caching these information contains only static data. Since the position of a may change it is not part of the characteristic information. The reference objects determine the position of a using a fault-tolerant mechanism provided by the PSO run-time system. Consequently, they use two different caches for characteristic information and for locations. 2 The structure of PSO The following example illustrates the use of PSO: // Parallel multiplication of two // (DIM,DIM) matrices by // DIM * DIM processors. // shared declaration of matrices shared double a[dim][dim], b[dim][dim]; shared double c[dim][dim];... // myprocid: processor number int row = myprocid % DIM; int col = myprocid / DIM; for (int k = 0; k < DIM; k++) c[row][col] += a[row][k] * b[k][col]; A precompiler has been built which translates declarations and references of DS 2 s into object calls. The example demonstrates that PSO is suitable for C++ programmers as well as C programmers. However, a C++ compiler is required to translate the code generated by the precompiler. The language C++ was choosen as base of PSO because it offers operator overloading and object oriented programming as well as dynamic memory management. Fig. 2 Source Code PSO Precompiler C++ Compiler Application Administration Interface Communication PSO Run-time System Message Passing Model of PSO The access of an application to a DS 2 is handled by the PSO administration layer. This layer determines the target of the current access. The access to a specific is handled by the communication layer which, if necessary, sends requests to other processing nodes using the underlaying Message Passing system. Unlimited scalability is achieved by using dis- The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 2

3 tributed administration strategies. All necessesary internal information is stored in distributed data bases. PSO is not intended to be a sole programming model, the application can use the Message Passing system directly. 2.1 Precompiler The concept of data abstraction and operator overloading makes C++ suitable for implementing PSO with a minimum of syntax extensions. The preprocessor transforms each declaration using the new storage class specifier shared into a constructor call of class DSVMem (Data Structure Virtual Memory). shared double a[dim][dim]; DSVMem a(1, sizeof(double),t_double,dim,dim); This class serves to handle any access to a single DS 2. The constructor parameters are evaluated by the precompiler. 1 is the internal unique identifier for the DS 2 and is called DSID (Data Structure ID). The precompiler scans all source files and assigns a single-valued number to each object of type DSVMem. sizeof(double) is the memory required by one element of array a. T_DOUBLE is a symbolic type identifier referring to a basic or user defined data type. DIM,DIM are the dimensions of the array. 2.2 Layers of PSO Administration Layer Each shared variable or shared array is represented on each processing node by an object of class DSVMem. The constructor call generated by the precompiler (see above) creates an local object that represents array a. First of all a modulo function f DS PID ( DSID, nproc) (nproc=number of available processors) is executed to determine the administrating node of the DS 2. Although each processing node creates a local object a only the administrating node generates the characteristic information of array a and acts as server to all other nodes. The administrating node of an object does not change during lifetime of the DS 2. The characteristic information contains DSID, the ID of the Data-Structure, size of one element in bytes, dimension(s) of the DS 2, number of necessary s, number of elements per, and used strategy. f DS PID f DS f ID client Distributed Data-Structure client Administrator server node node node Fig. 3 Distributing a DS 2 The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 3

4 A function f determines the splitting of the DS 2 DS into s. Function f ID is used to provide each with a unique ID which has a functional relationship to the ID of the processing node being responsible for the. In the current implementation the is always stored on this node. The introduction of an administrating node for each is useful since in future releases of PSO the s should be able to migrate to other nodes. Both functions and form the distribution strategy of the corresponding DS 2. f DS f ID Interface The interface between communication layer and administration layer is -based. Both layers are using semantic to communicate with each other matching the client-server concept. Above the interface all s seem to be stored in one huge pool. Instances of the class DSVmem are clients whereas the interface works as a single server handling all the s. The interface can handle the following calls: DS_Get: Demand the characteristic information of a DS 2 from the specified processing node. _Create: Allocate memory for a on the specified processing node. _Read: Fetch a part of a from the specified processing node. _Write: Update a part of a on the specified processing node. _Lock: Lock a on the specified processing node. _UnLock: Demand status of a from the specified processing node. _Delete: Delete a on the specified processing node. _Move: Move a from one specified processing node to another specified processing node Communication Layer The communication layer (Fig. 4 ) is the machine dependent part of PSO. Serving as an server it translates calls from the administration layer into Message Passing commands. Consequently, porting PSO to other Message Passing Systems only requires modification of the communication layer. The functionality of this layer is explained by its components. Two Mailboxes are storing Requests and Responses from other processing nodes and implement an asynchronous communication mode within a CSP-based message passing system. Nearly all run-time systems offer synchronous and asynchronous communication models, but they are not using the same syntax. However, we want to have PSO as portable as possible and therefore we implemented genuine mailbox communication routines. The Work List is implemented as a FIFO buffer and serializes the incoming requests. It serves as command queue to the Server. The Server is the crucial module in this layer. It removes requests from the Work List and executes the s in interaction with the administration layer on the same node or an Server somewhere in the network. From the Server s point of view all s are stored in a distributed -Pool. The Server controls the local part of this pool directly. If a is locked (access is temporarily denied) the according requests are queued in the The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 4

5 Wait Lock List. The mechanism of locking and unlocking s is implemented analogously to the UNIX file locking mechanism. The programmer has to decide whether the access should wait until the is unlocked or if an access failed signal should be returned. All requests to Servers on other nodes are queued by the Send List. The Requester takes entries from the Send List and performs s to other processing nodes. The -Solver waits for acknowledges from pending s, removes the matching entry from the Send List.? from other nodes! PSO Administration Work List Server SO Send List! Respond? Request PSO Mailbox incoming s Responder WL Requester WL SO Base Object Pool Wait Lock List Interface Solver! to other nodes? Fig. 4 Structure of the communication layer Application Administration SO Server Server Requester? Context of processor n+1 Server Responder! SO Fig. 5 Access to a single The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 5

6 3 Access Objects In order to improve data access performance acceleration methods have been developed. One method is the use of access objects [5] acting as agents between the application program and the distributed data structures. The base concept of access objects is to aggregate data from the distributed data structures in the local memory, before the data is required by the application program, and to write data aggregated in the local memory back to the distributed data structures. This is possible for applications with precalculable access patterns, for example matrix computation algorithms which often have structured access patterns. The basic functionality of an access object is, that the application program can access the locally aggregated data and request a data exchange between the locally aggregated data and the distributed data structure (Fig. 6 ). This data exchange is performed asynchronously by the access object. There are two basic types of access objects, i. e. buffers and queues. Buffers are intended for data that are used several times (matrix multiplication for example) and for applications that allow the precalculation of the required subset of the distributed data structure but not the precalculation of the access sequence (quicksort for example). Queues are intended for data that are required only once (vector scalar product) and for data, which may be processed by the algorithm in an unspecified sequence (e. g. sum of a row/column of a matrix, etc.) Processing node 1 Processing node 2 Processing node 3 Thread 1 Thread 2 Thread 3 Thread 4 Access Objects Distributed Arrays Fig. 6 Basic Functionality of the Access Objects This concept provides some advantages against usual acceleration methods used for Virtual Shared Memory systems: Reduced Overhead for Providing Coherence Virtual Shared Memory systems have to provide cache coherence at every time during program execution. This results in a huge overhead. Using access objects reduces the The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 6

7 overhead for coherence, because the coherence requirements may be reduced. This is possible, because the application program specifies of which elements copies may exist. Therefore the application program may also specify, where coherence of the copies is required and where not. Hiding Access Latency The access latency of non-local elements of distributed data structures may be hidden using access objects, because access objects allow simultaneous execution of many accesses to non-local data without requiring more than one application program thread per processor. That means the application problem does not have to be partitioned into more threads than processors are used. This reduces programming expense and task switching overhead. Prefetch Using access objects the application program itself specifies, which elements of the distributed data structures have to be prefetched. This avoids prefetching needless data. 4 Examples and Results using Access Objects The example below shows a parallel matrix multiplication. In order to keep the example simple the algorithm itself is not optimized in view of locality of the data access patterns. The example is intended to show the possibilities to use access objects for a given algorithm. The program code is executed on each processor. The memory of a single processing node is supposed to be too small to hold a full copy of matrix a or b. 4.1 Implementation using PSO without Access Objects shared float a[1000][1000]; shared float b[1000][1000]; shared float c[1000][1000]; extern int nproc; extern int PID; for (int i = PID; i < 1000; i += nproc) for (int j = 0; j < 1000; j++) { float sum = 0; for (int k = 0; k < 1000; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }; 4.2 Implementation using PSO and Access Objects // Declaration of distributed arrays // Keyword shared is a syntax extension of PSO // Number of processors // Unique processor ID; range [0, nproc-1] shared float a[1000][1000]; shared float b[1000][1000]; shared float c[1000][1000]; AccessBuffer<float> A; // A buffer to access array a AccessReadQueue<float> B; // A queue to read data from array b AccessWriteQueue<float> C; // A queue to write the results to array c AccessSelector S; // Required to select substructures // Calculate the number of rows of array c to be computed by the local processor int nrows = 1000 / nproc + ( 1000 % nproc > PID? 1 : 0 ); The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 7

8 // Load all required elements of array a asynchronously to the local buffer area of A A.Associate(a); A.Select(S.Rows(myProcId, nrows, nproc)); A.ReadAll(); // The local memory of the processing node is too small to hold a complete copy // of the distributed array b, therefore the data are read for each column separately. // So only association is required here. B.Associate(b); // Associate the queue C with the distributed array c and select those rows of c, // which are computed on the local processor. C.Associate(c); C.Select(S.Rows(myProcId, nrows, nproc)); for (int i = 0; i < nrows; i++) for (int j = 0; j < 1000; j++) { float sum = 0; // Start to read a column B.SelectAndRead(S.Column(j)); for (int k = 0; k < 8; k++) // Get one element from the local buffer area of A // and one element from the front of B. // Wait if reading the required element of a or b has not been finished. sum += A(i, k) * B.Pop(); // Write result asynchronously to the distributed array c. C.PushAndWrite(sum); }; 4.3 Result of Using the Access Objects In the example program shown above using access objects induces the following advantages: The required elements of the distributed array a are read into local memory only once. All required elements of array a are requested simultaneously. So the communication bandwidth instead of the communication latency becomes the major performance factor. The elements of the distributed array b are read by request of a full column. The program may continue execution after the first element being read. The succeeding elements are read asynchronously. If the communication bandwidth is big enough, this may reduce the influence of latency by a factor which equals the number of elements in a column (in this case 1000). The computed elements of the distributed array c are written asynchronously, so that latency becomes unimportant. 5 Conclusions PSO, a portable software solution providing global distributed data structures on distributed memory systems has been implemented for the PARIX software environment and tested on two different parallel architectures Parsytec SuperCluster with 128 processing nodes (1 INMOS Transputer 805 per node) Parsytec GC/PowerPlus with 64 processing nodes (2 PowerPC 601 processors and 4 INMOS Transputers 805 per node). The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 8

9 In order to achieve acceptable performance novel acceleration methods have been designed. Using information provided by the programmer the access objects hide latency and reduce communication bandwidth requirement. For this reason the access objects aggregate local copies of data and perform data prefetch and asynchronous write operations. Also asynchronous arithmetic operations are provided. The access objects avoid the major disadvantage of standard acceleration methods. For example, coherence requirements are reduced, prefetch of unnecessary data is avoided, and only one application thread per processor is needed. As a novel acceleration method self-organizing data structures are planned, which provide automatic rearrangement of the data distribution during run-time. PSO is not intended to be a sole programming model, but an enlargement in addition to Message Passing. Consequently, time critical parts of a parallel program may be optimized using Message Passing, while the major parts of the program are implemented using shared data structures. This leads to efficient programs combined with low programming expenditure. For several applications PSO has been proven to facilitate and shorten the process of program development, covering the fields of developing programs designed for shared memory architectures. 6 References [1] K. Mani Chandy and Carl Kesselman, The CC++ language definition, Technical Report Caltech-CS-TR-92-02, California Institute of Technology, 1992 [2] Al Geist et al., PVM: Parallel Virtual Machine - A Users Guide and Tutorial for Networked Parallel Computing, MIT Press, Cambridge, Massachusetts, [3] Message Passing Interface Forum (Ed.), MPI: A Message Passing Interface Standard, Message Passing Interface Forum, 1994 [4] PARSYTEC GmbH, PARIX Reference Manual, PARSYTEC GmbH, 1993 [5] Stefan Lüpke, Accelerated Access to Shared Distributed Arrays on Distributed Memory Systems by Access Objects, in B. Buchberger and J. Volkert (Eds.), Parallel Processing: CONPAR94 - VAPP VI, S , Springer-Verlag, Berlin, 1994 [6] K. Li, Shared Virtual Memory on Loosely Coupled Multiprocessors, Ph.D. Thesis, Yale University, 1986 The Public Shared Objects Run-Time System: ZEUS 95, Linköping, SWEDEN 9

Distributed Deadlock Detection for. Distributed Process Networks

Distributed Deadlock Detection for. Distributed Process Networks 0 Distributed Deadlock Detection for Distributed Process Networks Alex Olson Embedded Software Systems Abstract The distributed process network (DPN) model allows for greater scalability and performance

More information

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19 Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing

More information

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation Shared Memory and Distributed Multiprocessing Bhanu Kapoor, Ph.D. The Saylor Foundation 1 Issue with Parallelism Parallel software is the problem Need to get significant performance improvement Otherwise,

More information

CS222: Cache Performance Improvement

CS222: Cache Performance Improvement CS222: Cache Performance Improvement Dr. A. Sahu Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati Outline Eleven Advanced Cache Performance Optimization Prev: Reducing hit time & Increasing

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

LINDA. The eval operation resembles out, except that it creates an active tuple. For example, if fcn is a function, then

LINDA. The eval operation resembles out, except that it creates an active tuple. For example, if fcn is a function, then LINDA Linda is different, which is why we've put it into a separate chapter. Linda is not a programming language, but a way of extending ( in principle ) any language to include parallelism IMP13, IMP14.

More information

Virtual Memory COMPSCI 386

Virtual Memory COMPSCI 386 Virtual Memory COMPSCI 386 Motivation An instruction to be executed must be in physical memory, but there may not be enough space for all ready processes. Typically the entire program is not needed. Exception

More information

Name: PID: CSE 160 Final Exam SAMPLE Winter 2017 (Kesden)

Name: PID:   CSE 160 Final Exam SAMPLE Winter 2017 (Kesden) Name: PID: Email: CSE 160 Final Exam SAMPLE Winter 2017 (Kesden) Cache Performance (Questions from 15-213 @ CMU. Thanks!) 1. This problem requires you to analyze the cache behavior of a function that sums

More information

A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing

A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing Kentaro Shimada *1, Tatsuya Kawashimo *1, Makoto Hanawa *1, Ryo Yamagata *2, and Eiki Kamada *2 *1 Central Research Laboratory,

More information

Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine)

Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine) Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine) Ehab AbdulRazak Al-Asadi College of Science Kerbala University, Iraq Abstract The study will focus for analysis the possibilities

More information

ECE453: Advanced Computer Architecture II Homework 1

ECE453: Advanced Computer Architecture II Homework 1 ECE453: Advanced Computer Architecture II Homework 1 Assigned: January 18, 2005 From Parallel Computer Architecture: A Hardware/Software Approach, Culler, Singh, Gupta: 1.15) The cost of a copy of n bytes

More information

How to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform

How to Optimize the Scalability & Performance of a Multi-Core Operating System. Architecting a Scalable Real-Time Application on an SMP Platform How to Optimize the Scalability & Performance of a Multi-Core Operating System Architecting a Scalable Real-Time Application on an SMP Platform Overview W hen upgrading your hardware platform to a newer

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Abstract 1. Introduction

Abstract 1. Introduction Jaguar: A Distributed Computing Environment Based on Java Sheng-De Wang and Wei-Shen Wang Department of Electrical Engineering National Taiwan University Taipei, Taiwan Abstract As the development of network

More information

Swapping. Operating Systems I. Swapping. Motivation. Paging Implementation. Demand Paging. Active processes use more physical memory than system has

Swapping. Operating Systems I. Swapping. Motivation. Paging Implementation. Demand Paging. Active processes use more physical memory than system has Swapping Active processes use more physical memory than system has Operating Systems I Address Binding can be fixed or relocatable at runtime Swap out P P Virtual Memory OS Backing Store (Swap Space) Main

More information

Multiprocessors 2007/2008

Multiprocessors 2007/2008 Multiprocessors 2007/2008 Abstractions of parallel machines Johan Lukkien 1 Overview Problem context Abstraction Operating system support Language / middleware support 2 Parallel processing Scope: several

More information

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory concept. Grid concept, Synchronization. GPU Programming.   Szénási Sándor. Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip

More information

High Performance Computing and Programming, Lecture 3

High Performance Computing and Programming, Lecture 3 High Performance Computing and Programming, Lecture 3 Memory usage and some other things Ali Dorostkar Division of Scientific Computing, Department of Information Technology, Uppsala University, Sweden

More information

Commission of the European Communities **************** ESPRIT III PROJECT NB 6756 **************** CAMAS

Commission of the European Communities **************** ESPRIT III PROJECT NB 6756 **************** CAMAS Commission of the European Communities **************** ESPRIT III PROJECT NB 6756 **************** CAMAS COMPUTER AIDED MIGRATION OF APPLICATIONS SYSTEM **************** CAMAS-TR-2.3.4 Finalization Report

More information

PC cluster as a platform for parallel applications

PC cluster as a platform for parallel applications PC cluster as a platform for parallel applications AMANY ABD ELSAMEA, HESHAM ELDEEB, SALWA NASSAR Computer & System Department Electronic Research Institute National Research Center, Dokki, Giza Cairo,

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

CS201 - Introduction to Programming Glossary By

CS201 - Introduction to Programming Glossary By CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2.

University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. University of Southern California Department of Electrical Engineering EE657 Spring 2K2 Instructor: Michel Dubois Homework #2. Solutions Problem 1 Problem 3.12 in CSG (a) Clearly the partitioning can be

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel Parallel Programming Environments Presented By: Anand Saoji Yogesh Patel Outline Introduction How? Parallel Architectures Parallel Programming Models Conclusion References Introduction Recent advancements

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Evaluating the Portability of UPC to the Cell Broadband Engine

Evaluating the Portability of UPC to the Cell Broadband Engine Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/58 Definition Distributed Systems Distributed System is

More information

02 - Distributed Systems

02 - Distributed Systems 02 - Distributed Systems Definition Coulouris 1 (Dis)advantages Coulouris 2 Challenges Saltzer_84.pdf Models Physical Architectural Fundamental 2/60 Definition Distributed Systems Distributed System is

More information

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy CSCI-UA.0201 Computer Systems Organization Memory Hierarchy Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Programmer s Wish List Memory Private Infinitely large Infinitely fast Non-volatile

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memory Hierarchy. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Memory Hierarchy Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Time (ns) The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

CS252 Graduate Computer Architecture Multiprocessors and Multithreading Solutions November 14, 2007

CS252 Graduate Computer Architecture Multiprocessors and Multithreading Solutions November 14, 2007 CS252 Graduate Computer Architecture Multiprocessors and Multithreading Solutions November 14, 2007 Problem 1: Directory-based Cache Coherence Problem 1.A Cache State Transitions Complete Table 1. No.

More information

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads) Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information

Virtual Memory. Reading: Silberschatz chapter 10 Reading: Stallings. chapter 8 EEL 358

Virtual Memory. Reading: Silberschatz chapter 10 Reading: Stallings. chapter 8 EEL 358 Virtual Memory Reading: Silberschatz chapter 10 Reading: Stallings chapter 8 1 Outline Introduction Advantages Thrashing Principal of Locality VM based on Paging/Segmentation Combined Paging and Segmentation

More information

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Improving Http-Server Performance by Adapted Multithreading

Improving Http-Server Performance by Adapted Multithreading Improving Http-Server Performance by Adapted Multithreading Jörg Keller LG Technische Informatik II FernUniversität Hagen 58084 Hagen, Germany email: joerg.keller@fernuni-hagen.de Olaf Monien Thilo Lardon

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

Distributed OS and Algorithms

Distributed OS and Algorithms Distributed OS and Algorithms Fundamental concepts OS definition in general: OS is a collection of software modules to an extended machine for the users viewpoint, and it is a resource manager from the

More information

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate. Questions 1 Question 13 1: (Solution, p ) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate. Question 13 : (Solution, p ) In implementing HYMN s control unit, the fetch cycle

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

MapReduce: A Programming Model for Large-Scale Distributed Computation

MapReduce: A Programming Model for Large-Scale Distributed Computation CSC 258/458 MapReduce: A Programming Model for Large-Scale Distributed Computation University of Rochester Department of Computer Science Shantonu Hossain April 18, 2011 Outline Motivation MapReduce Overview

More information

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430 Parallel Systems Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable Calculator This

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017 CS 471 Operating Systems Yue Cheng George Mason University Fall 2017 Outline o Process concept o Process creation o Process states and scheduling o Preemption and context switch o Inter-process communication

More information

Comparing the Parix and PVM parallel programming environments

Comparing the Parix and PVM parallel programming environments Comparing the Parix and PVM parallel programming environments A.G. Hoekstra, P.M.A. Sloot, and L.O. Hertzberger Parallel Scientific Computing & Simulation Group, Computer Systems Department, Faculty of

More information

ECE 454 Computer Systems Programming

ECE 454 Computer Systems Programming ECE 454 Computer Systems Programming The Edward S. Rogers Sr. Department of Electrical and Computer Engineering Final Examination Fall 2011 Name Student # Professor Greg Steffan Answer all questions. Write

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Operating system Dr. Shroouq J.

Operating system Dr. Shroouq J. 2.2.2 DMA Structure In a simple terminal-input driver, when a line is to be read from the terminal, the first character typed is sent to the computer. When that character is received, the asynchronous-communication

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

System Models for Distributed Systems

System Models for Distributed Systems System Models for Distributed Systems INF5040/9040 Autumn 2015 Lecturer: Amir Taherkordi (ifi/uio) August 31, 2015 Outline 1. Introduction 2. Physical Models 4. Fundamental Models 2 INF5040 1 System Models

More information

First-In-First-Out (FIFO) Algorithm

First-In-First-Out (FIFO) Algorithm First-In-First-Out (FIFO) Algorithm Reference string: 7,0,1,2,0,3,0,4,2,3,0,3,0,3,2,1,2,0,1,7,0,1 3 frames (3 pages can be in memory at a time per process) 15 page faults Can vary by reference string:

More information

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Abstract Ultra Large-Scale FFT Processing on Graphics Processor Arrays Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Graphics Processor Unit (GPU) technology has been shown well-suited to efficient

More information

Last class. Caches. Direct mapped

Last class. Caches. Direct mapped Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place

More information

CS377P Programming for Performance Multicore Performance Cache Coherence

CS377P Programming for Performance Multicore Performance Cache Coherence CS377P Programming for Performance Multicore Performance Cache Coherence Sreepathi Pai UTCS October 26, 2015 Outline 1 Cache Coherence 2 Cache Coherence Awareness 3 Scalable Lock Design 4 Transactional

More information

A Comparison of the Iserver-Occam, Parix, Express, and PVM Programming Environments on a Parsytec GCel

A Comparison of the Iserver-Occam, Parix, Express, and PVM Programming Environments on a Parsytec GCel A Comparison of the Iserver-Occam, Parix, Express, and PVM Programming Environments on a Parsytec GCel P.M.A. Sloot, A.G. Hoekstra, and L.O. Hertzberger Parallel Scientific Computing & Simulation Group,

More information

Parallel Programming with OpenMP. CS240A, T. Yang

Parallel Programming with OpenMP. CS240A, T. Yang Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What is OpenMP? Open specification for Multi-Processing Standard API for defining multi-threaded shared-memory programs

More information

CSE Traditional Operating Systems deal with typical system software designed to be:

CSE Traditional Operating Systems deal with typical system software designed to be: CSE 6431 Traditional Operating Systems deal with typical system software designed to be: general purpose running on single processor machines Advanced Operating Systems are designed for either a special

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

211: Computer Architecture Summer 2016

211: Computer Architecture Summer 2016 211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University

More information

Memory Management Outline. Operating Systems. Motivation. Paging Implementation. Accessing Invalid Pages. Performance of Demand Paging

Memory Management Outline. Operating Systems. Motivation. Paging Implementation. Accessing Invalid Pages. Performance of Demand Paging Memory Management Outline Operating Systems Processes (done) Memory Management Basic (done) Paging (done) Virtual memory Virtual Memory (Chapter.) Motivation Logical address space larger than physical

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures Superscalar Architectures Have looked at examined basic architecture concepts Starting with simple machines Introduced concepts underlying RISC machines From characteristics of RISC instructions Found

More information

Processes and Threads

Processes and Threads TDDI04 Concurrent Programming, Operating Systems, and Real-time Operating Systems Processes and Threads [SGG7] Chapters 3 and 4 Copyright Notice: The lecture notes are mainly based on Silberschatz s, Galvin

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT I

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT I DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year and Semester : II / IV Subject Code : CS6401 Subject Name : Operating System Degree and Branch : B.E CSE UNIT I 1. Define system process 2. What is an

More information

CSE544 Database Architecture

CSE544 Database Architecture CSE544 Database Architecture Tuesday, February 1 st, 2011 Slides courtesy of Magda Balazinska 1 Where We Are What we have already seen Overview of the relational model Motivation and where model came from

More information

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud

Denison University. Cache Memories. CS-281: Introduction to Computer Systems. Instructor: Thomas C. Bressoud Cache Memories CS-281: Introduction to Computer Systems Instructor: Thomas C. Bressoud 1 Random-Access Memory (RAM) Key features RAM is traditionally packaged as a chip. Basic storage unit is normally

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Multiprocessor Systems

Multiprocessor Systems White Paper: Virtex-II Series R WP162 (v1.1) April 10, 2003 Multiprocessor Systems By: Jeremy Kowalczyk With the availability of the Virtex-II Pro devices containing more than one Power PC processor and

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

TDP3471 Distributed and Parallel Computing

TDP3471 Distributed and Parallel Computing TDP3471 Distributed and Parallel Computing Lecture 1 Dr. Ian Chai ianchai@mmu.edu.my FIT Building: Room BR1024 Office : 03-8312-5379 Schedule for Dr. Ian (including consultation hours) available at http://pesona.mmu.edu.my/~ianchai/schedule.pdf

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Cor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming

Cor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming Cor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming Conference Object, Postprint version This version is available at http://dx.doi.org/0.479/depositonce-577. Suggested Citation

More information

Experiences in building Cosy - an Operating System for Highly Parallel Computers

Experiences in building Cosy - an Operating System for Highly Parallel Computers Experiences in building Cosy - an Operating System for Highly Parallel Computers R. Butenuth a, W. Burke b, C. De Rose b, S. Gilles b and R. Weber b a Group for Operating Systems and Distributed Systems,

More information

The University of Adelaide, School of Computer Science 13 September 2018

The University of Adelaide, School of Computer Science 13 September 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP Instructors: John Wawrzynek & Vladimir Stojanovic http://inst.eecs.berkeley.edu/~cs61c/ Review

More information

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor CS6801-MULTICORE ARCHECTURES AND PROGRAMMING UN I 1. Difference between Symmetric Memory Architecture and Distributed Memory Architecture. 2. What is Vector Instruction? 3. What are the factor to increasing

More information

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as 372 Chapter 16 Code Improvement 16.10 Exercises 16.1 In Section 16.2 we suggested replacing the instruction r1 := r2 / 2 with the instruction r1 := r2 >> 1, and noted that the replacement may not be correct

More information

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT 6.189 IAP 2007 Lecture 5 Parallel Programming Concepts 1 6.189 IAP 2007 MIT Recap Two primary patterns of multicore architecture design Shared memory Ex: Intel Core 2 Duo/Quad One copy of data shared among

More information

Sri Vidya College of Engineering & Technology

Sri Vidya College of Engineering & Technology UNIT I INTRODUCTION TO OOP AND FUNDAMENTALS OF JAVA 1. Define OOP. Part A Object-Oriented Programming (OOP) is a methodology or paradigm to design a program using classes and objects. It simplifies the

More information

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University

Lecture 12. Memory Design & Caches, part 2. Christos Kozyrakis Stanford University Lecture 12 Memory Design & Caches, part 2 Christos Kozyrakis Stanford University http://eeclass.stanford.edu/ee108b 1 Announcements HW3 is due today PA2 is available on-line today Part 1 is due on 2/27

More information

The MOSIX Scalable Cluster Computing for Linux. mosix.org

The MOSIX Scalable Cluster Computing for Linux.  mosix.org The MOSIX Scalable Cluster Computing for Linux Prof. Amnon Barak Computer Science Hebrew University http://www. mosix.org 1 Presentation overview Part I : Why computing clusters (slide 3-7) Part II : What

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information