Application Programmer. Vienna Fortran Out-of-Core Program

Size: px
Start display at page:

Download "Application Programmer. Vienna Fortran Out-of-Core Program"


1 Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse 22, A-1092 Vienna, Austria, Institute for Computer Science, University of Potsdam, Am Neuen Palais 10, D-14469, Potsdam, Germany, c Department of Data Engineering, University of Vienna, Rathausstrasse 19/4, A-1010 Vienna, Austria, Keywords: parallel input/output, high performance mass storage system, high performance languages, compilation techniques, data administration 1 Introduction Languages like HPF and Vienna Fortran [5] and their compilers have been designed to improve the practical applicability of massively parallel systems. To accelerate the transition of these systems into fully operational environments, it is also necessary to develop appropriate language constructs and software tools supporting application programmers in the development of large I/O intensive applications [1,3]. This paper focuses on mass storage support for the Vienna Fortran Compilation System (VFCS) to enable ecient execution of parallel I/O operations and operations on outof-core (OOC) structures. The use of OOC structures implies I/O operations: due to main memory constraints some parts of these data structures (e.g., large arrays) must be swapped to disk. The approach outlined in this paper is based on two main concepts: (i) Vienna Fortran language extensions and compilation techniques. We propose constructs to specify OOC structures and I/O operations for distributed data structures in the context of Vienna Fortran. These operations can be used by the programmer to provide information helping the compiler and the runtime environment to operate the underlying I/O subsystem in an ecient way. (ii) Integrated advanced runtime support. The modules of VFCS that process I/O operations and handle OOC structures are coupled to a mass storage oriented runtime system called VIPIOS (Vienna Parallel I/O System). The objective of the proposed integrated compile time and runtime optimizations is to minimize the number of disk

2 accesses for le I/O and OOC processing. A central issue in this context is to increase the main memory buer hit ratio. 2 Language and Compiler Support 2.1 Processing Explicit I/O Operations Distributed data structures are stored in the parallel I/O subsystem as parallel les. The le lay-out may be optimized by VIPIOS to achieve ecient I/O data transfer. In the context of an OPEN or WRITE statement, the user may give a hint to the compilation system that data in the le will be written to or read from an array of a given distribution. The hint specication is provided by a new optional specier IO DIST in the OPEN or WRITE statement. An intended distribution (or class of them) can be bound to a name by means of the I/O distribution type denition. Furthermore, an I/O distribution type denition may have arguments to allow parameterization. S1: PROCESSORS P1D(64) S2: REAL A(40000) DIST (BLOCK) TO P1D IO DTYPE REG1(M,N1,N2,K1,K2) TARGET PROCESSORS P2D(M,M) ELM TYPE REAL TARGET ARRAY A(N1,N2) A DIST (CYCLIC(K1), CYCLIC(K2)) TO P2D END IO DTYPE REG1 O1: OPEN (u1 = 8, FILE = 'exam1.dat', MODE = 'PF', STATUS = 'NEW') WRITE (u1, IO DIST = REG1(8,400,100,4,2)) A... Fig. 1. Opening and Writing to a Parallel File { Examples According to line O1 in Fig.1, unit u1 is connected to the parallel le 'exam1.dat'. The elements of distributed array A (the BLOCK type distribution onto a one-dimensional processor array is specied in lines S1 and S2) are written to this le so as to optimize reading them into real arrays which have the shape (400,100), and are distributed as (CYCLIC(4), CYCLIC(2)) onto a grid of processors having the shape (8,8). This I/O distribution can be changed by a REORGANIZE statement. At compile time, the translation of parallel I/O operations conceptually consists of two phases: basic compilation, extracting parameters about data distributions and le access patterns from the VF program and passing this information to the VIPIOS primitives, and advanced optimizations, including the code restructuring based on program analysis.

3 2.2 Processing Out-of-Core Programs For the scientic application programmer's point of view there are no signicant dierences between the proposed OOC programming model and the traditional in-core model 1. The goal is to preserve for the programmer the model of unlimited main memory. It is assumed that an in-core version of the VF program is converted into the appropriate OOC VF form. Vienna Fortran In-Core Program Hardware and System Software Parameters Application Programmer Vienna Fortran Out-of-Core Program Fig. 2. Out-of-core Programming Model A graphical sketch of how the conversion can be done is depicted in Figure 2. The programmer analyzes the VF program and predicts memory requirements of the program after its parallelization. If there is not enough main memory for in-core (IC) execution on the given target architecture, the programmer annotates some large data structures that have to be processed using OOC techniques. The programmer's decision is also based on the knowledge of features of the target system hardware (memory capacity) and software (memory requirements). All computations are performed on the data in local main memories. VFCS restructures the source out-of-core program in such a way that during computation, sections of the array are fetched from disks into the local memory, the new values are computed and the sections updated are writen back to disks if necessary. The computation is performed in phases where each phase operates on a dierent part of the array called a slab. Loop iterations are partitioned so that data of xed slab size can be processed in in each phase. Each local main memory has access to the individual slabs through a "window" referred to as the in-core portion of the array. VFCS has to get the information which arrays are out-of-core and what is the shape and size of the corresponnding in-core portions in the form of OOC annotation. The OOC array annotation is of the following form: 1 Note that when developing in-core programs the programmer has to specify only data distribution and in some cases also work distribution.

4 REAL ad 1,.., ad r dist spec, OUT OF CORE [, IN MEM (ic portion)] where ad i ; 1 i r specify array identiers B i and their index domains and dist spec represents a Vienna Fortran distribution-specication annotation. The keyword OUT OF CORE indicates that all B i are out-of-core arrays. In the optional part, the keyword IN MEM indicates that only the array portions corresponding to ic portion are allowed to be kept in memory. The larger the IC portion the better, as it reduces the number of disk accesses. The process of transforming a Vienna Fortran out-of-core program into the out-of-core SPMD program can be conceptually divided into ve major steps: (i) Distribution of each out-of-core array among the available processors Array elements that are assigned to a processor according to the data distribution are initially stored on disks. Further, the resulting mapping determines the work distribution. Based on the IN MEM specication, memory for in-core array portions is allocated. (ii) Distribution of the computation among the processors The work distribution step determines for each processor the execution set, i.e., the set of loop iterations to be executed by this processor. The main criterion is to operate on data associated with the "nearest" disks and to optimize the load balance. In most cases the "owner-computes-rule" strategy is applied; the processor which owns the data element that is updated in this iteration will perform the computation. (iii) Splitting execution sets into tiles The computation assigned to a processor is performed in stages called tiles where each stage operates on a dierent slab. Loop iterations are partitioned so that one slab can be processed in each phase. (iv) Insertion of I/O and communication statements Depending on the data and work distribution, determine whether the data needed is in the local or remote in-core portion or on a disk and then detect the type of communication and I/O operation required. (v) Generation of a Section Access Graph (SAG) as the support for ecient softwarecontrolled prefetching [4]. I/O latency can be partially reduced by executing prefetch operations to move data close to the processor before it is actually needed. In our approach, the compiletime knowledge about I/O requirements in the program parts is represented by an Section Access Graph (SRG). This graph is incrementally constructed in the program database of VFCS during the compilation process and written to a le at its end. SRG is used by VIPIOS in the optimization of prefetching.

5 3 Advanced Runtime Support The goal of the proposed advanced runtime system is to provide an ecient parallel mass storage I/O framework [2] for parallel I/O operations and out-of-core data structures of the VFCS. The central component of the framework is a novel runtime module referred to as VIPIOS (VIenna Parallel Input/Output System). The framework distinguishes between two types of processes: application processes and VIPIOS servers. The application processes are created by the VFCS. According to the SPMD paradigm each processor executing the same program on dierent parts of the data space. The VIPIOS servers run independently on all or on a number of dedicated nodes and perform the data requests of the application processes. The number and the location of the VIPIOS servers are dened during the VIPIOS system start-up phase, which is generally part of the boot process of the machine. The default conguration is based on the properties of the hardware system. During runtime it is possible to change the conguration according to the application processes requirements by a VIPIOS supervisor server process, which administrates all other VIPIOS processes. Summing up, the conguration is dependent on the underlying hardware architecture (disk arrays, local disks, specialized I/O nodes, etc.), the system conguration (number and types of available nodes, etc.), the VIPIOS system administration (number of serviced nodes, disks, application processes, etc.) or user needs (I/O characteristics, regular, irregular problems, etc.). The VIPIOS servers are similar to data server processes in database systems. For each application process exactly one VIPIOS server is assigned and accomplishes its data requests, but one VIPIOS server can serve a number of application processes. In other words one-to-one or one-to-many relationships exist. For each application process all data requests are transparently caught by the assigned VIPIOS processes. Locally or remotely retrieved data are accessed by these processes and ensure that each application process has access to its requested data items. The VFCS provides information about the problem specic data distribution, the stride size of the slabs of the out-of-core data structures and the presumed data access prole. Based on this information, the VIPIOS organizes the data and tries to ensure high performance for data access operations. Additional data distribution and usage information can be provided by the Vienna Fortran programmer using new language constructs. This type of information allows the VFCS/VIPIOS system to parallelize read and write operations for by selecting a well-suited data organization in the les. An important advantage of the proposed framework is the support of a wide spectrum of mass storage architectures, e.g., global disk systems connected via a fast bus (like hippi) or local disks connected directly to nodes. In any case, the architecture is transparent to the application programmer as well as to the VF compiler developer.

6 3.1 Data Locality The design principle of the VIPIOS to achieve high data access performance is data locality. This means that the data requested by an application process should be read/written from/to the 'best-suited' disk. Generally the choice of the disks, respective the administrating servers, is based on the data distribution of the application problem. We distinguish between logical and physical data locality. Logical data locality denotes to choose the best suited VIPIOS server for an application process. This server is dened by the topological distance and/or the process characteristics. It is also possible that special process characteristics can inuence the VIPIOS server performance, like available memory, best disk list (see the next paragraph), etc. Therefore it is also possible that a remote VIPIOS server could provide better performance than a closer one. At any rate only one specic VIPIOS server is chosen for each application process, which handles the respective requests. This process is called the buddy server, while all other servers are called foe servers to this process. The physical data locality principle aims to determine the disk set providing the best (mostly the fastest) data access. For each node an ordered sequence of the accessible disks of the system is dened (the best disk list, BDL), which orders the disks according to their access behavior. Disks with good access characteristics precede disks with bad one in this list. This can be dened by technical disk characteristics, like seek time, transfer rate, etc. and/or by the location in the system architecture. Thus the VIPIOS server chooses from the BDL the actual disk administrating the data of a specic application process. In most cases it will choose the disk(s) both according to the BDL of the node it is executing on and the physical restrictions of the disks (memory requirements, workload, etc.). It is also possible that other criteria, which are not hardware oriented, inuence this decision, as the size of the stored data structure, data security, etc. Node 1 Node 2 Node 3 AP VI AP VI AP x VI b is buddy to x VIPIOS server f f is foe to x VIPIOS server b Disk1 Disk2 Disk3 Node1 Node2 Node3 BDL Disk1 - Disk2, Disk3 Fig. 3. Process model of application processes and VIPIOS servers The process model is depicted by Figure 3. The VIPIOS call interface VI, which is linked

7 with the application process AP, handles the communication with the assigned VIPIOS server VS. 3.2 Two-Phase Data Administration Process The data administration process of a VIPIOS server can be divided into 2 phases, the preparation and the administration phase (see Figure 4). The preparation phase prepares the the administrated data according to the data layout of the data structure, the presumed access prole and the physical restrictions of the system (available main memory, disk space, etc.). This phase is performed during the compilation process and the costly system startup phase and precedes the execution of the application process. In this phase the physical data layout schemas are dened, the actual VIPIOS server process for the application process and the disks for the stored data according to the locality principles are chosen. Further the data storage areas are prepared, the necessary main memory buers allocated, etc. The administration phase accomplishes the I/O requests of the application processes. It is obvious that the preparation phase is the basis for good I/O performance. All optimizations are performed in this phase. Compilation and Start-Up Vienna FORTRAN Program VIPIOS preparation phase Execution Executing ooc Program VIPIOS administration phase Fig. 4. Two-phase data administration process 4 Conclusions As mentioned in the preceding sections, high performance languages generally lack ecient parallel I/O support. A possible approach is the development of an integrated runtime subsystem, which is optimized for HPF language systems. As a main goal, physical data distributions should adapt to the requirements of the problem characteristics specied in the application program.

8 References [1] R.R. Bordawekar, A.N. Choudhary, Language and Compiler Support for Parallel I/O, Proc. IFIP Working Conf. Prog. Env. for Massively Parallel Dist. Systems (Swiss, 1994) [2] P. Brezany, T.A. Mueck, E. Schikuta, Language, Compiler and Parallel Database Support for I/O Intensive Applications, Proc. High Performance Computing and Networking 1995 Europe (Milano, 1995) 14{20 [3] D. Kotz, Disk-Directed I/O for MIMD Multiprocessors, Proc. First USENIX Symp. on Operating Systems Design and Implementation (Monterey, CA, 1994) 61{74 [4] R.H. Patterson, G.A. Gibson, E. Ginting, D. Stodolsky, J. Zelenka, Informed Prefetching and Caching, Tech. Rep. Carnegie Mellon Univ., CMU-CS (1995) [5] H. Zima, P. Brezany, B. Chapman, P. Mehrotra, A. Schwald, Vienna Fortran { a language specication, ACPC Technical Report Series, University of Vienna (1992), also available as ICASE INTERIM REPORT 21, MS 132c, NASA, Hampton VA 23681

DBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output.

DBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output. Language, Compiler and Parallel Database Support for I/O Intensive Applications? Peter Brezany a, Thomas A. Mueck b and Erich Schikuta b University of Vienna a Inst. for Softw. Technology and Parallel

More information

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Peter Brezany 1, Alok Choudhary 2, and Minh Dang 1 1 Institute for Software Technology and Parallel

More information

PASSION Runtime Library for Parallel I/O. Rajeev Thakur Rajesh Bordawekar Alok Choudhary. Ravi Ponnusamy Tarvinder Singh

PASSION Runtime Library for Parallel I/O. Rajeev Thakur Rajesh Bordawekar Alok Choudhary. Ravi Ponnusamy Tarvinder Singh Scalable Parallel Libraries Conference, Oct. 1994 PASSION Runtime Library for Parallel I/O Rajeev Thakur Rajesh Bordawekar Alok Choudhary Ravi Ponnusamy Tarvinder Singh Dept. of Electrical and Computer

More information

Data Access Reorganizations in Compiling Out-of-Core Data Parallel Programs on Distributed Memory Machines

Data Access Reorganizations in Compiling Out-of-Core Data Parallel Programs on Distributed Memory Machines 1063 7133/97 $10 1997 IEEE Proceedings of the 11th International Parallel Processing Symposium (IPPS '97) 1063-7133/97 $10 1997 IEEE Data Access Reorganizations in Compiling Out-of-Core Data Parallel Programs

More information

ViPIOS VIenna Parallel Input Output System

ViPIOS VIenna Parallel Input Output System arxiv:1808.166v1 [cs.dc] 3 Aug 28 ViPIOS VIenna Parallel Input Output System Language, Compiler and Advanced Data Structure Support for Parallel I/O Operations Project Deliverable Partially funded by FWF

More information

DYNAMIC DATA DISTRIBUTIONS IN VIENNA FORTRAN. Hans Zima a. Institute for Software Technology and Parallel Systems,

DYNAMIC DATA DISTRIBUTIONS IN VIENNA FORTRAN. Hans Zima a. Institute for Software Technology and Parallel Systems, DYNAMIC DATA DISTRIBUTIONS IN VIENNA FORTRAN Barbara Chapman a Piyush Mehrotra b Hans Moritsch a Hans Zima a a Institute for Software Technology and Parallel Systems, University of Vienna, Brunner Strasse

More information

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz Compiler and Runtime Support for Programming in Adaptive Parallel Environments 1 Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, and Joel Saltz UMIACS and Dept. of Computer Science University

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

proposed. In Sect. 3, the environment used for the automatic generation of data parallel programs is briey described, together with the proler tool pr

proposed. In Sect. 3, the environment used for the automatic generation of data parallel programs is briey described, together with the proler tool pr Performance Evaluation of Automatically Generated Data-Parallel Programs L. Massari Y. Maheo DIS IRISA Universita dipavia Campus de Beaulieu via Ferrata 1 Avenue du General Leclerc 27100 Pavia, ITALIA

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Zhou B. B., Brent R. P. and Tridgell A. y Computer Sciences Laboratory The Australian National University Canberra,

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines B. B. Zhou, R. P. Brent and A. Tridgell Computer Sciences Laboratory The Australian National University Canberra,

More information

Compiling FORTRAN for Massively Parallel Architectures. Peter Brezany. University of Vienna

Compiling FORTRAN for Massively Parallel Architectures. Peter Brezany. University of Vienna Compiling FORTRAN for Massively Parallel Architectures Peter Brezany University of Vienna Institute for Software Technology and Parallel Systems Brunnerstrasse 72, A-1210 Vienna, Austria 1 Introduction

More information

clients (compute nodes) servers (I/O nodes)

clients (compute nodes) servers (I/O nodes) Parallel I/O on Networks of Workstations: Performance Improvement by Careful Placement of I/O Servers Yong Cho 1, Marianne Winslett 1, Szu-wen Kuo 1, Ying Chen, Jonghyun Lee 1, Krishna Motukuri 1 1 Department

More information

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp Scientia Iranica, Vol. 11, No. 3, pp 159{164 c Sharif University of Technology, July 2004 On Routing Architecture for Hybrid FPGA M. Nadjarbashi, S.M. Fakhraie 1 and A. Kaviani 2 In this paper, the routing

More information

Optimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D St. Augustin, G

Optimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D St. Augustin, G Optimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D-53757 St. Augustin, Germany Abstract. This paper presents language features

More information

director executor user program user program signal, breakpoint function call communication channel client library directing server

director executor user program user program signal, breakpoint function call communication channel client library directing server (appeared in Computing Systems, Vol. 8, 2, pp.107-134, MIT Press, Spring 1995.) The Dynascope Directing Server: Design and Implementation 1 Rok Sosic School of Computing and Information Technology Grith

More information

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:

More information

The Compositional C++ Language. Denition. Abstract. This document gives a concise denition of the syntax and semantics

The Compositional C++ Language. Denition. Abstract. This document gives a concise denition of the syntax and semantics The Compositional C++ Language Denition Peter Carlin Mani Chandy Carl Kesselman March 12, 1993 Revision 0.95 3/12/93, Comments welcome. Abstract This document gives a concise denition of the syntax and

More information

On Estimating the Useful Work Distribution of. Thomas Fahringer. University of Vienna. Abstract

On Estimating the Useful Work Distribution of. Thomas Fahringer. University of Vienna. Abstract On Estimating the Useful Work Distribution of Parallel Programs under the P 3 T: A Static Performance Estimator Thomas Fahringer Institute for Software Technology and Parallel Systems University of Vienna

More information



More information

SVM Support in the Vienna Fortran Compilation System. Michael Gerndt. Research Centre Julich(KFA)

SVM Support in the Vienna Fortran Compilation System. Michael Gerndt. Research Centre Julich(KFA) SVM Support in the Vienna Fortran Compilation System Peter Brezany University of Vienna Michael Gerndt Research Centre Julich(KFA) Viera Sipkova University

More information

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Memory hierarchy J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Computer Architecture ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

More information

FORSCHUNGSZENTRUM J ULICH GmbH Zentralinstitut f ur Angewandte Mathematik D J ulich, Tel. (02461)

FORSCHUNGSZENTRUM J ULICH GmbH Zentralinstitut f ur Angewandte Mathematik D J ulich, Tel. (02461) FORSCHUNGSZENTRUM J ULICH GmbH Zentralinstitut f ur Angewandte Mathematik D-52425 J ulich, Tel. (02461) 61-6402 Interner Bericht SVM Support in the Vienna Fortran Compiling System Peter Brezany*, Michael

More information

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Network. Department of Statistics. University of California, Berkeley. January, Abstract Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,

More information

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract

More information

on Current and Future Architectures Purdue University January 20, 1997 Abstract

on Current and Future Architectures Purdue University January 20, 1997 Abstract Performance Forecasting: Characterization of Applications on Current and Future Architectures Brian Armstrong Rudolf Eigenmann Purdue University January 20, 1997 Abstract A common approach to studying

More information

Tarek S. Abdelrahman and Thomas N. Wong. University oftoronto. Toronto, Ontario, M5S 1A4. Canada

Tarek S. Abdelrahman and Thomas N. Wong. University oftoronto. Toronto, Ontario, M5S 1A4. Canada Distributed Array Data Management on NUMA Multiprocessors Tarek S. Abdelrahman and Thomas N. Wong Department of Electrical and Computer Engineering University oftoronto Toronto, Ontario, M5S 1A Canada

More information

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica A New Register Allocation Scheme for Low Power Data Format Converters Kala Srivatsan, Chaitali Chakrabarti Lori E. Lucke Department of Electrical Engineering Minnetronix, Inc. Arizona State University

More information

Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of Ne

Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of Ne Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of New York Bualo, NY 14260 Abstract The Connection Machine

More information

Compilation Issues for High Performance Computers: A Comparative. Overview of a General Model and the Unied Model. Brian J.

Compilation Issues for High Performance Computers: A Comparative. Overview of a General Model and the Unied Model. Brian J. Compilation Issues for High Performance Computers: A Comparative Overview of a General Model and the Unied Model Abstract This paper presents a comparison of two models suitable for use in a compiler for

More information

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety Data Parallel Programming with the Khoros Data Services Library Steve Kubica, Thomas Robey, Chris Moorman Khoral Research, Inc. 6200 Indian School Rd. NE Suite 200 Albuquerque, NM 87110 USA E-mail:

More information

Frank Mueller. Dept. of Computer Science. Florida State University. Tallahassee, FL phone: (904)

Frank Mueller. Dept. of Computer Science. Florida State University. Tallahassee, FL phone: (904) Static Cache Simulation and its Applications by Frank Mueller Dept. of Computer Science Florida State University Tallahassee, FL 32306-4019 e-mail: phone: (904) 644-3441 July 12, 1994

More information

PARTI Primitives for Unstructured and Block Structured Problems

PARTI Primitives for Unstructured and Block Structured Problems Syracuse University SURFACE College of Engineering and Computer Science - Former Departments, Centers, Institutes and Projects College of Engineering and Computer Science 1992 PARTI Primitives for Unstructured

More information

Dierencegraph - A ProM Plugin for Calculating and Visualizing Dierences between Processes

Dierencegraph - A ProM Plugin for Calculating and Visualizing Dierences between Processes Dierencegraph - A ProM Plugin for Calculating and Visualizing Dierences between Processes Manuel Gall 1, Günter Wallner 2, Simone Kriglstein 3, Stefanie Rinderle-Ma 1 1 University of Vienna, Faculty of

More information

1e+07 10^5 Node Mesh Step Number

1e+07 10^5 Node Mesh Step Number Implicit Finite Element Applications: A Case for Matching the Number of Processors to the Dynamics of the Program Execution Meenakshi A.Kandaswamy y Valerie E. Taylor z Rudolf Eigenmann x Jose' A. B. Fortes

More information

David Kotz. Abstract. papers focus on the performance advantages and capabilities of disk-directed I/O, but say little

David Kotz. Abstract. papers focus on the performance advantages and capabilities of disk-directed I/O, but say little Interfaces for Disk-Directed I/O David Kotz Department of Computer Science Dartmouth College Hanover, NH 03755-3510 Technical Report PCS-TR95-270 September 13, 1995 Abstract In other

More information



More information

UNIVERSITY OF PITTSBURGH FACULTY OF ARTS AND SCIENCES This dissertation was presented by Xin Yuan It was defended on August, 1998 and approved by Prof

UNIVERSITY OF PITTSBURGH FACULTY OF ARTS AND SCIENCES This dissertation was presented by Xin Yuan It was defended on August, 1998 and approved by Prof Dynamic and Compiled Communication in Optical Time{Division{Multiplexed Point{to{Point Networks by Xin Yuan B.S., Shanghai Jiaotong University, 1989 M.S., Shanghai Jiaotong University, 1992 M.S., University

More information

Overpartioning with the Rice dhpf Compiler

Overpartioning with the Rice dhpf Compiler Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University

More information

Annex A (Informative) Collected syntax The nonterminal symbols pointer-type, program, signed-number, simple-type, special-symbol, and structured-type

Annex A (Informative) Collected syntax The nonterminal symbols pointer-type, program, signed-number, simple-type, special-symbol, and structured-type Pascal ISO 7185:1990 This online copy of the unextended Pascal standard is provided only as an aid to standardization. In the case of dierences between this online version and the printed version, the

More information

B2 if cs < cs_max then cs := cs + 1 cs := 1 ra

B2 if cs < cs_max then cs := cs + 1 cs := 1 ra Register Transfer Level VHDL Models without Clocks Matthias Mutz (MMutz@sican{ SICAN Braunschweig GmbH, Digital IC Center D{38106 Braunschweig, GERMANY Abstract Several hardware compilers on the

More information

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 12: Advanced Caching Prof. Onur Mutlu Carnegie Mellon University Announcements Chuck Thacker (Microsoft Research) Seminar Tomorrow RARE: Rethinking Architectural

More information

RESPONSIVENESS IN A VIDEO. College Station, TX In this paper, we will address the problem of designing an interactive video server

RESPONSIVENESS IN A VIDEO. College Station, TX In this paper, we will address the problem of designing an interactive video server 1 IMPROVING THE INTERACTIVE RESPONSIVENESS IN A VIDEO SERVER A. L. Narasimha Reddy ABSTRACT Dept. of Elec. Engg. 214 Zachry Texas A & M University College Station, TX 77843-3128 In this

More information

Northeast Parallel Architectures Center. Syracuse University. May 17, Abstract

Northeast Parallel Architectures Center. Syracuse University. May 17, Abstract The Design of VIP-FS: A Virtual, Parallel File System for High Performance Parallel and Distributed Computing NPAC Technical Report SCCS-628 Juan Miguel del Rosario, Michael Harry y and Alok Choudhary

More information

Ecient Processor Allocation for 3D Tori. Wenjian Qiao and Lionel M. Ni. Department of Computer Science. Michigan State University

Ecient Processor Allocation for 3D Tori. Wenjian Qiao and Lionel M. Ni. Department of Computer Science. Michigan State University Ecient Processor llocation for D ori Wenjian Qiao and Lionel M. Ni Department of Computer Science Michigan State University East Lansing, MI 4884-107 fqiaow, bstract Ecient allocation of

More information

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERATIONS KAM_IL SARAC, OMER E GEC_IO GLU, AMR EL ABBADI

More information

Assignment 4. Overview. Prof. Stewart Weiss. CSci 335 Software Design and Analysis III Assignment 4

Assignment 4. Overview. Prof. Stewart Weiss. CSci 335 Software Design and Analysis III Assignment 4 Overview This assignment combines several dierent data abstractions and algorithms that we have covered in class, including priority queues, on-line disjoint set operations, hashing, and sorting. The project

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Efficient Communications in Parallel Loop Distribution

Efficient Communications in Parallel Loop Distribution Efficient Communications in Parallel Loop Distribution Marc Le Fur, Yves Mahéo To cite this version: Marc Le Fur, Yves Mahéo. Efficient Communications in Parallel Loop Distribution. Joubert, Peters D Hollander,

More information

Rule partitioning versus task sharing in parallel processing of universal production systems

Rule partitioning versus task sharing in parallel processing of universal production systems Rule partitioning versus task sharing in parallel processing of universal production systems byhee WON SUNY at Buffalo Amherst, New York ABSTRACT Most research efforts in parallel processing of production

More information

Lecture V: Introduction to parallel programming with Fortran coarrays

Lecture V: Introduction to parallel programming with Fortran coarrays Lecture V: Introduction to parallel programming with Fortran coarrays What is parallel computing? Serial computing Single processing unit (core) is used for solving a problem One task processed at a time

More information

Cost Models for Query Processing Strategies in the Active Data Repository

Cost Models for Query Processing Strategies in the Active Data Repository Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272

More information

A Framework for Integrated Communication and I/O Placement

A Framework for Integrated Communication and I/O Placement Syracuse University SURFACE Electrical Engineering and Computer Science College of Engineering and Computer Science 1996 A Framework for Integrated Communication and I/O Placement Rajesh Bordawekar Syracuse

More information

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality Repeated References, to a set of locations: Temporal Locality Take advantage of behavior

More information

A Component-based Programming Model for Composite, Distributed Applications

A Component-based Programming Model for Composite, Distributed Applications NASA/CR-2001-210873 ICASE Report No. 2001-15 A Component-based Programming Model for Composite, Distributed Applications Thomas M. Eidson ICASE, Hampton, Virginia ICASE NASA Langley Research Center Hampton,

More information

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology Automatic Array Alignment for Distributed Memory Multicomputers Mitsuru Ikei Hitachi Chemical Company Ltd. Michael Wolfe Oregon Graduate Institute of Science & Technology P.O. Box 91000 Portland OR 97291

More information

The Global Standard for Mobility (GSM) (see, e.g., [6], [4], [5]) yields a

The Global Standard for Mobility (GSM) (see, e.g., [6], [4], [5]) yields a Preprint 0 (2000)?{? 1 Approximation of a direction of N d in bounded coordinates Jean-Christophe Novelli a Gilles Schaeer b Florent Hivert a a Universite Paris 7 { LIAFA 2, place Jussieu - 75251 Paris

More information

INTRODUCTION Introduction This document describes the MPC++ programming language Version. with comments on the design. MPC++ introduces a computationa

INTRODUCTION Introduction This document describes the MPC++ programming language Version. with comments on the design. MPC++ introduces a computationa TR-944 The MPC++ Programming Language V. Specication with Commentary Document Version. Yutaka Ishikawa 3 Received 9 June 994 Tsukuba Research Center, Real World Computing Partnership

More information

Multi-Process Prefetching and Caching. Andrew Tomkins R. Hugo Patterson Garth Gibson. September, 1996 CMU-CS Carnegie Mellon University

Multi-Process Prefetching and Caching. Andrew Tomkins R. Hugo Patterson Garth Gibson. September, 1996 CMU-CS Carnegie Mellon University A Trace-Driven Comparison of Algorithms for Multi-Process Prefetching and Caching Andrew Tomkins R. Hugo Patterson Garth Gibson September, 1996 CMU-CS-96-174 School of Computer Science Carnegie Mellon

More information

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department of Computer Science and Information Engineering National

More information

Frank Miller, George Apostolopoulos, and Satish Tripathi. University of Maryland. College Park, MD ffwmiller, georgeap,

Frank Miller, George Apostolopoulos, and Satish Tripathi. University of Maryland. College Park, MD ffwmiller, georgeap, Simple Input/Output Streaming in the Operating System Frank Miller, George Apostolopoulos, and Satish Tripathi Mobile Computing and Multimedia Laboratory Department of Computer Science University of Maryland

More information

New article Data Producer. Logical data structure

New article Data Producer. Logical data structure Quality of Service and Electronic Newspaper: The Etel Solution Valerie Issarny, Michel Ban^atre, Boris Charpiot, Jean-Marc Menaud INRIA IRISA, Campus de Beaulieu, 35042 Rennes Cedex, France fissarny,banatre,

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

A Hierarchical Approach to Workload. M. Calzarossa 1, G. Haring 2, G. Kotsis 2,A.Merlo 1,D.Tessera 1

A Hierarchical Approach to Workload. M. Calzarossa 1, G. Haring 2, G. Kotsis 2,A.Merlo 1,D.Tessera 1 A Hierarchical Approach to Workload Characterization for Parallel Systems? M. Calzarossa 1, G. Haring 2, G. Kotsis 2,A.Merlo 1,D.Tessera 1 1 Dipartimento di Informatica e Sistemistica, Universita dipavia,

More information

2 J. Karvo et al. / Blocking of dynamic multicast connections Figure 1. Point to point (top) vs. point to multipoint, or multicast connections (bottom

2 J. Karvo et al. / Blocking of dynamic multicast connections Figure 1. Point to point (top) vs. point to multipoint, or multicast connections (bottom Telecommunication Systems 0 (1998)?? 1 Blocking of dynamic multicast connections Jouni Karvo a;, Jorma Virtamo b, Samuli Aalto b and Olli Martikainen a a Helsinki University of Technology, Laboratory of

More information

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742 UMIACS-TR-93-133 December, 1992 CS-TR-3192 Revised April, 1993 Denitions of Dependence Distance William Pugh Institute for Advanced Computer Studies Dept. of Computer Science Univ. of Maryland, College

More information

High Performance Fortran Kwai Lam Wong 1 Overview HPF : High Performance FORTRAN A language specification standard by High Performance FORTRAN Forum (HPFF), a

More information

University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors

University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors Image Template Matching on Distributed Memory and Vector Multiprocessors V. Blanco M. Martin D.B. Heras O. Plata F.F. Rivera September 995 Technical Report No: UMA-DAC-95/20 Published in: 5th Int l. Conf.

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Limitations of parallel processing

Limitations of parallel processing Your professor du jour: Steve Gribble 323B Sieg Hall all material in this lecture in Henessey and Patterson, Chapter 8 635-640 645, 646 654-665 11/8/00 CSE 471 Multiprocessors

More information

Hiroshi Nakashima Yasutaka Takeda Katsuto Nakajima. Hideki Andou Kiyohiro Furutani. typing and dereference are the most unique features of

Hiroshi Nakashima Yasutaka Takeda Katsuto Nakajima. Hideki Andou Kiyohiro Furutani. typing and dereference are the most unique features of A Pipelined Microprocessor for Logic Programming Languages Hiroshi Nakashima Yasutaka Takeda Katsuto Nakajima Hideki Andou Kiyohiro Furutani Mitsubishi Electric Corporation Abstract In the Japanese Fifth

More information

2 3. Syllabus Time Event 9:00{10:00 morning lecture 10:00{10:30 morning break 10:30{12:30 morning practical session 12:30{1:30 lunch break 1:30{2:00 a

2 3. Syllabus Time Event 9:00{10:00 morning lecture 10:00{10:30 morning break 10:30{12:30 morning practical session 12:30{1:30 lunch break 1:30{2:00 a 1 Syllabus for the Advanced 3 Day Fortran 90 Course AC Marshall cuniversity of Liverpool, 1997 Abstract The course is scheduled for 3 days. The timetable allows for two sessions a day each with a one hour

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

New Programming Paradigms: Partitioned Global Address Space Languages

New Programming Paradigms: Partitioned Global Address Space Languages Raul E. Silvera -- IBM Canada Lab ECMWF Briefing - April 2010 New Programming Paradigms: Partitioned Global Address Space Languages 2009 IBM Corporation Outline Overview of the PGAS programming

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,

More information

Processors. recv(n/p) Time. Processors. send(n/2-m) recv(n/2-m) recv(n/4 -m/2) gap(n/4-m/2) Time

Processors. recv(n/p) Time. Processors. send(n/2-m) recv(n/2-m) recv(n/4 -m/2) gap(n/4-m/2) Time LogP Modelling of List Algorithms W. Amme, P. Braun, W. Lowe 1, and E. Zehendner Fakultat fur Mathematik und Informatik, Friedrich-Schiller-Universitat, 774 Jena, Germany. E-mail: famme,braunpet,

More information

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above Janus a C++ Template Library for Parallel Dynamic Mesh Applications Jens Gerlach, Mitsuhisa Sato, and Yutaka Ishikawa fjens,msato, Tsukuba Research Center of the Real World Computing

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Keywords: networks-of-workstations, distributed-shared memory, compiler optimizations, locality

Keywords: networks-of-workstations, distributed-shared memory, compiler optimizations, locality Informatica 17 page xxx{yyy 1 Overlap of Computation and Communication on Shared-Memory Networks-of-Workstations Tarek S. Abdelrahman and Gary Liu Department of Electrical and Computer Engineering The

More information

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte,

More information

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy

More information

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t Data Reduction - an Adaptation Technique for Mobile Environments A. Heuer, A. Lubinski Computer Science Dept., University of Rostock, Germany Keywords. Reduction. Mobile Database Systems, Data Abstract.

More information

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center Pidad D Souza IBM Systems 1 Outline

More information

Performance of the Decoupled ACRI-1. Architecture: the Perfect Club. University of Edinburgh, The King's Buildings, Mayeld Road, Edinburgh EH9 3JZ,

Performance of the Decoupled ACRI-1. Architecture: the Perfect Club. University of Edinburgh, The King's Buildings, Mayeld Road, Edinburgh EH9 3JZ, Performance of the Decoupled ACRI-1 Architecture: the Perfect Club Nigel Topham 1;y and Kenneth McDougall 2;3 1 Department of Computer Science, University of Edinburgh, The King's Buildings, Mayeld Road,

More information

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, Abstract The last years have seen great

More information

Optimal Matrix Transposition and Bit Reversal on. Hypercubes: All{to{All Personalized Communication. Alan Edelman. University of California

Optimal Matrix Transposition and Bit Reversal on. Hypercubes: All{to{All Personalized Communication. Alan Edelman. University of California Optimal Matrix Transposition and Bit Reversal on Hypercubes: All{to{All Personalized Communication Alan Edelman Department of Mathematics University of California Berkeley, CA 94720 Key words and phrases:

More information

Parallel Algorithm Design

Parallel Algorithm Design Chapter Parallel Algorithm Design Debugging is twice as hard as writing the code in the rst place. Therefore, if you write the code as cleverly as possible, you are, by denition, not smart enough to debug

More information

Nils Nieuwejaar, David Kotz. Most current multiprocessor le systems are designed to use multiple disks

Nils Nieuwejaar, David Kotz. Most current multiprocessor le systems are designed to use multiple disks The Galley Parallel File System Nils Nieuwejaar, David Kotz fnils, Department of Computer Science, Dartmouth College, Hanover, NH 3755-351 Most current multiprocessor le systems are

More information

Memory Management. Memory Management

Memory Management. Memory Management Memory Management Chapter 7 1 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated efficiently to pack as many processes into memory as possible 2 1 Memory

More information

clients (compute nodes) servers (I/O nodes)

clients (compute nodes) servers (I/O nodes) Collective I/O on a SGI Cray Origin : Strategy and Performance Y. Cho, M. Winslett, J. Lee, Y. Chen, S. Kuo, K. Motukuri Department of Computer Science, University of Illinois Urbana, IL, U.S.A. Abstract

More information

15-740/ Computer Architecture

15-740/ Computer Architecture 15-740/18-740 Computer Architecture Lecture 19: Caching II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/31/2011 Announcements Milestone II Due November 4, Friday Please talk with us if you

More information

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk HRaid: a Flexible Storage-system Simulator Toni Cortes Jesus Labarta Universitat Politecnica de Catalunya - Barcelona ftoni, - Abstract Clusters of workstations

More information

Cache performance Outline

Cache performance Outline Cache performance 1 Outline Metrics Performance characterization Cache optimization techniques 2 Page 1 Cache Performance metrics (1) Miss rate: Neglects cycle time implications Average memory access time

More information

An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors

An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors Proceedings of the 28th Annual Hmvaii Intemottonol Conference on System Sciences - 1995 An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors Matthew

More information

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Henning Koch. Dept. of Computer Science. University of Darmstadt. Alexanderstr. 10. D Darmstadt. Germany. Keywords:

Henning Koch. Dept. of Computer Science. University of Darmstadt. Alexanderstr. 10. D Darmstadt. Germany. Keywords: Embedding Protocols for Scalable Replication Management 1 Henning Koch Dept. of Computer Science University of Darmstadt Alexanderstr. 10 D-64283 Darmstadt Germany Keywords:

More information

Enumeration of Full Graphs: Onset of the Asymptotic Region. Department of Mathematics. Massachusetts Institute of Technology. Cambridge, MA 02139

Enumeration of Full Graphs: Onset of the Asymptotic Region. Department of Mathematics. Massachusetts Institute of Technology. Cambridge, MA 02139 Enumeration of Full Graphs: Onset of the Asymptotic Region L. J. Cowen D. J. Kleitman y F. Lasaga D. E. Sussman Department of Mathematics Massachusetts Institute of Technology Cambridge, MA 02139 Abstract

More information