Application Programmer. Vienna Fortran Out-of-Core Program
|
|
- Tracy Carter
- 6 years ago
- Views:
Transcription
1 Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse 22, A-1092 Vienna, Austria, brezany@par.univie.ac.at Institute for Computer Science, University of Potsdam, Am Neuen Palais 10, D-14469, Potsdam, Germany, mueck@samuel.cs.uni-potsdam.de c Department of Data Engineering, University of Vienna, Rathausstrasse 19/4, A-1010 Vienna, Austria, schiki@ifs.univie.ac.at Keywords: parallel input/output, high performance mass storage system, high performance languages, compilation techniques, data administration 1 Introduction Languages like HPF and Vienna Fortran [5] and their compilers have been designed to improve the practical applicability of massively parallel systems. To accelerate the transition of these systems into fully operational environments, it is also necessary to develop appropriate language constructs and software tools supporting application programmers in the development of large I/O intensive applications [1,3]. This paper focuses on mass storage support for the Vienna Fortran Compilation System (VFCS) to enable ecient execution of parallel I/O operations and operations on outof-core (OOC) structures. The use of OOC structures implies I/O operations: due to main memory constraints some parts of these data structures (e.g., large arrays) must be swapped to disk. The approach outlined in this paper is based on two main concepts: (i) Vienna Fortran language extensions and compilation techniques. We propose constructs to specify OOC structures and I/O operations for distributed data structures in the context of Vienna Fortran. These operations can be used by the programmer to provide information helping the compiler and the runtime environment to operate the underlying I/O subsystem in an ecient way. (ii) Integrated advanced runtime support. The modules of VFCS that process I/O operations and handle OOC structures are coupled to a mass storage oriented runtime system called VIPIOS (Vienna Parallel I/O System). The objective of the proposed integrated compile time and runtime optimizations is to minimize the number of disk
2 accesses for le I/O and OOC processing. A central issue in this context is to increase the main memory buer hit ratio. 2 Language and Compiler Support 2.1 Processing Explicit I/O Operations Distributed data structures are stored in the parallel I/O subsystem as parallel les. The le lay-out may be optimized by VIPIOS to achieve ecient I/O data transfer. In the context of an OPEN or WRITE statement, the user may give a hint to the compilation system that data in the le will be written to or read from an array of a given distribution. The hint specication is provided by a new optional specier IO DIST in the OPEN or WRITE statement. An intended distribution (or class of them) can be bound to a name by means of the I/O distribution type denition. Furthermore, an I/O distribution type denition may have arguments to allow parameterization. S1: PROCESSORS P1D(64) S2: REAL A(40000) DIST (BLOCK) TO P1D IO DTYPE REG1(M,N1,N2,K1,K2) TARGET PROCESSORS P2D(M,M) ELM TYPE REAL TARGET ARRAY A(N1,N2) A DIST (CYCLIC(K1), CYCLIC(K2)) TO P2D END IO DTYPE REG1 O1: OPEN (u1 = 8, FILE = 'exam1.dat', MODE = 'PF', STATUS = 'NEW') WRITE (u1, IO DIST = REG1(8,400,100,4,2)) A... Fig. 1. Opening and Writing to a Parallel File { Examples According to line O1 in Fig.1, unit u1 is connected to the parallel le 'exam1.dat'. The elements of distributed array A (the BLOCK type distribution onto a one-dimensional processor array is specied in lines S1 and S2) are written to this le so as to optimize reading them into real arrays which have the shape (400,100), and are distributed as (CYCLIC(4), CYCLIC(2)) onto a grid of processors having the shape (8,8). This I/O distribution can be changed by a REORGANIZE statement. At compile time, the translation of parallel I/O operations conceptually consists of two phases: basic compilation, extracting parameters about data distributions and le access patterns from the VF program and passing this information to the VIPIOS primitives, and advanced optimizations, including the code restructuring based on program analysis.
3 2.2 Processing Out-of-Core Programs For the scientic application programmer's point of view there are no signicant dierences between the proposed OOC programming model and the traditional in-core model 1. The goal is to preserve for the programmer the model of unlimited main memory. It is assumed that an in-core version of the VF program is converted into the appropriate OOC VF form. Vienna Fortran In-Core Program Hardware and System Software Parameters Application Programmer Vienna Fortran Out-of-Core Program Fig. 2. Out-of-core Programming Model A graphical sketch of how the conversion can be done is depicted in Figure 2. The programmer analyzes the VF program and predicts memory requirements of the program after its parallelization. If there is not enough main memory for in-core (IC) execution on the given target architecture, the programmer annotates some large data structures that have to be processed using OOC techniques. The programmer's decision is also based on the knowledge of features of the target system hardware (memory capacity) and software (memory requirements). All computations are performed on the data in local main memories. VFCS restructures the source out-of-core program in such a way that during computation, sections of the array are fetched from disks into the local memory, the new values are computed and the sections updated are writen back to disks if necessary. The computation is performed in phases where each phase operates on a dierent part of the array called a slab. Loop iterations are partitioned so that data of xed slab size can be processed in in each phase. Each local main memory has access to the individual slabs through a "window" referred to as the in-core portion of the array. VFCS has to get the information which arrays are out-of-core and what is the shape and size of the corresponnding in-core portions in the form of OOC annotation. The OOC array annotation is of the following form: 1 Note that when developing in-core programs the programmer has to specify only data distribution and in some cases also work distribution.
4 REAL ad 1,.., ad r dist spec, OUT OF CORE [, IN MEM (ic portion)] where ad i ; 1 i r specify array identiers B i and their index domains and dist spec represents a Vienna Fortran distribution-specication annotation. The keyword OUT OF CORE indicates that all B i are out-of-core arrays. In the optional part, the keyword IN MEM indicates that only the array portions corresponding to ic portion are allowed to be kept in memory. The larger the IC portion the better, as it reduces the number of disk accesses. The process of transforming a Vienna Fortran out-of-core program into the out-of-core SPMD program can be conceptually divided into ve major steps: (i) Distribution of each out-of-core array among the available processors Array elements that are assigned to a processor according to the data distribution are initially stored on disks. Further, the resulting mapping determines the work distribution. Based on the IN MEM specication, memory for in-core array portions is allocated. (ii) Distribution of the computation among the processors The work distribution step determines for each processor the execution set, i.e., the set of loop iterations to be executed by this processor. The main criterion is to operate on data associated with the "nearest" disks and to optimize the load balance. In most cases the "owner-computes-rule" strategy is applied; the processor which owns the data element that is updated in this iteration will perform the computation. (iii) Splitting execution sets into tiles The computation assigned to a processor is performed in stages called tiles where each stage operates on a dierent slab. Loop iterations are partitioned so that one slab can be processed in each phase. (iv) Insertion of I/O and communication statements Depending on the data and work distribution, determine whether the data needed is in the local or remote in-core portion or on a disk and then detect the type of communication and I/O operation required. (v) Generation of a Section Access Graph (SAG) as the support for ecient softwarecontrolled prefetching [4]. I/O latency can be partially reduced by executing prefetch operations to move data close to the processor before it is actually needed. In our approach, the compiletime knowledge about I/O requirements in the program parts is represented by an Section Access Graph (SRG). This graph is incrementally constructed in the program database of VFCS during the compilation process and written to a le at its end. SRG is used by VIPIOS in the optimization of prefetching.
5 3 Advanced Runtime Support The goal of the proposed advanced runtime system is to provide an ecient parallel mass storage I/O framework [2] for parallel I/O operations and out-of-core data structures of the VFCS. The central component of the framework is a novel runtime module referred to as VIPIOS (VIenna Parallel Input/Output System). The framework distinguishes between two types of processes: application processes and VIPIOS servers. The application processes are created by the VFCS. According to the SPMD paradigm each processor executing the same program on dierent parts of the data space. The VIPIOS servers run independently on all or on a number of dedicated nodes and perform the data requests of the application processes. The number and the location of the VIPIOS servers are dened during the VIPIOS system start-up phase, which is generally part of the boot process of the machine. The default conguration is based on the properties of the hardware system. During runtime it is possible to change the conguration according to the application processes requirements by a VIPIOS supervisor server process, which administrates all other VIPIOS processes. Summing up, the conguration is dependent on the underlying hardware architecture (disk arrays, local disks, specialized I/O nodes, etc.), the system conguration (number and types of available nodes, etc.), the VIPIOS system administration (number of serviced nodes, disks, application processes, etc.) or user needs (I/O characteristics, regular, irregular problems, etc.). The VIPIOS servers are similar to data server processes in database systems. For each application process exactly one VIPIOS server is assigned and accomplishes its data requests, but one VIPIOS server can serve a number of application processes. In other words one-to-one or one-to-many relationships exist. For each application process all data requests are transparently caught by the assigned VIPIOS processes. Locally or remotely retrieved data are accessed by these processes and ensure that each application process has access to its requested data items. The VFCS provides information about the problem specic data distribution, the stride size of the slabs of the out-of-core data structures and the presumed data access prole. Based on this information, the VIPIOS organizes the data and tries to ensure high performance for data access operations. Additional data distribution and usage information can be provided by the Vienna Fortran programmer using new language constructs. This type of information allows the VFCS/VIPIOS system to parallelize read and write operations for by selecting a well-suited data organization in the les. An important advantage of the proposed framework is the support of a wide spectrum of mass storage architectures, e.g., global disk systems connected via a fast bus (like hippi) or local disks connected directly to nodes. In any case, the architecture is transparent to the application programmer as well as to the VF compiler developer.
6 3.1 Data Locality The design principle of the VIPIOS to achieve high data access performance is data locality. This means that the data requested by an application process should be read/written from/to the 'best-suited' disk. Generally the choice of the disks, respective the administrating servers, is based on the data distribution of the application problem. We distinguish between logical and physical data locality. Logical data locality denotes to choose the best suited VIPIOS server for an application process. This server is dened by the topological distance and/or the process characteristics. It is also possible that special process characteristics can inuence the VIPIOS server performance, like available memory, best disk list (see the next paragraph), etc. Therefore it is also possible that a remote VIPIOS server could provide better performance than a closer one. At any rate only one specic VIPIOS server is chosen for each application process, which handles the respective requests. This process is called the buddy server, while all other servers are called foe servers to this process. The physical data locality principle aims to determine the disk set providing the best (mostly the fastest) data access. For each node an ordered sequence of the accessible disks of the system is dened (the best disk list, BDL), which orders the disks according to their access behavior. Disks with good access characteristics precede disks with bad one in this list. This can be dened by technical disk characteristics, like seek time, transfer rate, etc. and/or by the location in the system architecture. Thus the VIPIOS server chooses from the BDL the actual disk administrating the data of a specic application process. In most cases it will choose the disk(s) both according to the BDL of the node it is executing on and the physical restrictions of the disks (memory requirements, workload, etc.). It is also possible that other criteria, which are not hardware oriented, inuence this decision, as the size of the stored data structure, data security, etc. Node 1 Node 2 Node 3 AP VI AP VI AP x VI b is buddy to x VIPIOS server f f is foe to x VIPIOS server b Disk1 Disk2 Disk3 Node1 Node2 Node3 BDL Disk1 - Disk2, Disk3 Fig. 3. Process model of application processes and VIPIOS servers The process model is depicted by Figure 3. The VIPIOS call interface VI, which is linked
7 with the application process AP, handles the communication with the assigned VIPIOS server VS. 3.2 Two-Phase Data Administration Process The data administration process of a VIPIOS server can be divided into 2 phases, the preparation and the administration phase (see Figure 4). The preparation phase prepares the the administrated data according to the data layout of the data structure, the presumed access prole and the physical restrictions of the system (available main memory, disk space, etc.). This phase is performed during the compilation process and the costly system startup phase and precedes the execution of the application process. In this phase the physical data layout schemas are dened, the actual VIPIOS server process for the application process and the disks for the stored data according to the locality principles are chosen. Further the data storage areas are prepared, the necessary main memory buers allocated, etc. The administration phase accomplishes the I/O requests of the application processes. It is obvious that the preparation phase is the basis for good I/O performance. All optimizations are performed in this phase. Compilation and Start-Up Vienna FORTRAN Program VIPIOS preparation phase Execution Executing ooc Program VIPIOS administration phase Fig. 4. Two-phase data administration process 4 Conclusions As mentioned in the preceding sections, high performance languages generally lack ecient parallel I/O support. A possible approach is the development of an integrated runtime subsystem, which is optimized for HPF language systems. As a main goal, physical data distributions should adapt to the requirements of the problem characteristics specied in the application program.
8 References [1] R.R. Bordawekar, A.N. Choudhary, Language and Compiler Support for Parallel I/O, Proc. IFIP Working Conf. Prog. Env. for Massively Parallel Dist. Systems (Swiss, 1994) [2] P. Brezany, T.A. Mueck, E. Schikuta, Language, Compiler and Parallel Database Support for I/O Intensive Applications, Proc. High Performance Computing and Networking 1995 Europe (Milano, 1995) 14{20 [3] D. Kotz, Disk-Directed I/O for MIMD Multiprocessors, Proc. First USENIX Symp. on Operating Systems Design and Implementation (Monterey, CA, 1994) 61{74 [4] R.H. Patterson, G.A. Gibson, E. Ginting, D. Stodolsky, J. Zelenka, Informed Prefetching and Caching, Tech. Rep. Carnegie Mellon Univ., CMU-CS (1995) [5] H. Zima, P. Brezany, B. Chapman, P. Mehrotra, A. Schwald, Vienna Fortran { a language specication, ACPC Technical Report Series, University of Vienna (1992), also available as ICASE INTERIM REPORT 21, MS 132c, NASA, Hampton VA 23681
DBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output.
Language, Compiler and Parallel Database Support for I/O Intensive Applications? Peter Brezany a, Thomas A. Mueck b and Erich Schikuta b University of Vienna a Inst. for Softw. Technology and Parallel
More informationLanguage and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors
Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors Peter Brezany 1, Alok Choudhary 2, and Minh Dang 1 1 Institute for Software Technology and Parallel
More informationPASSION Runtime Library for Parallel I/O. Rajeev Thakur Rajesh Bordawekar Alok Choudhary. Ravi Ponnusamy Tarvinder Singh
Scalable Parallel Libraries Conference, Oct. 1994 PASSION Runtime Library for Parallel I/O Rajeev Thakur Rajesh Bordawekar Alok Choudhary Ravi Ponnusamy Tarvinder Singh Dept. of Electrical and Computer
More informationData Access Reorganizations in Compiling Out-of-Core Data Parallel Programs on Distributed Memory Machines
1063 7133/97 $10 1997 IEEE Proceedings of the 11th International Parallel Processing Symposium (IPPS '97) 1063-7133/97 $10 1997 IEEE Data Access Reorganizations in Compiling Out-of-Core Data Parallel Programs
More informationViPIOS VIenna Parallel Input Output System
arxiv:1808.166v1 [cs.dc] 3 Aug 28 ViPIOS VIenna Parallel Input Output System Language, Compiler and Advanced Data Structure Support for Parallel I/O Operations Project Deliverable Partially funded by FWF
More informationDYNAMIC DATA DISTRIBUTIONS IN VIENNA FORTRAN. Hans Zima a. Institute for Software Technology and Parallel Systems,
DYNAMIC DATA DISTRIBUTIONS IN VIENNA FORTRAN Barbara Chapman a Piyush Mehrotra b Hans Moritsch a Hans Zima a a Institute for Software Technology and Parallel Systems, University of Vienna, Brunner Strasse
More informationCompiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz
Compiler and Runtime Support for Programming in Adaptive Parallel Environments 1 Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, and Joel Saltz UMIACS and Dept. of Computer Science University
More informationTechnische Universitat Munchen. Institut fur Informatik. D Munchen.
Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl
More informationproposed. In Sect. 3, the environment used for the automatic generation of data parallel programs is briey described, together with the proler tool pr
Performance Evaluation of Automatically Generated Data-Parallel Programs L. Massari Y. Maheo DIS IRISA Universita dipavia Campus de Beaulieu via Ferrata 1 Avenue du General Leclerc 27100 Pavia, ITALIA
More informationEcient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines
Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Zhou B. B., Brent R. P. and Tridgell A. y Computer Sciences Laboratory The Australian National University Canberra,
More informationEcient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines
Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines B. B. Zhou, R. P. Brent and A. Tridgell Computer Sciences Laboratory The Australian National University Canberra,
More informationCompiling FORTRAN for Massively Parallel Architectures. Peter Brezany. University of Vienna
Compiling FORTRAN for Massively Parallel Architectures Peter Brezany University of Vienna Institute for Software Technology and Parallel Systems Brunnerstrasse 72, A-1210 Vienna, Austria 1 Introduction
More informationclients (compute nodes) servers (I/O nodes)
Parallel I/O on Networks of Workstations: Performance Improvement by Careful Placement of I/O Servers Yong Cho 1, Marianne Winslett 1, Szu-wen Kuo 1, Ying Chen, Jonghyun Lee 1, Krishna Motukuri 1 1 Department
More information160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp
Scientia Iranica, Vol. 11, No. 3, pp 159{164 c Sharif University of Technology, July 2004 On Routing Architecture for Hybrid FPGA M. Nadjarbashi, S.M. Fakhraie 1 and A. Kaviani 2 In this paper, the routing
More informationOptimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D St. Augustin, G
Optimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D-53757 St. Augustin, Germany Abstract. This paper presents language features
More informationdirector executor user program user program signal, breakpoint function call communication channel client library directing server
(appeared in Computing Systems, Vol. 8, 2, pp.107-134, MIT Press, Spring 1995.) The Dynascope Directing Server: Design and Implementation 1 Rok Sosic School of Computing and Information Technology Grith
More informationImplementation and Evaluation of Prefetching in the Intel Paragon Parallel File System
Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:
More informationThe Compositional C++ Language. Denition. Abstract. This document gives a concise denition of the syntax and semantics
The Compositional C++ Language Denition Peter Carlin Mani Chandy Carl Kesselman March 12, 1993 Revision 0.95 3/12/93, Comments welcome. Abstract This document gives a concise denition of the syntax and
More informationOn Estimating the Useful Work Distribution of. Thomas Fahringer. University of Vienna. Abstract
On Estimating the Useful Work Distribution of Parallel Programs under the P 3 T: A Static Performance Estimator Thomas Fahringer Institute for Software Technology and Parallel Systems University of Vienna
More informationCHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song
CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed
More informationSVM Support in the Vienna Fortran Compilation System. Michael Gerndt. Research Centre Julich(KFA)
SVM Support in the Vienna Fortran Compilation System Peter Brezany University of Vienna brezany@par.univie.ac.at Michael Gerndt Research Centre Julich(KFA) m.gerndt@kfa-juelich.de Viera Sipkova University
More informationMemory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas
Memory hierarchy J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Computer Architecture ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid
More informationFORSCHUNGSZENTRUM J ULICH GmbH Zentralinstitut f ur Angewandte Mathematik D J ulich, Tel. (02461)
FORSCHUNGSZENTRUM J ULICH GmbH Zentralinstitut f ur Angewandte Mathematik D-52425 J ulich, Tel. (02461) 61-6402 Interner Bericht SVM Support in the Vienna Fortran Compiling System Peter Brezany*, Michael
More informationNetwork. Department of Statistics. University of California, Berkeley. January, Abstract
Parallelizing CART Using a Workstation Network Phil Spector Leo Breiman Department of Statistics University of California, Berkeley January, 1995 Abstract The CART (Classication and Regression Trees) program,
More informationReal-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo
Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract
More informationon Current and Future Architectures Purdue University January 20, 1997 Abstract
Performance Forecasting: Characterization of Applications on Current and Future Architectures Brian Armstrong Rudolf Eigenmann Purdue University January 20, 1997 Abstract A common approach to studying
More informationTarek S. Abdelrahman and Thomas N. Wong. University oftoronto. Toronto, Ontario, M5S 1A4. Canada
Distributed Array Data Management on NUMA Multiprocessors Tarek S. Abdelrahman and Thomas N. Wong Department of Electrical and Computer Engineering University oftoronto Toronto, Ontario, M5S 1A Canada
More information1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica
A New Register Allocation Scheme for Low Power Data Format Converters Kala Srivatsan, Chaitali Chakrabarti Lori E. Lucke Department of Electrical Engineering Minnetronix, Inc. Arizona State University
More informationComparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of Ne
Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of New York Bualo, NY 14260 Abstract The Connection Machine
More informationCompilation Issues for High Performance Computers: A Comparative. Overview of a General Model and the Unied Model. Brian J.
Compilation Issues for High Performance Computers: A Comparative Overview of a General Model and the Unied Model Abstract This paper presents a comparison of two models suitable for use in a compiler for
More informationKhoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety
Data Parallel Programming with the Khoros Data Services Library Steve Kubica, Thomas Robey, Chris Moorman Khoral Research, Inc. 6200 Indian School Rd. NE Suite 200 Albuquerque, NM 87110 USA E-mail: info@khoral.com
More informationFrank Mueller. Dept. of Computer Science. Florida State University. Tallahassee, FL phone: (904)
Static Cache Simulation and its Applications by Frank Mueller Dept. of Computer Science Florida State University Tallahassee, FL 32306-4019 e-mail: mueller@cs.fsu.edu phone: (904) 644-3441 July 12, 1994
More informationPARTI Primitives for Unstructured and Block Structured Problems
Syracuse University SURFACE College of Engineering and Computer Science - Former Departments, Centers, Institutes and Projects College of Engineering and Computer Science 1992 PARTI Primitives for Unstructured
More informationDierencegraph - A ProM Plugin for Calculating and Visualizing Dierences between Processes
Dierencegraph - A ProM Plugin for Calculating and Visualizing Dierences between Processes Manuel Gall 1, Günter Wallner 2, Simone Kriglstein 3, Stefanie Rinderle-Ma 1 1 University of Vienna, Faculty of
More information1e+07 10^5 Node Mesh Step Number
Implicit Finite Element Applications: A Case for Matching the Number of Processors to the Dynamics of the Program Execution Meenakshi A.Kandaswamy y Valerie E. Taylor z Rudolf Eigenmann x Jose' A. B. Fortes
More informationDavid Kotz. Abstract. papers focus on the performance advantages and capabilities of disk-directed I/O, but say little
Interfaces for Disk-Directed I/O David Kotz Department of Computer Science Dartmouth College Hanover, NH 03755-3510 dfk@cs.dartmouth.edu Technical Report PCS-TR95-270 September 13, 1995 Abstract In other
More informationINTERNATIONAL ORGANIZATION FOR STANDARDIZATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO-IEC JTC1/SC29/WG11
INTERNATIONAL ORGANIZATION FOR STANDARDIZATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO-IEC JTC1/SC29/WG11 CODING OF MOVING PICTRES AND ASSOCIATED ADIO ISO-IEC/JTC1/SC29/WG11 MPEG 95/ July 1995
More informationUNIVERSITY OF PITTSBURGH FACULTY OF ARTS AND SCIENCES This dissertation was presented by Xin Yuan It was defended on August, 1998 and approved by Prof
Dynamic and Compiled Communication in Optical Time{Division{Multiplexed Point{to{Point Networks by Xin Yuan B.S., Shanghai Jiaotong University, 1989 M.S., Shanghai Jiaotong University, 1992 M.S., University
More informationOverpartioning with the Rice dhpf Compiler
Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf
More informationAnnex A (Informative) Collected syntax The nonterminal symbols pointer-type, program, signed-number, simple-type, special-symbol, and structured-type
Pascal ISO 7185:1990 This online copy of the unextended Pascal standard is provided only as an aid to standardization. In the case of dierences between this online version and the printed version, the
More informationB2 if cs < cs_max then cs := cs + 1 cs := 1 ra
Register Transfer Level VHDL Models without Clocks Matthias Mutz (MMutz@sican{bs.de) SICAN Braunschweig GmbH, Digital IC Center D{38106 Braunschweig, GERMANY Abstract Several hardware compilers on the
More information15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 12: Advanced Caching Prof. Onur Mutlu Carnegie Mellon University Announcements Chuck Thacker (Microsoft Research) Seminar Tomorrow RARE: Rethinking Architectural
More informationRESPONSIVENESS IN A VIDEO. College Station, TX In this paper, we will address the problem of designing an interactive video server
1 IMPROVING THE INTERACTIVE RESPONSIVENESS IN A VIDEO SERVER A. L. Narasimha Reddy ABSTRACT Dept. of Elec. Engg. 214 Zachry Texas A & M University College Station, TX 77843-3128 reddy@ee.tamu.edu In this
More informationNortheast Parallel Architectures Center. Syracuse University. May 17, Abstract
The Design of VIP-FS: A Virtual, Parallel File System for High Performance Parallel and Distributed Computing NPAC Technical Report SCCS-628 Juan Miguel del Rosario, Michael Harry y and Alok Choudhary
More informationEcient Processor Allocation for 3D Tori. Wenjian Qiao and Lionel M. Ni. Department of Computer Science. Michigan State University
Ecient Processor llocation for D ori Wenjian Qiao and Lionel M. Ni Department of Computer Science Michigan State University East Lansing, MI 4884-107 fqiaow, nig@cps.msu.edu bstract Ecient allocation of
More informationInternational Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA
International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERATIONS KAM_IL SARAC, OMER E GEC_IO GLU, AMR EL ABBADI
More informationAssignment 4. Overview. Prof. Stewart Weiss. CSci 335 Software Design and Analysis III Assignment 4
Overview This assignment combines several dierent data abstractions and algorithms that we have covered in class, including priority queues, on-line disjoint set operations, hashing, and sorting. The project
More informationEgemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for
Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and
More informationEfficient Communications in Parallel Loop Distribution
Efficient Communications in Parallel Loop Distribution Marc Le Fur, Yves Mahéo To cite this version: Marc Le Fur, Yves Mahéo. Efficient Communications in Parallel Loop Distribution. Joubert, Peters D Hollander,
More informationRule partitioning versus task sharing in parallel processing of universal production systems
Rule partitioning versus task sharing in parallel processing of universal production systems byhee WON SUNY at Buffalo Amherst, New York ABSTRACT Most research efforts in parallel processing of production
More informationLecture V: Introduction to parallel programming with Fortran coarrays
Lecture V: Introduction to parallel programming with Fortran coarrays What is parallel computing? Serial computing Single processing unit (core) is used for solving a problem One task processed at a time
More informationCost Models for Query Processing Strategies in the Active Data Repository
Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272
More informationA Framework for Integrated Communication and I/O Placement
Syracuse University SURFACE Electrical Engineering and Computer Science College of Engineering and Computer Science 1996 A Framework for Integrated Communication and I/O Placement Rajesh Bordawekar Syracuse
More informationPick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality
Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality Repeated References, to a set of locations: Temporal Locality Take advantage of behavior
More informationA Component-based Programming Model for Composite, Distributed Applications
NASA/CR-2001-210873 ICASE Report No. 2001-15 A Component-based Programming Model for Composite, Distributed Applications Thomas M. Eidson ICASE, Hampton, Virginia ICASE NASA Langley Research Center Hampton,
More informationAutomatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology
Automatic Array Alignment for Distributed Memory Multicomputers Mitsuru Ikei Hitachi Chemical Company Ltd. Michael Wolfe Oregon Graduate Institute of Science & Technology P.O. Box 91000 Portland OR 97291
More informationThe Global Standard for Mobility (GSM) (see, e.g., [6], [4], [5]) yields a
Preprint 0 (2000)?{? 1 Approximation of a direction of N d in bounded coordinates Jean-Christophe Novelli a Gilles Schaeer b Florent Hivert a a Universite Paris 7 { LIAFA 2, place Jussieu - 75251 Paris
More informationINTRODUCTION Introduction This document describes the MPC++ programming language Version. with comments on the design. MPC++ introduces a computationa
TR-944 The MPC++ Programming Language V. Specication with Commentary Document Version. Yutaka Ishikawa 3 ishikawa@rwcp.or.jp Received 9 June 994 Tsukuba Research Center, Real World Computing Partnership
More informationMulti-Process Prefetching and Caching. Andrew Tomkins R. Hugo Patterson Garth Gibson. September, 1996 CMU-CS Carnegie Mellon University
A Trace-Driven Comparison of Algorithms for Multi-Process Prefetching and Caching Andrew Tomkins R. Hugo Patterson Garth Gibson September, 1996 CMU-CS-96-174 School of Computer Science Carnegie Mellon
More informationTable-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o
Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department of Computer Science and Information Engineering National
More informationFrank Miller, George Apostolopoulos, and Satish Tripathi. University of Maryland. College Park, MD ffwmiller, georgeap,
Simple Input/Output Streaming in the Operating System Frank Miller, George Apostolopoulos, and Satish Tripathi Mobile Computing and Multimedia Laboratory Department of Computer Science University of Maryland
More informationNew article Data Producer. Logical data structure
Quality of Service and Electronic Newspaper: The Etel Solution Valerie Issarny, Michel Ban^atre, Boris Charpiot, Jean-Marc Menaud INRIA IRISA, Campus de Beaulieu, 35042 Rennes Cedex, France fissarny,banatre,jmenaudg@irisa.fr
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationA Hierarchical Approach to Workload. M. Calzarossa 1, G. Haring 2, G. Kotsis 2,A.Merlo 1,D.Tessera 1
A Hierarchical Approach to Workload Characterization for Parallel Systems? M. Calzarossa 1, G. Haring 2, G. Kotsis 2,A.Merlo 1,D.Tessera 1 1 Dipartimento di Informatica e Sistemistica, Universita dipavia,
More information2 J. Karvo et al. / Blocking of dynamic multicast connections Figure 1. Point to point (top) vs. point to multipoint, or multicast connections (bottom
Telecommunication Systems 0 (1998)?? 1 Blocking of dynamic multicast connections Jouni Karvo a;, Jorma Virtamo b, Samuli Aalto b and Olli Martikainen a a Helsinki University of Technology, Laboratory of
More informationUMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742
UMIACS-TR-93-133 December, 1992 CS-TR-3192 Revised April, 1993 Denitions of Dependence Distance William Pugh Institute for Advanced Computer Studies Dept. of Computer Science Univ. of Maryland, College
More informationHigh Performance Fortran http://www-jics.cs.utk.edu jics@cs.utk.edu Kwai Lam Wong 1 Overview HPF : High Performance FORTRAN A language specification standard by High Performance FORTRAN Forum (HPFF), a
More informationUniversity of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors
Image Template Matching on Distributed Memory and Vector Multiprocessors V. Blanco M. Martin D.B. Heras O. Plata F.F. Rivera September 995 Technical Report No: UMA-DAC-95/20 Published in: 5th Int l. Conf.
More informationEfficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero
Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15
More informationLimitations of parallel processing
Your professor du jour: Steve Gribble gribble@cs.washington.edu 323B Sieg Hall all material in this lecture in Henessey and Patterson, Chapter 8 635-640 645, 646 654-665 11/8/00 CSE 471 Multiprocessors
More informationHiroshi Nakashima Yasutaka Takeda Katsuto Nakajima. Hideki Andou Kiyohiro Furutani. typing and dereference are the most unique features of
A Pipelined Microprocessor for Logic Programming Languages Hiroshi Nakashima Yasutaka Takeda Katsuto Nakajima Hideki Andou Kiyohiro Furutani Mitsubishi Electric Corporation Abstract In the Japanese Fifth
More information2 3. Syllabus Time Event 9:00{10:00 morning lecture 10:00{10:30 morning break 10:30{12:30 morning practical session 12:30{1:30 lunch break 1:30{2:00 a
1 Syllabus for the Advanced 3 Day Fortran 90 Course AC Marshall cuniversity of Liverpool, 1997 Abstract The course is scheduled for 3 days. The timetable allows for two sessions a day each with a one hour
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationNew Programming Paradigms: Partitioned Global Address Space Languages
Raul E. Silvera -- IBM Canada Lab rauls@ca.ibm.com ECMWF Briefing - April 2010 New Programming Paradigms: Partitioned Global Address Space Languages 2009 IBM Corporation Outline Overview of the PGAS programming
More information(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX
Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu
More informationProcessors. recv(n/p) Time. Processors. send(n/2-m) recv(n/2-m) recv(n/4 -m/2) gap(n/4-m/2) Time
LogP Modelling of List Algorithms W. Amme, P. Braun, W. Lowe 1, and E. Zehendner Fakultat fur Mathematik und Informatik, Friedrich-Schiller-Universitat, 774 Jena, Germany. E-mail: famme,braunpet,nezg@idec2.inf.uni-jena.de
More informationThe driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above
Janus a C++ Template Library for Parallel Dynamic Mesh Applications Jens Gerlach, Mitsuhisa Sato, and Yutaka Ishikawa fjens,msato,ishikawag@trc.rwcp.or.jp Tsukuba Research Center of the Real World Computing
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationKeywords: networks-of-workstations, distributed-shared memory, compiler optimizations, locality
Informatica 17 page xxx{yyy 1 Overlap of Computation and Communication on Shared-Memory Networks-of-Workstations Tarek S. Abdelrahman and Gary Liu Department of Electrical and Computer Engineering The
More informationinstruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals
Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte,
More informationESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence
Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy
More information2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t
Data Reduction - an Adaptation Technique for Mobile Environments A. Heuer, A. Lubinski Computer Science Dept., University of Rostock, Germany Keywords. Reduction. Mobile Database Systems, Data Abstract.
More informationOptimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink
Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com Pidad D Souza IBM Systems pidsouza@in.ibm.com 1 Outline
More informationPerformance of the Decoupled ACRI-1. Architecture: the Perfect Club. University of Edinburgh, The King's Buildings, Mayeld Road, Edinburgh EH9 3JZ,
Performance of the Decoupled ACRI-1 Architecture: the Perfect Club Nigel Topham 1;y and Kenneth McDougall 2;3 1 Department of Computer Science, University of Edinburgh, The King's Buildings, Mayeld Road,
More informationENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013
ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of
More informationLECTURE 11. Memory Hierarchy
LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed
More informationOpenMP for next generation heterogeneous clusters
OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great
More informationOptimal Matrix Transposition and Bit Reversal on. Hypercubes: All{to{All Personalized Communication. Alan Edelman. University of California
Optimal Matrix Transposition and Bit Reversal on Hypercubes: All{to{All Personalized Communication Alan Edelman Department of Mathematics University of California Berkeley, CA 94720 Key words and phrases:
More informationParallel Algorithm Design
Chapter Parallel Algorithm Design Debugging is twice as hard as writing the code in the rst place. Therefore, if you write the code as cleverly as possible, you are, by denition, not smart enough to debug
More informationNils Nieuwejaar, David Kotz. Most current multiprocessor le systems are designed to use multiple disks
The Galley Parallel File System Nils Nieuwejaar, David Kotz fnils,dfkg@cs.dartmouth.edu Department of Computer Science, Dartmouth College, Hanover, NH 3755-351 Most current multiprocessor le systems are
More informationMemory Management. Memory Management
Memory Management Chapter 7 1 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated efficiently to pack as many processes into memory as possible 2 1 Memory
More informationclients (compute nodes) servers (I/O nodes)
Collective I/O on a SGI Cray Origin : Strategy and Performance Y. Cho, M. Winslett, J. Lee, Y. Chen, S. Kuo, K. Motukuri Department of Computer Science, University of Illinois Urbana, IL, U.S.A. Abstract
More information15-740/ Computer Architecture
15-740/18-740 Computer Architecture Lecture 19: Caching II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/31/2011 Announcements Milestone II Due November 4, Friday Please talk with us if you
More informationStorage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk
HRaid: a Flexible Storage-system Simulator Toni Cortes Jesus Labarta Universitat Politecnica de Catalunya - Barcelona ftoni, jesusg@ac.upc.es - http://www.ac.upc.es/hpc Abstract Clusters of workstations
More informationCache performance Outline
Cache performance 1 Outline Metrics Performance characterization Cache optimization techniques 2 Page 1 Cache Performance metrics (1) Miss rate: Neglects cycle time implications Average memory access time
More informationAn Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors
Proceedings of the 28th Annual Hmvaii Intemottonol Conference on System Sciences - 1995 An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors Matthew
More informationIntroduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationHenning Koch. Dept. of Computer Science. University of Darmstadt. Alexanderstr. 10. D Darmstadt. Germany. Keywords:
Embedding Protocols for Scalable Replication Management 1 Henning Koch Dept. of Computer Science University of Darmstadt Alexanderstr. 10 D-64283 Darmstadt Germany koch@isa.informatik.th-darmstadt.de Keywords:
More informationEnumeration of Full Graphs: Onset of the Asymptotic Region. Department of Mathematics. Massachusetts Institute of Technology. Cambridge, MA 02139
Enumeration of Full Graphs: Onset of the Asymptotic Region L. J. Cowen D. J. Kleitman y F. Lasaga D. E. Sussman Department of Mathematics Massachusetts Institute of Technology Cambridge, MA 02139 Abstract
More information