Application Programmer. Vienna Fortran Out-of-Core Program

Similar documents
DBMS Environment. Application Running in DMS. Source of data. Utilization of data. Standard files. Parallel files. Input. File. Output.

Language and Compiler Support for Out-of-Core Irregular Applications on Distributed-Memory Multiprocessors

PASSION Runtime Library for Parallel I/O. Rajeev Thakur Rajesh Bordawekar Alok Choudhary. Ravi Ponnusamy Tarvinder Singh

Data Access Reorganizations in Compiling Out-of-Core Data Parallel Programs on Distributed Memory Machines

ViPIOS VIenna Parallel Input Output System

DYNAMIC DATA DISTRIBUTIONS IN VIENNA FORTRAN. Hans Zima a. Institute for Software Technology and Parallel Systems,

Compiler and Runtime Support for Programming in Adaptive. Parallel Environments 1. Guy Edjlali, Gagan Agrawal, and Joel Saltz

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

proposed. In Sect. 3, the environment used for the automatic generation of data parallel programs is briey described, together with the proler tool pr

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Compiling FORTRAN for Massively Parallel Architectures. Peter Brezany. University of Vienna

clients (compute nodes) servers (I/O nodes)

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

Optimizing Irregular HPF Applications Using Halos Siegfried Benkner C&C Research Laboratories NEC Europe Ltd. Rathausallee 10, D St. Augustin, G

director executor user program user program signal, breakpoint function call communication channel client library directing server

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

The Compositional C++ Language. Denition. Abstract. This document gives a concise denition of the syntax and semantics

On Estimating the Useful Work Distribution of. Thomas Fahringer. University of Vienna. Abstract

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

SVM Support in the Vienna Fortran Compilation System. Michael Gerndt. Research Centre Julich(KFA)

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas

FORSCHUNGSZENTRUM J ULICH GmbH Zentralinstitut f ur Angewandte Mathematik D J ulich, Tel. (02461)

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

on Current and Future Architectures Purdue University January 20, 1997 Abstract

Tarek S. Abdelrahman and Thomas N. Wong. University oftoronto. Toronto, Ontario, M5S 1A4. Canada

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of Ne

Compilation Issues for High Performance Computers: A Comparative. Overview of a General Model and the Unied Model. Brian J.

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety

Frank Mueller. Dept. of Computer Science. Florida State University. Tallahassee, FL phone: (904)

PARTI Primitives for Unstructured and Block Structured Problems

Dierencegraph - A ProM Plugin for Calculating and Visualizing Dierences between Processes

1e+07 10^5 Node Mesh Step Number

David Kotz. Abstract. papers focus on the performance advantages and capabilities of disk-directed I/O, but say little

INTERNATIONAL ORGANIZATION FOR STANDARDIZATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO-IEC JTC1/SC29/WG11

UNIVERSITY OF PITTSBURGH FACULTY OF ARTS AND SCIENCES This dissertation was presented by Xin Yuan It was defended on August, 1998 and approved by Prof

Overpartioning with the Rice dhpf Compiler

Annex A (Informative) Collected syntax The nonterminal symbols pointer-type, program, signed-number, simple-type, special-symbol, and structured-type

B2 if cs < cs_max then cs := cs + 1 cs := 1 ra

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

RESPONSIVENESS IN A VIDEO. College Station, TX In this paper, we will address the problem of designing an interactive video server

Northeast Parallel Architectures Center. Syracuse University. May 17, Abstract

Ecient Processor Allocation for 3D Tori. Wenjian Qiao and Lionel M. Ni. Department of Computer Science. Michigan State University

International Journal of Foundations of Computer Science c World Scientic Publishing Company DFT TECHNIQUES FOR SIZE ESTIMATION OF DATABASE JOIN OPERA

Assignment 4. Overview. Prof. Stewart Weiss. CSci 335 Software Design and Analysis III Assignment 4

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Efficient Communications in Parallel Loop Distribution

Rule partitioning versus task sharing in parallel processing of universal production systems

Lecture V: Introduction to parallel programming with Fortran coarrays

Cost Models for Query Processing Strategies in the Active Data Repository

A Framework for Integrated Communication and I/O Placement

Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality

A Component-based Programming Model for Composite, Distributed Applications

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology

The Global Standard for Mobility (GSM) (see, e.g., [6], [4], [5]) yields a

INTRODUCTION Introduction This document describes the MPC++ programming language Version. with comments on the design. MPC++ introduces a computationa

Multi-Process Prefetching and Caching. Andrew Tomkins R. Hugo Patterson Garth Gibson. September, 1996 CMU-CS Carnegie Mellon University

Table-Lookup Approach for Compiling Two-Level Data-Processor Mappings in HPF Kuei-Ping Shih y, Jang-Ping Sheu y, and Chua-Huang Huang z y Department o

Frank Miller, George Apostolopoulos, and Satish Tripathi. University of Maryland. College Park, MD ffwmiller, georgeap,

New article Data Producer. Logical data structure

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

A Hierarchical Approach to Workload. M. Calzarossa 1, G. Haring 2, G. Kotsis 2,A.Merlo 1,D.Tessera 1

2 J. Karvo et al. / Blocking of dynamic multicast connections Figure 1. Point to point (top) vs. point to multipoint, or multicast connections (bottom

UMIACS-TR December, CS-TR-3192 Revised April, William Pugh. Dept. of Computer Science. Univ. of Maryland, College Park, MD 20742


University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Limitations of parallel processing

Hiroshi Nakashima Yasutaka Takeda Katsuto Nakajima. Hideki Andou Kiyohiro Furutani. typing and dereference are the most unique features of

2 3. Syllabus Time Event 9:00{10:00 morning lecture 10:00{10:30 morning break 10:30{12:30 morning practical session 12:30{1:30 lunch break 1:30{2:00 a

Review: Creating a Parallel Program. Programming for Performance

New Programming Paradigms: Partitioned Global Address Space Languages

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

Processors. recv(n/p) Time. Processors. send(n/2-m) recv(n/2-m) recv(n/4 -m/2) gap(n/4-m/2) Time

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Keywords: networks-of-workstations, distributed-shared memory, compiler optimizations, locality

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

2 Data Reduction Techniques The granularity of reducible information is one of the main criteria for classifying the reduction techniques. While the t

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Performance of the Decoupled ACRI-1. Architecture: the Perfect Club. University of Edinburgh, The King's Buildings, Mayeld Road, Edinburgh EH9 3JZ,

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

LECTURE 11. Memory Hierarchy

OpenMP for next generation heterogeneous clusters

Optimal Matrix Transposition and Bit Reversal on. Hypercubes: All{to{All Personalized Communication. Alan Edelman. University of California

Parallel Algorithm Design

Nils Nieuwejaar, David Kotz. Most current multiprocessor le systems are designed to use multiple disks

Memory Management. Memory Management

clients (compute nodes) servers (I/O nodes)

15-740/ Computer Architecture

Storage System. Distributor. Network. Drive. Drive. Storage System. Controller. Controller. Disk. Disk

Cache performance Outline

An Initial Comparison of Implicit and Explicit Programming Styles for Distributed Memory Multiprocessors

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

COSC 6385 Computer Architecture - Multi Processor Systems

Henning Koch. Dept. of Computer Science. University of Darmstadt. Alexanderstr. 10. D Darmstadt. Germany. Keywords:

Enumeration of Full Graphs: Onset of the Asymptotic Region. Department of Mathematics. Massachusetts Institute of Technology. Cambridge, MA 02139

Transcription:

Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse 22, A-1092 Vienna, Austria, brezany@par.univie.ac.at Institute for Computer Science, University of Potsdam, Am Neuen Palais 10, D-14469, Potsdam, Germany, mueck@samuel.cs.uni-potsdam.de c Department of Data Engineering, University of Vienna, Rathausstrasse 19/4, A-1010 Vienna, Austria, schiki@ifs.univie.ac.at Keywords: parallel input/output, high performance mass storage system, high performance languages, compilation techniques, data administration 1 Introduction Languages like HPF and Vienna Fortran [5] and their compilers have been designed to improve the practical applicability of massively parallel systems. To accelerate the transition of these systems into fully operational environments, it is also necessary to develop appropriate language constructs and software tools supporting application programmers in the development of large I/O intensive applications [1,3]. This paper focuses on mass storage support for the Vienna Fortran Compilation System (VFCS) to enable ecient execution of parallel I/O operations and operations on outof-core (OOC) structures. The use of OOC structures implies I/O operations: due to main memory constraints some parts of these data structures (e.g., large arrays) must be swapped to disk. The approach outlined in this paper is based on two main concepts: (i) Vienna Fortran language extensions and compilation techniques. We propose constructs to specify OOC structures and I/O operations for distributed data structures in the context of Vienna Fortran. These operations can be used by the programmer to provide information helping the compiler and the runtime environment to operate the underlying I/O subsystem in an ecient way. (ii) Integrated advanced runtime support. The modules of VFCS that process I/O operations and handle OOC structures are coupled to a mass storage oriented runtime system called VIPIOS (Vienna Parallel I/O System). The objective of the proposed integrated compile time and runtime optimizations is to minimize the number of disk

accesses for le I/O and OOC processing. A central issue in this context is to increase the main memory buer hit ratio. 2 Language and Compiler Support 2.1 Processing Explicit I/O Operations Distributed data structures are stored in the parallel I/O subsystem as parallel les. The le lay-out may be optimized by VIPIOS to achieve ecient I/O data transfer. In the context of an OPEN or WRITE statement, the user may give a hint to the compilation system that data in the le will be written to or read from an array of a given distribution. The hint specication is provided by a new optional specier IO DIST in the OPEN or WRITE statement. An intended distribution (or class of them) can be bound to a name by means of the I/O distribution type denition. Furthermore, an I/O distribution type denition may have arguments to allow parameterization. S1: PROCESSORS P1D(64) S2: REAL A(40000) DIST (BLOCK) TO P1D IO DTYPE REG1(M,N1,N2,K1,K2) TARGET PROCESSORS P2D(M,M) ELM TYPE REAL TARGET ARRAY A(N1,N2) A DIST (CYCLIC(K1), CYCLIC(K2)) TO P2D END IO DTYPE REG1 O1: OPEN (u1 = 8, FILE = 'exam1.dat', MODE = 'PF', STATUS = 'NEW') WRITE (u1, IO DIST = REG1(8,400,100,4,2)) A... Fig. 1. Opening and Writing to a Parallel File { Examples According to line O1 in Fig.1, unit u1 is connected to the parallel le 'exam1.dat'. The elements of distributed array A (the BLOCK type distribution onto a one-dimensional processor array is specied in lines S1 and S2) are written to this le so as to optimize reading them into real arrays which have the shape (400,100), and are distributed as (CYCLIC(4), CYCLIC(2)) onto a grid of processors having the shape (8,8). This I/O distribution can be changed by a REORGANIZE statement. At compile time, the translation of parallel I/O operations conceptually consists of two phases: basic compilation, extracting parameters about data distributions and le access patterns from the VF program and passing this information to the VIPIOS primitives, and advanced optimizations, including the code restructuring based on program analysis.

2.2 Processing Out-of-Core Programs For the scientic application programmer's point of view there are no signicant dierences between the proposed OOC programming model and the traditional in-core model 1. The goal is to preserve for the programmer the model of unlimited main memory. It is assumed that an in-core version of the VF program is converted into the appropriate OOC VF form. Vienna Fortran In-Core Program Hardware and System Software Parameters Application Programmer Vienna Fortran Out-of-Core Program Fig. 2. Out-of-core Programming Model A graphical sketch of how the conversion can be done is depicted in Figure 2. The programmer analyzes the VF program and predicts memory requirements of the program after its parallelization. If there is not enough main memory for in-core (IC) execution on the given target architecture, the programmer annotates some large data structures that have to be processed using OOC techniques. The programmer's decision is also based on the knowledge of features of the target system hardware (memory capacity) and software (memory requirements). All computations are performed on the data in local main memories. VFCS restructures the source out-of-core program in such a way that during computation, sections of the array are fetched from disks into the local memory, the new values are computed and the sections updated are writen back to disks if necessary. The computation is performed in phases where each phase operates on a dierent part of the array called a slab. Loop iterations are partitioned so that data of xed slab size can be processed in in each phase. Each local main memory has access to the individual slabs through a "window" referred to as the in-core portion of the array. VFCS has to get the information which arrays are out-of-core and what is the shape and size of the corresponnding in-core portions in the form of OOC annotation. The OOC array annotation is of the following form: 1 Note that when developing in-core programs the programmer has to specify only data distribution and in some cases also work distribution.

REAL ad 1,.., ad r dist spec, OUT OF CORE [, IN MEM (ic portion)] where ad i ; 1 i r specify array identiers B i and their index domains and dist spec represents a Vienna Fortran distribution-specication annotation. The keyword OUT OF CORE indicates that all B i are out-of-core arrays. In the optional part, the keyword IN MEM indicates that only the array portions corresponding to ic portion are allowed to be kept in memory. The larger the IC portion the better, as it reduces the number of disk accesses. The process of transforming a Vienna Fortran out-of-core program into the out-of-core SPMD program can be conceptually divided into ve major steps: (i) Distribution of each out-of-core array among the available processors Array elements that are assigned to a processor according to the data distribution are initially stored on disks. Further, the resulting mapping determines the work distribution. Based on the IN MEM specication, memory for in-core array portions is allocated. (ii) Distribution of the computation among the processors The work distribution step determines for each processor the execution set, i.e., the set of loop iterations to be executed by this processor. The main criterion is to operate on data associated with the "nearest" disks and to optimize the load balance. In most cases the "owner-computes-rule" strategy is applied; the processor which owns the data element that is updated in this iteration will perform the computation. (iii) Splitting execution sets into tiles The computation assigned to a processor is performed in stages called tiles where each stage operates on a dierent slab. Loop iterations are partitioned so that one slab can be processed in each phase. (iv) Insertion of I/O and communication statements Depending on the data and work distribution, determine whether the data needed is in the local or remote in-core portion or on a disk and then detect the type of communication and I/O operation required. (v) Generation of a Section Access Graph (SAG) as the support for ecient softwarecontrolled prefetching [4]. I/O latency can be partially reduced by executing prefetch operations to move data close to the processor before it is actually needed. In our approach, the compiletime knowledge about I/O requirements in the program parts is represented by an Section Access Graph (SRG). This graph is incrementally constructed in the program database of VFCS during the compilation process and written to a le at its end. SRG is used by VIPIOS in the optimization of prefetching.

3 Advanced Runtime Support The goal of the proposed advanced runtime system is to provide an ecient parallel mass storage I/O framework [2] for parallel I/O operations and out-of-core data structures of the VFCS. The central component of the framework is a novel runtime module referred to as VIPIOS (VIenna Parallel Input/Output System). The framework distinguishes between two types of processes: application processes and VIPIOS servers. The application processes are created by the VFCS. According to the SPMD paradigm each processor executing the same program on dierent parts of the data space. The VIPIOS servers run independently on all or on a number of dedicated nodes and perform the data requests of the application processes. The number and the location of the VIPIOS servers are dened during the VIPIOS system start-up phase, which is generally part of the boot process of the machine. The default conguration is based on the properties of the hardware system. During runtime it is possible to change the conguration according to the application processes requirements by a VIPIOS supervisor server process, which administrates all other VIPIOS processes. Summing up, the conguration is dependent on the underlying hardware architecture (disk arrays, local disks, specialized I/O nodes, etc.), the system conguration (number and types of available nodes, etc.), the VIPIOS system administration (number of serviced nodes, disks, application processes, etc.) or user needs (I/O characteristics, regular, irregular problems, etc.). The VIPIOS servers are similar to data server processes in database systems. For each application process exactly one VIPIOS server is assigned and accomplishes its data requests, but one VIPIOS server can serve a number of application processes. In other words one-to-one or one-to-many relationships exist. For each application process all data requests are transparently caught by the assigned VIPIOS processes. Locally or remotely retrieved data are accessed by these processes and ensure that each application process has access to its requested data items. The VFCS provides information about the problem specic data distribution, the stride size of the slabs of the out-of-core data structures and the presumed data access prole. Based on this information, the VIPIOS organizes the data and tries to ensure high performance for data access operations. Additional data distribution and usage information can be provided by the Vienna Fortran programmer using new language constructs. This type of information allows the VFCS/VIPIOS system to parallelize read and write operations for by selecting a well-suited data organization in the les. An important advantage of the proposed framework is the support of a wide spectrum of mass storage architectures, e.g., global disk systems connected via a fast bus (like hippi) or local disks connected directly to nodes. In any case, the architecture is transparent to the application programmer as well as to the VF compiler developer.

3.1 Data Locality The design principle of the VIPIOS to achieve high data access performance is data locality. This means that the data requested by an application process should be read/written from/to the 'best-suited' disk. Generally the choice of the disks, respective the administrating servers, is based on the data distribution of the application problem. We distinguish between logical and physical data locality. Logical data locality denotes to choose the best suited VIPIOS server for an application process. This server is dened by the topological distance and/or the process characteristics. It is also possible that special process characteristics can inuence the VIPIOS server performance, like available memory, best disk list (see the next paragraph), etc. Therefore it is also possible that a remote VIPIOS server could provide better performance than a closer one. At any rate only one specic VIPIOS server is chosen for each application process, which handles the respective requests. This process is called the buddy server, while all other servers are called foe servers to this process. The physical data locality principle aims to determine the disk set providing the best (mostly the fastest) data access. For each node an ordered sequence of the accessible disks of the system is dened (the best disk list, BDL), which orders the disks according to their access behavior. Disks with good access characteristics precede disks with bad one in this list. This can be dened by technical disk characteristics, like seek time, transfer rate, etc. and/or by the location in the system architecture. Thus the VIPIOS server chooses from the BDL the actual disk administrating the data of a specic application process. In most cases it will choose the disk(s) both according to the BDL of the node it is executing on and the physical restrictions of the disks (memory requirements, workload, etc.). It is also possible that other criteria, which are not hardware oriented, inuence this decision, as the size of the stored data structure, data security, etc. Node 1 Node 2 Node 3 AP VI AP VI AP x VI b is buddy to x VIPIOS server f f is foe to x VIPIOS server b Disk1 Disk2 Disk3 Node1 Node2 Node3 BDL Disk1 - Disk2, Disk3 Fig. 3. Process model of application processes and VIPIOS servers The process model is depicted by Figure 3. The VIPIOS call interface VI, which is linked

with the application process AP, handles the communication with the assigned VIPIOS server VS. 3.2 Two-Phase Data Administration Process The data administration process of a VIPIOS server can be divided into 2 phases, the preparation and the administration phase (see Figure 4). The preparation phase prepares the the administrated data according to the data layout of the data structure, the presumed access prole and the physical restrictions of the system (available main memory, disk space, etc.). This phase is performed during the compilation process and the costly system startup phase and precedes the execution of the application process. In this phase the physical data layout schemas are dened, the actual VIPIOS server process for the application process and the disks for the stored data according to the locality principles are chosen. Further the data storage areas are prepared, the necessary main memory buers allocated, etc. The administration phase accomplishes the I/O requests of the application processes. It is obvious that the preparation phase is the basis for good I/O performance. All optimizations are performed in this phase. Compilation and Start-Up Vienna FORTRAN Program VIPIOS preparation phase Execution Executing ooc Program VIPIOS administration phase Fig. 4. Two-phase data administration process 4 Conclusions As mentioned in the preceding sections, high performance languages generally lack ecient parallel I/O support. A possible approach is the development of an integrated runtime subsystem, which is optimized for HPF language systems. As a main goal, physical data distributions should adapt to the requirements of the problem characteristics specied in the application program.

References [1] R.R. Bordawekar, A.N. Choudhary, Language and Compiler Support for Parallel I/O, Proc. IFIP Working Conf. Prog. Env. for Massively Parallel Dist. Systems (Swiss, 1994) [2] P. Brezany, T.A. Mueck, E. Schikuta, Language, Compiler and Parallel Database Support for I/O Intensive Applications, Proc. High Performance Computing and Networking 1995 Europe (Milano, 1995) 14{20 [3] D. Kotz, Disk-Directed I/O for MIMD Multiprocessors, Proc. First USENIX Symp. on Operating Systems Design and Implementation (Monterey, CA, 1994) 61{74 [4] R.H. Patterson, G.A. Gibson, E. Ginting, D. Stodolsky, J. Zelenka, Informed Prefetching and Caching, Tech. Rep. Carnegie Mellon Univ., CMU-CS-95-134 (1995) [5] H. Zima, P. Brezany, B. Chapman, P. Mehrotra, A. Schwald, Vienna Fortran { a language specication, ACPC Technical Report Series, University of Vienna (1992), also available as ICASE INTERIM REPORT 21, MS 132c, NASA, Hampton VA 23681