Application Programmer. Vienna Fortran Out-of-Core Program

Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse 22, A-1092 Vienna, Austria, brezany@par.univie.ac.at Institute for Computer Science, University of Potsdam, Am Neuen Palais 10, D-14469, Potsdam, Germany, mueck@samuel.cs.uni-potsdam.de c Department of Data Engineering, University of Vienna, Rathausstrasse 19/4, A-1010 Vienna, Austria, schiki@ifs.univie.ac.at Keywords: parallel input/output, high performance mass storage system, high performance languages, compilation techniques, data administration 1 Introduction Languages like HPF and Vienna Fortran [5] and their compilers have been designed to improve the practical applicability of massively parallel systems. To accelerate the transition of these systems into fully operational environments, it is also necessary to develop appropriate language constructs and software tools supporting application programmers in the development of large I/O intensive applications [1,3]. This paper focuses on mass storage support for the Vienna Fortran Compilation System (VFCS) to enable ecient execution of parallel I/O operations and operations on outof-core (OOC) structures. The use of OOC structures implies I/O operations: due to main memory constraints some parts of these data structures (e.g., large arrays) must be swapped to disk. The approach outlined in this paper is based on two main concepts: (i) Vienna Fortran language extensions and compilation techniques. We propose constructs to specify OOC structures and I/O operations for distributed data structures in the context of Vienna Fortran. These operations can be used by the programmer to provide information helping the compiler and the runtime environment to operate the underlying I/O subsystem in an ecient way. (ii) Integrated advanced runtime support. The modules of VFCS that process I/O operations and handle OOC structures are coupled to a mass storage oriented runtime system called VIPIOS (Vienna Parallel I/O System). The objective of the proposed integrated compile time and runtime optimizations is to minimize the number of disk

accesses for le I/O and OOC processing. A central issue in this context is to increase the main memory buer hit ratio. 2 Language and Compiler Support 2.1 Processing Explicit I/O Operations Distributed data structures are stored in the parallel I/O subsystem as parallel les. The le lay-out may be optimized by VIPIOS to achieve ecient I/O data transfer. In the context of an OPEN or WRITE statement, the user may give a hint to the compilation system that data in the le will be written to or read from an array of a given distribution. The hint specication is provided by a new optional specier IO DIST in the OPEN or WRITE statement. An intended distribution (or class of them) can be bound to a name by means of the I/O distribution type denition. Furthermore, an I/O distribution type denition may have arguments to allow parameterization. S1: PROCESSORS P1D(64) S2: REAL A(40000) DIST (BLOCK) TO P1D IO DTYPE REG1(M,N1,N2,K1,K2) TARGET PROCESSORS P2D(M,M) ELM TYPE REAL TARGET ARRAY A(N1,N2) A DIST (CYCLIC(K1), CYCLIC(K2)) TO P2D END IO DTYPE REG1 O1: OPEN (u1 = 8, FILE = 'exam1.dat', MODE = 'PF', STATUS = 'NEW') WRITE (u1, IO DIST = REG1(8,400,100,4,2)) A... Fig. 1. Opening and Writing to a Parallel File { Examples According to line O1 in Fig.1, unit u1 is connected to the parallel le 'exam1.dat'. The elements of distributed array A (the BLOCK type distribution onto a one-dimensional processor array is specied in lines S1 and S2) are written to this le so as to optimize reading them into real arrays which have the shape (400,100), and are distributed as (CYCLIC(4), CYCLIC(2)) onto a grid of processors having the shape (8,8). This I/O distribution can be changed by a REORGANIZE statement. At compile time, the translation of parallel I/O operations conceptually consists of two phases: basic compilation, extracting parameters about data distributions and le access patterns from the VF program and passing this information to the VIPIOS primitives, and advanced optimizations, including the code restructuring based on program analysis.

2.2 Processing Out-of-Core Programs For the scientic application programmer's point of view there are no signicant dierences between the proposed OOC programming model and the traditional in-core model 1. The goal is to preserve for the programmer the model of unlimited main memory. It is assumed that an in-core version of the VF program is converted into the appropriate OOC VF form. Vienna Fortran In-Core Program Hardware and System Software Parameters Application Programmer Vienna Fortran Out-of-Core Program Fig. 2. Out-of-core Programming Model A graphical sketch of how the conversion can be done is depicted in Figure 2. The programmer analyzes the VF program and predicts memory requirements of the program after its parallelization. If there is not enough main memory for in-core (IC) execution on the given target architecture, the programmer annotates some large data structures that have to be processed using OOC techniques. The programmer's decision is also based on the knowledge of features of the target system hardware (memory capacity) and software (memory requirements). All computations are performed on the data in local main memories. VFCS restructures the source out-of-core program in such a way that during computation, sections of the array are fetched from disks into the local memory, the new values are computed and the sections updated are writen back to disks if necessary. The computation is performed in phases where each phase operates on a dierent part of the array called a slab. Loop iterations are partitioned so that data of xed slab size can be processed in in each phase. Each local main memory has access to the individual slabs through a "window" referred to as the in-core portion of the array. VFCS has to get the information which arrays are out-of-core and what is the shape and size of the corresponnding in-core portions in the form of OOC annotation. The OOC array annotation is of the following form: 1 Note that when developing in-core programs the programmer has to specify only data distribution and in some cases also work distribution.

REAL ad 1,.., ad r dist spec, OUT OF CORE [, IN MEM (ic portion)] where ad i ; 1 i r specify array identiers B i and their index domains and dist spec represents a Vienna Fortran distribution-specication annotation. The keyword OUT OF CORE indicates that all B i are out-of-core arrays. In the optional part, the keyword IN MEM indicates that only the array portions corresponding to ic portion are allowed to be kept in memory. The larger the IC portion the better, as it reduces the number of disk accesses. The process of transforming a Vienna Fortran out-of-core program into the out-of-core SPMD program can be conceptually divided into ve major steps: (i) Distribution of each out-of-core array among the available processors Array elements that are assigned to a processor according to the data distribution are initially stored on disks. Further, the resulting mapping determines the work distribution. Based on the IN MEM specication, memory for in-core array portions is allocated. (ii) Distribution of the computation among the processors The work distribution step determines for each processor the execution set, i.e., the set of loop iterations to be executed by this processor. The main criterion is to operate on data associated with the "nearest" disks and to optimize the load balance. In most cases the "owner-computes-rule" strategy is applied; the processor which owns the data element that is updated in this iteration will perform the computation. (iii) Splitting execution sets into tiles The computation assigned to a processor is performed in stages called tiles where each stage operates on a dierent slab. Loop iterations are partitioned so that one slab can be processed in each phase. (iv) Insertion of I/O and communication statements Depending on the data and work distribution, determine whether the data needed is in the local or remote in-core portion or on a disk and then detect the type of communication and I/O operation required. (v) Generation of a Section Access Graph (SAG) as the support for ecient softwarecontrolled prefetching [4]. I/O latency can be partially reduced by executing prefetch operations to move data close to the processor before it is actually needed. In our approach, the compiletime knowledge about I/O requirements in the program parts is represented by an Section Access Graph (SRG). This graph is incrementally constructed in the program database of VFCS during the compilation process and written to a le at its end. SRG is used by VIPIOS in the optimization of prefetching.

3 Advanced Runtime Support The goal of the proposed advanced runtime system is to provide an ecient parallel mass storage I/O framework [2] for parallel I/O operations and out-of-core data structures of the VFCS. The central component of the framework is a novel runtime module referred to as VIPIOS (VIenna Parallel Input/Output System). The framework distinguishes between two types of processes: application processes and VIPIOS servers. The application processes are created by the VFCS. According to the SPMD paradigm each processor executing the same program on dierent parts of the data space. The VIPIOS servers run independently on all or on a number of dedicated nodes and perform the data requests of the application processes. The number and the location of the VIPIOS servers are dened during the VIPIOS system start-up phase, which is generally part of the boot process of the machine. The default conguration is based on the properties of the hardware system. During runtime it is possible to change the conguration according to the application processes requirements by a VIPIOS supervisor server process, which administrates all other VIPIOS processes. Summing up, the conguration is dependent on the underlying hardware architecture (disk arrays, local disks, specialized I/O nodes, etc.), the system conguration (number and types of available nodes, etc.), the VIPIOS system administration (number of serviced nodes, disks, application processes, etc.) or user needs (I/O characteristics, regular, irregular problems, etc.). The VIPIOS servers are similar to data server processes in database systems. For each application process exactly one VIPIOS server is assigned and accomplishes its data requests, but one VIPIOS server can serve a number of application processes. In other words one-to-one or one-to-many relationships exist. For each application process all data requests are transparently caught by the assigned VIPIOS processes. Locally or remotely retrieved data are accessed by these processes and ensure that each application process has access to its requested data items. The VFCS provides information about the problem specic data distribution, the stride size of the slabs of the out-of-core data structures and the presumed data access prole. Based on this information, the VIPIOS organizes the data and tries to ensure high performance for data access operations. Additional data distribution and usage information can be provided by the Vienna Fortran programmer using new language constructs. This type of information allows the VFCS/VIPIOS system to parallelize read and write operations for by selecting a well-suited data organization in the les. An important advantage of the proposed framework is the support of a wide spectrum of mass storage architectures, e.g., global disk systems connected via a fast bus (like hippi) or local disks connected directly to nodes. In any case, the architecture is transparent to the application programmer as well as to the VF compiler developer.

3.1 Data Locality The design principle of the VIPIOS to achieve high data access performance is data locality. This means that the data requested by an application process should be read/written from/to the 'best-suited' disk. Generally the choice of the disks, respective the administrating servers, is based on the data distribution of the application problem. We distinguish between logical and physical data locality. Logical data locality denotes to choose the best suited VIPIOS server for an application process. This server is dened by the topological distance and/or the process characteristics. It is also possible that special process characteristics can inuence the VIPIOS server performance, like available memory, best disk list (see the next paragraph), etc. Therefore it is also possible that a remote VIPIOS server could provide better performance than a closer one. At any rate only one specic VIPIOS server is chosen for each application process, which handles the respective requests. This process is called the buddy server, while all other servers are called foe servers to this process. The physical data locality principle aims to determine the disk set providing the best (mostly the fastest) data access. For each node an ordered sequence of the accessible disks of the system is dened (the best disk list, BDL), which orders the disks according to their access behavior. Disks with good access characteristics precede disks with bad one in this list. This can be dened by technical disk characteristics, like seek time, transfer rate, etc. and/or by the location in the system architecture. Thus the VIPIOS server chooses from the BDL the actual disk administrating the data of a specic application process. In most cases it will choose the disk(s) both according to the BDL of the node it is executing on and the physical restrictions of the disks (memory requirements, workload, etc.). It is also possible that other criteria, which are not hardware oriented, inuence this decision, as the size of the stored data structure, data security, etc. Node 1 Node 2 Node 3 AP VI AP VI AP x VI b is buddy to x VIPIOS server f f is foe to x VIPIOS server b Disk1 Disk2 Disk3 Node1 Node2 Node3 BDL Disk1 - Disk2, Disk3 Fig. 3. Process model of application processes and VIPIOS servers The process model is depicted by Figure 3. The VIPIOS call interface VI, which is linked

with the application process AP, handles the communication with the assigned VIPIOS server VS. 3.2 Two-Phase Data Administration Process The data administration process of a VIPIOS server can be divided into 2 phases, the preparation and the administration phase (see Figure 4). The preparation phase prepares the the administrated data according to the data layout of the data structure, the presumed access prole and the physical restrictions of the system (available main memory, disk space, etc.). This phase is performed during the compilation process and the costly system startup phase and precedes the execution of the application process. In this phase the physical data layout schemas are dened, the actual VIPIOS server process for the application process and the disks for the stored data according to the locality principles are chosen. Further the data storage areas are prepared, the necessary main memory buers allocated, etc. The administration phase accomplishes the I/O requests of the application processes. It is obvious that the preparation phase is the basis for good I/O performance. All optimizations are performed in this phase. Compilation and Start-Up Vienna FORTRAN Program VIPIOS preparation phase Execution Executing ooc Program VIPIOS administration phase Fig. 4. Two-phase data administration process 4 Conclusions As mentioned in the preceding sections, high performance languages generally lack ecient parallel I/O support. A possible approach is the development of an integrated runtime subsystem, which is optimized for HPF language systems. As a main goal, physical data distributions should adapt to the requirements of the problem characteristics specied in the application program.

References [1] R.R. Bordawekar, A.N. Choudhary, Language and Compiler Support for Parallel I/O, Proc. IFIP Working Conf. Prog. Env. for Massively Parallel Dist. Systems (Swiss, 1994) [2] P. Brezany, T.A. Mueck, E. Schikuta, Language, Compiler and Parallel Database Support for I/O Intensive Applications, Proc. High Performance Computing and Networking 1995 Europe (Milano, 1995) 14{20 [3] D. Kotz, Disk-Directed I/O for MIMD Multiprocessors, Proc. First USENIX Symp. on Operating Systems Design and Implementation (Monterey, CA, 1994) 61{74 [4] R.H. Patterson, G.A. Gibson, E. Ginting, D. Stodolsky, J. Zelenka, Informed Prefetching and Caching, Tech. Rep. Carnegie Mellon Univ., CMU-CS-95-134 (1995) [5] H. Zima, P. Brezany, B. Chapman, P. Mehrotra, A. Schwald, Vienna Fortran { a language specication, ACPC Technical Report Series, University of Vienna (1992), also available as ICASE INTERIM REPORT 21, MS 132c, NASA, Hampton VA 23681