Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

Size: px
Start display at page:

Download "Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN"

Transcription

1 The implementation of a general purpose FORTRAN harness for an arbitrary network of transputers for computational fluid dynamics J. Mushtaq, A.J. Davies D.J. Morgan ABSTRACT Many Computational Fluid Dynamics (CFD) problems of interest require far greater computational power than is available on any sequential machine. In CFD problems, where a large number of similar operations are performed, a parallel machine can be utilised to exploit the inherent parallelism of the algorithm. Distributed memory machines, although requiring extra programming, can provide truly scalable performance, at a fraction of the cost of current vector supercomputers. At the University of Hertfordshire part of the current work programme involves the parallel implementation of a sequential 2-D Navier-Stokes multiblock aerofoil code. In order to utilise an arbitrary network of transputers, it is necessary to have software which can effect the communication of data between processors and also schedule this data for processing. This paper is concerned with the development and implementation of a general purpose FORTRAN harness for a distributed memory machine with an arbitrary number of processors and hardware configuration. The general purpose harness therefore will not confine its use to the CFD work, but to any problem where a large number of similar operations is performed. The harness, which comprises many concurrently executing processes, is replicated to all the transputers in the network. The data is sorted into order and distributed to the network, so that as nearly as possible each transputer is responsible for performing the same amount of work. This will ensure that the

2 224 Applications of Supercomputers in Engineering distribution of computational load is even, thereby preventing the one with the most work from holding up the others. To illustrate the design and performance of the harness a simple five-point solution to the potential problem is considered in this paper. INTRODUCTION The implementation of the harness, written in Parallel 3L Fortran [1], is ultimately intended for porting the multiblock, 2-dimensional, Navier-Stokes aerofoil code onto the network of transputers. For the purposes of debugging and verification of the harness, independently of the aerofoil code, a simple model which simulates the data processed by the aerofoil code was constructed. Laplace's potential equation, which simply uses the mean value of four adjacent cells to update a cell, is used for the 'worker' process. This paper describes the design and performance of a multiblock Laplace solver on a network of T800 transputers. It is essentially the multiblock algorithm that makes the problem suitable for parallelisation. The discretized domain of computation is divided into subdomains, called blocks, thus creating internal boundaries between blocks. Before a block can be updated, transfer of data, called halo data, across internal boundaries is necessary. This therefore also necessitates the communication of halo data between transputers. The purpose of the harness is to schedule blocks for processing and effect the communication of halo data. The harness, which comprises many concurrently executing processes, is replicated to all the transputers in the network. The blocks are sorted into order on the basis of their size and then distributed to the network so that as nearly as possible each transputer is responsible for updating the same number of cells. This will ensure that the distribution of the computational load is even, thereby preventing the one with the most work from holding up all the others. TRANSPUTERS AND MIMD ARCHITECTURES Transputers belong to a class of parallel machines known as Multiple Instruction Multiple Data (MIMD). Each transputer is directly connected to its own local

3 Applications of Supercomputers in Engineering 225 memory holding both data and source code. Thus different transputers in the network may hold different data and perform quite different operations on that data. In CFD problems, however, the transputer is usually programmed as Single Instruction Multiple Data (SIMD), where the same governing equations are solved over each block. Moreover, in the case of the harness, where the same code is replicated to all procesors in the network, the transputer is coded as Single Program Multiple Data (SPMD). The T800 transputer is a high performance chip with four communication links, a 32 bit Central Processing Unit (CPU) and a 64 bit Floating Point Unit (FPU). The most important feature of the T800 is that the CPU, FPU and each communication link can operate in parallel within each transputer. This allows the message-passing overhead to be hidden to a large extent since communication can be overlapped with local computation. All transputers in the network execute locally resident source code and process locally resident data. During communication between processors, data is explicitly sent from one processor to another via the transputers' serial links. If the processors are not directly connected then intermediate processors are necessary to store and forward the message through the network, towards its destination transputer. PARALLELISM VIA GEOMETRIC DECOMPOSITION Geometric or data parallelism is the most natural sub-division of the workload for calculations over a region of space. The data is divided into sub-domains, called blocks. The shape of the blocks has an important effect on the communication to computation ratio. For example a rectangular block with p.q points has 2(p+q) boundary points, while a square block with p.q points has 47^q boundary points. Now 2(p + q) > 4,/jxq, therefore square blocks are favoured, since there is less halo data to communicate.

4 226 Applications of Supercomputers in Engineering BLOCK HALO DATA Once a multiblock domain has been established, calculations on each block can begin in parallel if the block's boundary conditions are known. This may be either a physical boundary or an internal boundary as a consequence of the domain decomposition. The physical boundary will be handled by the source code. The internal boundary requires boundary data from its neighbour, which may reside on a different processor. This data is provided by allowing a buffer on the boundary of each block which will store a copy of the corresponding overlap or halo data. h- _ L J buffer cells for boundary conditions U internal cells of block J L J _ L l_ figure 1. buffer cells for halo data Once halo data has been received for all sides of the block an update of the block, using the sequential algorithm, can commence. On completion of the algorithm, the current block's boundary data is sent to the neighbouring blocks. The current block then awaits for its halo data to be refreshed so that the next update can commence. The communication of the halo data and scheduling of the blocks for processing is effected by the harness. STRUCTURE OF THE HARNESS The harness [6] comprises many concurrent processes, namely: 1. the control process, 2. the worker process, 3. the first-in first-out buffers (FIFO's), 4. the transporter processes, 5. the dynamic memory allocator.

5 Applications of Supercomputers in Engineering 227 Control Transputer Hard Link Channel Memory Mapped Channel figure 2: FIFO's and Control interconnection

6 228 Applications of Supercomputers in Engineering From Out free, ptr To Alloc where n=slicel_ength > Memory Mapped Channel > Data Flow Direction figure 3: Detail Structure Diagram for One FIFO, Link Guardian Work, Memory Allocator and Control.

7 Applications of Supercomputers in Engineering 229 Figure 2 illustrates the process interconnection within the harness package, showing how the transporter processes input and output halo data, via the FIFO's to and from the control process. Figure 3 shows a structure graph, which illustrates the actual access connections between tasks. It also shows the data flow between tasks. As well as giving some information about the sequencing of interactions, the structure graph represents a static picture of the structure of the system, including both control and data interactions. The structure graph shows how the transporter, worker and control processes may communicate with the memory allocator, to request or to free heap space. The channels connecting the worker and control tasks allow the worker to request from control a block number to process, and control to acknowledge with a block number of a block that has had all the necessary halo data communicated to it. Before a block can be updated, halo data from physically adjacent blocks, at an appropriate time level, are required. This necessitates the communication of halo data between transputers. This is performed by the control process and the transporter tasks, via the FIFO's. The implementation of the harness is for an arbitrary network of transputers. This is achieved by allowing the transputers to investigate their interconnection - for the efficient communication of halo data, each transputer needs to know the shortest path to all other transputers DISTRIBUTION OF INITIAL DATA The distribution of initial data to the network of transputers, at 'start-up' consists mainly of two types: 1. the data required by the worker process to perform an iteration of Laplace's potential equation (block data). This consists of grid data, flow data, and control data which are separated into 'packets' representing data for each block of the multiblock computational domain.

8 230 Applications of Supercomputers in Engineering 2. the local and global data required by the harness (in the form of lists). These lists are essentially 'look-up' tables which describe the distribution strategy and topology of the network. MEMORY ALLOCATOR The memory allocator allows concurrent tasks to use the same array, the heap, in a checked manner. 3L FORTRAN has nothing to compare directly with the pointer types of PASCAL or ADA. Therefore, in order to make use of dynamic structures in FORTRAN the best one can do is to simulate dynamic storage by assigning large arrays for the purpose and to use INTEGER values for links [5]. Each element of the array specifically 'points' to the next element. This is achieved by recording with each element, an 'arrow' in the LINK array to where the next element can be found. The heap is stored in a COMMON block, so that all concurrent processes within the harness package have access to this memory block. This requires that control over this shared memory must be very carefully handed over from task to task. Utilising the memory allocator as a concurrent task, as opposed to a package, enforces mutual exclusion by providing coordinated sharing. The allocator structure graph, shown below, performs services in response to calls from a number of user tasks. The Allocator never calls nor has control over other tasks and always accepts calls immediately, subject only to the usual constraint of accepting one caller at a time. spacefull = FALSE / free figure 4: Structure Graph illustrating the Memory Allocator Control, worker and the transporters are provided with a channel link for communication with the memory allocator. Any task requiring a slice of the

9 Applications of Supercomputers in Engineering 231 heap in order to store data may do so by sending a request command and the slice length required, the memory allocator then acknowledges with a pointer to the start of the slice. Similarly, a process may free a slice of heap by sending a free command and the pointer to the beginning of the slice. TRANSPORTERS The transporters, which run at high priority, input or output data via the transputer serial links. Since input and output can be performed in parallel with computation, having four separate input and output transporter tasks can help to streamline communication. All communication to and from the transporter tasks is in the form of an INTEGER value specifying the number of words in the slice that follows and then the slice itself. Input transporters receive data from other transputers, which is copied to the heap, via communication with the memory allocator. Output transporters output data to neighbouring transputers and then free the heap of this slice of data, also via communication with the memory allocator. The transporter tasks and the memory allocator highlight the efficiency of using dynamic memory allocation, since only the pointer and slice length need to be communicated between tasks within the harness, as opposed to the entire slice. FIRST-IN FIRST-OUT BUFFERS (FIFO's) For systems with high levels of interaction activity, a buffer task is provided between the sender and target which can handle any temporary excess of items produced by the sender. This will prevent temporary communication deadlocks and as a consequence poor data throughput. The FIFO's, located between control and each transporter, decouple the control or transorter tasks by accepting communication even if the target task is not ready to receive thus preventing congestion or latency delays. Clearly the FIFO's are vital to an efficient communication system. The structure graph is shown in figure 5.

10 232 Applications of Supercomputers in Engineering space Full = FALSE ptr, slicelength figure 5: Structure Graph illustrating the FIFO buffer CONTROL The purpose of the control process is to direct the communication of halo data to their destination transputers, and schedule blocks for processing by the work process. The control process acts as an acceptor rather than a caller in its interactions with other tasks. This is because a call to another task risks a congestion delay. Control has five guarded inputs, four from the input transporters and the other is a request channel from the work process. Control holds two local lists, an active list and an inactive list. The active list holds block numbers that have had all the necessary halo data communicated to them and therefore await processing by the worker process. The distribution strategy may be such that there are many blocks per transputer, therefore the required halo data may reside on the same transputer. For physically adjacent blocks on different transputers, it is possible for the block to be one iteration ahead. The inactive list has halo data entries for two time levels, t and t+1. The inactive list therefore allows control to hold halo data pointers received for a block which is already active.

11 Applications of Supercomputers in Engineering 233 activelistempty = FALSE [_ inputs via FIFO's ptr, slicelength blockno O To Allocator and output via FIFO's figure 6: Structure graph illustrating the Control Process WORKER The worker process performs an iteration of Laplace's potential equation on a block, viz The worker initiates rendezvous with the control process by requesting an active block number to process. When acknowledged with a block, the worker then extracts the main block data from the heap. The main block data contains pointers to where the halo data is located in the heap, which is also extracted. Since the heap is a large one-dimensional array, the extracted data must be converted to multi-dimensional arrays, which will contain both block data and halo data, in the form required by the Laplace solver. On completion of an iteration, the worker process sets up halo data, for communication to destination transputers. The iteration count is incremented by one and the next active block is requested, at which time the halo data will be communicated to the destination transputers by control on de-activation. CONFIDENCE TESTING A simple test case, multiblock grid shown in figure 7, was constructed which, although geometrically uncomplicated, contains most of the essential features of the general case, with regard to halo data communication. The data, corresponding to figure 7, is distributed to the network, where each processor holds a copy of the harness and then the functionality of the harness is examined.

12 234 Applications of Supercomputers in Engineering The results from the test case for the single transputer verified the communication and control process functionality. The same test case now required verification on a network of transputers, as well as confirming performance capabilities by allowing parameters of interest, such as the number of blocks per processor, block shape and the block size, to be varied. It was found that as the work process becomes computationally intensive, the communication of the halo data becomes less critical. PERFORMANCE OF THE COMPLETE PROGRAM The computation time was determined for the multiblock grid shown in figure 7 (i.e. 12 blocks) for one, three, six and twelve transputers, whilst varying the computation per block. The worker process was made computationally more intensive by repeating the calculation on each block before halo data is exchanged. Below are tabulated results of execution times for 25 iterations. Loop No. of Procs. CPU Time speed-up efficiency ( seconds ) Table 1: Harness Execution Times for Figure 7(a) By increasing the 'loop' variable in table 1, the work process was made computationally more intensive i.e. the ratio of computation to communication was increased. The results indicate that for a computationally intensive

13 Applications of Supercomputers in Engineering 235 algorithm, because the harness exchanges halo data less periodically, the network runs more efficiently. For the case where loop= 10000, the harness is 95.3% efficient on twelve processors. Altering the shape of the blocks and sub-dividing blocks further, whilst maintaining the same number of cells per block, was also investigated. Each block of figure 7 was modified to that shown in figure 7(b) and 7(c). Table 2 gives CPU times (and efficiency) obtained for the different block shapes of figure 7(a), 7(b) and 7(c). No. of Processors figure 7 (a) (Square) figure 7(b) (Rectangular) figure 7(c) (sub-divided) (0.889) 32.2 (0.875) 31.0 (0.952) Table 2 CPU Times for Block Shape and Number of Blocks (loop=250, iterations= 25) Square blocks have less halo data to communicate than rectangular blocks, therefore square blocks were expected to produce faster CPU times. However, results from table 2 show that there is no significant advantage by having square blocks. Sub-dividing each block to four blocks, shown in figure 7(c), however, did give a notable increase in efficiency, 95.2%. By allowing many blocks per processor it is possible to reduce the transputer idle time ie update a block while another block's halo data has not yet arrived. This observation is also noted in table 1. MULTIBLOCK NAVIER-STOKES AEROFOIL CODE The current work program of the research is to implement a parallel version of the British Aerospace aerofoil code, MB2DV14 [7] and to program the algorithm on a network of transputers, using the communications harness. The code, initially developed for a Cray Y-MP, is used to simulate the viscous flow around a two-dimensional aerofoil, base on the Jameson algorithm [8], and consists of approximately lines of FORTRAN.

14 236 Applications of Supercomputers in Engineering Essentially, the Laplace's potential problem in the Worker process must be replaced by the multiblock Navier-Stokes code. For the harness to run efficiently, one requirement is that the algorithm must be computationally intensive, which is satisfied by the multiblock Navier-Stokes code. Furthermore, the test cases of figure 7 are such that each processor is perfectly load balanced i.e. each processor is responsible for updating the same number of cells. Typically, a multiblock aerofoil mesh used to solve a viscous flow will not have blocks of equal number of cells, and may also be of irregular shape. Table 2 shows that irregular block shapes do not significantly affect the CPU time. The network will not be perfectly load balanced, and all processors will be held up by the one with the most work to perform. The data will therefore be distributed such that all processors have a similar work load. CONCLUSIONS The multiblock Laplace solver is a highly parallel application and is amenable to an efficient implementation on a network of transputers. Maximum efficiency is achieved by ensuring perfect load balancing ie distributing the blocks to the network of transputers such that each transputer is responsible for updating the same number of cells. An overhead may be expected from use of the heap, which is a simulation of dynamic memory storage. This is because the lists within the heap are essentially, interleaved lists, which means that once a list is deleted, the 'simply linked' structure is destroyed. Therefore the next logical item in the list may not reside in the next position of the array. The elements in a list, from the heap, must therefore be scanned using the LINK array before the list can be used. Many small blocks increase the congestion delay due to entry queuing. Fewer large blocks tend to increase structural delays resulting from either fixed order of acceptance or conditional acceptance. It follows that there is an optimum block size. Performance evaluations were carried out for a 12 and 48 block topology, on one, three, six and twelve transputers. A speed-up of upto and an efficiency of 95.3% was attained, with increasing computation.

15 Applications of Supercomputers in Engineering 237 Block shape was shown to not significantly affect the CPU time, whereas increasing the number of blocks per processor reduced the processor idle time and therefore increased efficiency. Since the Navier-Stokes algorithm is computationally intensive, it is expected that the results from the Laplace solver will be carried over to the parallel Navier-Stokes solver, if the load balancing requirement is satisfied. Research was supported by the Science and Engineering Research Council (SERC) and by British Aerospace. REFERENCES [1] 3L Ltd, Parallel FORTRAN Compiler, Reference Manual, V2.1.3, [2] Notes for the Short Course, Parallel Processing Using 31 Parallel FORTRAN, National Transputer Support Centre, Sheffield City Polytechnic, [3] Fountain and May, Tutorial Introduction to OCCAM Programming, Inmos. [4] R.J.A. Buhr, System Design with ADA, [5] P.D. Terry, FORTRAN from PASCAL, Addison-Wesley, [6] Multiblock Euler Solver on an Array of Transputers, BAe, [7] J. Benton, Two Dimensional, Multiblock Navier-Stokes Aerofoil Code, MB2DV14, BAe, [8] A. Jameson, W. Schmidt, E. Turkel, Numerical Solution of the Euler Equations by Finite Volume Methods Using Runge-Kutta Time Stepping Schemes, AIAA paper , 1981.

16 238 Applications of Supercomputers in Engineering (a) 10x10 (b) 5x20 (c) 5x5 x 4 blocks figure block Topology

Implementation of an integrated efficient parallel multiblock Flow solver

Implementation of an integrated efficient parallel multiblock Flow solver Implementation of an integrated efficient parallel multiblock Flow solver Thomas Bönisch, Panagiotis Adamidis and Roland Rühle adamidis@hlrs.de Outline Introduction to URANUS Why using Multiblock meshes

More information

MPI Casestudy: Parallel Image Processing

MPI Casestudy: Parallel Image Processing MPI Casestudy: Parallel Image Processing David Henty 1 Introduction The aim of this exercise is to write a complete MPI parallel program that does a very basic form of image processing. We will start by

More information

Networks, Routers and Transputers:

Networks, Routers and Transputers: This is Chapter 1 from the second edition of : Networks, Routers and Transputers: Function, Performance and applications Edited M.D. by: May, P.W. Thompson, and P.H. Welch INMOS Limited 1993 This edition

More information

Concurrent/Parallel Processing

Concurrent/Parallel Processing Concurrent/Parallel Processing David May: April 9, 2014 Introduction The idea of using a collection of interconnected processing devices is not new. Before the emergence of the modern stored program computer,

More information

Studies of the Continuous and Discrete Adjoint Approaches to Viscous Automatic Aerodynamic Shape Optimization

Studies of the Continuous and Discrete Adjoint Approaches to Viscous Automatic Aerodynamic Shape Optimization Studies of the Continuous and Discrete Adjoint Approaches to Viscous Automatic Aerodynamic Shape Optimization Siva Nadarajah Antony Jameson Stanford University 15th AIAA Computational Fluid Dynamics Conference

More information

Parallelization Strategy

Parallelization Strategy COSC 6374 Parallel Computation Algorithm structure Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure

More information

Parallel Programming Patterns Overview and Concepts

Parallel Programming Patterns Overview and Concepts Parallel Programming Patterns Overview and Concepts Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

Parallel Mesh Partitioning in Alya

Parallel Mesh Partitioning in Alya Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Parallel Mesh Partitioning in Alya A. Artigues a *** and G. Houzeaux a* a Barcelona Supercomputing Center ***antoni.artigues@bsc.es

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Application of Parallel Processing to Rendering in a Virtual Reality System

Application of Parallel Processing to Rendering in a Virtual Reality System Application of Parallel Processing to Rendering in a Virtual Reality System Shaun Bangay Peter Clayton David Sewry Department of Computer Science Rhodes University Grahamstown, 6140 South Africa Internet:

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

HPC Algorithms and Applications

HPC Algorithms and Applications HPC Algorithms and Applications Dwarf #5 Structured Grids Michael Bader Winter 2012/2013 Dwarf #5 Structured Grids, Winter 2012/2013 1 Dwarf #5 Structured Grids 1. dense linear algebra 2. sparse linear

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK

Multigrid Solvers in CFD. David Emerson. Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK Multigrid Solvers in CFD David Emerson Scientific Computing Department STFC Daresbury Laboratory Daresbury, Warrington, WA4 4AD, UK david.emerson@stfc.ac.uk 1 Outline Multigrid: general comments Incompressible

More information

Massively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation

Massively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation L SIMULATION OF SEMICONDUCTOR DEVICES AND PROCESSES Vol. 4 Edited by W. Fichtner, D. Aemmer - Zurich (Switzerland) September 12-14,1991 - Hartung-Gorre Massively Parallel Computation for Three-Dimensional

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

ELSA Performance Analysis

ELSA Performance Analysis ELSA Performance Analysis Xavier Saez and José María Cela Barcelona Supercomputing Center Technical Report TR/CASE-08-1 2008 1 ELSA Performance Analysis Xavier Saez 1 and José María Cela 2 1 Computer Application

More information

Domain Decomposition: Computational Fluid Dynamics

Domain Decomposition: Computational Fluid Dynamics Domain Decomposition: Computational Fluid Dynamics July 11, 2016 1 Introduction and Aims This exercise takes an example from one of the most common applications of HPC resources: Fluid Dynamics. We will

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System

Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Implementation and Evaluation of Prefetching in the Intel Paragon Parallel File System Meenakshi Arunachalam Alok Choudhary Brad Rullman y ECE and CIS Link Hall Syracuse University Syracuse, NY 344 E-mail:

More information

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection

More information

LAPI on HPS Evaluating Federation

LAPI on HPS Evaluating Federation LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Wide area networks: packet switching and congestion

Wide area networks: packet switching and congestion Wide area networks: packet switching and congestion Packet switching ATM and Frame Relay Congestion Circuit and Packet Switching Circuit switching designed for voice Resources dedicated to a particular

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

Handling Parallelisation in OpenFOAM

Handling Parallelisation in OpenFOAM Handling Parallelisation in OpenFOAM Hrvoje Jasak hrvoje.jasak@fsb.hr Faculty of Mechanical Engineering and Naval Architecture University of Zagreb, Croatia Handling Parallelisation in OpenFOAM p. 1 Parallelisation

More information

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution Multigrid Pattern I. Problem Problem domain is decomposed into a set of geometric grids, where each element participates in a local computation followed by data exchanges with adjacent neighbors. The grids

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

Domain Decomposition: Computational Fluid Dynamics

Domain Decomposition: Computational Fluid Dynamics Domain Decomposition: Computational Fluid Dynamics December 0, 0 Introduction and Aims This exercise takes an example from one of the most common applications of HPC resources: Fluid Dynamics. We will

More information

Load Balancing and Termination Detection

Load Balancing and Termination Detection Chapter 7 Load Balancing and Termination Detection 1 Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection

More information

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:

3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes: BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General

More information

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment

More information

Chapter 13 Topics. Introduction. Introduction

Chapter 13 Topics. Introduction. Introduction Chapter 13 Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Java Threads C# Threads Statement-Level Concurrency Copyright 2006 Pearson Addison-Wesley. All rights reserved.

More information

INTERCONNECTION NETWORKS LECTURE 4

INTERCONNECTION NETWORKS LECTURE 4 INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source

More information

Parallel Programming Patterns. Overview and Concepts

Parallel Programming Patterns. Overview and Concepts Parallel Programming Patterns Overview and Concepts Outline Practical Why parallel programming? Decomposition Geometric decomposition Task farm Pipeline Loop parallelism Performance metrics and scaling

More information

Parallelization Strategy

Parallelization Strategy COSC 335 Software Design Parallel Design Patterns (II) Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

CFD exercise. Regular domain decomposition

CFD exercise. Regular domain decomposition CFD exercise Regular domain decomposition Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Suggestions for Stream Based Parallel Systems in Ada

Suggestions for Stream Based Parallel Systems in Ada Suggestions for Stream Based Parallel Systems in Ada M. Ward * and N. C. Audsley Real Time Systems Group University of York York, England (mward,neil)@cs.york.ac.uk Abstract Ada provides good support for

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

Resource Allocation and Queuing Theory

Resource Allocation and Queuing Theory and Modeling Modeling Networks Outline 1 Introduction Why are we waiting?... 2 Packet-Switched Network Connectionless Flows Service Model Router-Centric versus Host-Centric Reservation Based versus Feedback-Based

More information

COP 5611 Operating Systems Spring Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 2:00-3:00 PM

COP 5611 Operating Systems Spring Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 2:00-3:00 PM COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 2:00-3:00 PM Lecture 6 Last time: Virtualization Today: Thread coordination Scheduling Next Time: Multi-level

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Curriculum 2013 Knowledge Units Pertaining to PDC

Curriculum 2013 Knowledge Units Pertaining to PDC Curriculum 2013 Knowledge Units Pertaining to C KA KU Tier Level NumC Learning Outcome Assembly level machine Describe how an instruction is executed in a classical von Neumann machine, with organization

More information

Domain Decomposition for Colloid Clusters. Pedro Fernando Gómez Fernández

Domain Decomposition for Colloid Clusters. Pedro Fernando Gómez Fernández Domain Decomposition for Colloid Clusters Pedro Fernando Gómez Fernández MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2004 Authorship declaration I, Pedro Fernando

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

BlueGene/L. Computer Science, University of Warwick. Source: IBM

BlueGene/L. Computer Science, University of Warwick. Source: IBM BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours

More information

Flow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck.

Flow simulation. Frank Lohmeyer, Oliver Vornberger. University of Osnabruck, D Osnabruck. To be published in: Notes on Numerical Fluid Mechanics, Vieweg 1994 Flow simulation with FEM on massively parallel systems Frank Lohmeyer, Oliver Vornberger Department of Mathematics and Computer Science

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

COP 5611 Operating Systems Spring Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 2:00-3:00 PM

COP 5611 Operating Systems Spring Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 2:00-3:00 PM COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 2:00-3:00 PM Lecture 7 Last time: Thread coordination Today: Thread coordination Scheduling Multi-level memories

More information

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Lecture 4: Principles of Parallel Algorithm Design (part 4) Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction

More information

G.A. van Zee Koninklijke/Shell-Laboratorium, Amsterdam PO Box AA Amsterdam The Netherlands

G.A. van Zee Koninklijke/Shell-Laboratorium, Amsterdam PO Box AA Amsterdam The Netherlands OVERVIEW OF THE KSLA EFFORTS IN PARALLEL COMPUTING G.A. van Zee Koninklijke/Shell-Laboratorium, Amsterdam PO Box 3003 1003 AA Amsterdam The Netherlands ABSTRACT The parallel computing research in the Mathematics

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Parallel Multigrid on Cartesian Meshes with Complex Geometry +

Parallel Multigrid on Cartesian Meshes with Complex Geometry + Parallel Multigrid on Cartesian Meshes with Complex Geometry + Marsha Berger a and Michael Aftosmis b and Gedas Adomavicius a a Courant Institute, New York University, 251 Mercer St., New York, NY 10012

More information

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background 1 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Domain Decomposition: Computational Fluid Dynamics

Domain Decomposition: Computational Fluid Dynamics Domain Decomposition: Computational Fluid Dynamics May 24, 2015 1 Introduction and Aims This exercise takes an example from one of the most common applications of HPC resources: Fluid Dynamics. We will

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction

Homework # 1 Due: Feb 23. Multicore Programming: An Introduction C O N D I T I O N S C O N D I T I O N S Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.86: Parallel Computing Spring 21, Agarwal Handout #5 Homework #

More information

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015

Computer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April

More information

Using Co-Array Fortran to Enhance the Scalability of the EPIC Code

Using Co-Array Fortran to Enhance the Scalability of the EPIC Code Using Co-Array Fortran to Enhance the Scalability of the EPIC Code Jef Dawson Army High Performance Computing Research Center, Network Computing Services, Inc. ABSTRACT: Supercomputing users continually

More information

Concepts from High-Performance Computing

Concepts from High-Performance Computing Concepts from High-Performance Computing Lecture A - Overview of HPC paradigms OBJECTIVE: The clock speeds of computer processors are topping out as the limits of traditional computer chip technology are

More information

CHAPTER 7 CONCLUSION AND FUTURE SCOPE

CHAPTER 7 CONCLUSION AND FUTURE SCOPE 121 CHAPTER 7 CONCLUSION AND FUTURE SCOPE This research has addressed the issues of grid scheduling, load balancing and fault tolerance for large scale computational grids. To investigate the solution

More information

--Introduction to Culler et al. s Chapter 2, beginning 192 pages on software

--Introduction to Culler et al. s Chapter 2, beginning 192 pages on software CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Parallel Programming (Chapters 2, 3, & 4) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived from

More information

Line Segment Intersection Dmitriy V'jukov

Line Segment Intersection Dmitriy V'jukov Line Segment Intersection Dmitriy V'jukov 1. Problem Statement Write a threaded code to find pairs of input line segments that intersect within three-dimensional space. Line segments are defined by 6 integers

More information

Technical Briefing. The TAOS Operating System: An introduction. October 1994

Technical Briefing. The TAOS Operating System: An introduction. October 1994 Technical Briefing The TAOS Operating System: An introduction October 1994 Disclaimer: Provided for information only. This does not imply Acorn has any intention or contract to use or sell any products

More information

Computer Architecture: Dataflow/Systolic Arrays

Computer Architecture: Dataflow/Systolic Arrays Data Flow Computer Architecture: Dataflow/Systolic Arrays he models we have examined all assumed Instructions are fetched and retired in sequential, control flow order his is part of the Von-Neumann model

More information

Vectorized Search for Single Clusters

Vectorized Search for Single Clusters Vectorized Search for Single Clusters Hans Gerd Evertz Supercomputer Computations Research Institute, Florida State University, Tallahassee, FL 32306 evertz@scri.fsu.edu Aug. 1, 1992; Published in in J.

More information

1.2 Numerical Solutions of Flow Problems

1.2 Numerical Solutions of Flow Problems 1.2 Numerical Solutions of Flow Problems DIFFERENTIAL EQUATIONS OF MOTION FOR A SIMPLIFIED FLOW PROBLEM Continuity equation for incompressible flow: 0 Momentum (Navier-Stokes) equations for a Newtonian

More information

Acceleration of a 2D Euler Flow Solver Using Commodity Graphics Hardware

Acceleration of a 2D Euler Flow Solver Using Commodity Graphics Hardware Acceleration of a 2D Euler Flow Solver Using Commodity Graphics Hardware T. Brandvik and G. Pullan Whittle Laboratory, Department of Engineering, University of Cambridge 1 JJ Thomson Avenue, Cambridge,

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor

The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor IEEE Transactions on Parallel and Distributed Systems, Vol. 5, No. 6, June 1994, pp. 573-584.. The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor David J.

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees, butterflies,

More information

ENERGY-224 Reservoir Simulation Project Report. Ala Alzayer

ENERGY-224 Reservoir Simulation Project Report. Ala Alzayer ENERGY-224 Reservoir Simulation Project Report Ala Alzayer Autumn Quarter December 3, 2014 Contents 1 Objective 2 2 Governing Equations 2 3 Methodolgy 3 3.1 BlockMesh.........................................

More information

Semi-automatic domain decomposition based on potential theory

Semi-automatic domain decomposition based on potential theory Semi-automatic domain decomposition based on potential theory S.P. Spekreijse and J.C. Kok Nationaal Lucht- en Ruimtevaartlaboratorium National Aerospace Laboratory NLR Semi-automatic domain decomposition

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

Basics (cont.) Characteristics of data communication technologies OSI-Model

Basics (cont.) Characteristics of data communication technologies OSI-Model 48 Basics (cont.) Characteristics of data communication technologies OSI-Model Topologies Packet switching / Circuit switching Medium Access Control (MAC) mechanisms Coding Quality of Service (QoS) 49

More information

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory

An In-place Algorithm for Irregular All-to-All Communication with Limited Memory An In-place Algorithm for Irregular All-to-All Communication with Limited Memory Michael Hofmann and Gudula Rünger Department of Computer Science Chemnitz University of Technology, Germany {mhofma,ruenger}@cs.tu-chemnitz.de

More information

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI Sam White Parallel Programming Lab UIUC 1 Introduction How to enable Overdecomposition, Asynchrony, and Migratability in existing

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

CHAPTER 5 PROPAGATION DELAY

CHAPTER 5 PROPAGATION DELAY 98 CHAPTER 5 PROPAGATION DELAY Underwater wireless sensor networks deployed of sensor nodes with sensing, forwarding and processing abilities that operate in underwater. In this environment brought challenges,

More information

Introducing OpenMP Tasks into the HYDRO Benchmark

Introducing OpenMP Tasks into the HYDRO Benchmark Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Introducing OpenMP Tasks into the HYDRO Benchmark Jérémie Gaidamour a, Dimitri Lecas a, Pierre-François Lavallée a a 506,

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Efficiency Aspects for Advanced Fluid Finite Element Formulations

Efficiency Aspects for Advanced Fluid Finite Element Formulations Proceedings of the 5 th International Conference on Computation of Shell and Spatial Structures June 1-4, 2005 Salzburg, Austria E. Ramm, W. A. Wall, K.-U. Bletzinger, M. Bischoff (eds.) www.iassiacm2005.de

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Introduction to Parallel Computing Bootcamp for SahasraT 7th September 2018 Aditya Krishna Swamy adityaks@iisc.ac.in SERC, IISc Acknowledgments Akhila, SERC S. Ethier, PPPL P. Messina, ECP LLNL HPC tutorials

More information

PARALLEL COMPUTER ARCHITECTURES

PARALLEL COMPUTER ARCHITECTURES 8 ARALLEL COMUTER ARCHITECTURES 1 CU Shared memory (a) (b) Figure 8-1. (a) A multiprocessor with 16 CUs sharing a common memory. (b) An image partitioned into 16 sections, each being analyzed by a different

More information

Strategies for Parallelizing a Navier-Stokes Code on the Intel Touchstone Machines

Strategies for Parallelizing a Navier-Stokes Code on the Intel Touchstone Machines Strategies for Parallelizing a Navier-Stokes Code on the Intel Touchstone Machines Jochem Häuser European Space Agency and Roy Williams California Institute of Technology Abstract The purpose of this paper

More information

Parallel Programming

Parallel Programming Parallel Programming 7. Data Parallelism Christoph von Praun praun@acm.org 07-1 (1) Parallel algorithm structure design space Organization by Data (1.1) Geometric Decomposition Organization by Tasks (1.3)

More information

A Test Suite for High-Performance Parallel Java

A Test Suite for High-Performance Parallel Java page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

PowerVR Series5. Architecture Guide for Developers

PowerVR Series5. Architecture Guide for Developers Public Imagination Technologies PowerVR Series5 Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information