University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors

Size: px
Start display at page:

Download "University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors"

Transcription

1 Image Template Matching on Distributed Memory and Vector Multiprocessors V. Blanco M. Martin D.B. Heras O. Plata F.F. Rivera September 995 Technical Report No: UMA-DAC-95/20 Published in: 5th Int l. Conf. on Parallel Computing (ParCo 95) Gent, Belgium, September 9-22, 995 University of Malaga Department of Computer Architecture C. Tecnologico PO Box 44 E Malaga Spain

2 Image template matching on distributed memory and vector multiprocessors V. Blanco, M. Martn, D.B. Heras, O. Plata and F.F. Rivera Dept. Electronica y Computacion Fac. Fsica. Univ. Santiago de Compostela elvicente@usc.es, elfran@usc.es 6th December 995 Introduction In this work we present a study on the trade-o between temporal and spatial parallelism to perform highly parallel algorithms. We have selected the image template matching algorithm as representative of the dierent computational structures that can be executed eciently on both vector and distributed memory systems [2, 3, 4]. The computational body of the template matching that we will consider is the computation of the cross-correlation coecient. This coecient is given in terms of the cross-correlation function as: P Mr?P Mc? C(i; j) = P k=0 M f r? k=0 l=0 P (k + i; l + j) T (k; l) P Mc? l=0 P 2 (k + i; l + j)g =2 () where P and T are the image and the template with N r N c and M r M c pixels respectively. Note that the calculation of this coecient presents high spatial and temporal locality. The program has four independent nested loops corresponding to the rows and columns of the image and the template. The loops associated to indexes i and j can be executed as a doall structure, and the ones associated to k and l constitute summations that can be executed as a typical ecient reduction structure. We have executed this code on the Fujitsu AP000 system as representative of distributed memory multiprocessors. The programming strategy we used is based on the SPMD paradigm exploiting data parallelism, and it consists in determining the most adequate distribution of data, dividing it into subspaces, one for each processor. Then, a mapping must be done in order to assign computations to each one of these subspaces and explicitly establishing the communications required [5]. In this way, the parallel code keeps the same general structure of the sequential counterpart introducing the necessary routing statements. At this point, dierent transformations of the local code can be applied that exploit the vectorial capabilities of each node. In order to do that we have used the Fujitsu VP2400/0 vector computer. This work was supported in part by the CICYT under grant TIC C03-03 and Xunta de Galicia under grant XUGA20606B93. The authors wish to acknowledge the help oered by Fujitsu Labs Ltd. for the use of their systems.

3 N c (0,0) (0,) (0,2) (0,3) (0,4) n c n r Imagen local (,0) M c + n c - (,4) (,) (,2) (,3) M c N r (2,0) M r + n r - (2,) (2,2) (2,3) (2,4) n c Template M r (3,0) n r (3,) (3,2) (3,3) (3,4) (4,0) (4,) (4,2) (4,3) (4,4) Figure : Access scheme of template and image 2 Exploiting spatial parallelism The parallel implementation of the cross-correlation coecient is not direct due to the dependencies between the bounds of the summations over indexes k and l and the indexes in the external loops i and j. The most ecient strategy to distribute data is based on the replication of the template in every node. The size of the template is usually small, so the memory cost associated to this approach is not too high in practice, and on the other hand, the amount of communications saved justi- es it. The image is stored in the local memories using a block distribution scheme, mapping the two-dimensional matrix on the two-dimensional mesh of processors in a straightforward way. Each processor computes the products in the equation from position (0; 0) of the image in a lexicographic order, and from position (M r ; M c ) of the template in reverse lexicographic order. In this way, each processor executes every computation that involve just local memory accesses. The routing operations needed to compute the global result can be mapped in an ecient way on the mesh network, through rows and columns. In gure we display the access scheme of the template and the local image in each node. In this example, node (3; 3) have to send local results to every node that have some shaded zone, nodes (; ), (; 2), (; 3), (2; ), (2; 2), (2; 3), (3; ), (3; 2) and (; 3). Finally, in order to minimize communication costs, we compose individual messages in buers that have to be sent in a whole routing operation. In gure 2 we present the eciencies for the parallel algorithm executed on the AP000, that is a general purpose system with a MIMD conguration, distributed memory and a two dimensional torus topology network. Note that the eciency is high even when 52 processors are used. Moreover, in some cases we found superlinear speedups when the memory hierarchy (specially cache accesses) operate eciently. 2

4 Template 4 x 4.06 Efficiency Efficiency # of PEs Template 8 x # of PEs Figure 2: Eciency of the cross-correlation on the AP-000 Image 024 X X X 256 Template 8 X 8 4 X 4 8 X 8 4 X 4 8 X 8 4 X 4 Scalar Automatic Optimized Speedup Table : Run-times on the VP2400/0 3 Exploiting temporal parallelism The local program associated to each node computes a local cross-correlation, so, if we assume vector capabilities in the processors, temporal parallelism in a ner grain can be exploited. We have implemented a vector code for the local program. We have used the VP2400/0 vector computer from Fujitsu as a tool for the evaluation of the vectorization possibilities of this algorithm. In order to obtain the best use of the hardware of the system we have systematically applied to the algorithms dierent transformations that exploit the vectorial capabilities of the system []. In particular, we have considered the following: vectorization over the longest loop, minimization of memory conicts, loop fusion, use of scalar variables in reduction operations, unrolling and blocking. In table, runtimes in milliseconds are shown for dierent sizes of the local image and the template. Note that the automatic compilation does not oer good performance because it vectorize the innermost loop, that corresponds to the rows 3

5 of the template, a small quantity. 4 Conclusions The cross-correlation coecient and other related computations are the computational kernel of codes in the eld of image processing, and in particular for the image template problem. In this paper we focus on the computational features that make this kind of loop structured codes suitable for parallel and vector machines. We found that a block distribution of the image and a replication of the template in every processor will produce a high eciency in the parallel algorithm on distributed memory systems, and in particular in systems with mesh interconnexion topology. On the other hand, we found that vectorization is a more ecient solution than spatial parallelization in order to increase the processing speed of this kind of codes due to the communication costs. The best solution should be to combine both approaches in a distributed memory system with vector capabilities. References [] W. Cowell and C. Thompson. Transforming fortran do loops to improve performance on vector architectures. ACM Transaction on Mathematics Software, 2(4):326{353, 986. [2] Z. Fang, X. Li, and L. Ni. Parallel algorithms for image template matching on hypercube simd computers. IEEE Transaction on Pattern Anal. Mach. Intell., PAMI-9(6):835{84, Nov [3] V. Kumar and V. Krishnan. Ecient image template matching on hypercube simd arrays. IEEE Transaction on Pattern Anal. Mach. Intell., PAMI- (6):665{669, 989. [4] E. Zapata, J. Benavides, O. Plata, and F. Rivera. Image template matching on hypercube simd computers. Signal Processing, 2:49{60, 990. [5] E. Zapata, F. Rivera, and O. Plata. On the partition of algorithms into hypercubes. Advances in Parallel Computing., :49{7,

Sparse Givens QR Factorization on a Multiprocessor. May 1996 Technical Report No: UMA-DAC-96/08

Sparse Givens QR Factorization on a Multiprocessor. May 1996 Technical Report No: UMA-DAC-96/08 Sparse Givens QR Factorization on a Multiprocessor J. Tourino R. Doallo E.L. Zapata May 1996 Technical Report No: UMA-DAC-96/08 Published in: 2nd Int l. Conf. on Massively Parallel Computing Systems Ischia,

More information

University of Malaga. Cache Misses Prediction for High Performance Sparse Algorithms

University of Malaga. Cache Misses Prediction for High Performance Sparse Algorithms Cache Misses Prediction for High Performance Sparse Algorithms B.B. Fraguela R. Doallo E.L. Zapata September 1998 Technical Report No: UMA-DAC-98/ Published in: 4th Int l. Euro-Par Conference (Euro-Par

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

CS Parallel Algorithms in Scientific Computing

CS Parallel Algorithms in Scientific Computing CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan

More information

UNIVERSITI SAINS MALAYSIA. CCS524 Parallel Computing Architectures, Algorithms & Compilers

UNIVERSITI SAINS MALAYSIA. CCS524 Parallel Computing Architectures, Algorithms & Compilers UNIVERSITI SAINS MALAYSIA Second Semester Examination Academic Session 2003/2004 September/October 2003 CCS524 Parallel Computing Architectures, Algorithms & Compilers Duration : 3 hours INSTRUCTION TO

More information

Types of Parallel Computers

Types of Parallel Computers slides1-22 Two principal types: Types of Parallel Computers Shared memory multiprocessor Distributed memory multicomputer slides1-23 Shared Memory Multiprocessor Conventional Computer slides1-24 Consists

More information

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2 CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann

More information

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010

CS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010 Parallel Programming Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code Mary Hall September 23, 2010 1 Observations from the Assignment Many of you are doing really well Some more are doing

More information

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching

Module 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching The Lecture Contains: Loop Unswitching Supercomputing Applications Programming Paradigms Important Problems Scheduling Sources and Types of Parallelism Model of Compiler Code Optimization Data Dependence

More information

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas

Memory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Memory hierarchy J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Computer Architecture ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid

More information

Streaming as a pattern. Peter Mattson, Richard Lethin Reservoir Labs

Streaming as a pattern. Peter Mattson, Richard Lethin Reservoir Labs Streaming as a pattern Peter Mattson, Richard Lethin Reservoir Labs Streaming as a pattern Streaming is a pattern in efficient implementations of computation- and data-intensive applications Pattern has

More information

SHARED MEMORY VS DISTRIBUTED MEMORY

SHARED MEMORY VS DISTRIBUTED MEMORY OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors

More information

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels?

Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels? Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels? J. Lobeiras, M. Amor, M. Arenaz, and B.B. Fraguela Computer Architecture Group, University of A Coruña, Spain {jlobeiras,margamor,manuel.arenaz,basilio.fraguela}@udc.es

More information

Extending CRAFT Data-Distributions for Sparse Matrices. July 1996 Technical Report No: UMA-DAC-96/11

Extending CRAFT Data-Distributions for Sparse Matrices. July 1996 Technical Report No: UMA-DAC-96/11 Extending CRAFT Data-Distributions for Sparse Matrices G. Bandera E.L. Zapata July 996 Technical Report No: UMA-DAC-96/ Published in: 2nd. European Cray MPP Workshop Edinburgh Parallel Computing Centre,

More information

Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of Ne

Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of Ne Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of New York Bualo, NY 14260 Abstract The Connection Machine

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Introduction. EE 4504 Computer Organization

Introduction. EE 4504 Computer Organization Introduction EE 4504 Computer Organization Section 11 Parallel Processing Overview EE 4504 Section 11 1 This course has concentrated on singleprocessor architectures and techniques to improve upon their

More information

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology

Automatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology Automatic Array Alignment for Distributed Memory Multicomputers Mitsuru Ikei Hitachi Chemical Company Ltd. Michael Wolfe Oregon Graduate Institute of Science & Technology P.O. Box 91000 Portland OR 97291

More information

Memory Hierarchy Management for Iterative Graph Structures

Memory Hierarchy Management for Iterative Graph Structures Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced

More information

University of Malaga. Skill-Mart: Skilled Robotic Manipulation of Non-Rigid Objects

University of Malaga. Skill-Mart: Skilled Robotic Manipulation of Non-Rigid Objects Skill-Mart: Skilled Robotic Manipulation of Non-Rigid Objects P.P. Trabado N. Guil E.L. Zapata June 1998 Technical Report No: UMA-DAC-98/08 Published in: WorkShop on European Scientific and Industrial

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

This paper deals with ecient parallel implementations of reconstruction methods in 3D

This paper deals with ecient parallel implementations of reconstruction methods in 3D Ecient Implementation of Parallel Image Reconstruction Algorithms for 3D X-Ray Tomography C. Laurent a, C. Calvin b, J.M. Chassery a, F. Peyrin c Christophe.Laurent@imag.fr Christophe.Calvin@imag.fr a

More information

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)

Outline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued) Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348

Chapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348 Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Guiding the optimization of parallel codes on multicores using an analytical cache model

Guiding the optimization of parallel codes on multicores using an analytical cache model Guiding the optimization of parallel codes on multicores using an analytical cache model Diego Andrade, Basilio B. Fraguela, and Ramón Doallo Universidade da Coruña, Spain {diego.andrade,basilio.fraguela,ramon.doalllo}@udc.es

More information

Warps and Reduction Algorithms

Warps and Reduction Algorithms Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Optimal Topology for Distributed Shared-Memory Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Facultad de Informatica, Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia,

More information

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2

Essential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2

More information

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform

Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.

More information

Simone Campanoni Loop transformations

Simone Campanoni Loop transformations Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple

More information

Design of Parallel Algorithms. The Architecture of a Parallel Computer

Design of Parallel Algorithms. The Architecture of a Parallel Computer + Design of Parallel Algorithms The Architecture of a Parallel Computer + Trends in Microprocessor Architectures n Microprocessor clock speeds are no longer increasing and have reached a limit of 3-4 Ghz

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

The 2D wavelet transform on. a SIMD torus of scanline processors. R. Lang A. Spray H. Schroder. Application Specic Computer Design (ASCOD)

The 2D wavelet transform on. a SIMD torus of scanline processors. R. Lang A. Spray H. Schroder. Application Specic Computer Design (ASCOD) The D wavelet transform on a SIMD torus of scanline processors R. Lang A. Spray H. Schroder Application Specic Computer Design (ASCOD) Dept. of Electrical & Computer Engineering University of Newcastle

More information

Lecture 11 Loop Transformations for Parallelism and Locality

Lecture 11 Loop Transformations for Parallelism and Locality Lecture 11 Loop Transformations for Parallelism and Locality 1. Examples 2. Affine Partitioning: Do-all 3. Affine Partitioning: Pipelining Readings: Chapter 11 11.3, 11.6 11.7.4, 11.9-11.9.6 1 Shared Memory

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those

LINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen

More information

Figure 1: Representation of moving images using layers Once a set of ane models has been found, similar models are grouped based in a mean-square dist

Figure 1: Representation of moving images using layers Once a set of ane models has been found, similar models are grouped based in a mean-square dist ON THE USE OF LAYERS FOR VIDEO CODING AND OBJECT MANIPULATION Luis Torres, David Garca and Anna Mates Dept. of Signal Theory and Communications Universitat Politecnica de Catalunya Gran Capita s/n, D5

More information

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Parallel Systems Prof. James L. Frankel Harvard University Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Architectures SISD (Single Instruction, Single Data)

More information

Partition Border Charge Update. Solve Field. Partition Border Force Update

Partition Border Charge Update. Solve Field. Partition Border Force Update Plasma Simulation on Networks of Workstations using the Bulk-Synchronous Parallel Model y Mohan V. Nibhanupudi Charles D. Norton Boleslaw K. Szymanski Department of Computer Science Rensselaer Polytechnic

More information

Exploring Parallelism At Different Levels

Exploring Parallelism At Different Levels Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities

More information

Parallel Algorithms. Thoai Nam

Parallel Algorithms. Thoai Nam Parallel Algorithms Thoai Nam Outline Introduction to parallel algorithms development Reduction algorithms Broadcast algorithms Prefix sums algorithms -2- Introduction to Parallel Algorithm Development

More information

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic

More information

On Estimating the Useful Work Distribution of. Thomas Fahringer. University of Vienna. Abstract

On Estimating the Useful Work Distribution of. Thomas Fahringer. University of Vienna. Abstract On Estimating the Useful Work Distribution of Parallel Programs under the P 3 T: A Static Performance Estimator Thomas Fahringer Institute for Software Technology and Parallel Systems University of Vienna

More information

A Massively Parallel Virtual Machine for. SIMD Architectures

A Massively Parallel Virtual Machine for. SIMD Architectures Advanced Studies in Theoretical Physics Vol. 9, 15, no. 5, 37-3 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/1.19/astp.15.519 A Massively Parallel Virtual Machine for SIMD Architectures M. Youssfi and

More information

Chapter 1. Reprinted from "Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing",Norfolk, Virginia (USA), March 1993.

Chapter 1. Reprinted from Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing,Norfolk, Virginia (USA), March 1993. Chapter 1 Parallel Sparse Matrix Vector Multiplication using a Shared Virtual Memory Environment Francois Bodin y Jocelyne Erhel y Thierry Priol y Reprinted from "Proc. 6th SIAM Conference on Parallel

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Concurrency for data-intensive applications

Concurrency for data-intensive applications Concurrency for data-intensive applications Dennis Kafura CS5204 Operating Systems 1 Jeff Dean Sanjay Ghemawat Dennis Kafura CS5204 Operating Systems 2 Motivation Application characteristics Large/massive

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press,   ISSN Parallelization of software for coastal hydraulic simulations for distributed memory parallel computers using FORGE 90 Z.W. Song, D. Roose, C.S. Yu, J. Berlamont B-3001 Heverlee, Belgium 2, Abstract Due

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

An Inspector-Executor Algorithm for Irregular Assignment Parallelization

An Inspector-Executor Algorithm for Irregular Assignment Parallelization An Inspector-Executor Algorithm for Irregular Assignment Parallelization Manuel Arenaz, Juan Touriño, Ramón Doallo Computer Architecture Group Dep. Electronics and Systems, University of A Coruña, Spain

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

University of Ghent. St.-Pietersnieuwstraat 41. Abstract. Sucient and precise semantic information is essential to interactive

University of Ghent. St.-Pietersnieuwstraat 41. Abstract. Sucient and precise semantic information is essential to interactive Visualizing the Iteration Space in PEFPT? Qi Wang, Yu Yijun and Erik D'Hollander University of Ghent Dept. of Electrical Engineering St.-Pietersnieuwstraat 41 B-9000 Ghent wang@elis.rug.ac.be Tel: +32-9-264.33.75

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

University of Malaga. Sparse Matrix Block-Cyclic Redistribution. Department of Computer Architecture C. Tecnologico PO Box 4114 E Malaga Spain

University of Malaga. Sparse Matrix Block-Cyclic Redistribution. Department of Computer Architecture C. Tecnologico PO Box 4114 E Malaga Spain Sparse Matrix Block-Cyclic Redistribution G. Bandera E.L. Zapata April 999 Technical Report No: UMA-DAC-99/5 Published in: IEEE Int l. Parallel Processing Symposium (IPPS 99) San Juan, Puerto Rico, April

More information

Application Programmer. Vienna Fortran Out-of-Core Program

Application Programmer. Vienna Fortran Out-of-Core Program Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse

More information

Zeki Bozkus, Sanjay Ranka and Georey Fox , Center for Science and Technology. Syracuse University

Zeki Bozkus, Sanjay Ranka and Georey Fox , Center for Science and Technology. Syracuse University Modeling the CM-5 multicomputer 1 Zeki Bozkus, Sanjay Ranka and Georey Fox School of Computer Science 4-116, Center for Science and Technology Syracuse University Syracuse, NY 13244-4100 zbozkus@npac.syr.edu

More information

Linear Loop Transformations for Locality Enhancement

Linear Loop Transformations for Locality Enhancement Linear Loop Transformations for Locality Enhancement 1 Story so far Cache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a linear transformation

More information

Edge detection based on single layer CNN simulator using RK6(4)

Edge detection based on single layer CNN simulator using RK6(4) Edge detection based on single layer CNN simulator using RK64) Osama H. Abdelwahed 1, and M. El-Sayed Wahed 1 Mathematics Department, Faculty of Science, Suez Canal University, Egypt Department of Computer

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information

Program Transformations for the Memory Hierarchy

Program Transformations for the Memory Hierarchy Program Transformations for the Memory Hierarchy Locality Analysis and Reuse Copyright 214, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

More information

Optimal Communication Channel Utilization for Matrix Transposition and Related Permutations on Binary Cubes

Optimal Communication Channel Utilization for Matrix Transposition and Related Permutations on Binary Cubes Optimal Communication Channel Utilization for Matrix Transposition and Related Permutations on Binary Cubes The Harvard community has made this article openly available. Please share how this access benefits

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Overpartioning with the Rice dhpf Compiler

Overpartioning with the Rice dhpf Compiler Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf

More information

Parallel Algorithms. COMP 215 Lecture 22

Parallel Algorithms. COMP 215 Lecture 22 Parallel Algorithms COMP 215 Lecture 22 Terminology SIMD single instruction, multiple data stream. Each processor must perform exactly the same operation at each time step, only the data differs. MIMD

More information

Ecube Planar adaptive Turn model (west-first non-minimal)

Ecube Planar adaptive Turn model (west-first non-minimal) Proc. of the International Parallel Processing Symposium (IPPS '95), Apr. 1995, pp. 652-659. Global Reduction in Wormhole k-ary n-cube Networks with Multidestination Exchange Worms Dhabaleswar K. Panda

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

A Quantitative Algorithm for Data. IRISA, University of Rennes. Christine Eisenbeis INRIA. Abstract

A Quantitative Algorithm for Data. IRISA, University of Rennes. Christine Eisenbeis INRIA. Abstract A Quantitative Algorithm for Data Locality Optimization Francois Bodin, William Jalby, Daniel Windheiser IRISA, University of Rennes Rennes, FRANCE Christine Eisenbeis INRIA Rocquencourt, FRANCE Abstract

More information

Principles of Computer Architecture. Chapter 10: Trends in Computer. Principles of Computer Architecture by M. Murdocca and V.

Principles of Computer Architecture. Chapter 10: Trends in Computer. Principles of Computer Architecture by M. Murdocca and V. 10-1 Principles of Computer Architecture Miles Murdocca and Vincent Heuring Chapter 10: Trends in Computer Architecture 10-2 Chapter Contents 10.1 Quantitative Analyses of Program Execution 10.2 From CISC

More information

Null space basis: mxz. zxz I

Null space basis: mxz. zxz I Loop Transformations Linear Locality Enhancement for ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence

More information

Principle of Polyhedral model for loop optimization. cschen 陳鍾樞

Principle of Polyhedral model for loop optimization. cschen 陳鍾樞 Principle of Polyhedral model for loop optimization cschen 陳鍾樞 Outline Abstract model Affine expression, Polygon space Polyhedron space, Affine Accesses Data reuse Data locality Tiling Space partition

More information

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2

Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2 Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS Teacher: Jan Kwiatkowski, Office 201/15, D-2 COMMUNICATION For questions, email to jan.kwiatkowski@pwr.edu.pl with 'Subject=your name.

More information

Massively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation

Massively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation L SIMULATION OF SEMICONDUCTOR DEVICES AND PROCESSES Vol. 4 Edited by W. Fichtner, D. Aemmer - Zurich (Switzerland) September 12-14,1991 - Hartung-Gorre Massively Parallel Computation for Three-Dimensional

More information

Introduction Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as the Kendall Square KSR-, Intel P

Introduction Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as the Kendall Square KSR-, Intel P Performance Comparison of a Set of Periodic and Non-Periodic Tridiagonal Solvers on SP2 and Paragon Parallel Computers Xian-He Sun Stuti Moitra Department of Computer Science Scientic Applications Branch

More information

Concurrent Programming Introduction

Concurrent Programming Introduction Concurrent Programming Introduction Frédéric Haziza Department of Computer Systems Uppsala University Ericsson - Fall 2007 Outline 1 Good to know 2 Scenario 3 Definitions 4 Hardware 5 Classical

More information

FIELA: A Fast Image Encryption with Lorenz Attractor using Hybrid Computing

FIELA: A Fast Image Encryption with Lorenz Attractor using Hybrid Computing FIELA: A Fast Image Encryption with Lorenz Attractor using Hybrid Computing P Kranthi Kumar, B V Nagendra Prasad, Gelli MBSS Kumar, V. Chandrasekaran, P.K.Baruah Sri Sathya Sai Institute of Higher Learning,

More information

Computer Science Technical Report

Computer Science Technical Report Computer Science Technical Report Using Large Neural Networks as an Efficient Indexing Method for ATR Template Matching y Mark R. Stevens Charles W. Anderson J. Ross Beveridge Department of Computer Science

More information

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality

A Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

Performance Comparison of Processor Scheduling Strategies in a Distributed-Memory Multicomputer System

Performance Comparison of Processor Scheduling Strategies in a Distributed-Memory Multicomputer System Performance Comparison of Processor Scheduling Strategies in a Distributed-Memory Multicomputer System Yuet-Ning Chan, Sivarama P. Dandamudi School of Computer Science Carleton University Ottawa, Ontario

More information

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above

The driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above Janus a C++ Template Library for Parallel Dynamic Mesh Applications Jens Gerlach, Mitsuhisa Sato, and Yutaka Ishikawa fjens,msato,ishikawag@trc.rwcp.or.jp Tsukuba Research Center of the Real World Computing

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation Parallel Compilation Two approaches to compilation Parallelize a program manually Sequential code converted to parallel code Develop

More information

UNIVERSITY OF PITTSBURGH FACULTY OF ARTS AND SCIENCES This dissertation was presented by Xin Yuan It was defended on August, 1998 and approved by Prof

UNIVERSITY OF PITTSBURGH FACULTY OF ARTS AND SCIENCES This dissertation was presented by Xin Yuan It was defended on August, 1998 and approved by Prof Dynamic and Compiled Communication in Optical Time{Division{Multiplexed Point{to{Point Networks by Xin Yuan B.S., Shanghai Jiaotong University, 1989 M.S., Shanghai Jiaotong University, 1992 M.S., University

More information

Optimizing Aggregate Array Computations in Loops

Optimizing Aggregate Array Computations in Loops Optimizing Aggregate Array Computations in Loops Yanhong A. Liu Scott D. Stoller Ning Li Tom Rothamel Abstract An aggregate array computation is a loop that computes accumulated quantities over array elements.

More information

Akhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University.

Akhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University. Evaluating Virtual Channels for Cache-Coherent Shared-Memory Multiprocessors Akhilesh Kumar and Laxmi N. Bhuyan Department of Computer Science Texas A&M University College Station, TX 77-11, USA. E-mail:

More information