University of Malaga. Image Template Matching on Distributed Memory and Vector Multiprocessors
|
|
- Victor Charles
- 6 years ago
- Views:
Transcription
1 Image Template Matching on Distributed Memory and Vector Multiprocessors V. Blanco M. Martin D.B. Heras O. Plata F.F. Rivera September 995 Technical Report No: UMA-DAC-95/20 Published in: 5th Int l. Conf. on Parallel Computing (ParCo 95) Gent, Belgium, September 9-22, 995 University of Malaga Department of Computer Architecture C. Tecnologico PO Box 44 E Malaga Spain
2 Image template matching on distributed memory and vector multiprocessors V. Blanco, M. Martn, D.B. Heras, O. Plata and F.F. Rivera Dept. Electronica y Computacion Fac. Fsica. Univ. Santiago de Compostela elvicente@usc.es, elfran@usc.es 6th December 995 Introduction In this work we present a study on the trade-o between temporal and spatial parallelism to perform highly parallel algorithms. We have selected the image template matching algorithm as representative of the dierent computational structures that can be executed eciently on both vector and distributed memory systems [2, 3, 4]. The computational body of the template matching that we will consider is the computation of the cross-correlation coecient. This coecient is given in terms of the cross-correlation function as: P Mr?P Mc? C(i; j) = P k=0 M f r? k=0 l=0 P (k + i; l + j) T (k; l) P Mc? l=0 P 2 (k + i; l + j)g =2 () where P and T are the image and the template with N r N c and M r M c pixels respectively. Note that the calculation of this coecient presents high spatial and temporal locality. The program has four independent nested loops corresponding to the rows and columns of the image and the template. The loops associated to indexes i and j can be executed as a doall structure, and the ones associated to k and l constitute summations that can be executed as a typical ecient reduction structure. We have executed this code on the Fujitsu AP000 system as representative of distributed memory multiprocessors. The programming strategy we used is based on the SPMD paradigm exploiting data parallelism, and it consists in determining the most adequate distribution of data, dividing it into subspaces, one for each processor. Then, a mapping must be done in order to assign computations to each one of these subspaces and explicitly establishing the communications required [5]. In this way, the parallel code keeps the same general structure of the sequential counterpart introducing the necessary routing statements. At this point, dierent transformations of the local code can be applied that exploit the vectorial capabilities of each node. In order to do that we have used the Fujitsu VP2400/0 vector computer. This work was supported in part by the CICYT under grant TIC C03-03 and Xunta de Galicia under grant XUGA20606B93. The authors wish to acknowledge the help oered by Fujitsu Labs Ltd. for the use of their systems.
3 N c (0,0) (0,) (0,2) (0,3) (0,4) n c n r Imagen local (,0) M c + n c - (,4) (,) (,2) (,3) M c N r (2,0) M r + n r - (2,) (2,2) (2,3) (2,4) n c Template M r (3,0) n r (3,) (3,2) (3,3) (3,4) (4,0) (4,) (4,2) (4,3) (4,4) Figure : Access scheme of template and image 2 Exploiting spatial parallelism The parallel implementation of the cross-correlation coecient is not direct due to the dependencies between the bounds of the summations over indexes k and l and the indexes in the external loops i and j. The most ecient strategy to distribute data is based on the replication of the template in every node. The size of the template is usually small, so the memory cost associated to this approach is not too high in practice, and on the other hand, the amount of communications saved justi- es it. The image is stored in the local memories using a block distribution scheme, mapping the two-dimensional matrix on the two-dimensional mesh of processors in a straightforward way. Each processor computes the products in the equation from position (0; 0) of the image in a lexicographic order, and from position (M r ; M c ) of the template in reverse lexicographic order. In this way, each processor executes every computation that involve just local memory accesses. The routing operations needed to compute the global result can be mapped in an ecient way on the mesh network, through rows and columns. In gure we display the access scheme of the template and the local image in each node. In this example, node (3; 3) have to send local results to every node that have some shaded zone, nodes (; ), (; 2), (; 3), (2; ), (2; 2), (2; 3), (3; ), (3; 2) and (; 3). Finally, in order to minimize communication costs, we compose individual messages in buers that have to be sent in a whole routing operation. In gure 2 we present the eciencies for the parallel algorithm executed on the AP000, that is a general purpose system with a MIMD conguration, distributed memory and a two dimensional torus topology network. Note that the eciency is high even when 52 processors are used. Moreover, in some cases we found superlinear speedups when the memory hierarchy (specially cache accesses) operate eciently. 2
4 Template 4 x 4.06 Efficiency Efficiency # of PEs Template 8 x # of PEs Figure 2: Eciency of the cross-correlation on the AP-000 Image 024 X X X 256 Template 8 X 8 4 X 4 8 X 8 4 X 4 8 X 8 4 X 4 Scalar Automatic Optimized Speedup Table : Run-times on the VP2400/0 3 Exploiting temporal parallelism The local program associated to each node computes a local cross-correlation, so, if we assume vector capabilities in the processors, temporal parallelism in a ner grain can be exploited. We have implemented a vector code for the local program. We have used the VP2400/0 vector computer from Fujitsu as a tool for the evaluation of the vectorization possibilities of this algorithm. In order to obtain the best use of the hardware of the system we have systematically applied to the algorithms dierent transformations that exploit the vectorial capabilities of the system []. In particular, we have considered the following: vectorization over the longest loop, minimization of memory conicts, loop fusion, use of scalar variables in reduction operations, unrolling and blocking. In table, runtimes in milliseconds are shown for dierent sizes of the local image and the template. Note that the automatic compilation does not oer good performance because it vectorize the innermost loop, that corresponds to the rows 3
5 of the template, a small quantity. 4 Conclusions The cross-correlation coecient and other related computations are the computational kernel of codes in the eld of image processing, and in particular for the image template problem. In this paper we focus on the computational features that make this kind of loop structured codes suitable for parallel and vector machines. We found that a block distribution of the image and a replication of the template in every processor will produce a high eciency in the parallel algorithm on distributed memory systems, and in particular in systems with mesh interconnexion topology. On the other hand, we found that vectorization is a more ecient solution than spatial parallelization in order to increase the processing speed of this kind of codes due to the communication costs. The best solution should be to combine both approaches in a distributed memory system with vector capabilities. References [] W. Cowell and C. Thompson. Transforming fortran do loops to improve performance on vector architectures. ACM Transaction on Mathematics Software, 2(4):326{353, 986. [2] Z. Fang, X. Li, and L. Ni. Parallel algorithms for image template matching on hypercube simd computers. IEEE Transaction on Pattern Anal. Mach. Intell., PAMI-9(6):835{84, Nov [3] V. Kumar and V. Krishnan. Ecient image template matching on hypercube simd arrays. IEEE Transaction on Pattern Anal. Mach. Intell., PAMI- (6):665{669, 989. [4] E. Zapata, J. Benavides, O. Plata, and F. Rivera. Image template matching on hypercube simd computers. Signal Processing, 2:49{60, 990. [5] E. Zapata, F. Rivera, and O. Plata. On the partition of algorithms into hypercubes. Advances in Parallel Computing., :49{7,
Sparse Givens QR Factorization on a Multiprocessor. May 1996 Technical Report No: UMA-DAC-96/08
Sparse Givens QR Factorization on a Multiprocessor J. Tourino R. Doallo E.L. Zapata May 1996 Technical Report No: UMA-DAC-96/08 Published in: 2nd Int l. Conf. on Massively Parallel Computing Systems Ischia,
More informationUniversity of Malaga. Cache Misses Prediction for High Performance Sparse Algorithms
Cache Misses Prediction for High Performance Sparse Algorithms B.B. Fraguela R. Doallo E.L. Zapata September 1998 Technical Report No: UMA-DAC-98/ Published in: 4th Int l. Euro-Par Conference (Euro-Par
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationCHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song
CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly
More informationCompiling for Advanced Architectures
Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have
More informationTechnische Universitat Munchen. Institut fur Informatik. D Munchen.
Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationCS Parallel Algorithms in Scientific Computing
CS 775 - arallel Algorithms in Scientific Computing arallel Architectures January 2, 2004 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan
More informationUNIVERSITI SAINS MALAYSIA. CCS524 Parallel Computing Architectures, Algorithms & Compilers
UNIVERSITI SAINS MALAYSIA Second Semester Examination Academic Session 2003/2004 September/October 2003 CCS524 Parallel Computing Architectures, Algorithms & Compilers Duration : 3 hours INSTRUCTION TO
More informationTypes of Parallel Computers
slides1-22 Two principal types: Types of Parallel Computers Shared memory multiprocessor Distributed memory multicomputer slides1-23 Shared Memory Multiprocessor Conventional Computer slides1-24 Consists
More informationCS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2
CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann
More informationCS4961 Parallel Programming. Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code 09/23/2010
Parallel Programming Lecture 10: Data Locality, cont. Writing/Debugging Parallel Code Mary Hall September 23, 2010 1 Observations from the Assignment Many of you are doing really well Some more are doing
More informationModule 13: INTRODUCTION TO COMPILERS FOR HIGH PERFORMANCE COMPUTERS Lecture 25: Supercomputing Applications. The Lecture Contains: Loop Unswitching
The Lecture Contains: Loop Unswitching Supercomputing Applications Programming Paradigms Important Problems Scheduling Sources and Types of Parallelism Model of Compiler Code Optimization Data Dependence
More informationMemory hierarchy. 1. Module structure. 2. Basic cache memory. J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas
Memory hierarchy J. Daniel García Sánchez (coordinator) David Expósito Singh Javier García Blas Computer Architecture ARCOS Group Computer Science and Engineering Department University Carlos III of Madrid
More informationStreaming as a pattern. Peter Mattson, Richard Lethin Reservoir Labs
Streaming as a pattern Peter Mattson, Richard Lethin Reservoir Labs Streaming as a pattern Streaming is a pattern in efficient implementations of computation- and data-intensive applications Pattern has
More informationSHARED MEMORY VS DISTRIBUTED MEMORY
OVERVIEW Important Processor Organizations 3 SHARED MEMORY VS DISTRIBUTED MEMORY Classical parallel algorithms were discussed using the shared memory paradigm. In shared memory parallel platform processors
More informationMassively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain
Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,
More informationStreaming-Oriented Parallelization of Domain-Independent Irregular Kernels?
Streaming-Oriented Parallelization of Domain-Independent Irregular Kernels? J. Lobeiras, M. Amor, M. Arenaz, and B.B. Fraguela Computer Architecture Group, University of A Coruña, Spain {jlobeiras,margamor,manuel.arenaz,basilio.fraguela}@udc.es
More informationExtending CRAFT Data-Distributions for Sparse Matrices. July 1996 Technical Report No: UMA-DAC-96/11
Extending CRAFT Data-Distributions for Sparse Matrices G. Bandera E.L. Zapata July 996 Technical Report No: UMA-DAC-96/ Published in: 2nd. European Cray MPP Workshop Edinburgh Parallel Computing Centre,
More informationComparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of Ne
Comparing SIMD and MIMD Programming Modes Ravikanth Ganesan, Kannan Govindarajan, and Min-You Wu Department of Computer Science State University of New York Bualo, NY 14260 Abstract The Connection Machine
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationIntroduction. EE 4504 Computer Organization
Introduction EE 4504 Computer Organization Section 11 Parallel Processing Overview EE 4504 Section 11 1 This course has concentrated on singleprocessor architectures and techniques to improve upon their
More informationAutomatic Array Alignment for. Mitsuru Ikei. Hitachi Chemical Company Ltd. Michael Wolfe. Oregon Graduate Institute of Science & Technology
Automatic Array Alignment for Distributed Memory Multicomputers Mitsuru Ikei Hitachi Chemical Company Ltd. Michael Wolfe Oregon Graduate Institute of Science & Technology P.O. Box 91000 Portland OR 97291
More informationMemory Hierarchy Management for Iterative Graph Structures
Memory Hierarchy Management for Iterative Graph Structures Ibraheem Al-Furaih y Syracuse University Sanjay Ranka University of Florida Abstract The increasing gap in processor and memory speeds has forced
More informationUniversity of Malaga. Skill-Mart: Skilled Robotic Manipulation of Non-Rigid Objects
Skill-Mart: Skilled Robotic Manipulation of Non-Rigid Objects P.P. Trabado N. Guil E.L. Zapata June 1998 Technical Report No: UMA-DAC-98/08 Published in: WorkShop on European Scientific and Industrial
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationThis paper deals with ecient parallel implementations of reconstruction methods in 3D
Ecient Implementation of Parallel Image Reconstruction Algorithms for 3D X-Ray Tomography C. Laurent a, C. Calvin b, J.M. Chassery a, F. Peyrin c Christophe.Laurent@imag.fr Christophe.Calvin@imag.fr a
More informationOutline. Distributed Shared Memory. Shared Memory. ECE574 Cluster Computing. Dichotomy of Parallel Computing Platforms (Continued)
Cluster Computing Dichotomy of Parallel Computing Platforms (Continued) Lecturer: Dr Yifeng Zhu Class Review Interconnections Crossbar» Example: myrinet Multistage» Example: Omega network Outline Flynn
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationChapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348
Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationGuiding the optimization of parallel codes on multicores using an analytical cache model
Guiding the optimization of parallel codes on multicores using an analytical cache model Diego Andrade, Basilio B. Fraguela, and Ramón Doallo Universidade da Coruña, Spain {diego.andrade,basilio.fraguela,ramon.doalllo}@udc.es
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationEgemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for
Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and
More informationOptimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres
Optimal Topology for Distributed Shared-Memory Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Facultad de Informatica, Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia,
More informationEssential constraints: Data Dependences. S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2
Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 Essential constraints: Data Dependences S1: a = b + c S2: d = a * 2 S3: a = c + 2 S4: e = d + c + 2 S2
More informationStatistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform
Statistical Evaluation of a Self-Tuning Vectorized Library for the Walsh Hadamard Transform Michael Andrews and Jeremy Johnson Department of Computer Science, Drexel University, Philadelphia, PA USA Abstract.
More informationSimone Campanoni Loop transformations
Simone Campanoni simonec@eecs.northwestern.edu Loop transformations Outline Simple loop transformations Loop invariants Induction variables Complex loop transformations Simple loop transformations Simple
More informationDesign of Parallel Algorithms. The Architecture of a Parallel Computer
+ Design of Parallel Algorithms The Architecture of a Parallel Computer + Trends in Microprocessor Architectures n Microprocessor clock speeds are no longer increasing and have reached a limit of 3-4 Ghz
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationThe 2D wavelet transform on. a SIMD torus of scanline processors. R. Lang A. Spray H. Schroder. Application Specic Computer Design (ASCOD)
The D wavelet transform on a SIMD torus of scanline processors R. Lang A. Spray H. Schroder Application Specic Computer Design (ASCOD) Dept. of Electrical & Computer Engineering University of Newcastle
More informationLecture 11 Loop Transformations for Parallelism and Locality
Lecture 11 Loop Transformations for Parallelism and Locality 1. Examples 2. Affine Partitioning: Do-all 3. Affine Partitioning: Pipelining Readings: Chapter 11 11.3, 11.6 11.7.4, 11.9-11.9.6 1 Shared Memory
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationLINUX. Benchmark problems have been calculated with dierent cluster con- gurations. The results obtained from these experiments are compared to those
Parallel Computing on PC Clusters - An Alternative to Supercomputers for Industrial Applications Michael Eberl 1, Wolfgang Karl 1, Carsten Trinitis 1 and Andreas Blaszczyk 2 1 Technische Universitat Munchen
More informationFigure 1: Representation of moving images using layers Once a set of ane models has been found, similar models are grouped based in a mean-square dist
ON THE USE OF LAYERS FOR VIDEO CODING AND OBJECT MANIPULATION Luis Torres, David Garca and Anna Mates Dept. of Signal Theory and Communications Universitat Politecnica de Catalunya Gran Capita s/n, D5
More informationParallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.
Parallel Systems Prof. James L. Frankel Harvard University Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Architectures SISD (Single Instruction, Single Data)
More informationPartition Border Charge Update. Solve Field. Partition Border Force Update
Plasma Simulation on Networks of Workstations using the Bulk-Synchronous Parallel Model y Mohan V. Nibhanupudi Charles D. Norton Boleslaw K. Szymanski Department of Computer Science Rensselaer Polytechnic
More informationExploring Parallelism At Different Levels
Exploring Parallelism At Different Levels Balanced composition and customization of optimizations 7/9/2014 DragonStar 2014 - Qing Yi 1 Exploring Parallelism Focus on Parallelism at different granularities
More informationParallel Algorithms. Thoai Nam
Parallel Algorithms Thoai Nam Outline Introduction to parallel algorithms development Reduction algorithms Broadcast algorithms Prefix sums algorithms -2- Introduction to Parallel Algorithm Development
More informationAutotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT
Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic
More informationOn Estimating the Useful Work Distribution of. Thomas Fahringer. University of Vienna. Abstract
On Estimating the Useful Work Distribution of Parallel Programs under the P 3 T: A Static Performance Estimator Thomas Fahringer Institute for Software Technology and Parallel Systems University of Vienna
More informationA Massively Parallel Virtual Machine for. SIMD Architectures
Advanced Studies in Theoretical Physics Vol. 9, 15, no. 5, 37-3 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/1.19/astp.15.519 A Massively Parallel Virtual Machine for SIMD Architectures M. Youssfi and
More informationChapter 1. Reprinted from "Proc. 6th SIAM Conference on Parallel. Processing for Scientic Computing",Norfolk, Virginia (USA), March 1993.
Chapter 1 Parallel Sparse Matrix Vector Multiplication using a Shared Virtual Memory Environment Francois Bodin y Jocelyne Erhel y Thierry Priol y Reprinted from "Proc. 6th SIAM Conference on Parallel
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate
More informationConcurrency for data-intensive applications
Concurrency for data-intensive applications Dennis Kafura CS5204 Operating Systems 1 Jeff Dean Sanjay Ghemawat Dennis Kafura CS5204 Operating Systems 2 Motivation Application characteristics Large/massive
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationTransactions on Information and Communications Technologies vol 9, 1995 WIT Press, ISSN
Parallelization of software for coastal hydraulic simulations for distributed memory parallel computers using FORGE 90 Z.W. Song, D. Roose, C.S. Yu, J. Berlamont B-3001 Heverlee, Belgium 2, Abstract Due
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More informationAn Inspector-Executor Algorithm for Irregular Assignment Parallelization
An Inspector-Executor Algorithm for Irregular Assignment Parallelization Manuel Arenaz, Juan Touriño, Ramón Doallo Computer Architecture Group Dep. Electronics and Systems, University of A Coruña, Spain
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationUniversity of Ghent. St.-Pietersnieuwstraat 41. Abstract. Sucient and precise semantic information is essential to interactive
Visualizing the Iteration Space in PEFPT? Qi Wang, Yu Yijun and Erik D'Hollander University of Ghent Dept. of Electrical Engineering St.-Pietersnieuwstraat 41 B-9000 Ghent wang@elis.rug.ac.be Tel: +32-9-264.33.75
More informationParallel Architectures
Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationUniversity of Malaga. Sparse Matrix Block-Cyclic Redistribution. Department of Computer Architecture C. Tecnologico PO Box 4114 E Malaga Spain
Sparse Matrix Block-Cyclic Redistribution G. Bandera E.L. Zapata April 999 Technical Report No: UMA-DAC-99/5 Published in: IEEE Int l. Parallel Processing Symposium (IPPS 99) San Juan, Puerto Rico, April
More informationApplication Programmer. Vienna Fortran Out-of-Core Program
Mass Storage Support for a Parallelizing Compilation System b a Peter Brezany a, Thomas A. Mueck b, Erich Schikuta c Institute for Software Technology and Parallel Systems, University of Vienna, Liechtensteinstrasse
More informationZeki Bozkus, Sanjay Ranka and Georey Fox , Center for Science and Technology. Syracuse University
Modeling the CM-5 multicomputer 1 Zeki Bozkus, Sanjay Ranka and Georey Fox School of Computer Science 4-116, Center for Science and Technology Syracuse University Syracuse, NY 13244-4100 zbozkus@npac.syr.edu
More informationLinear Loop Transformations for Locality Enhancement
Linear Loop Transformations for Locality Enhancement 1 Story so far Cache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a linear transformation
More informationEdge detection based on single layer CNN simulator using RK6(4)
Edge detection based on single layer CNN simulator using RK64) Osama H. Abdelwahed 1, and M. El-Sayed Wahed 1 Mathematics Department, Faculty of Science, Suez Canal University, Egypt Department of Computer
More informationMapping Vector Codes to a Stream Processor (Imagine)
Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream
More informationProgram Transformations for the Memory Hierarchy
Program Transformations for the Memory Hierarchy Locality Analysis and Reuse Copyright 214, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message
More informationOptimal Communication Channel Utilization for Matrix Transposition and Related Permutations on Binary Cubes
Optimal Communication Channel Utilization for Matrix Transposition and Related Permutations on Binary Cubes The Harvard community has made this article openly available. Please share how this access benefits
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More informationOverpartioning with the Rice dhpf Compiler
Overpartioning with the Rice dhpf Compiler Strategies for Achieving High Performance in High Performance Fortran Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/hug00overpartioning.pdf
More informationParallel Algorithms. COMP 215 Lecture 22
Parallel Algorithms COMP 215 Lecture 22 Terminology SIMD single instruction, multiple data stream. Each processor must perform exactly the same operation at each time step, only the data differs. MIMD
More informationEcube Planar adaptive Turn model (west-first non-minimal)
Proc. of the International Parallel Processing Symposium (IPPS '95), Apr. 1995, pp. 652-659. Global Reduction in Wormhole k-ary n-cube Networks with Multidestination Exchange Worms Dhabaleswar K. Panda
More informationSMD149 - Operating Systems - Multiprocessing
SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction
More informationOverview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy
Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system
More informationA Quantitative Algorithm for Data. IRISA, University of Rennes. Christine Eisenbeis INRIA. Abstract
A Quantitative Algorithm for Data Locality Optimization Francois Bodin, William Jalby, Daniel Windheiser IRISA, University of Rennes Rennes, FRANCE Christine Eisenbeis INRIA Rocquencourt, FRANCE Abstract
More informationPrinciples of Computer Architecture. Chapter 10: Trends in Computer. Principles of Computer Architecture by M. Murdocca and V.
10-1 Principles of Computer Architecture Miles Murdocca and Vincent Heuring Chapter 10: Trends in Computer Architecture 10-2 Chapter Contents 10.1 Quantitative Analyses of Program Execution 10.2 From CISC
More informationNull space basis: mxz. zxz I
Loop Transformations Linear Locality Enhancement for ache performance can be improved by tiling and permutation Permutation of perfectly nested loop can be modeled as a matrix of the loop nest. dependence
More informationPrinciple of Polyhedral model for loop optimization. cschen 陳鍾樞
Principle of Polyhedral model for loop optimization cschen 陳鍾樞 Outline Abstract model Affine expression, Polygon space Polyhedron space, Affine Accesses Data reuse Data locality Tiling Space partition
More informationIntroduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS. Teacher: Jan Kwiatkowski, Office 201/15, D-2
Introduction to Parallel and Distributed Systems - INZ0277Wcl 5 ECTS Teacher: Jan Kwiatkowski, Office 201/15, D-2 COMMUNICATION For questions, email to jan.kwiatkowski@pwr.edu.pl with 'Subject=your name.
More informationMassively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation
L SIMULATION OF SEMICONDUCTOR DEVICES AND PROCESSES Vol. 4 Edited by W. Fichtner, D. Aemmer - Zurich (Switzerland) September 12-14,1991 - Hartung-Gorre Massively Parallel Computation for Three-Dimensional
More informationIntroduction Distributed-memory parallel computers dominate today's parallel computing arena. These machines, such as the Kendall Square KSR-, Intel P
Performance Comparison of a Set of Periodic and Non-Periodic Tridiagonal Solvers on SP2 and Paragon Parallel Computers Xian-He Sun Stuti Moitra Department of Computer Science Scientic Applications Branch
More informationConcurrent Programming Introduction
Concurrent Programming Introduction Frédéric Haziza Department of Computer Systems Uppsala University Ericsson - Fall 2007 Outline 1 Good to know 2 Scenario 3 Definitions 4 Hardware 5 Classical
More informationFIELA: A Fast Image Encryption with Lorenz Attractor using Hybrid Computing
FIELA: A Fast Image Encryption with Lorenz Attractor using Hybrid Computing P Kranthi Kumar, B V Nagendra Prasad, Gelli MBSS Kumar, V. Chandrasekaran, P.K.Baruah Sri Sathya Sai Institute of Higher Learning,
More informationComputer Science Technical Report
Computer Science Technical Report Using Large Neural Networks as an Efficient Indexing Method for ATR Template Matching y Mark R. Stevens Charles W. Anderson J. Ross Beveridge Department of Computer Science
More informationA Crash Course in Compilers for Parallel Computing. Mary Hall Fall, L2: Transforms, Reuse, Locality
A Crash Course in Compilers for Parallel Computing Mary Hall Fall, 2008 1 Overview of Crash Course L1: Data Dependence Analysis and Parallelization (Oct. 30) L2 & L3: Loop Reordering Transformations, Reuse
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel
More informationPerformance Comparison of Processor Scheduling Strategies in a Distributed-Memory Multicomputer System
Performance Comparison of Processor Scheduling Strategies in a Distributed-Memory Multicomputer System Yuet-Ning Chan, Sivarama P. Dandamudi School of Computer Science Carleton University Ottawa, Ontario
More informationThe driving motivation behind the design of the Janus framework is to provide application-oriented, easy-to-use and ecient abstractions for the above
Janus a C++ Template Library for Parallel Dynamic Mesh Applications Jens Gerlach, Mitsuhisa Sato, and Yutaka Ishikawa fjens,msato,ishikawag@trc.rwcp.or.jp Tsukuba Research Center of the Real World Computing
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation Parallel Compilation Two approaches to compilation Parallelize a program manually Sequential code converted to parallel code Develop
More informationUNIVERSITY OF PITTSBURGH FACULTY OF ARTS AND SCIENCES This dissertation was presented by Xin Yuan It was defended on August, 1998 and approved by Prof
Dynamic and Compiled Communication in Optical Time{Division{Multiplexed Point{to{Point Networks by Xin Yuan B.S., Shanghai Jiaotong University, 1989 M.S., Shanghai Jiaotong University, 1992 M.S., University
More informationOptimizing Aggregate Array Computations in Loops
Optimizing Aggregate Array Computations in Loops Yanhong A. Liu Scott D. Stoller Ning Li Tom Rothamel Abstract An aggregate array computation is a loop that computes accumulated quantities over array elements.
More informationAkhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University.
Evaluating Virtual Channels for Cache-Coherent Shared-Memory Multiprocessors Akhilesh Kumar and Laxmi N. Bhuyan Department of Computer Science Texas A&M University College Station, TX 77-11, USA. E-mail:
More information