Solving Tridiagonal Systems on the T3E: a Message Passing RD algorithm.

Size: px
Start display at page:

Download "Solving Tridiagonal Systems on the T3E: a Message Passing RD algorithm."

Transcription

1 Solving Tridiagonal Systems on the T3E: a Message Passing RD algorithm. A. Bevilacqua Dipartimento di Fisica, Università di Bologna INFN, Sezione di Bologna G. Spaletta Dipartimento di Matematica, Università di Bologna Sommario Il lavoro svolto è teso al completamento dell analisi dell algoritmo di Recursive Decoupling (RD) per sistemi tridiagonali lineari. Tale metodo, intrinsecamente parallelo e scalabile, trova una naturale implementazione sulle architetture CRAY della generazione MPP, quali il T3E. Recentemente è sorta la richiesta di impiego di tale solutore in problemi applicativi di ampie dimensioni, legati alla ingegneria dei circuiti microelettronici ed alla ricostruzione e restoring di immagini mediche. La necessità di una versione portabile ad altri sistemi distribuiti ha giustificato la revisione dell algoritmo di RD, per la sua programmazione in modello Message Passing. Grazie al grant CINECA, è stato possibile sviluppare su CRAY T3E una prima versione PVM dell algoritmo in esame, in modo da renderne il più possibile consistente il confronto con le precedenti implementazioni realizzate in modello a memoria condivisa su CRAY T3D. Le prestazioni del metodo RD su CRAY T3E vengono valutate in base alle caratteristiche di scalabilità ed alla accuratezza dei risultati ottenuti. Abstract The work developed here is meant as a completion to the analysis of the Recursive Decoupling solver (RD) for tridiagonal linear systems. Such a method, intrinsically parallel and scalable, finds a natural implementation on the CRAY architectures, belonging to the MPP generation, such as the T3E. Recently, it has arisen a request to utilize the technique above mentioned in the solution of applied problems, related to the Engineering of microwave circuits and to the reconstruction and restoration of medical images. The need for an RD version, portable to various distributed systems, justified the remodeling of the RD algorithm, to make feasible its Message Passing (MP) implementation. Thanks to the CINECA grant, it has been possible to develop 1

2 a PVM version of the RD solver for the CRAY T3E, which makes also interesting and consistent a comparison between the current MP model implementation and the previous CRAY T3D shared memory versions. The performance of the RD method on the CRAY T3E is then evaluated, based on the obtained scalability features and accuracy results. 1. The algorithm Here we synthesize the concepts on which the RD technique is based; they can be found in more details in [1, 2]. Let A be a diagonally dominant, tridiagonal matrix A of dimension n = k x 2 q (q, k integers). The related square system (1) A u = d is solved by partitioning A into an easily invertible block diagonal matrix J and a sum of m-1 rank-one matrices x (j) y (j)t, having set m=n/2. A first approximation to the solution vector u is given by solving (2) J u = d At the same time the following m-1 systems are solved (3) J g (j) = x (j) The partitioning of A is mirrored in the vectors u and g (j), i.e. such vectors are partitioned into subvectors, whose size is that of the diagonal blocks in J. The recursive use of the Sherman-Morrison formula, then, permits to obtain the solution of the original tridiagonal system (1). The n-dimensional vectors g (j) are recursively updated and used to update u, providing the wanted solution when the final step is reached. RD is a direct method, thus the number of steps is finite; such number is equal to log 2 (m), due to the sparsity of the matrices involved, the dimension n being a power of 2 and the fan-in pattern of the updating. Furthermore, the sparsity and dimension features of the problem considered permit to hold all vectors u, g (j) by log 2 (n) auxiliary vectors v (k), each of dimension n. All the updating operations can be performed in parallel, maintaining the partition of data chosen during the solution of (2) and (3), with a minimal need for communication among tasks. In fact, systems (2) and (3) can also be solved in parallel partitioned form. 2

3 The procedure outlined gives rise to an intrinsically parallel and scalable algorithm. The current version uses the C programming language and the Message Passing programming model, to answer the request for a wider use of this solver. Results of numerical testing and the analysis of the performance obtained are given next, including a qualitative comparison with analogous testing of the CRAY T3D shared memory implementation. 2. PVM Recursive Decoupling The C implementation of the RD method uses the routines of the PVM library, devoted to the communication among PEs (Processing Elements). Further to portability, the aims of such implementation are those sought by every parallel program: efficiency, speed-up, scalability and workload balancing. The SPMD programming model, required on the T3E, consists of one main program, running on each processor; the master-slave paradigm is put into effect by appending the following instruction as the closing one in the main program: if (PE_id == 0) then <master code> else <slave code> Data assignment (splitting) is crucial to the message passing version of the RD algorithm. All structures involved, that is to say the auxiliary vectors v (k) are partitioned into subvectors vi (k) of equal length (given by the ratio between the problem dimension n and the number npes of processors used), that are local to each PE. The updating of vectors v (k) is then obtained by each PE updating its substructure. The need for some communication of partial scalar results among groups of PEs justifies the choice of the master-slave programming model, with one PE specialized to execute the master code, to avoid the overhead due to synchronization among such groups. The master code has the following tasks: reads the input data A and d from an external file; subdivides A, d, and sends their parts to the slaves; collects the local results from each group of processors; computes and sends the global results to each group of slaves; receives the final result and writes it on an external file. The kernel of the algorithm, correspondent to the updating procedure, basically consists of three nested loops; by denoting with the index k each step (level) forming such procedure, the following schemata holds for each slave task: 3

4 for each level k k = 1.. log 2 (m) if all data needed for the updating are local for each j remaining level j = k +1.. log 2 (m) updates vi (k ) else for each remaining level j computes its partial scalar result sends it to the master receives the global scalar result uses it to update vi (k ) i = 1.. n/npes Having one PE to perform the master controls implies that up to 64 slaves can be used, missing the possibility to exploit all 128 PE available on the T3E. We considered choosing a model in which all PE perform the master controls as well as the slave task. Such model would have lead to a less efficient implementation, mainly because, at each level k, different partial scalar results have to be shared among different pools of PEs; having one master PE minimizes the occurrence of idle processors. The uniform partitioning of structures, as described above, meets the goal for a perfect even load balance. 3. Performance analysis As a source of test problems for our numerical experiments, we consider a tridiagonal linear system of large dimension n = 2 q, whose randomly generated entries are such to guarantee diagonal dominance. The exponent q ranges from 17 to 23 in the double precision version (d.p.), while q reaches up to 24 in the single precision implementation (s.p.). The accuracy results, working in single precision, comply with those obtained by the previous version on the CRAY T3D (see [1], [2] for a more detailed description). The double precision improvement is paid off by the inability to solve systems of dimension greater than If the solution is only required to be accurate in the first few significant digits, then the RD routine is serviceable to solve general tridiagonal linear systems. Timing results, used for the evaluation of speed-up and efficiency, are measured in seconds and shown in the following three tables. Table 1 refers to the computational time required by the RD solver on a problem of dimension n = 2 17, both using PVM on the T3E (single and double precision) and using CRAFT on the T3D (single precision). Such a comparison must take into account the enhanced resource; the time gain observed reflects the increased speed of the T3E processor. 4

5 Tables 2 and 3 gather all computational timing required by the PVM implementation of the RD method, on the T3E. The empty fields mean that the memory requirements, needed to run a 2 q x 2 q problem onto a chosen number of PEs, were too high. The computational complexity of the PVM implementation is O(n (log 2 n) 2 ), which is confirmed by the timing observed. This is slightly less satisfactory than the theoretical complexity of O(n log 2 n) and calls for further improvement in the PVM restructuring of the algorithm. Communication timings are not explicitly shown, since their incidence on the overall computational time is always lower than 10%. As a consequence, the workload is almost perfectly balanced. Speed-ups and efficiency are good, reaching the optimal value in most cases (in a few cases, we obtain superlinear speed-ups); such a good behavior fades when 64 processors are used on a problem whose dimension is not large enough to give each processor a significant work amount, to overcome the communication and synchronization overhead. This is shown in Figure 1, in which the Kuck Function is plotted (against the number of processors, in logarithmic scale) for problems of dimension 2 17, 2 18 and 2 19, respectively. The Kuck function gives a compound information, being the square of the geometric mean of speed-up and efficiency (obtained with a fixed number of processors). The scaling properties of the method are also confirmed by the timing results; the scaling factor, in its best instance, reaches the value of 0.88 (close to the theoretical optimal value 1). 5

6 Table 1. MP-T3E vs SM-T3D Timing Comparison: problem dimension is q=17 PEs T3E s.p T3E d.p T3D Table 2. MP-T3E Timings (double precision): problem dimension given by q PEs q = q = q = q = q = q = Table 3. MP-T3E Timings (single precision): problem dimension given by q PEs q = q = q = q = q = q = q = Conclusion and future work PVM assures flexibility and portability of code, which is a very important requirement in all branches of applied sciences. This could allow the use of the RD routine, on its own or as part of a more general application solver, on an heterogeneous cluster of computers or any other MPP architecture. Because of the intrinsic parallelism of this problem the workload is perfectly balanced; accuracy and scaling features are maintained and confirmed by the PVM implementation of the RD algorithm. The communication and synchronization overhead, already quite small, might be further decreased; there is space for further improvement in the current message passing implementation, as is suggested both by the Kuck function and the observed complexity. Parallel to such improvement to the PVM version, an MPI implementation is also being developed. 6

7 Acknowledgments We wish to thank Dr. Bassini, Dr. Voli and their colleagues at CINECA for the kind availability. Computational resources provided by the CINECA Supercomputing Center, under grant n.97/335-5, are gratefully acknowledged. Bibliography [1] G. Spaletta, The Recursive Decoupling Solver for Tridiagonal Linear Systems on the CRAY T3D, Parallel Computing: State-of-the-Art and Perspectives, Elsevier, 1996, pp [2] G. Spaletta, Recursive Decoupling on the CRAY T3D, Science and Supercomputing at CINECA, 1995 Report, pp [3] A. Bevilacqua, G. Spaletta, Solving Systems by Recursive Decoupling with PVM and MPI, in preparation. 7

Parallel Mesh Partitioning in Alya

Parallel Mesh Partitioning in Alya Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Parallel Mesh Partitioning in Alya A. Artigues a *** and G. Houzeaux a* a Barcelona Supercomputing Center ***antoni.artigues@bsc.es

More information

K-means seeding via MUS algorithm

K-means seeding via MUS algorithm K-means seeding via MUS algorithm Inizializzazione del K-means tramite l algoritmo MUS Leonardo Egidi, Roberta Pappadà, Francesco Pauli, Nicola Torelli Abstract K-means algorithm is one of the most popular

More information

Methods of solving sparse linear systems. Soldatenko Oleg SPbSU, Department of Computational Physics

Methods of solving sparse linear systems. Soldatenko Oleg SPbSU, Department of Computational Physics Methods of solving sparse linear systems. Soldatenko Oleg SPbSU, Department of Computational Physics Outline Introduction Sherman-Morrison formula Woodbury formula Indexed storage of sparse matrices Types

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Parallelization Strategy

Parallelization Strategy COSC 6374 Parallel Computation Algorithm structure Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure

More information

Message Passing Interface (MPI)

Message Passing Interface (MPI) What the course is: An introduction to parallel systems and their implementation using MPI A presentation of all the basic functions and types you are likely to need in MPI A collection of examples What

More information

Parallel Architecture & Programing Models for Face Recognition

Parallel Architecture & Programing Models for Face Recognition Parallel Architecture & Programing Models for Face Recognition Submitted by Sagar Kukreja Computer Engineering Department Rochester Institute of Technology Agenda Introduction to face recognition Feature

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

CSC630/CSC730 Parallel & Distributed Computing

CSC630/CSC730 Parallel & Distributed Computing CSC630/CSC730 Parallel & Distributed Computing Analytical Modeling of Parallel Programs Chapter 5 1 Contents Sources of Parallel Overhead Performance Metrics Granularity and Data Mapping Scalability 2

More information

MOC20741 Networking with Windows Server 2016

MOC20741 Networking with Windows Server 2016 Tel. +39 02 365738 info@overneteducation.it www.overneteducation.it MOC20741 Networking with Windows Server 2016 Durata: 4.5 gg Descrizione Questo corso intende trasmettere le competenze fondamentali in

More information

Processing and Others. Xiaojun Qi -- REU Site Program in CVMA

Processing and Others. Xiaojun Qi -- REU Site Program in CVMA Advanced Digital Image Processing and Others Xiaojun Qi -- REU Site Program in CVMA (0 Summer) Segmentation Outline Strategies and Data Structures Overview of Algorithms Region Splitting Region Merging

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign

More information

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Moysey Brio & Paul Dostert July 4, 2009 1 / 18 Sparse Matrices In many areas of applied mathematics and modeling, one

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

Second Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering

Second Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering State of the art distributed parallel computational techniques in industrial finite element analysis Second Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering Ajaccio, France

More information

Implementing Scalable Parallel Search Algorithms for Data-Intensive Applications

Implementing Scalable Parallel Search Algorithms for Data-Intensive Applications Implementing Scalable Parallel Search Algorithms for Data-Intensive Applications Ted Ralphs Industrial and Systems Engineering Lehigh University http://www.lehigh.edu/~tkr2 Laszlo Ladanyi IBM T.J. Watson

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Parallelization Strategy

Parallelization Strategy COSC 335 Software Design Parallel Design Patterns (II) Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure

More information

Blocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz.

Blocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz. Blocking vs. Non-blocking Communication under MPI on a Master-Worker Problem Andre Fachat, Karl Heinz Homann Institut fur Physik TU Chemnitz D-09107 Chemnitz Germany e-mail: fachat@physik.tu-chemnitz.de

More information

Lecture 27: Fast Laplacian Solvers

Lecture 27: Fast Laplacian Solvers Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall

More information

Parallel Algorithm Design. CS595, Fall 2010

Parallel Algorithm Design. CS595, Fall 2010 Parallel Algorithm Design CS595, Fall 2010 1 Programming Models The programming model o determines the basic concepts of the parallel implementation and o abstracts from the hardware as well as from the

More information

HPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90

HPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90 149 Fortran and HPF 6.2 Concept High Performance Fortran 6.2 Concept Fortran90 extension SPMD (Single Program Multiple Data) model each process operates with its own part of data HPF commands specify which

More information

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 9, 1995 WIT Press,  ISSN Finite difference and finite element analyses using a cluster of workstations K.P. Wang, J.C. Bruch, Jr. Department of Mechanical and Environmental Engineering, q/ca/z/brm'a, 5Wa jbw6wa CW 937% Abstract

More information

MOC10215 Implementing and Managing Server Virtualization

MOC10215 Implementing and Managing Server Virtualization Tel. +39 02 365738 info@overneteducation.it www.overneteducation.it MOC10215 Implementing and Managing Server Virtualization Durata: 4.5 gg Descrizione Questo corso fornisce le competenze e le conoscenze

More information

Processes in Distributed Systems

Processes in Distributed Systems Processes in Distributed Systems Distributed Systems L-A Sistemi Distribuiti L-A Andrea Omicini andrea.omicini@unibo.it Ingegneria Due Alma Mater Studiorum Università di Bologna a Cesena Academic Year

More information

Monte Carlo Method on Parallel Computing. Jongsoon Kim

Monte Carlo Method on Parallel Computing. Jongsoon Kim Monte Carlo Method on Parallel Computing Jongsoon Kim Introduction Monte Carlo methods Utilize random numbers to perform a statistical simulation of a physical problem Extremely time-consuming Inherently

More information

Unit 9 : Fundamentals of Parallel Processing

Unit 9 : Fundamentals of Parallel Processing Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Algorithms and Applications

Algorithms and Applications Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers

More information

04MAC03/03/01/001 RIMADIMA an INTERREG III B CADSES NP, An EU-funded project managed by the European Agency for Reconstruction

04MAC03/03/01/001 RIMADIMA an INTERREG III B CADSES NP, An EU-funded project managed by the European Agency for Reconstruction 04MAC03/03/01/001 RIMADIMA an INTERREG III B CADSES NP, RIMADIMA Risk-, Disaster-Management & prevention of natural hazards in mountainous and/or forested regions WP-Nr.:4 Action-nr.: 4.4, 4.5 Journal

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University

Co-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Jason Lee Eckhardt Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University Parallel Programming

More information

Processes in Distributed Systems

Processes in Distributed Systems Processes in Distributed Systems Distributed Systems Sistemi Distribuiti Andrea Omicini andrea.omicini@unibo.it Dipartimento di Informatica Scienza e Ingegneria (DISI) Alma Mater Studiorum Università di

More information

Naming in Distributed Systems

Naming in Distributed Systems Naming in Distributed Systems Distributed Systems Sistemi Distribuiti Andrea Omicini andrea.omicini@unibo.it Ingegneria Due Alma Mater Studiorum Università di Bologna a Cesena Academic Year 2010/2011 Andrea

More information

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 3, 1993 WIT Press,   ISSN The implementation of a general purpose FORTRAN harness for an arbitrary network of transputers for computational fluid dynamics J. Mushtaq, A.J. Davies D.J. Morgan ABSTRACT Many Computational Fluid Dynamics

More information

Laboratorio di Sistemi Software Design Patterns 2

Laboratorio di Sistemi Software Design Patterns 2 TITLE Laboratorio di Sistemi Software Design Patterns 2 Luca Padovani (A-L) Riccardo Solmi (M-Z) 1 Indice degli argomenti Tipi di Design Patterns Creazionali, strutturali, comportamentali Design Patterns

More information

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA

DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678

More information

Corso di Elettronica dei Sistemi Programmabili

Corso di Elettronica dei Sistemi Programmabili Corso di Elettronica dei Sistemi Programmabili Sistemi Operativi Real Time freertos implementation Aprile 2014 Stefano Salvatori 1/40 Sommario RTOS tick Execution context Context switch example 2/40 RTOS

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Graph Partitioning for Scalable Distributed Graph Computations

Graph Partitioning for Scalable Distributed Graph Computations Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

More information

Parallel Implementation of 3D FMA using MPI

Parallel Implementation of 3D FMA using MPI Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system

More information

Hierarchical Clustering of Process Schemas

Hierarchical Clustering of Process Schemas Hierarchical Clustering of Process Schemas Claudia Diamantini, Domenico Potena Dipartimento di Ingegneria Informatica, Gestionale e dell'automazione M. Panti, Università Politecnica delle Marche - via

More information

Concurrent programming: Introduction II. Anna Lina Ruscelli Scuola Superiore Sant Anna

Concurrent programming: Introduction II. Anna Lina Ruscelli Scuola Superiore Sant Anna Concurrent programming: Introduction II Anna Lina Ruscelli Scuola Superiore Sant Anna Outline Concepts of Process Thread Mode switch Process switch Introduction to competition and collaboration 2 Computer

More information

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations

Parallel Hybrid Monte Carlo Algorithms for Matrix Computations Parallel Hybrid Monte Carlo Algorithms for Matrix Computations V. Alexandrov 1, E. Atanassov 2, I. Dimov 2, S.Branford 1, A. Thandavan 1 and C. Weihrauch 1 1 Department of Computer Science, University

More information

THREE DESCRIPTIONS OF SCALAR QUANTIZATION SYSTEM FOR EFFICIENT DATA TRANSMISSION

THREE DESCRIPTIONS OF SCALAR QUANTIZATION SYSTEM FOR EFFICIENT DATA TRANSMISSION THREE DESCRIPTIONS OF SCALAR QUANTIZATION SYSTEM FOR EFFICIENT DATA TRANSMISSION Hui Ting Teo and Mohd Fadzli bin Mohd Salleh School of Electrical and Electronic Engineering Universiti Sains Malaysia,

More information

Transactions on Information and Communications Technologies vol 15, 1997 WIT Press, ISSN

Transactions on Information and Communications Technologies vol 15, 1997 WIT Press,  ISSN Balanced workload distribution on a multi-processor cluster J.L. Bosque*, B. Moreno*", L. Pastor*" *Depatamento de Automdtica, Escuela Universitaria Politecnica de la Universidad de Alcald, Alcald de Henares,

More information

Example of a Parallel Algorithm

Example of a Parallel Algorithm -1- Part II Example of a Parallel Algorithm Sieve of Eratosthenes -2- -3- -4- -5- -6- -7- MIMD Advantages Suitable for general-purpose application. Higher flexibility. With the correct hardware and software

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

RELATIVELY OPTIMAL CONTROL: THE STATIC SOLUTION

RELATIVELY OPTIMAL CONTROL: THE STATIC SOLUTION RELATIVELY OPTIMAL CONTROL: THE STATIC SOLUTION Franco Blanchini,1 Felice Andrea Pellegrino Dipartimento di Matematica e Informatica Università di Udine via delle Scienze, 208 33100, Udine, Italy blanchini@uniud.it,

More information

Georg Pick s reticular Geometry and Didactics of Mathematics

Georg Pick s reticular Geometry and Didactics of Mathematics Didactics of Mathematics-Technology in Education (1997), D Amore, B. & Gagatsis, A. (Eds.), Erasmus ICP-96-G-2011/11, Thessaloniki, 219-228 Georg Pick s reticular Geometry and Didactics of Mathematics

More information

B553 Lecture 12: Global Optimization

B553 Lecture 12: Global Optimization B553 Lecture 12: Global Optimization Kris Hauser February 20, 2012 Most of the techniques we have examined in prior lectures only deal with local optimization, so that we can only guarantee convergence

More information

Massively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation

Massively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation L SIMULATION OF SEMICONDUCTOR DEVICES AND PROCESSES Vol. 4 Edited by W. Fichtner, D. Aemmer - Zurich (Switzerland) September 12-14,1991 - Hartung-Gorre Massively Parallel Computation for Three-Dimensional

More information

Material handling and Transportation in Logistics. Paolo Detti Dipartimento di Ingegneria dell Informazione e Scienze Matematiche Università di Siena

Material handling and Transportation in Logistics. Paolo Detti Dipartimento di Ingegneria dell Informazione e Scienze Matematiche Università di Siena Material handling and Transportation in Logistics Paolo Detti Dipartimento di Ingegneria dell Informazione e Scienze Matematiche Università di Siena Introduction to Graph Theory Graph Theory As Mathematical

More information

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

Parallel sparse matrix algorithms - for numerical computing Matrix-vector multiplication

Parallel sparse matrix algorithms - for numerical computing Matrix-vector multiplication IIMS Postgraduate Seminar 2009 Parallel sparse matrix algorithms - for numerical computing Matrix-vector multiplication Dakuan CUI Institute of Information & Mathematical Sciences Massey University at

More information

Laboratorio di Problemi Inversi Esercitazione 3: regolarizzazione iterativa, metodo di Landweber

Laboratorio di Problemi Inversi Esercitazione 3: regolarizzazione iterativa, metodo di Landweber Laboratorio di Problemi Inversi Esercitazione 3: regolarizzazione iterativa, metodo di Landweber Luca Calatroni Dipartimento di Matematica, Universitá degli studi di Genova May 18, 2016. Luca Calatroni

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Evaluation of Parallel Programs by Measurement of Its Granularity

Evaluation of Parallel Programs by Measurement of Its Granularity Evaluation of Parallel Programs by Measurement of Its Granularity Jan Kwiatkowski Computer Science Department, Wroclaw University of Technology 50-370 Wroclaw, Wybrzeze Wyspianskiego 27, Poland kwiatkowski@ci-1.ci.pwr.wroc.pl

More information

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother? Parallel Programming Concepts Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04 Parallel Background Why Bother? 1 What is Parallel Programming? Simultaneous use of multiple

More information

Task Assignment Problem in Camera Networks

Task Assignment Problem in Camera Networks Task Assignment Problem in Camera Networks Federico Cerruti Mirko Fabbro Chiara Masiero Corso di laurea in Ingegneria dell Automazione Università degli Studi di Padova Progettazione di sistemi di controllo

More information

Architettura Database Oracle

Architettura Database Oracle Architettura Database Oracle Shared Pool La shared pool consiste di: Data dictionary: cache che contiene informazioni relative agli oggetti del databse, lo storage ed i privilegi Library cache: contiene

More information

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations

Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Nikolai Zamarashkin and Dmitry Zheltkov INM RAS, Gubkina 8, Moscow, Russia {nikolai.zamarashkin,dmitry.zheltkov}@gmail.com

More information

Fondamenti di Informatica

Fondamenti di Informatica Fondamenti di Informatica Getting started with Matlab and basic concept of programming Prof. Emiliano Casalicchio http://www.ce.uniroma2.it/courses/foi/ Avvisi n L esame sarà in lingua italiana n L esame

More information

Parallel Techniques. Embarrassingly Parallel Computations. Partitioning and Divide-and-Conquer Strategies

Parallel Techniques. Embarrassingly Parallel Computations. Partitioning and Divide-and-Conquer Strategies slides3-1 Parallel Techniques Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations Load Balancing

More information

Communication in Distributed Systems

Communication in Distributed Systems Communication in Distributed Systems Distributed Systems Sistemi Distribuiti Andrea Omicini andrea.omicini@unibo.it Dipartimento di Informatica Scienza e Ingegneria (DISI) Alma Mater Studiorum Università

More information

CORSO MOC10215: Implementing and Managing Microsoft Server Virtualization. CEGEKA Education corsi di formazione professionale

CORSO MOC10215: Implementing and Managing Microsoft Server Virtualization. CEGEKA Education corsi di formazione professionale CORSO MOC10215: Implementing and Managing Microsoft Server Virtualization CEGEKA Education corsi di formazione professionale Implementing and Managing Microsoft Server Virtualization This five-day, instructor-led

More information

Distributed Systems CS /640

Distributed Systems CS /640 Distributed Systems CS 15-440/640 Programming Models Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad Hammoud andvinay Kolar 1 Objectives Discussion on Programming Models

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Algorithm Design Techniques part I

Algorithm Design Techniques part I Algorithm Design Techniques part I Divide-and-Conquer. Dynamic Programming DSA - lecture 8 - T.U.Cluj-Napoca - M. Joldos 1 Some Algorithm Design Techniques Top-Down Algorithms: Divide-and-Conquer Bottom-Up

More information

Surfing Ada for ESP Part 2

Surfing Ada for ESP Part 2 Surfing Ada for ESP Part 2 C. Montangero Dipartimento d Informatica Corso di ESperienze di Programmazione a.a. 2012/13 1 Table of contents 2 CHAPTER 9: Packages Packages allow the programmer to define

More information

Lecture V: Introduction to parallel programming with Fortran coarrays

Lecture V: Introduction to parallel programming with Fortran coarrays Lecture V: Introduction to parallel programming with Fortran coarrays What is parallel computing? Serial computing Single processing unit (core) is used for solving a problem One task processed at a time

More information

A Truncated Newton Method in an Augmented Lagrangian Framework for Nonlinear Programming

A Truncated Newton Method in an Augmented Lagrangian Framework for Nonlinear Programming A Truncated Newton Method in an Augmented Lagrangian Framework for Nonlinear Programming Gianni Di Pillo (dipillo@dis.uniroma1.it) Giampaolo Liuzzi (liuzzi@iasi.cnr.it) Stefano Lucidi (lucidi@dis.uniroma1.it)

More information

A Global Operating System for HPC Clusters

A Global Operating System for HPC Clusters A Global Operating System Emiliano Betti 1 Marco Cesati 1 Roberto Gioiosa 2 Francesco Piermaria 1 1 System Programming Research Group, University of Rome Tor Vergata 2 BlueGene Software Division, IBM TJ

More information

Spectral Graph Sparsification: overview of theory and practical methods. Yiannis Koutis. University of Puerto Rico - Rio Piedras

Spectral Graph Sparsification: overview of theory and practical methods. Yiannis Koutis. University of Puerto Rico - Rio Piedras Spectral Graph Sparsification: overview of theory and practical methods Yiannis Koutis University of Puerto Rico - Rio Piedras Graph Sparsification or Sketching Compute a smaller graph that preserves some

More information

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Data Structures Hashing Structures Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Hashing Structures I. Motivation and Review II. Hash Functions III. HashTables I. Implementations

More information

L3.4. Data Management Techniques. Frederic Desprez Benjamin Isnard Johan Montagnat

L3.4. Data Management Techniques. Frederic Desprez Benjamin Isnard Johan Montagnat Grid Workflow Efficient Enactment for Data Intensive Applications L3.4 Data Management Techniques Authors : Eddy Caron Frederic Desprez Benjamin Isnard Johan Montagnat Summary : This document presents

More information

Introduction to Distributed Systems

Introduction to Distributed Systems Introduction to Distributed Systems Distributed Systems L-A Sistemi Distribuiti L-A Andrea Omicini andrea.omicini@unibo.it Ingegneria Due Alma Mater Studiorum Università di Bologna a Cesena Academic Year

More information

Ingegneria del Software Corso di Laurea in Informatica per il Management

Ingegneria del Software Corso di Laurea in Informatica per il Management Ingegneria del Software Corso di Laurea in Informatica per il Management UML: State machine diagram Davide Rossi Dipartimento di Informatica Università di Bologna State machine A behavioral state machine

More information

The Public Shared Objects Run-Time System

The Public Shared Objects Run-Time System The Public Shared Objects Run-Time System Stefan Lüpke, Jürgen W. Quittek, Torsten Wiese E-mail: wiese@tu-harburg.d400.de Arbeitsbereich Technische Informatik 2, Technische Universität Hamburg-Harburg

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

Flexible Batched Sparse Matrix-Vector Product on GPUs

Flexible Batched Sparse Matrix-Vector Product on GPUs ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible Batched Sparse Matrix-Vector Product on GPUs Hartwig Anzt, Gary Collins, Jack Dongarra,

More information

Integrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties.

Integrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties. Standard 1: Number Sense and Computation Students simplify and compare expressions. They use rational exponents and simplify square roots. IM1.1.1 Compare real number expressions. IM1.1.2 Simplify square

More information

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013 Voronoi Region K-means method for Signal Compression: Vector Quantization Blocks of signals: A sequence of audio. A block of image pixels. Formally: vector example: (0.2, 0.3, 0.5, 0.1) A vector quantizer

More information

Parallelization Principles. Sathish Vadhiyar

Parallelization Principles. Sathish Vadhiyar Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs

More information

Parallel quicksort algorithms with isoefficiency analysis. Parallel quicksort algorithmswith isoefficiency analysis p. 1

Parallel quicksort algorithms with isoefficiency analysis. Parallel quicksort algorithmswith isoefficiency analysis p. 1 Parallel quicksort algorithms with isoefficiency analysis Parallel quicksort algorithmswith isoefficiency analysis p. 1 Overview Sequential quicksort algorithm Three parallel quicksort algorithms Isoefficiency

More information

Hierarchical Multi level Approach to graph clustering

Hierarchical Multi level Approach to graph clustering Hierarchical Multi level Approach to graph clustering by: Neda Shahidi neda@cs.utexas.edu Cesar mantilla, cesar.mantilla@mail.utexas.edu Advisor: Dr. Inderjit Dhillon Introduction Data sets can be presented

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

Automatic Creation of Define.xml for ADaM

Automatic Creation of Define.xml for ADaM Automatic Creation of Define.xml for ADaM Alessia Sacco, Statistical Programmer www.valos.it info@valos.it 1 Indice Define.xml Pinnacle 21 Community Valos ADaM Metadata 2 Define.xml Cos è: Case Report

More information

Design of Parallel Algorithms. Course Introduction

Design of Parallel Algorithms. Course Introduction + Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Chapter 3:- Divide and Conquer. Compiled By:- Sanjay Patel Assistant Professor, SVBIT.

Chapter 3:- Divide and Conquer. Compiled By:- Sanjay Patel Assistant Professor, SVBIT. Chapter 3:- Divide and Conquer Compiled By:- Assistant Professor, SVBIT. Outline Introduction Multiplying large Integers Problem Problem Solving using divide and conquer algorithm - Binary Search Sorting

More information

A tree-search algorithm for ML decoding in underdetermined MIMO systems

A tree-search algorithm for ML decoding in underdetermined MIMO systems A tree-search algorithm for ML decoding in underdetermined MIMO systems Gianmarco Romano #1, Francesco Palmieri #2, Pierluigi Salvo Rossi #3, Davide Mattera 4 # Dipartimento di Ingegneria dell Informazione,

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Optimal Decision Trees Generation from OR-Decision Tables

Optimal Decision Trees Generation from OR-Decision Tables Optimal Decision Trees Generation from OR-Decision Tables Costantino Grana, Manuela Montangero, Daniele Borghesani, and Rita Cucchiara Dipartimento di Ingegneria dell Informazione Università degli Studi

More information

Analysis of high dimensional data via Topology. Louis Xiang. Oak Ridge National Laboratory. Oak Ridge, Tennessee

Analysis of high dimensional data via Topology. Louis Xiang. Oak Ridge National Laboratory. Oak Ridge, Tennessee Analysis of high dimensional data via Topology Louis Xiang Oak Ridge National Laboratory Oak Ridge, Tennessee Contents Abstract iii 1 Overview 1 2 Data Set 1 3 Simplicial Complex 5 4 Computation of homology

More information

L-Systems and Affine Transformations

L-Systems and Affine Transformations L-Systems and Affine Transformations Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ Copyright 2014, Moreno Marzolla, Università di

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

DANTE CONTROL SYSTEM DATA FLOW

DANTE CONTROL SYSTEM DATA FLOW K K DAΦNE TECHNICAL NOTE INFN - LNF, Accelerator Division Frascati, February 14, 1994 Note: C-9 DANTE CONTROL SYSTEM DATA FLOW M. Verola 1. Introduction The DANTE (DAΦNE New Tools Environment) Control

More information