Solving Tridiagonal Systems on the T3E: a Message Passing RD algorithm.
|
|
- Silas Bishop
- 6 years ago
- Views:
Transcription
1 Solving Tridiagonal Systems on the T3E: a Message Passing RD algorithm. A. Bevilacqua Dipartimento di Fisica, Università di Bologna INFN, Sezione di Bologna G. Spaletta Dipartimento di Matematica, Università di Bologna Sommario Il lavoro svolto è teso al completamento dell analisi dell algoritmo di Recursive Decoupling (RD) per sistemi tridiagonali lineari. Tale metodo, intrinsecamente parallelo e scalabile, trova una naturale implementazione sulle architetture CRAY della generazione MPP, quali il T3E. Recentemente è sorta la richiesta di impiego di tale solutore in problemi applicativi di ampie dimensioni, legati alla ingegneria dei circuiti microelettronici ed alla ricostruzione e restoring di immagini mediche. La necessità di una versione portabile ad altri sistemi distribuiti ha giustificato la revisione dell algoritmo di RD, per la sua programmazione in modello Message Passing. Grazie al grant CINECA, è stato possibile sviluppare su CRAY T3E una prima versione PVM dell algoritmo in esame, in modo da renderne il più possibile consistente il confronto con le precedenti implementazioni realizzate in modello a memoria condivisa su CRAY T3D. Le prestazioni del metodo RD su CRAY T3E vengono valutate in base alle caratteristiche di scalabilità ed alla accuratezza dei risultati ottenuti. Abstract The work developed here is meant as a completion to the analysis of the Recursive Decoupling solver (RD) for tridiagonal linear systems. Such a method, intrinsically parallel and scalable, finds a natural implementation on the CRAY architectures, belonging to the MPP generation, such as the T3E. Recently, it has arisen a request to utilize the technique above mentioned in the solution of applied problems, related to the Engineering of microwave circuits and to the reconstruction and restoration of medical images. The need for an RD version, portable to various distributed systems, justified the remodeling of the RD algorithm, to make feasible its Message Passing (MP) implementation. Thanks to the CINECA grant, it has been possible to develop 1
2 a PVM version of the RD solver for the CRAY T3E, which makes also interesting and consistent a comparison between the current MP model implementation and the previous CRAY T3D shared memory versions. The performance of the RD method on the CRAY T3E is then evaluated, based on the obtained scalability features and accuracy results. 1. The algorithm Here we synthesize the concepts on which the RD technique is based; they can be found in more details in [1, 2]. Let A be a diagonally dominant, tridiagonal matrix A of dimension n = k x 2 q (q, k integers). The related square system (1) A u = d is solved by partitioning A into an easily invertible block diagonal matrix J and a sum of m-1 rank-one matrices x (j) y (j)t, having set m=n/2. A first approximation to the solution vector u is given by solving (2) J u = d At the same time the following m-1 systems are solved (3) J g (j) = x (j) The partitioning of A is mirrored in the vectors u and g (j), i.e. such vectors are partitioned into subvectors, whose size is that of the diagonal blocks in J. The recursive use of the Sherman-Morrison formula, then, permits to obtain the solution of the original tridiagonal system (1). The n-dimensional vectors g (j) are recursively updated and used to update u, providing the wanted solution when the final step is reached. RD is a direct method, thus the number of steps is finite; such number is equal to log 2 (m), due to the sparsity of the matrices involved, the dimension n being a power of 2 and the fan-in pattern of the updating. Furthermore, the sparsity and dimension features of the problem considered permit to hold all vectors u, g (j) by log 2 (n) auxiliary vectors v (k), each of dimension n. All the updating operations can be performed in parallel, maintaining the partition of data chosen during the solution of (2) and (3), with a minimal need for communication among tasks. In fact, systems (2) and (3) can also be solved in parallel partitioned form. 2
3 The procedure outlined gives rise to an intrinsically parallel and scalable algorithm. The current version uses the C programming language and the Message Passing programming model, to answer the request for a wider use of this solver. Results of numerical testing and the analysis of the performance obtained are given next, including a qualitative comparison with analogous testing of the CRAY T3D shared memory implementation. 2. PVM Recursive Decoupling The C implementation of the RD method uses the routines of the PVM library, devoted to the communication among PEs (Processing Elements). Further to portability, the aims of such implementation are those sought by every parallel program: efficiency, speed-up, scalability and workload balancing. The SPMD programming model, required on the T3E, consists of one main program, running on each processor; the master-slave paradigm is put into effect by appending the following instruction as the closing one in the main program: if (PE_id == 0) then <master code> else <slave code> Data assignment (splitting) is crucial to the message passing version of the RD algorithm. All structures involved, that is to say the auxiliary vectors v (k) are partitioned into subvectors vi (k) of equal length (given by the ratio between the problem dimension n and the number npes of processors used), that are local to each PE. The updating of vectors v (k) is then obtained by each PE updating its substructure. The need for some communication of partial scalar results among groups of PEs justifies the choice of the master-slave programming model, with one PE specialized to execute the master code, to avoid the overhead due to synchronization among such groups. The master code has the following tasks: reads the input data A and d from an external file; subdivides A, d, and sends their parts to the slaves; collects the local results from each group of processors; computes and sends the global results to each group of slaves; receives the final result and writes it on an external file. The kernel of the algorithm, correspondent to the updating procedure, basically consists of three nested loops; by denoting with the index k each step (level) forming such procedure, the following schemata holds for each slave task: 3
4 for each level k k = 1.. log 2 (m) if all data needed for the updating are local for each j remaining level j = k +1.. log 2 (m) updates vi (k ) else for each remaining level j computes its partial scalar result sends it to the master receives the global scalar result uses it to update vi (k ) i = 1.. n/npes Having one PE to perform the master controls implies that up to 64 slaves can be used, missing the possibility to exploit all 128 PE available on the T3E. We considered choosing a model in which all PE perform the master controls as well as the slave task. Such model would have lead to a less efficient implementation, mainly because, at each level k, different partial scalar results have to be shared among different pools of PEs; having one master PE minimizes the occurrence of idle processors. The uniform partitioning of structures, as described above, meets the goal for a perfect even load balance. 3. Performance analysis As a source of test problems for our numerical experiments, we consider a tridiagonal linear system of large dimension n = 2 q, whose randomly generated entries are such to guarantee diagonal dominance. The exponent q ranges from 17 to 23 in the double precision version (d.p.), while q reaches up to 24 in the single precision implementation (s.p.). The accuracy results, working in single precision, comply with those obtained by the previous version on the CRAY T3D (see [1], [2] for a more detailed description). The double precision improvement is paid off by the inability to solve systems of dimension greater than If the solution is only required to be accurate in the first few significant digits, then the RD routine is serviceable to solve general tridiagonal linear systems. Timing results, used for the evaluation of speed-up and efficiency, are measured in seconds and shown in the following three tables. Table 1 refers to the computational time required by the RD solver on a problem of dimension n = 2 17, both using PVM on the T3E (single and double precision) and using CRAFT on the T3D (single precision). Such a comparison must take into account the enhanced resource; the time gain observed reflects the increased speed of the T3E processor. 4
5 Tables 2 and 3 gather all computational timing required by the PVM implementation of the RD method, on the T3E. The empty fields mean that the memory requirements, needed to run a 2 q x 2 q problem onto a chosen number of PEs, were too high. The computational complexity of the PVM implementation is O(n (log 2 n) 2 ), which is confirmed by the timing observed. This is slightly less satisfactory than the theoretical complexity of O(n log 2 n) and calls for further improvement in the PVM restructuring of the algorithm. Communication timings are not explicitly shown, since their incidence on the overall computational time is always lower than 10%. As a consequence, the workload is almost perfectly balanced. Speed-ups and efficiency are good, reaching the optimal value in most cases (in a few cases, we obtain superlinear speed-ups); such a good behavior fades when 64 processors are used on a problem whose dimension is not large enough to give each processor a significant work amount, to overcome the communication and synchronization overhead. This is shown in Figure 1, in which the Kuck Function is plotted (against the number of processors, in logarithmic scale) for problems of dimension 2 17, 2 18 and 2 19, respectively. The Kuck function gives a compound information, being the square of the geometric mean of speed-up and efficiency (obtained with a fixed number of processors). The scaling properties of the method are also confirmed by the timing results; the scaling factor, in its best instance, reaches the value of 0.88 (close to the theoretical optimal value 1). 5
6 Table 1. MP-T3E vs SM-T3D Timing Comparison: problem dimension is q=17 PEs T3E s.p T3E d.p T3D Table 2. MP-T3E Timings (double precision): problem dimension given by q PEs q = q = q = q = q = q = Table 3. MP-T3E Timings (single precision): problem dimension given by q PEs q = q = q = q = q = q = q = Conclusion and future work PVM assures flexibility and portability of code, which is a very important requirement in all branches of applied sciences. This could allow the use of the RD routine, on its own or as part of a more general application solver, on an heterogeneous cluster of computers or any other MPP architecture. Because of the intrinsic parallelism of this problem the workload is perfectly balanced; accuracy and scaling features are maintained and confirmed by the PVM implementation of the RD algorithm. The communication and synchronization overhead, already quite small, might be further decreased; there is space for further improvement in the current message passing implementation, as is suggested both by the Kuck function and the observed complexity. Parallel to such improvement to the PVM version, an MPI implementation is also being developed. 6
7 Acknowledgments We wish to thank Dr. Bassini, Dr. Voli and their colleagues at CINECA for the kind availability. Computational resources provided by the CINECA Supercomputing Center, under grant n.97/335-5, are gratefully acknowledged. Bibliography [1] G. Spaletta, The Recursive Decoupling Solver for Tridiagonal Linear Systems on the CRAY T3D, Parallel Computing: State-of-the-Art and Perspectives, Elsevier, 1996, pp [2] G. Spaletta, Recursive Decoupling on the CRAY T3D, Science and Supercomputing at CINECA, 1995 Report, pp [3] A. Bevilacqua, G. Spaletta, Solving Systems by Recursive Decoupling with PVM and MPI, in preparation. 7
Parallel Mesh Partitioning in Alya
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Parallel Mesh Partitioning in Alya A. Artigues a *** and G. Houzeaux a* a Barcelona Supercomputing Center ***antoni.artigues@bsc.es
More informationK-means seeding via MUS algorithm
K-means seeding via MUS algorithm Inizializzazione del K-means tramite l algoritmo MUS Leonardo Egidi, Roberta Pappadà, Francesco Pauli, Nicola Torelli Abstract K-means algorithm is one of the most popular
More informationMethods of solving sparse linear systems. Soldatenko Oleg SPbSU, Department of Computational Physics
Methods of solving sparse linear systems. Soldatenko Oleg SPbSU, Department of Computational Physics Outline Introduction Sherman-Morrison formula Woodbury formula Indexed storage of sparse matrices Types
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationParallelization Strategy
COSC 6374 Parallel Computation Algorithm structure Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure
More informationMessage Passing Interface (MPI)
What the course is: An introduction to parallel systems and their implementation using MPI A presentation of all the basic functions and types you are likely to need in MPI A collection of examples What
More informationParallel Architecture & Programing Models for Face Recognition
Parallel Architecture & Programing Models for Face Recognition Submitted by Sagar Kukreja Computer Engineering Department Rochester Institute of Technology Agenda Introduction to face recognition Feature
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationCSC630/CSC730 Parallel & Distributed Computing
CSC630/CSC730 Parallel & Distributed Computing Analytical Modeling of Parallel Programs Chapter 5 1 Contents Sources of Parallel Overhead Performance Metrics Granularity and Data Mapping Scalability 2
More informationMOC20741 Networking with Windows Server 2016
Tel. +39 02 365738 info@overneteducation.it www.overneteducation.it MOC20741 Networking with Windows Server 2016 Durata: 4.5 gg Descrizione Questo corso intende trasmettere le competenze fondamentali in
More informationProcessing and Others. Xiaojun Qi -- REU Site Program in CVMA
Advanced Digital Image Processing and Others Xiaojun Qi -- REU Site Program in CVMA (0 Summer) Segmentation Outline Strategies and Data Structures Overview of Algorithms Region Splitting Region Merging
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 3 Dense Linear Systems Section 3.3 Triangular Linear Systems Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at Urbana-Champaign
More informationSummer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics
Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Moysey Brio & Paul Dostert July 4, 2009 1 / 18 Sparse Matrices In many areas of applied mathematics and modeling, one
More informationParallel Architectures
Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s
More informationSecond Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering
State of the art distributed parallel computational techniques in industrial finite element analysis Second Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering Ajaccio, France
More informationImplementing Scalable Parallel Search Algorithms for Data-Intensive Applications
Implementing Scalable Parallel Search Algorithms for Data-Intensive Applications Ted Ralphs Industrial and Systems Engineering Lehigh University http://www.lehigh.edu/~tkr2 Laszlo Ladanyi IBM T.J. Watson
More informationData Partitioning. Figure 1-31: Communication Topologies. Regular Partitions
Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy
More informationParallelization Strategy
COSC 335 Software Design Parallel Design Patterns (II) Spring 2008 Parallelization Strategy Finding Concurrency Structure the problem to expose exploitable concurrency Algorithm Structure Supporting Structure
More informationBlocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz.
Blocking vs. Non-blocking Communication under MPI on a Master-Worker Problem Andre Fachat, Karl Heinz Homann Institut fur Physik TU Chemnitz D-09107 Chemnitz Germany e-mail: fachat@physik.tu-chemnitz.de
More informationLecture 27: Fast Laplacian Solvers
Lecture 27: Fast Laplacian Solvers Scribed by Eric Lee, Eston Schweickart, Chengrun Yang November 21, 2017 1 How Fast Laplacian Solvers Work We want to solve Lx = b with L being a Laplacian matrix. Recall
More informationParallel Algorithm Design. CS595, Fall 2010
Parallel Algorithm Design CS595, Fall 2010 1 Programming Models The programming model o determines the basic concepts of the parallel implementation and o abstracts from the hardware as well as from the
More informationHPF commands specify which processor gets which part of the data. Concurrency is defined by HPF commands based on Fortran90
149 Fortran and HPF 6.2 Concept High Performance Fortran 6.2 Concept Fortran90 extension SPMD (Single Program Multiple Data) model each process operates with its own part of data HPF commands specify which
More informationTransactions on Information and Communications Technologies vol 9, 1995 WIT Press, ISSN
Finite difference and finite element analyses using a cluster of workstations K.P. Wang, J.C. Bruch, Jr. Department of Mechanical and Environmental Engineering, q/ca/z/brm'a, 5Wa jbw6wa CW 937% Abstract
More informationMOC10215 Implementing and Managing Server Virtualization
Tel. +39 02 365738 info@overneteducation.it www.overneteducation.it MOC10215 Implementing and Managing Server Virtualization Durata: 4.5 gg Descrizione Questo corso fornisce le competenze e le conoscenze
More informationProcesses in Distributed Systems
Processes in Distributed Systems Distributed Systems L-A Sistemi Distribuiti L-A Andrea Omicini andrea.omicini@unibo.it Ingegneria Due Alma Mater Studiorum Università di Bologna a Cesena Academic Year
More informationMonte Carlo Method on Parallel Computing. Jongsoon Kim
Monte Carlo Method on Parallel Computing Jongsoon Kim Introduction Monte Carlo methods Utilize random numbers to perform a statistical simulation of a physical problem Extremely time-consuming Inherently
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationAlgorithms and Applications
Algorithms and Applications 1 Areas done in textbook: Sorting Algorithms Numerical Algorithms Image Processing Searching and Optimization 2 Chapter 10 Sorting Algorithms - rearranging a list of numbers
More information04MAC03/03/01/001 RIMADIMA an INTERREG III B CADSES NP, An EU-funded project managed by the European Agency for Reconstruction
04MAC03/03/01/001 RIMADIMA an INTERREG III B CADSES NP, RIMADIMA Risk-, Disaster-Management & prevention of natural hazards in mountainous and/or forested regions WP-Nr.:4 Action-nr.: 4.4, 4.5 Journal
More informationStudy and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou
Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm
More informationCo-array Fortran Performance and Potential: an NPB Experimental Study. Department of Computer Science Rice University
Co-array Fortran Performance and Potential: an NPB Experimental Study Cristian Coarfa Jason Lee Eckhardt Yuri Dotsenko John Mellor-Crummey Department of Computer Science Rice University Parallel Programming
More informationProcesses in Distributed Systems
Processes in Distributed Systems Distributed Systems Sistemi Distribuiti Andrea Omicini andrea.omicini@unibo.it Dipartimento di Informatica Scienza e Ingegneria (DISI) Alma Mater Studiorum Università di
More informationNaming in Distributed Systems
Naming in Distributed Systems Distributed Systems Sistemi Distribuiti Andrea Omicini andrea.omicini@unibo.it Ingegneria Due Alma Mater Studiorum Università di Bologna a Cesena Academic Year 2010/2011 Andrea
More informationTransactions on Information and Communications Technologies vol 3, 1993 WIT Press, ISSN
The implementation of a general purpose FORTRAN harness for an arbitrary network of transputers for computational fluid dynamics J. Mushtaq, A.J. Davies D.J. Morgan ABSTRACT Many Computational Fluid Dynamics
More informationLaboratorio di Sistemi Software Design Patterns 2
TITLE Laboratorio di Sistemi Software Design Patterns 2 Luca Padovani (A-L) Riccardo Solmi (M-Z) 1 Indice degli argomenti Tipi di Design Patterns Creazionali, strutturali, comportamentali Design Patterns
More informationDISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA
DISTRIBUTED HIGH-SPEED COMPUTING OF MULTIMEDIA DATA M. GAUS, G. R. JOUBERT, O. KAO, S. RIEDEL AND S. STAPEL Technical University of Clausthal, Department of Computer Science Julius-Albert-Str. 4, 38678
More informationCorso di Elettronica dei Sistemi Programmabili
Corso di Elettronica dei Sistemi Programmabili Sistemi Operativi Real Time freertos implementation Aprile 2014 Stefano Salvatori 1/40 Sommario RTOS tick Execution context Context switch example 2/40 RTOS
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationParallel Implementation of 3D FMA using MPI
Parallel Implementation of 3D FMA using MPI Eric Jui-Lin Lu y and Daniel I. Okunbor z Computer Science Department University of Missouri - Rolla Rolla, MO 65401 Abstract The simulation of N-body system
More informationHierarchical Clustering of Process Schemas
Hierarchical Clustering of Process Schemas Claudia Diamantini, Domenico Potena Dipartimento di Ingegneria Informatica, Gestionale e dell'automazione M. Panti, Università Politecnica delle Marche - via
More informationConcurrent programming: Introduction II. Anna Lina Ruscelli Scuola Superiore Sant Anna
Concurrent programming: Introduction II Anna Lina Ruscelli Scuola Superiore Sant Anna Outline Concepts of Process Thread Mode switch Process switch Introduction to competition and collaboration 2 Computer
More informationParallel Hybrid Monte Carlo Algorithms for Matrix Computations
Parallel Hybrid Monte Carlo Algorithms for Matrix Computations V. Alexandrov 1, E. Atanassov 2, I. Dimov 2, S.Branford 1, A. Thandavan 1 and C. Weihrauch 1 1 Department of Computer Science, University
More informationTHREE DESCRIPTIONS OF SCALAR QUANTIZATION SYSTEM FOR EFFICIENT DATA TRANSMISSION
THREE DESCRIPTIONS OF SCALAR QUANTIZATION SYSTEM FOR EFFICIENT DATA TRANSMISSION Hui Ting Teo and Mohd Fadzli bin Mohd Salleh School of Electrical and Electronic Engineering Universiti Sains Malaysia,
More informationTransactions on Information and Communications Technologies vol 15, 1997 WIT Press, ISSN
Balanced workload distribution on a multi-processor cluster J.L. Bosque*, B. Moreno*", L. Pastor*" *Depatamento de Automdtica, Escuela Universitaria Politecnica de la Universidad de Alcald, Alcald de Henares,
More informationExample of a Parallel Algorithm
-1- Part II Example of a Parallel Algorithm Sieve of Eratosthenes -2- -3- -4- -5- -6- -7- MIMD Advantages Suitable for general-purpose application. Higher flexibility. With the correct hardware and software
More informationArchitecture, Programming and Performance of MIC Phi Coprocessor
Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics
More informationRELATIVELY OPTIMAL CONTROL: THE STATIC SOLUTION
RELATIVELY OPTIMAL CONTROL: THE STATIC SOLUTION Franco Blanchini,1 Felice Andrea Pellegrino Dipartimento di Matematica e Informatica Università di Udine via delle Scienze, 208 33100, Udine, Italy blanchini@uniud.it,
More informationGeorg Pick s reticular Geometry and Didactics of Mathematics
Didactics of Mathematics-Technology in Education (1997), D Amore, B. & Gagatsis, A. (Eds.), Erasmus ICP-96-G-2011/11, Thessaloniki, 219-228 Georg Pick s reticular Geometry and Didactics of Mathematics
More informationB553 Lecture 12: Global Optimization
B553 Lecture 12: Global Optimization Kris Hauser February 20, 2012 Most of the techniques we have examined in prior lectures only deal with local optimization, so that we can only guarantee convergence
More informationMassively Parallel Computation for Three-Dimensional Monte Carlo Semiconductor Device Simulation
L SIMULATION OF SEMICONDUCTOR DEVICES AND PROCESSES Vol. 4 Edited by W. Fichtner, D. Aemmer - Zurich (Switzerland) September 12-14,1991 - Hartung-Gorre Massively Parallel Computation for Three-Dimensional
More informationMaterial handling and Transportation in Logistics. Paolo Detti Dipartimento di Ingegneria dell Informazione e Scienze Matematiche Università di Siena
Material handling and Transportation in Logistics Paolo Detti Dipartimento di Ingegneria dell Informazione e Scienze Matematiche Università di Siena Introduction to Graph Theory Graph Theory As Mathematical
More informationMPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh
MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance
More informationPIPELINE AND VECTOR PROCESSING
PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates
More informationParallel sparse matrix algorithms - for numerical computing Matrix-vector multiplication
IIMS Postgraduate Seminar 2009 Parallel sparse matrix algorithms - for numerical computing Matrix-vector multiplication Dakuan CUI Institute of Information & Mathematical Sciences Massey University at
More informationLaboratorio di Problemi Inversi Esercitazione 3: regolarizzazione iterativa, metodo di Landweber
Laboratorio di Problemi Inversi Esercitazione 3: regolarizzazione iterativa, metodo di Landweber Luca Calatroni Dipartimento di Matematica, Universitá degli studi di Genova May 18, 2016. Luca Calatroni
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationEvaluation of Parallel Programs by Measurement of Its Granularity
Evaluation of Parallel Programs by Measurement of Its Granularity Jan Kwiatkowski Computer Science Department, Wroclaw University of Technology 50-370 Wroclaw, Wybrzeze Wyspianskiego 27, Poland kwiatkowski@ci-1.ci.pwr.wroc.pl
More informationParallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?
Parallel Programming Concepts Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04 Parallel Background Why Bother? 1 What is Parallel Programming? Simultaneous use of multiple
More informationTask Assignment Problem in Camera Networks
Task Assignment Problem in Camera Networks Federico Cerruti Mirko Fabbro Chiara Masiero Corso di laurea in Ingegneria dell Automazione Università degli Studi di Padova Progettazione di sistemi di controllo
More informationArchitettura Database Oracle
Architettura Database Oracle Shared Pool La shared pool consiste di: Data dictionary: cache che contiene informazioni relative agli oggetti del databse, lo storage ed i privilegi Library cache: contiene
More informationBlock Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations
Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Nikolai Zamarashkin and Dmitry Zheltkov INM RAS, Gubkina 8, Moscow, Russia {nikolai.zamarashkin,dmitry.zheltkov}@gmail.com
More informationFondamenti di Informatica
Fondamenti di Informatica Getting started with Matlab and basic concept of programming Prof. Emiliano Casalicchio http://www.ce.uniroma2.it/courses/foi/ Avvisi n L esame sarà in lingua italiana n L esame
More informationParallel Techniques. Embarrassingly Parallel Computations. Partitioning and Divide-and-Conquer Strategies
slides3-1 Parallel Techniques Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations Load Balancing
More informationCommunication in Distributed Systems
Communication in Distributed Systems Distributed Systems Sistemi Distribuiti Andrea Omicini andrea.omicini@unibo.it Dipartimento di Informatica Scienza e Ingegneria (DISI) Alma Mater Studiorum Università
More informationCORSO MOC10215: Implementing and Managing Microsoft Server Virtualization. CEGEKA Education corsi di formazione professionale
CORSO MOC10215: Implementing and Managing Microsoft Server Virtualization CEGEKA Education corsi di formazione professionale Implementing and Managing Microsoft Server Virtualization This five-day, instructor-led
More informationDistributed Systems CS /640
Distributed Systems CS 15-440/640 Programming Models Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad Hammoud andvinay Kolar 1 Objectives Discussion on Programming Models
More informationAdvanced Topics UNIT 2 PERFORMANCE EVALUATIONS
Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors
More informationAlgorithm Design Techniques part I
Algorithm Design Techniques part I Divide-and-Conquer. Dynamic Programming DSA - lecture 8 - T.U.Cluj-Napoca - M. Joldos 1 Some Algorithm Design Techniques Top-Down Algorithms: Divide-and-Conquer Bottom-Up
More informationSurfing Ada for ESP Part 2
Surfing Ada for ESP Part 2 C. Montangero Dipartimento d Informatica Corso di ESperienze di Programmazione a.a. 2012/13 1 Table of contents 2 CHAPTER 9: Packages Packages allow the programmer to define
More informationLecture V: Introduction to parallel programming with Fortran coarrays
Lecture V: Introduction to parallel programming with Fortran coarrays What is parallel computing? Serial computing Single processing unit (core) is used for solving a problem One task processed at a time
More informationA Truncated Newton Method in an Augmented Lagrangian Framework for Nonlinear Programming
A Truncated Newton Method in an Augmented Lagrangian Framework for Nonlinear Programming Gianni Di Pillo (dipillo@dis.uniroma1.it) Giampaolo Liuzzi (liuzzi@iasi.cnr.it) Stefano Lucidi (lucidi@dis.uniroma1.it)
More informationA Global Operating System for HPC Clusters
A Global Operating System Emiliano Betti 1 Marco Cesati 1 Roberto Gioiosa 2 Francesco Piermaria 1 1 System Programming Research Group, University of Rome Tor Vergata 2 BlueGene Software Division, IBM TJ
More informationSpectral Graph Sparsification: overview of theory and practical methods. Yiannis Koutis. University of Puerto Rico - Rio Piedras
Spectral Graph Sparsification: overview of theory and practical methods Yiannis Koutis University of Puerto Rico - Rio Piedras Graph Sparsification or Sketching Compute a smaller graph that preserves some
More informationCOSC160: Data Structures Hashing Structures. Jeremy Bolton, PhD Assistant Teaching Professor
COSC160: Data Structures Hashing Structures Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Hashing Structures I. Motivation and Review II. Hash Functions III. HashTables I. Implementations
More informationL3.4. Data Management Techniques. Frederic Desprez Benjamin Isnard Johan Montagnat
Grid Workflow Efficient Enactment for Data Intensive Applications L3.4 Data Management Techniques Authors : Eddy Caron Frederic Desprez Benjamin Isnard Johan Montagnat Summary : This document presents
More informationIntroduction to Distributed Systems
Introduction to Distributed Systems Distributed Systems L-A Sistemi Distribuiti L-A Andrea Omicini andrea.omicini@unibo.it Ingegneria Due Alma Mater Studiorum Università di Bologna a Cesena Academic Year
More informationIngegneria del Software Corso di Laurea in Informatica per il Management
Ingegneria del Software Corso di Laurea in Informatica per il Management UML: State machine diagram Davide Rossi Dipartimento di Informatica Università di Bologna State machine A behavioral state machine
More informationThe Public Shared Objects Run-Time System
The Public Shared Objects Run-Time System Stefan Lüpke, Jürgen W. Quittek, Torsten Wiese E-mail: wiese@tu-harburg.d400.de Arbeitsbereich Technische Informatik 2, Technische Universität Hamburg-Harburg
More informationPrinciple Of Parallel Algorithm Design (cont.) Alexandre David B2-206
Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction
More informationFlexible Batched Sparse Matrix-Vector Product on GPUs
ScalA'17: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems November 13, 217 Flexible Batched Sparse Matrix-Vector Product on GPUs Hartwig Anzt, Gary Collins, Jack Dongarra,
More informationIntegrated Math I. IM1.1.3 Understand and use the distributive, associative, and commutative properties.
Standard 1: Number Sense and Computation Students simplify and compare expressions. They use rational exponents and simplify square roots. IM1.1.1 Compare real number expressions. IM1.1.2 Simplify square
More informationVoronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013
Voronoi Region K-means method for Signal Compression: Vector Quantization Blocks of signals: A sequence of audio. A block of image pixels. Formally: vector example: (0.2, 0.3, 0.5, 0.1) A vector quantizer
More informationParallelization Principles. Sathish Vadhiyar
Parallelization Principles Sathish Vadhiyar Parallel Programming and Challenges Recall the advantages and motivation of parallelism But parallel programs incur overheads not seen in sequential programs
More informationParallel quicksort algorithms with isoefficiency analysis. Parallel quicksort algorithmswith isoefficiency analysis p. 1
Parallel quicksort algorithms with isoefficiency analysis Parallel quicksort algorithmswith isoefficiency analysis p. 1 Overview Sequential quicksort algorithm Three parallel quicksort algorithms Isoefficiency
More informationHierarchical Multi level Approach to graph clustering
Hierarchical Multi level Approach to graph clustering by: Neda Shahidi neda@cs.utexas.edu Cesar mantilla, cesar.mantilla@mail.utexas.edu Advisor: Dr. Inderjit Dhillon Introduction Data sets can be presented
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor
More informationAutomatic Creation of Define.xml for ADaM
Automatic Creation of Define.xml for ADaM Alessia Sacco, Statistical Programmer www.valos.it info@valos.it 1 Indice Define.xml Pinnacle 21 Community Valos ADaM Metadata 2 Define.xml Cos è: Case Report
More informationDesign of Parallel Algorithms. Course Introduction
+ Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:
More informationLink Analysis and Web Search
Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html
More informationChapter 3:- Divide and Conquer. Compiled By:- Sanjay Patel Assistant Professor, SVBIT.
Chapter 3:- Divide and Conquer Compiled By:- Assistant Professor, SVBIT. Outline Introduction Multiplying large Integers Problem Problem Solving using divide and conquer algorithm - Binary Search Sorting
More informationA tree-search algorithm for ML decoding in underdetermined MIMO systems
A tree-search algorithm for ML decoding in underdetermined MIMO systems Gianmarco Romano #1, Francesco Palmieri #2, Pierluigi Salvo Rossi #3, Davide Mattera 4 # Dipartimento di Ingegneria dell Informazione,
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationOptimal Decision Trees Generation from OR-Decision Tables
Optimal Decision Trees Generation from OR-Decision Tables Costantino Grana, Manuela Montangero, Daniele Borghesani, and Rita Cucchiara Dipartimento di Ingegneria dell Informazione Università degli Studi
More informationAnalysis of high dimensional data via Topology. Louis Xiang. Oak Ridge National Laboratory. Oak Ridge, Tennessee
Analysis of high dimensional data via Topology Louis Xiang Oak Ridge National Laboratory Oak Ridge, Tennessee Contents Abstract iii 1 Overview 1 2 Data Set 1 3 Simplicial Complex 5 4 Computation of homology
More informationL-Systems and Affine Transformations
L-Systems and Affine Transformations Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ Copyright 2014, Moreno Marzolla, Università di
More informationSeminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm
Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of
More informationDANTE CONTROL SYSTEM DATA FLOW
K K DAΦNE TECHNICAL NOTE INFN - LNF, Accelerator Division Frascati, February 14, 1994 Note: C-9 DANTE CONTROL SYSTEM DATA FLOW M. Verola 1. Introduction The DANTE (DAΦNE New Tools Environment) Control
More information