Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming
|
|
- Nickolas Hicks
- 6 years ago
- Views:
Transcription
1 Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI), Facultad de Informática, Universidad Nacional de La Plata, 50 y 120 2do piso, La Plata, Argentina. {fleibovich, ldgiusti, mnaiouf}@lidi.info.unlp.edu.ar Abstract - Given the technological progress of current processors and the appearance of the multi-core cluster architecture, the assessment of different parallel programming techniques that allow exploiting the new memory hierarchy provided by the architecture becomes important. The purpose of this paper is to carry out a comparative analysis of two parallel programming paradigms- message passing and hybrid programming (where message passing and shared memory are combined). The testing architecture used for the experimental analysis is a multi-core cluster formed by 16 nodes (blades), each blade with 2 quad core processors (128 cores total). The study case chosen was square matrix multiplication, analyzing scalability by increasing the size of the problem and the number of processing cores used. node has a processor with several cores instead of a monoprocessor. When it comes to implementing a parallel algorithm, it is very important to consider the memory hierarchy available, since this will directly affect algorithm performance. Memory hierarchy performance is determined by two hardware parameters: memory latency (time elapsed from the moment a piece of data is required and the moment it becomes available) and memory bandwidth (the speed with which data are sent from the memory to the processor). Figure 1 shows a representation of the memory hierarchy in the different architectures. Keywords: hybrid programming, cluster, multi-core, message passing, shared memory, parallel architectures. 1 Introduction Parallel architectures have evolved to offer better response times for applications. As part of this evolution, clusters, then multi-cores, and currently multi-core cluster architectures, can be mentioned. The latter are basically a collection of multi-core processors interconnected through a network. Multicore clusters allow combining the most distinctive features of clusters (use of message passing in distributed memory) and multicores (use of shared memory). Also, they introduce modifications in memory hierarchy and further increase computer system capacity and power. Taking into account the popularity of this architecture, it is important to study new parallel algorithms programming techniques that efficiently exploit its power, considering the hybrid systems in which shared memory and distributed memory are combined [1]. As previously mentioned, a multi-core cluster is a set of multicore processors that are interconnected through a network, where they work cooperatively as an only computational resource. That is, it is similar to a traditional cluster but each Figure (1). Memory hierarchy In the case of traditional clusters (both homogeneous and heterogeneous), there are memory levels in each processor (processor register and cache levels L1 and L2), but a new level is also included: network-distributed memory.
2 When considering a multi-core architecture, there are, in addition to register and L1 levels corresponding to each core, two memory levels: cache memory shared by pairs of cores (L2) and memory shared among the cores of the multi-core processor [2]. In particular, multi-core clusters introduce one additional level to the traditional memory hierarchy. In addition to the cache memory shared between pairs of cores and the memory shared among all cores within the same physical processor, there is the distributed memory that is accessed through the network. There is a large number of parallel applications in different areas. One of the most traditional and widely studied of these areas in parallel computing, and used in this paper, is matrix multiplication. The reason for using this application (widely tested and assessed) is that it allows using and exploiting data parallelism, as well as analyzing algorithm scalability by increasing matrix size [3]. Thus, the solutions using message passing can be compared with those using shared memory and message passing (hybrid). This paper is organized as follows: In Section 2, the contribution of this paper is detailed, whereas in Section 3, the features of hybrid programming are described. In Section 4, the study case is detailed, and in Section 5, the solutions implemented and the architecture used to generate the results shown in Section 6 are analyzed. Finally, in Section 7, the conclusions and future lines of work are presented. 2 Contribution The main contribution of this paper is carrying out a comparative analysis of the performance that can be achieved with hybrid programming in a multi-core cluster architecture versus a traditional parallel programming model (distributed memory). The analysis is carried out based on running time and efficiency of the hybrid solution as the size of the problem and the number of cores used increase, and the results are compared with solutions that use only message passing. 3 Hybrid Programming Traditionally, parallel processing has been divided in two large models - shared memory and message passing [1][4]. Message passing: data are seen as being associated to a specific processor. Thus, message communication among processors is required to access remote data. In this model, sending and receiving primitives are responsible for handling synchronization. With the appearance of multi-core cluster architectures, a new hybrid programming model comes to existence, which combines both strategies. Communication among processes belonging to the same physical processor can be done by using shared memory (micro level), whereas communication among physical processors (macro level) can be done by message passing. The purpose of using the hybrid model is exploiting and applying all of the advantages of each strategy, based on the needs of the application. This is a current interest research area; among the libraries used for hybrid programming are Pthreads for shared memory and I for message passing. Pthreads is a library that implements the POSIX (Portable Operating System Interface) standard defined by IEEE, and is composed by a set of types and calls to procedures in programming language C that includes a header file and a thread library that is part, for example, of the libc library, among others. It is used for programming parallel applications that use shared memory [5]. On the other hand, I is a message passing interface created to provide portability. It is a library that can be used to develop programs that use message passing (distributed memory) and uses the programming languages C or Fortran. The I standard defines both the syntax and the semantics of the set of routines that can be used in the implementation of programs that use message passing [6]. 4 Study Case Given two square matrixes A and B, matrix multiplication consists in obtaining matrix C, as indicated in equation 1. C (1) A B If matrix A has m * p elements, and matrix B has p * n elements, matrix C will have m * n elements. Each position of matrix C is calculated by applying equation 2. Shared memory: the data accessed by the application are in a global memory that is accessible to parallel processors. This means that processors can look for and store data from any memory position independently from each other. It is characterized by the need of synchronization in order to preserve the integrity of shared data structures. p C i, j Ai, k Bk, j k 1 (2)
3 5 Implemented Solutions and Architecture Used Experimental tests were carried out based on the implementation of the classical matrix multiplication algorithm, both sequentially and using different parallel programming models: message passing and hybrid (combination of message passing and shared memory). All three solutions, sequential and parallel, were developed in language C. The parallel solution that uses message passing as process communication mechanism uses the OpenI library [6]. The hybrid solution uses the Pthreads library [5] for shared memory and OpenI for message passing. This initial phase of the investigation consists in carrying out an experimental analysis of the behavior of a hybrid application in a multi-core cluster architecture from the point of view of programming models [7][8][9]. The results shown are focused in analyzing the hybrid solution in two aspects: 1. Analyzing behavior when the size of the problem and the number of cores increase (scalability) [7][8]. In this case, square matrixes of 1024, 2048, 4096 and 8192 rows and columns were processed. 2. Comparing running times and efficiency with those obtained with the message passing solution. The hardware used to carry out the tests was a Blade with 16 servers (blades). Each blade has 2 quad core Intel Xeon e GHz processors; 2 Gb of RAM memory (shared between both processors); 2 X 6Mb L2 cache shared between each pair of cores by processor. The operating system used is Fedora 12, 64 bits [10][11]. In the following paragraphs, the solutions implemented are described. In all cases, matrix multiplication is carried out by storing matrix A by rows and matrix B by columns in order to use local cache memory for data access and take advantage of the architecture on which algorithms were run. 5.1 Sequential Solution Each position of C is calculated as established in equation Message Passing Solution In this case, processing is divided in blocks of rows, which are assigned equally to each process. If p is the number of processes and n * n is the dimension of matrixes A and B, the number of rows of matrix C calculated by each process is n/p. The algorithm uses a hierarchical master/worker structure. There is a general master that divides all rows that will be processed in each blade, and sends the corresponding rows to the master in each blade. It then behaves as the second level of workers described below. Finally, it receives the results obtained by all application workers. On the other hand, there is one master in each blade (secondlevel masters), responsible for receiving the rows that will be solved by the processes in its blade and distributing them among its workers to then process its own share, also acting as a worker. It should be noted that each process must store the rows from matrix A to be processed, all of matrix B and the rows from matrix C that it generates as a result. 5.3 Hybrid Solution In this solution, there is one process per blade that internally uses 7 threads to carry out processing activities, and the processing activities from the process itself that acts as a worker (one thread per core). A master/worker structure is used, with one of the processes acting as master, dividing the rows equally among all processes. Once this is done, it generates the corresponding processing threads (acting as worker). The other worker processes act in a similar way and send their results to the process master. The algorithm can be summarized as follows: Master process: 1. It divides the matrix into blocks of n rows/number of blades used for processing 2. It communicates the corresponding rows from matrix A and all of matrix B to worker processes. 3. It generates the threads and processes its own block 4. It receives results from worker processes. Worker processes 1. They receive the corresponding rows from matrix A and all of matrix B. 2. They generate the threads to process the data. 3. They communicate the results to the master process. 6 Results obtained In the following paragraphs, the results obtained in the experimental tests carried out are presented. Table 1 shows running times for the sequential solution (Seq.), the message passing solution using 16, 32, and 64 cores ( 16, 32 and 64) and the hybrid solution with 16, 32, and 64 cores (H16, H32 and H64). For the tests, both the dimension of the matrix and the number of cores are escalated. Figure 2 shows the speedups obtained for these tests. It can be seen that the running times obtained by the hybrid solution are always lower than those obtained by the message passing solution. Also, as problem size increases, the time difference between both solutions also increases in favor of the hybrid solution.
4 Size Seq. 16 H H H Table (1) Figure (2). Speedup 6.1 Comparison of Results Table 2 shows the efficiency achieved by the different testing alternatives; whereas in Figure 3, a comparative chart showing that information is presented. Based on the results obtained, two observations can be made - the efficiency achieved by the hybrid solution is in all cases higher than the one achieved by the message passing solution, and, as problem size increases (for the same number of processing units), efficiency also increases. However, as it is to be expected, when the number of processing units increases, efficiency decreases due to the increased volume of communications and synchronization among processes. It should also be mentioned that the efficiency achieved by the message passing solution for 8192 * 8192 elements is significantly degraded in comparison with the other sizes. This is due to limitations in the main memory that is available in each blade, which, for large sizes, generates a swapping of the necessary data structures. Size 16 H H H Table (2) Figure (3). Efficiency 7 Conclusions and future work As regards scalability, the results obtained show that the hybrid solution is scalable and that an increase in problem size also increases the efficiency achieved by the algorithm. On the other hand, when comparing the message passing solution versus the hybrid solution, it can be seen that the latter offers better running times. In this regard, there is improvement introduced by the hybrid solution, which takes advantage of the characteristics of the problem and the architecture used. The possibility of using shared memory makes it unnecessary to replicate data in each blade. In the case of the problem that was chosen as study case, matrix B does not have to be replicated in each of the workers. This does not happen with the message passing solution, since each worker handles its own memory space and therefore requires a copy of matrix B. This is shown in the running times obtained in the tests using matrixes of 8192 * 8192 elements. In the message passing solution, running time and efficiency are significantly degraded, since, due to the replication mentioned above, the memory that is available in the testing architecture becomes insufficient, swapping the required structures to disc and thus significantly degrading algorithm performance. In the future, the behavior with even larger matrix sizes will be studied, together with other parallelization strategies that mainly avoid data replication. 8 References [1] Dongarra J., Foster I., Fox G., Gropp W., Kennedy K., Torzcon L., White A. Sourcebook of Parallel computing. Morgan Kaufmann Publishers ISBN (Chapter 3). [2] Burger T. Intel Multi-Core Processors: Quick Reference Guide pdf. (2010).
5 [3] Andrews G. Foundations of Multithreaded, Parallel and Distributed Programming. Addison Wesley Higher Education ISBN-13: [4] Grama A., Gupta A., Karpyis G., Kumar V. Introduction to Parallel Computing. Pearson Addison Wesley ISBN: Second Edition (Chapter 3). [5] (2010) [6] (2010) [7] Kumar V., Gupta A., Analyzing Scalability of Parallel Algorithms and Architectures. Journal of Parallel and Distributed Computing. Vol 22, No.1.pp [8] Leopold C., Parallel and Distributed Computing. A Survey of Models, Paradigms and Approaches. Wiley, ISBN: (Chapters 1, 2 and 3). [9] Chapman B., The Multicore Programming Challenge, Advanced Parallel Processing Technologies ; 7th International Symposium, (7th APPT'07), Lecture Notes in Computer Science (LNCS), Vol. 4847, p. 3, Springer-Verlag (New York), November [10] HP, "HP BladeSystem". (2011). [11] HP, "HP BladeSystem c-class architecture". c /c pdf. (2011).
Performance Analysis of a Matrix Diagonalization Algorithm with Jacobi Method on a Multicore Architecture
Performance Analysis of a Matrix Diagonalization Algorithm with Jacobi Method on a Multicore Architecture Victoria Sanz 12, Armando De Giusti 134, Marcelo Naiouf 14 1 III-LIDI, Facultad de Informática,
More informationParallelization of the N-queens problem. Load unbalance analysis.
Parallelization of the N-queens problem Load unbalance analysis Laura De Giusti 1, Pablo Novarini 2, Marcelo Naiouf 3,Armando De Giusti 4 Research Institute on Computer Science LIDI (III-LIDI) 5 Faculty
More informationParallel Linear Algebra on Clusters
Parallel Linear Algebra on Clusters Fernando G. Tinetti Investigador Asistente Comisión de Investigaciones Científicas Prov. Bs. As. 1 III-LIDI, Facultad de Informática, UNLP 50 y 115, 1er. Piso, 1900
More informationI. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS
Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com
More informationParallel Processing of Multimedia Data in a Heterogeneous Computing Environment
Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment Heegon Kim, Sungju Lee, Yongwha Chung, Daihee Park, and Taewoong Jeon Dept. of Computer and Information Science, Korea University,
More informationParallel Matrix Multiplication on Heterogeneous Networks of Workstations
Parallel Matrix Multiplication on Heterogeneous Networks of Workstations Fernando Tinetti 1, Emilio Luque 2 1 Universidad Nacional de La Plata Facultad de Informática, 50 y 115 1900 La Plata, Argentina
More informationTHE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS
Computer Science 14 (4) 2013 http://dx.doi.org/10.7494/csci.2013.14.4.679 Dominik Żurek Marcin Pietroń Maciej Wielgosz Kazimierz Wiatr THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT
More informationParallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle
Parallel Combinatorial Search on Computer Cluster: Sam Loyd s Puzzle Plamenka Borovska Abstract: The paper investigates the efficiency of parallel branch-and-bound search on multicomputer cluster for the
More informationParallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine)
Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine) Ehab AbdulRazak Al-Asadi College of Science Kerbala University, Iraq Abstract The study will focus for analysis the possibilities
More informationMulti MicroBlaze System for Parallel Computing
Multi MicroBlaze System for Parallel Computing P.HUERTA, J.CASTILLO, J.I.MÁRTINEZ, V.LÓPEZ HW/SW Codesign Group Universidad Rey Juan Carlos 28933 Móstoles, Madrid SPAIN Abstract: - Embedded systems need
More informationIntroduction to parallel computing
Introduction to parallel computing 2. Parallel Hardware Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Motherboard Processor https://sites.google.com/
More informationScalable Performance Analysis of Parallel Systems: Concepts and Experiences
1 Scalable Performance Analysis of Parallel Systems: Concepts and Experiences Holger Brunst ab and Wolfgang E. Nagel a a Center for High Performance Computing, Dresden University of Technology, 01062 Dresden,
More informationGOP Level Parallelism on H.264 Video Encoder for Multicore Architecture
2011 International Conference on Circuits, System and Simulation IPCSIT vol.7 (2011) (2011) IACSIT Press, Singapore GOP Level on H.264 Video Encoder for Multicore Architecture S.Sankaraiah 1 2, H.S.Lam,
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationLimitations of Memory System Performance
Slides taken from arallel Computing latforms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar! " To accompany the text ``Introduction to arallel Computing'', Addison Wesley, 2003. Limitations
More informationPerformance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors
Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Mostafa I. Soliman and Fatma S. Ahmed Computers and Systems Section, Electrical Engineering Department Aswan Faculty of Engineering,
More informationSolving the Travelling Salesman Problem in Parallel by Genetic Algorithm on Multicomputer Cluster
Solving the Travelling Salesman Problem in Parallel by Genetic Algorithm on Multicomputer Cluster Plamenka Borovska Abstract: The paper investigates the efficiency of the parallel computation of the travelling
More informationCS Understanding Parallel Computing
CS 594 001 Understanding Parallel Computing Web page for the course: http://www.cs.utk.edu/~dongarra/web-pages/cs594-2006.htm CS 594 001 Wednesday s 1:30 4:00 Understanding Parallel Computing: From Theory
More informationScalable Hybrid Search on Distributed Databases
Scalable Hybrid Search on Distributed Databases Jungkee Kim 1,2 and Geoffrey Fox 2 1 Department of Computer Science, Florida State University, Tallahassee FL 32306, U.S.A., jungkkim@cs.fsu.edu, 2 Community
More informationScalability of Heterogeneous Computing
Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor
More informationDistributed systems: paradigms and models Motivations
Distributed systems: paradigms and models Motivations Prof. Marco Danelutto Dept. Computer Science University of Pisa Master Degree (Laurea Magistrale) in Computer Science and Networking Academic Year
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationOptimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*
Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating
More informationAnalysis of Different Approaches of Parallel Block Processing for K-Means Clustering Algorithm
Analysis of Different Approaches of Parallel Block Processing for K-Means Clustering Algorithm Rashmi C a ahigh-performance Computing Project, Department of Studies in Computer Science, University of Mysore,
More informationAll routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.
technologies for multi-core numeric computation In order to compare ConcRT, OpenMP and TBB technologies, we implemented a few algorithms from different areas of numeric computation and compared their performance
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationResearch on the Implementation of MPI on Multicore Architectures
Research on the Implementation of MPI on Multicore Architectures Pengqi Cheng Department of Computer Science & Technology, Tshinghua University, Beijing, China chengpq@gmail.com Yan Gu Department of Computer
More informationAnalysis of Matrix Multiplication Computational Methods
European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods
More informationA Comprehensive Study on the Performance of Implicit LS-DYNA
12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four
More informationA New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer
A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer Orieb AbuAlghanam, Mohammad Qatawneh Computer Science Department University of Jordan Hussein A. al Ofeishat
More informationChallenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang
Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang Multigrid Solvers Method of solving linear equation
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming
More informationGuiding the optimization of parallel codes on multicores using an analytical cache model
Guiding the optimization of parallel codes on multicores using an analytical cache model Diego Andrade, Basilio B. Fraguela, and Ramón Doallo Universidade da Coruña, Spain {diego.andrade,basilio.fraguela,ramon.doalllo}@udc.es
More informationOrganizational issues (I)
COSC 6374 Parallel Computation Introduction and Organizational Issues Spring 2008 Organizational issues (I) Classes: Monday, 1.00pm 2.30pm, F 154 Wednesday, 1.00pm 2.30pm, F 154 Evaluation 2 quizzes 1
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationOut-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)
Out-of-Order Simulation of s using Intel MIC Architecture G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Speaker: Rainer Dömer doemer@uci.edu Center for Embedded Computer
More informationEvaluation of Parallel Programs by Measurement of Its Granularity
Evaluation of Parallel Programs by Measurement of Its Granularity Jan Kwiatkowski Computer Science Department, Wroclaw University of Technology 50-370 Wroclaw, Wybrzeze Wyspianskiego 27, Poland kwiatkowski@ci-1.ci.pwr.wroc.pl
More informationChapter 1. Introduction: Part I. Jens Saak Scientific Computing II 7/348
Chapter 1 Introduction: Part I Jens Saak Scientific Computing II 7/348 Why Parallel Computing? 1. Problem size exceeds desktop capabilities. Jens Saak Scientific Computing II 8/348 Why Parallel Computing?
More informationMarco Danelutto. May 2011, Pisa
Marco Danelutto Dept. of Computer Science, University of Pisa, Italy May 2011, Pisa Contents 1 2 3 4 5 6 7 Parallel computing The problem Solve a problem using n w processing resources Obtaining a (close
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationGEN_OMEGA2: The HPSUMMARY Procedure: A SAS Macro for Computing the Generalized Omega-Squared Effect Size Associated with
GEN_OMEGA2: A SAS Macro for Computing the Generalized Omega-Squared Effect Size Associated with The HPSUMMARY Procedure: Analysis of Variance Models An Old Friend s Younger (and Brawnier) Cousin The HPSUMMARY
More informationTransactions on Information and Communications Technologies vol 15, 1997 WIT Press, ISSN
Balanced workload distribution on a multi-processor cluster J.L. Bosque*, B. Moreno*", L. Pastor*" *Depatamento de Automdtica, Escuela Universitaria Politecnica de la Universidad de Alcald, Alcald de Henares,
More informationChapter 18 - Multicore Computers
Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca
More informationParallel Programming Concepts. Parallel Algorithms. Peter Tröger
Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,
More informationCOSC 6374 Parallel Computation. Organizational issues (I)
COSC 6374 Parallel Computation Spring 2007 Organizational issues (I) Classes: Monday, 4.00pm 5.30pm, SEC 204 Wednesday, 4.00pm 5.30pm, SEC 204 Evaluation 2 homeworks, 25% each 2 quizzes, 25% each In case
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationIN this article we discuss several methods for parallelizing
XV JORNADAS DE PARALELISMO ALMERIA, SEPTIEMBRE 2004 Parallelizing 2D-Convex Hulls on clusters: Sorting matters Pedro Díaz, Diego R. Llanos, Belén Palop. Abstract This article explores three basic approaches
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded
More informationA COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING ENVIRONMENT
A COMPARATIVE STUDY IN DYNAMIC JOB SCHEDULING APPROACHES IN GRID COMPUTING ENVIRONMENT Amr Rekaby 1 and Mohamed Abo Rizka 2 1 Egyptian Research and Scientific Innovation Lab (ERSIL), Egypt 2 Arab Academy
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationOPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL
Journal of Theoretical and Applied Mechanics, Sofia, 2013, vol. 43, No. 2, pp. 77 82 OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL P. Dobreva Institute of Mechanics, Bulgarian
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationPrinciples of Parallel Algorithm Design: Concurrency and Decomposition
Principles of Parallel Algorithm Design: Concurrency and Decomposition John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 2 12 January 2017 Parallel
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationDr Tay Seng Chuan Tel: Office: S16-02, Dean s s Office at Level 2 URL:
Self Introduction Dr Tay Seng Chuan Tel: Email: scitaysc@nus.edu.sg Office: S-0, Dean s s Office at Level URL: http://www.physics.nus.edu.sg/~phytaysc I have been working in NUS since 0, and I teach mainly
More informationA Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs
A Study of Performance Scalability by Parallelizing Loop Iterations on Multi-core SMPs Prakash Raghavendra, Akshay Kumar Behki, K Hariprasad, Madhav Mohan, Praveen Jain, Srivatsa S Bhat, VM Thejus, Vishnumurthy
More informationParallel Exact Inference on the Cell Broadband Engine Processor
Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More informationParallelizing NEC s Equation Solver Algorithm with OpenMP
Parallelizing NEC s Equation Solver Algorithm with OpenMP Mario Trangoni 1 Victor Rosales 2 Argentina Software Design Center (Intel) 1 mario.trangoni@intel.com 2 victor.h.rosales@intel.com HPCLatAm 2012
More informationOptimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems
Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Alexey Paznikov Saint Petersburg Electrotechnical University
More informationContents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11
Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed
More informationPerformance Analysis and Optimal Utilization of Inter-Process Communications on Commodity Clusters
Performance Analysis and Optimal Utilization of Inter-Process Communications on Commodity Yili TSENG Department of Computer Systems Technology North Carolina A & T State University Greensboro, NC 27411,
More informationInternational Journal of Digital Application & Contemporary research Website: (Volume 1, Issue 7, February 2013)
Performance Analysis of GA and PSO over Economic Load Dispatch Problem Sakshi Rajpoot sakshirajpoot1988@gmail.com Dr. Sandeep Bhongade sandeepbhongade@rediffmail.com Abstract Economic Load dispatch problem
More informationParallel Architecture & Programing Models for Face Recognition
Parallel Architecture & Programing Models for Face Recognition Submitted by Sagar Kukreja Computer Engineering Department Rochester Institute of Technology Agenda Introduction to face recognition Feature
More informationLecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality)
COMP 322: Fundamentals of Parallel Programming Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality) Mack Joyner and Zoran Budimlić {mjoyner,
More informationA Test Suite for High-Performance Parallel Java
page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationComputer Architecture. R. Poss
Computer Architecture R. Poss 1 ca01-10 september 2015 Course & organization 2 ca01-10 september 2015 Aims of this course The aims of this course are: to highlight current trends to introduce the notion
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017 Outline for Today Threaded programming
More informationANALYSIS OF A PARALLEL LEXICAL-TREE-BASED SPEECH DECODER FOR MULTI-CORE PROCESSORS
17th European Signal Processing Conference (EUSIPCO 2009) Glasgow, Scotland, August 24-28, 2009 ANALYSIS OF A PARALLEL LEXICAL-TREE-BASED SPEECH DECODER FOR MULTI-CORE PROCESSORS Naveen Parihar Dept. of
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationA Chromium Based Viewer for CUMULVS
A Chromium Based Viewer for CUMULVS Submitted to PDPTA 06 Dan Bennett Corresponding Author Department of Mathematics and Computer Science Edinboro University of PA Edinboro, Pennsylvania 16444 Phone: (814)
More informationDesigning Evolvable Location Models for Ubiquitous Applications
Designing Evolvable Location Models for Ubiquitous Applications Silvia Gordillo, Javier Bazzocco, Gustavo Rossi and Robert Laurini 2 Lifia. Facultad de Informatica. Universidad Nacional de La Plata, Argentina
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationPerformance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads
Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi
More informationOPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT
OPEN MP-BASED PARALLEL AND SCALABLE GENETIC SEQUENCE ALIGNMENT Asif Ali Khan*, Laiq Hassan*, Salim Ullah* ABSTRACT: In bioinformatics, sequence alignment is a common and insistent task. Biologists align
More informationEvaluating Algorithms for Shared File Pointer Operations in MPI I/O
Evaluating Algorithms for Shared File Pointer Operations in MPI I/O Ketan Kulkarni and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {knkulkarni,gabriel}@cs.uh.edu
More informationA dedicated kernel named TORO. Matias Vara Larsen
A dedicated kernel named TORO Matias Vara Larsen Who am I? Electronic Engineer from Universidad Nacional de La Plata, Buenos Aires, Argentina. Argentina PhD in Computer Science at INRIA / CNRS, Nice, France
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationBig Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures
Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid
More informationAnalysis of Parallelization Techniques and Tools
International Journal of Information and Computation Technology. ISSN 97-2239 Volume 3, Number 5 (213), pp. 71-7 International Research Publications House http://www. irphouse.com /ijict.htm Analysis of
More informationCompilation for Heterogeneous Platforms
Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationBarbara Chapman, Gabriele Jost, Ruud van der Pas
Using OpenMP Portable Shared Memory Parallel Programming Barbara Chapman, Gabriele Jost, Ruud van der Pas The MIT Press Cambridge, Massachusetts London, England c 2008 Massachusetts Institute of Technology
More informationMeshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods
Meshlization of Irregular Grid Resource Topologies by Heuristic Square-Packing Methods Uei-Ren Chen 1, Chin-Chi Wu 2, and Woei Lin 3 1 Department of Electronic Engineering, Hsiuping Institute of Technology
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationConcurrent Programming Introduction
Concurrent Programming Introduction Frédéric Haziza Department of Computer Systems Uppsala University Ericsson - Fall 2007 Outline 1 Good to know 2 Scenario 3 Definitions 4 Hardware 5 Classical
More informationSparse Matrix Operations on Multi-core Architectures
Sparse Matrix Operations on Multi-core Architectures Carsten Trinitis 1, Tilman Küstner 1, Josef Weidendorfer 1, and Jasmin Smajic 2 1 Lehrstuhl für Rechnertechnik und Rechnerorganisation Institut für
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationParallel Implementaton of the Weibull
Journal of Environmental Protection and Ecology 15, No 1, 287 292 (2014) Computer applications on environmental information system Parallel Implementaton of the Weibull Distribution Parameters Estimator
More information