Analysis of Matrix Multiplication Computational Methods
|
|
- Sophia Maxwell
- 6 years ago
- Views:
Transcription
1 European Journal of Scientific Research ISSN X / X Vol.121 No.3, 2014, pp Analysis of Matrix Multiplication Computational Methods Khaled Matrouk Corrospondent Author, Department of Computer Engineering, Faculty of Engineering khaled.matrouk@ahu.edu.jo Tel: (ext. 8503), Fax: Abdullah Al- Hasanat Department of Computer Engineering, Faculty of Engineering Haitham Alasha'ary Department of Computer Engineering, Faculty of Engineering Ziad Al-Qadi Prof, Department of Computer Engineering, Faculty of Engineering Hasan Al-Shalabi Prof, Department of Computer Engineering, Faculty of Engineering Abstract Matrix multiplication is a basic concept that is used in engineering applications such as digital image processing, digital signal processing and graph problem solving. Multiplication of huge matrices requires a lot of computation time as its complexity is O(n 3 ). Because most engineering applications require higher computational throughputs with minimum time, many sequential and parallel algorithms are developed. In this paper, methods of matrix multiplication are chosen, implemented, and analyzed. A performance analysis is evaluated, and some recommendations are given when using openmp and MPI methods of parallel computing. Keywords: OpenMP, MPI, Processing Time, Speedup, Efficiency 1. Introduction With the advent of parallel hardware and software technologies users are faced with the challenge to choose a programming paradigm best suited for the underlying computer architecture (Alqadi and Abu-Jazzar, 2005a; Alqadi and Abu-Jazzar, 2005b; Alqadi et al, 2008). With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors (SMP) parallel programming techniques have evolved to support parallelism beyond a single level (Choi et al, 1994).
2 Analysis of Matrix Multiplication Computational Methods 259 Parallel programming within one SMP node can take advantage of the globally shared address space. Compilers for shared memory architectures usually support multi-threaded execution of a program. Loop level parallelism can be exploited by using compiler directives such as those defined in the OpenMP standard (Dongarra et al, 1994; Alpatov et al, 1997). OpenMP provides a fork-and-join execution model in which a program begins execution as a single thread. This thread executes sequentially until a parallelization directive for a parallel region is found (Alpatov et al, 1997; Anderson et al, 1987). At this time, the thread creates a team of threads and becomes the master thread of the new team (Chtchelkanova et al, 1995; Barnett et al, 1994; Choi et al, 1992). All threads execute the statements until the end of the parallel region. Work-sharing directives are provided to divide the execution of the enclosed code region among the threads. All threads need to synchronize at the end of parallel constructs. The advantage of OpenMP (web ref.) is that an existing code can be easily parallelized by placing OpenMP directives around time consuming loops which do not contain data dependences, leaving the source code unchanged. The disadvantage is that it is not easy for the user to optimize workflow and memory access. On an SMP cluster the message passing programming paradigm can be employed within and across several nodes. MPI (web ref.) is a widely accepted standard for writing message passing programs (web ref.; Rabenseifner, 2003). MPI provides the user with a programming model where processes communicate with other processes by calling library routines to send and receive messages. The advantage of the MPI programming model is that the user has complete control over data distribution and process synchronization, permitting the optimization data locality and workflow distribution. The disadvantage is that existing sequential applications require a fair amount of restructuring for a parallelization based on MPI Serial Matrix Multiplication Matrix multiplication involves two matrices A and B such that the number of columns of A and the number of rows of B are equal. When carried in sequential, it takes a time O(n 3 ). The algorithm for ordinary matrix multiplication is: for i=1 to n for j=1 to n c(i,j)=0 for k=1 to n c(i,j)=c(i,j)+a(i,k)*b(k,j) end end end 1.2. Parallel Matrix Multiplication Using OpenMp Master thread forks the outer loop between the slave threads, thus each of these threads implements matrix multiplication using a part of rows from the first matrix, when the threads multiplication are done the master thread joins the total result of matrix multiplication Parallel Matrix Multiplication Using MPI The procedures of implementing the sequential algorithm in parallel using MPI can be divided into the following steps: Split the first matrix row wise to split to the different processors, this is performed by the master processor. Broadcast the second matrix to all processors. Each processor performs multiplication of the partial of the first matrix and the second matrix. Each processor sends back the partial product to the master processor.
3 260 Khaled Matrouk, Abdullah Al- Hasanat, Haitham Alasha'ary Ziad Al-Qadi and Hasan Al-Shalabi Implementation Master (processor 0) reads data Master sends size of data to slaves Slaves allocate memory Master broadcasts second matrix to all other processors Master sends respective parts of first matrix to all other processors Every processor performs its local multiplication o All slave processors send back their result. 2. Methods and Tools One station with Pentium i5 processor with 2.5 GHz and 4 GB memory is used to implement serial matrix multiplication. Visual Studio 2008 with openmp library is used as an environment for building, executing and testing matrix multiplication program. The program is tested using Pentium i5 processer with 2.5 GHz and 4 GB memory. A distributed processing system with different number of processors is used, each processor is a 4 core processor with 2.5 MHz and 4 GB memory, the processors are connected though Visual Studio 2008 with MPI environment. 3. Experimental Part Different sets of 2 matrices are chosen (different in sizes and data types ) and each pair of matrices is multiplied serially and in parallel using both openmp and MPI environments, and the average multiplication time is taken Experiment 1 Sequential matrix multiplication program is tested using different size matrices. Different size matrices with different data types (integer, float, double, and complex) are chosen, 100 types of matrices with different data types and different sizes are multiplied. Table 1 shows the average results obtained in this experiment. Table 1: Experiment 1 Results 3.2. Experiment 2 Matrices size Multiplication time in seconds 10* * * * * * * * * Matrix multiplication program is tested using small size matrices. Different size matrices with different data types (integer, float, double, and complex) are chosen, 200 types of matrices with different data types and different sizes are multiplied using openmp environment. Table 2 shows the average results obtained in this experiment.
4 Analysis of Matrix Multiplication Computational Methods 261 Table 2: Experiment 2 Results # of threads 10,10 20,20 40,40 100, ,200 (time in seconds) (time in seconds) (time in seconds) (time in seconds) (time in seconds) Experiment 3 Matrix multiplication program is tested using big size matrices. Different size matrices with different data types (integer, float, double, and complex) are chosen, 200 types of matrices with different data types and different sizes are multiplied using openmp environment with 8 threads. Table 3 shows the average results obtained in this experiment. Table 3: Experiment 3 Results Matrices size Multiplication time (in seconds) 1000, , , , , Experiment 4 Matrix multiplication program is tested using big size matrices. Different size matrices with different data types (integer, float, double, and complex) are chosen, 200 types of matrices with different data types and different sizes are multiplied using MPI environment with different number of processors. Table 4 shows the average results obtained in this experiment. Table 4: Experiment 4 Results Number of processors Multiplication time in second 1000x1000 matrices Multiplication time in second 5000x5000 matrices Multiplication time in second 10000x10000 matrices Results Discussion From the results obtained in the previous section we can categorize the matrices into three groups: Small size matrices with size less than 1000*1000 Mid size matrices with 1000*1000 size 5000*5000
5 262 Khaled Matrouk, Abdullah Al- Hasanat, Haitham Alasha'ary Ziad Al-Qadi and Hasan Al-Shalabi Huge size matrices with size 5000*5000 The following recommendation can be declared: For small size matrices, it is preferable to use sequential matrix multiplication. For mid size matrices, it is preferable to use parallel matrix multiplication using openmp. For huge size matrices, it is preferable to use parallel matrix multiplication using MPI. Also it is recommended to use hybrid parallel systems (MPI with openmp) to multiply matrices with huge sizes. From the results obtained in Table 2 we can see that the speedup of using openmp is limited to the number of actual available physical cores in the computer system as it is shown in Table 5 and Fig. 1. Speedup (times) = Time of execution with 1 thread/parallel time Table 5: Speedup results of Using OpenMP Matrix size #of threads = 1 #of threads = 8 Speedup 300, , , , , , , , , , , , Figure 1: Maximum performance of using openmp From the results obtained in Tables 1 and 2 we can see that the matrix multiplication time increases rapidly when the matrix size increases as shown in Figs 2 and 3.
6 Analysis of Matrix Multiplication Computational Methods 263 Figure 2: Comparing between 8 and 1 threads results time in seconds One Thread 8 threads n matrix(nxn) Figure 3: Relationship between the speedup and the matrix size max # of cores 3.5 speedup :times matrix size(nxn) From the results obtained in Table 4 we can calculate the speedup of using MPI and the system efficiency: Efficiency = speedup/number of processors The calculation results are shown in Table 6:
7 264 Khaled Matrouk, Abdullah Al- Hasanat, Haitham Alasha'ary Ziad Al-Qadi and Hasan Al-Shalabi Table 6: Speedup and efficiency of using MPI Number of processors Speedup of Multiplication 1000x100 matrices Efficiency Speedup of Multiplication 5000x5000 matrices Efficiency Speedup of Multiplication 10000x10000 matrices Efficiency , From Table 6 we can see that increasing the number of processors in an MPI environments leads to enhancing the speedup of matrix multiplication but it also leads to poor system efficiency as shown in Figs 4, 5 and 6. Figure 4: Multiplication time for 10000x10000 matrices Running times for parallel matrix multiplication of two 10000x10000 matrices Time in seconds Number of processors
8 Analysis of Matrix Multiplication Computational Methods 265 Figure 5: Speedup of multiplication for 10000x10000 matrices 45 Speedup of matrix multiplication 10000* Speedup Number of processors Figure 6: System efficiency of matrices multiplication System efficiency 1000* * *10000 Efficiency Number of processors 5. Conclusions Based on the results obtained and shown above, the following conclusions can be drawn: Sequential matrix multiplication is preferable for matrices with small sizes. OpenMP is a good method to use as an environment for parallel matrix multiplication with mid sizes, and here the speedup is limited to the number of available physical cores.
9 266 Khaled Matrouk, Abdullah Al- Hasanat, Haitham Alasha'ary Ziad Al-Qadi and Hasan Al-Shalabi MPI is a good method to use as an environment for parallel matrix multiplication with huge sizes, here we can increase the speedup but negatively affects the system efficiency. To avoid the problems in the two previous conclusions we can recommend a hybrid parallel system References [1] Alqadi Z., and Abu-Jazzar A., CNRS-NSF Workshop on Environments and Program Methods Used for Optimizing Matrix Tools for Parallel Scientific Computing, Saint Hilaire Multiplication, Journal of Engineering 15(1), pp , du Touvet, France, Sept. 7-8, Elsevier Sci. Publishers. [2] Alqadi Z., and Abu-Jazzar A., "Analysis of Program Methods Used for Optimizing Matrix Multiplication", Journal of Engineering 15(1), pp [3] Alqadi Z., Aqel M., and El Emary I. M. M., "Performance Analysis and Evaluation of Parallel Matrix Multiplication Algorithms",World Applied Sciences Journal 5(2). [4] Dongarra, J. J., R.A. Van de Geijn, and D.W. Walker, "Scalability Issues Affecting the Design of a Linear Algebra Library, Parallel Linear Algebra Package Design", Distributed Computing 22( 3), Proceedings of SC 97, pp [5] Alpatov, P., G. Baker, C. Edwards, J. Gunnels, and P. Geng, "Parallel Matrix Distributions: Parallel Linear Algebra Package", Tech. Report TR-95-40, Proceedings of the SIAM Parallel Processing Computer Sciences Conference, The University of Texas, Austin. [6] Choi, J., J. J. Dongarra and D.W. Walker, "A High-Performance Matrix Multiplication Algorithm Pumma: Parallel Universal Matrix Multiplication on a Distributed Memory Parallel Computer Using Algorithms on Distributed Memory Concurrent Overlapped Communication", IBM J. Res. Develop., Computers, Concurrency: Practice and Experience 6(7), pp [7] Chtchelkanova, A., J. Gunnels, and G. Morrow, "IEEE Implementation of BLAS: General Techniques for Level 3 BLAS", Proceedings of the 1986 International Conference on Parallel Processing, pp , TR-95-40, Department of Computer Sciences, University of Texas. [8] Barnett, M., S. Gupta, D. Payne, and L. Shuler, "Using MPI: Communication Library (InterCom), Scalable High Portable Programming with the Message-Passing Performance, Computing Conference, pp [9] Anderson E., Z. Bai, C. Bischof, and J. Demmel, "Solving Problems on Concurrent Processors", Proceedings of Matrix Algorithms Supercomputing '90, IEEE 1, pp [10] Choi J., J. J. Dongarra, R. Pozo and D.W. Walker, "Scalapack: A Scalable Linear Algebra Library for Distributed Memory Concurrent Computers", Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation. IEEE Comput. Soc. Press, pp [11] MPI 1.1 Standard, [12] OpenMP Fortran Application Program Interface, [13] Rabenseifner, R., Hybrid Parallel Programming: Performance Problems and Chances, Proceedings of the 45th Cray User Group Conference, Ohio, May 12-16, 2003.
Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationParallel Matrix Multiplication on Heterogeneous Networks of Workstations
Parallel Matrix Multiplication on Heterogeneous Networks of Workstations Fernando Tinetti 1, Emilio Luque 2 1 Universidad Nacional de La Plata Facultad de Informática, 50 y 115 1900 La Plata, Argentina
More informationPARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION
PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION Stanislav Kontár Speech@FIT, Dept. of Computer Graphics and Multimedia, FIT, BUT, Brno, Czech Republic E-mail: xkonta00@stud.fit.vutbr.cz In
More informationA High Performance Parallel Strassen Implementation. Brian Grayson. The University of Texas at Austin.
A High Performance Parallel Strassen Implementation Brian Grayson Department of Electrical and Computer Engineering The University of Texas at Austin Austin, TX 787 bgrayson@pineeceutexasedu Ajay Pankaj
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationSolving Traveling Salesman Problem on High Performance Computing using Message Passing Interface
Solving Traveling Salesman Problem on High Performance Computing using Message Passing Interface IZZATDIN A. AZIZ, NAZLEENI HARON, MAZLINA MEHAT, LOW TAN JUNG, AISYAH NABILAH Computer and Information Sciences
More informationEXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL
EXPERIMENTS WITH STRASSEN S ALGORITHM: FROM SEQUENTIAL TO PARALLEL Fengguang Song, Jack Dongarra, and Shirley Moore Computer Science Department University of Tennessee Knoxville, Tennessee 37996, USA email:
More informationAdaptive Matrix Transpose Algorithms for Distributed Multicore Processors
Adaptive Matrix Transpose Algorithms for Distributed Multicore ors John C. Bowman and Malcolm Roberts Abstract An adaptive parallel matrix transpose algorithm optimized for distributed multicore architectures
More informationScientific Programming in C XIV. Parallel programming
Scientific Programming in C XIV. Parallel programming Susi Lehtola 11 December 2012 Introduction The development of microchips will soon reach the fundamental physical limits of operation quantum coherence
More informationPerformance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors
Performance Evaluation of BLAS on a Cluster of Multi-Core Intel Processors Mostafa I. Soliman and Fatma S. Ahmed Computers and Systems Section, Electrical Engineering Department Aswan Faculty of Engineering,
More informationParallelization of Graph Isomorphism using OpenMP
Parallelization of Graph Isomorphism using OpenMP Vijaya Balpande Research Scholar GHRCE, Nagpur Priyadarshini J L College of Engineering, Nagpur ABSTRACT Advancement in computer architecture leads to
More informationParallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming
Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),
More informationPerformance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development M. Serdar Celebi
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationAcknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text
Acknowledgments Programming with MPI Parallel ming Jan Thorbecke Type to enter text This course is partly based on the MPI courses developed by Rolf Rabenseifner at the High-Performance Computing-Center
More informationHigh Performance Computing Software Development Kit For Mac OS X In Depth Product Information
High Performance Computing Software Development Kit For Mac OS X In Depth Product Information 2781 Bond Street Rochester Hills, MI 48309 U.S.A. Tel (248) 853-0095 Fax (248) 853-0108 support@absoft.com
More informationParallel Linear Algebra on Clusters
Parallel Linear Algebra on Clusters Fernando G. Tinetti Investigador Asistente Comisión de Investigaciones Científicas Prov. Bs. As. 1 III-LIDI, Facultad de Informática, UNLP 50 y 115, 1er. Piso, 1900
More informationParallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine)
Parallel Program for Sorting NXN Matrix Using PVM (Parallel Virtual Machine) Ehab AbdulRazak Al-Asadi College of Science Kerbala University, Iraq Abstract The study will focus for analysis the possibilities
More informationParallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19
Parallel Processing WS 2018/19 Universität Siegen rolanda.dwismuellera@duni-siegena.de Tel.: 0271/740-4050, Büro: H-B 8404 Stand: September 7, 2018 Betriebssysteme / verteilte Systeme Parallel Processing
More informationI. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS
Performance Analysis of Java NativeThread and NativePthread on Win32 Platform Bala Dhandayuthapani Veerasamy Research Scholar Manonmaniam Sundaranar University Tirunelveli, Tamilnadu, India dhanssoft@gmail.com
More informationOpenMP * Past, Present and Future
OpenMP * Past, Present and Future Tim Mattson Intel Corporation Microprocessor Technology Labs timothy.g.mattson@intel.com * The name OpenMP is the property of the OpenMP Architecture Review Board. 1 OpenMP
More informationMatrix Multiplication Specialization in STAPL
Matrix Multiplication Specialization in STAPL Adam Fidel, Lena Olson, Antal Buss, Timmie Smith, Gabriel Tanase, Nathan Thomas, Mauro Bianco, Nancy M. Amato, Lawrence Rauchwerger Parasol Lab, Dept. of Computer
More informationDeveloping a High Performance Software Library with MPI and CUDA for Matrix Computations
Developing a High Performance Software Library with MPI and CUDA for Matrix Computations Bogdan Oancea 1, Tudorel Andrei 2 1 Nicolae Titulescu University of Bucharest, e-mail: bogdanoancea@univnt.ro, Calea
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationOn the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters
1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk
More informationLecture 3: Intro to parallel machines and models
Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class
More informationA cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve
A cache-oblivious sparse matrix vector multiplication scheme based on the Hilbert curve A. N. Yzelman and Rob H. Bisseling Abstract The sparse matrix vector (SpMV) multiplication is an important kernel
More informationMulti MicroBlaze System for Parallel Computing
Multi MicroBlaze System for Parallel Computing P.HUERTA, J.CASTILLO, J.I.MÁRTINEZ, V.LÓPEZ HW/SW Codesign Group Universidad Rey Juan Carlos 28933 Móstoles, Madrid SPAIN Abstract: - Embedded systems need
More informationIntroduction to Parallel Programming. Tuesday, April 17, 12
Introduction to Parallel Programming 1 Overview Parallel programming allows the user to use multiple cpus concurrently Reasons for parallel execution: shorten execution time by spreading the computational
More informationA New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer
A New Parallel Matrix Multiplication Algorithm on Tree-Hypercube Network using IMAN1 Supercomputer Orieb AbuAlghanam, Mohammad Qatawneh Computer Science Department University of Jordan Hussein A. al Ofeishat
More informationHigh-Performance Implementation of the Level-3 BLAS
High-Performance Implementation of the Level- BLAS KAZUSHIGE GOTO The University of Texas at Austin and ROBERT VAN DE GEIJN The University of Texas at Austin A simple but highly effective approach for
More informationAnalysis of Parallelization Techniques and Tools
International Journal of Information and Computation Technology. ISSN 97-2239 Volume 3, Number 5 (213), pp. 71-7 International Research Publications House http://www. irphouse.com /ijict.htm Analysis of
More informationSciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications
Parallel Tiled Algorithms for Multicore Architectures Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications
More informationOpenMP Algoritmi e Calcolo Parallelo. Daniele Loiacono
OpenMP Algoritmi e Calcolo Parallelo References Useful references Using OpenMP: Portable Shared Memory Parallel Programming, Barbara Chapman, Gabriele Jost and Ruud van der Pas OpenMP.org http://openmp.org/
More informationStrategies for Parallelizing the Solution of Rational Matrix Equations
Strategies for Parallelizing the Solution of Rational Matrix Equations José M. Badía 1, Peter Benner, Maribel Castillo 1, Heike Faßbender 3, Rafael Mayo 1, Enrique S. Quintana-Ortí 1, and Gregorio Quintana-Ortí
More informationSupercomputing and Science An Introduction to High Performance Computing
Supercomputing and Science An Introduction to High Performance Computing Part VII: Scientific Computing Henry Neeman, Director OU Supercomputing Center for Education & Research Outline Scientific Computing
More informationMPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi-Dimensional Array Transposition
MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi-Dimensional Array Transposition Yun He and Chris H.Q. Ding NERSC Division, Lawrence Berkeley National
More informationParallel Computers. c R. Leduc
Parallel Computers Material based on B. Wilkinson et al., PARALLEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers c 2002-2004 R. Leduc Why Parallel Computing?
More informationEvaluation and Tuning of the Level 3 CUBLAS for Graphics Processors
Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors Sergio Barrachina Maribel Castillo Francisco D. Igual Rafael Mayo Enrique S. Quintana-Ortí Depto. de Ingeniería y Ciencia de Computadores
More informationToward Scalable Matrix Multiply on Multithreaded Architectures
Toward Scalable Matrix Multiply on Multithreaded Architectures Bryan Marker 1, Field G Van Zee 1, Kazushige Goto 1, Gregorio Quintana Ortí 2, and Robert A van de Geijn 1 1 The University of Texas at Austin
More informationParallel Processing Top manufacturer of multiprocessing video & imaging solutions.
1 of 10 3/3/2005 10:51 AM Linux Magazine March 2004 C++ Parallel Increase application performance without changing your source code. Parallel Processing Top manufacturer of multiprocessing video & imaging
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More information1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008
1 of 6 Lecture 7: March 4 CISC 879 Software Support for Multicore Architectures Spring 2008 Lecture 7: March 4, 2008 Lecturer: Lori Pollock Scribe: Navreet Virk Open MP Programming Topics covered 1. Introduction
More informationAbstract. 1 Introduction
The performance of fast Givens rotations problem implemented with MPI extensions in multicomputers L. Fernández and J. M. García Department of Informática y Sistemas, Universidad de Murcia, Campus de Espinardo
More informationImplementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS
Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS Jianyu Huang, Leslie Rice Joint work with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn BLIS Retreat 2016 *Overlook of
More informationPerformance Analysis of a Matrix Diagonalization Algorithm with Jacobi Method on a Multicore Architecture
Performance Analysis of a Matrix Diagonalization Algorithm with Jacobi Method on a Multicore Architecture Victoria Sanz 12, Armando De Giusti 134, Marcelo Naiouf 14 1 III-LIDI, Facultad de Informática,
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationReducing Inter-process Communication Overhead in Parallel Sparse Matrix-Matrix Multiplication
Reducing Inter-process Communication Overhead in Parallel Sparse Matrix-Matrix Multiplication Md Salman Ahmed, Jennifer Houser, Mohammad Asadul Hoque, Phil Pfeiffer Department of Computing Department of
More informationTHE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems
THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430 Parallel Systems Study Period: 15 minutes Time Allowed: 3 hours Permitted Materials: Non-Programmable Calculator This
More informationMultilevel Hierarchical Matrix Multiplication on Clusters
Multilevel Hierarchical Matrix Multiplication on Clusters Sascha Hunold Department of Mathematics, Physics and Computer Science University of Bayreuth, Germany Thomas Rauber Department of Mathematics,
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationOn the Performance of Functional Hybrid Programming Models for Matrix Multiplication on Virtual Shared Memory and Distributed Memory Clusters.
On the Performance of Functional Hybrid Programming Models for Matrix Multiplication on Virtual Shared Memory and Distributed Memory Clusters. Mahesh Kandegedara, D.N. Ranasinghe University of Colombo
More informationIntel Math Kernel Library
Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationOpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means
High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview
More informationApplications of Berkeley s Dwarfs on Nvidia GPUs
Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse
More informationOptimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems
Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems Alexey Paznikov Saint Petersburg Electrotechnical University
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationAutomatic Parallelization of Sequential Codes Using S2P Tool and Benchmarking of the Generated Parallel Codes
Automatic Parallelization of Sequential Codes Using S2P Tool and Benchmarking of the Generated Parallel Codes Aditi Athavale, Priti Randive and Abhishek Kambale Center for Research in Engineering Science
More informationParallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors
Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors Andrés Tomás 1, Zhaojun Bai 1, and Vicente Hernández 2 1 Department of Computer
More informationAbstract HPF was originally created to simplify high-level programming of parallel computers. The inventors of HPF strove for an easy-to-use language
Ecient HPF Programs Harald J. Ehold 1 Wilfried N. Gansterer 2 Dieter F. Kvasnicka 3 Christoph W. Ueberhuber 2 1 VCPC, European Centre for Parallel Computing at Vienna E-Mail: ehold@vcpc.univie.ac.at 2
More informationFrequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8
Frequency Scaling and Energy Efficiency regarding the Gauss-Jordan Elimination Scheme on OpenPower 8 Martin Köhler Jens Saak 2 The Gauss-Jordan Elimination scheme is an alternative to the LU decomposition
More informationCS691/SC791: Parallel & Distributed Computing
CS691/SC791: Parallel & Distributed Computing Introduction to OpenMP 1 Contents Introduction OpenMP Programming Model and Examples OpenMP programming examples Task parallelism. Explicit thread synchronization.
More informationIn 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that:
Parallel Computing and Data Locality Gary Howell In 1986, I had degrees in math and engineering and found I wanted to compute things. What I ve mostly found is that: Real estate and efficient computation
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationA Fast and High Throughput SQL Query System for Big Data
A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190
More informationShared Memory programming paradigm: openmp
IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM
More informationCOMP4510 Introduction to Parallel Computation. Shared Memory and OpenMP. Outline (cont d) Shared Memory and OpenMP
COMP4510 Introduction to Parallel Computation Shared Memory and OpenMP Thanks to Jon Aronsson (UofM HPC consultant) for some of the material in these notes. Outline (cont d) Shared Memory and OpenMP Including
More informationA Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors
A Fine-Grained Pipelined Implementation of LU Decomposition on SIMD Processors Kai Zhang, ShuMing Chen*, Wei Liu, and Xi Ning School of Computer, National University of Defense Technology #109, Deya Road,
More informationScheduling of QR Factorization Algorithms on SMP and Multi-core Architectures
Scheduling of Algorithms on SMP and Multi-core Architectures Gregorio Quintana-Ortí Enrique S. Quintana-Ortí Ernie Chan Robert A. van de Geijn Field G. Van Zee quintana@icc.uji.es Universidad Jaime I de
More informationA Test Suite for High-Performance Parallel Java
page 1 A Test Suite for High-Performance Parallel Java Jochem Häuser, Thorsten Ludewig, Roy D. Williams, Ralf Winkelmann, Torsten Gollnick, Sharon Brunett, Jean Muylaert presented at 5th National Symposium
More informationEvolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017
Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University http://hpctoolkit.org Scalable Tools Workshop 7 August 2017 HPCToolkit 1 HPCToolkit Workflow source code compile &
More informationSerial and Parallel Sobel Filtering for multimedia applications
Serial and Parallel Sobel Filtering for multimedia applications Gunay Abdullayeva Institute of Computer Science University of Tartu Email: gunay@ut.ee Abstract GSteamer contains various plugins to apply
More informationPortable, usable, and efficient sparse matrix vector multiplication
Portable, usable, and efficient sparse matrix vector multiplication Albert-Jan Yzelman Parallel Computing and Big Data Huawei Technologies France 8th of July, 2016 Introduction Given a sparse m n matrix
More informationIntroduction to OpenMP
1 Introduction to OpenMP NTNU-IT HPC Section John Floan Notur: NTNU HPC http://www.notur.no/ www.hpc.ntnu.no/ Name, title of the presentation 2 Plan for the day Introduction to OpenMP and parallel programming
More informationScalability of Heterogeneous Computing
Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor
More informationBlocked Schur Algorithms for Computing the Matrix Square Root. Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui. MIMS EPrint: 2012.
Blocked Schur Algorithms for Computing the Matrix Square Root Deadman, Edvin and Higham, Nicholas J. and Ralha, Rui 2013 MIMS EPrint: 2012.26 Manchester Institute for Mathematical Sciences School of Mathematics
More informationHigh Performance Computing. Introduction to Parallel Computing
High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials
More informationScalable Performance Analysis of Parallel Systems: Concepts and Experiences
1 Scalable Performance Analysis of Parallel Systems: Concepts and Experiences Holger Brunst ab and Wolfgang E. Nagel a a Center for High Performance Computing, Dresden University of Technology, 01062 Dresden,
More informationParallelization of DQMC Simulations for Strongly Correlated Electron Systems
Parallelization of DQMC Simulations for Strongly Correlated Electron Systems Che-Rung Lee Dept. of Computer Science National Tsing-Hua University Taiwan joint work with I-Hsin Chung (IBM Research), Zhaojun
More informationPerformance analysis of LDPC Decoder using OpenMP
Performance analysis of LDPC Decoder using OpenMP S. V. Viraktamath Faculty, Dept. of E&CE, SDMCET, Dharwad. Karnataka, India. Jyothi S. Hosmath Student, Dept. of E&CE, SDMCET, Dharwad. Karnataka, India.
More informationParallel Architecture & Programing Models for Face Recognition
Parallel Architecture & Programing Models for Face Recognition Submitted by Sagar Kukreja Computer Engineering Department Rochester Institute of Technology Agenda Introduction to face recognition Feature
More informationParallel Implementation of the NIST Statistical Test Suite
Parallel Implementation of the NIST Statistical Test Suite Alin Suciu, Iszabela Nagy, Kinga Marton, Ioana Pinca Computer Science Department Technical University of Cluj-Napoca Cluj-Napoca, Romania Alin.Suciu@cs.utcluj.ro,
More informationCommunication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart
More informationHigh Performance Computing on Windows. Debugging with VS2005 Debugging parallel programs. Christian Terboven
High Permance omputing on Windows Debugging with VS2005 Debugging parallel programs hristian Terboven enter RWTH Aachen University 1 HP on Windows - 2007 enter Agenda Enabling OpenMP and MPI Debugging
More informationMATE-EC2: A Middleware for Processing Data with Amazon Web Services
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering
More informationEvaluation of Parallel Application s Performance Dependency on RAM using Parallel Virtual Machine
Evaluation of Parallel Application s Performance Dependency on RAM using Parallel Virtual Machine Sampath S 1, Nanjesh B R 1 1 Department of Information Science and Engineering Adichunchanagiri Institute
More informationOptimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor
Optimization of Lattice QCD with CG and multi-shift CG on Intel Xeon Phi Coprocessor Intel K. K. E-mail: hirokazu.kobayashi@intel.com Yoshifumi Nakamura RIKEN AICS E-mail: nakamura@riken.jp Shinji Takeda
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationCompilation for Heterogeneous Platforms
Compilation for Heterogeneous Platforms Grid in a Box and on a Chip Ken Kennedy Rice University http://www.cs.rice.edu/~ken/presentations/heterogeneous.pdf Senior Researchers Ken Kennedy John Mellor-Crummey
More informationGOING ARM A CODE PERSPECTIVE
GOING ARM A CODE PERSPECTIVE ISC18 Guillaume Colin de Verdière JUNE 2018 GCdV PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France June 2018 A history of disruptions All dates are installation dates of the machines
More informationParallel sparse matrix algorithms - for numerical computing Matrix-vector multiplication
IIMS Postgraduate Seminar 2009 Parallel sparse matrix algorithms - for numerical computing Matrix-vector multiplication Dakuan CUI Institute of Information & Mathematical Sciences Massey University at
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationA Comparison of Five Parallel Programming Models for C++
A Comparison of Five Parallel Programming Models for C++ Ensar Ajkunic, Hana Fatkic, Emina Omerovic, Kristina Talic and Novica Nosovic Faculty of Electrical Engineering, University of Sarajevo, Sarajevo,
More information