Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Similar documents
Parallel Computing with MATLAB

Report of Linear Solver Implementation on GPU

Issues In Implementing The Primal-Dual Method for SDP. Brian Borchers Department of Mathematics New Mexico Tech Socorro, NM

Parallelism paradigms

Daniel D. Warner. May 31, Introduction to Parallel Matlab. Daniel D. Warner. Introduction. Matlab s 5-fold way. Basic Matlab Example

Contents. F10: Parallel Sparse Matrix Computations. Parallel algorithms for sparse systems Ax = b. Discretized domain a metal sheet

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

PhD Student. Associate Professor, Co-Director, Center for Computational Earth and Environmental Science. Abdulrahman Manea.

High Performance Computing: Tools and Applications

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

EFFICIENT SOLVER FOR LINEAR ALGEBRAIC EQUATIONS ON PARALLEL ARCHITECTURE USING MPI

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

MATLAB on BioHPC. portal.biohpc.swmed.edu Updated for

AMSC/CMSC 662 Computer Organization and Programming for Scientific Computing Fall 2011 Introduction to Parallel Programming Dianne P.

Lecture 15: More Iterative Ideas

Real Parallel Computers

Parallel and Distributed Computing with MATLAB The MathWorks, Inc. 1

Performance Studies for the Two-Dimensional Poisson Problem Discretized by Finite Differences

Speeding up MATLAB Applications The MathWorks, Inc.

Modern GPUs (Graphics Processing Units)

Introduction to Multicore Programming

Parallel Systems. Project topics

CPS343 Parallel and High Performance Computing Project 1 Spring 2018

Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer

Scientific Computations Using Graphics Processors

Chapter 2 Basic Structure of High-Dimensional Spaces

Challenges of Scaling Algebraic Multigrid Across Modern Multicore Architectures. Allison H. Baker, Todd Gamblin, Martin Schulz, and Ulrike Meier Yang

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Chapter 3 Parallel Software

HPC Architectures. Types of resource currently in use

Numerical Methods for PDEs : Video 9: 2D Finite Difference February 14, Equations / 29

Using Parallel Computing Toolbox to accelerate the Video and Image Processing Speed. Develop parallel code interactively

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

OP2 FOR MANY-CORE ARCHITECTURES

Large scale Imaging on Current Many- Core Platforms

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Figure 6.1: Truss topology optimization diagram.

Practical High Performance Computing

Solving Sparse Linear Systems. Forward and backward substitution for solving lower or upper triangular systems

Introduction to GPU hardware and to CUDA

Subset Sum Problem Parallel Solution

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Improving the Performance of the Molecular Similarity in Quantum Chemistry Fits. Alexander M. Cappiello

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

What is Multigrid? They have been extended to solve a wide variety of other problems, linear and nonlinear.

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Real Parallel Computers

Optimizing and Accelerating Your MATLAB Code

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?

MATLAB AND PARALLEL COMPUTING

High Performance Computing Programming Paradigms and Scalability Part 6: Examples of Parallel Algorithms

Trends and Challenges in Multicore Programming

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Parallel Architectures

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

15-451/651: Design & Analysis of Algorithms October 11, 2018 Lecture #13: Linear Programming I last changed: October 9, 2018

Summer 2009 REU: Introduction to Matlab

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

IN recent years the dynamic development of multicore

High performance Computing and O&G Challenges

Automated Finite Element Computations in the FEniCS Framework using GPUs

Threaded Programming. Lecture 9: Alternatives to OpenMP

Using GPUs to compute the multilevel summation of electrostatic forces

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011!

Parallel Processing Tool-box

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

1 2 (3 + x 3) x 2 = 1 3 (3 + x 1 2x 3 ) 1. 3 ( 1 x 2) (3 + x(0) 3 ) = 1 2 (3 + 0) = 3. 2 (3 + x(0) 1 2x (0) ( ) = 1 ( 1 x(0) 2 ) = 1 3 ) = 1 3

Matrix Multiplication

Parallel sparse matrix algorithms - for numerical computing Matrix-vector multiplication

EXTENSION. a 1 b 1 c 1 d 1. Rows l a 2 b 2 c 2 d 2. a 3 x b 3 y c 3 z d 3. This system can be written in an abbreviated form as

Introduction to Multicore Programming

AMS526: Numerical Analysis I (Numerical Linear Algebra)

LDPC Simulation With CUDA GPU

Overview of research activities Toward portability of performance

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Parallel Performance Studies for a Clustering Algorithm

Let s say I give you a homework assignment today with 100 problems. Each problem takes 2 hours to solve. The homework is due tomorrow.

1 Motivation for Improving Matrix Multiplication

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Behavioral Data Mining. Lecture 12 Machine Biology

High Performance Computing (HPC) Introduction

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Chapter 4. Matrix and Vector Operations

Parallel and High Performance Computing CSE 745

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

WHY PARALLEL PROCESSING? (CE-401)

Transcription:

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics Moysey Brio & Paul Dostert July 4, 2009 1 / 18

Sparse Matrices In many areas of applied mathematics and modeling, one comes across sparse matrix problems, where we need to solve Ax = b with A containing primarily zeros Storage of n n linear system with m nonzero entries: Full Matrix Storage: n 2 elements For n = 10, 000 a double precision matrix takes approx 763 Mb Matlab-like (i, j, A ij ) Storage: 3m elements For n = 10, 000 and m = 50, 000 a double precision matrix takes approximately 1 2 Mb An example problem would be solving an elliptic PDE on a 2D mesh Using a 2D mesh and simple finite differences, one would want an approximation to the solution of a PDE at each point in the mesh For a 1000 1000 mesh, which is quite small in practical problems, we have a resulting 1, 000, 000 1, 000, 000 matrix Having 5, 000, 000 nonzeros would result in approximately 76 Mb for (i, j, A ij ) storage 2 / 18

CSR & CSC For (i, j, A ij ), generally all the i = 1 values come first, then i = 2, and so in Instead of saving each i value, we save how many entries come before we transition from row i to row i + 1 This is the idea behind compressed sparse row (column) format Consider storing the following matrix: 6 1 0 1 0 0 2 0 1 6 0 1 0 1 0 0 1 1 6 1 0 0 1 0 0 0 0 6 0 3 0 0 5 0 0 1 6 0 0 1 2 0 0 3 0 6 0 0 1 0 0 0 0 0 6 0 0 0 0 0 0 0 1 6 We keep a = [6 1 1 2 1 6 1 1 1 1 6 1 1 6 3 5 1 6 1 2 3 6 1616] J = [1 2 4 7 1 2 4 6 1 2 3 4 7 4 6 1 4 5 8 1 4 6 1 778] I = [1 5 9 14 16 20 23 25 27] which is only 2m + n + 1 entries 3 / 18

CSR & CSC Storage of n n linear system with m nonzero entries: Full Matrix Storage: n 2 elements For n = 1, 000, 000 a double precision matrix takes approx 7 Tb Matlab-like (i, j, A ij ) Storage: 3m elements For n = 1, 000, 000 and m = 5, 000, 000 a double precision matrix takes approximately 76 Mb CSR & CSC Storage: 2m + n + 1 elements For n = 1, 000, 000 and m = 5, 000, 000 a double precision matrix takes approximately 61 Mb Issues: CSR does not save a huge amount of space, but in applications > 2D, matrix sizes are well into the trillions This could be a savings in a few terabytes CSR does not take advantage of symmetry, or repeated patterns Matrix-vector, matrix-matrix multiplication routines should be rewritten to use CSR In a practice, a full matrix is stored only if completely necessary Often, we only need to know how matrices act, rather than every element of a matrix For example, to use a Conjugate Gradient routine one only needs to know how to compute Ax for any x 4 / 18

Parallel Computing Architectures Some problems simply cannot be done on a single computer or single processor For these problems, we must use some special techniques First, let us discuss different type of parallel computing environments In a general environment, we have a set of processors, a set of memory banks, and some way to connect between them The way this is set up defines different types of parallel machines Interconnect Memory Memory Memory Memory 5 / 18

Shared Memory Examples of shared memory computers would be your multicore desktop, some SGI Power Challenge servers, and some Cray machines Properties often include: A single memory bank, which each processor has relatively fast access to Few processors or cores Usually, at most, 1000 or so processors Very expensive Fast interconnects between processor are extremely expensive Shared Memory 6 / 18

Distributed Memory Examples of distributed memory computers would be a traditional Beowulf cluster Properties often include: Each processor has it s own local memory bank Relatively slow interconnect between processors Inexpensive In it s simplest form, this can simply be PCs linked together over a network Can be many processors (IBM Roadrunner has 19, 440 processors, with 129, 600 cores) Memory Memory Memory Memory Network 7 / 18

Realistic and Modern Designs Nearly all powerful clusters are a combination of shared and distributed memory Essentially, think of this as series of shared memory machines linked to each other in a distributed network Memory Memory Memory Memory Network 8 / 18

Parallelization of Linear Systems Let us consider a matrix problem Ax = b where A is a tridiagonal matrix, with 2 on the diagonal and 1 on the upper and lower diagonals Suppose we are given a vector y and wish to compute Ay = z (Note: In a practical application, we d never store this matrix, since it s matrix-vector result can be easily computed without keeping each entry of A) We must compute: 2 1 0 0 1 2 1 0 0 1 2 1 0 0 1 2 y 1 y 2 y n 1 y n = 2y 1 + y 2 y 1 2y 2 + y 3 y n 2 2y n 1 + y n y n 1 2y n Suppose that we would like to compute the result using two processors, by letting the first processor compute the first half of z and the second processor the second half 9 / 18

Linear Systems on Shared Memory On a shared memory machine, each processor has direct access to each part of A and y We simply let each processor do the required work We denote the work done on the first processor in red, and the second in green 2 1 0 0 1 2 1 0 0 1 2 1 0 0 1 2 y 1 y 2 y n 1 y n = 2y 1 + y 2 y 1 2y 2 + y 3 y n 2 2y n 1 + y n y n 1 2y n We can easily extend this idea to many more processors Note that we can split the matrix any way desired, as long as we balance work to reach processor evenly 10 / 18

Linear Systems on Distributed Memory On a distributed memory machine, different parts of the matrix and vectors are stored locally on different memory/processor locations For example, we could multiply using the coloring below, where entries in cyan are needed by both processors (we ignore zeros): 2 1 0 0 0 0 0 1 2 1 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 0 1 2 1 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 0 1 2 2y 1 + y 2 y n/2 2 2y n/2 1 + y n/2 y n/2 1 2y n/2 + y n/2+1 y n 1 2y n y 1 y 2 y n/2 1 y n/2 y n 1 y n = 11 / 18

Linear Systems on Distributed Memory Issues in using distributed memory systems: BOTH CPUs need y n/2 and y n/2 1 Usually, CPU 1 would store y 1 to y n/2 1 while CPU 2 stores the rest This means for CPU 1 to use y n/2, the value needs to be PASSED from CPU 2 to CPU 1 Intead of splitting the matrix as above, what happens if each odd row is stored on CPU 1 and each even row on 2? It is harder to write good code for distributed memory machines, but the same code would run very well on shared memory as well Application programming interfaces (APIs) for parallel machines: For distributed memory machines, Message Passing Interface (MPI) and Parallel Virtual Machine (PVM) are the most used For shared memory machines, there are many choices: OpenMP: Often works with MPI in practical machines Intel Threading Building Blocks (TBB): Relatively new POSIX Threads (Pthreads): usually C only 12 / 18

Parallelization in Matlab Many Matlab functions are already built to use multicore processors :» maxnumcomthreads(n) sets the number of cores to use Enabling parallel computations usually comes through function options: options = optimset( UseParallel, always ); There is also a Parallel Computing Toolbox allowing you to parallelize your own codes: Use matlabpool open n to enable n machines/cores (if there are n available) Change for to a parfor to indicate a loop can be done in any order Use matlabpool close to indicate parallel code is done Arrays can be split for distributed clusters using the codistributed command:»a = rand(80, 1000); D = codistributed(a, convert ); This distributes the array to each process in the Matlab pool, as D This requires everyone to already have A stored The above does not save on memory To do that, we use something like:»l = magic(5) + labindex; A = codistributed(l, codistributor()); 13 / 18

Mesh Partitioning Often a nonzero entry A ij implies a geometric relationship between i and j This can be used to distribute loads evenly to CPUs This process is often called mesh partitioning Consider an 8 8 grid where each point is an entry in x This results in a 64 64 matrix, A Assume that row i contains an element in the j th column iff i and j are connected in the mesh: (8,8) (1,1) 14 / 18

Mesh Partitioning - 2 s There are many ways to decompose an 8 8 mesh into two parts with an equal amount of unknowns After a tiny bit of thinking, it is clear that splitting through the middle horizontally or vertically will produce the configuration requiring the smallest amount of communication: 15 / 18

Mesh Partitioning - 4 s For four processors, things get slightly more complicated Each of the following are possibilities Either situation may be optimal, depending on the processor configuration: 16 / 18

Mesh Partitioning - General Case Partitioning a general mesh so that the load given to each processor is approximately the same, while still minimizing communication is an extremely difficult task and an active area of research 17 / 18

Future Directions A fairly recent development in parallel computing is the use of GPUs GPUs were developed to run advanced graphics and video games They have a huge number of processing threads GPUs are hundreds to thousands of times faster at certain calculations than a CPU, but struggle with general computing Offloading certain calculations to a GPU can greatly speed up your computations Special coding environments (OpenCL and CUDA) must be used, and codes need to written very carefully An Intel Core i7 965 Extreme provides approximately 60 GFLOPS and costs $1000 An AMD/ATI Firestream 9270 SP provides 270 GFLOPS and costs $1250 Bottom line: Multicore processor are here to stay, and you will have to learn how to use them properly If you are interesting in computing, make sure to take a basic parallel programming class A faster algorithm won t necessarily be better computationally, since it may be very difficult to parallelize 18 / 18