Parallel matrix-vector multiplication

Similar documents
Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Programming in Fortran 90 : 2017/2018

Parallelism for Nested Loops with Non-uniform and Flow Dependences

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

CMPS 10 Introduction to Computer Science Lecture Notes

AMath 483/583 Lecture 21 May 13, Notes: Notes: Jacobi iteration. Notes: Jacobi with OpenMP coarse grain

Mathematics 256 a course in differential equations for engineering students

A Binarization Algorithm specialized on Document Images and Photos

Support Vector Machines

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES

Lecture 5: Multilayer Perceptrons

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

CHARUTAR VIDYA MANDAL S SEMCOM Vallabh Vidyanagar

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization

Array transposition in CUDA shared memory

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

The Codesign Challenge

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Outline. Digital Systems. C.2: Gates, Truth Tables and Logic Equations. Truth Tables. Logic Gates 9/8/2011

Smoothing Spline ANOVA for variable screening

Analysis of Continuous Beams in General

Related-Mode Attacks on CTR Encryption Mode

Graph-based Clustering

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Machine Learning: Algorithms and Applications

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

AP PHYSICS B 2008 SCORING GUIDELINES

Load Balancing for Hex-Cell Interconnection Network

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

CHAPTER 2 DECOMPOSITION OF GRAPHS

Chapter 1. Introduction

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

Cluster Analysis of Electrical Behavior

RADIX-10 PARALLEL DECIMAL MULTIPLIER

Unsupervised Learning and Clustering

Edge Detection in Noisy Images Using the Support Vector Machines

Assembler. Shimon Schocken. Spring Elements of Computing Systems 1 Assembler (Ch. 6) Compiler. abstract interface.

ELEC 377 Operating Systems. Week 6 Class 3

Lecture 3: Computer Arithmetic: Multiplication and Division

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Sorting Review. Sorting. Comparison Sorting. CSE 680 Prof. Roger Crawfis. Assumptions

Very simple computational domains can be discretized using boundary-fitted structured meshes (also called grids)

Circuit Analysis I (ENGR 2405) Chapter 3 Method of Analysis Nodal(KCL) and Mesh(KVL)

ON SOME ENTERTAINING APPLICATIONS OF THE CONCEPT OF SET IN COMPUTER SCIENCE COURSE

A One-Sided Jacobi Algorithm for the Symmetric Eigenvalue Problem

Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier

An Optimal Algorithm for Prufer Codes *

Kinematics of pantograph masts

Hermite Splines in Lie Groups as Products of Geodesics

Concurrent Apriori Data Mining Algorithms

On Some Entertaining Applications of the Concept of Set in Computer Science Course

CHAPTER 10: ALGORITHM DESIGN TECHNIQUES

Load-Balanced Anycast Routing

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

Problem Set 3 Solutions

[33]. As we have seen there are different algorithms for compressing the speech. The

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

y and the total sum of

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Range images. Range image registration. Examples of sampling patterns. Range images and range surfaces

Machine Learning. Topic 6: Clustering

LECTURE : MANIFOLD LEARNING

Repeater Insertion for Two-Terminal Nets in Three-Dimensional Integrated Circuits

Hierarchical clustering for gene expression data analysis

Motivation. EE 457 Unit 4. Throughput vs. Latency. Performance Depends on View Point?! Computer System Performance. An individual user wants to:

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Fast Computation of Shortest Path for Visiting Segments in the Plane

Harvard University CS 101 Fall 2005, Shimon Schocken. Assembler. Elements of Computing Systems 1 Assembler (Ch. 6)

Preconditioning Parallel Sparse Iterative Solvers for Circuit Simulation

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

Random Kernel Perceptron on ATTiny2313 Microcontroller

Real-time Motion Capture System Using One Video Camera Based on Color and Edge Distribution

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT

CHAPTER 3 SEQUENTIAL MINIMAL OPTIMIZATION TRAINED SUPPORT VECTOR CLASSIFIER FOR CANCER PREDICTION

Private Information Retrieval (PIR)

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Self-tuning Histograms: Building Histograms Without Looking at Data

Conditional Speculative Decimal Addition*

Module Management Tool in Software Development Organizations

Convolutional interleaver for unequal error protection of turbo codes

Heterogeneous Parallel Computing: from Clusters of Workstations to Hierarchical Hybrid Platforms

THE low-density parity-check (LDPC) code is getting

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

Communication-Minimal Partitioning and Data Alignment for Af"ne Nested Loops

On a Registration-Based Approach to Sensor Network Localization

Lecture #15 Lecture Notes

GSLM Operations Research II Fall 13/14

Outline. Midterm Review. Declaring Variables. Main Variable Data Types. Symbolic Constants. Arithmetic Operators. Midterm Review March 24, 2014

Transcription:

Appendx A Parallel matrx-vector multplcaton The reduced transton matrx of the three-dmensonal cage model for gel electrophoress, descrbed n secton 3.2, becomes excessvely large for polymer lengths more than L = 12. Parallel machnes often have more memory than commonly used sequental machnes such as workstatons or PCs and ths memory can be used to solve larger problems. Our task s then to dstrbute the matrx over the processors, such that the problem can be solved as effcently as possble, hopefully also mprovng the performance by a factor close to the number of processors used. A.1 BSP A bulk synchronous parallel (BSP) program operates by alternatng between a phase where all processors smultaneously compute local results and a phase where they communcate wth each other. A superstep n a BSP algorthm conssts of a computaton phase followed by a communcaton phase. Before and after each communcaton phase a global synchronzaton s carred out. The BSPlb lbrary (for the programmng language C) [78, 79] conssts of only 20 prmtves and s based on one-sded communcatons. One-sded communcatons, as opposed to two-sded communcatons, cannot create deadlock stuatons. The communcaton mechansms bult nto the BSP lbrary are remote wrte, remote read and bulk synchronous message passng. In all three cases the remote processor s, at least conceptually, passve n the current superstep. The basc communcaton prmtves are summarzed below. Remote wrte: the processor that executes a put statement copes a block 69

70 APPENDIX A. PARALLEL MATRIX-VECTOR MULTIPLICATION of memory to a remote memory address at the tme of the next synchronzaton. Remote read: the processor that executes a get statement copes a block of memory from a remote memory address at the tme of the next synchronzaton. Bulk synchronous message passng: the processor that executes a send statement sends a message, consstng of a tag and a payload part, to the buffer of a remote processor at the tme of the next synchronzaton. The messages can be read from the buffer by a move operaton after the next synchronzaton. The BSP cost model conssts of four parameters: the number of processors p, the speed of the processors s, the communcaton tme g and the synchronzaton tme l. The speed of the processors s measured as the number of floatng pont operatons per second. The communcaton tme s measured as the average tme taken to communcate a sngle word to a remote processor, when all the processors are smultaneously communcatng; the unt of tme s the tme per floatng pont operaton (flop). The synchronzaton tme s the amount of tme needed for all processors to synchronze, also measured n flop tme. As mentoned earler a BSP program s ether n a computng phase or n a communcaton phase. Ths makes predctng the performance of algorthms much easer than n the case of parallel programmng models where computaton and communcaton are nterleaved n a less structured fashon. The analyss of the cost of a superstep s relatvely smple. For each processor we count the number of flops w, the number of words sent to other processors h (s) and the number of words receved h (r). The tme taken by processor for computaton s w and for communcaton s h = Max(h (s), h (r) ). The cost of the superstep s Max (w ) + Max (h )g + l. Ths shows that optmally we should dvde the problem to be solved n equal parts, n the sense that the calculatons and communcatons are evenly dstrbuted over the avalable processors. Of course, we should also take care to reduce the total amount of communcaton. A.2 Matrx dstrbuton A good way to dstrbute an n n dense matrx over p = MN processors s a generalzed M N block/cyclc dstrbuton: the rows are dvded nto p row blocks of equal sze and the columns nto N column blocks of equal sze; then

A.2. MATRIX DISTRIBUTION 71 0 1 2 3 4 5 0 3 1 4 2 5 0 3 1 4 2 5 0 2 4 1 3 5 0 2 4 1 3 5 0 2 4 1 3 5 0 1 2 3 4 5 Fgure A.1: M N generalzed block/cyclc dstrbuton for matrces on p = MN = 6 processors. The rows have a block-cyclc dstrbuton, wth p blocks whch are cyclcly numbered 0, 1,..., M 1, 0, 1,..., and the columns have a block dstrbuton, N blocks numbered 0, 1,... N 1. From left to rght: M = 6, N = 1; M = 3, N = 2; M = 2, N = 3 and M = 1, N = 6. the matrx elements a j are assgned to the processors as follows: φ 0 () = ( dv n p ) mod M, φ 1 (j) = j dv n N, a j P (φ 0 () + Mφ 1 (j)), (A.1) as shown n fgure A.1. The vector elements are best dstrbuted to the same processor as the dagonal of the matrx. Note that for each generalzed block/cyclc dstrbuton: all processors have an equally large part of the matrx; each column s dstrbuted over M processors; each row s dstrbuted over N processors; each processor has the same number of submatrces and each processor has the same number of dagonal elements. Ths scheme fts wthn the general Cartesan framework of the work of Bsselng and McColl [80]; t s smlar but not dentcal to the block/cyclc dstrbuton. The approach of Bsselng and McColl to the matrx vector product r = A x can be dvded nto four stages: fan-out: the elements x j are communcated to the processors contanng the values a j ; local matrx-vector multplcatons: the partal results u t = j a jx j are computed, wth the sum taken over only the local values of a j, whch all have the same t = φ 1 (j); fan-n: the partal results, u φ1 (j), of the processors are sent to the processor that possesses the correspondng element r ; summaton of the partal results: r = N 1 t=0 u t.

72 APPENDIX A. PARALLEL MATRIX-VECTOR MULTIPLICATION If the matrx s dvded nto rows (whch s the specal case N = 1 for our generalzed block/cyclc dstrbuton), the fan-n and summaton of partal sums s avoded; ths saves some communcaton, but all processors then have to communcate wth all other processors n the fan-out part. On the other hand, f the matrx s dvded nto columns (M = 1), then the fan-out communcaton s avoded and the fan-n communcaton s an all-to-all operaton. For the general M N dstrbuton, the fan-out s an M-to-M communcaton and the fan-n an N-to-N communcaton. The communcaton then takes O((M + N) n p )g tme, nstead of O(M N n p )g. The communcaton s mnmal f M = N = p s used. For a sparse matrx, the algorthm s adapted to avod computatons and communcatons nvolvng zero elements: elements x j are only sent f the correspondng a j 0; partal sums are only computed usng products a j x j wth a j 0 and the partal sums are only sent and summed f they are nonzero. The next secton shows how advantage s taken of the specfc sparsty structure of the matrx. A.3 Explotng the sparsty structure In our problem, for L > 12, we cannot afford to store the complete matrx on a sngle processor, so we need to dstrbute t over a number of processors. The matrx we have to deal wth s sparse and we explot ths n our computatons, snce we only handle nonzero elements A j. The standard approach to communcate a subset of elements of a vector s to gather all elements and ther global ndces n separate arrays, and then sendng those arrays to the processors that need them. The overhead of repeatedly sendng the same arrays wth ndces may be removed by sendng them only the frst tme the matrx vector multplcaton s performed, but the overhead of repeatedly packng and unpackng the vector elements cannot be removed n general. Our transton matrx has a partcular structure wth patches wth many nonzero elements. We explot ths to make communcatons faster by sendng contguous subvectors, avodng the packng and unpackng overhead. Consder a rectangular patch (.e., a contguous submatrx). A value x j must be sent to the owner of the patch f an element A j n column j of the patch s nonzero. It s lkely that most columns of the patch have at least one nonzero, so we mght as well send all x j for that patch. Ths makes t possble to send a contguous subvector of x, whch s more effcent than sendng separate components; ths comes at the expense of a few unnecessary communcatons. The trade-off can be shfted by ncreasng or decreasng the patch sze. To fnd sutable patches, we frst dvde the state vector nto contguous

A.3. EXPLOITING THE SPARSITY STRUCTURE 73 Fgure A.2: Reduced transton matrx for polymer length L = 5. The sze of the matrx s 37 37 and t has 233 nonzero elements, shown as black squares. To the left of each row s the correspondng knk representaton wrtten as a bnary number, wth black crcles denotng 1 and open ones 0. The horzontal lnes on the left show the ntal dvson of the reduced state vector nto eght contguous parts, optmzed to balance the number of nonzeros n the correspondng matrx rows. The jumps of these lnes ndcate slght adjustments to make the dvson ft the nonzero structure of the matrx. The resultng vector dvson nduces a dvson of the rows and columns of the matrx, and hence a parttonng nto 64 submatrces, shown by the gray checkerboard pattern. Complete submatrces are now assgned to the processors of a parallel computer. subvectors. We use a heurstc to partton the matrx nto blocks of rows wth approxmately the same number of nonzeros. If we use P processors, and we want each processor to have K subvectors, we have to dvde the vector nto KP subvectors. (The factor K s the overparttonng factor.) Ths ntal dvson tres to mnmze the computaton tme. Next, we adjust the dvsons to reduce communcaton: a sutable patch n the matrx corresponds to an nput subvector of knk representatons where only the last few bts dffer, and also to an output subvector wth that property. Therefore, we search for a par of adjacent knk representatons that has a dfferent bt as much as possble to the left. Ths s a sutable place to splt. We try to keep the dstance from the startng pont as small as possble. As an example of the structure of the reduced transton matrces and the dvson nto submatrces, we show the nonzero structure of the matrx for L = 5 n fgure A.2 and ts correspondng communcaton matrx n fgure A.3 (left). The communcaton matrx s bult from the parttoned transton matrx, by consderng each submatrx as a sngle element. It s a sparse matrx of much

74 APPENDIX A. PARALLEL MATRIX-VECTOR MULTIPLICATION Fgure A.3: Communcaton matrx for L = 5 (left) and L = 13 (rght). Note that the matrx for L = 5 can be obtaned by replacng each nonempty submatrx n Fg. A.2 by a sngle nonzero element. The communcaton matrx for L = 13, of sze 320 320, s dstrbuted over 16 processors n a row dstrbuton. smaller sze whch determnes the communcaton requrements. Our communcaton matrx for L = 13 s gven n fgure A.3 (rght). A.4 Tmngs Our computatons were performed on a Cray T3E computer. The peak performance of a sngle node of the Cray T3E s 600 Mflop/s for computatons. The bsp probe benchmark shows a performance of 47 Mflop/s per node [78]. The peak nterprocessor bandwdth s 500 Mbyte/s (bdrectonal). The bsp probe benchmark shows a sustaned bdrectonal performance of 94 Mbyte/s per processor when all 64 processors communcate at the same tme. Ths s equvalent to a BSP parameter g = 3.8, where g s the cost n flop tme unts of one 64-bt word leavng or enterng a processor. The measured global synchronzaton tme for 64 processors s 48 µs, whch s equvalent to l = 2 259 flop tme unts. Table A.1 presents the executon tme of one teraton of the algorthm n two forms: the BSP cost a + bg + cl counts the flops and the communcatons and thus gves the tme on an arbtrary computer wth BSP parameters g and l, whereas the tme n mllseconds gves the measured tme on ths partcular archtecture, splt nto computaton and communcaton tme. (The total mea-

A.4. TIMINGS 75 L P BSP cost tme (ms) effcency speedup 12 8 545 156 + 64 716g + 2l 47 + 4.3 85% 6.8 13 16 1 002 824 + 187 347g + 2l 89 + 13 81% 13.0 14 32 1 836 920 + 425 152g + 2l 169 + 44 73% 23.4 15 64 3 452 776 + 1 380 415g + 2l 330 + 112 67% 42.9 Table A.1: BSP cost, tme, effcency, and speedup for one matrx-vector multplcaton. sured synchronzaton tme s neglgble.) The BSP cost can be used to predct the run tme of our algorthm on dfferent archtectures. Table A.1 also gves the effcency and speedup relatve to a sequental program. Peak computaton performance s often only reached for dense matrx-matrx multplcaton; the performance for sparse matrx-vector multplcaton s always much lower. Comparng the flop count and the measured computaton tme for the largest problem L = 15, we see that we acheve about 10.5 Mflop/s per processor. Comparng the communcaton count wth the measured communcaton tme, we obtan a g-value of 8.1 µs, (or g = 3.8 flop unts; see above). Ths means that we attan the maxmum sustanable communcaton speed. Ths s due to the desgn of our algorthm, whch communcates contguous subvectors nstead of sngle components. Furthermore, the results show that our choce to optmze manly the computaton (by choosng a row dstrbuton) s justfed for ths archtecture: the communcaton tme s always less than a thrd of the total tme. For a dfferent machne, wth a hgher value of g, more emphass must be placed on optmzng the communcaton, leadng to a two-dmensonal dstrbuton. Each teraton of our computaton contans one matrx-vector multplcaton. The number of teratons needed for convergence depends on the length of the polymer, and on the appled electrc feld. The teraton was stopped when ether the accuracy was better than 10 10, or the number of teratons exceeded 100 000. In the latter case, the accuracy was computed at termnaton. Typcally, for L = 15 and a low electrc feld strength, 50 000 teratons are needed, takng about 6 hours per data pont. Only computed values wth accuracy 10 4 or better are shown n fgure 5.3. For L = 12, we compared the output for the parallel program wth that of the sequental program and found the dfference to be wthn roundng errors. The total speedup for L = 15, compared to a nave mplementaton (for whch one would need 38.5 Tbyte of memory), s a factor 1.5 10 6 : a factor of 17 248 by usng a reduced state space, a factor of 2 by shftng the egenvalues of the reduced transton matrx, and a factor 42.9 by usng a parallel program on 64 processors.