COSC 6374 Parallel Computation. Dense Matrix Operations

Similar documents
COSC 6374 Parallel Computation. Non-blocking Collective Operations. Edgar Gabriel Fall Overview

COSC 6374 Parallel Computation. Communication Performance Modeling (II) Edgar Gabriel Fall Overview. Impact of communication costs on Speedup

CS 241 Week 4 Tutorial Solutions

Shared Memory Architectures. Programming and Synchronization. Today s Outline. Page 1. Message passing review Cosmic Cube discussion

V = set of vertices (vertex / node) E = set of edges (v, w) (v, w in V)

Internet Routing. IP Packet Format. IP Fragmentation & Reassembly. Principles of Internet Routing. Computer Networks 9/29/2014.

Distributed Systems Principles and Paradigms

Distance vector protocol

Distributed Systems Principles and Paradigms. Chapter 11: Distributed File Systems

Fundamentals of Engineering Analysis ENGR Matrix Multiplication, Types

Error Numbers of the Standard Function Block

Chapter 9. Greedy Technique. Copyright 2007 Pearson Addison-Wesley. All rights reserved.

CS553 Lecture Introduction to Data-flow Analysis 1

Minimal Memory Abstractions

CMPUT101 Introduction to Computing - Summer 2002

CS453 INTRODUCTION TO DATAFLOW ANALYSIS

CS 551 Computer Graphics. Hidden Surface Elimination. Z-Buffering. Basic idea: Hidden Surface Removal

COMP 423 lecture 11 Jan. 28, 2008

Problem Final Exam Set 2 Solutions

PARALLEL AND DISTRIBUTED COMPUTING

Fig.25: the Role of LEX

Lesson 4.4. Euler Circuits and Paths. Explore This

2 Computing all Intersections of a Set of Segments Line Segment Intersection

Duality in linear interval equations

Doubts about how to use azimuth values from a Coordinate Object. Juan Antonio Breña Moral

UTMC APPLICATION NOTE UT1553B BCRT TO INTERFACE PSEUDO-DUAL-PORT RAM ARCHITECTURE INTRODUCTION ARBITRATION DETAILS DESIGN SELECTIONS

Greedy Algorithm. Algorithm Fall Semester

CS 340, Fall 2016 Sep 29th Exam 1 Note: in all questions, the special symbol ɛ (epsilon) is used to indicate the empty string.

SMALL SIZE EDGE-FED SIERPINSKI CARPET MICROSTRIP PATCH ANTENNAS

Midterm Exam CSC October 2001

COMP108 Algorithmic Foundations

EECS150 - Digital Design Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

The Network Layer: Routing in the Internet. The Network Layer: Routing & Addressing Outline

Lecture 13: Graphs I: Breadth First Search

Paradigm 5. Data Structure. Suffix trees. What is a suffix tree? Suffix tree. Simple applications. Simple applications. Algorithms

LINX MATRIX SWITCHERS FIRMWARE UPDATE INSTRUCTIONS FIRMWARE VERSION

Inter-domain Routing

TO REGULAR EXPRESSIONS

Product of polynomials. Introduction to Programming (in C++) Numerical algorithms. Product of polynomials. Product of polynomials

MTH 146 Conics Supplement

box Boxes and Arrows 3 true 7.59 'X' An object is drawn as a box that contains its data members, for example:

cisc1110 fall 2010 lecture VI.2 call by value function parameters another call by value example:

GENG2140 Modelling and Computer Analysis for Engineers

Dynamic Programming. Andreas Klappenecker. [partially based on slides by Prof. Welch] Monday, September 24, 2012

Solution of Linear Algebraic Equations using the Gauss-Jordan Method

In the last lecture, we discussed how valid tokens may be specified by regular expressions.

Compression Outline :Algorithms in the Real World. Lempel-Ziv Algorithms. LZ77: Sliding Window Lempel-Ziv

A distributed edit-compile workflow

Introduction to Algebra

Network Interconnection: Bridging CS 571 Fall Kenneth L. Calvert All rights reserved

Lecture 8: Graph-theoretic problems (again)

A Tautology Checker loosely related to Stålmarck s Algorithm by Martin Richards

Lexical Analysis. Amitabha Sanyal. ( as) Department of Computer Science and Engineering, Indian Institute of Technology, Bombay

Pointers and Arrays. More Pointer Examples. Pointers CS 217

String comparison by transposition networks

Geometric transformations

c s ha2 c s Half Adder Figure 2: Full Adder Block Diagram

Containers: Queue and List

Balanced Trees. 2-3 trees red-black trees B-trees. 2-3 trees red-black trees B-trees smaller than. 2-node. 3-node E J S X A C.

CS201 Discussion 10 DRAWTREE + TRIES

Advanced Programming Handout 5. Enter Okasaki. Persistent vs. Ephemeral. Functional Queues. Simple Example. Persistent vs.

6.3 Volumes. Just as area is always positive, so is volume and our attitudes towards finding it.

CS321 Languages and Compiler Design I. Winter 2012 Lecture 5

Internet Routing. Reminder: Routing. CPSC Network Programming

Simplifying Algebra. Simplifying Algebra. Curriculum Ready.

Quiz2 45mins. Personal Number: Problem 1. (20pts) Here is an Table of Perl Regular Ex

Calculus Differentiation

The Fundamental Theorem of Calculus

MPI Groups and Communicators

5 Regular 4-Sided Composition

McAfee Web Gateway

[SYLWAN., 158(6)]. ISI

Definition of Regular Expression

Solving Problems by Searching. CS 486/686: Introduction to Artificial Intelligence Winter 2016

HIGH-LEVEL TRANSFORMATIONS DATA-FLOW MODEL OF COMPUTATION TOKEN FLOW IN A DFG DATA FLOW

Compilers. Topic 4. The Symbol Table and Block Structure PART II. Mick O Donnell: Alfonso Ortega:

Compilers Spring 2013 PRACTICE Midterm Exam

If you are at the university, either physically or via the VPN, you can download the chapters of this book as PDFs.

MITSUBISHI ELECTRIC RESEARCH LABORATORIES Cambridge, Massachusetts. Introduction to Matroids and Applications. Srikumar Ramalingam

Presentation Martin Randers

Graphing Conic Sections

Section 2.3 Functions. Definition: Let A and B be sets. A function (mapping, map) f from A to B, denoted f :A B, is a subset of A B such that

Pattern Matching. Pattern Matching. Pattern Matching. Review of Regular Expressions

Final Exam Review F 06 M 236 Be sure to look over all of your tests, as well as over the activities you did in the activity book

An Efficient Code Update Scheme for DSP Applications in Mobile Embedded Systems

The dictionary model allows several consecutive symbols, called phrases

Math 227 Problem Set V Solutions. f ds =

Design Space Exploration for Massively Parallel Processor Arrays

Section 10.4 Hyperbolas

CS412/413. Introduction to Compilers Tim Teitelbaum. Lecture 4: Lexical Analyzers 28 Jan 08

From Dependencies to Evaluation Strategies

COMMON FRACTIONS. or a / b = a b. , a is called the numerator, and b is called the denominator.

CICS Application Design

B. Definition: The volume of a solid of known integrable cross-section area A(x) from x = a

COMPUTER EDUCATION TECHNIQUES, INC. (WEBLOGIC_SVR_ADM ) SA:

Transparent neutral-element elimination in MPI reduction operations

Union-Find Problem. Using Arrays And Chains. A Set As A Tree. Result Of A Find Operation

Before We Begin. Introduction to Spatial Domain Filtering. Introduction to Digital Image Processing. Overview (1): Administrative Details (1):

Network Layer: Routing Classifications; Shortest Path Routing

Systems I. Logic Design I. Topics Digital logic Logic gates Simple combinational logic circuits

Transcription:

COSC 6374 Prllel Computtion Dense Mtrix Opertions Edgr Griel Fll Edgr Griel Prllel Computtion Edgr Griel erminology Dense Mtrix: ll elements of the mtrix ontin relevnt vlues ypilly stored s 2-D rry, (e.g. doule [16][]); Sprse mtrix: most elements of the mtrix re zero Optimized storge tehniques nd mtries: store only the relevnt digonls of the mtrix Highly irregulr sprse mtries: store the oordintes of every non-zero element together with the ontent oeing-hrwell formt: exploit ertin regulrities (e.g. nerly onstnt numer of entries per row or olumn) Jgged Digonl storge formt: see oeing-hrwell formt 1

Replition vs. Communition Lrge dt items typilly distriuted ross multiple proesses Wht is lrge? Smll dt items n replited on ll proesses or ommunited whenever required Costs for ommunition: ~ network lteny Costs for replition: ~ memory onsumption + ~ repeted omputtion opertions Prllel Computtion Edgr Griel Mtrix opertions: =. Multiplying Mtrix with onstnt Constnt is definitely smll nd is thus replited on ll proesses E.g. ompiled in the ode Red from onfigurtion file Opertion does not require ny ommunition to e performed rivilly prllel Opertion n e performed independent of the wy the mtrix hs een distriuted ross the proesses Prllel Computtion Edgr Griel 2

Prllel Computtion Edgr Griel Mtrix Opertions: =. rnspose Mtrix Often not neessry, sine the opertions (e.g. Mtrixvetor multiply) n e (esily) reformulted for Mtrix- rnspose-vetor multiply opertions nd void the dt trnspose Opertions requiring the trnspose: multi-dimensionl FF ssumption: Mtries, re squre Element [x][y] should e on the sme proess s element [x][y] -> requires ommunition ross the proesses = : One element per proess Initil dt distriution: one element of the Mtrix per proess 0 1 2 3 4 5 6 7 8 9 14 15 Proess with oordintes (x,y) needs to send its dt item to the proess with the oordintes (y,x) nd reeive its dt item from (y,x) Prllel Computtion Edgr Griel 3

= : One element per proess // ssumptions: // newomm hs een reted using MPI_Crt_rete // doule, re the element of the mtries // owned y eh proess. is lredy set. int oords[2]; // my oordintes in the 2-D topology int rem_oords[2]; // oordintes of my ounterprt MPI_Request req[2]; MPI_Sttus stts[2]; // Determine my own rnk in newomm MPI_Comm_rnk (newomm, &rnk); // Determine my own oordintes in newomm MPI_Crt_oords (newomm, rnk, ndims, oords ); //Determine the oordintes of my ounterprt rem_oords[0] = oords[1]; rem_oords[1] = oords[0]; Prllel Computtion Edgr Griel = // Determine the rnk of my ounterprt using his oordintes MPI_Crt_rnk ( newomm, rem_oords, &rem_rnk); // Initite non-loking ommunition to send MPI_Isend ( &, 1, MPI_DOULE, rem_rnk, 0, newomm,&req[0]); // Initite non-loking ommunition to reeive MPI_Irev ( &, 1, MPI_DOULE, rem_rnk, 0, newomm,&req[1]); // Wit on oth non-loking opertions to finish MPI_Witll ( 2, req, stts); Prllel Computtion Edgr Griel : One element per proess Notes: using non-loking ommunition voids to hve to shedule messges to void dedlok proesses on the min digonl send messge to themselves 4

= : Column-wise dt distriution One olumn per proess rnk = 0 1 2 3 4 5 6 7 8 rnk = 0 1 2 3 4 5 6 7 8 Element [i] needs to e sent to proess i Element [i] will e reeived from proess i Prllel Computtion Edgr Griel = MPI_Request *reqs; MPI_Sttus *stts; int rnk, size; doule [N], [N]; : Column-wise dt distriution // Determine the numer of proesses working on the prolem // nd my rnk in the ommunitor MPI_Comm_size ( omm, &size); MPI_Comm_rnk ( omm, &rnk); // llote the required numer of Requests nd Sttuses. Sine // the ode is supposed to work for ritrry numers of // proessors, you n not use stti rrys for reqs nd stts reqs = (MPI_Request *) mllo ( 2*size*sizeof(MPI_Request) ); stts = (MPI_Sttus *) mllo ( 2*size*sizeof(MPI_Sttus) ); Prllel Computtion Edgr Griel 5

= : Column-wise dt distriution // Strt now ll non-loking ommunition opertions for (i=0; i<size; i++ ) { MPI_Isend (&[i], 1, MPI_DOULE, i, 0, omm, &reqs[2*i]); MPI_Irev (&[i], 1, MPI_DOULE, i, 0, omm, &(reqs[2*i+1]); // Wit for ll non-loking opertions to finish MPI_Witll ( 2*size, reqs, stts); Notes: identil pproh nd ode for row-wise dt distriution s long s the lol portions of oth nd re stored s one-dimensionl rrys numer of messges: N 2 = np 2 Prllel Computtion Edgr Griel = : lok olumn-wise dt distriution Eh proess holds N lol olumns of eh mtrix with N = np 1 i= 0 ssuming N n e divided evenly onto np proesses rnk = 0 1 2 rnk = 0 1 2 N lol Prllel Computtion Edgr Griel 6

= Element [i][j] hs to eome element [j][i] ssuming i, j re glol indexes Vrile delrtions on eh proess: doule [N][N lol ]; doule [N][N lol ]; [i][j] is loted on the proess with the rnk r = j/n lol hs the lol indexes [i 1 ][j 1 ] with i 1 =i nd j 1 =j%n lol [j][i] is loted on the proess with the rnk s=i/n lol hs the lol indexes [j 2 ][i 2 ] with j 2 =j nd i 2 =i%n lol Prllel Computtion Edgr Griel : lok olumn-wise dt distriution = : lok olumn-wise dt distriution // ode frgment for the ommunition for ( j1=0; j1<n lol ; j1++) { for (i=0; i<n; i++ ) { dest = i / N lol ; MPI_Isend ( &([i][j1], 1, MPI_DOULE, dest, 0, omm, &reqs[ ]); for ( j=0; j<n; j++ ) { for ( i2=0; i2<n lol ; i2++ ) { sr = j / N lol ; MPI_Irev ( &([j][i2]), 1, MPI_DOULE, sr, 0, omm, &reqs[ ]); Prllel Computtion Edgr Griel 7

= : lok olumn-wise dt distriution he lgorithm on the previous slide is good euse it doesn t require ny dditionl temporry storge he lgorithm on the previous slide is d euse it sends N 2 messges, with N>>np osts of eh messge is proportionl to the network lteny for short messges Mtrix hs to e trversed in non-ontiguous mnner C stores multi-dimensionl rrys in row-mjor order essing [0][0] thn [1][0] mens tht we jump in the min memory nd hve lrge numer of he misses Prllel Computtion Edgr Griel Memory lyout of multi-dimensionl rrys E.g. 2-D mtrix Memory lyout in C Memory lyout in Fortrn Prllel Computtion Edgr Griel 8

= : lok olumn-wise dt distriution lterntive lgorithm eh proess sends in relity N lol *N lol elements to every other proess send n entire lok of N lol *N lol elements lok hs to e trnsposed either t the sender or t the reeiver rnk = 0 1 2 rnk = 0 1 2 Prllel Computtion Edgr Griel = : lok olumn-wise dt distriution // Send the mtrix lok-y-lok for ( i=0; i<n; i+=n lol ) { MPI_Isend ( &([i][0], N lol *N lol, MPI_DOULE, i, 0, omm, &reqs[2*i]); MPI_Irev( &([i][0], N lol *N lol, MPI_DOULE, i, 0, omm, &*reqs[2*i+1]); MPI_Witll ( 2*size, reqs, stts); // Now trnspose eh lok for ( i=0; i<n; i+=n lol ) { for ( k=0; k<n lol ; k++ ) { for ( j=k; j<n lol ; j++ ) { temp = [i+k][j]; [i+k][j] = [i+j][k]; [i+j][k] = temp; Prllel Computtion Edgr Griel 9

= : other 1-D dt distriutions lok row-wise dt distriution lgorithm very similr to lok olumn-wise dt distriution Cyli olumn-wise dt distriution proess with rnk r gets the olumns r, r+np, r+2*np, et dvntge: none for the Mtrix trnspose opertions for some other opertions, this dt distriution leds often to etter lod lne thn lok olumn-wise distriution Cyli row-wise dt distriution lok-yli olumn-wise dt distriution lok-yli row-wise dt distriution Prllel Computtion Edgr Griel = : 2-D dt distriution Eh proess holds lok of N lol *N lol elements 2-D distriution voids skinny mtries often esier to rete lod lne thn with 1-D lok olumn/row distriution Prllel Computtion Edgr Griel

= : 2-D dt distriution ssumption: using 2-D rtesin ommunitor lgorithm: Determine your rnk using MPI_Comm_rnk Determine your oordintes using MPI_Crt_oords Determine the oordintes of your ommunition prtner y reverting the x nd y oordintes of your oordintes determine the rnk of your ommunition prtner using MPI_Crt_rnk Send lok of N lol *N lol elements to omm. prtner Reeive lok of N lol *N lol elements from omm. prtner rnspose the lok tht hs een reeived lgorithm omines tehniques from the one element per proess distriution nd the lok olumn-wise distriution Prllel Computtion Edgr Griel = : lok row-wise distriution repliting the vetor doule [nlol][n], [n]; doule [nlol], glol[n]; int i,j; for (i=0; i<nlol; i++) { for ( j=0;j<n; j++ ) { [i] = [i] + (i,j)*(j); * = MPI_llgther(, nlol, MPI_DOULE, glol, nlol, MPI_DOULE, MPI_COMM_WORLD ); Prllel Computtion Edgr Griel

= : lok row-wise distriution Why replite the vetor? memory requirement is O(N) with N eing the size of the vetor in ontrry to Mtrix O(N 2 ) or other higher dimensionl rrys inreses the performne of the Mtrix-vetor multiply opertion Why do we need the llgther t the end? most pplitions require uniform tretment of similr ojets e.g. one vetor is replited, ll should e replited if the result vetor is used in susequent opertion, you would need different implementtions in the ode depending on whether the vetor is distriuted or replited Prllel Computtion Edgr Griel = : lok olumn-wise distriution int min( int rg, hr **rgv) { doule [n][nlol], [nlol]; doule [n], t[n]; int i,j; * = for (i=0; i<n; i++) { for ( j=0;j<nlol;j++ ) { t[i] = t[i] + (i,j)*(j); MPI_llredue ( t,, n, MPI_DOULE, MPI_SUM, MPI_COMM_WORLD ); Prllel Computtion Edgr Griel

= : lok olumn-wise distriution Why not replite the vetor in this distriution there is no enefit in doing tht for this opertion there might e other opertions in the ode tht mndte tht ut the result vetor is replited sure, the lgorithm mndtes tht you n still drop the elements tht you don t need fterwrds Prllel Computtion Edgr Griel C = : Mtrix-Mtrix Multiply lok-olumn wise dt distriution Exmple for 2 proesses nd 4x4 mtrix Exmple uses glol indies for the mtrix Mtrix Mtrix Mtrix C rnk=0 rnk=1 rnk=0 rnk=1 rnk=0 rnk=1 = Prllel Computtion Edgr Griel

C = : lok-olumn wise distriution 1 st step: eh proess lultes prt of the result elements nd stores it in mtrix C = + + + + + + + + + + + + + + + + Prllel Computtion Edgr Griel C = : lok-olumn wise distriution 2 nd step: Proess 0 nd 1 swp there portions of Mtrix Mtrix nd C unhnged, e.g. Mtrix rnk=0 rnk=1 Prllel Computtion Edgr Griel 14

C = : lok-olumn wise distriution Finish mtrix multiply opertion = + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Prllel Computtion Edgr Griel C = : lok-olumn wise distriution Generliztion for np proesses rnk = 0 1 n-1 rnk = 0 1 n-1 rnk = 0 1 n-1 Prllel Computtion Edgr Griel 2 0 1 2 n- 1 x = C 0 1 2 n-1 0 1 2 n-1 0 1 2 n-1 0 1 2 n-1 2 n-1 0 1 n-2 it = 0 x 0 1 2 n-1 = 0 1 2 n-1 it = 1 2 1 2 3 0 x 0 1 2 n-1 = 0 1 2 n-1. it = np-1 0 1 2 n- 1 0 1 2 n- 1 C C 0 1 2 n-1 0 1 2 n-1..... 15

C = : lok-olumn wise distriution np-1 steps required to give every proess ess to the entire mtrix lgorithm does not require proess to hold the entire mtrix t ny point in time finl shift opertion required in order for eh proess to hve its originl portion of the mtrix k Communition etween proesses often using ring, e.g. proess x sends to x-1 nd reeives from x+1 speil se for proess 0 nd np-1 need to use temporry uffer if simultneously sending nd reeiving the mtrix Prllel Computtion Edgr Griel C = : lok-olumn wise distriution // nlolols: no. of olumns held y proess // nrows: no. of rows of the mtrix // np: numer of proesses // rnk: rnk of this proess sendto = rnk-1; revfrom = rnk+1; if ( rnk == 0 ) sendto = np-1; if ( rnk == np-1 ) revfrom = 0; MPI_Isend(, nrows*nlolols, MPI_DOULE, sendto, 0, omm, &req[0]); MPI_Irev ( temp, nrows, nlolols, MPI_DOULE, revfrom, 0, omm, &req[1]); MPI_Witll ( req, sttuses ); // Copy dt from temporry uffer into mempy (, temp, nrows*nlolols*sizeof(doule)); Prllel Computtion Edgr Griel 16

C = : lok-olumn wise distriution Mpping of glol to lol indies required sine C dt struture n not strt t n ritrry vlue, ut hs to strt t index 0 need to know from whih proess we hold the tul dt item in order to know whih elements of the Mtrix to use mpping will depend of the diretion of the ring ommunition Prllel Computtion Edgr Griel C = : lok-olumn wise distriution // nlolols: no. of olumns held y proess // nrows: no. of rows of the mtrix // np: numer of proesses // rnk: rnk of this proess for ( it=0; it < np; it++ ) { offset = (rnk+it)%np * nlolols; for (i=0; i<nrows; i++) { for ( j=0;j<nlolols;j++ ) { for (k=0; k<nlolols; k++) { C[i][j] += [i][k] + [offset+k][j]; // Communition s shown on previous slides Prllel Computtion Edgr Griel 17

C = : lok-olumn wise distriution lterntive ommunition pttern for lok-olumn wise distriution: in itertion it, proess with rnk=it rodsts its portion of the Mtrix to ll proesses. Mpping of glol to lol indies it simpler Communition osts higher thn for the ring ommunition Prllel Computtion Edgr Griel C = : lok-row wise distriution Similr lgorithm s for lok-olumn wise, e.g. 1 st step rnk=0 rnk=1 = + + + + + + + + + + + + + + + + 2 nd step omitted here, only differene to lok-olumn wise distriution is tht the mtrix is rotted mong the proesses mpping of lol to glol indies relevnt for Mtrix Prllel Computtion Edgr Griel 18

C = : 2-D dt distriution oth mtries need to e rotted mong the proesses only proesses holding portion of the sme rows of Mtrix need to rotte mongst eh other only proesses holding portion of the sme olumns of Mtrix need to rotte mongst eh other Prllel Computtion Edgr Griel C = : 2-D dt distriution e.g. for 2 nd step: ssuming 2-D proess topology Mtrix is ommunited in ring to the left neighor Mtrix is ommunited in ring to the upper neighor Prllel Computtion Edgr Griel 19

C = : 2-D dt distriution Cnnon s lgorithm for squre mtries Set up 2-D proess topology determine nlolols nd nlolrows for eh proess initil shift opertion suh tht eh proess multiplies its lol sumtries y i steps (see next slide) for i=0; i< numer of proesses in row (or olumn) lulte lol prt of mtrix-mtrix multiply opertion send lol portion of to the left neighor reeive next portion of from the right neighor send lol portion of to the upper neighor reeive next portion of from the lower neighor Prllel Computtion Edgr Griel Initil ssignment of Mtries nd 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 0,0 0,1 0,2 0,3 1,0 1,1 1,2 1,3 2,0 2,1 2,2 2,3 3,0 3,1 3,2 3,3 Initil shift of Mtries nd suh tht: Mtrix is shifted y i proesses left for proesses in the i-th olumn of the proess topology Mtrix is shifted y j proesses up for proesses in the j-th olumn of the proess topology 0,0 0,1 0,2 0,3 1,1 1,2 1,3 1,0 2,2 2,3 2,0 2,1 3,3 3,0 3,1 3,2 0,0 1,1 2,2 3,3 1,0 2,1 3,2 0,3 2,0 3,1 0,2 1,3 3,0 0,1 1,2 2,3 Prllel Computtion Edgr Griel