COSC 6374 Parallel Computation. Communication Performance Modeling (II) Edgar Gabriel Fall Overview. Impact of communication costs on Speedup

Similar documents
COSC 6374 Parallel Computation. Non-blocking Collective Operations. Edgar Gabriel Fall Overview

COSC 6374 Parallel Computation. Dense Matrix Operations

Shared Memory Architectures. Programming and Synchronization. Today s Outline. Page 1. Message passing review Cosmic Cube discussion

Distance vector protocol

Internet Routing. IP Packet Format. IP Fragmentation & Reassembly. Principles of Internet Routing. Computer Networks 9/29/2014.

Minimal Memory Abstractions

Containers: Queue and List

Error Numbers of the Standard Function Block

Distributed Systems Principles and Paradigms. Chapter 11: Distributed File Systems

LINX MATRIX SWITCHERS FIRMWARE UPDATE INSTRUCTIONS FIRMWARE VERSION

Outline. Motivation Background ARCH. Experiment Additional usages for Input-Depth. Regular Expression Matching DPI over Compressed HTTP

Distributed Systems Principles and Paradigms

Fundamentals of Engineering Analysis ENGR Matrix Multiplication, Types

INTEGRATED WORKFLOW ART DIRECTOR

UTMC APPLICATION NOTE UT1553B BCRT TO INTERFACE PSEUDO-DUAL-PORT RAM ARCHITECTURE INTRODUCTION ARBITRATION DETAILS DESIGN SELECTIONS

Compiling a Parallel DSL to GPU

cisc1110 fall 2010 lecture VI.2 call by value function parameters another call by value example:

A distributed edit-compile workflow

From Dependencies to Evaluation Strategies

McAfee Web Gateway

EECS150 - Digital Design Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

Introduction to Algebra

Pattern Matching. Pattern Matching. Pattern Matching. Review of Regular Expressions

Duality in linear interval equations

PARALLEL AND DISTRIBUTED COMPUTING

V = set of vertices (vertex / node) E = set of edges (v, w) (v, w in V)

Inter-domain Routing

Transparent neutral-element elimination in MPI reduction operations

CS 551 Computer Graphics. Hidden Surface Elimination. Z-Buffering. Basic idea: Hidden Surface Removal

CS553 Lecture Introduction to Data-flow Analysis 1

Lesson 4.4. Euler Circuits and Paths. Explore This

Lecture 13: Graphs I: Breadth First Search

Fig.25: the Role of LEX

SMALL SIZE EDGE-FED SIERPINSKI CARPET MICROSTRIP PATCH ANTENNAS

Midterm Exam CSC October 2001

Title. How FIFO is Your Concurrent FIFO Queue? Andreas Haas, Christoph M. Kirsch, Michael Lippautz, Hannes Payer. RACES Workshop, October 2012

Lecture 12 : Topological Spaces

COMPUTER EDUCATION TECHNIQUES, INC. (WEBLOGIC_SVR_ADM ) SA:

Paradigm 5. Data Structure. Suffix trees. What is a suffix tree? Suffix tree. Simple applications. Simple applications. Algorithms

CMPUT101 Introduction to Computing - Summer 2002

LU Decomposition. Mechanical Engineering Majors. Authors: Autar Kaw

[SYLWAN., 158(6)]. ISI

The Network Layer: Routing in the Internet. The Network Layer: Routing & Addressing Outline

UT1553B BCRT True Dual-port Memory Interface

Compression Outline :Algorithms in the Real World. Lempel-Ziv Algorithms. LZ77: Sliding Window Lempel-Ziv

Problem Final Exam Set 2 Solutions

4.3 Balanced Trees. let us assume that we can manipulate them conveniently and see how they can be put together to form trees.

Tries. Yufei Tao KAIST. April 9, Y. Tao, April 9, 2013 Tries

Network Layer: Routing Classifications; Shortest Path Routing

String comparison by transposition networks

box Boxes and Arrows 3 true 7.59 'X' An object is drawn as a box that contains its data members, for example:

Troubleshooting. Verify the Cisco Prime Collaboration Provisioning Installation (for Advanced or Standard Mode), page

Lecture 8: Graph-theoretic problems (again)

10.5 Graphing Quadratic Functions

Very sad code. Abstraction, List, & Cons. CS61A Lecture 7. Happier Code. Goals. Constructors. Constructors 6/29/2011. Selectors.

TO REGULAR EXPRESSIONS

COMP108 Algorithmic Foundations

FASTEST METHOD TO FIND ALTERNATIVE RE-ROUTE

LAB L Hardware Building Blocks

Incremental Design Debugging in a Logic Synthesis Environment

CS Summer Lecture #23: Network and End-to-End layers

CS453 INTRODUCTION TO DATAFLOW ANALYSIS

Presentation Martin Randers

Product of polynomials. Introduction to Programming (in C++) Numerical algorithms. Product of polynomials. Product of polynomials

Network Interconnection: Bridging CS 571 Fall Kenneth L. Calvert All rights reserved

COMP 423 lecture 11 Jan. 28, 2008

The Math Learning Center PO Box 12929, Salem, Oregon Math Learning Center

Summer Review Packet For Algebra 2 CP/Honors

Accelerating 3D convolution using streaming architectures on FPGAs

2 Computing all Intersections of a Set of Segments Line Segment Intersection

Greedy Algorithm. Algorithm Fall Semester

COMPUTER SCIENCE 123. Foundations of Computer Science. 6. Tuples

P(r)dr = probability of generating a random number in the interval dr near r. For this probability idea to make sense we must have

McAfee Network Security Platform

Lists in Lisp and Scheme

MITSUBISHI ELECTRIC RESEARCH LABORATORIES Cambridge, Massachusetts. Introduction to Matroids and Applications. Srikumar Ramalingam

Distance Computation between Non-convex Polyhedra at Short Range Based on Discrete Voronoi Regions

Engineer-to-Engineer Note

Lesson6: Modeling the Web as a graph Unit5: Linear Algebra for graphs

6.045J/18.400J: Automata, Computability and Complexity. Quiz 2: Solutions. Please write your name in the upper corner of each page.

Stained Glass Design. Teaching Goals:

COMPUTATION AND VISUALIZATION OF REACHABLE DISTRIBUTION NETWORK SUBSTATION VOLTAGE

CICS Application Design

6.3 Volumes. Just as area is always positive, so is volume and our attitudes towards finding it.

Systems I. Logic Design I. Topics Digital logic Logic gates Simple combinational logic circuits

HIGH-LEVEL TRANSFORMATIONS DATA-FLOW MODEL OF COMPUTATION TOKEN FLOW IN A DFG DATA FLOW

ASTs, Regex, Parsing, and Pretty Printing

Distributed Systems Principles and Paradigms

MPI Groups and Communicators

AI Adjacent Fields. This slide deck courtesy of Dan Klein at UC Berkeley

Exploiting Locality to Ameliorate Packet Queue Contention and Serialization

Internet Routing. Reminder: Routing. CPSC Network Programming

All in One Kit. Quick Start Guide CONNECTING WITH OTHER DEVICES SDE-4003/ * 27. English-1

[Prakash* et al., 5(8): August, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

c s ha2 c s Half Adder Figure 2: Full Adder Block Diagram

Design Space Exploration for Massively Parallel Processor Arrays

Ma/CS 6b Class 1: Graph Recap

Smart Output Field Installation for M-Series and L-Series Converter

LINX MATRIX SWITCHERS FIRMWARE UPDATE INSTRUCTIONS FIRMWARE VERSION

ITEC2620 Introduction to Data Structures

Transcription:

COSC 6374 Prllel Computtion Communition Performne Modeling (II) Edgr Griel Fll 2015 Overview Impt of ommunition osts on Speedup Crtesin stenil ommunition All-to-ll ommunition Impt of olletive ommunition opertions Use se: rodst sed mtrix multiply opertions for dense mtries Doule uffering nd non-loking olletive opertions Implementtion spets 1

Crtesin stenil ommunition Assumptions 2-D rtesin grid ll proesses hve 4 ommunition prtners Communition ours sequentilly : 4 sends + 4 reeives Prolem size : N*N, Messge length to eh neighor: Using Hokney s model: T omm ( N, p) 8( l N /( p)) N p Crtesin stenil ommunition (II) Speedup: S = T(1) / T (p) n: numer of proesses T(p) = T omp + T omm T omp sles with the numer of proesses, T(1) = T omp => no ommunition Thus Tomp Sstenil Tomp N 8( l ) p p 2

2-D stenil ommunition Gigit Ethernet QDR InfiniBnd Lteny 50µs 1µs Bndwidth 5 MB/s 5 GB/s Assumptions in the grph: T omp = 00s 1 stenil ommunition every 2s All-to-ll ommunition Assumptions: p proesses Prolem size: N Prolem size per proess: N/p Messge length per proess pir: N/p 2 Eh proess sends N messges nd reeives p messges sequentilly Congestion nd onurreny of messges not onsidered Communition Costs using Hokney s Model: 2 2N T omm ( N, p) 2 p( l N /( p )) 2 pl p 3

All-to-ll ommunition Speedup: S = T(1) / T (p) n: numer of proesses T(p) = T omp + T omm T omp sles with the numer of proesses, T(1) = T omp => no ommunition Thus Tomp Slltoll Tomp 2N 2 pl n p All-to-ll ommunition Gigit Ethernet QDR InfiniBnd Lteny 50µs 1µs Bndwidth 5 MB/s 5 GB/s Assumptions in the grph: T omp = 00s 1 ll-to-ll ommunition every 2s 4

Conlusions Communition osts limit the speedup Drmtilly for All-to-ll Modertely for Stenil ommunition Stenil ommunition impt gets worse when inresing the numer of opertions per seond How to inrese the slility of the pplition Communition osts hve to go to zero non-loking ommunition Non-loking olletive opertions introdued in MPI-3 High level strtion seprting funtionlity from implementtion Non-loking exeution llows hide ommunition osts Colletive Opertions Offer higher level strtion for often ourring ommunition ptterns Seprte desired dt movement from tul implementtion Allow for numerous optimiztions internlly E.g. O(p) vs. O(log(p)) lgorithms O(p) liner lgorithms often found in pplitions not using olletive opertions - its simple Hrdwre topology sed lgorithms: minimize utiliztion of the lowest performing network onnetions Hrdwre sed optimiztions: some networking rds hve uilt in support for some olletive opertions 5

O(p) vs. O(log(p)) olletive lgorithms Estimted exeution time of Bst opertion of 1 MB messge length using QDR InfiniBnd prmeters Conlusions Colletive Opertions essentil for slility of pplitions t lrge proess ounts Simplify ode mintenne nd redility Redue ommunition osts ompred to (trivil) liner lgorithms often used y pplitions tht do not use olletive opertions But ommunition osts re not zero 6

Non-loking Brodst opertion MPI_Ist (void *uf, int nt, MPI_Dttype dt, int root, MPI_Comm omm, MPI_Request *req); The proess with the rnk root distriutes the dt stored in uf to ll other proesses in the ommunitor omm. Initites olletive ommunition, ompletion hs to enfored seprtely using MPI_Test or MPI_Wit MPI_Request hndle llows to uniquely identify urrently ongoing opertion Non-loking olletive opertions Completion opertion only indites lol ompletion E.g. you do not know whether nother proess hs finished tht opertion Completion of non-loking olletive opertion does not imply ompletion of nother non-loking individul or olletive opertion posted erlier Multiple non-loking olletive opertions n e outstnding on single ommunitor Unlike point-to-point opertions, non-loking olletive opertions do not mth with loking olletive opertions All proesses must ll olletive opertions (loking nd non-loking) in the sme order per ommunitor 7

height Non-loking opertions progression Prolem: how to ensure tht non-loking opertions (individul or olletive) ontinue exeution in the kground? Two options: Using progress thred: seprte thred exeutes the non-loking opertion in loking mnner Prolem: numer of threds n e lrge (e.g. 1,000 nonloking send opertions to different proesses) Prolem: ommunition through seprte thred inreses the network lteny (thred synhroniztion et.) Regulrly invoking the progress engine: Send lrge messge hunk-y-hunk ut void loking Progress only mde inside the MPI lirry Use se: dense mtrix multiply Blok-olumn wise dt distriution (height * lwidth) Assuming squre mtries, even dt distriution Exmple: 4x4 mtrix for 2 proesses Mtrix A Mtrix B Mtrix C rnk=0 rnk=1 rnk=0 rnk=1 rnk=0 rnk=1 00 20 30 01 11 21 31 02 22 32 03 13 23 33 00 20 30 01 11 21 31 02 22 32 03 13 23 33 = 00 20 30 01 11 21 31 02 22 32 03 13 23 33 lwidth lwidth 8

Brodst sed prllel dense mtrix multiply In itertion i, proess with rnk=i rodsts its portion of the Mtrix A to ll proesses. Itertion 0 00 20 30 Rnk 0 Rnk 1 01 11 21 31 00 20 30 01 11 21 31 00 20 30 01 11 21 31 Rnk 0 Rnk 1 02 22 32 03 13 23 33 Itertion 1 02 22 32 03 13 23 33 00 20 30 01 11 21 31 02 22 32 03 13 23 33 02 22 32 03 13 23 33 Overll ode struture doule A[height][lwidth], B[height][lwidth]; doule C[height][lwidth], tmp[height][lwidth]; MPI_Comm_size ( omm, &size); MPI_Comm_rnk ( omm, &rnk); for ( i=0; i<size; i++ ) { if ( rnk == i ) { mempy (tmp, A, lwidth*height*sizeof(doule)); MPI_Bst (tmp, lwidth*height, MPI_DOUBLE, i, omm); lolmtmul( tmp, B, C, i); 9

Use se: dense mtrix multiply Brodst sed lgorithm in itertion it, proess with rnk=it rodsts its portion of the Mtrix A to ll proesses. Multiplying the portion of mtrix B with the urrent vlue of the Mtrix A void lolmtmul ( doule **X, doule **Y, doule **Z, int it ) { for ( i=0; i<height; i++ ) for ( j=0; j<lwidth; j++ ) for ( k=0; k<lwidth; k++) { Z[i][j] += X[i][k] * Y[it*lwidth+k][j]; Overll ode struture Alternting sequenes of ommunition nd omputtion The MPI_Bst in every itertion hs different root proess Communition uses olletive opertions Allows for optimiztion For fixed prolem size, Speedup will e limited y ommunition osts Not shown: initiliztion of C to zero

Doule Buffering Use two uffers for the sme vrile One used for omputtion One used for ommunition Allows to overlp ommunition nd omputtion -> not llowed to touh ommunition uffer while non-loking opertion is ongoing Buffer pointers used to lter the ommunition nd omputtion uffer in eh itertion Avoids dditionl mempy opertion Revised ode struture doule tmp[height][lolwidth],tmp2[height][lolwidth]; doule **ommuf=tmp, **ompuf=tmp2; MPI_Request req; if (rnk == 0 ){ mempy (ompuf, A, lwidth*height*sizeof(doule); MPI_Bst (ompuf, lwidth*height, MPI_DOUBLE, 0, omm); for ( i=1; i<size; i++ ) { if ( rnk == i ) { mempy (ommuf, A, lwidth*height *sizeof(doule)); MPI_Ist (ommuf, lwidth*height, MPI_DOUBLE, i, omm, &req); lolmtmul( ompuf, B, C, i-1); // Perform omputtions doule **tpoint=ommuf; MPI_Wit ( &req, MPI_STATUS_IGNORE); // Wit for ompletion ommuf=ompuf; SWAP (ommuf, ompuf); // Swp pointers ompuf=tpoint; lolmtmul ( ompuf, B, C, i-1); 11

Revised ode struture Doule uffering Overlps two itertions in the exeution Requires n dditionl uffer Complites ode (mrginlly) Communition of itertion 0 n e loking -> Cn not ontinue efore it finishes nywy Computtion lwys performed on dt of itertion i-1 Computtion of lst itertion fter the loop Rememer the overlp prolem? Need to modify lolmtmul s well Revised ode struture void lolmtmul ( doule **X, doule **Y, doule **Z, int it, MPI_Request *req ) { int flg; for ( i=0; i<height; i++ ) { for ( j=0; j<lwidth; j++ ) { for ( k=0; k<lwidth; k++) { Z[i][j] += X[i][k] * Y[it*lwidth+k][j]; if ( (i % 4) == 0 ) MPI_Test ( req, flg, MPI_STATUS_IGNORE);

Performne Results Compring loking nd non-loking version for two different mtrix sizes Tests performed on rill luster t University of Houston (16 nodes, 48 ores per node, 2 DDR IB HCAs per node) Implementtion spets: LiNBC Implements non-loking versions of ll MPI olletive opertions Shedule sed design: proess-lol shedule of p2p opertions is reted 1 2 0 3 2 1 2 4 3 3 3 5 6 Pseudoode for shedule t rnk 1: NBC_Shed_rev(uf, nt, dt, 0, shed); NBC_Shed_rr(shed); NBC_Shed_send(uf, nt, dt, 3, shed); NBC_Shed_rr(shed); NBC_Shed_send(uf, nt, dt, 5, shed); See http://www.unixer.de/pulitions/img/hoefler-hlrs-n.pdf for more detils 13

Implementtion spets: LiNBC Shedule exeution is represented s stte mhine Stte nd shedule re tthed to every request Shedules might e hed/reused Progress is most importnt for effiient overlp Progression in NBC_Test/NBC_Wit Other non-loking olletive opertions Virtully ll olletive ommunition opertions from MPI-1 hs non-loking ounter prt in MPI-3 e.g. MPI_Istter, MPI_Igther, MPI_Iredue, MPI_Illredue, MPI_Illtoll, New topology olletive ommunitions E.g. MPI_Neighor_illtoll, MPI_Neighor_illgther Some ommunitor onstrutors E.g. MPI_Comm_idup 14

Other non-loking olletive opertions Colletive opertions whih do not hve non-loking ounterprt Some ommunition retors: There is no MPI_Comm_irete, MPI_Comm_isplit There is no MPI_Crt_irete, MPI_Grph_dist_irete Colletive file I/O opertions There is MPI_File_ired_ll, MPI_File_iwrite_ll, et. 15