Introduction to Lab Series DMS & MPI

Size: px

Start display at page:

Download "Introduction to Lab Series DMS & MPI"

Kelly Robbins
6 years ago
Views:

1 TDDC 78 Labs: Memory-based Taxonomy Introduction to Lab Series DMS & Mikhail Chalabine Linköping University Memory Lab(s) Use Distributed 1 Shared 2 3 Posix threads OpenMP Distributed LAB 5 (tools) at every stage. Saves your time. TDDC 78 Labs: Memory-based Taxonomy TDDC 78 Labs: Memory-based Taxonomy Memory Lab(s) Use Distributed 1 Memory Lab(s) Use Distributed 1 Shared 2 3 Posix threads OpenMP Shared 2 3 Posix threads OpenMP Distributed 4 Distributed 4 LAB 5 (tools) at every stage. Save your time. LAB 5 (tools) at every stage. Save your time.

2 Shared- and Distributed-memory systems Programming parallelism (typical problems) Approach and solve Partitioning Domain decomposition Functional decomposition Communication Agglomeration Mapping Load balancing Goals Your primary source of information Comprehensive Environment description Lab specification Step-by-step instructions Compendium intro (1) Message Passing Interface () a standard for programming parallel processors under the message-passing paradigm Processors exchange messages Point-to-point and collective communication Low-level (explicit) programming of parallelism Efficient but complex error-prone implementation Communication details Processor topologies Portable and language independent Widely used in practice Accepted by industry Available on virtually all platforms mpirun -np <nn> <program> <args> Structure ierr = _Init( argc, argv); Intro (2) _Comm_size( _COMM_WORLD, &nproc); _Comm_rank( _COMM_WORLD, &iproc); // Program code printf( Hello from proc %d\n, iproc); _Finalize();

Learn about Lab-1 TDDC78: Image Filters with Define types / Receive Broadcast Scatter / Gather Use virtual topologies _Issend / _Probe / _Reduce ing larger pieces of data Synchronize / _Barrier LAB-1

$double data[10]; } buf_t; // Composite type buf_t item; // Element of the type _Datatype buf_t_mpi; // type to commit int block_lengths [] = { 1, 10}; // Lengths of type elements _Datatype$

3 Learn about Lab-1 TDDC78: Image Filters with Define types / Receive Broadcast Scatter / Gather Use virtual topologies _Issend / _Probe / _Reduce ing larger pieces of data Synchronize / _Barrier LAB-1 LAB-4 Blur & Threshold See compendium for details Your goal is to understand:! Define types! / Receive! Broadcast! Scatter / Gather For syntax and examples refer to the lecture slides and examples below! Decompose domains! Apply filter in parallel typedef struct {! int id;! double data[10]; } buf_t; // Composite type buf_t item; // Element of the type _Datatype buf_t_mpi; // type to commit int block_lengths [] = { 1, 10}; // Lengths of type elements _Datatype block_types [] = { _INT, _DOUBLE}; // Set types _Aint start, displ[2]; _Address( &item, &start ); _Address( &item.id, &displ[0] ); _Address( &item.data[0], &displ[1] ); Types Example displ[0] -= start; // Displacement relative to address of start displ[1] -= start; // Displacement relative to address of start _Type_struct( 2, block_lengths, displ, block_types, &buf_t_mpi ); _Type_commit( &buf_t_mpi ); message_t message = create_message( my_id ); -Receive _( &message, sizeof( message_t ), _BYTE,! (my_id == 0)?1:0, 0, _COMM_WORLD ); _Status status; _Recv( &message, sizeof( message_t ), _BYTE, (my_id == 0)?1:0, 0, _COMM_WORLD, &status );

4 -Receive Modes (1) SEND BLOCKING NON-BLOCKING Standard Isend Synchronous _Ssend _Issend Buffered _Bsend _Ibsend Ready _Rsend _Irsend RECEIVE BLOCKING NON-BLOCKING _Recv _Irecv -Receive Modes (2) Blocking send Returns after message safely stored away Free to access and overwrite send buffer May copy message into Matching receive buffer, or Temporary system buffer Can complete as soon as message buffered Receive May complete before send completes May complete only after send started Message buffering Decouples send and receive Can be expensive Additional memory-to-memory copying [ ORG ] -Receive Modes (3) Standard mode decides whether outgoing messages are buffered may buffer outgoing messages may complete before a matching receive posted Obs! Buffer space may be unavailable may choose not to buffer for performance will not complete until a matching receive posted Makes implementation run-system implementation dependent Starts whether or not a matching receive posted Is non-local Completion may depend on occurrence of a matching receive Core rational Correct programs do not rely on buffering in standard mode -Receive Modes (4) Buffered mode buffers sent messages Starts whether or not a matching receive posted May complete before a matching receive posted s messages in a non-blocking mode Unlike the standard mode Local completion independent of a matching receive If no matching receive posted must buffer Error if insufficient buffer space Buffering may improve performance but not affect the result [ ORG ]

-Receive Modes (5) Synchronous mode Synchronous communication semantics (non-local) If both send and receive blocking Communication completes at both ends only after both processes rendezvous at

operation began executing Completion indicates buffer can be reused Receiver reached a certain ctrl.

5 -Receive Modes (5) Synchronous mode Synchronous communication semantics (non-local) If both send and receive blocking Communication completes at both ends only after both processes rendezvous at communication Control flow er sends a request to send Receiver permits when matching receive posted Starts whether or not a matching receive posted Completes if Matching receive posted Receive operation began executing Completion indicates buffer can be reused Receiver reached a certain ctrl. point exec matching receive -Receive Modes (6) Ready mode Message sent as soon as possible Same semantics as Standard send operation, or Synchronous send operation er provides additional info to the system namely that a matching receive is already posted), that Saves overhead In a correct program can be replaced by a standard send On some systems Removes the need of a hand-shake operation Improved performance. Starts only if matching receive is posted Otherwise erroneous and outcome undefined Completion independent of matching receive Merely indicates that send buffer can be reused SR Modes: sense the difference (1) SR Modes: sense the difference (2) message_t message = create_message(iproc); _Request request; _Isend( &message, sizeof(message_t), _BYTE,!!! (iproc == 0)?1:0, 0, _COMM_WORLD,!!! &request); // Non-blocking send _Status status; _Recv( &message, sizeof(message_t), _BYTE,!!! (iproc == 0)?1:0, 0, _COMM_WORLD,!!! &status); // Receive // Synchronize sender & receiver _Wait( &request, &status); Non-blocking SEND: returns even if the message data have not been safely stored away, i.e., it is neither buffered nor read. message_t message = create_message(iproc); _Status status; _Irecv( &message, sizeof( message_t), _BYTE,!!! (iproc == 0)?1:0, 0, _COMM_WORLD,!!! &status); // Non-blocking receive _( &message, sizeof( message_t), _BYTE,!!! (iproc == 0)?1:0, 0, _COMM_WORLD); //

6 Typical Master-Slave (1) Typical Master-Slave (2) // The root sends jobs (synchronous mode) task_t task[nproc-1]; _Request request[nproc-1]; for(int i=1; i < nproc; i++) {! _Issend( &(task[i-1], sizeof( task_t), _BYTE,!!! i, 0, _COMM_WORLD,&request); } _Status status[nproc-1]; _Waitall( nproc-1, request, status); // Each CPU receives data to process result_t result[nproc-1]; for(int j=1; j < nproc; j++){! _Status rstat;! _Probe( _ANY_SOURCE, 0, _COMM_WORLD, &rstat);! int from = rstat._source;! int data_size;! _Get_count( &rstat, _DOUBLE, &data_size);! result[from-1].buf = new double[data_size];! _Recv( result[from-1].buf, data_size, _DOUBLE, from, 0, _COMM_WORLD, &status); } Collective Communication (CC) CC: Scatter & Gather // One processor _(&message, sizeof(message_t), ); // All the others _Recv(&message,sizeof(message_t), ); All processors: _Bcast(message, sizeof(message_t), _BYTE, 0, _COMM_WORLD );! Distributing (unevenly sized) chunks of data sendbuf = (int *) malloc( nproc * stride * sizeof(int)); displs = (int *) malloc( nproc * sizeof( int)); scounts = (int *) malloc( nproc * sizeof( int)); for (i = 0; i < nproc; ++i) { displs[i] = scounts[i] = } _Scatterv( sendbuf, scounts, displs, _INT,!! rbuf, 100, _INT, root, comm);

7 ! Define types! / Receive! Broadcast! Scatter / Gather! Use virtual topologies Learn about LAB-1! _Issend / _Probe / _Reduce LAB-4! ing larger pieces of data! Synchronize / _Barrier Moving particles Validate the pressure law Lab-4: Particles Dynamic interaction patterns # of particles that fly across borders is not static You need advanced domain decomposition Motivate your choice! Process Topologies (0)! By default processors are arranged into 1- dimensional arrays! Processor ranks are computed accordingly What if processors need to communicate in 2 dimensions or more?! Use virtual topologies achieving 2D instead of 1D arrangement of processors with convenient ranking schemes int dims[2]; dims[0]= 2; dims[1]= 3; Process Topologies (1) // 2D matrix / grid // 2 rows // 3 columns _Dims_create( nproc, 2, dims); int periods[2]; periods[0]= 1; periods[1]= 0; int reorder = 1; _Comm grid_comm; // Row-periodic // Column-non-periodic // Re-oder allowed _Cart_create( _COMM_WORLD, 2, dims, periods,! reorder, &grid_comm);

8 int my_coords[2]; int my_rank; int right_nbr[2]; int right_nbr_rank; // Cartesian Process coordinates // Process rank _Cart_get( grid_comm, 2,!! dims, periods, my_coords); _Cart_rank( grid_comm, my_coords, &my_rank); right_nbr[0] = my_coords[0]+1; right_nbr[1] = my_coords[1]; Process Topologies (2) _Cart_rank( grid_comm, right_nbr,!!! & right_nbr_rank); Learning goals Point-to-point communication Probing / Non-blocking send (choose) Barriers & Wait = Synchronization Derived data types Collective communications Virtual topologies /Receive modes Use with care to keep your code portable It works there but not here! Summary Labs at home? No problem Simple to install Simple to use

Introduction to TDDC78 Lab Series. Lu Li Linköping University Parts of Slides developed by Usman Dastgeer

Introduction to TDDC78 Lab Series Lu Li Linköping University Parts of Slides developed by Usman Dastgeer Goals Shared- and Distributed-memory systems Programming parallelism (typical problems) Goals Shared-