COSC 6374 Parallel Computation. Remote Direct Memory Access

Size: px

Start display at page:

Download "COSC 6374 Parallel Computation. Remote Direct Memory Access"

Helen Daniel
6 years ago
Views:

1 COSC 6374 Parallel Computation Remote Direct Memory Access Edgar Gabriel Fall 2015 Communication Models A P0 receive send B P1 Message Passing Model A B Shared Memory Model P0 A=B P1 A P0 put B P1 Remote Memory Access 1

2 Data Movement Mem CPU CPU Mem NIC NIC Message Passing Model: Two-sided communication Mem CPU CPU Mem NIC NIC Remote Memory Access: One-sided communication Remote Direct Memory Access Direct Memory Access (DMA) allows data to be sent directly from an attached device to the memory on the computer's motherboard. One CPU is freed from involvement with the data transfer, thus speeding up overall computer operation Remote Direct Memory Access (RDMA): two or more computers communicate directly from the main memory of one system to the main memory of another 2

3 One-sided communication in MPI MPI-2 defines one-sided communication: A process can put some data into the main memory of another process (MPI_Put) A process can get some data from the main memory of another process (MPI_Get) A process can perform some operations on a data item in the main memory of another process (MPI_Accumulate) Target process not actively involved in the communication RDMA in MPI Problems: How can a process define which part of its main memory are available for RDMA? How can a process define when this part of the main memory is available for RDMA? How can a process define who is allowed to access its memory? How can a process define which elements in a remote memory it wants to access? 3

4 The window concept of MPI-2 (I) MPI_Win_create(void *base, MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, MPI_Win *win); An MPI_Win defines the group of process allowed to access a certain memory area Arguments: base: Starting address for the public memory region size: size of the public memory area in bytes disp_unit: offset from the base address in bytes info: Hint to the MPI how the window will be used (e.g. only reading or only writing) comm: communicator defining the group of processes allowed to access the memory window The window concept of MPI-2 (II) Definition of a temporal window: Access Epoch: time slot in which a process accesses remote memory of another process Exposure Epoch: time slot in which a process allows access to a memory window by other processes Does a process have control when other processes are accessing its memory window? yes: active target communication no: passive target communication 4

5 Active Target Communication (I) MPI_Win_fence ( int assert, MPI_Win win); Synchronization of all operations within a window collective across all processes of win No difference between access and exposure epoch Starts or closes an access and exposure epoch Arguments assert: Hint to the library on the usage (default: 0) Data exchange (I) MPI_Put (void *oaddr, int ocount, MPI_Datatype otype, int rank, MPI_Aint disp, int tcount, MPI_Datatype ttype, MPI_Win win); A single process controls the data parameters of both processes Put data described by (oaddr, ocount, otype) into the main memory of the process defined by Rank rank in the window win at the position (base+disp*disp_unit,tcount,ttype) base and disp_unit have been defined in MPI_Win_create Value of base and disp_unit not known by the process calling MPI_Put! 5

6 Example: Ghost-cell update Parallel Matrix-vector multiply for band-matrices x1 rhs1 x2 rhs2 30 x 3 rhs3 50x4 rhs4 Process 0 Process 1 50x1 30x 2 20x1 50x2 30x rhs x2 50x3 30x4 rhs3 20x3 50x rhs 4 rhs 1 4 Process 0 needs x 3 Process 1 needs x 2 Example: Ghost-cell update (II) Ghost cells: (read-only) copy of elements held by another process x 1 Process 0 x 2 x3 x2 Process 1 Ghost-cells for 2-D matrices: additional row of data x 3 x 4 Process 0 nxlocal Process 1 nxlocal Process 2 nxlocal ny 6

7 Example: Ghost-cell update (III) Data structure: u[i][j] is stored in a matrix nxlocal : no of data points in x direction ny : no of data points in y direction Extent of variable u u[ n xlocal 2][ ny ] with u[ 1: n xlocal ][0 : n y 1] containing the local data Example: Ghost-cell update (IV) MPI_Win_create ( u,(nxlocal+2)*ny*sizeof(double), 0, MPI_INFO_NULL, &win); MPI_Win_fence ( 0, win); MPI_Put ( &u[1][0], ny, MPI_DOUBLE, rank-1, (nxlocal+1)*ny*sizeof(double), ny, MPI_DOUBLE, win); MPI_Put ( &u[nxlocal][0], ny, MPI_DOUBLE, rank+1, 0, ny, MPI_DOUBLE, win); MPI_Win_fence ( 0, win); MPI_Win_free ( &win); 7

8 Comments to the example Modifications to the data items might only be visible after closing the corresponding epochs No guarantee whether the data item is really transfered during MPI_Put or during MPI_Win_fence If multiple processes modify the very same memory address at the very same process, no guarantees are given on which data item will be visible. Responsibility of the user to get it right Passive Target Communication MPI_Win_lock (int lock_type, int rank, int assert, MPI_Win win); MPI_Win_unlock (int rank, MPI_Win win); MPI_Win_lock starts an access epoch to access the main memory of the process with rank rank All RDMA operations between a lock/unlock appear atomic lock_type: MPI_LOCK_EXCLUSIVE or MPI_LOCK_SHARED Update to the local memory exposed through the MPI window should also happen using MPI_Win_lock/MPI_Put Otherwise undefined access order/race condition between local update and RDMA access 8

9 Example: Ghost-cell update (V) MPI_Win_create ( u,(nxlocal+2)*ny*sizeof(double), 0, MPI_INFO_NULL, &win); MPI_Win_lock ( MPI_LOCK_EXCLUSIVE, rank-1, 0, win); MPI_Put ( &u[1][0], ny, MPI_DOUBLE, rank-1, (nxlocal+1)*ny*sizeof(double), ny, MPI_DOUBLE, win); MPI_Win_unlock( rank-1, win); MPI_Win_lock ( MPI_LOCK_EXCLUSIVE, rank+1, 0, win); MPI_Put ( &u[nxlocal][0], ny, MPI_DOUBLE, rank+1, 0, ny, MPI_DOUBLE, win); MPI_Win_unlock ( rank+1, win); One-sided vs. Two-sided communication One-sided communication doesn t need message matching unexpected message queues Uses only one processor potentially faster! One-sided communication in MPI can optimize potentially multiple transactions between multiple processes 9

10 Limitations of the MPI-2 model Synchronization costs (e.g. MPI_Win_fence) can be significant Static model Size of memory window can not be altered after creating an MPI_Win Difficult to support dynamic data structures such as a linked list Passive target model has limited usability But that is what most other RDMA libraries focus on In MPI-3: Introduction of dynamic windows Extending the functionality passive target operations Use case: distributed linked list A linked list maintained across multiple processes E.g. after a global sort operation of all elements E.g. having fixed rules for the keys rank 0: keys which start with a to d rank 1: keys which start with e to h Rank 0 Rank 1 Rank 2 10

11 Use case: Distributed linked list typedef struct{ char key[max_key_size]; char value[max_value_size]; MPI_Aint next_disp; Equivalent to the next pointer int next_rank; in a non-distributed linked list void *next_local; // next local element } ListElem; // Create an MPI data type describing this // structure using MPI_Type_create_struct. Not shown // here for brevity Traversing a distributed linked list ListElem local_copy, *current; ListElem *head; //assumed to be already set current=head; Get a shared (read-only) lock to all processes that are part of win MPI_Win_lock_all ( win ); while (!found ) { if ( current->next_rank!= myrank ) { MPI_Get (&local_copy, 1, ListElem_type, current->next_rank, current->next_disp, 1, ListElem_type, win ); MPI_Win_flush ( current->next_rank, win ); current = &local_copy; } else Enforce the completion of all current = current->next_local; pending operations to a process if ( strcmp(current->key, key ) == 0 ) without having to release the lock(s) break; } MPI_Win_unlock_all( win); 11

12 Inserting elements into a linked list Assuming that only a local process is allowed to insert an element (e.g. after a global sort operation) Remote processes only allowed to read elements on other processes Requires dynamically allocating memory and extending a memory region MPI_Win_create_dynamic( MPI_Info info, MPI_Comm comm, MPI_Win *win); MPI_Win_attach (MPI_Win win, void *base, MPI_Aint size); A dynamic window defines only the participating group of process More than one memory region can be assigned to a single window Inserting elements into a linked list (II) // create window instance once MPI_Win_create_dynamic (MPI_INFO_NULL, comm, &win); // insert each element into the memory window t = (ListElem *) malloc ( sizeof (ListElem) ); t->key = strdup (key); t->value = strdup (value); current = find_prev_element (head, key, value) t2 = current->next_local; Similarly for updating next_rank and current->next_local = t; next_disp on current and t t->next_local = t2; MPI_Win_attach ( win, t, sizeof(listelem ); // add another element t = (ListElem *) malloc ( sizeof (ListElem) ); MPI_Win_attach ( win, t, sizeof(listelem ); MPI_Barrier (comm); 12

COSC 6374 Parallel Computation. Remote Direct Memory Acces

COSC 6374 Parallel Computation Remote Direct Memory Acces Edgar Gabriel Fall 2013 Communication Models A P0 receive send B P1 Message Passing Model: Two-sided communication A P0 put B P1 Remote Memory