CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

Size: px

Start display at page:

Download "CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan"

Phebe Hunter
6 years ago
Views:

1 CISC 879 Software Support for Multicore Architectures Spring 2008 Student Presentation 6: April 8 Presenter: Pujan Kafle, Deephan Mohan Scribe: Kanik Sem The following two papers were presented: A Synchronous Mode MPI Implementation on the Cell BE Architecture: Presented by Pujan Kafle MPI Design In implementing MPI, it is tempting to view the main memory as the shared memory of an SMP, and hide the local store from the application. This will alleviate the challenges of programming the Cell. However, there are several challenges to overcome. Some of these require new features to the compiler and the Linux implementation on the Cell, which have recently become available. MPI Communication Modes MPI provides different options for the communication mode chosen for the basic blocking point to point operations, MPI_Send and MPI_Recv. Implementations can use either the buffered mode or the synchronous mode. A safe application should not make any assumption on the choice made by the implementation. In the buffered mode, the message to be sent is copied into a buffer, and then the call can return. Thus, the send operation is local, and can complete before the matching receive operation have been posted. Implementations often use this for small messages. For large messages, they avoid the extra buffer copy overhead by using synchronous mode. Here, the send can complete only after the matching receive has been posted. MPI Initialization A user can run an MPI application, provided it uses only features that we have currently implemented, by compiling the application for the SPE and executing the following command on the PPE: mpirun n <N> executable arguments where <N> is the number of SPEs on which the code is to be run. The mpirun process spawns the desired number of threads on the SPE. Note that only one thread can be spawned on an SPE, and so <N> cannot exceed eight on a single processor or sixteen for a blade. Communication Architecture Associated with each message is meta-data that contains the following information about the message: Address of the memory location that contains the data4, sender s rank, tag, message size, datatype ID, MPI communicator ID, and an error field. For each pair of SPE threads, we allocate space for two meta-data entries, one in each of the SPE local stores, for a total of N(N-1) entries, with (N-1) entries in each SPE local store; entry Bij is used to store meta-data for a message from process i to process j, i!=j.

2 Send Protocol The send operation from Pi to Pj proceeds as follows. The send operation first puts the meta-data entry into buffer Bij through a DMA operation. The send operation then waits for a signal from Pj notifying that Pj has copied the message. The signal obtained from Pj contains the error value. It is set to MPI_SUCCESS on successful completion of the receive operation and the appropriate error value on failure. In the synchronous mode, an SPE is waiting for acknowledgment for exactly one send, at any given point in time, and so all the bits of the receive signal register can be used. Receive Protocol The receive operation has four flavors. (i) It can receive a message with a specific tag from a specific source, (ii) It can receive a message with any tag (MPI_ANY_TAG) from a specific source, (iii) It can receive a message with a specific tag from any source (MPI_ANY_SOURCE), or (iv) It can receive a message with any tag from any source. In the first case, the meta-data entry in Bij is continuously polled, until the flag field is set. The tag value in the meta-data entry is checked. If the application truly did not assume any particular communication mode, then this tag should match, and the check is superfluous. The receive call then transfers data from the source SPE's application buffer to its own buffer and signals Pi's signal register to indicate that the data has been copied. The second case is handled in a manner similar to the first, except that any tag matches. The third and fourth cases are similar to the first two respectively, as far as tags are concerned. However, messages from any source can match. So the receive operation checks the meta-data entry flags for each sender, repeatedly, in a round robin fashion, to avoid starvation, even though the MPI standard does not guarantee against starvation. Performance Evaluation The latency results for point to point communication on the pingpong test, in the presence and absence of congestion. The congested test involved dividing the SPEs into pairs, and having each pair exchanging messages. The throughput results for the same tests as above for large messages, where the overhead of exchanging meta-data and signals can be expected to be relatively insignificant. The maximum throughput observed is around 6 GB/s. Conclusion The authors have described an efficient implementation of synchronous mode MPI communication on the Cell processor. It is a part of an MPI implementation that demonstrates that an efficient MPI implementation is possible on the Cell processor, using just the SPEs, even though they are not full-featured cores.

3 Limitations of PlayStation 3 for High Performance Cluster Computing: Presented by Deephan Mohan Cell Evaluation Cell addresses the following two main problems: Memory Wall: The Cell Broadband Engine s SPEs use two mechanisms to deal with long main-memory latencies: (a) A 3-level memory structure (main storage, local stores in each SPE, and large register files in each SPE), and (b) Asynchronous DMA transfers between main storage and local stores. These features allow programmers to schedule simultaneous data and code transfers to cover long latencies effectively. Frequency Wall: By specializing the PPE and the SPEs for control and computeintensive tasks, respectively, the Cell Broadband Engine Architecture, on which the Cell Broadband Engine is based, allows both the PPE and the SPEs to be designed for high frequency without excessive overhead. The PPE achieves efficiency primarily by executing two threads simultaneously rather than by optimizing single-thread performance. Cell Application Features Vectorize: the SPEs are vector units. This means that, in a code that is not vectorized, every scalar operation must be promoted to a vector operation which results in a considerable performance loss. Keep data aligned: In order to achieve the best transfer rates, data accesses must be aligned both on the main memory and the SPEs local memories. Alignment will provide a better exploitation of the memory banks and a better performance of the DMA engine. Implement double buffering: In order to hide the cost of the latencies and memory transfers, DMA transfers can be overlapped with SPE local computations. If these local computations are more expensive than a single data transfer, the communication phase can be completely hidden. This technique is known as double buffering. Improve data reuse: to reduce the number of memory transfers, it is important to arrange the instructions in order to maximize the reuse of data once it has been brought into the SPEs local memories. Explicitly unroll: due to the high number of registers on the SPEs and to the simplicity of SPEs architecture (no register renaming, no speculative execution, no dynamic branch prediction etc.), explicit unrolling provides considerable improvements in performance. Reduce branches in the code: SPEs can only do static branch prediction. Since these prediction schemes are rather inefficient on programs that have a complex execution flow, reducing the number of branches in the code usually provides performance improvements.

4 A PlayStation3 cluster hardware/software details Hardware: Cell BE processor: the Cell processor on each node has only six SPEs accessible to the user out of eight. Dual-channel Rambus Extreme Data Rate: For all practical purposes the memory can provide the bandwidth of 25.6 GB/s to the SPEs through the EIB, provided that accesses are distributed evenly across all the 16 banks. Built in GigaBit Ethernet network card: The network card has a dedicated DMA unit, which permits data transfer without the PPE s intervention. Software: The Linux operating system runs on the PS3 on top of a virtualization layer (also called hypervisor) that Sony refers to as Game OS. This means that all the hardware is accessible only through the hypervisor calls. The hardware signals the kernel through virtualized interrupts. The interrupts are used to implement callbacks for non-blocking system calls. Parallel Matrix-Matrix Product Matrix-matrix product is one of the most important linear algebra kernels since it represents the most computationally intensive operation on many applications. The Cannon, the PUMMA and the SUMMA algorithms are still extensively adopted in many high performance, linear algebra applications run on parallel architectures. Algorithm SUMMA 1: for i = 1 to n/nb do 2: if I own A*i then 3: Copy A*i in buf1 4: Bcast buf1 to my proc row 5: end if 6: if I own Bi* then 7: Copy bi* in buf2 8: Bcast buf2 to my proc column 9: end if 10: C = C + buf1 * buf2 11: end for Performance Evaluation Double Buffering: Performance can be improved by communications/computations overlap Computations offloaded to SPEs Communications taken care by PPE Data for step k+1 broadcasted at step k Performance Result Cost of Local Computations is small

5 Surface-to-Volume effect props up for large problem sizes Memory Limitations inhibit best performance Conclusion Limitations: * Main memory access rate fp execution faster than peak transfer rate * Network Interconnect speed Capacity of interconnect out of balance * Main memory size Serious limitation * Double precision performance Lower performance than sp * Programming paradigm Write low level code

A Synchronous Mode MPI Implementation on the Cell BE Architecture

A Synchronous Mode MPI Implementation on the Cell BE Architecture Murali Krishna 1, Arun Kumar 1, Naresh Jayam 1, Ganapathy Senthilkumar 1, Pallav K Baruah 1, Raghunath Sharma 1, Shakti Kapoor 2, Ashok