Distributed Simulation in CUBINlab

Distributed Simulation in CUBINlab John Papandriopoulos http://www.cubinlab.ee.mu.oz.au/ ARC Special Research Centre for Ultra-Broadband Information Networks

Outline Clustering overview The CUBINlab cluster Paradigms for Distributed Simulation Message Passing Interface Parallelising MATLAB Simulation example Wrap up

Clustering Speedup through parallelism Motivated by: Cheap commodity hardware Cheap/fast network interconnects Various architectures: Dedicated nodes, separate network Reuse of existing network and workstations Clustering Overview

Cluster Architectures BeoWulf Network of workstations Clustering Overview

How can CUBIN benefit? Many problems have inherent parallelism: Optimization problems e.g. shortest path Network simulation Monte Carlo simulations Numerical and iterative calculations e.g. large systems: matrix inverse, FFT, PDEs Clustering can exploit this! CUBINlab Distributed Simulation Cluster (DSC) Clustering Overview

Composition of the DSC 9 machines, each: 550MHz to 1.5GHz 128MB to 256MB RAM ~10GB HDD scratch space 100Mbps Ethernet Mandrake Linux Identical software Low maintenance http://www.cubinlab.ee.mu.oz.au/cluster/ The CUBINlab Cluster

Utilisation of the Cluster Which machines should I use? Resource allocation Load balancing Each machine is running a SNMP daemon Use this to get stats on each machine, 5 min intervals CPU load Memory use Scratch space use LAN network use (rate & volume) http://www.cubinlab.ee.mu.oz.au/netstat/ The CUBINlab Cluster

CPU Load Average History The CUBINlab Cluster

Paradigms for Distributed Simulation

Paradigms for Distributed Simulation Distributed vs. Parallel loose vs. tight coupling Single Instruction Multiple Data (SIMD) Vector (super-)computers Multiple Instruction Multiple Data (MIMD) Parallel/distributed computers Synchronous vs. asynchronous design Shared memory vs. message passing Paradigms for Distributed Simulation

Embarrassingly Parallel Applications Totally independent subtasks Assign each subtask to a processor Collect results at the end Master-Slave model: Slave #1 Bag of subtasks Master Slave #2 Slave #p No task dependencies Paradigms for Distributed Simulation

Load Balancing When do we allocate tasks, and to whom? At the start? As we go along? Equal division? Communications overhead? Concurrency maximised when all processors in use Unequal distribution of work or different processing speeds: Some processors will be idle Paradigms for Distributed Simulation

Message Passing Interface (MPI)

Overview of MPI(CH) Standard for message passing Portable and free implementation High performance Object orientated look and feel C, Fortran and C++ bindings available Message Passing Interface

MPICH Architecture MPI Abstract Device Interface Channel Interface Interface providing message passing constructs MPI_Send MPI_Recv etc. Uses ADI to implement all behaviour, with a smaller set of constructs TCP/IP Message Passing Interface

MPICH Architecture MPI Send/Receive messages Abstract Device Interface Channel Interface Moving data b/w MPI and CI Managing pending messages Providing basic environment information e.g. how many tasks running TCP/IP Message Passing Interface

MPICH Architecture MPI Formats and packs bytes for the wire Abstract Device Interface Channel Interface Very simple: 5 functions only Three mechanisms: Eager: immediate delivery Rendezvous: once receiver wants it Get: shared memory/dma (h/ware) TCP/IP Message Passing Interface

Process Structure Groups and Communicators How many processors are there? Who am I? (Who are my neighbours?) Virtual topologies Cartesian for grid computation (0) (1) (2) (3) (4) (5) (6) (7) (8) map (0,0) (0,1) (0,2) (1,0) (1,1) (1,2) Process group Message Passing Interface (2,0) (2,1) (2,2)

Communicators MPI_COMM_WORLD (0) (1) (2) (3) (4) (5) Message Passing Interface

Communicators MPI_COMM_WORLD (0) (1) (2) (3) (4) (5) (0) (1) (2) Group A Group B (3) (4) (5) Message Passing Interface

Communicators MPI_COMM_WORLD (0) (1) (2) (3) (4) (5) (0) (1) (2) Group A Group B (3) (4) (5) (0) (1) (2) COMM_A COMM_B (0) (1) (2) Message Passing Interface

Communicators MPI_COMM_WORLD (0) (1) (2) (3) (4) (5) COMM_A COMM_B Message Passing Interface

Point-to-Point: sending data MPI_SEND(buf, count, datatype, dest, tag, comm) IN buf initial address of send buffer (choice) IN count number of elements in send buffer (non-negative integer) IN datatype datatype of each send buffer element (handle) IN dest rank of destination (integer) IN tag message tag (integer) IN comm communicator (handle) Message Passing Interface

Point-to-Point: receiving data MPI_RECV(buf, count, datatype, source, tag, comm, status) OUT buf initial address of receive buffer (choice) IN count number of elements in receive buffer (integer) IN datatype datatype of each receive buffer element (handle) IN source rank of source (integer) IN tag message tag (integer) IN comm communicator (handle) OUT status status object (Status) Message Passing Interface

Simple Example: hello world #include <stdio.h> #include "mpi.h" void main(int argc, char* argv[]) { int rank, size, tag, i; char message[20]; MPI_Status status; } MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); tag = 100; if (rank == 0) { strcpy(message, "Hello world!"); for (i=1; i<size; i++) MPI_Send(message, 13, MPI_CHAR, i, tag, MPI_COMM_WORLD); } else { MPI_Recv(message, 13, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status); } printf("node %d: %s\n", rank, message); MPI_Finalize(); Message Passing Interface

Simple Example: hello world Compiling $mpicc -o simple simple.c Running on 12 machines (?!) $mpirun -np 12 simple node 0: Hello world! node 3: Hello world! node 7: Hello world! node 6: Hello world! node 8: Hello world! node 2: Hello world! node 5: Hello world! node 4: Hello world! node 1: Hello world! node 11: Hello world! node 9: Hello world! node 10: Hello world! $ Message Passing Interface

MPI Data Types Standard types: MPI data type MPI_CHAR MPI_SHORT MPI_INT MPI_LONG MPI_UNSIGNED_CHAR MPI_UNSIGNED_SHORT MPI_UNSIGNED MPI_UNSIGNED_LONG MPI_FLOAT MPI_DOUBLE MPI_LONG DOUBLE C data type signed char signed short int signed int signed long int unsigned char unsigned short int unsigned int unsigned long int float double long double Can define your own also Message Passing Interface

Collective Communication Barrier synchronization Broadcast from one to many Gather from many to one Scatter data from one to many Scatter/Gather (complete exchange, all-to-all) and many more Message Passing Interface

Broadcast Collective Communication

Scatter and Gather Collective Communication

All Gather Collective Communication

All to All Collective Communication

Profiling Parallel Code MPI has profiling built in: mpicc -o simple -mpilog simple.c Generates a log file of all calls during execution Log file visualiser is available Jumpshot-3 (Java) Message Passing Interface

Parallelising MATLAB

Speeding up MATLAB Code Use vector/matrix form of evaluations inner = A *B rather than inner = 0; for i = 1:size(A,1) inner = inner + A(i).*B(i); end Pre-allocate matrices S = zero(n, K); Clear unused variables to avoid swapping clear S; Re-write bottlenecks in C with MEX Parallelize! Parallelising MATLAB

Why not a Parallel MATLAB? Parallelism is a good idea! Why doesn t MathWorks provide support? MathWorks says [1]: Communication bottlenecks and problem size Memory model and architecture (Only) useful for outer-loop parallelism Most time spent: parser, interpreter, graphics Business situation Parallelising MATLAB

Parallelising MATLAB: overview Parallelism through high level toolboxes Medium-course granularity Good for outer loop parallelisation Uses message-passing APIs MPI PVM Socket based File IO based Parallelising MATLAB

Toolboxes Available Status X$ S B $ S S S? Toolbox MultiMATLAB DP-Toolbox MPITB/PVMTB MATmarks MatlabMPI MATLAB*P Matlab Parallelization ToolkIt Parallel Toolbox for MATLAB Origins Cornell U of Rostock, Germany U of Granada, Spain U of Illinois MIT MIT Einar Heiberg Unknown X No source $ Commercial B Binaries Only S Open Source Parallelising MATLAB

MatlabMPI Limited subset of MPI commands Very primitive handling of basic operations No barrier function for synchronization Communication based on file IO Very portable: no MEX File system, disc and NFS overhead Spin locks for blocking reads Uses RSH/SSH to spawn remote MATLAB instances Machine list supplied in m-files Parallelising MATLAB

MatlabMPI API MPI_Run MPI_Init MPI_Finalize Runs a MATLAB script in parallel Inititialises toolbox Cleans up at the end MPI_Comm_size MPI_Comm_rank MPI_Send MPI_Recv Gets number of processors in a communicator Gets rank of current processor within a communicator Sends a message to a processor (non-blocking) Receives message from a processor (blocking) MPI_Abort MPI_Bcast MPI_Probe Function to kill all matlab jobs started by MatlabMPI Broadcast a message (blocking) Returns a list of all incoming messages Parallelising MATLAB

MatlabMPI Performance Mainly used for embarrassing parallel Equals MPI native at 1MByte message size [12] Spinlock and busy waiting caused a DDoS attack on our NFS Parallelising MATLAB

Simulation Example

Iterative Power Control Iterative solution to a Power Control problem: Each iteration depends on all P j, j i We want to simulate over 1,000 runs with different user signature sequences S Simulation Example

Inner Loop Parallelism Split the iteration P i (n+1) over p processors Synchronize at the end of each iteration Is there a faster way? Simulation Example

Outer Loop Parallelism Split each 1,000 runs over p processors No need for explicit synchronization Each simulation run is independent Embarrassingly parallel problem Master/slave model will work quite well Simulation Example

Simulation Results Sorry, no results due to the MatlabMPI DDoS attack! Simulation Example

Wrap Up

Conclusions Clustering provides a cheap way to increase computational power through parallelism Parallelism is present in many problems, to a degree Message passing is one method of unifying computation amongst distributed processors MATLAB can be used for coarse-grain parallel applications

References-1 1. Why there isn t a parallel MATLAB, Cleve Moler, http://www.mathworks.com/company/newsletter/pdf/spr95cleve.pdf 2. Parallel Matlab survey, Ron Choy, http://supertech.lcs.mit.edu/~cly/survey.html 3. MultiMatlab: Integrating MATLAB with High-Performance Parallel Computing, V. Menon and A. Trefethen, Cornell University 4. Matpar: Parallel Extensions for MATLAB, P. Springer, Jet Propulsion Laboratory, CalTech 5. Message Passing under MATLAB, J. Baldomero, U of Granada, Spain 6. Performance of Message-Passing MATLAB Toolboxes, J. Fernandez, A. Canas, A. Diaz, J. Gonzalez, J. Ortega and A. Prieto, U of Granada, Spain

References-2 7. Parallel and Distributed Computation, D. Bertsekas and J. Tsitsiklis, Prentice Hall, NJ, 1989 8. Message Passing Interface Forum, http://www.mpi-forum.org/ 9. MPICH-A Portable Implementation of MPI, http://www-unix.mcs.anl.gov/mpi/mpich/ 10. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard, W. Gropp and E. Lusk, Argonne National Laboratory 11. Tutorial on MPI: The Message Passing Interface, W. Gropp, http://www-unix.mcs.anl.gov/mpi/tutorial/gropp/talk.html 12. Parallel Programming with MatlabMPI, J. Kepner, MIT Lincoln Laboratory, 2001