SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May PDF Free Download

SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008

What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem Implementing parallelism requires modification of a sequentially executing algorithm such that the workload is decomposed into independent tasks. This may be trivial or laborious

Rationale for using SHARCNET: SHARCNET provides access to, and substantial user support for, large-scale parallel computing systems These HPC systems are prohibitively expensive to operate for individual groups/departments The majority of SHARCNET computing projects implement parallelism to achieve faster and/or more significant results

Why implement parallelism? Faster computation (more processors) strong scaling workload / processor is inversely proportional to the number of processors Larger problems (more memory) weak scaling constant workload / processor high-density memory is expensive Parallelism is an essential feature in HPC...

Parallel Hardware Parallelism is abundant in computer hardware: increases throughput exists on many, if not all, levels technological trend (eg. multi-core) Examples: Parallel instruction scheduling Parallel data operators (vector units) Multiple processing cores per network node (computer) Multiple computers (nodes) connected in a network (cluster)

Memory Locality A major constraint on developing parallel algorithms is how processing elements view the global address space and communicate with one-another: shared-memory (SMP) multiple cores on the same computer all of the cores can directly access all of the memory distributed-memory (cluster) multiple computers on a network each core can only directly access their local memory It s possible to have combinations and multiple levels of shared/distributed memory, and technologies exist that can bridge the gap, eg. MPI-2, distributed shared-memory

SMP Programming Shared-memory systems are usually programmed to take advantage of multiple processing cores by threading applications with: OpenMP POSIX threads (pthreads) Both are application programming interfaces (APIs). OpenMP is easier to implement and is generally supported by the compiler, whereas pthreads is more complex (lower-level) and is used as an external library. We will only focus on OpenMP.

Processes and Threads A process is an instance of a computer program that is being executed: Includes binary instructions, different memory storage areas, hidden kernel data Performs serial execution of instructions It may be possible to perform many operations simultaneously (instructions are independent). To process these instructions in parallel, the process can spawn a number of threads.

Processes and Threads A thread is a light-weight process connected to the parent process, sharing it's memory space threads each have their own list of instructions and independent stack memory The process of splitting up the execution stream is typically called a fork-join model execution progresses serially, then a team of threads is created, work in parallel, and return execution flow to the parent process

Distributed Memory Programming On distributed memory systems one uses a team of processes that run simultaneously on a network of computers, passing data to one another in the form of messages. This is facilitated by the Message Passing Interface (MPI). MPI is a portable API that includes an external library and runtime programs that are used to compile and execute MPI programs. One can run MPI programs on shared-memory systems, or even mix the two approaches (MPI & OpenMP/pthreads).

Network Properties Message passing performance depends on the time to send data across network (latency) and the amount of data that can be passed in a given amount of time (bandwidth). Examples: gigabit ethernet: 3.0x10-5 s latency, 1Gb/s bandwidth infiniband: 1.5x10-6 s latency, 20Gb/s bandwidth Issues to be aware of: Programs may not scale on lower-performance networks Most algorithms are latency limited as opposed to bandwidth (writing to disk is slower than the network!) Network topology can dramatically affect performance (multiple hops increased latency, shared-links decreased bandwidth)

Amdahl s Law In a parallel computational process, the degree of speedup is limited by the fraction of code that can be parallelized: Tparallel(NP) = { ( 1 P ) + P / NP } * Tserial + overhead Where: Tparallel == time to run application in parallel P == fraction of serial runtime that can be parallelized NP == number of parallel elements (cores, nodes, threads, etc.) Tserial == time to run application serially overhead == overhead introduced by adding the parallelization If P is small, application will not scale to large NP!

Parallel Scaling To compare the performance of a parallelized code to the serial version, one uses the term scaling, which is just: Scaling (NP)= Tserial / Tparallel(NP) This is only meaningful for strong scaling In the weak scaling case, one attempts to maintain a constant runtime while using additional compute elements to solve a larger problem Perfect scaling over a range of NP is referred to as being linear

Parallelization Overhead Parallelization may involve additional computational overhead as well communications overhead Optimal performance may require tuning for a particular system Usually best to maximize the amount of computation versus communication Asynchronous message passing can hide communications overhead

Graininess of parallelism Fine grained vs. coarse grained parallelism Refers to the amount of work that each compute element does before a synchronization point Fine-grained parallelism (frequent communications) requires low latency!

Decomposition Approach There are three main approaches to decomposing a problem so that it can run in parallel: Serial Farming (Embarrassingly or Perfectly Parallel) Data parallelization Task parallelization The approach to use depends on the algorithm, and can be a combination of the three!

Serial Farming Algorithms that can be decomposed with little or no communication parameter space algorithms with independent results No change required to serial algorithm instead of running sequentially, perform all permutations simultaneously Can still suffer from I/O bottlenecks!

Data Parallelism Algorithms that can be decomposed into distinct work units mesh-based solvers, linear algebra Each compute element only works with its local data may require sending boundary data, issues with synchronization and load balancing Susceptible to large amounts of communications overhead impacts scaling performance

Task Parallelism Algorithms that can be decomposed into distinct tasks which can be operated on in parallel eg. Real-time signal correlator One thread to do I/O, one to do message passing, other threads process data Communication strategy may not be as straightforward or structured as in data parallelism

Important Consideration Don t over-complicate things (KISS rule): avoid unnecessarily implementing parallelism (use the best tool for the job) complexity may result in slower performance, worse scalability, and increased likelihood of job failures a simple, modular solution is elegant (eg. mathematical proofs) easy to understand and maintain

The Parallel Development Process Initially one must profile and understand the code. If P isn t ~1, is it worthwhile? a major code re-design may be necessary to implement parallelism data structures may have to change and new methods to solving problems may have to be implemented. Once these changes are made, and for more straightforward problems, the parallelization process follows the same cycle used to optimize a program.

The Parallel Development Cycle 1. Identify hot-spots (functions/routines that use the most cycles) Requires profiling / instrumenting timing in the code 2. Modify code to parallelize hot-spots Not all operations can be parallelized 3. Check for accuracy and correctness, debug any program errors that are introduced 4. Tune for performance, repeat cycle as necessary

OpenMP: Open Multi-Processing programmed by utilizing special compiler directives and functions in the OpenMP runtime library In simple cases (loop-level parallelism), one doesn't have to make any code modifications, only addition of these directives the compiler will only consider the directives if it is told to do so at compile time It's easy to incrementally add parallelization!

OpenMP Blocks of code that are to be done in parallel are considered parallel regions, and must have only 1 entry and 1 exit point Threaded programs are executed in the same fashion as serial programs OMP_NUM_THREADS environment variable to set number of threads, or else hardwire it in your code

Data Scoping Scoping refers to how different variables should be accessed by the threads. Basic scoping types are: private: each thread gets a copy of the variable that is stored on it's private stack Values in private variables are not returned to the parent process shared: all threads access the same variable If variables are read-only, safe to declare as shared reduction: like private, but the value is reduced (sum,max,min,etc.) and returned to the parent process

Simple OpenMP Example #pragma omp parallel default(shared) private(i,x) reduction(+:pi_int) { #ifdef _OPENMP printf("thread %d of %d started\n", omp_get_thread_num(), omp_get_num_threads()); #endif #pragma omp for } for (i = 1; i <= n; ++i) { } x = h * ( i - 0.5 ) ; //center of interval pi_int += 4.0 / ( 1.0 + pow(x,2) ) ;

MPI: Message Passing Interface programmed using functions/routines in the MPI library Requires more work than implementing OpenMP involves new data structures and increased algorithmic complexity portable runs on both shared and distributed memory MPI programs are executed on top of an underlying MPI framework. Typically each computing element (network node) runs an MPI daemon that facilitates the passing of messages Usually executed with a special mpirun command takes the number of processes to be used as arguments

MPI: The 6 basic functions MPI essentially boils down to only 6 different functions: MPI_Init() MPI_Comm_size(MPI_COMM_WORLD,&numprocs) MPI_Comm_rank(MPI_COMM_WORLD,&rank) MPI_Send(buffer,buffer_size,MPI_TYPE,to_rank,tag,mpi_comm) MPI_Recv(buffer,buffer_size,MPI_TYPE,from_rank,tag,mpi_comm,status) MPI_Finalize() All other communication patterns fundamentally consist of sends and receives.

MPI Hello World #include "mpi.h" int main( int argc, char *argv[]) { int numprocs, myrank; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myrank); printf("hello World from rank %d of %d\n", myrank, numprocs); } MPI_Finalize(); return 0;

Parallel Libraries Many libraries use threading and MPI internally: Intel MKL, AMD ACML, etc. threaded FFTs, linear algebra, random number generators FFTW Threaded and MPI parallel FFT routines Scalapack MPI routines for doing linear algebra These can relieve a great deal of the burden required to implement parallel programming use them! Be careful not to oversubscribe processors by calling threaded library routines from inside parallel regions.

Debugging Commercial parallel debuggers (SHARCNET supports DDT) can provide an efficient way to debug problems with MPI code Most contemporary debuggers support debugging threaded applications One can also run a separate debugger for each MPI process, which is sometimes helpful for capturing stack traces, etc. If all else fails, use a lot of print statements Many MPI problems result from incorrect synchronization Deadlocking (posting a receive before a send) Beware of libraries that are not thread-safe

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008