Overview. 2 Introduction to parallel computing. The control structure. Parallel computers

Size: px

Start display at page:

Download "Overview. 2 Introduction to parallel computing. The control structure. Parallel computers"

Leonard Golden
6 years ago
Views:

1 Overview 2 Introduction to parallel computing Robert Mullins Parallel computing platforms Approaches to building parallel computers Today's chip-multiprocessor architectures Approaches to parallel programming Programming with threads and shared memory Message-passing libraries PGAS languages High-level parallel languages 2 Parallel computers The control structure How might we exploit multiple processing elements and memories in order to complete a large computation quickly? How are the processing elements controlled? How many processing elements, how powerful? How do they communicate and cooperate? Flynn's taxonomy: Single Instruction Multiple Data (SIMD) Multiple Instruction Multiple Data (MIMD) How are memories and processing elements interconnected? How is the memory hierarchy organised? Centrally from single control unit or can they work independently? How might we program such a machine? 3 4

The control structure The control structure Other possible organisations: SIMD Dataflow Systolic array Single-Program Multiple-Data (SPMD) Same program runs on all processors Commonly seen in MPI

2 The control structure The control structure Other possible organisations: SIMD Dataflow Systolic array Single-Program Multiple-Data (SPMD) Same program runs on all processors Commonly seen in MPI (message-passing) or PGAS programs The scalar pipelines execute in lockstep Data-independent logic is shared Efficient for highly data parallel applications Much simpler instruction fetch and supply mechanism SIMD hardware can support a SPMD model if the individual threads follow similar control flow Masked execution 5 A Generic Streaming Multiprocessor (for graphics applications) Reproduced from, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow", W. W. L. Fung et al 6 The communication model The communication model A clear distinction is made between two common communication models: 2. Message-passing platforms 1. Shared-address-space platforms All processors have access to a shared data space accessed via a shared address space All communication takes place via a shared memory Each processing element may also have an area of memory that is private 7 Each processing element has its own exclusive address space Communication is achieved by sending explicit messages between processing elements The sending and receiving of messages can be used to both communicate between and synchronize the actions of multiple processing elements 8

Multi-core SMP multiprocessor Figure courtesy of Tim Harris, MSR 9 Figure courtesy of Tim Harris, MSR

primitives that were close to the send/receive user-level communication commands 101 001 e.g.

which processors could be named in a send or receive operation (e.g.

3 Multi-core SMP multiprocessor Figure courtesy of Tim Harris, MSR 9 Figure courtesy of Tim Harris, MSR NUMA multiprocessor 10 Message-passing platforms Many early messagepassing machines provided hardware primitives that were close to the send/receive user-level communication commands e.g. a pair of processors may be interconnected with a hardware FIFO queue The network topology restricted which processors could be named in a send or receive operation (e.g. only neighbours could communicate in a mesh network) [Culler, Figure 1.22] Figure courtesy of Tim Harris, MSR

Message-passing platforms Message-passing platforms The Transputer (1984) Recently some chipmultiprocessors have taken a similar approach (RAW/Tilera and XMOS) The result of an earlier foray into the

4 Message-passing platforms Message-passing platforms The Transputer (1984) Recently some chipmultiprocessors have taken a similar approach (RAW/Tilera and XMOS) The result of an earlier foray into the world of parallel computing! Transputer contained integrated serial links for building multiprocessors IN/OUT instructions in ISA for sending and receiving messages Programmed in OCCAM (based on CSP) IBM Victor V256 (1991) 16x16 array of transputers The processors could be partitioned dynamically between different users Message queues (or communication channels) may be register mapped or accessed via special instructions The processor stalls when reading an empty input queue or when trying to write to a full output buffer A wireless application mapped to the RAW processor. Data is streamed from one core to another over a statically scheduled network. Network input and output is register mapped. (See also the iwarp paper on wiki) Message-passing platforms Message-passing platforms For larger message-passing machines (typically scientific supercomputers) direct FIFO designs were soon replaced by designs that built message-passing upon remote memory copies (supported by DMA or a The most fundamental communication primitives in a message-passing machine are synchronous send and receive operations more general communication assist processor) The interconnection networks also became more powerful, supporting the automatic routing of messages between arbitrary nodes No restrictions on programmer or software support required Hardware and software evolution meant there was a general convergence of parallel machine organisations 15 Here data movement must be specified at both ends of the communication, this is known as two-sided communication. e.g. MPI_Send and MPI_Recv* Non-blocking versions of send and receive are also often provided to allow computation and communication to be overlapped *Message Passing Interface (MPI) is a portable message-passing system that is supported by a very wide range of parallel machines. 16

One-side communication The communication model SHMEM From a hardware perspective we would like to keep the machine simple (message-passing) But we inevitably need to simplify the programmer's and

5 One-side communication The communication model SHMEM From a hardware perspective we would like to keep the machine simple (message-passing) But we inevitably need to simplify the programmer's and compiler's task Provides routines to access the memory of a remote processing element without any assistance from the remote process, e.g: shmem_put (target_addr, source_addr, length, remote_pe) shmem_get, shmem_barrier etc. One-sided communication may be used to reduce synchronization, simplify programming and reduce data movement 17 Today's chip multiprocessors Intel Nehalem-EX (2009) Efficiently support shared-memory programming Add support for transactional memory? Create a simple but high-performance target Trade-offs between hardware complexity and complexity of hardware and compiler. 18 Today's chip multiprocessors Intel Nahalem-EX (2009) 8-cores 2-way hyperthreaded (SMT) 16 hardware threads L1 L1I 32KB, L1D 32KB 256 KB L2 (Private) 24MB L3 (Shared) L2 8-banks Inclusive L3 Shared L3 Memory 19 20

Today's chip multiprocessors IBM Power 7 (2010) Today's chip multiprocessors IBM Power 7 (2010) 8 core (dual-chip module to

21 Today's chip multiprocessors 22 Oracle M7 Processor (2014) Sun Niagara T1 (2005) 32 core Dual-issue, OOO Dynamic

DDR channels 160GB/s (vs. ~20GB/s for T1) >10B transistors!

6 Today's chip multiprocessors IBM Power 7 (2010) Today's chip multiprocessors IBM Power 7 (2010) 8 core (dual-chip module to hold 16 cores) 32MB shared edram L3 cache 2-channel DDR3 controllers Individual cores 4-thread SMT per core 6 ops/cycle 4GHz 21 Today's chip multiprocessors 22 Oracle M7 Processor (2014) Sun Niagara T1 (2005) 32 core Dual-issue, OOO Dynamic multithreading 1-8 threads/core 256KB I&D L2 caches shared by groups of 4 cores 64MB L3 Technology: 20nm, 13 metal layers 16 DDR channels 160GB/s (vs. ~20GB/s for T1) >10B transistors! Each core has its own level 1 cache (16KB for instructions, 8KB for data). The level 2 caches are 3MB in total and are effectively 12-way associative. They are interleaved by 64-byte cache lines

Manycore designs: Tilera Manycore designs: Celerity (2017) Tilera (now Mellanox) Evolution of MIT RAW 100-cores grid of identical tiles Low-power 3-way VLIW

parallel tier: 496 5-stage RISC-V cores, 16x31 tiled mesh array Specialised tier: Binarized Neural Network accelerator 25 26 GPUs Communication latencies TESLA

communication latencies may be around 10-100 cycles 56 Streaming multiprocessors x 64 cores = 3584 cores or lanes 732GB/s memory bandwidth 4MB L2 cache 15.

7 Manycore designs: Tilera Manycore designs: Celerity (2017) Tilera (now Mellanox) Evolution of MIT RAW 100-cores grid of identical tiles Low-power 3-way VLIW cores Cores interconnected by a selection of static and dynamic on-chip networks Tiered Accelerator Fabric General-purpose tier: 5 Rocket RISC-V cores Massively parallel tier: stage RISC-V cores, 16x31 tiled mesh array Specialised tier: Binarized Neural Network accelerator GPUs Communication latencies TESLA P100 Chip multiprocessor Some have very fast core to core communication, as low as 1-3 cycles Opportunities to add dedicated core-to-core links Typical L1-to-L1 communication latencies may be around cycles 56 Streaming multiprocessors x 64 cores = 3584 cores or lanes 732GB/s memory bandwidth 4MB L2 cache 15.3 billion transistors Other types of parallel machine: Shared memory multiprocessor ~500 Cluster/supercomputer ~ The NVIDIA GeForce 8800 GPU, Hot Chips

Approaches to parallel programming Approaches to parallel programming Principles of Parallel Programming, Calvin Lin and Lawrence Snyder, Pearson, 2009 This book provides a good overview of the

8 Approaches to parallel programming Approaches to parallel programming Principles of Parallel Programming, Calvin Lin and Lawrence Snyder, Pearson, 2009 This book provides a good overview of the different approaches to parallel programming There is also a significant amount of information on the course wiki Programming with threads and shared memory Message-passing libraries PGAS languages High level parallel languages Try some examples! A thread, or thread of execution, is a unit of parallelism How might we express threads in our code? fork/join It consists of everything necessary to execute a sequential stream of instructions program code, a call stack, set of registers (incl. a single program counter) It shares memory with other threads Threads cooperate and coordinate there actions by reading and writing to shared variables Special atomic operations are provided by the multiprocessor for synchronization 31 Fork/Join keywords can appear anywhere in code General, but unstructured p1 ; start p5 in fork(p5) p2 fork(p3) P4 ; wait for p5 to ; complete join(p5) p6 join(p3) p7 A forked procedure runs in parallel with main thread 32

9 fork/join using the pthreads library parbegin/parend (cobegin/coend) Limitations to bare metal thread programming? void *thread_func ( void *ptr) { int i = ((thread_args *) ptr) >input; ((thread_args *) ptr) >output = fib(i); return NULL; } args.input=n 1; // create and start first thread status = pthread_create(&thread, NULL, thread_func, (void*)&args ); // calc. fib(n 2) in parallel result = fib (n 2); // join pthread_join(thread, NULL); 33 Simple and structured, but not as general as fork/join, e.g. we cannot represent the graph on the previous slide. p1 parbegin p5 begin p2 parbegin p3 p4 parend end parend p6 p7 34 Even though parbegin..parend can only represent properly nested dependency graphs it is usually adequate Cilk style spawn/sync forall (doall, parfor) cilk int fib (int n) { if (n < 2) return n; else { int x, y; spawn indicates that the proceduce call can safely proceed in parallel sync wait until all previously spawned procedures have returned their results Simply allows a programmer to indicate that each iteration of the loop is independent and may be run in parallel OpenMP example: #pragma omp parallel for for (i=first; i<n; i+=prime) marked[i]=1; x = spawn fib (n 1); y = spawn fib (n 2); sync; } return (x+y); } 35 36

10 Futures Synchronization and coordination Future <expr> Evaluate the expression concurrently with calling program. An asynchronous function call If a thread requires the value of a future that has not been computed, stall the thread until it is available y=future (fn(x)) z=y+1; In addition to creating threads, we also need to be able to control the way threads interact. Often involves identifying critical sections Mechanisms Locks and barriers Mutexes and monitors Condition Variables (wait/signal) Transactional memory See reading group papers and examples The incremental garbage collection of processes, Baker/Hewitt, Message-passing PGAS languages Simple (perhaps primitive) programming model Partitioned Global Address Space Languages Programmer must distribute and explicitly move data The fact that the interactions are explicit can be seen as both an advantage and a disadvantage Potentially simple hardware implementation Processes communicate and synchronize by sending messages Message Passing Interface (MPI) standard Widely used on High-Performance Computing (HPC) platforms Programs tend to be portable Usually written in a Single-Program Multiple-Data (SPMD) style 39 Aimed at large-scale distributed memory machines Aim to improve on MPI PGAS languages overlay a global address space on the virtual memories of the distributed machines No expectation that memories will be coherent The programmer distinguishes between local and non-local data The compiler generates the necessary communication calls in response to non-local references Compiler exploits one-sided communication primitives rather than message-passing Co-Array Fortran, Unified Parallel C, Titanium (Ti) (Titanium extends Java) 40

11 High-level parallel languages Global view of computation Raise level of abstraction Hide low-level details of communication and synchronization Take a global view and describe the algorithm rather than per-task behavior e.g. ZPL forces programmer to think in parallel style using array operations (reference to neighboring elements, flood, remap, reduction,...) Compiler, runtime and libraries will manage implementation details Interesting examples: ZPL Array programming language NESL, Data Parallel Haskell (see wiki) See also Cray Chapel, IBM X10, Sun Fortress languages (DARPA HPCS project) 41

2 Introduction to parallel computing. Chip Multiprocessors (ACS MPhil) Robert Mullins

2 Introduction to parallel computing Robert Mullins Overview Parallel computing platforms Approaches to building parallel computers Today's chip-multiprocessor architectures Approaches to parallel programming