Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1
Content A quick overview of morden parallel hardware Parallelism within a chip Pipelining Superscaler execution SIMD Multiple cores Parallelism within a node Multiple sockets UMA vs. NUMA Parallelism across multiple nodes A very quick overview of parallel programming Note: The textbook [Quinn 2004] is outdated on parallel hardware, please read instead https://computing.llnl.gov/tutorials/parallel comp/ Introduction to parallel computersand parallel programming p. 2
First things first CPU central processing unit is the brain of a computer CPU processes instructions, many of which require data transfers from/to the memory on a computer CPU integrates many components (registers, FPUs, caches...) CPU has a clock, which at each clock cycle synchronizes the logic units within the CPU to process instructions Introduction to parallel computersand parallel programming p. 3
An example of a CPU Block diagram of an Intel Xeon Woodcrest CPU (core) Introduction to parallel computersand parallel programming p. 4
Instruction pipelining Suppose every instruction has five stages (each needs one cycle) Without instruction pipelining With instruction pipelining Introduction to parallel computersand parallel programming p. 5
Superscalar execution Multiple execution units more than one instruction can finish per cycle An enhanced form of instruction-level parallelism Introduction to parallel computersand parallel programming p. 6
Data Data are stored in computer memory as sequence of 0s and 1s Each 0 or 1 occupies one bit 8 bits constitute one byte Normally, in the C language char: 1 byte int: 4 bytes float: 4 bytes double: 8 bytes Bandwidth the speed of data transfer is measured as number of bytes transferred per second Introduction to parallel computersand parallel programming p. 7
SIMD SISD: single instruction stream single data stream SIMD: single instruction stream multiple data streams Introduction to parallel computersand parallel programming p. 8
An example of a floating-point unit FP unit on a Xeon CPU Introduction to parallel computersand parallel programming p. 9
Multicore processor Modern hardware technology can put several independent CPUs (called cores ) on the same chip a multicore processor Intel Xeon Nehalem quad-core processor Introduction to parallel computersand parallel programming p. 10
Multi-threading Modern CPU cores often have threading capability Hardware support for multiple threads to be executed within a core However, threads have to share resources of a core computing units caches translation lookaside buffer Introduction to parallel computersand parallel programming p. 11
Vector processor Another approach, different from multicore (and multi-threading) Massive SIMD Vector registers Direct pipes into main memory with high bandwidth Used to be dominating high-performance computing hardware, but now only niche technology Introduction to parallel computersand parallel programming p. 12
Multi-socket Socket a connection on motherboard that a processor is plugged into Modern computers often have several sockets Each socket holds a multicore processor Example: Nehalem-EP (2 socket, quad-core CPUs, 8 cores in total) OS maps each core to its concept of a CPU for historical reasons Introduction to parallel computersand parallel programming p. 13
Shared memory Shared memory: all CPU cores can access all memory as global address space Traditionally called multiprocessor Two common definitions of SMP (dependent on context): 1. SMP (shared-memory processing) multiple CPUs (or multiple CPU cores) share access, not necessarily of same access speed, to the entire memory available on a single system 2. SMP (symmetric multi-processing) multiple CPUs (or multiple CPU cores) have same speed and latency to a shared memory Introduction to parallel computersand parallel programming p. 14
UMA UMA uniform memory access, one type of shared memory Another name for symmetric multi-processing Dual-socket Xeon Clovertown CPUs Introduction to parallel computersand parallel programming p. 15
NUMA NUMA non-uniform memory access, another type of shared memory Several symmetric multi-processing units are linked together Each core should access its closest memory unit, as much as possible Dual-socket Xeon Nehalem CPUs Introduction to parallel computersand parallel programming p. 16
Cache coherence Important for shared-memory systems If one CPU core updates a value in its private cache, all the other cores know about the update Cache coherence is accomplished by hardware Introduction to parallel computersand parallel programming p. 17
Competition among the cores Within a multi-socket multicore computer, some resources are shared Within a socket, the cores share the last-level cache The memory bandwidth is also shared to a great extent # cores 1 2 4 6 8 BW 3.42 GB/s 4.56 GB/s 4.57 GB/s 4.32 GB/s 5.28 GB/s Actual memory bandwidth measured on a 2 socket quad-core Xeon Harpertown Introduction to parallel computersand parallel programming p. 18
Distributed memory The entire memory consists of several disjoint parts A communication network is needed in between There is not a single global memory space A CPU (core) can directly access its own local memory A CPU (core) cannot directly access a remote memory A distributed-memory is traditionally called multicomputer Introduction to parallel computersand parallel programming p. 19
Comparing shared memory and distributed memory Sharedmemory User-friendly programming Data sharing between processors Not cost effective Synchronization needed Distributed-memory Memory is scalable with the number of processors Cost effective Programmer responsible for data communication Introduction to parallel computersand parallel programming p. 20
Hybrid memory system Core Core Core Core Core Core Core Core Cache Cache Cache Cache Bus Memory Compute Node Core Core Core Core Core Core Core Core Interconnect Network Cache Cache Cache Cache Bus Memory Compute Node Core Core Core Core Core Core Core Core Cache Cache Cache Cache Bus Memory Compute Node Introduction to parallel computersand parallel programming p. 21
Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives) Parallelism "behind the scene" (little user control) Difficult to scale to many CPUs (NUMA, cache coherence) Message passing model Many programming details (MPI or PVM) Better user control (data & work decomposition) Larger systems and better performance Stream-based programming (for using GPUs) Some special parallel languages Co-Array Fortran, Unified Parallel C, Titanium Hybrid parallel programming Introduction to parallel computersand parallel programming p. 22
Designing parallel programs Determine whether or not the problem is parallelizable Identify hotspots Where are most of the computations? Parallelization should focus on the hotspots Partition the problem Insert collaboration (unless embarrassingly parallel) Introduction to parallel computersand parallel programming p. 23
Partitioning Break the problem into chunks Domain decomposition (data decomposition) Functional decomposition Introduction to parallel computersand parallel programming p. 24
Examples of domain decomposition Introduction to parallel computersand parallel programming p. 25
Collaboration Communication Overhead depends on both the number and size of messages Overlap communication with computation, if possible Different types of communications (one-to-one, collective) Synchronization Barrier Lock & semaphore Synchronous communication operations Introduction to parallel computersand parallel programming p. 26
Load balancing Objective: idle time is minimized Important for parallel performance Equal partitioning of work (and/or data) Dynamic work assignment may be necessary Introduction to parallel computersand parallel programming p. 27
Granularity Computations are typically separated from communications by synchronization events Granularity: ratio of computation to communication Fine-grain parallelism Individual tasks are relatively small More overhead incurred Might be easier for load balancing Coarse-grain parallelism Individual tasks are relatively large Advantageous for performance due to lower overhead Might be harder for load balancing Introduction to parallel computersand parallel programming p. 28
Exercises Problem 1 of INF3380 written exam V10 Problem 1 of INF3380 written exam V11 Exercise 1.5 from the textbook [Quinn 2004] If we denote by T pipe, which is found in the above exercise, the time required to process n tasks using a m-stage pipeline. Let us also denote by T seq the time needed to process n tasks without pipelining. How large should n be in order to get a speedup of p? Exercise 1.10 from the textbook [Quinn 2004] Introduction to parallel computersand parallel programming p. 29