Chapter 2 Parallel Hardware

Size: px

Start display at page:

Download "Chapter 2 Parallel Hardware"

Benjamin Hodge
5 years ago
Views:

1 Chapter 2 Parallel Hardware Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Part IV. GPU Acceleration Part V. Map-Reduce Appendices 11

2 12 BIG CPU, BIG DATA P arallel computing hardware is characterized along three dimensions (Figure 2.1): nodes, cores, and accelerators. Different parallel computers fall at different locations along these three dimensions. A node refers to an independent computer with its own CPU chip or chips, its own main memory, and its own network interface. A parallel computer can consist of a single node, or of multiple nodes. A multinode parallel computer is called a cluster. A core refers to the hardware on a node that executes a stream of machine instructions. A node can have a single core, or multiple cores. Each core can also be hyperthreaded, meaning that the core can execute more than one stream of machine instructions. In addition to the CPU, a node can have zero or more accelerators. These are separate processors that can perform computations alongside the CPU. Several kinds of accelerators are available: A graphics processing unit (GPU) accelerator typically has a large number of cores that are mainly intended for graphics rendering but that can do general calculations. A manycore accelerator is similar to a general purpose CPU, except it typically has more cores than a CPU, and it omits all the peripheral interface circuitry (disk, network, display, USB, and so on) not needed for doing pure calculation. A tensor processing unit (TPU) accelerator has a large number of cores designed to do tensor (multidimensional array) calculations at very high speed. Tensor calculations are at the heart of many large-scale machine intelligence algorithms, such as deep neural networks and convolutional neural networks. A field programmable gate array (FPGA) accelerator is a digital chip whose logic circuitry can be reconfigured during operation, letting you create customized high-speed processors. An application specific integrated circuit (ASIC) accelerator uses specialized chips hardwired to perform specific computations at very high speed. One example is bitcoin mining (bitcoin.org). You can buy ASICs that compute the SHA-256 cryptographic hash function, part of the bitcoin protocol, at very high speed. Having examined each hardware dimension by itself, let s look at examples of parallel computers and see where they fall along the dimensions.

Chapter 2. Parallel Hardware 13 Figure 2.1. Hardware dimensions of the parallel landscape Figure 2.2. Single node single core computer Single-Core Computer Most computers back in the twentieth century were single node single core unaccelerated (Figure 2.

shifters, multipliers, and so on that carry out the machine instructions; and a number of high-speed registers that hold intermediate results.

3 Chapter 2. Parallel Hardware 13 Figure 2.1. Hardware dimensions of the parallel landscape Figure 2.2. Single node single core computer Single-Core Computer Most computers back in the twentieth century were single node single core unaccelerated (Figure 2.2). In such a computer, the core has an instruction unit that reads the program s machine instructions from memory, decodes them, and executes them; a number of functional units, such as adders, shifters, multipliers, and so on that carry out the machine instructions; and a number of high-speed registers that hold intermediate results. With one instruction unit, the core can execute only one stream of instructions one thread at a time. To execute more than one thread, the computer s operating system must do a context switch from one thread to another every so often. The computer has a main memory that stores program instructions and data. Because the main memory s circuitry is much slower than the core s circuitry, a level 2 (L2) cache sits between the core and the main memory.

4 14 BIG CPU, BIG DATA The L2 cache is faster than the main memory, but smaller (typically a few megabytes). When the core reads a memory location into the instruction unit or into a register, an entire cache line, typically 64 or 128 bytes, is fetched from main memory and stored in the L2 cache; from there, the requested data goes into the core. If the core reads an adjacent memory location, the data is already in the L2 cache (a cache hit) and can be read quickly without waiting for the slow main memory. Still, the L2 cache s circuitry is not as fast as the core s circuitry; so another cache, the level 1 (L1) cache, sits between the core and the L2 cache. The L1 cache is nearly as fast as the core but is smaller than the L2 cache. On its way to the core, data read from main memory is cached in L2 and in L1. If the instructions and data the program is accessing can fit entirely in L2, or better still L1, the core can run at nearly full speed. Otherwise, the core will have to pause while new data is fetched from the slow main memory, reducing the processing speed. So as to keep the core running as fast as possible, the L1 and L2 caches are made as large as possible while still fitting on the CPU chip along with the rest of the core circuitry. Often, a large fraction of a CPU chip s area is devoted to the caches. As CPU circuit speeds continued to increase relative to main memory circuit speeds, additional levels of cache were added been L2 and main memory. A modern CPU might have three or even four levels of cache (L3, L4). Multicore Computer To increase the processing power while not increasing the clock frequency, computers switched from single core to multicore (Figure 2.3). Now there are two or more cores, each with its own instruction unit, functional units, registers, and L1 cache. The cores all share the L2 cache and the main memory. The operating system can run multiple threads simultaneously, one in each core, without needing to context switch. The threads are running truly in parallel. Theoretically, a K-core computer ought to run K times faster than a single-core computer. (This does not always happen in practice, though.) Hyperthreaded Computer Replicating the instruction unit, functional units, registers, and L1 cache to get multiple cores requires a lot of transistors and chip area. To run more threads without needing quite so much area, multicore computers became hyperthreaded (Figure 2.4). In a hyperthreaded core, the instruction units are replicated, but not the functional units, registers, or L1 cache. A dual-hyperthreaded core can run two threads simultaneously without needing to context switch. There s a catch, though. As long as the two threads are using different functional units and registers, both threads can run at full speed. But if the

5 Chapter 2. Parallel Hardware 15 Figure 2.3. Single node multicore computer Figure 2.4. Single node multicore hyperthreaded computer

6 16 BIG CPU, BIG DATA threads both need to use the same functional unit or register at the same time, the hardware makes the threads take turns. While one thread is accessing the functional unit or register, the other thread stalls. This reduces the effective rate at which the threads can execute instructions. In my experience, a dualhyperthreaded core typically does not run as fast as two regular cores; but it does run faster than one regular core. Multicore Accelerated Computer Lately, parallel computers are including accelerators alongside the multicore hyperthreaded CPUs. A GPU accelerator (Figure 2.5) repurposes a graphics card to do general calculations. A GPU card has numerous cores, anywhere from dozens to thousands of them, as well as its own main memory. The GPU s main memory is linked with the CPU s main memory over a high-speed bus, allowing data to be transferred between the CPU and GPU. Standard GPU programming libraries, such as Nvidia Corporation s CUDA and the vendor-independent OpenCL, make it almost as easy to write GPU programs as it is to write CPU programs. Both the CPU and the GPU have to deal with their main memory s large latency (access time) relative to high speed of their cores. The CPU deals with it by interposing L1 and L2 caches between the cores and the memory, devoting much of the chip area to cache, leaving room for only a few cores. The GPU deals with it by reducing or even eliminating the cache, devoting the bulk of the chip area to a large number of cores. The GPU then runs large numbers of threads simultaneously on its cores. When one thread stalls waiting to access main memory, another thread instantly takes its place. There are enough threads so that whenever a core goes idle, a thread is available that has completed its previous memory access and is ready to run again. In this way, the cores stay busy all the time. This technique is called latency hiding. Depending on the nature of the problem, a GPU can perform calculations dozens or hundreds of times faster than a CPU. With GPUs incorporated into just about every modern computer, it s important to know how to write GPU parallel programs, to take advantage of the GPU s additional processing power. BIG CPU, BIG DATA is intended to teach you how to write parallel programs for GPU accelerated parallel computers. (I m not going to cover manycore, TPU, FPGA, or ASIC accelerators in this book.) Single-Core Cluster Computer Turning away from single-node parallel computers, we come to multinode parallel computers, or clusters (Figure 2.6). The cluster has some number of backend nodes that carry out computations. Each node has (or had, back in the 1980s and 1990s) a single core plus its own main memory. The cluster is said to have a distributed memory; the cluster s memory is distributed across the nodes instead of concentrated in a single node. As we will

7 Chapter 2. Parallel Hardware 17 Figure 2.5. Single node multicore hyperthreaded GPU accelerated computer Figure 2.6. Multinode single core cluster computer

8 18 BIG CPU, BIG DATA see, the distributed memory has profound implications for the design of cluster parallel programs. The cluster has a dedicated high-speed backend network that allows the backend nodes to communicate with each other. The backend network might use commodity Ethernet hardware, or it might use specialized faster technology such as InfiniBand, Myrinet, or Scalable Coherent Interface (SCI), or it might use a proprietary interconnect. The cluster usually also has a frontend node, connected to the Internet to let users log in and run parallel programs on the cluster, and connected to the backend network to control and monitor the backend nodes. Multicore Cluster Computer However, nowadays it s virtually impossible to buy a single-core node. So a modern cluster parallel computer consists of multiple multicore backend nodes (Figure 2.7). The cores might or might not be hyperthreaded. Multicore Accelerated Cluster Computer In addition, the nodes of a modern cluster parallel computer might include accelerators (Figure 2.8). Summit, the world s fastest supercomputer, looks like Figure 2.8; it has 4,356 nodes, each with 44 CPU cores (on IBM Power9 chips) and six Nvidia Tesla V100 GPU/TPU accelerators, and it uses a high-speed InfiniBand interconnect for its backend network. Points to Remember Parallel computing hardware can be characterized along three dimensions: single node multinode, single core multicore hyperthreaded, and unaccelerated accelerated. A parallel computer can consist of a single node, or multiple nodes (a cluster). Nowadays, each node of a parallel computer has multiple cores, permitting multiple threads to run simultaneously. Each node of a parallel computer might also have one or more GPU accelerators with many cores, permitting even more threads to run simultaneously. Several other kinds of accelerators are also available, including manycore accelerators, TPUs, FPGAs, and ASICs.

9 Chapter 2. Parallel Hardware 19 Figure 2.7. Multinode multicore cluster computer Figure 2.8. Multinode multicore GPU accelerated cluster computer

10 20 BIG CPU, BIG DATA

Chapter 3 Parallel Software

Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers