Chapter 5 Supercomputers

Size: px

Start display at page:

Download "Chapter 5 Supercomputers"

Harry Perkins
5 years ago
Views:

1 Chapter 5 Supercomputers Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Part IV. GPU Acceleration Part V. Map-Reduce Appendices 35

2 36 BIG CPU, BIG DATA M any readers of this book will never write a large scale parallel big CPU big data scientific or engineering application that needs to run on a supercomputer to achieve acceptable performance. However, it s important to understand how supercomputer performance is evaluated. Therein lie lessons pertinent to any parallel application. Since 1993, the Top500 List* has been publishing a list of the 500 fastest supercomputers in the world. Top500 invites anyone who owns a supercomputer to run a standard benchmark: the Highly Parallel Computing Benchmark. This is one program from the Linpack Benchmark suite of programs. The program solves a system of simultaneous linear equations, expressed in matrix form as Mx = b, where M is a dense matrix one where most or all of the elements are nonzero. The size of the matrix (the number of simultaneous equations) is chosen to be as large as possible while still fitting in the computer s memory. Given the size of the matrix, the number of 64-bit floating point operations additions, multiplications, reciprocals, and so on needed to solve the matrix equation is determined. The program is executed on the supercomputer, and the program s running time is measured. The performance metric is the rate at which the program executes floating point operations, calculated as the total number of floating point operations divided by the running time in seconds. The metric s units are floating point operations per second, or flops. Supercomputer owners all over the world run the Highly Parallel Computing Benchmark on their supercomputers and send the measured flops metrics, along with information about their machines, to Top500. Twice a year, in June and November, Top500 publishes a list of the top 500 supercomputers the 500 fastest supercomputers, in descending order of flops. Supercomputers nowadays are so fast that their performance is expressed in teraflops or petaflops rather than just flops. One teraflops equals one trillion flops, or flops. One petaflops equals flops. Here are the top five supercomputers on the June 2018 Top500 List, along with their speeds on the Highly Parallel Computing Benchmark: 1. Summit, United States petaflops 2. Sunway TaihuLight, China 93.0 petaflops 3. Sierra, United States 71.6 petaflops 4. Tianhe-2A, China 61.4 petaflops 5. AI Bridging Cloud Infrastructure (AIBCI), Japan 19.9 petaflops For comparison, a desktop PC s floating point performance is in the single gigaflops range (10 9 flops), about 100 million times slower than Summit. Besides flops rates, Top500 publishes information about the supercomputers themselves the number of cores, the CPU chips, the accelerators if *

3 Chapter 5. Supercomputers 37 any, the backend network, and so on. Here there is a wide variety. Considering just the top five supercomputers, for example: The number of cores ranges from 391,680 (AIBCI) to 2,282,544 (Summit). Summit, Sierra, Tianhe-2A, and AIBCI use commodity CPU chips; Sunway TaihuLight uses proprietary CPU chips. Summit, Sierra, and AIBCI use GPU accelerators; Sunway TaihuLight and Tianhe-2A do not use accelerators. Summit, Sierra, and AIBCI use a commodity backend network; Sunway TaihuLight and Tianhe-2A use a proprietary backend network. The only thing these five supercomputers have in common is that they are very large clusters of multicore nodes. In fact, all the supercomputers on the Top500 List are variously sized clusters of multicore nodes. The Top500 List ranks supercomputers based on their performance on solving dense systems of linear equations. Such calculations were typical of scientific and engineering applications in previous decades. In recent years, however, other kinds of calculations have become prevalent. A supercomputer that executes dense matrix calculations quickly will not necessarily execute other kinds of algorithms at the same speed. To measure supercomputer performance on these newer calculations, other benchmarks and their associated top lists have arisen. The Graph500 List* operates like the Top500 List, except it uses different benchmarks and a different performance metric. The Graph500 benchmarks are programs that calculate with graphs, rather than dense matrices. A graph is a collection of vertices and edges connecting pairs of vertices. Graphs are often used in big data analytics applications, such as social network analytics. For example, Facebook maintains an enormous graph where the vertices are Facebook users and the edges are Facebook friend relationships. Facebook wants to know who your friends are, who the friends of your friends are, who the friends of the friends of your friends are, and so on, so they can recommend new friends for you, thereby (once you friend them) increasing the number of posts in your news feed, increasing the number of ads they show you, and increasing their ad revenue their ultimate goal. Graph models are also used to study metabolic pathways in organisms, the spread of infectious disease through a population, the spread of malware through computer networks, and other scientific applications. Graphs can be represented as matrices. But unlike the dense matrices in the Top500 benchmark, graph matrices are typically large and sparse. A graph s matrix might have billions of rows and columns, but the nonzero elements might be only a miniscule fraction of the total elements; almost all of *

4 38 BIG CPU, BIG DATA the elements are zero. It makes no sense to allocate storage for all those zero elements; only the nonzero elements would be held in the computer s memory. But this means that algorithms that operate on graphs are fundamentally different from algorithms that operate on dense matrices. Compared to a matrix program, a graph program spends little or no time on floating point operations. Rather, the bulk of the time consists of traversing the graph going from one vertex to another along the edges. The Graph500 benchmark consists of three programs: a program that generates a very large, sparse graph; a program that does a breadth first traversal from selected vertices in that graph; and a program that finds the shortest paths through the graph from selected vertices to every other vertex. (Facebook s analysis mentioned above is based on a breadth first traversal.) The programs are executed on the supercomputer, the programs count the number of edges traversed during execution, and the programs running times are measured. The performance metric is the rate at which the programs traverse edges, calculated as the total number of edge traversals divided by the running time in seconds. The metric s units are traversed edges per second, or teps. Since 2010, the Graph500 List has been published twice a year, at the same time as the Top500 List. Here are the top five supercomputers on the June 2018 Graph500 List, along with their speeds on the breadth first search benchmark. The speeds are in units of terateps (10 12 teps). 1. K Computer, Japan 38.6 terateps 2. Sunway TaihuLight, China 23.8 terateps 3. Sequoia, United States 23.8 terateps 4. Mira, United States 15.0 terateps 5. JUQUEEN, Germany 5.8 terateps Note that four of the top five supercomputers on the Top500 List are not among the top five on the Graph500 List. A computer that does dense matrix calculations quickly does not necessarily do graph calculations quickly, and vice versa. Why are the top five dense matrix benchmark rates so much faster than the top five graph benchmark rates peta-flops versus only tera-teps? Much of the difference arises because dense matrix programs tend to be cache friendly, while graph programs tend not to be. A dense matrix s elements are stored in adjacent locations in the computer s main memory, and a dense matrix program tends to access the matrix s elements in the order they are stored in memory. When the program reads a certain element, the entire cache line containing that element is loaded from main memory into the L2 and L1 caches. When the program reads the next element (in an adjacent memory location), much of the time that element is already present in the caches, and so the element can be re-

5 Chapter 5. Supercomputers 39 trieved from the caches at nearly the full speed of the CPU. In other words, most of the memory accesses in a dense matrix program are fast cache hits. While a sparse matrix s elements (the nonzero elements) are still stored in adjacent memory locations, a graph program tends not to access the matrix s elements in the order they are stored in memory. Rather, when the graph program traverses an edge from one vertex (matrix element) to another, the target element is often not in an adjacent memory location, and therefore is often not in the cache. The graph program continually experiences cache misses, forcing elements to be loaded from the slow main memory rather than the cache. As a result, the CPU has to spend much of its time waiting for data to arrive from the main memory, leading to much smaller teps rates compared to flops rates. In the context of data-intensive applications like graph analytics, folks have started to refer to the memory wall as a prominent killer of supercomputer performance. It is as though the CPU is a car speeding down a race track, but every 50 feet it runs into a brick wall. The car is not going to finish the race very quickly. The Top500 List measures supercomputer performance on solving linear systems expressed as dense matrices. Another category of scientific and engineering computation works with partial differential equations (PDEs). Fluid dynamics problems, such as aircraft design, weather forecasting, and climate modeling, for example, are commonly expressed as PDEs. PDE solving programs perform intensive floating point calculations (like dense matrix programs), but on sparse matrices (like graph programs). To rank supercomputer speeds on these kinds of computations, a new top list based on a new benchmark, the High Performance Conjugate Gradient (HPCG) Benchmark,* has been published twice a year since Conjugate gradient refers to a particular technique for solving a PDE. The HPCG benchmark program measures supercomputer performance in flops, like the Top500 benchmark. Here are the top five supercomputers on the June 2018 HPCG List, along with their speeds on the HPCG benchmark in petaflops. 1. Summit, United States 2.93 petaflops 2. Sierra, United States 1.80 petaflops 3. K Computer, Japan 0.60 petaflops 4. Trinity, United States 0.55 petaflops 5. Piz Daint, Switzerland 0.49 petaflops The HPCG flops rates are considerably slower than the Top500 flops rates again, largely due to the memory wall encountered with sparse matrix computations. Besides raw computational speed, folks are becoming increasingly concerned with supercomputers energy efficiency. When Summit runs the Top- *

6 40 BIG CPU, BIG DATA 500 benchmark program at a rate of petaflops, it consumes electrical energy at a rate of 8.8 megawatts equivalent to the energy consumption of nearly seven thousand average United States households. The fossil fuels burned to generate the electricity for a supercomputer release carbon into the atmosphere, contributing to global warming. The supercomputer s cooling system pumps heat into the environment, also contributing to global warming. Folks might be willing to trade off a slower computation rate, leading to an increased time to finish a program, to gain a reduction in energy usage. From this point of view, the important question is: How many floating point operations can a supercomputer perform for every unit of energy consumed? A higher number indicates a more energy efficient supercomputer. To quantify energy efficiency, the machine s computation rate in flops is divided by its energy consumption rate in watts, yielding an energy efficiency metric in units of flops per watt. Since 2013, Top500 has published an additional list, the Green500 list,* which ranks supercomputers based on energy efficiency. The benchmark is the same as the Top500 List, the solution of a dense system of linear equations, but the metric is flops per watt. Here are the top five supercomputers on the June 2018 Green500 List, along with their energy efficiencies in gigaflops per watt (10 9 flops/watt) and their speeds in petaflops. 1. Shoubu System B, Japan 18.4 gigaflops/watt; petaflops 2. Suiren-2, Japan 16.8 gigaflops/watt; petaflops 3. Sakura, Japan 16.7 gigaflops/watt; petaflops 4. DGX SaturnV Volta, United States 15.1 gigaflops/watt; 1.07 petaflops 5. Summit, United States 13.9 gigaflops/watt; petaflops Only one machine, Summit, is in the top five both for raw computational speed and for energy efficiency. Summit, number one on the Top500 List, is number five on the Green500 List. Shoubu System B, number one on the Green500 List, is number 359 on the Top500 List. The fastest supercomputer is not the most energy efficient supercomputer, and vice versa. Descending from the rarefied heights of supercomputer performance, we can discern three lessons for anyone writing parallel programs. First lesson: Performance matters. It s not enough merely to write a program that solves the computational problem. The program must also get the answer in as little time as possible. Certain software design choices might lead to smaller running times, even when running on a parallel computer. Other design choices might lead to larger running times; such design choices are to be avoided. In later chapters we will see examples of how different design choices affect program performance. *

7 Chapter 5. Supercomputers 41 This leads to the second lesson: Assess performance using measured running time data from actual programs. We will be doing exactly that in subsequent chapters. Insights gained from performance measurements can guide design choices, and we will see examples of those as well. Third lesson: Memory matters. Both the amount of memory a program requires and the pattern in which the program accesses the memory affect the program s performance. To the extent practical, data structures that use less memory and fewer CPU cycles are preferable. Later chapters will include examples of memory-lean data structures. Now we re ready to start learning how to write parallel programs. Points to Remember Various benchmark programs are used to assess supercomputer performance on different kinds of calculations. Various metrics are used to assess supercomputers, including floating point operations per second (flops), traversed edges per second (teps), and flops per watt. The Top500 List rates supercomputer performance in flops on dense matrix calculations. All of the Top500 supercomputers are clusters of multicore nodes. The Graph500 List rates supercomputer performance in teps on graph algorithms. The HPCG List rates supercomputer performance in flops on partial differential equation (PDE) solving programs. The Green500 List rates supercomputer energy efficiency in flops per watt on dense matrix calculations. Performance matters. Assess performance using measured running time data from actual programs. Memory matters.

8 42 BIG CPU, BIG DATA

Chapter 2 Parallel Hardware

Chapter 2 Parallel Hardware Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers