It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Size: px

Start display at page:

Download "It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist"

Harvey Anderson
6 years ago
Views:

1 It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

2 Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt data

3 Moore s Law is not at all dead Intel process technology capabilities High Volume Manufacturing Feature Size 90nm 65nm 45nm 32nm 22nm 16nm 11nm 8nm Integration Capacity (Billions of Transistors) nm Transistor for 90nm Process Source: Intel Influenza Virus Source: CDC

4 That Power and Clock Inflection Point in 2004 didn t get better. Fun fact: At 100+ Watts and <1V, currents are beginning to exceed 100A at the point of load! Source: Kogge and Shalf, IEEE CISE Courtesy Horst Simon, LBNL

5 Not a new problem, just a new scale Power W) Cray-2 with cooling tower in foreground, circa 1985

6 How to get same number of transistors to give us more performance without cranking up power? Cache Big core Power Performance 2 1 Key is that Performance area Small core 1 1 Power = ¼ C1 C2 4 4 Performance = 1/2 Cache 3 3 C3 C

7 And how to get more performance from more transistors with the same power. A 15% Reduction In Voltage Yields RULE OF THUMB Frequency Power Performance Reduction Reduction Reduction 15% 45% 10% SINGLE CORE DUAL CORE Area = 1 Voltage = 1 Freq = 1 Power = 1 Perf = 1 Area = 2 Voltage = 0.85 Freq = 0.85 Power = 1 Perf = ~1.8

8 Single Socket Parallelism Processor Year Vector Bits SP FLOPs / core / cycle Cores FLOPs/cycle Pentium III 1999 SSE Pentium IV 2001 SSE Core 2006 SSE Nehalem 2008 SSE Sandybridge 2011 AVX Haswell 2013 AVX KNC 2012 AVX KNL 2016 AVX Skylake 2017 AVX

9 Prototypical Application: Serial Weather Model MEMORY

10 First Parallel Weather Modeling Algorithm: Richardson in 1917 Courtesy John Burkhardt, Virginia Tech

11 Weather Model: Shared (OpenMP) Core Core Core Core MEMORY Four meterologists in the same room sharing the map.

12 Weather Model: Accelerator (OpenACC) GPU PCI Bus GPU 1 meterologist coordinating 1000 savants using tin cans and a string.

13 Weather Model: Distributed (MPI) MEMORY 50 meterologists using telegraphs.

14 The pieces fit like this OpenACC MPI OpenMP

15 Many Levels and Types of Parallelism Vector (SIMD) Instruction Level (ILP) Instruction pipelining Superscaler (multiple instruction units) Out-of-order Register renaming Speculative execution Branch prediciction Compiler (not your problem) OpenMP OpenACC MPI Multi-Core (Threads) SMP/Multi-socket Accelerators: GPU MIC Clusters MPPs Also Important ASIC/FPGA/DSP RAID/IO

16 Cores, Nodes, Processors, PEs? The most unambiguous way to refer to the smallest useful computing device is as a Processing Element, or PE. This is usually the same as a single core. Processors usually have more than one core as per the previous list. Nodes is commonly used to refer to an actual physical unit, most commonly a circuit board or blade with a network connection. These often have multiple processors. I will try to use the term PE consistently here, but I may slip up myself. Get used to it as you will quite often hear all of the above terms used interchangeably where they shouldn t be.

17 Shared- Processing at Extreme Scale Programming OpenMP, Pthreads, Shmem Examples All multi-socket motherboards SGI UV (Blacklight!) Intel Xeon 8 dual core processors linked by the UV interconnect 4096 cores sharing 32 TB of memory As big as it gets right now

MPPs (Massively Parallel Processors) Distributed memory at largest scale.

processors (Interlagos) 560,640 cores Gemini interconnect (3-D Torus)

18 MPPs (Massively Parallel Processors) Distributed memory at largest scale. Often shared memory at lower level. Titan (ORNL) Sequoia (LLNL) petaflops Rmax and petaflops Rpeak IBM Blue Gene/Q 98,304 compute nodes AMD Opteron 6274 processors (Interlagos) 560,640 cores Gemini interconnect (3-D Torus) Accelerated node design using NVIDIA multi-core accelerators 20+ PFlops peak system performance 1.6 million processor cores 1.6 PB of memory

19 GPU Architecture - GK110 Kepler From a document you should read if you are interested in this: Architecture-Whitepaper.pdf

20 Top 10 Systems as of November 2017 # Site Manufacturer Computer Interconnect [Accelerator] Cores Rmax (Tflops) Rpeak (Tflops) Power (MW) 1 National Super Computer Center in Guangzhou China NRCPC Sunway TaihuLight Sunway SW C 1.45GHz 10,649,600 93, , OpenACC is a first class API! 2 National Super Computer Center in Guangzhou China NUDT Tianhe-2 (MilkyWay-2) Intel Xeon E GHz TH Express-2 Intel Xeon Phi 31S1P 3,120,000 33,862 54, Swiss National Supercomputing Centre (CSCS) Switzerland Cray Piz Daint Cray XC50 Xeon E GHz Aries NVIDIA P ,760 19,590 25, Japan Agency for Marine-Earth Science Japan ExaScaler Gyoukou Xeon D GHz Infiniband EDR 19,860,000 19,135 28, DOE/SC/Oak Ridge National Laboratory United States Cray Titan Cray XK7 Opteron GHz Gemini NVIDIA K20x 560,640 17,590 27, DOE/NNSA/LLNL United States IBM Sequoia BlueGene/Q Power BQC 1.6 GHz Custom 1,572,864 17,173 20, DOE/NNSA/LANL/SNL United States Cray Trinity Cray XC40 Xeon E5-2698v3 2.3 GHz Aries Intel Xeon Phi ,968 17,173 20, DOE/SC/LBNL/NERSC United States Cray Cori Cray XC40 Aries Intel Xeon Phi ,336 14,014 27, Joint Center for Advanced High Performance Computing Japan Fujitsu Oakforest Primergy Intel OPA Intel Xeon Phi ,104 13,554 24, RIKEN Advanced Institute for Fujitsu K Computer SPARC64 VIIIfx 2.0 GHz 705,024 10,510 11,

21 1E+11 1E+10 1E Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflop/s 1 Tflop/s 100 Gflop/s 10 Gflop/s 1 Gflop/s 100 Mflop/s Projected Performance Development SUM N=1 N= Courtesy Horst Simon, LBNL

22 Trends with ends. Source: Kogge and Shalf, IEEE CISE Courtesy Horst Simon, LBNL

23 Amdahl s Law If there is x% of serial component, speedup cannot be better than 100/x. If you decompose a problem into many parts, then the parallel time cannot be less than the largest of the parts. If the critical path through a computation is T, you cannot complete in less time than T, no matter how many processors you use. Amdahl's law used to be cited by the knowledgeable as a limitation. These days it is mostly raised by the uninformed. Massive scaling is commonplace: Science Literature Web (map reduce everywhere) Data Centers (Spark, etc.) Machine Learning (GPUs and others)

24 In Conclusion OpenACC MPI OpenMP

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt