It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Size: px

Start display at page:

Download "It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center"

Augustus Garrett
6 years ago
Views:

1 It s a Multicore World John Urbanic Pittsburgh Supercomputing Center

2 Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data

3 Moore s Law is not at all dead Intel process technology capabilities High Volume Manufacturing Feature Size 90nm 65nm 45nm 32nm 22nm 16nm 11nm 8nm Integration Capacity (Billions of Transistors) nm Transistor for 90nm Process Source: Intel Influenza Virus Source: CDC

4 That Power and Clock Inflection Point in 2004 didn t get better. Source: Kogge and Shalf, IEEE CISE Courtesy Horst Simon, LBNL

5 Not a new problem, just a new scale CPU Power W) Cray-2 with cooling tower in foreground, circa 1985

6 How to get same number of transistors to give us more performance without cranking up power? Cache Big core Power Performance 2 1 Key is that Performance area Small core 1 1 Power = ¼ C1 C2 4 4 Performance = 1/2 Cache 3 3 C3 C

7 And how to get more performance from more transistors with the same power. A 15% Reduction In Voltage Yields RULE OF THUMB Frequency Power Performance Reduction Reduction Reduction 15% 45% 10% SINGLE CORE DUAL CORE Area = 1 Voltage = 1 Freq = 1 Power = 1 Perf = 1 Area = 2 Voltage = 0.85 Freq = 0.85 Power = 1 Perf = ~1.8

8 Cores, Nodes, Processors, PEs? The most unambiguous way to refer to the smallest useful computing device is as a Processing Element, or PE. This is usually the same as a single core. Processors usually have more than one core as per the previous list. Nodes is commonly used to refer to an actual physical unit, most commonly a circuit board or blade with a network connection. These often have multiple processors. I will try to use the term PE consistently here, but I may slip up myself. Get used to it as you will quite often here all of the above terms used interchangeably where they shouldn t be.

9 Multi-socket Motherboards Dual and Quad socket boards are very common in the enterprise and HPC world. Less desirable in consumer world.

10 Shared-Memory Processing at Extreme Scale Programming OpenMP, Pthreads, Shmem Examples All multi-socket motherboards SGI UV (Blacklight!) Intel Xeon 8 dual core processors linked by the UV interconnect 4096 cores sharing 32 TB of memory As big as it gets right now

MPPs (Massively Parallel Processors) Distributed memory at largest scale.

processors (Interlagos) 560,640 cores Gemini interconnect (3-D Torus)

11 MPPs (Massively Parallel Processors) Distributed memory at largest scale. Often shared memory at lower level. Titan (ORNL) Sequoia (LLNL) petaflops Rmax and petaflops Rpeak IBM Blue Gene/Q 98,304 compute nodes AMD Opteron 6274 processors (Interlagos) 560,640 cores Gemini interconnect (3-D Torus) Accelerated node design using NVIDIA multi-core accelerators 20+ PFlops peak system performance 1.6 million processor cores 1.6 PB of memory

GPU Architecture Kepler: Streaming Multiprocessor (SMX) 192 SP CUDA Cores per SMX 192 fp32 ops/clock 192 int32 ops/clock 64 DP CUDA Cores per SMX 64 fp64

12 GPU Architecture Kepler: Streaming Multiprocessor (SMX) 192 SP CUDA Cores per SMX 192 fp32 ops/clock 192 int32 ops/clock 64 DP CUDA Cores per SMX 64 fp64 ops/clock 4 warp schedulers Up to 2048 threads concurrently 32 special-function units 64KB shared mem + L1 cache 48KB Read-Only Data cache 64K 32-bit registers

13 Top 10 Systems as of November 2015 # Site Manufacturer Computer CPU Interconnect [Accelerator] Cores Rmax (Tflops) Rpeak (Tflops) Power (MW) 1 National Super Computer Center in Guangzhou China NUDT Tianhe-2 (MilkyWay) Intel Xeon E GHz TH Express-2 Intel Xeon Phi 31S1P 3,120,000 33,862 54, DOE/SC/Oak Ridge National Laboratory United States Cray Titan Cray XK7 Opteron GHz Gemini NVIDIA K20x 560,640 17,590 27, DOE/NNSA/LLNL United States IBM Sequoia BlueGene/Q Power BQC 1.6 GHz Custom 1,572,864 17,173 20, RIKEN Advanced Institute for Computational Science (AICS) Japan Fujitsu K Computer SPARC64 VIIIfx 2.0 GHz Tofu 705,024 10,510 11, DOE/SC/Argonne National Laboratory United States IBM Mira BlueGene/Q Power BQC 1.6 GHz Custom 786,432 8,586 10, DOE/NNSA/LANL/SNL United States Cray Trinity Cray XC40 Xeon E5-2698v3 2.3 GHz Aries 301,056 8,100 11,078 7 Swiss National Supercomputing Centre (CSCS) Switzerland Cray Piz Daint Cray XC30 Xeon E GHz Aries NVIDIA K20x 115,984 6,271 7, HLRS Germany Cray Hazel Hen Cray XC40 Xeon E GHz Aries 185,088 5,640 7,403 9 King Abdullah University of Science and Technology Saudi Arabia Cray Shaheen II Cray XC40 Xeon E5-2698v3 2.3 GHz Aries 196,608 5,537 7, Texas Advanced Computing Dell Stampede Xeon E GHz 462,462 5,168 8,

14 Projected Performance Development 1E+11 1 Eflop/s 1E Pflop/s Pflop/s 1 Pflop/s Tflop/s 10 Tflop/s 1 Tflop/s Gflop/s Gflop/s 1 Gflop/s Mflop/s SUM N=1 N= Courtesy Horst Simon, LBNL Data from: TOP500 November 2012

15 Trends with ends. Source: Kogge and Shalf, IEEE CISE Courtesy Horst Simon, LBNL

16 Amdahl s Law If there is x% of serial component, speedup cannot be better than 100/x. If you decompose a problem into many parts, then the parallel time cannot be less than the largest of the parts. If the critical path through a computation is T, you cannot complete in less time than T, no matter how many processors you use.

17 Why you should be (extra) motivated. This parallel computing thing is no fad. The laws of physics are drawing this roadmap. If you get on board (the right bus), you can ride this trend for a long, exciting trip. Let s learn how to use multicore machines!

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt