Programming Techniques for Supercomputers

Size: px

Start display at page:

Download "Programming Techniques for Supercomputers"

Penelope Gray
5 years ago
Views:

1 Programming Techniques for Supercomputers Prof. Dr. G. Wellein (a,b) Dr. G. Hager (a) Dr.-Ing. M. Wittmann (a) (a) HPC Services Regionales Rechenzentrum Erlangen (b) Department für Informatik University Erlangen-Nürnberg Sommersemester 2018

2 Audience & Contact Audience Computational Engineering, Computer Science, Computational & Applied Mathematics Physics, Engineering, Materials Science, Chemistry Contact: Gerhard Wellein: Georg Hager: Markus Wittmann:

3 Organization & Format Lecture/Tutorial is completely documented in our moodle LMS: See also PTFS univis entry Please enroll into the lecture and specify your matriculation number! Homework assignments, announcements etc. all handled via moodle 3

4 Organization & Format 4 hours of lecture: Monday 16:15 17:45 in H10 AND Thursday 16:15 17:45 in H4 (Wednesday 12:15 13:45 cancelled?! Due to conflict) DON T BE SHY AND ASK QUESTIONS! 4

5 Organization and Format 2 hours of tutorial: Monday 14:15 15:45 at OR Wednesday 10:15 11:45 at Tutorial "sheets" (homework) available every Monday in moodle Tutorials start next week, i.e You also need CIP pool accounts (ask CIP admins!) First tutorials (next week): Intro to systems handling (logging in via SSH, X forwarding, using compilers, batch jobs) of RRZE cluster 5

6 Format of course Lecture only: 5 ECTS Material covered in the lecture Register in meincampus Written exam: 60 Minutes Lecture & Exercises: ( ) ECTS Material covered in lecture AND tutorial Register for lecture AND exercise in meincampus Written exam: 90 Minutes No supporting material allowed in exam PTFS-CAM students: Please contact me via / in person after the lecture 6

7 Format of the course Prerequisite for exercises: Basic programming knowledge in C/C++ or FORTRAN Using LINUX / UNIX OS environments (including ssh) Recommended First experiences with parallel programming though we will introduce necessary basics 7

Supporting material Books: G. Hager and G. Wellein: Introduction to High Performance Computing for Scientists and Engineers. CRC Computational Science Series, 2010.

8 Supporting material Books: G. Hager and G. Wellein: Introduction to High Performance Computing for Scientists and Engineers. CRC Computational Science Series, ISBN see moodle for a very early version 10 copies are available in the library discounted copies ask us J. Hennessy and D. Patterson: Computer Architecture. A Quantitative Approach. Morgan Kaufmann Publishers, Elsevier, ISBN W. Schönauer: Scientific Supercomputing. (cf. 8

9 Supporting material Documentation: The big ones and more useful HPC related information: 9

10 Related teaching activities Regular seminar on Efficient numerical simulation on multicore processors (MuCoSim) 5 ECTS 2 hrs per week 2 talks + written summary Topics from code optimization, code parallelization and code benchmarking on latest multicore / manycore CPUs and GPUs This semester: Tuesday 16:00 17:30 RRZE (2.049) 10

11 SCOPE OF THE LECTURE 11

12 Scope of the lecture Ability to write hardware efficient serial and parallel programs for (super)computers Hardware coverage: Single-core + Multi-Core: Many-core / GPU: Intel Core i, Intel Xeon E5-2xyz) Intel Xeon Phi / NVIDIA Shared memory nodes Distributed memory computers Single node (RRZE) Compute clusters (RRZE) and MPP (IBM BlueGene, CRAY series) Identify basic hardware concepts and how to efficiently use them Shared Memory Parallel Programming OpenMP Distributed Memory Parallel Programming MPI (Hybrid programming MPI+OpenMP) Performance Analysis & Modeling throughout all topics April 12, 2018 PTfS

13 Performance Analysis and Modeling Scope of the lecture Introduction Performance: Basic, Measuring & Reporting, Benchmarks: Kernels & more Modern processors Single core: Basics, Pipelining, Superscalarity, SIMD Memory Hierarchy Multicore: Technology & Basics Manycore / GPU (*) Parallel computers: Shared Memory Shared-memory system architectures: UMA, ccnuma OpenMP basics Performance Modelling / Engineering: Roofline Model Case Studies: Dense&Sparse Matrix-Vector-Multiplication /Stencils Shared Memory in depth Advanced OpenMP, Pitfalls, Data Placement Parallel computers: Distributed Memory Architecture & Communication networks MPI in a nutshell Hardware performance monitoring and model validation (*) April 12, 2018 PTfS

14 Scope of the lecture!$omp PARALLEL DO do k = 1, Nk do j = 1, Nj; do i = 1, Ni y(i,j,k)= b*( enddo; enddo enddo!$omp END PARALLEL DO Establish limit simple performance model x(i-1,j,k)+ x(i+1,j,k)+ x(i,j-1,k)+ x(i,j+1,k)+ x(i,j,k-1)+ x(i,j,k+1)) Parallelize Parallelize Single core performance optimization 14

15 Introduction Supercomputers: The Big Ones and the working horses

16 Supercomputer A good definition?! Supercomputer is a computer that is only one generation behind what large-scale users want. Neil Lincoln, architect for the CDC Cyber 205 and others A supercomputer does not fit under the desktop! (and you can not plug it into a standard power line) Absolute, rare compute power is not a reasonable measure Assume: Computer is being used for numerical simulation Compute power of a system is measured by Floating Point Operations (MULT, ADD) for a specific numeric benchmark TOP500 list 16

17 Most powerful computers in the world: TOP500 Top 500: Survey of the 500 most powerful supercomputers Solve large dense system of linear equations: A x = b ( LINPACK ) Published twice a year (ISC in Germany, SC in USA) Established in 1993 (CM5/1024): 60 GFlop/s (TOP1) Since Nov (Sunway/China): 93,000,000 GFlop/s (TOP1) Performance increase: 81 % p.a. from Performance measure: MFlop/s, GFlop/s, TFlop/s, PFlop/s, EFlop/s Number of FLOATING POINT operations per second FLOATING POINT operations: double precision (64 bit) Add & Mult ops 10 6 : MFlop/s; 10 9 : GFlop/s; : TFlop/s; : PFlop/s ; : EFlop/s 17

18 TOP5 as of November 2017 R max : LINPACK Performance R peak : Peak Performance Power@ LINPACK Source: 18

19 TOP6-10 as of November 2017 R max : LINPACK Performance R peak : Peak Performance Power@ LINPACK Source: 19

20 TOP16-20 as of November 2017 Non-standard hardware continues until rank 15! Trends: Extreme number of parallelism Top1: 10,000,000 cores Many CPUs at low clock speed (1.3 GHz 1.5 GHz) Use of non-standard CPUs: NVIDIA GPGPUs, Chinese / Japanese processors Power range of TOP20 systems: 1 MW,..., 18 MW Cores Peak LINPACK Power Source: 20

21 TOP5: Why GPUs & special purpose? Energy efficiency Rmax/Power GF/J GF/J 6,1 8,6 14,2 4,0 1,6 2,0 21

22 Performance Trend & Projection ExaFlop/s machine by the end of this decade? Basic trend: Slope changes performance increase slows down Source: 22

23 Question? Current GPGPU (CPU) techology: approx. 20 GF/s W GF/s (5 ) W How much power does an ExaFlop (EF/s) consume? 1 EF s GF = 109 s ExaFlop GPGU machine: 109 GF/s 20 GF/s W ExaFlop CPU machine: 109 GF/s 5 GF/s W = 50 MW = 200 MW Power 15ct/kWhrs: 1 MW 1,300,000 p.a. Engery consumption is major issue for centers and users! 23

24 HPC Centers in Germany: A view from Erlangen Jülich Supercomputing Center BlueGene/Q 5.8 PFlop/s Hannover Berlin FZ Jülich Erlangen/ Nürnberg (0.5 PF/s) HLRS-Stuttgart LRZ-München HLR Stuttgart:: 7.4 PF (CRAY XC 40) IBM Cluster: 2*3 PF 24

9 PF/s LINPACK Fat nodes: 1 Island: 205 nodes 4 Intel Xeon E7-4870 10 C per node 256 GB/node

25 SuperMUC LRZ Garching: TOP 4 (June 2012) Thin nodes: 18 Islands with 512 nodes each 2 Intel Xeon E processors (8 cores & 2.7 GHz baseline) per node 147,456 cores 3.2 PF/s Peak 2.9 PF/s LINPACK Fat nodes: 1 Island: 205 nodes 4 Intel Xeon E C per node 256 GB/node Total power consumption: 2.5 MW 3 MW Upgrade to 3+3 PF/s (Peak) in 2014 (with Intel Haswell proc.) 25

RRZE: Meggie -cluster 728 Compute nodes (14.560 cores) 2 Intel Xeon E5-2630 v4 (Broadwell) 2.2 GHz (10 cores) 20 cores/ node + SMT cores 64 GB main memory NO local disks Peak Performance: R peak = 0.

26 RRZE: Meggie -cluster 728 Compute nodes ( cores) 2 Intel Xeon E v4 (Broadwell) 2.2 GHz (10 cores) 20 cores/ node + SMT cores 64 GB main memory NO local disks Peak Performance: R peak = 0.5 PF/s Floating Point Ops/s #346@TOP500 Nov R max = 0.48 PF/s Intel OmniPath network: Up to 100 Gbit/s Price: 2,5 Mio. Power consumption: 120 KW KW (depending on workload) HPC am RRZE Gerhard Wellein 26

27 RRZE: Emmy -cluster 544 compute nodes ( cores) with 2 Intel Xeon E5-2660v2 (Ivy Bridge) 2.2 GHz (10 cores) 20 cores/ node + SMT cores 64 GB main memory NO local disks 16 accelerator nodes same CPUs 8 nodes with 2 x NVIDIA K20 GPGPUs 8 nodes with 2 x Intel Xeon Phi Vendor: NEC (Dual-Twin Supermicro) Power consumption: ~160 KW (backdoor heat exchanger) Full QuadDataRate Infiniband fat tree BW ~ 3 GB/s / direction and < 2 µs latency Parallel Filesystem: 400 TB+ (max. 7 GB/s) Operating system: LINUX Peak performance: 234 TFlop/s (all devices) 191 TFlop/s LINPACK (CPUs) #210 in TOP500 / Nov

Power consumption@rrze 260 nodes@2.66 GHz (2009) 540 nodes@2.2 GHz (2013) 720 nodes@2.

28 Power 260 GHz (2009) 540 GHz (2013) 720 GHz (2016) Trends: Clock speed reduces / stagnates (Minor) energy efficiency improvements Power consumption depends on workload Not shown: Electric power for cooling is approx 50% of power drawn by clusters 28

29 Prepare computer access: Send to containing name, IDM account, Matrikelnummer Tour through computer room 29

for Supercomputers Prof. Dr. G. Wellein (a,b), Dr. G. Hager (a), J. Habich (a) HPC Services Regionales Rechenzentrum Erlangen (b)

Programming Techniques for Supercomputers Prof. Dr. G. Wellein (a,b), Dr. G. Hager (a), J. Habich (a) (a) HPC Services Regionales Rechenzentrum Erlangen (b) Department für Informatik University Erlangen-Nürnberg