for Supercomputers Prof. Dr. G. Wellein (a,b), Dr. G. Hager (a), J. Habich (a) HPC Services Regionales Rechenzentrum Erlangen (b)

Size: px

Start display at page:

Download "for Supercomputers Prof. Dr. G. Wellein (a,b), Dr. G. Hager (a), J. Habich (a) HPC Services Regionales Rechenzentrum Erlangen (b)"

Earl Collins
6 years ago
Views:

1 Programming Techniques for Supercomputers Prof. Dr. G. Wellein (a,b), Dr. G. Hager (a), J. Habich (a) (a) HPC Services Regionales Rechenzentrum Erlangen (b) Department für Informatik University Erlangen-Nürnberg Sommersemester 2011

2 Audience Audience Computational Engineering, Computer Science Physics, Engineering, Materials Science, Contact: Gerhard Wellein: Georg Hager: Johannes Habich: Lecture is completely documented in our Moodle LMS: erlangen de/moodle/course/view php?id=145 Please enroll into the lecture and specify your matriculation number Homework assignments, submissions, credits etc. all conducted in Moodle 2

3 Format of course Lectures/tutorials: 4 hours of lecture: Wednesday 12:15 13:45 in E1.11 Thursday 10:00 11:30 in Slides of each lecture are available in the moodle 2 hours of tutorial: Thur. 12:15 13:45 at (CIP-Pool) Pool) Wed. 10:00 11:30 at 0.01 Exercise "sheets" (homework) available every Wednesday in Moodle, to be turned in one week later before the tutorial You also need CIP pool accounts (ask CIP admins!) First tutorial (May 5 th ): Intro to systems handling (logging in via SSH, X forwarding, using compilers, batch jobs) of RRZE cluster Please interrupt and ask questions! 3

4 Format of course Schein: Lecture only: 5 ECTS Oral Examination Register for in meincampus Lecture & Exercises: (5 + 2,5) ECTS At least 50% of the credits in the exercises Oral Examination Register for AND in meincampus Prerequisite for exercises: Basic programming g knowledge in C/C++ or FORTRAN Using LINUX / UNIX OS environments 4

5 Scope of course Ability to write (simple & efficient) parallel programs for modern parallel supercomputers Introduction to architecture of Single core/processor Multi-Core processors Shared memory nodes Distributed memory computers GPU x86_64 based architectures x86_64 dual- /quad-/octo-core Typical cluster nodes (RRZE) Compute clusters (RRZE) and MPP (IBM BlueGene, CRAY XT) nvidia Efficient programming and optimization strategies Concepts, Potentials & Pitfalls of Parallel Computing Shared Memory Parallel Programming OpenMP Distributed Memory Parallel Programming MPI Hybrid programming MPI+OpenMP Performance Analysis & Modeling throughout all topics 5

6 Scope of the course Introduction Colored slides, Performance: Measuring & Reporting, Standard benchmarks: Kernels & more Single Core: Architecture and optimization strategies Pipelining, i Caches, Blocking, Foundations of parallel processing Parallel processing (1): Multi-Core parallel processing for the masses Shared-memory system architectures & programming techniques multi-core, multi-socket, multi-everything, UMA, ccnuma, Parallel processing (2): Distributed-memory system architectures & programming techniques networks, clusters, MPI, Parallel processing (3): Hybrid programming techniques: MPI + OpenMP Parallel processing (4): GPU: nvidia & CUDA Modeling Perfor rmance Analys sis and 6

Supporting material Books: G. Hager and G.

library discounted copies ask us J. Hennessy and D. Patterson: Computer Architecture.

7 Supporting material Books: G. Hager and G. Wellein: Introduction to High Performance Computing for Scientists and Engineers. CRC Computational Science Series, ISBN see moodle for a very early version 10 copies are available in the library discounted copies ask us J. Hennessy and D. Patterson: Computer Architecture. t A Quantitative Approach. Morgan Kaufmann Publishers, Elsevier, ISBN W. Schönauer: Scientific Supercomputing. (cf. 7

8 Supporting material Documentation: t l / d t / / l The big ones and more useful HPC related information: 8

9 Introduction (1) Why Supercomputers? The Big Ones and the working horses

10 Computational Science drives the need for supercomputing The Two Principles of Science Theory Three Mathematical Models, Differential Equations, Newton Experiments Observation and prototypes empirical Sciences Computational Science Simulation, Optimization (quantitative) virtual Reality 10

11 Motivation: Supercomputers in everyday's life Engineering (Automotive): Crash simulations Aerodynamics Meteorology: Weather forecasts Hurricane warnings Did you know? Roasting Pringles Designing g engines for chain saws Drug design. 11

12 Motivation: MD Simulation of HIV protease dynamics Time Step: 1 fs (10-15 s) Real time: 10 ns (10-8 s) Compute time: CPU-hrs 8 CPUs 90 days Courtesy: Prof. Sticht, Bio-Informatics, Emil-Fischer Center, FAU 12

13 Motivation: Lattice Boltzmann flow solvers 13 Figures by courtesy of LS CAB-Braunschweig, Thomas Zeiser, N. Thuerey

14 Motivation Industrial Usage of HLRS systems Co osts CPU Hours Product cycle Ot her s Porsche T-Syst ems Porsche Panamera 14 Michael M. Resch High Performance Computing Center Stuttgart

15 Supercomputer A good definition?! Supercomputer is a computer that is only one generation behind what large-scale users want. Neil Lincoln, architect t for the CDC Cyber 205 and others A supercomputer does not fit under the desktop! Absolute, rare compute power is not a reasonable measure Assume: Computer is being used for numerical simulation Compute power of a system is measured by Floating Point Operations (MULT, ADD) for a specific numeric benchmark TOP500 list 15

16 Most powerful computers in the world: TOP500 Top 500: Survey of the 500 most powerful supercomputers Solve a large system of linear equations: A x = b ( LINPACK ) Published twice a year (ISC in Germany, SC in USA) Established in 1993 (CM5/1024): 60 GFlop/s (Top1) Nov 2010 (Tianhe-1A): GFlop/s (Top1) Performance increase: 87 % p.a. over almost 2 decades! Today s Laptop p Performance measure: MFlop/s, GFlop/s, TFlop/s, PFlop/s Number of FLOATING POINT operations per second FLOATING POINT operations: double precision (64 bit) Add & Mult ops 10 6 : MFlop/s; 10 9 : GFlop/s; : TFlop/s; : PFlop/s 16

17 TOP10 as of November 2010 Power consumption 4.0 MW 69MW MW 1 MW 1.75 Mio p.a. 23MW

18 Top500 list as of November 2010 Clusters, clusters, clusters with GBit Interconnect 18

19 TOP500 is going massively parallel (Nov. 99/04/09) 8 x 16 x 19

20 TOP500 is going massively parallel 20

21 Top500: Power becomes a real problem Absolute Power Levels kwatt t Electricity costs: ~ 1.5 Mio. p.a TOP500 Ranking By courtesy of H. Meuer - page 21

22 The next step? ExaFLOP dreams, visions and fears Performance Projections 10 Eflop/s 1 Eflop/s 100 Pflop/s 10 Pflop/s 1Pflop/s 100 Tflop/s 10 Tflop/s 1Tflop/s 100 Gflop/s 10 Gflop/s 1 Gflop/s 100 Mflop/s 1 Eflop/s 1 Pflop/s 1 Tflop/s SUM N=1 N= Notebook 6-8 years 8-10 years Notebook Intel 6th EMEA HPC Roundtable page 1 22

23 TOP1: Tianhe-1A 7168 compute nodes each with 2 x 6-core Intel Xeon/Westmere 2.93 GHz 1 x nvidia Tesla M2050 GPUs ( Fermi ) High speed interconnect Overall compute capacity: Intel CPUs: 1.0 PFlop/s nvidia GPUS: 3.6 PFlop/s Overall power consumption (LINPACK Test): 4 MWatt Heating of a single home: ~ KW Heating power for > 200 homes National Supercomputing Center Tianjin 23

24 TOP2: CRAY XT5 PetaFLOPs and beyond Cray Inc. Proprietary Cray-1 = 160 MF 1 PF = 6,250,000 as much! 24 Slide 24

4 GB/s direct connect HyperTransport By courtesy of W. Oed, CRAY 2 32 GB memory 9.6 GB/sec 9.

25 The TOP2 system: CRAY XT5 5th generation of CRAY MPP systems (1 node = 2 QC chips) Successor of CRAY T3E System is designed to scale to s CPUs /sec 9.6 GB/ ~6 μs MPI latency 6.4 GB/s direct connect HyperTransport By courtesy of W. Oed, CRAY 2 32 GB memory 9.6 GB/sec 9.6 GB/sec 9.6 GB/sec 9.6 GB/sec Oak Ridge: Jaguar TOP1 9.6 GB/s sec ec Original development: 40 TFlop/s Red Storm OS: Linux micro kernel Cray SeaStar2+ Interconnect 25.6 GB/s direct connect memory 25

Roadrunner at Los Alamos - First PetaFLOP system June 2008: 10 15 FLOP/s for the first time! (Nov. 1997: ASI Red (Intel Paragon / P6) breaks the TFlop/s barrier!

26 Roadrunner at Los Alamos - First PetaFLOP system June 2008: FLOP/s for the first time! (Nov. 1997: ASI Red (Intel Paragon / P6) breaks the TFlop/s barrier!) Mankind (7 secs per FLOP per person) >1 Year to do FLOP It`s not only the first PetaFLOP system, it`s heterogeneous! Opteron Dual-Core + IBM PowerXCell ( Playstation 3!) 26

27 RoadRunner node ( triblade ) structure Tri-Blade Characteristics: 2 x QS way host 400 GFlop/s peak Single 4x DDR IB Host: Dual-core Opteron ccnuma within QS22 blade Each SPE operates on 256 kbyte local store holding data and instruction code! Data needs to be transfered explicitly between local store and memory via DMA transfers! 27

28 IBM Blue Gene/P 28

29 IBM Blue Gene/P Up to 1 Petaflop Performance Blue Gene/P continues Blue Gene s leadership performance in a spacesaving, power-efficient package for the most demanding and scalable high-performance computing applications Rack 32 Node Cards 1024 chips, 4096 procs Cabled 8x8x16 System 72 Racks Compute Card 1 chip, 20 DRAMs Chip 4 processors 13.6 GF/s 8 MB EDRAM Node Card (32 chips 4x4x2) 32 compute, 0-1 IO cards 435 GF/s 64 GB 14 TF/s 2 TB Front End Node / Service Node 1 PF/s 144 TB HPC SW: Compilers GPFS ESSL Loadleveler 13.6 GF/s 2.0 (or 4.0) GB DDR Supports 4-way SMP JS21 / Power5 Linux SLES10 Loadleveler 29

30 IBM BlueGene/P Single Node IBM PowerPC450 4 cores, 8MB Cu Heatsink SDRAM-DDR2 1of20sites sites. 30

31 HPC Centers in Germany: A view from Erlangen Jülich Supercomputing Center BlueGene/P TFlop/s Hannover Berlin FZ Jülich Erlangen/ Nürnberg HLRS-Stuttgart LRZ-München HLR Stuttgart: 20 TFlop/s NEC SX9 To be replaced: 1 PF (2011/12) SGI Altix (62 TFlop/s) To be replaced: 3 PF (2012) 31

32 HLRB Munich SGI Altix 4700 / 9728 cores of LRZ urtesy o by cou 32

33 HLRB-II: 2D-Torus between 19 compute nodes 51.2 GByte/s per direction courtes sy of LR RZ by Each compute node: 512 processors 2000 GByte main memory 33

RRZE: LiMa -cluster 500 compute nodes (6.000 cores) with 2 Intel Westmere 2.

(Dual-Twin Supermicro) Power consumption: ~160 KW Closed Racks with cold water heat

direction and < 2 µs latency Parallel Filesystem: 130 TB+ accessible with 3 GB/s Operating

34 RRZE: LiMa -cluster 500 compute nodes (6.000 cores) with 2 Intel Westmere 2.66 GHz Hexacores 12 cores/ node + SMT cores 24 GB main memory NO local disks Vendor: NEC (Dual-Twin Supermicro) Power consumption: ~160 KW Closed Racks with cold water heat exchanger inside Full QuadDataRate Infiniband fat tree interconnect BW ~ 3 GB/s / direction and < 2 µs latency Parallel Filesystem: 130 TB+ accessible with 3 GB/s Operating system: LINUX Peak performance: 63.8 TFlop/s 2.66GHz LINPACK (Rmax): 57.3 TFlop/s (#130 in TOP500 / Nov. 2010) 34

RRZE: Woody -Cluster 860 Intel Xeon5160 processor cores Core2Duo

0 GHz 12 GFlop/s per core 4 cores per compute node Installation: November

GByte in total Infiniband network Voltaire DDRx 216 ports Top500 Rank 329

35 RRZE: Woody -Cluster 860 Intel Xeon5160 processor cores Core2Duo architecture 3.0 GHz 12 GFlop/s per core 4 cores per compute node Installation: November 2006 Peak performance: GFlop/s Main memory: 2 GByte per core 1720 GByte in total Infiniband network Voltaire DDRx 216 ports Top500 Rank GBit/s+ per node & direction OS: SuSe Linux: SLES9 (Nov 2007) Parallel filesystem: 15 TByte NFS filesystem: 15 TByte Power consumption > 100 kw 35

36 RRZE: TinyBlue -Cluster 84 nodes each with: Dual Socket Nehalem X5550, 2,66 GHz SMT (8 physical Cores + 8 SMT) 12 GB RAM (DDR3-1333) 250 GB disc QDR InfiniBand (fully non blocking) 7,1 Tflop/s Peak (>65% von Woody!) Applicationperformance Woody P P P P P P P P C C C C C C C C C C C C C C C C C C MI MI ccnuma! Memory Memory 36

37 RRZE: TinyGPU -Cluster 8 nodes with Dual-Socket Dual GPU nodes 2 x Intel Xeon X5550 (2,66 GHz) 24 GB RAM (DDR3-1333) 2 NVIDIA Tesla M1060 passive passive cooling DDR InfiniBand Tesla M1060 GPU: 30 Multiprocessors 78 GFlop/s (dp), 933 GFlop/s (sp) 4 GB memory (102 GB/s) Overall compute power: Double precision: 0.68 TFlop/s (Xeons) TFlop/s (Teslas) Single precision: 1.36 TFlop/s (Xeons) TFlop/s (Teslas) 37

38 RRZE: Windows-Cluster 16 nodes with Dual-Socket AMD Opteron 2435 (2,6 GHz, Istanbul Hexacore) 32 GB RAM 160 GB HDD 2 TFlop/s Peak GBit Betriebssystem: ebssyste Windows HPC Server

Programming Techniques for Supercomputers

Programming Techniques for Supercomputers Prof. Dr. G. Wellein (a,b) Dr. G. Hager (a) Dr.-Ing. M. Wittmann (a) (a) HPC Services Regionales Rechenzentrum Erlangen (b) Department für Informatik University