Directions in HPC Technology

Size: px

Start display at page:

Download "Directions in HPC Technology"

Osborn Caldwell
6 years ago
Views:

1 Directions in HPC Technology

2 PRACE evaluates Technologies for Multi-Petaflop/s Systems This should lead to integration of 3 5 Tier-0 world-class systems in Europe from 2010 on. It implies: New hardware technology Scalability to > 105 cores (also new!) New programming paradigms 2

3 Some workpackages WP6: Software for Petascale systems Includes: scaling, benchmarking, example code selection WP7: Petaflop/s systems for 2009/2010 Vendor update meetings (together with WP8, 4 meetings sofar) Selection of prototype systems (6 sites ranging from extension of existing systems to PGAS languages) WP8: Future Pflop/s technology Technology Watch (see meetings above) Selection of 8 prototype systems: accelerator-based (Cell, GPUs, PetaPath, FPGAs), PGAS languages, High Performance I/O Prototype for energy efficiency pending. 3

4 PRACE Workpackage 8 (WP8) surveys new technology WP8 has intimate connections with: WP6 (software for Pflop/s systems) WP7 (Pflop/s systems to be installed 2010) STRATOS Originated from WP8 (Workpackage 8.1, AHTP) Continuous relations with system & component suppliers STRATOS has commercial members and 3 working groups: Technology watch Green IT Peta- to Exa-scale software 4

5 HPC enemy No. 1: The memory wall Speed of memory does not grow in proportion to CPU speed. Actual situation is worse. Example: Nehalem X5570 CPU 2.93 GHz 1333 MHz DDR3 memory. Mismatch factor 11. 5

6 HPC enemy No. 1: The memory wall Types of memory: A quick overview (A best, F worst) 6

7 HPC enemy No. 1: The memory wall Most probable near-future contenders: Z-RAM (Hynix, AMD). Hurdles: reliability, durability, needs Silicon on Insulator fab. STT-MRAM (Hynix, Samsung, actual plans for 2011). Hurdle: needs new, expensive fab. 7

8 HPC enemy No. 1: The memory wall Future non-volatile memory may be based on graphene nano ribbons (GNRs). GNRs are conductors or semiconductors depending on size and shape. Tim. J. et al, Nonvolatile Switching in Graphene Field-Effect Devices, IEEE ElectronDevice Letters 29, 952 (200 8

HPC enemy No. 1: The memory wall Increasing memory speed means shifting the bottleneck: bandwidth. Solution: 3-D stacking. Simulation shows 2.

9 HPC enemy No. 1: The memory wall Increasing memory speed means shifting the bottleneck: bandwidth. Solution: 3-D stacking. Simulation shows times speedup on many codes. G.H. Loh, 3D-stacked Memory Architectures for Multi-Core Processors, ACM/IEEE Intl. Conf. on Comp. Arch., June

10 IN HPC speed is never high enough. Until 2005 partly solved by increasing clock frequency. As energy consumption E = NCV2f (N = # of devices/surface unit C = capacitance V = voltage f = frequency) Choice was made to increase N, lower V, keep f as low as possible. More cores on chip with acceptable energy consumption thanks to possible technology shrink: dual core, quad-core & 6 core CPUs, to many-core. 10

11 Combatting lack of speed Vector units SSEx, AVX, etc., have been introduced. Originally for graphics, but also of help for simple regular arithmetic operations: 11

12 Combatting lack of speed External accelerators are a logical extension. Presently four types are available: Cell-based systems (IBM) BSC, FZJ prototypes FPGA-based (CPUTech, Mitrion, Convey, Kuberre, SRC,...) EPCC prototype GPU-based (NVIDIA, ATI/AMD) GENCI prototype ClearSpeed CSX700-based (PetaPath) CINES, LRZ, NCF prototypes 12

13 Combatting lack of speed Convey, Kuberre, and SRC systems hide FPGAs from user through special compilers and libraries. Normal, x86 processor-based hosts act as the intermediaries between users and the FPGAs. FPGAs can take on any personality at the cost of reconfiguration time. SRC-7 system 13

Combatting lack of speed GPU-based accelerators rely on: Many parallel processor streams (240/GPU in the Tesla S1070) At a moderate clock frequency (1.

14 Combatting lack of speed GPU-based accelerators rely on: Many parallel processor streams (240/GPU in the Tesla S1070) At a moderate clock frequency ( GHz). 8-byte precision arithmetic is much slower than 4-byte precision (< 10%) Data transport to/from host is a main issue. No error correction! 14

15 Combatting lack of speed PetaPath accelerator systems use ClearSpeed CSX700 processors. Inherently 8-byte precision. 192 (2 96) processors at 250 MHz (extremely low power: 8.1 Watt). 15

16 Combatting lack of speed The Empire Strikes Back: Intel will launch the many-core Larrabee in

17 Combatting lack of speed Accelerators can help but: Transport of data/results from/to host systems can be a challenge. Programmability is another issue: Cell Rapidmind (LRZ WP8 prototype), OpenCL could help Handel-C, Mitrion C for FPGAs OpenFPGA could help (also proprietary compilers from Convey, Kuberre, SRC) CUDA, Brooks+ for GPUs CAPS HMPP (GENCI/CEA WP8 prototype), OpenCL, Rapidmind could help Cn for PetaPath OpenCL could help 17

18 Interconnect issues in HPC Approximate current status: Remarks: 1) IB bandwidth is theoretical maximum. 2) Proprietary interconnects BW is measured (MPI P-to-P). 3) Latencies are all measured. 18

19 Interconnect issues in HPC Bandwidth and latencies of open interconnects do not differ much from proprietary ones anymore. MPI implementations and special provisions (e.g., shmem) might. An important differentiator for multi-pflop/s systems is the topology: Present prevalent ones have drawbacks for very large scales: 19

20 Interconnect issues in HPC Fat Tree: Becomes unwieldy, expensive, with high power consumption Possible solution: Thinned Fat Tree 20

21 Interconnect issues in HPC 3-D Torus: Becomes unreliable: one link failure affects many nodes. Possible solution: Higher dimension torus. 21

22 Software issues in HPC Present programming models (MPI, OpenMP, hybrid MPI/OpenMP) are still usable, but barely. PGAS (Partitioned Global Address Space) languages may be a way out. Example (UPC): shared double a[n], b[n]; shared double sum =0.0; int i;... upc_forall( i=0; i<n; i++;i ) sum += a[i]*b[i]; 22

23 Software issues in HPC Presently UPC (Unified Parallel C) and CAF (Co-Array Fortran) are available (though not very mature). Advantages of PGAS languages are: Simpler implementation Some control over data distribution Some control over shared/private character of variables 23

24 Software issues in HPC Next generation PGAS languages are Chapel (Cray) and X10 (IBM). Both developed in the US DARPA High Productivity Computing Systems program. Both have memory locality awareness and support of Transactional Memory. Example (Chapel): const problem_space: domain(1) distributed(block) = [1..n]; var a, b: [problem_space] elem_type; var dot_a_b: elem_type;... dot_a_b = 0.0; atomic( dot_a_b += a*b ); 24

25 Power issues in HPC For Multi-Petaflop/s to Exaflop/s systems real disruption in technology is needed. With current type of technology (even with progress factored in) we would have the following situation: Pflop/s 9.3 MW for IT 14 MW total 12.3 M /y Eflop/s 154 MW for IT 213 MW total 187 M /y Assuming: PUE = 1.5, 0.10/Kwhr. 25

26 Power issues in HPC What we need at least: Non-volatile, 3-D stacked memory (graphene?, memristor?) Faster device switching times Carbon nano tubes? Rapid Single Flux Quantum devices?: switching times in picoseconds. But need liquid Helium temperatures. High density storage (Holographic storage?). Photonics where appropriate. 26

27 Going to interesting times Thanks to: Herbert Huber, Jean-Phillipe Nominé, Jean-Marie Normand, François Robin and many other PRACE workers, and to you for your attention! 27

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction