FUJITSU HPC and the Development of the Post-K Supercomputer

FUJITSU HPC and the Development of the Post-K Supercomputer Toshiyuki Shimizu Vice President, System Development Division, Next Generation Technical Computing Unit 0 November 16 th, 2016 Post-K is currently under development. Information in these slides is subject to change without notice

FUJITSU HPC and Post-K Development Introduction of HPC solutions, HPC product portfolio High-end HPC supercomputer development Performance of high-end machines preceding the Post-K Post-K Goals and approaches Post-K hardware Post-K software Performance balance Summary 1

FUJITSU HPC Solutions to Satisfy Customer Demands High-end supercomputers, both Fujitsu-developed CPUs and x86 cluster systems Single system image operation w/ FUJITSU system software High performance, high availability, and high reliability K computer Co-developed with RIKEN) Supercomputer PRIMEHPC PRIMEHPC FX10 x86 Cluster High-end Divisional Departmental Work Group PRIMEHPC FX100 CX400/CX600(KNL) BX900/BX400 RX2530/RX2540 2 Large-Scale SMP System RX900 High scalability with Fujitsu-developed CPU and interconnect PRIMERGY x86 cluster systems support the latest CPUs and accelerators

FUJITSU High-end Supercomputer Development 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 FUJITSU App. review PRIMEHPC FX10 1.8x CPU perf. of K Easier installation Japan s National Projects Development HPCI strategic apps program FS projects PRIMEHPC FX100 4x(DP) / 8x(SP) CPU per. of K, Tofu2 High-density pkg & lower energy Technical Computing Suite (TCS) Handles millions of parallel jobs OS: Lower OS jitter w/ FEFS: super scalable file system assistant MPI: Ultra scalable collective communication libraries Operation of K computer Post-K Post-K computer development 3 K computer and PRIMEHPC FX10/FX100 in Operation The CPU and interconnect of FX10/FX100 supercomputers inherit the K computer s architectural concept, featuring state-of-the-art technologies System software TCS supports FUJITSU supercomputers Many applications are currently running on these machines and being developed for science and various industries Post-K Supercomputer RIKEN and FUJITSU are working together to provide a successor to the K computer with application R&D teams using a co-design approach

Achievements with the K computer Prestigious Benchmark Awards TOP500: 10.5Pflops, 93% efficiency HPCG: 602Tflops, No. 1 Graph500: 38.6TTEPS 5.3% efficiency No. 1 No. 1 HPC Challenge Class 1: No.1 at all categories (1) Global HPL, (2) Global RandomAccess, (3) EP STREAM, (4) Global FFT Gordon Bell Prize Awards First principles calculation of electronic states of a silico nanowire with 100,000 atoms on the K computer (2011) 4.45 Pflops Astrophysical N-Body Simulation on K Computer The Gravitational Trillion-Body Problem (2012) at SC16 6 years from the shipment Nominated as finalist Simulations of Below-Ground Dynamics of Fungi: 1.184 Pflops Attained by Automated Generation and Autotuning of Temporal Blocking Codes (2016 finalist) 4

Performance of FUJITSU High-end Machines FUJITSU s custom CPUs steadily increase their FP performance Uncompromised data bandwidth for the best use of applications With the FX100, we introduced the SMaC concept, followed by the assistant (AC), and CMG structure FX100 FX10 K computer Available year CY2015 CY2012 CY2010 Double Flops / CPU 1 TF 235 GF 128 GF Single Flops / CPU 2 TF 235 GF 128 GF SIMD width 256 bit 128 bit 128 bit # of CMG (# of s/cmg*) 2 (16+1xAC*) 1 (16) 1 (8) Memory BW 480 GB/s 83.5 GB/s 64 GB/s Byte per flop 0.4 ~ 0.5 *AC (Assistant Core) for OS jitter reduction by processing IO operations and async. MPI handling *CMG (Core Memory Group) is a group of CPU s sharing L2 and memory for efficient BW 5

SMaC (Scalable Many Core) Concept & Approach MAC MAC Memory interface Many -oriented, long-lasting architecture Scalable performance improvement by increasing the number of s Increasing the number of s would be safe, even in the post-moore s Law era, by using 3D stacking and alternative, newer technologies Middle-sized, general purpose, out-of-order, superscalar processor Good performance for variety of apps Low power by optimal balance of resources & perf. Assistant OS jitter reduction by processing IO op, async MPI Highly scalable performance by low system noise Core Memory Group (CMG), many building block, ccnuma between CMGs Hierarchal structure for hybrid parallel model Optimized area and performance FX100 CPU implementation Memory interface Tofu2 controller CORE MAC MAC Tofu2 interface CMG L2 cache Assistant Assistant CMG L2 cache (Shared L2 cache & Memories) PCI controller PCI interface 6

Post-K Goals and Approaches Post-K Goals High application performance and good power efficiency Keeping application compatibility while advancing from predecessors Good usability and better accessibility for users Our Approaches Developing high performance and scalable, custom CPU s Performance Wider SIMD & high memory BW, mathematical acc. primitives Scalability SMaC (scalable many ), zero OS jitter (assistant ) Power efficiency The best device tech, power control functions, optimal resources Maintaining performance balance and supporting advanced features High memory BW, Tofu interconnect, and RIKEN advanced system software Adopting ARM standard architecture Co-operation with ARM/Linux community and utilization of open source software Getting involved in the ARM HPC ecosystem 7

Post-K Powered by FUJITSU-designed CPU Cores & Tofu FUJITSU CPU s support ARM SVE ISA FUJITSU, as a lead partner in ARM SVE development, contributes to specification of ARM SVE (Scalable Vector Extension), for application performance FUJITSU ARM incorporates FUJITSU s proven supercomputer microarchitecture ARM SVE, plus optional functions and Tofu, maintain programing models and performance balance Post-K complies ARM s standard frameworks (SBSA, etc.), for compatibility among platforms SVE incorporated Optional functions Functions for Perf. Post-K FX100 FX10 K computer SIMD 512bit 256bit 128bit 128bit FMA4 Math. acc. prim.* Enhanced Inter- barrier Sector cache Enhanced Prefetch modes Enhanced Interconnect Tofu Enhanced *Mathematical acceleration primitives include trigonometric functions, sine & cosines, and exponential... 8

System Software for Post-K Currently in development, based on co-design scheme with application developers, including system hardware FUJITSU Technical Computing Suite / RIKEN Advanced System Software Management Software System management for highly available & power saving operation Job management for higher system utilization & power efficiency Post-K Applications Hierarchical File I/O Software Lustre-based distributed file system FEFS Linux OS / McKernel (Lightweight Kernel) Post-K System Hardware Programming Environment MPI (Open MPI, ) OpenMP, COARRAY, Math Libs. Compilers (C, C++, Fortran) Debugging and tuning tools 9

FUJITSU Compiler for Post-K Maximizes the execution performance of HPC applications Covers a wide range of applications, including integer calculations are dominant Targets 512bit-wide vectorization as well as Vector-length-agnostic Fixed-vector-length facilitates optimizations such as constant folding Inherits options/features of K computer, PRIMEHPC FX10 and FX100 Language Standard Support Fully supported : Fortran 2008, C11, C++14, OpenMP 4.5 Partially supported : Fortran 2015, C++1z, OpenMP 5.0 Supports ARM C Language Extensions (ACLE) for SVE ACLE allow programmers to use SVE instructions as C intrinsic functions // C intrinsics in ACLE for SVE svfloat64_t z0 = svld1_f64(p0, &x[i]); svfloat64_t z1 = svld1_f64(p0, &y[i]); svfloat64_t z2 = svadd_f64_x(p0, z0, z1); svst1_f64(p0, &z[i], z2); // SVE assembler ld1d z1.d, p0/z, [x19, x3, lsl #3] ld1d z0.d, p0/z, [x20, x3, lsl #3] fadd z1.d, p0/m, z1.d, z0.d st1d z1.d, p0, [x21, x3, lsl #3] 10

Vectorization by FUJITSU Compiler # of executed instruction ratio Dynamic instruction counts of representative loops of NPB 3.3-SER 1.2 1 0.8 0.6 0.4 0.2 75% SIMD 69% SIMD 69% SIMD 72% SIMD 256b SIMD(FX100) 512b SIMD(Post-K) 512b SIMD(Estimated from FX100 result) 0 MG BT SP LU Vectorized loops in TSVC* (Fortran and C) // Sample of vectorized loop by SVE TSVC (total) FX100 Post-K Fortran (135) 96 111 C (151) 106 121 // s482 for (int i = 0; i < LEN; i++) { a[i] += b[i] * c[i]; if (c[i] > b[i]) break; } *[Fortran] D. Callahan, J. Dongarra, and D. Levine. Vectorizing compilers: a test suite and results. In Supercomputing '88, pp. 98-105. [C] S. Maleki, Y. Gao, M. J. Garzar n, T. Wong, and D. A. Padua, "An Evaluation of Vectorizing Compilers, PACT2011, pp. 372-382. 11

Discussion on the Perf. Balance for Applications Effectiveness for the meteorology application, IFS*, was evaluated Good performance balance w/ wider SIMD and memory bandwidth from K to FX100 realizes an IFS performance improvement Trying to keep the performance balance throughout the generations toward Post-K will be expected to provide scalable speed-up Speedup of IFS CNT4 (TL159, 96 s) K computer FX100 Flops / CPU 128 Gflops 1 Tflops SIMD width 128 bit 256 bit Memory BW 64 GB/s 480 GB/s Byte per flop 0.4 ~ 0.5 (1) BW limits performance (2) Narrower SIMD limits performance 1.5x by doubling SIMD (3) Insufficient gain by 2x B/F.5x B/F 2x B/F w/ narrow SIMD *The Integrated Forecasting System (IFS) is developed by ECMWF Balanced 2x B/F 12

Summary of Post-K Development Developing high performance, scalable, custom CPU s SMaC architecture with an assistant for scalable performance ARM instruction set architecture, SVE, as a standard architecture ARM standard frameworks, SBSA, etc., for compatibility among platforms Keeping performance balanced and advancing preceding machines Higher performance and higher data bandwidth Advanced system software and applications Co-design scheme with application developers FUJITSU optimizing compilers for Post-K Performance balance is a key for application speedup Post-K will meet requirements & be valuable for science and industries 13

Post-K: Succeeding the K computer Heritage Prestigious Benchmark Awards TOP500: 10.5Pflops, 93% efficiency HPCG: 602Tflops, No. 1 Graph500: 38.6TTEPS 5.3% efficiency No. 1 No. 1 HPC Challenge Class 1: No.1 at all categories (1) Global HPL, (2) Global RandomAccess, (3) EP STREAM, (4) Global FFT Gordon Bell Prize Awards First principles calculation of electronic states of a silico nanowire with 100,000 atoms on the K computer (2011) 4.45 Pflops Astrophysical N-Body Simulation on K Computer The Gravitational Trillion-Body Problem (2012) at SC16 6 years from the shipment Nominated as finalist Simulations of Below-Ground Dynamics of Fungi: 1.184 Pflops Attained by Automated Generation and Autotuning of Temporal Blocking Codes (2016 finalist) 14