FUJITSU HPC and the Development of the Post-K Supercomputer

Similar documents
Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED

Post-K: Building the Arm HPC Ecosystem

Toward Building up ARM HPC Ecosystem

Post-K Development and Introducing DLU. Copyright 2017 FUJITSU LIMITED

ARMv8-A Scalable Vector Extension for Post-K. Copyright 2016 FUJITSU LIMITED

Introduction of Fujitsu s next-generation supercomputer

Findings from real petascale computer systems with meteorological applications

Fujitsu s new supercomputer, delivering the next step in Exascale capability

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

Fujitsu High Performance CPU for the Post-K Computer

Fujitsu s Approach to Application Centric Petascale Computing

Technical Computing Suite supporting the hybrid system

Programming for Fujitsu Supercomputers

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS

PRIMEHPC FX10: Advanced Software

Toward Building up Arm HPC Ecosystem --Fujitsu s Activities--

The STREAM Benchmark. John D. McCalpin, Ph.D. IBM eserver Performance ^ Performance

Fujitsu Petascale Supercomputer PRIMEHPC FX10. 4x2 racks (768 compute nodes) configuration. Copyright 2011 FUJITSU LIMITED

Overview of the Post-K processor

Composite Metrics for System Throughput in HPC

HOKUSAI System. Figure 0-1 System diagram

Introduction to the K computer

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

The way toward peta-flops

Preliminary Performance Evaluation of Application Kernels using ARM SVE with Multiple Vector Lengths

The Architecture and the Application Performance of the Earth Simulator

An Overview of Fujitsu s Lustre Based File System

Fujitsu and the HPC Pyramid

Fujitsu s Technologies Leading to Practical Petascale Computing: K computer, PRIMEHPC FX10 and the Future

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

ARM High Performance Computing

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

Fujitsu s Technologies to the K Computer

Current Status of the Next- Generation Supercomputer in Japan. YOKOKAWA, Mitsuo Next-Generation Supercomputer R&D Center RIKEN

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10

Fujitsu and the HPC Pyramid

Compiler Technology That Demonstrates Ability of the K computer

Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence

The Cray Rainier System: Integrated Scalar/Vector Computing

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Fujitsu's Lustre Contributions - Policy and Roadmap-

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Mont-Blanc project Updates from the Barcelona Supercomputing Center

HPC in the Multicore Era

Fast-multipole algorithms moving to Exascale

The Arm Technology Ecosystem: Current Products and Future Outlook

Basic Specification of Oakforest-PACS

HPCS HPCchallenge Benchmark Suite

Implicit Low-Order Unstructured Finite-Element Multiple Simulation Enhanced by Dense Computation using OpenACC

High Performance Computing with Accelerators

Supercomputer SX-9 Development Concept

in Action Fujitsu High Performance Computing Ecosystem Human Centric Innovation Innovation Flexibility Simplicity

Brand-New Vector Supercomputer

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

Current and Future Challenges of the Tofu Interconnect for Emerging Applications

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Overview of Tianhe-2

The Art of Parallel Processing

Arm's role in co-design for the next generation of HPC platforms

Architecture, Programming and Performance of MIC Phi Coprocessor

Introduction of Oakforest-PACS

Getting the best performance from massively parallel computer

What can/should we measure with benchmarks?

Experiences of the Development of the Supercomputers

The Earth Simulator Current Status

What does Heterogeneity bring?

Intel High-Performance Computing. Technologies for Engineering

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Mapping MPI+X Applications to Multi-GPU Architectures

BOPS, Not FLOPS! A New Metric, Measuring Tool, and Roofline Performance Model For Datacenter Computing. Chen Zheng ICT,CAS

April 2 nd, Bob Burroughs Director, HPC Solution Sales

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

designing a GPU Computing Solution

High Performance Computing: Architecture, Applications, and SE Issues. Peter Strazdins

Innovative Alternate Architecture for Exascale Computing. Surya Hotha Director, Product Marketing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

High Performance Computing with Fujitsu

SSD Based First Layer File System for the Next Generation Super-computer

High Performance Computing in C and C++

Course web site: teaching/courses/car. Piazza discussion forum:

Top500 Supercomputer list

Experiences in Optimizations of Preconditioned Iterative Solvers for FEM/FVM Applications & Matrix Assembly of FEM using Intel Xeon Phi

Performance Evaluation with the HPCC Benchmarks as a Guide on the Way to Peta Scale Systems

An Introduction to OpenACC

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

A Case for High Performance Computing with Virtual Machines

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

IBM Blue Gene/Q solution

The next generation supercomputer. Masami NARITA, Keiichi KATAYAMA Numerical Prediction Division, Japan Meteorological Agency

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Transcription:

FUJITSU HPC and the Development of the Post-K Supercomputer Toshiyuki Shimizu Vice President, System Development Division, Next Generation Technical Computing Unit 0 November 16 th, 2016 Post-K is currently under development. Information in these slides is subject to change without notice

FUJITSU HPC and Post-K Development Introduction of HPC solutions, HPC product portfolio High-end HPC supercomputer development Performance of high-end machines preceding the Post-K Post-K Goals and approaches Post-K hardware Post-K software Performance balance Summary 1

FUJITSU HPC Solutions to Satisfy Customer Demands High-end supercomputers, both Fujitsu-developed CPUs and x86 cluster systems Single system image operation w/ FUJITSU system software High performance, high availability, and high reliability K computer Co-developed with RIKEN) Supercomputer PRIMEHPC PRIMEHPC FX10 x86 Cluster High-end Divisional Departmental Work Group PRIMEHPC FX100 CX400/CX600(KNL) BX900/BX400 RX2530/RX2540 2 Large-Scale SMP System RX900 High scalability with Fujitsu-developed CPU and interconnect PRIMERGY x86 cluster systems support the latest CPUs and accelerators

FUJITSU High-end Supercomputer Development 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 FUJITSU App. review PRIMEHPC FX10 1.8x CPU perf. of K Easier installation Japan s National Projects Development HPCI strategic apps program FS projects PRIMEHPC FX100 4x(DP) / 8x(SP) CPU per. of K, Tofu2 High-density pkg & lower energy Technical Computing Suite (TCS) Handles millions of parallel jobs OS: Lower OS jitter w/ FEFS: super scalable file system assistant MPI: Ultra scalable collective communication libraries Operation of K computer Post-K Post-K computer development 3 K computer and PRIMEHPC FX10/FX100 in Operation The CPU and interconnect of FX10/FX100 supercomputers inherit the K computer s architectural concept, featuring state-of-the-art technologies System software TCS supports FUJITSU supercomputers Many applications are currently running on these machines and being developed for science and various industries Post-K Supercomputer RIKEN and FUJITSU are working together to provide a successor to the K computer with application R&D teams using a co-design approach

Achievements with the K computer Prestigious Benchmark Awards TOP500: 10.5Pflops, 93% efficiency HPCG: 602Tflops, No. 1 Graph500: 38.6TTEPS 5.3% efficiency No. 1 No. 1 HPC Challenge Class 1: No.1 at all categories (1) Global HPL, (2) Global RandomAccess, (3) EP STREAM, (4) Global FFT Gordon Bell Prize Awards First principles calculation of electronic states of a silico nanowire with 100,000 atoms on the K computer (2011) 4.45 Pflops Astrophysical N-Body Simulation on K Computer The Gravitational Trillion-Body Problem (2012) at SC16 6 years from the shipment Nominated as finalist Simulations of Below-Ground Dynamics of Fungi: 1.184 Pflops Attained by Automated Generation and Autotuning of Temporal Blocking Codes (2016 finalist) 4

Performance of FUJITSU High-end Machines FUJITSU s custom CPUs steadily increase their FP performance Uncompromised data bandwidth for the best use of applications With the FX100, we introduced the SMaC concept, followed by the assistant (AC), and CMG structure FX100 FX10 K computer Available year CY2015 CY2012 CY2010 Double Flops / CPU 1 TF 235 GF 128 GF Single Flops / CPU 2 TF 235 GF 128 GF SIMD width 256 bit 128 bit 128 bit # of CMG (# of s/cmg*) 2 (16+1xAC*) 1 (16) 1 (8) Memory BW 480 GB/s 83.5 GB/s 64 GB/s Byte per flop 0.4 ~ 0.5 *AC (Assistant Core) for OS jitter reduction by processing IO operations and async. MPI handling *CMG (Core Memory Group) is a group of CPU s sharing L2 and memory for efficient BW 5

SMaC (Scalable Many Core) Concept & Approach MAC MAC Memory interface Many -oriented, long-lasting architecture Scalable performance improvement by increasing the number of s Increasing the number of s would be safe, even in the post-moore s Law era, by using 3D stacking and alternative, newer technologies Middle-sized, general purpose, out-of-order, superscalar processor Good performance for variety of apps Low power by optimal balance of resources & perf. Assistant OS jitter reduction by processing IO op, async MPI Highly scalable performance by low system noise Core Memory Group (CMG), many building block, ccnuma between CMGs Hierarchal structure for hybrid parallel model Optimized area and performance FX100 CPU implementation Memory interface Tofu2 controller CORE MAC MAC Tofu2 interface CMG L2 cache Assistant Assistant CMG L2 cache (Shared L2 cache & Memories) PCI controller PCI interface 6

Post-K Goals and Approaches Post-K Goals High application performance and good power efficiency Keeping application compatibility while advancing from predecessors Good usability and better accessibility for users Our Approaches Developing high performance and scalable, custom CPU s Performance Wider SIMD & high memory BW, mathematical acc. primitives Scalability SMaC (scalable many ), zero OS jitter (assistant ) Power efficiency The best device tech, power control functions, optimal resources Maintaining performance balance and supporting advanced features High memory BW, Tofu interconnect, and RIKEN advanced system software Adopting ARM standard architecture Co-operation with ARM/Linux community and utilization of open source software Getting involved in the ARM HPC ecosystem 7

Post-K Powered by FUJITSU-designed CPU Cores & Tofu FUJITSU CPU s support ARM SVE ISA FUJITSU, as a lead partner in ARM SVE development, contributes to specification of ARM SVE (Scalable Vector Extension), for application performance FUJITSU ARM incorporates FUJITSU s proven supercomputer microarchitecture ARM SVE, plus optional functions and Tofu, maintain programing models and performance balance Post-K complies ARM s standard frameworks (SBSA, etc.), for compatibility among platforms SVE incorporated Optional functions Functions for Perf. Post-K FX100 FX10 K computer SIMD 512bit 256bit 128bit 128bit FMA4 Math. acc. prim.* Enhanced Inter- barrier Sector cache Enhanced Prefetch modes Enhanced Interconnect Tofu Enhanced *Mathematical acceleration primitives include trigonometric functions, sine & cosines, and exponential... 8

System Software for Post-K Currently in development, based on co-design scheme with application developers, including system hardware FUJITSU Technical Computing Suite / RIKEN Advanced System Software Management Software System management for highly available & power saving operation Job management for higher system utilization & power efficiency Post-K Applications Hierarchical File I/O Software Lustre-based distributed file system FEFS Linux OS / McKernel (Lightweight Kernel) Post-K System Hardware Programming Environment MPI (Open MPI, ) OpenMP, COARRAY, Math Libs. Compilers (C, C++, Fortran) Debugging and tuning tools 9

FUJITSU Compiler for Post-K Maximizes the execution performance of HPC applications Covers a wide range of applications, including integer calculations are dominant Targets 512bit-wide vectorization as well as Vector-length-agnostic Fixed-vector-length facilitates optimizations such as constant folding Inherits options/features of K computer, PRIMEHPC FX10 and FX100 Language Standard Support Fully supported : Fortran 2008, C11, C++14, OpenMP 4.5 Partially supported : Fortran 2015, C++1z, OpenMP 5.0 Supports ARM C Language Extensions (ACLE) for SVE ACLE allow programmers to use SVE instructions as C intrinsic functions // C intrinsics in ACLE for SVE svfloat64_t z0 = svld1_f64(p0, &x[i]); svfloat64_t z1 = svld1_f64(p0, &y[i]); svfloat64_t z2 = svadd_f64_x(p0, z0, z1); svst1_f64(p0, &z[i], z2); // SVE assembler ld1d z1.d, p0/z, [x19, x3, lsl #3] ld1d z0.d, p0/z, [x20, x3, lsl #3] fadd z1.d, p0/m, z1.d, z0.d st1d z1.d, p0, [x21, x3, lsl #3] 10

Vectorization by FUJITSU Compiler # of executed instruction ratio Dynamic instruction counts of representative loops of NPB 3.3-SER 1.2 1 0.8 0.6 0.4 0.2 75% SIMD 69% SIMD 69% SIMD 72% SIMD 256b SIMD(FX100) 512b SIMD(Post-K) 512b SIMD(Estimated from FX100 result) 0 MG BT SP LU Vectorized loops in TSVC* (Fortran and C) // Sample of vectorized loop by SVE TSVC (total) FX100 Post-K Fortran (135) 96 111 C (151) 106 121 // s482 for (int i = 0; i < LEN; i++) { a[i] += b[i] * c[i]; if (c[i] > b[i]) break; } *[Fortran] D. Callahan, J. Dongarra, and D. Levine. Vectorizing compilers: a test suite and results. In Supercomputing '88, pp. 98-105. [C] S. Maleki, Y. Gao, M. J. Garzar n, T. Wong, and D. A. Padua, "An Evaluation of Vectorizing Compilers, PACT2011, pp. 372-382. 11

Discussion on the Perf. Balance for Applications Effectiveness for the meteorology application, IFS*, was evaluated Good performance balance w/ wider SIMD and memory bandwidth from K to FX100 realizes an IFS performance improvement Trying to keep the performance balance throughout the generations toward Post-K will be expected to provide scalable speed-up Speedup of IFS CNT4 (TL159, 96 s) K computer FX100 Flops / CPU 128 Gflops 1 Tflops SIMD width 128 bit 256 bit Memory BW 64 GB/s 480 GB/s Byte per flop 0.4 ~ 0.5 (1) BW limits performance (2) Narrower SIMD limits performance 1.5x by doubling SIMD (3) Insufficient gain by 2x B/F.5x B/F 2x B/F w/ narrow SIMD *The Integrated Forecasting System (IFS) is developed by ECMWF Balanced 2x B/F 12

Summary of Post-K Development Developing high performance, scalable, custom CPU s SMaC architecture with an assistant for scalable performance ARM instruction set architecture, SVE, as a standard architecture ARM standard frameworks, SBSA, etc., for compatibility among platforms Keeping performance balanced and advancing preceding machines Higher performance and higher data bandwidth Advanced system software and applications Co-design scheme with application developers FUJITSU optimizing compilers for Post-K Performance balance is a key for application speedup Post-K will meet requirements & be valuable for science and industries 13

Post-K: Succeeding the K computer Heritage Prestigious Benchmark Awards TOP500: 10.5Pflops, 93% efficiency HPCG: 602Tflops, No. 1 Graph500: 38.6TTEPS 5.3% efficiency No. 1 No. 1 HPC Challenge Class 1: No.1 at all categories (1) Global HPL, (2) Global RandomAccess, (3) EP STREAM, (4) Global FFT Gordon Bell Prize Awards First principles calculation of electronic states of a silico nanowire with 100,000 atoms on the K computer (2011) 4.45 Pflops Astrophysical N-Body Simulation on K Computer The Gravitational Trillion-Body Problem (2012) at SC16 6 years from the shipment Nominated as finalist Simulations of Below-Ground Dynamics of Fungi: 1.184 Pflops Attained by Automated Generation and Autotuning of Temporal Blocking Codes (2016 finalist) 14