Update of Post-K Development Yutaka Ishikawa RIKEN AICS

Update of Post-K Development Yutaka Ishikawa RIKEN AICS 11:20AM 11:40AM, 2 nd of November, 2017

FLAGSHIP2020 Project Missions Building the Japanese national flagship supercomputer, post K, and Developing wide range of HPC applications, running on post K, in order to solve social and science issues in Japan Project organization Post K Computer development RIKEN AICS is in charge of development Fujitsu is vendor partner. International collaborations: DOE, JLESC, CEA Applications The government selected 9 social & scientific priority issues and their R&D organizations. 4 Exploratory Issues I/O Network Maitenance Servers Portal Servers Login Servers Hierarchical Storage System Status Basic Design was finalized and now in Design and Implementation phase. We have decided to choose ARM v8 with SVE as ISA for post K manycore processor. We are working on detail evaluation by simulators and compilers 20017/11/02 2

CPU Architecture ARMv8-A + SVE (Scalable Vector Extension) FP64/FP32/FP16 https://developer.arm.com/products/architecture/a-profile/docs Fujitsuʼs extensions Reducing fork/join sync. Inter core barrier Sector cache Hardware prefetch assist 20017/11/02 3

Easy of Use Easy of use is one of our KPIs (Key Performance Indicators) Linux Distributions Fujitsu Providing wide range of applications/tools/libraries/compilers Languages Fortran, C11/C++, OpenMP, Java, etc. Math Libraries Comm. and I/O MPI(OpenMPI),,, OS Kernel https://www.linaro.org/sig/hpc/ The HPC SIG drives the adoption of ARM in HPC through the creation of a data center ecosystem. It is a collaborative project comprised of members and an advisory board. Current members include ARM, HiSilicon, Qualcomm, Fujitsu, Cavium, Red Hat and HPE. CERN and Riken are on the advisory board. Riken Linux for post-k Languages & Domain Specific Languages XMP, FDPS, Formula Comm. and I/O MPICH, LLC, PiP, DTF, FTAR OpenHPC is a Linux Foundation Collaborative Project whose mission is to provide a reference collection of opensource HPC software components and best practices, lowering barriers to deployment, advancement, and use of modern HPC methods and tools. OS Kernel IHK/McKernel 20017/11/02 4

McKernel developed at RIKEN Partition resources (CPU cores, memory) Full Linux kernel on some cores System daemons and in-situ non HPC applications Device drivers Light-weight kernel(lwk), McKernel on other cores HPC applications Daemons and kernel activities are In situ non HPC application source of OS noises System daemons TCP stack Dev. Drivers Interrupt Linux VFS File Sys Driers Partition Complex Mem. Mngt. General scheduler McKernel is loadable module of Linux No changes of Linux McKernel supports Linux API Memory The same binary of Linux runs on McKernel Thin LWK Core Core Core Core McKernel currently runs on Intel Xeon and Xeon phi Fujitsu FX10 and FX100 (Experiments) McKernel is being ported to ARM by Fujitsu Enhancement for ARM will start at FY2018 Very? simple memory management No jitter environment HPC Applications Linux API (glibc, /sys/, /proc/) Partition New features are easily implemented Customized OS is dynamically launched Process/Thread management Corefor a Core user 20017/11/02 5

How to deploy McKernel Linux Kernel+Loadable LWK, McKernel Linux Kernel is resident, and daemons for job scheduler and etc. run on Linux McKernel is dynamically reloaded (rebooted) for each application No hardware reboot App A, requiring LWKwithout-scheduler, Is invoked App B, requiring LWK-withscheduler, Is invoked Finish App C, using full Linux capability, Is invoked Finish Finish 20017/11/02 6

OS Noise is harmful in large-scale HPC FWQ (Fixed Work Quanta) Benchmark Measurement of Jitter on single node McKernel Linux 20017/11/02 7

GeoFEM (University of Tokyo) ICCG with Additive Schwartz Domain Decomposition - weak scaling Up to 18% improvement Higher is better Figure of merit (solved problem size normalized to execution time) 16 14 12 10 8 6 4 2 0 Linux IHK/McKernel Results using the same binary Acknowledgement: Kengo Nakajima, University of Tokyo, for providing GeoFEM. This result is on Oakforest PACS supercomputer, 25 PF in peak, at JCAHPC organized by U. of Tsukuba and U. of Tokyo 1024 2048 4096 8192 16k 32k 64k 128k Number of physical cores 20017/11/02 8

CCS-QCD (University of Tsukuba) Lattice quantum chromodynamics code - weak scaling Up to 38% improvement Higher is better MFlop/sec/node 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Linux IHK/McKernel Results using the same binary Acknowledgement: Ken ichi Ishikawa, Hiroshima University, providing CCS QCD. This result is on Oakforest PACS supercomputer, 25 PF in peak, at JCAHPC organized by U. of Tsukuba and U. of Tokyo 1024 2048 4096 8192 16k 32k 64k 128k Number of physical cores 20017/11/02 9

minife (CORAL benchmark suite) Conjugate gradient - strong scaling Up to 3.5X improvement (Linux falls over.. ) Higher is better Total CG MFlops 12000000 10000000 8000000 6000000 4000000 2000000 Linux IHK/McKernel 3.5X Results using the same binary 20017/11/02 0 1024 2048 4096 8192 16k 32k 64k Number of physical cores Oakforest PACS supercomputer, 25 PF in peak, at JCAHPC organized by U. of Tsukuba and U. of Tokyo 10

Concluding Remarks The system software stack for post-k is being designed and implemented with the leverage of international collaborations, CEA, DOE Labs, and JLESC (NCSA, INRIA, ANL, BSC, JSC, RIKEN) DOE MEXT Optimized Memory Management, Efficient MPI for exascale, Dynamic Execution Runtime, Storage Architectures, Metadata and active storage, Storage as a Service, Parallel I/O Libraries, MiniApps for Exascale CoDesign, Performance Models for Proxy Apps, OpenMP/XMP Runtime, Programming Models for Heterogeneity, LLVM for vectorization, Power Monitoring and Control, Power Steering, Resilience API, Shared Fault Data, etc. CEA Programming Language, Runtime Environment, Energy aware batch job scheduler, Large DFT calculations and QM/MM, Application of High Performance Computing to Earthquake Related Issues of Nuclear Power Plant Facilities, KPIs (Key Performance Indicators) JLESC: Joint Laboratory on Extreme Scale Computing (NCSA, ANL, UTK, JSC, BSC, INRIA, RIKEN) Simplified Sustained System performance benchmark, Developer tools for porting and tuning parallel applications on extreme scale parallel systems,. HPC libraries for solving dense symmetric eigenvalue problems,.dtf+sz: Using lossy data compression for direct data transfer in multicomponent applications, Process in Process: Techniques for Practical Address Space Sharing The software stack developed at RIKEN is open source 20017/11/02 11