Site Update for Oakforest-PACS at JCAHPC

Similar documents
Oakforest-PACS (OFP) Taisuke Boku

Introduction of Oakforest-PACS

Oakforest-PACS (OFP) : Japan s Fastest Supercomputer

Basic Specification of Oakforest-PACS

Oakforest-PACS and PACS-X: Present and Future of CCS Supercomputers

Oakforest-PACS: Japan s Fastest Intel Xeon Phi Supercomputer and its Applications

Towards Efficient Communication and I/O on Oakforest-PACS: Large-scale KNL+OPA Cluster

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

Overview of Reedbush-U How to Login

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

University of Tsukuba s Accelerated Computing

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

T2K & HA-PACS Projects Supercomputers at CCS

Accelerated Computing Activities at ITC/University of Tokyo

Supercomputers in ITC/U.Tokyo 2 big systems, 6 yr. cycle

Overview of Reedbush-U How to Login

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

CORD Test Project in Okinawa Open Laboratory

2018/9/25 (1), (I) 1 (1), (I)

Cray XC Scalability and the Aries Network Tony Ford

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Multiprocessors. HPC Prof. Robert van Engelen

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

n Explore virtualization concepts n Become familiar with cloud concepts

Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence

The Architecture and the Application Performance of the Earth Simulator

HOKUSAI System. Figure 0-1 System diagram

Elementary Educational Computer

SCI Reflective Memory

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

Service Oriented Enterprise Architecture and Service Oriented Enterprise

User Training Cray XC40 IITM, Pune

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Python Programming: An Introduction to Computer Science

Architectural styles for software systems The client-server style

System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited

Fujitsu Petascale Supercomputer PRIMEHPC FX10. 4x2 racks (768 compute nodes) configuration. Copyright 2011 FUJITSU LIMITED

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

GPUMP: a Multiple-Precision Integer Library for GPUs

An Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Lecture 28: Data Link Layer

Brand-New Vector Supercomputer

Oak Ridge National Laboratory Computing and Computational Sciences

DCMIX: Generating Mixed Workloads for the Cloud Data Center

CAEN Tools for Discovery

On Nonblocking Folded-Clos Networks in Computer Communication Environments

Fujitsu s Approach to Application Centric Petascale Computing

OnApp Cloud. The complete platform for cloud service providers. 114 Cores. 286 Cores / 400 Cores

Kengo Nakajima Information Technology Center, The University of Tokyo. SC15, November 16-20, 2015 Austin, Texas, USA

IBM CORAL HPC System Solution

Appendix D. Controller Implementation

Tightly Coupled Accelerators Architecture

TELETERM M2 Series Programmable RTU s

Avid Interplay Bundle

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

The next generation supercomputer. Masami NARITA, Keiichi KATAYAMA Numerical Prediction Division, Japan Meteorological Agency

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Ones Assignment Method for Solving Traveling Salesman Problem

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

. Written in factored form it is easy to see that the roots are 2, 2, i,

Instruction and Data Streams

Data Protection: Your Choice Is Simple PARTNER LOGO

Fujitsu s Technologies to the K Computer

TELETERM M2 Series Programmable RTU s

Data diverse software fault tolerance techniques

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Overview of Tianhe-2

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

ICS Regent. Communications Modules. Module Operation. RS-232, RS-422 and RS-485 (T3150A) PD-6002

MOTIF XF Extension Owner s Manual

IHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing

Introduction of Fujitsu s next-generation supercomputer

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

Optimization for framework design of new product introduction management system Ma Ying, Wu Hongcui

Software development of components for complex signal analysis on the example of adaptive recursive estimation methods.

Analysis of Server Resource Consumption of Meteorological Satellite Application System Based on Contour Curve

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

HPC Storage Use Cases & Future Trends

1 Enterprise Modeler

Operating System Concepts. Operating System Concepts

OnApp Cloud. The complete cloud management platform

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

1&1 Next Level Hosting

Supercomputers in Nagoya University

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Computer Graphics Hardware An Overview

Reliable Transmission. Spring 2018 CS 438 Staff - University of Illinois 1

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization

Web OS Switch Software

Communication-Computation Overlapping with Dynamic Loop Scheduling for Preconditioned Parallel Iterative Solvers on Multicore/Manycore Clusters

Algorithms for Disk Covering Problems with the Most Points

Threads and Concurrency in Java: Part 1

UNIVERSITY OF MORATUWA

System Software for Big Data and Post Petascale Computing

Transcription:

Site Update for Oakforest-PACS at JCAHPC Taisuke Boku Vice Director, JCAHPC Uiversity of Tsukuba 1 Ceter for Computatioal Scieces, Uiv. of Tsukuba

Towards Exascale Computig PF 1000 100 Tier-1 ad tier-2 supercomputers form HPCI ad move forward to Exascale computig like two wheels Future Exascale Post K Computer -> Rike R-CCSAICS 10 OFP JCAHPC(U. Tsukuba ad U. Tokyo) 1 Tokyo Tech. TSUBAME2.0 T2K U. of Tsukuba U. of Tokyo Kyoto U. 2008 2010 2012 2014 2016 2018 2020 2 Ceter for Computatioal Scieces, Uiv. of Tsukuba

Fiscal Year 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 Hokkaido Tohoku Tsukuba Tokyo Tokyo Tech. Nagoya Kyoto Osaka Deploymet pla of 9 supercomputig ceter (Feb. 2017) HITACHI SR16000/M1(172TF, 22TB) Cloud System BS2000 (44TF, 14TB) Data Sciece Cloud / Storage HA8000 / WOS7000 (10TF, 1.96PB) NEC SX- 9 他 (60TF) HA-PACS (1166 TF) COMA (PACS-IX) (1001 TF) SX-ACE(707TF,160TB, 655TB/s) LX406e(31TF), Storage(4PB), 3D Vis, 2MW Fujitsu FX10 Reedbush 1.8 1.9 PF 0.7 MW (1PFlops, 150TiB, 408 TB/s), Hitachi SR16K/M1 (54.9 TF, 10.9 TiB, 28.7 TB/s) TSUBAME 2.5 (5.7 PF, 110+ TB, 1160 TB/s), 1.4MW FX10(90TF) Fujitsu FX100 (2.9PF, 81 TiB) CX400(470T F) Fujitsu CX400 (774TF, 71TiB) Power cosumptio idicates maximum of power supply (icludes coolig facility) ~30PF, ~30PB/s Mem BW (CFL-D/CFL-M) ~3MW 50+ PF (FAC) 3.5MW 50+ PF (FAC/UCC + CFL-M) SGI UV2000 (24TF, 20TiB) 2MW i total up to 4MW Cray: XE6 + GB8K + XC30 (983TF) Cray XC30 (584TF) TSUBAME 2.5 (3~4 PF, exteded) 3.2 PF (UCC + CFL/M) 0.96MW 30 PF (UCC + 0.3 PF (Cloud) 0.36MW CFL-M) 2MW 5 PF(FAC/TPF) 1.5 MW PACS-X 10PF (TPF) 2MW Oakforest-PACS 25 PF (UCC + TPF) 4.2 MW TSUBAME 3.0 12 PF, UCC/TPF 2.0MW 100+ PF 4.5MW (UCC + TPF) 200+ PF (FAC) 6.5MW TSUBAME 4.0 (100+ PF, >10PB/s, ~2.0MW) 100+ PF (FAC/UCC+CFL- M)up to 4MW 50-100+ PF (FAC/TPF + UCC) 1.8-2.4 MW NEC SX-ACE NEC Express5800 3.2PB/s, 5-10Pflop/s, 1.0-1.5MW (CFL-M) (423TF) (22.4TF) 0.7-1PF (UCC) 25.6 PB/s, 50-100Pflop/s,1.5-2.0MW Kyushu 3 HA8000 (712TF, 242 TB) SR16000 (8.2TF, 6TB) FX10 (272.4TF, 36 TB) CX400 (966.2 TF, 183TB) 2.0MW 15-20 PF (UCC/TPF) FX10 (90.8TFLOPS) 2.6MW???? 100-150 PF (FAC/TPF + UCC/TPF) 3MW K Computer (10PF, 13MW) Post-K Computer (??.??)

JCAHPC Joit Ceter for Advaced High Performace Computig (http://jcahpc.jp) Very tight collaboratio for post-t2k with two uiversities For mai supercomputer resources, uiform specificatio to sigle shared system Each uiversity is fiacially resposible to itroduce the machie ad its operatio -> uified procuremet toward sigle system with largest scale i Japa To maage everythig smoothly, a joit orgaizatio was established -> JCAHPC 4 Ceter for Computatioal Scieces, Uiv. of Tsukuba

Machie locatio: Kashiwa Campus of U. Tokyo U. Tsukuba Kashiwa Campus of U. Tokyo Hogo Campus of U. Tokyo 5

Oakforest-PACS (OFP) U. Tokyo covetio U. Tsukuba covetio Do t call it just Oakforest! OFP is much better 25 PFLOPS peak 8208 KNL CPUs FBB Fat-Tree by OmiPath HPL 13.55 PFLOPS #1 i Japa #6 #7 HPCG #3 #5 Gree500 #6 #21 Full operatio started Dec. 2016 Official Program started o April 2017 6

Computatio ode & chassis Water coolig wheel & pipe Chassis with 8 odes, 2U size Computatio ode (Fujitsu ext geeratio PRIMERGY) with sigle chip Itel Xeo Phi (Kights Ladig, 3+TFLOPS) ad Itel Omi-Path Architecture card (100Gbps) 7

Water coolig pipes ad rear pael radiator Direct water coolig pipe for CPU Rear-pael idirect water coolig for others 8

Specificatio of Oakforest-PACS Total peak performace 25 PFLOPS Total umber of compute odes 8,208 Compute ode Itercoect Logi ode Product Processor Fujitsu Next-geeratio PRIMERGY server for HPC (uder developmet) Itel Xeo Phi (Kights Ladig) Xeo Phi 7250 (1.4GHz TDP) with 68 cores Memory High BW 16 GB, > 400 GB/sec (MCDRAM, effective rate) Product Lik speed Topology Product Low BW # of servers 20 Processor Memory 96 GB, 115.2 GB/sec (DDR4-2400 x 6ch, peak rate) Itel Omi-Path Architecture 100 Gbps Fat-tree with full-bisectio badwidth Fujitsu PRIMERGY RX2530 M2 server Itel Xeo E5-2690v4 (2.6 GHz 14 core x 2 socket) 256 GB, 153 GB/sec (DDR4-2400 x 4ch x 2 socket) 9 Ceter for Computatioal Scieces, Uiv. of Tsukuba

Specificatio of Oakforest-PACS (I/O) Parallel File System Fast File Cache System Type Total Capacity Meta data Object storage Type Product Lustre File System 26.2 PB # of MDS 4 servers x 3 set MDT Total capacity Product Product # of OSS (Nodes) Aggregate BW DataDirect Networks MDS server + SFA7700X 7.7 TB (SAS SSD) x 3 set DataDirect Networks SFA14KE 10 (20) ~500 GB/sec # of servers (Nodes) 25 (50) Aggregate BW Burst Buffer, Ifiite Memory Egie (by DDN) 940 TB (NVMe SSD, icludig parity data by erasure codig) DataDirect Networks IME14K ~1,560 GB/sec 10 Ceter for Computatioal Scieces, Uiv. of Tsukuba

Full bisectio badwidth Fat-tree by Itel Omi-Path Architecture 12 of 768 port Director Switch (Source by Itel) Uplik: 24 2 Dowlik: 24......... 1 24 25 48 49 72 Firstly, to reduce switches&cables, we cosidered : All the odes ito subgroups are coected with FBB Fat-tree Subgroups are coected with each other with >20% of FBB But, HW quatity is ot so differet from globally FBB, ad globally FBB is preferredfor flexible job maagemet. 362 of 48 port Edge Switch 2 Compute Nodes 8208 Logi Nodes 20 Parallel FS 64 IME 300 Mgmt, etc. 8 Total 8600 11

Facility of Oakforest-PACS system Power cosumptio # of racks 102 Coolig system Compute Node Type Facility Others Type Air coolig 4.2 MW (icludig coolig) actually aroud 3.0 MW Warm-water coolig Direct coolig (CPU) Rear door coolig (except CPU) Coolig tower & Chiller Facility PAC 12

Software of Oakforest-PACS Compute ode Logi ode OS CetOS 7, McKerel Red Hat Eterprise Liux 7 Compiler MPI Library Applicatio Distributed FS Job Scheduler Debugger Profiler gcc, Itel compiler (C, C++, Fortra) Itel MPI, MVAPICH2 Itel MKL LAPACK, FFTW, SuperLU, PETSc, METIS, Scotch, ScaLAPACK, GNU Scietific Library, NetCDF, Parallel etcdf, Xabclib, ppope-hpc, ppope-at, MassiveThreads mpijava, XcalableMP, OpeFOAM, ABINIT-MP, PHASE system, FrotFlow/blue, FrotISTR, REVOCAP, OpeMX, xtapp, AkaiKKR, MODYLAS, ALPS, feram, GROMACS, BLAST, R packages, Biocoductor, BioPerl, BioRuby Globus Toolkit, Gfarm Fujitsu Techical Computig Suite Alliea DDT Itel VTue Amplifier, Trace Aalyzer & Collector 13

TOP500 list o Nov. 2017 (#50) # Machie Architecture Coutry Rmax (TFLOPS) Rpeak (TFLOPS) MFLOPS/W 1 TaihuLight, NSCW MPP (Suway, SW26010) 2 Tiahe-2 (MilkyWay-2), Cluster (NUDT, NSCG CPU + KNC) 3 Piz Dait, CSCS MPP (Cray, XC50: CPU + GPU) 4 Gyoukou, JAMSTEC MPP (Exascaler, PEZY-SC2) 5 Tita, ORNL MPP (Cray, XK7: CPU + GPU) 6 Sequoia, LLNL MPP (IBM, BlueGee/Q) 7 Triity, NNSA/ MPP (Cray, LABNL/SNL XC40: MIC) 8 Cori, NERSC-LBNL MPP (Cray, XC40: KNL) 9 Cluster (Fujitsu, Oakforest-PACS, JCAHPC KNL) Chia 93,014.6 125,435.9 6051.3 Chia 33,862.7 54,902.4 1901.5 Switzerlad 19,590.0 25,326.3 10398.0 Japa 19,125.8 28,192.0 14167.3 Uited States 17,590.0 27,112.5 2142.8 Uited States 17,173.2 20,132.7 2176.6 Uited States 14,137.3 43,902.6 3667.8 Uited States 14,014.7 27,880.7 3556.7 Japa 13,554.6 25,004.9 4985.1 10 K Computer, RIKEN AICS MPP (Fujitsu) Japa 10,510.0 11,280.4 830.2 14 Ceter for Computatioal Scieces, Uiv. of Tsukuba

Post-K Computer ad OFP OFP fills gap betwee K Computer ad Post-K Computer Post-K Computer is plaed to istall 2020-2021 time frame K Computer will be shutdow aroud 2018-2019?? Two system software developed i AICS RIKEN for Post-K Computer McKerel OS for May-core era, for a umber of thi-cores without OS jitter ad core bidig Primary OS (based o Liux) o Post-K, ad applicatio developmet goes ahead XcalableMP (XMP) (i collaboratio with U. Tsukuba) Parallel programmig laguage for directive-base easy codig o distributed memory system Not like explicit message passig with MPI 15 Ceter for Computatioal Scieces, Uiv. of Tsukuba

OFP resource sharig program (atio-wide) JCAHPC (20%) HPCI HPC Ifrastructure program i Japa to share all supercomputers (free!) Big challege special use (full system size) opportuity to use etire 8208 CPUs by just oe project for 24 hours, every ed of moth U. Tsukuba (23.5%) Iterdiscipliary Academic Program (free!) Large scale geeral use U. Tokyo (56.5%) Geeral use Idustrial trial use Educatioal use Youg & Female special use Ordiary job ca use up to 2048 odes/job 16 Ceter for Computatioal Scieces, Uiv. of Tsukuba

Research Area based o CPU Hours Oakforest-PACS i FY.2017 (TENTATIVE: 2017.4~2017.9) 17 Lattice QCD NICAM-COCO GHYDRA Seism3D SALMON/ ARTED Egieerig Earth/Space Material Eergy/Physics Iformatio Sci. Educatio Idustry Bio Social Sci. & Ecoomics Data Ceter for Computatioal Scieces, Uiv. of Tsukuba

Performace variat betwee odes Lower is Faster Normalized elapse time / Iteratio Hamiltoia Curret Misc. computatio Commuicatio 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0.126 0.109 0.127 0.095 0.925 0.764 Best Worst ormalized to best case most of time is cosumed for Hamiltoia calculatio ot icludig commuicatio time domai size is equal for all odes root cause of strog scalig saturatio performace gap exists o ay materials No-algorithmic load-imbalacig Ø Ø Ø Ø dyamic clock adjustmet (DVFS) o turbo boost is applied idividually o all processors it is observed o uder same coditio of odes o KNL, more sesitive tha Xeo serious performace degradatio o sychroized large scale system 18 IXPUG ME 2018 (Site Update) Ceter for Computatioal Scieces, Uiv. of Tsukuba

Operatio summary Memory model basically 50:50 for cache:flat modes started to watch the queue coditio for getly chagig the ratio ~ 15% plaig to itroduce dyamic o-demad switchig i job by job maer KNL CPU almost good ad failure rate is eough uder estimatio by Fujitsu eough stability to support up to 2048 ode job OPA etwork at first there was a problem at bootig up time, but ow it s fixed almost -> it was the mai reaso agaist to the dyamic memory mode chage hudreds of liks have bee chaged by iitial failure, but ow stable Special operatio every moth, 24hours operatio for just oe project to occupy etire system 19

New machie plaed at CCS, U. Tsukuba PACS-X with GPU+FPGA 20 Ceter for Computatioal Scieces, Uiv. of Tsukuba

CCS at Uiversity of Tsukuba Ceter for Computatioal Scieces Established i 1992 12 years as Ceter for Computatioal Physics Reorgaized as Ceter for Computatioal Scieces i 2004 Daily collaborative researches with two kids of faculty researchers (about 35 i total) Computatioal Scietists who have NEEDS (applicatios) Computer Scietists who have SEEDS (system & solutio) 21 Ceter for Computatioal Scieces, Uiv. of Tsukuba

PAX (PACS) series history i U. Tsukuba Started i 1977 (by Hoshio ad Kawai) 1 st geeratio PACS i 1978 with 9 CPUs 6 th geeratio CP-PACS awarded #1 i TOP500 1978 1980 1989 2 d PAX-32 1st PACS-9 5 th QCDPAX 1996 6 th CP-PACS #1 i the world 2006 PACS-CS (7 th ) first PC cluster solutio 2012~2013 HA-PACS (8 th ) itroducig GPU/FPGA Year Name Performace 1978 年 PACS-9 7 KFLOPS 1980 年 PACS-32 500 KFLOPS 1983 年 PAX-128 4 MFLOPS 1984 年 PAX-32J 3 MFLOPS 1989 年 QCDPAX 14 GFLOPS 1996 年 CP-PACS 614 GFLOPS 2006 年 PACS-CS 14.3 TFLOPS 2012~13 年 HA-PACS (PACS-VIII) 1.166 PFLOPS 2014 年 COMA (PACS-IX) 1.001 PFLOPS 22 Ceter for Computatioal Scieces, Uiv. of Tsukuba

AiS AiS: Accelerator i Swtich Usig FPGA ot oly for computatio offloadig but also for commuicatio Combiig computatio offloadig ad commuicatio amog FPGAs for ultralow latecy o FPGA computig Especially effective o commuicatiorelated small/medium computatio (such as collective commuicatio) Coverig GPU o-suited computatio by FPGA OpeCL-eable programmig for applicatio users CPU PCIe GPU FPGA comp. comm. high speed itercoect 23 Ceter for Computatioal Scieces, Uiv. of Tsukuba

AiS computatio model ivoke GPU/FPGA kersls data trasfer via PCIe GPU (ivoked from FPGA) CPU CPU GPU PCIe FPGA comp. comm. > QSFP28 itercoect PCIe collective or specialized commuicatio with computatio FPGA comp. comm. Etheret Switch 24 Ceter for Computatioal Scieces, Uiv. of Tsukuba

Evaluatio test-bed Pre-PACS-X (PPX) CCS, U. Tsukuba comp. ode HCA: Mellaox IB/EDR PACS-X prototype IB/EDR : 100Gbps CPU:Xeo E5-2660 v4 CPU:Xeo E5-2660 v4 GPU: NVIDIA P100 x2 QSFP+ : 40Gbpsx2 FPGA: Bittware A10PL4 25 Ceter for Computatioal Scieces, Uiv. of Tsukuba

Time Lie Feb. 2018: Request for Iformatio Apr. 2018: Request for Commet (followigs are just requiremet) basic specificatio: AiS-based large cluster with up to 256 odes V100 class of GPU x2 Stratix10 or UltraScale class of FPGA x1 (25% of total cout of odes) OPA x2 or IfiiBad HDR class itercoectio Aug. 2018: Request for Proposal Biddig closed o begi of Sep. 2018 Mar. 2019: Deploymet Apr. 2019: Startig official operatio 26 Ceter for Computatioal Scieces, Uiv. of Tsukuba