Oakforest-PACS (OFP) : Japan s Fastest Supercomputer

Size: px

Start display at page:

Download "Oakforest-PACS (OFP) : Japan s Fastest Supercomputer"

Theodora Snow
6 years ago
Views:

1 Oakforest-PACS (OFP) : Japa s Fastest Supercomputer Taisuke Boku Deputy Director, Ceter for Computatioal Scieces Uiversity of Tsukuba (with courtesy of JCAHPC members) 1 Ceter for Computatioal Scieces, Uiv. of Tsukuba

2 Supercomputer deploymet pla i Japa What is JCAHPC? Supercomputer procuremet i JCAHPC Oakforest-PACS System Records ad evaluatio so far Summary 2 Ceter for Computatioal Scieces, Uiv. of Tsukuba

Towards Exascale Computig PF 1000 100 Tier-1 ad tier-2 supercomputers form HPCI ad move

OFP JCAHPC(U. Tsukuba ad U. Tokyo) 1 Tokyo Tech. TSUBAME2.0 T2K U. of Tsukuba U.

3 Towards Exascale Computig PF Tier-1 ad tier-2 supercomputers form HPCI ad move forward to Exascale computig like two wheels Future Exascale Post K Computer Rike AICS 10 OFP JCAHPC(U. Tsukuba ad U. Tokyo) 1 Tokyo Tech. TSUBAME2.0 T2K U. of Tsukuba U. of Tokyo Kyoto U Ceter for Computatioal Scieces, Uiv. of Tsukuba

Fiscal Year 2014 2015 2016 2017 2018 2019 2020

System BS2000 (44TF, 14TB) Data Sciece Cloud /

96PB) NEC SX- 9 他 (60TF) HA-PACS (1166 TF)

655TB/s) LX406e(31TF), Storage(4PB), 3D Vis,

7 MW (1PFlops, 150TiB, 408 TB/s), Hitachi

facility) ~30PF, ~30PB/s Mem BW (CFL-D/CFL-M)

5MW 50+ PF (FAC/UCC + CFL-M) SGI UV2000 (24TF,

2 PF (UCC + CFL/M) 0.96MW 30 PF (UCC + 0.

5MW (UCC + TPF) 200+ PF (FAC) 6.5MW TSUBAME 4.

4 Fiscal Year Hokkaido Tohoku Tsukuba Tokyo Tokyo Tech. Nagoya Kyoto Osaka Kyushu Deploymet pla of 9 supercomputig ceter (Feb. 2017) HITACHI SR16000/M1(172TF, 22TB) Cloud System BS2000 (44TF, 14TB) Data Sciece Cloud / Storage HA8000 / WOS7000 (10TF, 1.96PB) NEC SX- 9 他 (60TF) HA-PACS (1166 TF) COMA (PACS-IX) (1001 TF) SX-ACE(707TF,160TB, 655TB/s) LX406e(31TF), Storage(4PB), 3D Vis, 2MW Fujitsu FX10 Reedbush PF 0.7 MW (1PFlops, 150TiB, 408 TB/s), Hitachi SR16K/M1 (54.9 TF, 10.9 TiB, 28.7 TB/s) TSUBAME 2.5 (5.7 PF, 110+ TB, 1160 TB/s), 1.4MW FX10(90TF) Fujitsu FX100 (2.9PF, 81 TiB) CX400(470T F) Fujitsu CX400 (774TF, 71TiB) Power cosumptio idicates maximum of power supply (icludes coolig facility) ~30PF, ~30PB/s Mem BW (CFL-D/CFL-M) ~3MW 50+ PF (FAC) 3.5MW 50+ PF (FAC/UCC + CFL-M) SGI UV2000 (24TF, 20TiB) 2MW i total up to 4MW Cray: XE6 + GB8K + XC30 (983TF) Cray XC30 (584TF) HA8000 (712TF, 242 TB) SR16000 (8.2TF, 6TB) FX10 (272.4TF, 36 TB) CX400 (966.2 TF, 183TB) 2.0MW TSUBAME 2.5 (3~4 PF, exteded) 3.2 PF (UCC + CFL/M) 0.96MW 30 PF (UCC PF (Cloud) 0.36MW CFL-M) 2MW 5 PF(FAC/TPF) 1.5 MW PF (UCC/TPF) FX10 (90.8TFLOPS) PACS-X 10PF (TPF) 2MW Oakforest-PACS 25 PF (UCC + TPF) 4.2 MW TSUBAME PF, UCC/TPF 2.0MW 2.6MW 100+ PF 4.5MW (UCC + TPF) 200+ PF (FAC) 6.5MW TSUBAME 4.0 (100+ PF, >10PB/s, ~2.0MW) 100+ PF (FAC/UCC+CFL- M)up to 4MW PF (FAC/TPF + UCC) MW NEC SX-ACE NEC Express PB/s, 5-10Pflop/s, MW (CFL-M) (423TF) (22.4TF) 0.7-1PF (UCC)???? 25.6 PB/s, Pflop/s, MW PF (FAC/TPF + UCC/TPF) 3MW K Computer (10PF, 13MW) Post-K Computer (??.??) 4

T2K Ope Supercomputer Systems Same timig of procuremet for ext geeratio

sciece/egieerig i research/educatio/grid-use o same platform Ope hardware

Ope software stack with opesource middleware & tools.

416 odes (61.2TF) / 13TB Lipack Result: Rpeak= 61.2TF (416 odes) Rmax = 50.5TF Uiv.

5 T2K Ope Supercomputer Systems Same timig of procuremet for ext geeratio supercomputers i three uiversities Academic leadership for computatioal sciece/egieerig i research/educatio/grid-use o same platform Ope hardware architecture with commodity devices & techologies. Ope software stack with opesource middleware & tools. Ope to user s eeds ot oly i FP & HPC field but also INT world. Kyoto Uiv. 416 odes (61.2TF) / 13TB Lipack Result: Rpeak= 61.2TF (416 odes) Rmax = 50.5TF Uiv. Tokyo 952 odes (140.1TF) / 31TB Lipack Result: Rpeak= 113.1TF ( odes) Rmax = 83.0TF Uiv. Tsukuba 648 odes (95.4TF) / 20TB Lipack Result: Rpeak= 92.0TF (625 odes) Rmax = 76.5TF 5

6 From T2K to Post-T2K Effect of T2K Alliace Three supercomputers are itroduced at the same time, sharig wide kowledge for system costructio ad commodity techology, followed by academic research collaboratio amog these players After T2K, three uiversities had differet time of ew system procuremet Kyotop U.: four year period of procuremet U. Tsukuba: accelerated computig U. Tokyo: T2K + Fujitsu FX10 ad other systems Post-T2K (with two Ts ) i 2013, U. Tsukuba ad U. Tokyo collaborated agai for ew supercomputer procuremet i much more tight framework 6 Ceter for Computatioal Scieces, Uiv. of Tsukuba

7 JCAHPC Joit Ceter for Advaced High Performace Computig ( Very tight collaboratio for post-t2k with two uiversities For mai supercomputer resources, uiform specificatio to sigle shared system Each uiversity is fiacially resposible to itroduce the machie ad its operatio -> uified procuremet toward sigle system with largest scale i Japa To maage everythig smoothly, a joit orgaizatio was established -> JCAHPC 7 Ceter for Computatioal Scieces, Uiv. of Tsukuba

8 Procuremet Policies of JCAHPC based o the spirit of T2K, itroducig ope advaced techology massively parallel PC cluster advaced processor for HPC easy to use ad efficiet itercoectio large scale shared file system flatly shared by all odes joit procuremet by two uiversities the largest class of budget as atioal uiversities supercomputer i Japa the largest system scale as PC cluster i Japa o accelerator to support wide variety of users ad applicatio fields -> ot chasig absolute peak performace ad iheritig traditioal applicatio codes (basically) goodess of sigle system scale-merit by mergig budget -> largest i Japa ultra large scale sigle job executio at special occasio such as Gordo Bell Prize Challege Oakforest-PACS (OFP) 8 Ceter for Computatioal Scieces, Uiv. of Tsukuba

9 Oakforest-PACS (OFP) U. Tokyo covetio U. Tsukuba covetio Do t call it just Oakforest! OFP is much better 25 PFLOPS peak 8208 KNL CPUs FBB Fat-Tree by OmiPath HPL PFLOPS #1 i Japa #6 i World HPCG #3 i World Gree500 #6 i World Full operatio started Dec Official Program started o April

geeratio PRIMERGY) with sigle chip Itel Xeo Phi (Kights

10 Computatio ode & chassis Water coolig wheel & pipe Chassis with 8 odes, 2U size Computatio ode (Fujitsu ext geeratio PRIMERGY) with sigle chip Itel Xeo Phi (Kights Ladig, 3+TFLOPS) ad Itel Omi-Path Architecture card (100Gbps) 10

11 Water coolig pipes ad IME (burst buffer) 11

12 Specificatio of Oakforest-PACS Total peak performace 25 PFLOPS Total umber of compute odes 8,208 Compute ode Itercoect Logi ode Product Processor Fujitsu Next-geeratio PRIMERGY server for HPC (uder developmet) Itel Xeo Phi (Kights Ladig) Xeo Phi 7250 (1.4GHz TDP) with 68 cores Memory High BW 16 GB, > 400 GB/sec (MCDRAM, effective rate) Product Lik speed Topology Product Low BW # of servers 20 Processor Memory 96 GB, GB/sec (DDR x 6ch, peak rate) Itel Omi-Path Architecture 100 Gbps Fat-tree with full-bisectio badwidth Fujitsu PRIMERGY RX2530 M2 server Itel Xeo E5-2690v4 (2.6 GHz 14 core x 2 socket) 256 GB, 153 GB/sec (DDR x 4ch x 2 socket) 12 Ceter for Computatioal Scieces, Uiv. of Tsukuba

13 Specificatio of Oakforest-PACS (I/O) Parallel File System Fast File Cache System Type Total Capacity Meta data Object storage Type Product Lustre File System 26.2 PB # of MDS 4 servers x 3 set MDT Total capacity Product Product # of OSS (Nodes) Aggregate BW DataDirect Networks MDS server + SFA7700X 7.7 TB (SAS SSD) x 3 set DataDirect Networks SFA14KE 10 (20) ~500 GB/sec # of servers (Nodes) 25 (50) Aggregate BW Burst Buffer, Ifiite Memory Egie (by DDN) 940 TB (NVMe SSD, icludig parity data by erasure codig) DataDirect Networks IME14K ~1,560 GB/sec 13 Ceter for Computatioal Scieces, Uiv. of Tsukuba

14 Full bisectio badwidth Fat-tree by Itel Omi-Path Architecture 12 of 768 port Director Switch (Source by Itel) Uplik: 24 2 Dowlik: Firstly, to reduce switches&cables, we cosidered : All the odes ito subgroups are coected with FBB Fat-tree Subgroups are coected with each other with >20% of FBB But, HW quatity is ot so differet from globally FBB, ad globally FBB is preferredfor flexible job maagemet. 362 of 48 port Edge Switch 2 Compute Nodes 8208 Logi Nodes 20 Parallel FS 64 IME 300 Mgmt, etc. 8 Total

15 Facility of Oakforest-PACS system Power cosumptio # of racks 102 Coolig system Compute Node Type Facility Others Type Air coolig 4.2 MW (icludig coolig) Warm-water coolig Direct coolig (CPU) Rear door coolig (except CPU) Coolig tower & Chiller Facility PAC 15

16 Software of Oakforest-PACS Compute ode Logi ode OS CetOS 7, McKerel Red Hat Eterprise Liux 7 Compiler gcc, Itel compiler (C, C++, Fortra) MPI Itel MPI, MVAPICH2 Library Itel MKL Applicatio Distributed FS Job Scheduler Debugger Profiler LAPACK, FFTW, SuperLU, PETSc, METIS, Scotch, ScaLAPACK, GNU Scietific Library, NetCDF, Parallel etcdf, Xabclib, ppope-hpc, ppope-at, MassiveThreads mpijava, XcalableMP, OpeFOAM, ABINIT-MP, PHASE system, FrotFlow/blue, FrotISTR, REVOCAP, OpeMX, xtapp, AkaiKKR, MODYLAS, ALPS, feram, GROMACS, BLAST, R packages, Biocoductor, BioPerl, BioRuby Globus Toolkit, Gfarm Fujitsu Techical Computig Suite Alliea DDT Itel VTue Amplifier, Trace Aalyzer & Collector 16

17 TOP500 list o Nov # Machie Architecture Coutry Rmax (TFLOPS) Rpeak (TFLOPS) MFLOPS/W 1 TaihuLight, NSCW 2 Tiahe-2 (MilkyWay-2), NSCG 3 Tita, ORNL 4 Sequoia, LLNL 5 Cori, NERSC-LBNL 6 Oakforest-PACS, JCAHPC MPP (Suway, SW26010) Cluster (NUDT, CPU + KNC) MPP (Cray, XK7: CPU + GPU) MPP (IBM, BlueGee/Q) MPP (Cray, XC40: KNL) Cluster (Fujitsu, KNL) Chia 93, , Chia 33, , Uited States 17, , Uited States 17, , Uited States 14, ,880.7??? Japa 13, , K Computer, RIKEN AICS MPP (Fujitsu) Japa 10, , Piz Dait, CSCS 9 Mira, ANL 10 Triity, NNSA/ LABNL/SNL MPP (Cray, XC50: CPU + GPU) Switzerlad 9, , MPP (IBM, BlueGee/Q) Uited States 8, , MPP (Cray, XC40: MIC) Uited States 8, , /04/ Ceter for Computatioal Scieces, Uiv. of Tsukuba

18 Gree500 o Nov # # HPL Machie Architecture Coutry Rmax (TFLOPS) MFLOPS/W 1 28 DGX SaturV 2 8 Piz Dait, CSCS GPU cluster (NVIDIA DGX1) MPP (Cray, XC50: CPU + GPU) USA 3, Switzerlad 9, Shoubu PEZY ZettaScaler-1 Japa 1, TaihuLight MPP (Suway SW26010) Chia 93, QPACE3 Cluster (Fujitsu, KNL) Germay Oakforest-PACS, JCAHPC Cluster (Fujitsu, KNL) Japa 13, Theta MPP (Cray XC40, KNL) USA 5, XStream MPP (Cray CS-Storm, GPU) USA Camphor2 MPP (Cray XC40, KNL) Japa 3, SciPhi XVI Cluster (KNL) USA Ceter for Computatioal Scieces, Uiv. of Tsukuba

19 HPCG o Nov Ceter for Computatioal Scieces, Uiv. of Tsukuba

20 McKerel support McKerel (special light weight kerel for May-Core architecture) developed at U. Tokyo ad ow at AICS, RIKEN (lead by Y. Ishikawa) KNL-ready versio is almost completed It ca be loaded as a kerel module to Liux Batch scheduler is oticed to use McKerel by user s script, the applies it Detach the McKerel module after job executio 20 Ceter for Computatioal Scieces, Uiv. of Tsukuba

21 XcalableMP (XMP) support XcalableMP: massively parallel descriptio laguage based o PGAS model ad directives by users origially developed at U. Tsukuba ad ow at AICS, RIKEN (lead by M. Sato) KNL-ready versio is uder evaluatio ad tuig It will be ope for users to write (relatively) easy-way for large scale parallelizatio as well as performace tuig 21 Ceter for Computatioal Scieces, Uiv. of Tsukuba

22 Memory Model (curretly plaed) Our challege semi-dyamic switchig of CACHE ad FLAT modes Iitial: odes i the system are cofigured with a certai ratio of mixture (half ad half) of Cache ad Flat modes Batch scheduler is oticed about the memory cofiguratio from user s script Batch scheduler tries to fid appropriate odes without recofiguratio If there are ot eough umber of odes, some of them are rebooted with aother memory cofiguratio Reboot is by warm-reboot, with ~100 odes group Size limitatio (max. # of odes) may be applied NUMA model Curretly quadrat mode oly (Perhaps) we will ot dyamically chage it?? 22 Ceter for Computatioal Scieces, Uiv. of Tsukuba

23 System operatio outlie Regular operatio both uiversities share the CPU time based o the budget ratio ot split the system hardware, but split the CPU time for flexible operatio (except several specially dedicated partitios) sigle system etry for HPCI program, ad other ow program by each uiversity is performed uder CPU time sharig Special operatio (limited period) massively large scale operatio -> effectively usig the largest class resource i Japa for special occasio (ex. Gordo Bell Challege) Power savig operatio power cappig feature for eergy savig schedulig feature reacts to power savig requiremet (ex. summer time) 23 Ceter for Computatioal Scieces, Uiv. of Tsukuba

24 OFP resource sharig program (atio-wide) JCAHPC (20%) HPCI HPC Ifrastructure program i Japa to share all supercomputers (free!) Big challege special use (full system size) U. Tsukuba (23.5%) Iterdiscipliary Academic Program (free!) Large scale geeral use U. Tokyo (56.5%) Geeral use Idustrial trial use Educatioal use Youg & Female special use 24 Ceter for Computatioal Scieces, Uiv. of Tsukuba

25 Machie locatio: Kashiwa Campus of U. Tokyo U. Tsukuba Kashiwa Campus of U. Tokyo Hogo Campus of U. Tokyo 25 Ceter for Computatioal Scieces, Uiv. of Tsukuba

26 Xeo Phi tuig o ARTED (with Y. Hirokawa uder collaboratio with Prof. K. Yabaa, CCS) ARTED Ab-iitio Real-Time Electro Dyamics simulator Multi-scale simulator based o RTRSDFT (Real-Time Real-Space Desity Fuctioal Theory) developed i CCS, U. Tsukuba to be used for Electro Dyamics Simulatio RSDFT : basic status of electro (o movemet of electro) RTRSDFT : electro state uder exteral force I RTRSDFT, RSDFT is used for groud state RSDFT : large scale simulatio with atoms (ex. K-Computer) RTRSDFT : calculate a umber of uit-cells with 10 ~ 100 atoms Macroscopic grids Vacuum Solids Microscopic grids y Atom Z Electric field x 26

27 Computatio domai ad amout Parameters for wave fuctio expressio k-poits (NK), bad-umber (NB), 3-D lattice poits (NL) valuables are i double precisio complex with matrix of (NK, NB, NL) for stecil computatio, size NL of calculatio is performed NKxNB times Parameters used i this research (two models) SiO 2 : (4 3, 48, = (20, 36, 50)) -> ot eough large Si : (24 3, 32, 4096 = (16, 16, 16)) -> larger parallelism o thread NK is parallelized by MPI, the NKxNB is parallelized i OpeMP domai of each process: (NK/NP, NB, NL) (NP = umber of processes) space domai is ot decomposed to miimize MPI commuicatio 27 Ceter for Computatioal Scieces, Uiv. of Tsukuba

28 Stecil computatio (3D) Si case SiO2 case Origial Compiler vec. Origial Compiler vec. Explicit vec. (w/o SWP) Explicit vec. (w SWP) Explicit vec. (w/o SWP) Explicit vec. (w SWP) Performace [GFLOPS] Performace [GFLOPS] KNC x2 KNL 0 KNC x2 KNL 3x faster tha KNC (at maximum) 28 Ceter for Computatioal Scieces, Uiv. of Tsukuba

29 KNL vs GPU Si case GFLOPS vs. Peak perf. Xeo E5-2670v2 x2 (IVB) % Xeo Phi 7110P x2 (KNC) % OFP: Xeo Phi 7250 (KNL) % Tesla K40 x2 (Kepler) % Tesla P100 (Pascal) % SiO 2 case GFLOPS vs. Peak perf. Xeo E5-2670v2 x2 (IVB) % Xeo Phi 7110P x2 (KNC) % OFP: Xeo Phi 7210 (KNL) % Tesla K40 x2 (Kepler) % Tesla P100 (Pascal) % Peak performace (DP) Actual memory badwidth Actual B/F Xeo Phi 7110P (KNC) 1074 GFLOPS GB/s 0.16 Xeo Phi 7250 (KNL) 2998 GFLOPS GB/s 0.15 Tesla K40 (Kepler) 1430 GFLOPS GB/s 0.13 Tesla P100 (Pascal) 5300 GFLOPS GB/s 0.10 GPU (Pascal) performace is by courtesy of A. NVIDIA 29 Ceter for Computatioal Scieces, Uiv. of Tsukuba

30 Summary JCAHPC is a joit resource ceter for advaced HPC by U. Tokyo ad U. Tsukuba as the first case i Japa Oakforest-PACS (OFP) with 25 PFLOPS peak is raked as #1 i Japa ad #6 i the world, with Itel Xeo Phi (KNL) ad OPA Uder JCAHPC, both uiversities perform atio-wide resource sharig programs icludig HPCI JCAHPC is ot just a orgaizatio to maage the resource but also a basic commuity for advaced HPC research OFP is used ot oly for HPCI ad other resource sharig program but also a testbed for McKerel ad XcalableMP system software to support Post-K developmet 30 Ceter for Computatioal Scieces, Uiv. of Tsukuba

Oakforest-PACS (OFP) Taisuke Boku

Oakforest-PACS (OFP) Taisuke Boku Deputy Director, Ceter for Computatioal Scieces Uiversity of Tsukuba (with courtesy of JCAHPC members) 1 Japa-Korea HPC Witer School 2018 Ceter for Computatioal Scieces,