SSD Based First Layer File System for the Next Generation Super-omputer Shinji Sumimoto, Ph.D. Next Generation Tehnial Computing Unit FUJITSU LIMITED Sept. 24 th, 2018 0
Outline of This Talk A64FX: High Performane Arm CPU SSD Based First Layer File System for the Next Generation Super-omputer Current Status of Lustre Based File System Development 1
A64FX: High Performane Arm CPU From presentation slides of Hothips 30 th and Cluster 2018 Inheriting Fujitsu HPC CPU tehnologies with ommodity standard ISA 2
A64FX Chip Overview Arhiteture Features <A64FX> Tofu 28Gbps 2 lanes 10 ports I/O PCIe Gen3 16 lanes Armv8.2-A (AArh64 only) SVE 512-bit wide SIMD 48 omputing ores + 4 assistant ores* CMG speifiation 13 ores L2$ 8MiB Mem 8GiB, 256GB/s Tofu ontroller PCIe ontroller HBM2 TofuD 32GiB *All the ores are idential 6D Mesh/Torus 28Gbps x 2 lanes x 10 ports HBM2 HBM2 Netwrok on Chip HBM2 HBM2 PCIe Gen3 16 lanes 7nm FinFET 8,786M transistors 594 pakage signal pins Peak Performane (Effiieny) >2.7TFLOPS (>90%@DGEMM) Memory B/W 1024GB/s (>80%@Stream Triad) A64FX (Post-K) SPARC64 XIfx (PRIMEHPC FX100) ISA (Base) Armv8.2-A SPARC-V9 ISA (Extension) SVE HPC-ACE2 Proess Node 7nm 20nm Peak Performane >2.7TFLOPS 1.1TFLOPS SIMD 512-bit 256-bit # of Cores 48+4 32+2 Memory HBM2 HMC Memory Peak B/W 1024GB/s 240GB/s x2 (in/out) 3
A64FX Memory System Extremely high bandwidth Out-of-order Proessing in ores, ahes and memory ontrollers Maximizing the apability of eah layer s bandwidth CMG 12x Computing Cores + 1x Assistant Core Performane >2.7TFLOPS L1 Cahe >11.0TB/s (BF= 4) 512-bit wide SIMD 2x FMAs >230 GB/s Core >115 GB/s L1D 64KiB, 4way Core Core Core L2 Cahe >3.6TB/s (BF = 1.3) Memory 1024GB/s (BF =~0.37) >115 GB/s >57 GB/s L2 Cahe 8MiB, 16way 256 GB/s HBM2 8GiB HBM2 8GiB 4
A64FX Core Features Optimizing SVE arhiteture for wide range of appliations with Arm inluding AI area by FP16 INT16/INT8 Dot Produt Developing A64FX ore miro-arhiteture to inrease appliation performane A64FX (Post-K) SPARC64 XIfx (PRIMEHPC FX100) SPAR64 VIIIfx (K omputer) ISA Armv8.2-A + SVE SPARC-V9 + HPC-ACE2 SPARC-V9 + HPC-ACE SIMD Width 512-bit 256-bit 128-bit Four-operand FMA Enhaned Gather/Satter Enhaned Prediated Operations Enhaned Math. Aeleration Further enhaned Enhaned Compress Enhaned First Fault Load New FP16 New INT16/ INT8 Dot Produt New HW Barrier* / Setor Cahe* Further enhaned Enhaned 5 * Utilizing AArh64 implementation-defined system registers
Normalized to SPARC64 XIfx A64FX Chip Level Appliation Performane Boosting appliation performane up by miro-arhitetural enhanements, 512-bit wide SIMD, HBM2 and semi-ondutor proess tehnologies > 2.5x faster in HPC/AI benhmarks than that of SPARC64 XIfx tuned by Fujitsu ompiler for A64FX miro-arhiteture and SVE A64FX Kernel Benhmark Performane (Preliminary results) 8 Throughput (DGEMM / Stream) HPC Appliation Kernel AI 9.4x 6 Memory B/W 512-bit SIMD Combined Gather L1 $ B/W L2$ B/W INT8 dot produt 4 2 2.5TF 830 GB/s 3.0x 2.8x 3.4x 2.5x 0 DGEMM Stream Triad Fluid dynamis Atomosphere Seismi wave propagation Convolution FP32 Convolution Low Preision (Estimated) Baseline: SPARC64 XIfx 6
Tofu Network Router 4 lanes 10 ports Tofu Network Router 2 lanes 10 ports A64FX TofuD Overview Halved Off-hip Channels Power and Cost Redution Inreased Communiation Resoures TNIs from 2 to 4 Tofu Barrier Resoures Redued Communiation Lateny Simplified Multi-Lane PCS Inreased Communiation Reliability Dynami Paket Sliing: Split and Dupliate Tofu K.omp Tofu2 FX100 TofuD Data rate (Gbps) 6.25 25.78 28.05 # of signal lanes per link 8 4 2 Link bandwidth (GB/s) 5.0 12.5 6.8 # of TNIs per node 4 4 6 Injetion bandwidth per node (GB/s) 20 50 40.8 C M G C M G HMC HMC HMC HMC HMC HMC HMC HMC SPARC64 XIfx PCIe TNI0 TNI1 TNI2 TNI3 Tofu2 C M G HBM2 HBM2 NOC HBM2 C M G HBM2 PCIe TNI0 TNI1 TNI2 TNI3 TNI4 TNI5 TofuD C M G C M G TofuD A64FX 7
TofuD: Put Latenies & Throughput& Injetion Rate TofuD: Evaluated by hardware emulators using the prodution RTL odes Simulation model: System-level inluded multiple nodes Communiation settings Lateny Tofu Desriptor on main memory 1.15 µs Diret Desriptor 0.91 µs Tofu2 Cahe injetion OFF 0.87 µs Cahe injetion ON 0.71 µs TofuD To/From far CMGs 0.54 µs To/From near CMGs 0.49 µs Put throughput Injetion rate Tofu 4.76 GB/s (95%) 15.0 GB/s (77%) Tofu2 11.46 GB/s (92%) 45.8 GB/s (92%) TofuD 6.35 GB/s (93%) 38.1 GB/s (93%) 8
Next Generation File System Design Next Generation File System Struture and Design Next-Gen 1 st Layer File System Overview 9
K omputer: Pre-Staging-In/Post-Staging-Out Method Pros: Stable Appliation Performane for Jobs Cons: Requiring three times amount of storage whih a job needs Pre-defining file name of stage-in/out proessing laks of usability Data-intensive appliation affets system usage to down beause of waiting prestaging-in/out proessing Computing Node Appliation Linux Stage-in/out Loopbak Loal File System using FEFS Job Control Node Global File System Login Node Users 10
Next-Gen File System Requirement and Issues Requirements 10 times higher aess performane 100 times larger file system apaity Lower power and footprint Issues How to realize 10 times faster and 100 times larger file aess at a time? 11
Next-Gen. File System Design K omputer File System Design How should we realize High Speed and Redundany together? Introdued Integrated Two Layered File System. Next-Gen. File System/Storage Design Another trade off targets: Power, Capaity, Footprint Diffiult to realize single Exabyte and 10TB/s lass file system in limited power onsumption and footprint. Additional Third layer Storage for Capaity is needed: Compute Compute Nodes Nodes Compute Compute Nodes Nodes High Speed for Appliation Lustre Based Lustre Lustre Ext[34] Based Based Ext[34] Based Based Appliation Ext[34] Objet Objet Based Speifi Based Based Appliation Appliation Objet Existing Based FS Speifi Speifi Appliation Objet Based Speifi Shared Usability Thousands of Users Job Sheduler Login Server Lustre Based Transparent Data Aess The Next Integrated Layered File System Arhiteture for Post-peta sale System (Feasibility Study 2012-2013) Other Organization Other Systems High Capaity & Redundany & Interoperability HSM, Other Shared FS, Grid or Cloud Based /data 12
Next Gen. File System Design Introduing three level hierarhial storage. 1 st level storage: Aelerating appliation file I/O performane (Loal File System) 2 nd level storage: Sharing data using Lustre based file system (Global File System) 3 rd level storage: Arhive Storage (Arhive System) Aessing 1 st level storage as file ahe of global file system and loal storage File ahe on omputing node is also used as well as 1 st level storage Computing Node Appliation Linux SSD Based 1 st Level Storage Job Control Node Global File System Lustre based file system on 2 nd Level Storage Login Node Users Arhive Storage for 3 rd Level Storage 13
Next Gen. Layered File System Requirements Appliation views: Loal File System: Appliation Oriented File Aesses(Higher Meta&Data I/O) Global File System: Transparent File Aess Arhive System: In-diret Aess or Transparent File Aess(HSM) Transparent File Aess to the Global File System Loal File System Capaity is not enough as muh as loating whole data of Global File System File Cahe on node memory and Loal File System enables to aelerate appliation performane Meta Perf. Data BWs Capaity Salability Data Sharing in a Job Data Sharing among Jobs Loal File System Global File System Arhive System 14
Next-Gen 1 st Layer File System Overview Goal: Maximizing appliation file I/O performane Features: Easy aess to User Data: File Cahe of Global File System Higher Data Aess Performane: Temporary Loal FS (in a proess) Higher Data Sharing Performane: Temporary Shared FS (among proesses) Now developing LLIO(Lightweight Layered IO-Aelerator) Prototype I/O w/ Assistant Cores Node App. Job A Node App. Node App. Job B Node App. Node App. Salable Compute Cores 1 st Level file1 Loal Cahe file2 SSD Cahe file3 Shared Temporary Cahe Loal File Systems 2 nd Level file3 file4 Global File System(Lustre Based) 15
LLIO Prototype Implementation Two types of Computing Nodes Burst Buffer Computing Node(BBCN) Burst Buffer System Funtion with SSD Devie Computing Node(CN) Burst Buffer Clients: File Aess Request to BBCN as burst buffer server CN CN CN CN arm/x86 IO/meta Requests CN CN CN CN IO/meta Requests BBCN BBCN Bakground data flushing On demand data staging SSD SSD interonnet Computing Node Cluster 16 2 nd Layer File System
File Aess Sequenes using LLIO (Cahe Mode) CN BBCN 2 nd Layer File System Meta Reqs: Pass through to 2 nd Layer open(file) meta server write(fd, buf, sz) I/O server App /gfs LLIO LLIO 2nd Layer FS Client 2 nd Layer FS Server write(fd, buf, sz) flush LFS Bakground Flushing SSD HDD 17
I/O Bandwidth LLIO Prototype I/O Performane I/O Bandwidth Write Performane Read Performane Devie Devie # of IOR Streams # of IOR Streams Higher I/O performane than those of NFS, Lustre Evaluated on IA servers using Intel P3608 Utilizing maximum physial I/O devie performane by LLIO 18
Current Status of Lustre Based File System Development, et,. 19
Current Status of Lustre Based File System Next-gen. Lustre Based File System: FEFS Planning to develop on Lustre 2.10.x base Now testing Lustre 2.10.x based FEFS, and found several problems Planning to fix the bugs and report their fixes. 20
Contribution of DL-SNAP What is DL-SNAP? Presented@LUG2016, LAD16 (http://dn.opensfs.org/wp-ontent/uploads/2016/04/lug2016d2_dl-snap_sumimoto.pdf) DL-SNAP is designed for user and diretory level file bakups. Users an reate a snapshot of a diretory using lfs ommand with snapshot option and reate option like a diretory opy. The user reates multiple snapshot of the diretory and manage the snapshots inluding merge of the snapshots. DL-SNAP also supports quota to limit storage usage of users. Issue of Contribution: We planed DL-SNAP ontribution in 2018 We do not have human resoures enough to port to latest Lustre version Our Strategy for ontributing DL-SNAP: We are ready to ontribute our urrent DL-SNAP ode for Lustre 2.6 We will make a LU-tiket for the DL-SNAP (by the end of Ot. 2018) We need help to port DL-SNAP to the latest Lustre 21
22