次世代スーパーコンピュータ向けファイルシステムについて

Size: px

Start display at page:

Download "次世代スーパーコンピュータ向けファイルシステムについて"

Juliana Rice
5 years ago
Views:

1 Gfarm シンポジウム 2018 次世代スーパーコンピュータ向けファイルシステムについて Shinji Sumimoto, Ph.D. Next Generation Tehnial Computing Unit FUJITSU LIMITED Ot. 26 th,

2 Outline of This Talk A64FX: High Performane Arm CPU Next Generation File System Design 1

3 A64FX: High Performane Arm CPU From presentation slides of Hothips 30 th and Cluster 2018 Inheriting Fujitsu HPC CPU tehnologies with ommodity standard ISA 2

A64FX Chip Overview Arhiteture Features <A64FX> Tofu 28Gbps 2 lanes 10 ports I/O PCIe Gen3 16 lanes Armv8.

ontroller PCIe ontroller HBM2 TofuD 32GiB *All the ores are idential 6D Mesh/Torus 28Gbps x 2 lanes x 10 ports HBM2 HBM2 Netwrok on

7TFLOPS (>90%@DGEMM) Memory B/W 1024GB/s (>80%@Stream Triad) A64FX (Post-K) SPARC64 XIfx (PRIMEHPC FX100) ISA (Base) Armv8.

4 A64FX Chip Overview Arhiteture Features <A64FX> Tofu 28Gbps 2 lanes 10 ports I/O PCIe Gen3 16 lanes Armv8.2-A (AArh64 only) SVE 512-bit wide SIMD 48 omputing ores + 4 assistant ores* CMG speifiation 13 ores L2$ 8MiB Mem 8GiB, 256GB/s Tofu ontroller PCIe ontroller HBM2 TofuD 32GiB *All the ores are idential 6D Mesh/Torus 28Gbps x 2 lanes x 10 ports HBM2 HBM2 Netwrok on Chip HBM2 HBM2 PCIe Gen3 16 lanes 7nm FinFET 8,786M transistors 594 pakage signal pins Peak Performane (Effiieny) >2.7TFLOPS (>90%@DGEMM) Memory B/W 1024GB/s (>80%@Stream Triad) A64FX (Post-K) SPARC64 XIfx (PRIMEHPC FX100) ISA (Base) Armv8.2-A SPARC-V9 ISA (Extension) SVE HPC-ACE2 Proess Node 7nm 20nm Peak Performane >2.7TFLOPS 1.1TFLOPS SIMD 512-bit 256-bit # of Cores Memory HBM2 HMC Memory Peak B/W 1024GB/s 240GB/s x2 (in/out) 3

5 A64FX Memory System Extremely high bandwidth Out-of-order Proessing in ores, ahes and memory ontrollers Maximizing the apability of eah layer s bandwidth CMG 12x Computing Cores + 1x Assistant Core Performane >2.7TFLOPS L1 Cahe >11.0TB/s (BF= 4) 512-bit wide SIMD 2x FMAs >230 GB/s Core >115 GB/s L1D 64KiB, 4way Core Core Core L2 Cahe >3.6TB/s (BF = 1.3) Memory 1024GB/s (BF =~0.37) >115 GB/s >57 GB/s L2 Cahe 8MiB, 16way 256 GB/s HBM2 8GiB HBM2 8GiB 4

6 A64FX Core Features Optimizing SVE arhiteture for wide range of appliations with Arm inluding AI area by FP16 INT16/INT8 Dot Produt Developing A64FX ore miro-arhiteture to inrease appliation performane A64FX (Post-K) SPARC64 XIfx (PRIMEHPC FX100) SPAR64 VIIIfx (K omputer) ISA Armv8.2-A + SVE SPARC-V9 + HPC-ACE2 SPARC-V9 + HPC-ACE SIMD Width 512-bit 256-bit 128-bit Four-operand FMA Enhaned Gather/Satter Enhaned Prediated Operations Enhaned Math. Aeleration Further enhaned Enhaned Compress Enhaned First Fault Load New FP16 New INT16/ INT8 Dot Produt New HW Barrier* / Setor Cahe* Further enhaned Enhaned 5 * Utilizing AArh64 implementation-defined system registers

7 Normalized to SPARC64 XIfx A64FX Chip Level Appliation Performane Boosting appliation performane up by miro-arhitetural enhanements, 512-bit wide SIMD, HBM2 and semi-ondutor proess tehnologies > 2.5x faster in HPC/AI benhmarks than that of SPARC64 XIfx tuned by Fujitsu ompiler for A64FX miro-arhiteture and SVE A64FX Kernel Benhmark Performane (Preliminary results) 8 Throughput (DGEMM / Stream) HPC Appliation Kernel AI 9.4x 6 Memory B/W 512-bit SIMD Combined Gather L1 $ B/W L2$ B/W INT8 dot produt TF 830 GB/s 3.0x 2.8x 3.4x 2.5x 0 DGEMM Stream Triad Fluid dynamis Atomosphere Seismi wave propagation Convolution FP32 Convolution Low Preision (Estimated) Baseline: SPARC64 XIfx 6

8 Tofu Network Router 4 lanes 10 ports Tofu Network Router 2 lanes 10 ports A64FX TofuD Overview Halved Off-hip Channels Power and Cost Redution Inreased Communiation Resoures TNIs from 2 to 4 Tofu Barrier Resoures Redued Communiation Lateny Simplified Multi-Lane PCS Inreased Communiation Reliability Dynami Paket Sliing: Split and Dupliate Tofu K.omp Tofu2 FX100 TofuD Data rate (Gbps) # of signal lanes per link Link bandwidth (GB/s) # of TNIs per node Injetion bandwidth per node (GB/s) C M G C M G HMC HMC HMC HMC HMC HMC HMC HMC SPARC64 XIfx PCIe TNI0 TNI1 TNI2 TNI3 Tofu2 C M G HBM2 HBM2 NOC HBM2 C M G HBM2 PCIe TNI0 TNI1 TNI2 TNI3 TNI4 TNI5 TofuD C M G C M G TofuD A64FX 7

9 TofuD: 6D Mesh/Torus Network Six oordinates: (X, Y, Z) (A, B, C) X, Y and Z: sizes are depends on the system size A, B and C: sizes are fixed to 2, 3, and 2 respetively Tofu stands for torus fusion X Y Z Z B X Y A C 8

TofuD: Pakaging CPU Memory Unit Two CPUs onneted with C-axis X Y Z A B C = 1 1 1 1 1 2 Two or three

10 TofuD: Pakaging CPU Memory Unit Two CPUs onneted with C-axis X Y Z A B C = Two or three ative optial able ages on board Eah able is shared by two CPUs CPU CPU AOC (X) AOC (Y) AOC (Z) AOC AOC 9

11 TofuD: Pakaging Rak Struture Rak 8 shelves 192 CMUs or 384 CPUs Rak Shelf 24 CMUs or 48 CPUs X Y Z A B C = Top or bottom half of rak 4 shelves X Y Z A B C = Shelves 10

12 TofuD: Put Latenies & Throughput& Injetion Rate TofuD: Evaluated by hardware emulators using the prodution RTL odes Simulation model: System-level inluded multiple nodes Communiation settings Lateny Tofu Desriptor on main memory 1.15 µs Diret Desriptor 0.91 µs Tofu2 Cahe injetion OFF 0.87 µs Cahe injetion ON 0.71 µs TofuD To/From far CMGs 0.54 µs To/From near CMGs 0.49 µs Put throughput Injetion rate Tofu 4.76 GB/s (95%) 15.0 GB/s (77%) Tofu GB/s (92%) 45.8 GB/s (92%) TofuD 6.35 GB/s (93%) 38.1 GB/s (93%) 11

13 Next Generation File System Design File System Design for the K omputer Next Generation File System Struture and Design Next-Gen 1 st Layer File System Overview 12

14 Overview of FEFS for K omputer Staging Goals: To realize World Top Class Capaity and Performane File system 100PB, 1TB/s Based on Lustre File System with several extensions These extensions are now going to be ontributed to Lustre ommunity. Introduing Layered File system for eah file layer harateristis Temporary Fast Srath FS(Loal) and persistent Shared FS(Global) Staging Funtion whih transfers between Loal FS and Global FS is ontrolled by Bath Sheduler File Server File Server Loal File System Loal File System (work temporary) For Performane File Cluster File System FEFS File Server Global File System Configuration of FEFS for K omputer 13 For Easy Use For Capaity and Reliability Global File System (data persistent)

4TB/s Write Bath Sheduler 5 I/O Loal File System 3 4 Automati Stage-in and Stage-out by Bath Sheduler Global File System /home Users use

15 Job Exeution and File System Aesses on K omputer 82,944 Compute Nodes 4.Stage-in Proessing 5. Job Proessing 6.Stage-out Proessing 6 Portal Server GW Loal FS 11PB, 1 MDS+ 2,592 OSSes (5,184 OST) 3.2TB/s Read, 1.4TB/s Write Bath Sheduler 5 I/O Loal File System 3 4 Automati Stage-in and Stage-out by Bath Sheduler Global File System /home Users use /home and /data 14 Global FS 30PB, 5 MDS+ 90 OSSes (2,880 OST) 0.2TB/s Read/ Write /data 2 Login Nodes 1 Job Dispath Stage-In Files Program Stage-Out Files Job Exeution Environment Job Dispathing on a Login Node

K omputer: Pre-Staging-In/Post-Staging-Out Method Pros: Stable Appliation Performane for Jobs Cons: Requiring three times amount of storage whih a job needs Pre-defining file name of stage-in/out

16 K omputer: Pre-Staging-In/Post-Staging-Out Method Pros: Stable Appliation Performane for Jobs Cons: Requiring three times amount of storage whih a job needs Pre-defining file name of stage-in/out proessing laks of usability Data-intensive appliation affets system usage to down beause of waiting prestaging-in/out proessing Computing Node Appliation Linux Stage-in/out Loopbak Loal File System using FEFS Job Control Node Global File System Login Node Users 15

17 Next-Gen File System Requirement and Issues Requirements 10 times higher aess performane 100 times larger file system apaity Lower power and footprint Issues How to realize 10 times faster and 100 times larger file aess at a time? 16

18 Next-Gen. File System Design K omputer File System Design How should we realize High Speed and Redundany together? Introdued Integrated Two Layered File System. Next-Gen. File System/Storage Design Another trade off targets: Power, Capaity, Footprint Diffiult to realize single Exabyte and 10TB/s lass file system in limited power onsumption and footprint. Additional Third layer Storage for Capaity is needed: Compute Compute Nodes Nodes Compute Compute Nodes Nodes High Speed for Appliation Lustre Based Lustre Lustre Ext[34] Based Based Ext[34] Based Based Appliation Ext[34] Objet Objet Based Speifi Based Based Appliation Appliation Objet Existing Based FS Speifi Speifi Appliation Objet Based Speifi Shared Usability Thousands of Users Job Sheduler Login Server Lustre Based Transparent Data Aess The Next Integrated Layered File System Arhiteture for Post-peta sale System (Feasibility Study ) Other Organization Other Systems High Capaity & Redundany & Interoperability HSM, Other Shared FS, Grid or Cloud Based /data 17

19 Next Gen. File System Design Introduing three level hierarhial storage. 1 st level storage: Aelerating appliation file I/O performane (Loal File System) 2 nd level storage: Sharing data using Lustre based file system (Global File System) 3 rd level storage: Arhive Storage (Arhive System) Aessing 1 st level storage as file ahe of global file system and loal storage File ahe on omputing node is also used as well as 1 st level storage Computing Node Appliation Linux SSD Based 1 st Level Storage Job Control Node Global File System Lustre based file system on 2 nd Level Storage Login Node Users Arhive Storage for 3 rd Level Storage 18

20 Sopes of File Usages for Post-K File System Design File Lifetime: Persistent Files: Input Files, Output Files Temporary Files: Input Files, Output File Aess Pattern: Distributed Files: for eah proess Distributed Files Proess Shared File(1) Shared File(2) I/O Master Shared File : partial aess onentrate aess to same data File I/O Master: Master does whole File I/O Data Sharing: Within a job Proess Job Job Among multiple jobs(under designing) File File Within a Job Among multiple Jobs 19

21 File Lifetime for Effetive SSD Use Persistent files in a job are loated on SSD as file ahe Asynhronous data transfer is used between SSD and global file system Temporary files in a job should be loated on SSD to eliminate global file system aesses But, how persistent file ahe on SSD should be used? It depends on file aess patterns 20

Appliation s Aess Pattern and SSD Cahe Effets Comparison of Effetive Pattern for SSD based storage Distributed Files Shared File (1) Shared File (2) I/O Master Proesses File Reading Proesses File

22 Appliation s Aess Pattern and SSD Cahe Effets Comparison of Effetive Pattern for SSD based storage Distributed Files Shared File (1) Shared File (2) I/O Master Proesses File Reading Proesses File Writing File Read: Effets Rereading Case: Non Rereading : Rereading Case: Non Rereading : Rereading Case: Non Rereading : Rereading Case: Non Rereading : File Write: Effets Rewriting Case: Non Rewriting : Rewriting Case: Non Rewriting : Rewriting Case: Non Rewriting : 21

23 Data Sharing in a Job Write-Read in a proess and among proesses are effetive to use SSD For Persistent Files: File ahe of global file system should be shared among proesses For Temporary Files: Two types of temporary file systems are effetive to use SSD Temporary Loal System (in a proess) Temporary Shared File System (among proesses) In a proess Among proesses (1) (2) write read write read Proess Proess Proess Proess Proess Proess write (1) write (2) write (1) write (2) read (1) write (2) write(1) write (2) read File File File File Proess Proess Proess Proess Proess Proess read (1) read (2) write (1) read (2) read (1) read (2) write (1) read (2) read File File File File 22

24 Data Sharing among Multiple Jobs Write-Read among multiple jobs are effetive to use SSD To be designed how to share file ahe on global file system and temporary shared file system data 23

25 SSD Lifetime Issue Current SSDs mainly use NAND based ells and have an issue of limited number of lifetime writes(dwpd) Consumer produts an not be used beause of lak of DWPD Enterprise produts must be used Operating period of Post-K will be planed at least 5 years The DWPD of most I/O intensive target appliation is 7.1TB/Day Intel P3700 is the best hoie in these produts Enterprise Produts Intel P3700 Intel P3608 Consumer Produts Intel 750 Intel 600p Samsung 950 pro Samsung 960 Pro Capaity 800GB 1.6TB 1.2TB 1TB 512GB 1TB 1TB Warranty Period Samsung 960 EVO 5 years 5 years 5 years 5 years 5 years 5 years 3 years MTBF 2.0M 1.0M 1.2M 1.5M 1.5M 1.5M 1.5M AFR 0.44% 0.87% 0.73% 0.54% 0.58% 0.58% 0.58% DWPD 8TB/Day 4.8TB/Day 70GB/Day 40GB/Day 210GB/Day 430GB/Day 360GB/Day 24

26 How about Intel Optane Produts? Enterprise Produts Enthusiast Intel P3700 Intel P3608 Intel P4600 Intel P4500 Intel Optane P4800X Intel Optane 900P Capaity 800GB 1.6TB 1.6TB 1TB 375GB 480GB Read Perf. 2.7GB/s 5.0GB/s 3.3GB/s 3.3GB/s 2.4GB/s 2.5GB/s Write Perf. 1.9GB/s 2.0GB/s 1.4GB/s 0.6GB/s 2.0GB/s 2.0GB/s K IOPS(R/W) 460/90 850/ / /32 550/ /500 Lateny(R/W) 20/20us 20/20us 79/34us 80/29us 10/10 us 10/10us Warranty 5 years 5 years 5 years 5 years 5 years 5 years MTBF 2.0M 1.0M 2.0M 2.0M 2.0M 1.6M AFR 0.44% 0.87% 0.44% 0.44% 0.44% 0.54% DWPD 8TB/Day 4.8TB/Day 4.7TB/Day 0.72TB/Day 11.2TB/Day 4.7TB/Day Intel Optane: solid-state-drives/data-enter-ssds.html Write IOPs is 2.7 times higher than that of P4600, but 375GB apaity is too small to use DWPD 11.2TB/Day is not higher than expeted, (3 times better than P3700/800G) but atual number of ells should be investigated. Current ost is 30% higher than that of P GB (Amazon.om) 25

27 Next-Gen. File System Design: How SSD based storage should be used? Life Time Persistent files in a job are loated on SSD as file ahe Temporary files in a job should be loated on SSD to eliminate global file system aesses Appliation s Aess Pattern Non reusable file in file reading should not use SSD based storage Data Sharing in a Job Write-Read in a proess and among proesses are effetive to use SSD For Persistent Files: File ahe of global file system should be shared among proesses For Temporary Files: Two types of temporary file systems are effetive to use SSD Temporary Loal System (in a proess) Temporary Shared File System (among proesses) Data Sharing among Multiple Jobs Write-Read among multiple jobs are effetive to use SSD To be designed how to share file ahe on global file system and temporary shared file system data SSD Lifetime Issue Enterprise SSD with higher DWPD than that of all appliations will be seleted 26

28 Next-Gen 1 st Layer File System Overview Goal: Maximizing appliation file I/O performane Features: Easy aess to User Data: File Cahe of Global File System Higher Data Aess Performane: Temporary Loal FS (in a proess) Higher Data Sharing Performane: Temporary Shared FS (among proesses) Now developing LLIO(Lightweight Layered IO-Aelerator) Prototype I/O w/ Assistant Cores Node App. Job A Node App. Node App. Job B Node App. Node App. Salable Compute Cores 1 st Level file1 Loal Cahe file2 SSD Cahe file3 Shared Temporary Cahe Loal File Systems 2 nd Level file3 file4 Global File System(Lustre Based) 27

29 LLIO Prototype Implementation Two types of Computing Nodes Burst Buffer Computing Node(BBCN) Burst Buffer System Funtion with SSD Devie Computing Node(CN) Burst Buffer Clients: File Aess Request to BBCN as burst buffer server CN CN CN CN arm/x86 IO/meta Requests CN CN CN CN IO/meta Requests BBCN BBCN Bakground data flushing On demand data staging SSD SSD interonnet Computing Node Cluster 28 2 nd Layer File System

30 File Aess Sequenes using LLIO (Cahe Mode) CN BBCN 2 nd Layer File System Meta Reqs: Pass through to 2 nd Layer open(file) meta server write(fd, buf, sz) I/O server App /gfs LLIO LLIO 2nd Layer FS Client 2 nd Layer FS Server write(fd, buf, sz) flush LFS Bakground Flushing SSD HDD 29

31 I/O Bandwidth LLIO Prototype I/O Performane I/O Bandwidth Write Performane Read Performane Devie Devie # of IOR Streams # of IOR Streams Higher I/O performane than those of NFS, Lustre Evaluated on IA servers using Intel P3608 Utilizing maximum physial I/O devie performane by LLIO 30

32 31

SSD Based First Layer File System for the Next Generation Super-computer

SSD Based First Layer File System for the Next Generation Super-omputer Shinji Sumimoto, Ph.D. Next Generation Tehnial Computing Unit FUJITSU LIMITED Sept. 24 th, 2018 0 Outline of This Talk A64FX: High

次世代スーパーコンピュータ向け ファイルシステムについて

次世代スーパーコンピュータ向けファイルシステムについて