SSD Based First Layer File System for the Next Generation Super-computer

Similar documents
次世代スーパーコンピュータ向け ファイルシステムについて

Post-K Supercomputer with Fujitsu's Original CPU, A64FX Powered by Arm ISA

Fujitsu High Performance CPU for the Post-K Computer

The Tofu Interconnect D

Post-K Supercomputer Overview. Copyright 2016 FUJITSU LIMITED

Announcements. Lecture Caching Issues for Multi-core Processors. Shared Vs. Private Caches for Small-scale Multi-core

Fujitsu's Lustre Contributions - Policy and Roadmap-

The Tofu Interconnect D

Fujitsu HPC Roadmap Beyond Petascale Computing. Toshiyuki Shimizu Fujitsu Limited

Overview of the Post-K processor

Fujitsu s new supercomputer, delivering the next step in Exascale capability

Introduction of Fujitsu s next-generation supercomputer

An Overview of Fujitsu s Lustre Based File System

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

FUJITSU HPC and the Development of the Post-K Supercomputer

Outline: Software Design

Findings from real petascale computer systems with meteorological applications

Post-K Development and Introducing DLU. Copyright 2017 FUJITSU LIMITED

Post-K: Building the Arm HPC Ecosystem

Fujitsu s Approach to Application Centric Petascale Computing

Technical Computing Suite supporting the hybrid system

Xpander Rack Mount 2 Gen 3 HPC Version User Guide

COSSIM An Integrated Solution to Address the Simulator Gap for Parallel Heterogeneous Systems

Making Light Work of the Future IP Network

Facility Location: Distributed Approximation

Fujitsu Petascale Supercomputer PRIMEHPC FX10. 4x2 racks (768 compute nodes) configuration. Copyright 2011 FUJITSU LIMITED

DL-SNAP and Fujitsu's Lustre Contributions

Xpander Rack Mount 8 5U Gen 3 with Redundant Power [Part # XPRMG3-81A5URP] User Guide

COST PERFORMANCE ASPECTS OF CCD FAST AUXILIARY MEMORY

Japan s post K Computer Yutaka Ishikawa Project Leader RIKEN AICS

Partial Character Decoding for Improved Regular Expression Matching in FPGAs

DECODING OF ARRAY LDPC CODES USING ON-THE FLY COMPUTATION Kiran Gunnam, Weihuang Wang, Euncheol Kim, Gwan Choi, Mark Yeary *

Toward Building up ARM HPC Ecosystem

High-level synthesis under I/O Timing and Memory constraints

On - Line Path Delay Fault Testing of Omega MINs M. Bellos 1, E. Kalligeros 1, D. Nikolos 1,2 & H. T. Vergos 1,2

Make your process world

Arm Processor Technology Update and Roadmap

Experiences of the Development of the Supercomputers

Cross-layer Resource Allocation on Broadband Power Line Based on Novel QoS-priority Scheduling Function in MAC Layer

Coprocessors, multi-scale modeling, fluid models and global warming. Chris Hill, MIT

MAHA. - Supercomputing System for Bioinformatics

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

The Tofu Interconnect 2

Pipelined Multipliers for Reconfigurable Hardware

Fujitsu s Contribution to the Lustre Community

Xpander Rack Mount 8 6U Gen 3 with Redundant Power [Part # XPRMG3-826URP] User Guide

Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis

The recursive decoupling method for solving tridiagonal linear systems

Methods for Multi-Dimensional Robustness Optimization in Complex Embedded Systems

Reevaluating the overhead of data preparation for asymmetric multicore system on graphics processing

SVC-DASH-M: Scalable Video Coding Dynamic Adaptive Streaming Over HTTP Using Multiple Connections

Upstreaming of Features from FEFS

Automatic Physical Design Tuning: Workload as a Sequence Sanjay Agrawal Microsoft Research One Microsoft Way Redmond, WA, USA +1-(425)

CA PPM 14.x Implementation Proven Professional Exam (CAT-222) Study Guide Version 1.2

Xpander Rack Mount 8 5U Gen 3 User Guide

DETECTION METHOD FOR NETWORK PENETRATING BEHAVIOR BASED ON COMMUNICATION FINGERPRINT

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract

CA Single Sign-On 12.x Proven Implementation Professional Exam (CAT-140) Study Guide Version 1.5

The way toward peta-flops

CA Privileged Identity Manager r12.x (CA ControlMinder) Implementation Proven Professional Exam (CAT-480) Study Guide Version 1.5

Multi-Channel Wireless Networks: Capacity and Protocols

HOKUSAI System. Figure 0-1 System diagram

System-Level Parallelism and Throughput Optimization in Designing Reconfigurable Computing Applications

Installation Guide. Expansion module 1

DAQ system at SACLA and future plan for SPring-8-II

Performance Evaluation of a Vector Supercomputer SX-Aurora TSUBASA

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10

Writing Libraries in MPI*

SPARC64 X: Fujitsu s New Generation 16 core Processor for UNIX Server

RS485 Transceiver Component

Key Technologies for 100 PFLOPS. Copyright 2014 FUJITSU LIMITED

Programming for Fujitsu Supercomputers

ARMv8-A Scalable Vector Extension for Post-K. Copyright 2016 FUJITSU LIMITED

Fujitsu s Technologies to the K Computer

CA Service Desk Manager 14.x Implementation Proven Professional Exam (CAT-181) Study Guide Version 1.3

Parallelization and Performance of 3D Ultrasound Imaging Beamforming Algorithms on Modern Clusters

Design of High Speed Mac Unit

CA Test Data Manager 4.x Implementation Proven Professional Exam (CAT-681) Study Guide Version 1.0

The AMDREL Project in Retrospective

Parallel Block-Layered Nonbinary QC-LDPC Decoding on GPU

Uplink Channel Allocation Scheme and QoS Management Mechanism for Cognitive Cellular- Femtocell Networks

Design and Evaluation of a 2048 Core Cluster System

Tackling IPv6 Address Scalability from the Root

Smooth Trajectory Planning Along Bezier Curve for Mobile Robots with Velocity Constraints

Batch Auditing for Multiclient Data in Multicloud Storage

Boosted Random Forest

Computing Pool: a Simplified and Practical Computational Grid Model

RAC 2 E: Novel Rendezvous Protocol for Asynchronous Cognitive Radios in Cooperative Environments

Contents Contents...I List of Tables...VIII List of Figures...IX 1. Introduction Information Retrieval... 8

Accelerating Multiprocessor Simulation with a Memory Timestamp Record

'* ~rr' _ ~~ f' lee : eel. Series/1 []J 0 [[] "'l... !l]j1. IBM Series/1 FORTRAN IV. I ntrod uction ...

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

Multi-hop Fast Conflict Resolution Algorithm for Ad Hoc Networks

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

Direct-Mapped Caches

Architecture and Performance of the Hitachi SR2201 Massively Parallel Processor System

April 2 nd, Bob Burroughs Director, HPC Solution Sales

S K T e l e c o m : A S h a r e a b l e D A S P o o l u s i n g a L o w L a t e n c y N V M e A r r a y. Eric Chang / Program Manager / SK Telecom

A Novel Validity Index for Determination of the Optimal Number of Clusters

Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-Volatile Burst Buffers

Transcription:

SSD Based First Layer File System for the Next Generation Super-omputer Shinji Sumimoto, Ph.D. Next Generation Tehnial Computing Unit FUJITSU LIMITED Sept. 24 th, 2018 0

Outline of This Talk A64FX: High Performane Arm CPU SSD Based First Layer File System for the Next Generation Super-omputer Current Status of Lustre Based File System Development 1

A64FX: High Performane Arm CPU From presentation slides of Hothips 30 th and Cluster 2018 Inheriting Fujitsu HPC CPU tehnologies with ommodity standard ISA 2

A64FX Chip Overview Arhiteture Features <A64FX> Tofu 28Gbps 2 lanes 10 ports I/O PCIe Gen3 16 lanes Armv8.2-A (AArh64 only) SVE 512-bit wide SIMD 48 omputing ores + 4 assistant ores* CMG speifiation 13 ores L2$ 8MiB Mem 8GiB, 256GB/s Tofu ontroller PCIe ontroller HBM2 TofuD 32GiB *All the ores are idential 6D Mesh/Torus 28Gbps x 2 lanes x 10 ports HBM2 HBM2 Netwrok on Chip HBM2 HBM2 PCIe Gen3 16 lanes 7nm FinFET 8,786M transistors 594 pakage signal pins Peak Performane (Effiieny) >2.7TFLOPS (>90%@DGEMM) Memory B/W 1024GB/s (>80%@Stream Triad) A64FX (Post-K) SPARC64 XIfx (PRIMEHPC FX100) ISA (Base) Armv8.2-A SPARC-V9 ISA (Extension) SVE HPC-ACE2 Proess Node 7nm 20nm Peak Performane >2.7TFLOPS 1.1TFLOPS SIMD 512-bit 256-bit # of Cores 48+4 32+2 Memory HBM2 HMC Memory Peak B/W 1024GB/s 240GB/s x2 (in/out) 3

A64FX Memory System Extremely high bandwidth Out-of-order Proessing in ores, ahes and memory ontrollers Maximizing the apability of eah layer s bandwidth CMG 12x Computing Cores + 1x Assistant Core Performane >2.7TFLOPS L1 Cahe >11.0TB/s (BF= 4) 512-bit wide SIMD 2x FMAs >230 GB/s Core >115 GB/s L1D 64KiB, 4way Core Core Core L2 Cahe >3.6TB/s (BF = 1.3) Memory 1024GB/s (BF =~0.37) >115 GB/s >57 GB/s L2 Cahe 8MiB, 16way 256 GB/s HBM2 8GiB HBM2 8GiB 4

A64FX Core Features Optimizing SVE arhiteture for wide range of appliations with Arm inluding AI area by FP16 INT16/INT8 Dot Produt Developing A64FX ore miro-arhiteture to inrease appliation performane A64FX (Post-K) SPARC64 XIfx (PRIMEHPC FX100) SPAR64 VIIIfx (K omputer) ISA Armv8.2-A + SVE SPARC-V9 + HPC-ACE2 SPARC-V9 + HPC-ACE SIMD Width 512-bit 256-bit 128-bit Four-operand FMA Enhaned Gather/Satter Enhaned Prediated Operations Enhaned Math. Aeleration Further enhaned Enhaned Compress Enhaned First Fault Load New FP16 New INT16/ INT8 Dot Produt New HW Barrier* / Setor Cahe* Further enhaned Enhaned 5 * Utilizing AArh64 implementation-defined system registers

Normalized to SPARC64 XIfx A64FX Chip Level Appliation Performane Boosting appliation performane up by miro-arhitetural enhanements, 512-bit wide SIMD, HBM2 and semi-ondutor proess tehnologies > 2.5x faster in HPC/AI benhmarks than that of SPARC64 XIfx tuned by Fujitsu ompiler for A64FX miro-arhiteture and SVE A64FX Kernel Benhmark Performane (Preliminary results) 8 Throughput (DGEMM / Stream) HPC Appliation Kernel AI 9.4x 6 Memory B/W 512-bit SIMD Combined Gather L1 $ B/W L2$ B/W INT8 dot produt 4 2 2.5TF 830 GB/s 3.0x 2.8x 3.4x 2.5x 0 DGEMM Stream Triad Fluid dynamis Atomosphere Seismi wave propagation Convolution FP32 Convolution Low Preision (Estimated) Baseline: SPARC64 XIfx 6

Tofu Network Router 4 lanes 10 ports Tofu Network Router 2 lanes 10 ports A64FX TofuD Overview Halved Off-hip Channels Power and Cost Redution Inreased Communiation Resoures TNIs from 2 to 4 Tofu Barrier Resoures Redued Communiation Lateny Simplified Multi-Lane PCS Inreased Communiation Reliability Dynami Paket Sliing: Split and Dupliate Tofu K.omp Tofu2 FX100 TofuD Data rate (Gbps) 6.25 25.78 28.05 # of signal lanes per link 8 4 2 Link bandwidth (GB/s) 5.0 12.5 6.8 # of TNIs per node 4 4 6 Injetion bandwidth per node (GB/s) 20 50 40.8 C M G C M G HMC HMC HMC HMC HMC HMC HMC HMC SPARC64 XIfx PCIe TNI0 TNI1 TNI2 TNI3 Tofu2 C M G HBM2 HBM2 NOC HBM2 C M G HBM2 PCIe TNI0 TNI1 TNI2 TNI3 TNI4 TNI5 TofuD C M G C M G TofuD A64FX 7

TofuD: Put Latenies & Throughput& Injetion Rate TofuD: Evaluated by hardware emulators using the prodution RTL odes Simulation model: System-level inluded multiple nodes Communiation settings Lateny Tofu Desriptor on main memory 1.15 µs Diret Desriptor 0.91 µs Tofu2 Cahe injetion OFF 0.87 µs Cahe injetion ON 0.71 µs TofuD To/From far CMGs 0.54 µs To/From near CMGs 0.49 µs Put throughput Injetion rate Tofu 4.76 GB/s (95%) 15.0 GB/s (77%) Tofu2 11.46 GB/s (92%) 45.8 GB/s (92%) TofuD 6.35 GB/s (93%) 38.1 GB/s (93%) 8

Next Generation File System Design Next Generation File System Struture and Design Next-Gen 1 st Layer File System Overview 9

K omputer: Pre-Staging-In/Post-Staging-Out Method Pros: Stable Appliation Performane for Jobs Cons: Requiring three times amount of storage whih a job needs Pre-defining file name of stage-in/out proessing laks of usability Data-intensive appliation affets system usage to down beause of waiting prestaging-in/out proessing Computing Node Appliation Linux Stage-in/out Loopbak Loal File System using FEFS Job Control Node Global File System Login Node Users 10

Next-Gen File System Requirement and Issues Requirements 10 times higher aess performane 100 times larger file system apaity Lower power and footprint Issues How to realize 10 times faster and 100 times larger file aess at a time? 11

Next-Gen. File System Design K omputer File System Design How should we realize High Speed and Redundany together? Introdued Integrated Two Layered File System. Next-Gen. File System/Storage Design Another trade off targets: Power, Capaity, Footprint Diffiult to realize single Exabyte and 10TB/s lass file system in limited power onsumption and footprint. Additional Third layer Storage for Capaity is needed: Compute Compute Nodes Nodes Compute Compute Nodes Nodes High Speed for Appliation Lustre Based Lustre Lustre Ext[34] Based Based Ext[34] Based Based Appliation Ext[34] Objet Objet Based Speifi Based Based Appliation Appliation Objet Existing Based FS Speifi Speifi Appliation Objet Based Speifi Shared Usability Thousands of Users Job Sheduler Login Server Lustre Based Transparent Data Aess The Next Integrated Layered File System Arhiteture for Post-peta sale System (Feasibility Study 2012-2013) Other Organization Other Systems High Capaity & Redundany & Interoperability HSM, Other Shared FS, Grid or Cloud Based /data 12

Next Gen. File System Design Introduing three level hierarhial storage. 1 st level storage: Aelerating appliation file I/O performane (Loal File System) 2 nd level storage: Sharing data using Lustre based file system (Global File System) 3 rd level storage: Arhive Storage (Arhive System) Aessing 1 st level storage as file ahe of global file system and loal storage File ahe on omputing node is also used as well as 1 st level storage Computing Node Appliation Linux SSD Based 1 st Level Storage Job Control Node Global File System Lustre based file system on 2 nd Level Storage Login Node Users Arhive Storage for 3 rd Level Storage 13

Next Gen. Layered File System Requirements Appliation views: Loal File System: Appliation Oriented File Aesses(Higher Meta&Data I/O) Global File System: Transparent File Aess Arhive System: In-diret Aess or Transparent File Aess(HSM) Transparent File Aess to the Global File System Loal File System Capaity is not enough as muh as loating whole data of Global File System File Cahe on node memory and Loal File System enables to aelerate appliation performane Meta Perf. Data BWs Capaity Salability Data Sharing in a Job Data Sharing among Jobs Loal File System Global File System Arhive System 14

Next-Gen 1 st Layer File System Overview Goal: Maximizing appliation file I/O performane Features: Easy aess to User Data: File Cahe of Global File System Higher Data Aess Performane: Temporary Loal FS (in a proess) Higher Data Sharing Performane: Temporary Shared FS (among proesses) Now developing LLIO(Lightweight Layered IO-Aelerator) Prototype I/O w/ Assistant Cores Node App. Job A Node App. Node App. Job B Node App. Node App. Salable Compute Cores 1 st Level file1 Loal Cahe file2 SSD Cahe file3 Shared Temporary Cahe Loal File Systems 2 nd Level file3 file4 Global File System(Lustre Based) 15

LLIO Prototype Implementation Two types of Computing Nodes Burst Buffer Computing Node(BBCN) Burst Buffer System Funtion with SSD Devie Computing Node(CN) Burst Buffer Clients: File Aess Request to BBCN as burst buffer server CN CN CN CN arm/x86 IO/meta Requests CN CN CN CN IO/meta Requests BBCN BBCN Bakground data flushing On demand data staging SSD SSD interonnet Computing Node Cluster 16 2 nd Layer File System

File Aess Sequenes using LLIO (Cahe Mode) CN BBCN 2 nd Layer File System Meta Reqs: Pass through to 2 nd Layer open(file) meta server write(fd, buf, sz) I/O server App /gfs LLIO LLIO 2nd Layer FS Client 2 nd Layer FS Server write(fd, buf, sz) flush LFS Bakground Flushing SSD HDD 17

I/O Bandwidth LLIO Prototype I/O Performane I/O Bandwidth Write Performane Read Performane Devie Devie # of IOR Streams # of IOR Streams Higher I/O performane than those of NFS, Lustre Evaluated on IA servers using Intel P3608 Utilizing maximum physial I/O devie performane by LLIO 18

Current Status of Lustre Based File System Development, et,. 19

Current Status of Lustre Based File System Next-gen. Lustre Based File System: FEFS Planning to develop on Lustre 2.10.x base Now testing Lustre 2.10.x based FEFS, and found several problems Planning to fix the bugs and report their fixes. 20

Contribution of DL-SNAP What is DL-SNAP? Presented@LUG2016, LAD16 (http://dn.opensfs.org/wp-ontent/uploads/2016/04/lug2016d2_dl-snap_sumimoto.pdf) DL-SNAP is designed for user and diretory level file bakups. Users an reate a snapshot of a diretory using lfs ommand with snapshot option and reate option like a diretory opy. The user reates multiple snapshot of the diretory and manage the snapshots inluding merge of the snapshots. DL-SNAP also supports quota to limit storage usage of users. Issue of Contribution: We planed DL-SNAP ontribution in 2018 We do not have human resoures enough to port to latest Lustre version Our Strategy for ontributing DL-SNAP: We are ready to ontribute our urrent DL-SNAP ode for Lustre 2.6 We will make a LU-tiket for the DL-SNAP (by the end of Ot. 2018) We need help to port DL-SNAP to the latest Lustre 21

22