THOUGHTS ABOUT THE FUTURE OF I/O

Size: px

Start display at page:

Download "THOUGHTS ABOUT THE FUTURE OF I/O"

Aubrey Dorsey
5 years ago
Views:

1 THOUGHTS ABOUT THE FUTURE OF I/O Dagstuhl Seminar Challenges and Opportunities of User-Level File Systems for HPC Franz-Josef Pfreundt, May 2017 Deep Learning I/O Challenges Memory Centric Computing : The Machine Low latency Non Volatile Memory Fraunhofer ITWM 2017 pfreundt 1

2 Fraunhofer ITWM HPC Department Research and Development of Industry Applications Parallel programming models Parallel file system Large scale visualization IoT in the energy market Development of parallel industry applications Performance Engineering Fraunhofer ITWM 2017 pfreundt 2

3 The one slide about BeeGFS Scalable IOPS Excellent N:1 performance, shared file I/O IOPS (Random 4k writes) up to 20 servers, 160 client procs Sequential I/O, 1 shared file, 600k block size up to 20 servers, 192 client procs IOPS MB/s Write Read # Storage servers # Servers Server components are in user space, client in kernel space X86 - ARM -POWER Low latency implementation No depency on Linux Kernel or Linux Distribution, any local FS ( ZFS, EXT, XFS, BTRFS, tmpfs) Very efficient multithreaded implementation > hyperconverged solution Fraunhofer ITWM 2017 pfreundt 3

The slide about BeeGFS on Demand - BeeOND = burst buffer Cray CS400 at

in each node BeeOND IOR 50TB Stripe size 1, local Stripe 4 Stripe size

167 GB/sec 164 GB/sec 167 GB/sec TSUBAME 3.

4 The slide about BeeGFS on Demand - BeeOND = burst buffer Cray CS400 at Alfred Wegner Institut Broadwell CPU Omnipath Interconnect 0,5 TB SSD in each node BeeOND IOR 50TB Stripe size 1, local Stripe 4 Stripe size 1, any 308 Nodes write 160 GB/sec 161 GB/sec 160 GB/sec 308 Nodes read 167 GB/sec 164 GB/sec 167 GB/sec TSUBAME 3.0 plans to run the CN attached NVMe with BeeOND on 1 PByte of NVMe Fraunhofer ITWM 2017 pfreundt 4

Deep Learning I/O Challenge ( example Imagenet) Single Node I/O into a Lustre PFS ( Single GPU, FDR IB) IBM Minsky : 4 P100, NVLink Needs multiple SSD s in Raid 0 To

media research 2D 2D+time By Bart Thomee, David A.

5 Deep Learning I/O Challenge ( example Imagenet) Single Node I/O into a Lustre PFS ( Single GPU, FDR IB) IBM Minsky : 4 P100, NVLink Needs multiple SSD s in Raid 0 To allow sclability across 4 GPU s During training the data has be read 100 times ( 120 Mio file reads using standard Caffe) YFCC100m - a new public data set for multi media research 2D 2D+time By Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, Li- Jia Li, Communications of the ACM, Vol. 59 No. 2, Pages ,206,564 photos and 793,436 videos from 581,099 different photographers Size : About15 TByte Fraunhofer ITWM 2017 pfreundt 5

6 Our Solution BeeOND Build temporary parallel file system across nodes ( on demand per user) using BeeOND Every compute node can become a MDS Combine all files in one large binary with fixed offsets Or : all in memory ( rewrite the I/O layer in the DL framework) Fraunhofer ITWM 2017 pfreundt 6

7 Data Management : One key problem in parallel computing Its too complicated for application developers CPU We usually see nasty I/O patterns and bad performance Caches High Bandwidth Memory DRAM High Bandwidth low latency communication Non Volatile Memory, µs Latency Flash Storage Spinning Discs, Tape (Parallel) File I/O Fraunhofer ITWM 2017 pfreundt 7

Zero copy data transfer Standardized API ( GASPI) Developed and used at Fraunhofer since 2006

8 GPI-2 Global Address Space Communication Interface Partitioned global address space Explicit one-sided communication with notification Every thread can communicate Multiple memory segments, Zero copy data transfer Standardized API ( GASPI) Developed and used at Fraunhofer since 2006 Complete replacement for MPI in industry applications GPLv3 Fraunhofer ITWM 2017 pfreundt 8

9 GPI Global Address Space Programming Interface 2) Hide latency by asynchronous one sided communication : RDMA 3. Every CPU Core can communicate and does not spent cycles for communication 1) Map pinned memory in a global address space Fraunhofer ITWM 2017 pfreundt 9

exchange between Application and VMEM through a shared memory segment Data transfer between nodes with GPI Allows to couple tasks written in different languages Carl Adam Petri 1962: Description

10 GPI-Space : Our approach to memory centric computing ( 2009) 1. Memory Virtualization using GPI 2. Concurrency and task management Virtual Global Memory Interconnect Application independend memory space Can keep data without an application running Applications are local to a node - Tasks Data exchange between Application and VMEM through a shared memory segment Data transfer between nodes with GPI Allows to couple tasks written in different languages Carl Adam Petri 1962: Description language for asynchronous and concurrent systems in order to add resources to running jobs simple, graphical representation (physical) properties: locality (no global state) concurrency (no total order given, just data dependencies) reversibility (calculate cause from effect) based on states not events (separate activation from execution) + some extensions ( names ports, type safety, expressions..) Fraunhofer ITWM 2017 pfreundt 10

11 GPI-Space is becomming a distributed OS 3. DRTS : Distributed Runtime System Debugging by on the fly modification of the Petri net Step Failure tolerant JIT compilation and execution of the Petri net Resources have capabilities : GPU, CPI, I/O Coscheduling of multi-node tasks (MPI) Preemptive scheduling of data transfers ( if information provided by the task) Fraunhofer ITWM 2017 pfreundt 11

12 GPI-Space + Domain Knowledge Complile high level workflows into Petri nets Dataflow modell in Seismic data processing Fraunhofer ITWM 2017 pfreundt 12

13 Example SPLOTCH : Visualization in Astrophysics LRZ Munich MPI Programm with problems Rewrite in GPI-Space in 3 weeks Thoughput : time to solution 10 x Fraunhofer ITWM 2017 pfreundt 13

14 Deep Learning on demand The development of new DNN s requires a lot of test runs - How can I do this cheap? Auto scaling Fail save: auto recovery, restarting Exploit the AWS spot market Automatic meta-parameter search Automatic data-management Supports original DL model descriptors e.g. Caffe & Tensor Flow Arbitrary Hardware nodes: GPU, CPU Developed in a few weeks Fraunhofer ITWM 2017 pfreundt 14

15 Deep Learning on Demand - Architecture Fraunhofer ITWM 2017 pfreundt 15

16 Data Management : One key problem in parallel computing Its too complicated for application developers CPU We usually see nasty I/O patterns and bad performance Caches High Bandwidth Memory DRAM High Bandwidth low latency communication Non Volatile Memory, µs Latency Flash Storage Spinning Discs, Tape (Parallel) File I/O Fraunhofer ITWM 2017 pfreundt 16

17 Directory/Cache API to support VMEM Multilevel Abstract Data Representation Allocation and global range Server knowledge Logical segment Segment knowledge Physical segment Physical segment: segment type and hardware dependent distribution Logical segment: linear view on physical segment Allocation: linear view on (distributed) part(s) of a segment Global range: subrange of an allocation The directory/cache unifies access to segments and abstracts distributed hardware no knowledge about data dependencies no knowledge about the runtime system behavior Fraunhofer ITWM 2017 pfreundt 17

VMEM Directory/Cache : Client Server Architecture The original data is stored in one or more segments across several nodes. Copies of global memory regions are stored in local caches.

18 VMEM Directory/Cache : Client Server Architecture The original data is stored in one or more segments across several nodes. Copies of global memory regions are stored in local caches. A local server may create and manage multiple local caches. Multiple clients may share local caches. External programs can connect to an already running directory/cache service. Tolerant to client failures (provided the clients are started in different processes) Fraunhofer ITWM 2017 pfreundt 18

Goal : Support the task based runtimes with an open source implementation OmpSs Runtime StarPU Runtime GPI-Space Directory/Cache API GASPI Segment MPI Segment BeeGFS/BeeOND Segment Keep some data in

19 Goal : Support the task based runtimes with an open source implementation OmpSs Runtime StarPU Runtime GPI-Space Directory/Cache API GASPI Segment MPI Segment BeeGFS/BeeOND Segment Keep some data in non-volatile memory Automate data transfer from storage to memory The API provides functions that may be used for taking scheduling decisions: transfer costs associated with a list of operations data locality information. The VMEM will become non volatile and data survive the appliaction Fraunhofer ITWM 2017 pfreundt 19

20 Moving on to byte addressable SCM Legacy Code _MPI GPI-Space Task World App App App App App App App POSIX I/O BeeGFS Client MD Server MD Server Key-Value Store VMEM/ API Storage Server Translate Posix Into Memory Operation Storage Server PC Cluster FDR IB SCM, PGAS Object Storage Fraunhofer ITWM 2017 pfreundt 20

21 Questions? Our plan until 2021 J. Keuper at Rice O&G Conference "Scaling Deep Learning Applications" Fraunhofer ITWM 2017 pfreundt 21

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting BeeGFS The Parallel Cluster File System Container Workshop ISC 28.7.18 www.beegfs.io July 2018 Marco Merkel VP ww Sales, Consulting HPC & Cognitive Workloads Demand Today Flash Storage HDD Storage Shingled