Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies François Tessier, Venkatram Vishwanath, Paul Gressier Argonne National Laboratory, USA Wednesday 23 rd August, 2017

Data Movement at Scale Computational science simulation in scientific domains such as in materials, high energy physics, engineering, have large I/O needs Typically around 10% to 20% of the wall time is spent in I/O Table: Example of I/O from large simulations Scientific domain Simulation Data size Cosmology Q Continuum 2 PB / simulation High-Energy Physics Higgs Boson 10 PB / year Climate / Weather Hurricane 240 TB / simulation Increasing disparity between computing power and I/O performance in the largest supercomputers Ratio of I/O (TB/s) to Flops (TF/s) in percent IOPS/FLOPS of the #1 system in Top 500 0.1 0.01 0.001 0.0001 1997 2001 2005 2009 2013 2017 Years

Complex Interconnect Hierarchies On BG/Q, data movement needs to fully exploit the 5D-Torus topology for improved performance Additionally, we need to exploit the placement of the I/O nodes for performance Cray supercomputers have similar challenges with dragonfly-based interconnects together with placement of LNET nodes for I/O Pset 128 nodes 5D Torus network 2 GBps per link 2 GBps per link 4 GBps per link Compute nodes PowerPC A2, 16 cores 16 GB of DDR3 Bridge nodes 2 per I/O node QDR Infiniband switch I/O nodes IO forwarding daemon GPFS client Storage GPFS filesystem Mira - 49,152 nodes / 786,432 cores - 768 TB of memory - 27 PB of storage, 330 GB/s (GPFS) - 5D Torus network - Peak performance: 10 PetaFLOPS

Deep Memory Hierarchies and Filesystem characteristics We need to exploit the deep memory hierarchy tiers for improved performance This includes effective ways to seamlessly use HBM, DRAM, NVRAM, BurstBuffers, etc. We need to leverage filesystem specific features such as OSTs and striping in Lustre, among others. Compute node Intel KNL 7250 Dragonfly network Elec. links 14 GBps Dragonfly network Opt. links 12.5 GBps 36 tiles (2 cores, L2) 16 GB MCDRAM 192 GB DDR4 128 GB SSD 6 (level 2) 16 (level 1) Storage Lustre filesystem IB FDR 56 GBps (level 3) Compute node Knights Landing proc. 4 per router Aries router 2D all-to-all structure 96 routers per group Service node LNET, gateway, Irregular mapping 2-cabinet group 9 groups - 18 cabinets 16 x 6 routers hosted All-to-all

TAPIOCA Library based on the two-phase I/O scheme for topology-aware data aggregation at scale on IBM BG/Q with GPFS and Cray XC40 with Lustre (Cluster 17) Topology-aware aggregator placement taking into account The topology of the architecture The data access pattern Capure the data model and data layout to optimize the I/O scheduling Pipelining (RMA, non-blocking calls) of aggregation and I/O phases Interconnect architecture abstraction 0 1 2 3 Processes X Y Z X Y Z X Y Z X Y Z Data Round 1 1 Buffers X Y X X X RMA operations Aggregator Non-blocking MPI calls File

Abstractions for Interconnect Topology Topology characteristics include: Spatial coordinates Distance between nodes: number of hops, routing policy I/O nodes location, depending on the filesystem (bridge nodes, LNET,...) Network performance: latency, bandwidth Need to model some unknowns and uncertainties such as routing, contention Figure: 5D-Torus on BG/Q and intra-chassis Dragonfly Network on Cray XC30 (Credit: LLNL / LBNL)

Abstractions for Interconnect Topology - Our current approach TAPIOCA features a topology-aware aggregator placement This approach is based on quantitative information easy to gather: latency, bandwidth, distance between nodes ω(u, v): Amount of data exchanged between nodes u and v d(u, v): Number of hops from nodes u to v l: The interconnect latency B i j: The bandwidth from node i to node j C 1 = ( ) l d(i, A) + ω(i,a) i V C,i A B i A C 2 = l d(a, IO) + ω(a,io) B A IO TopoAware(A) = min (C 1 + C 2) C2 C1 Vc : Compute nodes IO : I/O node A : Aggregator Contention-aware algorithm: static and dynamic routing policies, unknown vendors information such as routing policy or data distribution among I/O nodes,...

TAPIOCA Outperfoms MPI I/O on the IO kernel of HACC and two data layouts on a Cray XC40 + Lustre and BG/Q + GPFS HACC: Large-scale simulation of the mass evolution of the universe with particle-mesh techniques (A particle is defined by 9 variables). 1024 nodes, 16 ranks per node Best PFS configuration for MPI I/O Lustre: 48 OST, 8 MB stripe size, 192 aggregators GPFS: 16 aggregators per Pset (128 aggr), 16 MB buffer size 14 12 TAPIOCA - Read TAPIOCA - Write MPI-IO - Read MPI-IO - Write 25 20 TAPIOCA - Read TAPIOCA - Write MPI-IO - Read MPI-IO - Write I/O Bandwidth (GBps) 10 8 6 4 2 R: 2.7x W: 3.8x I/O Bandwidth (GBps) 15 10 5 W: 10.6x R: 6.7x 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Data size per process (MB) (a) Cray XC40 + Lustre 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Data size per process (MB) (b) BG/Q + GPFS Figure: Array of structures data layout

TAPIOCA Outperfoms MPI I/O on the IO kernel of HACC and two data layouts on a Cray XC40 + Lustre and BG/Q + GPFS HACC: Large-scale simulation of the mass evolution of the universe with particle-mesh techniques (A particle is defined by 9 variables). 1024 nodes, 16 ranks per node Best PFS configuration for MPI I/O Lustre: 48 OST, 8 MB stripe size, 192 aggregators GPFS: 16 aggregators per Pset (128 aggr), 16 MB buffer size 14 12 TAPIOCA - Read TAPIOCA - Write MPI-IO - Read MPI-IO - Write 25 20 TAPIOCA - Read TAPIOCA - Write MPI-IO - Read MPI-IO - Write I/O Bandwidth (GBps) 10 8 6 4 2 R: 2.3x W: 3.6x I/O Bandwidth (GBps) 15 10 5 W: 1.2x R: 1.3x 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Data size per process (MB) (a) Cray XC40 + Lustre 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Data size per process (MB) (b) BG/Q + GPFS Figure: Structure of arrays data layout

TAPIOCA - Ongoing research Move toward a generic data movement library for data-intensive applications exploiting deep memory/storage hierarchies as well as interconnect to facilitate I/O, in-transit analysis, data transformation, data/code coupling, workflows,... Network Memory/Storage P0 P1 P2 P3 Application X Y Z X Y Z X Y Z X Y Z DRAM, MCDRAM,... Data Dragonfly, torus,... X X X X Y Y P0 P2 Y Y Z Z Z Z DRAM, MCDRAM, NVRAM, BB,... Aggregators Dragonfly, torus,... X X X X Y Y DRAM, MCDRAM, Y Y Z Z Z Z Target NVRAM, PFS, BB,...

Abstractions for Memory and Storage Topology characteristics including spatial location, distance Performance characteristics: bandwidth, latency, capacity Access characteristics such as byte/block-based, concurrency Persistency Application Memory API (alloc, store, load, free, ) Abstraction layer (mmap, pmem, ) DRAM HBM NVRAM PFS BB Listing 1: Function prototypes for memory/storage data movements v o i d memalloc ( v o i d buff, int64_ t buffsize, mem_t mem ) ; v o i d memfree ( v o i d buff, mem_t mem ) ; i n t mem{ Write, Store } ( void srcbuffer, int64_ t srcsize, v o i d d e s t B u f f e r, mem_t mem, i n t 6 4 _ t o f f s e t ) ; i n t mem{read, Load } ( void s r c B u f f e r, int6 4 _ t s r c S i z e, v o i d d e s t B u f f e r, mem_t mem, i n t 6 4 _ t o f f s e t ) ; v o i d memflush ( v o i d buff, mem_t mem ) ; Work in progress with open questions Bluring boundary between memory and storage (MCDRAM, 3D XPoint memory,...) Some data movements need one or more processes involved at destination (RMA window, flushing thread,...) Scope of memory/storage tiers (PFS vs node-local SSD) Data partitioning to take advantage of fast memories with smaller capacities

Conclusion TAPIOCA, an optimized data-movement library incorporating Topology-aware aggregator placement Optimized data movement with I/O scheduling and pipelining Hardware abstraction insuring performance portability Performance portability on two leadership-class supercomputers: Mira (IBM BG/Q + GPFS) and Theta (Cray XC40 + Lustre) Same application code running on both platforms Same optimization algorithms using an interconnect abstraction Promising preliminary results with memory/storage abstraction An appropriate level of abstraction is hard to define Specific abstraction for every system including the architecture, filesystems, capturing every phase of deployment, relevant software versions,... Generalized abstraction that maps to current and expected future deep memory hierarchies and interconnects Future work: Come up with a model helping to take smart decision for data movement Acknowledgments Argonne Leadership Computing Facility at Argonne National Laboratory DOE Office of Science, ASCR Proactive Data Containers (PDC) project

Conclusion Thank you for your attention! ftessier@anl.gov

MPI-IO and TAPIOCA - Data layout MPI I/O TAPIOCA P0 P1 P0 P1 Processes X Y Z X Y Z X Y Z X Y Z Data X X Y Y X Y Z X Y Z Aggregator Z Z X Y Z X Y Z X Y Z X Y Z File