I/O and Scheduling aspects in DEEP-EST

Similar documents
The DEEP (and DEEP-ER) projects

The DEEP-ER take on I/O

I/O Monitoring at JSC, SIONlib & Resiliency

MPI RUNTIMES AT JSC, NOW AND IN THE FUTURE

JÜLICH SUPERCOMPUTING CENTRE Site Introduction Michael Stephan Forschungszentrum Jülich

Dynamical Exascale Entry Platform

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

Welcome to the. Jülich Supercomputing Centre. D. Rohe and N. Attig Jülich Supercomputing Centre (JSC), Forschungszentrum Jülich

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

Parallel & Scalable Machine Learning Introduction to Machine Learning Algorithms

Programming model and application porting to the Dynamical Exascale Entry Platform (DEEP)

Parallel I/O on JUQUEEN

NVIDIA Application Lab at Jülich

Porting Scientific Applications to OpenPOWER

Systems Architectures towards Exascale

Index LEVERAGING THE POWER OF HETEROGENEITY FOR NEXT GENERATION SUPERCOMPUTERS

Jülich Supercomputing Centre

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Trends in HPC Architectures

HPC projects. Grischa Bolls

Overview of Tianhe-2

PRACE Project Access Technical Guidelines - 19 th Call for Proposals

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy

Vectorisation and Portable Programming using OpenCL

The RAMDISK Storage Accelerator

Pedraforca: a First ARM + GPU Cluster for HPC

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 16 th CALL (T ier-0)

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

THOUGHTS ABOUT THE FUTURE OF I/O

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 14 th CALL (T ier-0)

The Mont-Blanc approach towards Exascale

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Early Evaluation of the "Infinite Memory Engine" Burst Buffer Solution

Application Performance on IME

HPC state of play. HPC Ecosystem. HPC system supply. HPC use 24% EU 4,3% EU. Application software & tools. Academia 23% Bio-sciences 22% CAE 21%

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

libhio: Optimizing IO on Cray XC Systems With DataWarp

Comet Virtualization Code & Design Sprint

Recent Developments in Supercomputing

Mapping MPI+X Applications to Multi-GPU Architectures

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

Fujitsu SC 18. Human Centric Innovation Co-creation for Success 2018 FUJITSU

Emerging Technologies for HPC Storage

JURECA Tuning for the platform

IBM CORAL HPC System Solution

High Performance Computing at the Jülich Supercomputing Center

Building supercomputers from embedded technologies

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

Application Example Running on Top of GPI-Space Integrating D/C

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

Parallel I/O and Portable Data Formats I/O strategies

Analyzing I/O Performance on a NEXTGenIO Class System

pnfs, POSIX, and MPI-IO: A Tale of Three Semantics

Oak Ridge National Laboratory Computing and Computational Sciences

Illinois Proposal Considerations Greg Bauer

SEVENTH FRAMEWORK PROGRAMME

Employing HPC DEEP-EST for HEP Data Analysis. Viktor Khristenko (CERN, DEEP-EST), Maria Girone (CERN)

How то Use HPC Resources Efficiently by a Message Oriented Framework.

HPC Architectures. Types of resource currently in use

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science

Introduction of Oakforest-PACS

Feedback on BeeGFS. A Parallel File System for High Performance Computing

CS500 SMARTER CLUSTER SUPERCOMPUTERS

NEXTGenIO Performance Tools for In-Memory I/O

Building NVLink for Developers

Smarter Clusters from the Supercomputer Experts

HPC Innovation Lab Update. Dell EMC HPC Community Meeting 3/28/2017

DDN About Us Solving Large Enterprise and Web Scale Challenges

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Exascale: challenges and opportunities in a power constrained world

UCX: An Open Source Framework for HPC Network APIs and Beyond

MAHA. - Supercomputing System for Bioinformatics

Design and Evaluation of a 2048 Core Cluster System

Carlo Cavazzoni, HPC department, CINECA

ECMWF's Next Generation IO for the IFS Model and Product Generation

How To Design a Cluster

Results from TSUBAME3.0 A 47 AI- PFLOPS System for HPC & AI Convergence

Preparing GPU-Accelerated Applications for the Summit Supercomputer

User Training Cray XC40 IITM, Pune

LBRN - HPC systems : CCT, LSU

The rcuda middleware and applications

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Interconnect Your Future

Architectures for Scalable Media Object Search

InfiniBand-based HPC Clusters

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Cray XC Scalability and the Aries Network Tony Ford

University at Buffalo Center for Computational Research

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

HPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,

Technical Computing Suite supporting the hybrid system

Data center: The center of possibility

The Spider Center-Wide File System

Intel Knights Landing Hardware

CERN openlab & IBM Research Workshop Trip Report

I/O Profiling Towards the Exascale

Modernizing OpenMP for an Accelerated World

Transcription:

I/O and Scheduling aspects in DEEP-EST Norbert Eicker Jülich Supercomputing Centre & University of Wuppertal The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under Grant Agreement n 287530 and n 610476

Collaborative R&D in DEEP projects (DEEP + DEEP-ER + DEEP-EST) EU-Exascale projects 27 partners Total budget: 44 M EU-funding: 30 M Nov 2011 Jun 2020 www.deep-projects.eu 2

2628.3.2017 September 2017 Lippert I/O and Scheduling aspects in Thomas DEEP-EST EIUG Workshop Hamburg 3 3 Slide

Science Campus Jülich 5,700 staff members Budget (2015): 558 mio. Institutional funding: 320 mio. Third party funding: 238 mio. Project management: 1,6 billion Teaching: ~ 900 PhD students (Campus Jülich) ~ 350 Trainees Research for the future for key technologies of the next generation and Information Slide 4

Past: Supercomputer evolution @ JSC IBM Power 4+ JUMP, 9 TFlop/s IBM Blue Gene/L JUBL, 45 TFlop/s IBM Power 6 JUMP, 9 TFlop/s JUROPA 200 TFlop/s HPC-FF 100 TFlop/s IBM Blue Gene/P JUGENE, 1 PFlop/s File Server IBM Blue Gene/Q JUQUEEN 5.9 PFlop/s Lustre GPFS JURECA 2 PFlop/s + Booster ~ 5 PFlop/s JUQUEEN successor Modular System Highly Scalable Slide 5

Research Field Usage 11/2015-04/2017 Leadership-Class System General-Purpose Supercomputer Granting periods 05/2016 04/2017 11/2015 10/2016 Slide 6

Application's Scalability Only few application capable to scale to O(450k) cores Sparse matrix-vector codes Highly regular communication patterns Well suited for BG/Q Most applications are more complex Less regular control flow / memory access Complicated communication patterns Less capable to exploit accelerators How to map different requirements to most suited hardware Heterogeneity might be beneficial Do we need better programming models? Slide 7

Heterogeneous Clusters GPU GPU GPU Flat IB-topology Simple management of resources InfiniBand GPU GPU GPU Static assignment of CPUs to GPUs Accelerators not capable to act autonomously 8

Alternative Integration Go for more capable accelerators (e.g. MIC) Acc Attach all nodes to a low-latency fabric All nodes might act autonomously Acc Acc Acc Dynamical assignment of cluster-nodes and accelerators Acc IB can be assumed as fast as PCIe besides latency Ability to off-load more complex (including parallel) kernels communication between CPU and Accelerator less frequently larger messages i.e. less sensitive to latency 9

Hardware Architecture 10

CyI Climate Simulation 11

Application-driven approach DEEP+DEEP-ER applications: Brain simulation (EPFL) Space weather simulation (KULeuven) Climate simulation (Cyprus Institute) Computational fluid engineering (CERFACS) High temperature superconductivity (CINECA) Seismic imaging (CGG) Human exposure to electromagnetic fields (INRIA) Geoscience (LRZ Munich) Radio astronomy (Astron) Oil exploration (BSC) Lattice QCD (University of Regensburg) Goals: Co-design and evaluation of architecture and its programmability Analysis of the I/O and resiliency requirements of HPC codes 12

Software Architecture 13

Application Startup Application highly scalable code-part main() part OmpSs Global MPI Resource management er t s Clu Application's main()-part runs on Cluster-nodes () only Actual spawn done via global MPI OmpSs acts as an abstraction layer er t s o Bo Spawn is a collective operation of Cluster-processes Highly scalable code-parts (HSCP) utilize multiple Booster-nodes (BN) 14

MPI Process Creation The inter-communicator contains all parents on the one side and all children on the other side. Returned by MPI_Comm_spawn for the parents Returned by MPI_Get_parent by the children Rank numbers are the same as in the the corresponding intracommunicator. Cluster comm MPI_COMM_WORLD (A) BN MPI BN BN MPI_COMM_WORLD (B) Inter-Communicator Booster 15

Programming Booster Interface Cluster Booster ParaStation global MPI Cluster Booster Protocol Extoll Infiniband MPI_Comm_spawn 16

Application running on DEEP Source code #pragma omp task in( ) out ( ) onto (com, size*rank+1) Compiler Application binaries DEEP Runtime 17

Modular Supercomputing Generalization of the Cluster-Booster concept Module 1: Storage Disk Disk Module 2: Cluster Module 3: Many core Booster BN BN BN BN BN BN Module 4: Memory Booster DN NAM BN BN DN Module 6: Graphics Booster BN Module 5: Data Analytics NAM GN NIC NIC NIC MEM NVM MEM BN NIC NAM GN NIC GN 18

Modular Supercomputing Workload 1 Workload 2 Workload 3 19

Modular Supercomputing 20

SLURM JSC decided to go for SLURM with the start of JURECA in 2015 Close collaboration with ParTec for deep integration with PS-MPI Currently waiting for job-packs We expect to extend the scheduling capabilities for support of complex job requirements Workflows without communication via the filesystem 21

Scalable I/O in DEEP-ER Improve I/O scalability on all usage-levels Used also for checkpointing 22

BeeGFS and Caching Two instances: Global FS on HDD server Cache FS on NVM at node API for cache domain handling Synchronous version Asynchronous version 23

Asynchronous Cache API 24

Cache Integration in ROMIO New MPI-IO Hints Developed in DEEP-ER and tested on DEEP Cluster e10_cache e10_cache_path Global Sync Group (MPI_COMM_WORLD) e10_cache_flush_flag Processes e10_cache_discard_flag e10_cache_threads ROMIO ADIO (Abstract Device IO) GPFS Driver UFS Driver BeeGFS Driver Common Layer (DEEP-ER Cache) Parallel File System 1 2 3 4 5 Buffers Collective Buffers MPI-IO Lustre Driver 0 Collective I/O DEEP-ER Cache Independent I/O Parallel File System 25

MPIWRAP Support Library MPI_Init, MPI_Finalize MPI_File_open, MPI_File_close MPI-IO hints are defined in a config file and injected by libmpiwrap into the middleware Provides deeper and more flexible control of MPI-IO functionalities to the users Provides transparent integration of E10 functionalities into applications Works with any high level library (e.g. phdf5) MPIWRAP MPI-IO ROMIO ADIO (Abstract Device IO) Lustre Driver GPFS Driver UFS Driver BeeGFS Driver Common Layer (DEEP-ER Cache) Parallel File System 26

Effects of the Cache S (k): amount of data written to the file at phase k Tc(k): time to write S(k) to the cache Ts(k): time to sync S(k) with the parallel file system C (k): compute time at phase k 27

Exper. Results IOR 512 processes (8/node) writing a 32GB share file varying # of aggregators and collective buffer size TBW represents the maximum theoretical bandwidth achievable when writing to the cache without flushing it to the parallel file system In IOR the last write sync phase is not overlapped with any computation and thus it is CONGIU, G., et al. 2016 IEEE affecting the overall bandwidth performance 28

Exper. Results IOR 512 processes (8/node) writing a 32GB share file varying # of aggregators and collective buffer size TBW represents the maximum theoretical bandwidth achievable when writing to the cache without flushing it to the parallel file system In IOR the last write sync phase is not overlapped with any computation and thus it is CONGIU, G., et al. 2016 IEEE affecting the overall bandwidth performance 29

SIONlib API resembles logical task-local files Simple integration into application code Internal mapping to single or few large files Reduces load on meta data server Serial program t1 Application Tasks t3 tn-2 t2 tn-1 tn Logical task-local files Physical multi-file SIONlib Parallel file system 30

Write buddy checkpoint SIONlib MPI Node Global File System local t1 local data Node local buddy t2 local data Node local buddy Node t3 local data local t4 local data buddy Node local buddy t5 local data Node local buddy t6 local data buddy lcom Open: sid=sion_paropen_mpi(, bw,buddy,mpi_comm_world, lcom, ) Write: sion_coll_write_mpi(data,size,n,sid) Close: sion_parclose(sid) Write-Call will write data first to local chunk, and then sent it to the associated buddy which writes the data to a second file 31

JURECA Cluster Hardware T-Platforms V210 blade server solution 28 Racks 1884 Nodes: Intel Xeon Haswell (24 cores) DDR4: 128 / 256 / 512 GiB InfiniBand EDR (100 Gbps) - Fat tree topology NVIDIA GPUs: 75 2 K80 + 12 2 K40 Installed at JSC since July 2015 Peak performance: 1.8PF (CPUs) + 0.4PF(GPUs) Main memory: 281 TiB Software CentOS 7 Linux SLURM batch system ParaStation Cluster Management GPFS file system 32

JURECA Booster Collaboration of JSC, Intel (+DELL) & ParTec Network Bridge EDR-OPA (Cluster-Booster) Management of heterogeneous resources in SLURM Hardware 1640 nodes KNL 7250-F Installation to be completed in 2017 96 GB DDR4 16 GB MCDRAM 200 GB local SSD Intel OmniPath (OPA) Fat tree topology Fully integrated with JURECA Cluster 198 OPA-to-EDR bridges to connect to Cluster Same login nodes Software (Not the real Booster, just to give an impression) SLURM (orchestrating jointly Cluster and Booster) ParaStation Cluster Management GPFS file system 33

Summary The DEEP projects bring a new view to heterogeneity Cluster-Booster architecture Hardware, software and applications jointly developed Strongly co-design driven DEEP-ER explored future directions of I/O On filesystem level BeeGFS On MPI-IO level E10 On POSIX optimization level SIONlib Test and combine the approaches Step into production in preparation Booster to be attached to JURECA Cluster in coming months Future: Modular Supercomputing More modules to come 34

Future: Modular Design Principle First Step: JURECA will be enhanced by a highly scalable Module Slide 35