I/O Monitoring at JSC, SIONlib & Resiliency

Size: px

Start display at page:

Download "I/O Monitoring at JSC, SIONlib & Resiliency"

Silvester Cooper
5 years ago
Views:

SIONlib: Task-Local I/O & Buddy Checkpointing Wolfgang Frings, Kay Thust {W.Frings,K.

1 Mitglied der Helmholtz-Gemeinschaft I/O Monitoring at JSC, SIONlib & Resiliency Update: I/O JSC Update: Monitoring with LLview (I/O, Memory, Load) I/O Workloads on Jureca SIONlib: Task-Local I/O & Buddy Checkpointing Wolfgang Frings, Kay Thust {W.Frings,K.Thust}@fz-juelich.de Jülich Supercomputing Centre HPC I/O in the Data Center (HPC-IODC), ISC, Frankfurt, June 23 th, 2016

HPC Systems @ JSC: Dual Architecture Strategy IBM Power

Nehalem JUROPA 300 TFlop/s Intel Haswell JURECA ~ 2.

IBM Blue Gene/L JUBL, 45 TFlop/s IBM Blue Gene/P JUGENE,

9 PFlop/s JUQUEEN successor ~ 50 PFlop/s General-Purpose

2 HPC JSC: Dual Architecture Strategy IBM Power 4+ JUMP, 9 TFlop/s IBM Power 6 JUMP, 9 TFlop/s Intel Nehalem JUROPA 300 TFlop/s Intel Haswell JURECA ~ 2.2 PFlop/s + Booster ~ 10 PFlop/s File Server Lustre GPFS IBM Blue Gene/L JUBL, 45 TFlop/s IBM Blue Gene/P JUGENE, 1 PFlop/s IBM Blue Gene/Q JUQUEEN 5.9 PFlop/s JUQUEEN successor ~ 50 PFlop/s General-Purpose Cluster Highly Scalable System HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 2

main memory and 2 K40 NVIDIA graphics cards each for visualization 2.245 Petaflop/s peak (with K80 graphics cards) 1.

3 JURECA: Jülich Research on Exascale Cluster Architectures 2 Intel Haswell 12-core processors, 2.5 GHz, SMT, 128 GB main memory 1,884 compute nodes or 45,216 cores, thereof 75 nodes with 2 K80 NVIDIA graphics cards each and 12 nodes with 512 GB main memory and 2 K40 NVIDIA graphics cards each for visualization Petaflop/s peak (with K80 graphics cards) Petaflop/s Linpack from CPUs (out of 1,693 Petaflop/s peak) 281 TByte memory Mellanox Infiniband EDR Connected to the GPFS file system on JUST via IB FDR/40GigE gateway switches HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 3

3 Pbyte I/O Bandwidth: up to 220 GB/sec Hardware: IBM System x GPFS Storage Server solution, GPFS Native RAID 31

4 Parallel I/O Hardware at JSC (Just4, GSS) Juelich Storage Cluster (JUST) GPFS Storage Server (GSS/ESS) End-to-End integrity Fast rebuild time on disk replacement GPFS + TSM Backup + HSM JUQUEEN, JURECA, Just4-GSS Capacity: 20.3 Pbyte I/O Bandwidth: up to 220 GB/sec Hardware: IBM System x GPFS Storage Server solution, GPFS Native RAID 31 Building blocks: each 2 x X3650 M4 server, 232 NL-SAS disks (2TB), 6 SSD HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 4

LLview: User-level Monitoring Efficient supervision of node usage, running jobs, statistics, history

mouse-sensitive Main source: batch scheduler, runtime system No interaction with compute nodes Fully

5 LLview: User-level Monitoring Efficient supervision of node usage, running jobs, statistics, history Prediction of system usage Monitoring of energy consumption, load, memory usage, I/O usage Interactive and mouse-sensitive Main source: batch scheduler, runtime system No interaction with compute nodes Fully customizable, fast and portable client-server application Integrated into Eclipse/PTP Support for various resource manager, incl. LoadLeveler, IBM Blue Gene, Cray ALPS, PBSpro, Torque, SLURM, Grid Engine and LSF LLview download: (open source) ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 5

6 LLview Architecture & I/O monitoring config WWW Server LLview Client LML HTTP intermediate XML llview lml2llview Eclipse PTP SSH LML request LML llview SSH SCP LML_da Adapter: SLURM, Torque, Loadl, LSF, Moab, LoadL, OpenMPI,... DB2 Frontend, Service-Nodes HPC-Systems LoadL GPFS mmpmon metrics Node based metrics (load, memory) mmpmom files I/O node I/O node compute node compute compute node node DB parallel file system ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 6

7 LLview: Node & File System Metrics Job-based metrics: bandwidth (last minute) bytes written/read since job start open/close operations Job-based history (I/O, Load, Memory, ) Node-based mapping of I/O load (color-coded) System-based history (I/O, Load, Memory, ) HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 7

8 LLview: Node & File System Metrics Job-based monitoring selected job: - using 86 nodes - transferring ~ 31 TiB I/O activity of application avg. load per node avg. memory usage per node HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 8

9 LLview: Offline Analysis Accumulated Information load & mem Slight increase over time -> memory leak? Load fluctuates -> why? ODE for ~ components, rhs distributed over 64 nodes ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 9

10 LLview: Offline Analysis Node-specific information - load bad load balancing ODE for ~ components, rhs distributed over 64 nodes HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 10

11 I/O Workload Analysis: number of jobs Number of Jobs GiB read GiB written Example: number of jobs ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 11

12 I/O Workload Analysis: transfer size Total GiB transferred GiB read Example: transfer size GiB written ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 12

13 I/O Workload Analysis: Next steps Classification (e.g. research topic, group, account) Additional metrics Bandwidth Open/Close calls Type of I/O: continuous, burst,.. Parallel I/O or one writer/reader, More? Report at job end, containing information and timelines for I/O activity Memory usage Load ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 13

14 SIONlib: Shared Files for Task- Data Parallel Application HDF5 NETCDF MPI-I/O POSIX I/O Parallel file system t 1 t 2 t n-1 t n./checkpoint/file.0001./checkpoint/file.nnnn Parallel file system HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 14

15 SIONlib: Shared Files for Task- Data Parallel Application HDF5 NETCDF MPI-I/O SIONlib POSIX I/O Parallel file system Serial program Application t 1 t 2 t 3 Tasks t n-2 t n-1 t n Logical task- files #files: O(10) Physical multi-file SIONlib Parallel file system HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 15

SIONlib: Architecture & Example Parallel Tools Parallel Application callbacks SION Generic API SION OpenMP API SION Hybrid API SION MPI API Serial Tools callbacks Serial API

Current versions: 1.6.2 Open source license: http://www.fz-juelich.

16 SIONlib: Architecture & Example Parallel Tools Parallel Application callbacks SION Generic API SION OpenMP API SION Hybrid API SION MPI API Serial Tools callbacks Serial API Parallel generic layer Serial layer callbacks SIONlib OpenMP ANSI C or POSIX-I/O MPI Extension of I/O-API (ANSI C or POSIX) C and Fortran bindings, implementation language C Current versions: Open source license: /* fopen() */ sid=sion_paropen_mpi( filename, bw, &numfiles, &chunksize, gcom, &lcom, &fileptr,...); /* fwrite(bin,1,nbytes, fileptr) */ sion_fwrite(bin,1,nbytes, sid); /* fclose() */ sion_parclose_mpi(sid) HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 16

Coordinator: JSC* 14 Partners 3 PRACE hosting members 4 industry partners 7 European

17 DEEP-ER: The Project DEEP Extended Reach: EU-funded Exascale research project Budget: 10M EU-funding: 6,4 M Start: Oct 13 Duration: 42 months The DEEP-ER Consortium Coordinator: JSC* 14 Partners 3 PRACE hosting members 4 industry partners 7 European countries * Jülich Supercomputing Centre HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 17

18 DEEP-ER: Architecture Innovation Xeon Simplified Interconnect Xeon Phi On-Node NVM Self-Booting Nodes Network Attached Memory ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 18

19 DEEP-ER: Resiliency Integration SCR/SIONlib/BeeGFS ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 19

20 SIONlib: Local checkpointing SIONlib MPI Node t 1 Node t 2 Node t 3 Node t 4 Node t 5 Node t 6 Global File System Mapping: task writes to storage Transparent access to from application when files migrated to global storage HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 20

21 SIONlib: Buddy-CP, Logical mapping SIONlib MPI Node t 1 Node t 2 Node t 3 Node t 4 Node t 5 Node t 6 Global File System buddy buddy buddy buddy buddy buddy Mapping: 1:1 task writes to storage and storage of buddy node 1:x task writes to storage and storage of x buddy nodes Data exchange to buddy node is done via SIONlib MPI/OpenMP layer Collective checkpoint calls required HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 21

22 SIONlib: Write buddy checkpoint SIONlib MPI Node t 1 Node t 2 Node t 3 Node t 4 Node t 5 Node t 6 Global File System lcom buddy buddy buddy buddy buddy buddy Open: sid=sion_paropen_mpi(, bw,buddy,mpi_comm_world, lcom, ) Write: sion_coll_write_mpi(,size,n,sid) Close: sion_parclose(sid) Write-Call will write first to chunk, and then sent it to the associated buddy which writes the to a second file HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 22

23 Restore checkpoint after failure SIONlib MPI Node t 1 Node t 2 Node t 3 Node t 4 Node t 5 Node t 6 Global File System buddy buddy buddy buddy buddy buddy Open: sid=sion_paropen_mpi(, br,buddy,mpi_comm_world, lcom, ) First tries normal open and falls back to buddy if first open fails Read: sion_coll_read_mpi(,size,n,sid) Close: sion_parclose(sid) HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 23

24 SIONlib and SCR: example SCR_Start_checkpt() SCR_Route_file(fn, fn_scr) fn = check1 fn_scr= /abspath/check1 sid=sion_paropen_mpi(fn_scr, wb,buddy...) (node0) /abspath/check1 (node1) /abspath/check (node0) /abspath/check1_buddy_ (node1) /abspath/check1_buddy_ info=sion_get_io_info(sid) sion_parclose_mpi(sid) - List of filename opened on this task - Bytes written SCR_update_filename(nfiles, info.names,info.sizes, info.roles) SCR_Complete_checkpt() HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 24

25 Conclusion I/O monitoring, combination of information from different sources LLview + GPFS mmpmon On-line and off-line analysis I/O Workload analysis concepts for analysis of job-based mmpmon Automatic job reports after job end DEEP-ER project: SIONlib support for resiliency Multi-version buddy checkpointing Support for multi-level checkpointing with SCR HPC-IODC@ ISC, Frankfurt, June 23th, 2016 I/O at JSC, W.Frings 25

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings

I/O at JSC. I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O. Wolfgang Frings Mitglied der Helmholtz-Gemeinschaft I/O at JSC I/O Infrastructure Workloads, Use Case I/O System Usage and Performance SIONlib: Task-Local I/O Wolfgang Frings W.Frings@fz-juelich.de Jülich Supercomputing