ANL Site Update. ScicomP/SPXXL Summer Meeting Ray Loy Ben Allen Gordon McPheeters Jack O'Connell William Scullin Tim Williams

Size: px

Start display at page:

Download "ANL Site Update. ScicomP/SPXXL Summer Meeting Ray Loy Ben Allen Gordon McPheeters Jack O'Connell William Scullin Tim Williams"

Arlene Reynolds
5 years ago
Views:

1 ANL Site Update ScicomP/SPXXL Summer Meeting 2016 Ray Loy Ben Allen Gordon McPheeters Jack O'Connell William Scullin Tim Williams

2 Site Update System overview Support topics GPFS AFM Burst Buffer GHI deployment Performance Portability 2

3 ALCF Resources 3

4 ALCF Roadmap CORAL Theta interim system (late 2016) Cray/Intel KNL >=2500 nodes, >= 60 cores/node (8.5PF) Peak comparable to Mira Aurora late 2018 ALCF-3, successor to Mira Cray/Intel KNH >=50K nodes Blue Gene will continue to be a primary resource through the installation of Aurora Both HW and SW considerations for extended life 4

5 Support/Requirements 5

6 GPFS on BG/Q Current: ESS GPFS ; DDNs, BG/Q GPFS 3.5 Cannot upgrade ESS without BG/Q GPFS >=4.1 End-of-service for GPFS 3.5 coming up in 2017 BG/Q will continue in service well beyond that Support for GPFS 4.1 (or 4.2) on BG/Q IONs would be very helpful Concern that fixes for GPFS and AFM at GPFS 4.2 will continue to be back-ported to GPFS in a timely manner ANL supports SPXXL 2016 requirement (Ticket 21) 6

7 Compilers IBM XL Compilers for Blue Gene/Q XL Fortran 14.1 Current BG/Q version August 2015 AIX, Linux last update April/May 2016 XL C/C Current BG/Q version August 2015 AIX, Linux last update April 2016 Hoping for BG/Q update 7

8 PMRs PMR getdents/getdents64 alias conflict with toolchain [Apr 2016] Success - Fixed in V1R2M4 PMR No way to list contents of node's /dev/shmem or /dev/persistent from a CNK program (Can only access from off-node using CDTI) Negotiating with IBM on a solution (stalled on ANL) 8

9 Implementing GPFS AFM as a Burst Buffer 9

10 ALCF Operational File System Configuration There are three main production level GPFS file systems: mira-fs0 19 PiB mira-fs1-7 PiB mira-home 1.1 PiB These are based on DDN 12K-E (Embedded NSD servers) storage. These file systems are mounted to all the BG IO servers, a Cray visualization cluster, Globus DTN nodes and HPSS data movers. symlinks to the project filesets are used to mask the underlying file system from the users. /project contains links to mira-fs0 and mirafs1 filesets. 10

11 The big picture Current Step 1 AFM Step 2 GHI A,B,C,D,E,F,G,H 8PB, 400 GB/s fs2 (30 ESS) A,B,C,D,E,F,G,H 8PB, 400 GB/s fs2 (30 ESS) A,C,D,G,H 21PB,240 GB/s fs0 (16 DDN) B,E,F 7PB, 90 GB/s fs1 (6 DDN) A,C,D,G,H 21PB,240 GB/s fs0 (16 DDN) B,E,F 7PB, 90 GB/s fs1 (6 DDN) A,C,D,G,H 21PB,240 GB/s fs0 (16 DDN) B,E,F 7PB, 90 GB/s fs1 (6 DDN) HPSS via GHI 11

12 ESS 50,000 foot view Terminology: AFM: Active File Management cache fs2, non-permanent location home fs0, fs1, permanent location; not /home, which is confusing fs2 is a cache; Like any other cache, files can be evicted; AFM ensures there is a good copy in home before eviction. Eviction is policy driven; Our policies will probably be fairly simple (high water / lower water and LRU), but most any file system parameter can be checked) AFM is constantly syncing files between the cache and home Default is every 15 seconds, but is alterable Basically replays events from the cache on home (metadata updates, block changes, etc.) We can influence priorities of new writes vs. replication with tuning parameters, but true QoS is not here yet (it is in the roadmap) Fundamental assumption: Our file systems are big enough that recalls will be rare 12

13 Why? Overall, better experience for both the users and the admins No single point of failure We got this by adding fs1 to the existing fs0 file system All users see the same performance There is not uniform bw between fs0 and fs1, but fs2 and AFM give it back A cache miss might cause run to run variation, but all users are equally subject to this and given our cache size, the assumption is this will be rare. Minimal user/admin intervention required Ideally policies will do the right thing at the right time Example: Project data will automatically disappear, via GHI, after a project completes; Storage team doesn t need to do anything. Should trivially (single command) be able to cause the right behavior Probably based on extended attribute and policy Minimal (ideally zero) manual copying of files All handled by the file system 13

14 Implementation Overview Current State: Internal testing 2 ESS node pairs deleted from ESS and used to form a test home cluster. File system called: fm-fs1. Using AFM mode that allows for a GPFS home file system. mira-fs0 and mira-fs1 will be remotely mounted to ESS system mira-home will not be AFM managed Home file system defined as: gpfs:/// Null server list which signifies remotely mounted to cache cluster 14

15 Implementation Challenges This cluster was originally based on the GSS product which was xseries (Intel) based. Cluster was migrated from xseries to pseries (ESS) to address support concerns. Kernel Panics in mpt2sas kernel module Redhat bug: (bug) / (back port to 7.1 zstream) Difficult PD/recreate path. Much of the work was done by ALCF dealing directly with Red Hat and Avago. Up to 3 weeks between recreates. Fixed in RH 7.2 stream but since ESS won t support RH 7.2 in calendar year 2016 we needed to ask Red Hat for a port to RH 7.1 zstream. This fix is under final tests now at Red Hat. As soon as available this will be incorporated in our ESS cluster. Fixed in: kernel el7 15

16 Implementation Challenges Quotas: No communication between home and cache WRT to quota settings. Data is sync ed to home as root and root does not error on over quota Found best to turn off auto-migration on cache filesets. Prevents over-running home cache hard limits Bug found after file evicted and subsequently re-read from cache cluster it was not becoming resident again in cache. Efix under construction. 16

17 HSI deployment 17

18 File System Backup Overview Ø Mira-home is intended to store executable files and configura[on files. Ø User quota limits are enforced. Ø Data is fully protected thru GPFS metadata and data replica[on, GPFS snapshots and nightly backups. Ø Mira-fs0 and mira-fs1 are intended as intermediate-term storage for Mira/Cetus job output such as checkpoint datasets. Ø Project associated fileset quota limits are enforced. Ø Only metadata replica[on is enabled. Ø The user is responsible for archiving their own data. Ø Aaer project expira[on, quota limits are reduced, data is archived and the fileset is removed.

19 Current archiving (fs0, fs1) Users manually archive files via HSI or Globus/GridFTP When user project expires, script invokes HSI Copy to tape, shrink disk quota 90 days after expiration unmount (but still on disk) 180 days delete from disk, copy retained on tape No other system-related archiving 19

20 GHI value added Instead of scripting archiving directly, generate GPFS policy files to tell GHI when/how to migrate expired projects Improved disaster recovery Migrate everything (leave in place on disk) Optional: Implement threshold policy e.g. when fs reaches 90% start migration down to 80% (punch holes in files leaving metadata) GHI image backup create image of the entire fs only as "punched out" files (metadata) Initial restore of entire fs quickly, retrieve remaining contents on demand until caught up 20

21 GHI Deployment Status In progress (~6 mo) GHI installed and enabled on a test fs Some delays due to PMR (snapshots not deleted, deletion times out due to other processes) Other sites report issue resolved Expecting deployment this year 21

22 High Level GHI Integration View GPFS Cluster HPSS GPFS NSD Nodes GHI Session Node Data Mover Data Mover DDN Storage GHI IOM Nodes Disk Cache Tape 1GbE 10GbE QDR IB FC 4 x 6Gb/s SAS

23 Portability/Interoperability 23

Coordinating SC Centers Efforts Meetings of cross-labs applications

Joubert) Cross-lab training committee (F.

export control challenges CORAL partners, APEX partners (NERSC &

representa[ves September 2014 Mee[ng Apps Portability Apps readiness

24 Coordinating SC Centers Efforts Meetings of cross-labs applications readiness staff Tools and libraries working group (W. Joubert) Cross-lab training committee (F. Foerttner) Shared calendar Shared training events Manage nondisclosure, export control challenges CORAL partners, APEX partners (NERSC & Trinity) Portability March 2014 Mee[ng Apps readiness coordina[on ~15 representa[ves September 2014 Mee[ng Apps Portability Apps readiness coordina[on ~25 representa[ves January 2015 Mee[ng Next-gen hardware & soaware Portable programming ~40 center staff ~10 vendor reps 24

25 OpenMP 4.5 Source Code Portability double x[128], y[128]; #pragma omp target data map(to:x[0:64]) map(tofrom:y[0;64]) { #pragma omp target { // y computed on device } } double x[128], y[128]; #pragma omp for simd aligned(x, y: 32) for (int i=0; i<128; i++ ) { // thread s iterates à SIMD lanes } #pragma omp target data map(to:x) { #pragma omp target map(tofrom:y) #pragma omp teams #pragma omp distribute #pragma omp parallel for for (int i = 0; i < n; ++i) { y[i] = a*x[i] + y[i]; } } Host (CPU) Device (accelerator) GPGPU Intel Mic omp_get_num_devices() map maybe shared location simd standard vectorization #ifdef MANYCORE #else GPU #endif <code> 25

26 Performance Portability When directives-based approach performs: OpenMP 4.5 Use libraries or frameworks to encapsulate node level parallelism OpenMP 4.0 Libraries App Modules CUDA Vector intrinsics shmem Charm++ ADLB PETSc Chombo Trilinos MADNESS Kokkos RAJA 26

27 Effort Continuing Coordination between ESP, CAAR, NESAP. ALCC allocation at ANL, OLCF, NESAP for portability work Shared training including live/videocon Bringing NNSA labs into the discussion HPCOR (9/2015), COEPP (4/2016) SC15 Workshop on Portability Among HPC Architectures for Scientific Applications New effort on kernels/mini-apps: optimize on 2 platforms, then rework with OMP 4.5 or OpenACC, compare NekBone (ALCF), BoxLib (NERSC), DSL-based library for MD (OLCF) 27

28 Training - ATPESC Argonne Training Program on Extreme Scale Computing Intensive 2-week course held off-site Audience: doctoral students, postdocs, and computational scientists Presenters are leaders in all major areas of HPC Planning in progress for the 4 th session in August

29 The End 29

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations Argonne National Laboratory Argonne National Laboratory is located on 1,500