8.5 End-to-End Demonstration Exascale Fast Forward Storage Team June 30 th, 2014

Size: px

Start display at page:

Download "8.5 End-to-End Demonstration Exascale Fast Forward Storage Team June 30 th, 2014"

Homer Cole
5 years ago
Views:

8.5 End-to-End Demonstration Exascale Fast Forward Storage Team June 30 th, 2014 NOTICE: THIS MANUSCRIPT HAS BEEN AUTHORED BY INTEL, THE HDF GROUP, AND EMC UNDER INTEL S SUBCONTRACT WITH LAWRENCE

1 8.5 End-to-End Demonstration Exascale Fast Forward Storage Team June 30 th, 2014 NOTICE: THIS MANUSCRIPT HAS BEEN AUTHORED BY INTEL, THE HDF GROUP, AND EMC UNDER INTEL S SUBCONTRACT WITH LAWRENCE LIVERMORE NATIONAL SECURITY, LLC WHO IS THE OPERATOR AND MANAGER OF LAWRENCE LIVERMORE NATIONAL LABORATORY UNDER CONTRACT NO. DE-AC52-07NA27344 WITH THE U.S. DEPARTMENT OF ENERGY. THE UNITED STATES GOVERNMENT RETAINS AND THE PUBLISHER, BY ACCEPTING THE ARTICLE OF PUBLICATION, ACKNOWLEDGES THAT THE UNITED STATES GOVERNMENT RETAINS A NON-EXCLUSIVE, PAID-UP, IRREVOCABLE, WORLD-WIDE LICENSE TO PUBLISH OR REPRODUCE THE PUBLISHED FORM OF THIS MANUSCRIPT, OR ALLOW OTHERS TO DO SO, FOR UNITED STATES GOVERNMENT PURPOSES. THE VIEWS AND OPINIONS OF AUTHORS EXPRESSED HEREIN DO NOT NECESSARILY REFLECT THOSE OF THE UNITED STATES GOVERNMENT OR LAWRENCE LIVERMORE NATIONAL SECURITY, LLC. Fast Forward Project

2 Statement of Work 8.5 Deliverables 8.5 End-to -End Demonstration with Final Design Documentation and Report a) The Subcontractor shall develop a final version of the design documentation that is an updated version of the Milestone 4.1 Design Document to include technical lessons learned during the research and development phases of the project. The resulting design will include the following sections for the topics listed in Milestones 3.1 and 4.1 : Description of the solution elements/implementation components and how they are expected to address the research goals Explanation of why/how the proposed solution will address the project requirements Identify risks or unknowns with the proposed approach b) The Subcontractor shall complete a final report that describes research methodology, key findings and recommendations for future research. This report will also identify work that did not result in useable functionality, so others may avoid those paths in the future. c) As part of this reporting process the Subcontractor shall complete a final end-to-end demonstration representative of work completed during the project. The specific demonstration criteria will be described and mutually agreed to by the Subcontractor and Technical Representative in the Solution Architecture document and further refined and finalized during the quarter prior to the demonstration (Project Quarter 7). This milestone is deemed complete when the Final Report, Final Design Document, and demonstration results have been presented to and approved by the Technical Representative. 2

3 Final Report and Updated Design Documents Milestone Final Report Ref. Document Milestone Final Report Ref. Document M8.5 FF-Storage Final Report 3.1 D1 DAOS 3.1 Reduction Network Discovery Design Document 4.1 D2 DAOS 4.1 Server Collectives Design Document 3.1 D3 DAOS 3.1 Versioning Object Storage Device (VOSD) Design Document - D4 DAOS API and DAOS POSIX Design Document 4.1 D5 DAOS 4.1 Epoch Recovery v D6 DAOS 4.1 Lustre Restructuring and Protocol Changes v DAOS 4.1 Client Health and Global Eviction Design Document - I1 IOD Solution Architecture - I2 IOD API 3.1 I3 IOD 3.1 Design Document - I4 IOD KV Design Document - I5 IOD Object Storage on DAOS Document HDF 3.1 HDF5 IOD VOL Design v HDF 4.4 HDF5 Data Integrity Report POSIX Function Shipping Design Document - H1 The Design and Implementation of FastForward Features in HDF5 - H2 User s Guide to FastForward Features in HDF5 5.6,5.7 H3 HDF5 Data in IOD Containers Layout Specification 5.1 H4 HDF Function Shipper Design v ,7.3 H5 Burst Buffer Space Management - Prototype and Production - H6 Deep Dive - Transactions - Presentation Sept H7 Mercury Design Document - H8 AXE Design Document - A1 ACG Solution Architecture 4.1 A2 ACG Computation ACG Software Install Guide - EFF1 End to End Data Integrity in the Intel/EMC/HDF Exascale IO Presentation, Sept HDF 3.1 Dynamic Data Structure Support 3

4 Demonstration Goals Simulate a producer/consumer workflow Run applications at larger scale Demonstrate resilience of the stack to failures Early feedback on stack behavior/performance Compute Nodes (CNs) I/O Nodes (IONs) DAOS-Lustre MetaData Server (MDS) DAOS-Lustre Object Storage Servers (OSSs) 4

5 LANL Test Cluster: Buffy 5

6 Test Environment 64 Compute nodes Cray SLES Aries interconnect 14 I/O nodes CentOS 6.4 Connected to both Aries & Infiniband network 3 Lustre file systems exported by the abba appliance cross mounted on all IONs 5 DAOS servers (1MDS + 4 OSS) CentOS OSTs per OSS, each OST using 5x RAID6+2 LUNs SFA12K has writeback cache enabled 6

7 Workflow Simulation / attribute Step#0 DT0 Step#1 DT1 Step#2 DT2 Dataset 1D Dataset 1D Dataset 1D Split Communicator in half, Producers and Consumers. Producers create N timesteps (runtime argument s) Each timestep is written with 1 transaction. Persist every R timesteps. Consumers read and verify R timesteps once the Persist of those timesteps are completed by the Producers. Producers will concurrently write the next R timesteps as the consumers are reading and verifying After an Application crash/relaunch, Consumers will first verify all the timesteps written up to the latest readable transaction on DAOS. Then the above behavior resumes. 7

8 Three failure scenarios 1. Fail the CN Purge the BB s and restart the app. Recover from last persisted checkpoint on DAOS Future work: Recover from ION since data still there. 2. Fail the ION Wait for VPIC to fail. Remount the needed mount points on ION after recovery Same as above Future work: Recover from ION if ION data is resilient. 3. Fail the OSS Fail and then recover an OSS VPIC will fail a persist and wait for the OSS VPIC will retry the persist Future work VPIC continue computing and committing to IOD while waiting for OSS IOD to internally retry the persist 8

9 Compute Node Failure Simulation Let s start the demo Run VPIC Application on CNs & IONs Kill Application running on the CNs Clear all burst buffers Run VPIC again which restarts from DAOS Expected result Application resumes from the last persisted state to DAOS and completes successfully 9

10 I/O Node Failure Simulation Let s start the demo Run VPIC Application on CNs & IONs Power cycle an ION (ion01) Wait for VPIC to fail Remount BB filesystems & DAOS once the ION is back online Clear all burst buffers Run VPIC again Expected result Application resumes from the last persisted state to DAOS and completes successfully 10

11 DAOS Server Failure Simulation Let s start the demo Run VPIC Application on CNs & IONs Power cycle an OSS Remount the OSTs once the OSS is back online Wait for VPIC to complete Expected result Application continues running until it needs to communicate with an OST which is down Application waits for the OST to be back online Persist() call might fail if OST has lost I/Os on disk (HDF5 to retry) Application resumes and completes successfully 11

12 VPIC-IO & Backend Activity 600 DAOS & BB Activity (100% transactions persisted) Throughput (MB/s) BB reads BB writes DAOS writes Time (s) 12

VPIC-IO & Persist Frequency 250 200 Runtime (s) 150 100 50 0

13 VPIC-IO & Persist Frequency Runtime (s) % 20% 50% 100% Persist Rate (% of persisted transactions) 13

14 Performance Evaluation IOR driver developed for each layer of the stack Each driver uses a single transaction Checksum disabled -a DAOS New driver using asynchronous DAOS I/O submission -a IOD New driver supporting all IOD object types (blobs, arrays & kvs) Support purge/prefetch/persist/cksum and variable array cell size -a HDF5 Original HDF5 driver extended to support EFF extensions Only support dataset (map to IOD array object) Single dataset shared by each task -a PLFS For checking baseline performance of Lustre cross-mounts on the BB s For checking overhead of IOD which layers above PLFS for BB storage LANL fs_test also augmented with an IOD module -b iod 14

15 Performance DAOS Several important VOSD performance optimizations Enable VIL zero-copy Disabled in M7 demo due to problems in ZFS patch ZFS patch has been almost rewritten since Fully overwritten blocks (128KB) are migrated from VIL to final DAOS object Partially-modified blocks are still copied Fix bug causing zero-copy blocks to be read from disk during flattening Detach aggregation Detach blocks from VIL in bulk just before transaction commits Fix bug in DAOS writeback cache causing poor results in M5 demo 15

16 Write Bandwidth (MB/s) DAOS/IOR - Write DAOS Buffered HCE+1 DAOS DIO HCE+1 POSIX 1:1 POSIX N:1 DAOS DIO VIL KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB I/O Size 16

17 Read Bandwidth (MB/s) DAOS/IOR - Read DAOS DIO POSIX N: POSIX 1: KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB I/O Size 17

18 Performance IOD Several important performance optimizations Reduce overly-eager PLFS index merging PLFS will extend index entries indefinitely for contiguous data PLFS has one checksum per index entry Small reads within a giant checksum chunk are very slow Better aggregation and larger writes to DAOS during persists Using <stdio.h> FILE * IO instead of <fcntl.h> IO Provides client side user-space buffering and caching Many performance tuning parameters in iodrc and plfsrc Threadpool sizes Stripe sizes Checksum chunk sizes Memory consumption during persist DAOS shards used per DAOS storage target Incremental KV persist Optimized sorting in range queries of data across multiple TIDs 18

19 Write Bandwidth (MB/s) IOD/IOR Burst Buffer Write POSIX 1:1 IOD Blob 1:1 IOD Blob N:1 IOD Blob Noglib 8000 IOD Array N: KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 1GB I/O Size 19

Write Bandwidth (GB/s) Small IO s to Single Object on Lola 5 4 PLFS File 3 2 IOD Blob 1 0 4K IOD Array 64K 1M 16M 256M 1G IO Size Note that these tests set the array cell size equal to

20 Write Bandwidth (GB/s) Small IO s to Single Object on Lola 5 4 PLFS File 3 2 IOD Blob 1 0 4K IOD Array 64K 1M 16M 256M 1G IO Size Note that these tests set the array cell size equal to the IO size. However, in other IOR measurements, not shown, we set the array cells to be a constant 8 bytes regardless of the IO size and observed that this did not affect performance. 20

21 IOD IOPS Scaling by NP 64K 128-len keys per task Two problems 1. Array small IO. Fixable with faster transformation to blobs internally. 2. KV inserts. Fixable with newer MDHIM which uses leveldb instead of pbl-isam. 21

Write Bandwidth (GB/s) IOStore and Debug

Mlog=Err IOStore=Posix, Mlog=Debug INSIGHT:

22 Write Bandwidth (GB/s) IOStore and Debug Impact on PLFS on Lola K 64K 1M 16M 256M IO Size 1G IOStore=Glib, MLOG=Err IOStore=Glib, Mlog=Debug IOStore=Posix, Mlog=Err IOStore=Posix, Mlog=Debug INSIGHT: Apps better do large IO or have client-side buffering. 22

23 Read Bandwidth (MB/s) IOD/IOR Burst Buffer Read POSIX 1: IOD Blob 1:1 IOD Blob N:1 IOD Array N: KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 1GB I/O Size 23

24 Persist Bandwidth (MB/s) IOD/IOR Persist IOD Blob 1: IOD Blob N: KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB I/O Size 24

25 IOD IOPS Persist Scaling by NP 64K 128-len keys per task INSIGHT: Apps need to understand how KV objects are sharding as seen by the very poor performance of the KV using decimal keys. Or new functionality needed from IOD. e.g. Hey IOD: Don t use sorted ranges; hash the keys instead. 25

26 IOD/IOR Persist DAOS RPC Size 100% 90% 80% 70% 60% DAOS RPC Size Distribution 50% 40% 512 KB 30% 20% 10% 0% 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB I/O Size 1 MB 26

27 Bandwidth (MB/s) IOD/IOR Persist Storage Activity (32K IO) BB Reads DAOS Writes BB Writes Time (s) 27

28 Performance HDF5 Performance results for HDF5 were gathered with 2 modes: Mercury (non coresident) Mode: The IOR application / clients run on the CNs and connect to the HDF5 VOL servers on the IONs. This will include the cost of data transfer through Mercury in the performance results. Coresident Mode: The IOR application / clients are executed on the IONs and act also as HDF5 VOL server processes. Overhead of data transfer is eliminated. Checksums: Add a considerable overhead. Disabled for most runs. 28

29 Bandwidth (MB/s) HDF5 and IOD IOR Read & Write IOD Array write IOD Array read HDF5 Co-resident write HDF5 Co-resident read HDF5 Mercury write HDF5 Mercury read KB 128KB 256KB 512KB 1MB 1GB I/O Size 29

30 Bandwidth (MB/s) HDF5 Co-resident IOR Impact of IOD Checksums (with slower crc64) Write Without cksum Read Without cksum Read With cksum Write With cksum KB 128KB 256KB 512KB 1MB I/O Size 30

31 ACG Retrieval of sub-objects HDFS EFF EFF stack much faster and more consistent than HDFS for an important ACG workload retrieving sub-objects. 31

32 Performance Buffy MPICH with GNI support Native support for Aries interconnect Low latency & RDMA support Some simple IOD operations like transaction finish significantly slower with MPICH/GNI than with MPICH/TCP Might be related to multi-thread support with one thread polling inside MPI Problem still under investigation All performance tests were run with MPICH/TCP 32

3X using the standard Intel MPI Benchmarks.

33 Time (s) IOD Operations MPI/TCP vs MPI/GNI Lola eth Buffy gni Buffy eth Note: Buffy gni out-performs Buffy ethernet by about 3X using the standard Intel MPI Benchmarks. IOD, and MDHIM within it, use non-standard MPI routines such as heavy use of threads. 33

34 Intel MPI Benchmark Alltoall 34

35 Intel MPI Benchmark Pingpong 35

36 Performance Checksum Noticed significant impact of checksums on performance Default checksum algorithm set to crc64 Not the most optimal one e.g. adler32 usually performs better Should test with crc32c supported in hardware since SSE4.2 36

37 Checksum algorithms versus memset memset adler32 crc64 37

38 Write Bandwidth (MB/s) Performance Summary (1/2) Full IOR Write Comparison KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 1GB I/O Size PLFS N-1 POSIX 1:1 IOD Blob 1:1 IOD Blob N:1 IOD Array N:1 HDF5 Co-resident HDF5 Mercury DAOS Buffered HCE+1 IOD Blob 1:1 Persist IOD Blob N:1 Persist 38

39 Read Bandwidth (MB/s) Performance Summary (2/2) Full IOR Read Comparison POSIX 1:1 IOD Blob 1:1 PLFS N-1 IOD Blob N:1 IOD Array N:1 HDF5 Co-resident DAOS DIO HDF5 Mercury 0 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB 1MB 1GB I/O Size 39

40 Performance Next Steps (1/2) All ACG Cray Checksums. Transition from crc64 to adler32 Identify whether performance issues are design or implementation Scale tests and more performance benchmarks Being able to efficiently extend on-demand sizes of data structures More efficiency in variable-length structures (common in natural graphs) Multi-threaded MPI on Aries DAOS Zero-copy VIL performance improvement for large dataset More performance benchmarks Many objects in a single shard Persist to epochs > HCE+1 40

41 Performance Next Steps (2/2) HDF5 Mercury Plugins that support the native network protocol as well as true RDMA (as opposed to current MPI emulation) More testing of access patterns from HDF5 level, to understand tradeoffs between various data representations, prefetch options, read/write/persist granularities. IOD Arrays (especially small IO s) KV s Update to new MDHIM with LevelDB instead of PBL-iSAM Implement fetch_next_list from key K in addition to current fetch_next_list from Nth key More testing of reads. Reads after purge and reads after fetch. Some was done here and some was done in 8.3 and 8.4 But larger focus on writes. What do you expect? PLFS Motto: Checkpoints. 41

42 Fast Forward Project

Bandwidth (MB/s) IOD/IOR Persist Storage Activity (1MB IO) 45000 40000 35000 BB Reads DAOS Writes 30000 BB Writes 25000 20000 15000

43 Bandwidth (MB/s) IOD/IOR Persist Storage Activity (1MB IO) BB Reads DAOS Writes BB Writes Time (s) 43

44 Howard Pritchard about Buffy Gemini: It appears the issue is one of fairness within the nemesis progress engine when one thread is blocking within MPI - i.e. has made some kind of blocking MPI call. Although there are places within the progress engine where the thread yields the lock, it's not sufficient to prevent hindering progress of other threads. The tcp netmod is less susceptible to this, I suspect, because once a thread starts reading data off of a socket, its doing so with the big lock held. With ugni (and likely also for ib), since there aren't these "blocking" calls (note the mpich tcp uses non-blocking sockets but once there's data to read it keeps reading till there's no more data in the sock buffer at that time) they are more likely to exit the progress engine and return to the application without having completed a transfer. There may be additional effects of this long blocking barrier on other threads making progress.

45 IOD Scaling by NP (8MB IO, 4GB per task) 45

46 Adler32 Checksums in IOD - Writes Drop-off for large IO probably due to fs_test timing measureemnt bug which includes buffer allocation and setup in the open time. 46

47 Adler32 Checksums in IOD - Reads Drop-off for large IO probably due to fs_test timing measureemnt bug which includes buffer allocation and setup in the open time. 47

48 Adler32 Checksums in IOD - Persists Drop-off for large IO probably due to fs_test timing measureemnt bug which includes buffer allocation and setup in the open time. 48

49 Demo Performance

50 Demo Performance

51 Performance VPIC-IO DAOS Lustre IOD aggregation Buffy mpi-eth IOD <stdio.h> VOSD zero-copy 51

5.4 - DAOS Demonstration and Benchmark Report

5.4 - DAOS Demonstration and Benchmark Report Johann LOMBARDI on behalf of the DAOS team September 25 th, 2013 Livermore (CA) NOTICE: THIS MANUSCRIPT HAS BEEN AUTHORED BY INTEL UNDER ITS SUBCONTRACT WITH