End-to-End Data Integrity in the Intel/EMC/HDF Group Exascale IO DOE Fast Forward Project

Size: px

Start display at page:

Download "End-to-End Data Integrity in the Intel/EMC/HDF Group Exascale IO DOE Fast Forward Project"

Aubrey Robertson
5 years ago
Views:

1 End-to-End Data Integrity in the Intel/EMC/HDF Group Exascale IO DOE Fast Forward Project As presented by John Bent, EMC and Quincey Koziol, The HDF Group

2 Truly End-to-End App provides checksum buffer to HDF5 Input checksum on write, return value on read Optional. Can be disabled or ask HDF5 to do it. HDF5 passes to IOD IOD does necessary recomputation when unaligned Buck stops at IOD. DAOS doesn t (yet) participate IOD checksums stored as regular DAOS data DAOS and zfs and lustre all do checksumming as well... Note: We prevent silent data corruption only We don t (yet) repair

3 Checksum Support in the API s Every IO buffer will have a checksum A checksum function is provided to upper layers E.g. iod_checksum_t iod_checksum(buffer)

4 Data Integrity in the Stack during Writes 1. HDF5 API allows checksums to be optionally passed Application 2. If app doesn t pass checksum, it will be added somewhere in the HDF5/IOD VOL layers H5Dwrite(data,(checksum)) HDF / IOD VOL iod_obj_write(data,checksum) 3. The function shipper just does a passthrough 4. IOD does two writes into DAOS. IOD can actually create a DAOS (virtual) shard that is optimized for small iops to store checksums and a DAOS (virtual) shard optimized for bandwidth to store data. 5. DAOS stores the data (which is actually metadata and data for the above layer) Function Shipper iod_obj_write(data,checksum) IOD daos_shard_write(data) daos_shard_write(checksum) DAOS

5 Data Integrity in the Stack during Reads 1. The read request goes down the stack and gets to IOD 2. IOD reads the data and checksum from DAOS. May require multiple reads of multiple buffers and their verification and a recomputation of a new checksum if unaligned. Being careful to avoid race conditions. Application H5Dread(data,(&checksum?)) HDF / IOD VOL iod_obj_read(data,&checksum) Function Shipper iod_obj_read(data,checksum) IOD 3. IOD returns the data and the checksum up the stack 1. Hints can disable this daos_shard_read(data) daos_shard_read(checksum) DAOS

6 An HDF5 Dataset is stored in a logical IOD Array and nicely striped across a set of DAOS shards. Each cell has its own checksum. DAOS Storage Target DAOS Shard

7 An HDF5 Dataset is stored in a logical IOD Array and nicely striped across a set of DAOS shards. Each cell has its own checksum. An aligned full cell read is easy! Just return the cell and its checksum. DAOS Storage Target DAOS Shard

8 An HDF5 Dataset is stored in a logical IOD Array and nicely striped across a set of DAOS shards. Each cell has its own checksum. An unaligned read is hard! Imagine reading this bright pink rectangle. Many checksum computations are required and race conditions must be carefully avoided. [Hints can disable if performance is paramount over integrity.] DAOS Storage Target DAOS Shard

9 Race Conditions, Contiguous Blocks Imagine a read straddling checksummed blocks {cksum1} {ret_cksum} {cksum2} Create ret_cksum, then verify existing cksums, then copy KEY: Create must happen first If verify and then create, corruption can occur between KEY: Create ret_cksum from existing data not from copy If create from copy, copy might already be corrupted

$Race Conditions, Non-Contiguous Imagine a read straddling checksummed non-contiguous blocks {cksum1} {tmp1} {tmp2} {cksum2} Creating ret_cksum cannot be done in one operation Especially if regions$

10 Race Conditions, Non-Contiguous Imagine a read straddling checksummed non-contiguous blocks {cksum1} {tmp1} {tmp2} {cksum2} Creating ret_cksum cannot be done in one operation Especially if regions come from more than one storage node In this case, each region must be cksum d, then copied. Then create ret_cksum on return buffer. The cksum each copied region and compare to temporary cksums on the source regions. Then, finally, verify the original cksums on the blocks.

11 Non-contiguous reads {ret_cksum} {tmp1.1} {tmp2.1} {cksum1} {tmp1} {tmp2} {cksum2} First create tmp1 and tmp2 cksums Then copy regions Then create ret_cksum from copy Then create tmp1.1 and tmp2.1 and compare to tmp1 and tmp2 Then verify cksum1 and cksum2

$Why not Verify First? {ret_cksum} {tmp1.1} {tmp2.$

12 Why not Verify First? {ret_cksum} {tmp1.1} {tmp2.1} {cksum1} First verify cksum1 and cksum2 Then create tmp1 and tmp2 cksums Then copy regions Then create ret_cksum from copy Then create tmp1.1 and tmp2.1 and compare to tmp1 and tmp2 {tmp1} {tmp2} {cksum2}

13 What about writes? Just like reads but in reverse

14 Read Pseudo-code iod_obj_read(offset=o,length=l,checksum=&c,buffer=&b) { regions = find_all_data(o,l) foreach region and its checksum (R,V) in regions # the checksum will be on disk for whole regions # or recomputed if a partial region checksums[r] = V # save the checksum copy R into B *C = checksum(b) # checksum entire output buffer foreach region R in regions # verify copied region within buffer matches original's checksum checksum = checksum(r within B) assert(checksum == checksums[r]) }

15 Three IOD Object Types Blobs Just as has been described Arrays When stored, they are unrolled into a blob KV Stores Store checksum(s) as a header in value Other metadata may be stored here as well such as value length

16 Storing Data In ION, we will store the data in traditional PLFS style logfiles Pro: fast writes Con: large amount of IOD/PLFS index metadata but it s short-lived In DAOS, we will store the data in flattened view as round-robined stripes across shards Pro: minimized IOD metadata Con: potentially slower migrate due to need to scatter/gather but it s on fast interconnect

17 Storing Checksums In ION, we will store the checksum in a new field in the PLFS index entry for each range Pro: very easy to implement Con: can t do pattern compression on PLFS index In DAOS, we will store the checksum in a virtual checksum shard corresponding to the object of the same number in a virtual data shard

18 Storing IOD Objects Onto DAOS Split each DAOS shard into four virtual shards Metadata (00) Checksum (01) Data (10) Reserved (11) [in future for DAOS HA?] An IOD object ID, OID, is 62 bits To read its metadata, read DAOS object {00}{OID} To read its checksum, read DAOS object {01}{OID} To read its data, read DAOS object {10}{OID}

19 IOD Metadata Found by getting list of shards from container Hash(OID) % shard list to find target shard Read {00}{OID} on that shard to get metadata Metadata is very small A list of shards across which this object is striped Checksum unit size Tunable via hints or explicit parameters or IOD Stripe size observation of usage while in ION Multiple of the checksum unit size The last offset of the object The dimensionality info for array objects

20 IOD Data Data for IOD object OID is striped in a roundrobin fashion in object {10}{OID} across a set of shards Since DAOS is very good at sparse objects and can flatten overwrites nicely with transactions, we place all data at the same physical offset as the logical target offset (this reduces our metadata)

21 IOD Checksums The data for block B of object OID is stored at {10}{OID} in the appropriate shard given the stripe size and list of shards for OID (as explained in previous slide) Therefore, the checksum for block B is stored at {01}{OID} in that same shard. Even though DAOS is good at sparse, we might want to avoid very small IO s if we put checksums at the same offset as the block they describe. Therefore, we may instead do an array of checksums for each stripe squished together in the front of the corresponding stripe of the checksum shard

22 Storing OID 3 on DAOS Shards 3 % 2 = 1. Metadata at shard 1 {00}{3} Data is striped across objects {10}{3} on each shard, starting at shard 1 Checksums for each stripe at corresponding location in objects {01}{3} {meta} {cksums} {data} {empty}

23 Zooming in on Checksum Block {meta} {empty} {cksums} {data}

24 Zooming in on Checksum Block This shows one data block and one checksum block holding one single checksum. But actually each data block is split into multiple checksum units. Each checksum could be at same offset as its checksum unit. But this is too sparse (e.g. 64 bits every MB). Instead create an array of checksums for each checksum unit at the front of the checksum block.

25 More info Optimizations about sub-chunking checksum regions are available in this white paper by Andreas Dilger: Integrity pdf ( )

26 HDF5 Metadata End-to-End Integrity

27 HDF5 Metadata End-to-End Integrity

28 HDF5 Metadata End-to-End Integrity

29 HDF5 Metadata End-to-End Integrity

30 HDF5 Metadata End-to-End Integrity

31 HDF5 Metadata End-to-End Integrity

32 HDF5 Metadata End-to-End Integrity

33 HDF5 Metadata End-to-End Integrity

34 HDF5 Metadata End-to-End Integrity

35 HDF5 Metadata End-to-End Integrity

36 HDF5 Metadata End-to-End Integrity

37 HDF5 Metadata End-to-End Integrity

38 HDF5 Metadata End-to-End Integrity

39 HDF5 Metadata End-to-End Integrity

40 HDF5 Metadata End-to-End Integrity

41 HDF5 Raw Data End-to-End Integrity

FastForward I/O and Storage: IOD M5 Demonstration (5.2, 5.3, 5.9, 5.10)

FastForward I/O and Storage: IOD M5 Demonstration (5.2, 5.3, 5.9, 5.10) 1 EMC September, 2013 John Bent john.bent@emc.com Sorin Faibish faibish_sorin@emc.com Xuezhao Liu xuezhao.liu@emc.com Harriet Qiu