IME Infinite Memory Engine Technical Overview

1 1 IME Infinite Memory Engine Technical Overview

2 Bandwidth, IOPs single NVMe drive

3 What does Flash mean for Storage? It's a new fundamental device for storing bits. We must treat it different from HDD We have to manage data placement across tiers at a larger scale + New opportunities for Novel Developments around scaling, data security, performance protection

4 DDN IME Application I/O Workflow Compute Diverse, high concurrency applications Fast Data NVM & SSD Persistent Data (Disk) Lightweight IME client intercepts application I/O. Places fragments into buffers + parity IME client sends fragments to IME servers IME servers write buffers to NVM and manage internal metadata IME servers write aligned sequential I/O to SFA backend Parallel File system operates at maximum efficiency

5 Distributed Hash Table + Log Structure Filesystem Data Key Distributed Network File1 File4 File3 File6 Hash Function DFCD3455 52ED789E 46042D43 DC355CE peers DHT provides foundation for Network parallelism Node-level fault tolerance Distributed metadata Log Tail Space reclaimed here data empty space data data data data Log Head data New data added here Log (time) wrap Log Structured Filesystem used at the storage device level High performance device throughput (NAND Flash Maintaining device longevity

6 DDN IME DataFlow in the Client COMPUTE NODE APPLICATION POSIX or MPIIO I/O is issued by the application IME places fragments into buffers 2 IME LUSTRE 1 data buffers accumulator parity buffers are simultaneously built parity 3 file open, file close, stat 4 metadata requests are passed through to the PFS client 5 Once full, buffers sent to IME server layer

7 DDN IME Erasure Coding application COMPUTE Data protection against IME server or SSD Failure is optional IME data buffers parity buffers (the lost data is "just cache ) LUSTRE Erasure Coding calculated at the Client Great scaling with extremely high client count Servers don't get clogged up IME Erasure coding does reduce useable Client bandwidth and useable IME capacity: 3+1: 56Gb 42Gb 5+1: 56Gb 47Gb 7+1: 56Gb 49Gb 8+1: 56Gb 50Gb PFS

8 DDN IME Data Residency Control COMPUTE maximum percentage of dirty data resident in IME before the data is automatically synchronized to the PFS: flush_threshold_ratio [0%.. 100%] Once Synchronised, the data is marked clean The clean data is kept in IME until the min_free_space_ratio is reached. min_free_space_ratio [0%.. 100%] I M E DIRTY CLEAN purge clean data until min_free_space_ratio sync dirty data until flush_threshold_ratio PFS

9 Use of Log Structuring in IME Consider two different application I/O patterns (write) Sequential Non-sequential Burst buffer blocks (BBB) are really just buffers generated at the client Note the contents of the BBB can be aligned or not. The same storage method is used for both blocks (despite the qualitative difference of their contents)!

10 Use of Log Structuring in IME What does this give us? Near line rate performance regardless of output pattern. 3 IOR Checkpoints to IME @~50GB/s (4k strided, shared-file)

11 Non-Deterministic Data Placement Deterministic Approach: IME Clients use hash function to place data on rank0 host Non-deterministic approach: IME clients learn and observe the load of IME servers and route write requests to avoid highlyloaded servers COMPUTE Client Pending Request Queue Lengths IME

12 Aggregate IME Adaptive vs. Non-Adaptive WRITE Performance Ideal, healthy system One degraded IME server, Adaptive Amdahl s Law in action! One degraded IME server, Non-adaptive

13 IME v1.0 Mount Points FUSE client provides IME POSIX mount point df -h /ime/gsfs Filesystem Size Used Avail Use% Mounted on imefs 26T 3.9T 22T 16% /ime/gsfs Filesystem Mount Point # df -h /dev/gsfs/ Filesystem Size Used Avail Use% Mounted on /dev/gsfs 26T 3.9T 22T 16% /gsfs

19 Parallel File System: Shared File Performance Filesystem locking

20 IME vs Parallel File System: Shared File Performance

21 Rack Performance: IME IOR File-per-Process (GB/s) 4k Random IOPS 600 100,000,000 500 10,000,000 1,000,000 400 100,000 300 10,000 200 1,000 100 100 10 0 Write Read 1 Write Read ~550GB/s Read, Write ~50 Million IOPs

22 Benchmark Data POSIX Single Shared File IOR with Segments GB/sec 600 IME 20 Nodes (GB/sec) IOPS 100,000,000 IME 20 Nodes (IOPS) 500 10,000,000 1,000,000 400 100,000 300 10,000 200 1,000 100 100 10-4k 8k 16k 32k 64k 128k 256k 512k 1024k 1 4k 8k 16k 32k 64k 128k 256k 512k 1024k Write(MB/sec) Read(MB/sec) Write IOPS Read IOPS

23 IME - Burst Buffer Productizing many years of research and development Yields a huge percentage of peak BW No server side erasure code overhead Fewer memory copies / visits Flexible data placement Allows server to avoid slow or oversubscribed components Log structured writes Utilize NVM devices in the most performant manner Completely declustered RAID Rebuilds of devices in MINUTES not days

24 Thank You! Keep in touch with us sales@ 9351 Deering Avenue Chatsworth, CA 91311 @ddn_limitless 1.800.837.2298 1.818.700.4000 company/datadirect-networks