Oct, 2014 HPC Storage Use Cases & Future Trends Massively-Scalable Platforms and Solutions Engineered for the Big Data and Cloud Era Atul Vidwansa Email: atul@
DDN About Us DDN is a Leader in Massively Scalable Platforms and Solutions for Big Data and Cloud Applications Main Office: Santa Clara, California, USA Installed Base: 1,000+ Customers in 50 Countries Go-To-Market: Partner & Reseller Assisted, Direct DDN: World s Largest Private Storage Company World-Renowned & Award-Winning 2
DDN The Technology Behind The World s Leading Data-Driven Organizations HPC & Big Data Analysis Cloud & Web Infrastructure Professional Media Security 3
Biggest of the HPC Customers DDN Powers Fastest HPC Centers in the World #1 on 5 Continents and 61 of the Top100 Delivering More GB/s Than All Others Combined Fastest Parallel Filesystem 32PB @ 1 TB/s 20PB @ 250 GB/s 26PB @ 250 GB/s 10PB @ 200 GB/s 10PB @ 150 GB/s 4
Supercomputer Storage in Asia-Pacific Best Suited for CPU OR Accelerator Based Clusters Raijin @ ANU-NCI 1.1PF CPU only cluster 10 PB DDN Storage SFA12K with FDR IB Mix of 10K RPM SAS & 7.2K RPM Nearline SAS DDN supplied and supported Lustre parallel filesystem 150 GB/s throughput Tsubame2 @ Titech Japan 2.2PF CPU+GPU cluster 7.2 PB DDN Storage SFA10K with QDR IB Mix of 10K RPM SAS & 7.2K RPM Nearline SAS DDN supplied and supported Lustre + GPFS parallel filesystem 70 GB/s throughput
Life Science Systems Asia Pacific 2012 Kazusa DNA Research Institute 30 GB/sec 4.5 PB, several storage tiers 2013 Kyoto University CiRA 4 PB capacity 2012/2014 National Institute of Genetics 300 GB/sec aggregate throughput ~10 PB Disc Capacity 2014 Medical Megabank Project 500 GB/sec aggregate throughput 10-50 PB Disc Capacity 2014 Peking University 60 GB/sec 2015 Tokyo University, Human Genome Center (HGC) 22 PB disk capacity 200 TB Flash, 4.5 million random read IOPS 10s to 100s of petabyte of tape library 6
Snapshot of DDN Customers Singapore & India 2PB @ 20 GB/s At CSIR C-MMACS 1.5PB @ 15 GB/s At TIFR-NCBS 500TB @ 25 GB/s At IIT Kanpur 1PB @ 8 GB/s At IUCAA Pune 7 1PB @ 8 GB/s At IITM Pune 300TB @ 5 GB/s At NIBMG
Use Cases for Petaflop Exaflop class machines 8
Storage Use Cases for 100 500 PF Supers Taken from RFPs of Trinity, NERSC-8, ORNL, LLNL 9 Priority 1: Defensive I/O: Checkpoint-restart Users are demanding ability to dump 100% memory in very short period of time & reduction of I/O wait times. Trinity requires 4.4 17.8 TB/s whereas NERSC-8 requires 2.2 8.9 TB/s bandwidth to handle these scenarios Priority 2: Handle concurrency of 50 Million CPU cores Handle average concurrency of application running on 50 Million CPU cores where each core generates data Handle on an average 1 Million files create/s of metadata performance Priority 3: On the fly visualization with quick-drain & prestaging of data Start visualization as soon as data is generated by simulation but before it is written to parallel filesystem. Requires average 500 GB/s write throughput followed by 500 GB/s read throughput. Users are demanding reduction of time to start jobs which need to pre-stage multi Petabytes of data
Continued. Priority 4: Accelerate reads for common files Load user profiles, shell environment, libraries, configuration files, data files for 1000s of users on a multidisciplinary HPC system. Also applicable to chip-design industry for Silicon verification process, pinning reference Genome in bio0informatics etc Priority 5: Fast Data Lookup Scale-out Pattern matching defense/intelligence use cases Google scale data lookups - Hadoop jobs Fraud detection based on embarrassingly parallel databases Priority 6: Application level power capping Ability to dynamically allocate faster CPU, network and storage components to power hungry applications before start of power capping 10
Common Requirements from Storage New supercomputers require main memory performance (RAM) at the cost of disk performance (PFS) Extremely fast cache to absorb checkpoint-restart data Must be able to cope up with bursts of random data Has to be an extension of parallel filesystem as data finally needs to go to long term persistent storage Also needs to integrate with caching mechanism of underlying storage infrastructure This storage must not require re-compiling of applications (has to provide POSIX or MPI-IO semantics) Must be power efficient & space efficient 11
What is Burst Buffer & Why you need it? Analysis: Argonne s LCF production storage system 99% of the time, storage BW utilization < 33% of max 70% of the time, storage BW utilization < 5% of max Trend: Burst Buffers will demand smaller, robust parallel file systems that sustain very high bandwidth efficiency; SFA value proposition remains strong in a burst buffer world DDN Confidential NDA Required, Roadmap Subject to Change Need for Burst Buffer: ExaScale applications need extreme throughput of the order of 5x 10x of what PFS can provide. Large systems do not want to invest in disk only technology, its too costly & power hungry. Capacity required for extreme throughput is usually 2-5 times the system memory size. Such small PFS can not deliver performance. PFS storage often not used to 100% of their capability. Small random I/O is still a challenge for all PFS.
Move to NVRAM (SSDs) is a MUST! NVRAM Is A Viable HPC Tier, Today And Performance Gains Outpace HDD $/MB/s: $0.29 (91% less) $/GB: $1 (1000% more) W/GB/s: 0.003 (98% less) 1E+10 100000000 1000000 GHz Widening 10 6 to 10 8 $/MB/s: $3.19 $/GB: $0.1 W/GB/s: 0.13 10000 100 1 IOPS IOPS 2000 2010 2018 HDDs (IOPS) Flash CPU Operations 13
What is Infinite Memory Engine (IME)? A DDN-developed Bust Buffer Implementation using patentprotected Distributed Hash Table algorithm that manages distributed, non-volatile memory devices: High bandwidth Low latency o reads & writes o large and small o aligned or random Data integrity & protection Massive scalability DDN Confidential NDA Required, Roadmap Subject to Change
Where does Infinite Memory Engine fit? HPC Burst Buffer and DDN s Exascale/Webscale Architecture Elements of the DDN Exascale/Webscale Stack Massively Parallel Computational Platform DDN IME Burst Buffer Tier DDN SFA Persistent Storage Tier DDN WOS Archival Storage Tier NEW Massively Parallel Processing Platform + High-Performance Network Fabric File System Buffer Cache Software And NVRAM Appliances High-Performance, High- Capacity, and Reliable File Storage Appliances Cost-Effective Cloud- Enabled Object-Based Archive
IME Design Decisions 1 Decouples Storage Performance from Capacity (SSD vs Spinning Disk) 2 Speeds Up Apps by Moving I/O Next to Compute (Bandwidth & IOPS, Read & Write, Small & Large) 3 Shrinks Cluster Idle Time With I/O Provisioning (You Bought a $100 Cluster But Are Using Only $25, IME Gives Back the Other $75)
The Infinite Memory Engine: How it Works? 17
How IME Works? Continued. 18
Demo Comparative Testing: Shared Writes IME Accelerates Parallel File Systems By 2,000x CLUSTER LEVEL TESTING DDN GRIDScaler IME( overall) 6,225 Concurrent Write Requests 49 GB/s 49 GB/s Linear Cluster Scaling 12,250,000 Concurrent Write Requests 17 MB/s 49 GB/s DISK LEVEL TESTING DDN GRIDScaler (per SSD) IME (per SSD) 62.5 Concurrent Write Requests 438 MB/s 500 MB/s 125,000 Concurrent Write Requests 170 KB/s 500 MB/s SSDs behind a PFS don t help. IME is at line rate to scale with SSD rates. Avg. 2018 Top500 Cluster Concurrency 57,772,000 Cores (est) 19
IME Demo at ISC 2014 - Summary Content Write to, and read from IME with IOR Write S3D application data to IME Purge data to underlying parallel file system Interface MPI-IO driver for IME Testbed Hardware DDN IME 2U servers 24 SATA SSDs per IME server GRIDScaler(GPFS) with SFA7700 S3D App Performance (MPI-IO) <50 MB/s Compute Cluster Burst Buffer Parallel Filesystem 80 GB/s 3 GB/s
IME Roadmap Various Degrees of DDN Hardware Product Appliance Software Only Clustered Systems IME Client SW DDN SSDs & Server SW IME SERVER IME Client and Server SW IME SERVER IME Client and Server SW DDN Confidential NDA Required, Roadmap Subject to Change
IME Phase 1 System Available for POC now.. Single Rack Solution 20 2U IME Server nodes Dual-rail IB FDR switching 250TB of SSD Capacity Extends PFS interface to cluster 50M IOPS and 200 GB/s Throughput Clustered Systems IME SERVER IME Client and Server SW This rack can support ~1,600 compute nodes, or can be operated as a stand-alone data-intensive cluster
IME Users In Tremendous Value To HPC Centers! Clear Benefits At Any Level of Scale >90% Less I/O & Router Nodes >90% Less Spinning Disks >90% Less Storage Networking >90% Less Data Center Racks >90% Less Data Center Power 23
Thank You 24