System Software for Big Data and Post Petascale Computing

The Japanese Extreme Big Data Workshop February 26, 2014 System Software for Big Data and Post Petascale Computing Osamu Tatebe University of Tsukuba

I/O performance requirement for exascale applications Computational Science (Climate, CFD, ) Read initial data (100TB~PB) Write snapshot data (100TB~PB) periodically Data Intensive Science (Particle Physics, Astrophysics, Life Science, ) Data analysis of 10PB~EB experiment data

Scalable performance requirement for Parallel File System Year FLOPS #cores IO BW IOPS Systems 2008 1P 100K 100GB/s O(1K) Jaguar, BG/P 2011 10P 1M 1TB/s O(10K) K, BG/Q 2016 100P 10M 10TB/s O(100K) 2020 1E 100M 100TB/s O(1M) Performance target IO BW and IOPS are expected to be scaled-out in terms of # cores or # nodes

Technology trend HDD performance not increase so much 300 MB/s, 5 W in 2020 100 TB/s means O(2M)W Flash, storage class memory 1 GB/s, 0.1 W in 2020 Cost, limited number of updates Interconnects 62 GB/s (Infiniband 4xHDR)

Current parallel file system Central storage array Separate installation of compute nodes and storage Network BW between compute nodes and storage needs to be scaled-up to scale out the I/O performance MDS NW BW limitation Compute nodes (clients) Storage

Remember memory architecture CPU CPU Mem Mem Shared memory Distributed memory

Scaled-out parallel file system Distributed storage in compute nodes I/O performance would be scaled out by accessing near storage unless metadata performance is bottleneck Access to near storage mitigates network BW requirement The performance may be non uniform MDS cluster Compute nodes (clients) Storage

Example of Scale-out Storage Architecture 62 GB/s chipset 1TB local storage CPU (2 sockets x2.0ghzx16 coresx32fpu) Metadata server Infiniband HDR memory 12 Gbps SAS x 16 x 16 19.2 GB/s, 16 TB x 500 9.6 TB/s, 8 PB 3 years later snapshot Non-uniform but scale-out storage R&D of system software stacks is required to achieve maximum I/O performance for dataintensive science x 10 96 TB/s, 80 PB 5,000 IO nodes 10 MDSs

Challenge File system (Object store) Central storage cluster to distributed storage cluster Scaled out parallel file system up to O(1M) clients Scaled out MDS performance Compute node OS Reduction of OS noises Cooperative cache Runtime system Optimization for non uniform storage access NUSA Global storage for data sharing of exabyte-scale data among machines

Scaled out parallel file system Federate local storage in compute nodes Special purpose Google file system [SOSP 03] Hadoop file system (HDFS) POSIX(-like) Gfarm file system [CCGrid 02, NGC 10]

Scaled-out MDS GIGA+ [Swapnil Patil et al. FAST 11] Incremental directory partitioning Independent locking in each partition skyfs [Jing Xing et al. SC 09] Performance improvement during directory partitioning in GIGA+ Lustre MT scalability in 2.X Proposed clustered MDS PPMDS [Our JST CREST R&D] Shared-nothing KV stores Nonblocking software transactional memory (No lock) IOPS (file creates per sec) #MDS (#core) GIGA+ 98K 32 (256) skyfs 100K 32 (512) Lustre 2.4 80K 1 (16) PPMDS 270K 15 (240)

Development of Pwrake: Data-intensive Workflow System Pwrake = Workflow System based on Rake (Ruby make) Pwrake SSH IO-aware Task Scheduling: Locality-aware scheduling Selection of Compute Nodes by Input files Buffer Cache-aware scheduling Modified LIFO to ease Trailing Task Problem Workflow elapsed time (sec) with I/O file size 900 GB (10 nodes) 0 2000 4000 Process Process Process file1 file2 file3 Gfarm file system Naïve Locality aware Locality aware and Cache aware Locality-aware 42% speedup Cache-aware 23% speedup

Maximize Locality using Multi-Constraint Graph Partitioning [Tanaka, CCGrid 2012] Task scheduling based on MCGP can minimize data movement Applied to Pwrake workflow system and evaluated on Montage workflow Simple Graph Partitioning Multi-Constraint Graph Partitioning Parallel tasks are unbalanced among nodes. Data movement reduced by 86% Execution time improved by 31%

HPCI Shared Storage HPCI High Performance Computing Infrastructure K, Hokkaido, Tohoku, Tsukuba, Tokyo, Titech, Nagoya, Kyoto, Osaka, Kyushu, RIKEN, JAMSTEC, AIST A 20PB Gfarm distributed file system consisting East and West sites Grid Security Infrastructure (GSI) for user ID Parallel file replication among sites Parallel file staging to/from each center MDS MDS 10 PB (40 servers) West site (AICS) 10 (~40) Gbps MDS MDS 11.5 PB (60 servers) East site (U Tokyo) Picture courtesy by Hiroshi Harada (U Tokyo)

Storage structure of HPCI Shared Storage How to use Objective File system Local file system Global file system mv/cp File staging Temporal space I/O performance No backup Persistent storage Capacity and reliability Back up copy will be in Tape or disk Lustre, Pansas, GPFS, Wide-area distributed file system mv/cp File staging Web I/F Data sharing Capacity and reliability Secured communication Fair share and easy to use No backup but file can be replicated Gfarm file system Remote clients HPCI Shared Storage

Initial Performance Result 1,200 1,000 800 600 400 200 898 I/O Bandwidth [MB/sec] 847 1,107 1,073 0 Hokkaido Kyoto Tokyo AICS File copy performance of 300 1GB files to HPCI Shared Storage

Related System XSEDE-Wide File System (GPFS) Planned, but not in operation yet DEISA Global File System Multicluster GPFS RZG, LRZ, BSC, JSC, EPSS, HLRS, Site name included in the path name no location transparency files cannot be replicated across sites PRACE does not provide global file system Limitation of operation systems that can mount PRACE does not assume to use multiple sites

Summary App IO requirement Computational Science Scaled-out IO performance up to O(1M) nodes (100TB to 1PB per hour) Data Intensive Science Data processing for 10PB to 1EB data (>100TB/sec) File system, Object store, OS and runtime R&D for scale out storage architecture Central storage cluster to distributed storage cluster Network wide RAID Scaled out MDS Runtime system for non uniform storage access NUSA Locality aware process scheduling Global file system