System Software for Big Data and Post Petascale Computing

Similar documents
Workflow System for Data-intensive Many-task Computing

Disk Cache-Aware Task Scheduling

Emerging Technologies for HPC Storage

An Overview of Fujitsu s Lustre Based File System

Lustre2.5 Performance Evaluation: Performance Improvements with Large I/O Patches, Metadata Improvements, and Metadata Scaling with DNE

HPC Storage Use Cases & Future Trends

The Fusion Distributed File System

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

Optimizing Local File Accesses for FUSE-Based Distributed Storage

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

NFS, GPFS, PVFS, Lustre Batch-scheduled systems: Clusters, Grids, and Supercomputers Programming paradigm: HPC, MTC, and HTC

Introduction to High Performance Parallel I/O

I/O at the Center for Information Services and High Performance Computing

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Improved Solutions for I/O Provisioning and Application Acceleration

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

NetApp: Solving I/O Challenges. Jeff Baxter February 2013

Highly Scalable, Non-RDMA NVMe Fabric. Bob Hansen,, VP System Architecture

Cluster Setup and Distributed File System

Synonymous with supercomputing Tightly-coupled applications Implemented using Message Passing Interface (MPI) Large of amounts of computing for short

Data Movement & Tiering with DMF 7

Analytics in the cloud

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

Co-existence: Can Big Data and Big Computation Co-exist on the Same Systems?

DDN. DDN Updates. DataDirect Neworks Japan, Inc Nobu Hashizume. DDN Storage 2018 DDN Storage 1

MOHA: Many-Task Computing Framework on Hadoop

Storage for HPC, HPDA and Machine Learning (ML)

The Oracle Database Appliance I/O and Performance Architecture

Parallel File Systems Compared

Storage Optimization with Oracle Database 11g

Next Generation Storage for The Software-Defned World

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

Parallel File Systems for HPC

Overview of Supercomputer Systems. Supercomputing Division Information Technology Center The University of Tokyo

Kinetic Open Storage Platform: Enabling Break-through Economics in Scale-out Object Storage PRESENTATION TITLE GOES HERE Ali Fenn & James Hughes

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Typically applied in clusters and grids Loosely-coupled applications with sequential jobs Large amounts of computing for long periods of times

THE EMC ISILON STORY. Big Data In The Enterprise. Deya Bassiouni Isilon Regional Sales Manager Emerging Africa, Egypt & Lebanon.

The Blue Water s File/Archive System. Data Management Challenges Michelle Butler

irods at TACC: Secure Infrastructure for Open Science Chris Jordan

An ESS implementation in a Tier 1 HPC Centre

Life In The Flash Director - EMC Flash Strategy (Cross BU)

Tuning I/O Performance for Data Intensive Computing. Nicholas J. Wright. lbl.gov

CA485 Ray Walshe Google File System

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Distributed File Systems II

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Leveraging Software-Defined Storage to Meet Today and Tomorrow s Infrastructure Demands

Crossing the Chasm: Sneaking a parallel file system into Hadoop

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Lustre overview and roadmap to Exascale computing

Ian Foster, An Overview of Distributed Systems

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Cloud Programming. Programming Environment Oct 29, 2015 Osamu Tatebe

Computer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research

Let s Make Parallel File System More Parallel

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into

Comprehensive Lustre I/O Tracing with Vampir

SFA12KX and Lustre Update

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Deduplication File System & Course Review

Beyond Petascale. Roger Haskin Manager, Parallel File Systems IBM Almaden Research Center

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Voltaire Making Applications Run Faster

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

IBM FlashSystem. IBM FLiP Tool Wie viel schneller kann Ihr IBM i Power Server mit IBM FlashSystem 900 / V9000 Storage sein?

Introduction & Motivation Problem Statement Proposed Work Evaluation Conclusions Future Work

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

1Z0-433

EMC VFCache. Performance. Intelligence. Protection. #VFCache. Copyright 2012 EMC Corporation. All rights reserved.

Experiences of the Development of the Supercomputers

Distributed Filesystem

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

Xyratex ClusterStor6000 & OneStor

Bridging the peta- to exa-scale I/O gap

INTRODUCTION TO CEPH. Orit Wasserman Red Hat August Penguin 2017

Crossing the Chasm: Sneaking a parallel file system into Hadoop

18-hdfs-gfs.txt Thu Oct 27 10:05: Notes on Parallel File Systems: HDFS & GFS , Fall 2011 Carnegie Mellon University Randal E.

Flash Storage Complementing a Data Lake for Real-Time Insight

Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies

Extreme I/O Scaling with HDF5

Forming an ad-hoc nearby storage, based on IKAROS and social networking services

THE SUMMARY. CLUSTER SERIES - pg. 3. ULTRA SERIES - pg. 5. EXTREME SERIES - pg. 9

Parallel File Systems. John White Lawrence Berkeley National Lab

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

Distributed File Systems Part IV. Hierarchical Mass Storage Systems

Transcription:

The Japanese Extreme Big Data Workshop February 26, 2014 System Software for Big Data and Post Petascale Computing Osamu Tatebe University of Tsukuba

I/O performance requirement for exascale applications Computational Science (Climate, CFD, ) Read initial data (100TB~PB) Write snapshot data (100TB~PB) periodically Data Intensive Science (Particle Physics, Astrophysics, Life Science, ) Data analysis of 10PB~EB experiment data

Scalable performance requirement for Parallel File System Year FLOPS #cores IO BW IOPS Systems 2008 1P 100K 100GB/s O(1K) Jaguar, BG/P 2011 10P 1M 1TB/s O(10K) K, BG/Q 2016 100P 10M 10TB/s O(100K) 2020 1E 100M 100TB/s O(1M) Performance target IO BW and IOPS are expected to be scaled-out in terms of # cores or # nodes

Technology trend HDD performance not increase so much 300 MB/s, 5 W in 2020 100 TB/s means O(2M)W Flash, storage class memory 1 GB/s, 0.1 W in 2020 Cost, limited number of updates Interconnects 62 GB/s (Infiniband 4xHDR)

Current parallel file system Central storage array Separate installation of compute nodes and storage Network BW between compute nodes and storage needs to be scaled-up to scale out the I/O performance MDS NW BW limitation Compute nodes (clients) Storage

Remember memory architecture CPU CPU Mem Mem Shared memory Distributed memory

Scaled-out parallel file system Distributed storage in compute nodes I/O performance would be scaled out by accessing near storage unless metadata performance is bottleneck Access to near storage mitigates network BW requirement The performance may be non uniform MDS cluster Compute nodes (clients) Storage

Example of Scale-out Storage Architecture 62 GB/s chipset 1TB local storage CPU (2 sockets x2.0ghzx16 coresx32fpu) Metadata server Infiniband HDR memory 12 Gbps SAS x 16 x 16 19.2 GB/s, 16 TB x 500 9.6 TB/s, 8 PB 3 years later snapshot Non-uniform but scale-out storage R&D of system software stacks is required to achieve maximum I/O performance for dataintensive science x 10 96 TB/s, 80 PB 5,000 IO nodes 10 MDSs

Challenge File system (Object store) Central storage cluster to distributed storage cluster Scaled out parallel file system up to O(1M) clients Scaled out MDS performance Compute node OS Reduction of OS noises Cooperative cache Runtime system Optimization for non uniform storage access NUSA Global storage for data sharing of exabyte-scale data among machines

Scaled out parallel file system Federate local storage in compute nodes Special purpose Google file system [SOSP 03] Hadoop file system (HDFS) POSIX(-like) Gfarm file system [CCGrid 02, NGC 10]

Scaled-out MDS GIGA+ [Swapnil Patil et al. FAST 11] Incremental directory partitioning Independent locking in each partition skyfs [Jing Xing et al. SC 09] Performance improvement during directory partitioning in GIGA+ Lustre MT scalability in 2.X Proposed clustered MDS PPMDS [Our JST CREST R&D] Shared-nothing KV stores Nonblocking software transactional memory (No lock) IOPS (file creates per sec) #MDS (#core) GIGA+ 98K 32 (256) skyfs 100K 32 (512) Lustre 2.4 80K 1 (16) PPMDS 270K 15 (240)

Development of Pwrake: Data-intensive Workflow System Pwrake = Workflow System based on Rake (Ruby make) Pwrake SSH IO-aware Task Scheduling: Locality-aware scheduling Selection of Compute Nodes by Input files Buffer Cache-aware scheduling Modified LIFO to ease Trailing Task Problem Workflow elapsed time (sec) with I/O file size 900 GB (10 nodes) 0 2000 4000 Process Process Process file1 file2 file3 Gfarm file system Naïve Locality aware Locality aware and Cache aware Locality-aware 42% speedup Cache-aware 23% speedup

Maximize Locality using Multi-Constraint Graph Partitioning [Tanaka, CCGrid 2012] Task scheduling based on MCGP can minimize data movement Applied to Pwrake workflow system and evaluated on Montage workflow Simple Graph Partitioning Multi-Constraint Graph Partitioning Parallel tasks are unbalanced among nodes. Data movement reduced by 86% Execution time improved by 31%

HPCI Shared Storage HPCI High Performance Computing Infrastructure K, Hokkaido, Tohoku, Tsukuba, Tokyo, Titech, Nagoya, Kyoto, Osaka, Kyushu, RIKEN, JAMSTEC, AIST A 20PB Gfarm distributed file system consisting East and West sites Grid Security Infrastructure (GSI) for user ID Parallel file replication among sites Parallel file staging to/from each center MDS MDS 10 PB (40 servers) West site (AICS) 10 (~40) Gbps MDS MDS 11.5 PB (60 servers) East site (U Tokyo) Picture courtesy by Hiroshi Harada (U Tokyo)

Storage structure of HPCI Shared Storage How to use Objective File system Local file system Global file system mv/cp File staging Temporal space I/O performance No backup Persistent storage Capacity and reliability Back up copy will be in Tape or disk Lustre, Pansas, GPFS, Wide-area distributed file system mv/cp File staging Web I/F Data sharing Capacity and reliability Secured communication Fair share and easy to use No backup but file can be replicated Gfarm file system Remote clients HPCI Shared Storage

Initial Performance Result 1,200 1,000 800 600 400 200 898 I/O Bandwidth [MB/sec] 847 1,107 1,073 0 Hokkaido Kyoto Tokyo AICS File copy performance of 300 1GB files to HPCI Shared Storage

Related System XSEDE-Wide File System (GPFS) Planned, but not in operation yet DEISA Global File System Multicluster GPFS RZG, LRZ, BSC, JSC, EPSS, HLRS, Site name included in the path name no location transparency files cannot be replicated across sites PRACE does not provide global file system Limitation of operation systems that can mount PRACE does not assume to use multiple sites

Summary App IO requirement Computational Science Scaled-out IO performance up to O(1M) nodes (100TB to 1PB per hour) Data Intensive Science Data processing for 10PB to 1EB data (>100TB/sec) File system, Object store, OS and runtime R&D for scale out storage architecture Central storage cluster to distributed storage cluster Network wide RAID Scaled out MDS Runtime system for non uniform storage access NUSA Locality aware process scheduling Global file system