LCOC Lustre Cache on Client

Similar documents
Lustre overview and roadmap to Exascale computing

Andreas Dilger. Principal Lustre Engineer. High Performance Data Division

Andreas Dilger High Performance Data Division RUG 2016, Paris

Supercomputer Networking via IPv6. Xing Li, Congxiao Bao, Wusheng Zhang

Lustre/HSM Binding. Aurélien Degrémont Aurélien Degrémont LUG April

Lustre 2.12 and Beyond. Andreas Dilger, Whamcloud

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

Architecting Storage for Semiconductor Design: Manufacturing Preparation

LUG 2012 From Lustre 2.1 to Lustre HSM IFERC (Rokkasho, Japan)

Implementing a Hierarchical Storage Management system in a large-scale Lustre and HPSS environment

SXL-8500 Series: LTO Digital Video Archives

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

SFA12KX and Lustre Update

Lustre* is designed to achieve the maximum performance and scalability for POSIX applications that need outstanding streamed I/O.

Efficient Object Storage Journaling in a Distributed Parallel File System

SXL-6500 Series: LTO Digital Video Archives

DDN and Flash GRIDScaler, Flashscale Infinite Memory Engine

Project Quota for Lustre

The Google File System

Lustre HSM at Cambridge. Early user experience using Intel Lemur HSM agent

Optimizing MySQL performance with ZFS. Neelakanth Nadgir Allan Packer Sun Microsystems

Enhancing Lustre Performance and Usability

Imperative Recovery. Jinshan Xiong /jinsan shiung/ 2011 Whamcloud, Inc.

The Google File System

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

DNE2 High Level Design

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Andreas Dilger, Intel High Performance Data Division LAD 2017

Application Performance on IME

Lustre Metadata Fundamental Benchmark and Performance

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Caching and Buffering in HDF5

Remote Directories High Level Design

The Google File System (GFS)

Isilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team

XenData Product Brief: SX-550 Series Servers for LTO Archives

Google File System. By Dinesh Amatya

Introduction The Project Lustre Architecture Performance Conclusion References. Lustre. Paul Bienkowski

Virtual to physical address translation

The RAMDISK Storage Accelerator

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

Progress on Efficient Integration of Lustre* and Hadoop/YARN

GFS: The Google File System. Dr. Yingwu Zhu

Using DDN IME for Harmonie

Lustre on ZFS. At The University of Wisconsin Space Science and Engineering Center. Scott Nolin September 17, 2013

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

1. Memory technology & Hierarchy

The Google File System

Google File System 2

Robin Hood 2.5 on Lustre 2.5 with DNE

Disks, Memories & Buffer Management

The Google File System

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Feedback on BeeGFS. A Parallel File System for High Performance Computing

SHRD: Improving Spatial Locality in Flash Storage Accesses by Sequentializing in Host and Randomizing in Device

Advanced Database Systems

AN 831: Intel FPGA SDK for OpenCL

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

LIME: A Framework for Lustre Global QoS Management. Li Xi DDN/Whamcloud Zeng Lingfang - JGU

What is a file system

High-Performance Lustre with Maximum Data Assurance

Decentralized Distributed Storage System for Big Data

GFS: The Google File System

The Google File System

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

The Leading Parallel Cluster File System

SXL-4205Q LTO-8 Digital Archive

Overview of Tianhe-2

UNIT III MEMORY MANAGEMENT

Computer Architecture: Multithreading (I) Prof. Onur Mutlu Carnegie Mellon University

CHAO YANG. Early Experience on Optimizations of Application Codes on the Sunway TaihuLight Supercomputer

TGCC OVERVIEW. 13 février 2014 CEA 10 AVRIL 2012 PAGE 1

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Evaluating Algorithms for Shared File Pointer Operations in MPI I/O

Fujitsu s Contribution to the Lustre Community

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

! Design constraints. " Component failures are the norm. " Files are huge by traditional standards. ! POSIX-like

Lustre on ZFS. Andreas Dilger Software Architect High Performance Data Division September, Lustre Admin & Developer Workshop, Paris, 2012

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

2012 HPC Advisory Council

Cray XC Scalability and the Aries Network Tony Ford

Nathan Rutman SC09 Portland, OR. Lustre HSM

Google File System. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google fall DIP Heerak lim, Donghun Koo

Lock Ahead: Shared File Performance Improvements

TxFS: Leveraging File-System Crash Consistency to Provide ACID Transactions

Using file systems at HC3

Chapter Seven Morgan Kaufmann Publishers

Lustre File System. Proseminar 2013 Ein-/Ausgabe - Stand der Wissenschaft Universität Hamburg. Paul Bienkowski Author. Michael Kuhn Supervisor

Zhang, Hongchao

Crossing the Chasm: Sneaking a parallel file system into Hadoop

IME Infinite Memory Engine Technical Overview

Distributed File Systems. Directory Hierarchy. Transfer Model

C 1. Recap. CSE 486/586 Distributed Systems Distributed File Systems. Traditional Distributed File Systems. Local File Systems.

A FRAMEWORK ARCHITECTURE FOR SHARED FILE POINTER OPERATIONS IN OPEN MPI

Transcription:

1 LCOC Lustre Cache on Client Xue Wei The National Supercomputing Center, Wuxi, China Li Xi DDN Storage

2 NSCC-Wuxi and the Sunway Machine Family Sunway-I: - CMA service, 1998 - commercial chip - 0.384 Tflops - 48 th of TOP500 Sunway BlueLight: - NSCC-Jinan, 2011-16-core processor - 1 Pflops - 14 th of TOP500 Sunway TaihuLight: - NSCC-Wuxi, 2016-260-core processor - 125 Pflops - 1 st of TOP500 LCOC project is collaborated by NSCC-Wuxi and DDN

3 Sunway TaihuLight in NSCC-Wuxi: a 10M-Core System 163,840 processes 65 threads racks chips core-groups cores total number of cores 1,024x 40 x 260-core Chip Rack System

4 I/O Architecture of Sunway TaihuLight Cache on I/O forwarding nodes (Lustre clients) should be helpful

5 Why SSD cache on Lustre client? Less overhead visible for applications Less network latency No LDLM lock and other Lustre overhead Easier to be optimized for the best performance I/O stack is much simpler No interference I/Os from other clients Less requirement on hardware o Any kind of SSD can be used as the cache device Reduces the pressure of OSTs o Small or random I/Os are regularized to big sequential I/Os o Temporary files do not need to be flushed to OSTs Relatively easier than server side implementations Write support for SSD cache on server side is very difficult Problems for write cache on server side: o Visibility when failover happens o Consistency when corruption happens

6 Design of LCOC (1) LCOC provides a group of local caches Each client has its own local cache based on SSD No global namespace is provided by LCOC Data on the local cache can not accessed by other clients directly Local file system is used to manage the data on local caches Cached I/O is directed to local file system while normal I/O is directed to OSTs LCOC uses HSM for data synchronization LCOC uses HSM copytool restore file from local caches to Lustre OSTs Remote access from another Lustre client will trigger the data synchronization Each LCOC has a copytool instance running with unique archive number If a client with LCOC goes offline, the cached data becomes inaccessible for other client temporally o But this is fine, since it is local cache

7 Design of LCOC (2) When file is being created on LCOC A normal file is created on MDT An empty mirror file is created on local cache The HSM status of the Lustre file will be set to archived and released The archive number will be set to the proper value When file is being prefetched to LCOC An mirror file is copied to local cache The HSM status of the Lustre file will be set to archived and released The archive number will be set to the proper value When file is being accessed from LCOC Data will be read directly from local cache Metadata will be read from MDT, except file size File size will be got from local cache

8 Architecture of LCOC Policy Engine Cache Policy LCOC Switcher Normal I/O Cached I/O Cache Policy LCOC Switcher Normal I/O Cached I/O Control Flow Data Flow Cache Policy LCOC Switcher Normal I/O Cached I/O SSD SSD SSD Copytool #1 Prefetch Copytool #2 Prefetch Copytool #3 Prefetch OSCs OSCs OSCs OST OST OST OST OST

9 Data management of LCOC Policy engine manages the data movement from local caches to OSTs Policy engine will prefetch data if necessary Possible conditions to prefetch a file: o High access heat is being detected on that file o The file is going to be accessed soon (e.g. job is starting) o Explicit hint is being given by applications/users (e.g. lfs ladvise) Policy engine will do HSM restore to flush data according to the policies defined Possible conditions to shrink a file from the cache: o Cache is becoming full o The file size is growing too big to be cached o Low access heat is detected on the file in the cache o The file won t be accessed any more for some time (e.g. job is stopping) o Explicit hint is being given by applications/users (e.g. lfs ladvise)

10 Limitations Not all applications are able to be accelerated by LCOC Locality requirements of application I/Os o Applications shall not access the cached file through multiple clients o But no inconsistency will happen even the application writes the cached file on a remote client Capacity of each local cache is limited o Size of a cached file is limited to the available space of the local cache o The total cached data on a single client is limited Files can not be partly cached Partial cache can be implemented if HSM supports partial archive/restore The total LCOC clients are limited to 32 Only 32 different archive numbers are supported by Lustre This upper limitation can be raised in the future

11 Extension: Read-only replications Read-only replications are cached on multiple local caches The replications on LCOC are identical to the data on OSTs A new global flag lcoc_cached is used to indicate whether any local replication exists for a file Replications of files without lcoc_cached flag will be cleared I/O on client with LCOC replication: Read: o The file data comes from cache if lcoc_cached is set o The file data comes from OSTs if lcoc_cached is cleared Write: o Modification is applied directly to data on OSTs o The lcoc_cached flag is cleared I/O on client without LCOC replication: Read: o Data are read from OSTs directly Write: o The lcoc_cached flag is cleared

12 I/O Pattern Detector and Job Scheduler for LCOC I/O pattern detector detects suitable applications for LCOC Jobstat ID is used to distinguish I/O from different jobs The type, timestamp, size, offset, FID, job ID of I/Os are recorded on each client and sent to global detector The global detector finds FIDs with cross-client I/O and send back to I/O monitors on all clients A description about the I/O patterns on each job is generated by the detector LCOC-ware scheduler The scheduler considers LCOC usage as part of the constraint when scheduling jobs o Concurrent jobs shall not cause contention or exhaustion of LCOC The scheduler gives hints for LCOC cache management o Which files should be prefetched to cache o Whether a newly created file should be cached or not o Which client should cache the file o When should a file be swapped out of the cache

13 I/O Pattern Detector and Job Scheduler for LCOC Lustre Client I/O Monitor Lustre Client I/O Monitor I/O Pattern Detector Lustre Client I/O Monitor List of Accessed FIDs & List of Unsuitable FIDs LCOC management Job Descriptions Manual Optimization # General information of estimated cache size needed GENERAL: NEED 2GB ON rank0, 2GB ON rank 1 # When job starts, file a should be fetched to rank 0, need 2GB cache RULE 1: IF job_starts, FETCH a ON rank0, SIZE 2GB; # When file b is generated, file a should be swapped out RULE 2: IF b_exists, SHRINK a, SIZE 2GB; # When file c is generated on rank 1, it should be cached RULE 3: CACHE c ON rank 1, SIZE 2G; # If file d is generated, RULE 3 should be disabled RULE 4: IF d_exists, DISABLE 3; # If file e is generated, RULE 3 should be enabled RULE 5: IF e_exists, ENABLE 3; # When job finishes, file a and c should be swapped out RULE 6: IF job_ends, SHRINK a, SIZE 2GB; RULE 7: IF job_ends, SHRINK c, SIZE 2GB; Policy Engine of LCOC Job Information Job Scheduler Cache Information

14 Benchmark results LCOC uses Ext4 (Samsung SSD 850 EVO 500GB) as local cache Lustre OST is based on a single SSD (Intel 535 Series) Network is Gigabit Ethernet Benchmark: use dd command to write/read 32GB data with different I/O sizes Run the same command on different levels of the storage 600 MB/s Write Performance Overhead of LCOC is minimum 600 MB/s Read Performance 500 400 Speedup of LCOC is obvious(x4) 500 400 300 300 200 Latency of network is significant 200 100 100 0 I/O Size(Byte) 0 I/O Size(Byte) 128 256 512 1 K 2 K 4 K 8 K 16 K 32 K 64 K 128 K 256 K 512 K 1 M 2 M 4 M Ext4 of LCOC LCOC Lustre Client OST ldiskfs 128 256 512 1 K 2 K 4 K 8 K 16 K 32 K 64 K 128 K 256 K 512 K 1 M 2 M 4 M Ext4 of LCOC LCOC Lustre Client OST ldiskfs

15 Summary We designed and implemented a novel client side cache (LCOC) for Sunway TaihuLight Small scale benchmarks shows that LCOC is able to accelerate I/Os Large scale benchmarks and tests will be carried out in NSCC-Wuxi soon

16 16 Thank you!