Toward An Integrated Cluster File System

Similar documents
File System Internals. Jo, Heeseung

Kerrighed: A SSI Cluster OS Running OpenMP

an Object-Based File System for Large-Scale Federated IT Infrastructures

Computer Systems Laboratory Sungkyunkwan University

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

An Overview of Fujitsu s Lustre Based File System

File System Internals. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Current Topics in OS Research. So, what s hot?

W4118 Operating Systems. Instructor: Junfeng Yang

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Structuring PLFS for Extensibility

Single System Image OS for Clusters: Kerrighed Approach

CA485 Ray Walshe Google File System

Feedback on BeeGFS. A Parallel File System for High Performance Computing

Distributed File Systems II

Object based Storage Cluster File Systems & Parallel I/O

Chapter 11: Implementing File Systems

COS 318: Operating Systems. NSF, Snapshot, Dedup and Review

NPTEL Course Jan K. Gopinath Indian Institute of Science

Chapter 12: File System Implementation

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Red Hat Enterprise 7 Beta File Systems

Distributed File Systems

OPERATING SYSTEM. Chapter 12: File System Implementation

University of Wisconsin-Madison

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

NPTEL Course Jan K. Gopinath Indian Institute of Science

Operating Systems. Lecture File system implementation. Master of Computer Science PUF - Hồ Chí Minh 2016/2017

Chapter 11: Implementing File

Chapter 11: Implementing File Systems. Operating System Concepts 9 9h Edition

An Overview of The Global File System

Introduction to High Performance Parallel I/O

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission 1

Da-Wei Chang CSIE.NCKU. Professor Hao-Ren Ke, National Chiao Tung University Professor Hsung-Pin Chang, National Chung Hsing University

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Mass-Storage Structure

The Google File System

A GPFS Primer October 2005

Secure Block Storage (SBS) FAQ

XtreemStore A SCALABLE STORAGE MANAGEMENT SOFTWARE WITHOUT LIMITS YOUR DATA. YOUR CONTROL

Technical Computing Suite supporting the hybrid system

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Cloud Computing CS

CS 4284 Systems Capstone

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Chapter 12: File System Implementation

The Google File System

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

OCFS2 Mark Fasheh Oracle

A Global Operating System for HPC Clusters

Chapter 11: File System Implementation. Objectives

The MOSIX Scalable Cluster Computing for Linux. mosix.org

AN OVERVIEW OF DISTRIBUTED FILE SYSTEM Aditi Khazanchi, Akshay Kanwar, Lovenish Saluja

Distributed Storage with GlusterFS

Parallel File Systems Compared

Lustre A Platform for Intelligent Scale-Out Storage

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Ceph. The link between file systems and octopuses. Udo Seidel. Linuxtag 2012

Distributed System. Gang Wu. Spring,2018

Discriminating Hierarchical Storage (DHIS)

File System Case Studies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Overview Job Management OpenVZ Conclusions. XtreemOS. Surbhi Chitre. IRISA, Rennes, France. July 7, Surbhi Chitre XtreemOS 1 / 55

COS 318: Operating Systems. Journaling, NFS and WAFL

Red Hat Global File System

XtreemFS: highperformance network file system clients and servers in userspace. Minor Gordon, NEC Deutschland GmbH

Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments

Chapter 10: File System Implementation

Finding a Needle in a Haystack. Facebook s Photo Storage Jack Hartner

ZEST Snapshot Service. A Highly Parallel Production File System by the PSC Advanced Systems Group Pittsburgh Supercomputing Center 1

CSE 451: Operating Systems Winter Redundant Arrays of Inexpensive Disks (RAID) and OS structure. Gary Kimura

Is Virtualization Killing SSI Research? Jérôme Gallard Kerrighed Summit Paris February 2008

XtreemFS a case for object-based storage in Grid data management. Jan Stender, Zuse Institute Berlin

Parallel File Systems for HPC

What is a file system

Experience the GRID Today with Oracle9i RAC

1Z0-433

The Google File System

GFS: The Google File System

COMP520-12C Final Report. NomadFS A block migrating distributed file system

Enhancing Lustre Performance and Usability

Lustre * Features In Development Fan Yong High Performance Data Division, Intel CLUG

The File Systems Evolution. Christian Bandulet, Sun Microsystems

Today: Coda, xfs! Brief overview of other file systems. Distributed File System Requirements!

Planning & Installing a RAC Database

File Systems 1. File Systems

File Systems 1. File Systems

Cost-Effective Virtual Petabytes Storage Pools using MARS. LCA 2018 Presentation by Thomas Schöbel-Theuer

libhio: Optimizing IO on Cray XC Systems With DataWarp

CSE 153 Design of Operating Systems

Coordinating Parallel HSM in Object-based Cluster Filesystems

An Introduction to GPFS

INTEGRATING HPFS IN A CLOUD COMPUTING ENVIRONMENT

OPERATING SYSTEMS II DPL. ING. CIPRIAN PUNGILĂ, PHD.

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9b: Distributed File Systems INTRODUCTION. Transparency: Flexibility: Slide 1. Slide 3.

Chapter 12: File System Implementation. Operating System Concepts 9 th Edition

Mission-Critical Lustre at Santos. Adam Fox, Lustre User Group 2016

Object Storage: Redefining Bandwidth for Linux Clusters

Transcription:

Toward An Integrated Cluster File System Adrien Lebre February 1 st, 2008 XtreemOS IP project is funded by the European Commission under contract IST-FP6-033576

Outline Context Kerrighed and root file system Parallel file system vs Symmetric file system kdfs, kernel Distributed File System Building a distributed FS upon kddm mechanisms Architecture overview Performance kdfs, integration with other cluster services Scheduling policies, checkpoint/restart, hotplug,... kdfs, conclusion, future work kdfs, A Storage Building Block, February 1, 2008 1/12

Kerrighed and The Root File System Background A cluster, generally based on the historical model : compute vs storage nodes Lot of works have been done (Parallel FS, NAS, SAN,... ) Inefficient use of disk capabilities (space and throughput) SMP 1 SMP 2 I/O server 1 Compute nodes SMP 3 interconnection I/O server 2 Storage nodes SMP 4 I/O server p SMP n kdfs, A Storage Building Block, February 1, 2008 2/12

Kerrighed and The Root File System Background A cluster, generally based on the historical model : compute vs storage nodes Lot of works have been done (Parallel FS, NAS, SAN,... ) Inefficient use of disk capabilities (space and throughput) SMP 1 SMP 2 Compute nodes SMP 3 interconnection I/O server (NFS) Storage node SMP 4 SMP n SSI kdfs, A Storage Building Block, February 1, 2008 2/12

Kerrighed and The Root File System Background A cluster, generally based on the historical model : compute vs storage nodes Lot of works have been done (Parallel FS, NAS, SAN,... ) Inefficient use of disk capabilities (space and throughput) SMP 1 SMP 2 Compute nodes SMP 3 interconnection I/O server (NFS) Storage node SMP 4 Storage space: 1 TB Maximum throughput: 125 MB/sec SMP n Available storage space : > 23 TB Available throughput : 13 GB/sec SSI Grid5000 Rennes site kdfs, A Storage Building Block, February 1, 2008 2/12

Kerrighed and The Root File System Background A cluster, generally based on the historical model : compute vs storage nodes Lot of works have been done (Parallel FS, NAS, SAN,... ) Inefficient use of disk capabilities (space and throughput) SMP 1 SMP 2 Objectives Federate available hard drives : aggregate storage spaces Storage and Compute nodes Fine and efficient use of disk throughput : data striping, distributed I/O scheduling, redundancy SMP 3 SMP 4 SMP n interconnection Transparency from both application and resource usage point of views SSI kdfs, A Storage Building Block, February 1, 2008 2/12

Parallel File System vs Symmetric File System Parallel File System One meta-server and several I/O servers SMP 1 Meta server Meta server (failover) Single Point Of Failure failover server Scalability issue Storage and Compute nodes SMP 2 SMP 3 interconnection Performance and reliability FS services present on each nodes! SMP 4 SMP n I/O servers Symmetric FS No SPOF, better load-balancing Design and implementation much more complex (consistency) : Several proposals but no real implementation (xfs, serverless FS) How to take into account application requirements? (CPU / memory /...) kdfs, A Storage Building Block, February 1, 2008 3/12

Toward an Integrated Cluster File System for HPC First objective : a symmetric file system Federate available hard drives : aggregate storage spaces Fine, transparent and efficient use of disk throughput : data striping, distributed I/O scheduling, redundancy Storage and Compute nodes SMP 1 SMP 2 SMP 3 SMP 4 SMP n interconnection Performance / transparency / reliability Symmetric Kernel FS! Second objective : integration (Kerrighed philosophy ;) ) Integrate the cluster file system with other cluster services : scheduler, checkpointing, hotplug,... Take advantage of services complementarity Provide fine mechanisms to continue to improve cluster usage kdfs, A Storage Building Block, February 1, 2008 4/12

kdfs and Kerrighed (1/2) Kerrighed, one of the most complete SSI Developed during 6 years in the PARIS project-team Since 2006, developed as an open source project (Kerlabs, XtreemOS,...) Lot of features : C/R, live migration, load-balancing, hotplug,... All of them based on the Kernel Distributed Data Manager Clusterwide data-sharing at kernel level [Lottiaux01] System Service System Service Interface Linker Interface Linker kddm Resource Manager Resource Manager Memory Memory Node A Node B kdfs, A Storage Building Block, February 1, 2008 5/12

kdfs and Kerrighed (2/2) Building a DFS upon kddm mechanisms (and only!) kdfs, Kernel/Kerrighed Distribued File System : Clusterwide file system at kernel level Based on cooperative caching mechanisms 1 n user kernel kdfs 1 n user kernel kdfs Native FS (ext2) kddm Inodes / Open files / Data Data/Inode Data/Inode Native FS (ext2) Node A Node B From kerfs (2004-2005) More a proof of concept Nodes have to participate in the physical structure of the file system Meta-data replicated on all nodes : overhead to maintain consistency To kdfs (2006-2010) 4 years to work on an integrated FS Nodes can access to kdfs files without providing storage spaces Meta-data fully distributed : performance, reliability (later) kdfs keeps several kerfs proposals but developed from scratch! kdfs, A Storage Building Block, February 1, 2008 6/12

kdfs, Architecture Overview 4 kinds of kddm-set to provide a cooperative cache Inode management, Content Management (directories and files), Dentry management P0 P0 Pq user space user space kernel space kernel space Dentry cache Inode cache Directories cache Files cache Dentry I/O linker Dentry I/O linker Inode Dir content File content Inode Dir content File content Node A Node B kdfs, A Storage Building Block, February 1, 2008 7/12

kdfs, Performance (1/2) Caching mechanisms Exploit local Linux mechanisms at kdfs low level (1) (read-ahead and write-back) 1 n 1 n user kernel user kernel kdfs kdfs Native FS (ext2) (1) Data/Inode kddm Inodes / Open files / Data Data/Inode Native FS (ext2) Node A Node B kdfs, A Storage Building Block, February 1, 2008 8/12

kdfs, Performance (1/2) Caching mechanisms Exploit local Linux mechanisms at kdfs low level (1) (read-ahead and write-back) Exploit local Linux mechanisms at kdfs high level?? (2) read-ahead : usefull / useless? write-back : reduce network traffic but from tolerance point of view? write-through : impact of keeping data synchronized 1 n 1 n user kernel user kernel (2) kdfs kdfs Native FS (ext2) (1) kddm Inodes / Open files / Data Data/Inode Data/Inode Native FS (ext2) Node A Node B kdfs, A Storage Building Block, February 1, 2008 8/12

kdfs, Performance (2/2) Striping, two modes Transparent (implicit) : data are written locally Parallel programs are the best to discover suited striping parameters User (explicit) : users provide parameters on a directory/file basis Redundancy Users should notify (RAID 1) Reduce impact on non-tolerant applications User mode requires to extend POSIX calls kdfs, A Storage Building Block, February 1, 2008 9/12

kdfs, Integration with Other SSI Services (1/2) kdfs and SSI scheduler Interaction between kdfs and SSI scheduler to improve data locality (processes are launched where required files are stored) Exploit I/O probes to improve : Scheduling decision (load-balancing, reducing network traffic,...) File distribution (which are the best nodes for storing data) kdfs, SSI scheduler and migration mechanisms IOR benchmark from LLNL (MPI Parallel I/O application), two phases : 1./ For each process : Reads particular data from one common file (according to its MPI rank), Processes them, Writes results in a second common file 2./ Permutation between processes is made Restart step 1./ from last result file. Instead of making remote access, migrate processes to the right nodes. kdfs, A Storage Building Block, February 1, 2008 10/12

kdfs, Integration with Other SSI Services (1/2) 1./ Each mpi process reads and then writes (let s say 1GB) Node 1 MPI P0 Node 2 Node q MPI P2 Interconnection Node 3 network Node q+1 MPI Node 4 Node p Node n kdfs, A Storage Building Block, February 1, 2008 10/12

kdfs, Integration with Other SSI Services (1/2) 2./ After permutation, each mpi process accesses remote data (3 * 1GB) Node 1 MPI P0 1GB Node 2 Node q MPI P2 1GB Interconnection Node 3 network Node q+1 MPI 1GB Node 4 Node p Node n kdfs, A Storage Building Block, February 1, 2008 10/12

kdfs, Integration with Other SSI Services (1/2) 2./ Detect remote accesses and exploit migration mechanisms (some MBs) Node 1 MPI P2 Node 2 Node q MPI Interconnection Node 3 network Node q+1 MPI P0 Node 4 Node p Reduces network traffic, improves performance Node n Proof of concept almost done with NFS client cache mechanisms kdfs, A Storage Building Block, February 1, 2008 10/12

kdfs, Integration with Other SSI Services (2/2) kdfs and hotplug Manage human nodes addition/removals Transfert meta-data and content files (size issue) Exploit a particular mode where some files are not reachable (Notifiy SSI scheduler to sleep impacted processes) kdfs and checkpoint mechanisms Extend the VFS to provide incremental snapshot for a specified file. Define the best node to save the checkpoint (reduce system noise ) kdfs, A Storage Building Block, February 1, 2008 11/12

Conclusion kdfs, towards an integrated cluster file system for HPC Build a symmetric kernel file system : Based on cooperative caching strategies Without applying intrusive kernel patch Focus as soon as possible on the integration with other services Presentation based on my current work (XtreemOS/Kerrighed framework) kdfs, roadmap for next months kdfs, an alpha version available since the 15th october ( proof of concept ) Fix page cache management issue ( out-of-memory in the alpha version) Me, efficiency features (mainly striping and scheduling) 2 master students : Pierre Riteau - File checkpoint mechanisms Marko Novak - I/O probes and scheduling coordination kdfs, 3000 LOC, it takes times : look for some volunteers :) kdfs, A Storage Building Block, February 1, 2008 12/12

Toward An Integrated Cluster File System Adrien Lebre February 1 st, 2008 kdfs, A Storage Building Block, February 1, 2008 questions

kdfs, Implementation Overview 4 kinds of kddm-set to provide a cooperative cache Inode management (INODE LINKER) Content management : meta-data (DIR LINKER) and file content (FILE LINKER) Dentry management, (DENTRY LINKER) P0 P0 Pq user space user space kernel space kernel space Node A Node B kdfs, A Storage Building Block, February 1, 2008 questions

kdfs, Implementation Overview 4 kinds of kddm-set to provide a cooperative cache Inode management (INODE LINKER) Content management : meta-data (DIR LINKER) and file content (FILE LINKER) Dentry management, (DENTRY LINKER) P0 P0 Pq user space user space kernel space kernel space Inode kddm management Created during mount process and filled dynamically Inode Inode Node A Node B kdfs, A Storage Building Block, February 1, 2008 questions

kdfs, Implementation Overview 4 kinds of kddm-set to provide a cooperative cache Inode management (INODE LINKER) Content management : meta-data (DIR LINKER) and file content (FILE LINKER) Dentry management, (DENTRY LINKER) P0 P0 Pq user space kernel space user space kernel space Inode kddm management Created during mount process and filled dynamically dir content / Created and filled dynamically (one for each directory/file) Inode Dir content Inode Dir content Node A Node B kdfs, A Storage Building Block, February 1, 2008 questions

kdfs, Implementation Overview 4 kinds of kddm-set to provide a cooperative cache Inode management (INODE LINKER) Content management : meta-data (DIR LINKER) and file content (FILE LINKER) Dentry management, (DENTRY LINKER) P0 P0 Pq user space kernel space user space kernel space Inode kddm management Created during mount process and filled dynamically dir content / dir content /etc/ dir content /usr/ dir content /usr/src/linux/ Created and filled dynamically (one for each directory/file) Inode Dir content Inode Dir content Node A Node B kdfs, A Storage Building Block, February 1, 2008 questions

kdfs, Implementation Overview 4 kinds of kddm-set to provide a cooperative cache Inode management (INODE LINKER) Content management : meta-data (DIR LINKER) and file content (FILE LINKER) Dentry management, (DENTRY LINKER) P0 P0 Pq user space kernel space user space kernel space Inode kddm management Created during mount process and filled dynamically dir content / dir content /etc/ dir content /usr/ dir content /usr/src/linux/ Created and filled dynamically (one for each directory/file) file content /etc/xtreemos.conf file content /usr/src/linux/.config file content /home/alebre/presentation.pdf Inode Dir content File content Inode Dir content File content Node A Node B kdfs, A Storage Building Block, February 1, 2008 questions

kdfs, Implementation Overview 4 kinds of kddm-set to provide a cooperative cache Inode management (INODE LINKER) Content management : meta-data (DIR LINKER) and file content (FILE LINKER) Dentry management, (DENTRY LINKER) P0 P0 Pq user space kernel space user space kernel space Dentry kddm management Inode kddm management Created during mount process and filled dynamically dir content / dir content /etc/ dir content /usr/ dir content /usr/src/linux/ Created and filled dynamically (one for each directory/file) file content /etc/xtreemos.conf file content /usr/src/linux/.config file content /home/alebre/presentation.pdf Dentry I/O linker Dentry I/O linker Inode Dir content File content Inode Dir content File content Node A Node B kdfs, A Storage Building Block, February 1, 2008 questions

kdfs, Formating and mounting a kdfs Structure Formatting a kdfs partition mkfs.kdfs DIRECTORY PATHNAME ROOT NODEID Create the kdfs superbloc file for the node (kdfs bitmap for inode id allocation, reference to ROOT NODEID) if NODEID equals ROOT NODEID, create root meta-file Mounting/Accessing a kdfs system mount -t kdfs ALLOCATED DIRECTORY none MOUNT POINT Current limitations : Only one kdfs MOUNT POINT per node, none mode has to be finalized (few days) Advanced version : Import local and network file systems inside kdfs Add some QoS parameters (such as storage space size) kdfs, A Storage Building Block, February 1, 2008 questions

kdfs, Meta-data Management Structure of a kdfs partition Independent from native file system KDFS DIR/... superbloc file KDFS DIR/0-99/, KDFS DIR/100-199/,..., meta-data files, content files on ROOT NODEID KDFS DIR/0-99/1 correspond to the kdfs / Meta-files are stored in a binary mode kdfs inode id and meta-files 32 bits, 8 for node id, 24 for local inode id (now a scalability limitation ) If possible creation is done locally : get an inode id (nodeid + free id from local bitmap) Create corresponding meta-file mkdir /foo./0-99/2 Two kinds of meta files : directory meta-file stores directory structure (directory entries) file meta-file stores file distribution (based on an object approach) kdfs, A Storage Building Block, February 1, 2008 questions

kdfs, Performance I/O scheduling Block I/O schedulers are not sufficient (on a block basis, no global view) Manage I/O requests incoming for all applications Node 1 Scheduling is done independently on each node 2 applications run on the cluster Appli 1 req(a1,10) Node 2 req(a1,5) Node q Node 3 Interconnection network req (A2,5) Node q+1 Appli 2 req (A2,5) Node 4 Node p Loss of informations How to define an efficient schedule? Node n kdfs, A Storage Building Block, February 1, 2008 questions

kdfs, Performance I/O scheduling Block I/O schedulers are not sufficient (on a block basis, no global view) Manage I/O requests incoming for all applications Processed Waiting queue Node A2 A1 A1 A2 q t0 t1 t0 t1 Node q+1 Requests arrival t Requests arrival t FIFO scheduling FIFO scheduling A1 waits 15 units of time A2 waits 10 units Node A2 A1 A1 A2 q T = 5 T = 10 T = 5 T = 5 FIFO scheduling but taking into account parallel accesses Node q+1 A1 still waits 15 A2 waits 5 A2 T = 5 A1 T = 10 Node q A2 T = 5 A1 T = 5 Node q+1 kdfs, A Storage Building Block, February 1, 2008 questions

Looking one step ahead (1/2) Hierarchical Storage kdfs during the execution of application (Distributed cache, cooperation with cluster services,...) Concurrent applications Several kdfs kdfs, A Storage Building Block, February 1, 2008 questions

Looking one step ahead (1/2) Hierarchical Storage kdfs during the execution of application (Distributed cache, cooperation with cluster services,...) Concurrent applications Several kdfs kdfs, A Storage Building Block, February 1, 2008 questions

Looking one step ahead (1/2) kdfs as a Grid FS building block Grid Data Mgmt system is built on kdfs Coordination between Grid Data Mgmt and other Grid services (Grid scheduler, probes,...) Concurrent applications Several Grid Data Mgmt System kdfs, A Storage Building Block, February 1, 2008 questions

Looking one step ahead (1/2) kdfs as a Grid FS building block Grid Data Mgmt system is built on kdfs Coordination between Grid Data Mgmt and other Grid services (Grid scheduler, probes,...) Concurrent applications Several Grid Data Mgmt System kdfs, A Storage Building Block, February 1, 2008 questions