Toward An Integrated Cluster File System Adrien Lebre February 1 st, 2008 XtreemOS IP project is funded by the European Commission under contract IST-FP6-033576
Outline Context Kerrighed and root file system Parallel file system vs Symmetric file system kdfs, kernel Distributed File System Building a distributed FS upon kddm mechanisms Architecture overview Performance kdfs, integration with other cluster services Scheduling policies, checkpoint/restart, hotplug,... kdfs, conclusion, future work kdfs, A Storage Building Block, February 1, 2008 1/12
Kerrighed and The Root File System Background A cluster, generally based on the historical model : compute vs storage nodes Lot of works have been done (Parallel FS, NAS, SAN,... ) Inefficient use of disk capabilities (space and throughput) SMP 1 SMP 2 I/O server 1 Compute nodes SMP 3 interconnection I/O server 2 Storage nodes SMP 4 I/O server p SMP n kdfs, A Storage Building Block, February 1, 2008 2/12
Kerrighed and The Root File System Background A cluster, generally based on the historical model : compute vs storage nodes Lot of works have been done (Parallel FS, NAS, SAN,... ) Inefficient use of disk capabilities (space and throughput) SMP 1 SMP 2 Compute nodes SMP 3 interconnection I/O server (NFS) Storage node SMP 4 SMP n SSI kdfs, A Storage Building Block, February 1, 2008 2/12
Kerrighed and The Root File System Background A cluster, generally based on the historical model : compute vs storage nodes Lot of works have been done (Parallel FS, NAS, SAN,... ) Inefficient use of disk capabilities (space and throughput) SMP 1 SMP 2 Compute nodes SMP 3 interconnection I/O server (NFS) Storage node SMP 4 Storage space: 1 TB Maximum throughput: 125 MB/sec SMP n Available storage space : > 23 TB Available throughput : 13 GB/sec SSI Grid5000 Rennes site kdfs, A Storage Building Block, February 1, 2008 2/12
Kerrighed and The Root File System Background A cluster, generally based on the historical model : compute vs storage nodes Lot of works have been done (Parallel FS, NAS, SAN,... ) Inefficient use of disk capabilities (space and throughput) SMP 1 SMP 2 Objectives Federate available hard drives : aggregate storage spaces Storage and Compute nodes Fine and efficient use of disk throughput : data striping, distributed I/O scheduling, redundancy SMP 3 SMP 4 SMP n interconnection Transparency from both application and resource usage point of views SSI kdfs, A Storage Building Block, February 1, 2008 2/12
Parallel File System vs Symmetric File System Parallel File System One meta-server and several I/O servers SMP 1 Meta server Meta server (failover) Single Point Of Failure failover server Scalability issue Storage and Compute nodes SMP 2 SMP 3 interconnection Performance and reliability FS services present on each nodes! SMP 4 SMP n I/O servers Symmetric FS No SPOF, better load-balancing Design and implementation much more complex (consistency) : Several proposals but no real implementation (xfs, serverless FS) How to take into account application requirements? (CPU / memory /...) kdfs, A Storage Building Block, February 1, 2008 3/12
Toward an Integrated Cluster File System for HPC First objective : a symmetric file system Federate available hard drives : aggregate storage spaces Fine, transparent and efficient use of disk throughput : data striping, distributed I/O scheduling, redundancy Storage and Compute nodes SMP 1 SMP 2 SMP 3 SMP 4 SMP n interconnection Performance / transparency / reliability Symmetric Kernel FS! Second objective : integration (Kerrighed philosophy ;) ) Integrate the cluster file system with other cluster services : scheduler, checkpointing, hotplug,... Take advantage of services complementarity Provide fine mechanisms to continue to improve cluster usage kdfs, A Storage Building Block, February 1, 2008 4/12
kdfs and Kerrighed (1/2) Kerrighed, one of the most complete SSI Developed during 6 years in the PARIS project-team Since 2006, developed as an open source project (Kerlabs, XtreemOS,...) Lot of features : C/R, live migration, load-balancing, hotplug,... All of them based on the Kernel Distributed Data Manager Clusterwide data-sharing at kernel level [Lottiaux01] System Service System Service Interface Linker Interface Linker kddm Resource Manager Resource Manager Memory Memory Node A Node B kdfs, A Storage Building Block, February 1, 2008 5/12
kdfs and Kerrighed (2/2) Building a DFS upon kddm mechanisms (and only!) kdfs, Kernel/Kerrighed Distribued File System : Clusterwide file system at kernel level Based on cooperative caching mechanisms 1 n user kernel kdfs 1 n user kernel kdfs Native FS (ext2) kddm Inodes / Open files / Data Data/Inode Data/Inode Native FS (ext2) Node A Node B From kerfs (2004-2005) More a proof of concept Nodes have to participate in the physical structure of the file system Meta-data replicated on all nodes : overhead to maintain consistency To kdfs (2006-2010) 4 years to work on an integrated FS Nodes can access to kdfs files without providing storage spaces Meta-data fully distributed : performance, reliability (later) kdfs keeps several kerfs proposals but developed from scratch! kdfs, A Storage Building Block, February 1, 2008 6/12
kdfs, Architecture Overview 4 kinds of kddm-set to provide a cooperative cache Inode management, Content Management (directories and files), Dentry management P0 P0 Pq user space user space kernel space kernel space Dentry cache Inode cache Directories cache Files cache Dentry I/O linker Dentry I/O linker Inode Dir content File content Inode Dir content File content Node A Node B kdfs, A Storage Building Block, February 1, 2008 7/12
kdfs, Performance (1/2) Caching mechanisms Exploit local Linux mechanisms at kdfs low level (1) (read-ahead and write-back) 1 n 1 n user kernel user kernel kdfs kdfs Native FS (ext2) (1) Data/Inode kddm Inodes / Open files / Data Data/Inode Native FS (ext2) Node A Node B kdfs, A Storage Building Block, February 1, 2008 8/12
kdfs, Performance (1/2) Caching mechanisms Exploit local Linux mechanisms at kdfs low level (1) (read-ahead and write-back) Exploit local Linux mechanisms at kdfs high level?? (2) read-ahead : usefull / useless? write-back : reduce network traffic but from tolerance point of view? write-through : impact of keeping data synchronized 1 n 1 n user kernel user kernel (2) kdfs kdfs Native FS (ext2) (1) kddm Inodes / Open files / Data Data/Inode Data/Inode Native FS (ext2) Node A Node B kdfs, A Storage Building Block, February 1, 2008 8/12
kdfs, Performance (2/2) Striping, two modes Transparent (implicit) : data are written locally Parallel programs are the best to discover suited striping parameters User (explicit) : users provide parameters on a directory/file basis Redundancy Users should notify (RAID 1) Reduce impact on non-tolerant applications User mode requires to extend POSIX calls kdfs, A Storage Building Block, February 1, 2008 9/12
kdfs, Integration with Other SSI Services (1/2) kdfs and SSI scheduler Interaction between kdfs and SSI scheduler to improve data locality (processes are launched where required files are stored) Exploit I/O probes to improve : Scheduling decision (load-balancing, reducing network traffic,...) File distribution (which are the best nodes for storing data) kdfs, SSI scheduler and migration mechanisms IOR benchmark from LLNL (MPI Parallel I/O application), two phases : 1./ For each process : Reads particular data from one common file (according to its MPI rank), Processes them, Writes results in a second common file 2./ Permutation between processes is made Restart step 1./ from last result file. Instead of making remote access, migrate processes to the right nodes. kdfs, A Storage Building Block, February 1, 2008 10/12
kdfs, Integration with Other SSI Services (1/2) 1./ Each mpi process reads and then writes (let s say 1GB) Node 1 MPI P0 Node 2 Node q MPI P2 Interconnection Node 3 network Node q+1 MPI Node 4 Node p Node n kdfs, A Storage Building Block, February 1, 2008 10/12
kdfs, Integration with Other SSI Services (1/2) 2./ After permutation, each mpi process accesses remote data (3 * 1GB) Node 1 MPI P0 1GB Node 2 Node q MPI P2 1GB Interconnection Node 3 network Node q+1 MPI 1GB Node 4 Node p Node n kdfs, A Storage Building Block, February 1, 2008 10/12
kdfs, Integration with Other SSI Services (1/2) 2./ Detect remote accesses and exploit migration mechanisms (some MBs) Node 1 MPI P2 Node 2 Node q MPI Interconnection Node 3 network Node q+1 MPI P0 Node 4 Node p Reduces network traffic, improves performance Node n Proof of concept almost done with NFS client cache mechanisms kdfs, A Storage Building Block, February 1, 2008 10/12
kdfs, Integration with Other SSI Services (2/2) kdfs and hotplug Manage human nodes addition/removals Transfert meta-data and content files (size issue) Exploit a particular mode where some files are not reachable (Notifiy SSI scheduler to sleep impacted processes) kdfs and checkpoint mechanisms Extend the VFS to provide incremental snapshot for a specified file. Define the best node to save the checkpoint (reduce system noise ) kdfs, A Storage Building Block, February 1, 2008 11/12
Conclusion kdfs, towards an integrated cluster file system for HPC Build a symmetric kernel file system : Based on cooperative caching strategies Without applying intrusive kernel patch Focus as soon as possible on the integration with other services Presentation based on my current work (XtreemOS/Kerrighed framework) kdfs, roadmap for next months kdfs, an alpha version available since the 15th october ( proof of concept ) Fix page cache management issue ( out-of-memory in the alpha version) Me, efficiency features (mainly striping and scheduling) 2 master students : Pierre Riteau - File checkpoint mechanisms Marko Novak - I/O probes and scheduling coordination kdfs, 3000 LOC, it takes times : look for some volunteers :) kdfs, A Storage Building Block, February 1, 2008 12/12
Toward An Integrated Cluster File System Adrien Lebre February 1 st, 2008 kdfs, A Storage Building Block, February 1, 2008 questions
kdfs, Implementation Overview 4 kinds of kddm-set to provide a cooperative cache Inode management (INODE LINKER) Content management : meta-data (DIR LINKER) and file content (FILE LINKER) Dentry management, (DENTRY LINKER) P0 P0 Pq user space user space kernel space kernel space Node A Node B kdfs, A Storage Building Block, February 1, 2008 questions
kdfs, Implementation Overview 4 kinds of kddm-set to provide a cooperative cache Inode management (INODE LINKER) Content management : meta-data (DIR LINKER) and file content (FILE LINKER) Dentry management, (DENTRY LINKER) P0 P0 Pq user space user space kernel space kernel space Inode kddm management Created during mount process and filled dynamically Inode Inode Node A Node B kdfs, A Storage Building Block, February 1, 2008 questions
kdfs, Implementation Overview 4 kinds of kddm-set to provide a cooperative cache Inode management (INODE LINKER) Content management : meta-data (DIR LINKER) and file content (FILE LINKER) Dentry management, (DENTRY LINKER) P0 P0 Pq user space kernel space user space kernel space Inode kddm management Created during mount process and filled dynamically dir content / Created and filled dynamically (one for each directory/file) Inode Dir content Inode Dir content Node A Node B kdfs, A Storage Building Block, February 1, 2008 questions
kdfs, Implementation Overview 4 kinds of kddm-set to provide a cooperative cache Inode management (INODE LINKER) Content management : meta-data (DIR LINKER) and file content (FILE LINKER) Dentry management, (DENTRY LINKER) P0 P0 Pq user space kernel space user space kernel space Inode kddm management Created during mount process and filled dynamically dir content / dir content /etc/ dir content /usr/ dir content /usr/src/linux/ Created and filled dynamically (one for each directory/file) Inode Dir content Inode Dir content Node A Node B kdfs, A Storage Building Block, February 1, 2008 questions
kdfs, Implementation Overview 4 kinds of kddm-set to provide a cooperative cache Inode management (INODE LINKER) Content management : meta-data (DIR LINKER) and file content (FILE LINKER) Dentry management, (DENTRY LINKER) P0 P0 Pq user space kernel space user space kernel space Inode kddm management Created during mount process and filled dynamically dir content / dir content /etc/ dir content /usr/ dir content /usr/src/linux/ Created and filled dynamically (one for each directory/file) file content /etc/xtreemos.conf file content /usr/src/linux/.config file content /home/alebre/presentation.pdf Inode Dir content File content Inode Dir content File content Node A Node B kdfs, A Storage Building Block, February 1, 2008 questions
kdfs, Implementation Overview 4 kinds of kddm-set to provide a cooperative cache Inode management (INODE LINKER) Content management : meta-data (DIR LINKER) and file content (FILE LINKER) Dentry management, (DENTRY LINKER) P0 P0 Pq user space kernel space user space kernel space Dentry kddm management Inode kddm management Created during mount process and filled dynamically dir content / dir content /etc/ dir content /usr/ dir content /usr/src/linux/ Created and filled dynamically (one for each directory/file) file content /etc/xtreemos.conf file content /usr/src/linux/.config file content /home/alebre/presentation.pdf Dentry I/O linker Dentry I/O linker Inode Dir content File content Inode Dir content File content Node A Node B kdfs, A Storage Building Block, February 1, 2008 questions
kdfs, Formating and mounting a kdfs Structure Formatting a kdfs partition mkfs.kdfs DIRECTORY PATHNAME ROOT NODEID Create the kdfs superbloc file for the node (kdfs bitmap for inode id allocation, reference to ROOT NODEID) if NODEID equals ROOT NODEID, create root meta-file Mounting/Accessing a kdfs system mount -t kdfs ALLOCATED DIRECTORY none MOUNT POINT Current limitations : Only one kdfs MOUNT POINT per node, none mode has to be finalized (few days) Advanced version : Import local and network file systems inside kdfs Add some QoS parameters (such as storage space size) kdfs, A Storage Building Block, February 1, 2008 questions
kdfs, Meta-data Management Structure of a kdfs partition Independent from native file system KDFS DIR/... superbloc file KDFS DIR/0-99/, KDFS DIR/100-199/,..., meta-data files, content files on ROOT NODEID KDFS DIR/0-99/1 correspond to the kdfs / Meta-files are stored in a binary mode kdfs inode id and meta-files 32 bits, 8 for node id, 24 for local inode id (now a scalability limitation ) If possible creation is done locally : get an inode id (nodeid + free id from local bitmap) Create corresponding meta-file mkdir /foo./0-99/2 Two kinds of meta files : directory meta-file stores directory structure (directory entries) file meta-file stores file distribution (based on an object approach) kdfs, A Storage Building Block, February 1, 2008 questions
kdfs, Performance I/O scheduling Block I/O schedulers are not sufficient (on a block basis, no global view) Manage I/O requests incoming for all applications Node 1 Scheduling is done independently on each node 2 applications run on the cluster Appli 1 req(a1,10) Node 2 req(a1,5) Node q Node 3 Interconnection network req (A2,5) Node q+1 Appli 2 req (A2,5) Node 4 Node p Loss of informations How to define an efficient schedule? Node n kdfs, A Storage Building Block, February 1, 2008 questions
kdfs, Performance I/O scheduling Block I/O schedulers are not sufficient (on a block basis, no global view) Manage I/O requests incoming for all applications Processed Waiting queue Node A2 A1 A1 A2 q t0 t1 t0 t1 Node q+1 Requests arrival t Requests arrival t FIFO scheduling FIFO scheduling A1 waits 15 units of time A2 waits 10 units Node A2 A1 A1 A2 q T = 5 T = 10 T = 5 T = 5 FIFO scheduling but taking into account parallel accesses Node q+1 A1 still waits 15 A2 waits 5 A2 T = 5 A1 T = 10 Node q A2 T = 5 A1 T = 5 Node q+1 kdfs, A Storage Building Block, February 1, 2008 questions
Looking one step ahead (1/2) Hierarchical Storage kdfs during the execution of application (Distributed cache, cooperation with cluster services,...) Concurrent applications Several kdfs kdfs, A Storage Building Block, February 1, 2008 questions
Looking one step ahead (1/2) Hierarchical Storage kdfs during the execution of application (Distributed cache, cooperation with cluster services,...) Concurrent applications Several kdfs kdfs, A Storage Building Block, February 1, 2008 questions
Looking one step ahead (1/2) kdfs as a Grid FS building block Grid Data Mgmt system is built on kdfs Coordination between Grid Data Mgmt and other Grid services (Grid scheduler, probes,...) Concurrent applications Several Grid Data Mgmt System kdfs, A Storage Building Block, February 1, 2008 questions
Looking one step ahead (1/2) kdfs as a Grid FS building block Grid Data Mgmt system is built on kdfs Coordination between Grid Data Mgmt and other Grid services (Grid scheduler, probes,...) Concurrent applications Several Grid Data Mgmt System kdfs, A Storage Building Block, February 1, 2008 questions