Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage Evaluation of Lustre File System software enhancements for improved Metadata performance Wojciech Turek, Paul Calleja,John Talyor, (University of Cambridge HPCS ) Quy Ta, Onur Celebioglu (Dell HPC Engineering)

Contents Abstract 1 1. Introduction to Lustre Metadata Performance and Scalability 2 2. Lustre Distributed Namespace [DNE-1] 3 3. MDT reply reconstruction improvement 5 4. Test System Reference Specification 6 5. Benchmarking Methodology and Tools 8 6. Benchmarks 9 7. Conclusion 2 8. References 21 ii

Abstract The Lustre filesystem is well known for its ability to handle large sequential I/O patterns typically seen in HPC workloads and is the main reason why Lustre is used in many top HPC centers in the world today. Lustre has steadily gained popularity and has become the first choice parallel filesystem in many academic and industry HPC environments. In recent years the requirement for I/O performance has increased dramatically with increasing demand for both traditional HPC I/O bandwidth but also for increased IOP s and Metadata performance. Traditionally Lustre is not known for high IOP s performance due to the serial nature of Lustre metadata server architecture. This legacy position has now changed due to significant improvements to the Lustre software in terms of metadata performance and scalability. In this paper we investigate the implementation of scalable Lustre metadata servers called Distributed Namespace phase 1 (DNE-1). The HPC team within the University of Cambridge undertook a detailed study and optimisation of Lustre Metadata performance in partnership with the Dell HPC engineering team at Dell Austin. The base design of the testbed utilised a new Dell/Intel DNE-1 Lustre blue print design, which will be published at the end 215/start 216. The work presented here has also been extended to look at how single client multithreaded Metadata performance can been improved by further Lustre code enhancements soon to be release in IEEL. The paper demonstrates that together with the removal of serial of metadata servers and serial multithreaded single client metadata transactions significantly improve both the performance and scalability of Lustre metadata activity thereby completely removing the traditional metadata performance limitations of Lustre. The paper looks ahead to the IEEL release in 216 where these software features will be combined with DNE- 2 and describes future work to investigate further hardware optimisations of MDS processor and memory configuration combined of new Intel nonvolatile memory technologies that could provide yet another large boost to metadata performance. 1

1. Introduction to Lustre Metadata Performance and Scalability Today s large-scale data processing systems can consist of thousands of compute nodes (Lustre clients) running tens of thousands of concurrent processes. Up to now within a Lustre file system it has been possible to scale the number of Lustre Object Storage Servers and increase the I/O throughput of the cluster but it has only been possible to have one Metadata Target (MDT) per filesystem. There have been many development efforts to improve Lustre metadata performance however as client numbers increases the single MDS represents a fundamental bottleneck limiting scalability of the metadata transactions within the Lustre filesystem. The Distributed Namespace (DNE) development addresses this fundamental limit by distributing the filesystem metadata over multiple metadata servers and metadata targets. Phase 1 of the development project focuses on distributing Lustre namespace by allowing directory entries to reference sub-directories on different metadata targets (MDTs). In the DNE-1 implementation, this is referred to as Remote Directories and allows for metadata workloads distributed across multiple directories to scale. DNE-2 will introduce shared directories in which directory entries in a single directory are striped over multiple MDTs and allows metadata performance within a single directory to scale. This is often referred to as Striped Directories. As of today, the DNE-1 has been fully implemented and is available in production releases of Lustre. DNE-2 is available as a preview feature since Lustre-2.7 and full implementation will be available in the community release 2.8, which is planned to be released by the end of year 215 (it is expected in Intel Enteprise Edition for Lustre 3. at the start of 216). 2

2. Lustre Distributed Namespace [DNE-1] Removing serialization of metadata servers Remote Directories: Lustre sub-directories are distributed over multiple metadata targets (MDTs) Figure 1. Sub-directory distribution is defined by an administrator using a Lustre-specific mkdir command. Figure 1 DNE-1 The Lustre architecture with DNE is shown on Figure 2, where the metadata can now be distributed across multiple MDT devices and multiple MDS servers. This new architecture enables the metadata component to scale in the same manner as the object storage component. Figure 2 Lustre System Architecture 3

To enable DNE on a Lustre filesystem, one needs to simply format another MDT device with the same fsname as the existing one MDT and use next available MDT index then mount it on the MDS. If using IEEL-2.3+ the easiest way to enable DNE is from the Intel Manager for Lustre (IML) web interface as shown in Figure 3. Creation of remote directories requires administrative privileges. An administrator can allocate a sub-directory to a given MDT using the command: client# lfs mkdir i mdt_index /mount_point/remote_dir This command will allocate the sub-directory remote dir onto the MDT of index <mdt index>. Only an administrator can create remote sub-directories allocated to separate MDTs. Creating remote subdirectories in parent directories not hosted on MDT is not recommended. This is because the failure of the parent MDT will leave the namespace below it inaccessible. For this reason by default it is only possible to create remote sub-directories within MDT. To relax this restriction and enable remote sub-directories within any MDT an administrator must issue the command. lctl set_param mdd.*.enable_remote_dir=1. Figure 3 Intel Manager for Lustre DNE enablement 4

3. MDT reply reconstruction improvement Removing serialization of client Metadata transitions Currently, the MDT cannot handle more than one single filesystem-modifying RPC at a time because there is only one slot per client in the MDT last rcvd file. Consequently, the filesystem-modifying MDC requests are serialized leading to poor metadata performance scaling on a single Lustre client. The Intel Lustre Support Ticket LU-5319 tracks the work on implementing support for multiple slots per client for reply reconstruction of filesystem modifying MDT requests to improve metadata operations performance of a single client for multithreaded workloads. Before a fix of this issue is implemented within IEEL a workaround for this problem is to create multiple mount points per client and to structure the workload so it is distributed across these multiple mount points. In production workloads this approach would not be practical. However it is useful to implement the workaround here on our Intel EE for Lustre testbed to understand what level of improvement can be achieved once the serialization problem is resolved and the fix is available in the production version of Lustre (Currently the fix is available for community Lustre 2.8 and it will be available in Intel EE for Lustre 3. release early next year). 5

4. Test System Reference Specification This section describes in detailed the base test system configuration. The main system (shown in Figure 4) consists of two MDS servers and two OSS servers configured as failover pairs. The system is running either IEE or Intel Enterprise Edition for Lustre 2.3 which enables use of DNE-1. The Lustre system is managed by the Intel Manager for Lustre (IML) server. This server is a single point for configuration and monitoring of Lustre filesystems. The metadata storage is provided by two Dell Power Vault MD342 RAID enclosures. Each enclosure is fully populated with 22 SAS 15K HDDs and 2 x 146GB SSD. The enclosures are connected to both MDS servers therefore allow shared access to all disks for both servers. This enables failover functionality for MDS servers and better load balance of MDT devices across available MDS servers. Similarly the OST targets are provided by Dell Power Vault MD346 RAID enclosure. Both the MD346 and MD342 disk enclosure contain two active-active RAID controllers. This provides full redundancy for accessing the disks but also improves performance and load balances workloads. The RAID disk enclosures and servers are connected via direct 12Gbps SAS connections. All servers are interconnected by high speed and low latency Infiniband FDR network. We find that the Dell hardware in this configuration is a very good match for Lustre filesystem performance capabilities and provides balanced storage system architecture. Figure 4 Tests system 6

IML - Intel Manager for Lustre Platform CPU RAM Boot disk Disk LOM Network R43 1 x Xeon E5-262 6C 2.4GHz 32GB 2 x 1GB SSD 2 x 1TB NL-SAS 2 x 1GbE MDS - Mata Data Server Platform CPU RAM Disk Fast Network LOM Network R63 2 x Xeon E5-263 v3 8C 2.4GHz 128GB 2 x 1GB SSD FDR Infiniband 2 x 1GbE MDT - RAID Storage Enclosure Platform Disks Controller Cache MD342 24 x 15GB SAS 15k HDDs Dual RAID controllers 8GB of cache per controller OSS - Object Storage Server Platform CPU RAM Boot disk Fast Network MGT Network R43 2 x Xeon E5-262 6C 2.4GHz 64GB 2 x 1GB SSD FDR Infiniband 2 x 1GbE OST - RAID Storage Enclosure Platform Disks Controller MD346 6 x 4TB NL-SAS HDDs 4 x 2GB SSDs Dual RAID controllers Table 1 Hardware Specification 7

5. Benchmarking Methodology and Tools The metadata performance was measured using the MDTEST benchmark tool. The MDTEST benchmark measures the rate of most common metadata operations such as directory and file creation/ deletion and stat operations. The benchmark uses MPI to coordinate and threads between the test nodes making it suitable for testing a single client performance as well as hundreds or thousands of clients. The MDTEST benchmark was run using a maximum of 64 Lustre client nodes, up to 124 threads and a minimum of one million files and directories were used per test. The following parameters were used: mdtest -n <number of files/directories per test directory> -i 3 -y -N 1 -t -u -d <test_directories> -n: every process will creat/stat/remove # directories and files -i: number of iterations the test will run -y: sync file after writing -N: stride # between neighbor tasks for file/dir stat (local=) -t: time unique working directory overhead -u: unique working directory for each task -d: the directory in which the tests will run 8

Table 2 lists the test cases performed on the test system. Test were selected to highlight the DNE-1 metadata performance improvements but also to show the major reply reconstruction metadata bottleneck affecting a single client performance. The test provide a preview of how the performance will improve once the bottleneck is a removed and once parallel MDS servers are deployed. TAB IEEL-2.3-SCD-1MDT IEEL-2.3-SCO-1MDT IEEL-2.3-MCO-1MDT IEEL-2.3-MCO-2MDT IEEL-2.3-MCO-4MDT DNE Comparison - Create ops DNE Comparison - Stats ops DNE Comparison - Remove ops Description Results from a single Lustre client test in default mode: no tuning, no multi mounts, single MDT Results from a single Lustre client test in optimised mode: small files Lustre client tuning, multi mounts, single MDT Results from a multi Lustre client test in optimised mode: small files tuning, multi mounts, single MDT Results from a multi Lustre client test in optimised mode: small files tuning, multi mounts, 2 x MDT Results from a multi Lustre client test in optimised mode: small files tuning, multi mounts, 4 x MDT These are charts for File Create operations to help visualise deltas between distributed metadata These are charts for File Stat operations to help visualise deltas between distributed metadata These are charts for File Remove operations to help visualise deltas between distributed metadata Table 2 Test Cases 9

6. Benchmarks Single MDT tests Figure 5 Single MDT Configuration In the single MDT series of tests, a single Dell Power Vault MD342 was used as Lustre metadata target. The RAID enclosure is populated with 22 SAS 15K disks and all disks are configured as RAID1 disk group. The RAID group is then mapped as a single virtual disk to the metadata servers. Only one RAID controller is active in this configuration the second controller is in standby-failover mode. 1

Single Client Default - IEEL-2.3-SCD-1MDT This test case looks at performance of a typical Lustre client with one Lustre filesystem mount point and with no Lustre client side specific tuning. This test case only use single MDT therefore is not using DNE-1 feature. I.e this is the base line Lustre performance. 16 14 12 1 IEEL-2.3-SCD-1MDT - File Operations OPS 8 6 4 2 File Create File Stat File Remove 2 4 6 8 1 12 14 16 18 Threads Figure 6 IEEL-2.3-SCD-1MDT - File Operations 25 IEEL-2.3-SCD-1MDT - Directory Operations 2 OPS 15 1 5 Directory Create Directory Stat Directory Remove 2 4 6 8 1 12 14 16 18 Threads Figure 7 IEEL-2.3-SCD-1MDT - Directory Operations 11

The metadata performance of a single client only scales for stat operations. It is very clear that the create and remove operation do not scale well with increasing number of threads. The operation rates stop scaling with just 4 threads. This is caused by the serialization of the filesystem-modifying MDC requests (there is 1 outstanding MDC RPC call for each MDT versus 8 OSC RPC calls for each OST). The fix for the serialization has been developed and implemented in the latest community Lustre-2.8. The details about the fix can be found in LU-5319. The next test case will show metadata performance after applying the work around for the fix and allowing Lustre metadata server to handle multiple MDC requests in parallel. Single Client Default - IEEL-2.3-SCO-1MDT Figure 8 and Figure 9 show results for optimized single client test case. The Lustre client side has been optimized for small file performance and to improve high transactional workloads, see Table 3 for parameters and their values. In order to mitigate the modifying MDC requests serialization Lustre has been mounted 16 times per node and each thread is using different mount point. This way we avoid serialization of the modifying RPC requests on the metadata server. The results show a significant improvement for all types of operations. Figure 1 and Figure 11 Show comparison for before and after optimisation. The most significant change is for file creates and removes operation. The Performance scales almost linearly with increasing number of threads. This is a major improvement for Lustre s metadata capability and enables Luster to perform much better for multithreaded or farm workloads on a single compute node. Linux Client Tuning Parameter name Default Tuned MAX_RPCS_IN_FLIGHT 32 256 MAX_DIRTY_MB 256 124 MAX_PAGES_PER_RPC 256 124 Table 3 Client Side Tuning 12

IEEL-2.3-SCO-1MDT- File Operations OPS 4 35 3 25 2 15 1 5 File Create File Stat File Remove 5 1 15 2 Threads Figure 8 IEEL-2.3-SCO-1MDT- File Operations 9 8 7 6 IEEL-2.3-SCO-1MDT - Directory Operations OPS 5 4 3 2 Directory Create Directory Stat Directory Remove 1 5 1 15 2 Threads Figure 8 IEEL-2.3-SCO-1MDT- File Operations 13

Comparison of IEEL-2.3-SCD-1MDT vs EE-2.3-SCO-1MDT 4 35 3 25 2 15 1 5 File Operations 1 2 4 8 16 File create EE-2.3-SCD-1MDT File Stat EE-2.3-SCD-1MDT File Remove EE-2.3-SCD-1MDT File create EE-2.3-SCO-1MDT File Stat EE-2.3-SCO-1MDT File Remove EE-2.3-SCO-1MDT Figure 1 IEEL-2.3-SCD-1MDT vs IEEL-2.3-SCO-1MDT - File Ops 9 8 7 6 5 4 3 2 1 Directory Operations 1 2 4 8 16 Directory Create EE-2.3-SCD-1MDT Directory Stat EE-2.3-SCD-1MDT Directory Remove EE-2.3-SCD-1MDT Directory Create EE-2.3-SCO-1MDT Directory Stat EE-2.3-SCO-1MDT Directory Remove EE-2.3-SCO-1MDT Figure 11 IEEL-2.3-SCD-1MDT vs IEEL-2.3-SCO-1MDT - Dir Ops 14

Multithreaded 64 Client Optimised - IEEL-2.3-MCO-1MDT The metadata performance continues to scale beyond single client but it is quickly saturated with just 32 threads for the create and remove operations. The stat operations scale much better and reach peak performance with 512 threads. 25 Multi Node Optimised - 1MDT - File Ops 2 OPS 15 1 5 File Create File Stat File Remove 2 4 6 8 1 12 Throughput Figure 12 Multi Node Optimised - 1MDT - File Ops 3 25 2 Multi Node Optimised - 1MDT - Dir Ops OPS 15 1 5 Directory Create Directory Stat Directory Remove 2 4 6 8 1 12 Throughput Figure 13 Multi Node Optimised - 1MDT - Dir Ops 15

Two MDTs Tests Figure 14 Two MDTs Configuration In this configuration two Dell Power Vault MD342 disk RAID enclosures are used. Both systems are populated with 22 x SAS 15K HDDs and 2 x SSDs and all SAS disks in each disk enclosure are configured as a single RAID1 disk group. Only one RAID controller in each disk enclosure is active in this configuration, the second controller is in standby-failover mode. Multithreaded 64 Clients Optimised - IEEL-2.3-MCO-2MDT After adding additional MDT, test results show that the performance had a significant improvement for file operations but has stopped scaling with 128 threads for create and removal operations. The directory operations have also improved but not as much as files. 25 2 Multi Client Optimised - EE-2.3-MCO-2MDT Ops 15 1 5 File Create File Stat File Remove 2 4 6 8 1 12 Threads Figure 15 Multi Client Optimised - IEEL-2.3-MCO-2MDT 16

Multi Client Optimised - IEEL-2.3-MCO-2MDT 25 2 Axis Title 15 1 5 Dir Create Dir Stat Dir Remove 2 4 6 8 1 12 Threads Figure 16 Multi Client Optimised - IEEL-2.3-MCO-2MDT Four MDTs Tests Figure 17 Four MDTs Configuration In this configuration two Dell Power Vault MD342 disk RAID enclosures are used. Both systems are fully populated with 22 x SAS 15K HDDs and 2 x SSDs, all SAS disks in each disk enclosure are configured as a single RAID1 disk group. Only one RAID controller in each disk enclosure is active in this configuration, the second controller is in standby-failover mode. Multithreaded 64 Clients Optimised - IEEL-2.3-MCO-4MDT Adding more MDTs, continues to improve metadata performance scaling. In this test case we use the same hardware as in case with 2MDTs but this time hardware is configured to present 4 MDTs. This has a very positive impact on recorded performance. Therefore, it is beneficial to split disks into two smaller RAID disk groups, one for each controller. This is most likely due to the fact that the disk enclosure has two controllers and creating two RAID groups per disk enclosures allows to use both controllers at the same time. 17

64 Nodes Multithreaded - IEEL-2.3-MCO-4MDT Ops 4 35 3 25 2 15 1 5 File Create File Stat File Remove 2 4 6 8 1 12 Throughput Figure 18 64 Nodes Multithreaded - IEEL-2.3-MCO-4MDT 45 4 35 3 64 Nodes Multithreaded - IEEL-2.3-MCO-4MDT Ops 25 2 15 1 Dir Create Dir Stat Dir Remove 5 2 4 6 8 1 12 Throughput Figure 19 64 Nodes Multithreaded - IEEL-2.3-MCO-4MDT 18

DNE Scaling File Create Ops DNE Scaling Comparison - File Creates Ops 16 14 12 1 8 6 4 2 1 2 4 8 16 32 64 128 256 512 124 Throughput 1MDT 2MDTs 4MDTs Figure 2 DNE Scaling Comparison - File Creates File Stat Ops 4 35 3 25 MDT Scaling Comparison - File Stats Ops 2 15 1 5 1MDT 2MDTs 4MDTs 1 2 4 8 16 32 64 128 256 512 124 Throughput Figure 21 MDT Scaling Comparison - File Stats 19

File Stat Removes 25 MDT Scaling Comparison - File Removes 2 Ops 15 1 5 1MDT 2MDTs 4MDTs 1 2 4 8 16 32 64 128 256 512 124 Throughput Figure 22 MDT Scaling Comparison - File Removes When comparing results for DNE scaling test cases, Figure 2, Figure 21 and Figure 22, it clearly shows that performance scales with the additional MDTs. It also shows that a single MDT performance is quite limited and can be saturated with 32 threads. As the MDTs are added to the system the peak performance increases with the number of threads. The test results suggest that scaling will continue with additional metadata targets. 7. Conclusion This paper clearly shows that the two metadata performance enhancements (1)DNE-1 (parallelisation of metadata servers) and (2)MDT reply reconstruction improvement (parallelisation of client Metadata transitions at the MDS) have a very large positive effect on Lustre meta data performance. With these enhancements the Metadata performance of Lustre has been transformed in terms of its multithreaded single node performance and in terms of its multi-node metadata scalability. These enhancements now resolve Lustre s legacy metadata performance issues allowing industry leading parallel file system performance in terms of both I/O throughput and metadata performance. The next release of Intel EE for Lustre due for release in early 216 will include the MDT reply reconstruction improvement and also implementation of DNE-2 which allows for the striping of a single directory across multiple MDT s. This will bring together all the features required to fully unlock the scalable metadata performance required by data intensive workloads that we now find in modern day HPC data centers. This next set of Intel EE for Lustre Metadata software improvements will also enable Lustre to take advantage of new Intel advancements in nonvolatile memory technologies to boost metadata performance even further. The next paper in this series will investigate these new Intel EE for Lustre software features as expressed on Dell storage hardware in combination with hardware optimisations using nonvolatile RAM technologies and MDS server processor and memory configurations to provide a definitive review of modern Lustre metadata performance possibilities. 2

8. References [1] Lustre Manual http://build.whamcloud.com/job/lustre-manual/lastsuccessfulbuild/artifact/lustre_manual.xhtml [2] LU-5319 https://jira.hpdd.intel.com/browse/lu-5319 [3] DNE-1 https://wiki.hpdd.intel.com/display/pub/dne+1+remote+directories+high+level+design [4] Lustre tuning paper https://www.dell.com/downloads/global/products/pvaul/en/powervault-md32i-performancetuning-white-paper.pdf 21