MELLANOX MTD2000 NFS-RDMA SDK PERFORMANCE TEST REPORT The document describes performance testing that was done on the Mellanox OFED 1.2 GA NFS-RDMA distribution. Test Cluster Mellanox Technologies 1 July 2007
The figure above illustrates the setup used for testing the NFS-RDMA server. The switch is a Flextronics 24-port DDR IB switch. The NFS Filer consists of the Mellanox MTD2000 head-end and Mellanox MTD2000E JBOD expander. This provided a total of 32 15K 36GB SAS drives. The drives were configured as two RAID0 volumes, one two-disk volume for the operating system, and one 30 disk volume for the NFS export. SLES10 was installed on the server and both volumes were formatted with XFS. NFS-RDMA with OFED 1.2 GA was then installed and configured. The clients consisted of: Two dual core, dual processor 64bit 3.46Ghz XEON machine with 4GB of memory, One dual core, single processor 32bit 3.40Ghz XEON machine with 1GB of memory, and One single core, dual processor 64bit 1.8Ghz AMD Athlon machine with 2GB of memory All clients contained a Mellanox dual ported IB DDR adapter. Two of the clients were installed with RHEL 5, and two were installed with SLES 10. All clients ran the Mellanox NFS-RDMA distribution based on OFED 1.2 GA. Test Description All tests were performed using version 3.283 of the iozone 1 filesystem performance benchmarking tool. The tool was installed and built individually on each machine. Cluster Testing In order to test the scalability characteristics of the server, the cluster testing mode of the iozone tool was used. This mode allows multiple clients to participate in a test. A master node communicates with a subordinate iozone agent running on each client node. The agents communicate with the master to ensure that all tests start concurrently and that the test duration reflects the time for all participating nodes to complete. A separate machine on a separate network was used as the master node to ensure that cluster control traffic did not perturb NFS traffic on the IB network. Client Cache The NFS client uses the Linux buffer cache to improve performance. When an application performs a read or write to a file, the I/O is satisfied through this 1 This tool is available for download at http://www.iozone.org. Mellanox Technologies 2 July 2007
buffer cache. When performing a write, the data goes to the buffer cache and is asynchronously flushed to the backing store as memory pressure builds, or in response to a synchronization request (e.g. close). Similarly, when an application reads, the NFS client checks to see if the data is already in the buffer cache. If the data is present, the read is satisfied from this cache. If it is not present, a read is issued to the backing store into the buffer cache. For our purposes, in both cases the backing store is the NFS Filer. In order to evaluate the performance of the NFS Filer, therefore, the operation of the client-side buffer cache must be considered. Write Testing When performing write testing, it is important that the time it takes the NFS client to flush the dirty buffer cache pages to the NFS Filer is considered in the performance results. In order to do this, the close option (-c) was specified to iozone. This instructs iozone to include the time it takes to close the file in the performance calculation. Since close will not return until all pages have been flushed to the NFS Filer, this provides an accurate assessment of the NFS Filer performance. Normally, iozone cleans up it s temporary files after completing a test, therefore, the no-unlink (-w) option was specified to keep the generated file for read testing as described below. Read Testing When performing read testing, it is important that we ensure that the data is not already in the client s buffer cache, because as we ve discussed, if the page is already present, then the NFS Filer is not involved in the operation. Obviously if we are performing a read test, then the file must already exist and if it exists its data may well be in the client buffer cache. To ensure that it is not, the filesystem is unmounted, and remounted between tests, effectively invalidating all buffer cache pages for the filesystem. In addition, to ensure that unrelated dirty pages do not inadvertently impact the result, a filesystem sync operation is performed between read tests. Server Cache The NFS Filer uses the local VFS as it s backing store for exported volumes. For our tests, the local filesystem is XFS and is backed by a 30 disk stripped RAID0 volume. The goal of the testing, however, is to evaluate IB and NFS, not the XFS filesystem. For this reason, it is desirable that file data be served from the buffer cache whenever possible. For read processing, this is accomplished by syncing between tests so that the newly generated read data will have the maximum amount of memory available for the new file and not be polluted with either data from earlier tests or unrelated data on the server. Mellanox Technologies 3 July 2007
This holds for write testing as well, so that any data flushed to disk by the buffer cache is for the current test and not data left over from an earlier test or unrelated process. Test Description The iozone performance tool was used to test NFS sequential read and write performance across a range of record and file sizes. Two scripts were written: bigfile_master.sh and recsize_master.sh. The bigfile_master.sh script was used to test performance across file sizes form 64M to 64G. For this test, a constant record size of 128K was used. The recsize_master.sh script was used to test performance across a range of record sizes from 4K to 512K. For these tests, a constant file size of 1GB was used. Cells in shaded in green below indicate record/file sizes that were specifically tested. File Record Size Size 4K 8K 16K 32K 64K 128K 256K 512K 64M 128M 256M 512M 1G 2G 4G 8G 16G 32G 64G All tests were run with 1, 2, 3 and 4 participating nodes to evaluate how performance scaled as node count increased. Note that for any given file size, the amount of data being read or written from the perspective of the NFS Filer is cumulative. That is, if 4 nodes are writing a file of 1GB in size, the server sees 4GB of data. This observation is important when interpreting the performance results because the tipping point where file size overflows the server side buffer cache occurs earlier (relative to client file size) for larger node counts. See the script files bigfile_master.sh and recsize_master.sh for detail on the exact iozone command syntax used. Mellanox Technologies 4 July 2007
Results Read Performance Read Throughput by File Size 1,400,000 1,200,000 1,000,000 800,000 600,000 1 Node 2 Nodes 3 Nodes 4 Nodes 400,000 200,000 0 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 File Size in Megabytes NFS Filer performance is very good when serving data from cache. Clearly up to the point where the cumulative amount of data exceeds the NFS Filer buffer cache, the server is able to maintain wire rate across four nodes. Note that larger node counts fell off sharply at about 2GB or 4 nodes * 2GB = 8GB which is the total amount of memory in the server. Mellanox Technologies 5 July 2007
Read Throughput by Record Size 1,400,000 1,200,000 1,000,000 800,000 600,000 1 Node 2 Nodes 3 Nodes 4 Nodes 400,000 200,000 0 4 8 16 32 64 128 256 512 Record Size in Kilobytes This result strongly implies that record size is not a significant performance factor. Mellanox Technologies 6 July 2007
Write Performance Write Throughput By File Size 700,000 600,000 500,000 400,000 300,000 1 Node 2 Nodes 3 Nodes 4 Nodes 200,000 100,000 0 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 File Size in Megabytes Write performance rises with node count until about aggregate data size is about 40% of server s memory capacity. After this point, the performance trends down to the backing store s random I/O performance limit. Mellanox Technologies 7 July 2007
Write Throughput by Record Size 400,000.00 350,000.00 300,000.00 250,000.00 200,000.00 1 Node 2 Nodes 3 Nodes 4 Nodes 150,000.00 100,000.00 50,000.00 0.00 4 8 16 32 64 128 256 512 Record Size in Kilobytes As for read, record size does not appear to be a significant factor with respect to write performance. Mellanox Technologies 8 July 2007