Z RESEARCH, Inc. Commoditizing Supercomputing and Superstorage. Massive Distributed Storage over InfiniBand RDMA

Z RESEARCH, Inc. Commoditizing Supercomputing and Superstorage Massive Distributed Storage over InfiniBand RDMA

What is GlusterFS? GlusterFS is a Cluster File System that aggregates multiple storage bricks over InfiniBand RDMA into one large parallel network file system GlusterFS is MORE than making data available over a network or the organization of data on disk storage. Typical clustered file systems work to aggregate storage and provide unified views but. - scalability comes with increased cost, reduced reliability, difficult management, increased maintenance and recovery time. - limited reliability means volume sizes are kept small. - capacity and i/o performance can be limited.. GlusterFS allows scaling of capacity and I/O using industry standard inexpensive modules!

GlusterFS Features 1. Fully POSIX compliant! 2. Unified VFS! 3. More flexible volume management (stackable features)! 4. Application specific scheduling / load balancing roundrobin; adaptive least usage; non-uniform file access (NUFA)! 5. Automatic file replication (AFR); Snapshot! and Undelete! 6. Striping for performance! 7. Self-heal! No fsck!!!! 8. Pluggable transport modules (IB verbs, IB-SDP)! 9. I/O accelerators - I/O threads, I/O cache, read ahead and write behind! 10. Policy driven - user group/directory level quotas, access control lists (ACL)

GlusterFS Design Client Side Storage Clients Cluster of Clients (Supercomputer, Data Center) GLFS Client GLFS Client Clustered Vol Manager GLFS Client Clustered Vol Manager Clustered Clustered Vol Manager I/O Scheduler Clustered I/O Scheduler Clustered I/O Scheduler GLFS Client GLFS Client Clustered Vol Manager GLFS Client Clustered Vol Manager Clustered Clustered Vol Manager I/O Scheduler Clustered I/O Scheduler Clustered I/O Scheduler Compatibility with MS Windows and other Unices GigE Storage Gateway Storage Storage NFS/Samba Gateway Gateway NFS/Samba NFS/Samba GLFS Client GLFS Client GLFS Client NFS / SAMBA over TCP/IP RDMA RDMA InfiniBand / GigE / 10GigE Server Side GlusterFS Clustered Filesystem on x86-64 platform Storage Brick 1 GLFSD Volume Storage Brick 2 GLFSD Volume RDMA Storage Brick 3 GLFSD Volume Storage Brick N Storage Brick GLFSD4 GLFSD Volume GLFSD Volume Volume

Stackable Design GlusterFS Client VFS I/O Cache Read Ahead Unify GlusterFS Client GlusterFS Server Client TCPIP GigE, 10GigE / InfiniBand - RDMA Client GlusterFS Server Server POSIX GlusterFS Server Server POSIX GlusterFS Server Server POSIX Ext3 Ext3 Ext3

GlusterFS Function - unify Client View (unify/roundrobin)../files/aaa../files/bbb../files/ccc Server/Head Node 1../files/aaa Server/Head Node 2../files/bbb Server/Head Node 3../files/ccc

GlusterFS Function unify+afr Client View (unify/roundrobin+afr)../files/aaa../files/bbb../files/ccc Server/Head Node 1../files/aaa../files/ccc Server/Head Node 2../files/bbb../files/aaa Server/Head Node 3../files/ccc../files/bbb

GlusterFS Function - stripe Client View (stripe)../files/aaa../files/bbb../files/ccc Server/Head Node 1../files/aaa../files/bbb../files/ccc Server/Head Node 2../files/aaa../files/bbb../files/ccc Server/Head Node 3../files/aaa../files/bbb../files/ccc

I/O Scheduling 1. Round robin 2. Adaptive least usage (ALU) 3. NUFA 4. Random 5. Custom volume bricks type cluster/unify subvolumes ss1c ss2c ss3c ss4c option scheduler alu option alu.limits.min-free-disk 60GB option alu.limits.max-open-files 10000 option alu.order disk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage option alu.disk-usage.entry-threshold 2GB # Units in KB, MB and GB are allowed option alu.disk-usage.exit-threshold 60MB # Units in KB, MB and GB are allowed option alu.open-files-usage.entry-threshold 1024 option alu.open-files-usage.exit-threshold 32 option alu.stat-refresh.interval 10sec end-volume

Benchmarks

GlusterFS Throughput & Scaling Benchmarks Benchmark Environment Method: Multiple 'dd' of varying blocks are read and written from multiple clients simultaneously. GlusterFS Brick Configuration (16 bricks) Processor - Dual Intel(R) Xeon(R) CPU 5160 @ 3.00GHz RAM - 8GB FB-DIMM Linux Kernel - 2.6.18-5+em64t+ofed111 (Debian) Disk - SATA-II 500GB HCA - Mellanox MHGS18-XT/S InfiniBand HCA Client Configuration (64 clients) RAM - 4GB DDR2 (533 Mhz) Processor - Single Intel(R) Pentium(R) D CPU 3.40GHz Linux Kernel - 2.6.18-5+em64t+ofed111 (Debian) Disk - SATA-II 500GB HCA - Mellanox MHGS18-XT/S InfiniBand HCA Interconnect Switch: Voltaire port InfiniBand Switch (14U) GlusterFS version 1.3.pre0-BENKI

GlusterFS Performance gregated I/O Benchmark on 16 bricks(servers) and 64 clients over IB Verbs transpor 13GBps! Peak aggregated read read throughput was was 13 13 GBps. After a particular threshold, write write performance plateaus because of of disk disk I/O I/O bottleneck. System memory greater than than the the peak peak load load will will ensure best best possible performance. ib-verbs transport driver driver is is about about 30% 30% faster faster than than ib-sdp transport driver.

Scalability Performance improves when the number of bricks are increased hroughput increases with corresponding increased in servers from 1 to 16

GlusterFS Value Proposition A single solution for 10's of Terabytes to Petabytes No single point of failure completely distributed - no centralized meta-data Non-stop Storage can withstand hardware failures, sel healing, snap-shots Data easily recovered even without GlusterFS Customizable schedulers User Friendly - Installs and upgrades in minutes Operating system agnostic! Extremely cost effective deployed on any x86-64 hardware!

http://www.zresearch.com http://www.gluster.org Thank You!

Backup Slides

GlusterFS Vs Lustre Benchmark Benchmark Environment Brick Config (10 bricks) Processor - 2 x AMD Dual-Core Opteron Model 275 processors RAM - 6 GB Interconnect - InfiniBand 20 Gb/s - Mellanox MT25208 InfiniHost III Ex Hard disk - Western Digital Corp. WD1200JB-00REA0, ATA DISK drive Client Config (20 clients) Processor - 2 x AMD Dual-Core Opteron Model 275 processors RAM - 6 GB Interconnect - InfiniBand 20 Gb/s - Mellanox MT25208 InfiniHost III Ex Hard disk - Western Digital Corp. WD1200JB-00REA0, ATA DISK drive Software Version Operation System - Redhat Enterprise GNU/Linux 4 (Update 3) Linux version - 2.6.9-42 Lustre version - 1.4.9.1 GlusterFS version - 1.3-pre2.3

Directory Listing Benchmark GlusterFS vs Lustre - Directory Listing benchmark 1.8 1.6 1.4 1.7 1.5 Time in Seconds 1.2 1 0.8 0.6 0.4 Lower is better Lustre GlusterFS 0.2 0 Directory Listing $ find /mnt/glusterfs "find" command navigates across the directory tree structure and prints them to console. In this case, there were thirteen thousand binary files. Note: Commands are same for both GlusterFS and Lustre, except the directory part.

Copy Local to Cluster File System Time in Seconds 37.5 37 35 32.5 30 27.5 25 22.5 20 17.5 15 GlusterFS vs Lustre - Copy Local to Cluster File System 12.5 10 7.5 5 2.5 0 Copy Local to Cluster File System 26 Lustre Lower is better GlusterFS $ cp -r /local/* /mnt/glusterfs/ cp utility is used to copy files and directories. Copy 12039 files (595 MB) were copied into the cluster file system.

Copy Local from Cluster File System 45 GlusterFS vs Lustre - Copy Local from Cluster File System 45 40 35 30 Time in Seconds 25 20 15 10 18 Lower is better Lustre GlusterFS 5 0 Copy Local from Cluster File System $ cp -r /mnt/glusterfs/ /local/* cp utility is used to copy files and directories. Copy 12039 files (595 MB) were copied from the cluster file system.

Checksum GlusterFS vs Lustre - Checksum 50 45 40 45.1 44.4 Time in Seconds 35 30 25 20 15 10 5 Lower is better Lustre GlusterFS 0 Checksum Perform md5sum calculation for all files across your file system. In this case, there were thirteen thousand binary files. $ find. -type f -exec md5sum {} \;

Base64 Conversion GlusterFS vs Lustre - Base64 Conversion 27.5 25 25.7 25.1 22.5 Time in Seconds 20 17.5 15 12.5 10 7.5 Lower is better Lustre GlusterFS 5 2.5 0 Base64 Conversion Base64 is an algorithm for encoding binary to ASCII and vice-versa. This benchmark was performed on a 640 MB binary file. $ base64 --encode big-file big-file.base64

Pattern Search GlusterFS vs Lustre - Pattern Search 55 54.3 52.1 50 45 Time in Seconds 40 35 30 25 20 15 Lower is better Lustre GlusterFS 10 5 0 Pattern Search grep utility searches for a PATTERN on a file and prints the matching lines to console. This benchmark used 1GB ASCII BASE-64 file. $ grep GNU big-file.base64

Data Compression GlusterFS vs Lustre - Data Compression 20 18 16 18.3 14.8 16.5 14 Time in Seconds 12 10 8 6 4 2 Lower is better 10.1 Lower is better Lustre GlusterFS 0 Compression Decomression GNU gzip utility compresses files using Lempel-Ziv coding. This benchmark was performed on 1GB TAR binary file. $ gzip big-file.tar $ gunzip big-file.tar.gz

Apache Web Server GlusterFS vs Lustre - Apache Web Server 3.25 3.17 3 2.75 2.5 Time in Minutes 2.25 2 1.75 1.5 1.25 1 0.75 0.5 0.25 Lustre Failed to execute** Lower is better Lustre GlusterFS 0 Apache web server Apache served 12039 files (595 MB) over HTTP protocol. wget client fetched the files recursively. **Lustre failed after downloading 33 MB out of 585 MB in 11 mins.

Archiving GlusterFS vs Lustre - Archiving Time in Seconds 45 40 35 30 25 20 15 10 5 0 41 25 Archive ceration Lower is better Lustre 43 GlusterFS Lustre Failed to execute** Archive extraction Archive Creation tar utility is used for archiving filesystem data. $ tar czf benchmark.tar.gz /mnt/glusterfs 'tar utility created an archive of 12039 files (595 MB) served through GlusterFS. Archive Extraction 'tar utility extracted the archive on to GlusterFS filesystem. $ tar xzf benchmark.tar.gz **Lustre Falied to Execute:- Tar extraction failed under Lustre with no space left on the device error. Disk-free (df -h) utility revealed lot of free space. It appears like I/O scheduler performed a poor job in distributing the data evenly tar: lib/libtdss.so.1: Cannot create symlink to `libtdss.so.1.0.0': No space left on device tar: lib/libg.a: Cannot open: No space left on device tar: Error exit delayed from previous errors

Aggregated Read Throughput Throughput in Giga Bytes per second GlusterFS vs Lustre - Aggregated Read Throughput 12 11.415 11.424 11 10 9 8 7 6 5.782 5 4 3 2 1.796 1 Higher is Better Lustre GlusterFS 0 Data Sizes 4KB 16KB Multiple dd utility were executed simultaneously with different block sizes to read from GlusterFS filesystem.

Aggregated Write Throughput Throughput in Giga Bytes per second GlusterFS vs Lustre - Aggregated Write Throughput 2.25 2.191 2 1.886 1.75 1.613 1.5 1.25 1 0.969 0.75 0.5 0.25 Higher is Better Lustre GlusterFS 0 Data Sizes 4KB 16KB Multiple dd utility were executed simultaneously with different block sizes to write to GlusterFS Filesystem.