Z RESEARCH, Inc. Commoditizing Supercomputing and Superstorage. Massive Distributed Storage over InfiniBand RDMA

Similar documents
GlusterFS Distributed Replicated Parallel File System

GlusterFS Architecture & Roadmap

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

High-Availability Storage Cluster With GlusterFS On Ubuntu1. Introduction

Cluster Setup and Distributed File System

Feedback on BeeGFS. A Parallel File System for High Performance Computing

FROM HPC TO THE CLOUD WITH AMQP AND OPEN SOURCE SOFTWARE

An Overview of Fujitsu s Lustre Based File System

Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet

Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware

MODERN FILESYSTEM PERFORMANCE IN LOCAL MULTI-DISK STORAGE SPACE CONFIGURATION

Building a Highly Scalable and Performant SMB Protocol Server

GlusterFS and RHS for SysAdmins

System that permanently stores data Usually layered on top of a lower-level physical storage medium Divided into logical units called files

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Data storage services at KEK/CRC -- status and plan

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Introduction To Gluster. Thomas Cameron RHCA, RHCSS, RHCDS, RHCVA, RHCX Chief Architect, Central US Red

Analytics of Wide-Area Lustre Throughput Using LNet Routers

Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand

Application Acceleration Beyond Flash Storage

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

An Introduction to GPFS

Petal and Frangipani

-Presented By : Rajeshwari Chatterjee Professor-Andrey Shevel Course: Computing Clusters Grid and Clouds ITMO University, St.

S. Narravula, P. Balaji, K. Vaidyanathan, H.-W. Jin and D. K. Panda. The Ohio State University

Sharing High-Performance Devices Across Multiple Virtual Machines

ABySS Performance Benchmark and Profiling. May 2010

EC-Bench: Benchmarking Onload and Offload Erasure Coders on Modern Hardware Architectures

STAR-CCM+ Performance Benchmark and Profiling. July 2014

NetApp High-Performance Storage Solution for Lustre

Demonstration Milestone for Parallel Directory Operations

General Purpose Storage Servers

InfiniBand Networked Flash Storage

CA485 Ray Walshe Google File System

BeoLink.org. Design and build an inexpensive DFS. Fabrizio Manfredi Furuholmen. FrOSCon August 2008

Filesystem. Disclaimer: some slides are adopted from book authors slides with permission

Experiences with HP SFS / Lustre in HPC Production

2008 International ANSYS Conference

Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?

Assistance in Lustre administration

Infiniband and RDMA Technology. Doug Ledford

Using Cloud Services behind SGI DMF

BeeGFS. Parallel Cluster File System. Container Workshop ISC July Marco Merkel VP ww Sales, Consulting

Memory Management Strategies for Data Serving with RDMA

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

IBM Storwize V7000 Unified

Birds of a Feather Presentation

The Future of Storage

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Distributed Filesystem

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Parallel File Systems Compared

MM5 Modeling System Performance Research and Profiling. March 2009

LustreFS and its ongoing Evolution for High Performance Computing and Data Analysis Solutions

Software Distributed Shared Memory with High Bandwidth Network: Production and Evaluation

The advantages of architecting an open iscsi SAN

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

FhGFS - Performance at the maximum

Parallel File Systems for HPC

High Performance SSD & Benefit for Server Application

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

The Leading Parallel Cluster File System

Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand

OFED Storage Protocols

Storage Optimization with Oracle Database 11g

Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand

ORACLE SNAP MANAGEMENT UTILITY FOR ORACLE DATABASE

High-Performance Lustre with Maximum Data Assurance

Mellanox InfiniBand Solutions Accelerate Oracle s Data Center and Cloud Solutions

BeeGFS Solid, fast and made in Europe

In the multi-core age, How do larger, faster and cheaper and more responsive memory sub-systems affect data management? Dhabaleswar K.

Accelerating Parallel Analysis of Scientific Simulation Data via Zazen

Optimizing LS-DYNA Productivity in Cluster Environments

Lustre overview and roadmap to Exascale computing

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

Performance Analysis and Evaluation of Mellanox ConnectX InfiniBand Architecture with Multi-Core Platforms

OPERATING SYSTEM. Chapter 12: File System Implementation

Exam LFCS/Course 55187B Linux System Administration

Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

CSE 124: Networked Services Lecture-16

An introduction to GPFS Version 3.3

Initial Performance Evaluation of the Cray SeaStar Interconnect

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Chapter 11: Implementing File Systems

Cloudian Sizing and Architecture Guidelines

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Database Systems II. Secondary Storage

Chapter 10: File System Implementation

Extremely Fast Distributed Storage for Cloud Service Providers

Triton file systems - an introduction. slide 1 of 28

CSE 124: Networked Services Fall 2009 Lecture-19

Spark Over RDMA: Accelerate Big Data SC Asia 2018 Ido Shamay Mellanox Technologies

Filesystems on SSCK's HP XC6000

Transcription:

Z RESEARCH, Inc. Commoditizing Supercomputing and Superstorage Massive Distributed Storage over InfiniBand RDMA

What is GlusterFS? GlusterFS is a Cluster File System that aggregates multiple storage bricks over InfiniBand RDMA into one large parallel network file system GlusterFS is MORE than making data available over a network or the organization of data on disk storage. Typical clustered file systems work to aggregate storage and provide unified views but. - scalability comes with increased cost, reduced reliability, difficult management, increased maintenance and recovery time. - limited reliability means volume sizes are kept small. - capacity and i/o performance can be limited.. GlusterFS allows scaling of capacity and I/O using industry standard inexpensive modules!

GlusterFS Features 1. Fully POSIX compliant! 2. Unified VFS! 3. More flexible volume management (stackable features)! 4. Application specific scheduling / load balancing roundrobin; adaptive least usage; non-uniform file access (NUFA)! 5. Automatic file replication (AFR); Snapshot! and Undelete! 6. Striping for performance! 7. Self-heal! No fsck!!!! 8. Pluggable transport modules (IB verbs, IB-SDP)! 9. I/O accelerators - I/O threads, I/O cache, read ahead and write behind! 10. Policy driven - user group/directory level quotas, access control lists (ACL)

GlusterFS Design Client Side Storage Clients Cluster of Clients (Supercomputer, Data Center) GLFS Client GLFS Client Clustered Vol Manager GLFS Client Clustered Vol Manager Clustered Clustered Vol Manager I/O Scheduler Clustered I/O Scheduler Clustered I/O Scheduler GLFS Client GLFS Client Clustered Vol Manager GLFS Client Clustered Vol Manager Clustered Clustered Vol Manager I/O Scheduler Clustered I/O Scheduler Clustered I/O Scheduler Compatibility with MS Windows and other Unices GigE Storage Gateway Storage Storage NFS/Samba Gateway Gateway NFS/Samba NFS/Samba GLFS Client GLFS Client GLFS Client NFS / SAMBA over TCP/IP RDMA RDMA InfiniBand / GigE / 10GigE Server Side GlusterFS Clustered Filesystem on x86-64 platform Storage Brick 1 GLFSD Volume Storage Brick 2 GLFSD Volume RDMA Storage Brick 3 GLFSD Volume Storage Brick N Storage Brick GLFSD4 GLFSD Volume GLFSD Volume Volume

Stackable Design GlusterFS Client VFS I/O Cache Read Ahead Unify GlusterFS Client GlusterFS Server Client TCPIP GigE, 10GigE / InfiniBand - RDMA Client GlusterFS Server Server POSIX GlusterFS Server Server POSIX GlusterFS Server Server POSIX Ext3 Ext3 Ext3

GlusterFS Function - unify Client View (unify/roundrobin)../files/aaa../files/bbb../files/ccc Server/Head Node 1../files/aaa Server/Head Node 2../files/bbb Server/Head Node 3../files/ccc

GlusterFS Function unify+afr Client View (unify/roundrobin+afr)../files/aaa../files/bbb../files/ccc Server/Head Node 1../files/aaa../files/ccc Server/Head Node 2../files/bbb../files/aaa Server/Head Node 3../files/ccc../files/bbb

GlusterFS Function - stripe Client View (stripe)../files/aaa../files/bbb../files/ccc Server/Head Node 1../files/aaa../files/bbb../files/ccc Server/Head Node 2../files/aaa../files/bbb../files/ccc Server/Head Node 3../files/aaa../files/bbb../files/ccc

I/O Scheduling 1. Round robin 2. Adaptive least usage (ALU) 3. NUFA 4. Random 5. Custom volume bricks type cluster/unify subvolumes ss1c ss2c ss3c ss4c option scheduler alu option alu.limits.min-free-disk 60GB option alu.limits.max-open-files 10000 option alu.order disk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage option alu.disk-usage.entry-threshold 2GB # Units in KB, MB and GB are allowed option alu.disk-usage.exit-threshold 60MB # Units in KB, MB and GB are allowed option alu.open-files-usage.entry-threshold 1024 option alu.open-files-usage.exit-threshold 32 option alu.stat-refresh.interval 10sec end-volume

Benchmarks

GlusterFS Throughput & Scaling Benchmarks Benchmark Environment Method: Multiple 'dd' of varying blocks are read and written from multiple clients simultaneously. GlusterFS Brick Configuration (16 bricks) Processor - Dual Intel(R) Xeon(R) CPU 5160 @ 3.00GHz RAM - 8GB FB-DIMM Linux Kernel - 2.6.18-5+em64t+ofed111 (Debian) Disk - SATA-II 500GB HCA - Mellanox MHGS18-XT/S InfiniBand HCA Client Configuration (64 clients) RAM - 4GB DDR2 (533 Mhz) Processor - Single Intel(R) Pentium(R) D CPU 3.40GHz Linux Kernel - 2.6.18-5+em64t+ofed111 (Debian) Disk - SATA-II 500GB HCA - Mellanox MHGS18-XT/S InfiniBand HCA Interconnect Switch: Voltaire port InfiniBand Switch (14U) GlusterFS version 1.3.pre0-BENKI

GlusterFS Performance gregated I/O Benchmark on 16 bricks(servers) and 64 clients over IB Verbs transpor 13GBps! Peak aggregated read read throughput was was 13 13 GBps. After a particular threshold, write write performance plateaus because of of disk disk I/O I/O bottleneck. System memory greater than than the the peak peak load load will will ensure best best possible performance. ib-verbs transport driver driver is is about about 30% 30% faster faster than than ib-sdp transport driver.

Scalability Performance improves when the number of bricks are increased hroughput increases with corresponding increased in servers from 1 to 16

GlusterFS Value Proposition A single solution for 10's of Terabytes to Petabytes No single point of failure completely distributed - no centralized meta-data Non-stop Storage can withstand hardware failures, sel healing, snap-shots Data easily recovered even without GlusterFS Customizable schedulers User Friendly - Installs and upgrades in minutes Operating system agnostic! Extremely cost effective deployed on any x86-64 hardware!

http://www.zresearch.com http://www.gluster.org Thank You!

Backup Slides

GlusterFS Vs Lustre Benchmark Benchmark Environment Brick Config (10 bricks) Processor - 2 x AMD Dual-Core Opteron Model 275 processors RAM - 6 GB Interconnect - InfiniBand 20 Gb/s - Mellanox MT25208 InfiniHost III Ex Hard disk - Western Digital Corp. WD1200JB-00REA0, ATA DISK drive Client Config (20 clients) Processor - 2 x AMD Dual-Core Opteron Model 275 processors RAM - 6 GB Interconnect - InfiniBand 20 Gb/s - Mellanox MT25208 InfiniHost III Ex Hard disk - Western Digital Corp. WD1200JB-00REA0, ATA DISK drive Software Version Operation System - Redhat Enterprise GNU/Linux 4 (Update 3) Linux version - 2.6.9-42 Lustre version - 1.4.9.1 GlusterFS version - 1.3-pre2.3

Directory Listing Benchmark GlusterFS vs Lustre - Directory Listing benchmark 1.8 1.6 1.4 1.7 1.5 Time in Seconds 1.2 1 0.8 0.6 0.4 Lower is better Lustre GlusterFS 0.2 0 Directory Listing $ find /mnt/glusterfs "find" command navigates across the directory tree structure and prints them to console. In this case, there were thirteen thousand binary files. Note: Commands are same for both GlusterFS and Lustre, except the directory part.

Copy Local to Cluster File System Time in Seconds 37.5 37 35 32.5 30 27.5 25 22.5 20 17.5 15 GlusterFS vs Lustre - Copy Local to Cluster File System 12.5 10 7.5 5 2.5 0 Copy Local to Cluster File System 26 Lustre Lower is better GlusterFS $ cp -r /local/* /mnt/glusterfs/ cp utility is used to copy files and directories. Copy 12039 files (595 MB) were copied into the cluster file system.

Copy Local from Cluster File System 45 GlusterFS vs Lustre - Copy Local from Cluster File System 45 40 35 30 Time in Seconds 25 20 15 10 18 Lower is better Lustre GlusterFS 5 0 Copy Local from Cluster File System $ cp -r /mnt/glusterfs/ /local/* cp utility is used to copy files and directories. Copy 12039 files (595 MB) were copied from the cluster file system.

Checksum GlusterFS vs Lustre - Checksum 50 45 40 45.1 44.4 Time in Seconds 35 30 25 20 15 10 5 Lower is better Lustre GlusterFS 0 Checksum Perform md5sum calculation for all files across your file system. In this case, there were thirteen thousand binary files. $ find. -type f -exec md5sum {} \;

Base64 Conversion GlusterFS vs Lustre - Base64 Conversion 27.5 25 25.7 25.1 22.5 Time in Seconds 20 17.5 15 12.5 10 7.5 Lower is better Lustre GlusterFS 5 2.5 0 Base64 Conversion Base64 is an algorithm for encoding binary to ASCII and vice-versa. This benchmark was performed on a 640 MB binary file. $ base64 --encode big-file big-file.base64

Pattern Search GlusterFS vs Lustre - Pattern Search 55 54.3 52.1 50 45 Time in Seconds 40 35 30 25 20 15 Lower is better Lustre GlusterFS 10 5 0 Pattern Search grep utility searches for a PATTERN on a file and prints the matching lines to console. This benchmark used 1GB ASCII BASE-64 file. $ grep GNU big-file.base64

Data Compression GlusterFS vs Lustre - Data Compression 20 18 16 18.3 14.8 16.5 14 Time in Seconds 12 10 8 6 4 2 Lower is better 10.1 Lower is better Lustre GlusterFS 0 Compression Decomression GNU gzip utility compresses files using Lempel-Ziv coding. This benchmark was performed on 1GB TAR binary file. $ gzip big-file.tar $ gunzip big-file.tar.gz

Apache Web Server GlusterFS vs Lustre - Apache Web Server 3.25 3.17 3 2.75 2.5 Time in Minutes 2.25 2 1.75 1.5 1.25 1 0.75 0.5 0.25 Lustre Failed to execute** Lower is better Lustre GlusterFS 0 Apache web server Apache served 12039 files (595 MB) over HTTP protocol. wget client fetched the files recursively. **Lustre failed after downloading 33 MB out of 585 MB in 11 mins.

Archiving GlusterFS vs Lustre - Archiving Time in Seconds 45 40 35 30 25 20 15 10 5 0 41 25 Archive ceration Lower is better Lustre 43 GlusterFS Lustre Failed to execute** Archive extraction Archive Creation tar utility is used for archiving filesystem data. $ tar czf benchmark.tar.gz /mnt/glusterfs 'tar utility created an archive of 12039 files (595 MB) served through GlusterFS. Archive Extraction 'tar utility extracted the archive on to GlusterFS filesystem. $ tar xzf benchmark.tar.gz **Lustre Falied to Execute:- Tar extraction failed under Lustre with no space left on the device error. Disk-free (df -h) utility revealed lot of free space. It appears like I/O scheduler performed a poor job in distributing the data evenly tar: lib/libtdss.so.1: Cannot create symlink to `libtdss.so.1.0.0': No space left on device tar: lib/libg.a: Cannot open: No space left on device tar: Error exit delayed from previous errors

Aggregated Read Throughput Throughput in Giga Bytes per second GlusterFS vs Lustre - Aggregated Read Throughput 12 11.415 11.424 11 10 9 8 7 6 5.782 5 4 3 2 1.796 1 Higher is Better Lustre GlusterFS 0 Data Sizes 4KB 16KB Multiple dd utility were executed simultaneously with different block sizes to read from GlusterFS filesystem.

Aggregated Write Throughput Throughput in Giga Bytes per second GlusterFS vs Lustre - Aggregated Write Throughput 2.25 2.191 2 1.886 1.75 1.613 1.5 1.25 1 0.969 0.75 0.5 0.25 Higher is Better Lustre GlusterFS 0 Data Sizes 4KB 16KB Multiple dd utility were executed simultaneously with different block sizes to write to GlusterFS Filesystem.