Red Hat Storage (RHS) Performance. Ben England Principal S/W Engr., Red Hat

Similar documents
Red Hat Gluster Storage performance. Manoj Pillai and Ben England Performance Engineering June 25, 2015

RHEL OpenStack Platform on Red Hat Storage

Red Hat Storage Server for AWS

GLUSTER CAN DO THAT! Architecting and Performance Tuning Efficient Gluster Storage Pools

GlusterFS and RHS for SysAdmins

Red Hat Storage Server: Roadmap & integration with OpenStack

GlusterFS Architecture & Roadmap

Red Hat HyperConverged Infrastructure. RHUG Q Marc Skinner Principal Solutions Architect 8/23/2017

Scale-Out backups with Bareos and Gluster. Niels de Vos Gluster co-maintainer Red Hat Storage Developer

Introduction To Gluster. Thomas Cameron RHCA, RHCSS, RHCDS, RHCVA, RHCX Chief Architect, Central US Red

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

GlusterFS Current Features & Roadmap

RED HAT GLUSTER STORAGE TECHNICAL PRESENTATION

Why software defined storage matters? Sergey Goncharov Solution Architect, Red Hat

-Presented By : Rajeshwari Chatterjee Professor-Andrey Shevel Course: Computing Clusters Grid and Clouds ITMO University, St.

GlusterFS Cloud Storage. John Mark Walker Gluster Community Leader, RED HAT

A product by CloudFounders. Wim Provoost Open vstorage

Isilon Performance. Name

OPEN STORAGE IN THE ENTERPRISE with GlusterFS and Ceph

Open Storage in the Enterprise

Deploying Software Defined Storage for the Enterprise with Ceph. PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu

Extremely Fast Distributed Storage for Cloud Service Providers

Demystifying Gluster. GlusterFS and RHS for the SysAdmin

Red Hat Enterprise Linux on IBM System z Performance Evaluation

Integration of Flexible Storage with the API of Gluster

Red Hat Enterprise 7 Beta File Systems

Block Storage Service: Status and Performance

Cloudian Sizing and Architecture Guidelines

A fields' Introduction to SUSE Enterprise Storage TUT91098

RED HAT GLUSTER STORAGE 3.2 MARCEL HERGAARDEN SR. SOLUTION ARCHITECT, RED HAT GLUSTER STORAGE

Next Generation Storage for The Software-Defned World

NAS for Server Virtualization Dennis Chapman Senior Technical Director NetApp

DRBD SDS. Open Source Software defined Storage for Block IO - Appliances and Cloud Philipp Reisner. Flash Memory Summit 2016 Santa Clara, CA 1

Enterprise Ceph: Everyway, your way! Amit Dell Kyle Red Hat Red Hat Summit June 2016

Gluster roadmap: Recent improvements and upcoming features

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

Red Hat Enterprise Virtualization Hypervisor Roadmap. Bhavna Sarathy Senior Technology Product Manager, Red Hat

THE CEPH POWER SHOW. Episode 2 : The Jewel Story. Daniel Messer Technical Marketing Red Hat Storage. Karan Singh Sr. Storage Architect Red Hat Storage

CA485 Ray Walshe Google File System

Architecting For Availability, Performance & Networking With ScaleIO

Cluster Setup and Distributed File System

Distributed Filesystem

Linux on System z Distribution Performance Update

Amazon EC2 Deep Dive. Michael #awssummit

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

Deterministic Storage Performance

NetVault Backup Client and Server Sizing Guide 2.1

Forget IOPS: A Proper Way to Characterize & Test Storage Performance Peter Murray SwiftTest

IBM V7000 Unified R1.4.2 Asynchronous Replication Performance Reference Guide

Introducing SUSE Enterprise Storage 5

Is Open Source good enough? A deep study of Swift and Ceph performance. 11/2013

THE SUMMARY. CLUSTER SERIES - pg. 3. ULTRA SERIES - pg. 5. EXTREME SERIES - pg. 9

SoftNAS Cloud Performance Evaluation on AWS

Ceph in a Flash. Micron s Adventures in All-Flash Ceph Storage. Ryan Meredith & Brad Spiers, Micron Principal Solutions Engineer and Architect

SurFS Product Description

SoftNAS Cloud Performance Evaluation on Microsoft Azure

The Datacentered Future Greg Huff CTO, LSI Corporation

Jumpstart your Production OpenStack Deployment with

Virtual Machine Migration

Application Acceleration Beyond Flash Storage

Shared File System Requirements for SAS Grid Manager. Table Talk #1546 Ben Smith / Brian Porter

Build Cloud like Rackspace with OpenStack Ansible

Evaluation of Chelsio Terminator 6 (T6) Unified Wire Adapter iscsi Offload

Enterprise Filesystems

Scale-out Storage Solution and Challenges Mahadev Gaonkar igate

Google File System. Arun Sundaram Operating Systems

DELL EMC ISILON F800 AND H600 I/O PERFORMANCE

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Deterministic Storage Performance

Ceph Intro & Architectural Overview. Abbas Bangash Intercloud Systems

Choosing an Interface

White Paper. EonStor GS Family Best Practices Guide. Version: 1.1 Updated: Apr., 2018

IOmark-VM. Datrium DVX Test Report: VM-HC b Test Report Date: 27, October

Virtualization and Performance

A Gentle Introduction to Ceph

CSE 124: Networked Services Lecture-17

Choosing Hardware and Operating Systems for MySQL. Apr 15, 2009 O'Reilly MySQL Conference and Expo Santa Clara,CA by Peter Zaitsev, Percona Inc

GFS: The Google File System

Container-Native Storage & Red Hat Gluster Storage Roadmap

November 7, DAN WILSON Global Operations Architecture, Concur. OpenStack Summit Hong Kong JOE ARNOLD

RED HAT CEPH STORAGE ROADMAP. Cesar Pinto Account Manager, Red Hat Norway

IOmark-VM. VMware VSAN Intel Servers + VMware VSAN Storage SW Test Report: VM-HC a Test Report Date: 16, August

Optimizing SDS for the Age of Flash. Krutika Dhananjay, Raghavendra Gowdappa, Manoj Hat

Accelerate Applications Using EqualLogic Arrays with directcache

Isilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team

SolidFire and Ceph Architectural Comparison

Using Cloud Services behind SGI DMF

VMware Virtual SAN. Technical Walkthrough. Massimiliano Moschini Brand Specialist VCI - vexpert VMware Inc. All rights reserved.

MODERN FILESYSTEM PERFORMANCE IN LOCAL MULTI-DISK STORAGE SPACE CONFIGURATION

VMware Virtual SAN. High Performance Scalable Storage Architecture VMware Inc. All rights reserved.

NetVault Backup Client and Server Sizing Guide 3.0

Ceph vs Swift Performance Evaluation on a Small Cluster. edupert monthly call Jul 24, 2014

Storage Protocol Offload for Virtualized Environments Session 301-F

Introducing Tegile. Company Overview. Product Overview. Solutions & Use Cases. Partnering with Tegile

The QLogic 8200 Series is the Adapter of Choice for Converged Data Centers

Red Hat Virtualization 4.1 Technical Presentation May Adapted for MSP RHUG Greg Scott

Network and storage settings of ES NAS high-availability network storage services

Distributed Storage with GlusterFS

Deep Learning Performance and Cost Evaluation

Scalability Testing of DNE2 in Lustre 2.7 and Metadata Performance using Virtual Machines Tom Crowe, Nathan Lavender, Stephen Simms

Transcription:

Red Hat Storage (RHS) Performance Ben England Principal S/W Engr., Red Hat

What this session covers What's new with Red Hat Storage (perf.) since summit 2013? How to monitor RHS performance Future directions for RHS performance.

Related presentations other RHS presentations in summit 2014 http://www.redhat.com/summit/sessions/topics/red-hat-storageserver.html previous RHS perf presentations in 2012, 2013 http://rhsummit.files.wordpress.com/2013/07/england_th_0450_rhs_perf_practices-4_neependra.pdf http://rhsummit.files.wordpress.com/2012/03/england-rhs-performance.pdf

Improvement since last year Libgfapi removes FUSE bottleneck for high-iops applications FUSE linux caching can now be utilized with fopen-keep-cache feature Brick configuration alternatives for small-file, random-io-centric configurations SSD where it helps, configuration options Swift - RHS converging with OpenStack IceHouse, is scalable and stable NFS - large sequential writes more stable, 16x larger RPC size limit

Anatomy of Red Hat Storage (RHS) libgfapi Windows HTTP clients smbd (CIFS) Swift App NFS V3 clients qemu (KVM) RHS server Hadoop glusterfs client process Linux kernel VFS glusterfsd FUSE brick server other RHS servers... 5 INTERNAL ONLY PRESENTER NAME

Libgfapi development and significance Python and java bindings Integration with libvirt, SMB Gluster performs better than Ceph for Cinder volumes (Principled Tech.) Lowered context switching overhead A distributed libgfapi benchmark available at: https://github.com/bengland2/parallel-libgfapi Fio benchmark has libgfapi engine https://github.com/rootfs/fio/blob/master/engines/glusterfs.c

Libgfapi tips For each Gluster volume: gluster volume set your-volume allow-insecure on Gluster volume set stat-prefetch on (if it is set to off ) For each Gluster server: Add option rpc-auth-allow-insecure on to /etc/glusterfs/glusterd.vol Restart glusterd Watch TCP port consumption On client: TCP ports = bricks/volume x libgfapi instances On server: TCP ports = bricks/server x clients' libgfapi instances Control number of bricks/server

libgfapi - performance for single client & server with 40-Gbps IB connection, Nytro SSD caching

Libgfapi gets rid of client-side bottleneck libgfapi vs FUSE - random write IOPS 64 threads on 1 client, 16-KB transfer size, 4 servers, 6 bricks/server, 1 GB/file, 4096 requests/file client IOPS (requests/sec) 60000 50000 40000 30000 20000 10000 0 libgfapi FUSE

Libgfapi and small files 4 servers, 10-GbE jumbo frames, 2-socket westmere, 48 GB RAM, 12 7200-RPM drives, 6 RAID1 LUNs, RHS 2.1U2 1024 libgfapi processes spread across 4 servers Each process writes/reads 5000 1-MB files (1000 files/dir) Aggregate write rate: 155 MB/s/server Aggregate read rate: 1500 MB/s/server, 3000 disk IOPS/server On reads, disks are bottleneck!

Libgfapi vs Ceph for Openstack Cinder Principled Technologies compared for Havana RDO Cinder Results show Gluster outperformed Ceph significantly in some cases Sequential large-file I/O Small-file I/O (also sequential) Ceph outperformed Gluster for pure-random-i/o case, why? RAID6 appears to be the root cause, saw high block device avg wait

Gluster libgfapi vs Ceph for OpenStack Cinder large files example: 4 servers, 4-16 guests, 4 threads/guest, 64-KB file size 4 Node - Write 4 Node - Read 1800 1600 2500 MB/s MB/s 3000 2000 1400 1200 1000 1500 800 600 1000 400 500 0 200 4 8 12 VMs 16 0 4 8 VMs ceph write 4 4 write 4 12 16 gluster write 4 4 write 4

Libgfapi with OpenStack Cinder small files Gluster %improvement over Ceph for Cinder volume small-file throughput file size average 64-KB, files/thread 32K, 4 threads/guest % improvement relative to Ceph 60% 50% 40% append create read 30% 20% 10% 0% 1,1,4 2,2,8 4,4,16 1,4,16-10% -20% compute nodes, total guests, total threads 2,8,32 4,16,64

Random I/O on RHEL OSP with RHS Cinder volumes Rate-limited guests for sizing, measured response time OSP on RAID6 RHS, Random Write IOPS vs. Latency OSP on JBOD RHS, Random Write IOPS vs. Latency OSP 4.0, 1 thread/instance, 4 RHS (repl 2) servers, 4 computes, 16G filesz, 64K recsz, DIO OSP 4.0, 3 RHS (repl 3) servers, 4 computes, 1 thread/instance, 16G filesz, 64K recsz, DIO 25 IOPS limit 50 IOPS limit 100 IOPS limit 25 IOPS limit 50 IOPS limit 100 IOPS limit Resp Time Resp Time Resp Time 4000 1000 4000 3500 2500 43 2000 1500 10 1000 500 1 1 1 2 1 4 1 8 1 16 Guests 100 2500 2000 1500 10 1000 500 1 32 64 1 128 0 1 2 4 8 16 Guests 32 64 1 128 <<< Response Time (msec) 100 3000 Total IOPS >>> 139 <<< Response Time (msec) Total IOPS >>> 1000 3500 3000 0 Resp Time Resp Time Resp Time

RHS libgfapi: Sequential I/O to Cinder volumes OSP on RHS, Scaling Large File Sequential I/O, RAID6 (2-replica) vs. JBOD (3-replica) OSP 4.0, 4 computes, 1 thread/instance, 4G filesz, 64K recsz Total Throughput in MB / Sec / Server >>> RAID6 Writes JBOD Writes RAID6 Reads JBOD Reads 900 800 700 600 500 400 300 200 100 0 1 2 4 8 16 guests 32 64 128

Cinder volumes on libgfapi: small files Increase in concurrency on RAID6 decreases system throughput

Brick configuration alternatives metrics How do you evaluate optimality of storage with both Gluster replication and RAID? Disks/LUN more bricks = more CPU power Physical:usable TB what fraction of physical space is available? Disk loss tolerance how many disks can you lose without data loss? Physical:logical write IOPS how many physical disk IOPS per user IOP? Both Gluster replication and RAID affect this

Brick configuration how do alternatives look? Small-file workload, 8 clients, 4 threads/client, 100K files/thread, 64 KB/file 2 servers in all cases except in JBOD 3way case RAID type disks/lun physical:usable loss tolerance Logical:physical TB ratio (disks) random IO ratio create Files/sec read Files/sec RAID6 12 2.4 4 12 234 676 JBOD 1 2.0 1 2 975 874 RAID1 2 4.0 2 4 670 773 RAID10 12 4.0 2 4 308 631 1 3.0 2 3 500 393? JBOD 3way

Brick config. relevance of SSDs Where SSDs don't matter: Sequential large-file read/write RAID6 drive near 1 GB/sec throughput Reads cacheable by Linux RAM faster than SSD Where SSDs do matter: Random read working set > RAM, < SSD Random writes Concurrent sequential file writes are slightly random at block device Small-file workload random at block device metadata

Brick configuration 3 ways to utilize SSDs SSDs as brick Configuration guideline multiple bricks/ssd RAID controller supplies SSD caching layer (example: LSI Nytro) Transparent to RHEL, Gluster, but no control over policy Linux-based SSD caching layer dm-cache (RHEL7) Greater control over caching policy, etc. Not supported on RHEL6 -> Not supported in Denali (RHS 3.0)

Example of SSD performance gain pure-ssd vs write-back cache for small files smallfile workload: 1,000,000 files total, fsync-on-close, hash-into-dirs, 100 files/dir, 10 sub-dir./dir, exponential file sizes, 70000 60000 files/sec 50000 40000 ssd Write-back caching 30000 20000 10000 0 4 1 create 64 1 create 4 16 create 64 16 create 4 1 read 64 1 read 4 16 read operation type, threads (1 or 16), file size (4 or 64) KB 64 16 read

SSDs: Multiple bricks per SSD distribute CPU load across glusterfsd brick server processes workload here: random I/O

SSD caching by LSI Nytro RAID controller

dm-cache in RHEL7 can help with small-file reads

SSD conclusions Can add SSD to Gluster volumes today in one of above ways Use more SSD than server RAM to see benefits Don't really need for sequential large-file I/O workloads Each of these 3 alternatives has weaknesses SSD brick very expensive, too many bricks required in Gluster volume Firmware caching - very conservative in using SSD blocks Dm-cache RHEL7-only, not really writeback caching Could Gluster become SSD-aware? Could reduce latency of metadata access such as change logs,.glusterfs, journal, etc.

Swift example test configuration - TBS Client-1 Client-2... Cisco Nexus 7010 10-GbE network switch server-1 server-2... Client-4 Magenta = private replication traffic subnet Green line = HTTP public traffic subnet server-8

Swift-on-RHS: large objects No tuning except 10-GbE, 2 NICs, MTU=9000 (jumbo frames), rhs-high-throughput tuned profile 150-MB average object size, 4 clients, 8 servers writes: HTTP 1.6 GB/sec = 400 MB/sec/client, 30% net. util reads: HTTP 2.8 GB/s = 700 MB/s/client, 60% net. util.

Swift-on-RHS: small objects Catalyst workload: 28.68 million objects, PUT and GET 8 physical clients, 64 thread/client, 1 container/client, file size from 5-KB up to 30 MB, average 30 KB. 160 PUT-1 PUT-2 PUT-3 140 120 iops (Files/s) 100 80 60 40 20 0 0 100 200 300 Fileset (10K files) 400 500

RHS performance monitoring Start with network Are there hot spots on the servers or clients? Gluster volume profile your-volume [ start info ] Perf. Copilot plugin Gluster volume top your-volume [ read write opendir readdir ] Perf-copilot demo

Recorded demo of RHS perf.

Future (upstream) directions Thin Provisioning (LVM) Allows Gluster volumes with different users and policies to easily co-exist on same physical storage Snapshot (LVM) For online backup/geo-replication of live files Small-file optimizations in system calls Erasure coding lowering cost/tb and write traffic on network Scalability larger server and brick counts in a volume RDMA+libgfapi bypassing kernel for fast path NFS-Ganesha and pnfs how it complements glusterfs

LVM thin-p + snapshot testing XFS file system LVM thin-p volume Snapshot[1]... Snapshot[n] LVM pool built from LVM PV LSI MegaRAID controller RAID6 12-drive volume

avoiding LVM snapshot fragmentation LVM enhancement: bio-sorting on writes testing just XFS over LVM thin-provisioned volume snapshot

Scalability using virtual RHS servers Client[1] Client[2]... Client[2n] 10-GbE Switch RHSS VM[1] 3 GB RAM, 2 cores RHSS VM[1] 3 GB RAM, 2 cores RHSS VM[8] 3 GB RAM, 2 cores...... KVM host N... KVM host 1 RHSS VM[8] 3 GB RAM, 2 cores

scaling of throughput with virtual Gluster servers for sequential multi-stream I/O Glusterfs, up to 8 bare-metal clients, up to 4 bare-metal KVM hosts, up to 8 VMs/host, 2-replica volume, read-hash-mode 2, 1 brick/vm, RA 4096 KB on /dev/vdb, deadline sch. 2 iozone threads/server, 4 GB/thread, threads spread evenly across 8 clients. 3500 3000 throughput (MB/s) 2500 read write 2000 1500 1000 500 0 0 5 10 15 20 server count 25 30 35

Optimizing small-file creates Test Done with dm-cache on RHEL7 with XFS filesystem, no Gluster, 180 GB total data small-file create performance throughput (files/sec) 100,000 files/thread, 64 KB/file, 16 threads, 30 files/dir, 5 subdirs/dir, 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Y Y N Y fsync before close? hash into directories? N N

Avoiding FSYNC fop in AFR http://review.gluster.org/#/c/5501/, in gluster 3.5 Reverses loss in small-file performance from RHS 2.0 -> 2.1 effect of no-fsync-after-append enhancement on RHS throughput throughput (files/sec) patch at http://review.gluster.org/5501 700 700 600 600 500 500 400 400 300 300 200 200 100 100 0 0 12 threads/server, files/thread, file size (KB) 16 without change with change

Check Out Other Red Hat Storage Activities at The Summit Enter the raffle to win tickets for a $500 gift card or trip to LegoLand! Entry cards available in all storage sessions - the more you attend, the more chances you have to win! Talk to Storage Experts: Red Hat Booth (# 211) Infrastructure Infrastructure-as-a-Service Storage Partner Solutions Booth (# 605) Upstream Gluster projects Developer Lounge Follow us on Twitter, Facebook: @RedHatStorage