Is Open Source good enough? A deep study of Swift and Ceph performance. 11/2013

Similar documents
Intel Cache Acceleration Software for Windows* Workstation

How to Create a.cibd File from Mentor Xpedition for HLDRC

How to Create a.cibd/.cce File from Mentor Xpedition for HLDRC

Theory and Practice of the Low-Power SATA Spec DevSleep

Drive Recovery Panel

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

Intel Atom Processor E3800 Product Family Development Kit Based on Intel Intelligent System Extended (ISX) Form Factor Reference Design

Intel Cache Acceleration Software (Intel CAS) for Linux* v2.9 (GA)

Intel Atom Processor E6xx Series Embedded Application Power Guideline Addendum January 2012

Intel Cache Acceleration Software - Workstation

Intel Core TM i7-4702ec Processor for Communications Infrastructure

Intel Graphics Virtualization Technology. Kevin Tian Graphics Virtualization Architect

Intel Atom Processor D2000 Series and N2000 Series Embedded Application Power Guideline Addendum January 2012

OpenCL* and Microsoft DirectX* Video Acceleration Surface Sharing

Intel RealSense Depth Module D400 Series Software Calibration Tool

Intel Core TM Processor i C Embedded Application Power Guideline Addendum

Data Plane Development Kit

2013 Intel Corporation

Desktop 4th Generation Intel Core, Intel Pentium, and Intel Celeron Processor Families and Intel Xeon Processor E3-1268L v3

Intel RealSense D400 Series Calibration Tools and API Release Notes

Bitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

True Scale Fabric Switches Series

Jim Harris Principal Software Engineer Intel Data Center Group

Intel Open Source HD Graphics, Intel Iris Graphics, and Intel Iris Pro Graphics

The Intel SSD Pro 2500 Series Guide for Microsoft edrive* Activation

Intel Open Source HD Graphics Programmers' Reference Manual (PRM)

Intel USB 3.0 extensible Host Controller Driver

Intel Dynamic Platform and Thermal Framework (Intel DPTF), Client Version 8.X

Evolving Small Cells. Udayan Mukherjee Senior Principal Engineer and Director (Wireless Infrastructure)

Data Center Efficiency Workshop Commentary-Intel

Intel Ethernet Controller I350 Frequently Asked Questions (FAQs)

Lustre Beyond HPC. Presented to the Lustre* User Group Beijing October 2013

Highly accurate simulations of big-data clusters for system planning and optimization

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

Product Change Notification

Product Change Notification

Krzysztof Laskowski, Intel Pavan K Lanka, Intel

Product Change Notification

Product Change Notification

Intel Integrated Native Developer Experience 2015 (OS X* host)

Ernesto Su, Hideki Saito, Xinmin Tian Intel Corporation. OpenMPCon 2017 September 18, 2017

Product Change Notification

Product Change Notification

LED Manager for Intel NUC

Accelerate Finger Printing in Data Deduplication Xiaodong Liu & Qihua Dai Intel Corporation

Product Change Notification

Bring Intelligence to the Edge with Intel Movidius Neural Compute Stick

Product Change Notification

Customizing an Android* OS with Intel Build Tool Suite for Android* v1.1 Process Guide

Product Change Notification

Product Change Notification

Non-Volatile Memory Cache Enhancements: Turbo-Charging Client Platform Performance

Optimizing the operations with sparse matrices on Intel architecture

Reference Boot Loader from Intel

Using the Intel VTune Amplifier 2013 on Embedded Platforms

Intel Galileo Firmware Updater Tool

Intel Manycore Platform Software Stack (Intel MPSS)

Extremely Fast Distributed Storage for Cloud Service Providers

Product Change Notification

Product Change Notification

Product Change Notification

Product Change Notification

Intel Open Source HD Graphics. Programmer's Reference Manual

Intel s Architecture for NFV

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Product Change Notification

The Intel Processor Diagnostic Tool Release Notes

Product Change Notification

OMNI-PATH FABRIC TOPOLOGIES AND ROUTING

Intel Desktop Board DP55SB

Product Change Notification

Intel Manageability Commander User Guide

Intel SDK for OpenCL* - Sample for OpenCL* and Intel Media SDK Interoperability

Product Change Notification

Product Change Notification

Intel Embedded Media and Graphics Driver v1.12 for Intel Atom Processor N2000 and D2000 Series

Intel Desktop Board DZ68DB

PARDISO - PARallel DIrect SOlver to solve SLAE on shared memory architectures

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Product Change Notification

Andrzej Jakowski, Armoun Forghan. Apr 2017 Santa Clara, CA

IEEE1588 Frequently Asked Questions (FAQs)

Product Change Notification

Intel Desktop Board DH55TC

Product Change Notification

Case Study: Optimizing King of Soldier* with Intel Graphics Performance Analyzers on Intel HD Graphics 4000

Intel vpro Technology Virtual Seminar 2010

INTEL PERCEPTUAL COMPUTING SDK. How To Use the Privacy Notification Tool

Product Change Notification

Product Change Notification

Intel vpro Technology Virtual Seminar 2010

Product Change Notification

Sample for OpenCL* and DirectX* Video Acceleration Surface Sharing

Ceph BlueStore Performance on Latest Intel Server Platforms. Orlando Moreno Performance Engineer, Intel Corporation May 10, 2018

Product Change Notification

Data Center Energy Efficiency Using Intel Intelligent Power Node Manager and Intel Data Center Manager

Software Evaluation Guide for ImTOO* YouTube* to ipod* Converter Downloading YouTube videos to your ipod

Intel Desktop Board D945GCCR

Product Change Notification

Transcription:

Is Open Source good enough? A deep study of Swift and Ceph performance Jiangang.duan@intel.com 11/2013

Agenda Self introduction Ceph Block service performance Swift Object Storage Service performance Summary 2

Self introduction Jiangang Duan ( 段建钢 ) Open source Cloud Software team in Intel @ Shanghai OpenStack related performance Blog: http://software.intel.com/en-us/user/335367/track This is the team work: Chendi Xue, Chen Zhang, Jian Zhang, Thomas Barnes, Xinxin Shu, Xiaoxi Chen, Yuan Zhou, Ziyuan Wang, etc. 3

Agenda Self introduction Ceph Block service performance Swift Object Storage Service performance Summary 4

Ceph Test Environment V M 1 V M 2 V M n Compute1 V M 1 V M 2 V M n Compute2 V M 1 V M 2 V M n Compute5 10Gb NIC OSD1 10x 1TB 7200rpm HDD 3x 200GB DC3700 OSD2 10x 1TB 7200rpm HDD 3x 200GB DC3700 OSD3 10x 1TB 7200rpm HDD 3x 200GB DC3700 OSD4 10x 1TB 7200rpm HDD 3x 200GB DC3700 Full 10Gb network 4 Xeon UP server for Ceph cluster: 16GB memory each 10x 1TB SATA HDD for data through LSI9205 HBA (JBOD), each is parted into 1 partition for OSD daemon 3x SSD for journal directly connected with SATA controller, 20GB for each OSD 5

Ceph Test-Bed Configuration: Software Ceph cluster OS Ubuntu 12.10 Kernel 3.6.6 Ceph Cuttle Fish 0.61.2 FileSystem XFS Client host OS Ubuntu 12.10 Kernel 3.5.0-22 OpenStack Folsom (2012.2) Client VM OS Ubuntu 12.10 Kernel 3.5.0-22 Network tuning Enable Jumbo Frame (1496 bytes -> 8000 bytes) XFS tuning osd mount options xfs = rw,noatime,inode64,logbsize=256k,delaylog osd mkfs options xfs = -f -i size=2048 Ceph Tuning Default replication setting (2 replicas), 4096 pgs. Enlarge read ahead size to 2048 Kb for OSD disks. Max Sync Interval 5s - > 10s Large Queue and Large Bytes OSD OP Thread 2 -> 20 Turn off FileStore Flusher Sync Flush False -> True (to avoid a OSD crash bug) 6

Methodology Storage Interface: RBD (aka volume) is attached to one virtual machine via QEMU RBD driver Workload FIO to simulate 4 IO pattern: 64k Sequential Read/Write (SR and SW) with 60MB/sec max per VM 4k Random Read/Write (RR and RW) with 100IOPS max per VM Run Rules Volume is pre-allocated before testing with dd Drop OSD page cache before each measurement Duration 400 secs warm-up, 600 secs data collection 60GB per Volume, increase Stress VM number to get full load line QoS requirement: Less than 20ms latency for random read/write Per VM throughput larger than 90% of the predefine threshold 54MB/sec for SR/SW and 90IOPS for RR/RW 7

IOPS per VM total IOPS Random read performance 4KB random Read with CAP=100 IOPS,QD=2 120 100 80 60 99 99 99 99 99 IOPS per VM 95 84 2,858 total IOPS 3,732 3,341 75 68 4,051 62 4,336 57 5000 4,5824500 4000 3500 3000 2500 1,980 2000 40 1500 20 0 990 396 99 198 1 2 4 10 20 30 40 50 60 70 80 number of VM 1000 500 0 Max measured value at ~4600 IOPS When scale to 30VM, per VM IOPS drop to 95 8 For more complete information about performance and benchmark results, visit Performance Test Disclosure

IOPS per VM total IOPS Random write performance 4KB Random Write with CAP=100 IOPS,QD=2 120 100 100 100 100 100 100 100 throughtput latency 41 48 50 45 40 80 82 36 35 66 30 30 60 40 24 56 48 42 25 20 15 20 0 3 3 3 3 3 3 VM=1 VM=2 VM=4 VM=10 VM=20 VM=30 VM=40 VM=50 VM=60 VM=70 VM=80 number of VM 10 5 0 Peak at ~3291 IOPS When scale to 40VM, per VM VW drop to 82 IOPS per VM 9 For more complete information about performance and benchmark results, visit Performance Test Disclosure

per VM BW (MB/s) total BW (MB/s) Sequential read performance 64KB sequential read QD = 64, CAP=60MB/s 70 60 50 40 per VM BW 60 60 60 60 60 60 57 1,799 total BW 2,263 49 2,428 43 2,575 38 2,671 34 3000 2,759 2500 2000 1500 30 1,200 20 1000 10 0 600 240 120 60 1 2 4 10 20 30 40 50 60 70 80 The number of VM 500 0 Max measured value at ~2759MB/s When scale to 40VM, per VM VW drop to 57MB/sec 10 For more complete information about performance and benchmark results, visit Performance Test Disclosure

Sequential write performance per VM BW (MB/s) total BW (MB/s) 64KB Sequential Write QD = 64, CAP=60MB/s 70 per VM BW total BW 1,487 1,487 60 60 60 60 60 1,424 60 1,350 1,457 1600 1400 1,363 50 40 30 20 600 1,197 45 36 30 25 21 17 1200 1000 800 600 400 10 0 240 120 60 1 2 4 10 20 30 40 50 60 70 80 The number of VM 200 0 Max at 1487MB/sec When scale to 30VM, per VM BW drop to 45MB/sec per VM 11 For more complete information about performance and benchmark results, visit Performance Test Disclosure

Latency analysis 20ms 4ms Latency is strongly related to size of QD Random is good enough. Sequential still has improve room. 12 For more complete information about performance and benchmark results, visit Performance Test Disclosure

IO trace analysis for Sequential read Ceph converts the logical sequential to random disk access Read ahead can help, but not enough 13 For more complete information about performance and benchmark results, visit Performance Test Disclosure

Full SSD performance 4x 200GB SSD with 8 OSD (60GB data + 10GB journal each) Good enough full SSD performance. ~55K with 1ms latency; ~80K with 2ms, peak 85K with 3ms 14 For more complete information about performance and benchmark results, visit Performance Test Disclosure

Ceph Data Summary workload pattern In theory Max in theory Network measureable throughput Consider QoS Disk throughput throughput througput (IOPS and MB/s) (IOPS and MB/s) Efficiency 4KB random read CAP=100 4582 2858 3600 31250 79% 4KB random write CAP=100 3357 3000 4000 31250 75% 64KB seq read QD=64 CAP=60 2759 2263 6400 4000 57% 64KB seq write QD=64 CAP=60 1487 1197 3200 4000 37% Assume 160MB/s for SR/SW and 90IOPS for RR and 200IOPS for RW and 40Gb Network BW limit HW total capacity (GB) total throughput (IOPS) 60GB Volume number consider capacity 60GB Volume number consider performance 40x 1TB with replica=2 20000 2800 333.3 30 3 x 200GB SSD with replica=2 150 85000 2.5 850 Consider Random IO only Random performance is pretty good. Sequential performance has room to do better. Challenges and opportunities: Throttle both BW and IOPS to better support multiple tenant case Mix SSD and HDD together to achieve a balanced capacity/performance. 15 For more complete information about performance and benchmark results, visit Performance Test Disclosure

Agenda Self introduction Ceph Block service performance Swift Object Storage Service performance Summary 16

H/W Configuration Proxy Node x2 CPU = 2 * Intel Xeon E5 2.6GHz (8C/16T) Memory = 64GB, 8x 8G DDR3 1333MHz NIC = Intel 82599 Dual port 10GbE OS = CentOS 6.4 (2.6.32-358.14.1) BIOS = Turbo on, EIST on, CPU C-state ON Storage Node x10 CPU = 1* Intel Xeon E3 3.5GHz (4C/8T) Memory =16 GB, 4x 4G DDR3 1333MHz NIC = 1 * Intel 82576 Quad port 1GbE (4*1Gb) Disk = 12* 1TB 7200rpm SATA, 1x 64GB SSD (3Gb/s) for account/container zone OS = CentOS 6.4 (2.6.32-358.14.1) Network 1x 10Gb to in/out of Proxy 4x 1Gb for each storage node 17

Software Configuration Swift 1.9.0 Swift Proxy Server log-level = warning workers = 64 account/container-recheck = 80640s Swift Account/Container Server log-level = warning workers = 16 Swift Object Server log-level = warning workers = 32 Swift replicator/auditor/updater default Memcached Server cache-size = 2048MB max-concurrent-connections = 20480 18

Test Methodology COSBench is an Intel developed Benchmark to measure Cloud Object Storage Service performance. https://github.com/intel-cloud/cosbench 19

Swift Performance Overview #con x # obj Object- Size RW-Mode Worker- Count Avg- ResTi me Throughput Bot t l eneck -- -- -- ms op/s -- 100x100 128KB 10MB Read 640 51.77 12359.84 Proxy CPU = 100%, SN CPU = 72% Write 640 213.12 3001.65 Proxy CPU=47%, SN CPU = 96% Read 160 699.3 228.78 Proxy CPU = 61%, SN CPU = 55%, Proxy NIC 9.6Gb/s Write 80 1020.8 78.37 Proxy CPU = 15%, SN CPU = 59%, Proxy NIC 9.6Gb/s #con x # obj Object- Size RW-Mode Worker- Count Avg- ResTi me Throughput Bot t l eneck -- -- -- ms op/s -- 10000*10000 128KB 10000*100 10MB Read 640 248.56 2574.78 Proxy CPU = 25% SN CPU = 52%, disk latency 22ms Write 80 62.75 1272.97 Proxy CPU = 21%, SN CPU = 57%, disk latency 18ms Read 160 699.17 228.83 Proxy CPU = 56%, SN CPU = 62%, Proxy NIC 9.6Gb/s Write 320 4096.3 78.12 Proxy CPU = 16%, SN CPU = 81%, Proxy NIC 9.6Gb/s For large object, bottleneck is front end network 80% and 58% performance degradation in many-small-objects case 20 For more complete information about performance and benchmark results, visit Performance Test Disclosure

Problem : large-scale test 128K read 80+% drops Different IO pattern large-scale : 34 KB per op small-scale: 122KB per op Blktrace shows lots of IO generated by XFS file system Meta data cannot be cached with 10k * 10k objects Frequent disk access for that info 21 For more complete information about performance and benchmark results, visit Performance Test Disclosure

Option 1 : Using larger memory to cache Inodes Set vfs_cache_pressure =1 and run ls R /srv/node to preload the cache Close the gap less than 10% although this is expensive option. 100M object with 3 replica need 300GB memory size 22 For more complete information about performance and benchmark results, visit Performance Test Disclosure

Option 2 : Using SSD to cache Inodes Use SSD to Cache inode and metadata w/ flashcache 50%-100% performance boost Still big room (70%) comparing to base scale 23 For more complete information about performance and benchmark results, visit Performance Test Disclosure

Alternative Solutions and Next step Take a small inode size in latest XFS? Using a different Filesystem(e.g., reiserfs, btrfs)? Taking NoSQL? Using customized indexing and object format? Haystack(Fackbook) and TFS(Taobao FS) More work for SSD as cache Write back vs. write throughput Where does the latency go? Data Lost issue, mirror the SSD or combine them with container/os disk together? 24

Summary Many emerging open source storage projects to satisfy BIG DATA requirement Ceph and swift are widely used in OpenStack environment Good result however still improvement areas Call action to work together 25

Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm%20 This document contains information on products in the design phase of development. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. Copyright 2012 Intel Corporation. All rights reserved. 26