Yuan Zhou Chendi Xue Jian Zhang 02/2017

Similar documents
Data center day. Non-volatile memory. Rob Crooke. August 27, Senior Vice President, General Manager Non-Volatile Memory Solutions Group

Hardware and Software Co-Optimization for Best Cloud Experience

Mobile World Congress Claudine Mangano Director, Global Communications Intel Corporation

Sensors on mobile devices An overview of applications, power management and security aspects. Katrin Matthes, Rajasekaran Andiappan Intel Corporation

Building a Firmware Component Ecosystem with the Intel Firmware Engine

A New Key-value Data Store For Heterogeneous Storage Architecture Intel APAC R&D Ltd.

Thunderbolt Technology Brett Branch Thunderbolt Platform Enabling Manager

Kirk Skaugen Senior Vice President General Manager, PC Client Group Intel Corporation

Technology is a Journey

Data Centre Server Efficiency Metric- A Simplified Effective Approach. Henry ML Wong Sr. Power Technologist Eco-Technology Program Office

A New Key-Value Data Store For Heterogeneous Storage Architecture

Innovation Accelerating Mission Critical Infrastructure

Re-Architecting Cloud Storage with Intel 3D XPoint Technology and Intel 3D NAND SSDs

NVMHCI: The Optimized Interface for Caches and SSDs

Rack Scale Architecture Platform and Management

Server Efficiency: A Simplified Data Center Approach

Carsten Benthin, Sven Woop, Ingo Wald, Attila Áfra HPG 2017

The Heart of A New Generation Update to Analysts. Anand Chandrasekher Senior Vice President, Intel General Manager, Ultra Mobility Group

All-NVMe Performance Deep Dive Into Ceph + Sneak Preview of QLC + NVMe Ceph

Intel. Rack Scale Design: A Deeper Perspective on Software Manageability for the Open Compute Project Community. Mohan J. Kumar Intel Fellow

Andrzej Jakowski, Armoun Forghan. Apr 2017 Santa Clara, CA

Accelerate block service built on Ceph via SPDK Ziye Yang Intel

Is Open Source good enough? A deep study of Swift and Ceph performance. 11/2013

JOE NARDONE. General Manager, WiMAX Solutions Division Service Provider Business Group October 23, 2006

Data-Centric Innovation Summit ALPER ILKBAHAR VICE PRESIDENT & GENERAL MANAGER MEMORY & STORAGE SOLUTIONS, DATA CENTER GROUP

investor meeting SANTA CLARA Diane Bryant Senior Vice President & General Manager Data Center Group

Introducing SUSE Enterprise Storage 5

Introduction to the NVMe Working Group Initiative

Enterprise Data Integrity and Increasing the Endurance of Your Solid-State Drive MEMS003

Ben Walker Data Center Group Intel Corporation

A U G U S T 8, S A N T A C L A R A, C A

Intel SSD Data center evolution

Accelerating NVMe-oF* for VMs with the Storage Performance Development Kit

THE CEPH POWER SHOW. Episode 2 : The Jewel Story. Daniel Messer Technical Marketing Red Hat Storage. Karan Singh Sr. Storage Architect Red Hat Storage

INTRODUCTION TO CEPH. Orit Wasserman Red Hat August Penguin 2017

Using Wind River Simics * Virtual Platforms to Accelerate Firmware Development PTAS003

Ceph BlueStore Performance on Latest Intel Server Platforms. Orlando Moreno Performance Engineer, Intel Corporation May 10, 2018

What's new in Jewel for RADOS? SAMUEL JUST 2015 VAULT

Virtuozzo Hyperconverged Platform Uses Intel Optane SSDs to Accelerate Performance for Containers and VMs

Understanding Write Behaviors of Storage Backends in Ceph Object Store

A fields' Introduction to SUSE Enterprise Storage TUT91098

Why software defined storage matters? Sergey Goncharov Solution Architect, Red Hat

Solid-State Drives for Servers and Clients: Why, When, Where and How

MySQL and Ceph. A tale of two friends

Using Wind River Simics * Virtual Platforms to Accelerate Firmware Development

Ceph Optimizations for NVMe

Deploying Software Defined Storage for the Enterprise with Ceph. PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu

Using persistent memory and RDMA for Ceph client write-back caching Scott Peterson, Senior Software Engineer Intel

Intel European Investor Meeting

Ceph Block Devices: A Deep Dive. Josh Durgin RBD Lead June 24, 2015

Unlocking the Future with Intel

Fast-track Hybrid IT Transformation with Intel Data Center Blocks for Cloud

Accelerate Ceph By SPDK on AArch64

Extremely Fast Distributed Storage for Cloud Service Providers

SPDK Blobstore: A Look Inside the NVM Optimized Allocator

Fast and Easy Persistent Storage for Docker* Containers with Storidge and Intel

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

Out-of-band (OOB) Management of Storage Software through Baseboard Management Controller Piotr Wysocki, Kapil Karkra Intel

Andreas Schneider. Markus Leberecht. Senior Cloud Solution Architect, Intel Deutschland. Distribution Sales Manager, Intel Deutschland

Interconnect Bus Extensions for Energy-Efficient Platforms

Supermicro All-Flash NVMe Solution for Ceph Storage Cluster

Building an Open Memory-Centric Computing Architecture using Intel Optane Frank Ober Efstathios Efstathiou Oracle Open World 2017 October 3, 2017

Nokia Conference Call 1Q 2012 Financial Results

Risk Factors. Rev. 4/19/11

Changpeng Liu, Cloud Software Engineer. Piotr Pelpliński, Cloud Software Engineer

THE STORAGE PERFORMANCE DEVELOPMENT KIT AND NVME-OF

ENVISION TECHNOLOGY CONFERENCE. Functional intel (ia) BLA PARTHAS, INTEL PLATFORM ARCHITECT

Scott Oaks, Oracle Sunil Raghavan, Intel Daniel Verkamp, Intel 03-Oct :45 p.m. - 4:30 p.m. Moscone West - Room 3020

NOKIA FINANCIAL RESULTS Q3 / 2012

The Intel IoT Platform Architecture and Product Overview

Re- I m a g i n i n g D a t a C e n t e r S t o r a g e a n d M e m o r y

M.2 Evolves to Storage Benefit

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

FlashGrid Software Enables Converged and Hyper-Converged Appliances for Oracle* RAC

Toward a Memory-centric Architecture

Changpeng Liu. Senior Storage Software Engineer. Intel Data Center Group

Jim Harris. Principal Software Engineer. Data Center Group

An Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar

Software Innovations for Cloud Scale Networking. Kelly Ahuja Senior Vice President Service Provider Business, Products & Solutions November 18, 2015

Data and Intelligence in Storage Carol Wilder Intel Corporation

Live Migration of vgpu

Changpeng Liu. Cloud Storage Software Engineer. Intel Data Center Group

Expert Days SUSE Enterprise Storage

Provisioning with SUSE Enterprise Storage. Nyers Gábor Trainer &

An introduction to today s Modular Operating System

The Comparison of Ceph and Commercial Server SAN. Yuting Wu AWcloud

Ceph Rados Gateway. Orit Wasserman Fosdem 2016

Programming Larrabee: Beyond Data Parallelism. Dr. Larry Seiler Intel Corporation

Extending Energy Efficiency. From Silicon To The Platform. And Beyond Raj Hazra. Director, Systems Technology Lab

Performance Monitoring on Intel Core i7 Processors*

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Reimagining OpenStack*

INTEL NEXT GENERATION TECHNOLOGY - POWERING NEW PERFORMANCE LEVELS

12th ANNUAL WORKSHOP 2016 NVME OVER FABRICS. Presented by Phil Cayton Intel Corporation. April 6th, 2016

REGULATORY REPORTING FOR FINANCIAL SERVICES

RE-IMAGINING THE DATACENTER. Lynn Comp Director of Datacenter Solutions and Technologies

Long Live DAS! Ryan Jancaitis Product Management - Symantec Frank Ober Data Center Solutions Architect - Intel

Small File I/O Performance in Lustre. Mikhail Pershin, Joe Gmitter Intel HPDD April 2018

Accelerating Data Center Workloads with FPGAs

Hardware Based Compression in Ceph OSD with BTRFS

Transcription:

Yuan Zhou yuan.zhou@intel.com Chendi Xue Chendi.xue@intel.com Jian Zhang jian.zhang@intel.com 02/2017

Agenda Introduction Hyper Converged Cache Hyper Converged Cache Architecture Overview Design details Performance overview Current progress and roadmap Hyper Converged Cache with Optane technology Summary 2

Introduction Intel Cloud computing and Big Data Engineering Team Open source @ Spark, Hadoop, OpenStack, Ceph, NoSQL etc. Working with community and end customers closely Technology and Innovation oriented Real-time, in-memory, complex analytics Structure and unstructured data Agility, Multi-tenancy, Scalability and elasticity Bridging advanced research and real-world applications 3

Agenda Introduction Hyper Converged Cache Hyper Converged Cache Architecture Overview Design details Performance overview Current progress and roadmap Hyper Converged Cache with Optane technology Summary 4

Hyper Converged Cache A strong demands for SSD caching in Ceph cluster Ceph SSD caching performance has gaps Application RGW A web services gateway for object storage Host/VM caching RBD A reliable, fully distributed block device Client CephFS A distributed file system with POSIX semantics Cache tiering, Flashcache/bCache not work well Long tail latency is big issue for workloads such as OLTP Need a caching layer to reduce IO path dependency on the network LIBRADOS A library allowing apps to directly access RADOS RADOS A software-based, reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes and lightweight monitors 5

Ceph caching solutions on SSDs Bare-metal VM Container Ceph Cluster Ceph Clients User App Guest VM App Container App NVM Cache Kernel RBD NVM Qemu/VirtIO LIBRBD NVM Cache Kernel RBD Client-side Cache Kernel RADOS Hypervisor RADOS Host RADOS RADOS RADOS IP Network RADOS CEPH NODE CEPH NODE OSD OSD Data Metadata Journal NVM Filestore Filesystem Cache NVM NVM NVM Cache RocksDB BlueRocksEnv BlueFS NVM Journal Read Cache OSD Data Filestore Backend (Production) Bluestore Backend (Tech Preview) 6

Hyper Converged Cache Overview Client Side cache: caching on compute node Local read cache and distributed write cache Extensible Framework Pluggable design/cache policies General caching interfaces: Memcached like API Data Services Deduplication, Compression when flushing to HDD Value add feature designed for Optane devices Log-structure object store for write cache SSD Caching tier HDD Capacity tier Read Cache Compute Node VM1 VM2 VMn 2 Write Cache meta Read Cache Compute Node VM1 VM2 VMn 4 3 1 3 2 1 Write Cache ssd ssd ssd ssd ssd ssd ssd ssd 5 dedup/compression meta OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD OSD 7

Agenda Introduction Hyper Converged Cache Hyper Converged Cache Architecture Overview Design details Performance overview Current progress and roadmap Hyper Converged Cache with Optane technology Summary 8

Hyper Converged Cache: General architecture Building a hyper-converged cache solutions for the cloud Started with Ceph* Block cache, object cache, file cache Replication architecture Extensible Framework write Write caching dedeup Memory pool Persistence read Read caching compression Pluggable design/cache policies Support third-party caching software Advanced data services: Ceph Compression, deduplication, QOS Value added feature for future SCM device 9

Hyper Converged Cache: Design details API Layer Generic interfaces: Master Node Slave Node RBD, RGW and Cephfs HA API API Master/Slave architecture: Two hosts are required in order to provide physical redundancy Caching Layer Write I/O Write Read cache cache networking Read I/O Local Store Local Store Write Read cache cache networking Advanced service: dedup, compression, QoS, optimized with caching semantics SYNC Active Standby Capacity Layer OSD OSD OSD OSD OSD OSD OSD 10

Hyper Converged Cache: API layer RBD: Hooks on librbd caching for small writes Application Application RGW A web services gateway for object storage Host/VM RBD A reliable, fully distributed block device Client CephFS A distributed file system with POSIX semantics RGW: Caching over http Caching Layer For metadata and small data CephFS: Extend POSIX API Caching for metadata and small writes LIBRADOS A library allowing apps to directly access RADOS RADOS A software-based, reliable, autonomous, distributed object store comprised of selfhealing, self-managing, intelligent storage nodes and lightweight monitors 11

Hyper Converged Cache: Master/Slave replication Master/Slave architecture: Cache Adapter Each host will have two process networking shm networking Master: accept local read/writes and replicates to slave Slave: accept replication writes msg_queue io_thread_pool transaction msg_queue io_thread_pool transaction Configurable master/slave pair Metadata Data Metadata Data Static configuration file dynamic configuration in HA service layer agent_thread_pool Master Slave agent_thread_pool Adapter sends read to master only Adapter sends write to master, then master replicates to slave Client ACK on two writes finish Backend Tier 12

Hyper Converged Cache: Storage backend API h d r Append-only log RAM Buffer in-mem Index Flusher WriteBack Daemon GC work when WB Super Block Segment SSD Evict Daemon Backend Tier 13

Hyper Converged Cache: Storage backend With Caching Semantic Hash(key) Data Header Recency Bit RAM Data Header Recency Bit Key Digest Dirty Bit Data Size Data Offset Next Segment Header Data Header Value Data Header Value Data Header Data Header Value Value Segment Segment Segment Segment Segment Segment Segment Segment 1. Reclaim segment for new items in a FIFO way 2. For items whose recency bit is set, reinsert them and clear their recency bits; For items whose dirty bit is set, also flush them to the backend store 14

IOPS Latency(ms) Hyper Converged Cache: Performance 60000 50000 40000 Performance Improvements 30 25 20 30000 20000 10000 15 10 5 0 RBD RBD w/ Cache Tier RBD w/ RBD w/ caching(s3700) caching(p3700) 0 performance latency Hyper converged cache is able to provide ~7x performance improvements w/ zipf 4k randwrite, the latency also decreased ~92%. With NVMe disk caching, the performance improved like 20x. Comparing with cache tier, the performance improved ~5x, the code path is much simpler. 15

lat(ms) Hyper Converged Cache: Tail Latency Tail latency improvements 25 22.18 20 15 15.144 10 5 0 avg lat 99.99% lat avg lat 99.99% lat 50 RBD-10K IOPS 100 RBD-20K IOPS RBD RBD w/ caching(p3700) With SSD caching, hyper converged cache is able to reduce ~30% tail latency under specified load. Much easier to control and meet QOS/SLA requirements. 16

Upstream status and Roadmap Upstream BluePrint: CRASH-CONSISTENT ORDERED WRITE-BACK CACHING EXTENSION A new librbd read cache to support LBA-based caching with DRAM/*non-volatile* storage backends An ordered write-back cache that maintains checkpoints internally (or is structured as a data journal), such that writes that get flushed back to the cluster are always crash consistent. Even if one were to lose the client cache entirely, the disk image is still holding a valid file system that looks like it is just a little bit stale [1]. Should have durability characteristics similar to async replication if done right. External caching plug-in interface kernel and usermode 17

Agenda Introduction Hyper Converged Cache Hyper Converged Cache Architecture Overview Design details Performance overview Current progress and roadmap Hyper Converged Cache with Optane technology Summary 18

Intel investment: Two technologies Intel Optane Technology Intel 3D NAND Higher performance lower cost & higher density

Intel Optane TECHNOLOGY Size and Latency Specification Comparison HDD Latency: ~10 MillionX Size of Data: ~10,000X SRAM Latency: 1X Size of Data: 1X DRAM Latency: ~10X Size of Data: ~100X MEMORY Intel Optane Technology Latency: ~100X Size of Data: ~1,000X NAND SSD Latency: ~100,000X Size of Data: ~1,000X STORAGE Technology claims are based on comparisons of latency, density and write cycling metrics amongst memory technologies recorded on published specifications of in-market memory products against internal Intel specifications. Lower latency is faster 20

http://www.flashmemorysummit.com/english/collaterals/proceedings/2016/20160810_k21_zhang_zhang_zhou.pdf 21

Hyper Converged Cache: caching on Optane? VM Layer 3 Compute Node VM cache RBD VM VM Compute Node VM VM cache cache cache cache cache cache cache RBD RBD RBD RBD RBD RBD RBD VM Hypervisor Layer 2 Write cache PageCache Read cache Local Store Block Buffer Write cache PageCache Read cache Local Store Block Buffer Storage Server Independent Cache Layer OSD OSD OSD OSD 1 Page Cache Block Buffer Page Cache Block Buffer Page Cache Block Buffer Page Cache Block Buffer 1. Using Intel Optane device as block buffer cache device. 2. Using Intel Optane device as page caching device. 3. Using 3D XPointTM device as OS L2 memory? 22

Summary With client-side SSD caching, RBD randwrite improved ~5x, the avg latency and tail latency(99.99%) could be improved a lot. With the emerging new media like Optane, the caching benefit will be more higher Next step: Finish the coding work(80% done) and open source the project Tests on objects and filesystem 21

Q&A 23

Legal Notices and Disclaimers Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction. This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. Statements in this document that refer to Intel s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel s results and plans is included in Intel s SEC filings, including the annual report on Form 10-K. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. Intel, Xeon and the Intel logo are trademarks of Intel Corporation in the United States and other countries. 24 *Other names and brands may be claimed as the property of others. 2015 Intel Corporation.

Legal Information: Benchmark and Performance Claims Disclaimers Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. Test and System Configurations: See Back up for details. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. 25

Risk Factors The above statements and any others in this document that refer to plans and expectations for the first quarter, the year and the future are forwardlooking statements that involve a number of risks and uncertainties. Words such as "anticipates," "expects," "intends," "plans," "believes," "seeks," "estimates," "may," "will," "should" and their variations identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many factors could affect Intel's actual results, and variances from Intel's current expectations regarding such factors could cause actual results to differ materially from those expressed in these forward-looking statements. Intel presently considers the following to be important factors that could cause actual results to differ materially from the company's expectations. Demand for Intel s products is highly variable and could differ from expectations due to factors including changes in the business and economic conditions; consumer confidence or income levels; customer acceptance of Intel s and competitors products; competitive and pricing pressures, including actions taken by competitors; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes in the level of inventory at customers. Intel s gross margin percentage could vary significantly from expectations based on capacity utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the timing and execution of the manufacturing ramp and associated costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of materials or resources; and product manufacturing quality/yields. Variations in gross margin may also be caused by the timing of Intel product introductions and related expenses, including marketing expenses, and Intel s ability to respond quickly to technological developments and to introduce new features into existing products, which may result in restructuring and asset impairment charges. Intel's results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Results may also be affected by the formal or informal imposition by countries of new or revised export and/or import and doing-business regulations, which could be changed without prior notice. Intel operates in highly competitive industries and its operations have high costs that are either fixed or difficult to reduce in the short term. The amount, timing and execution of Intel s stock repurchase program and dividend program could be affected by changes in Intel s priorities for the use of cash, such as operational spending, capital spending, acquisitions, and as a result of changes to Intel s cash flows and changes in tax laws. Product defects or errata (deviations from published specifications) may adversely impact our expenses, revenues and reputation. Intel s results could be affected by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust, disclosure and other issues. An unfavorable ruling could include monetary damages or an injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel s ability to design its products, or requiring other remedies such as compulsory licensing of intellectual property. Intel s results may be affected by the timing of closing of acquisitions, divestitures and other significant transactions. A detailed discussion of these and other factors that could affect Intel s results is included in Intel s SEC filings, including the company s most recent reports on Form 10-Q, Form 10-K and earnings release. 26 Rev. 1/15/15

28

H/W Configuration 10Gb NIC MON OSD1 8 x 1TB HDD 2x 400GB DC S3700 Client1 cache 1x DC3700 10Gb NIC OSD2 8 x 1TB HDD 2x 400GB DC S3700 Client Cluster CPU Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.80GHz Memory NIC Disks CPU Memory NIC Disks 96 GB 10Gb 1 HDD for OS 400G SSD for cache Ceph Cluster OSD: Intel(R) Xeon(R) CPU E31280 @ 3.50GHz 32 GB 10GbE 2 x 400 GB SSD (Journal) 8 x 1TB HDD (Storage) 2 hosts Ceph cluster each host has 8 x 1TB HDD as OSDs and 2x Intel DC S3700 SSD journal 1 Client with 1x 400GB Intel DC S3700 SSD as cache device 29

S/W Configuration Ceph* version : 10.2.2 (Jewel) Replica size : 2 Data pool : 16 OSDs. 2 SSDs for journal, 8 OSDs on each node OSD Size : 1TB * 8 Journal Size : 40G * 8 Cache: 1 x 400G Intel DC S3700 FIO volume size: 10G Cetune test benchmark fio + librbd Cetune: https://github.com/01org/cetune *Other names and brands may be claimed as the property of others. 26

Testing Configuration Test cases: Operation: 4K random write with fio (zipf=1.2) Detail case: Cache size < volume size (w/ zipf) w/o flush & evict: cache size 10G. w/ flush w/o evict: cache size 10G. w/ flush & evict: cache size 10G. Hot data = volume size * zipf1.2(5%), runtime = 4 hours cache_ratio_health=0.5 cache_dirty_ratio_min=0.1 cache_dirty_ratio_max=0.95 cache_flush_interval=3 cache_evict_interval=5 Runtime: Base: 200s ramp up, 14400s run DataStoreDev=/dev/sde cache_total_size=10g cacheservice_threads_num=128 agent_threads_num=32 Caching Parameters: object_size=4096 cache_flush_queue_depth=256 cache_ratio_max=0.7 27