Build a Key Value Flash Disk Based Storage System. Flash Memory Summit 2017 Santa Clara, CA 1

Similar documents
Understanding IO patterns of SSDs

AvalonMiner Raspberry Pi Configuration Guide. AvalonMiner 树莓派配置教程 AvalonMiner Raspberry Pi Configuration Guide

Logitech G302 Daedalus Prime Setup Guide 设置指南

Oracle 一体化创新云技术 助力智慧政府信息化战略. Copyright* *2014*Oracle*and/or*its*affiliates.*All*rights*reserved.** *

绝佳的并行处理 - FPGA 加速的根本基石

PCU50 的整盘备份. 本文只针对操作系统为 Windows XP 版本的 PCU50 PCU50 启动硬件自检完后, 出现下面文字时, 按向下光标键 光标条停在 SINUMERIK 下方的空白处, 如下图, 按回车键 PCU50 会进入到服务画面, 如下图

新一代 ODA X5-2 低调 奢华 有内涵

Chapter 1 (Part 2) Introduction to Operating System

Virtual Memory Management for Main-Memory KV Database Using Solid State Disk *

libde265 HEVC 性能测试报告

Wireless Presentation Pod

Chapter 11 SHANDONG UNIVERSITY 1

Safe Memory-Leak Fixing for C Programs

Chapter 7: Deadlocks. Operating System Concepts 9 th Edition

梁永健. W K Leung. 华为企业业务 BG 解决方案销售部 CTO Chief Technology Officer, Solution Sales, Huawei

Microsemi - Leading Innovation for China s Hyperscale Data Centers

如何查看 Cache Engine 缓存中有哪些网站 /URL

Microsoft RemoteFX: USB 和设备重定向 姓名 : 张天民 职务 : 高级讲师 公司 : 东方瑞通 ( 北京 ) 咨询服务有限公司

ICP Enablon User Manual Factory ICP Enablon 用户手册 工厂 Version th Jul 2012 版本 年 7 月 16 日. Content 内容

Previous on Computer Networks Class 18. ICMP: Internet Control Message Protocol IP Protocol Actually a IP packet

Triangle - Delaunay Triangulator

OTAD Application Note

第二小题 : 逻辑隔离 (10 分 ) OpenFlow Switch1 (PC-A/Netfpga) OpenFlow Switch2 (PC-B/Netfpga) ServerB PC-2. Switching Hub

密级 : 博士学位论文. 论文题目基于 ScratchPad Memory 的嵌入式系统优化研究

VAS 5054A FAQ ( 所有 5054A 整合, 中英对照 )

计算机组成原理第二讲 第二章 : 运算方法和运算器 数据与文字的表示方法 (1) 整数的表示方法. 授课老师 : 王浩宇

EqualLogic Best Practices for SQL Server Deployments

云计算入门 Introduction to Cloud Computing GESC1001

我们应该做什么? 告知性分析 未来会发生什么? 预测性分析 为什么会发生 诊断性分析 过去发生了什么? 描述性分析 高级分析 传统 BI. Source: Gartner

Multiprotocol Label Switching The future of IP Backbone Technology

#MDCC Swift 链式语法应 用 陈乘

Presentation Title. By Author The MathWorks, Inc. 1

EBD EBD. end

Apache OpenWhisk + Kubernetes:

北 京 忆 恒 创 源 科 技 有 限 公 司 16

测试 SFTP 的 问题在归档配置页的 MediaSense

测试基础架构 演进之路. 茹炳晟 (Robin Ru) ebay 中国研发中心

ngx_openresty: an Nginx ecosystem glued by Lua

Command Dictionary CUSTOM

HAWQ. MPP SQL for HDFS of Hadoop 基于 Hadoop 原生 HDFS 的大规模并行 SQL

三 依赖注入 (dependency injection) 的学习

组播路由 - MSDP 和 PIM 通过走

Table of Contents. DS159-ZH LUXEON XR-3020 Product Datasheet Lumileds Holding B.V. All rights reserved.

Murrelektronik Connectivity Interface Part I Product range MSDD, cable entry panels MSDD 系列, 电缆穿线板

1. DWR 1.1 DWR 基础 概念 使用使用 DWR 的步骤. 1 什么是 DWR? Direct Web Remote, 直接 Web 远程 是一个 Ajax 的框架

Computer Networks. Wenzhong Li. Nanjing University

mod_callcenter callcenter.conf.xml 范例 odbc-dsn

IBM 企业业务连续性方案建议书. System x3850m2+ds4700/ds5000

S 1.6V 3.3V. S Windows 2000 Windows XP Windows Vista S USB S RGB LED (PORT1 PORT2 PORT3) S I 2 C. + 表示无铅 (Pb) 并符合 RoHS 标准 JU10 JU14, JU24, JU25

Skill-building Courses Business Analysis Lesson 3 Problem Solving

CloudStack 4.3 API 开发指南!

2. Introduction to Digital Media Format

<properties> <jdk.version>1.8</jdk.version> <project.build.sourceencoding>utf-8</project.build.sourceencoding> </properties>

Tizen v2.3 Graphics & UI Frameworks. Embedded Software SKKU

NyearBluetoothPrint SDK. Development Document--Android

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!--- global properties --> <property>

Technology: Anti-social Networking 科技 : 反社交网络

XML allows your content to be created in one workflow, at one cost, to reach all your readers XML 的优势 : 只需一次加工和投入, 到达所有读者的手中

Apache Kafka 源码编译 Spark 大数据博客 -

上海泛腾电子科技有限公司徐鹤军 上海张江高科技园区碧波路 500 号 306 室. Tel :

深入了解 Redis. 宋传胜

失Answer for homework assignment 4

Air Speaker. Getting started with Logitech UE Air Speaker. 快速入门罗技 UE Air Speaker. Wireless speaker with AirPlay. 无线音箱 (AirPlay 技术 )

iphone 4s 尾插排线更换放方法 FIXBAR 此教程只适用于更换 iphone4s 损坏的尾插排线 Written By: xiaohang 2018 bengg.dozuki.com/ Page 1 of 15

: Operating System 计算机原理与设计

Flash-based Database Systems

Support for Title 21 CFR Part 11 and Annex 11 compliance: Agilent OpenLAB CDS version 2.1

朱晔和你聊 Spring 系列 S1E2: SpringBoot 并不神秘

B c e k c h k o h f o f f

Green Computing Cloud Computing LSD Tech Co., Ltd SSD server & SSD Storage Cloud SSD Supercomputer LSD Tech Co., LTD

基于 Davinci 平台的视频应用开发 沈燕飞

IPC 的 Proxy-Stub 设计模式 ( c)

Congestion Control Mechanisms for Ad-hoc Social Networks 自组织社会网络中的拥塞控制机制

PTZ PRO 2. Setup Guide 设置指南

DEV Office 客户端开发增强

White Paper 3 System Level ESD Part II: Implementation of Effective ESD Robust Designs. Industry Council on ESD Target Levels

2.8 Megapixel industrial camera for extreme environments

China Next Generation Internet (CNGI) project and its impact. MA Yan Beijing University of Posts and Telecommunications 2009/08/06.

5.1 Megapixel machine vision camera with GigE interface

Company Overview.

Logitech ConferenceCam CC3000e Camera 罗技 ConferenceCam CC3000e Camera Setup Guide 设置指南

The Design of Everyday Things

学习沉淀成长分享 EIGRP. 红茶三杯 ( 朱 SIR) 微博 : Latest update:

DPDK Summit China 2017

在数据中心中加速 AI - Xilinx 机器学习套件 (Xilinx ML Suite )

1. Spring 整合 Jdbc 进行持久层开发

上汽通用汽车供应商门户网站项目 (SGMSP) User Guide 用户手册 上汽通用汽车有限公司 2014 上汽通用汽车有限公司未经授权, 不得以任何形式使用本文档所包括的任何部分

FLIGHT INSTRUMENT PANEL

Nvidia GPU Support on Mesos: Bridging Mesos Containerizer and Docker Containerizer

A Benchmark For Stroke Extraction of Chinese Characters

IPC 的 Proxy-Stub 设计模式 (b)

CHAPTER 5 NEW INTERNET APPLICATIONS

NVIDIA CUDA 超大规模并行. 邓仰东 清华大学微电子学研究所

nbns-list netbios-type network next-server option reset dhcp server conflict 1-34

Software Engineering. Zheng Li( 李征 ) Jing Wan( 万静 )

TW5.0 如何使用 SSL 认证. 先使用 openssl 工具 1 生成 CA 私钥和自签名根证书 (1) 生成 CA 私钥 openssl genrsa -out ca-key.pem 1024

操作系统原理与设计. 第 13 章 IO Systems(IO 管理 ) 陈香兰 2009 年 09 月 01 日 中国科学技术大学计算机学院

云计算入门 Introduction to Cloud Computing GESC1001

Chap1 Introduction. Outline. An Example System. 1.1 Overview. Computer organization and architecture. Computer organization and architecture

WSV 让网站更加安全的几个小 妙招 徐栋 北京中达金桥技术服务有限公司

Transcription:

Build a Key Value Flash Disk Based Storage System Flash Memory Summit 2017 Santa Clara, CA 1

Outline Ø Introduction,What s Key Value Disk Ø A Evolution to Key Value Flash Disk Based Storage System Ø Three Technologies to Setup a Flash Disk Based Stroage System Flash Memory Summit 2017 Santa Clara, CA 2

What s Key Value Drive q Drive Key Value is a drive level interface which can be adopted by multiple high-level Key Value storage applications. q Drive Key Value is a transportation agnostic interface and can run over Fabric(FC, RoCE, iwarp, TCP, et al) and PCIe architecture. q Drive Key Value supersedes and replaces the traditional block interface in some areas. Board Drive JBOD

A Typical Key Value Scenario - Object Storage qserver SAN based solutions, like ØScale out object storage (like Amazon Cloud Storage S3). ØKey Value Databases (like Redis, RocksDB) ØIn Memory DBs (IMDB, like Gemfire) ØDistributed caching (memcached) ØObject Storage Device Ceph Dsware All use key-value as a basic interface. Medical Financial Industry DC (Huawei Cloud)

Ecology Some Vendors Have Launched NVMe Key Value Samsung Key Value SSD SSOD Key Value SSD SSOD

Storage Architecture Evolution I:Server side KV Advantage: q Less CPU consumption ² Get rid of complexity translation between KV and Block. ² Get rid of redundant useless block protocol. ² SLKV and Disk Drive translation is more efficient. q Less high-speed memory ² Get rid of most of metadata. q Low Latency/High Performance

Storage Architecture Evolution II:Disaggregated Scale-out Object Storage Advantage: p Decouple KV Disk Drive and Server, Reduce the servers, Cost down. p Flexible architecture, dynamic configuration to address the changing storage requirement. p Small Fault Recovery Range.

Storage Architecture Evolution III:Server-less Scale-out Object Storage More Advantages besides Disaggregated architecture: q No Storage Sever required. Cost, space and power savings. Stronger Scale-out feature. q Workload offload. Fine-granularity(Disk Drive Level) parallel operation. q Small/Flexible Fault Recovery Range.

Key Technology innovation-i :Protocol Optimization Protocol Optimization Previous Drive Protocol Stack New Drive Protocol Stack

Key Technology innovation-ii: FTL Optimization Beyond protocol stack/transfer mechanism optimization, a new FTL is required if NVMe Key Value is adopted. A more complex FTL layer is the downside for NVMe Key Value. The new FTL is a key component to achieving high performance.

Key Technology innovation-iii: Standardized Key Value Interface p New KV NVM Command/New data replacement. ü New bit should be added to indicate that key-value or block data is transferred. ü New parameter should be added to indicate the formation of key-value if key-value is supported. For instance, u Need a parameter to indicate whether one key value pair or multiple key value pairs are transferred in a command. ü If multiple key value pairs are transferred in one command, need a parameter to indicate Key and Value stored interleaved with each other, or Key and Value stored separated in the command. ü Should we support arbitrary length keys and values? If yes, parameters should be added in the command to indicate the length of Key and Value. p New Transfer Mechanism ü One method would be to transfer one or multiple key value pairs via one command if the key value pair do not larger than the max size of the command. ü Another alternative to have the key transferred in the NVMe command and the value transferred thru DMA/RDMA. ü However with variable length keys the previous method may not work. Another approach would be to create a metadata structure that fully describes the key-value operation including references to the key and value. This structure could then transfer both the key and value via DMA/RDMA. The foregoing modification should be addressed by NVMe Key Value Standard.

Solution Verification Step1 RocksDB Optimization 应用优化 优化 RocksDB 的复杂的 KV 存储流程, 去掉文件系统层, 直接对接 KV Lib KV 库 提供 KV API 给应用使用, 透传 KV 的访问通道, 对接 KV Driver NVMe 驱动适配 控制器适配 开源 NVMe 驱动适配 KV 接口的盘, 提供 KV 设备的初始化能力 控制器实现的 KV->PBA 转换, 提供 KV 接口能力 后续可以提供更加高级的 KV 特性接口 :scrubbing GC SCM cache 压缩等特性

Solution Verification Step1 - Ceph Bluestore Optimization NEW Thick Software Stack Thin RocksDB 1 KV lib 2 4 KV Drive KV Drive KV Drive 3 Bluestore 背景 : Ceph 后端因为 filestore 在写数据前需要先写 journal, 会有一倍的写放大, 并且 filestore 一开始只是对于机械盘进行设计的, 没有专 门针对 SSD 做优化考虑,Bluestore 初衷就是为了减少写放大, 并针对 SSD 做优化 Bluestore 存在的缺陷 : 1 Bluestore 本身沿用了 RocksDB, 还需要存在文件系统层, 未从根本解决复杂软件堆栈 2 RocksDB 后端对接的还是 Block 接口, 需要 KVàLBAàPB A 的转化流程, 增加了元数据的管理的复杂度 ; 3 RocksDB 在 Bluestore 的主要线程之一 _kv_sync_thread 线程中的 CPU 占比超过 90% Complex MetaData Structure Bluestore New 架构的方案 : 1 Bluestore 采用 Thin RocksDB, 去掉文件系统层, 从根本解决复杂软件堆栈, 减少内存 copy 消耗, 降低内存成本 2 RocksDB 后端直接对接 KV 接口, 简化元数据的管理及软件处理开销, 降低 CPU 消耗, 大幅提升 CEPH 性能 CPU OverHead is High

Thank you Questions? Flash Memory Summit 2017 Santa Clara, CA 14