Flash-based Database Systems

Similar documents
Understanding IO patterns of SSDs

如何查看 Cache Engine 缓存中有哪些网站 /URL

Build a Key Value Flash Disk Based Storage System. Flash Memory Summit 2017 Santa Clara, CA 1

Previous on Computer Networks Class 18. ICMP: Internet Control Message Protocol IP Protocol Actually a IP packet

Virtual Memory Management for Main-Memory KV Database Using Solid State Disk *

实验三十三 DEIGRP 的配置 一 实验目的 二 应用环境 三 实验设备 四 实验拓扑 五 实验要求 六 实验步骤 1. 掌握 DEIGRP 的配置方法 2. 理解 DEIGRP 协议的工作过程

Chapter 1 (Part 2) Introduction to Operating System

AvalonMiner Raspberry Pi Configuration Guide. AvalonMiner 树莓派配置教程 AvalonMiner Raspberry Pi Configuration Guide

ICP Enablon User Manual Factory ICP Enablon 用户手册 工厂 Version th Jul 2012 版本 年 7 月 16 日. Content 内容

OTAD Application Note

A Benchmark For Stroke Extraction of Chinese Characters

测试基础架构 演进之路. 茹炳晟 (Robin Ru) ebay 中国研发中心

Software Engineering. Zheng Li( 李征 ) Jing Wan( 万静 )

Presentation Title. By Author The MathWorks, Inc. 1

Logitech G302 Daedalus Prime Setup Guide 设置指南

新一代 ODA X5-2 低调 奢华 有内涵

Oracle 一体化创新云技术 助力智慧政府信息化战略. Copyright* *2014*Oracle*and/or*its*affiliates.*All*rights*reserved.** *

Chapter 11 SHANDONG UNIVERSITY 1

Machine Vision Market Analysis of 2015 Isabel Yang

密级 : 博士学位论文. 论文题目基于 ScratchPad Memory 的嵌入式系统优化研究

#MDCC Swift 链式语法应 用 陈乘

Chapter 7: Deadlocks. Operating System Concepts 9 th Edition

Computer Networks. Wenzhong Li. Nanjing University

EqualLogic Best Practices for SQL Server Deployments

Command Dictionary CUSTOM

我们应该做什么? 告知性分析 未来会发生什么? 预测性分析 为什么会发生 诊断性分析 过去发生了什么? 描述性分析 高级分析 传统 BI. Source: Gartner

PCU50 的整盘备份. 本文只针对操作系统为 Windows XP 版本的 PCU50 PCU50 启动硬件自检完后, 出现下面文字时, 按向下光标键 光标条停在 SINUMERIK 下方的空白处, 如下图, 按回车键 PCU50 会进入到服务画面, 如下图

IEEE 成立于 1884 年, 是全球最大的技术行业协会, 凭借其多样化的出版物 会议 教育论坛和开发标准, 在激励未来几代人进行技术创新方面做出了巨大的贡献, 其数据库产品 IEL(IEEE/IET Electronic Library)

Triangle - Delaunay Triangulator

Microsemi - Leading Innovation for China s Hyperscale Data Centers

操作系统原理与设计. 第 13 章 IO Systems(IO 管理 ) 陈香兰 2009 年 09 月 01 日 中国科学技术大学计算机学院

计算机科学与技术专业本科培养计划. Undergraduate Program for Specialty in Computer Science & Technology

China Next Generation Internet (CNGI) project and its impact. MA Yan Beijing University of Posts and Telecommunications 2009/08/06.

Air Speaker. Getting started with Logitech UE Air Speaker. 快速入门罗技 UE Air Speaker. Wireless speaker with AirPlay. 无线音箱 (AirPlay 技术 )

绝佳的并行处理 - FPGA 加速的根本基石

组播路由 - MSDP 和 PIM 通过走

<properties> <jdk.version>1.8</jdk.version> <project.build.sourceencoding>utf-8</project.build.sourceencoding> </properties>

北 京 忆 恒 创 源 科 技 有 限 公 司 16

H3C CAS 虚拟机支持的操作系统列表. Copyright 2016 杭州华三通信技术有限公司版权所有, 保留一切权利 非经本公司书面许可, 任何单位和个人不得擅自摘抄 复制本文档内容的部分或全部, 并不得以任何形式传播 本文档中的信息可能变动, 恕不另行通知

Multiprotocol Label Switching The future of IP Backbone Technology

测试 SFTP 的 问题在归档配置页的 MediaSense

失Answer for homework assignment 4

: Operating System 计算机原理与设计

Apache Kafka 源码编译 Spark 大数据博客 -

SHANDONG UNIVERSITY 1

计算机组成原理第二讲 第二章 : 运算方法和运算器 数据与文字的表示方法 (1) 整数的表示方法. 授课老师 : 王浩宇

学习沉淀成长分享 EIGRP. 红茶三杯 ( 朱 SIR) 微博 : Latest update:

libde265 HEVC 性能测试报告

TDS - 3. Battery Compartment. LCD Screen. Power Button. Hold Button. Body. Sensor. HM Digital, Inc.

三 依赖注入 (dependency injection) 的学习

Congestion Control Mechanisms for Ad-hoc Social Networks 自组织社会网络中的拥塞控制机制

Microsoft RemoteFX: USB 和设备重定向 姓名 : 张天民 职务 : 高级讲师 公司 : 东方瑞通 ( 北京 ) 咨询服务有限公司

Operating Systems. Chapter 4 Threads. Lei Duan

第二小题 : 逻辑隔离 (10 分 ) OpenFlow Switch1 (PC-A/Netfpga) OpenFlow Switch2 (PC-B/Netfpga) ServerB PC-2. Switching Hub

Safe Memory-Leak Fixing for C Programs

Bi-monthly report. Tianyi Luo

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!--- global properties --> <property>

云计算入门 Introduction to Cloud Computing GESC1001

上海泛腾电子科技有限公司徐鹤军 上海张江高科技园区碧波路 500 号 306 室. Tel :

2. Introduction to Digital Media Format

EBD EBD. end

上汽通用汽车供应商门户网站项目 (SGMSP) User Guide 用户手册 上汽通用汽车有限公司 2014 上汽通用汽车有限公司未经授权, 不得以任何形式使用本文档所包括的任何部分

Technology: Anti-social Networking 科技 : 反社交网络

云计算入门 Introduction to Cloud Computing GESC1001

Outline. Motivations (1/3) Distributed File Systems. Motivations (3/3) Motivations (2/3)

1. DWR 1.1 DWR 基础 概念 使用使用 DWR 的步骤. 1 什么是 DWR? Direct Web Remote, 直接 Web 远程 是一个 Ajax 的框架

Logitech ConferenceCam CC3000e Camera 罗技 ConferenceCam CC3000e Camera Setup Guide 设置指南

Altera 器件高级特性与应用 内容安排 时钟管理 时钟管理 片内存储器 数字信号处理 高速差分接口 高速串行收发器. 时钟偏斜 (skew): 始终分配到系统中到达各个时钟末端 ( 器件内部触发器的时钟输入端 ) 的时钟相位不一致的现象 抖动 : 时钟边沿的输出位置和理想情况存在一定的误差

Chapter2 Instruction Sets

智能终端与物联网应用 课程建设与实践. 邝坚 嵌入式系统与网络通信研究中心北京邮电大学计算机学院

The Design of Everyday Things

Green Computing Cloud Computing LSD Tech Co., Ltd SSD server & SSD Storage Cloud SSD Supercomputer LSD Tech Co., LTD

CA Application Performance Management

The Design and Optimization for the TDMA Network-on-Chip

XPS 8920 Setup and Specifications

信息检索与搜索引擎 Introduction to Information Retrieval GESC1007

nbns-list netbios-type network next-server option reset dhcp server conflict 1-34

NyearBluetoothPrint SDK. Development Document--Android

PTZ PRO 2. Setup Guide 设置指南

IPC 的 Proxy-Stub 设计模式 ( c)

HAWQ. MPP SQL for HDFS of Hadoop 基于 Hadoop 原生 HDFS 的大规模并行 SQL

Spark Standalone 模式应用程序开发 Spark 大数据博客 -

XML allows your content to be created in one workflow, at one cost, to reach all your readers XML 的优势 : 只需一次加工和投入, 到达所有读者的手中

CHINA VISA APPLICATION CONCIERGE SERVICE*

Smart Services Lucy Huo (Senior Consultant, UNITY Business Consulting) April 27, 2016

Wireless Presentation Pod

2.8 Megapixel industrial camera for extreme environments

Chap1 Introduction. Outline. An Example System. 1.1 Overview. Computer organization and architecture. Computer organization and architecture

5.1 Megapixel machine vision camera with GigE interface

Autodesk Backburner 2011 安装手册

New Media Data Analytics and Application. Lecture 7: Information Acquisition An Integration Ting Wang

public static InetAddress getbyname(string host) public static InetAddress getlocalhost() public static InetAddress[] getallbyname(string host)

WSV 让网站更加安全的几个小 妙招 徐栋 北京中达金桥技术服务有限公司

Ganglia 是 UC Berkeley 发起的一个开源集群监视项目, 主要是用来监控系统性能, 如 :cpu mem 硬盘利用率, I/O 负载 网络流量情况等, 通过曲线很容易见到每个节点的工作状态, 对合理调整 分配系统资源, 提高系统整体性能起到重要作用

1. Spring 整合 Jdbc 进行持久层开发

MeeGo : An Open Source OS Solution For Client Devices

Declaration of Conformity STANDARD 100 by OEKO TEX

Safety Life Cycle Model IEC61508 安全生命周期模型 -IEC61508

IBM 企业业务连续性方案建议书. System x3850m2+ds4700/ds5000

Color LaserJet Pro MFP M477 入门指南

Transcription:

Flash-based Database Systems Experiences from the FlashDB Project Xiaofeng Meng Renmin University of China National Database Conference of China, Hefei, 2012,10,13

Outline 1 New Storage 2 3 Flash-based DBMSs SSD Hybrid Systems 4 Future Work National Database Conference of China 2

Outline 1 New Storage 2 3 Flash-based DBMSs SSD Hybrid Systems 4 Future Work National Database Conference of China 3

Times Increase Performance Improvement: Disk vs. CPU Over the Last 30 Years 10000000 2.82 mil. 1000000 100000 10000 3750 1000 266.67 100 29.5 10 4.1 3.12 2.89 1 CPU Disk Density Transfer Rate RAID Transfer Rate Disk RPMS Seek+Latency Read Seek+Latency Write Technology Type National Database Conference of China 4

SSD will Bring Storage Performance Back in line with CPU Performance National Database Conference of China 5

Flash Memory Chip Flash Memory NOR Flash: Procedure NAND Flash: Data NAND Flash Density increases SLC (1bit), MLC (2bits), TLC (3bits) Lifetime decreases 100,000 (SLC), 10,000 (MLC), 5,000 (TLC) Flash chip layout and structure Larger blocks (32 -> 256 Pages) Larger pages: 512 B (old SLC) -> 16KB (future TLC) National Database Conference of China 6

Solid State Disk (SSD) Flash Translation Layer (FTL) Interface SATA, SAS, PCIE Manufactures Intel, OCZ, Samsung Capacity/Price Controller Flash Chips National Database Conference of China 7

Application of Flash Devices Mobile Devices Personal Computer Aerospace Data Center Embedded Devices National Database Conference of China 8

SSD vs. HDD IOPs (4 KB) Enterprise SSD Ratio 150x Enterprise HDD Seq. Read BW (MB/s) >450 3x >150 Ran. 80/20 BW (MB/s) >450 22x 20 Avg. Random I/O latency >1000x Active Power 10w 60% 17w Typical Capacity 300GB 2/3 450GB National Database Conference of China 9

Research Motivation (1) 60~150x faster data r/w speed (vs. 7200RPM HDD) Query processing time is improved only up to 10x Sang-Won Lee et al. Design of Flash-Based DBMS: An In-Page Logging Approach. SIGMOD 07 National Database Conference of China 10

Research Motivation (2) Transaction processing performance is improved 2~10x TPS: Transactions-per-Second The fast access performance of flash memory is not fully exploited by existing database algorithms! Sang-Won Lee et al. A Case for Flash Memory SSD in Enterprise Database Applications. SIGMOD 08 National Database Conference of China 11

Research Goal of FlashDB Project To boost database performance by exploiting unique flash I/O characteristics DBMS Query Processing Flash Disks Transaction Processing Buffer 存储管理 management Access 缓冲区管理 Methods Query 查询优化 Evaluation 查询执行 Query Optimization 索引管理 Unique I/O features Cost Models 事务管理 Transaction Management 日志与恢复 Logging and Recovery 并发控制 Concurrency Control Flash-Based DBMSs National Database Conference of China 12

FlashDB Project - Background National key project funded by NSFC To investigate novel database technologies for flash-memory based DBMSs To explore applications for flash-based DBMSs Funding period: 2009-2012 Participating institutions Renmin University of China (Leading) University of Science & Technology of China Hong Kong Baptist University National Database Conference of China 13

FlashDB Meetings Bi-annual meetings (every semester in 2009-2012) Progress reporting Experience sharing Brainstorm new ideas Participations from academia and industry (IBM, Baidu, Huawei) International workshop FlashDB(Hong Kong 2011, Bushan 2012) National Database Conference of China 14

Outline 1 New Storage 2 3 Flash-based DBMSs SSD Hybrid Systems 4 Future Work National Database Conference of China 15

Flash is Coming The age of flash-based DBMSs is coming Oracle s TPC-C BM result @ 2010 using Exadata Oracle + Sun Flash Storage Total cost: 49M $ Storage: 23M $ Sun Flash Array: 22M $ 720 2TB 7.2K HDD: 0.7M$ IBM proposed SSD Buffer (VLDB 10) And MS SQL Server @ Jim Gray Lab.. National Database Conference of China 16

Flash Organization Page size: 2/4/8 KB Block = 64 ~ 128 pages: 128/256/512 KB A page has a data area and a spare area Data area: for mass data storage Spare area: for storing metadata like ECC and LBA National Database Conference of China 17

Characteristics of NAND Flash Asymmetric read/write speed (by pages) Random read fast(no mechanical latency) Erase-before-overwrite( by Blocks) Out-of-Place update National Database Conference of China 18

Out-of-Place Update in Flash Flash Transaction Layer (FTL): Addresses Mapping Garbage Collection Wear Leveling Block 125 Block 125 Sec #3045 Sec #3045 Map Map Block 125 Page 29 Obsolete Block 237 (#237, #4) Block 237 (#125, #29) Page 29 Page 4 Page 4 1 Update Request 2 Write New Data 3 Update Map Table National Database Conference of China 19

Research on FlashDB FlashStorage FlashBuffer FlashDB FlashIndex FlashRecovery FlashQuery FlashBench National Database Conference of China 20

Research on FlashDB FlashStorage FlashBuffer FlashDB FlashIndex FlashRecovery FlashQuery FlashBench National Database Conference of China 21

Research on FlashDB FlashStorage FlashBuffer FlashDB FlashIndex FlashRecovery FlashQuery FlashBench National Database Conference of China 22

Buffer Management LRU/CLOCK Assumption: Cost read = Cost write National Database Conference of China 23

Motivation Asymmetry of IO operation cost Discrepancy of the asymmetry of different SSDs Assumption: Cost read << Cost write 1 is MCAQE32G8APP-0XA, 2 is K9WAG08U1A, 3 is K9XXG08UXM, 4 is K9F1208R0B, 5 is K9GAG08B0M, 6 is Hynix HY27SA1G1M, 7 is K9K1208U0A, 8 is K9F2808Q0B, 9 is MCAQE32G5APP Characteristics of NAND Asymmetric read/write Random Read fast Out-of-Place Update Energy efficient National Database Conference of China 24

ACR [MDM2010] Adaptive Cost-aware Replacement 针对闪存的不同读写代价, 在内存维护两个 LRU 队列,Clean 队列 (Lc) 和 dirty 队列 (Ld) National Database Conference of China 25

ACR [MDM2010] 置换策略 ACR 根据 SSD 不同的读写代价来调节 Lc 和 Ld 的长度 L C (L D ) 的长度和 L C (L D ) 的置换代价占整个缓冲区置换代价的比值 β 相对应 s L C L D LC 的置换代价 整个缓冲区的置换代价 L C 太短, 选择 L D 的 LRU 位置的数据页进行置换 L C 太长, 选择 L C 的 LRU 位置的数据页进行置换 National Database Conference of China 26

ACR [MDM2010] 按照队列计算置换代价 考虑不同的闪存的读写差异, 能适用于不同类型的闪存 National Database Conference of China 27

ACR [MDM2010] 按照队列计算置换代价 考虑不同的闪存的读写差异, 能适用于不同类型的闪存 This paper targets at an interesting and important topic by considering developing a new buffer cache replacement algorithm on flash disks to solve this problem ( 本文解决的是一个重要并且有趣的问题 ) -- 美国俄亥俄州立大学张晓东教授的评价 National Database Conference of China 28

Research on FlashDB FlashStorage FlashBuffer FlashDB FlashIndex FlashRecovery FlashQuery FlashBench National Database Conference of China 29

Transaction Recovery Write Ahead Log (WAL) Basic Ideas: Updates can be written only after logged; Force log records to disk before a commit is finished; Perform undo/redo operations during abort or recovery Disadvantage: frequent write operations May not preferable for write-expensive flash-based DBMSs Characteristics of NAND Asymmetric read/write Random Read fast Out-of-Place Update Energy efficient National Database Conference of China 30

闪存数据库中日志设计思路 读写速度不一致 异地更新 无机械延迟 读速度比写速度快 考虑用较多的读操作来减少写操作 闪存要求重写之前擦除 利用数据的历史版本地址记日志 利用天然存在的历史版本的数据 随机和连续访问速度相似 可以用随机读来代替连续读 将日志文件由顺序结构转变成链表结构 擦除次数有限 闪存寿命有限, 不可无限制的擦除 尽量减少写操作, 间接减少擦除 National Database Conference of China 31

recovery time/ms Performance Evaluation 利用该设计实现在 Berkeley DB 中, 与传统数据库的日志恢复时间进行比较 16000 14000 12000 10000 8000 HDD(Tranditional) HDD(HV-recovery) SSD(Tranditional) SSD(HV-recovery) 8 倍 6000 4000 2000 0 10 20 30 40 50 60 70 80 90 100 Update Ops(thousand) National Database Conference of China 32

Transaction Recovery Write Ahead Log (WAL) Shadow Paging[TODS, 1977] Basic idea: out-of-place update Access data pages through a page mapping table; Update a page: allocate a shadow page, change the current mapping Commit: force current mappings of updated pages to disk Abort: discard the shadow page and the current mapping National Database Conference of China 33

Transaction Recovery Write Ahead Log (WAL) Shadow Paging[TODS, 1977] Basic idea: out-of-place update Access data pages through a page mapping table; Update a page: allocate a shadow page, change the current mapping Commit: force current mappings of updated pages to disk Abort: discard the shadow page and the current mapping Characteristics of NAND Asymmetric read/write Random Read fast Out-of-Place Update Energy efficient National Database Conference of China 34

Shadow Paging[TODS, 1977] Basic idea: out-of-place update Advantage: no need to write log records Disadvantages on hard disk 1 Maintaining page mapping table 2 Reclaiming obsolete pages 3 High commit overhead of flushing the current page mapping 4 Shadow pages may be scattered over the disk National Database Conference of China 35

Shadow Paging[TODS, 1977] Basic idea: out-of-place update Advantage: no need to write log records Disadvantages on hard disk 1 Maintaining page mapping table 2 Reclaiming obsolete pages Most of these disadvantages are no longer issues on flash disks! 3 High commit overhead of flushing the current page mapping 4 Shadow pages may be scattered over the disk National Database Conference of China 36

Shadow Paging for SSD Two mapping tables in FTL Direct mapping table Inverse mapping table RAM Flash Memory National Database Conference of China 37

FlagCommit [TKDE, 2011] Inverse mapping table(spare area): Extend to keep Transaction States National Database Conference of China 38

FlagCommit [TKDE, 2011] Inverse mapping table(spare area): Extend to keep Transaction States Flag-base Protocol Commit-based Flag Commit (CFC) Abort-based Flag Commit (AFC) P2 P3 P6 P2 P3 P6 FALSE TRUE TRUE TRUE TRUE TRUE (a) An in-progress / aborted transaction (b) A committed transaction National Database Conference of China 39

Shadow Paging > WAL? HDD: Shadow Paging < WAL Bad performance of random operations Random read for recovery Random write when committing SSD:Shadow Paging>WAL Good performance of random data access Out-of-place update: multi-version data Garbage Collection National Database Conference of China 40

Research on FlashDB FlashStorage FlashBuffer FlashDB FlashIndex FlashRecovery FlashQuery FlashBench National Database Conference of China 41

Observation on Sort-Merge-joins SELECT * FROM CUSTOMER X, ORDERS Y WHERE X.C_CUSTKEY = Y.C_CUSTKEY; R eload Tuple digest table <key, tid> Table X Sort Sorted Table X Scan Table X Extract D idgest Table of X Table Y Sort Sorted Table Y Scan M erge Table Y Extract R eload Tuple D idgest Table of Y Join Sort-merge join Alternative Join National Database Conference of China 42

DigestJoin [MDM 2009] Random read is no longer an issue on flash disks. But writing intermediate results is expensive. Characteristics of NAND Asymmetric read/write Random Read fast Out-of-Place Update Energy efficient Idea of DigestJoin: 1st phase: generate digest tables <key, tid> and then join 2nd phase: reload full tuples based on digest join results National Database Conference of China 43

DigestJoin [MDM2009] R R1 R2 R3 S1 S2 S3 S4 A B C 1 6 3 2 2 4 3 5 5 S B D E 1 2 3 2 2 4 3 5 5 6 7 9 Extract Extract Fetch tid B R2 2 R3 5 R1 6 Join tid B S1 1 S2 2 S3 3 S4 6 Fetch Tid_R Tid_S B R2 S2 2 R1 S4 6 DigestJoin[MDM 09,DASFAA 10] Sort-based Fetch Graph-based Fetch National Database Conference of China 44

DigestJoin [MDM2009] Page 1 Page 2 R A B C Fetch 2 buffer pages R2 S2 R1 S4 4 pages R1 R2 R3 S1 S2 S3 S4 1 6 3 2 2 4 3 5 5 S B D E 1 2 3 2 2 4 3 5 5 6 7 9 Extract Extract tid B R2 2 R3 5 R1 6 Join tid B S1 1 S2 2 S3 3 S4 6 Fetch Tid_R Tid_S B R2 S2 2 R1 S4 6 R2 DigestJoin[MDM 09,DASFAA 10] 1 2 R1 S2 S4 2 pages a b c National Database Conference of China 45

DigestJoin [MDM2009] Page 1 Page 2 R A B C Fetch 2 buffer pages R2 S2 R1 S4 4 pages R1 R2 R3 S1 S2 S3 S4 1 6 3 2 2 4 3 5 5 S B D E 1 2 3 2 2 4 3 5 5 6 7 9 Extract Extract tid B R2 2 R3 5 R1 6 Join tid B S1 1 S2 2 S3 3 S4 6 Fetch Tid_R Tid_S B R2 S2 2 R1 S4 6 R2 DigestJoin[MDM 09,DASFAA 10] 1 2 R1 S2 S4 2 pages a b c National Database Conference of China 46

Performance Evaluation PostgreSQL with DigestJoin National Database Conference of China 47

Research on FlashDB FlashStorage FlashBuffer FlashDB FlashIndex FlashRecovery FlashQuery cost model, opt.. FlashBench National Database Conference of China 48

Flash Devices Flash Board 高速 SSD 的设计验证 底层 FTL 算法的验证 多芯片并行存取验证 Inside-SSD Cache 算法验证 National Database Conference of China 49

Flash Devices Flash Board 高速 SSD 的设计验证 底层 FTL 算法的验证 多芯片并行存取验证 Inside-SSD Cache 算法验证 FlashDBSim 统一 可配置 易于使用的闪存设备仿真平台 可以模拟不同类型闪存特性 (SLC/MLC/NOR) 提供 I/O 统计信息 (write/read/erase) National Database Conference of China 50

tps (k) tpmc 闪存数据库系统基准测试环境 Tpmc of PostgreSQL on SSD and HDD (warehouses=10) 7000 6000 5000 4000 3000 2000 1000 0 1 3 5 7 9 15 25 35 45 55 65 75 85 95 users TPS of MySQL on SSD and HDD (users=200) HDD SSD National Database Conference of China time (10minu) 51 20 18 16 14 12 10 8 6 4 2 0 HDD SSD 1 3 5 7 9 11 13 15 17 19 21

闪存数据库系统 闪存数据库面临的新挑战 多版本的日志恢复 事务管理器 代价模型查询优化 查询处理器 故障恢复并发控制 索引评价标准索引结构 文件与索引管理 锁粒度快照方式 换入换出代价 缓冲区管理 空间分配损耗均衡压缩存储 存储管理 NAND 闪存 读 / 写 / 擦除不均衡异地更新闪存寿命 National Database Conference of China 52

闪存数据库系统 闪存数据库面临的新挑战 多版本的日志恢复 事务管理器 代价模型查询优化 查询处理器 故障恢复并发控制 索引评价标准索引结构 文件与索引管理 锁粒度快照方式 换入换出代价 缓冲区管理 空间分配损耗均衡压缩存储 存储管理 NAND 闪存 Characteristics of NAND Asymmetric read/write Random Read fast Out-of-Place Update Energy efficient 读 / 写 / 擦除不均衡异地更新闪存寿命 National Database Conference of China 53

Outline 1 New Storage 2 3 Flash-based DBMSs SSD Hybrid Systems 4 Future Work National Database Conference of China 54

SSD Applications Widely used in various IT companies Massive data volume High throughput Low latency... National Database Conference of China 55

Comparison of Data Storage Approaches 1T Data National Database Conference of China 56

IO Results Hierarchical Storage CPU RAM 40GB = 3.2x Disk Drive 1000GB = 1x Conventional DBMS 1TB Data Store = 4.2x CPU RAM 1000GB = 80x SSD 1000GB = 16x 1TB Data Store = 96x In-Memory DBMS 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% CPU 94% of IO uses 30% of data (20% of cylinders) RAM 40GB = 3.2x 85% of IO uses 15% of data (10% of cylinders) SSD 256GB = 4x 43% of IO uses 1.5% of data (1% of cylinders) Hard Disk Drive 768GB = 0.75x 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Disk Space 1TB Data Store = 8x Hybrid FlashDBMS Source TERADATA: 6 th XLDB Conference, Workshop & Tutorials, The future of Data Warehousing Is all of your data worth 25x? National Database Conference of China 57

View of the Near Future: Leverage SSD for Big Data National Database Conference of China 58

Hybrid FlashDB Data Placement SSD Hybrid Storage Metadata Management Endurance Data Migration Energy Efficiency National Database Conference of China 59

Endurance SSD Write-lifetime 350 300 250 200 150 100 50 0 60GB X 5000 =300TB SSD Write-lifetime 40TB (60GB drive) 37TB (80GB drive) Ideal Micron C200 Intel X-25M Extending SSD Lifetimes with Disk-Based Write Caches FAST 10 National Database Conference of China 60

Endurance SSD Write-lifetime 350 300 250 200 150 100 50 0 60GB X 5000 =300TB SSD Write-lifetime We must try our best to reduce the number of erase operation on SSD as many as we can. 40TB (60GB drive) 37TB (80GB drive) Ideal Micron C200 Intel X-25M Extending SSD Lifetimes with Disk-Based Write Caches FAST 10 National Database Conference of China 61

Extend SSD lifetime [FlashDB2011] Motivation Small write storage utilization declines Random write frequent erase Key idea DBMS buffer management DBMS storage management Flash device Insert Write buffer Append Only update Shared memory Write Read Data Record Storage space National Database Conference of China 62

Extend SSD lifetime [FlashDB2011] Experiment Setup Fexible simulator Page size:2kb Block size: 128KB Performance National Database Conference of China 63

Energy Efficiency Number of server installations is rapidly increasing The spending on power and cooling exceed server purchase cost National Database Conference of China 64

Energy Efficiency Number of server installations is rapidly increasing The spending on power and cooling exceed server purchase cost To manage extremely large amounts of data efficiently, we should balance performance improvement and energy consumption. National Database Conference of China 65

Energy Efficiency Power (Watt) % 100 observed ideal Hardware Optimization 80 60 software optimization 40 20 energyproportional behavior hardware optimization 0 20 40 60 80 100 % System utilization power@utilization SSD can reduce the use of energy-inefficient RAM-based memory without compromising the overall system performance National Database Conference of China 66

Energy Efficiency Power (Watt) % 100 observed ideal Software Optimization 80 60 40 20 software optimization energyproportional behavior hardware optimization Buffer Management Trading Memory for Performance and Energy by Yi Ou [FlashDB2011]. 0 20 40 60 80 100 % System utilization power@utilization National Database Conference of China 67

Outline 1 New Storage Era 2 3 Flash-based DBMSs SSD Hybrid Systems 4 Future Work National Database Conference of China 68

National Database Conference of China 69

Big Data is so hot! Google Trends of Big Data Big Data Across the Federal Government (USA, March, 2012) National Database Conference of China 70

安阳殷墟遗址 ( 公元前 1300, 距今 3300 年 ) 这就是大数据! 甲骨文大坑, 1 万 7 千余片 National Database Conference of China 71

DB(Database) vs. BD(Big Data) Small data, Very Large Database (VLDB) MB, 结构数据, 运营式系统, 封闭数据源 以数据为对象解决其存储和管理问题 Big Data, Extremely Large Database (XLDB) >PB, 非结构数据, 感知式系统, 开放数据源 以数据为资源解决诸领域问题 数据工程 Data Engineering 数据思维 Data Thinking National Database Conference of China 72

Big Analytics Many situations need the result of analysis immediately Parallelism Parallelism across nodes in a cluster Parallelism within a single node Cloud Computing New hardware: SSD PCM National Database Conference of China 73

Storage Class Memory (SCM) A new class of data storage/memory devices many technologies compete to be the best SCM SCM blurs the distinction between Memory (fast, expensive, volatile ) and Storage (slow, cheap, non-volatile) SCM features: Non-volatile Short access times (~ DRAM like ) Low cost per bit (disk like by 2020) Solid state, no moving parts National Database Conference of China 74

Phase change memory Phase change memory (PCM) is the leading contender for first true SCM. At least 18 companies are working on PCM, such as IBM, Samsung, Intel, Micro, etc. PCM is an electronic device using two distinct solid phases metal alloy to store a bit. National Database Conference of China 75

The Impacts of PCM on DBMSs Applications Read & Write B+ Tree Index Hash Index Access Methods Lock Transaction Data Page & Log File Buffer Pool LRU, Clock Log HDD National Database Conference of China 76

The Impacts of PCM on DBMSs In-memory buffer pool can be obviated, or at least read buffer can be obviated? What about logging? Logging is still necessary? Opportunity to rethink data structures for implementing database system, such as B+ Tree, record organization, etc. Even Opportunity to rethink the Database Machines National Database Conference of China 77

Conclusion Flash devices open the new world for DBMSs Buffer(ACR), Join(DigestJoin), ShadowPage,. And, there are a lot of research topics still ahead (at least for the coming 5 years) and thus you can jump on the flash-based database researches. New storage(ssd, PCM, etc..) take a new opportunity for big data management National Database Conference of China 78

SELECT thanks FROM me SELECT questions FROM you Tape is Dead Disk is Tape Flash is Disk --Jim Gray, 1998 National Database Conference of China 79

致谢 本报告的工作得到了国家自然科学基金重点项目 闪存数据库技术研究 (60833005) 的资助 National Database Conference of China 80

About our Lab Innovative Data Management Research Http://idke.ruc.edu.cn Google wamdm National Database Conference of China 81