System and Algorithmic Adaptation for Flash

Similar documents
FAWN. A Fast Array of Wimpy Nodes. David Andersen, Jason Franklin, Michael Kaminsky*, Amar Phanishayee, Lawrence Tan, Vijay Vasudevan

From architecture to algorithms: Lessons from the FAWN project

FAWN: A Fast Array of Wimpy Nodes

Comparing Performance of Solid State Devices and Mechanical Disks

FAWN as a Service. 1 Introduction. Jintian Liang CS244B December 13, 2017

FAWN: A Fast Array of Wimpy Nodes

SILT: A MEMORY-EFFICIENT, HIGH-PERFORMANCE KEY- VALUE STORE PRESENTED BY PRIYA SRIDHAR

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

C 1. Recap. CSE 486/586 Distributed Systems Distributed File Systems. Traditional Distributed File Systems. Local File Systems.

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Cheap and Large CAMs for High Performance Data-Intensive Networked Systems- The Bufferhash KV Store

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

Speeding Up Cloud/Server Applications Using Flash Memory

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

CS 350 Winter 2011 Current Topics: Virtual Machines + Solid State Drives

Improving Datacenter Energy Efficiency Using a Fast Array of Wimpy Nodes

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage

CS3600 SYSTEMS AND NETWORKS

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Key Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs

CSE 4/521 Introduction to Operating Systems. Lecture 23 File System Implementation II (Allocation Methods, Free-Space Management) Summer 2018

FAWN: A Fast Array of Wimpy Nodes By David G. Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan

CHAPTER 11: IMPLEMENTING FILE SYSTEMS (COMPACT) By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

The Benefits of Solid State in Enterprise Storage Systems. David Dale, NetApp

PS2 out today. Lab 2 out today. Lab 1 due today - how was it?

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

Using Transparent Compression to Improve SSD-based I/O Caches

Lecture 1: Gentle Introduction to GPUs

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

MemC3: MemCache with CLOCK and Concurrent Cuckoo Hashing

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Virtualization of the MS Exchange Server Environment

Performance Analysis in the Real World of Online Services

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Enabling Cost-effective Data Processing with Smart SSD

Emulex LPe16000B 16Gb Fibre Channel HBA Evaluation

Memory-Based Cloud Architectures

Storage Systems : Disks and SSDs. Manu Awasthi July 6 th 2018 Computer Architecture Summer School 2018

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

KVZone and the Search for a Write-Optimized Key-Value Store

Deploy a High-Performance Database Solution: Cisco UCS B420 M4 Blade Server with Fusion iomemory PX600 Using Oracle Database 12c

Memory Hierarchy. Memory Flavors Principle of Locality Program Traces Memory Hierarchies Associativity. (Study Chapter 5)

Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage

Energy-efficient Cluster Computing with FAWN: Workloads and Implications

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

June 2004 Now let s find out exactly what we ve bought, how to shop a new system and how to speed up an existing PC!

Computer Science 61C Spring Friedland and Weaver. Input/Output

Chapter 12: File System Implementation

Mass-Storage. ICS332 - Fall 2017 Operating Systems. Henri Casanova

COMP 273 Winter physical vs. virtual mem Mar. 15, 2012

OPERATING SYSTEM. Chapter 12: File System Implementation

Session 201-B: Accelerating Enterprise Applications with Flash Memory

Chapter 11: Implementing File Systems

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

SSD Architecture Considerations for a Spectrum of Enterprise Applications. Alan Fitzgerald, VP and CTO SMART Modular Technologies

ASN Configuration Best Practices

A Better Storage Solution

I/O CANNOT BE IGNORED

Onyx: A Prototype Phase-Change Memory Storage Array

MEMORY. Objectives. L10 Memory

High Performance Solid State Storage Under Linux

The Search for Energy-Efficient Building Blocks for the Data Center

Chapter 12: Query Processing

[537] Flash. Tyler Harter

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

Main Memory (RAM) Organisation

SATA in Mobile Computing

Be Fast, Cheap and in Control with SwitchKV. Xiaozhou Li

Exactly as much as you need.

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Comparing Performance of Solid State Devices and Mechanical Disks

Sarah L. Harris and David Money Harris. Digital Design and Computer Architecture: ARM Edition Chapter 8 <1>

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Main-Memory Databases 1 / 25

Fundamentals of Computer Systems

Crucial MX inch Internal Solid State Drive

Overview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM

Fundamentals of Computer Systems

I/O. Disclaimer: some slides are adopted from book authors slides with permission 1

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Optimizing the Data Center with an End to End Solutions Approach

The What, Why and How of the Pure Storage Enterprise Flash Array. Ethan L. Miller (and a cast of dozens at Pure Storage)

Non-Volatile Memory Cache Enhancements: Turbo-Charging Client Platform Performance

CSE502: Computer Architecture CSE 502: Computer Architecture


The Impact of SSD Selection on SQL Server Performance. Solution Brief. Understanding the differences in NVMe and SATA SSD throughput

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 420, York College. November 21, 2006

Chapter 12: File System Implementation

Memory Management. Kevin Webb Swarthmore College February 27, 2018

Erik Riedel Hewlett-Packard Labs

Memory Hierarchy Y. K. Malaiya

Chapter 11: Implementing File

Optimizing Flash-based Key-value Cache Systems

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Virtual Storage Tier and Beyond

Secondary storage. CS 537 Lecture 11 Secondary Storage. Disk trends. Another trip down memory lane

Module 1: Basics and Background Lecture 4: Memory and Disk Accesses. The Lecture Contains: Memory organisation. Memory hierarchy. Disks.

Transcription:

System and Algorithmic Adaptation for Flash The FAWN Perspective David G. Andersen, Vijay Vasudevan, Michael Kaminsky* Amar Phanishayee, Jason Franklin, Iulian Moraru, Lawrence Tan Carnegie Mellon University and *Intel Labs

Context: Datacenter Energy Hydroelectric Dam 2

Approaches to saving power Infrastructure Efficiency Dynamic Power Scaling Computational Efficiency Power generation Power distribution Cooling Sleeping when idle Rate adaptation VM consolidation FAWN Goal of computational efficiency: Reduce the amount of energy to do useful work 3

FAWN Fast Array of Wimpy Nodes Improve computational efficiency of data-intensive computing using an array of well-balanced low-power systems. 34-5"6"78-, 9&4:&4 ()* ()* ()* ()*!"#$ %&' +;1< ()* %&' +,-#. ()* %&' +,-#. 1.6Ghz single/dual Intel Pineview ()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#. ()* %&' +,-#. atom 2GB DRAM AMD Geode 256MB DRAM 4GB CompactFlash Intel X25-m/e SSD /001 201 4

Towards balanced systems 1E+08 1E+07 Disk Seek Rebalancing Options Nanoseconds 1E+06 1E+05 1E+04 1E+03 1E+02 1E+01 Wasted resources DRAM Access 1E+00 CPU Cycle 1E-01 1980 1985 1990 1995 2000 2005 Year Today s CPUs Array of Fastest Disks Slower CPUs Fast Storage Slow CPUs Today s Disks 5

Targeting the sweet-spot in efficiency 2500 Speed vs. Efficiency Fastest processors exhibit superlinear power usage Instructions/sec/W in millions 2000 1500 1000 500 Custom ARM Mote XScale 800Mhz Atom Z500 Xeon7350 Fixed power costs can dominate efficiency for slow processors FAWN targets sweet spot in system efficiency when including fixed costs 0 1 10 100 1000 10000 100000 Instructions/sec in millions (Includes 0.1W power overhead) 6

Targeting the sweet-spot in efficiency Instructions/sec/W in millions FAWN 0 500 1000 1500 2000 2500 Today s CPU Array of Fastest Disks Slower CPU Fast Storage Slow CPU Today s Disk 7 Instructions/sec in millions 1 10 100 1000 10000 100000 XScale 800Mhz Custom ARM Mote Atom Z500 Xeon7350 More efficient

Case 1: A high-performance, persistent key-value store ~20 byte keys 10-1000 byte values Very small writes Irregular size objects Very random access The FTL is not our friend. 8

Using Berkeley DB on CF Platform: 500Mhz AMD Geode, 256MB DRAM, 4GB Compact Flash Card Insert 7M 200-byte entries into DB BDB FAWN-KV 0.07 MB/s 20 MB/s

Using Flash for K-V Write sequentially within an erase block Can do this concurrently to several, iff the FTL lets you (Duplication w/filesystem...) Use system memory efficiently Otherwise, why use Flash at all? :)

From key to value KeyFrag!= Key Potential collisions! Low probability of multiple Flash reads 160-bit key DRAM Hashtable Hash Index Flash Data region Log Entry Key Len Data KeyFrag Valid. { 12 bytes per entry Offset (a) 11

Just one log is painful With flash, not restricted to one -- maybe Write Speed in MB/s 100 90 80 70 60 50 40 30 20 10 0 1 2 4 8 16 32 64 128 256 Sandisk Extreme IV Memoright GT Mtron Mobi Number of FAWNDS Files (Log-scale) 12 Intel X25-M Intel X25-E

FAWN-DS Lookups System QPS Watts Our FAWN-based system over 6x more efficient than 2008-era traditional systems 13 QPS Watt Alix3c2/Sandisk(CF) 1298 3.75 346 Desktop/Mobi (SSD) 4289 83 51.7 MacbookPro / HD 66 29 2.3 Desktop / HD 171 87 1.96

Ongoing work DRAM limits amount of Flash that can be used. FAWN-KV: 12 bytes per entry Our ongoing work gets this down to ~1 byte DRAM per key-value entry (but must re-write data once), or 3 bits if can read flash on table miss BufferHash (NSDI 2010) provides similar benefits, though wastes 50% of flash space 14

And then we moved to Atom + SSD 1.6Ghz single-core Pineview, 2GB DRAM, x25-m SSD 2.8Ghz 4-core i7, 2GB DRAM, 6x (x25-m SSD) dual 2.8Ghz 4-core Xeon, 8GB DRAM, FusionIO 15

512 b random reads Platform 2x 4-core xeon + FusionIO i7 + single X25-m i7 + 4x X25-m Atom-1core + X25-m Reads / Second ~150 K ~60 K ~115 K ~23 K 16

SATA... Need I say more? Couldn t get more than ~120k IOPS over the onboard SATA bus, no matter what we tried 17

Slow wimpies Prior results: Wimpies dominated in efficiency What s happening here? 23k vs 60k add_disk_randomness(rq->rq_disk); 23,000 interrupts/second tester program called gettimeofday Fixed these, new interrupt coalescing: 37k and rising 18

Sorting Similar results using NSort But flash-aware can clobber NSort (talk Sort Efficiency Comparison offline) 3.5 Sort Efficiency (MBpJ) MB sorted / Joule 3 2.5 2 1.5 1 0.5 0 Atom x25-e Atom+X25E i7-desktop+4-x25e 19 Sort Efficiency i7 4x x25e i7-svr+fusionio 2x xeon FusionIO

Data structures One idea you ve seen: mutable bits through re-programming; Rivest punch-cards 82, Grupp, Yaakobi, Mitz, more Can do even better for particular data types... Flash should be an ideal add-only Bloom filter (Set membership with one-sided error: Will tell you if X is in set, may lie and say it is) Caching works poorly for Blooms (random access) Very important for data-mining, etc. But all need bit-level access to Flash... 20

Where we re going (?) (PCM??) + Bandwidth + Latency + Power ---- Capacity Requires even more mem-efficient systems 21

The FAWN Perspective Pretending Flash is disk or DRAM misses opportunities Making Flash look like disk or DRAM hides opportunities Today s kernels handle high block IOPS poorly (... and we need to fix this) Algorithms exploiting re-programmability, semirandom writes can win big But want to leave the system usable and abstractions manageable 22