Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Similar documents
Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Onyx: A Prototype Phase-Change Memory Storage Array

Redrawing the Boundary Between So3ware and Storage for Fast Non- Vola;le Memories

Onyx: A Protoype Phase Change Memory Storage Array

Near- Data Computa.on: It s Not (Just) About Performance

Providing Safe, User Space Access to Fast, Solid State Disks

New Abstractions for Fast Non-Volatile Storage

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Willow: A User- Programmable SSD

Comparing Performance of Solid State Devices and Mechanical Disks

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Beyond Block I/O: Rethinking

Accelerate Applications Using EqualLogic Arrays with directcache

PCIe Storage Beyond SSDs

Flash Trends: Challenges and Future

Operating System Supports for SCM as Main Memory Systems (Focusing on ibuddy)

Chapter 6. Storage and Other I/O Topics

Using Transparent Compression to Improve SSD-based I/O Caches

Storage. Hwansoo Han

Robert Gottstein, Ilia Petrov, Guillermo G. Almeida, Todor Ivanov, Alex Buchmann

Dynamic Interval Polling and Pipelined Post I/O Processing for Low-Latency Storage Class Memory

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

January 28-29, 2014 San Jose

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

CSCI-GA Database Systems Lecture 8: Physical Schema: Storage

Extending the NVMHCI Standard to Enterprise

Enabling NVMe I/O Scale

Interface Trends for the Enterprise I/O Highway

Extremely Fast Distributed Storage for Cloud Service Providers

Key Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs

The Non-Volatile Memory Verbs Provider (NVP): Using the OFED Framework to access solid state storage

Exploring System Challenges of Ultra-Low Latency Solid State Drives

Summarizer: Trading Communication with Computing Near Storage

Implica(ons of Non Vola(le Memory on So5ware Architectures. Nisha Talagala Lead Architect, Fusion- io

BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

Computer Architecture Computer Science & Engineering. Chapter 6. Storage and Other I/O Topics BK TP.HCM

Comparing Memory Systems for Chip Multiprocessors

MANAGING MULTI-TIERED NON-VOLATILE MEMORY SYSTEMS FOR COST AND PERFORMANCE 8/9/16

UCS Invicta: A New Generation of Storage Performance. Mazen Abou Najm DC Consulting Systems Engineer

Linux Storage System Bottleneck Exploration

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Readings. Storage Hierarchy III: I/O System. I/O (Disk) Performance. I/O Device Characteristics. often boring, but still quite important

Next Generation Architecture for NVM Express SSD

Dongjun Shin Samsung Electronics

Exploiting the benefits of native programming access to NVM devices

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

Near Memory Key/Value Lookup Acceleration MemSys 2017

Reliability, Availability, Serviceability (RAS) and Management for Non-Volatile Memory Storage

Designing a True Direct-Access File System with DevFS

Blurred Persistence in Transactional Persistent Memory

Developing Low Latency NVMe Systems for HyperscaleData Centers. Prepared by Engling Yeo Santa Clara, CA Date: 08/04/2017

How Next Generation NV Technology Affects Storage Stacks and Architectures

Solid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

End-to-End Adaptive Packet Aggregation for High-Throughput I/O Bus Network Using Ethernet

Toward a Memory-centric Architecture

HP visoko-performantna OLTP rješenja

March NVM Solutions Group

Impact of Cache Coherence Protocols on the Processing of Network Traffic

Computer Systems Laboratory Sungkyunkwan University

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

Building an All Flash Server What s the big deal? Isn t it all just plug and play?

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

NVM PCIe Networked Flash Storage

IBM FlashSystem. IBM FLiP Tool Wie viel schneller kann Ihr IBM i Power Server mit IBM FlashSystem 900 / V9000 Storage sein?

Memory hierarchy review. ECE 154B Dmitri Strukov

Hard Disk Drives. Nima Honarmand (Based on slides by Prof. Andrea Arpaci-Dusseau)

System and Algorithmic Adaptation for Flash

Storage. CS 3410 Computer System Organization & Programming

Evolving Solid State Storage. Bob Beauchamp EMC Distinguished Engineer

Capriccio : Scalable Threads for Internet Services

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

High Performance Computing on GPUs using NVIDIA CUDA

Efficient Memory Mapped File I/O for In-Memory File Systems. Jungsik Choi, Jiwon Kim, Hwansoo Han

Design and Architecture of Dell Acceleration Appliances for Database (DAAD): A Practical Approach with High Availability Guaranteed

I/O. Disclaimer: some slides are adopted from book authors slides with permission 1

SUPERTALENT ULTRADRIVE GX(SLC)/GX(MLC) PERFORMANCE ANALYSIS WHITEPAPER

Data Storage and Query Answering. Data Storage and Disk Structure (2)

Toward SLO Complying SSDs Through OPS Isolation

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Optimizing Quality of Service with SAP HANA on Power Rapid Cold Start

Enhancing SSD Control of NVMe Devices for Hyperscale Applications. Luca Bert - Seagate Chris Petersen - Facebook

Field Testing Buffer Pool Extension and In-Memory OLTP Features in SQL Server 2014

Building a High IOPS Flash Array: A Software-Defined Approach

A Performance Puzzle: B-Tree Insertions are Slow on SSDs or What Is a Performance Model for SSDs?

Block Device Scheduling. Don Porter CSE 506

Block Device Scheduling

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Linux Software RAID Level 0 Technique for High Performance Computing by using PCI-Express based SSD

The Datacentered Future Greg Huff CTO, LSI Corporation

Flash Storage Trends & Ecosystem

Evaluation of Intel Memory Drive Technology Performance for Scientific Applications

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

88X + PERFORMANCE GAINS USING IBM DB2 WITH BLU ACCELERATION ON INTEL TECHNOLOGY

Flash Controller Solutions in Programmable Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

High-Performance Transaction Processing in Journaling File Systems Y. Son, S. Kim, H. Y. Yeom, and H. Han

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

Transcription:

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Adrian M. Caulfield Arup De, Joel Coburn, Todor I. Mollov, Rajesh K. Gupta, Steven Swanson Non-Volatile Systems Laboratory, Department of Computer Science and Engineering University of California, San Diego

The Future of Storage Hard Drives PCIe-Flash 2007 PCIe-NVM 2013? NVM Lat.: 7.1ms BW: 2.6MB/s 1x 1x 68us 250MB/s 104x 96x 12us 1.7GB/s 591x 669x = 2.89x/yr = 2.95x/yr *Random 4KB Reads from user space 2

Software Latency Costs Disk Flash PCM 1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 1E+6 1E+7 Log Request Latency (us) File System Operating System Hardware 3

Architecting a High Performance SSD Hardware architecture and software layers limit performance HW/SW interface critical to good performance Careful co-design provides significant benefits Increased bandwidth Decreased latencies Increased concurrency 4

Overview Motivation Moneta Architecture Optimizing Moneta Performance Analysis Conclusion 5

Ring (4 GB/s) Moneta Architecture Host via PIO Request Queue Scoreboard Ring Control Transfer Buffers DMA Control 16GB PCM 16GB PCM 16GB PCM Tag Status Registers 16GB PCM Host via DMA 6

Moneta Architecture Overview Application File System OS IO Stack Moneta Driver PCIe 1.1 x8 (2GB/s Full Duplex) 7

Advanced Memory Technology Characteristics Work with any fast NVM: DRAM-like speed DRAM-like interface Phase Change Memory Coming soon Simple wear leveling: Start Gap [Micro 2009] 8

Moneta: Modeling Advanced NVMs Built on RAMP s BEE3 board PCIe 1.1 x8 host connection 250MHz design DDR2 DRAM emulates NVMs Adjust timings to match PCM RAS-CAS Delay for reads Precharge latency for writes Read Write Projected Latency 48 ns 150 ns Latency projections from [B.C. Lee, ISCA 09] 9

Overview Motivation Moneta Architecture Optimizing Moneta Interface & Software Microarchitecture Performance Analysis Conclusion 10

Latency (us) Optimized Software is Critical Baseline Latency (4KB) Hardware: 8.2 us Software: 13.4 us Optimize hardware and software 25 20 15 10 Hardware costs PCM Ring DMA Wait Interrupt Issue Copy 5 Schedule OS/User 0 Base 11

Latency (us) Removing the IO Scheduler Reduces executed code IO Sched code Kernel request queuing Improves concurrency Requests not serialized through IO scheduler Multiple threads in driver 10% SW latency savings 25 20 15 10 5 0 PCM Ring DMA Wait Interrupt Issue Copy Schedule OS/User 12

Lock Free Tag Pool Compare-and-swap operations Tag indexed data structures Use processor ID as hint for where to search in tag structure Reduces concurrency conflicts Reduces cache-line misses 13

Co-design HW/SW Interface Baseline uses multiple PIO writes to send command Reduce command word to 64 bits Allows a single PIO write to issue a request No need for locks to protect request issue Remove DMA address from command Pre-allocate buffers during initialization Static buffer address for each tag 14

Latency (us) Atomic Operations Lock-Free Data Structures Atomic HW/SW Interface Increased concurrency 10% less latency vs. NoSched 25 20 15 10 5 0 PCM Ring DMA Wait Interrupt Issue Copy Schedule OS/User 15

Latency (us) Add Spin-Waits Spin-waits vs. sleeping Spin for < 4KB requests Sleep for larger requests 5us of software latency 54% less SW vs. Atomic 62% vs. Base 25 20 15 10 5 0 PCM Ring DMA Wait Interrupt Issue Copy Schedule OS/User 16

Bandwidth (GB/s) 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 Moneta Bandwidth Random Write Accesses 1M IOPS 0.5 2 8 32 128 512 Access Size (KB) Base Base NoSched Base Base NoSched Atomic NoSched Ideal Atomic SpinWait Ideal Ideal Ideal 17

CPUs / GB/s CPU Utilization Random 4KB Read/Write Accesses 14 12 10 8 6 4 2 0 Baseline NoSched Atomic Spin 18

Zero-Copy Trade-off multiple PIO writes and zero-copy IO Currently faster for us to copy in the driver than it is to write multiple words to the hardware to issue a request 128-bit or cache-line sized PIO writes needed 19

Overview Motivation Moneta Architecture Optimizing Moneta Interface & Software Microarchitecture Performance Analysis Conclusion 20

Bandwidth (GB/s) Balancing Bandwidth Usage Full duplex PCIe should see better R/W BW Smarter HW request scheduling = more BW Two request queues: one for reads, one for writes Alternate between Qs on each buffer allocation 4.0 Random R/W Accesses 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 8 128 Access Size (KB) Single Queue Ideal 21

Ring (4 GB/s) Moneta Architecture Changes Host via PIO Read Request Queue Write Request Queue Scoreboard Ring Control Transfer Buffers DMA Control 16GB PCM 16GB PCM 16GB PCM Tag Status Registers 16GB PCM Host via DMA 22

Bandwidth (GB/s) Balancing Bandwidth Usage Full duplex PCIe should see better R/W BW Smarter HW request scheduling = more BW Two request queues: one for reads, one for writes 4.0 3.5 3.0 2.5 2.0 1.5 1.0 Alternate between Qs on 0.5 each buffer allocation 0.0 Random R/W Accesses 0.5 8 128 Access Size (KB) Single Queue RW Queues Ideal 23

Round-Robin Scheduling Prevent requests from starving other requests Allocate a buffer and then put request at back of queue Attains much better small request throughput in the presence of large requests 12x improvement in small request BW 24

Ring (4 GB/s) Moneta Architecture Changes Host via PIO Read Request Queue Write Request Queue Scoreboard Ring Control Transfer Buffers DMA Control 16GB PCM 16GB PCM 16GB PCM Tag Status Registers 16GB PCM Host via DMA 25

4ns 64ns 256ns 1us 4us 16us 64us 128us Bandwidth (GB/s) Effects of Memory Latency Moneta tolerates memory latencies up to 1us without BW loss Increased parallelism hides extra memory latency 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Memory Latency 8 Controllers 4 Controllers 2 Controllers 1 Controller 26

NVMs for Storage vs DRAM Replacement Write coalescing Storage must guarantee durability by closing row DRAM leaves row open to enable coalescing Row buffer size should match access size Large accesses for storage Cache-line sized for memory Peak memory activity limited by PCIe BW Storage and DRAM replacement are different and should be optimized differently 27

Overview Motivation Moneta Architecture Optimizing Moneta Performance Analysis Conclusion 28

System Overview Moneta Memory and Device Interconnect Capacity Fusion-IO IODrive PCIe 4x 80GB SLC NAND Flash SW RAID-0 PCIe 4x SATA 2 Controller 128GB Disk HW RAID-0 PCIe 4x RAID Controller 4TB Moneta PCIe 8x 64GB Moneta-4x PCIe 4x 64GB 29

Workloads Name Footprint Description Basic IO Benchmarks XDD NoFS 64 GB Low-level IO performance without file system XDD XFS 64 GB XFS file system performance Database Applications Berkeley-DB Btree 16 GB Transactional updates to btree key/value store Berkeley-DB HashTable 16 GB Transactional updates to hash table key/value store BiologicalNetworks 35 GB Biological database queried for properties of genes and biological-networks PTF 50 GB Palomar Transient Factory sky survey queries Memory-hungry Applications DGEMM 21 GB Matrix multiply with 30,000 x 30,000 matrices NAS Parallel Benchmarks 8-35 GB 7 apps from NPB suite modeling scientific workloads 30

Bandwidth (GB/s) XDD Bandwidth Comparison 50/50 Read/Write Random Accesses 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Disk SSD FusionIO Moneta Moneta-4x 0.0 0.5 2 8 32 128 512 Access Size (KB) 31

Bandwidth (GB/s) Bandwidth w/o File System 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Read Write Read Write Large Accesses (4MB) Small Accesses (4KB) Moneta FusionIO SSD-RAID DISK-RAID 32

Bandwidth (GB/s) Bandwidth with XFS 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Read Write Read Write Raw IO Large 4MB Accesses (4MB) Small 4KB Accesses (4KB) Moneta FusionIO SSD-RAID DISK-RAID 33

Log Speedup vs DISK-RAID Database & App Performance An Opportunity for Leveraging Hardware Optimizations 10000 1000 100 10 1 0.1 10000 1000 100 10 1 0.1 XDD 4KB RW HashTable PTF Btree Bio XDD 4KB RW DGEMM lu bt sp is 34

Conclusion Fast advanced NVMs are coming soon We built Moneta to understand the impact of NVMs Found that the interface and microarchitecture are both critical to getting excellent performance Many opportunities to move software optimizations into hardware Many open questions exist about the architecture of fast SSDs and the systems they interface with 35

Thank You! Any Questions? 36