Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories

Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Adrian M. Caulfield Arup De, Joel Coburn, Todor I. Mollov, Rajesh K. Gupta, Steven Swanson Non-Volatile Systems Laboratory, Department of Computer Science and Engineering University of California, San Diego

The Future of Storage Hard Drives PCIe-Flash 2007 PCIe-NVM 2013? NVM Lat.: 7.1ms BW: 2.6MB/s 1x 1x 68us 250MB/s 104x 96x 12us 1.7GB/s 591x 669x = 2.89x/yr = 2.95x/yr *Random 4KB Reads from user space 2

Software Latency Costs Disk Flash PCM 1E+0 1E+1 1E+2 1E+3 1E+4 1E+5 1E+6 1E+7 Log Request Latency (us) File System Operating System Hardware 3

Architecting a High Performance SSD Hardware architecture and software layers limit performance HW/SW interface critical to good performance Careful co-design provides significant benefits Increased bandwidth Decreased latencies Increased concurrency 4

Overview Motivation Moneta Architecture Optimizing Moneta Performance Analysis Conclusion 5

Ring (4 GB/s) Moneta Architecture Host via PIO Request Queue Scoreboard Ring Control Transfer Buffers DMA Control 16GB PCM 16GB PCM 16GB PCM Tag Status Registers 16GB PCM Host via DMA 6

Moneta Architecture Overview Application File System OS IO Stack Moneta Driver PCIe 1.1 x8 (2GB/s Full Duplex) 7

Advanced Memory Technology Characteristics Work with any fast NVM: DRAM-like speed DRAM-like interface Phase Change Memory Coming soon Simple wear leveling: Start Gap [Micro 2009] 8

Moneta: Modeling Advanced NVMs Built on RAMP s BEE3 board PCIe 1.1 x8 host connection 250MHz design DDR2 DRAM emulates NVMs Adjust timings to match PCM RAS-CAS Delay for reads Precharge latency for writes Read Write Projected Latency 48 ns 150 ns Latency projections from [B.C. Lee, ISCA 09] 9

Overview Motivation Moneta Architecture Optimizing Moneta Interface & Software Microarchitecture Performance Analysis Conclusion 10

Latency (us) Optimized Software is Critical Baseline Latency (4KB) Hardware: 8.2 us Software: 13.4 us Optimize hardware and software 25 20 15 10 Hardware costs PCM Ring DMA Wait Interrupt Issue Copy 5 Schedule OS/User 0 Base 11

Latency (us) Removing the IO Scheduler Reduces executed code IO Sched code Kernel request queuing Improves concurrency Requests not serialized through IO scheduler Multiple threads in driver 10% SW latency savings 25 20 15 10 5 0 PCM Ring DMA Wait Interrupt Issue Copy Schedule OS/User 12

Lock Free Tag Pool Compare-and-swap operations Tag indexed data structures Use processor ID as hint for where to search in tag structure Reduces concurrency conflicts Reduces cache-line misses 13

Co-design HW/SW Interface Baseline uses multiple PIO writes to send command Reduce command word to 64 bits Allows a single PIO write to issue a request No need for locks to protect request issue Remove DMA address from command Pre-allocate buffers during initialization Static buffer address for each tag 14

Latency (us) Atomic Operations Lock-Free Data Structures Atomic HW/SW Interface Increased concurrency 10% less latency vs. NoSched 25 20 15 10 5 0 PCM Ring DMA Wait Interrupt Issue Copy Schedule OS/User 15

Latency (us) Add Spin-Waits Spin-waits vs. sleeping Spin for < 4KB requests Sleep for larger requests 5us of software latency 54% less SW vs. Atomic 62% vs. Base 25 20 15 10 5 0 PCM Ring DMA Wait Interrupt Issue Copy Schedule OS/User 16

Bandwidth (GB/s) 2.00 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 Moneta Bandwidth Random Write Accesses 1M IOPS 0.5 2 8 32 128 512 Access Size (KB) Base Base NoSched Base Base NoSched Atomic NoSched Ideal Atomic SpinWait Ideal Ideal Ideal 17

CPUs / GB/s CPU Utilization Random 4KB Read/Write Accesses 14 12 10 8 6 4 2 0 Baseline NoSched Atomic Spin 18

Zero-Copy Trade-off multiple PIO writes and zero-copy IO Currently faster for us to copy in the driver than it is to write multiple words to the hardware to issue a request 128-bit or cache-line sized PIO writes needed 19

Overview Motivation Moneta Architecture Optimizing Moneta Interface & Software Microarchitecture Performance Analysis Conclusion 20

Bandwidth (GB/s) Balancing Bandwidth Usage Full duplex PCIe should see better R/W BW Smarter HW request scheduling = more BW Two request queues: one for reads, one for writes Alternate between Qs on each buffer allocation 4.0 Random R/W Accesses 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 0.5 8 128 Access Size (KB) Single Queue Ideal 21

Ring (4 GB/s) Moneta Architecture Changes Host via PIO Read Request Queue Write Request Queue Scoreboard Ring Control Transfer Buffers DMA Control 16GB PCM 16GB PCM 16GB PCM Tag Status Registers 16GB PCM Host via DMA 22

Bandwidth (GB/s) Balancing Bandwidth Usage Full duplex PCIe should see better R/W BW Smarter HW request scheduling = more BW Two request queues: one for reads, one for writes 4.0 3.5 3.0 2.5 2.0 1.5 1.0 Alternate between Qs on 0.5 each buffer allocation 0.0 Random R/W Accesses 0.5 8 128 Access Size (KB) Single Queue RW Queues Ideal 23

Round-Robin Scheduling Prevent requests from starving other requests Allocate a buffer and then put request at back of queue Attains much better small request throughput in the presence of large requests 12x improvement in small request BW 24

4ns 64ns 256ns 1us 4us 16us 64us 128us Bandwidth (GB/s) Effects of Memory Latency Moneta tolerates memory latencies up to 1us without BW loss Increased parallelism hides extra memory latency 3.0 2.5 2.0 1.5 1.0 0.5 0.0 Memory Latency 8 Controllers 4 Controllers 2 Controllers 1 Controller 26

NVMs for Storage vs DRAM Replacement Write coalescing Storage must guarantee durability by closing row DRAM leaves row open to enable coalescing Row buffer size should match access size Large accesses for storage Cache-line sized for memory Peak memory activity limited by PCIe BW Storage and DRAM replacement are different and should be optimized differently 27

Overview Motivation Moneta Architecture Optimizing Moneta Performance Analysis Conclusion 28

System Overview Moneta Memory and Device Interconnect Capacity Fusion-IO IODrive PCIe 4x 80GB SLC NAND Flash SW RAID-0 PCIe 4x SATA 2 Controller 128GB Disk HW RAID-0 PCIe 4x RAID Controller 4TB Moneta PCIe 8x 64GB Moneta-4x PCIe 4x 64GB 29

Workloads Name Footprint Description Basic IO Benchmarks XDD NoFS 64 GB Low-level IO performance without file system XDD XFS 64 GB XFS file system performance Database Applications Berkeley-DB Btree 16 GB Transactional updates to btree key/value store Berkeley-DB HashTable 16 GB Transactional updates to hash table key/value store BiologicalNetworks 35 GB Biological database queried for properties of genes and biological-networks PTF 50 GB Palomar Transient Factory sky survey queries Memory-hungry Applications DGEMM 21 GB Matrix multiply with 30,000 x 30,000 matrices NAS Parallel Benchmarks 8-35 GB 7 apps from NPB suite modeling scientific workloads 30

Bandwidth (GB/s) XDD Bandwidth Comparison 50/50 Read/Write Random Accesses 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Disk SSD FusionIO Moneta Moneta-4x 0.0 0.5 2 8 32 128 512 Access Size (KB) 31

Bandwidth (GB/s) Bandwidth w/o File System 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Read Write Read Write Large Accesses (4MB) Small Accesses (4KB) Moneta FusionIO SSD-RAID DISK-RAID 32

Bandwidth (GB/s) Bandwidth with XFS 2.0 1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 Read Write Read Write Raw IO Large 4MB Accesses (4MB) Small 4KB Accesses (4KB) Moneta FusionIO SSD-RAID DISK-RAID 33

Log Speedup vs DISK-RAID Database & App Performance An Opportunity for Leveraging Hardware Optimizations 10000 1000 100 10 1 0.1 10000 1000 100 10 1 0.1 XDD 4KB RW HashTable PTF Btree Bio XDD 4KB RW DGEMM lu bt sp is 34

Conclusion Fast advanced NVMs are coming soon We built Moneta to understand the impact of NVMs Found that the interface and microarchitecture are both critical to getting excellent performance Many opportunities to move software optimizations into hardware Many open questions exist about the architecture of fast SSDs and the systems they interface with 35

Thank You! Any Questions? 36