Architecture Exploration of High-Performance PCs with a Solid-State Disk

Similar documents
HARD Disk Drives (HDDs) have been widely used as mass

LETTER Solid-State Disk with Double Data Rate DRAM Interface for High-Performance PCs

SFS: Random Write Considered Harmful in Solid State Drives

The Many Flavors of NAND and More to Come

WHITE PAPER. Know Your SSDs. Why SSDs? How do you make SSDs cost-effective? How do you get the most out of SSDs?

Computer Architecture 计算机体系结构. Lecture 6. Data Storage and I/O 第六讲 数据存储和输入输出. Chao Li, PhD. 李超博士

Mainstream Computer System Components CPU Core 2 GHz GHz 4-way Superscaler (RISC or RISC-core (x86): Dynamic scheduling, Hardware speculation

The Drive Interface Progress Cycle

NAND Flash Memory. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

16:30 18:00, June 20 (Monday), 2011 # (even student IDs) # (odd student IDs) Scope

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

Mainstream Computer System Components

CS 261 Fall Mike Lam, Professor. Memory

A Buffer Replacement Algorithm Exploiting Multi-Chip Parallelism in Solid State Disks

Pseudo SLC. Comparison of SLC, MLC and p-slc structures. pslc

Solid State Drives (SSDs) Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Near-Data Processing for Differentiable Machine Learning Models

CS311 Lecture 21: SRAM/DRAM/FLASH

Key Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs

PC-based data acquisition II

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Preface. Fig. 1 Solid-State-Drive block diagram

Computer System Components

I/O CANNOT BE IGNORED

NAND Flash-based Storage. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Using Transparent Compression to Improve SSD-based I/O Caches

Storage Systems : Disks and SSDs. Manu Awasthi July 6 th 2018 Computer Architecture Summer School 2018

HLNAND: A New Standard for High Performance Flash Memory

Comparing Performance of Solid State Devices and Mechanical Disks

Large and Fast: Exploiting Memory Hierarchy

Advanced Memory Organizations

CS429: Computer Organization and Architecture

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 5. Large and Fast: Exploiting Memory Hierarchy

NAND Flash-based Storage. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

A Caching-Oriented FTL Design for Multi-Chipped Solid-State Disks. Yuan-Hao Chang, Wei-Lun Lu, Po-Chun Huang, Lue-Jane Lee, and Tei-Wei Kuo

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

NAND Flash-based Storage. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Memory Technology. Chapter 5. Principle of Locality. Chapter 5 Large and Fast: Exploiting Memory Hierarchy 1

Reducing Solid-State Storage Device Write Stress Through Opportunistic In-Place Delta Compression

Memory latency: Affects cache miss penalty. Measured by:

NVMe: The Protocol for Future SSDs

Phase-change RAM (PRAM)- based Main Memory

Memory latency: Affects cache miss penalty. Measured by:

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Intel s s Memory Strategy for the Wireless Phone

SSDs vs HDDs for DBMS by Glen Berseth York University, Toronto

A Memory Management Scheme for Hybrid Memory Architecture in Mission Critical Computers

NAND Flash-based Storage. Computer Systems Laboratory Sungkyunkwan University

FLASH MEMORY SUMMIT Adoption of Caching & Hybrid Solutions

Foundations of Computer Systems

Main Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:

Storage Architecture and Software Support for SLC/MLC Combined Flash Memory

Flash Memory Based Storage System

CS 261 Fall Mike Lam, Professor. Memory

SSD Applications in the Enterprise Area

ELEC 5200/6200 Computer Architecture and Design Spring 2017 Lecture 7: Memory Organization Part II

Storage Systems : Disks and SSDs. Manu Awasthi CASS 2018

Memory Industry Report T A B L E O F C O N T E N T S

Memory Systems for Embedded Applications. Chapter 4 (Sections )

SSDs Driving Greater Efficiency in Data Centers

Memory Industry Dynamics & Market Drivers Memory Market Segments Memory Products Dynamics Memory Status 2007 Key Challenges for 2008 Future Trends

Annual Update on Flash Memory for Non-Technologists

Storage Technologies and the Memory Hierarchy

CS429: Computer Organization and Architecture

Business of NAND: Trends, Forecasts & Challenges

Datasheet rev.a01. CORE Series SATA II SOLID STATE DRIVE OCZSSD2-1C32G OCZSSD2-1C64G OCZSSD2-1C128G

DI-SSD: Desymmetrized Interconnection Architecture and Dynamic Timing Calibration for Solid-State Drives

Solving the I/O bottleneck with Flash

Flash Trends: Challenges and Future

+ Random-Access Memory (RAM)

OSSD: A Case for Object-based Solid State Drives

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Memory Hierarchy Y. K. Malaiya

EE414 Embedded Systems Ch 5. Memory Part 2/2

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

HP Z Turbo Drive G2 PCIe SSD

CSE502: Computer Architecture CSE 502: Computer Architecture

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Exploring System Challenges of Ultra-Low Latency Solid State Drives

High Performance SSD & Benefit for Server Application

Design of Flash-Based DBMS: An In-Page Logging Approach

Toward SLO Complying SSDs Through OPS Isolation

Fundamentals of Computer Systems

Presented by: Nafiseh Mahmoudi Spring 2017

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Your Laptop s Midlife Crisis - How to Help It Cope

3D Xpoint Status and Forecast 2017

Map3D V58 - Multi-Processor Version

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

PCIe Storage Beyond SSDs

Five Key Steps to High-Speed NAND Flash Performance and Reliability

Markets for 3D-Xpoint Applications, Performance and Revenue

Storage Systems. Storage Systems

3U CompactPCI Intel SBCs F14, F15, F17, F18, F19P

islc Claiming the Middle Ground of the High-end Industrial SSD Market

Mainstream Computer System Components

Non-Volatile Memory Cache Enhancements: Turbo-Charging Client Platform Performance

VIA ProSavageDDR KM266 Chipset

Emerging NVM Memory Technologies

Transcription:

Architecture Exploration of High-Performance PCs with a Solid-State Disk D. Kim, K. Bang, E.-Y. Chung School of EE, Yonsei University S. Yoon School of EE, Korea University April 21, 2010 1/53

Outline Introduction Related Works Motivation The Proposed Techniques PC Architecture Exploration Experimental Results Conclusion and Future Work Summary April 21, 2010 2/53

SSD The Inevitable Tide HDD Mass storage device for the last several decades SSD An electronic storage device Using non-volatile memory elements High-performance Small form factor Light weight Low power consumption Shock resistance Advantageous for harsh and rugged environment April 21, 2010 3/53

From Extravagance to Necessity The only downside of SSD The higher bit cost than HDD Samsung s 256GB SSD is as much as 860,000 won Seagate s 250GB HDD is just 44,000 won The increasing density of NAND flash memory It becomes double every 12 months The price gap keeps narrowing and narrowing Eventually, it will become negligible in 2012 Forecast by IDC (International Data Corporation) April 21, 2010 4/53

The Increasing Density of NAND Multi Level Cell SLC DLC TLC QLC 3D Stacking Source: Toshiba 2008 April 21, 2010 5/53

Average Selling Price Comparison Source : IDC 2008 April 21, 2010 6/53

CPU / Memory Performance Gap Source: SUN Microsystems 2007 Source: MSDN 2009 Multi / many-core processors enlarge the gap Intel dual core / quad core Nvidia CUDA ARM Cortex High-performance SSD is Strongly Required! April 21, 2010 7/53

Outline Introduction Related Works Motivation The Proposed Techniques PC Architecture Exploration Experimental Results Conclusion and Future Work Summary April 21, 2010 8/53

SSD Internals Memory Hierarchy Smart buffer cache [Lee et al. 05] Enhanced exploitation of spatial / temporal locality High performance and low power consumption Energy-aware demand paging [Park et al. 04] Minimizes the number of write or erase operations April 21, 2010 10/53

SSD Internals Hybrid Systems SLC / MLC hybrid SSD [Chang et al. 08] Trade-off performance and cost SLC as a cache block FRAM / NAND hybrid SSD [Yoon et al. 07] Meta-data is maintained in a small FRAM Exploiting non-volatility of FRAM PRAM / NAND hybrid SSD [Kim et al. 08] PRAM is used for meta-data Firstly under mass production among universal RAMs April 21, 2010 11/53

Interaction between Host and SSD Robson Architecture by Intel A non-volatile memory layer as a cache for disk Reduces data transfer time and power consumption PCIe SSD by FusionIO Resolves bottleneck due to traditional slow I/F Much higher bandwidth (520MB/s) NAND-based storage nodes [Lee et al. 08] Several thousands of nodes to build clusters Plugged into ethernet-style backplane April 21, 2010 12/53

Outline Introduction Related Works Motivation The Proposed Techniques PC Architecture Exploration Experimental Results Conclusion and Future Work Summary April 21, 2010 13/53

Traditional Architecture of SSD April 21, 2010 14/53

SSD Benchmark Chart Drive Read Write Intel X-25E 250MB/s 170MB/s Kingston E 250MB/s 170MB/s Intel X-25M 250MB/s 70MB/s Kingston M 250MB/s 70MB/s OCZ Apex 230MB/s 160MB/s G.Skill Titan 230MB/s 160MB/s OCZ Vertex 200MB/s 160MB/s Patriot Warp V2 175MB/s 100MB/s OCZ Core V2 170MB/s 100MB/s G.Skill FM 155MB/s 90MB/s OCZ Solid 155MB/s 90MB/s RiData CO4MPN 152MB/s 96MB/s SuperTalent Masterdrive OX 150MB/s 100MB/s Transcend TS 145MB/s 92MB/s RiData CO3M 118MB/s 74MB/s OCZ SSD 100MB/s 80MB/s G.Skill FS 100MB/s 80MB/s Samsung SSD 100MB/s 80MB/s Source: www.ssdbenchmark.com 2009 April 21, 2010 15/53

Asynchronous DDR NAND Flash The most crucial bottleneck of SSD is the performance of NAND flash device New DDR type NAND flash devices offer tremendous performance improvements Toggle-mode NAND from Samsung & Denali ONFi from Hynix, Intel, Micron, etc. April 21, 2010 16/53

The 3rd Generation High-speed I/F SSDs are close to saturating the SATA 2.0 3 Gbit/s (300 MB/s) limit SATA 3 / USB 3 / PCIe 3 SATA 3.0 will offer 6 Gbit/s (600MB/s) USB 3.0 SuperSpeed will provides 4.8 Gbit/s (572MB/s) PCIe 3.0 will add a Gen3-signalling mode, at 1 GB/s DDR 3 DDR3-1600 shows 12.5GB/s by 64-bit data width April 21, 2010 17/53

Conventional PC Architecture April 21, 2010 18/53

DMA in Conventional PC DMA write (consecutive) DMA read (pipelined) April 21, 2010 19/53

Limitations of Conventional PC / SSD Simply replaces a HDD The same interface protocol for compatibility Maximum bandwidth of ATA is only 133 ~ 300 MB/s May become a bottleneck in the near future Data go through both North & South bridges A single data request must be arbitrated twice Both bridges are not designated for SSD SSD is connected together with slow peripherals Page fault transfer must be serialized DMA read for a new page must wait until completion of DMA write for a victim page April 21, 2010 20/53

Architecture Exploration Aspects Host interface scheme Location of SSD in PC From the conventional south bridge to north bridge Interface Protocol Using huge bandwidth of DDR offered by north bridge DDR 2 is widely used in PC at the time Data transfer concurrency Minimize the conflicts between CPU-to-Mem and Mem-to-SSD A dual port DRAM and a dual port SSD April 21, 2010 21/53

Outline Introduction Related Works Motivation The Proposed Techniques PC Architecture Exploration Experimental Results Conclusion and Future Work Summary April 21, 2010 22/53

SSD with DDR DRAM Interface The fastest DDR DRAM interface 64-bit bus width is widely used Using both rising and falling edge for data transfer High frequency for communication with processors operating with several GHz Fixed access latency Latencies such as Column Address Strobe are fixed SSD cannot guarantee internal maximum latency Cache buffer hit vs. cache buffer miss Needs a signal for data readiness DQS pin is used to support arbitrary latencies April 21, 2010 23/53

Timing Diagram of DDR I/F SSD Cache buffer hit Cache buffer miss April 21, 2010 24/53

Dual Port SSD Internals April 21, 2010 25/53

DMA Command Pack for Dual DMA April 21, 2010 26/53

Outline Introduction Related Works Motivation The Proposed Techniques PC Architecture Exploration Experimental Results Conclusion and Future Work Summary April 21, 2010 27/53

4 Architectural Choices Direct Path No Yes South Bridge SBSP Architecture Conventional Architecture SBDP Architecture 1. Dual-port DRAM supported DRAM controller in NB 2. DMA command packing supported OS Location of the SSD North Bridge NBSP Architecture 1. DMA in DRAM controller in NB 2. DQS scheme supported DRAM controller in NB NBDP Architecture 1. DMA in DRAM controller in NB 2. DQS scheme supported DRAM controller in NB 3. Dual-port DRAM supported DRAM controller in NB 4. DMA command packing supported OS April 21, 2010 28/53

Conventional SBSP Architecture April 21, 2010 29/53

NBSP Architecture April 21, 2010 30/53

SBDP Architecture April 21, 2010 31/53

NBDP Architecture April 21, 2010 32/53

DMA in Conventional SSD (Revisit) DMA write (consecutive) DMA read (pipelined) April 21, 2010 33/53

DMA in Dual Port SSD DMA write (consecutive) DMA read (pipelined) April 21, 2010 34/53

Outline Introduction Related Works Motivation The Proposed Techniques PC Architecture Exploration Experimental Results Conclusion and Future Work Summary April 21, 2010 35/53

Experimental Setup Modeling transaction-level PC architecture Using high-level SystemC language Cycle accurate modeling, especially SATA, DDR, and PCIe protocols The main specification Based on Intel s 965 chipset for North Bridge and ICH8 for South Bridge Bridge internals and externals are linked by PCIe DDR2-800 (6.4GB/s) is selected for NB All other peripherals are not considered except CPU, NB, SB, DRAM, and SSD E.g. Graphics, Sound, Mouse, Keyboard, etc. April 21, 2010 36/53

DMA Read with Cache Miss Total Time Excluding SSD internal April 21, 2010 37/53

Contribution Breakdown in $ Miss SB elimination DRAM I/F Direct path NBSP 1.89% 98.11% NBDP 2.52% 11.85% 85.85% Improvement by SB elimination is marginal The reduction of transfer time is hidden by long SSD internal latency DRAM interface is dominant for NBSP Direct path is dominant for NBDP April 21, 2010 38/53

DMA Read with Cache Hit Total Time Excluding SSD internal April 21, 2010 39/53

Contribution Breakdown in $ Hit SB elimination DRAM I/F Direct path NBSP 27.48% 72.52% NBDP 30.04% 34.79% 35.16% Contribution of SB elimination is not marginal No hidden effect by SSD internal latency DRAM interface is still stronger for NBSP Overall even contribution for NBDP April 21, 2010 40/53

DMA Write Total Time Excluding SSD internal April 21, 2010 41/53

Contribution Breakdown in Write SB elimination DRAM I/F Direct path NBSP 37.92% 62.08% NBDP 23.00% 7.03% 69.97% DRAM interface is stronger for NBSP Similar to DMA read cache hit Direct path is dominant for NBDP Similar to DMA read cache miss April 21, 2010 42/53

Page Fault Scenario Total Time Excluding SSD internal April 21, 2010 43/53

Contribution Breakdown in P.F. SB elimination DRAM I/F Direct path NBSP 42.23% 57.77% NBDP 21.31% 10.60% 68.09% Similar to DMA write for both NBSP and NBDP Victim page write is dominant and new page read is marginal for page fault April 21, 2010 44/53

Network Download Scenario Total Time Excluding SSD internal April 21, 2010 45/53

Contribution Breakdown in N.D. SB elimination DRAM I/F Direct path NBSP 38.52% 61.48% NBDP 100% DRAM interface is stronger for NBSP Similar to DMA read cache hit or DMA write The communication between main memory and SSD occurs solely via the direct path April 21, 2010 46/53

Real Traces About 20% Hit Ratio April 21, 2010 47/53

Max. Performance Improve Ratio Sub-Op. SBSP NBSP SBDP NBDP DMA Read miss 1 1.31 1.60 2.29 DMA Read hit 1 8.70 2.13 28.56 DMA Write 1 1.59 1.83 2.75 Page Fault 1 1.37 1.74 2.37 Network Download 1 1.31 2.08 2.19 Real Trace 1 1 1.50 1.74 2.60 Real Trace 2 1 1.54 1.79 2.67 Real Trace 3 1 1.38 1.65 2.40 Real Trace 4 1 1.50 1.75 2.60 Cache hit ratio is the most important Ideally, all the data are read from high-speed DRAM Cache buffer should be carefully designed! April 21, 2010 48/53

Outline Introduction Related Works Motivation The Proposed Techniques PC Architecture Exploration Experimental Results Conclusion and Future Work Summary April 21, 2010 49/53

Conclusion and Future Work How to make an extreme-performance SSD SSD should be located quite close to main memory The number of path(or bandwidth) is as important as high-speed interface Make 100% cache hit by all means We DREAM a solution to satisfy the conditions A single general memory device having DRAM and NAND at the same memory hierarchy Replace traditional main memory and disks Not only high-performance SSD internal arch. But also paradigm shift on PC architecture and OS April 21, 2010 50/53

Outline Introduction Related Works Motivation The Proposed Techniques PC Architecture Exploration Experimental Results Conclusion and Future Work Summary April 21, 2010 51/53

Summary SSD is the solution to overcome CPU/IO gap Technical directions to next generation SSD SSD internal architecture SSD interface scheme NAND flash interface scheme System-level architecture exploration Architectural improvement will be driven by IO device in this Exa-byte era April 21, 2010 52/53

Thank you! Q & A April 21, 2010 53/53