BIBIM: A Prototype Multi-Partition Aware Heterogeneous New Memory

Size: px

Start display at page:

Download "BIBIM: A Prototype Multi-Partition Aware Heterogeneous New Memory"

Lester Lynch
5 years ago
Views:

1 HotStorage 18 BIBIM: A Prototype Multi-Partition Aware Heterogeneous New Memory Gyuyoung Park 1, Miryeong Kwon 1, Pratyush Mahapatra 2, Michael Swift 2, and Myoungsoo Jung 1 Yonsei University Computer Architecture and MEmory Systems Lab

2 1. Demands for Hybrid Memory 2. Preview of BIBIM 3. Design Considerations of BIBIM 4. Example of BIBIM Operation 5. Evaluation

3 Relative capacity Increasing Demands for Bigger Memory More cores require more memory capacity However, there is memory capacity gap Core count doubling ~ every 2 years Memory capacity doubling ~ every 3 years Core Need a bigger and larger memory!! 1 Memory (DRAM) Source: Lim et al, ISCA Moreover, there are many data-intensive applications Ex) Memory caching (MEMCACHED, Redis), In-memory database (DynamoDB, HANA) Year

4 Comparison of Technology Scaling Cost Reduction Rate DRAM NAND New Memory (PRAM, STT-MRAM, ReRAM) Suitable for next memory device DRAM NAND Source: SK Hynix New Memory Scaling limitation Managed DRAM 3D NAND Time & Technology node

5 Write time (s) Comparison of Programming Latency Unfortunately, takes a long time to program FeRAM MRAM SRAM Courtesy: Motoyuki Ooishi DRAM NOR Flash memory PRAM Can achieve high capacity NAND Flash memory NV-RAM w/o erase-beforewrite NV-RAM w/ erase-beforewrite Volatile RAM Memory capacity (bit)

6 Why does PRAM Show Long Write Latency? How Significant? Top electrode Ge 2 Sb 2 Te 5 (GST) PRAM exploits the unique behavior of chalcogenide glass (GST). By switching status of GST, it can store data Heater Bottom electrode PRAM cell

7 GST Temperature Programming Method of PRAM GST status is changed by heating the material Amorphous (Data 0 ) T melt (~600 ) Crystallin (Data 1 ) T crys (~300 ) T reset T read T set Long write latency T room Time Moreover, compared with DRAM, PRAM has asymmetric latency

similar to or slightly worse than DRAM (1.

8 General Assumptions of PRAM Latency HPCA 18 ISCA 09 HPCA 13 Many previous works assume PRAM s write latency as similar to or slightly worse than DRAM (1.5x) WRITE latency (ns) DRAM x Assumed PRAM

PRAM Latency Measurement 3x nm PRAM WRITE latency (ns) 19500 19000 18500 18000 120 100 80 60 40 20 0 100 DRAM

9 PRAM Latency Measurement 3x nm PRAM WRITE latency (ns) DRAM x Real PRAM Our performance measurement on real 3x nm PRAM exhibits expensive write latency than DRAM (190x)

10 Then, How Can PRAM s Long Write Latency Be Mitigated? Let s borrow the concept of Bibim Bibim means Mix in Korean Mix various ingredients for better taste

11 Hybrid Memory Can Help Us Put fast DRAM and slow PRAM together DRAM PRAM NOTE) I m just used as a write-only inclusive cache of slow PRAM

12 BTW, How Can Hybrid Memory Be Used in Real System? We design new memory controller, BIBIM, for non-volatile hybrid memory BIBIM

13 1. Demands for Hybrid Memory 2. Preview of BIBIM 3. Design Considerations of BIBIM 4. Example of BIBIM Operation 5. Evaluation

14 Demo (Track and Field) DRAM PRAM Performance comparison (memory access with synthetic, readwrite inter-mixed trace) BIBIM

15 Demo Slow Version DRAM PRAM Host sent 1M memory requests 1 Memory Still services PRAM servicing incoming memory requests requests BIBIM 2

16 1. Demands for Hybrid Memory 2. Preview of BIBIM 3. Design Considerations of BIBIM 4. Example of BIBIM Operation 5. Evaluation

17 PHY What Are The Considerations To Design Hybrid Memory (DRAM+PRAM) Controller? Memory controller DRAM PRAM To get insights of controller design, let s understand the details of DRAM and PRAM

18 Row decoder Bank 0 Bank 1 Bank N DRAM s Multi-bank Architecture Multiple banks to serve multiple memory requests in parallel Single row buffer within a bank Row buffer Column decoder

19 Does PRAM have the Same Internal Architecture with DRAM? Challenge1: PRAM s write latency is long PRAM employs multiple row buffers Challenge2: PRAM s asymmetric latency incurs lots of bank conflicts PRAM uses multi-partition architecture

20 Decoder Bank PRAM s Multi-partition Architecture Multiple partitions within the bank for partition-level parallelism Multiple row buffers to mitigate long write latency RAB RDB Cell Array Partition 0 Partition 1 Partition 15 Sense Amp.

21 Can A Conventional DRAM Controller Be Aware of Multi- Partition? (Inside of A Bank) (Revisited) Conventional DRAM scheduler just utilizes bank-level parallelism. Cannot see inside of bank! Partition-level parallelism should be supported

22 Sense amp. Blocked! How Memory Requests Can Be Scheduled By Exploiting Multi-Partition Architecture? Limitation of PRAM: In PRAM design, WRITE request blocks whole PRAM bank. Decoder Request (Write) Partition 0 Partition 1 Partition 15

23 How Memory Requests Can Be Scheduled By Exploiting Multi-Partition Architecture? Key insight: Although WRITE cannot be serviced, READs can be serviced if partition number is different

24 Sense amp. Non-Blocking Read Service (NBRS) Solution: Add a register to store partition number of WRITE and compare it with partition number of incoming READ request BIBIM controller WRITE partition number Decoder Partition 0 Partition 1 Partition 2 Partition 15 Same partition READ can t be serviced!

25 PHY BIBIM Design1: Scheduling Support Module for PRAMaware New Scheduling Scheme Request Scheduling? Module? DRAM?? PRAM

26 Now Requests Are Scheduled. Then, How Then Can It Serve to Hybrid Memory? Firstly, as is generally known, LPDDR2 is JEDEC standard low-power memory interface (used for DRAM)

27 Row decoder Bank 0 Bank 1 Bank N DRAM s Timing (LPDDR2 by JEDEC) 1) Activation: activate target row & write that data to row buffer 2) Read/Write: accessing row buffer with column address 3) Precharge: charge half-voltage of bit-line Column decoder

28 Does PRAM have the Same Memory Interface (LPDDR2) with DRAM? (Revisited) PRAM has a different architecture with DRAM such as Multiple row buffers and More larger capacity Different interface is required

29 Upper row address PRAM s Timing PRAM requires different timing model from DRAM NVM memory space is much larger than a DRAM 3-Phase addressing (LPDDR2-NVM by JEDEC) ❶ Pre-active ❷ Activate ❸ READ RAB Lower row address Upper Lower RDB Column address PRAM array Row Address Buffer (RAB) is selected by the memory controller Row Data Buffer (RDB) is also selected by the memory controller Data out

30 PHY BIBIM Design2: Heterogeneity Support Module for both LPDDR2 & LPDDR2-NVM Request Scheduling? Module Heterogeneity Support? Module Our own new physical layer (400MHz) is implemented?? DRAM PRAM

31 Don t Forget DRAM is For Cache. Then, How Caching Can Be Supported? Solution: Keep which data exist in DRAM (caching info) in lookup table. Moreover, like conventional cache, controller should have algorithms such as DRAM dataline update, eviction, and find empty dataline.

32 0000 tag index offset Hardware Support for Lookup Table Lookup table do not include data value, includes address information count valid tag DRAM hit/miss way2 way3 way4 = = = = # way Mux Only 512KB BRAM is used for lookup table With multiple comparators, 4 ways can be parallelized

33 PHY BIBIM Design3: Caching Support Module for Use DRAM As Inclusive Cache of PRAM Request Scheduling? Module Heterogeneity Support? Module DRAM Caching Support Module?? PRAM

34 BTW, How Non-Volatility of PRAM Can Be Maintained Although DRAM Is Integrated? (Hybrid) Challenge of hybrid memory: Data in DRAM will disappear when there is a power failure Power failure Data CPU Memory controller Data DRAM PRAM

35 FLUSH Operation Solution: Provide Flush operation which moves DRAM data to PRAM. Memory controller generates PRAM write request corresponding to the target DRAM row. NOTE) PRAM write will be stored in command queue which exists in memory controller. And DRAM dataline is invalidated. CPU Flush Generate PRAM write request Memory controller WRITE Data DRAM PRAM

36 Okay, Data Delivery Is Guaranteed. Is It Good Enough? Challenge of flush: User believes data has the latest value. But, the memory controller can reorder the order of memory request Process1 Process2 1) Data=1 CPU 2) Data=2 reorder Memory controller DRAM PRAM

37 FENCE Operation Solution: Provide Fence operation to enforce data delivery order of memory requests. The memory controller can simply add fence flag to check fenced or not 1) Data=1 CPU 2) Data=2 Fence Fence flag Memory controller DRAM PRAM

38 PHY BIBIM Design4: Persistent Support Module to Guarantee Data Delivery & Delivery Order Request Scheduling? Module Heterogeneity Support? Module DRAM Caching Support Module Persistent Support Module?? PRAM

39 1. Demands for Hybrid Memory 2. Preview of BIBIM 3. Design Considerations of BIBIM 4. Example of BIBIM Operation 5. Evaluation

40 PHY We Designed Four Modules in BIBIM. BTW, How They Work Together? Request Scheduling Module Caching Support Module Heterogeneity Support Module Persistent Support Module DRAM PRAM

41 PHY Specific Example: Evict DRAM Dataline Host WRITE Request Scheduling? Module Caching Support Module Heterogeneity Support? Module Persistent Support Module?? DRAM Addr DRAM PRAM [Host] Sends WRITE request to the hybrid memory [CACHING] Check DRAM dataline hit or not [CACHING] (If it is a miss) Find victim DRAM dataline based on LRU policy, and return victim DRAM address to the SCHEDULING module

42 PHY Specific Example: Evict DRAM Dataline Request Heterogeneity DRAM Scheduling? Addr DRAM Support? Cmd DRAM DRAMData Module Module Caching Support Module Persistent Support Module?? PRAM [SCHEUDLING] Sends DRAM address to the HETEROGENEITY module [HETEROGENITY] Generate DRAM commands [DRAM] Return DRAM dataline to the SCHEDULING module

43 PHY Specific Example: Evict DRAM Dataline Request Scheduling? Module DRAM Data Heterogeneity Support? Module DRAM Caching Support Module Persistent Support Module?? PRAM [SCHEUDLING] Ask fence flag is set or not to the PERSISTENT module [PERSISTENT] Return as fence flag is set [SCHEUDLING] Do not serve PRAM WRITE (Eviction of victim DRAM dataline)

44 1. Demands for Hybrid Memory 2. Preview of BIBIM 3. Design Considerations of BIBIM 4. Example of BIBIM Operation 5. Evaluation

45 Persistent Memory Workloads WHISPER: Wisconsin-HP Labs Suite for Persistence

46 Wisconsin-HP Labs Suite for Persistence (WHISPER) Benchmark echo ycsb tpcc ctree hashmap vacation Brief description Scalable, multi-version key-value store H-store like DB. Undo logs for consistency H-store like DB. Undo logs for consistency Micro-benchmarks for simulations Micro-benchmarks for simulations Online travel reservation system Write-intensive (82% of total requests)

47 Latency Comparisons (DRAM-only vs. PRAM-only vs. Hybrid memory w/bibim)

48 Average Request Latency CPU CPU CPU Avg. request latency(us) echo DRAM PRAM BIBIM (15%) us 9.8 us 0.87 us tpcc ycsb ctreehashmap Lower is better! DRAM > BIBIM > PRAM vacation

49 Average Request Latency Avg. request latency(us) echo DRAM PRAM BIBIM (15%) tpcc ycsb ctree hashmapvacation BIBIM reduces the long latency than PRAM by 87% BIBIM only degrades the perf. than DRAM by 44% Best performance benefits: vacation; many cache hits (high write fraction) Small performance benefits: tpcc, ycsb; persistent operation (flush) generates many PRAM writes

50 Non-blocking Read Service (NBRS) Analysis Tested on PRAM-only (PRAM)

51 RAW Latency of Non-NBRS PRAM Read-after-write (RAW) Sense amp. Blocked! RAW Req1: partition 0 (cache eviction /persistent operation) Req2: partition 1 Req3: partition 15 Decoder Partition 0 Partition 1 Partition 15 Non-NBRS PRAM bank Avg. read-afterwrite latency (us) echo tpccycsb Non-NBRS NBRS ctree hashmap vacation RAW latency of Non-NBRS: 19.8 us on average. ( PRAM write latency)

52 RAW Latency of NBRS PRAM Sense amp. NBRS! RAW RAR Read-after-write (RAW) Req1: partition 0 (cache eviction /persistent operation) Req2: partition 1 Req3: partition 15 Decoder Partition 0 Partition 1 Partition 15 NBRS PRAM bank Avg. read-afterwrite latency (us) Non-NBRS NBRS Many partition conflicts Many write echo tpccycsb ctree hashmap vacation RAW latency of NBRS: 52% reduced on average.

53 Demo Normal Version sec DRAM PRAM 1 Performance 1.27 sec comparison (memory access with synthetic, 3 readwrite inter-mixed trace) BIBIM sec

54 Conclusion BIBIM can achieve both high capacity and short latency Avg. latency (us) PRAM-Only 0.87 BIBIM Our main contributions: 1) Multi-partition aware scheduler (NBRS) 2) New physical layer (PHY) 3) Persistent support for hybrid memory

An Analysis of Persistent Memory Use with WHISPER

An Analysis of Persistent Memory Use with WHISPER Sanketh Nalli, Swapnil Haria, Michael M. Swift, Mark D. Hill, Haris Volos*, Kimberly Keeton* University of Wisconsin- Madison & *Hewlett- Packard Labs