Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh
|
|
- Gerard Hampton
- 6 years ago
- Views:
Transcription
1 Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu
2 Executive Summary Our Goal: Accelerating pointer chasing inside main memory Challenges: Parallelism challenge and Address translation challenge Our Solution: In- PoInter Chasing Accelerator (IMPICA) Address-access decoupling: enabling parallelism in the accelerator with low cost IMPICA page table: low cost page table structure Key Results: 1.2X 1.9X speedup for pointer chasing operations, +16% database throughput 6% - 41% reduction in energy consumption 2
3 Linked Data Structures Linked data structures are widely used in many important applications 3
4 Linked Data Structures Linked data structures are widely used in many important applications Database 4
5 Linked Data Structures Linked data structures are widely used in many important applications Database B-Tree 5
6 Linked Data Structures Linked data structures are widely used in many important applications Database Key-value stores B-Tree 6
7 Linked Data Structures Linked data structures are widely used in many important applications Database Key-value stores B-Tree Hash Table 7
8 Linked Data Structures Linked data structures are widely used in many important applications Database Key-value stores B-Tree Hash Table 8
9 The Problem: Pointer Chasing Traversing linked data structures requires chasing pointers 9
10 The Problem: Pointer Chasing Traversing linked data structures requires chasing pointers Find(A) H E Q A F M MEM 10
11 The Problem: Pointer Chasing Traversing linked data structures requires chasing pointers E Find(A) H Q Addr (H) A F M MEM Data (H) 11
12 The Problem: Pointer Chasing Traversing linked data structures requires chasing pointers E Find(A) H Q Addr (E) (H) A F M MEM Data (H) (E) 12
13 The Problem: Pointer Chasing Traversing linked data structures requires chasing pointers E Find(A) H Q Addr (A) (E) (H) A F M MEM Data (H) (E) (A) 13
14 The Problem: Pointer Chasing Traversing linked data structures requires chasing pointers A E Find(A) F H Q M Addr (A) (E) (H) MEM Data (H) (E) (A) Serialized and irregular access pattern 6X cycles per instruction in real workloads 14
15 Our Goal 15
16 Our Goal Accelerating pointer chasing inside main memory 16
17 Our Goal Accelerating pointer chasing inside main memory MEM 17
18 Our Goal Accelerating pointer chasing inside main memory DRAM layers 18
19 Our Goal Accelerating pointer chasing inside main memory DRAM layers Logic layer 19
20 Our Goal Accelerating pointer chasing inside main memory Find(A) H E Q DRAM layers A F M Logic layer 20
21 Our Goal Accelerating pointer chasing inside main memory Find(A) H Find (A) A E F Q M Data DRAM layers (A) Logic layer 21
22 Outline Motivation and Our Approach Parallelism Challenge IMPICA Core Architecture Address Translation Challenge IMPICA Page Table Evaluation Conclusion 22
23 Parallelism Challenge Time 23
24 Parallelism Challenge CPU core Time 24
25 Parallelism Challenge CPU core access Time 25
26 Parallelism Challenge CPU core access Time In- Accelerator 26
27 Parallelism Challenge CPU core access Time In- Accelerator access 27
28 Parallelism Challenge CPU core access Time In- Accelerator access 28
29 Parallelism Challenge CPU core access Time CPU core In- Accelerator access 29
30 Parallelism Challenge CPU core access Time CPU core access In- Accelerator access 30
31 Parallelism Challenge CPU core access Time CPU core access In- Accelerator access access 31
32 Parallelism Challenge CPU core access Time CPU core access In- Accelerator access access 32
33 Parallelism Challenge and Opportunity A simple in-memory accelerator can still be slower than multiple CPU cores 33
34 Parallelism Challenge and Opportunity A simple in-memory accelerator can still be slower than multiple CPU cores CPU core CPU core CPU core Accelerator 34
35 Parallelism Challenge and Opportunity A simple in-memory accelerator can still be slower than multiple CPU cores CPU core CPU core CPU core Accelerator Opportunity: a pointer-chasing accelerator spends a long time waiting for memory 35
36 Parallelism Challenge and Opportunity A simple in-memory accelerator can still be slower than multiple CPU cores CPU core CPU core CPU core Accelerator Opportunity: a pointer-chasing accelerator spends a long time waiting for memory access (10-15X of ) 36
37 Our Solution: Address-Access Decoupling Time 37
38 Our Solution: Address-Access Decoupling CPU core access Time CPU core access 38
39 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine 39
40 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine Access Engine 40
41 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine Access Engine access 41
42 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine Access Engine access access 42
43 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine Access Engine access access 43
44 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine Access Engine access access 44
45 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine Access Engine access access 45
46 Our Solution: Address-Access Decoupling CPU core access Time CPU core access Address Engine Access Engine access access 46
47 IMPICA Core Architecture DRAM DRAM Dies DRAM Layers Logic Layer Controller 47
48 IMPICA Core Architecture DRAM DRAM Dies DRAM Layers Logic Layer Controller Address Engine 48
49 IMPICA Core Architecture DRAM DRAM Dies DRAM Layers Logic Layer Controller Address Engine Access Engine 49
50 IMPICA Core Architecture DRAM DRAM Dies DRAM Layers Logic Layer Controller Access Queue Address Engine Access Engine 50
51 IMPICA Core Architecture DRAM DRAM Dies DRAM Layers Logic Layer Controller Address Engine Access Queue Response Queue Access Engine 51
52 IMPICA Core Architecture DRAM Layers DRAM DRAM Dies Logic Layer IMPICA Cache Access Queue Controller Address Engine Access Engine Response Queue 52
53 IMPICA Core Architecture DRAM Layers DRAM DRAM Dies Logic Layer Request Queue Address Engine IMPICA Cache Access Queue Controller Access Engine To/From CPU Response Queue 53
54 IMPICA Core Architecture DRAM DRAM Dies DRAM Layers Logic Layer Request Queue Traversal 1 Traversal 2 To/From CPU Address Engine IMPICA Cache Access Queue Response Queue Controller Access Engine 54
55 Outline Motivation and Our Approach Parallelism Challenge IMPICA Core Architecture Address Translation Challenge IMPICA Page Table Evaluation Conclusion 55
56 Address Translation Challenge TLB/MMU Pointer (PA) PTW Page table walk 56
57 Address Translation Challenge TLB/MMU Pointer (PA) Page table walk 57
58 Address Translation Challenge TLB/MMU Pointer (PA) Page table walk 58
59 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs 59
60 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs CPU Page Table Virtual Address Space Physical Address Space 60
61 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs CPU Page Table Virtual Page Physical Page Virtual Address Space Physical Address Space 61
62 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs CPU Page Table Virtual Page Physical Page Virtual Page Virtual Address Space Physical Page Physical Address Space 62
63 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs IMPICA Page Table Virtual Page Physical Page Virtual Page Virtual Address Space Physical Page Physical Address Space 63
64 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs IMPICA Page Table IMPICA Region Virtual Page Physical Page Virtual Page Physical Page Virtual Address Space Physical Address Space 64
65 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs IMPICA Page Table IMPICA Region Virtual Page Physical Page Virtual Page Physical Page Virtual Address Space Physical Address Space 65
66 Our Solution: IMPICA Page Table letely decouple the page table of IMPICA from the page table of the CPUs IMPICA Page Table IMPICA Region Virtual Page Physical Page Virtual Page Physical Page Virtual Address Space Physical Address Space 66
67 IMPICA Page Table: Mechanism Virtual Address Bit [47:41] Bit [40:21] Bit [20:12] Bit [11:0] Region Table + + Flat Page Table (2MB) Small Page Table (4KB) + Physical Address
68 IMPICA Page Table: Mechanism Virtual Address Bit [47:41] Bit [40:21] Bit [20:12] Bit [11:0] Region Table + + Flat Page Table (2MB) Small Page Table (4KB) + Physical Address
69 IMPICA Page Table: Mechanism Virtual Address Bit [47:41] Bit [40:21] Bit [20:12] Bit [11:0] Region Table + + Flat Page Table (2MB) Small Page Table (4KB) + Physical Address
70 Outline Motivation and Our Approach Parallelism Challenge IMPICA Core Architecture Address Translation Challenge IMPICA Page Table Evaluation Conclusion 70
71 Evaluated Workloads Microbenchmarks Linked list (from Olden benchmark) Hash table (from Memcached) B-tree (from DBx1000) Application DBx1000 (with TPC-C benchmark) 71
72 Evaluation Methodology Simulator: gem5 System Configuration CPU 4 OoO cores, 2GHz Cache: 32KB L1, 1MB L2 IMPICA 1 core, 500MHz, 32KB Cache Bandwidth 12.8 GB/s for CPU, 51.2 GB/s for IMPICA Our simulator code will be released in Dec. 72
73 Result Microbenchmark Performance 73
74 Result Microbenchmark Performance Baseline + extra 128KB L2 IMPICA 2.0 Speedup Linked List Hash Table B-Tree 74
75 Result Microbenchmark Performance Baseline + extra 128KB L2 IMPICA 2.0 Speedup Linked List Hash Table B-Tree 75
76 Result Microbenchmark Performance Baseline + extra 128KB L2 IMPICA X Speedup X 1.2X 0.0 Linked List Hash Table B-Tree 76
77 Result Database Performance 77
78 Result Database Performance Database Throughput Baseline + extra 128KB L2 Baseline + extra 1MB L2 IMPICA 78
79 Result Database Performance Database Throughput % Baseline + extra 128KB L2 Baseline + extra 1MB L2 IMPICA 79
80 Result Database Performance Database Throughput % Baseline + extra 128KB L2 +5% Baseline + extra 1MB L2 IMPICA 80
81 Result Database Performance Database Throughput % Baseline + extra 128KB L2 +5% Baseline + extra 1MB L2 +16% IMPICA 81
82 Result Database Performance Database Throughput % Baseline + extra 128KB L2 +5% Baseline + extra 1MB L2 +16% IMPICA Database Latency Baseline + extra 128KB L2 Baseline + extra 1MB L2 IMPICA 82
83 Result Database Performance Database Throughput % Baseline + extra 128KB L2 +5% Baseline + extra 1MB L2 +16% IMPICA Database Latency % Baseline + extra 128KB L2-4% Baseline + extra 1MB L2-13% IMPICA 83
84 Energy Consumption 84
85 Energy Consumption Normalized Energy Baseline + extra 128KB L2 Linked List Hash Table B-Tree IMPICA DBx
86 Energy Consumption Normalized Energy Baseline + extra 128KB L2-41% Linked List -24% Hash Table -10% B-Tree IMPICA -6% DBx
87 More in the Paper Interface and design considerations CPU interface and programming model Page table management Cache coherence Area and power overhead analysis Sensitivity to IMPICA page table design 87
88 Conclusion Performing pointer-chasing inside main memory can greatly speed up the traversal of linked data structures Challenges: Parallelism challenge and Address translation challenge Our Solution: In- PoInter Chasing Accelerator Address-access decoupling: enabling parallelism with low cost IMPICA page table: low cost page table structure Key Results: 1.2X 1.9X speedup for pointer chasing operations, +16% database throughput 6% - 41% reduction in energy consumption Our solution can be applied to a broad class of in-memory accelerators 88
89 Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu
90 Microarchitecture Metrics 90
91 Microarchitecture Metrics 150 TLB MPKI Linked List Hash Table B-Tree 91
92 Microarchitecture Metrics 150 TLB MPKI Linked List Hash Table B-Tree 92
93 Microarchitecture Metrics TLB MPKI Linked List Hash Table B-Tree Normalized cache miss latency Linked List Hash Table B-Tree 93
94 Microarchitecture Metrics TLB MPKI Linked List Hash Table B-Tree Normalized cache miss latency Linked List Hash Table B-Tree 94
95 Address Translation Speedup Sensitivity to IMPICA TLB size & Page Table Design 32 TLB 64 TLB 32 TLB + RPT 64 TLB + RPT Linked List Hash Table B-Tree DBx
96 Full IMPICA Core Architecture DRAM Layers DRAM DRAM Dies Logic Layer Data RAM Request Queue Inst RAM Address Engine IMPICA Cache Access Queue Controller Access Engine To/From CPU Response Queue 96
97 CPU Interface We use packet-based interface between CPU and IMPICA Execution steps CPU sends function call and parameter to IMPICA The packet is written to IMPICA data RAM IMPICA loads the function into inst RAM IMPICA writes results to the data RAM, from which the CPU polls the results. 97
98 Programming Model An IMPICA program is written as a function in the application code with a compiler directive The compiler compiles these functions into IMPICA instructions and wraps the function calls with communication codes 98
99 Page Table Management The application allocates the memory for its linked data structures with a special API The OS reserves a portion of the virtual address space as IMPICA regions The OS maintains the coherence between CPU page table and IMPICA page table in the page fault handler 99
100 IMPICA Page Table Size Region Table 4 entries (covers a 2TB memory range) 68 B Flat page table (each) 2 20 entries 8 MB Small page table (each) 2 9 entries 4 KB 100
101 Handling of Multiple Stacks The OS knows the IMPICA region because of our page table management The OS always maps the IMPICA region of the same application into the same memory stack, including the corresponding IMPICA page table 101
102 Cache Coherence We execute every function that operates on the IMPICA regions in the accelerator It can be extended with more advanced cache coherence mechanism. 102
103 Limit of Parallelism The parallelism of IMPICA is limited by Data RAM size (for call stacks) access time vs. address computation time The size of the queues Each IMPICA core can easily parallelize pointer chasing requests. 103
104 Area and Power Overhead CPU (Cortex-A57) L2 Cache 5.85 mm 2 per core 5 mm 2 per MB Controller 10 mm 2 IMPICA (+32KB cache) 0.45 mm 2 Power overhead: average power increases by 5.6% 104
Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh
Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,
More informationMeet the Walkers! Accelerating Index Traversals for In-Memory Databases"
Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Onur Kocberber Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, Parthasarathy Ranganathan Our World is Data-Driven! Data resides
More informationSoftMC A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies
SoftMC A Flexible and Practical Open-Source Infrastructure for Enabling Experimental DRAM Studies Hasan Hassan, Nandita Vijaykumar, Samira Khan, Saugata Ghose, Kevin Chang, Gennady Pekhimenko, Donghyuk
More informationGoogle Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand
Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 18: Virtual Memory Lecture Outline Review of Main Memory Virtual Memory Simple Interleaving Cycle
More informationOptimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service
Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications
More informationChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality
ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce
More informationLow-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM
Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu roblem: Inefficient Bulk Data
More informationSEESAW: Set Enhanced Superpage Aware caching
SEESAW: Set Enhanced Superpage Aware caching http://synergy.ece.gatech.edu/ Set Associativity Mayank Parasar, Abhishek Bhattacharjee Ω, Tushar Krishna School of Electrical and Computer Engineering Georgia
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationCS162 - Operating Systems and Systems Programming. Address Translation => Paging"
CS162 - Operating Systems and Systems Programming Address Translation => Paging" David E. Culler! http://cs162.eecs.berkeley.edu/! Lecture #15! Oct 3, 2014!! Reading: A&D 8.1-2, 8.3.1. 9.7 HW 3 out (due
More informationDark Silicon Accelerators for Database Indexing
Dark Silicon Accelerators for Database Indexing Onur Kocberber, Kevin Lim, Babak Falsafi, Partha Ranganathan, Stavros Harizopoulos Dark Silicon and Big Data Challenges Data explosion Data growing faster
More informationDesign-Induced Latency Variation in Modern DRAM Chips:
Design-Induced Latency Variation in Modern DRAM Chips: Characterization, Analysis, and Latency Reduction Mechanisms Donghyuk Lee 1,2 Samira Khan 3 Lavanya Subramanian 2 Saugata Ghose 2 Rachata Ausavarungnirun
More informationImproving DRAM Performance by Parallelizing Refreshes with Accesses
Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes
More informationVirtual Memory Nov 9, 2009"
Virtual Memory Nov 9, 2009" Administrivia" 2! 3! Motivations for Virtual Memory" Motivation #1: DRAM a Cache for Disk" SRAM" DRAM" Disk" 4! Levels in Memory Hierarchy" cache! virtual memory! CPU" regs"
More informationMoneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories
Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Adrian M. Caulfield Arup De, Joel Coburn, Todor I. Mollov, Rajesh K. Gupta, Steven Swanson Non-Volatile Systems
More informationExploration of Cache Coherent CPU- FPGA Heterogeneous System
Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based
More informationVirtual Memory. Motivations for VM Address translation Accelerating translation with TLBs
Virtual Memory Today Motivations for VM Address translation Accelerating translation with TLBs Fabián Chris E. Bustamante, Riesbeck, Fall Spring 2007 2007 A system with physical memory only Addresses generated
More informationCOSC3330 Computer Architecture Lecture 20. Virtual Memory
COSC3330 Computer Architecture Lecture 20. Virtual Memory Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston Virtual Memory Topics Reducing Cache Miss Penalty (#2) Use
More informationUnderstanding Reduced-Voltage Operation in Modern DRAM Devices
Understanding Reduced-Voltage Operation in Modern DRAM Devices Experimental Characterization, Analysis, and Mechanisms Kevin Chang A. Giray Yaglikci, Saugata Ghose,Aditya Agrawal *, Niladrish Chatterjee
More informationI, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.
5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000
More informationCISC 360. Virtual Memory Dec. 4, 2008
CISC 36 Virtual Dec. 4, 28 Topics Motivations for VM Address translation Accelerating translation with TLBs Motivations for Virtual Use Physical DRAM as a Cache for the Disk Address space of a process
More informationSpring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand
Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications
More informationVirtual Memory Oct. 29, 2002
5-23 The course that gives CMU its Zip! Virtual Memory Oct. 29, 22 Topics Motivations for VM Address translation Accelerating translation with TLBs class9.ppt Motivations for Virtual Memory Use Physical
More informationA Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps
A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,
More informationvirtual memory Page 1 CSE 361S Disk Disk
CSE 36S Motivations for Use DRAM a for the Address space of a process can exceed physical memory size Sum of address spaces of multiple processes can exceed physical memory Simplify Management 2 Multiple
More informationLinearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,
More informationSmartMD: A High Performance Deduplication Engine with Mixed Pages
SmartMD: A High Performance Deduplication Engine with Mixed Pages Fan Guo 1, Yongkun Li 1, Yinlong Xu 1, Song Jiang 2, John C. S. Lui 3 1 University of Science and Technology of China 2 University of Texas,
More informationChapter 5B. Large and Fast: Exploiting Memory Hierarchy
Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,
More informationCPS 104 Computer Organization and Programming Lecture 20: Virtual Memory
CPS 104 Computer Organization and Programming Lecture 20: Virtual Nov. 10, 1999 Dietolf (Dee) Ramm http://www.cs.duke.edu/~dr/cps104.html CPS 104 Lecture 20.1 Outline of Today s Lecture O Virtual. 6 Paged
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationMotivations for Virtual Memory Virtual Memory Oct. 29, Why VM Works? Motivation #1: DRAM a Cache for Disk
class8.ppt 5-23 The course that gives CMU its Zip! Virtual Oct. 29, 22 Topics Motivations for VM Address translation Accelerating translation with TLBs Motivations for Virtual Use Physical DRAM as a Cache
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required
More informationMemory hier ar hier ch ar y ch rev re i v e i w e ECE 154B Dmitri Struko Struk v o
Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Opteron example Cache performance Six basic optimizations Virtual memory Processor DRAM gap (latency) Four issue superscalar
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationSGI Challenge Overview
CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 2 (Case Studies) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived
More informationNear Memory Key/Value Lookup Acceleration MemSys 2017
Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy
More informationFaRM: Fast Remote Memory
FaRM: Fast Remote Memory Problem Context DRAM prices have decreased significantly Cost effective to build commodity servers w/hundreds of GBs E.g. - cluster with 100 machines can hold tens of TBs of main
More informationSort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot
Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack
More information5 Solutions. Solution a. no solution provided. b. no solution provided
5 Solutions Solution 5.1 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 S2 Chapter 5 Solutions Solution 5.2 5.2.1 4 5.2.2 a. I, J b. B[I][0] 5.2.3 a. A[I][J] b. A[J][I] 5.2.4 a. 3596 = 8 800/4 2 8 8/4 + 8000/4 b.
More informationMoneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010
Moneta: A High-performance Storage Array Architecture for Nextgeneration, Non-volatile Memories Micro 2010 NVM-based SSD NVMs are replacing spinning-disks Performance of disks has lagged NAND flash showed
More informationgem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood
gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory
More informationDecoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 Please find the power point presentation
More informationGaia: Geo-Distributed Machine Learning Approaching LAN Speeds
Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds Kevin Hsieh Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, Onur Mutlu Machine Learning and Big
More informationPageForge: A Near-Memory Content- Aware Page-Merging Architecture
PageForge: A Near-Memory Content- Aware Page-Merging Architecture Dimitrios Skarlatos, Nam Sung Kim, and Josep Torrellas University of Illinois at Urbana-Champaign MICRO-50 @ Boston Motivation: Server
More informationvirtual memory. March 23, Levels in Memory Hierarchy. DRAM vs. SRAM as a Cache. Page 1. Motivation #1: DRAM a Cache for Disk
5-23 March 23, 2 Topics Motivations for VM Address translation Accelerating address translation with TLBs Pentium II/III system Motivation #: DRAM a Cache for The full address space is quite large: 32-bit
More informationEECS 470. Lecture 16 Virtual Memory. Fall 2018 Jon Beaumont
Lecture 16 Virtual Memory Fall 2018 Jon Beaumont http://www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Shen, Smith, Sohi, Tyson, and
More informationThe Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!!
The Reuse Cache Downsizing the Shared Last-Level Cache! Jorge Albericio 1, Pablo Ibáñez 2, Víctor Viñals 2, and José M. Llabería 3!!! 1 2 3 Modern CMPs" Intel e5 2600 (2013)! SLLC" AMD Orochi (2012)! SLLC"
More informationMain Memory and the CPU Cache
Main Memory and the CPU Cache CPU cache Unrolled linked lists B Trees Our model of main memory and the cost of CPU operations has been intentionally simplistic The major focus has been on determining
More informationTaming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1 Why? High-Performance Multicores for Real-Time Systems
More informationExploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:
More informationCPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner
CPS104 Computer Organization and Programming Lecture 16: Virtual Memory Robert Wagner cps 104 VM.1 RW Fall 2000 Outline of Today s Lecture Virtual Memory. Paged virtual memory. Virtual to Physical translation:
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationRandom-Access Memory (RAM) Systemprogrammering 2007 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics
Systemprogrammering 27 Föreläsning 4 Topics The memory hierarchy Motivations for VM Address translation Accelerating translation with TLBs Random-Access (RAM) Key features RAM is packaged as a chip. Basic
More informationPerformance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of
More informationVirtual Memory: From Address Translation to Demand Paging
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November 12, 2014
More informationBanshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!
Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Xiangyao Yu 1, Christopher Hughes 2, Nadathur Satish 2, Onur Mutlu 3, Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich 1 High-Bandwidth
More informationVector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks
Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor
More informationIX: A Protected Dataplane Operating System for High Throughput and Low Latency
IX: A Protected Dataplane Operating System for High Throughput and Low Latency Belay, A. et al. Proc. of the 11th USENIX Symp. on OSDI, pp. 49-65, 2014. Reviewed by Chun-Yu and Xinghao Li Summary In this
More informationRandom-Access Memory (RAM) Systemprogrammering 2009 Föreläsning 4 Virtual Memory. Locality. The CPU-Memory Gap. Topics! The memory hierarchy
Systemprogrammering 29 Föreläsning 4 Topics! The memory hierarchy! Motivations for VM! Address translation! Accelerating translation with TLBs Random-Access (RAM) Key features! RAM is packaged as a chip.!
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationCS399 New Beginnings. Jonathan Walpole
CS399 New Beginnings Jonathan Walpole Memory Management Memory Management Memory a linear array of bytes - Holds O.S. and programs (processes) - Each cell (byte) is named by a unique memory address Recall,
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationMQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices
MQSim: A Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices Arash Tavakkol, Juan Gómez-Luna, Mohammad Sadrosadati, Saugata Ghose, Onur Mutlu February 13, 2018 Executive Summary
More informationA Typical Memory Hierarchy
A Typical Memory Hierarchy Processor Faster Larger Capacity Managed by Hardware Managed by OS with hardware assistance Datapath Control Registers Level One Cache L 1 Second Level Cache (SRAM) L 2 Main
More informationMemory Management Outline. Operating Systems. Motivation. Paging Implementation. Accessing Invalid Pages. Performance of Demand Paging
Memory Management Outline Operating Systems Processes (done) Memory Management Basic (done) Paging (done) Virtual memory Virtual Memory (Chapter.) Motivation Logical address space larger than physical
More informationVirtual Memory. Samira Khan Apr 27, 2017
Virtual Memory Samira Khan Apr 27, 27 Virtual Memory Idea: Give the programmer the illusion of a large address space while having a small physical memory So that the programmer does not worry about managing
More informationMain Memory (Fig. 7.13) Main Memory
Main Memory (Fig. 7.13) CPU CPU CPU Cache Multiplexor Cache Cache Bus Bus Bus Memory Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 Memory b. Wide memory organization c. Interleaved memory organization
More informationRxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance Results Brendan Gregg; Performance and Reliability Engineering Nitesh Kant, Ben Christensen; Edge Engineering updated: Apr 2015 Results based on The Hello Netflix benchmark
More informationChapter 5A. Large and Fast: Exploiting Memory Hierarchy
Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM
More informationImpact of Cache Coherence Protocols on the Processing of Network Traffic
Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background
More informationVirtual Memory Virtual memory first used to relive programmers from the burden of managing overlays.
CSE420 Virtual Memory Prof. Mokhtar Aboelaze York University Based on Slides by Prof. L. Bhuyan (UCR) Prof. M. Shaaban (RIT) Virtual Memory Virtual memory first used to relive programmers from the burden
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationEE 457 Unit 7b. Main Memory Organization
1 EE 457 Unit 7b Main Memory Organization 2 Motivation Organize main memory to Facilitate byte-addressability while maintaining Efficient fetching of the words in a cache block Low order interleaving (L.O.I)
More informationMain Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:
Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row (~every 8 msec). Static RAM may be
More informationPerformance of Commercial Database Benchmarks on a CC-NUMA Computer System
STiNG Revisited: Performance of Commercial Database Benchmarks on a CC-NUMA Computer System Russell M. Clapp IBM NUMA-Q rclapp@us.ibm.com January 21st, 2001 1 Background STiNG was the code name of an engineering
More informationMemory latency: Affects cache miss penalty. Measured by:
Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row. Static RAM may be used for main memory
More informationCOEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory
1 COEN-4730 Computer Architecture Lecture 3 Review of Caches and Virtual Memory Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationOutline. Low-Level Optimizations in the PowerPC/Linux Kernels. PowerPC Architecture. PowerPC Architecture
Low-Level Optimizations in the PowerPC/Linux Kernels Dr. Paul Mackerras Senior Technical Staff Member IBM Linux Technology Center OzLabs Canberra, Australia paulus@samba.org paulus@au1.ibm.com Introduction
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationVirtual Memory. Motivation:
Virtual Memory Motivation:! Each process would like to see its own, full, address space! Clearly impossible to provide full physical memory for all processes! Processes may define a large address space
More informationLECTURE 5: MEMORY HIERARCHY DESIGN
LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive
More informationCache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance
6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,
More information14 May 2012 Virtual Memory. Definition: A process is an instance of a running program
Virtual Memory (VM) Overview and motivation VM as tool for caching VM as tool for memory management VM as tool for memory protection Address translation 4 May 22 Virtual Memory Processes Definition: A
More informationMEMORY: SWAPPING. Shivaram Venkataraman CS 537, Spring 2019
MEMORY: SWAPPING Shivaram Venkataraman CS 537, Spring 2019 ADMINISTRIVIA - Project 2b is out. Due Feb 27 th, 11:59 - Project 1b grades are out Lessons from p2a? 1. Start early! 2. Sketch out a design?
More informationComputer Systems Laboratory Sungkyunkwan University
I/O System Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction (1) I/O devices can be characterized by Behavior: input, output, storage
More informationQuantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms
Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Arizona State University Dhinakaran Pandiyan(dpandiya@asu.edu) and Carole-Jean Wu(carole-jean.wu@asu.edu
More informationMemshare: a Dynamic Multi-tenant Key-value Cache
Memshare: a Dynamic Multi-tenant Key-value Cache ASAF CIDON*, DANIEL RUSHTON, STEPHEN M. RUMBLE, RYAN STUTSMAN *STANFORD UNIVERSITY, UNIVERSITY OF UTAH, GOOGLE INC. 1 Cache is 100X Faster Than Database
More informationArrakis: The Operating System is the Control Plane
Arrakis: The Operating System is the Control Plane Simon Peter, Jialin Li, Irene Zhang, Dan Ports, Doug Woos, Arvind Krishnamurthy, Tom Anderson University of Washington Timothy Roscoe ETH Zurich Building
More informationModule Outline. CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies.
M6 Memory Hierarchy Module Outline CPU Memory interaction Organization of memory modules Cache memory Mapping and replacement policies. Events on a Cache Miss Events on a Cache Miss Stall the pipeline.
More informationEbbRT: A Framework for Building Per-Application Library Operating Systems
EbbRT: A Framework for Building Per-Application Library Operating Systems Overview Motivation Objectives System design Implementation Evaluation Conclusion Motivation Emphasis on CPU performance and software
More informationOperating Systems. Paging... Memory Management 2 Overview. Lecture 6 Memory management 2. Paging (contd.)
Operating Systems Lecture 6 Memory management 2 Memory Management 2 Overview Paging (contd.) Structure of page table Shared memory Segmentation Segmentation with paging Virtual memory Just to remind you...
More informationV. Primary & Secondary Memory!
V. Primary & Secondary Memory! Computer Architecture and Operating Systems & Operating Systems: 725G84 Ahmed Rezine 1 Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM)
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationSE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Reality Check Question 1: Are real caches built to work on virtual addresses or physical addresses? Question 2: What about
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationRISCV with Sanctum Enclaves. Victor Costan, Ilia Lebedev, Srini Devadas
RISCV with Sanctum Enclaves Victor Costan, Ilia Lebedev, Srini Devadas Today, privilege implies trust (1/3) If computing remotely, what is the TCB? Priviledge CPU HW Hypervisor trusted computing base OS
More information15-740/ Computer Architecture Lecture 7: Pipelining. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011
15-740/18-740 Computer Architecture Lecture 7: Pipelining Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/26/2011 Review of Last Lecture More ISA Tradeoffs Programmer vs. microarchitect Transactional
More informationVirtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili
Virtual Memory Lecture notes from MKP and S. Yalamanchili Sections 5.4, 5.5, 5.6, 5.8, 5.10 Reading (2) 1 The Memory Hierarchy ALU registers Cache Memory Memory Memory Managed by the compiler Memory Managed
More informationManaging GPU Concurrency in Heterogeneous Architectures
Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das Era of Heterogeneous Architectures
More information