PageForge: A Near-Memory Content- Aware Page-Merging Architecture
|
|
- Ross Watson
- 6 years ago
- Views:
Transcription
1 PageForge: A Near-Memory Content- Aware Page-Merging Architecture Dimitrios Skarlatos, Nam Sung Kim, and Josep Torrellas University of Illinois at Urbana-Champaign Boston
2 Motivation: Server Consolidation in the Cloud 2
3 3 Motivation: Server Consolidation in the Cloud How to consolidate Main Memory?
4 Content-Aware Page Merging or Page Deduplication 4
5 Content-Aware Page Merging or Page Deduplication Hypervisors search the address space and merge identical pages 5
6 Content-Aware Page Merging or Page Deduplication Hypervisors search the address space and merge identical pages VM-0 VM-1 Guest Virtual Page 0 Page 1 Guest Physical Page 0 Page 1 Host Physical Page 0 Page 1 6
7 Content-Aware Page Merging or Page Deduplication Hypervisors search the address space and merge identical pages VM-0 VM-1 VM-0 VM-1 Guest Virtual Page 0 Page 1 Page 0 Page 1 Guest Physical Page 0 Page 1 Page 0 Page 1 Host Physical Page 0 Page 1 Page 0 7
8 8 Problem: Overhead of Software Page Merging Search hundreds of millions of pages Latency sensitive applications get disrupted RedHat s SW page merging (KSM) has execution overhead: Average mean latency overhead: 68% Average tail latency overhead: 136%
9 Contribution: PageForge First solution for hardware-assisted content-aware page merging General, effective, minimal hypervisor involvement & hardware mods Reduced overhead vs state-of-the-art software: Mean latency 68% à 10% Tail latency 136% à 11% Same memory savings as software: 48% à Twice #VMs Novel use of ECC for page content characterization Boston
10 Content Duplication in the Cloud
11 11 Background: Cloud Infrastructure
12 12 Background: Cloud Host Operating System Infrastructure
13 13 Background: Cloud Hypervisor Host Operating System Infrastructure
14 14 Background: Cloud Guest OS Hypervisor Host Operating System Infrastructure
15 15 Background: Cloud Libs Guest OS Hypervisor Host Operating System Infrastructure
16 16 Background: Cloud Data Libs Guest OS Hypervisor Host Operating System Infrastructure
17 17 Background: Cloud Data Libs Data Libs Data Libs Guest OS Guest OS Guest OS Hypervisor Host Operating System Infrastructure
18 18 Background: Content Duplication Data Libs Data Libs Data Libs Guest OS Guest OS Guest OS Hypervisor Host Operating System Infrastructure
19 19 Background: Content Duplication Data Libs Data Libs Data Libs Guest OS Guest OS Guest OS Hypervisor Host Operating System Infrastructure
20 20 Background: Content Duplication Data Libs Data Libs Data Libs Guest OS Guest OS Guest OS Hypervisor Host Operating System Infrastructure
21 Background: Content Duplication Data Data Data Libs Libs Libs Guest OS Guest OS Guest OS Hypervisor Host Operating System Infrastructure 21
22 Background: Content Duplication Data Data Data Libs Libs Libs Guest OS Guest OS Guest OS Hypervisor Host Operating System Infrastructure 22
23 23 Background: Content Duplication Data Data Data Libs Libs Libs Content Guest OS Guest OS Guest OS Duplication Page Merging Hypervisor Host Operating System Infrastructure
24 24 Our Proposal: Content Deduplication in HW Data Data Data Libs Libs Libs Content Guest OS Guest OS Guest OS Duplication Hypervisor Page Merging Host Operating System Infrastructure
25 Software-based Content-Aware Page Merging
26 Software-Based Content-Aware Page Merging Pool of Pages to Scan Pool of Stable Pages 4KB 4KB 4KB 4KB 4KB 4KB
27 Software-Based Content-Aware Page Merging Pool of Pages to Scan Pool of Stable Pages 4KB 4KB 4KB 4KB 4KB 4KB Candidate Page Compare Page
28 Software-Based Content-Aware Page Merging Main Memory Candidate Page 4KB Memory Ctrl L3 $ L2 $ L1 $ Core Compare Page 4KB
29 Software-Based Content-Aware Page Merging Main Memory Candidate 64B Page 4KB Memory L3 $ L2 $ 1B L1 $ Core == 64B Ctrl 1B Compare Page 4KB
30 Software-Based Content-Aware Page Merging Pool of Pages to Scan Pool of Stable Pages 4KB 4KB 4KB 4KB 4KB 4KB Candidate Page Compare Page
31 Software-Based Content-Aware Page Merging Pool of Pages to Scan Pool of Stable Pages 4KB 4KB 4KB 4KB 4KB 4KB Candidate Page Compare Page
32 32 Why Page-Merging is expensive? Search hundreds of millions of pages Many core cycles consumed Caches get polluted Latency sensitive applications get disrupted
33 Software-Based Content-Aware Page Merging Optimization: Identify if the candidate page has recently changed If a page changes too often à Not a good candidate Candidate Page Hash Key 1KB # 4KB JHash
34 PageForge: Hardware Assisted Page-Merging
35 35 PageForge Main Idea Hardware engine in the memory controller Compares pages CPU CPU CPU CPU CPU Generates hashes Advantages: ü No core utilization ü No cache pollution Memory Memory PageForge Mem Controller L2 L2 L2 L2 L3 L3 L3 L3 Interconnect L3 L3 L3 L3 L2 L2 L2 L2 L2 L3 L3 L2 Mem Controller Memory Memory CPU CPU CPU CPU CPU
36 36 PageForge Main Idea Hardware engine in the memory controller Compares pages CPU CPU CPU CPU CPU Generates hashes Advantages: ü No core utilization ü No cache pollution Memory Memory PageForge Mem Controller L2 L2 L2 L2 L3 L3 L3 L3 Interconnect L3 L3 L3 L3 L2 L2 L2 L2 L2 L3 L3 L2 Mem Controller Memory Memory CPU CPU CPU CPU CPU
37 37 The Memory Controller Read Req Buffer Scheduler Command On-Chip Network (FR-FCFS) Write Req Buffer Control Comparator Scan Table PageForge Read Data Buffer ECC Dec Generation Memory Interface Off-Chip Memory Interface ECC Enc Write Data Buffer Memory Controller
38 PageForge Operation 38
39 39 PageForge Operation Operating System Loads in Scan Table: Candidate page PPN PPNs of pages to compare to candidate
40 40 PageForge Operation Operating System Loads in Scan Table: Candidate page PPN PPNs of pages to compare to candidate PageForge Hardware Autonomously performs : Sequence of comparisons Generation of the hash key for candidate page
41 41 PageForge Operation In Hardware
42 42 PageForge Operation Search Pool of Stable Pages Pick Candidate Page In Hardware
43 43 PageForge Operation Search Pool of Stable Pages Match Found? Yes OS Merges Pages In Hardware
44 44 PageForge Operation Search Pool of Match Found? No Old Key = New Key? Stable Pages No Pick Candidate Page In Hardware
45 45 PageForge Operation Search Search Pool of Match Found? Old Key = New Key? Yes Pool of Unprotected Stable Pages Pages OS Merges Pages Match Found? Yes In Hardware
46 46 PageForge Operation Search Search Pool of Match Found? Old Key = New Key? Yes Pool of Unprotected Stable Pages Pages Match Found? No In Hardware Insert Candidate in Unprotected Pool
47 47 PageForge Operation Search Search Pool of Match Found? No Old Key = New Key? Yes Pool of Unprotected Stable Pages Pages Yes No Pick OS Merges Candidate Page Pages Match Found? Yes No In Hardware Insert Candidate in Unprotected Pool
48 Eliminating The Cost of Hash Keys
49 Hash Key for Page Content Characterization 49
50 50 Hash Key for Page Content Characterization If page changed recently, key may be different
51 51 Hash Key for Page Content Characterization If page changed recently, key may be different KSM: JHash à Serial computation + 1KB of data
52 52 Hash Key for Page Content Characterization If page changed recently, key may be different KSM: JHash à Serial computation + 1KB of data PageForge: ECC à Parallel computation + 256B of data
53 53 The Memory Controller Read Req Buffer Scheduler Command On-Chip Network (FR-FCFS) Write Req Buffer Control Comparator Scan Table PageForge Read Data Buffer ECC Dec Generation Memory Interface Off-Chip Memory Interface ECC Enc Write Data Buffer Memory Controller
54 Proposal: ECC for Page Content Characterization 54
55 55 Proposal: ECC for Page Content Characterization Break a 4KB page into 4 segments Data 64B 4K Page
56 56 Proposal: ECC for Page Content Characterization Break a 4KB page into 4 segments Pick a random cache line within each segment Data 64B 4K Page
57 57 Proposal: ECC for Page Content Characterization Break a 4KB page into 4 segments Pick a random cache line within each segment Select the 8-bit ECC of the least significant QWORD ECC 8B Data 64B 4K Page
58 58 Proposal: ECC for Page Content Characterization Break a 4KB page into 4 segments Pick a random cache line within each segment Select the 8-bit ECC of the least significant QWORD Hash Key 4B ECC 8B Data 64B 4K Page
59 59 Proposal: ECC for Page Content Characterization Break a 4KB page into 4 segments Pick a random cache line within each segment Select the 8-bit ECC of the least significant QWORD Hash Key 4B ECC 8B Software JHash = 1KB Data 64B
60 60 Proposal: ECC for Page Content Characterization Break a 4KB page into 4 segments Pick a random cache line within each segment Select the 8-bit ECC of the least significant QWORD Hash Key 4B ECC 8B Software Jhash = 1KB Data 64B
61 61 Proposal: ECC for Page Content Characterization Break a 4KB page into 4 segments Pick a random cache line within each segment Select the 8-bit ECC of the least significant QWORD Hash Key 4B ECC 8B ECC Hash = 256 Bytes Data 64B
62 62 Proposal: ECC for Page Content Characterization Break a 4KB page into 4 segments Pick a random cache line within each segment Select the 8-bit ECC of the least significant QWORD ECC 8B Data 64B 75% Memory Footprint Hash Key 4B Reduction for Hash Key Generation!
63 Evaluation
64 64 Simulation Setup 2GHz, 10-core, 16GB DRAM One VM per core Ubuntu Host, Ubuntu Cloud Guest InHouse simulator: Simics + SST + DRAMSim2 TailBench Suite benchmarks
65 65 Configurations Baseline: Page-merging is disabled KSM: RedHat s implementations shipped with the current Linux PageForge: Hardware-assisted page-merging
66 Memory Savings w/o vs w/ Page Merging w/o Page Merging w/ Page Merging 100 Pages % W/O Page-Merging 48% 20 0 Img-Dnn Masstree Moses Silo Sphinx Average 66
67 Memory Savings w/o vs w/ Page Merging w/o Page Merging w/ Page Merging 100 Pages % W/O Page-Merging 48% 20 0 PageForge Achieves Memory Savings of 48% Img-Dnn Masstree Moses Silo Sphinx Average 67
68 Memory Savings w/o vs w/ Page Merging w/o Page Merging w/ Page Merging 100 Pages % W/O Page-Merging 48% 20 0 PageForge Achieves Memory Savings of 48% Twice #VMs!! Img-Dnn Masstree Moses Silo Sphinx Average 68
69 Memory Savings w/o vs w/ Page Merging 100 Unmergeable Mergeable Zero Mergeable Non-Zero Pages % W/O Page-Merging W/ Page-Merging 20 0 Img-Dnn Masstree Moses Silo Sphinx Average 69
70 Memory Savings w/o vs w/ Page Merging 100 Unmergeable Mergeable Zero Mergeable Non-Zero Pages % W/O Page-Merging W/ Page-Merging 20 0 Img-Dnn Masstree Moses Silo Sphinx Average 70
71 Memory Savings w/o vs w/ Page Merging 100 Unmergeable Mergeable Zero Mergeable Non-Zero Pages % W/O Page-Merging W/ Page-Merging 48% 20 0 Img-Dnn Masstree Moses Silo Sphinx Average 71
72 Mean Latency of Requests 3 KSM PageForge Normalized Mean Latency Img-Dnn Masstree Moses Silo Sphinx Average 72
73 Mean Latency of Requests 3 Baseline KSM PageForge Normalized Mean Latency Img-Dnn Masstree Moses Silo Sphinx Average 73
74 Mean Latency of Requests 3 Baseline KSM PageForge Normalized Mean Latency Img-Dnn Masstree Moses Silo Sphinx Average 74
75 Mean Latency of Requests 3 Baseline KSM PageForge Normalized Mean Latency % 10% 0 Img-Dnn Masstree Moses Silo Sphinx Average 75
76 Mean Latency of Requests 3 Baseline KSM PageForge Normalized Mean Latency PageForge Reduces Mean Latency Overhead From 68% Down to 10%! 68% 10% 0 Img-Dnn Masstree Moses Silo Sphinx Average 76
77 Tail Latency of Requests 3 Baseline KSM PageForge th Percentile Latency % 11% 0 Img-Dnn Masstree Moses Silo Sphinx Average 77
78 Tail Latency of Requests 3 Baseline KSM PageForge th Percentile Latency PageForge Reduces Tail Latency Overhead From 136% Down to 11%! 136% 11% 0 Img-Dnn Masstree Moses Silo Sphinx Average 78
79 Also in the Paper Software interface to PageForge Interaction with the cache coherence protocol Alternative designs Bandwidth analysis ECC vs Jhash à ECC keys add negligible collisions Power and area at 22nm is negligible 79
80 Takeaway: PageForge First solution for hardware-assisted content-aware page merging General, effective, minimal hypervisor involvement & hardware mods Reduced overhead vs state-of-the-art software: Mean latency 68% à 10% Tail latency 136% à 11% Same memory savings as software: 48% à Twice #VMs Novel use of ECC for content characterization à 75% less memory footprint Boston
SmartMD: A High Performance Deduplication Engine with Mixed Pages
SmartMD: A High Performance Deduplication Engine with Mixed Pages Fan Guo 1, Yongkun Li 1, Yinlong Xu 1, Song Jiang 2, John C. S. Lui 3 1 University of Science and Technology of China 2 University of Texas,
More informationvcache: Architectural Support for Transparent and Isolated Virtual LLCs in Virtualized Environments
vcache: Architectural Support for Transparent and Isolated Virtual LLCs in Virtualized Environments Daehoon Kim *, Hwanju Kim, Nam Sung Kim *, and Jaehyuk Huh * University of Illinois at Urbana-Champaign,
More informationReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang*, Josep Torrellas University of Illinois at Urbana-Champaign *Hewlett-Packard
More informationDifference Engine: Harnessing Memory Redundancy in Virtual Machines (D. Gupta et all) Presented by: Konrad Go uchowski
Difference Engine: Harnessing Memory Redundancy in Virtual Machines (D. Gupta et all) Presented by: Konrad Go uchowski What is Virtual machine monitor (VMM)? Guest OS Guest OS Guest OS Virtual machine
More informationAccelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh
Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary
More informationDistributed caching for cloud computing
Distributed caching for cloud computing Maxime Lorrillere, Julien Sopena, Sébastien Monnet et Pierre Sens February 11, 2013 Maxime Lorrillere (LIP6/UPMC/CNRS) February 11, 2013 1 / 16 Introduction Context
More informationPrototyping Architectural Support for Program Rollback Using FPGAs
Prototyping Architectural Support for Program Rollback Using FPGAs Radu Teodorescu and Josep Torrellas http://iacoma.cs.uiuc.edu University of Illinois at Urbana-Champaign Motivation Problem: Software
More informationFlexible Cache Error Protection using an ECC FIFO
Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead
More informationXylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks
Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks Aditya Agrawal, Josep Torrellas and Sachin Idgunji University of Illinois at Urbana Champaign and Nvidia Corporation http://iacoma.cs.uiuc.edu
More informationPhoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware
Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware Smruti R. Sarangi Abhishek Tiwari Josep Torrellas University of Illinois at Urbana-Champaign Can a Processor
More informationV. Primary & Secondary Memory!
V. Primary & Secondary Memory! Computer Architecture and Operating Systems & Operating Systems: 725G84 Ahmed Rezine 1 Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM)
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationOracle(SUN) Virtual box another virtual platform
Oracle(SUN) Virtual box another virtual platform Using VirtualBox Importing and exporting virtual machines VirtualBox can import and export virtual machines in the industry-standard Open Virtualization
More informationTransparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh
Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,
More informationSnatch: Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks
Snatch: Opportunistically Reassigning Power Allocation between and in 3D Stacks Dimitrios Skarlatos, Renji Thomas, Aditya Agrawal, Shibin Qin, Robert Pilawa, Ulya Karpuzcu, Radu Teodorescu, Nam Sung Kim,
More informationA DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU
A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU PRESENTED BY ROMAN SHOR Overview Technics of data reduction in storage systems:
More informationMWC 2015 End to End NFV Architecture demo_
MWC 2015 End to End NFV Architecture demo_ March 2015 demonstration @ Intel booth Executive summary The goal is to demonstrate how an advanced multi-vendor implementation of the ETSI ISG NFV architecture
More informationReplayConfusion: Detecting Cache-based Covert Channel Attacks Using Record and Replay
ReplayConfusion: Detecting Cache-based Covert Channel Attacks Using Record and Replay Mengjia Yan, Yasser Shalabi, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu MICRO
More informationComputer Organization and Structure. Bing-Yu Chen National Taiwan University
Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common
More informationI, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.
5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000
More informationMEMORY HIERARCHY BASICS. B649 Parallel Architectures and Programming
MEMORY HIERARCHY BASICS B649 Parallel Architectures and Programming BASICS Why Do We Need Caches? 3 Overview 4 Terminology cache virtual memory memory stall cycles direct mapped valid bit block address
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 18: Virtual Memory Lecture Outline Review of Main Memory Virtual Memory Simple Interleaving Cycle
More informationMoneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010
Moneta: A High-performance Storage Array Architecture for Nextgeneration, Non-volatile Memories Micro 2010 NVM-based SSD NVMs are replacing spinning-disks Performance of disks has lagged NAND flash showed
More informationThe Design Complexity of Program Undo Support in a General Purpose Processor. Radu Teodorescu and Josep Torrellas
The Design Complexity of Program Undo Support in a General Purpose Processor Radu Teodorescu and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Processor with program
More informationSecure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks
: Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia
More informationReal-Time Cache Management for Multi-Core Virtualization
Real-Time Cache Management for Multi-Core Virtualization Hyoseung Kim 1,2 Raj Rajkumar 2 1 University of Riverside, California 2 Carnegie Mellon University Benefits of Multi-Core Processors Consolidation
More informationTiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu Carnegie Mellon University HPCA - 2013 Executive
More informationHEAD HardwarE Accelerated Deduplication
HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing Acceleration with FPGA December 9, 2016 Insu Jang Seikwon Kim Seonyoung Lee Executive Summary A-Z development of deduplication SW version
More informationIntegrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim
Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim Farzad Farshchi, Qijing Huang, Heechul Yun University of Kansas, University of California, Berkeley SiFive Internship Rocket
More informationBanshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!
Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation! Xiangyao Yu 1, Christopher Hughes 2, Nadathur Satish 2, Onur Mutlu 3, Srinivas Devadas 1 1 MIT 2 Intel Labs 3 ETH Zürich 1 High-Bandwidth
More informationHypervisors at Hyperscale
Hypervisors at Hyperscale ARM, Xen, Servers and Evolution of the Data Center Larry Wikelius Co-Founder & VP Software 1 Overview l Market Dynamics l Technology Trends l Roadmaps Where are we today l Use
More informationImplementing Long-term Recurrent Convolutional Network Using HLS on POWER System
Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign
More informationMemory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez
Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection
More informationCS 350 Winter 2011 Current Topics: Virtual Machines + Solid State Drives
CS 350 Winter 2011 Current Topics: Virtual Machines + Solid State Drives Virtual Machines Resource Virtualization Separating the abstract view of computing resources from the implementation of these resources
More informationAn Analysis and Empirical Study of Container Networks
An Analysis and Empirical Study of Container Networks Kun Suo *, Yong Zhao *, Wei Chen, Jia Rao * University of Texas at Arlington *, University of Colorado, Colorado Springs INFOCOM 2018@Hawaii, USA 1
More informationCS5460: Operating Systems Lecture 14: Memory Management (Chapter 8)
CS5460: Operating Systems Lecture 14: Memory Management (Chapter 8) Important from last time We re trying to build efficient virtual address spaces Why?? Virtual / physical translation is done by HW and
More informationKeyStone II. CorePac Overview
KeyStone II ARM Cortex A15 CorePac Overview ARM A15 CorePac in KeyStone II Standard ARM Cortex A15 MPCore processor Cortex A15 MPCore version r2p2 Quad core, dual core, and single core variants 4096kB
More informationMoneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories
Moneta: A High-Performance Storage Architecture for Next-generation, Non-volatile Memories Adrian M. Caulfield Arup De, Joel Coburn, Todor I. Mollov, Rajesh K. Gupta, Steven Swanson Non-Volatile Systems
More informationTales of the Tail Hardware, OS, and Application-level Sources of Tail Latency
Tales of the Tail Hardware, OS, and Application-level Sources of Tail Latency Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports and Steven D. Gribble February 2, 2015 1 Introduction What is Tail Latency? What
More informationChapter 8. Virtual Memory
Operating System Chapter 8. Virtual Memory Lynn Choi School of Electrical Engineering Motivated by Memory Hierarchy Principles of Locality Speed vs. size vs. cost tradeoff Locality principle Spatial Locality:
More informationCase Study 1: Optimizing Cache Performance via Advanced Techniques
6 Solutions to Case Studies and Exercises Chapter 2 Solutions Case Study 1: Optimizing Cache Performance via Advanced Techniques 2.1 a. Each element is 8B. Since a 64B cacheline has 8 elements, and each
More informationOnyx: A Prototype Phase-Change Memory Storage Array
Onyx: A Prototype Phase-Change Memory Storage Array Ameen Akel * Adrian Caulfield, Todor Mollov, Rajesh Gupta, Steven Swanson Non-Volatile Systems Laboratory, Department of Computer Science and Engineering
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit
More information5 Solutions. Solution a. no solution provided. b. no solution provided
5 Solutions Solution 5.1 5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 S2 Chapter 5 Solutions Solution 5.2 5.2.1 4 5.2.2 a. I, J b. B[I][0] 5.2.3 a. A[I][J] b. A[J][I] 5.2.4 a. 3596 = 8 800/4 2 8 8/4 + 8000/4 b.
More informationMemory Hierarchy. Mehran Rezaei
Memory Hierarchy Mehran Rezaei What types of memory do we have? Registers Cache (Static RAM) Main Memory (Dynamic RAM) Disk (Magnetic Disk) Option : Build It Out of Fast SRAM About 5- ns access Decoders
More informationVM Migration, Containers (Lecture 12, cs262a)
VM Migration, Containers (Lecture 12, cs262a) Ali Ghodsi and Ion Stoica, UC Berkeley February 28, 2018 (Based in part on http://web.eecs.umich.edu/~mosharaf/slides/eecs582/w16/021516-junchenglivemigration.pptx)
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationPersistent Memory. High Speed and Low Latency. White Paper M-WP006
Persistent Memory High Speed and Low Latency White Paper M-WP6 Corporate Headquarters: 3987 Eureka Dr., Newark, CA 9456, USA Tel: (51) 623-1231 Fax: (51) 623-1434 E-mail: info@smartm.com Customer Service:
More informationShared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network
Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache
More informationCS 5523 Operating Systems: Memory Management (SGG-8)
CS 5523 Operating Systems: Memory Management (SGG-8) Instructor: Dr Tongping Liu Thank Dr Dakai Zhu, Dr Palden Lama, and Dr Tim Richards (UMASS) for providing their slides Outline Simple memory management:
More informationUNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Computer Architecture ECE 568/668
UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 568/668 Part 11 Memory Hierarchy - I Israel Koren ECE568/Koren Part.11.1 ECE568/Koren Part.11.2 Ideal Memory
More informationDRAM Tutorial Lecture. Vivek Seshadri
DRAM Tutorial 18-447 Lecture Vivek Seshadri DRAM Module and Chip 2 Goals Cost Latency Bandwidth Parallelism Power Energy 3 DRAM Chip Bank I/O 4 Sense Amplifier top enable Inverter bottom 5 Sense Amplifier
More informationTDT Appendix E Interconnection Networks
TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages
More informationIOmark-VM. Datrium DVX Test Report: VM-HC b Test Report Date: 27, October
IOmark-VM Datrium DVX Test Report: VM-HC-171024-b Test Report Date: 27, October 2017 Copyright 2010-2017 Evaluator Group, Inc. All rights reserved. IOmark-VM, IOmark-VDI, VDI-IOmark, and IOmark are trademarks
More informationPageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili
PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on Blaise-Pascal Tine Sudhakar Yalamanchili Outline Background: Memory Security Motivation Proposed Solution Implementation Evaluation Conclusion
More informationVirtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018
Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018 Today s Papers Disco: Running Commodity Operating Systems on Scalable Multiprocessors, Edouard
More informationMemory hierarchy review. ECE 154B Dmitri Strukov
Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationAutomatic Speech Recognition (ASR)
Automatic Speech Recognition (ASR) February 2018 Reza Yazdani Aminabadi Universitat Politecnica de Catalunya (UPC) State-of-the-art State-of-the-art ASR system: DNN+HMM Speech (words) Sound Signal Graph
More informationCOMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy
COMPUTER ARCHITECTURE Virtualization and Memory Hierarchy 2 Contents Virtual memory. Policies and strategies. Page tables. Virtual machines. Requirements of virtual machines and ISA support. Virtual machines:
More informationDesign Tradeoffs for Data Deduplication Performance in Backup Workloads
Design Tradeoffs for Data Deduplication Performance in Backup Workloads Min Fu,DanFeng,YuHua,XubinHe, Zuoning Chen *, Wen Xia,YuchengZhang,YujuanTan Huazhong University of Science and Technology Virginia
More informationSort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot
Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack
More informationThe Effectiveness of Deduplication on Virtual Machine Disk Images
The Effectiveness of Deduplication on Virtual Machine Disk Images Keren Jin & Ethan L. Miller Storage Systems Research Center University of California, Santa Cruz Motivation Virtualization is widely deployed
More informationDefining a High-Level Programming Model for Emerging NVRAM Technologies
Defining a High-Level Programming Model for Emerging NVRAM Technologies Thomas Shull, Jian Huang, Josep Torrellas University of Illinois at Urbana-Champaign September 13, 2018 Shull et al. Defining a High-Level
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationHPE SimpliVity. The new powerhouse in hyperconvergence. Boštjan Dolinar HPE. Maribor Lancom
HPE SimpliVity The new powerhouse in hyperconvergence Boštjan Dolinar HPE Maribor Lancom 2.2.2018 Changing requirements drive the need for Hybrid IT Application explosion Hybrid growth 2014 5,500 2015
More informationWORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS
WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES ON BIG AND SMALL SERVER PLATFORMS Shuang Chen*, Shay Galon**, Christina Delimitrou*, Srilatha Manne**, and José Martínez* *Cornell University **Cavium
More informationInternational Journal of Computer & Organization Trends Volume5 Issue3 May to June 2015
Performance Analysis of Various Guest Operating Systems on Ubuntu 14.04 Prof. (Dr.) Viabhakar Pathak 1, Pramod Kumar Ram 2 1 Computer Science and Engineering, Arya College of Engineering, Jaipur, India.
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface COEN-4710 Computer Hardware Lecture 7 Large and Fast: Exploiting Memory Hierarchy (Chapter 5) Cristinel Ababei Marquette University Department
More informationAlgorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II
Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language
More informationLocality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example
Locality CS429: Computer Organization and Architecture Dr Bill Young Department of Computer Sciences University of Texas at Austin Principle of Locality: Programs tend to reuse data and instructions near
More informationPick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality
Pick a time window size w. In time span w, are there, Multiple References, to nearby addresses: Spatial Locality Repeated References, to a set of locations: Temporal Locality Take advantage of behavior
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required
More informationImpact of Cache Coherence Protocols on the Processing of Network Traffic
Impact of Cache Coherence Protocols on the Processing of Network Traffic Amit Kumar and Ram Huggahalli Communication Technology Lab Corporate Technology Group Intel Corporation 12/3/2007 Outline Background
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationCS 537 Lecture 6 Fast Translation - TLBs
CS 537 Lecture 6 Fast Translation - TLBs Michael Swift 9/26/7 2004-2007 Ed Lazowska, Hank Levy, Andrea and Remzi Arpaci-Dussea, Michael Swift Faster with TLBS Questions answered in this lecture: Review
More informationBeyond Block I/O: Rethinking
Beyond Block I/O: Rethinking Traditional Storage Primitives Xiangyong Ouyang *, David Nellans, Robert Wipfel, David idflynn, D. K. Panda * * The Ohio State University Fusion io Agenda Introduction and
More informationSmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center
SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center Jeff Defilippi Senior Product Manager Arm #Arm Tech Symposia The Cloud to Edge Infrastructure Foundation for a World of 1T Intelligent
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationChunkStash: Speeding Up Storage Deduplication using Flash Memory
ChunkStash: Speeding Up Storage Deduplication using Flash Memory Biplob Debnath +, Sudipta Sengupta *, Jin Li * * Microsoft Research, Redmond (USA) + Univ. of Minnesota, Twin Cities (USA) Deduplication
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationMeasuring zseries System Performance. Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012
Measuring zseries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Outline Computer System Performance Performance Factors and Measurements zseries
More informationNear Memory Key/Value Lookup Acceleration MemSys 2017
Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy
More informationCSEE W4824 Computer Architecture Fall 2012
CSEE W4824 Computer Architecture Fall 2012 Lecture 8 Memory Hierarchy Design: Memory Technologies and the Basics of Caches Luca Carloni Department of Computer Science Columbia University in the City of
More informationMemshare: a Dynamic Multi-tenant Key-value Cache
Memshare: a Dynamic Multi-tenant Key-value Cache ASAF CIDON*, DANIEL RUSHTON, STEPHEN M. RUMBLE, RYAN STUTSMAN *STANFORD UNIVERSITY, UNIVERSITY OF UTAH, GOOGLE INC. 1 Cache is 100X Faster Than Database
More informationDonn Morrison Department of Computer Science. TDT4255 Memory hierarchies
TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,
More informationChapter 5B. Large and Fast: Exploiting Memory Hierarchy
Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationPOSH: A TLS Compiler that Exploits Program Structure
POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign
More informationSEESAW: Set Enhanced Superpage Aware caching
SEESAW: Set Enhanced Superpage Aware caching http://synergy.ece.gatech.edu/ Set Associativity Mayank Parasar, Abhishek Bhattacharjee Ω, Tushar Krishna School of Electrical and Computer Engineering Georgia
More informationIsoStack Highly Efficient Network Processing on Dedicated Cores
IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single
More informationAuthenticated Storage Using Small Trusted Hardware Hsin-Jung Yang, Victor Costan, Nickolai Zeldovich, and Srini Devadas
Authenticated Storage Using Small Trusted Hardware Hsin-Jung Yang, Victor Costan, Nickolai Zeldovich, and Srini Devadas Massachusetts Institute of Technology November 8th, CCSW 2013 Cloud Storage Model
More informationCS162 - Operating Systems and Systems Programming. Address Translation => Paging"
CS162 - Operating Systems and Systems Programming Address Translation => Paging" David E. Culler! http://cs162.eecs.berkeley.edu/! Lecture #15! Oct 3, 2014!! Reading: A&D 8.1-2, 8.3.1. 9.7 HW 3 out (due
More informationMARACAS: A Real-Time Multicore VCPU Scheduling Framework
: A Real-Time Framework Computer Science Department Boston University Overview 1 2 3 4 5 6 7 Motivation platforms are gaining popularity in embedded and real-time systems concurrent workload support less
More informationEFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES
EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University
More informationPower Efficiency of Hypervisor and Container-based Virtualization
Power Efficiency of Hypervisor and Container-based Virtualization University of Amsterdam MSc. System & Network Engineering Research Project II Jeroen van Kessel 02-02-2016 Supervised by: dr. ir. Arie
More informationAddendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches
Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Gabriel H. Loh Mark D. Hill AMD Research Department of Computer Sciences Advanced Micro Devices, Inc. gabe.loh@amd.com
More informationVMWare. Inc. 발표자 : 박찬호. Memory Resource Management in VMWare ESX Server
Author : Carl A. Waldspurger VMWare. Inc. 발표자 : 박찬호 Memory Resource Management in VMWare ESX Server Contents Background Motivation i and Overview Memory Virtualization Reclamation Mechanism Sharing Memory
More information