SmartMD: A High Performance Deduplication Engine with Mixed Pages

Similar documents
SmartMD: A High Performance Deduplication Engine with Mixed Pages

PageForge: A Near-Memory Content- Aware Page-Merging Architecture

Reducing Memory Virtualization Overheads in Virtualized Datacenters. Jayneel Gandhi, Arkaprava Basu, Michael M. Swift, Mark D.

Difference Engine: Harnessing Memory Redundancy in Virtual Machines (D. Gupta et all) Presented by: Konrad Go uchowski

A DEDUPLICATION-INSPIRED FAST DELTA COMPRESSION APPROACH W EN XIA, HONG JIANG, DA N FENG, LEI T I A N, M I N FU, YUKUN Z HOU

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Freewrite: Creating (Almost) Zero-Cost Writes to SSD

Transparent Hugepage Support

HashKV: Enabling Efficient Updates in KV Storage via Hashing

CMD: Classification-based Memory Deduplication through Page Access Characteristics

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan

1. Creates the illusion of an address space much larger than the physical memory

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Public Cloud Leverage For IT/Business Alignment Business Goals Agility to speed time to market, adapt to market demands Elasticity to meet demand whil

Understanding The Performance of DPDK as a Computer Architect

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores

Virtual Virtual Memory

AutoNUMA Red Hat, Inc.

The Effectiveness of Deduplication on Virtual Machine Disk Images

Cross-layer Optimization for Virtual Machine Resource Management

Increase KVM Performance/Density

Revisiting Virtual Memory for High Performance Computing on Manycore Architectures: A Hybrid Segmentation Kernel Approach

Virtualization, Xen and Denali

Using Transparent Compression to Improve SSD-based I/O Caches

Painless switch from proprietary hypervisor to QEMU/KVM. Denis V. Lunev

MIGRATORY COMPRESSION Coarse-grained Data Reordering to Improve Compressibility

Large Page Performance ESX Server 3.5 and ESX Server 3i v3.5

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

V. Primary & Secondary Memory!

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

PebblesDB: Building Key-Value Stores using Fragmented Log Structured Merge Trees

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Towards Massive Server Consolidation

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy

Acceleration of Virtual Machine Live Migration on QEMU/KVM by Reusing VM Memory

LSM-trie: An LSM-tree-based Ultra-Large Key-Value Store for Small Data

Profiling: Understand Your Application

System Virtual Machines

Automatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

ECE 598 Advanced Operating Systems Lecture 14

Distributed caching for cloud computing

A Framework for Memory Hierarchies

CS 333 Introduction to Operating Systems. Class 11 Virtual Memory (1) Jonathan Walpole Computer Science Portland State University

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

gscale: Scaling up GPU Virtualization with Dynamic Sharing of Graphics Memory Space

VMWare. Inc. 발표자 : 박찬호. Memory Resource Management in VMWare ESX Server

SANDPIPER: BLACK-BOX AND GRAY-BOX STRATEGIES FOR VIRTUAL MACHINE MIGRATION

Knut Omang Ifi/Oracle 20 Oct, Introduction to virtualization (Virtual machines) Aspects of network virtualization:

vcache: Architectural Support for Transparent and Isolated Virtual LLCs in Virtualized Environments

Case Study 1: Optimizing Cache Performance via Advanced Techniques

csci 3411: Operating Systems

SRM-Buffer: An OS Buffer Management Technique to Prevent Last Level Cache from Thrashing in Multicores

Workload-Aware Elastic Striping With Hot Data Identification for SSD RAID Arrays

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

HEAD HardwarE Accelerated Deduplication

A Fine-grained Performance-based Decision Model for Virtualization Application Solution

Transparent Hugepage

Beyond Block I/O: Rethinking

Lecture 3 Memory Virtualization Larry Rudolph

What is KVM? KVM patch. Modern hypervisors must do many things that are already done by OSs Scheduler, Memory management, I/O stacks

Amazon EC2 Deep Dive. Michael #awssummit

System Virtual Machines

Status Update About COLO (COLO: COarse-grain LOck-stepping Virtual Machines for Non-stop Service)

AGILE PAGING FOR EFFICIENT MEMORY VIRTUALIZATION

AIST Super Green Cloud

Virtual Memory. Reading. Sections 5.4, 5.5, 5.6, 5.8, 5.10 (2) Lecture notes from MKP and S. Yalamanchili

Preserving I/O Prioritization in Virtualized OSes

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

Boosting GPU Virtualization Performance with Hybrid Shadow Page Tables

Network Design Considerations for Grid Computing

Virtualization. ! Physical Hardware Processors, memory, chipset, I/O devices, etc. Resources often grossly underutilized

Virtualization. Starting Point: A Physical Machine. What is a Virtual Machine? Virtualization Properties. Types of Virtualization

Application Fault Tolerance Using Continuous Checkpoint/Restart

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

ECE 571 Advanced Microprocessor-Based Design Lecture 13

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

PERFORMANCE STUDY OCTOBER 2017 ORACLE MONSTER VIRTUAL MACHINE PERFORMANCE. VMware vsphere 6.5

I-CASH: Intelligently Coupled Array of SSD and HDD

ChunkStash: Speeding Up Storage Deduplication using Flash Memory

INTEL NEXT GENERATION TECHNOLOGY - POWERING NEW PERFORMANCE LEVELS

Deduplication Storage System

CS 5523 Operating Systems: Memory Management (SGG-8)

KVM PERFORMANCE OPTIMIZATIONS INTERNALS. Rik van Riel Sr Software Engineer, Red Hat Inc. Thu May

Be Fast, Cheap and in Control with SwitchKV Xiaozhou Li

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

CSE502: Computer Architecture CSE 502: Computer Architecture

CFS-v: I/O Demand-driven VM Scheduler in KVM

Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Chapter Seven. Memories: Review. Exploiting Memory Hierarchy CACHE MEMORY AND VIRTUAL MEMORY

Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation!

VMWARE TUNING BEST PRACTICES FOR SANS, SERVER, AND NETWORKS

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX924 S2

Virtual to physical address translation

Benefits of Storage Capacity Optimization Methods (COMs) And. Performance Optimization Methods (POMs)

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Transcription:

SmartMD: A High Performance Deduplication Engine with Mixed Pages Fan Guo 1, Yongkun Li 1, Yinlong Xu 1, Song Jiang 2, John C. S. Lui 3 1 University of Science and Technology of China 2 University of Texas, Arlington 3 The Chinese University of Hong Kong 1

Overview TLB (Translation Lookaside Buffer) miss carries high penalty Due to the access of page table E.g., four level address mapping for x86-64 system with 4KB-page memory Virtualization increases TLB miss penalty 2D page table walk (GVA -> GPA -> HPA) Up to 24 memory references Uneven increase of TLB & memory size exacerbates the problem 2

Large Pages Large pages improve memory access performance Fewer page table entries (1/512) Larger TLB coverage Virtual address TLB entry1 TLB miss TLB hit Physical address Page Table entry1 Physical address Large page 2MB memory region 3

Benefits of Large Pages Performance improvement with large page Enabling large pages in both guest and host can improve memory access performance by up to 68% Benchmark Host: Base Guest: Large Host: Large Guest: Base Host: Large Guest: Large SPECjbb 1.06 1.12 1.30 Graph500 1.26 1.34 1.68 Liblinear 1.13 1.14 1.37 Sysbench 1.07 1.09 1.20 Biobench 1.02 1.18 1.37 4

Deduplication with Large Pages Redundant data is very common among VMs Many base pages (4KB) share the same content Large pages reduce the deduplication opportunerties Very few large pages (2MB) are exactly the same ADA: aggressively split large pages into base pages Dedup. with large pages (0.8% ~ 5.9%) Dedup. with base pages (13.7% ~ 47.9%) 5

Motivation Base pages vs. large pages Exists a tradeoff between access performance and deduplication rate Access Performance Deduplication Rate Base pages (4KB) Low High Large Pages (2MB) High Low Question: can we enjoy both benefits of high access performance and high deduplication rate simultaneously? 6

Our Solution SmartMD: an adaptive management scheme with mixed pages Monitors page information (access frequency, repetition rate) Adaptively splits/reconstructs large pages: manage with mixed pages Observation: many large pages have high access frequency but few duplicate subpages 7

High-level Idea of SmartMD Lightweight scheme to monitor page information Access frequency and repetition rate Adaptive scheme to selectively split/reconstruct large pages Split into base pages Cold pages with high repetition rate For high deduplication rate Keep in large pages Hot pages with low repetition rate For high access performance Reconstruct: hot pages For high access performance 8

Key Issues Access frequency monitor Repetition rate monitor Adaptive conversion 9

Monitor Access Frequency Scan pages periodically In each scan interval (e.g., 6s) Reset the access bits of all pages Sleep (e.g., 2.6s) Check the access bits & update access frequency (+/- by one) Sleep until this scan interval ends Use a counter to keep the access frequency of each large page 10

Monitor Repetition Rate Scan pages periodically, and for each large page Check each of its subpages and label it if it is a duplicate Use a counter to record repetition rate Counting bloom filter # of entries: 8 # of base pages Each entry: a 3-bit counter 3 hash functions to index Sampling Sample only 25% subpages for pages being checked before and not being modified in last interval 11

Adaptive Conversion Selectively split/reconstruct: adjust para. based on mem. util. Split: Acc. Freq. < Threscold & Rep. Rate > Thresrepet Reconstruct: Acc. Freq. > Threshot Implementation Split: well supported by Linux Reconstruct Gathering subpages Migrate remapped subpages Break shared subpages Recreating page descriptor Updating page table 12

Deduplication Deduplication thread Modify KSM s deduplication algorithm to merge duplicated pages Two red-black trees to manage pages With duplicate labels, SmartMD improves deduplication efficiency Compare pages with duplicate labbels only The # of candidate pages for comparison is reduced The height of the red-black trees is reduced The # of comparisons to merge a page is reduced 13

Evaluation Experiment setting Host: two Intel Xeon E5-2650 v4 2.20GHz processors, 64GB RAM Guest: QEMU&KVM. Boot up 4 VMs on one physical CPU, each VM is assigned one VCPU and 4GB RAM Both guest and host OSes are Ubuntu 14.04 Workloads and memory demands w/o deduplication Graph-500 SPECjbb Liblinear Sysbench Biobench 2.7GB 1.7GB 4.0GB 2.93GB 3.42GB 14

Overhead of SmartMD SmartMD reduces CPU consumption even if it requires more CPU cycles for monitoring Average CPU utilization sampled in every second SmartMD introduces negligible memory overhead 3/2 12 for storing counting bloom filter, and 1/2 16 for keeping access frequency & repetition rate Tens of MB for 16GB memory 15

Performance of SmartMD Comparison deduplication schemes KSM: aggressively splits large pages which contain duplicate subpages Already supported in Linux Achieves best memory saving No-splitting: deduplicates memory in unit of 2MB page Without splitting any large page Achieves best access performance Ingens (OSDI 16): Splits large pages with low access frequency w/o considering repetition rate 16

Tradeoff KSM and no-splitting stand for two extreme points on the tradeoff curve (best performance vs. best memory saving) 17

Tradeoff KSM and no-splitting stand for two extreme points on the tradeoff curve (best performance vs. best memory saving) SmartMD achieves similar performance with no-splitting similar memory saving with KSM Takes both benefits simultaneously 18

Comparison with Ingens Memory saving SmartMD can save 30% to 2.5x more memory than Ingens with similar access performance 19

Performance in Overcommitted Systems Overcommitment level (ratios of memory demand of all VMs to usable memory size): 0.8, 1.1, 1.4 Limit the host s memory by running an in-memory file system (hugetlbfs) SmartMD achieves up to 38.6% of performance improvement over other schemes 20

Performance on NUMA Machine Setting: 2 VMs on one physical CPU and two on a different CPU Baseline: no-splitting (best access performance) NUMA effect is very small The extra performance reduction on NUMA machine is < 2% comparing to Single-CPU 21

Conclusions Tradeoff: large pages improve memory access performance, but reduce deduplication opportunities Many pages have high access frequency but few duplicate subpages We propose SmartMD, an adaptive scheme to manage memory with mixed pages Split: cold pages with high repetition rate Reconstruct: hot pages SmartMD simultaneously takes both benefits High memory performance (by accessing with large pages) High memory saving (by deduplcating with base pages) 22

Thanks! Q&A 23