SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores

Similar documents
SRM-Buffer: An OS Buffer Management Technique to Prevent Last Level Cache from Thrashing in Multicores

1. Creates the illusion of an address space much larger than the physical memory

Using Transparent Compression to Improve SSD-based I/O Caches

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

CFLRU:A A Replacement Algorithm for Flash Memory

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

MCC-DB: Minimizing Cache Conflicts in Multi-core Processors for Databases

CS 333 Introduction to Operating Systems. Class 11 Virtual Memory (1) Jonathan Walpole Computer Science Portland State University

Virtual Memory. Robert Grimm New York University

ASEP: An Adaptive Sequential Prefetching Scheme for Second-level Storage System

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

BP-Wrapper: A System Framework Making Any Replacement Algorithms (Almost) Lock Contention Free

CS399 New Beginnings. Jonathan Walpole

Data Processing on Modern Hardware

Buffer Management for XFS in Linux. William J. Earl SGI

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Aerie: Flexible File-System Interfaces to Storage-Class Memory [Eurosys 2014] Operating System Design Yongju Song

CSE 120. Translation Lookaside Buffer (TLB) Implemented in Hardware. July 18, Day 5 Memory. Instructor: Neil Rhodes. Software TLB Management

CIS Operating Systems Memory Management Cache and Demand Paging. Professor Qiang Zeng Spring 2018

ECE 571 Advanced Microprocessor-Based Design Lecture 13

CPE300: Digital System Architecture and Design

CIS Operating Systems Memory Management Cache. Professor Qiang Zeng Fall 2017

Memory Management Virtual Memory

Learning Outcomes. An understanding of page-based virtual memory in depth. Including the R3000 s support for virtual memory.

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

Learning Outcomes. An understanding of page-based virtual memory in depth. Including the R3000 s support for virtual memory.

CS356: Discussion #9 Memory Hierarchy and Caches. Marco Paolieri Illustrations from CS:APP3e textbook

COL862 Programming Assignment-1

CS 5523 Operating Systems: Memory Management (SGG-8)

1. Memory technology & Hierarchy

ECE232: Hardware Organization and Design

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan

ECE7995 (3) Basis of Caching and Prefetching --- Locality

G Virtual Memory. Robert Grimm New York University

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Structure of Computer Systems

SAY-Go: Towards Transparent and Seamless Storage-As-You-Go with Persistent Memory

ECE 571 Advanced Microprocessor-Based Design Lecture 12

Difference Engine: Harnessing Memory Redundancy in Virtual Machines (D. Gupta et all) Presented by: Konrad Go uchowski

CIS Operating Systems Memory Management Cache. Professor Qiang Zeng Fall 2015

Chapter 8 Virtual Memory

Falcon: Scaling IO Performance in Multi-SSD Volumes. The George Washington University

Memory Technology. Caches 1. Static RAM (SRAM) Dynamic RAM (DRAM) Magnetic disk. Ideal memory. 0.5ns 2.5ns, $2000 $5000 per GB

CS510 Operating System Foundations. Jonathan Walpole

Modern Virtual Memory Systems. Modern Virtual Memory Systems

Silent Shredder: Zero-Cost Shredding For Secure Non-Volatile Main Memory Controllers

Compressed Swap for Embedded Linux. Alexander Belyakov, Intel Corp.

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Virtual Memory. Motivation:

CPS104 Computer Organization and Programming Lecture 16: Virtual Memory. Robert Wagner

Virtual Memory. CS61, Lecture 15. Prof. Stephen Chong October 20, 2011

arxiv: v1 [cs.db] 25 Nov 2018

Azor: Using Two-level Block Selection to Improve SSD-based I/O caches

VMware VMmark V1.1 Results

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Analysis of high capacity storage systems for e-vlbi

DAT (cont d) Assume a page size of 256 bytes. physical addresses. Note: Virtual address (page #) is not stored, but is used as an index into the table

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

MultiLanes: Providing Virtualized Storage for OS-level Virtualization on Many Cores

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

TDT 4260 lecture 3 spring semester 2015

Recall from Tuesday. Our solution to fragmentation is to split up a process s address space into smaller chunks. Physical Memory OS.

Module 2: Virtual Memory and Caches Lecture 3: Virtual Memory and Caches. The Lecture Contains:

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT I

Smooth Scan: Statistics-Oblivious Access Paths. Renata Borovica-Gajic Stratos Idreos Anastasia Ailamaki Marcin Zukowski Campbell Fraser

Operating Systems Design Exam 2 Review: Spring 2011

CS 416: Opera-ng Systems Design March 23, 2012

White Paper. File System Throughput Performance on RedHawk Linux

Virtual Memory. Patterson & Hennessey Chapter 5 ELEC 5200/6200 1

SmartSaver: Turning Flash Drive into a Disk Energy Saver for Mobile Computers

Memory - Paging. Copyright : University of Illinois CS 241 Staff 1

I-CASH: Intelligently Coupled Array of SSD and HDD

Address Translation. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Caching and Demand-Paged Virtual Memory

SFS: Random Write Considered Harmful in Solid State Drives

Virtual Memory. Chapter 8

Firebird Tour 2017: Performance. Vlad Khorsun, Firebird Project

Virtualization and memory hierarchy

Cross-layer Optimization for Virtual Machine Resource Management

Feng Chen and Xiaodong Zhang Dept. of Computer Science and Engineering The Ohio State University

Memory Allocation. Copyright : University of Illinois CS 241 Staff 1

SmartMD: A High Performance Deduplication Engine with Mixed Pages

Reducing Disk Latency through Replication

LSbM-tree: Re-enabling Buffer Caching in Data Management for Mixed Reads and Writes

z/vm Large Memory Linux on System z

Distributed caching for cloud computing

Meltdown and Spectre Interconnect Performance Evaluation Jan Mellanox Technologies

Processes and Virtual Memory Concepts

CPSC 352. Chapter 7: Memory. Computer Organization. Principles of Computer Architecture by M. Murdocca and V. Heuring

Learning to Play Well With Others

Pipelined processors and Hazards

HashKV: Enabling Efficient Updates in KV Storage via Hashing

Chapter 8. Virtual Memory

Arrakis: The Operating System is the Control Plane

Basic Memory Management

Presented by: Nafiseh Mahmoudi Spring 2017

Evaluating the Impact of RDMA on Storage I/O over InfiniBand

Transcription:

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores Xiaoning Ding The Ohio State University dingxn@cse.ohiostate.edu Kaibo Wang The Ohio State University wangka@cse.ohiostate.edu Xiaodong Zhang The Ohio State University zhang@cse.ohiostate.edu ACM EuroSys 11, April 10 13, 2011, Salzburg, Austria. 2012/1/4

Outline Introduction Related work proposes Selected Region Mapping Buffer Implement prototype Conclusion

Introduction(1/3) CPU cache and operating system buffer cache are two critical layers to narrow the speed gap. We need a good cooperation between multicore architecture and increasingly large capacity of main memory Otherwise, severe performance degradation may be incurred.

Introduction(2/3) Buffer data usually have much weaker temporal localities than VM data VM-intensive(computation intensive): scientific application File-intensive: grep, tar

Introduction(3/3) because it can easily pollute the shared hardware cache(s) in the processor. Cache Core Core Core Core L1 L1 L1 L1 L2 L3(Last Level Cache) L2 Core i7 a thread accessing a large set of data cached in OS buffer may significantly slow down its co-runners Last level caches share among multicore Cache pollution: process load data in CPU cache which has few opportunities to be reused. VM page Buffer page Main Memory Disk

Introduction(3/3) CO run because it can easily pollute the shared hardware cache(s) in the processor. Cache Core Core Core Core L1 L1 L1 L1 L2 L3(Last thrashing Level Cache) L2 Core i7 a thread accessing a large set of data cached in OS buffer may significantly slow down its co-runners Last level caches share among multicore Cache pollution: process load data in CPU cache which has few opportunities to be reused. VM page Buffer page Main Memory Disk

Related Work(1/4) Cache-Memory address mapping: APP Virtual address Virtual page number Page offset OS Physical address Physical page number Page offset Hardwar e Cache address Cache tag Cache index Block offset

Related Work(2/4) Physical address in the page coloring technique: 1. Cache color: common bits between the cache set index and the physical page number. 2. Cache sets with the same color value form a cache region.

Related Work(3/4) For example, there are 64 different cache colors on Intel Xeon 5355 processor with a page size of 4KB.(2 12 ) 1. colors evenly divide the last level cache into 64 nonoverlapping regions 2. divide physical pages into 64 disjoint groups. If the physical memory size is 4GB(2 32 ), there are 16384pages(2 14 ) (i.e. 64MB physical memory for 4KB pages) in each color. Cache(64 colors) Memory 16384 pages

Related Work(4/4) cache pollution is restricted allocating physical pages in the same color Conventional OS buffer cache 1. File blocks are allocated with physical pages in random colors 2. Access OS buffer cache pollutes LLC very quickly Cache is polluted in small period of time

Select Region Mapping (SRM)Buffer(1/7) To control cache pollution, Minimize the number of colors of the physical pages assigned to OS buffer pages In an SRM-buffer, the application accesses a batch of buffer pages in color. Usually accessed in sequence (identifies groups of related blocks) Appropriate Sequence length:256pages

Select Region Mapping Buffer(2/7) Identify sequence(stream of file blocks) Blocks in the same sequence are mapped to the same cache region(same color when they are loaded to OS buffer) Change colors dynamically for difference sequences of blocks

Select Region Mapping Buffer(3/7) How to determine sequence(which block access together)? 1. Same file heuristic: blocks in the same file 2. same application heuristic: blocks consecutively loaded by same process

Select Region Mapping Buffer(4/7) how to coordinate different requirements of buffer management and virtual memory management on physical page allocation, while retaining a high hit ratio of OS page cache? 1. Hurt the hit ratio of page cache 2. Uneven color distribution among the physical pages available to virtual memory

Select Region Mapping Buffer(5/7) Structure: 1. Normal zone : managed by conventional OS buffer, contain LRU list (e.g. the active list and inactive list in Linux OS). 2. Color zone: free page and small amount of inactive pages 1. Pages in the colored zone into multiple lists 2. Each list links the pages in the same color 3. If a page in the colored zone is hit, it is moved to the normal zone 4. On page faults or OS buffer misses, SRM-buffer reclaims pages in the colored zone

Select Region Mapping Buffer(6/7) Buffer miss: allocate physical pages in a single color to file blocks loaded in a sequence Change colors dynamically after the number of pages allocated in a given color reaches a threshold VM page faults: uniformly allocate pages in different lists to hold VM page Page hit is retained automatically by normal zone

Select Region Mapping Buffer(7/7) Buffer miss(a sequence): Page refill Another sequence: Reach threshold, Dynamic change colors Page Allocation VM page fault: 1. first block in a sequence, SRM-buffer selects a list with more pages than threshold, and reclaims a page on the list. 2. When such a list cannot be found or a list becomes empty, SRM-buffer refills the colored zone.

Select Region Mapping Buffer(7/7) Buffer miss(a sequence): Page refill Another sequence: Reach threshold, Dynamic change colors Page Allocation VM page fault: 1. first block in a sequence, SRM-buffer selects a list with more pages than threshold, and reclaims a page on the list. 2. When such a list cannot be found or a list becomes empty, SRM-buffer refills the colored zone.

Select Region Mapping Buffer(7/7) Buffer miss(a sequence): Page refill Another sequence: Reach threshold, Dynamic change colors Page Allocation VM page fault: 1. first block in a sequence, SRM-buffer selects a list with more pages than threshold, and reclaims a page on the list. 2. When such a list cannot be found or a list becomes empty, SRM-buffer refills the colored zone.

Select Region Mapping Buffer(7/7) Buffer miss(a sequence): Page refill Another sequence: Reach threshold, Dynamic change colors Page Allocation VM page fault: 1. first block in a sequence, SRM-buffer selects a list with more pages than threshold, and reclaims a page on the list. 2. When such a list cannot be found or a list becomes empty, SRM-buffer refills the colored zone.

Performance Evaluation(1/6) Experiment Setup: 1. Dell PowerEdge 1900 workstation with two 2.66GHz quad-core Xeon X5355 processors, 16GB RAM. 2. Dell Precision T1500 workstation with an Intel Core i7 860 processor, 8GB RAM. 3. The operating system is 64-bit Red Hat Enterprise Linux AS release 5. 4. The file system is ext3. 5. Linux kernel version 2.6.30. We used pfmon[hp Corp. 2010] to collect performance statistics such as last level cache misses

Performance Evaluation(2/6) we test SRM-buffer with a PostgreSQL database server [PostgreSQL 2008] supporting data warehouse workloads. The size of the fact table is about 4GiB Hash join Sequential scan Relative to solo run slowdown = T 2 T 1 T 1

Performance Evaluation(3/6) TPC-H benchmarks on PostgreSQL 1. First group: Q6,Q15(sequential scan) 2. Second group:q5,q7,q8,q10,q11,q18(mixed feature: join, scan, sort)

Performance Evaluation(5/6) With different workload file-intensive vm-intensive

Performance Evaluation(4/6)

Performance Evaluation(6/6) Two application access the same set of data With different access patterns

Conclusion On a multicore system, a thread can slow down its co-running threads flushes the to-be-reused data enhancing the page allocation policies in OS buffer. cache pollution is limited within the corresponding cache regions. SRM-buffer detects block sequences and allocates physical pages Our evaluation with a prototype improve application performance and decrease the execution times