Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Similar documents
SRM-Buffer: An OS Buffer Management Technique to Prevent Last Level Cache from Thrashing in Multicores

SRM-Buffer: An OS Buffer Management SRM-Buffer: An OS Buffer Management Technique toprevent Last Level Cache from Thrashing in Multicores

SWAP: EFFECTIVE FINE-GRAIN MANAGEMENT

A Comparative Study of Microsoft Exchange 2010 on Dell PowerEdge R720xd with Exchange 2007 on Dell PowerEdge R510

Accelerate Applications Using EqualLogic Arrays with directcache

CSC501 Operating Systems Principles. OS Structure

Impact of Dell FlexMem Bridge on Microsoft SQL Server Database Performance

Learning with Purpose

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

Accelerating Microsoft SQL Server 2016 Performance With Dell EMC PowerEdge R740

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Deterministic Memory Abstraction and Supporting Multicore System Architecture

Chapter 3 Virtualization Model for Cloud Computing Environment

Arrakis: The Operating System is the Control Plane

Operating System Supports for SCM as Main Memory Systems (Focusing on ibuddy)

Two hours - online. The exam will be taken on line. This paper version is made available as a backup

Simultaneous Multithreading on Pentium 4

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

White Paper. File System Throughput Performance on RedHawk Linux

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective. Part I: Operating system overview: Memory Management

Presented by: Nafiseh Mahmoudi Spring 2017

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

CS370 Operating Systems

Hierarchical PLABs, CLABs, TLABs in Hotspot

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

DELL Reference Configuration Microsoft SQL Server 2008 Fast Track Data Warehouse

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads

WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS

Reference. T1 Architecture. T1 ( Niagara ) Case Study of a Multi-core, Multithreaded

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Competitive Power Savings with VMware Consolidation on the Dell PowerEdge 2950

A Case Study in Optimizing GNU Radio s ATSC Flowgraph

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

Are You Insured Against Your Noisy Neighbor Sunku Ranganath, Intel Corporation Sridhar Rao, Spirent Communications

Fit for Purpose Platform Positioning and Performance Architecture

ZBD: Using Transparent Compression at the Block Level to Increase Storage Space Efficiency

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Utilizing the IOMMU scalably

Dell PowerEdge 11 th Generation Servers: R810, R910, and M910 Memory Guidance

46PaQ. Dimitris Miras, Saleem Bhatti, Peter Kirstein Networks Research Group Computer Science UCL. 46PaQ AHM 2005 UKLIGHT Workshop, 19 Sep

740: Computer Architecture, Fall 2013 Midterm I

File Memory for Extended Storage Disk Caches

Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet

Memory Management (2)

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Performance and Energy Efficiency of the 14 th Generation Dell PowerEdge Servers

Chapter 8: Memory- Management Strategies. Operating System Concepts 9 th Edition

Performance Scaling. When deciding how to implement a virtualized environment. with Dell PowerEdge 2950 Servers and VMware Virtual Infrastructure 3

Understanding The Performance of DPDK as a Computer Architect

Profiling Grid Data Transfer Protocols and Servers. George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA

Cisco Prime Home 6.X Minimum System Requirements: Standalone and High Availability

Accelerating storage performance in the PowerEdge FX2 converged architecture modular chassis

Enhancements to Linux I/O Scheduling

Improving Virtual Machine Scheduling in NUMA Multicore Systems

IBM B2B INTEGRATOR BENCHMARKING IN THE SOFTLAYER ENVIRONMENT

IBM System x servers. Innovation comes standard

HP SAS benchmark performance tests

Consolidation Assessment Final Report

Performance of Virtual Desktops in a VMware Infrastructure 3 Environment VMware ESX 3.5 Update 2

A Performance Characterization of Microsoft SQL Server 2005 Virtual Machines on Dell PowerEdge Servers Running VMware ESX Server 3.

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

VERITAS Foundation Suite TM 2.0 for Linux PERFORMANCE COMPARISON BRIEF - FOUNDATION SUITE, EXT3, AND REISERFS WHITE PAPER

Exchange Server 2007 Performance Comparison of the Dell PowerEdge 2950 and HP Proliant DL385 G2 Servers

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Implementing SQL Server 2016 with Microsoft Storage Spaces Direct on Dell EMC PowerEdge R730xd

Comparing Performance of Solid State Devices and Mechanical Disks

CS330: Operating System and Lab. (Spring 2006) I/O Systems

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Dell Reference Configuration for Large Oracle Database Deployments on Dell EqualLogic Storage

Nested Virtualization and Server Consolidation

Pexip Infinity Server Design Guide

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

Optimizing Virtualized Datacenters

Microsoft SQL Server 2012 Fast Track Reference Configuration Using PowerEdge R720 and EqualLogic PS6110XV Arrays

Performance Modeling and Analysis of Flash based Storage Devices

Meet the Increased Demands on Your Infrastructure with Dell and Intel. ServerWatchTM Executive Brief

Intel profiling tools and roofline model. Dr. Luigi Iapichino

Scheduling the Intel Core i7

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Method-Level Phase Behavior in Java Workloads

Cauldron: A Framework to Defend Against Cache-based Side-channel Attacks in Clouds

Chapter 8: Memory-Management Strategies

Lixia Liu, Zhiyuan Li Purdue University, USA. grants ST-HEC , CPA and CPA , and by a Google Fellowship

ISA-L Performance Report Release Test Date: Sept 29 th 2017

Continuous Shape Shifting: Enabling Loop Co-optimization via Near-Free Dynamic Code Rewriting

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A Fine-grained Performance-based Decision Model for Virtualization Application Solution

WHITE PAPER SINGLE & MULTI CORE PERFORMANCE OF AN ERASURE CODING WORKLOAD ON AMD EPYC

Open Benchmark Phase 3: Windows NT Server 4.0 and Red Hat Linux 6.0

... IBM Advanced Technical Skills IBM Oracle International Competency Center September 2013

Accelerating Microsoft SQL Server Performance With NVDIMM-N on Dell EMC PowerEdge R740

Transcription:

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems 1 Presented by Hadeel Alabandi

Introduction and Motivation 2 A serious issue to the effective utilization of multicore processors is cache partitioning and sharing Simulation were used to evaluate cache partitioning in the existing studies, however, it has some limitations Excessive simulation time Absence of OS activities Proneness to simulation inaccuracy

Introduction and Motivation (cont.) 3 In this paper, a software approach has been used It supports static and dynamic cache partitioning by using memory address mapping It emulates hardware partitioning mechanism will examine cache partitioning policies on real time systems Three metrics were used through evaluation for optimization purposes Performance Fairness QoS

Cache Partitioning for Multicore Processors 4 It has two interdependent parts Mechanism Forces cache partitioning Provides partitioning policy input Policy Decides how much cache resources will be allocated to each program with an optimization objective

Adopted Evaluation Metrics in The Study 5 Performance Metrics Throughput (IPCs) Absolute number of IPCs Combined miss rates Summarizes miss rates Combined misses Summarizes number of cache misses QoS Metrics Suppose that QoS constraints are never violated in their case

Adopted Evaluation Metrics in The Study (cont.) 6 Fairness Metrics Miss rates The number of misses The slowdown for each co- secheduled program should be identical after cache partitioning In the study, fairness metrics related to single core execution with dedicated L2 cache Date required for policy metric and the evaluation metric were acquired by running a workload with different cache partitioning The result value will be in the range (-1 to 1) If the result is 1, the correlation between the 2 metrics is perfect

Static OS-based Cache Partitioning 7 Static cache partitioning policy predetermines the amount of cache blocks allocated to each program at the beginning of its execution Page coloring will be used in the partitioning mechanism There several bits between cache index and physical page number in the physical address It will be used for page color Addressed cache will be divided to non-intersecting regions by page color Pages with the same color are mapped to the same cache region

Cache Partitioning Page Coloring 8

Cache Partitioning Page Coloring 9

Dynamic OS-based Cache Partitioning 10 Adjust cache quotas among processes dynamically Page recoloring procedure Increasing the process cache resources ( i.e number of colors used by the process) The kernel rearrange the virtual memory mapping of the process Allocating physical pages of the new color Copying the memory contents Freeing the old pages Remapping virtual pages cause performance overhead Reduce the overall overhead by lowering the frequency of cache allocation adjustment Another option is using lazy method of page migration, so the content of colored page is moved only when it s accessed Average overhead of dynamic partitioning reduced to 2% Highest migration overhead observed 7%

Page Recoloring 11

Dynamic Cache Partitioning Policies 12 Cache partitioning will be adjusted periodically by the policies at the end of each epoch Dynamic cache partitioning policy for performance Adjust cache partitioning dynamically Metrics Throughput (IPCs) Combined miss rate Combined misses Fair speedup Dynamic cache partitioning policy for fairness Two dynamic policies were implemented based on FM0 and FM4 FM0 is the evaluation metric ( i.e. the ratio of the current cumulative IPC over the baseline IPC) FM4 is the cache miss rates

Dynamic Cache Partitioning Policies (cont.) 13 Dynamic cache partitioning policy for QoS consideration Two core workload of two programs The first is the target program The second is the partner program QoS guarantee Ensure the target program performance is larger than or equal to X% of a baseline execution of homogeneous workload on a dual core processor with half of the cache capacity allocated for each program Increase the performance of the partner program

Experimental Methodology 14 Hardware and software platform Dell PowerEdge1950 Two dual core, 3.0GHz Intel Xeon 5160 processors and 8GB fully Buffered DIMM (FB-DIMM) main memory Shared, 4MB, 16-way set associative L2 cache Each core has a private 32KB instruction cache and a private 32KB data cache Red Hat Enterprise Linux 4.0 Kernel linux-2.6.20.3 Performance collected using pfmon

Evaluation Results 15 Show the improvement with the best static partitioning of each workload over shared cache

The Performance Static & Dynamic 16

Fairness Correlation between Evaluation Metrics and Policy Metrics 17

QoS Static & Dynamic 18

Related Work 19 Cache partitioning for multicore processors Page Coloring

Summary 20 An OS-based cache partitioning mechanism on multicore processors were designed and implemented Using it to study different cache partitioning polices Some simulation-based study findings were confirmed, however, this approach shows new insights haven t been shown by simulation Future work Reduce cache partitioning overhead Adding easy user interface Conducting partitioning research at the compiler level for both multiprogramming and multithreaded applications

Discussion 21 Does OS-based approach had provided new insights and observations that simulation couldn t or failed to show it?

References 22 Gaining Insights into Multicore Cache Partitioning:Bridging the Gap between Simulation and Real Systems http://www.contrib.andrew.cmu.edu/~hyoseunk/pdf/ecrts13- hyos-slides.pdf http://ftp.cs.rochester.edu/~xiao/eurosys09/euro061-zhang.pdf