Coordinated Bank and Cache Coloring for Temporal Protection of Memory Accesses

Similar documents
Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

4/11/17. Agenda. Princeton University Computer Science 217: Introduction to Programming Systems. Goals of this Lecture. Storage Management.

An Optimal Algorithm for Prufer Codes *

ELEC 377 Operating Systems. Week 6 Class 3

Parallel matrix-vector multiplication

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Problem Set 3 Solutions

Maintaining temporal validity of real-time data on non-continuously executing resources

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Cache Performance 3/28/17. Agenda. Cache Abstraction and Metrics. Direct-Mapped Cache: Placement and Access

A Binarization Algorithm specialized on Document Images and Photos

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

AADL : about scheduling analysis

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

The Codesign Challenge

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

An Efficient Garbage Collection for Flash Memory-Based Virtual Memory Systems

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Efficient Distributed File System (EDFS)

Mixed-Criticality Scheduling on Multiprocessors using Task Grouping

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

Array transposition in CUDA shared memory

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Verification by testing

DESIGNING TRANSMISSION SCHEDULES FOR WIRELESS AD HOC NETWORKS TO MAXIMIZE NETWORK THROUGHPUT

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

S1 Note. Basis functions.

Computer Architecture ELEC3441

Related-Mode Attacks on CTR Encryption Mode

TPL-Aware Displacement-driven Detailed Placement Refinement with Coloring Constraints

Optimizing Document Scoring for Query Retrieval

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

GSLM Operations Research II Fall 13/14

Video Proxy System for a Large-scale VOD System (DINA)

A SYSTOLIC APPROACH TO LOOP PARTITIONING AND MAPPING INTO FIXED SIZE DISTRIBUTED MEMORY ARCHITECTURES

An Entropy-Based Approach to Integrated Information Needs Assessment

Programming in Fortran 90 : 2017/2018

Module Management Tool in Software Development Organizations

Space-Optimal, Wait-Free Real-Time Synchronization

Cluster Analysis of Electrical Behavior

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Concurrent Apriori Data Mining Algorithms

Lecture 15: Memory Hierarchy Optimizations. I. Caches: A Quick Review II. Iteration Space & Loop Transformations III.

Overview. Basic Setup [9] Motivation and Tasks. Modularization 2008/2/20 IMPROVED COVERAGE CONTROL USING ONLY LOCAL INFORMATION

Reducing Frame Rate for Object Tracking

Reliability and Energy-aware Cache Reconfiguration for Embedded Systems

UB at GeoCLEF Department of Geography Abstract

CMPS 10 Introduction to Computer Science Lecture Notes

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Smoothing Spline ANOVA for variable screening

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Lecture 7 Real Time Task Scheduling. Forrest Brewer

Run-Time Operator State Spilling for Memory Intensive Long-Running Queries

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Analysis of Collaborative Distributed Admission Control in x Networks

3. CR parameters and Multi-Objective Fitness Function

Assembler. Building a Modern Computer From First Principles.

Chapter 1. Introduction

UNIT 2 : INEQUALITIES AND CONVEX SETS

Response-Time Analysis for Single Core Equivalence Framework

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS

Fast Computation of Shortest Path for Visiting Segments in the Plane

Meta-heuristics for Multidimensional Knapsack Problems

Load Balancing for Hex-Cell Interconnection Network

y and the total sum of

A New Token Allocation Algorithm for TCP Traffic in Diffserv Network

Constructing Minimum Connected Dominating Set: Algorithmic approach

K-means and Hierarchical Clustering

Simulation Based Analysis of FAST TCP using OMNET++

A Generic and Compositional Framework for Multicore Response Time Analysis

Positive Semi-definite Programming Localization in Wireless Sensor Networks

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning

Real-Time Guarantees. Traffic Characteristics. Flow Control

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Mathematics 256 a course in differential equations for engineering students

Comparison of Heuristics for Scheduling Independent Tasks on Heterogeneous Distributed Environments

Memory and I/O Organization


Support Vector Machines

Real-Time Systems. Real-Time Systems. Verification by testing. Verification by testing

Analysis of Continuous Beams in General

Kent State University CS 4/ Design and Analysis of Algorithms. Dept. of Math & Computer Science LECT-16. Dynamic Programming

ARTICLE IN PRESS. Signal Processing: Image Communication

Private Information Retrieval (PIR)

Efficient Broadcast Disks Program Construction in Asymmetric Communication Environments

Resource and Virtual Function Status Monitoring in Network Function Virtualization Environment

Adaptive Resource Allocation Control with On-Line Search for Fair QoS Level

Feature Reduction and Selection

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

A Predictable Execution Model for COTS-based Embedded Systems

Storage Binding in RTL synthesis

On Achieving Fairness in the Joint Allocation of Buffer and Bandwidth Resources: Principles and Algorithms

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

Repeater Insertion for Two-Terminal Nets in Three-Dimensional Integrated Circuits

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Transcription:

Coordnated Bank and Cache Colorng for Temporal Protecton of Memory Accesses 1 Norak Suzuk, 2 Hyoseung Km, 2 Donso de Nz, 2 Bjorn Andersson, 2 Lutz Wrage, 2 Mark Klen, and 2 Ragunathan (Raj) Rajkumar n-suzuk@ha.jp.nec.com, hyoseung@cmu.edu, {donso, bandersson, lwrage, mk}@se.cmu.edu, raj@ece.cmu.edu 1 NEC Corporaton, Japan 2 Carnege Mellon Unversty, USA Abstract In commercal-off-the-shelf (COTS) mult-core systems, the executon tmes of tasks become hard to predct because of contenton on shared resources n the memory herarchy. In partcular, a task runnng n one processor core can delay the executon of another task runnng n another processor core. Ths s due to the fact that tasks can access data n the same cache set shared among processor cores or n the same memory bank n the DRAM memory (or both). Such cache and bank nterference effects have motvated the need to create solaton mechansms for resources accessed by more than one task. One popular solaton mechansm s cache colorng that dvdes the cache nto multple parttons. Wth cache colorng, each task can be assgned exclusve cache parttons, thereby preventng cache nterference from other tasks. Smlarly, bank colorng allows assgnng exclusve bank parttons to tasks. Whle cache colorng and some bank colorng mechansms have been studed separately, nteractons between the two schemes have not been studed. Specfcally, whle memory accesses to two dfferent bank colors do not nterfere wth each other at the bank level, they may nteract at the cache level. Smlarly, two dfferent cache colors avod cache nterference but may not prevent bank nterference. Therefore t s necessary to coordnate cache and bank colorng approaches. In ths paper, we present a coordnated cache and bank colorng scheme that s desgned to prevent cache and bank nterference smultaneously. We also developed color allocaton algorthms for confgurng a vrtual memory system to support our scheme whch has been mplemented n the Lnux kernel. In our experments, we observed that the executon tme can ncrease by 60% due to nter-task nterference when we use only cache colorng. Our coordnated approach can reduce ths fgure down to 12% (an 80% reducton). I. INTRODUCTION In mult-core systems, the executon of one task on one processor core can depend on the executon of a task on another processor core whch s a major concern for hard real-tme systems. Ths dependency s mostly caused by the followng effects: E1. Evcton of cache blocks n a cache memory shared between processor cores. E2. Contenton on the bus between the memory controller and the DRAM modules. E3. Evcton of the currently open row n the memory bank n DRAM modules. E4. Reorderng of memory requests n the queue of the memory controller, caused by the memory controller favorng memory access to the currently open row n a memory bank. Ths materal s based upon work funded and supported by the Department of Defense under Contract No. FA8721-05-C-0003 wth Carnege Mellon Unversty for the operaton of the Software Engneerng Insttute, a federally funded research and development center. Ths materal has been approved for publc release and unlmted dstrbuton. DM-0000537 Works wth Deals wth effect COTS Multcore E1 E2 E3 E4 Analyss method [16] No No Yes No No Mechansm [15] No No Yes No No [16] No No Yes No No [14] No No Yes Yes Yes [5] No Yes Yes Yes Yes Analyss method [4] Yes No Yes No No [10] Yes No Yes No No [13] Yes No Yes No No Mechansm [9] Yes No No Yes Yes [11] Yes Yes No No No [17] Yes No Yes No No Our work Yes Yes No Yes Yes TABLE I: Comparson wth prevous work. Consequently, the research communty has developed () methods for analyzng the mpact of these dependences and () run-tme mechansms that protect the executon tme of one task from these effects. Table 1 summarzes the state-of-art. We have heard from software practtoners a strong preference for COTS multcore systems and therefore we center our dscusson on those. The queung dscplne used for resolvng multple requests to use the memory bus s often undocumented n COTS multcore systems and consequently, knowng exactly whch request wll be served at any gven tme s dffcult. Therefore, E2 has been dealt wth [4], [13] by developng analyss technques assumng that the arbtraton for the memory bus s work conservng but makng no other assumpton on the arbtraton. E2 has also been dealt wth [17] by developng mechansms where a task s assgned a certan number of bus accesses that t s allowed to generate and, at run-tme, perodcally montorng (usng performance montorng counters) the number of cache msses a task generates and f the number of cache msses reaches a hgh value (close to the allowed number) the task s suspended (because t has generated a large number of memory transfers on the bus). Cache colorng s an often-favored [11] technque for provdng protecton aganst E1. Ths technque sets up the vrtual-to-physcal-address translaton so that no two tasks access the same cache set n the shared cache and hence one task cannot evct a cache block that another task has fetched to the shared cache. The same dea has been used for dealng wth E3 and E4, and s referred to as bank colorng [9]. Software practtoners clearly beneft from usng both cache colorng and bank colorng but whle the vrtual address translaton s used for both of them, t s not clear how to confgure the translaton

to acheve both at the same tme, and the research lterature provdes no gudance on how to acheve such coordnated bank and cache colorng. Therefore, n ths paper, we present coordnated bank and cache colorng. Specfcally, we make the followng contrbutons: We create the frst cache-bank color conflct model that descrbes the conflcts that a cache colorng scheme has wth bank colorng schemes n generc commercal mult-core processors. We develop the frst operatng systems (OS) mechansm that provdes coordnated cache and bank colorng. We create the frst algorthm that coordnates the allocaton of tasks to cores and bank and cache colors to tasks. The remander of the paper s organzed as follows. Secton 2 gves a background on memory herarches. Secton 3 gves an ntroducton to cache and bank colorng. Secton 4 presents algorthms for allocatng processor cores, cache colors, and bank colors to tasks. Secton 5 gves conclusons. II. BANK AND CACHE INTERFERENCE In ths secton we dscuss the nature of the task nterference related to cache and memory banks. We start by provdng a general ntroducton to the memory herarchy to set the context. Then we dscuss cache and bank nterference separately followed by the nteractons between them. A. Memory Herarchy Background To cope wth the stark dfference between the processor and the memory access speeds, today s processors mplement a memory herarchy that combnes memory elements of dfferent speeds. These elements, ordered n decreasng access speed and ncreasng memory capacty, start wth processor regsters, followed by multple levels of cache, and end wth memory banks. As a stream of nstructons executes n the processor, ts data (and the nstructons themselves) s loaded frst n the last-level cache then n each of the cache levels closer to the processor and fnally n the processor regsters. Repeated uses of the same data are sped up by accessng them n the fastest memory where t s loaded. Gven that the set of data accessed by a program can exceed the sze of the memory element used, old data s evcted and replaced by the newly accessed data. Ths evcton nduces a longer access tme the next tme ths data s accessed gven that t wll need to be accessed from a slower memory element (down to the man memory). Whle some evctons are trggered by the same task that loaded the data, t can also be caused by another task runnng n the same or other cores n the processor. In ths paper we study the evctons across tasks n shared cache and memory banks. However, the nature of these two types of evctons and the nterference they nduce are dfferent. In the followng we explan these two types n detal. 1) Cache Evcton and Interference: The cache herarchy takes advantage of spatal and temporal localty of memory accesses to preserve the data that s more lkely to be accessed n the fastest cache possble. Specfcally, t s well known that when a program accesses a memory locaton, t s hghly lkely for the next few accesses to be close the same locaton. Ths locaton closeness s known as spatal localty for whch caches are organzed n cache lnes of multple bytes (typcally 64) that are loaded together. That s, when a memory locaton s accessed and loaded nto cache, the surroundng memory locatons n the same cache lne are loaded as well. In ths way, the next locaton accessed by the program wll lkely be already loaded n the cache lne and ts access wll be much faster. Temporal localty, on the other hand, relates to the fact that memory locatons accessed n the past are lkely to be accessed n the future agan. Gven the lack of an oracle that can tell us the future, the cache mechansms assume that t s more lkely that a recently-accessed locaton wll be accessed agan nstead of a less recently-accessed one. Ths assumpton s mplemented n cache-replacement polces that select some approxmaton of a least-recently-used cache lne to be evcted when no empty cache lne s avalable, snce t would be the least lkely to be accessed agan. When two (or more) tasks access the same cache n an nterleaved fashon they break each other s temporal and spatal localty nducng costly evctons. When both tasks are runnng on the same core, these nterleavngs are the result of the context swtch between tasks and happens only when the scheduler decdes to perform a context swtch. The effect of ths evcton, known as Cache-Related-Preempton Delay (CRPD), has been studed n the past but present a sgnfcant schedulablty penalty wth lmted predctablty. When the tasks are n dfferent cores, the nterleavng s contnuous as they run n real parallelsm and can produce evctons perhaps more costly and less predctable. The cache s parttoned nto sets restrcted to contan words from dfferent regons (memory ranges) of man memory. Ths set parttonng prevents memory locatons restrcted to be loaded n one cache set from evctng data loaded from memory addresses that are loaded n another cache set. We wll use ths property to mplement our parttonng scheme to be dscussed shortly. 2) Memory Bank Evctons and Interference: Man memory s dvded nto memory banks wth the man purpose of parallelzng the access to the physcal devces and reducng the speed dfference wth the processor speed. These banks are organzed nto ranks, that bundle multple banks together, and ranks are grouped nto channels. Dfferent banks, ranks, and channels can be accessed n parallel wth very lttle nterference wthn each other. Memory banks are nternally organzed n rows and columns. A row s a sequence of memory locatons that are dvded nto columns. A column wthn a row s the mnmum amount of memory transfer from/to the banks. To access a memory address the memory controller frst dentfes the correspondng channel, rank, bank, row, and column where the memory address s located. Then, t transfers the row where the address s located to the bank row buffer. Fnally, t accesses the column where the address s located. Subsequent accesses to columns from the same row can be performed drectly from

the row buffer savng the tme to load such a buffer. However, f an address mapped to a dfferent row s accessed then the row buffer needs to be reloaded and the prevous data s evcted. Row buffers mplement a cachng strategy smlar to the cache lne, matchng the spatal localty of the program. The temporal localty n memory banks, however, have a dfferent flavor from the localty of the cache. That s, whle n the cache memory the temporal localty s mplemented as a replacement polcy (e.g. least-recently used) that predcts the future based on the past, to access locatons n memory banks the memory controller has more nformaton. In partcular, memory access requests to man memory arrve at the memory controller from multple cores at the same tme. Gven that the access to the memory banks s much slower than the cache, the memory controller keeps a queue of requests watng to be served. Ths queue contans n fact the sequence of future accesses to the banks, and hence, the memory controller does not need to guess ths future. More mportantly, the memory controller s able to change the future through the reorderng of the requests n ths queue. Specfcally, memory controllers reorder the request queue to avod accesses to other rows to get n the mddle of a sequence of request to the same row. Ths reorderng reduces the row buffer evctons and mprove the throughput of the whole system and the program whose accesses were favored n the reorderng. Unfortunately, the reorderng can also enlarge the worst-case executon tme of the tasks. However, as happens wth dfferent cache lnes, accesses to dfferent banks do not nterfere wth each other. Furthermore, t s worth notng that the reorderng effect s only sgnfcant between accesses from dfferent cores gven that the effect that happens n the context swtch between two tasks runnng on the same core s neglgble. III. COORDINATED CACHE AND BANK MEMORY COLORING The mappng of a memory locaton to specfc cache sets or memory banks s determned by ts address. For cache sets, a subset of the address bts are used as the cache set ndex. Ths ndex unquely dentfes the cache set where ths address can be loaded. Ths means that f we restrct dfferent tasks to use memory addresses wth dfferent cache ndex then we avod cache nterference between them. Memory banks, on the other hand, are selected wth a subset of address bts that dentfy the channel, rank, and bank numbers. In the same fashon as cache, restrctng dfferent tasks to use dfferent bank numbers elmnates bank nterference. Both bank and cache colorng are mplemented by allocatng physcal pages wth dfferent cache-set ndexes (or bank numbers) to dfferent tasks and hence t s also known as page colorng. A. Cache-Color Address Bts In order to mplement cache colorng t s necessary to frst dentfy the address bts that dentfy the cache ndex. In order to do ths t s possble to obtan the cache parameters from processors specfcaton. In partcular, t s necessary to obtan: (1) the cache sze, (2) the cache lne sze, and (3) the set assocatvty. Wth these data t s possble to calculate the number of address bts needed to dentfy the cache set (C) of Last-Level Cache Sze (S) 8 MB Number of Ways (W ) 16 4 Cache Lne Sze (L) 64 bytes TABLE II: Intel Core 7-2600 Processor Cache Specfcaton Rank Channel Rows Bank Col 1 Col 2 Byte 31-18 17 16-14 13-7 6 5-3 2-0 Fg. 1: Typcal Bank Address Layout an address wth the formula C = log 2 ( S W L) where S s the cache sze, W the number of ways of the cache, and L the cache lne sze. To locate the startng bt of the cache ndex address bt we use the sze of the cache lne (L) and apply log base 2 (log 2 (L)). As an example consder the specfcatons of the Intel Core 7-2600 processor presented n Table II. In( ths case the number of cache sets s calculated as C = log 8MB ) 2 64 64 = 11. 1 Then the startng bt can be located at log 2 (64) = 6 extendng up to bt 16. Whle all the cache ndex address bts could potentally be used to create dfferent colors, the vrtual memory system only allows us to control the address bts that are ncluded n the page frame number. In other words, a memory address s dvded nto a page number and an byte offset nsde the page. The vrtual memory system works by replacng vrtual memory page number wth physcal page numbers, a.k.a. frame numbers. As a result, only the address bts that are ncluded n the dentfcaton of the frame number are avalable to the vrtual memory manager. Ths depends on the page sze that s typcally confgured as 4KB by the OS. When the page sze s 4KB, the address bts for the page frame number start from bt 12. As a result, cache colorng can control only 5 bts that range from bt 12 to bt 16. B. Bank-Color Address Bts The address bts used to ndcate ndvdual memory banks can be found n a smlar fashon to cache bts. Fgure 1 presents a typcal layout of the address bts for banks [6]. There are three features n ths layout that are worth notng. Frst, the channel s located n the low-order bts (6). Ths s amed at nterleavng the memory access across channels to allow parallel access to consecutve cache lnes n a ppelned fashon. Secondly, the bank bts are n lower-order bts than the rows. Ths has the same motvaton of the channel favorng the nterleavng of access to banks to parallelze memory accesses across banks for the same row. Fnally, the rows are the most sgnfcant bts n the layout. Ths has the ntenton to mnmze the lkelhood of changng rows when accessng consecutve memory. All of these features are amed at mprovng the access tme of the memory for a sngle task. Unfortunately, they do not help prevent the nterference across tasks, whch s our goal. In modern processors the bank address bt layout n Fgure 1 s augmented wth a randomzaton strategy to mnmze 1 The shared cache of the Intel Core 7 processor conssts of four cache slces, whch makes the number of ways of each cache slce be multpled by four. More detals on ths can be found n [7].

Address bts 20 19 18 17 16 15 14 6 Channel XOR XOR XOR Cache Bank 00 01 10 11 00 X X 01 X X 10 X X 11 X X Fg. 3: Bank and Cache Color Intersecton Matrx Bank Color Index Fg. 2: Randomzaton Bank Bt Strategy (XOR) of Intel Core 7-2600 3 Bt Value Bank Color Cache Color 000 00 00 001 00 01 010 01 10 011 01 11 100 10 00 101 10 01 110 11 10 111 11 11 TABLE III: Bank and Cache Colors Intersecton Example the lkelhood of bank collsons. Ths strategy s mplemented by XORng row bts wth the bank bts to produce the fnal bank bts. For nstance, Fgure 2 presents the xor strategy of the Intel Core 7-2600 memory controller for the four snglesded DIMMs of 2GB each. It s worth notng that modern processors do not publsh the bank address bts. As a result, prevous work has been publshed to dscover ths address bts n a expermental fashon [9]. We took ths work and created an mproved procedure (not presented due to space constrants). As happens wth cache colorng, bank colorng can only take advantage of the bts that can be controlled by the vrtual memory manager whch n our case rules out the channel bt (number 6). C. Bank and Cache Color Interacton Model A page color s a collecton of pages that do not nterfere wth each other. When pages are colored to avod cache nterference they guarantee that pages of dfferent colors wll not evct cache lnes from each other. Smlarly when they are desgned based on memory banks they guarantee that they do not map to the same bank and hence do not nduce row buffer evcton or reorder memory requests. In ths paper we am at avodng both cache and bank nterference and, hence, we requre a colorng scheme that avods both types of nterference. Unfortunately, not only do cache and bank colors ntersect but nether subsumes the other. Ths means that t s not possble to buld a color herarchy where one type or color s subsumed nto another, e.g., cache colors wthn bank colors. The bank and cache color ntersecton s caused by the ntersecton of the address bts that dentfy these colors. To llustrate ths ntersecton consder a memory system where the color of both the cache and the banks are dentfed by two address bts and one of the bts s between the two. Specfcally bt and + 1 dentfy bank colors whle + 1 and + 2 dentfy the cache colors. Both cache and bank colors can be represented by all possble values of these three bts, leadng to the ntersecton of colors depcted n Table III. Note that the ntersecton s only partal. For nstance, the bank color 00 n our example ntersects wth cache colors 00 and 01 but not wth 10 or 11. Smlarly, cache color 00 only ntersects wth bank colors 00 and 10. In order to better understand the mpact of the color ntersecton we use an ntersecton matrx where columns represent bank colors and rows cache colors. Fgure 3 shows the matrx for the color scheme of Table III wth an X markng the colors that ntersect. The matrx from Fgure 3 smplfes the vsualzaton of both the color ntersectons and the ntersecton gaps. Ths allows us to hghlght two addtonal facts. Frst, the collecton of all cells wth an X n the ntersecton matrx represents the whole memory space 2. Secondly, each cell wth an X represents a subset of pages that do not ntersect wth other subsets of pages n other cells, ether through cache or memory banks. The ntersecton of bank and cache bts reduces the number of ndependent cells n the ntersecton matrx. Ths can be reflected as a reducton of the number of cache colors avalable whenever a bank color s selected. Specfcally, the number of bank colors s B = 2 numbankbts, and the number of cache colors per bank s H = 2 (numcachebts numntersectngbts). 1) Bank Address Bt Randomzaton: The bank-bts randomzaton scheme dscussed n Secton III-B has the effect of removng the gaps n the ntersecton matrx. Ths s because t s possble to set the common bts to whatever value s needed to produce a partcular cache bt and adjust the correspondng XORed bts accordngly. Fgure 4 presents the varaton of the ntersecton matrx from Fgure 3 when XORng mechansm s added. In ths case, for a gven two-bt value of the cache color we can select a two-bt bank value that have the same shared bt value along wth the correspondng row bts that produces the desred bank color. For nstance, f we select the cache bts 10 then, to obtan the bank color 00 (non-exstng n Fgure 3), we can use bank bts 11 along wth row bts 11 to produce bank color 11 11 = 00. All row and bank bt combnatons are shown n the bank-color headng of Fgure 4. For ths case, f we calculate the number of bank colors as B = 2 numbankbts, then the number of cache colors per bank are calculated as H = 2 numcachebts. Note that for ths case, numntersectngbts does not play a role n computng H. D. Memory Interference Model We now defne our new memory nterference model. We defne a system as where S = (τ = {τ 1,..., τ n }, π = {π 1,..., π m }, B, H) τ s a task defned as τ = (T, {C 1,..., Ck }, M ) wth a perod T, a set of executon tmes where 2 In the rest of ths paper we refer to these cells as the memory cells and we wll use them to dscuss our memory nterference model and allocaton algorthms.

row bank row bank row bank row bank 00 00 01 00 10 00 11 00 01 01 = 00 00 01 = 01 11 01 = 10 10 01 = 11 10 10 11 10 00 10 01 10 11 11 10 11 01 11 00 11 X X X X X X X X X X X X Fg. 4: Intersecton Matrx wth XOR. each C j represents the worst-case computaton tme wthout nterference when task τ s assgned j cache colors, and a memory requrement M that represents the number of memory cells as shown n Fgure 4. π j represents a processor core. B s the number of bank colors. H s the number of cache colors. We wll defne a concrete allocaton and an abstract allocaton where the former states whch specfc set of colors are assgned to a task and the latter specfes only the number of colors assgned. A concrete allocaton s defned as CA = {ca = (P, B, H )} where P [1, m] dentfes the processor where τ s allocated. For convenence we use the value 0 (zero) when a task s not assgned to any processor. B s the set of bank colors assgned to task τ. H s the set of cache colors assgned to task τ. An abstract allocaton s defned as AA = {aa = (P, B, H )} where P [1, m] dentfes the processor where τ s allocated. B s the number of bank colors assgned to task τ. H s the number of cache colors assgned to task τ. We wll now defne memory nterference. For ths purpose, we ntroduce two predcates below. Defnton 1: conflct(ca) = ca, ca j CA ( j) ((B B j P P j) (H H j )) Ths encodes that a conflct exsts f a bank color s shared across cores or a cache color s shared across tasks. Defnton 2: conflct(aa) = (( π j π (max [1,n] P =j B )) > B) (( [1,n] H) > H) In ths case a conflct exsts f the sum of the number of bank colors assgned to each core exceeds the number of avalable bank colors or the number of allocated cache colors exceeds the avalable colors. The bank colors assgned to a core can be shared among all tasks assgned to ths core wthout causng nterference. Bank Color 1 Bank Color 2 Bank Color B Cache Color 1... Memory cell headers Mem Cell (1,1) Mem Cell (1,2) Mem Cell (1,B) Cache Color 2 1 pages 5... 3 7...... Cache Color 3...... Memory cell headers Mem Cell (3,1) Mem Cell (3,2) Mem Cell (3,B) 2 Cache Color H pages 6... 4 8... Fg. 5: Page management mechansm for combned cache and bank colorng Wth these predcates, we defne vald (meanng vald allocaton ) as follows Defnton 3: vald(ca) =( π j π : ( [1,n] P =j 1)) ( conflct(ca)) Defnton 4: vald(aa) =( π j π : ( [1,n] P =j 1)) ( conflct(aa)) C H T C H T The two prevous defntons express that a vald allocaton does not over-utlze any core 3 and has no conflct. It s easy to see that any gven abstract allocaton can be transformed nto a concrete allocaton. Hence, our allocaton algorthms (presented n Secton IV) generate abstract allocatons only. E. OS Support for Combned Cache and Bank Colorng In order to mantan physcal pages accordng to ther cache and bank colors, we developed a vrtual page management mechansm wth a two-level herarchcal lst structure, as shown n Fgure 5. Pages are frst categorzed accordng to ther cache colors, and then they are sub-categorzed accordng to ther bank colors. The pages wth the same cache and bank colors are lnked as a lst whch represents a memory cell. The header of ths lst contans ts cache color x and bank color y as (x, y). The header also mantans ts owner task ID. Once a task s assgned ts cache and bank colors, the task s restrcted to use only physcal pages n ts own memory cells, thereby preventng cache and bank nterferences from other tasks. When a task s assgned more than one cache or bank color, we use a round-robn scheme to allocate pages to the task from ts memory cells. Ths approach evenly dstrbutes the task s page usage across multple memory cells. In Fgure 5, the crcled numbers represent the page allocaton order for a task that s assgned cache colors 1 and 3, and bank colors 1 and 2. Pages are allocated to ths task from the memory cells of (1, 1), (3, 1), (1, 2), and (3, 2) n round-robn order. We have mplemented our page management mechansm as part of the memory reservaton scheme [8] n Lnux/RK [12], whch s based on the Lnux 2.6.38.8 kernel. Memory reservaton allows an applcaton task to reserve a porton of the physcal memory for ts exclusve use. Memory reservaton mantans a global physcal page pool and our page management mechansm s appled to ths page pool. When a taskset s gven, each task s assgned ts cache and bank colors. Then, physcal pages are reserved for each task from the page pool based on ther assgned cache and bank color ndces. 3 In ths paper we assume EDF schedulng.

PS.canneal PS.ferret PS.streamcluster PS.fludanmate PS.facesm PS.freqmne PS.x264 SPEC.lesle3d SPEC.mcf SPEC.mlc SPEC.sphnx3 Normalzed Executon Tme 180% 170% 160% 150% 140% 130% 120% 110% 100% 90% 80% no bank colorng (cache colorng only) combned cache and bank colorng Fg. 6: Bank Interference Protecton F. Effect of Combned Cache and Bank Colorng In ths secton, we evaluate how much temporal protecton on memory accesses can be acheved by our combned cache and bank colorng approach. As a baselne, we use a subset of the PARSEC [3] benchmark (runnng one benchmark n each task) measurng ther executon tme when each task runs alone n the system. Then, we measured ther executon tmes when a memory-ntensve background task s co-executed on a dfferent core. We compare the executon tme ncrease due to the co-runnng task under our approach and the no-bankprotecton (cache colorng only) approach. The target system s equpped wth the Intel Core 7-2600 3.4GHz quad-core processor. The system s confgured for 4KB page frames and a 4GB memory reservaton page pool. Wth cache and bank colorng, the system provdes 32 cache colors and 16 bank colors. Durng ths experment, each task s assgned four exclusve cache parttons so tasks do not experence cache nterference. Under our approach, each task s also assgned eght exclusve bank parttons. Under the no-bank-protecton approach, the benchmark task and the background task share memory banks. Fgure 6 shows the executon tme of each task when t runs wth the background task. The executon tmes are normalzed to the executon tme when t runs alone n the system. Overall, the task executon tme ncrease due to the background task s sgnfcantly lower under our approach. For nstance, the executon tme of PS.streamcluster s ncreased by 60% under the no-bank-protecton approach, but the ncrease s only 12% under our combned cache and bank colorng approach. We suspect that ths 12% of ncrease s caused by contenton n other components of the DRAM system, such as the DRAM controller and the DRAM bus. The contenton on such components can be mtgated by a memory bandwdth reservaton mechansm [17] [2], but s beyond the scope of ths work. We plan to study ths ssue as part of our future work. IV. COORDINATED CACHE AND BANK COLOR ALLOCATION ALGORITHMS In ths secton, we present two algorthms for allocatng tasks to processor cores and to non-conflctng cache and bank parttons. The frst one s based on solvng a Mxed-Integer Lnear Program (MILP). It has the advantage that t s optmal but t has the drawback that ts runnng tme can be large. The second one s based on solvng a varant of the knapsack problem. It has the advantage that t runs faster but t has the drawback that there are problem nstances such that MILP can solve them but the knapsack algorthm cannot solve them. A. Mxed-Integer Lnear Programmng Algorthm From the prevous dscusson, t can be seen that our problem s to assgn tasks to processors, assgn cache colors to tasks and assgn bank colors to processors so that certan constrants are fulflled. It s therefore natural to phrase ths as a constrant satsfacton problem. We wll now present notatons that we fnd useful and then present such constrants. Let PCS denote the set of possble cache colors that we have specfed that t s possble to assgn to task τ. Consder for example a task τ 1 for whch assgnng 1 cache color to ths task leads to the executon tme 3.5 and assgnng 2 cache colors to ths task leads to the executon tme 3.4 and assgnng 6 cache colors to ths task leads to the executon tme 3.1 and assgnng 7 cache colors to ths task leads to the executon tme 3.0. Ths gves us PCS = {1, 2, 6, 7} and C 1 1 = 3.5, C 2 1 = 3.4, C 6 1 = 3.1, C 7 1 = 3.0. We compute PC as: PC = { t : t PCS (t M ) (t H) } Then fndng an assgnment of tasks to processors, assgnment of cache colors to tasks and assgnment of bank colors to processors can be done by fndng an assgnment of values to varables so that the constrants n Fgure 7 are satsfed. Here, x,p = 1 ndcates that task τ s assgned to processor p; otherwse x,p = 0. Also, y,t = 1 ndcates that task τ s assgned t cache colors; otherwse y,t = 0. In addton, z,p,t = 1 ndcates that task τ s assgned to processor p and task τ s assgned t cache colors; otherwse z,p,t = 0. Note that, for each task τ, we let c,p ndcate the amount of executon that task τ s assgned to processor p. Clearly, snce a gven task τ can only be assgned to a sngle processor, there s only one processor p such that c,p > 0; for the other processors, c,p = 0. The constrants n Fgure 7 use mplcaton and equvalence operators. These can be rewrtten to lnear nequaltes (usng standard technques) and hence the above s a Mxed-Integer Lnear Program (MILP) a problem for whch many solvers are avalable (we use Gurob [1]). B. Knapsack Heurstc Algorthm The man dea of our new knapsack heurstc algorthm s to () generate all possble assgnments of banks to processor (lne 3 n Algorthm 1), () then generate task-to-processor assgnment (see functon TaskKnapsack) and () fnally, evaluate the feasblty of the current allocaton aganst the capacty of the cache (lne 6 n Algorthm 1). On each teraton k the call to TaskKnapsack returns an assgnment canddate CT A k. Ths canddate s composed of three sets: bank-tocore assgnment set (BC k ), the task-to-core assgnment set (T C K ), and the cache-to-task assgnment set (HT k ).The BC k set has one varable Bπ k j per core π j wth the number of banks assgned to the core. The set T C k contans one varable P k per task τ wth the ndex of the core the task s assgned to. Fnally, the set HT k contans one varable H k per task τ wth the number of cache parttons assgned to task τ. Algorthm 2 proceeds as follows. Cores are traversed n non-ncreasng order of number of assgned banks. In each teraton we try to maxmze the total number of memory cells (requred by the tasks) ftted n the memory grd. To do ths we consder ths total number of cells as the value of the objects

Constrants p {1, 2,..., m} : =1..n ((1/T) c,p) 1 {1, 2,..., n} : p=1..m x,p = 1 {1, 2,..., n}, p {1, 2,..., m} : (x,p = 1) (c,p = c ) {1, 2,..., n}, p {1, 2,..., m} : (x,p = 0) (c,p = 0) {1, 2,..., n} : t PC y,t = 1 {1, 2,..., n} : c = t PC (C t y,t) p=1..m ncachecolorsofprocp H p=1..m nbankcolorsofprocp B p {1, 2,..., m} : ncachecolorsofproc p = =1..n t PC (t z,p,t) {1, 2,..., n}, p {1, 2,..., m}, t PC : (z,p,t = 1) (x,p = 1) (y,t = 1) {1, 2,..., n}, p {1, 2,..., m}, t PC : (z,p,t = 1) (M (nbankcolorsofprocessor p t)) Domans of varables {1, 2,..., n}, p {1, 2,..., m} : x,p {0, 1} {1, 2,..., n} : c s a real number 0 {1, 2,..., n}, p {1, 2,..., m} : c,p s a real number 0 {1, 2,..., n}, t PC : y,t {0, 1} p {1, 2,..., m} : ncachecolorsofproc p s an nteger n [1,H] p {1, 2,..., m} : nbankcolorsofproc p s an nteger n [1,B] {1, 2,..., n}, p {1, 2,..., m}, t PC : z,p,t {0, 1} Fg. 7: A MILP formulaton for coordnated task, cache and bank allocaton. Algorthm 1 KnapsackCoreAndMemoryAssgnment(S) 1: CT A k := (BC k = {B k π j π j π}, T C k = {P k τ τ π P π}, HT k = {H k τ τ}) 2: /* CT A s an assgnment canddate */ 3: BA a set of all possble canddate banks-to-core assgnment 4: for all each assgnment BA k n BA do 5: CT A k TaskKnapsack(τ, BA k ) 6: f H k H then return CT A k end f 7: end for 8: return null (tasks) packed n the knapsack. We consder the total number cache parttons as the sze of the knapsack and the cost of the object as the number of cache parttons requred by each task τ (H ). We calculate the cache parttons requred by the task takng nto account the bank parttons avalable n the M core,.e., H = B πj. Then, gven ths requrement and the total utlzaton of the tasks assgned to a core the algorthm searches for the best task-to-core allocatons. The tasks-to-core assgnment exploraton saves canddate assgnments (to the current core) n a vector V wth H + 1 entres. Each entry v k keeps the total utlzaton U k of the current core, the total memory cells ftted RM k (sum of memory assgnments of all assgned tasks to the core), the set of task-to-core assgnments T C k, and the set of cache-to-task assgnments HT k. The array V represents possble packngs for a knapsack of sze up to H (ndexed from 1 to H), wth a specal entry 0. The entry ndexed by k represents a canddate that requres a total of k cache parttons. For example, when a canddate assgnment Z conssts of τ x and τ y, and both H x and H y are 2, the canddate assgnment Z s held n an entry wth ndex 2 + 2 = 4 (v 4 ). An entry ndexed by 0 represents an entry wth no cache assgnments. The man loop (Lne 3) traverses all the cores. The next nested loop (Lne 8) traverses all the tasks that have not been assgned to a core yet. Fnally, the last nested loop (Lne 10) traverses all canddate packng entres that requres k cache parttons. At teraton k, the algorthm tres to make a new canddate packng wth the prevous canddate packng n entry v k by addng τ to the packng and checkng ts feasblty. The new cache requrement s calculated by addng to the old requrement k the cache requrement from τ (H c ). Ths new packng s stored wth the ndex of the new assgnment k + H c, f we can verfy that such requrement s less than or equal to H Algorthm 2 TaskKnapsack(τ, BA) 1: CT A r := (BC r = {B r π π j j π}, T C r = {P r τ τ π P π}, HT r = {H r τ τ}) 2: BC r BA 3: for all π j π n non-ncreasng order of B πj do 4: // Vector of utlzatons (U k ) mem reqs. (RM k ), task-to-core (T C k ) and 5: // cache-to-task (HT k ) ndex by number of cache assgned k 6: V := {v h = (U h, RM h, T C h, HT h ) 0 h H} 7: [0,..., H]: U 0, RM 0 8: for all τ τ P = 0 do 9: H c M Bπ j 10: for all k [0,..., H] do 11: f ((k = 0) (k 0 U k 0)) k + H c H C Hr c r τr τ Pr =j + CH Tr T 1 then 12: /* Ths canddate s feasble */ 13: f RM k+h c < RM k + M then 14: // New canddate can ft more memory cells c 15: U k+h c U k + CH T 16: RM k+h c RM k + M 17: P k x T C k: P k+hc x P k x 18: P k+hc j 19: H k x HT k: H k+hc x H k x 20: H k+hc H c 21: end f 22: end f 23: end for 24: end for 25: v m = argmax v V (RM ) 26: T C r T C r {P m T C m P m = j} 27: HT r HT r {H m HT m} 28: end for 29: return CT A r = (BC r, T C r, HT r) (.e., t fts the knapsack). Smlarly, the schedulablty of the taskset s verfed by checkng that U + CH T 1. Fnally, f the new total number of cells s larger than the prevous RM k+h c ths new packng s stored n the entry v k+h c, overwrtng the prevous contents, otherwse the canddate s dscarded due to ts lower value (lower fttng memory cells). Once all tasks are assgned, we select the entry v m n the vector V that maxmzes the total number of cells RM m deployed n the memory grd. From ths entry we take the tasks assgned to core π j and save these core assgnments along wth the assgnments of cache parttons to the selected tasks. Then, the core loop contnues untl all cores are traversed. c

Percentage of Cases 120% 100% 80% 60% 40% 20% 0% 0% 10% 20% 30% Augmentaton Percentage Fg. 8: Knapsack Augmentaton Experment Results C. Allocaton Algorthms Evaluaton We have generated problem nstances randomly (see the appendx) and evaluated both algorthms. 1) Scalablty: We conducted experments wth the followng default setup: B=64, H=32 and m=4. Then we vared each of these parameters as follows. Frst, we vared m throughout all numbers n {4,8,16,32}. Then, we vared H throughout all numbers n {16,32,64,128}. Fnally, we vared B throughout all numbers n {8,16,32,64}. For very large systems (the case that B=64, H 32, m=4) MILP tended to not be able to fnsh wthn an hour. For other systems we evaluated, MILP tend to succeed. 2) Knapsack Augmentaton Evaluaton: Gven that the knapsack algorthm can fal to allocate dffcult cases we created an experment that ncreases the sze of the memory confguraton and number of cores ncrementally (augments the resource) untl the algorthm succeeds. We ran 100 experments wth 4 cores, 16 cache colors, 32 bank colors, and 16 tasks. The results are presented n Fgure 8. In ths fgure we can observe that n over 95% of the cases we can ft the taskset wthout augmentaton and the rest of the cases can be ftted wth only 10% augmentaton. V. CONCLUSIONS We have presented a coordnated cache and bank colorng scheme that s desgned to prevent cache and bank nterference smultaneously. We also gave color allocaton algorthms for confgurng a vrtual memory system to support our scheme whch has been mplemented n the Lnux kernel. REFERENCES [1] Gurob. http://www.gurob.com. [2] F. Bellosa. Process cruse control: Throttlng memory access n a soft real-tme envronment. Techncal report TR-I4-97-02, Unversty of Erlangen, Germany, 1997. [3] C. Bena, S. Kumar, J. P. Sngh, and K. L. The PARSEC benchmark sute: characterzaton and archtectural mplcatons. In PACT 08. [4] D. Dasar, B. Andersson, V. Néls, S. M. Petters, A. Easwaran, and J. Lee. Response Tme Analyss of COTS-Based Multcores Consderng the Contenton on the Shared Memory Bus. In ICESS 11. [5] K. Goossens, A. Azevedo, K. Chandrasekar, M. D. Gomony, S. Goossens, M. Koedam, Y. L, D. Mrzoyan, A. Molnosand, A. B. Nejad, A. Nelson, and S. Snha. Vrtual Platforms for Mxed-Tme- Crtcalty Applcatons: The CompSOC Archtecture and Desgn Flow. In CRTS 12. [6] B. Jacob, S. Ng, and D. Wang. Memory Systems: Cache, DRAM, Dsk. Elsever Scence, 2010. [7] H. Km, A. Kandhalu, and R. Rajkumar. A Coordnated Approach for Practcal OS-Level Cache Management n Mult-Core Real-Tme Systems. In ECRTS 13. [8] H. Km and R. Rajkumar. Shared-page management for mprovng the temporal solaton of memory reservatons n resource kernels. In RTCSA 12. [9] L. Lu, Z. Cu, M. Xng, Y. Bao, M. Chen, and C. Wu. A software memory partton approach for elmnatng bank-level nterference n multcore systems. PACT 12. [10] M. Lv, G. Nan, W. Y, and G. Yu. Combnng Abstract Interpretaton wth Model Checkng for Tmng Analyss of Multcore Software. In RTSS 10. [11] R. Mancuso, R. Dudko, E. Bett, M. Cesat, M. Caccamo, and R. Pellzzon. Real-Tme Cache Management Framework for Mult-core Archtectures. In RTAS 13. [12] S. Okawa and R. Rajkumar. Lnux/RK: A portable resource kernel n lnux. In RTSS-WIP 98. [13] R. Pellzzon, A. Schranzhofer, J.-J. Chen, M. Caccamo, and L. Thele. Worst case delay analyss for memory nterference n multcore systems. In DATE 10. [14] J. Reneke, I. Lu, H. D. Patel, S. Km, and E. A. Lee. PRET DRAM Controller: Bank Prvatzaton for Predctablty and Temporal Isolaton. In CODES+ISSS 11. [15] J. Rosén, A. Andre, P. Eles, and Z. Peng. Bus Access Optmzaton for Predctable Implementaton of Real-Tme Applcatons on Multprocessor Systems-on-Chp. In RTSS 07. [16] L. Steffens, M. Agarwal, and P. van der Wolf. Real-Tme Analyss for Memory Access n Meda Processng SoCs - A Practcal Approach. In ECRTS 08. [17] H. Yun, G. Yao, R. Pellzzon, M. Caccamo, and L. Sha. Memory Access Control n Multprocessor for Real-tme Systems wth Mxed Crtcalty. In ECRTS 12. APPENDIX We generate the correspondng random taskset as follows. 1) Frst, generate a memory grd of HB memory cells. 2) Then, generate two sets of memory-dvson lnes, one set to dvde the banks and the other set to dvde the caches. Each set has M 1 lnes. These lnes are generated randomly from a unform dstrbuton. These lnes create M memory-grd rectangles g k n the dagonal. These rectangles are the non-ntersectng regons to be used by each of the k cores. Each rectangle has a number of cache colors ccolors(g k ) that represents ts sze n the cache dmenson of the grd and a number of banks colors bcolors(g k ) that represents ts sze n the bank dmenson of the grd. 3) For each core k, generate a taskset of sze N k ensurng that N k ccolors(g k ). 4) For each core k, generate N k 1 cache-dvson lnes to dvde the core rectangle g k. 5) Assgn a random number of memory cells h,k to a each task τ,k n a core k such that h,k bcolor(g k ) ccolor(g k ). 6) For each task τ, randomly generate the T from a unform dstrbuton between [100, 2000]. 7) For each task τ, generate C t as C t = 0.99 T (1 r ) + 0.99 1 T r N k N k t where r s a randomly generate number n the range [0, 0.5].