Caches in Real-Time Systems. Instruction Cache vs. Data Cache

Similar documents
Caches in Real-Time Systems. Instruction Cache vs. Data Cache

Data Cache Locking for Tight Timing Calculations

Data Cache Locking for Higher Program Predictability

Aperiodic Servers (Issues)

Worst-Case Execution Time (WCET)

Partitioned Fixed-Priority Scheduling of Parallel Tasks Without Preemptions

Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance

Cache and Virtual Memory Simulations

Worst-Case Execution Time (WCET)

Cache Memory: Instruction Cache, HW/SW Interaction. Admin

How to Write Fast Numerical Code Spring 2011 Lecture 7. Instructor: Markus Püschel TA: Georg Ofenbeck

CPSC/ECE 3220 Summer 2017 Exam 2

Memory Hierarchy Basics

Precise and Efficient FIFO-Replacement Analysis Based on Static Phase Detection

Memory Hierarchy. Advanced Optimizations. Slides contents from:

Cache Performance (H&P 5.3; 5.5; 5.6)

Memory Hierarchy Basics. Ten Advanced Optimizations. Small and Simple

Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems

Analysis and Implementation of Global Preemptive Fixed-Priority Scheduling with Dynamic Cache Allocation

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

Chapter 9 Memory Management

WCET-Aware C Compiler: WCC

Advanced optimizations of cache performance ( 2.2)

Last Time. Response time analysis Blocking terms Priority inversion. Other extensions. And solutions

CACHE-RELATED PREEMPTION DELAY COMPUTATION FOR SET-ASSOCIATIVE CACHES

A program execution is memory safe so long as memory access errors never occur:

Design and Analysis of Real-Time Systems Predictability and Predictable Microarchitectures

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Topics. Digital Systems Architecture EECE EECE Need More Cache?

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Shared Cache Aware Task Mapping for WCRT Minimization

Bounding Worst-Case Data Cache Behavior by Analytically Deriving Cache Reference Patterns

Efficient Cache Locking at Private First-Level Caches and Shared Last-Level Cache for Modern Multicore Systems

Design and Analysis of Time-Critical Systems Introduction

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern

PERFORMANCE OPTIMISATION

Cache Design and Timing Analysis for Preemptive Multi- tasking Real-Time Uniprocessor Systems

Review: Computer Organization

Memory Cache. Memory Locality. Cache Organization -- Overview L1 Data Cache

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Lecture 11 Reducing Cache Misses. Computer Architectures S

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

CS 31: Intro to Systems Caching. Martin Gagne Swarthmore College March 23, 2017

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

A Server-based Approach for Predictable GPU Access Control

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization

Analyzing Real-Time Systems

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

ECE 2300 Digital Logic & Computer Organization. More Caches

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses. Professor Randy H. Katz Computer Science 252 Fall 1995

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

CS 2461: Computer Architecture 1

Improving Cache Performance. Reducing Misses. How To Reduce Misses? 3Cs Absolute Miss Rate. 1. Reduce the miss rate, Classifying Misses: 3 Cs

Lecture 16: Memory Hierarchy Misses, 3 Cs and 7 Ways to Reduce Misses Professor Randy H. Katz Computer Science 252 Spring 1996

Chapter 5 Memory Hierarchy Design. In-Cheol Park Dept. of EE, KAIST

EITF20: Computer Architecture Part4.1.1: Cache - 2

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

CS 33. Architecture and Optimization (3) CS33 Intro to Computer Systems XVI 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Cache Performance! ! Memory system and processor performance:! ! Improving memory hierarchy performance:! CPU time = IC x CPI x Clock time

Introduction to OpenMP. Lecture 10: Caches

ECE468 Computer Organization and Architecture. Virtual Memory

Improving Cache Performance. Dr. Yitzhak Birk Electrical Engineering Department, Technion

Memory Systems and Performance Engineering

Windows 7 Overview. Windows 7. Objectives. The History of Windows. CS140M Fall Lake 1

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Tutorial 11. Final Exam Review

ABSTRACT. PANCHAMUKHI, SHRINIVAS ANAND. Providing Task Isolation via TLB Coloring. (Under the direction of Dr. Frank Mueller.)

ECE4680 Computer Organization and Architecture. Virtual Memory

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

Static WCET Analysis: Methods and Tools

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

LECTURE 5: MEMORY HIERARCHY DESIGN

ECE 454 Computer Systems Programming

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Memory Systems and Performance Engineering. Fall 2009

Department of Computer Science, Institute for System Architecture, Operating Systems Group. Real-Time Systems '08 / '09. Hardware.

Compilation for Heterogeneous Platforms

Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

CSC D70: Compiler Optimization Memory Optimizations

Computer Architecture Spring 2016

How to Write Fast Numerical Code

Administrivia. Deadlock Prevention Techniques. Handling Deadlock. Deadlock Avoidance

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

Carnegie Mellon. Cache Lab. Recitation 7: Oct 11 th, 2016

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

A Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications

Performance analysis of the static use of locking caches.

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Transcription:

Caches in Real-Time Systems [Xavier Vera, Bjorn Lisper, Jingling Xue, Data Caches in Multitasking Hard Real- Time Systems, RTSS 2003.] Schedulability Analysis WCET Simple Platforms WCMP (memory performance) e.g. no caches Ignoring cache leads to significant resource under-utilization. Q: How to appropriately account for cache? R. Bettati Instruction Cache vs. Data Cache Computation of WCET with Instruction Cache for non-preemptive systems (e.g. Static Cache Simulation) Extension: Computation of WCET with instruction cache in preemptive systems. Analysis of Data Cache harder Single instruction can refer to multiple memory locations. Locality of reference harder to capture for data access. 1

WCET Analysis in the Presence of Data Caches (I) Static Analysis Attempts to classify statically the different memory accesses as hits or misses. Typically does not consider preemptive systems Limited to codes free of data-dependent constructs Cache Preemption Delay Incorporate cache preemption cost as context switch overhead into schedulability analysis. Cold-started cache after preemption? Might be unsafe on processors with out-of-order instruction scheduling, where a cache hit under some circumstances may be more expensive than a miss. Program Model Programs consist of subroutines, calls, arbitrarily nested but well-structured loops, assignments, possibly guided by IF conditionals. Extensions possible to unstructured code. In this paper, all programs are in C. Thus, all arrays are assumed to be in row major. Static analysis possible with additional constraints Calls are non-recursive. Bounds of all loops are known and affine. The IF conditionals are analyzable at compile time. 2

How can Caches help? Cache Locking Available on many microprocessors (e.g. PowerPC 604e, 405 and 440 families, Intel-960, some Intel x86, Motorola MPC7400) Static Locking cache is loaded and locked at system start Dynamic Locking state of the cache is allowed to change during the system execution Cache Partitioning. Eliminate inter-task conflicts by giving reserved portions of cache to certain tasks. May give raise to fragmentation, and translate to a loss of performance. Cache Model Uniprocessor with two-level memory hierarchy virtually-indexed K-way set-associative data cache using LRU replacement main memory. K-way set-associative cache Cache set contains K cache lines. Let C (L) be the cache (line) size in bytes. The total number of cache sets is thus C/(L K). A cache is called direct-mapped when K=1 A cache is called fully-associative when K=C/L. Cache locking Cache locking mechanism allows a single cache line to be locked. Pre-fetch / Invalidate Processors can load and invalidate cache lines selectively. (This can be emulated in software.) Cache partitioning Implemented either in hardware or software. Partition unit is a cache set. 3

Approach (Overview) Summary: Need method that allows obtaining an exact and safe WCMPs of tasks for multitasking systems with data caches, so that current schedulability analyses can be applied without modifications. Use Cache partitioning to eliminate inter-tasks conflicts. This allows us to compute the WCMP of each task in isolation. Compensate performance loss through use of compiler cache optimizations (such as tiling and padding). Use Static Analysis to compute WCMP of a task. Transform the program issuing lock/unlock instructions to ensure a tight WCMP estimate at static time. Cache pre-fetching added when necessary to improve performance Cache Partitioning Inter-task interference occurs when cache lines from different tasks conflict in cache, which causes unpredictability. Partitioning: Divide the cache into disjoint partitions, which are assigned to tasks in such a way that inter-conflicts are removed. Create n + 1 partitions, one for each real-time task and another one which is shared among non-real-time tasks. Each task is only allowed to access its own partition, thus removing inter-task conflicts. Tasks with same priority can share the same partition Only preempted by tasks with higher priority, and thus the predictability of cache behavior is not affected. (Therefore, p partitions are sufficient, where p is the number of different priorities). Partition-size: Size of the partitions impacts performance. Optimal partitioning depends on the priorities and the reuse patterns of tasks. Equally-sized partitions give significant improvement. 4

Predictable Cache Behavior Unpredictability caused by path merging and data dependent memory access. Path Merging: Reduce overhead of analyzing loop constructs with multiple paths inside (data-dependent conditionals, loops with unknown loop bounds). Cache state at the end of the merged path is unknown. Data Dependent Memory Access: Indirection arrays (e.g., a[b[i]], where b[i] is not statically known) Variables allocated dynamically (e.g., mallocs) and pointer accesses that cannot be determined statically. Nonlinear array references that are not handled by static analyzer (e.g., a[i*j]) Library and operating system calls. Solution: Cache locking during unpredictable regions of code. Cache Locking: Example int a[100], b[100]; int c[100], k=0; a[i]=random(i); c[i]=b[a[i]]+c[i]; N=random(i)*100; for (i=0;i<n;i++){ if (c[i]>15) k++; c[i]=0; Datadependent access Merging construct Merging construct int a[100], b[100]; int c[100], k=0; a[i]=random(i); { lock(); /*Region 1*/ c[i]=b[a[i]]+c[i]; N=random(i)*100; lock(); /*Region 2*/ for (i=0;i<n;i++){ register int temp=(c[i]>15); lock();/*region 2.1*/ if (temp) k++; c[i]=0; Original Code Lock/Unlock Placement 5

Optimizing Lock Placement Rule 1. Lock/unlock instructions that lock the whole loop body (including the test) are placed outside the loop. loop;lock;s;unlock;endloop lock;loop;s;endloop;unlock Rule 2. Remove nested lock regions. lock;lock;s;unlock;unlock lock;s;unlock Rule 3. Fuse two consecutive locked regions. lock;s1;unlock;lock;s2;unlock lock;s1;s2;unlock Rule 4 *. Move a statement past a lock instruction. S1;lock;S2;unlock lock;s1;s2;unlock Rule 5 *. Move an unlock instruction past a statement. lock;s1;unlock;s2 lock;s1;s2;unlock (*) May affect cache behavior. Optimizing Lock Placement: Example int a[100], b[100]; int c[100], k=0; a[i]=random(i); { lock(); /*Region 1*/ c[i]=b[a[i]]+c[i]; N=random(i)*100; lock(); /*Region 2*/ for (i=0;i<n;i++){ register int temp=(c[i]>15); lock();/*region 2.1*/ if (temp) k++; c[i]=0; Lock/Unlock Placement int a[100], b[100]; int c[100], k=0; a[i]=random(i); IssueLoads(c); IssueLoads(b); lock(); /*Region 1*/ c[i]=b[a[i]]+c[i]; N=random(i)*100; lock(); /*Region 2*/ for (i=0;i<n;i++){ register int temp=(c[i]>15); if (temp) k++; c[i]=0; Final Version 6

Overview of System Test Workload Workload WCMP Period Period Name Description (bytes) (no cache) (Normal) (HP) Large Task Set MM Multiplication of two 100x100 Int matrices 120000 153140000 117800000 102093333 SRT Bubblesort of 1000 double array 8000 113925998 159496397 227851996 FIB Computation of the 30 first Fibonacci numbers 16 7790 155800000 3895 FFT Fast Fourier transformation of 512 complex numbers 8192 1655808 152334336 3311616 Medium Task Set CNT Counting and sum of values in a 100x100 Int matrix 40000 1140000 570000 285000 SQRT Computation of the square root of 1384 16 5360 241200 2680 ST Computation of Sum, Mean, Var (1000 doubles) 16000 532000 266000 266000 NDES Encryption and decryption of 64 bits 960 220938 331407 110469 Table 1. Benchmarks used. 7

Performance: Effect of Partitioned Cache Performance: Static vs. Dynamic Locking 8

Worst-Case Performance Compares utilization levels. 9