Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern

Similar documents
WCET-Aware C Compiler: WCC

Shared Cache Aware Task Mapping for WCRT Minimization

Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance

Control flow graphs and loop optimizations. Thursday, October 24, 13

Efficient Microarchitecture Modeling and Path Analysis for Real-Time Software. Yau-Tsun Steven Li Sharad Malik Andrew Wolfe

History-based Schemes and Implicit Path Enumeration

HW/SW Codesign. WCET Analysis

A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking

Bus-Aware Static Instruction SPM Allocation for Multicore Hard Real-Time Systems

Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems

Static WCET Analysis: Methods and Tools

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

Predictable paging in real-time systems: an ILP formulation

WCET ANALYSIS ON REAL TIME EMBEDDED SYSTEMS FOR MEMORY CONSTRAINS

A Refining Cache Behavior Prediction using Cache Miss Paths

Cache-Aware Instruction SPM Allocation for Hard Real-Time Systems

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Single-Path Programming on a Chip-Multiprocessor System

WCET-Aware Scratchpad Memory Management for Hard Real-Time Systems. Yooseong Kim

Scratchpad memory vs Caches - Performance and Predictability comparison

A Fast and Precise Worst Case Interference Placement for Shared Cache Analysis

Hardware-Software Codesign. 9. Worst Case Execution Time Analysis

Caches in Real-Time Systems. Instruction Cache vs. Data Cache

ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS

Worst Case Execution Time Analysis for Synthesized Hardware

FIFO Cache Analysis for WCET Estimation: A Quantitative Approach

Dissecting Execution Traces to Understand Long Timing Effects

Combining Worst-Case Timing Models, Loop Unrolling, and Static Loop Analysis for WCET Minimization

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications

SPM Management Using Markov Chain Based Data Access Prediction*

Caches in Real-Time Systems. Instruction Cache vs. Data Cache

Tiling: A Data Locality Optimizing Algorithm

MAKING DYNAMIC MEMORY ALLOCATION STATIC TO SUPPORT WCET ANALYSES 1 Jörg Herter 2 and Jan Reineke 2

Data Scratchpad Prefetching for Real-time Systems

ECE 5730 Memory Systems

The Heptane Static Worst-Case Execution Time Estimation Tool

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

CONTENTION IN MULTICORE HARDWARE SHARED RESOURCES: UNDERSTANDING OF THE STATE OF THE ART

Computer Systems Architecture

Hardware-Software Codesign

INSTRUCTION CACHE OPTIMIZATIONS FOR EMBEDDED SYSTEMS

Two-Level Address Storage and Address Prediction

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Module 17: Loops Lecture 34: Symbolic Analysis. The Lecture Contains: Symbolic Analysis. Example: Triangular Lower Limits. Multiple Loop Limits

UNIT 4 Branch and Bound

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

PERFORMANCE ANALYSIS OF REAL-TIME EMBEDDED SOFTWARE

Last Time. Response time analysis Blocking terms Priority inversion. Other extensions. And solutions

A Single-Path Chip-Multiprocessor System

Embedded Systems Lecture 11: Worst-Case Execution Time. Björn Franke University of Edinburgh

4/6/2011. Informally, scheduling is. Informally, scheduling is. More precisely, Periodic and Aperiodic. Periodic Task. Periodic Task (Contd.

Compiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems

Minimizing Clock Domain Crossing in Network on Chip Interconnect

CS 432 Fall Mike Lam, Professor. Data-Flow Analysis

Approximation of the Worst-Case Execution Time Using Structural Analysis. Matteo Corti and Thomas Gross Zürich

Introduction to Embedded Systems

Write only as much as necessary. Be brief!

Dirk Tetzlaff Technical University of Berlin

Handling Constraints in Multi-Objective GA for Embedded System Design

ABC basics (compilation from different articles)

Optimizations - Compilation for Embedded Processors -

Optimizations - Compilation for Embedded Processors -

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Status of the Bound-T WCET Tool

F-Soft: Software Verification Platform

Bootstrap, Memory Management and Troubleshooting. LS 12, TU Dortmund

Memory. Lecture 22 CS301

Write only as much as necessary. Be brief!

Instruction Cache Locking Using Temporal Reuse Profile

Memory: Overview. CS439: Principles of Computer Systems February 26, 2018

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Lecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )

Tiling: A Data Locality Optimizing Algorithm

Register Allocation. Stanford University CS243 Winter 2006 Wei Li 1

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Verification of Quantitative Properties of Embedded Systems: Execution Time Analysis. Sanjit A. Seshia UC Berkeley EECS 249 Fall 2009

Modeling Control Speculation for Timing Analysis

CS7810 Prefetching. Seth Pugsley

Probabilistic Modeling of Data Cache Behavior

CS553 Lecture Profile-Guided Optimizations 3

Scheduling with Bus Access Optimization for Distributed Embedded Systems

1. Introduction. 1 Multi-Core Execution of Hard Real-Time Applications Supporting Analysability. This research is partially funded by the

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Program Analysis and Constraint Programming

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Heuristic Optimisation Methods for System Partitioning in HW/SW Co-Design

Improving Dual Bound for Stochastic MILP Models Using Sensitivity Analysis

Register Allocation via Hierarchical Graph Coloring

11. APPROXIMATION ALGORITHMS

Optimizations. Optimization Safety. Optimization Safety CS412/CS413. Introduction to Compilers Tim Teitelbaum

Static Analysis of Worst-Case Stack Cache Behavior

Basics of Performance Engineering

Question 13 1: (Solution, p 4) Describe the inputs and outputs of a (1-way) demultiplexer, and how they relate.

Power Estimation of UVA CS754 CMP Architecture

Transcription:

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern

Introduction WCET of program ILP Formulation Requirement SPM allocation for code SPM allocation for data Conclusion References allocation 9/6/2015 2

Real-time (RT) systems : Embedded systems whose correctness depends on timeliness of results RT systems application : automotive engine control, aircraft flight control Worst-case execution time (WCET) analysis Used to guarantee RT system safety is met To bridge gap between processor speed and memory On-chip memories typically caches are used But caches are unpredictable in nature Go for Scratchpad memories (SPMs) for RT systems 3

Scratchpad memories (SPM) are small on-chip memories that are mapped into the address space of the processor Advantages of SPM Smaller in size compared to caches Energy efficient Access latency is predictable When SPMs are used The decision of placing data and code objects lies in hands of compiler 4

Introduction WCET of program ILP Formulation Requirement SPM allocation for code SPM allocation for data Conclusion References allocation 9/6/2015 5

A program P s WCET is the maximal time P s execution can ever take WCET analysis of program is performed on control flow graph(cfg) of program Goal: find WCET of program Start from functions in the program For each function, for each loop, the innermost loop is an acyclic path Obtain the WCET of the innermost loop and multiply it with loop s worst case loop count 6

Cost of each basic block is given by : i i ci = Cmain *(1- xi) +Cspm *xi With x i, allocation variable defined as: x i 1 0 If basic block is SPM allocated otherwise Start from basic block at the exit and then moving upwards to basic block at the entry of loop WCET of basic block at exit will be cost of basic block itself: L L w c exit exit b L entry b i b j b L exit WCET of any basic block bi is obtained as: w w i succ + c i 7

The WCET of all paths of loop body of L can be obtained by modeling the basic block at the entry of loop: L w entry Replace the innermost loop by a single node in graph Repeat the procedure iteratively main function is the unique entry point of an entire program Thus WCET of entire program : main w entry 8

Introduction WCET of program ILP Formulation Requirement SPM allocation for code SPM allocation for data Conclusion References allocation 9/6/2015 9

Requirement : Optimal SPM allocation which is WCET-aware i.e., it reduces the WCET of application Problem: An allocation optimal for average case execution time is not optimal for WCET reduction locally optimizing the current worst-case path may not lead to the globally optimal solution Allocation must consider changes in worst case path 10

Introduction WCET of program ILP Formulation Requirement SPM allocation for code WCET-aware static allocation Instruction SPM allocation for PRET architecture SPM allocation for data Conclusion References allocation 9/6/2015 11

Minimizes WCET by placing the most beneficial parts of a program s code in SPM This method: captures a program s current WCEP and its possible switches considers jump penalties and variable basic block sizes based on jump scenarios 12

Goal: ILP has to minimize a program s WCET by assigning basic blocks to the SPM The objective function: main w entry Extend WCET of each basic block b i by Jump penalty jp i Call penalty cp i 13

Why jump penalty? Consider unconditional jump scenario : After b i, b j has to execute If they are placed in different memories If they are placed in same memory but not adjacently If they are placed in adjacently in same memory b i b k b j 14

Modeling global control flow: A function call to F inside a basic block b i of F Use w F entry to model worst case cost of b i In addition a call penalty to model a jump to function scenario Thus WCET of basic block is: w i w succ + c i jpi cp i 15

Advantages : Considers Jump penalties and variable basic block sizes based on jump scenario It is optimal WCET-centric allocation method for program code as it is based on ILP Disadvantage: WCET analysis required to find cost of basic block is expensive 16

Introduction WCET of program ILP Formulation Requirement SPM allocation for code WCET-aware static allocation SPM allocation for PRET architecture SPM allocation for data Conclusion References allocation 9/6/2015 17

Precision timed (PRET) architecture has predictable timing and repeatable temporal behaviors. It provides precise timing control using timing instructions PRET is a multi-threaded architecture CFG used for ILP formulation is extended with timed blocks 18

The objective function: Overall execution time of all the timed blocks of all the threads This is minimized under constraints: Timing constraint : Execution time of individual time blocks is just met SPM size constraint : the total sum of size of all basic blocks must lie within SPM size 19

Advantages: incorporates instructions from multiple hardware threads of the PRET architecture Allocation utilizes unused space of one thread for another thread s timed blocks and hence better than dedicated segmented SPM for each thread It allocates basic blocks so that explicit timing specified in the program is met 20

Introduction WCET of program ILP formulation Requirement SPM allocation for code SPM allocation for data WCET-aware static allocation based on ILP SPM allocation using Branch and Bound Algorithm Dynamic SPM allocation for data Conclusion References allocation 9/6/2015 21

This allocation considers both scalar variables and array The method analyses CFG of program to find the objective function main w entry Allocate variables to SPM to reduce WCET of program Objective function is minimized under SPM size constraint 22

Advantages: Optimal method since it is based on ILP WCET reduction of up to 80% is possible for different benchmarks Disadvantage: Doesn t consider infeasible path information in programs 23

Introduction WCET of program ILP formulation Requirement SPM allocation for code SPM allocation for data WCET-aware static allocation based on ILP SPM allocation using Branch and Bound Algorithm Dynamic SPM allocation for data Conclusion References allocation 9/6/2015 24

Search space = power set of all possible allocations Measure the potential WCET reduction of a variable using its maximum contribution over all execution paths sort the variables in decreasing order of this measured value 25

Traverse the search tree and keep the maximum WCET reduction achieved so far at any leaf node bound Heuristic function computes the maximum possible WCET reduction at any leaf node in the sub-tree rooted at node m The search space corresponding to the sub-tree rooted at m is pruned based on the maximum WCET reduction that can be obtained on that path 26

General idea : 1. Find the infeasible path patterns by conservatively identifying pairs of branches and/or assignments which are guaranteed to conflict Example: x=2 conflicts with x>3 2. WCET calculation same as before but using the conflict list for each path list to avoid exhaustive path enumeration 27

Advantages: Provides optimal allocation Considers infeasible path information in program Reduction obtained in WCET is higher than that using ILP based static allocation Disadvantage: complexity is exponential with respect to the number of data variables to be allocated 28

Introduction WCET of program - ILP formulation Requirement SPM allocation for code SPM allocation for data WCET-aware static allocation based on ILP SPM allocation using Branch and Bound Algorithm Dynamic SPM allocation for data Conclusion References allocation 9/6/2015 29

Enables eviction and placement of data to the SPM at run-time Steps : 1. Determine the possible targets for load-store instruction 2. ILP formulation to obtain an allocation 3. Greedily refine the allocation to get the WCETcentric allocation 30

1. Determine the possible targets for load-store instruction : Done by applying pointer analysis To whole text of program To all pointer definitions 2. Compute WCET of application 3. Compute data access information on worst-case execution path 31

4. ILP formulation to obtain an allocation 0/1 ILP is formulated to compute allocation Formulation is applied on flow graph 5. Apply the process iteratively till no more reduction in WCET can be obtained 32

Advantages: For small SPM highly beneficial Outperforms the static allocation from 12% to 85%. Useful to systems with a SPM shared among several RT tasks Disadvantages: Allocation quality depends on granularity of flow graph run time of ILP solver is a limitation for the largest benchmarks 33

Introduction WCET of program Requirement SPM allocation for code SPM allocation for data Conclusion References allocation 9/6/2015 34

WCET is a key metric in RT embedded systems WCET-centric SPM allocation techniques for program code and data objects ILP based method give optimal allocation reducing WCET Instruction allocation for PRET architecture allocates basic blocks from multiple threads and meet the explicit timing requirements Dynamic allocation approach for data allows efficient use of small SPM 35

Introduction WCET of program Requirement SPM allocation for code SPM allocation for data Conclusion References allocation 9/6/2015 36

1. H. Falk & J.C. Kleinsorge (2009): Optimal static WCET-aware scratchpad allocation of program code. In: Design Automation Conference, 2009. DAC 09. 46th ACM/IEEE, pp. 732 737. 2. A. Prakash & H.D. Patel (2012): An instruction scratchpad memory allocation for the precision timed architecture. In: Design, Automation Test in Europe Conference Exhibition (DATE), 2012, pp. 659 664, doi:10.1109/date.2012.6176553. 3. V. Suhendra, T. Mitra, A. Roychoudhury & Ting Chen (2005): WCET centric data allocation to scratchpad memory. In: Real-Time Systems Symposium, 2005. RTSS 2005. 26th IEEE International, pp. 10 pp. 232, doi:10.1109/rtss.2005.45. 4. J.-F. Deverge & I. Puaut (2007): WCET-Directed Dynamic Scratchpad Memory Allocation of Data. In: RealTimeSystems,2007.ECRTS 07 19thEuromicroConferenceon,pp.179 190,doi:10.1109/ECRTS.2007.37. 37

Thank you 38