WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache

Similar documents
Lecture 25: Multiprocessors

Experiences and Results of Parallelisation of Industrial Hard Real-time Applications for the parmerasa Multi-core

Flynn s Classification

A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

ECE 669 Parallel Computer Architecture

D 8.4 Workshop Report

Chapter 5. Multiprocessors and Thread-Level Parallelism

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Multiprocessor Cache Coherency. What is Cache Coherence?

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Chapter 5. Multiprocessors and Thread-Level Parallelism

History-based Schemes and Implicit Path Enumeration

This is an author-deposited version published in : Eprints ID : 12597

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Caches in Real-Time Systems. Instruction Cache vs. Data Cache

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Memory Hierarchy in a Multiprocessor

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

M4 Parallelism. Implementation of Locks Cache Coherence

Lecture 21: Virtual Memory. Spring 2018 Jason Tang

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Design and Analysis of Time-Critical Systems Introduction

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Predictable Cache Coherence for Multi- Core Real-Time Systems

Scope-based Method Cache Analysis

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

The Cache-Coherence Problem

Lecture 24: Board Notes: Cache Coherency

Timing analysis and timing predictability

Multiprocessor Synchronization

Foundations of Computer Systems

Migrating from the UT699 to the UT699E

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

ECE/CS 757: Homework 1

CS377P Programming for Performance Multicore Performance Cache Coherence

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

CONTENTION IN MULTICORE HARDWARE SHARED RESOURCES: UNDERSTANDING OF THE STATE OF THE ART

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

Thread- Level Parallelism. ECE 154B Dmitri Strukov

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

PCS - Part Two: Multiprocessor Architectures

MC2: Multicore and Cache Analysis via Deterministic and Probabilistic Jitter Bounding

Advanced and parallel architectures. Part B. Prof. A. Massini. June 13, Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points)

PARALLEL MEMORY ARCHITECTURE

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

Aperiodic Servers (Issues)

1. Introduction. 1 Multi-Core Execution of Hard Real-Time Applications Supporting Analysability. This research is partially funded by the

MERASA: MULTICORE EXECUTION OF HARD REAL-TIME APPLICATIONS SUPPORTING ANALYZABILITY

Design and Analysis of Real-Time Systems Predictability and Predictable Microarchitectures

parmerasa Dissemination Event Address of Welcome

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Evaluating the Portability of UPC to the Cell Broadband Engine

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

Multiprocessors and Locking

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

COSC4201 Multiprocessors

Timing Anomalies Reloaded

Multiprocessor scheduling, part 1 -ChallengesPontus Ekberg

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

EC 513 Computer Architecture

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

Concurrent Preliminaries

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

The Cache Write Problem

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Fall 2011 PhD Qualifier Exam

Page 1. Cache Coherence

Multiprocessors & Thread Level Parallelism

Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically

Advanced and parallel architectures

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Cache Coherence Protocols for Chip Multiprocessors - I

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Directory Implementation. A High-end MP

Lecture 15: Virtual Memory and Large Caches. Today: TLB design and large cache design basics (Sections )

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

12 Cache-Organization 1

Analysis and Implementation of Global Preemptive Fixed-Priority Scheduling with Dynamic Cache Allocation

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Coding guidelines for WCET analysis using measurement-based and static analysis techniques

Multiprocessors. Key questions: How do parallel processors share data? How do parallel processors coordinate? How many processors?

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Transcription:

WCET Analysis of Parallel Benchmarks using On-Demand Coherent Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund Hugues Cassé, Haluk Ozaktas and Christine Rochange Université Paul Sabatier, Toulouse

Outline Coherence in Todays Architectures Coherence and Static WCET-Analysis In Short: The On-Demand Coherent Static WCET-Analysis with ODC² Evaluation Conclusion 2

Coherence in Todays Architectures s are part of almost every multi/many-core architecture. Accesses on shared data induce inconsistence of data in caches. Coherent accesses are preserved by cache coherence protocols. Core X = 0 Coherence protocols affect the timing of data accesses. Inconsistent data Coherence protocols affect the state and content of the caches. Core X = 1 Coherence protocols affect a static worst-case execution time analysis. 3

Example: Quad-Core LEON4-FT GR712RC Bus interconnect Private L1-s Shared L2- Write-Through Policy Quad Core LEON4 Block Diagram [1] Coherence is ensured using a Snooping-based Write-Invalidate protocol. Snooping: Write-Invalidate: Any core listen to data accesses on the bus. Invalidation of caches data by write accesses of other cores. [1] Aeroflex Gaisler LEON4 Product Sheet 4

Example: Quad-Core LEON4-FT GR712RC Bus interconnect Private L1-s Shared L2- Core 0 Write-Through Strategy Core 1 Data X = 0 Data X = 0 Coherence is ensured using a Snooping-based Write- Invalidate protocol. Snooping: Write-Invalidate: Shared Memory Any core listen to data accesses on the bus. Invalidation Data X of = 0 caches data by write accesses of other cores. 5

Example: Quad-Core LEON4-FT GR712RC Bus interconnect Private L1-s Shared L2- Write-Through Strategy Write X = 1 Invalid X Coherence is ensured using a Snooping-based Write- Invalidate protocol. Snooping: Write-Invalidate: Core 0 Shared Memory Core 1 Data X = 1 Data X = 0 Any core listen to data accesses on the bus. Invalidation Data X of = 0 caches data by write accesses of other cores. 6

Example: Quad-Core LEON4-FT GR712RC Bus interconnect Private L1-s Shared L2- Core 0 Write-Through Strategy Data X = 1 Core 1 Read X = 1 Coherence is ensured using a Snooping-based Write- Invalidate protocol. Snooping: Write-Invalidate: Shared Memory Any core listen to data accesses on the bus. Invalidation Data X of = 1 caches data by write accesses of other cores. 7

Coherence and Static WCET-Analysis Estimation of the WCET using a static analysis tool Abstraction of the hardware architecture incl. caches Analysis to calculate data access latencies Analysis Overview of the OTAWA framework [2] Prediction of cache-hits and cache-misses Abstraction of possible cache states (Must/May Analysis) Modification of cache state by data accesses in program Analysis of private cache in isolation from other cores [2] Ballabriga, C. et al.: OTAWA: an Open Toolbox for Adaptive WCET Analysis. In: The 8th IFIP Workshop on Software Technologies for Future Embedded and Ubiquitous Systems 8

Coherence and Static WCET-Analysis Analysis CORE Data X = 0 External Invalidation Data X = 1 External Invalidations turn the cache into a shared resource Target of invalidation is unknown -> Invalidation of any data must be assumed Time of invalidation is unknown -> Invalidation must be assumed at any possible time Prediction of cache-hits is nearly impossible External invalidations prohibit a feasible WCET estimation. 9

Example: Quad-Core LEON4-FT GR712RC Bus interconnect Private L1-s Shared L2- Write-Through Policy Quad Core LEON4 Block Diagram [1] Alternative: Pure software-based cache coherence using the -Flush instruction. -Flush: Invalidation of the complete cache triggered by a special instruction. [1] Aeroflex Gaisler LEON4 Product Sheet 10

On-Demand Coherent (ODC²) Core ODC ODC² Block-Frame Shared Bit Crtl Repl. info Tag Data Replacement information Control flags (Valid, Modified) Indication of possible carrier of shared data Software/Hardware co-approach Handling of coherence on demand Different handling of accesses to private and shared data Access to shared data only in protected code regions Invalidation of shared data after use No coherence transactions 11

On-Demand Coherent (ODC²) Abstract Program ODC² Functionality - Accesses on private data - Mutex Lock / Barrier; enter_shared_mode(); - Accesses on private/shared data - exit_shared_mode(); Mutex Unlock / Barrier; - Accesses on private data - ODC² in Private-Mode No coherence-operations Functionality and timing equal to incoherent cache ODC² in Shared-Mode Newly loaded cache-lines are marked as shared Checking of target address Write-Through Policy ODC² in Restore-Procedure Invalidation of marked cache-lines ODC² in Private-Mode No coherence-operations Functionality and timing equal to incoherent cache 12

Static WCET-Analysis with ODC² Analysis CORE ODC Data X = 0 remains private resource Time and target of invalidation is known Invalidation increases cache-miss rate Invalidation improve efficiency Influence on timing is bounded and predictable Additional Shared Analysis to identify accesses to shared data New classification of accesses in Must/May Analysis: Always Shared, Never Shared, Sometimes Shared 13

Evaluation System architecture: Tree-type interconnect PPC-based cores with private L1-s Shared memory Parallel Benchmarks: Matrix Multiplication Fast Fourier Transformation Dijkstra Shortest Path Evaluated Platforms: ODC² Uncached shared data Software -Flush Magic Coherence Parallel benchmarks Program flow and synchronisation WCET analysis of ODC² and common alternatives using the OTAWA toolbox parmerasa real-time capable many-core architecture incl. ODC² OTAWA evaluation framework analysis enhanced with ODC² model 14

Evaluation WCET Analysis regarding parallel execution with 2-8 cores. Significantly reduced WCET compared to Uncached and -Flush Reduced accesses to shared memory with ODC² Coherence overhead of ODC² depends on access 350000 pattern to shared 300000 memory 250000 Number of Accesses 200000 150000 100000 50000 Normalised WCET 3,50 3,00 2,50 2,00 1,50 1,00 0,50 0,00 Matrix FFT Djikstra Number of WT Accesses 300000 250000 200000 150000 100000 50000 Uncached Flush ODC² Magic 0 Matrix FFT Dijkstra 0 Matrix FFT Dijkstra 15

Conclusion Existing hardware cache coherence techniques are not suitable for a static WCET analysis, software coherence techniques lack in performance. ODC² permits time-predictable hardware/software co-approach. Static timing analysis using ODC² allows significant reduction of WCET compared to common alternatives. Promising approach for hard real-time multicore systems. Thank you for your attention Any questions? 16