WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache

Size: px
Start display at page:

Download "WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache"

Transcription

1 WCET Analysis of Parallel Benchmarks using On-Demand Coherent Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund Hugues Cassé, Haluk Ozaktas and Christine Rochange Université Paul Sabatier, Toulouse

2 Outline Coherence in Todays Architectures Coherence and Static WCET-Analysis In Short: The On-Demand Coherent Static WCET-Analysis with ODC² Evaluation Conclusion 2

3 Coherence in Todays Architectures s are part of almost every multi/many-core architecture. Accesses on shared data induce inconsistence of data in caches. Coherent accesses are preserved by cache coherence protocols. Core X = 0 Coherence protocols affect the timing of data accesses. Inconsistent data Coherence protocols affect the state and content of the caches. Core X = 1 Coherence protocols affect a static worst-case execution time analysis. 3

4 Example: Quad-Core LEON4-FT GR712RC Bus interconnect Private L1-s Shared L2- Write-Through Policy Quad Core LEON4 Block Diagram [1] Coherence is ensured using a Snooping-based Write-Invalidate protocol. Snooping: Write-Invalidate: Any core listen to data accesses on the bus. Invalidation of caches data by write accesses of other cores. [1] Aeroflex Gaisler LEON4 Product Sheet 4

5 Example: Quad-Core LEON4-FT GR712RC Bus interconnect Private L1-s Shared L2- Core 0 Write-Through Strategy Core 1 Data X = 0 Data X = 0 Coherence is ensured using a Snooping-based Write- Invalidate protocol. Snooping: Write-Invalidate: Shared Memory Any core listen to data accesses on the bus. Invalidation Data X of = 0 caches data by write accesses of other cores. 5

6 Example: Quad-Core LEON4-FT GR712RC Bus interconnect Private L1-s Shared L2- Write-Through Strategy Write X = 1 Invalid X Coherence is ensured using a Snooping-based Write- Invalidate protocol. Snooping: Write-Invalidate: Core 0 Shared Memory Core 1 Data X = 1 Data X = 0 Any core listen to data accesses on the bus. Invalidation Data X of = 0 caches data by write accesses of other cores. 6

7 Example: Quad-Core LEON4-FT GR712RC Bus interconnect Private L1-s Shared L2- Core 0 Write-Through Strategy Data X = 1 Core 1 Read X = 1 Coherence is ensured using a Snooping-based Write- Invalidate protocol. Snooping: Write-Invalidate: Shared Memory Any core listen to data accesses on the bus. Invalidation Data X of = 1 caches data by write accesses of other cores. 7

8 Coherence and Static WCET-Analysis Estimation of the WCET using a static analysis tool Abstraction of the hardware architecture incl. caches Analysis to calculate data access latencies Analysis Overview of the OTAWA framework [2] Prediction of cache-hits and cache-misses Abstraction of possible cache states (Must/May Analysis) Modification of cache state by data accesses in program Analysis of private cache in isolation from other cores [2] Ballabriga, C. et al.: OTAWA: an Open Toolbox for Adaptive WCET Analysis. In: The 8th IFIP Workshop on Software Technologies for Future Embedded and Ubiquitous Systems 8

9 Coherence and Static WCET-Analysis Analysis CORE Data X = 0 External Invalidation Data X = 1 External Invalidations turn the cache into a shared resource Target of invalidation is unknown -> Invalidation of any data must be assumed Time of invalidation is unknown -> Invalidation must be assumed at any possible time Prediction of cache-hits is nearly impossible External invalidations prohibit a feasible WCET estimation. 9

10 Example: Quad-Core LEON4-FT GR712RC Bus interconnect Private L1-s Shared L2- Write-Through Policy Quad Core LEON4 Block Diagram [1] Alternative: Pure software-based cache coherence using the -Flush instruction. -Flush: Invalidation of the complete cache triggered by a special instruction. [1] Aeroflex Gaisler LEON4 Product Sheet 10

11 On-Demand Coherent (ODC²) Core ODC ODC² Block-Frame Shared Bit Crtl Repl. info Tag Data Replacement information Control flags (Valid, Modified) Indication of possible carrier of shared data Software/Hardware co-approach Handling of coherence on demand Different handling of accesses to private and shared data Access to shared data only in protected code regions Invalidation of shared data after use No coherence transactions 11

12 On-Demand Coherent (ODC²) Abstract Program ODC² Functionality - Accesses on private data - Mutex Lock / Barrier; enter_shared_mode(); - Accesses on private/shared data - exit_shared_mode(); Mutex Unlock / Barrier; - Accesses on private data - ODC² in Private-Mode No coherence-operations Functionality and timing equal to incoherent cache ODC² in Shared-Mode Newly loaded cache-lines are marked as shared Checking of target address Write-Through Policy ODC² in Restore-Procedure Invalidation of marked cache-lines ODC² in Private-Mode No coherence-operations Functionality and timing equal to incoherent cache 12

13 Static WCET-Analysis with ODC² Analysis CORE ODC Data X = 0 remains private resource Time and target of invalidation is known Invalidation increases cache-miss rate Invalidation improve efficiency Influence on timing is bounded and predictable Additional Shared Analysis to identify accesses to shared data New classification of accesses in Must/May Analysis: Always Shared, Never Shared, Sometimes Shared 13

14 Evaluation System architecture: Tree-type interconnect PPC-based cores with private L1-s Shared memory Parallel Benchmarks: Matrix Multiplication Fast Fourier Transformation Dijkstra Shortest Path Evaluated Platforms: ODC² Uncached shared data Software -Flush Magic Coherence Parallel benchmarks Program flow and synchronisation WCET analysis of ODC² and common alternatives using the OTAWA toolbox parmerasa real-time capable many-core architecture incl. ODC² OTAWA evaluation framework analysis enhanced with ODC² model 14

15 Evaluation WCET Analysis regarding parallel execution with 2-8 cores. Significantly reduced WCET compared to Uncached and -Flush Reduced accesses to shared memory with ODC² Coherence overhead of ODC² depends on access pattern to shared memory Number of Accesses Normalised WCET 3,50 3,00 2,50 2,00 1,50 1,00 0,50 0,00 Matrix FFT Djikstra Number of WT Accesses Uncached Flush ODC² Magic 0 Matrix FFT Dijkstra 0 Matrix FFT Dijkstra 15

16 Conclusion Existing hardware cache coherence techniques are not suitable for a static WCET analysis, software coherence techniques lack in performance. ODC² permits time-predictable hardware/software co-approach. Static timing analysis using ODC² allows significant reduction of WCET compared to common alternatives. Promising approach for hard real-time multicore systems. Thank you for your attention Any questions? 16

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors Lecture 25: Multiprocessors Today s topics: Virtual memory wrap-up Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 TLB and Cache Is the cache indexed

More information

Experiences and Results of Parallelisation of Industrial Hard Real-time Applications for the parmerasa Multi-core

Experiences and Results of Parallelisation of Industrial Hard Real-time Applications for the parmerasa Multi-core Experiences and Results of Parallelisation of Industrial Hard Real-time Applications for the parmerasa Multi-core Theo Ungerer 1, Christian Bradatsch 1, Martin Frieb 1, Florian Kluge 1, Jörg Mische 1,

More information

Flynn s Classification

Flynn s Classification Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:

More information

A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders

A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs Marco Bekooij & Frank Ophelders Outline Context What is cache coherence Addressed challenge Short overview of related work Related

More information

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network Shared Memory Multis Processor Processor Processor i Processor n Symmetric Shared Memory Architecture (SMP) cache cache cache cache Interconnection Network Main Memory I/O System Cache Coherence Cache

More information

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization

Lecture 25: Multiprocessors. Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization Lecture 25: Multiprocessors Today s topics: Snooping-based cache coherence protocol Directory-based cache coherence protocol Synchronization 1 Snooping-Based Protocols Three states for a block: invalid,

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches Overview ost cache protocols are more complicated than two state Snooping not effective for network-based systems Consider three

More information

D 8.4 Workshop Report

D 8.4 Workshop Report Project Number 288008 D 8.4 Workshop Report Version 2.0 30 July 2014 Final Public Distribution Denmark Technical University, Eindhoven University of Technology, Technical University of Vienna, The Open

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico

More information

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessor Cache Coherency. What is Cache Coherence? Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

History-based Schemes and Implicit Path Enumeration

History-based Schemes and Implicit Path Enumeration History-based Schemes and Implicit Path Enumeration Claire Burguière and Christine Rochange Institut de Recherche en Informatique de Toulouse Université Paul Sabatier 6 Toulouse cedex 9, France {burguier,rochange}@irit.fr

More information

This is an author-deposited version published in : Eprints ID : 12597

This is an author-deposited version published in :  Eprints ID : 12597 Open Archive TOULOUSE Archive Ouverte (OATAO) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible. This is an author-deposited

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Caches in Real-Time Systems. Instruction Cache vs. Data Cache

Caches in Real-Time Systems. Instruction Cache vs. Data Cache Caches in Real-Time Systems [Xavier Vera, Bjorn Lisper, Jingling Xue, Data Caches in Multitasking Hard Real- Time Systems, RTSS 2003.] Schedulability Analysis WCET Simple Platforms WCMP (memory performance)

More information

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 05. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 05 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 5.1 Basic structure of a centralized shared-memory multiprocessor based on a multicore chip.

More information

Memory Hierarchy in a Multiprocessor

Memory Hierarchy in a Multiprocessor EEC 581 Computer Architecture Multiprocessor and Coherence Department of Electrical Engineering and Computer Science Cleveland State University Hierarchy in a Multiprocessor Shared cache Fully-connected

More information

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections

Lecture 18: Coherence and Synchronization. Topics: directory-based coherence protocols, synchronization primitives (Sections Lecture 18: Coherence and Synchronization Topics: directory-based coherence protocols, synchronization primitives (Sections 5.1-5.5) 1 Cache Coherence Protocols Directory-based: A single location (directory)

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem

Cache Coherence. Bryan Mills, PhD. Slides provided by Rami Melhem Cache Coherence Bryan Mills, PhD Slides provided by Rami Melhem Cache coherence Programmers have no control over caches and when they get updated. x = 2; /* initially */ y0 eventually ends up = 2 y1 eventually

More information

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16

Shared memory. Caches, Cache coherence and Memory consistency models. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS15/16 Shared memory Caches, Cache coherence and Memory consistency models Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16 Shared memory Caches, Cache

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes.

Data-Centric Consistency Models. The general organization of a logical data store, physically distributed and replicated across multiple processes. Data-Centric Consistency Models The general organization of a logical data store, physically distributed and replicated across multiple processes. Consistency models The scenario we will be studying: Some

More information

M4 Parallelism. Implementation of Locks Cache Coherence

M4 Parallelism. Implementation of Locks Cache Coherence M4 Parallelism Implementation of Locks Cache Coherence Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory

More information

Lecture 21: Virtual Memory. Spring 2018 Jason Tang

Lecture 21: Virtual Memory. Spring 2018 Jason Tang Lecture 21: Virtual Memory Spring 2018 Jason Tang 1 Topics Virtual addressing Page tables Translation lookaside buffer 2 Computer Organization Computer Processor Memory Devices Control Datapath Input Output

More information

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Miss Rates for Snooping Protocol 4th C: Coherency Misses More processors:

More information

Design and Analysis of Time-Critical Systems Introduction

Design and Analysis of Time-Critical Systems Introduction Design and Analysis of Time-Critical Systems Introduction Jan Reineke @ saarland university ACACES Summer School 2017 Fiuggi, Italy computer science Structure of this Course 2. How are they implemented?

More information

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location.

Incoherent each cache copy behaves as an individual copy, instead of as the same memory location. Cache Coherence This lesson discusses the problems and solutions for coherence. Different coherence protocols are discussed, including: MSI, MOSI, MOESI, and Directory. Each has advantages and disadvantages

More information

Predictable Cache Coherence for Multi- Core Real-Time Systems

Predictable Cache Coherence for Multi- Core Real-Time Systems Predictable Cache Coherence for Multi- Core Real-Time Systems Mohamed Hassan, Anirudh M. Kaushik and Hiren Patel RTAS 2017 Motivation: Data sharing in multi-core real-time systems Core 1 RT tasks L1 D/I

More information

Scope-based Method Cache Analysis

Scope-based Method Cache Analysis Scope-based Method Cache Analysis Benedikt Huber 1, Stefan Hepp 1, Martin Schoeberl 2 1 Vienna University of Technology 2 Technical University of Denmark 14th International Workshop on Worst-Case Execution

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

The Cache-Coherence Problem

The Cache-Coherence Problem The -Coherence Problem Lecture 12 (Chapter 6) 1 Outline Bus-based multiprocessors The cache-coherence problem Peterson s algorithm Coherence vs. consistency Shared vs. Distributed Memory What is the difference

More information

Lecture 24: Board Notes: Cache Coherency

Lecture 24: Board Notes: Cache Coherency Lecture 24: Board Notes: Cache Coherency Part A: What makes a memory system coherent? Generally, 3 qualities that must be preserved (SUGGESTIONS?) (1) Preserve program order: - A read of A by P 1 will

More information

Timing analysis and timing predictability

Timing analysis and timing predictability Timing analysis and timing predictability Architectural Dependences Reinhard Wilhelm Saarland University, Saarbrücken, Germany ArtistDesign Summer School in China 2010 What does the execution time depends

More information

Multiprocessor Synchronization

Multiprocessor Synchronization Multiprocessor Systems Memory Consistency In addition, read Doeppner, 5.1 and 5.2 (Much material in this section has been freely borrowed from Gernot Heiser at UNSW and from Kevin Elphinstone) MP Memory

More information

Foundations of Computer Systems

Foundations of Computer Systems 18-600 Foundations of Computer Systems Lecture 21: Multicore Cache Coherence John P. Shen & Zhiyi Yu November 14, 2016 Prevalence of multicore processors: 2006: 75% for desktops, 85% for servers 2007:

More information

Migrating from the UT699 to the UT699E

Migrating from the UT699 to the UT699E Standard Products Application Note Migrating from the UT699 to the UT699E January 2015 www.aeroflex.com/leon Table 1.1 Cross Reference of Applicable Products Product Name: Manufacturer Part Number SMD

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB

Shared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB Shared SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB 1 Review: Snoopy Cache Protocol Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

CS377P Programming for Performance Multicore Performance Cache Coherence

CS377P Programming for Performance Multicore Performance Cache Coherence CS377P Programming for Performance Multicore Performance Cache Coherence Sreepathi Pai UTCS October 26, 2015 Outline 1 Cache Coherence 2 Cache Coherence Awareness 3 Scalable Lock Design 4 Transactional

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

CONTENTION IN MULTICORE HARDWARE SHARED RESOURCES: UNDERSTANDING OF THE STATE OF THE ART

CONTENTION IN MULTICORE HARDWARE SHARED RESOURCES: UNDERSTANDING OF THE STATE OF THE ART CONTENTION IN MULTICORE HARDWARE SHARED RESOURCES: UNDERSTANDING OF THE STATE OF THE ART Gabriel Fernandez 1, Jaume Abella 2, Eduardo Quiñones 2, Christine Rochange 3, Tullio Vardanega 4 and Francisco

More information

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today

More information

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor CS6801-MULTICORE ARCHECTURES AND PROGRAMMING UN I 1. Difference between Symmetric Memory Architecture and Distributed Memory Architecture. 2. What is Vector Instruction? 3. What are the factor to increasing

More information

Thread- Level Parallelism. ECE 154B Dmitri Strukov

Thread- Level Parallelism. ECE 154B Dmitri Strukov Thread- Level Parallelism ECE 154B Dmitri Strukov Introduc?on Thread- Level parallelism Have mul?ple program counters and resources Uses MIMD model Targeted for?ghtly- coupled shared- memory mul?processors

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

PCS - Part Two: Multiprocessor Architectures

PCS - Part Two: Multiprocessor Architectures PCS - Part Two: Multiprocessor Architectures Institute of Computer Engineering University of Lübeck, Germany Baltic Summer School, Tartu 2008 Part 2 - Contents Multiprocessor Systems Symmetrical Multiprocessors

More information

MC2: Multicore and Cache Analysis via Deterministic and Probabilistic Jitter Bounding

MC2: Multicore and Cache Analysis via Deterministic and Probabilistic Jitter Bounding www.bsc.es MC2: Multicore and Cache Analysis via Deterministic and Probabilistic Jitter Bounding Enrique Díaz¹,², Mikel Fernández¹, Leonidas Kosmidis¹, Enrico Mezzetti¹, Carles Hernandez¹, Jaume Abella¹,

More information

Advanced and parallel architectures. Part B. Prof. A. Massini. June 13, Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points)

Advanced and parallel architectures. Part B. Prof. A. Massini. June 13, Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points) Advanced and parallel architectures Prof. A. Massini June 13, 2017 Part B Exercise 1a (3 points) Exercise 1b (3 points) Exercise 2 (8 points) Student s Name Exercise 3 (4 points) Exercise 4 (3 points)

More information

PARALLEL MEMORY ARCHITECTURE

PARALLEL MEMORY ARCHITECTURE PARALLEL MEMORY ARCHITECTURE Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 6 is due tonight n The last

More information

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh Multicore Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh Symmetric MultiProcessing 2 Each processor in an SMP has equal access to all parts of memory same latency and bandwidth

More information

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation

COMP Parallel Computing. CC-NUMA (1) CC-NUMA implementation COP 633 - Parallel Computing Lecture 10 September 27, 2018 CC-NUA (1) CC-NUA implementation Reading for next time emory consistency models tutorial (sections 1-6, pp 1-17) COP 633 - Prins CC-NUA (1) Topics

More information

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014

Mul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014 Mul$processor Architecture CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014 1 Agenda Announcements (5 min) Quick quiz (10 min) Analyze results of STREAM benchmark (15 min) Mul$processor

More information

Aperiodic Servers (Issues)

Aperiodic Servers (Issues) Aperiodic Servers (Issues) Interference Causes more interference than simple periodic tasks Increased context switching Cache interference Accounting Context switch time Again, possibly more context switches

More information

1. Introduction. 1 Multi-Core Execution of Hard Real-Time Applications Supporting Analysability. This research is partially funded by the

1. Introduction. 1 Multi-Core Execution of Hard Real-Time Applications Supporting Analysability. This research is partially funded by the WCET ANALYSIS OF A PARALLEL 3D MULTIGRID SOLVER EXECUTED ON THE MERASA MULTI-CORE 1 Christine Rochange 2, Armelle Bonenfant 2, Pascal Sainrat 2, Mike Gerdes 3, Julian Wolf 3, Theo Ungerer 3, Zlatko Petrov

More information

MERASA: MULTICORE EXECUTION OF HARD REAL-TIME APPLICATIONS SUPPORTING ANALYZABILITY

MERASA: MULTICORE EXECUTION OF HARD REAL-TIME APPLICATIONS SUPPORTING ANALYZABILITY ... MERASA: MULTICORE EXECUTION OF HARD REAL-TIME APPLICATIONS SUPPORTING ANALYZABILITY... THE MERASA PROJECT AIMS TO ACHIEVE A BREAKTHROUGH IN HARDWARE DESIGN, HARD REAL-TIME SUPPORT IN SYSTEM SOFTWARE,

More information

Design and Analysis of Real-Time Systems Predictability and Predictable Microarchitectures

Design and Analysis of Real-Time Systems Predictability and Predictable Microarchitectures Design and Analysis of Real-Time Systems Predictability and Predictable Microarcectures Jan Reineke Advanced Lecture, Summer 2013 Notion of Predictability Oxford Dictionary: predictable = adjective, able

More information

parmerasa Dissemination Event Address of Welcome

parmerasa Dissemination Event Address of Welcome parmerasa Dissemination Event Address of Welcome Prof. Dr. Theo Ungerer, parmerasa Project Coordinator, University of Augsburg Prof. Dr. Jesús Labarta, Director of Computer Sciences Department of Barcelona

More information

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors By: Anvesh Polepalli Raj Muchhala Introduction Integrating CPU and GPU into a single chip for performance

More information

Evaluating the Portability of UPC to the Cell Broadband Engine

Evaluating the Portability of UPC to the Cell Broadband Engine Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and

More information

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ CMPE 511 TERM PAPER Distributed Shared Memory Architecture by Seda Demirağ 2005701688 1. INTRODUCTION: Despite the advances in processor design, users still demand more and more performance. Eventually,

More information

Multiprocessors and Locking

Multiprocessors and Locking Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

COSC4201 Multiprocessors

COSC4201 Multiprocessors COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore

More information

Timing Anomalies Reloaded

Timing Anomalies Reloaded Gernot Gebhard AbsInt Angewandte Informatik GmbH 1 of 20 WCET 2010 Brussels Belgium Timing Anomalies Reloaded Gernot Gebhard AbsInt Angewandte Informatik GmbH Brussels, 6 th July, 2010 Gernot Gebhard AbsInt

More information

Multiprocessor scheduling, part 1 -ChallengesPontus Ekberg

Multiprocessor scheduling, part 1 -ChallengesPontus Ekberg Multiprocessor scheduling, part 1 -ChallengesPontus Ekberg 2017-10-03 What is a multiprocessor? Simplest answer: A machine with >1 processors! In scheduling theory, we include multicores in this defnition

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017

CS433 Homework 6. Problem 1 [15 points] Assigned on 11/28/2017 Due in class on 12/12/2017 CS433 Homework 6 Assigned on 11/28/2017 Due in class on 12/12/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

Concurrent Preliminaries

Concurrent Preliminaries Concurrent Preliminaries Sagi Katorza Tel Aviv University 09/12/2014 1 Outline Hardware infrastructure Hardware primitives Mutual exclusion Work sharing and termination detection Concurrent data structures

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Module 9: "Introduction to Shared Memory Multiprocessors" Lecture 16: "Multiprocessor Organizations and Cache Coherence" Shared Memory Multiprocessors

Module 9: Introduction to Shared Memory Multiprocessors Lecture 16: Multiprocessor Organizations and Cache Coherence Shared Memory Multiprocessors Shared Memory Multiprocessors Shared memory multiprocessors Shared cache Private cache/dancehall Distributed shared memory Shared vs. private in CMPs Cache coherence Cache coherence: Example What went

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 14

More information

The Cache Write Problem

The Cache Write Problem Cache Coherency A multiprocessor and a multicomputer each comprise a number of independent processors connected by a communications medium, either a bus or more advanced switching system, such as a crossbar

More information

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea

Goldibear and the 3 Locks. Programming With Locks Is Tricky. More Lock Madness. And To Make It Worse. Transactional Memory: The Big Idea Programming With Locks s Tricky Multicore processors are the way of the foreseeable future thread-level parallelism anointed as parallelism model of choice Just one problem Writing lock-based multi-threaded

More information

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms. The Lecture Contains: Synchronization Waiting Algorithms Implementation Hardwired Locks Software Locks Hardware Support Atomic Exchange Test & Set Fetch & op Compare & Swap Traffic of Test & Set Backoff

More information

Fall 2011 PhD Qualifier Exam

Fall 2011 PhD Qualifier Exam Computer Architecture Area Fall 2011 PhD Qualifier Exam November 18 th 2011 This exam has six (6) equally- weighted problems. You must submit your answers to all six problems. Write your answers/solutions

More information

Page 1. Cache Coherence

Page 1. Cache Coherence Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically

Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically Abdullah Muzahid, Shanxiang Qi, and University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu MICRO December

More information

Advanced and parallel architectures

Advanced and parallel architectures Cognome Nome Advanced and parallel architectures Prof. A. Massini June 11, 2015 Exercise 1a (2 points) Exercise 1b (2 points) Exercise 2 (5 points) Exercise 3 (3 points) Exercise 4a (3 points) Exercise

More information

Optimizing Replication, Communication, and Capacity Allocation in CMPs

Optimizing Replication, Communication, and Capacity Allocation in CMPs Optimizing Replication, Communication, and Capacity Allocation in CMPs Zeshan Chishti, Michael D Powell, and T. N. Vijaykumar School of ECE Purdue University Motivation CMP becoming increasingly important

More information

Cache Coherence Protocols for Chip Multiprocessors - I

Cache Coherence Protocols for Chip Multiprocessors - I Cache Coherence Protocols for Chip Multiprocessors - I John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 5 6 September 2016 Context Thus far chip multiprocessors

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1 Why? High-Performance Multicores for Real-Time Systems

More information

Directory Implementation. A High-end MP

Directory Implementation. A High-end MP ectory Implementation Distributed memory each processor (or cluster of processors) has its own memory processor-memory pairs are connected via a multi-path interconnection network snooping with broadcasting

More information

Lecture 15: Virtual Memory and Large Caches. Today: TLB design and large cache design basics (Sections )

Lecture 15: Virtual Memory and Large Caches. Today: TLB design and large cache design basics (Sections ) Lecture 15: Virtual Memory and Large Caches Today: TLB design and large cache design basics (Sections 5.3-5.4) 1 TLB and Cache Is the cache indexed with virtual or physical address? To index with a physical

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

12 Cache-Organization 1

12 Cache-Organization 1 12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty

More information

Analysis and Implementation of Global Preemptive Fixed-Priority Scheduling with Dynamic Cache Allocation

Analysis and Implementation of Global Preemptive Fixed-Priority Scheduling with Dynamic Cache Allocation Analysis and Implementation of Global Preemptive Fixed-Priority Scheduling with Dynamic Cache Allocation Meng Xu Linh Thi Xuan Phan Hyon-Young Choi Insup Lee Department of Computer and Information Science

More information

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections

Lecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections 4.2-4.4) 1 SMP/UMA/Centralized Memory Multiprocessor Main Memory I/O System

More information

Coding guidelines for WCET analysis using measurement-based and static analysis techniques

Coding guidelines for WCET analysis using measurement-based and static analysis techniques INSTITUT DE RECHERCHE EN INFORMATIQUE DE TOULOUSE Coding guidelines for WCET analysis using measurement-based and static analysis techniques Armelle Bonenfant, Ian Broster, Clément Ballabriga, Guillem

More information

Multiprocessors. Key questions: How do parallel processors share data? How do parallel processors coordinate? How many processors?

Multiprocessors. Key questions: How do parallel processors share data? How do parallel processors coordinate? How many processors? Multiprocessors Key questions: How do parallel processors share data? How do parallel processors coordinate? How many processors? Two main approaches to sharing data: Shared memory (single address space)

More information

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks

Secure Hierarchy-Aware Cache Replacement Policy (SHARP): Defending Against Cache-Based Side Channel Attacks : Defending Against Cache-Based Side Channel Attacks Mengjia Yan, Bhargava Gopireddy, Thomas Shull, Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented by Mengjia

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information