WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache

WCET Analysis of Parallel Benchmarks using On-Demand Coherent Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund Hugues Cassé, Haluk Ozaktas and Christine Rochange Université Paul Sabatier, Toulouse

Outline Coherence in Todays Architectures Coherence and Static WCET-Analysis In Short: The On-Demand Coherent Static WCET-Analysis with ODC² Evaluation Conclusion 2

Coherence in Todays Architectures s are part of almost every multi/many-core architecture. Accesses on shared data induce inconsistence of data in caches. Coherent accesses are preserved by cache coherence protocols. Core X = 0 Coherence protocols affect the timing of data accesses. Inconsistent data Coherence protocols affect the state and content of the caches. Core X = 1 Coherence protocols affect a static worst-case execution time analysis. 3

Example: Quad-Core LEON4-FT GR712RC Bus interconnect Private L1-s Shared L2- Write-Through Policy Quad Core LEON4 Block Diagram [1] Coherence is ensured using a Snooping-based Write-Invalidate protocol. Snooping: Write-Invalidate: Any core listen to data accesses on the bus. Invalidation of caches data by write accesses of other cores. [1] Aeroflex Gaisler LEON4 Product Sheet 4

Example: Quad-Core LEON4-FT GR712RC Bus interconnect Private L1-s Shared L2- Core 0 Write-Through Strategy Core 1 Data X = 0 Data X = 0 Coherence is ensured using a Snooping-based Write- Invalidate protocol. Snooping: Write-Invalidate: Shared Memory Any core listen to data accesses on the bus. Invalidation Data X of = 0 caches data by write accesses of other cores. 5

Example: Quad-Core LEON4-FT GR712RC Bus interconnect Private L1-s Shared L2- Write-Through Strategy Write X = 1 Invalid X Coherence is ensured using a Snooping-based Write- Invalidate protocol. Snooping: Write-Invalidate: Core 0 Shared Memory Core 1 Data X = 1 Data X = 0 Any core listen to data accesses on the bus. Invalidation Data X of = 0 caches data by write accesses of other cores. 6

Example: Quad-Core LEON4-FT GR712RC Bus interconnect Private L1-s Shared L2- Core 0 Write-Through Strategy Data X = 1 Core 1 Read X = 1 Coherence is ensured using a Snooping-based Write- Invalidate protocol. Snooping: Write-Invalidate: Shared Memory Any core listen to data accesses on the bus. Invalidation Data X of = 1 caches data by write accesses of other cores. 7

Coherence and Static WCET-Analysis Estimation of the WCET using a static analysis tool Abstraction of the hardware architecture incl. caches Analysis to calculate data access latencies Analysis Overview of the OTAWA framework [2] Prediction of cache-hits and cache-misses Abstraction of possible cache states (Must/May Analysis) Modification of cache state by data accesses in program Analysis of private cache in isolation from other cores [2] Ballabriga, C. et al.: OTAWA: an Open Toolbox for Adaptive WCET Analysis. In: The 8th IFIP Workshop on Software Technologies for Future Embedded and Ubiquitous Systems 8

Coherence and Static WCET-Analysis Analysis CORE Data X = 0 External Invalidation Data X = 1 External Invalidations turn the cache into a shared resource Target of invalidation is unknown -> Invalidation of any data must be assumed Time of invalidation is unknown -> Invalidation must be assumed at any possible time Prediction of cache-hits is nearly impossible External invalidations prohibit a feasible WCET estimation. 9

Example: Quad-Core LEON4-FT GR712RC Bus interconnect Private L1-s Shared L2- Write-Through Policy Quad Core LEON4 Block Diagram [1] Alternative: Pure software-based cache coherence using the -Flush instruction. -Flush: Invalidation of the complete cache triggered by a special instruction. [1] Aeroflex Gaisler LEON4 Product Sheet 10

On-Demand Coherent (ODC²) Core ODC ODC² Block-Frame Shared Bit Crtl Repl. info Tag Data Replacement information Control flags (Valid, Modified) Indication of possible carrier of shared data Software/Hardware co-approach Handling of coherence on demand Different handling of accesses to private and shared data Access to shared data only in protected code regions Invalidation of shared data after use No coherence transactions 11

On-Demand Coherent (ODC²) Abstract Program ODC² Functionality - Accesses on private data - Mutex Lock / Barrier; enter_shared_mode(); - Accesses on private/shared data - exit_shared_mode(); Mutex Unlock / Barrier; - Accesses on private data - ODC² in Private-Mode No coherence-operations Functionality and timing equal to incoherent cache ODC² in Shared-Mode Newly loaded cache-lines are marked as shared Checking of target address Write-Through Policy ODC² in Restore-Procedure Invalidation of marked cache-lines ODC² in Private-Mode No coherence-operations Functionality and timing equal to incoherent cache 12

Static WCET-Analysis with ODC² Analysis CORE ODC Data X = 0 remains private resource Time and target of invalidation is known Invalidation increases cache-miss rate Invalidation improve efficiency Influence on timing is bounded and predictable Additional Shared Analysis to identify accesses to shared data New classification of accesses in Must/May Analysis: Always Shared, Never Shared, Sometimes Shared 13

Evaluation System architecture: Tree-type interconnect PPC-based cores with private L1-s Shared memory Parallel Benchmarks: Matrix Multiplication Fast Fourier Transformation Dijkstra Shortest Path Evaluated Platforms: ODC² Uncached shared data Software -Flush Magic Coherence Parallel benchmarks Program flow and synchronisation WCET analysis of ODC² and common alternatives using the OTAWA toolbox parmerasa real-time capable many-core architecture incl. ODC² OTAWA evaluation framework analysis enhanced with ODC² model 14

Evaluation WCET Analysis regarding parallel execution with 2-8 cores. Significantly reduced WCET compared to Uncached and -Flush Reduced accesses to shared memory with ODC² Coherence overhead of ODC² depends on access 350000 pattern to shared 300000 memory 250000 Number of Accesses 200000 150000 100000 50000 Normalised WCET 3,50 3,00 2,50 2,00 1,50 1,00 0,50 0,00 Matrix FFT Djikstra Number of WT Accesses 300000 250000 200000 150000 100000 50000 Uncached Flush ODC² Magic 0 Matrix FFT Dijkstra 0 Matrix FFT Dijkstra 15

Conclusion Existing hardware cache coherence techniques are not suitable for a static WCET analysis, software coherence techniques lack in performance. ODC² permits time-predictable hardware/software co-approach. Static timing analysis using ODC² allows significant reduction of WCET compared to common alternatives. Promising approach for hard real-time multicore systems. Thank you for your attention Any questions? 16