Predictable Cache Coherence for Multi- Core Real-Time Systems

Size: px

Start display at page:

Download "Predictable Cache Coherence for Multi- Core Real-Time Systems"

Jodie Harper
5 years ago
Views:

1 Predictable Cache Coherence for Multi- Core Real-Time Systems Mohamed Hassan, Anirudh M. Kaushik and Hiren Patel RTAS 2017

2 Motivation: Data sharing in multi-core real-time systems Core 1 RT tasks L1 D/I $ P1 Private data Interconnect LLC/Main memory S1 Shared data PAGE 2

3 Motivation: Data sharing in multi-core real-time systems No caching of shared data [RTSS 09, RTNS 10] T1 T2 T3 P1 P2 S1 S2 P1 P2 S3 S4 Simpler timing analysis Hardware support Long execution time PAGE 3

4 Motivation: Data sharing in multi-core real-time systems No caching of shared data [RTSS 09, RTNS 10] Task scheduling on shared data [ECRTS 10, RTSS 16] Scheduler T1 T2 T3 T1 T2 T3 P1 P2 P1 S1 P2 S1 P1 S3 S1 P1 S3 S2 P2 S4 S2 P2 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler PAGE 4

5 Motivation: Data sharing in multi-core real-time systems No caching of shared data [RTSS 09, RTNS 10] Task scheduling on shared data [ECRTS 10, RTSS 16] Scheduler Software cache coherence [SAMOS 14] T1 T2 T3 P1 P2 T1 P1 T2 S1 T3 P2 T1 P1 T2 S1 /* Access T3 to * private data */ P2 /* Access to * shared data */ S1 P1 S3 S1 P1 S3 S1 P1 S3 S2 P2 S4 S2 P2 S4 S2 P2 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler PAGE 5 Private cache hits on shared data Hardware support Source code changes

6 Motivation: Data sharing in multi-core real-time systems No caching of shared data [RTSS 09, RTNS 10] S1 S2 P1 P2 S3 S4 Task scheduling on shared data [ECRTS 10, RTSS 16] S1 S2 P1 P2 S3 S4 Software cache coherence [SAMOS 14] Scheduler /* Access to T1 T2 T3 T1 T2 T3 T1 T2 T3 * private data */ P1Disabling P2 simultaneous P1 S1 cached P2 shared data P1 S1for timing P2 /* Access to analysis at the cost of lower performance and OS and * shared data */ application changes. S1 S2 P1 P2 S3 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler PAGE 6 Private cache hits on shared data Hardware support Source code changes

7 Motivation: Data sharing in multi-core real-time systems No caching of shared data [RTSS 09, RTNS 10] S1 Task scheduling on shared data [ECRTS 10, RTSS 16] Software cache coherence [SAMOS 14] Scheduler T1 T2 T3 T1 T2This work: T3 T1 T2 /* Access to T3 * private data */ Allowing P1 predictable P2 P1simultaneous S1 P2 cached P1 S1 shared P2 data /* Access to accesses through hardware cache coherence, and provide * shared data */ significant performance improvements with no S2 P1 P2 S3 S4 changes to S1OS P1and S3 applications. S1 P1 S2 P2 S4 S2 P2 S3 S4 Simpler timing analysis Hardware support Long execution time Private cache hits on shared data No hardware support Limited multi-core parallelism Changes to OS scheduler PAGE 7 Private cache hits on shared data Hardware support Source code changes

8 Outline Overview of hardware cache coherence Challenges of applying conventional hardware cache coherence on RT multi-core systems Our solution: Predictable Modified-Shared-Invalid (PMSI) cache coherence protocol Latency analysis of PMSI Results Conclusion PAGE 8

9 Overview of hardware cache coherence Modified-Shared-Invalid protocol Core 1 I M OwnGETM OwnUPG OtherGETS S OwnGETS or OtherGETS GETS: Memory load GETM/UPG: Memory store PUTM: Replacement PAGE 9

10 Applying hardware cache coherence to multi-core RT systems Modified-Shared-Invalid protocol Core 1 I L1 D/I $ L1 D/I $ Real-time interconnect + LLC/Main memory M OwnGETM OwnUPG OtherGETS S OwnGETS or OtherGETS PAGE 10

11 Applying hardware cache coherence to multi-core RT systems Modified-Shared-Invalid protocol Core 1 I Multi-core L1 D/I $ RT system L1 D/I $ with predictable arbitration to shared components + + Real-time interconnect Well studied cache coherence protocol = OwnUPG M S Predictable latencies for shared data accesses? LLC/Main memory OwnGETM OtherGETS OwnGETS or OtherGETS PAGE 11

12 Applying hardware cache coherence to multi-core RT systems Modified-Shared-Invalid protocol Core 1 I Multi-core L1 D/I $ RT system L1 D/I $ with predictable arbitration to shared components + + Real-time interconnect Well studied cache coherence protocol = OwnUPG M S Predictable latencies for shared data accesses? LLC $/Main memory OwnGETM OtherGETS OwnGETS or OtherGETS PAGE 12

13 Sources of unpredictable memory access latencies due to hardware cache coherence 1. Inter-core coherence interference on same cache line 2. Inter-core coherence interference on different cache lines 3. Inter-core coherence interference due to write hits 4. Intra-core coherence interference PAGE 13

14 System model TDM schedule Core 3 Core 1 Core 3 Core 1 Core 3 A:100 M A:- I A:- I A:- I Real-time interconnect TDM arbitration A:50 M B:42 S C:31 S PR: Pending requests PWB: Pending write backs PAGE 14

15 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Core Event Read A Core 1 Core 3 PWB A:100 M A A:- I A:- I * A:- I : GetS A A:50 M B:42 S C:31 S PR: Pending requests PWB: Pending write backs * Denotes waiting for data PAGE 15

16 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Core 1 Core 3 A:100 M B:42 S PWB A A:- I A:- * A:- I Core Event Read A Read B : GetS B A:50 M B:42 S C:31 S PR: Pending requests PWB: Pending write backs PAGE 16

17 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Core 1 Core 3 A:100 M B:42 S A:- I A:- * A:- I Core Event Read A Read B Waiting for data A:50 M B:42 S C:31 S * Denotes waiting for data PAGE 17

18 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Core 1 Core 3 A:100 M B:42 S PWB A A:- I A:- * A:- I Core Event Read A Read B Waiting for data Read C A:50 M B:42 S C:31 S PR: Pending requests PWB: Pending write backs PAGE 18

19 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Core 1 Core 3 A:100 M B:42 S PWB A A:- I A:- * A:- I Core Event Read A Read B Waiting for data Read C A:50 M B:42 S C:31 S Unbounded latency for * Denotes waiting for data PAGE 19

20 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Core 1 Core 3 A:100 M B:42 S PWB A A:- I A:- * A:- I Core Event Read A Read B Waiting for data Read C : GetS B/GetS C A:50 M B:42 S C:31 S PR: Pending requests PWB: Pending write backs * Denotes waiting for data PAGE 20

21 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Solution Core1: 1 Architectural modification Core 3 A:100 M Predictable A:- I A:- B:42 S arbitration * A:- I Core Event Read A Read B Waiting for data Read C WB A Read B : GetS B/GetS C A:50 M B:42 S C:31 S PAGE 21

22 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Solution Core1: 1 Architectural modification Core 3 A:100 M Predictable A:- I A:- B:42 S arbitration * A:- I Core Event Read A Read B Waiting for data Read C WB A Read B : GetS B/GetS C A:50 M B:42 S C:31 S PAGE 22

23 Intra-core coherence interference Solution Core1: 1 Architectural modification Core 3 A:100 M B:42 S TDM schedule Core 3 Core 1 Core 3 Predictable A:- I arbitration A:- X A:- I Core Event Read A Read B Waiting for data Read C WB A Read B : GetS B/GetS C A:50 M C:31 S B:42 S Solution 2: Coherence protocol modifications Need to maintain state to indicate s pending WB of A PAGE 23

24 Sources of unpredictable memory access latencies due to cache coherence 1. Inter-core coherence interference on same cache line ü Solution: Architectural modification to shared LLC/memory 2. Inter-core coherence interference on different cache lines ü Architectural modification + Coherence modifications = Predictable Modified-Shared-Invalidate cache coherence protocol (PMSI) Solution: Architectural modification to the core + Coherence modifications 3. Inter-core coherence interference due to write hits ü Predictable latencies for simultaneous shared data access ü ü No application/os changes 4. Intra-core coherence üinterference Significant performance benefits Solution: Architectural modification to the core + Coherence modifications ü Solution: Architectural modification to the core + Coherence modifications PAGE 24

25 PMSI cache coherence protocol: Protocol modifications State Core events Bus events Load Store Replacement OwnData OwnUpg OwnPutM OtherGetS OtherGetM I Issue GetS/IS d Issue GetM/IM d X X X X N/A N/A N/A N/A S Hit Issue Upg/SM w X N/A X X N/A I I X M Hit Hit Issue PutM/MI wb X X X Issue PutM/MS wb Issue PutM/MI wb X X IS d X X X Read/S X X N/A IS d I IS d I N/A IM d X X X Write/S X X IM d S IM d I X N/A IM d I X X X Write/MI wb X X N/A X X N/A IS d I X X X Read/I X X N/A X X N/A IM d S X X X Write/MS wb X X N/A X X N/A SM w X X X N/A Store/ M X N/A I I X MI wb Hit Hit N/A X X Send data/i N/A N/A X X MS wb Hit Hit MI wb X X Send data/s N/A MI wb X X Other Upg Other PutM PAGE 25

26 PMSI cache coherence protocol: Protocol modifications State Core events Bus events Load Store Replacement OwnData OwnUpg OwnPutM OtherGetS OtherGetM I Issue GetS/IS d Issue GetM/IM d X X X X N/A N/A N/A N/A S Hit Issue Upg/SM w X N/A X X N/A I I X M Hit Hit Issue PutM/MI wb X X X Issue PutM/MS wb Issue PutM/MI wb X X IS d X X X Read/S X X N/A IS d I IS d I N/A IM d X X X Write/S X X IM d S IM d I X N/A IM d I X X X Write/MI wb X X N/A X X N/A IS d I X X X Read/I X X N/A X X N/A IM d S X X X Write/MS wb X X N/A X X N/A SM w X X X N/A Store/ M X N/A I I X MI wb Hit Hit N/A X X Send data/i N/A N/A X X MS wb Hit Hit MI wb X X Send data/s N/A MI wb X X Other Upg Other PutM PAGE 26

27 Intra-core coherence interference A:100 TDM schedule Core 3 Core 1 Core 3 TDM Core 1 Core 3 PR PWB M A:- I A:- MS wb Read B A I * State Core Bus events Event Read A A:- IOtherGets OtherGetM M Issue PutM/MS wb Issue PutM/MI wb : GetS A A:50 M B:42 S C:31 S PR LUT Address Core ID Request A 2 GetS PAGE 27

28 Intra-core coherence interference TDM Core 1 Core 3 A:100 MSwb S TDM schedule Core 3 Core 1 Core 3 WB slot PR PWB State Read BA:- AI A:- IS d A:- I MS wb Bus events OwnPutM Send data/s Core Event Read A WB A : Send A A: M S B:42 S C:31 S PR LUT Address Core ID Request A 2 GetS PAGE 28

29 Intra-core coherence interference TDM schedule Core 3 Core 1 Core 3 Core 1 Core 3 A:100 S A:- I A:100 * S A:- I Core Event Read A WB A Complete read A Data response A:100 A:100 S B:42 S C:31 S PR LUT Address Core ID Request A 2 GetS PAGE 29

30 Intra-core coherence interference TDM Core 1 Core 3 A:100 S TDM schedule Core 3 Core 1 Core 3 B:42 I S PR PWB Request slot Read BA:- I A:100 S A:- I Read C Core Event Read A WB A Complete read A Read B : GetS B A:100 S B:42 S C:31 S PAGE 30

31 Latency analysis of PMSI WCLarb N: Number of cores Core 3 Core 1 Core 3 TDM schedule Core 1 Core 3 Pending WB Pending request A:- IM d I A:- IM d I A:- IM d A:50 M MI wb L acc Real-time interconnect TDM arbitration TDM Address Core ID Request A 0 GetM A 1 GetM LLC $/Main memory WCL intercoh WCL intracoh A 2 GetM PAGE 31

32 Latency analysis of PMSI PAGE 32 WCL of memory request = L acc + WCL arb ( N)+ WCL intracoh ( N) + WCL intercoh ( N 2 )

33 Results Cycle accurate Gem5 simulator 2, 4, 8 cores, 16KB direct mapped private L1-D/I caches In-order cores, 2GHz operating frequency TDM bus arbitration with slot width = 50 cycles Benchmarks: SPLASH-2, synthetic benchmarks to stress worst-case scenarios Simulation framework available at PAGE 33

34 Results: Observed worst case latency per request PMSI Unpredictable 2 Unpredictable 3 Unpredictable 4 PAGE 34

35 Results: Observed worst case latency per request Latency bound PMSI Unpredictable 2 Unpredictable 3 Unpredictable 4 PAGE 35

36 Results: Observed worst case latency per request Latency bound PMSI Unpredictable 2 Unpredictable 3 Unpredictable 4 PAGE 36

37 Results: Slowdown relative to MESI cache coherence Uncache all Single core Uncache shared PMSI MSI MESI PAGE 37

38 Results: Slowdown relative to MESI cache coherence Uncache all Single core Uncache shared PMSI MSI MESI PAGE 38

39 Conclusion Enumerate sources of unpredictability when applying conventional hardware cache coherence on multi-core real-time systems Propose PMSI cache coherence that allows predictable simultaneous shared data accesses with no changes to application/os Hardware overhead of PMSI is ~ 128 bytes WCL of memory access with PMSI square of number of cores ( N 2 ) PMSI outperforms prior techniques for handling shared data accesses (upto 4x improvement over uncache-shared), and suffers on average 46% slowdown compared to conventional MESI cache coherence protocol PAGE 39

40 THANK YOU & QUESTIONS? PAGE 40

Presented by Anh Nguyen

Presented by Anh Nguyen u When a cache has a block in state M or E and receives a GetS from another core, if using the MSI protocol or the MESI protocol, the cache must u change the block state from M