Johnathan Alsop, Matthew D. Sinclair, Sarita V. Adve* *Illinois, AMD, Wisconsin. Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)

Size: px

Start display at page:

Download "Johnathan Alsop*, Matthew D. Sinclair*, Sarita V. Adve* *Illinois, AMD, Wisconsin. Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)"

Ross Lee
5 years ago
Views:

1 Johnathan Alsop*, Matthew D. Sinclair*, Sarita V. Adve* *Illinois, AMD, Wisconsin Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)

2 Specialized architectures are increasingly important in all compute domains 2

3 Specialization Requires Better Memory Systems Traditional heterogeneity: CPU Accelerator CPU Mem Space data in data out Accelerator Mem Space No fine-grain synchronization No irregular access patterns Wasteful data movement Shared coherent memory: CPU Accelerator Shared Mem Space coherent data Existing solutions: complex and inflexible Fine-grain synchronization Irregular access Implicit data reuse 3

4 Heterogeneous devices have diverse memory demands Temporal locality Spatial locality Fine-grain Synch Latency Sensitivity Throughput Sensitivity 4

5 Heterogeneous devices have diverse memory demands Temporal locality Spatial locality Fine-grain Synch Latency Sensitivity Throughput Sensitivity Typical CPU workloads: fine-grain synch, latency sensitive 5

6 Heterogeneous devices have diverse memory demands Temporal locality Spatial locality Fine-grain Synch Latency Sensitivity Throughput Sensitivity Typical workloads: spatial locality, throughput sensitive

7 MESI protocol fits CPU workloads Properties MESI coherence DeNovo Granularity Line Reads: Spatial Line locality Writes: False sharing Word Reads: Flexible Writes: Word Invalidation Writer-invalidate Self-invalidate Temporal locality Self-invalidate for reads Overheads limit throughput Updates Ownership Write-through Temporal locality for Ownership writes Indirection if low locality Good for: MESI CPU 7

8 s prefer simpler protocols Properties MESI coherence DeNovo Granularity Line Reads: Line Writes: Word Reads: No false Flexible sharing Writes: Synch Word limits spatial locality Invalidation Writer-invalidate Self-invalidate Self-invalidate Simple, scalable Synch limits read reuse Updates Ownership Write-through Ownership Simple, low overhead Synch limits write reuse Good for: MESI CPU coh. 8

9 DeNovo is a good fit for CPU and Properties MESI coherence DeNovo Granularity Line Reads: Line Writes: Word Reads: Flexible Writes: Word Invalidation Writer-invalidate Self-invalidate Self-invalidate Updates Ownership Write-through Ownership Good for: MESI CPU coh. DeNovo CPU or 9

10 Existing Solutions: Inflexible and Inefficient CPU Accel 1 Accel 2? MESI L1 coh. L1 DeNovo MESI L1 L1 coh. L1 MESI/ coh. Hybrid L2 MESI LLC Examples: ARM ACE, IBM CAPI, AMD APU 10

11 Existing Solutions: Inflexible and Inefficient CPU Accel 1 Accel 2? MESI L1 If the glove doesn t fit There s limited benefit! MESI L1 coh. L1 MESI/ coh. Hybrid L2 MESI LLC Examples: ARM ACE, IBM CAPI, AMD APU 11

12 Existing Solutions: Inflexible and Inefficient CPU Accel 1 Accel 2? MESI L1 If the glove doesn t fit There s limited benefit! MESI L1 coh. L1 MESI/ coh. Hybrid L2 MESI LLC Examples: ARM ACE, IBM CAPI, AMD APU 12

13 Spandex: Flexible Heterogeneous Coherence Interface CPU Accel 1 Accel 2? MESI L1 coh. L1 DeNovo L1 coh. L1 Spandex Adapts to exploit individual device s workload attributes Better performance, lower complexity Fits like a glove for any heterogeneous system! 13

14 Spandex Overview CPU Accel? MESI L1 coh. L1 DeNovo L1 Key Components Flexible device request interface TU TU External Request Interface Device Request Interface Spandex LLC TU DeNovo-based LLC External request interface Device may need a translation unit (TU) 14

15 Spandex Overview CPU Accel? MESI L1 coh. L1 DeNovo L1 Key Components Flexible device request interface TU TU External Request Interface Device Request Interface Spandex LLC TU DeNovo-based LLC External request interface Device may need a translation unit (TU) 15

16 Device Request Interface Action Request Indicates Read ReqV Self-invalidation ReqS Writer-invalidation Write ReqWT Write-through ReqO Ownership only Read+ Write ReqWT+data ReqO+data Atomic write-through Ownership + Data Writeback ReqWB Owned data eviction Requests also specify granularity and (optionally) a bitmask 16

17 Spandex Overview CPU Accel? MESI L1 coh. L1 DeNovo L1 Key Components Flexible device request interface TU TU TU DeNovo-based LLC External Request Interface Device Request Interface Spandex LLC External request interface Device may need a translation unit (TU) 17

18 ReqWT RspWT ReqO RspO ST ST Spandex LLC CPU Accel? MESI L1 coh. L1 DeNovo L1 States: I, V, O, S Allocation at line granularity Ownership at word granularity Data field tracks owner ID May generate requests to owner/sharer Tag State O Mask Data V ID ID Spandex LLC No false sharing Non-blocking ownership transfer 18

19 Spandex Overview CPU Accel? MESI L1 coh. L1 DeNovo L1 Key Components Flexible device request interface TU TU TU DeNovo-based LLC External Request Interface Device Request Interface Spandex LLC External request interface Device may need a translation unit (TU) 19

20 ReqV ReqV External Request Interface CPU Accel? MESI L1 coh. L1 DeNovo L1 States: I, S, O States: I, V States: I, V, O External Request ReqV ReqO ReqO+data RvkO Inv ReqS Must handle if supports state O O O O S S and O RspV Spandex LLC O Translation Unit may implement functionality if not supported by device 20

21 Evaluation: Configurations Hierarchical MESI CPU L1 Spandex CPU L1 MESI LLC L1 CPU L1 CPU L1 L1 L1 Spandex LLC L2 L1 Configuration LLC protocol CPU protocol protocol HMG Hierarchical MESI MESI coherence Hierarchical MESI MESI DeNovo Spandex MESI coherence Spandex MESI DeNovo Spandex DeNovo coherence Spandex DeNovo DeNovo CPU- workloads from Pannotia and Chai benchmark suites 21

22 HMG HMG HMG HMG HMG HMG HMG Hbest Sbest Execution Time (cycles) Evaluation: CPU- Applications 120% 100% 80% 60% 40% 20% 0% BC PR HSTI Average TRNS RSCT TQH Different workloads prefer different protocols Spandex flexibility consistently better execution time (avg 16% lower) 22

23 HMG HMG HMG HMG HMG HMG HMG Hbest Sbest Network Traffic (flits) Evaluation: CPU- Applications 100% 80% 60% 40% 20% Probe ReqWB/WT ReqV/S ReqWT+data ReqO[+data] 0% BC PR HSTI Average TRNS RSCT TQH Spandex flexibility consistently better NW traffic (avg 27% lower) 23

24 Conclusion and Future Work Producer Consumer MESI LLC Simple, Flexible, Efficient Future Work: exploit SW or HW hints about data access patterns Dynamic Spandex request selection Producer-consumer forwarding Extended granularity flexibility 24

25 Johnathan Alsop*, Matthew D. Sinclair*, Sarita V. Adve* *Illinois, AMD, Wisconsin Sponsors: NSF, C-FAR, ADA (JUMP center by SRC, DARPA)

Spandex: A Flexible Interface for Efficient Heterogeneous Coherence Johnathan Alsop 1, Matthew D. Sinclair 1,2, and Sarita V.

Appears in the Proceedings of ISCA 2018. Spandex: A Flexible Interface for Efficient Heterogeneous Coherence Johnathan Alsop 1, Matthew D. Sinclair 1,2, and Sarita V. Adve 1 1 University of Illinois at