STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

Size: px

Start display at page:

Download "STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip"

Philip Cox
5 years ago
Views:

1 STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing , China wang-my12@mails.tsinghua.edu.cn 1

2 Outline Introduction and Motivation Design of STLAC Evaluation and Results Conclusions 2

3 Target architectures Introduction Tiled many-core architectures large last level cache resources meshed network-on-chip (NoC) good power distribution good scalability. Tiled many-core architectures are widely used in multimedia applications and scientific computing. 3

4 Introduction Challenges for Design: Cache capacity interference problem Difference workloads Difference memory access behaviors (e.g. stream-like workloads with low temporal locality) Not-uniform cache access (NUCA) effect. Especially notable in NoC-linked cache banks 4

5 Introduction Motivation The spatial and temporal locality of workloads are the root causes for cache designs to overcome the memory wall problem. Difference workloads have different locality features. To solve the cache capacity interference problem and NUCA effect from a view of cache and NoC codesign. 5

6 Introduction Motivation Prefetch buffer speculates the data blocks in subsequent addresses to exploit the spatial locality. Victim cache collects the evicted data blocks from the upper memory hierarchy to exploit the temporal locality. Spatial locality Temporal locality Fast prefetch Prefetch buffer Victim cache Burst-support NoC 6

7 Proposed STLAC Key Idea: To dynamically partition the last level cache (LLC) as data prefetch buffer or victim cache for locality prediction. To exploit a hybrid burst-support NoC for fast data prefetch and to save the network usage. To explore more optimization opportunities from the cache and NoC codesign. 7

Proposed STLAC Overview of the Codesign 1 2 3 5: normal data access behaviors.

8 Proposed STLAC Overview of the Codesign : normal data access behaviors. 4: fast data prefetch is issued between node 4 and node 14 via the burst-support NoC. 8

9 Proposed STLAC Adaptive Cache Resource Partition to dynamically adjust the ratio between the victim part and prefetch part cache partition is operated at waygranularity and executed periodically 9

10 Proposed STLAC Cache Partition Algorithm (CPA) 10

11 Proposed STLAC Fast Data Prefetch using Burst-support NoC To prefetch remote data as fast as possible without breaking the data continuity. To speculate the data blocks of incremental addresses in destination nodes. To avoid frequently sending separate request or response flits from source or destination nodes. 11

12 Proposed STLAC Example of the hybrid data access 1 represents a 4-length data prefetch with packet-switched (PS), circuit switch (CS) and the burst-support NoC connections represent other normal PS connections. 12

13 Proposed STLAC Architecture of the burst-support router N-field virtual channel (VC) state register is added. Higher priority is assigned to burst packets. Switch will be reserved for the entire burst packets to keep the data continuity. 13

14 Proposed STLAC Architecture of the burst-support router Starvation avoidance The initial age of the normal PS flits is set to 0 and burst flits have the higher priority than the PS flits with age 1. After PS transmission is blocked for a predetermined number of cycles, the age of PS flits will be increased. PS flits with age 1 have the same priority with burst flits. 14

15 Evaluation and Results Methodology NoC simulator Booksim2.0 4x4 2D mesh, 128-bit channels, DOR FeS2 multiprocessor simulator 15

16 Evaluation and Results Burst-support NoC Simulation Results 5% burst inject rate 16

17 Evaluation and Results Full System Simulation Results Miss rate reduction results The cache partition is executed every 1M cycles. About 40% off-chip misses are reduced on average. This reduction rises to 50% if the running workloads have good spatial and temporal locality. 17

18 Evaluation and Results Full System Simulation Results Hit distributions to show the locality difference High hit rate in prefetch part shows good spatial locality. High hit rate in victim part shows good temporal locality. Cache pollution is relieved because the prefetch part and victim part is managed separately. 18

19 Evaluation and Results Full System Simulation Results Network usage reduction The total number of injected flits into the network is taken as the metric for network usage evaluation. The network usage is saved by 7.6% on average. 19

Evaluation and Results Full System Simulation Results Performance results The total runtime (cycles) is taken as the metric for performance evaluation.

20 Evaluation and Results Full System Simulation Results Performance results The total runtime (cycles) is taken as the metric for performance evaluation. About 15% performance is improved on average. This improvement even rises to nearly 30% for these workloads with good spatial and temporal locality. 20

21 Conclusions In this work Adaptive cache resources partition by taking advantage of the differences of the spatial and temporal locality. Fast prefetch is realized in the proposed hybrid burstsupport network to save the on-chip network usage. Explore the opportunities for performance improvement from a view of cache and NoC codesign. Future work More discussion about energy consumption. Further optimization on NoC router for network latency reduction (e.g. routing algorithm). 21

22 Thank you for your attention! 22

Ultra-Fast NoC Emulation on a Single FPGA

The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo