Compiler-Directed Cache Polymorphism

Size: px

Start display at page:

Download "Compiler-Directed Cache Polymorphism"

Ashley Fisher
5 years ago
Views:

1 Compiler-Directed Cache Polymorphism J. S. Hu, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, H. Saputra, W. Zhang Microsystems Design Lab The Pennsylvania State University

2 Outline Motivation Compiler-Directed Cache Polymorphism Array-based Code Algorithms for Cache Polymorphism Experiments & Results Conclusions & Future Work 2

3 Motivation State-of-the-art microprocessors normally employ large cache structures; Fixed cache structures are wasted due to ineffective utilization; Traditional loop optimizations may not be available due to dependences as well as not energy aware; Fixed cache structure may also increases energy consumption; Previous reconfigurable cache architecture lacks software-based direction mechanism. 3

4 Our Approach Compiler-directed cache polymorphism Compiler analysis based; No transformation to the original code; Determines the near-optimal cache configurations for each loop nest; Dynamically reconfigures caches for different nests at run time. 4

5 Array-based Code Array-based programs Assumption : globally declared arrays; Assumption 2: same lexical-level nests; Assumption 3: perfectly-nested nests. Characteristics of array-based code Computation of nests dominates the execution time of array-based codes; Cache behavior directly affects the performance & energy consumption of nests; Data locality is the key to improve cache behavior. 5

6 6 Array References )]. ( )]...[ ( )][ ( [,,2, i f i f i f AR m j j j j r r r. * , 2,, + = m n mn m m n n m j j j c c c i i i a a a a a a a a a f f f M M L M O M M L L M c. Ai f r r r + = () (2) (3) : reference subscript vector : access matrix : loop index vector : constant offset vector f r i r c r A

7 Compiler-Directed Cache Polymorphism How to analyze reuse properties of a given nest efficiently? How to transform data reuse into real data locality? Why? improve performance and energy efficiency What s the barrier? cache interferences Approach? avoid or reduce the majority of cache interferences Prerequisite? cache behavior of array references How to figure out the cache behavior within reuse space? How to avoid or reduce the majority of cache interferences? How to optimize cache configurations? How to dynamically reconfigure caches for each nest? 7

8 Algorithms for Cache Polymorphism Algorithm & 2 for self & group reuse analysis Characteristics of algorithm & 2 At granularity of uniform reference set; Work on access matrix and constant vector to extract reuse information; Not solving a system of equations for reuse; More efficient. 8

9 Example Array a: uniform reference set : {A a, A a 2} uniform reference set 2: {A a 3} Array b: uniform reference set : {A b } 9

10 Example Data reuse obtained from algorithm &2 Array a: Uniform reference set : Self-spatial reuse at level l; Group-temporal reuse at level j; Uniform reference set 2: Self-spatial reuse at level l; Self-temporal reuse at level j; Array b: Self-spatial reuse at level i. 0

11 Algorithms for Cache Polymorphism r f ( i ) SMS At a given loop iteration, footprint of a reference is computed as: = SA+ Cof * i + Cof 2 * i2 + L+ Cof SA = elmt _ sz * Cof j m = elmt _ sz * j= k = j+ m l = m+ M k =, k = m + ddk * c j, ddk = duk dlk, k m +, k = m + dd k * alj, ddk = l+ duk dlk, k m Reuse space for reuse at loop level j: * i j = ( i = l, i2 = l2, Li j = l j, i j+, L, in ), lk ik > j n n u k (4) (5) (6) (7)

12 Algorithms for Cache Polymorphism Algorithm 3: Simulate the footprints of array references in reuse space. Invokes algorithm &2 for each uniform reference set; Exploits the highest reuse level among those sets; Creates an array bitmap; For each rarray reference Use f (i ) to simulate the memory addresses of array references within the reuse space SMS j Set the bit in bitmap if the corresponding block in memory is accessed. 2

13 Algorithms for Cache Polymorphism Algorithm 4: optimizing the cache configurations Objective: a near-optimal cache configuration for both performance and energy. Scheme: reduce conflict by increasing cache ways. Approach: optimize through the nest-level bitmap. Map (accumulate) all array bitmaps to a nest bitmap; Value of each bit indicates the conflict in the cache block; Initialize the cache as number of sets=bitmap size; Halves the number of sets and remaps bitmap; Repeat above step until way >= upper bound; Give the configuration with the smallest cache size and fewest ways as the near-optimal cache configuration. 3

14 Example 4

15 Example Algorithm 3: obtain footprints (6B) 5

16 Example Algorithm 4: Remap array bitmaps into nest bitmap Optimize cache configuration Near-optimal cache configuration: (2KB, 2 way, 6B) 6

17 Algorithms for Cache Polymorphism Algorithm 5: Global Level Cache Polymorphism Take the source code as input; Use SUIF pass to construct a global array list; Invokes Algorithm 4 for optimizing cache configurations for each nest; Activates cache reconfiguration mechanisms at running time (Shade as execution engine); Outputs performance data for each loop nest at different cache configurations. 7

18 Experiments Framework CDCP is implemented with SUIF compiler and Shade Scheme: performance & energy comparison Shade: optimal cache configurations from exhaustive simulation CDCP: near-optimal cache configurations determined by CDCP 8

19 9 Cache Configurations 8k8s 4k8s 2k6s 8k8s 4k8s 4k2s 3 32k8s 6k8s 6k4s 32k6s 6k6s 6k8s 2 8k8s 4k4s 2k8s 64k4s 4k8s 2k4s aps.c 6k6s 6k6s 6k6s 6k6s 6k6s 6k6s 2 k4s k4s k4s k4s k4s k4s adi.c CDCP Shade Codes

20 Results () L D-cache Hit Rate Shade CDCP adi.c aps.c bmcm.c eflux.c tomcat.c tsf.c vpenta.c wss.c Performance comparison at 6B cache line size. 20

21 Results (2) L D-cache Hit Rate ln0-shade ln0-cdcp ln-shade ln-cdcp ln2-shade ln2-cdcp Performance comparison breakdown at each loop nest for aps.c at 6, 32, and 64B cache line sizes. 2

22 22 Energy Consumption (µjoules) total total aps.c adi.c CDCP Shade Codes

23 Related Work Compiler optimizations D. Gannon et. al. [JPDC 88] M. Wolf and M. Lam [PLDI 9] K. S. McKinley et. al. [ACM TOPLS 96] Reconfigurable caches D. H. Albonesi [Micro 99] P. Ranganathan et. al. [ISCA 00] 23

24 Conclusions and Future Work Conclusions: Propose a new technique: CDCP; Present a set of algorithms to implement CDCP; Experimental results show CDCP generates competitive performance and much less energy; Future Work: Experiment with different sets of applications; Finer granularity beyond nest; Combine CDCP with loop/data based compiler optimizations. 24

25 Thank you! 25

26 Algorithm running time Name Running Time (s) Name Running Time (s) adi.c 0.49 aps.c.638 bmcm.c eflux.c tomcat.c.544 tsf.c vpenta.c wss.c

27 Results (2) L D-cache Hit Rate Shade CDCP adi.c aps.c bmcm.c eflux.c tomcat.c tsf.c vpenta.c wss.c Performance comparison at 32B cache line size. 27

28 Results (3) L D-cache Hit Rate Shade CDCP adi.c aps.c bmcm.c eflux.c tomcat.c tsf.c vpenta.c wss.c Performance comparison at 64B cache line size. 28

Cache Miss Clustering for Banked Memory Systems

Cache Miss Clustering for Banked Memory Systems O. Ozturk, G. Chen, M. Kandemir Computer Science and Engineering Department Pennsylvania State University University Park, PA 16802, USA {ozturk, gchen,