Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu University, Fukuoka, Japan Dept. of Elec. Eng. and Computer Science, Fukuoka University, Fukuoka, Japan arch-ccc-lpc@c.csce.kyushu-u.ac.jp Abstract In recent years, the performance gap between microprocessor and main memory has been increasing. This problem suppresses the performance of microprocessor and is well-known in the literature as the Memory Wall Problem (MWP). This paper proposes a new method to reduce the MWP effect by means of re-computation. The basic idea is to replace a frequently cache-missed load (socalled delinquent load) instruction with a piece of code which regenerates the load value. The code is called re-computation (RC) code. This method can reduce the number of main memory accesses and the cache miss penalty to alleviate the MWP. From the experiments, one can obtain up to 45.3% reduction of execution time on SPEC2000 benchmark programs. 1 Introduction The performance of microprocessor have been improving 55% per year, whereas that of DRAM commonly used as the main memory improving only 7% [4]. The performance gap between them has been increasing. This gap is one of the main reasons which suppress the performance of microprocessor. This problem is called as memory wall problem. To solve this problem, the memory access latency needs to be reduced or hidden. One of the traditional latency reduction techniques is to employ an on-chip cache memory, which is widely used in state-of-the-art commercial microprocessors. On the other hand, one of the latency hiding techniques is prefetching. So far, many researchers have been proposed a number of techniques to improve the accuracy of prefetching. In recent years, some research groups proposed a new prefetching technique focused on the load instructions which frequently cause cache misses [3, 7]. Those load instructions have come to be known as delinquent load (DL) instructions, and are responsible for 90 % of total cache misses in an experiment [3]. Their prefetching technique is based on multithreaded processors. Besides the normal program execution in one thread (main thread), the data addresses accessed by DL instructions are speculatively calculated in another thread (helper thread). Therefore, the microprocessor can prefetch the data whose address is calculated in the helper thread. However, prefetching needs to access the main memory to improve cache hit rate, so cannot reduce the number of accesses. The main drawback of the prefetching scheme is that an inaccurate data loading pollutes the on-chip cache, resulting in higher cache miss rates. These points are the limitations of the prefetching. In this paper, we propose a novel approach to solve the memory wall problem based on Re- Computation (RC). Originally, we have a concept called CCC (Computing Centric Computation) as one of the measures to the memory wall problem. Recently, the memory accesse latency is extremely longer than the computation time for one operation in microprocessors. CCC replaces the memory accesses with the computations to reduce total execution time. We apply this concept to the load instructions, and propose a Load Data Re-Computation (LDRC) technique. In order to reduce the growing offchip memory-access latency, we replace the load instructions with the corresponding RC code. Namely, we attempt to change memory access latency to RC code execution time. To make the best use of a bad influence of cache misses, we apply this method only to DL instructions out of all loads. Hence, if the microprocessor can execute RC codes faster than DL instructions, LDRC executes RC codes instead of DL instructions, and shortens total execution time,. In this paper, we preliminarily evaluate

our technique and show the potential for execution time reduction. The remainder of this paper is organized as follows. Section 2 introduces the related work. Section 3 describes the concept of our re-computation technique. Section 4 expresses the experiment and the result of evaluation for our method. Section 5 concludes this paper. 2 Related work Original source code c=a+b; Value C is stored to main memory z=x+c; Value C is not in the cache, and has to be read from main memory Original object code Load a Load b Add c, a, b Store c Load c Load x Add z, x, c Store z RC code Load a Load b Add c, a, b Replaced with the RC code LDRC applied object code Load a Load b Add c, a, b Store c [RC code] Load x Add z, x, c Store z Assuming that value x is on the cache Figure 1. Outline of LDRC There are a lot of techniques to reduce or hide the latency of main memory access. However, but needs to add DGG and PE. as far as we know, no techniques related to recomputation These techniques described above are limited have been proposed. In this sec- to data prefetching. If the predicted data ad- tion, we take up methods dealing with DL instructions. Recently, to hide the access latency, speculative data prefetching techniques for DL instructions dress is wrong, they cannot hide the latency of main memory accesses. In the next section, we propose a data re-computation method for DL instructions. have been proposed. Roth and Sohi propose Speculative data-driven multithreading (DDMT) in [7]. In DDMT, intructions are classified by their criticality to performance 3 Load data re-computation from a trace data. If the processor de- In this section, we propose the load data re- tects the critical instructions upcoming at run computation technique, called LDRC, and discuss the effectiveness of our method. time, then the processor forks a speculative thread to execute the critical instruction. The speculative thread is called data-driven thread (DDT), ation that benchmark programs are running on To make problem simple, we assume a situ- and is executed in parallel with the main thread. a normal 32bit RISC type microprocessor with DDT includes load instructions which cause first single thread. The microprocessor accesses a 2- level cache misses or branch instructions which level on-chip cache memory and an off-chip main are frequently miss predicted. memory. Collins et al. propose Speculative Precomputation (SP) in [3]. SP also uses a simultaneous multithreading processor to improve performance 3.1 Basic idea of single-thread applications. In SP, pre- To explain the concept of LDRC, consider a computation slices (p-slices) which include DL case to execute a load instruction. If a data accessed by the load instruction is not in the on- instructions and their dependent instructions are executed in another thread to calculate data addresses accessed by DL instructions and prefetch chip L2 cache, the data has to be fetched from the off-chip main memory. Since it takes long them. SP is focused on Itanium processors which time to fetch the data from the off-chip memory, have in-order pipeline, whereas DDMT focused the cache miss penalty is very large. on out-of-order processors. However, in this case, if the data was defined Annavaram et al. propose Data Prefetching previously in the program, we can get the data by Dependence Graph Precomputation (DGP) by means of re-computation. In addition, if the in [1]. DGP employs Dependence Graph Generator (DGG) in the IFQ, and generates a depen- re-computation time for the load data is shorter than the memory access time, we can reduce the dence graph for a load or store instruction which execution time by re-computation. This is the is likely to cause a cache miss. Precomputation Engine (PE) executes the dependence graph basic idea of LDRC. Namely, we attempt to replace DL instructions with corresponding re-computation to generate prefetching address and prefetch the codes to reduce execution time. data. DGP does not need multithread processors

We give an example in Figure 1 to explain this method. In Figure 1, we see an original source code, a corresponding object code, a RC code, and a modified object code. In the original source code, the variable c is defined in the first sentence (c = a + b), and is used in the latter one (z = x + c). In the latter sentence, we assume that the value of c is evicted from the on-chip cache. In the corresponding object code, the instruction Load c causes a cache miss. The variable c needs to be loaded from the main memory. In the LDRC applied object code, the RC code is added instead of Load c which was in the original object code. The variable c is regenerated with the RC code. In our method, if variables a and b are in the cache, the variable c is successfully re-computed by RC code without any cache misses. The execution cycles can be reduced, if the re-computation can replaces the accesses to the main memory and its time is shorter than the access time to the main memory. 3.2 RC code generation We explain a procedure to generate RC codes. RC codes generation algorithm is shown as follows: 1. Determine DL instructions with simulations using benchmark programs. DL instructions depend on input vectors to a benchmark program, so the DL instruction deterninations have to be carried out for each inputs. 2. Construct CDFGs for each benchmark program. 3. On the CDFG, search the load data definition points for each DL instruction. 4. Extract the defining codes for each load data from CDFG. We can obtain RC codes from above algorithm. Furthermore, if we know that the load instruction in the leaf of the RC code will cause the cache miss, then the RC code for the load should be generated. This kind of RC codes are called Multi-layered RC codes, whereas the normal ones are called Single-layered RC codes. 3.3 The conditions to obtain performance improvements by LDRC This subsection discusses the conditions that should be satisfied in order to cut down the execution time by LDRC. We ignore implementation issues because the evaluation objective is to show the possibility of LDRC to reduce execution time. To reduce execution time, applications should at least satisfy the following four conditions. 1. DL instructions cause a lot of cache misses. 2. DL instructions can be replaced with RC codes. 3. The execution time of RC code is shorter than memory access latency. 4. The load instructions cannot cause any cache misses. First, conditions 1 and 2 describe the opportunity to apply the proposed method. The more opportunities we have, the more effect we can obtain. Next, conditions 3 and 4 describe the overhead caused by the proposed method. Execution time cannot be reduced, unless re-computation time is short and load instructions in the RC code cause cache hits. 4 Evaluation In this section, we quantitatively evaluate our method using some benchmark programs. Since the evaluation objective is to show the maximum performance improvement due to our method, we ignore several issues related to implementations. 4.1 Experimental environment To evaluate our method, we used SimpleScalar simulation tool set [2]. The processor configuration assumed in this evaluation is shown in Table 1. We selected the following benchmark programs from SPEC CPU 2000 benchmark set [5]. These applications were compiled with MIRV C compiler [6] Floating point benchmark programs (4 programs): 177.mesa, 179.art, 183.equake, 188.ammp Integer benchmark programs (7 programs): 164.gzip, 175.vpr, 176.gcc, 181.mcf, 197.parser, 255.vortex, 256.bzip2 We used Reference inputs in SPEC CPU 2000 to these benchmarks. We selected command line parameters so as to cause maximum number of

Table 1. Simulator parameters (cc: clock cycle(s)) Instruction issue Out of order Branch predictor Type 2 level (gshare, 2K entries) BTB size 512 entries, 4 way RAS 32 Inst. decode width 8 instructions/cc Inst. issue width 8 instructions/cc IFQ size 8 entries RUU size 64 entries LSQ size 32 entries Cache memory L1 data 32KB (64B/entry, 2way, 256 entries) L1 instruction 32KB (64B/entry, 1way, 512 entries) L2 unified 2MB (64B/entry, 4way, 8192 entries) Latency L1 cache 1 cc L2 cache 16 cc Main memory 250 cc Memory bandwidth 8B Memory ports 2 ITLB, DTLB Entry 1M entries (4KB/entry, 256 entries/way, 4 way) Miss penalty 30 cc Integer operation units (Units, Exec. lat., Issue.lat.) ALU 4, 1 cc, 1 cc Mult. 1, 3 cc, 1 cc Div. 1, 20 cc, 19 cc Floating operation units (Units, Exec.lat., Issue.lat.) ALU 4, 2 cc, 1 cc Mult. 1, 4 cc, 1 cc Div. 1, 12 cc, 12 cc SQRT 1, 24 cc, 24 cc L2 data cache misses by DL instructions. We evaluated 200M instructions executed after 2G instructions from the beginning to shorten the simulation time. 4.2 Evaluation assumptions The assumptions in this evaluation are shown as follows: In this evaluation, we define the 16 static DL instructions which cause L2 cache misses most frequently. This definition is based on the fact that the 16 static load instructions are responsible for the greater part of L2 cache misses. Figure 2 represents the percentage of L2 cache miss caused by the DL instructions for each benchmark programs. We ranked DL instructions by their cache miss counts such as DL1, DL2, DL3 and Percentage of L2 cache miss due to DL instructions (%) DL1 DL2 DL3 DL4-16 100 80 40 20 0 Figure 2. Percentage of L2 cache miss due to DL instructions DL4-16. According to Figure 2, DL1,2,3 instructions are responsible for more than 40% of total cache misses in most of the benchmark programs, and all the DL instructions are responsible for more than %. In Subsection 3.2, we explain the way to generate RC codes. However, we could not generate CDFG, so we used instruction trace instead. Therefore, RC codes should have included all the paths needed for re-computation, but the generated RC codes include only one path. To count the number of instructions in RC codes, we have to generate all the RC codes. However, it takes long time to generate all of them. In this evaluation, to reduce the RC code generation time, we group the RC codes together by their similarity and generate only one representative RC code for each group. We assume that all the RC codes in the same group have the same number of instructions as a representative RC code. In order to always satisfy the condition 4 described in Subsection 3.3, it is assumed that all the load instructions in all the RC codes cause cache hits in the L2 cache. If the execution time of RC code is longer than the memory access latency, we suppose that the load instruction can fetch the data from the off-chip memory. In this case, we assume that the RC code is not executed but the access to off-chip memory is carried out. We do not consider the drawbacks with LDRC for the instruction caches and the data caches.

4.3 Evaluation index We adopt the percentage of the reduced execution time by LDRC as an evaluation index. The following Formula (1) represents the execution time (in clock cycles) by this method. C ldrc = C ideal + C rc + C norc (1) C ldrc : The execution time (in clock cycles). C ideal : The execution time (in clock cycles) which ignores the data cache miss penalty of the DL instructions. In this case, we assume that DL instructions always cause cache hits. C rc : The execution time (in clock cycles) of all the RC codes. C norc : The execution time (in clock cycles) of DL instructions which cannot be replaced with RC codes. We estimate C rc with Formula (2). ( dl rcc ) C rc = CP I I rccg (i, k) N rccg (i, k) i=0 k=0 (2) CP I: The mean execution clock cycles per instruction. I rccg (i, k): The number of instructions in the kth representative RC code, for the ith DL instruction. N rccg (i, k): The number of execution counts of the kth representative RC code, for the ith DL instruction. dl: The number of DL instructions. In this experiment, this number is fixed to 16. rcc: The number of RC code groups. Formula (3) represents the execution time of DL instructions which cannot be replaced with RC codes. C norc = (C org C ideal ) N norc N dl2miss (3) C org : Original execution time (in clock cycles). N dl2miss : Total number of L2 miss counts due to DL instructions. N norc : Total number of DL instructions which can not be replaced with RC codes. Execution time reduction (%) 100 80 40 20 Ideal LDRC Figure 3. Percentage of reduced execution time Average re-computation time (Clock Cycles) 180 150 120 90 30 Figure 4. The average execution time of RC codes Formula (4) indicates the percentage of the reduced execution time using Formula (1), (2), and (3). ( R ldrc = 1 C ) ldrc 100[%] (4) C org R ldrc : The percentage of execution time reduction method. To estimate the percentage of the reduced execution time by LDRC, we obtain C org C ideal, I rccg (i, k), N rccg (i, k), N norc, N dl2miss, and CP I from our experiment. 4.4 Experimental results and discussions We show the percentage of the reduced execution time by LDRC in Figure 3, the mean execution time of RC codes in Figure 4, and the percentage of DL instruction which can be replaced with RC codes in Figure 5. In Figure3, there are two bars corresponding to two cases. The first bar corresponds to the ideal case assuming all the DL instructions cause cache hits,

Percentages of replacable STOREs by RC codes (%) 100 80 40 20 Figure 5. The percentage of DL instructions which can be replaced with RC codes and the second bar corresponds to the LDRC applied case. Figure 4 plots the average execution time of representative RC codes for each RC code group mentioned in Subsection 4.2. Figure 5 indicates the percentage of DL instructions which can be replaced with RC codes. According to Figure 3, one benchmark program 181.mcf can reduce execution time up to 45.3%. On the other hand, we cannot see the execution time reduction effect of LDRC on 177.mesa, 179.art, 176.gcc, and 255.vortex. We discuss the reasons of these results by means of average execution time of RC codes and the number of DL instructions which can be replaced with RC codes. First, according to Figure 3, we cannot see the large effect of execution time reduction on benchmark programs 177.mesa, 176.gcc and 255.vortex, because these programs seldom satisfy the condition 1 described in Subsection 3.3. Since the DL instructions have little influence on these programs, they cannot obtain large effect due to LDRC. Secondly, according to Figure 5, 179.art has a very small number of DL instructions which can be replaced with RC codes. Therefore, this program seldom satisfies the performance improving condition 2 described in Subsection 3.3. This kind of program has little opportunity to apply LDRC. Finally, we discuss about one more benchmark program, 188.ammp. According to Figure 3, this program indicates high potential to reduce execution time in the ideal case, but not in the case of LDRC. The reason is that, according to Figure 4, it takes a long time for this program to execute RC code. This program does not satisfy the condition 3 described in Subsection 3.3. From above discussions, if one of the performance improving conditions described in Subsection 3.3 is not satisfied, we cannot get large execution time reduction effect. One of the benchmark programs in this evaluation, 181.mcf, satisfies every condition, so it can obtain large execution time reduction effect. 5 Conclusion In this paper, we propose a load data re-computation technique called LDRC as a measure to memory wall problem. We apply LDRC to DL instructions, and evaluate the potential of execution time reduction effect. From the experimental results, one benchmark program 181.mcf can reduce execution time up to 45.3%. We confirmed the potential of execution time reduction. In this evaluation, we ignore several factors related to overheads, such as a change of cache miss rate in the data and the instruction cache. If we take more evaluations considering the available implementations, we will see the effectiveness of LDRC more precisely. References [1] M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph precomputation. In Proc of the 28th Intl. Symposium on Computer Architecture, 2001. [2] D. Burger and T. M. Austin. The simplescalar tool set, version 2.0. University of Wisconsin- Madison Computer Sciences Department Technical Report, June 1997. [3] J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, G. Hoflener, D. Lavery, and J. P. Shen. Speculative precomputation: Long-range prefetching of delinquent loads. In Proc of the 28th Intl. Symposium on Computer Architecture, 2001. [4] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach 3rd Edition. Morgan Kaufmann Publishers, 2003. [5] J. L. Henning. Spec cpu2000: Measuring cpu performance in the new millennium. 33(7), July 2000. [6] M. Postiff, D. Greene, C. Lefurgy, D. Helder, and T. Mudge. The mirv simplescalar/pisa compiler. University of Michigan EECS Department Technical Report, April 2000. [7] A. Roth and G. Sohi. Speculative data-driven multithreading. In Proc of the 7th Intl. Symposium on High-Performance Computer Architecture, January 2001.