Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Size: px

Start display at page:

Download "Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors"

Nathan Sanders
5 years ago
Views:

1 Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University of Texas at San Antonio San Antonio, TX, , USA (Wenun.Wang, Abstract Effective distribution of critical shared resources among concurrently executing threads is key to improving overall system performance in Simultaneous Multi-Threading (SMT) processors. One of the most critical shared resources is physical register file in the rename stage and a disproportional distribution of these rename registers can easily render it a bottleneck along the pipeline stages. In this paper, we propose an architectural-level physical register file allocation algorithm to better utilize the register file. Once the overall physical register file utilization exceeds a certain threshold, the thread with the highest occupancy is temporarily suspended in order to allow other threads more space to proceed for achieving a higher throughput. Our simulations based on M-Sim [1] show that our proposed technique improves IPC (Instruction Per Clock Cycle) by a very significant margin (up to 25%) without sacrificing execution fairness among the threads. 1 Introduction To enhance overall performance in an SMT system by improving resources allocation fairness, several techniques have been presented by modifying the fetch policy. ICOUT [2] gives priority to threads with the fewest instructions in decode, rename and issue stages. DCRA [3] monitors the usage of resources by each thread and gives a higher share of the available resources to memory-intensive threads. Several techniques target on the utilization of critical shared resources, for example physical register file. In [4], the performance for a given register file size is improved in an SMT system by permitting the compiler and operating system to expedite the deallocation process. Another technique focused on early release of physical registers is proposed in [5] in which a virtual register window is employed to achieve early release and reduce the register file pressure. In [6], a technique is proposed to reduce register file cost by delaying the allocation of physical registers until a later pipeline stage. However, these techniques either require the support of an intelligent compiler and operating system [4][6] or extra hardware modifications in other pipeline stages [5], which can be too system-specific and/or complex to be practical. In this paper, we propose an efficient physical register file allocation scheme with the notion of suspending a thread if the system resources are being heavily contended. This is achieved by blocking one thread with the highest utilization of physical registers if the current overall utilization is relatively high. Our proposed technique is at architectural level and requires no modification at operating system or compiler. Furthermore, this technique is a stand-alone process at rename stage, and thus no modification is required in other pipeline stages. Our simulation results show that our proposed technique improves IPC by 24% and 25%, and enhances the Harmonic IPC (a performance indicator of execution fairness) up to 30.48% and 40.47% on a 4-threaded workload and 8-threaded workload respectively. 2 Simulation Environment 2.1 Simulator We use the M-sim [1], a multi-threaded microarchitectural simulation environment to estimate the performance of our proposed method in an SMT system. Table 1 gives the detailed configuration of the simulated processor. 2.2 Workloads Simulations on simultaneous multi-threading use the workload of mixed SPEC CPU2006 benckmark suite [8] with mixtures of various levels of ILP. ILP classification of each mix is obtained by initializing it in accordance with the procedure mentioned in Simpoints tool [9] and simulated individually in a simplescalar environment / copyright ISCA, SEDE 2016 September 26-28, 2016, Denver, Colorado, USA

2 Table 1: Configuration of Simulated Processor Parameter Machine Width L/S Queue Size ROB & IQ size Functional Units & Latency(total/issue) Physical registers L1 I-cache L1 D-cache L2 Cache unified BTB Branch Predictor Pipeline Structure Memory Configuration 8 wide fetch/dispatch/issue/commit 48-entry load/store queue 128-entry ROB, 32-entry IQ 4 Int Add(1/1) 1 Int Mult(3/1)/Div(20/19) 2 Load/Store(1/1), 4 FP Add(2/1) 1 FP Mult(4/1)/Div(12/12) Sqrt(24/24) integer and floating point as specified in the paper 64KB, 2-way set associative 64-byte line 64KB, 4-way set associative 64-byte line write back, 1 cycle access latency 512KB, 16-way set associative 64-byte line write back, 10 cycles access latency 512 entry, 4-way set-associative bimod: 2K entry 5-stage front-end(fetch-dispatch) scheduling (for register file access: 2 stages, execution, write back, commit) 32-bit wide, 300 cycles access latency Three types of ILP classification are identified, low ILP (memory bound), medium ILP, high ILP (execution bound). Table 2 gives the 4-threaded and 8-threaded simulated workload with various mixtures of ILP classification types. Table 2: SPEC CPU threaded Workload and 8- threaded Workload Mix Benchmarks Classification(ILP) Low Med High Mix1 libquantum, dealii, gromacs, namd Mix2 soplex, leslie3d, povray, milc Mix3 hmmer, seng, gobmk, gcc Mix4 lbm, cactusadm, xalancbmk, bzip Mix5 libquantum, dealii, gobmk, gcc Mix6 gromacs, namd, soplex, leslie3d Mix7 dealii, gromacs, lbm, cactusadm Mix8 libquantum, namd, xalancbmk, bzip Mix9 povray, milc, cactusadm, xalancbmk Mix10 hmmer, seng, lbm, bzip Mix1 libquantum, dealii, gromacs, namd, soplex, leslie3d, povray, milc Mix2 libquantum, dealii, gromacs, namd, lbm, cactusadm, xalancbmk, bzip2 Mix3 hmmer, seng, gobmk, gcc, lbm, cactusadm, xalancbmk, bzip2 Mix4 libquantum, dealii, gromacs, soplex, leslie3d, povray, lbm, cactusadm Mix5 dealii, gromacs, namd, xalancbmk, hmmer, cactusadm, milc, bzip2 Mix6 gromacs, namd, seng, gobmk, gcc,lbm, cactusadm, xalancbmk 2.3 Metrics For a multi-threaded workload, total combined IPC is a typical indicator used to measure the overall performance, which is defined as the sum of each thread s IPC: n Overall IPC = IPC i (1) where n denotes the number of threads per mix in the system. However, in order to preclude starvation effect among threads, the so-called Harmonic IPC is also adopted, which reflects the degree of execution fairness among the threads, namely, Harmonic IPC = n/ i n i 1 IPC i (2) In this paper, these two indicators are used to compare the proposed algorithm to the baseline (default) system. The following metric indicates the IPC improvement percentage averaged over the selected mixes (instead of improvement percentage of averaged IPC ), which is applied to both Overall IPC and Harmonic IPC, namely, m Imp % = ( IPC new IPC baseline IPC baseline 100%)/m (3) where m denotes the number of mixes of the workload in our simulation. 3 Motivation This section is devoted to analyzing the distribution of physical registers among multiple threads. ote that in general competition for the floating point register file is not as severe as that for the integer one [7]. Thus, throughout the rest of this paper, the register file focused all refers to the integer one. We first analyze occupancy rate of physical register file to see the demand of it. Figure 1 shows the overall physical register utilization (denoted as R t ) with its size set at 192. Each point represents an average result Figure 1: Average Overall Physical Register Utilization of a 4-threaded Workload with R t = 192 from the 10 mixes of the 4-threaded workload. As the

3 results show, in nearly 80% of simulation clock cycles, the physical register file is fully occupied, which clearly indicates the need for a larger register file. We further analyze the utilization distribution of physical registers among threads on a 4-threaded workload to determine the seriousness of single-thread dominance. Figure 2 shows the percentage of time distribution for the highest individual thread utilization on shared registers per mix with R t = 192. If the more threads? next thread thread "suspended"? more rename register? next clock cycle updateuo anduih end of current window? UO >ThO? more rename bandwidth? allocate register UIH>ThIH? suspend target thread initializeuo more instructions to rename? : newly proposed feature 1 : newly proposed feature 2 Figure 3: Flowchart of the Proposed Algorithm Figure 2: Highest Register Utilization among Threads of a 4-threaded Workload with R t = 192 registers are perfectly (equally) distributed among 4 threads, the highest individual utilization should be at 25%. However, as shown in Figure 2, except for mix6, over 40% of shared registers are occupied by one single thread for most of simulation time. Especially for mix9 and mix10, in nearly 60% of simulation clock cycles, over 80% of shared registers are occupied by one single thread. Such a single-thread dominance may easily lead to a sustained bottleneck for other thread s movement. 4 Proposed Method Our proposed technique is based on the conecture that, when the overall utilization of the register file exceeds a certain threshold indicating an ongoing congestion, it may be worthwhile to suspend the most demanding thread from further getting more registers, in sacrificing a certain degree of thread-level parallelism for a better flow of thread processing. The basic concept of our proposed technique is that the thread with highest physical register utilization (occupancy) will be suspended in the rename stage without being allocated more new physical registers if the overall register utilization is above a set threshold. Figure 3 shows the flowchart of the proposed features built into the original default scheme. The proposed thread-suspending technique is to be imposed in a window basis, with a fixed, preset number of clock cycles per window. That is, suppression of the thread identified at the end of the current window will last through the next window if the set threshold is exceeded in the current window. ote that physical register file is partially shared among threads in an SMT system and a portion of which is dedicated for architectural registers for each thread, with the rest (denoted as R r ) being shared among the threads. All utilization measurements in this paper are with the base of R r. During the current window, register utilization of thread i is measured, denoted as U I (i), so is the overall utilization of all threads, denoted as U O. If the average number of registers used in a window is R O then and U O = R O R r U I (i) = R(i) R r where R(i) is the average number of registers occupied by thread i. At the end of each window, the proposed algorithm checks if the overall utilization is higher than the preset overall threshold (denoted as Th O ) to determine if the suspension scheme needs to be initiated for the next window; that is, when U O > Th O.

4 In this algorithm, for the sake of avoiding thread starvation, no thread will be suspended for two consecutive windows. Our observation shows that such a prevention scheme mostly is not needed since register utilization of a suspended thread in the current window normally decreases significantly and thus does not continue to be the highest register-occupying thread at the end of the current window. Such a status change comes from the fact that, even a thread is suspended for renaming during the current window, its instructions already renamed can still continue to finish and commit, and along the way releasing its occupied shared resources for other threads to use. This newly added feature is indicated in the lightly shaded box in Figure 3 as feature 1. Due to a combination of register file size and threads concurrent behaviors, exceeding the overall utilization threshold U O may not truly reflect the targeted scenario of congestion and imbalanced usage. amely, a universally set U O may not be applicable to the cases with smaller register file sizes (R r ) U O will be mostly high under such cases but it does not necessarily indicate an imbalanced usage among the threads. Due to this concern, another threshold is instilled to ensure that the suspension mechanism is imposed only when the highest per-thread occupancy exceeds this threshold, the so called highest individual utilization threshold, denoted as Th IH ; that is when condition which does not really call for the scheme. Also note that the preset window size, denoted as W, is also one of parameters that may dearly affect the effectiveness of the approach. If the window size is set too small, there may not be enough time for the suspended thread to release enough registers for others to use before this thread comes back in the next window. On the other hand, a window size too large may reduce the effectiveness of the proposed technique in capturing the threads temporal behavioral changes and at the same time potentially degrading execution fairness among the threads. 5 Simulation Results 5.1 Feature I: with Overall Utilization Threshold Th O As shown in Figure 3, the technique is first applied with the first feature, using only the overall utilization threshold (Th O ) to determine when the suspension mechanism is to be imposed. First, the simulation runs are to investigate the effect of window size on the effectiveness of the proposed algorithm. An ample size of register file is chosen for 4-threaded with R t = 256. Figure 4 shows the average IPC improvement with different window sizes on a 4-threaded workload. Four U IH > Th IH where U IH denotes the highest per-thread occupancy defined as U IH = max i{u I (i)} U O and is measured against the overall usage, instead of against R r. That is, under this new feature (indicated as feature 2 in Figure 3), not only the overall threshold has to be exceeded but this individual threshold condition has to be satisfied as well. In this approach, the size of the register file will affect the effectiveness of our proposed technique. A larger register file will diminish the benefits of our technique since the competition among threads is less severe. For a smaller register file, even with the thread with highest utilization being suspended, congestion still can be competitive due to the limited size. However, the main compromise between the benefits and drawbacks lies at the value setting of Th O and Th IH. If these thresholds are set too high, the intention of our proposed technique in preventing shared resources from being dominated by one single thread will not be effective since the threshold conditions are less likely to be satisfied. On the other hand, thresholds set too low may unnecessarily suspend threads in a Figure 4: Average IPC Improvement vs. Window Size on a 4-threaded Workload with R t = 256 different Th O values are used for each to provide a more comprehensive comparison on different window sizes. IPC improvement continues to rise as window size (W ) increases until about 8000 clock cycles (cc), and the optimal Th O value is around 72%. Due to this observation, all our following simulation runs are with W = Simulations are also performed to disclose the effect on IPC performance from the size of register file adopted. Results are shown in Figure 5 when we apply different threshold (Th O ) values, for 4-threaded workload and 8-threaded workload respectively. ote that the mechanism is defaulted back to the original system when Th O is set to 100% in which the threshold

5 5.2 Feature II: with Individual Utilization Threshold Th IH As displayed in Figure 3, the second feature proposed imposes an additional individual utilization threshold Th IH as the second condition required for thread suspension. To clearly see the effect of this extra condition, combined with the overall utilization threshold Th O, on the overall performance, a simulation is performed on a variety of register file sizes with both features turned on. Figures 6 shows the average IPC improvement with these two threshold values on a register file size of 160, 192 on a 4-threaded workload, respectively. Three Figure 5: Average IPC Improvement vs. Overall Physical Register Utilization Threshold (Th O ) on a 4- threaded workload and 8-thread workload with W = 8000 condition is never satisfied leading to no suspension at all. Average IPC improvement from this technique can reach up to about 25% for both workloads. In the 4-threaded case, improvement percentage in general increases as R t increases from 160 to 224, but drops significantly when it further increases to 256, an indication of having a suitable range for applying the proposed mechanism. That is, when the register file size is not large enough, congestion still remains prevalent and thus suspending only one thread may not provide the optimal benefit. On the other hand, when the register file size is sufficiently large, the benefit of the proposed mechanism will not be as obvious, indicating the need for thread suspension being not as critical. Similar scenario occurs in the 8-threaded case where the performance improvement peaks at R t = 576 and then drops significantly when it is set higher. A summary on the optimal Th O value for each register file size is given in Table 3 for the two workloads respectively. The optimal threshold value continues to decrease as the size of register file increases, which perfectly ascertains our conecture on the use of this threshold, in which a smaller register file demands a higher threshold to indicate the need for thread suspension. Table 3: Optimal Th O vs Physical Register File Size R t 4 threads Th O 98% 95% 95% 80% R t threads R t Th O 98% 95% 80% 80% 80% Figure 6: Average Overall IPC Improvement vs. Overall Physical Register Utilization Threshold on a 4-threaded workload with R t = 160, R t = 192 different Th IH values are used: 40%, 60%, and 80%. A fourth plot with Th IH = 0 represents the case of not employing this extra threshold since the condition is always satisfied. For the cases with smaller register files (R t = 160 and R t = 192), a Th IH = 40% is too small to properly reflect single-thread dominance, which leads to no further improvement from not employing this threshold (Th IH = 0). Employing a higher Th IH can further increase the performance improvement by up to 7% depending the Th O selected. This result verifies our aforementioned claim that satisfying the overall utilization threshold (Th O ) condition may not indicate resource dominance by one or a few threads, especially with a smaller R t or a smaller Th O. 5.3 Execution Fairness When a selective thread-suspension mechanism like the proposed one is imposed for the sake of increasing overall performance, there is a potential that execution unfairness or even disastrous thread starvation might

6 occur. Thus, another measurement is carried out in our simulations using the Harmonic IPC (defined in Eq. 2) as a performance indicator. Figure 7 shows the Harmonic IPC improvement with varying register file size on the 4-threaded workload and 8-threaded workload, respectively. Values of Th O and Th IH are set to their respective values for optimal overall IPC improvement so as to reveal the most potential execution unfairness. As the results show, our proposed Figure 7: Harmonic IPC Improvement vs. Physical Register File Size on a 4-threaded Workload and 8- threaded Workload technique not only improves the overall IPC but also enhances the Harmonic IPC by up to 30.48% and 40.47% on the 4-threaded workload and 8-threaded workload respectively. 6 Conclusion In this paper, an efficient physical register file allocation scheme is proposed to reduce register file pressure in an SMT system. By preventing the physical register file from being overwhelmingly occupied by one single thread, the technique delivers a very significant performance improvement without sacrificing execution fairness. Further extension of this technique can be directed at the selection of not only one but potentially more than one thread for temporary suspension, which is especially true when the register file is small and/or the number of threads in the system is large. Such an extension would require a more intelligent threshold setting and more reliable real-time congestion and thread-dominance reading mechanism. 7 Acknowledgement This research is partially supported by funding from ational Science Foundation Award References [1] J. Sharkey, M-Sim: M-Sim: A Flexible, Multithreaded Simulation Environment Tech. Report CS-TR-05-DP1, Department of Computer Science, SU Binghamton, [2] D. M. Tullsen, S. J. Emer, H. M. Levy, J. L. Lo and R. L. Stamm, Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multi- Threading Processor, In the Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp , May 1996 [3] F. J. Cazorla, A. Ramirez, M. Valero and E. Fernandez, Dynamically Controlled Resource Allocation in SMT processors, In the Proceedings of the 37th International Symposium on Microarchitecture, pp , Dec 2004 [4] J. L. Lo, S. S. Parekh, S. J. Eggers, H. M. Levy and D. M. Tullsen, Software-directed Register Deallocation for Simultaneous Multi-Threading Processors, IEEE Transaction on Parallel and Distributed Systems, Vol 10, Issue 9, pp , Sept 1999 [5] T. Monreal, A. Gonzalez, M. Valero, J. Gonzalez, V. Vinals, Dynamic Register Renaming through Virtual-Physical Registers, Journal of Instruction Parallelism, Vol. 2, 2000 [6] E. Quinones, J. Parcerisa, A. Gonzalez Leveraging Register Windows to Reduce Physical Registers to the Bare Minimum, IEEE Transaction on Computers, Vol. 59, o. 12, Dec 2010 [7]. Zhang and W.-M. Lin, Efficient Physical Register File Allocation in Simultaneous Multi- Threading CPUs, 33rd IEEE International Performance Computing and Communications Conference (IPCCC 2014), Austin, Texas, December 5-7, [8] Standard Performance Evaluation Corporation (SPEC) website, [9] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, Automatically Characterizing Large Scale Program Behavior In the Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pp , October 2002.

An Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs

International Journal of Computer Systems (ISSN: 2394-1065), Volume 04 Issue 04, April, 2017 Available at http://www.ijcsonline.com/ An Intelligent Fetching algorithm For Efficient Physical Register File