ECE404 Term Project Sentinel Thread

Size: px

Start display at page:

Download "ECE404 Term Project Sentinel Thread"

Gregory Harrell
5 years ago
Views:

1 ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache misses still limits the peak performance of current Processors despite aggressive utilization of instruction level parallelism. This is because; conventional branch prediction and cache pre-fetching techniques ineffectiveness for narrow range of hard to predict branches and long latency instructions. These few, hard to predict, instructions are responsible for low performance of aggressive out-oforder microprocessors. Few Pre-computational techniques, like helper threaded processors [3 7, 11, 12] and slipstream processor [9, 10], selectively pre-compute the outcome of these small fraction of instructions much earlier in time. These outcomes are used to accurately predict these instructions when they actually execute. Even though each of these stated technique has some pros and cons, but they are not able to completely exploit the potential of precomputational techniques. On the other hand these techniques are enough complex to be impractical for commercial processors. In this project we try to achieve higher potential gains by using more simple and optimized precomputation techniques. 1.1 Potential of pre-computational techniques Figure 1 shows the potential of pre-computational techniques. Very high speedup can be achieved for perfect cache and branch. It can be observed that perfect branch gives better speedup for integer benchmarks, while perfect cache gives better speedup for floating point benchmarks. Pre-computational techniques try to accurately predict these branches and pre-fetch cache misses much ad- Table 1. Simulator Parameters Branch predictor comb of bimodal and 2-level Bimodal predictor entries 2048 Level 1 table entries 1024 Level 2 table entries 4096 BTB entries, associativity 2048, 2 Branch mispredict penalty Fetch width 8 Fetch queue size 128 Integer issue queue size 80 FP issue queue size 80 Load/Store queue size 128 Issue width 16 Dispatch and commit width 8 Integer physical registers 128 FP physical registers 128 Reorder buffer size 256 Integer FUs 4 FP FUs 4 L1 I-Cache L1 D-Cache L2 Cache L1 cache latency L2 cache latency Memory latency 12 cycles 32KB, 2-way 64KB, 2-way 2MB, 8-way 2 cycles 25 cycles 160 cycles vance in time, thus mitigating their impact on performance. 1.2 Approach Our approach is similar to slipstream processor as proposed by Sundaramoorthy et al. [10]. In slipstream processor a separate thread called the advance stream (A-stream) is used to pre-compute the outcome of hard to predict instructions. These outcomes are then passed to primary thread (R-stream), which executes the full dynamic instruction stream. A-stream is only a subset of original dynamic stream. These skeleton instructions are enough for A- stream to make correct forward progress. Since only fewer instructions are executed, A-stream may be speculative. A- 1

2 Figure 1. Speedup achieved for perfect cache, perfect branch, and if both cache and branch are perfect. stream therefore mimics the original program and tries to remain on the correct path most of the time. This scheme is very pessimistic in instruction removal, and therefore A- stream does not sped up enough to mitigate the effect all the performance degrading events. We try to improve on this approach by being more aggressive in instruction removal. Since the A-stream is already speculative, we may choose not to execute any load or store instructions. Instead only those instructions are required to be executed, that lead to the computation of hard to predict instructions. We believe that removing unnecessary instructions may still not divert the A-stream on to the wrong path. This is possible when both the threads share the level one cache. The shared cache would feed the A- stream with not so stale data and keep it on the correct path. Since A-stream does not commit any store into the shared cache, the state of the cache will also remain consistent. 2 Skeleton Creation We used Alto [8] for removing ineffectual instructions from the static application code. As an initial step we kept all the instructions required to compute the outcome of all the branches. Table 2 shows the results of instruction removal on Speck2000 benchmarks. As we can observe from the Table 2, fewer instructions are removed from integer benchmarks compared to floating point benchmarks. This is because integer applications on an average consists of more branches compared to floating point applications. 2.1 Experiments on Skeleton To analyze the impact and potential of created skeleton, we executed the skeleton independently on the Simplescalar simulator [1, 2]. We compared the speedup obtained for the skeleton with respect to the original instruction stream. 2

3 Table 2. Results of Instruction Removal on Spec2000 Benchmarks. Application Static Total Static Instructions Instructions Removed bzip gap gcc gzip mcf twolf vpr applu art equake galge l lucas mesa mgrid swim % Instructions Removed Sharing of branch predictor and branch target buffer is required between CMP processors. All the Caches must be value driven to feed A-stream memory references. L2 Cache miss engine need to access memory image of R-stream (controlled by the emulation engine of the R-stream) to service memory accesses in the case of cache miss. Communication mechanism between R-stream and A- stream is required. Modifications in A-stream emulation engine is also required. This gives us the potential of our approach. To ensure that skeleton always remains on the correct path, we modified the simulator to emulate all the instructions correctly but to dispatch only the skeleton instructions. This means that skeleton instruction stream is virtually filled with NOPS. Figure 2 shows the results of our experiments. As expected, skeleton achieves enough speedup for integer benchmarks and very high speedup for floating point benchmarks. Speedup obtained for the integer applications shows the potential for this scheme, as we would be able to achieve at best, on an average, more than 25% speedup for all the benchmarks. More aggressive pruning of skeleton may give better results. 3 Simulator Implementation Complete and faithful simulation of slipstream processor requires simulator modifications. Support for slipstream requires two threads to execute on two separate contexts of CMP processor. One context is used by the primary thread called R-stream. Other context is used by the speculative lead thread called A-stream. Chip multi-processor simulator is used for simulations. Summary of modifications required in the CMP simulator are: Sharing of L1 and L2 caches is required between CMP processors. We are still working on simulator changes. 4 Conclusion From the experiments performed in this work, there appears a high potential for slipstream processor. Simulator modifications are currently on; to understand the complete effect of A-stream on the performance of overall system. Once modifications are over, we can further prune and optimize the skeleton for better performance. Based on the behavior of slipstream processor; hardware mechanism may be proposed next. References [1] T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Computer System Modeling. IEEE Computer, 39(2):59 67, Feb [2] D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. Technical report 1342, Computer Sciences Department, University of Wisconsin-Madison, June [3] R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous Subordinate Microthreading (SSMT). In International Symposium on Computer Architecture, pages , Atlanta, Georgia, May

4 Figure 2. Speed up obtained for skeleton. [4] R. Chappell, F. Tseng, A. Yoaz, and Y. Patt. Difficult-path branch prediction using subordinate microthreads. In International Symposium on Computer Architecture, pages , Anchorage, Alaska, May [5] R. Chappell, F. Tseng, A. Yoaz, and Y. Patt. Microarchitectural Support for Precomputation Microthreads. In International Symposium on Microarchitecture, pages 74 84, Istanbul, Turkey, Nov [6] J. Collins, D. Tullsen, H. Wang, and J. Shen. Dynamic Speculative Precomputation. In International Symposium on Microarchitecture, pages , Austin, Texas, Dec [7] J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery, and J. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In International Symposium on Computer Architecture, pages 14 25, Göteberg, Sweden, June July [8] R. Muth, S. Debray, S. Watterson, and K. D. Bosschere. alto: A Link-Time Optimizer for the Compaq Alpha. Software: Practices and Experience, 31(1):67 101, Jan [9] Z. Purser, K. Sundaramoorthy, and E. Rotenberg. A study of slipstream processors. In International Symposium on Microarchitecture, pages , Monterey, California, Dec [10] K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream Processors: Improving both Performance and Fault Tolerance. In International Conference on Architectural Support for Programming Languages and Operating Systems, pages , Cambridge, Massachusetts, Nov

5 [11] C. Zilles and G. Sohi. Execution-Based Prediction Using Speculative Slices. In International Symposium on Computer Architecture, pages 2 13, Göteberg, Sweden, June July [12] C. B. Zilles and G. S. Sohi. Understanding the backward slices of performance degrading instructions. In International Symposium on Computer Architecture, pages , Vancouver, Canada, June

Pre-Computational Thread Paradigm: A Survey

Pre-Computational Thread Paradigm: A Survey Alok Garg Abstract The straight forward solution to exploit high instruction level parallelism is to increase the size of instruction window. Large instruction