Application-Specific Autonomic Cache Tuning for General Purpose GPUs

Size: px
Start display at page:

Download "Application-Specific Autonomic Cache Tuning for General Purpose GPUs"

Transcription

1 Application-Specific Autonomic Cache Tuning for General Purpose GPUs Sam Gianelli, Edward Richter, Diego Jimenez, Hugo Valdez, Tosiron Adegbija, Ali Akoglu Department of Electrical and Computer Engineering University of Arizona, Tucson, AZ, USA Abstract Cache tuning has been widely studied in CPUs, and shown to achieve substantial energy savings, with minimal performance degradations. However, cache tuning has yet to be explored in General Purpose Graphics Processing Units (GPGPU), which have emerged as efficient alternatives for general purpose high-performance computing. In this paper, we explore autonomic cache tuning for GPGPUs, where the cache configurations (cache size, line size, and associativity) can be dynamically specialized/tuned to the executing applications resource requirements. We investigate cache tuning for both the level one (L1) and level two (L2) caches to derive insights into which cache level offers maximum optimization benefits. To illustrate the optimization potentials of autonomic cache tuning in GPGPUs, we implement a tuning heuristic that can dynamically determine each application s best L1 data cache configurations during runtime. Our results show that application-specific autonomic L1 data cache tuning can reduce the average energy delay product (EDP) and improve the performance by 16.5% and 18.8%, respectively, as compared to a static cache. Index Terms Graphics processing unit, GPGPU, GPU cache management, configurable caches, high performance computing, cache tuning, cache memories, low-power design, low-power embedded systems, adaptable hardware I. Introduction and Motivation A processor s memory hierarchy contributes substantially to the processor s overall energy consumption. In some cases, such as embedded systems CPUs, the memory hierarchy (specifically, the caches) can contribute more than 40% of the system s total energy [19], [20]. As a result, much optimization research has focused on reducing the cache s energy consumption, without degrading the performance. The cache is increasingly becoming a source of energy overhead in general purpose graphics processing units (GPGPUs). GPGPUs used to rely solely on massive parallelism and shared memory to hide memory latency. However, due to the proliferation of applications with irregular memory accesses and large amounts of data, modern GPGPUs rely on caches to help mitigate the intrinsic delay incurred from memory accesses [25]. Unfortunately, GPGPU cache management and design has not evolved as rapidly as CPU caches, due to decreased cache capacity per thread and cache line lifetime (the amount of time a line goes without replacement in a cache) [23]. Because of the high throughput of GPGPUs and the inadequate performance of current GPGPU caches, there can be a very large number of misses in a very small amount of time, leading to overall performance degradations and increased energy consumption. GPU cache management becomes even more important in a High-Performance Computing (HPC) system. 10.4% of the top 500 HPC systems utilize a GPU model from the NVIDIA Tesla family as an accelerator [24]. Thus, the HPC s overall energy efficiency and performance is highly dependent on the GPU s energy efficiency and performance. As GPGPU applications data continue to increase, storing all the data in the shared memory becomes infeasible. Thus, the GPGPU s cache becomes more important, as it reduces the memory access time for frequently accessed data. Most current GPGPU cache management research focus on energy reduction and performance optimization through efficient cache bypassing and thread throttling. Cache bypassing [6], [7], [13], [29] mitigates thrashing in the L1 data caches by artificially extending the cache line lifetime. Bypassing schemes allow frequently reused data blocks to remain in the cache, while data requests for lowreuse blocks are made to bypass the cache. However, as the number of bypass requests increases, on-chip resource congestion increases, eventually causing the memory pipeline to stall. Thread throttling [6], [13], [29] increases the hit rate of a system s cache(s) while decreasing potential resource congestion by dynamically or statically limiting the number of warps/threads that can be run on a given streaming multiprocessor (SM). Decreasing the number of threads increases the cache capacity per thread, which in effect increases the amount of locality in the cache. As a result, multiple GPGPU cores are not utilized to their full potential. In this paper, we explore the potentials of leveraging autonomic cache tuning for GPGPU cache management. Cache tuning, otherwise known as cache optimization, uses a configurable cache [28], with total size, associativity, and line size that can be dynamically adjusted during runtime to specialize the cache s configurations to executing applications needs. Cache tuning exploits the fact that different

2 applications have different configuration requirements for energy efficiency and optimal performance [9]. Thus, based on each application s execution characteristics (e.g., cache miss rates, instructions per cycle, branch mispredicts, etc.), the cache can be tuned to minimize energy consumption without degrading performance. While cache tuning has been extensively studied in CPUs [12], [20], [22], this paper is the first, to the best of our knowledge, to explore cache tuning in the context of GPGPUs. Our goal in this study is to derive insights to enable the implementation of autonomous caches in emerging GPGPUs. To this end, our contributions are as follows: We propose that autonomic cache tuning is a viable approach to cache management in GPGPUs, and extensively analyze the potentials for improving the energy efficiency and performance of GPGPU L1 and L2 data caches. We analyze autonomic cache tuning in L1 and L2 data caches to derive insight into which cache level offers the most benefits for optimal performance and energy efficiency. We illustrate the benefits of runtime autonomic cache tuning using a tuning heuristic that dynamically determines execution GPGPU applications L1 data cache configurations. Through detailed experiments using GPGPU-Sim [4], a common GPGPU architectural simulator, we show that an autonomously tuned configurable L1 data cache can reduce the average energy delay product and improve the performance (with respect to IPC) by 16.5% and 18.8%, respectively, as compared to a static non-configurable cache. Our results also reveal that tuning the L2 data cache can reduce the average EDP and improve the performance by 18.1% and 13.4%, respectively, as compared to a static L2 cache. Finally, our results reveal a strong correlation between GPGPU performance and L1 data cache reservation failures [7], and provides a solid foundation for future work enabling autonomous caches in GPGPUs. The rest of the paper is organized as follows. In Section II we provide an overview of the related work on GPU cache management strategies, configurable caches and cache tuning. We describe our methodology, simulation environment, and experimental setup in Section III. In Section IV, we present the results and analysis generated by exhaustively exploring the cache design space to evaluate the impact of different configurations on application execution. We establish the methodology and present the results from our autonomous cache tuning experiment in Section V. Finally, in Section VI, we present our conclusions and future work. II. Background and Related Work Due to the poor performance of caches in GPGPUs, there has been extensive research on GPU cache management schemes. This section provides an overview of current GPU cache management schemes and background on configurable caches and cache tuning, which we leverage in this work. A. GPU Cache Management The main reason why caches are less effective for GPUs than for CPUs is that GPUs high throughput is mismatched with the architecture of general purpose caches [13]. Our work represents a step in the direction of making caches more viable for GPUs, and is orthogonal to current GPU cache management techniques. We review three main research directions that currently exist for optimizing caches for GPUs. The first involves schemes forcing low-reuse memory accesses to bypass the cache in order to not evict highreuse cache blocks. Chen et al. [6] proposed a scheme that forces memory requests to bypass the L1 data cache based on the reuse distance theory. This scheme also dynamically throttles the number of warps to prevent occupancy of too many resources when the cache is getting bypassed frequently. Dai et al. [7] introduced a model to calculate the ideal number of requests to bypass the cache as a function of the estimated number of reservation failures. Li et al. [18] observed that the number of times cache blocks are reused is very small in certain GPGPU applications and propose a scheme that bypasses the cache if a memory request associated with a block is not going to be reused. The second popular research area involves optimizing cache indexing to efficiently populate the cache. Kim et al. [14] studied the effects of multiple indexing techniques on performance and energy consumption within GPU caches. These indexing functions include static schemes such as XORing different bits of the instruction address and polynomial modulus mapping. Additionally, they experimented with one dynamic indexing function, which uses different bits of the instruction address to index into the cache. Kim et al. [15] proposed using polynomial modulus mapping as a means of interleaving set indexes within the cache. A third approach optimizes cache management on a warp level in order to best account for each application s specific characteristics. Khairy et al. [13] proposed a warp throttling scheme that throttles the number of active warps based on the current misses per kilo instructions (MPKI) in order to reduce the contention seen in the L1 data cache. Zhao and Wang [29] observed that irregular memory accesses are the most detrimental aspect of an application due to memory divergence and a loss of locality. In order to combat this, they devised a scheme that is present in both the memory level and the instruction level to determine the degree of irregularity of memory requests across the threads within a warp. If those memory requests surpass a certain threshold, the entire warp is forced to bypass the L1-Data cache. B. Configurable Caches A configurable cache [28] is a multi-banked cache that allows the cache size to be configured to specialize the

3 cache s configurations to executing applications. Given a physical cache, where each cache way is implemented as a different bank, the ways can be shutdown to reduce the cache size or concatenated to configure the associativity. To configure the block size, multiple physical blocks can be fetched and concatenated to logically configure larger block sizes. This configurability can be achieved using small bit-width configuration registers, and augmenting a state-of-the-art cache to support configurability results in negligible design, power and area overheads. In addition, incorporating configurability to caches has no impact on the critical path, thus making configurable caches a viable option for low-overhead cache optimization. We direct the reader to [28] for details on the configurable cache s circuitry and design. C. Cache Tuning Cache tuning determines the cache configurations that best satisfy an application s execution requirements. Due to potentially large design spaces (possible combination of configurations), exhaustively exploring the full cache design space is typically unrealistic. Thus, various cache tuning heuristics and algorithms [3], [11], [22] have been developed to prune the design space, in order to reduce cache tuning overheads (energy and time spent during tuning). Zhang et al. [27] presented a heuristic that automatically tunes the cache in an embedded system to a pseudo-optimal configuration when considering energy, execution time, or a combination of the two depending on the priority of the system. Gordon-Ross et al. [10] expanded upon this heuristic and introduced the Two-level Cache Tuner (TCaT), a method to automatically tune two levels of cache in an embedded system. Cache tuning heuristics/algorithms are usually implemented using a software or hardware tuning mechanism, known as the cache tuner. Adegbija et al. [2] developed low-overhead cache tuners for multicore systems, and analyzed different cache tuner architectures for effectively tuning caches in multicore embedded systems. The authors analyzed the energy and hardware footprint of the different architectures in order to identify which architectures resulted in minimum overheads. This work illustrated the extensibility and scalability of low-overhead cache tuning to large multicore systems, and provides a foundation for the work proposed herein. III. Methodology This section describes our methodology for evaluating the potentials of autonomic cache tuning in GPGPU caches. We first describe our simulation environment, and thereafter, present details of our experimental setup. A. Simulation Environment To evaluate the effect of varying cache configurations within a GPGPU, we used the GPGPU-Sim simulator [4] to emulate a system based on the NVIDIA Fermi GTX480 TABLE I: Simulated base configurations SM config Cache hashing Shared memory/sm Warp scheduling Memory model 15 SM, width = 32, 700 MHz, 48 Warps/SM, 1536 threads Linear set function 6kB 2 GTO (Greedy-then-oldest) warp schedulers/sm 6 GDDR5 memory controllers, FR-FCFS scheduling, 924 MHz GDDR5 timing tcl = 12, trp = 12, trc = 40, tras = 28, trcd = 12, trrd = 6 [26]. We used the GPU-WATTCH simulator [17] to collect the GPU s average power for each application. To take into account both energy and delay in the evaluations, we used the energy delay product (EDP) and instructions per cycle (IPC) as our evaluation metrics. We calculated the EDP as follows: EDP = system power * (total application cycles / system frequency) 2 Table I depicts the static base configuration parameters used for our evaluations. We modeled the GTX480 GPU because its memory hierarchy and associated performance sufficiently represent state of the art GPUs [8] [21]. Therefore, analysis and conclusions made through experimentation on the GTX480 can easily be extended to other GPU models. The GTX480 memory hierarchy architecture has a dedicated L1 data (L1D) cache for every SM, and a unified L2 data (L2D) cache that is split into multiple partitions used by every SM in the GPGPU. Each simulation requires a configuration file in which different system parameters can be manipulated. This includes number of sets, the associativity, the block size of both L1D and L2D caches, and the number of memory partitions for the unified L2 data cache. In conjunction with the configuration information, the actual benchmark/application is input into the simulator as well. GPGPU-Sim then collects IPC, total number of instructions, and reservation failures; these statistics are also used to calculate execution time of the program. Subsequently, the average power statistic is generated post GPGPU-Sim simulation using GPU-WATTCH. We then collated and analyzed the execution and power statistics of the different cache configurations for the benchmarks to determine the impact of cache tuning in GPGPUs. TABLE II: Ranges of the L1D parameter values Parameter Name L1D Range L1D Base Value Number of sets 16, 32, Associativity 1, 2, 4 4 Line size (bytes) 128,

4 TABLE III: Ranges of the L2D parameter values Parameter Name L2D Range L2D Base Value Number of sets 64, 128, Associativity 8, 16, Line size (bytes) 128, Number of partitions 6 6 Fig. 1: Flowchart describing exhaustive search, and simulation environment and setup B. Experimental Setup To investigate the impact of autonomic cache tuning on GPGPU energy efficiency and performance, we first conducted an exhaustive search of the L1D and L2D cache configurations and evaluated the improvements achieved by optimal cache configurations (with respect to power consumption, IPC, and EDP) as compared to the base Tables II and III depict the L1D and L2D cache configuration design spaces, respectively. The design space comprises 18 L1D cache configurations, with L2D cache set to the base configuration, and 18 L2D configurations, with the L1D set to the base configuration, resulting in 36 total configurations. We determined the base configurations using the most common values found in literature for the L1D ( [13] [7] [29] [14]) and L2D ( [13] [7] [29] [18]) caches. We assume that the caches are configured as described in Section II-B. For future work, we plan to investigate the interleaved exploration of both the L1D and L2D cache configurations, which will exponentially increase the design space. The range for each configuration parameter is chosen to explore cache sizes larger than the base configuration, since increasing the cache size has been shown to improve performance [6]. The L1D cache associativities are smaller than the L2D cache associativities, because the L2D cache is much larger than the L1D cache. We observed that the L1D and L2D caches yield diminishing returns at associativities larger than 4 and 32, respectively. Figure 1 depicts the sequence of steps we used to perform the exhaustive search of L1D and L2D cache configurations. We run every benchmark using GPGPU-sim in order to generate the performance statistics. GPGPUsim then outputs a file used as an input to GPU-WATTCH in order to simulate the power statistics of the GPGPU. Then, the configuration changes to the next configuration within the configuration space. This process continues until all cache configurations are explored for every benchmark. We tuned both the L1D and L2D caches to derive insights into which level derives the most benefit from cache tuning. Since GPGPUs typically have multiple L1D caches [8], we assume that each L1D cache is globally tuned with the same configuration, i.e., the configurations are homogeneous across the L1D caches. Thus, we assume a low-overhead hardware cache tuner that can simultaneously change the caches configurations, without incurring significant power and area overheads. This tuner has been extensively analyzed for CPUs by prior work [2]; for future work, however, we plan to further evaluate these tuners in the context of GPGPUs. We assume a single unified L2D cache, as is the case in the GTX480 [8]. To represent a variety of GPGPU applications, we used seven benchmarks from the Rodinia 3.1 Benchmark Suite [5] and the NVIDIA Toolkit [1]. From the Rodinia 3.1 Benchmark Suite, we used BFS, HotSpot, NW, SRADv2, and Pathfinder, and from the NVIDIA Toolkit, we used VectorAdd and ScalarProd. IV. Exhaustive Search Experimental Results Our first set of experiments involved an exhaustive search of the L1D and L2D cache design spaces to evaluate the benefits of tuning the GPGPU s caches. The L1D caches are usually much smaller than the L2D cache, making the L1D caches more susceptible to thrashing. This observation leads us to hypothesize that the performance gain of tuning the L1D caches in the GPGPU will be greater than tuning the unified L2D cache. After our exhaustive search, we made three key observations with respect to tuning GPGPU caches. In what follows, we present each observation, and then detailed results that support the observation. 1) Different applications require different L1D and L2D cache configurations for optimal energy efficiency and performance. To verify that cache tuning is a viable GPGPU cache optimization technique, we investigated the frequency with which different L1D and L2D cache configurations were deemed best for different evaluation metrics. Figure 2 shows a histogram displaying the frequency with which different L1D (Figure 2a) and L2D (Figure 2b) configurations were determined to be best for the different metrics.

5 (a) L1D best configuration distribution. (b) L2D best configuration distribution. Fig. 2: Distribution of best L1D (a) and L2D (b) cache configurations based on IPC, EDP, and Average Power of both. Configurations identified as CacheSize-BlockSize- Associativity The frequency of configurations with the optimal IPC, EDP, and average power are indicated with blue, red and green bars, respectively. For the cases where two or more configurations tie for the most optimal performance within a specific benchmark, both configurations are awarded a point. As seen in Figure 2a, when the average power is taken into account, the best configurations tend to be small, direct mapped caches. This is to be expected because cache modules with fewer sets and ways typically consume less power. When considering the IPC, larger configurations (size and associativity) tend to be the optimal configurations for the different benchmarks. Again, this observation is expected because IPC is highly related to the total throughput of the GPGPU; larger caches with large associativities typically result in fewer misses, and enable faster accesses. Finally, with EDP as the focus, the optimal cache configurations are far more spread out among the different configurations in the design space, but still favoring large caches. This can be attributed to the fact that GPGPUs are throughput-oriented devices, and small caches result in a large number reservation failures which lead to massive cache thrashing. Figure 2b shows a more even distribution of L2D configurations that achieve the best average power and IPC. However, there were fewer configurations that achieved the best EDP across all 7 benchmarks. This implies that the performance gain from tuning the L2D cache is much smaller than tuning the L1D cache. Despite the large variation of best configurations for average power and IPC, the deviation between the statistics for each configuration is much smaller than that of the L1D cache. We also observed that EDP has a smaller distribution of best L2D cache configurations than the L1D cache. Overall, these observations indicate that tuning the L2D cache offers less optimization benefits than tuning the L1D cache; to minimize runtime overheads, the L2D cache configuration should be a single static configuration determined through extensive a priori analysis. The key takeaway from these results is that different applications have different L1D and L2D cache configuration requirements for optimal execution. Our results show, for example, that the best configuration for minimum EDP varies significantly for the different applications. These observations, using a set of test applications, motivates us to perform analysis for a larger set of applications in future work. 2) L1D cache tuning provides significant improvement as compared to a base static cache L2D cache tuning provides much smaller improvement. We performed further experiments and analysis to determine which cache level offers the most improvement from cache tuning. To illustrate the potential variation of results achieved using the best, worst, and base configurations, we used exhaustive search to determine the best and worst configurations, and compared them to the base Figure 3a compares the L1D cache configurations with respect to EDP (the best and worst configurations are normalized to the base). The optimal cache configuration on average reduced the EDP by 16.5% as compared to the base cache. The EDP savings were as high as 25% and 35.5% for Hotspot and SRADv2, respectively. On the other hand, the worst cache configuration for EDP was 402% higher than the EDP of the base configuration on average. This can be attributed predominantly to two large outliers ScalarProduct and SRADv2 whose worst configurations increased the EDP by 1110% and 613%, respectively, as compared to the base Figure 4a compares the L2D cache configurations with respect to EDP. We observed that the optimal cache configuration reduced the average EDP by 18.1% as compared to the base The EDP savings were as high as 36% and 34.9% for NW and SRADv2, respectively. The worst configurations increased the average EDP by 8%, as compared to the base The worst configurations increased the EDP by up to 33% and 17% for BFS and NW, respectively. The EDP savings of the optimal L1D cache configurations were similar to those of the L2D cache configurations.

6 (a) EDP comparison of best, worst, and base L1D (a) EDP comparison of best, worst, and base L2D (b) Average power comparison of best, worst, and base L1D (b) Average power comparison of best, worst, and base L2D (c) Execution time comparison of best, worst, and base L1D (c) Execution time comparison of best, worst, and base L2D (d) IPC comparison of best, worst, and base L1D Fig. 3: Cache tuning potential to improve EDP, Average Power, Execution Time, and IPC compared to base L1D cache (d) IPC comparison of best, worst, and base L2D Fig. 4: Cache tuning potential to improve EDP, Average Power, Execution Time, and IPC compared to base L2D cache

7 On average, tuning the optimal L1D cache configurations achieved 16.5% EDP savings, while the optimal L2D cache configurations achieved 18.1% EDP savings. However, there was a more significant difference in the EDP increase from the worst configurations. On average, the worst L1D configuration for EDP increased the EDP of the base by 402%, while the worst L2D configuration for EDP only increased the EDP by 8%. These results reveal that the GPGPU s EDP is more sensitive to changes in the L1D cache configuration than the L2D cache; a suboptimal L1D cache can result in more significant EDP degradation. Thus, optimization emphasis must be placed on the L1D cache to prevent the potential EDP increases from suboptimal L1D cache configurations. Figure 3b demonstrates the comparison between the best, base, and worst L1D configurations when considering the average power consumption for each benchmark. On average across the benchmarks, the best configurations reduce the average power consumption by 50.7% as compared to the base For ScalarProduct and Hotspot, the best configuration reduced the power consumption by 72.9% and 55.5%, respectively. The worst configurations on average consume 17.7% more power than the base Figure 4b shows the difference in overall average power consumed by the GPGPU between the best, base, and worst L2D cache configuration for each benchmark. On average across all the benchmarks, the best configurations reduced the power consumption by 5.32% as compared to the base configuration, with power savings as high as 32.2% for BFS. On average, the worst configurations increased power consumption by 14.4%. These results show the significant disparity in the power savings obtained from tuning the L1D and L2D caches. Tuning the L2D cache reveals a smaller impact on the GPGPU s average power savings (5.32%) than changing the L1D cache configuration (50.7%). We also observed that the difference between the worst and base configuration were more similar for both L1D cache and L2D cache, with power increases of 17.7% and 14.4%, respectively. Figure 3c compares the different L1D cache configurations with respect to execution time of the different benchmarks. Because the base configuration was relatively large, the benefits of cache tuning for execution time savings was much smaller for all the benchmarks than power and EDP. Furthermore, we observed dramatic execution time degradations with sub-optimal configurations. On average across all the benchmarks, the best configurations reduced the execution time by 14.6%, as compared to the base, while the worst configurations increased the execution time by 226%. Figure 4c shows the comparison between the best, base, and wost L2D cache configurations with respect to execution time. The best L2D cache configuration reduced the average execution time by 10.3%, as compared to the base configuration, with reductions as high as 32.3% and (a) Average power vs reservation failures (b) Execution time vs reservation failures (c) EDP vs reservation failures Fig. 5: Relationship between normalized reservation failures and average power, execution time, and EDP. 19.1% for SRADv2 and NW, respectively. The worst L2D configuration increased the execution time by 1.61% with respect to the base. These results reveal that tuning the L1D and L2D caches achieve similar execution time savings with respect to the base configuration (14.6% and 10.3%, respectively). However, we observed a significant difference in the execution time degradation of the worst L1D and L2D configurations: 226% and 1.61%, respectively. Figure 3d compares the best and worst L1D configurations to the base, with respect to IPC. Similar to the results from average power consumption, the best configurations increase the average IPC by 18.8% as compared

8 to the base configuration, with increases as high as 29% and 48% for Hotspot and SRADv2, respectively. The worst configurations decreased IPC by 63.0%. Figure 4d looks at the same IPC comparison for the L2D cache configurations. The best L2D configuration increased the average IPC by 13.4%, as compared to the base, with increases as high as 23.6% and 47.7% for NW and SRADv2, respectively. The worst L2D configuration only decreased the average IPC by 1.53% compared to the base. These results reveal similar impacts of tuning the L1D and L2D caches on IPC. However, the IPC reduction resulting from the worst configurations was more significant for the L1D cache (63%) than the L2D cache (1.53%). Figures 3 and 4 show that the GPGPU s average IPC, power, execution time, and EDP are highly sensitive to both the L1D and L2D cache However, in general, our results reveal that the L1D cache tuning creates much larger deviations in the results than L2D cache tuning. For example, while tuning the L1D cache, the standard deviation of the IPC and average power across all configurations was and 15.78, respectively. While tuning the L2D cache data, the average standard deviation of IPC and average power across all configurations was and 3.692, respectively. These results show that while both L1D and L2D cache tuning can improve a GPGPU s efficiency, tuning the L1D cache offers more promise and warrants more optimization efforts than tuning the L2D cache. 3) L1D cache reservation failures can be used for performance prediction during cache tuning. The GPU handles misses with the Miss Status/Information Holding Registers (MSHR) [16]. There are multiple MSHRs per Streaming Multiprocessor (SM); each MSHR can track one pending memory request. However, if there are many misses in a small period of time, a SM can run out of resources, which include available MSHRs as well as miss queue entries and available cache lines. This is referred to as a reservation failure, and it requires stalling the entire memory pipeline until sufficient resources are available to handle additional cache misses [7]. Our results indicate that as L1D reservation failures increase, the GPGPU s overall average power consumption decreases. We show this trend in Figure 5a using two candidate benchmarks, NW and Hotspot, with respective correlation coefficients (R) of and On average, all benchmarks have a coefficient of determination (R 2 ) of , indicating a strong correlation between the power consumption and reservation failures for all applications. This trend occurs because as the cache size decreases the reservation failures increase, as there are fewer resources available and the cache is unable to keep up with the demands from the GPGPU [13]. GPGPUs are designed to have high throughput, therefore, smaller caches have more problems keeping all the necessary data in the cache at a given time. This causes a large amount of cache misses, resulting in poorer performance and subsequently more reservation failures. Figure 5b shows a linear correlation between execution time and reservation failures. The correlation coefficients (R) of NW and Hotspot are and We observed a similar trend for all the benchmarks: on average, the benchmarks have coefficient of determination (R 2 ) of (figures omitted for brevity). We also observed similar trends for the EDP, as shown in Figure 5c. Given these observations, we conjecture that the reservation failures can be used for predicting L1D configurations during runtime. We intend to exploit these correlations for cache tuning in future work. We found that, unlike the L1D cache, there was no correlation between the L2D cache and reservation failures. This can be attributed to the fact that there are many more reservation failures for the L1D cache than the L2D cache. The average number of L1D reservation failures across all benchmarks and all configurations was x the L2D reservation failures. There were more L1D cache reservation failures because there are many more L1D caches than L2D caches 15 L1D caches in the GTX480 vs one unified L2D cache and the L2D caches are much larger than the L1D caches. Thus, the L2D cache is able to maintain the locality of several executing threads. For the L2D cache, the coefficients of determination (R 2 ) between the reservation failures and average power, execution time, and EDP were , , and , respectively (figures omitted for brevity). V. Autonomic Cache Tuning Analysis In addition to conducting an exhaustive search of all possible L1D and L2D cache configurations, we used a simple heuristic, similar to the one proposed in [27], to dynamically predict the optimal L1D cache configuration for executing applications. We assume that the heuristic is implemented by a low-overhead hardware cache tuner [2]. This tuner has been extensively analyzed to show that it contributes minimal power and area overhead to a resource-constrained embedded system. In addition, the accrued tuning stall cycles do not result in any significant runtime tuning overhead. To take into account both energy and execution time, the heuristic uses the EDP as the optimization metric; however, the heuristic can easily be modified to optimize for energy or execution time separately. The heuristic operates as follows: 1) Begin with the smallest cache: 16 sets, 128 byte line size, direct mapped. Increase the number of sets to 32 and keep the other two parameters constant. If the change in number of sets decreases the EDP, increase the number of sets to 64. Pick the number of sets corresponding to the lowest overall EDP.

9 Fig. 6: Heuristic EDP normalized to the base L1D cache 2) With the number of sets and associativity constant, increase the line size from 128 to 256 bytes. If the increase in line size decreases EDP, choose 256 bytes as the line size. If increasing the line size increases EDP, choose 128 bytes as the line size. 3) With the number of sets and line size constant, increase the associativity from direct mapped to 2- way set associative. If this change decreases the EDP, increase the associativity from 2-way to 4-way set associative. Pick the associativity corresponding to the lowest overall EDP. Figure 6 shows that for every benchmark, the heuristic determined the optimal configuration for EDP. Similar to exhaustive search, the heuristic achieved average EDP savings of 16.5%, as compared to the base cache configuration, with savings as high as 35.6% for SRADv2. The heuristic achieved these EDP savings while exploring 31.75% of the design space. We expect this percentage to reduce even further with a larger design space. VI. Conclusions In this paper, we explored autonomic cache tuning for optimizing the L1 and L2 data caches in general purpose graphics processing units (GPGPU). Cache tuning adapts the system cache configurations to executing applications resource requirements in order to reduce energy or improve performance. We showed that different GPGPU applications require different runtime L1 and L2 data cache configurations for optimal execution. Thus, using extensive experiments, we showed that L1 data cache tuning can reduce average power and EDP by 50.7% and 16.5%, respectively, and increase the average IPC by 18.8%, as compared to a static L1 data cache. We also showed that, in general, the L1 data cache offers more promise for GPGPU optimization than the L2 data cache. Our results also revealed that the reservation failures may be used to predict when to switch to a new cache configuration and what that configuration may be. Finally, we implemented a simple heuristic to autonomously tune the L1D cache, showing optimal configuration predictions in all applications while significantly reducing the design space. For future work, we intend to explore an interlaced tuning of the L1 and L2 caches. This interlaced tuning will significantly increase the design space, and potentially lead to higher optimization potential. Given a larger design space, we also intend to explore techniques for optimizing the cache tuning heuristic to determine optimal configurations while exploring a small subset of the design space. We also intend to design a prediction scheme, leveraging reservation failures, to predict the best cache configurations for application phases in a GPGPU. Also, we plan to extensively evaluate and explore additional techniques for implementing configurable caches and cache tuning in GPGPUs, while accruing minimal runtime overheads. References [1] Nvidia cuda toolkit. Accessed: February [2] T. Adegbija, A. Gordon-Ross, and M. Rawlins. Analysis of cache tuner architectural layouts for multicore embedded systems. In Performance Computing and Communications Conference (IPCCC), 2014 IEEE International, pages 1 8. IEEE, [3] M. H. Alsafrjalani, A. Gordon-Ross, and P. Viana. Minimum effort design space subsetting for configurable caches. In Embedded and Ubiquitous Computing (EUC), th IEEE International Conference on, pages IEEE, [4] A. Bakhoda, G. L. Yuan, W. W. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, ISPASS IEEE International Symposium on, pages IEEE, [5] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, IISWC IEEE International Symposium on, pages Ieee, [6] X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.-M. Hwu. Adaptive cache management for energy-efficient gpu computing. In Proceedings of the 47th annual IEEE/ACM international symposium on microarchitecture, pages IEEE Computer Society, [7] H. Dai, C. Li, H. Zhou, S. Gupta, C. Kartsaklis, and M. Mantor. A model-driven approach to warp/thread-block level gpu cache bypassing. In Design Automation Conference (DAC), nd ACM/EDAC/IEEE, pages 1 6. IEEE, [8] GeForce. GTX480 Specifications. com/hardware/desktop-gpus/geforce-gtx-480/specifications Accessed: May [9] A. Gordon-Ross and F. Vahid. A self-tuning configurable cache. In Proceedings of the 44th annual Design Automation Conference, pages ACM, [10] A. Gordon-Ross, F. Vahid, and N. Dutt. Automatic tuning of two-level caches to embedded applications. In Proceedings of the conference on Design, automation and test in Europe-Volume 1, page IEEE Computer Society, [11] H. Hajimiri and P. Mishra. Intra-task dynamic cache re In VLSI Design (VLSID), th International Conference on, pages IEEE, [12] H. Hajimiri, P. Mishra, and S. Bhunia. Dynamic cache tuning for efficient memory based computing in multicore architectures. In VLSI Design and th International Conference on Embedded Systems (VLSID), th International Conference on, pages IEEE, [13] M. Khairy, M. Zahran, and A. G. Wassal. Efficient utilization of gpgpu cache hierarchy. In Proceedings of the 8th Workshop on General Purpose Processing using GPUs, pages ACM, [14] K. Y. Kim and W. Baek. Quantifying the performance and energy efficiency of advanced cache indexing for gpgpu computing. Microprocessors and Microsystems, 43:81 94, 2016.

10 [15] K. Y. Kim, J. Park, and W. Baek. Iacm: Integrated adaptive cache management for high-performance and energy-efficient gpgpu computing. In Computer Design (ICCD), 2016 IEEE 34th International Conference on, pages IEEE, [16] A. Lashgar, E. Salehi, and A. Baniasadi. Understanding outstanding memory request handling resources in gpgpus. In proceedings of The Sixth International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies (HEART 2015), [17] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi. Gpuwattch: enabling energy optimizations in gpgpus. In ACM SIGARCH Computer Architecture News, volume 41, pages ACM, [18] C. Li, S. L. Song, H. Dai, A. Sidelnik, S. K. S. Hari, and H. Zhou. Locality-driven dynamic gpu cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing, pages ACM, [19] A. Malik, B. Moyer, and D. Cermak. A low power unified cache architecture providing power and performance flexibility. In Low Power Electronics and Design, ISLPED 00. Proceedings of the 2000 International Symposium on, pages IEEE, [20] S. Mittal. A survey of architectural techniques for improving cache power efficiency. Sustainable Computing: Informatics and Systems, 4:33 43, [21] NVIDIA. Tesla GPU Accelerators. content/tesla/pdf/nvidia-tesla-kepler-family-datasheet.pdf Accessed: May [22] M. Rawlins and A. Gordon-Ross. An application classification guided cache tuning heuristic for multi-core architectures. In Design Automation Conference (ASP-DAC), th Asia and South Pacific, pages IEEE, [23] T. G. Rogers, M. O Connor, and T. M. Aamodt. Cacheconscious wavefront scheduling. In Proceedings of the th Annual IEEE/ACM International Symposium on Microarchitecture, pages IEEE Computer Society, [24] E. Strohmaier, H. Meuer, J. Dongarra, and H. Simon. The top500 list and progress in high-performance computing. In Computer, pages IEEE Computer Society, [25] C. M. Wittenbrink, E. Kilgariff, and A. Prabhu. Fermi gf100 gpu architecture. IEEE Micro, 31(2):50 59, [26] C. M. Wittenbrink, E. Kilgariff, and A. Prabhu. Fermi gf100 gpu architecture. IEEE Micro, 31(2):50 59, [27] C. Zhang and F. Vahid. Cache configuration exploration on prototyping platforms. In Rapid Systems Prototyping, Proceedings. 14th IEEE International Workshop on, pages IEEE, [28] C. Zhang, F. Vahid, and W. Najjar. A highly configurable cache architecture for embedded systems. In Computer Architecture, Proceedings. 30th Annual International Symposium on, pages IEEE, [29] C. Zhao, F. Wang, Z. Lin, H. Zhou, and N. Zheng. Selectively gpu cache bypassing for un-coalesced loads. In Parallel and Distributed Systems (ICPADS), 2016 IEEE 22nd International Conference on, pages IEEE, 2016.

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

PACT: Priority-Aware Phase-based Cache Tuning for Embedded Systems

PACT: Priority-Aware Phase-based Cache Tuning for Embedded Systems PACT: Priority-Aware Phase-based Cache Tuning for Embedded Systems Sam Gianelli and Tosiron Adegbija Department of Electrical & Computer Engineering University of Arizona, Tucson, AZ, USA Email: {samjgianelli,

More information

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Understanding Outstanding Memory Request Handling Resources in GPGPUs Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

Energy-efficient Phase-based Cache Tuning for Multimedia Applications in Embedded Systems

Energy-efficient Phase-based Cache Tuning for Multimedia Applications in Embedded Systems Energy-efficient Phase-based Cache Tuning for Multimedia Applications in Embedded Systems Tosiron Adegbija and Ann Gordon-Ross* Department of Electrical and Computer Engineering University of Florida,

More information

Cache efficiency based dynamic bypassing technique for improving GPU performance

Cache efficiency based dynamic bypassing technique for improving GPU performance 94 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'18 Cache efficiency based dynamic bypassing technique for improving GPU performance Min Goo Moon 1, Cheol Hong Kim* 1 1 School of Electronics and

More information

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs

Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs Chien-Ting Chen 1, Yoshi Shih-Chieh Huang 1, Yuan-Ying Chang 1, Chiao-Yun Tu 1, Chung-Ta King 1, Tai-Yuan Wang 1, Janche Sang

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Analyzing CUDA Workloads Using a Detailed GPU Simulator

Analyzing CUDA Workloads Using a Detailed GPU Simulator CS 3580 - Advanced Topics in Parallel Computing Analyzing CUDA Workloads Using a Detailed GPU Simulator Mohammad Hasanzadeh Mofrad University of Pittsburgh November 14, 2017 1 Article information Title:

More information

CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads

CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads 2015 InternaDonal Symposium on Computer Architecture (ISCA- 42) CAWA: Coordinated Warp Scheduling and Cache Priori6za6on for Cri6cal Warp Accelera6on of GPGPU Workloads Shin- Ying Lee Akhil Arunkumar Carole-

More information

On the Interplay of Loop Caching, Code Compression, and Cache Configuration

On the Interplay of Loop Caching, Code Compression, and Cache Configuration On the Interplay of Loop Caching, Code Compression, and Cache Configuration Marisha Rawlins and Ann Gordon-Ross* Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL

More information

Performance in GPU Architectures: Potentials and Distances

Performance in GPU Architectures: Potentials and Distances Performance in GPU Architectures: s and Distances Ahmad Lashgar School of Electrical and Computer Engineering College of Engineering University of Tehran alashgar@eceutacir Amirali Baniasadi Electrical

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Evaluating GPGPU Memory Performance Through the C-AMAT Model

Evaluating GPGPU Memory Performance Through the C-AMAT Model Evaluating GPGPU Memory Performance Through the C-AMAT Model Ning Zhang Illinois Institute of Technology Chicago, IL 60616 nzhang23@hawk.iit.edu Chuntao Jiang Foshan University Foshan, Guangdong 510000

More information

BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU

BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU 2013 8th International Conference on Communications and Networking in China (CHINACOM) BER Guaranteed Optimization and Implementation of Parallel Turbo Decoding on GPU Xiang Chen 1,2, Ji Zhu, Ziyu Wen,

More information

Is There A Tradeoff Between Programmability and Performance?

Is There A Tradeoff Between Programmability and Performance? Is There A Tradeoff Between Programmability and Performance? Robert Halstead Jason Villarreal Jacquard Computing, Inc. Roger Moussalli Walid Najjar Abstract While the computational power of Field Programmable

More information

MRPB: Memory Request Priori1za1on for Massively Parallel Processors

MRPB: Memory Request Priori1za1on for Massively Parallel Processors MRPB: Memory Request Priori1za1on for Massively Parallel Processors Wenhao Jia, Princeton University Kelly A. Shaw, University of Richmond Margaret Martonosi, Princeton University Benefits of GPU Caches

More information

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance Rachata Ausavarungnirun Saugata Ghose, Onur Kayiran, Gabriel H. Loh Chita Das, Mahmut Kandemir, Onur Mutlu Overview of This Talk Problem:

More information

Tag-Split Cache for Efficient GPGPU Cache Utilization

Tag-Split Cache for Efficient GPGPU Cache Utilization Tag-Split Cache for Efficient GPGPU Cache Utilization Lingda Li Ari B. Hayes Shuaiwen Leon Song Eddy Z. Zhang Department of Computer Science, Rutgers University Pacific Northwest National Lab lingda.li@cs.rutgers.edu

More information

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai and Jing-Yang Jou Department of Electronics Engineering National Chiao

More information

Intra-Task Dynamic Cache Reconfiguration *

Intra-Task Dynamic Cache Reconfiguration * Intra-Task Dynamic Cache Reconfiguration * Hadi Hajimiri, Prabhat Mishra Department of Computer & Information Science & Engineering University of Florida, Gainesville, Florida, USA {hadi, prabhat}@cise.ufl.edu

More information

Impact of Warp Formation on GPU Performance

Impact of Warp Formation on GPU Performance Impact of Warp Formation on GPU Performance Hong Jun Choi, Dong Oh Son, and Cheol Hong Kim Abstract As computing power of GPU increases dramatically, the GPU is widely used for general-purpose parallel

More information

ID-Cache: Instruction and Memory Divergence Based Cache Management for GPUs

ID-Cache: Instruction and Memory Divergence Based Cache Management for GPUs ID-Cache: Instruction and Memory Divergence Based Cache Management for GPUs Akhil Arunkumar, Shin-Ying Lee and Carole-Jean Wu School of Computing, Informatics, and Decision Systems Engineering Arizona

More information

Phase distance mapping: a phase-based cache tuning methodology for embedded systems

Phase distance mapping: a phase-based cache tuning methodology for embedded systems Des Autom Embed Syst DOI 10.1007/s10617-014-9127-8 Phase distance mapping: a phase-based cache tuning methodology for embedded systems Tosiron Adegbija Ann Gordon-Ross Arslan Munir Received: 2 September

More information

Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities

Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities Ahmad Lashgar Electrical and Computer Engineering University of Victoria Victoria, BC, Canada Email: lashgar@uvic.ca Amirali Baniasadi

More information

Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems

Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems Yingchao Huang University of California, Merced yhuang46@ucmerced.edu Abstract A heterogeneous memory system (HMS)

More information

An Investigation of Unified Memory Access Performance in CUDA

An Investigation of Unified Memory Access Performance in CUDA An Investigation of Unified Memory Access Performance in CUDA Raphael Landaverde, Tiansheng Zhang, Ayse K. Coskun and Martin Herbordt Electrical and Computer Engineering Department, Boston University,

More information

Exploring Configurable Non-Volatile Memory-based Caches for Energy-Efficient Embedded Systems

Exploring Configurable Non-Volatile Memory-based Caches for Energy-Efficient Embedded Systems Exploring Configurable Non-Volatile Memory-based Caches for Energy-Efficient Embedded Systems Tosiron Adegbija Department of Electrical and Computer Engineering University of Arizona Tucson, Arizona tosiron@email.arizona.edu

More information

Energy-Efficient Scheduling for Memory-Intensive GPGPU Workloads

Energy-Efficient Scheduling for Memory-Intensive GPGPU Workloads Energy-Efficient Scheduling for Memory-Intensive GPGPU Workloads Seokwoo Song, Minseok Lee, John Kim KAIST Daejeon, Korea {sukwoo, lms5, jjk}@kaist.ac.kr Woong Seo, Yeongon Cho, Soojung Ryu Samsung Electronics

More information

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs

APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun

More information

Exploiting Core Criticality for Enhanced GPU Performance

Exploiting Core Criticality for Enhanced GPU Performance Exploiting Core Criticality for Enhanced GPU Performance Adwait Jog, Onur Kayıran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, Chita R. Das. SIGMETRICS 16 Era of Throughput Architectures

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

A Detailed GPU Cache Model Based on Reuse Distance Theory

A Detailed GPU Cache Model Based on Reuse Distance Theory A Detailed GPU Cache Model Based on Reuse Distance Theory Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal Eindhoven University of Technology (Netherlands) Henri Bal Vrije Universiteit Amsterdam

More information

Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Memory Hierarchy of Modern GPUs 1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong

More information

GPU-accelerated Verification of the Collatz Conjecture

GPU-accelerated Verification of the Collatz Conjecture GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,

More information

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

High-Performance Packet Classification on GPU

High-Performance Packet Classification on GPU High-Performance Packet Classification on GPU Shijie Zhou, Shreyas G. Singapura, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California 1 Outline Introduction

More information

Synergistic Integration of Dynamic Cache Reconfiguration and Code Compression in Embedded Systems *

Synergistic Integration of Dynamic Cache Reconfiguration and Code Compression in Embedded Systems * Synergistic Integration of Dynamic Cache Reconfiguration and Code Compression in Embedded Systems * Hadi Hajimiri, Kamran Rahmani, Prabhat Mishra Department of Computer & Information Science & Engineering

More information

Register File Organization

Register File Organization Register File Organization Sudhakar Yalamanchili unless otherwise noted (1) To understand the organization of large register files used in GPUs Objective Identify the performance bottlenecks and opportunities

More information

A Study of the Potential of Locality-Aware Thread Scheduling for GPUs

A Study of the Potential of Locality-Aware Thread Scheduling for GPUs A Study of the Potential of Locality-Aware Thread Scheduling for GPUs Cedric Nugteren Gert-Jan van den Braak Henk Corporaal Eindhoven University of Technology, Eindhoven, The Netherlands c.nugteren@tue.nl,

More information

Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen

Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen Register and Thread Structure Optimization for GPUs Yun (Eric) Liang, Zheng Cui, Kyle Rupnow, Deming Chen Peking University, China Advanced Digital Science Center, Singapore University of Illinois at Urbana

More information

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) Lecture: Storage, GPUs Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4) 1 Magnetic Disks A magnetic disk consists of 1-12 platters (metal or glass disk covered with magnetic recording material

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Coding for Efficient Caching in Multicore Embedded Systems

Coding for Efficient Caching in Multicore Embedded Systems Coding for Efficient Caching in Multicore Embedded Systems Tosiron Adegbija and Ravi Tandon Department of Electrical and Computer Engineering University of Arizona, USA Email: {tosiron,tandonr}@emailarizonaedu

More information

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and A Novel Way to Improve TLP 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 GPU Performance vs. Thread-Level Parallelism: Scalability Analysis

More information

Low Effort Design Space Exploration Methodology for Configurable Caches

Low Effort Design Space Exploration Methodology for Configurable Caches computers Article Low Effort Design Space Exploration Methodology for Configurable Caches Mohamad Hammam Alsafrjalani 1, * and Ann Gordon-Ross 2 1 Department of Electrical and Computer Engineering, University

More information

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution

More information

A Reconfigurable Cache Design for Embedded Dynamic Data Cache

A Reconfigurable Cache Design for Embedded Dynamic Data Cache I J C T A, 9(17) 2016, pp. 8509-8517 International Science Press A Reconfigurable Cache Design for Embedded Dynamic Data Cache Shameedha Begum, T. Vidya, Amit D. Joshi and N. Ramasubramanian ABSTRACT Applications

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Reconfigurable Multicore Server Processors for Low Power Operation

Reconfigurable Multicore Server Processors for Low Power Operation Reconfigurable Multicore Server Processors for Low Power Operation Ronald G. Dreslinski, David Fick, David Blaauw, Dennis Sylvester, Trevor Mudge University of Michigan, Advanced Computer Architecture

More information

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD

More information

Automatic Data Layout Transformation for Heterogeneous Many-Core Systems

Automatic Data Layout Transformation for Heterogeneous Many-Core Systems Automatic Data Layout Transformation for Heterogeneous Many-Core Systems Ying-Yu Tseng, Yu-Hao Huang, Bo-Cheng Charles Lai, and Jiun-Liang Lin Department of Electronics Engineering, National Chiao-Tung

More information

Position Paper: OpenMP scheduling on ARM big.little architecture

Position Paper: OpenMP scheduling on ARM big.little architecture Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS

UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS UNLOCKING BANDWIDTH FOR GPUS IN CC-NUMA SYSTEMS Neha Agarwal* David Nellans Mike O Connor Stephen W. Keckler Thomas F. Wenisch* NVIDIA University of Michigan* (Major part of this work was done when Neha

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Fine-Grained Resource Sharing for Concurrent GPGPU Kernels

Fine-Grained Resource Sharing for Concurrent GPGPU Kernels Fine-Grained Resource Sharing for Concurrent GPGPU Kernels Chris Gregg Jonathan Dorn Kim Hazelwood Kevin Skadron Department of Computer Science University of Virginia PO Box 400740 Charlottesville, VA

More information

Adaptive Cache Partitioning on a Composite Core

Adaptive Cache Partitioning on a Composite Core Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,

More information

Intra-Cluster Coalescing to Reduce GPU NoC Pressure

Intra-Cluster Coalescing to Reduce GPU NoC Pressure Intra-Cluster Coalescing to Reduce GPU NoC Pressure Lu Wang 1, Xia Zhao 1, David Kaeli 2, Zhiying Wang 3, and Lieven Eeckhout 1 1 Ghent University,{luluwang.wang, xia.zhao, lieven.eeckhout}@ugent.be 2

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

An Efficient STT-RAM Last Level Cache Architecture for GPUs

An Efficient STT-RAM Last Level Cache Architecture for GPUs An Efficient STT-RAM Last Level Cache Architecture for GPUs Mohammad Hossein Samavatian, Hamed Abbasitabar, Mohammad Arjomand, and Hamid Sarbazi-Azad, HPCAN Lab, Computer Engineering Department, Sharif

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs Jin Wang* Norm Rubin Albert Sidelnik Sudhakar Yalamanchili* *Georgia Institute of Technology NVIDIA Research Email: {jin.wang,sudha}@gatech.edu,

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

Characterizing the Performance Bottlenecks of Irregular GPU Kernels

Characterizing the Performance Bottlenecks of Irregular GPU Kernels Characterizing the Performance Bottlenecks of Irregular GPU Kernels Molly A. O Neil M.S. Candidate, Computer Science Thesis Defense April 2, 2015 Committee Members: Dr. Martin Burtscher Dr. Apan Qasem

More information

Sustainable Computing: Informatics and Systems

Sustainable Computing: Informatics and Systems Sustainable Computing: Informatics and Systems 2 (212) 71 8 Contents lists available at SciVerse ScienceDirect Sustainable Computing: Informatics and Systems j ourna l ho me page: www.elsevier.com/locate/suscom

More information

Cache Memory Access Patterns in the GPU Architecture

Cache Memory Access Patterns in the GPU Architecture Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 7-2018 Cache Memory Access Patterns in the GPU Architecture Yash Nimkar ypn4262@rit.edu Follow this and additional

More information

Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement

Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement Exploiting Uniform Vector Instructions for GPGPU Performance, Energy Efficiency, and Opportunistic Reliability Enhancement Ping Xiang, Yi Yang*, Mike Mantor #, Norm Rubin #, Lisa R. Hsu #, Huiyang Zhou

More information

A REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE

A REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2016 A REUSED DISTANCE BASED ANALYSIS AND OPTIMIZATION FOR GPU CACHE Dongwei Wang Follow this and additional

More information

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU

GaaS Workload Characterization under NUMA Architecture for Virtualized GPU GaaS Workload Characterization under NUMA Architecture for Virtualized GPU Huixiang Chen, Meng Wang, Yang Hu, Mingcong Song, Tao Li Presented by Huixiang Chen ISPASS 2017 April 24, 2017, Santa Rosa, California

More information

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

A Brief Compendium of On Chip Memory Highlighting the Tradeoffs Implementing SRAM,

A Brief Compendium of On Chip Memory Highlighting the Tradeoffs Implementing SRAM, A Brief Compendium of On Chip Memory Highlighting the Tradeoffs Implementing, RAM, or edram Justin Bates Department of Electrical and Computer Engineering University of Central Florida Orlando, FL 3816-36

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Programming GPUs for database applications - outsourcing index search operations

Programming GPUs for database applications - outsourcing index search operations Programming GPUs for database applications - outsourcing index search operations Tim Kaldewey Research Staff Member Database Technologies IBM Almaden Research Center tkaldew@us.ibm.com Quo Vadis? + special

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Phase-based Cache Locking for Embedded Systems

Phase-based Cache Locking for Embedded Systems Phase-based Cache Locking for Embedded Systems Tosiron Adegbija and Ann Gordon-Ross* Department of Electrical and Computer Engineering, University of Florida (UF), Gainesville, FL 32611, USA tosironkbd@ufl.edu

More information

Mitigating GPU Memory Divergence for Data-Intensive Applications. Bin Wang

Mitigating GPU Memory Divergence for Data-Intensive Applications. Bin Wang Mitigating GPU Memory Divergence for Data-Intensive Applications by Bin Wang A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree

More information

Inter-warp Divergence Aware Execution on GPUs

Inter-warp Divergence Aware Execution on GPUs Inter-warp Divergence Aware Execution on GPUs A Thesis Presented by Chulian Zhang to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Master

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors , July 4-6, 2018, London, U.K. A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid in 3D chip Multi-processors Lei Wang, Fen Ge, Hao Lu, Ning Wu, Ying Zhang, and Fang Zhou Abstract As

More information

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors By: Anvesh Polepalli Raj Muchhala Introduction Integrating CPU and GPU into a single chip for performance

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood (powerjg/morr)@cs.wisc.edu UW-Madison Computer Sciences 2012 gem5-gpu gem5 + GPGPU-Sim (v3.0.1) Flexible memory

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

A Framework for Modeling GPUs Power Consumption

A Framework for Modeling GPUs Power Consumption A Framework for Modeling GPUs Power Consumption Sohan Lal, Jan Lucas, Michael Andersch, Mauricio Alvarez-Mesa, Ben Juurlink Embedded Systems Architecture Technische Universität Berlin Berlin, Germany January

More information

Application-independent Autotuning for GPUs

Application-independent Autotuning for GPUs Application-independent Autotuning for GPUs Martin Tillmann, Thomas Karcher, Carsten Dachsbacher, Walter F. Tichy KARLSRUHE INSTITUTE OF TECHNOLOGY KIT University of the State of Baden-Wuerttemberg and

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

{jtan, zhili,

{jtan, zhili, Cost-Effective Soft-Error Protection for SRAM-Based Structures in GPGPUs Jingweijia Tan, Zhi Li, Xin Fu Department of Electrical Engineering and Computer Science University of Kansas Lawrence, KS 66045,

More information