Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View

Size: px

Start display at page:

Download "Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View"

Sabrina Knight
5 years ago
Views:

1 Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View Matteo Monchiero 1 1 Dipartimento di Elettronica e Informazione Politecnico di Milano Via Ponzio, 3/5 133 Milano, Italy monchier@elet.polimi.it Ramon Canal Dept of Computer Architecture Universitat Politècnica de Catalunya Cr. Jordi Girona, Barcelona, Spain rcanal@ac.upc.edu Antonio González,3 3 Intel Barcelona Research Center Intel Labs-Universitat Politècnica de Catalunya Cr. Jordi Girona, Barcelona, Spain antonio.gonzalez@intel.com ABSTRACT Multicore architectures are ruling the recent microprocessor design trend. This is due to different reasons: better performance, threadlevel parallelism bounds in modern applications, ILP diminishing returns, better thermal/power scaling (many small cores dissipate less than a large and complex one); and, ease and reuse of design. This paper presents a thorough evaluation of multicore architectures. The architecture we target is composed of a configurable number of cores, a memory hierarchy consisting of private L1 and L, and a shared bus interconnect. We consider parallel shared memory applications. We explore the design space related to the number of cores, L cache size and processor complexity, showing the behavior of the different configurations/applications with respect to performance, energy consumption and temperature. Design tradeoffs are analyzed, stressing the interdependency of the metrics and design factors. In particular, we evaluate several chip floorplans. Their power/thermal characteristics are analyzed and they show the importance of considering thermal effects at the architectural level to achieve the best design choice. 1. INTRODUCTION Main semiconductor firms have been recently proposing microprocessor solutions composed of a few cores integrated on a single chip [1 3]. This approach, named Chip Multiprocessor (CMP), permits to efficiently deal with power/thermal issues dominating deep sub-micron technologies and makes it easy to exploit threadlevel parallelism of modern applications. Power has been recognized as first class design constraint [], and many literature works target analysis and optimization of power consumption. Issues related to chip thermal behavior have been faced only recently, but emerged as one of the most important factors to determine a wide range of architectural decisions [5]. This paper targets a CMP system composed of multiple cores, each one with private L1 and L hierarchy. This approach has been Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or coercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICS, June -3, Cairns, Queensland, Australia Copyright C ACM //...$5.. adopted in some industrial products, e.g. the Intel Montecito [1] and Pentium D [], and the AMD AthlonX [7]. It maximizes design reuse, since this architectural style does not require re-design of the secondary cache, as it would be for a L shared architecture. Unlike many recent works about design space explorations of CMPs, we consider parallel shared memory applications, which can be considered the natural workload for a small scale multiprocessor. We target several scientific programs and multimedia ones. Scientific applications represent the traditional benchmark for multiprocessor, while multimedia programs represent a promising way of improving parallelism in everyday computing. Our experimental framework consists of a detailed microarchitectural simulator [], integrated with Wattch [9] and CACTI [1] power models. Thermal effects have been modeled at the functional unit granularity by using HotSpot [5]. This environment makes a fast and accurate exploration of the target design space possible. The main contribution of this work is the analysis of several design energy/performance trade-offs when varying core complexity, L cache size and number of cores, for parallel applications. In particular, we discuss the interdependence of energy/thermal efficiency, performance and architectural level chip floorplan. Our findings can be suarized as follows: We show that L cache is an important factor in determining chip thermal behavior. We show that large CMPs of fairly narrow cores are the best solution for energy-delay. We show that floorplan geometric characteristics are important for chip temperature. This paper is organized as follows. Section presents the related work. Metrics are defined in Section 3. The target design space, as well as the description of the considered architecture is presented in Section. The experimental framework used in this paper is introduced in Section 5. Section discusses performance/energy/thermal results for the proposed configurations. Section 7 presents an analysis of the spatial distribution of the temperature for selected chip floorplans. Finally, conclusions are drawn in Section.. RELATED WORK Many works have appeared exploring the design space of CMPs from the point of view of different metrics and application domains. This paper extends previous works in several issues. In detail:

2 Huh et al. [11] evaluate the impact of several design factors on the performance. The authors discuss the interactions of core complexity, cache hierarchy, and available off-chip bandwidth. The paper focuses on a workload composed of single-threaded applications. It is shown that out-of-order cores are more effective than in-order ones. Furthermore, bandwidth limitations can force large L caches to be used, therefore reducing the number of cores. A similar study was conducted by Ekman et al. [1] for scientific parallel programs. Kaxiras et al. [13] deal with multimedia workload, especially targeting DSP for mobile phones. The paper discuss the energy efficiency of SMT and CMP organizations with given performance constraints. They show that both these approaches can be more efficient than a single-threaded processor, and claims that SMT has some advantages in terms of power. Grochowski et al. [1] discuss how to achieve the best performance for scalar and parallel codes in a power constrained environment. The idea proposed in this paper is to dynamically vary the amount of energy expended to process instructions according to the amount of parallelism available in the software. They evaluate several architectures showing that a combination of voltage frequency/scaling and asyetric cores represents the best approach. In [15], Li and Martinez study the power/performance implications of parallel computing on CMPs. They use a mixed analyticalexperimental model to explore parallel efficiency, the number of processor used, and voltage/frequency scaling. They show that parallel computation can bring significant power savings when the number of processors used and the voltage/frequency scaling levels are properly chosen. These works [1 1] are orthogonal to ours. Our experimental framework is similar to the one of [15, 1]. We also consider parallel applications and the benchmarks as in [1]. The power model we use accounts for thermal effects on leakage energy, but, unlike the authors of [15, 1], we also provide a temperature-dependent model for L caches. This makes it possible to capture some particular behaviors of temperature and energy consumption since L caches may occupy more than half of the area of the chip in some configurations. In [17], a comparison of SMT and CMP architectures is carried out. The authors consider performance, power and temperature metrics. They show that SMT and CMP exhibit similar thermal behavior, but the sources of heating are different: localized hotspots for SMTs, global heating for CMPs. They found that CMPs are more efficient for computation intensive applications, while SMTs perform better for memory bound programs due to larger L cache available. The paper also discusses the suitability of several dynamic thermal management techniques for CMPs and SMTs. Li et al. [1] conduct a thorough exploration of the multi-dimensional design space for CMPs. They show that thermal constraints dominate other physical constraints such as pin-bandwidth and power delivery. According to the authors, thermal constraints tend to favor shallow pipelines, narrower cores and tend to reduce the optimal number of cores and L cache size. The focus of [17, 1] is on single-threaded applications. In this paper, we target a different benchmark set, composed of explicitly parallel applications. In [19], Kumar et al. provide an analysis of several on-chip interconnections. The authors show the importance of considering the interconnect while designing a CMP and argue that careful codesign of the interconnection and the other architectural entities is needed. To the best of the authors knowledge, the problem of architecturallevel floorplan has not been faced for what concerns multi-core architectures. Several chip floorplan have been proposed as instrumental to thermal/power models, but the interactions of floorplanning issues and CMP power/performance characteristics have not been faced up to now. In [], Sankaranarayanan et al. discuss some issues related to chip floorplanning for single-core processors. In particular, the authors show that chip temperature variation is sensitive to three main parameters: the lateral spreading of heat in the silicon, the power density, and the temperature dependent leakage power. We also use these parameters to describe the thermal behavior for CMPs. The idea of distributing the microarchitecture for reducing temperature is exploited in [1]. In this paper the the organization of a distributed front-end for clustered microarchitectures is proposed and evaluated Ku et al. [] analyze the inter-dependence of temperature and leakage energy in cache memories. They focus on single-core workload. The authors analyze several low-power techniques and propose a temperature-aware cache management approach. In this paper, we account for temperature effects on leakage in the memory hierarchy, stressing its importance to make best design decisions. This paper contribution differentiates from the previous ones, since it combines power/performance exploration for parallel applications and interactions with the chip floorplan. Our model takes into account the thermal effects and the temperature dependence of leakage energy. 3. METRICS This section describes the metrics that we use in this paper to characterize performance and power efficiency of the CMP configurations we explore. We measure the performance of the system when running a given parallel applications by using the execution time (or delay). This is computed as the time needed to complete the execution of the program. Delay is often preferred to IPC for multiprocessor studies. It accounts for different instruction count which may arise when different system configurations are used. This mostly happens when the number of cores is varied and the dynamic instructions count varies to deal with a different number of parallel threads. Power efficiency is evaluated through the system energy, i.e. the energy needed to run a given application. We also account for thermal effects, and we therefore report the average temperature across the chip. Furthermore, in Section 7, we evaluate several alternative floorplans by means of the spatial distribution of the temperature (thermal map of the chip). Energy Delay Product (EDP) and Energy Delay Product (EDP) are typically used to evaluate energy-delay trade-offs. EDP is usually preferred when targeting low-power devices, while EDP when focusing on high-performance systems, since it gives higher priority to performance over power. In this paper, we use EDP. Anyway, we give a discussion of ED n P optimality for the considered design space. We use concepts taken form [3]. In particular, we adopt the notion of hardware intensity to proper characterize the complexity of different configurations.. DESIGN SPACE We consider a shared memory multiprocessor, composed of several independent out-of-order cores (each one with private L cache), counicating on a shared bus. Detailed microarchitecture is discussed in Section 5. In this section, we focus on the target design space. This is composed of the following architecture parameters: Number of cores (,, ). This is the number of cores present in the system.

3 Table 1: Chip Area [ ] Issue #cores L Size [KB] Issue width (,,, ). We modulate the number of instructions which can be issued in parallel to integer/floating point reservation stations. According to this parameter, many microarchitectural blocks are scaled, see Table for details. The issue width is therefore an index of the core complexity. P P1 L_ L_1 (a) L 5, 51 KB Base Unit (c) cores L_ P P1 L_1 (b) L 1, KB Base Unit (d) cores L cache size (5, 51, 1, KB). This is the size of the private L cache of each processor. For each configuration in terms of number of cores and L cache size we consider a different chip floorplan. As in some related works on thermal analysis [5], the floorplan of the core is modeled on the AlphaEv (1) [].The area of several units (register files, issue and Ld/St queues, rename units, and FUs) has been scaled according to size and complexity [5]. Each floorplan has been re-designed to minimize dead spaces and not to increase the delay of the critical loops. Figure 1 illustrates the methodology used to build the floorplans we use. In particular, to obtain a reasonable aspect ratio we defined two different layouts: one for small caches, 5KB and 51KB (Figure 1.a), and another one for large caches, 1KB and KB (Figure 1.b). The floorplans for and cores are built by using the -core floorplan as a base unit. The base unit for small caches is shown in Figure 1.a. Core (P) and core 1 (P1) are placed side by side. The L caches are placed in front of each core. In the base unit for large caches (Figure 1.b), the L is split in two pieces, one in front of the core, the other one beside it. In this way, each processor+cache unit is roughly squared. For and cores this floorplan is replicated and possibly mirrored, trying to obtain the aspect ratio which most approximates 1 (perfect square). Figures 1.c/d show how the -core and -core floorplans have been obtained respectively from the -core and -core floorplan. For any cache/core configurations, each floorplan has the shape/organization as outlined in Figure 1, but different size (these have been omitted for the sake of clarity). Table 1 shows the chip area for each design. For each configuration, the shared bus is placed according to the counication requirements. Size is derived from [19]. 5. EXPERIMENTAL SETUP The simulation infrastructure that we use in this paper is composed of a microarchitecture simulator modeling a configurable CMP, integrated power models, and a temperature modeling tool. The CMP simulator is SESC []. It models a multiprocessor composed of a configurable number of cores. Table reports main Figure 1: Chip floorplans for,, cores, 5/51KB L caches (a), and 1/KB L caches (b) parameters of simulated architectures. Each core is an out-of-order superscalar processor, with private caches (instruction L1, data L1, and instructions and data L). The -issue microarchitecture models the Alpha 1 []. For any different issue width, all the processor structures have been scaled accordingly as shown in Table. Inter-processor counication develops on a high-bandwidth shared bus (57 GB/s), pipelined and clocked at half of the core clock (see Table ). Coherence protocol acts directly among L caches and it is MESI snoopy-based. This protocol generates invalidate/write requests between L1 and L caches to ensure the coherency of shared data. Memory ordering is ruled by a weak consistency model. The power model integrated in the simulator is based on Wattch [9] for processor structures, CACTI [1] for caches, and Orion [] for buses. The thermal model is based on Hotspot-3.. [5]. HotSpot uses dynamic power traces and chip floorplan to drive thermal simulation. As outputs, it provides transient behavior and steady state temperature. According to [5], we chose a standard cooling solution featuring thermal resistance of 1. K/W. We chose ambient air temperature of 5 C. We carefully select initial temperature values to ensure that thermal simulation converges to a stationary state. Hotspot has been augmented with a temperature-dependent leakage model, based on [7]. This model accounts for different transistor density of functional units and memories. Temperature dependency is modeled by varying the amount of leakage according to the proper exponential distribution as in [7]. At each iteration of the thermal simulation, the leakage contribution is calculated and injected into each HotSpot grid element. Table 3 lists the benchmark we selected. They are 3 scientific applications from Splash- suite [] and MPEG encoder/decoder from ALPbench [9]. All benchmarks have been run up to completion and statistics have been collected on the whole program run,

4 Table : Architecture configuration. Issue width parameters are in boldface. They are used as representatives of a core configuration through the paper. Processor Frequency nm Vdd.9V Core area (w/ L1$) [ ] Branch penalty Branch unit 7 cycles BTB (1K entries, -way) Alpha-style Hybrid Branch Predictor (3.7KB) RAS (3 entries) Fetch/issue/retire width // // // //1 Issue Queue size 1 3 INT I-Window size 1 3 FP I-Window size INT registers 1 1 FP registers ROB size 1 1 LdSt/Int/FP units 1//1 1// 1//3 1// Ld/St queue entries 1/1 3/3 / / IL1 KB, -way, B block, ports Hit/miss lat /1 cycles ITLB entries DL1 KB, -way, B block, write-through Hit/miss lat /1 cycles MAF size DTLB entries 1 Unified L 5 KB 51 KB 1 KB KB -way, B block, write-back, ports Hit/miss lat 1/ cycles MAF size 3 Coherence Protocol MESI CMP #cores Shared Bus 7B, 1.5GHz, 1 cycles delay Bandwidth 57 GB/s Memory Bus bandwidth GB/s Memory lat 9 cycles Table 3: Benchmarks #graduated Description Problem size instructions (M) FMM Fast Multipole Method 1k particles, 1 steps mpegdec 11 MPEG- decoder flowg (Stanford) 35, 1 frames mpegenc 75 MPEG- encoder flowg (Stanford) 35, 1 frames VOLREND 15 1 Volume rendering using ray casting head, 5 viewpoints WATER-NS 17 Forces and potentials of water molecules 51 molecules, 5 steps after skipping initialization. For thermal simulations, we used the power trace related to the whole benchmark simulation (dumped every 1 cycles). We used standard data set for Splash-, while we limited to 1 frames for the MPEG. In Table 3 is shown the total number of graduated instructions for each application. This number is fairly constant across all the simulated configurations for all the benchmarks, except for FMM and VOLREND. This variation (3-%) is related to different thread spawning and synchronization overhead as the number of cores is scaled.. PERFORMANCE AND POWER EVALU- ATION In this section, we evaluate the simulated CMP configurations from the point of view of performance and energy. We consider the interdependence of these metrics as well as chip physical characteristics (temperature and floorplan). We base this analysis on average values across all the simulated benchmarks, apart from Section., which presents a characterization of each benchmark. Delay [cycles].5 x Figure : Execution time (delay). Each group of bars (L cache varying) is labeled with <#procs>-<issue width>.1 Performance Figure shows the average execution time, for all the configurations, in terms of cores, issue width and L caches. Results in terms of IPC show the same behavior. It can be observed that when scaling the system from to, and from to processors, delay is reduced by a 7% each step. This trend is homogeneous across other parameter variations. It means that high parallel efficiency is achieved by these applications, and that counication overhead is not appreciable. By varying the issue width, for a given processor count and L size, a larger speedup is observed when moving from - to -issue (1%). If the issue width is furthermore increased, the improvement of the delay saturates (5.5% from - to -issue, and 3.3% from - to -issue). This is because of ILP diminishing returns. This trend is also seen (but not that dramatically) for the configurations of and cores. The impact of the L cache size is at maximum -5% (from 5KB to KB), when the other parameters are fixed. Each time the L size is doubled the speedup is around 1.-%. Also this behavior is orthogonal to the other parameters since this trend can be seen across all configurations.. Energy In this section, we analyze the effect of varying the number of cores, the issue width and the L cache on the system energy. Number of cores. In Figure 3, the energy behavior is reported. It can be seen that, when the number of processors is increased, the system energy slightly increases although the area nearly doubles as core count doubles. For example, for a -issue 1KB configuration, the energy increase rate is 5% (both from cores to cores, and from cores to cores). In fact, the leakage energy increase, due to larger chip, is mitigated by the reduction of the execution time power-on time (i.e. the time the chip is leaking is shorter). On the other hand, when considering configurations with a given resource budget, and similar area (see Table 1), the energy typically decreases. For a 1 total issue width and MB total cache, the energy decreases by a 37%, from to cores. From to, it slightly increases, but, the chip for cores is % larger. In this case, the shorter delay makes the difference, making relatively large CMPs result in good energy efficiency. Issue width. By varying the issue width and, therefore, the core

5 Energy [J] 15 1 Av. Temperature [ C] Figure 3: Energy. Each group of bars (L cache varying) is labeled with <#procs>-<issue width> Figure 5: Average chip temperature. Each group of bars (L cache varying) is labeled with <#procs>-<issue width> EDP ED1.5P EDP Leakage Energy [J] 1 Ls Energy [J] 15 core Delay [cycles] x 1 9 Figure : Leakage energy. Each bar is labeled <#issue>-<l size>. Core/Ls leakage breakdown is represented by a dash superposed on each bar. complexity, energy has a minimum at -issue. This occurs for, and processors. The -issue cores offer the best balance between complexity, power consumption and performance. The -issue microarchitecture can t efficiently extract available IPC (leakage due to delay dominates). -issue processor cores need much more energy to extract little more parallelism with respect to - and -issue one, and therefore they are quite inefficient. L size. L cache represents the major contributor to chip area, so it affects leakage energy considerably. The static energy component for our simulations ranges from 37% (P -issue 5KB- L) to 7% (P -issue KB-L). Figure shows the leakage energy and the core-leakage/l-leakage breakdown. Most of the leakage is due to the L caches. This component scales linearly with L cache size. The static power of the core increases as the issue-width and the number of cores is increased. Nevertheless, this effect is mitigated by larger L caches since they offer enhanced spreading capabilities of the chip (overall temperature decreases resulting in less leakage energy). Average chip temperature is shown in Figure 5. Despite abso- Figure : Energy Delay scatter plot for the simulated configurations lute variation is at most 1.7 C, even small temperature variation can correspond to appreciable energy modification. In [], SPICE simulation results are reported, showing that 1 C translates into around -5% leakage variation. Regarding our simulations, temperature reduction due to caches causes the energy reduction of the core leakage energy, as shown in Figure. In this case, 5% energy reduction (core leakage) corresponds to.5 C reduction of the average chip temperature. On the contrary, leakage in the cache is much more sensitive to area than temperature. So, the leakage increase due to the area can t be compensated by the temperature decrease..3 Energy Delay In Figure, the Energy-Delay scatter plot for the design space is reported. Several curves with constant ED n P, with n=1, 1.5, and are reported. Furthermore, the we connected with a line all the points of the energy-efficient family [3]. In the Energy-Delay plane, these points form a convex hull of all possible configurations of the design space. The best configuration from the Energy-Delay point of view is

6 IPC Energy per cycle [nj/cycle] Av. Temperature [ C] P issue 51L P issue 5L mpegenc mpegdec FMM VOLREND WATER NS mpegenc mpegdec FMM VOLREND WATER NS mpegenc mpegdec FMM VOLREND WATER NS Figure : IPC for each benchmark Figure 9: Energy per cycle for each benchmark Figure 1: Average temperature for each benchmark Energy*Delay [Js ] Figure 7: Energy Delay Product. Each group of bars (L cache varying) is labeled with <#procs>-<issue width> the -core -issue 5KB-L one. This is optimal for the ED n P with n <.5, as it can be intuitively observed in Figure. If n.5 the optimum moves to 51KB-L and therefore to the points of the energy-efficient family with higher hardware intensity 1. By minimizing energy the optimal solution is -core -issue 5- KB-L (low hardware intensity). -core -issue configurations are penalized by poor performances, while -core -issue 5KB-L represents an interesting trade-offs featuring only 5.5% worse energy and.% better delay. Nevertheless, one could now consider other metrics such as area (see Table 1) or average temperature (see Figure 5). In the first case (area), smaller core count architecture can be preferred. For example, -core -core -issue 5KB-L needs only 9, while the optimal one (-core -issue 5KB-L) requires 1, featuring.7% better delay. Otherwise, if chip temperature is minimized, larger caches can be preferred. For example, -core - 1 The hardware intensity is defined as is the ratio of the relative increase in energy to the corresponding relative gain in performance achievable by architectural modification. See [3] for a detailed discussion. issue KB-L features.7 C lower average temperature. This comes at the expenses of relevant leakage power consumption in the L caches (see Figure ). In Figure, we represent, as squares, the points with MB total L cache (equivalent to the optimum). They form 3 clusters, each one corresponding to a core/cache configuration: /5KB, /51KB and /1KB. These points show that increasing the number of cores, while decreasing the L size for each processor, makes the energy-delay better. Furthermore, if considering the same number of total issues (-core -issue and -core -issue), the advantages of reducing the number of issues, but increasing the core count come evident. Parallelism of applications in the considered benchmark set is much better exploited by multi-core architectures, rather than complex monolithic cores. It can be also observed that each cluster progressively rotates as moving towards the optimum, making more similar different issue-width from the point of view of delay. Figure 7 shows the Energy Delay Product for the simulated configurations. EDP strongly favors high-performance architectures. In fact, -cores CMPs regardless other parameters are better than any other architecture: % (w.r.t. large caches) up to % (w.r.t the EDP optimum). These results show that the rule-of-the-thumb that configurations offering the same ILP (#cores issue-width) have similar energy/performance trade-offs is not totally accurate when considering EDP (still holds for EDP). For example, the -issue -core 51KB CMP is equivalent to the -issue -core 5KB CMP in terms of hardware budget but not on EDP. Note that both configurations have a total of issue width and 51KB in L caches. Nevertheless the area budget favors the -core over the -core ( vs 1 die area). Cache size has a relatively small impact on performance (see Figure ), while it is important for energy. For the considered dataset, 5KB L cache per processor can be considered enough. Moreover it minimizes energy. It should be pointed out that L size differently affects the core and the L leakage (see Figure ). Intelligent cache management, which is aware of used data, could considerably reduce the dependence of the leakage on the L size, making larger caches more attractive when considering energy/delay trade-offs.. Per Benchmark Results Figures, 9, and 1 show IPC, energy per cycle, and average tem-

7 perature for each benchmark. Results are reported for two configurations: the optimal one (-issue, procs. 5KB-L) and total cache/issue budget equivalent one, i.e. -issue, procs. 51KB- L. These benchmarks have been selected since they represents two different kinds of workload. The mpeg decoder and the encoder are integer video applications, stressing integer units and memory system. FMM, VOLREND, and WATER-NS are floating point scientific benchmarks, each of them featuring different characteristics in terms of IPC and energy. IPC ranges from around 3 (FMM, -core) to 1 for WATER-NS, -core. It shows up to 1% difference across the benchmarks. Note that the IPC of mpegenc and FMM is much lower than VOL- REND and mpegdec (3-5 vs 5-), while WATER-NS stems for really high IPC (9-1). On the other hand, the energy per cycle doesn t vary so much across benchmarks (-3 nj/cycle). It is also quite similar for the -core and the core CMPs. Temperature variability is 1.7 C across the benchmarks. Temperature differences, between the two architectures, reported in the graphs, is about.5 C for mpegenc and FMM,.5 C for VOLREND and mpegdec, while it is negligible for WATER-NS. 7. TEMPERATURE SPATIAL DISTRIBUTION This section discusses of the spatial distribution of the temperature (i.e. the chip thermal map) and its relationship with the floorplan design. We first analyze several different floorplan designs for CMPs. We then discuss temperature distribution while varying processor core complexity. 7.1 Evaluation of Floorplans The main goal of this section is to discuss how floorplan design at system level can impact on chip thermal behavior. Several floorplan topologies, featuring different core/caches placement are analyzed. We reduce the scope of this part of the work to cores and -issue CMPs and L cache size of 5KB and 1KB. Similar results are obtained for the other configurations. Figure 11 shows the additional floorplans we have considered in addition to those in Figure 1. Floorplans are classified with respect to processor position in the die: Paired (Figure 1. Cores are paired and placed on alternate sides of the die. Lined up (Figure 11.b and Figure 11.d). Cores are lined up on a side of the die. Centered (Figure 11.a and Figure 11.c). Cores are placed in the middle of the die. Area and aspect ratio are roughly equivalent for each cache size and topology. We report the thermal map for each floorplan type and cache size. This is the stationary solution as obtained by HotSpot. We use the grid model, making sure that grid spacing is 1 µm for each floorplan. The grid spacing determines the magnitudes of the hotspots, since each grid point has somewhat the average temperature of 1 µm 1 µm surrounding it. Our constant grid spacing setup ensures that the hotspot magnitude is comparable across designs. Power traces are from the WATER-NS benchmark. We selected this application as representative of the behavior of all the benchmarks. The conclusions drawn in this analysis apply for each simulated application. For each thermal map (e.g. choose Figure 1), several coon phenomena can be observed. Several temperature peaks correspond to each processor. This is due to the hotspots in the core. L_ L_3 P L_ P1 P P3 L_1 P1 P P L_ L_ L_1 (a) 5KB Centered P3 L_3 (b) 5KB Lined up L_3 L_ P3 P P P1 L_ L_1 (c) 1KB Centered P P1 P P3 L_ L_1 L_ L_3 (d) 1KB Lined up Figure 11: Additional floorplans taken into account for evaluation of the spatial distribution of the temperature Table : Average and maximum chip temperature for different floorplans L 5KB 1KB Max [ C] Av. [ C] Max [ C] Av. [ C] Paired Lined up Centered The nature of the hotspots will be analyzed shortly in next section. Caches in particular the L cache and the shared bus are cooler, apart from the portions which are heated because of the proximity to hot units. Paired floorplans for 5KB (Figure 1) and 1KB (Figure 15) show similar thermal characteristics. The temperature of the two hotspots of each core (the hottest is the FP unit, the other one is the Integer Window coupled with the Ld/St queue) ranges from 73.1/71. to 7./7 C for for 5KB L. Hottest peaks are the ones closest to the die corners (FP units on the right side), since they dispose of less spreading sides into the silicon. For 1KB L hotspots temperature is between 7.5/71.5 and 73./7. In this case the hottest peaks are between each core pair. This is because of thermal coupling between processors. The same phenomena appear for the lined up floorplans (Figure 1 and Figure 15). The rightmost hotspots suffers from corner effect, while inner ones from thermal coupling. Figure 1 shows an alternative design. Here processors are placed in the center of the die (see Figure 11.a). Also their orientation is different, since they are back to back. As can be observed in Table, the maximum temperature is lowered significantly (1./1. C). In this case, this is due to the increased spreading perimeter of the processors: two sides are on the surrounding silicon, and another side is on the bus (the bus goes across the die, see Figure 11.a). The centered floorplans are the only ones with fairly hot bus. In this case, leakage in the bus can be significant. Furthermore, for this floorplan, between the backs of the processor the temperature is quite high (9. C), unlike other floorplans featuring relatively cold backs (7.9 C). For the 1KB-centered floorplan (Figure 17) these characteristics are less evident. Maximum temperature decrease is only./1. C with respect to paired and lined up floorplans. In particular, it is small if compared with the paired one. This is reasonable if considering that, in the paired floorplan the cache (1KB) surrounds each pair of cores, providing therefore enough spreading area. Overall, the choice of the cache size and the relative placing of

8 Figure 1: Thermal map for the 5KB- Figure 13: Thermal map for the 5KB- Figure 1: Thermal map for the 5KBpaired floorplan (Figure 1) lined up floorplan (Figure 11.b) centered floorplan (Figure 11.a) Figure 15: Thermal map for the 1KB- Figure 1: Thermal map for the 1KB- Figure 17: Thermal map for the 1KBlined up floorplan (Figure 11.d) paired floorplan (Figure 1) centered floorplan (Figure 11.c) the processors can be important in determining the chip thermal map. Those layouts, where processors are placed in the center and where L caches surround cores, typically feature lower peak temperature. The entity of the temperature decrease, with respect to alternative floorplans, is limited to approximately 1- C. The impact on the system energy can be considered negligible, since L leakage dominates. Despite this, such a reduction of the hotspot temperature can lead to leakage reduction localized in the hotspot sources. 7. Processor Complexity In Figure 1, several thermal maps for different issue width of the processors are reported. We selected the WATER-NS benchmark, a floating point one, since this enables us to discuss heating in both FP and Integer units. The CMP configuration is -core 5KB-L. Hottest units are the FP ALU, the Ld/St queue and the Integer I-Window. Depending on the issue width, the temperature of the Integer Window and the Ld/St vary. For - and -issue, the FP ALU dominates, while as the the issue width is scaled up the Integer Window and the Ld/St becomes the hottest units of the die. This is due to the super-linear increase of power density in these structures. Hotspot magnitude depends on issue width i.e. power density of the microarchitecture blocks and on the compactness of the floorplan. In the -issue chip temperature peaks higher than in the -issue one exist. This is because the floorplan of the -issue core has many cold units surrounding hotspots (e.g. see the FP ALU hotspots). Interleaving hot and cold blocks is an effective method to provide spreading silicon to hot areas. Thermal coupling of hotspots exists for various units shown in Figure 1, and it is often caused by inter-processor interaction. For example, In the -issue floorplan, the LdSt queue of the left processor is thermally coupled with the FP ALU of the right one. In the -issue, the Integer Execution Unit is warmed by the coupling between LdSt queue and FP ALU of the two cores. For what concerns intra-processor coupling, it can be observed for the -issue design, between the FP ALU and the FP register file, and in the -issue, between the Integer Window and the the LdSt queue. 7.3 Discussion Different factors determine hotspots in the processors. Power density for each microarchitecture unit provides a proxy to temperature, but doesn t suffice in explaining effects due to the thermal

9 FPMul_ FPAdd_ Bpred_ Icache_ L_ FPQ_IntMap_IntQ_ FPReg_ FPMap_ ITB_ Dcache_ IntExec_ IntReg_ LdStQ_ FPMul_1 FPAdd_1 DTB_Bpred_1 Icache_1 L_1 FPQ_1IntMap_1IntQ_1 FPReg_1 FPMap_1 ITB_1 Dcache_1 IntExec_1 IntReg_1 LdStQ_1 DTB_ FP ALU LSQ FPRF IntQ LSQ FP ALU IntQ LSQ FP ALU IntQ LSQ FPQ IntExec_ IntExec_1 FPMap_ IntReg_ FPMap_1 IntReg_1 FPMul_ FPReg_ FPAdd_ IntMap_IntQ_ LdStQ_ FPQ_ ITB_ IntExec_ FPMul_1 FPReg_1 FPAdd_1 IntMap_1IntQ_1 LdStQ_1 FPQ_1 ITB_1 IntExec_1 FPMap_ IntMap_ FPQ_ IntQ_ FPMap_1 IntMap_1 FPQ_1 IntQ_1 IntReg_ IntQ_ LdStQ_ FPMap_ IntReg_1 IntQ_1 LdStQ_1 FPMap_1 Bpred_ DTB_ Icache_ Dcache_ FP ALU Bpred_1 DTB_1 Icache_1 IntEx Dcache_1 FPAdd_ FPReg_ Bpred_ Icache_ FPMul_ ITB_ IntExec_ Dcache_ IntReg_ FPAdd_1 FPReg_1 LdStQ_ DTB_ Bpred_1 Icache_1 FPMul_1 ITB_1 IntExec_1 Dcache_1 IntReg_1 LdStQ_1 DTB_1 FPMul_ ITB_ Icache_ IntMap_ FPReg_FPQ_ DTB_ FPAdd_ Bpred_ Dcache_ FPMul_1 ITB_1 Icache_1 IntMap_1 FPReg_1FPQ_1 DTB_1 FPAdd_1 Bpred_1 Dcache_1 L_ L_1 L_ L_1 L_ L_1 -issue -issue -issue -issue Figure 1: Thermal maps of -core 5KB-L CMP for VOLREND, varying issue width (from left to right: -, -, -, and -issue) coupling and spreading area. It is important to care about the geometric characteristics of the floorplan. We suarize all those impacting on chip heating as follows: Proximity of hot units. If two or more hotspots come close, this will produce thermal coupling and therefore raise the temperature locally. Relative positions of hot and cold units. A floorplan interleaving hot and cold units will result in lower global power density (therefore lower temperature). Available spreading silicon. Units placed in such a position, that limits its spreading perimeter, will result in higher temperature. E.g. the units placed in a corner of the die. These principles, all related to the coon idea of lowering the global power density, apply to core microarchitecture, as well as CMP system architecture. In the first case, they suggest to make not too close hot units, like register files, instruction queues,... For what concerns CMPs they can be translated as follows: Proximity of the cores. The interaction between two cores placed side-by-side, can generate thermal coupling between some units of the processors. Relative position of cores and L caches. If L caches are placed to surround the cores, this results in better heat spreading and lower chip temperature. Position of cores in the die. Processors placed in the center of the die offer better thermal behavior with respect to processors in the die corners or beside the die edge.. CONCLUSIONS AND FUTURE WORK In this paper, we have discussed the impact of the choice of several architectural parameters of CMPs on power/performance/thermal metrics. Our conclusions apply to multicore architectures composed of processors with private L cache and running parallel applications. Our contributions can be suarized as follows: The choice of L cache size is not only matter of performance/area, but also temperature. According to EDP, best CMP configuration consists of a large number of fairly narrow cores. Wide cores can be considered too much power hungry to be competitive. Floorplan geometric characteristics are important for chip temperature. The relative position of processors and caches must be carefully determined to avoid hotspot thermal coupling. Several other factors may affect design choices, such as area, yield and reliability. These factors may, as well, affect the design choice. At the same time, orthogonal architectural alternatives such as heterogeneous cores, L3 caches, complex interconnects might be present in future CMPs. Nevertheless, evaluating all these alternatives is left for future research. 9. ACKNOWLEDGMENTS This work has been supported by the Generalitat de Catalunya under grant SGR-95, the Spanish Ministry of Education and Science under grants TIN-7739-C-1, TIN-37 and HI5-99, and the Ministero dell Istruzione, dell Università e della Ricerca under the MAIS-FIRB project. 1. REFERENCES [1] C. McNairy and R. Bhatia. Montecito: A dual-core, dual-thread Itanium processor. IEEE Micro, pages 1, March/April 5. [] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 3-way multithreaded Sparc processor. IEEE Micro, pages 1 9, March/April 5. [3] R. Kalla, B. Sinharoy, and J.M. Tendler. IBM Power5 Chip: A Dual-Core Multithreaded Processor. IEEE MICRO, pages 7, March/April. [] Trevor Mudge. Power: A first class constraint for future architectures. In HPCA : Proceedings of the th International Symposium on High-Performance Computer

10 Architecture, Washington, DC, USA,. IEEE Computer Society. [5] Kevin Skadron, Mircea R. Stan, Karthik Sankaranarayanan, Wei Huang, Sivakumar Velusamy, and David Tarjan. Temperature-aware microarchitecture: Modeling and implementation. ACM Trans. Archit. Code Optim., 1(1):9 15,. [] Intel. White paper: Superior performance with dual-core. Technical report, Intel, 5. [7] Kelly Quinn, Jessica Yang, and Vernon Turner. The next evolution in enterprise computing: The convergence of multicore x processing and -bit operating systems white paper. Technical report, Advanced Micro Devices Inc., April 5. [] Jose Renau, Basilio Fraguela, James Tuck, Wei Liu, Milos Prvulovic, Luis Ceze, Smruti Sarangi, Paul Sack, Karin Strauss, and Pablo Montesinos. SESC simulator, January 5. [9] David Brooks, Vivek Tiwari, and Margaret Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In Proc. of the 7th International Symposium on Computer Architecture, pages 3 9. ACM Press,. [1] Premkishore Shivakumar and Norman P. Jouppi. CACTI 3.: An integrated cache timing, power, and area model. Technical Report 1/, Western Research Laboratory, Compaq, 1. [11] J. Huh, D. Burger, and S. Keckler. Exploring the design space of future cmps. In PACT 1: Proceedings of the 1th International Conference on Parallel Architectures and Compilation Techniques, pages 199 1, Washington, DC, USA, 1. IEEE Computer Society. [1] M. Ekman and P. Stenstrom. Performance and power impact of issue-width in chip-multiprocessor cores. In ICPP 3: Proceedings of the 3 International Conference on Parallel Processing, pages , Washington, DC, USA, 3. IEEE Computer Society. [13] Stefanos Kaxiras, Girija Narlikar, Alan D. Berenbaum, and Zhigang Hu. Comparing power consumption of an SMT and a CMP DSP for mobile phone workloads. In CASES 1: Proceedings of the 1 international conference on Compilers, architecture, and synthesis for embedded systems, pages 11, New York, NY, USA, 1. ACM Press. [1] Ed Grochowski, Ronny Ronen, John Shen, and Hong Wang. Best of both latency and throughput. In ICCD : Proceedings of the IEEE International Conference on Computer Design (ICCD ), pages 3 3, Washington, DC, USA,. IEEE Computer Society. [15] J. Li and J.F. Martinez. Power-performance implications of thread-level parallelism on chip multiprocessors. In ISPASS 5: Proceedings of the 5 International Symposium on Performance Analysis of Systems and Software, pages 1 13, Washington, DC, USA, 5. IEEE Computer Society. [1] J. Li and J.F. Martinez. Dynamic power-performance adaptation of parallel computation on chip multiprocessors. In HPCA : Proceedings of the 1th International Symposium on High Performance Computer Architecture, Washington, DC, USA,. IEEE Computer Society. [17] Yingmin Li, David Brooks, Zhigang Hu, and Kevin Skadron. Performance, energy, and thermal considerations for SMT and CMP architectures. In HPCA 5: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pages 71, Washington, DC, USA, 5. IEEE Computer Society. [1] Y. Li, B. Lee, D. Brooks, Z. Hu, and K. Skadron. CMP design space exploration subject to physical constraints. In HPCA : Proceedings of the 1th International Symposium on High Performance Computer Architecture, Washington, DC, USA,. IEEE Computer Society. [19] Rakesh Kumar, Victor Zyuban, and Dean M. Tullsen. Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling. In ISCA 5: Proceedings of the 3nd Annual International Symposium on Computer Architecture, pages 19, Washington, DC, USA, 5. IEEE Computer Society. [] K. Sankaranarayanan, S. Velusamy, M. Stan, and K. Skadron. A case for thermal-aware floorplanning at the microarchitectural level. Journal of Instruction Level Parallelism, 5. [1] Pedro Chaparro, Grigorios Magklis, Jose Gonzalez, and Antonio Gonzalez. Distributing the frontend for temperature reduction. In HPCA 5: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pages 1 7, Washington, DC, USA, 5. IEEE Computer Society. [] Ja Chun Ku, Serkan Ozdemir, Gokhan Memik, and Yehea Ismail. Thermal management of on-chip caches through power density minimization. In MICRO 3: Proceedings of the 3th annual IEEE/ACM International Symposium on Microarchitecture, pages 3 93, Washington, DC, USA, 5. IEEE Computer Society. [3] V. Zyuban and P. N. Strenski. Balancing hardware intensity in microprocessor pipelines. IBM Journal of Research & Development, 7(5-):55 59, 3. [] R. E. Kessler. The Alpha 1 microprocessor. IEEE Micro, 19():. [5] Subbarao Palacharla, Norman P. Jouppi, and J. E. Smith. Complexity-effective superscalar processors. In ISCA 97: Proceedings of the th annual international symposium on Computer architecture, pages 1, New York, NY, USA, ACM Press. [] Hang-Sheng Wang, Xinping Zhu, Li-Shiuan Peh, and Sharad Malik. Orion: a power-performance simulator for interconnection networks. In Proc. of the 35th annual ACM/IEEE International Symposium on Microarchitecture, pages 9 35,. [7] Weiping Liao, Lei He, and K.M. Lepak. Temperature and supply Voltage aware performance and power modeling at microarchitecture level. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, (7):1. [] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH- programs: characterization and methodological considerations. In ISCA 95: Proceedings of the nd annual International Symposium on Computer Architecture, pages 3, New York, NY, USA, ACM Press. [9] Man-Lap Li, Ruchira Sasanka, Sarita V. Adve, Yen-Kuang Chen, and Eric Debes. The alpbench benchmark suite for complex multimedia applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC-5), Washington, DC, USA, 5. IEEE Computer Society.

Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core

3 rd HiPEAC Workshop IBM, Haifa 17-4-2007 Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core, P. Michaud, L. He, D. Fetis, C. Ioannou, P. Charalambous and A. Seznec