Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View

Size: px
Start display at page:

Download "Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View"

Transcription

1 Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View Matteo Monchiero 1 1 Dipartimento di Elettronica e Informazione Politecnico di Milano Via Ponzio, 3/5 133 Milano, Italy monchier@elet.polimi.it Ramon Canal Dept of Computer Architecture Universitat Politècnica de Catalunya Cr. Jordi Girona, Barcelona, Spain rcanal@ac.upc.edu Antonio González,3 3 Intel Barcelona Research Center Intel Labs-Universitat Politècnica de Catalunya Cr. Jordi Girona, Barcelona, Spain antonio.gonzalez@intel.com ABSTRACT Multicore architectures are ruling the recent microprocessor design trend. This is due to different reasons: better performance, threadlevel parallelism bounds in modern applications, ILP diminishing returns, better thermal/power scaling (many small cores dissipate less than a large and complex one); and, ease and reuse of design. This paper presents a thorough evaluation of multicore architectures. The architecture we target is composed of a configurable number of cores, a memory hierarchy consisting of private L1 and L, and a shared bus interconnect. We consider parallel shared memory applications. We explore the design space related to the number of cores, L cache size and processor complexity, showing the behavior of the different configurations/applications with respect to performance, energy consumption and temperature. Design tradeoffs are analyzed, stressing the interdependency of the metrics and design factors. In particular, we evaluate several chip floorplans. Their power/thermal characteristics are analyzed and they show the importance of considering thermal effects at the architectural level to achieve the best design choice. 1. INTRODUCTION Main semiconductor firms have been recently proposing microprocessor solutions composed of a few cores integrated on a single chip [1 3]. This approach, named Chip Multiprocessor (CMP), permits to efficiently deal with power/thermal issues dominating deep sub-micron technologies and makes it easy to exploit threadlevel parallelism of modern applications. Power has been recognized as first class design constraint [], and many literature works target analysis and optimization of power consumption. Issues related to chip thermal behavior have been faced only recently, but emerged as one of the most important factors to determine a wide range of architectural decisions [5]. This paper targets a CMP system composed of multiple cores, each one with private L1 and L hierarchy. This approach has been Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or coercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICS, June -3, Cairns, Queensland, Australia Copyright C ACM //...$5.. adopted in some industrial products, e.g. the Intel Montecito [1] and Pentium D [], and the AMD AthlonX [7]. It maximizes design reuse, since this architectural style does not require re-design of the secondary cache, as it would be for a L shared architecture. Unlike many recent works about design space explorations of CMPs, we consider parallel shared memory applications, which can be considered the natural workload for a small scale multiprocessor. We target several scientific programs and multimedia ones. Scientific applications represent the traditional benchmark for multiprocessor, while multimedia programs represent a promising way of improving parallelism in everyday computing. Our experimental framework consists of a detailed microarchitectural simulator [], integrated with Wattch [9] and CACTI [1] power models. Thermal effects have been modeled at the functional unit granularity by using HotSpot [5]. This environment makes a fast and accurate exploration of the target design space possible. The main contribution of this work is the analysis of several design energy/performance trade-offs when varying core complexity, L cache size and number of cores, for parallel applications. In particular, we discuss the interdependence of energy/thermal efficiency, performance and architectural level chip floorplan. Our findings can be suarized as follows: We show that L cache is an important factor in determining chip thermal behavior. We show that large CMPs of fairly narrow cores are the best solution for energy-delay. We show that floorplan geometric characteristics are important for chip temperature. This paper is organized as follows. Section presents the related work. Metrics are defined in Section 3. The target design space, as well as the description of the considered architecture is presented in Section. The experimental framework used in this paper is introduced in Section 5. Section discusses performance/energy/thermal results for the proposed configurations. Section 7 presents an analysis of the spatial distribution of the temperature for selected chip floorplans. Finally, conclusions are drawn in Section.. RELATED WORK Many works have appeared exploring the design space of CMPs from the point of view of different metrics and application domains. This paper extends previous works in several issues. In detail:

2 Huh et al. [11] evaluate the impact of several design factors on the performance. The authors discuss the interactions of core complexity, cache hierarchy, and available off-chip bandwidth. The paper focuses on a workload composed of single-threaded applications. It is shown that out-of-order cores are more effective than in-order ones. Furthermore, bandwidth limitations can force large L caches to be used, therefore reducing the number of cores. A similar study was conducted by Ekman et al. [1] for scientific parallel programs. Kaxiras et al. [13] deal with multimedia workload, especially targeting DSP for mobile phones. The paper discuss the energy efficiency of SMT and CMP organizations with given performance constraints. They show that both these approaches can be more efficient than a single-threaded processor, and claims that SMT has some advantages in terms of power. Grochowski et al. [1] discuss how to achieve the best performance for scalar and parallel codes in a power constrained environment. The idea proposed in this paper is to dynamically vary the amount of energy expended to process instructions according to the amount of parallelism available in the software. They evaluate several architectures showing that a combination of voltage frequency/scaling and asyetric cores represents the best approach. In [15], Li and Martinez study the power/performance implications of parallel computing on CMPs. They use a mixed analyticalexperimental model to explore parallel efficiency, the number of processor used, and voltage/frequency scaling. They show that parallel computation can bring significant power savings when the number of processors used and the voltage/frequency scaling levels are properly chosen. These works [1 1] are orthogonal to ours. Our experimental framework is similar to the one of [15, 1]. We also consider parallel applications and the benchmarks as in [1]. The power model we use accounts for thermal effects on leakage energy, but, unlike the authors of [15, 1], we also provide a temperature-dependent model for L caches. This makes it possible to capture some particular behaviors of temperature and energy consumption since L caches may occupy more than half of the area of the chip in some configurations. In [17], a comparison of SMT and CMP architectures is carried out. The authors consider performance, power and temperature metrics. They show that SMT and CMP exhibit similar thermal behavior, but the sources of heating are different: localized hotspots for SMTs, global heating for CMPs. They found that CMPs are more efficient for computation intensive applications, while SMTs perform better for memory bound programs due to larger L cache available. The paper also discusses the suitability of several dynamic thermal management techniques for CMPs and SMTs. Li et al. [1] conduct a thorough exploration of the multi-dimensional design space for CMPs. They show that thermal constraints dominate other physical constraints such as pin-bandwidth and power delivery. According to the authors, thermal constraints tend to favor shallow pipelines, narrower cores and tend to reduce the optimal number of cores and L cache size. The focus of [17, 1] is on single-threaded applications. In this paper, we target a different benchmark set, composed of explicitly parallel applications. In [19], Kumar et al. provide an analysis of several on-chip interconnections. The authors show the importance of considering the interconnect while designing a CMP and argue that careful codesign of the interconnection and the other architectural entities is needed. To the best of the authors knowledge, the problem of architecturallevel floorplan has not been faced for what concerns multi-core architectures. Several chip floorplan have been proposed as instrumental to thermal/power models, but the interactions of floorplanning issues and CMP power/performance characteristics have not been faced up to now. In [], Sankaranarayanan et al. discuss some issues related to chip floorplanning for single-core processors. In particular, the authors show that chip temperature variation is sensitive to three main parameters: the lateral spreading of heat in the silicon, the power density, and the temperature dependent leakage power. We also use these parameters to describe the thermal behavior for CMPs. The idea of distributing the microarchitecture for reducing temperature is exploited in [1]. In this paper the the organization of a distributed front-end for clustered microarchitectures is proposed and evaluated Ku et al. [] analyze the inter-dependence of temperature and leakage energy in cache memories. They focus on single-core workload. The authors analyze several low-power techniques and propose a temperature-aware cache management approach. In this paper, we account for temperature effects on leakage in the memory hierarchy, stressing its importance to make best design decisions. This paper contribution differentiates from the previous ones, since it combines power/performance exploration for parallel applications and interactions with the chip floorplan. Our model takes into account the thermal effects and the temperature dependence of leakage energy. 3. METRICS This section describes the metrics that we use in this paper to characterize performance and power efficiency of the CMP configurations we explore. We measure the performance of the system when running a given parallel applications by using the execution time (or delay). This is computed as the time needed to complete the execution of the program. Delay is often preferred to IPC for multiprocessor studies. It accounts for different instruction count which may arise when different system configurations are used. This mostly happens when the number of cores is varied and the dynamic instructions count varies to deal with a different number of parallel threads. Power efficiency is evaluated through the system energy, i.e. the energy needed to run a given application. We also account for thermal effects, and we therefore report the average temperature across the chip. Furthermore, in Section 7, we evaluate several alternative floorplans by means of the spatial distribution of the temperature (thermal map of the chip). Energy Delay Product (EDP) and Energy Delay Product (EDP) are typically used to evaluate energy-delay trade-offs. EDP is usually preferred when targeting low-power devices, while EDP when focusing on high-performance systems, since it gives higher priority to performance over power. In this paper, we use EDP. Anyway, we give a discussion of ED n P optimality for the considered design space. We use concepts taken form [3]. In particular, we adopt the notion of hardware intensity to proper characterize the complexity of different configurations.. DESIGN SPACE We consider a shared memory multiprocessor, composed of several independent out-of-order cores (each one with private L cache), counicating on a shared bus. Detailed microarchitecture is discussed in Section 5. In this section, we focus on the target design space. This is composed of the following architecture parameters: Number of cores (,, ). This is the number of cores present in the system.

3 Table 1: Chip Area [ ] Issue #cores L Size [KB] Issue width (,,, ). We modulate the number of instructions which can be issued in parallel to integer/floating point reservation stations. According to this parameter, many microarchitectural blocks are scaled, see Table for details. The issue width is therefore an index of the core complexity. P P1 L_ L_1 (a) L 5, 51 KB Base Unit (c) cores L_ P P1 L_1 (b) L 1, KB Base Unit (d) cores L cache size (5, 51, 1, KB). This is the size of the private L cache of each processor. For each configuration in terms of number of cores and L cache size we consider a different chip floorplan. As in some related works on thermal analysis [5], the floorplan of the core is modeled on the AlphaEv (1) [].The area of several units (register files, issue and Ld/St queues, rename units, and FUs) has been scaled according to size and complexity [5]. Each floorplan has been re-designed to minimize dead spaces and not to increase the delay of the critical loops. Figure 1 illustrates the methodology used to build the floorplans we use. In particular, to obtain a reasonable aspect ratio we defined two different layouts: one for small caches, 5KB and 51KB (Figure 1.a), and another one for large caches, 1KB and KB (Figure 1.b). The floorplans for and cores are built by using the -core floorplan as a base unit. The base unit for small caches is shown in Figure 1.a. Core (P) and core 1 (P1) are placed side by side. The L caches are placed in front of each core. In the base unit for large caches (Figure 1.b), the L is split in two pieces, one in front of the core, the other one beside it. In this way, each processor+cache unit is roughly squared. For and cores this floorplan is replicated and possibly mirrored, trying to obtain the aspect ratio which most approximates 1 (perfect square). Figures 1.c/d show how the -core and -core floorplans have been obtained respectively from the -core and -core floorplan. For any cache/core configurations, each floorplan has the shape/organization as outlined in Figure 1, but different size (these have been omitted for the sake of clarity). Table 1 shows the chip area for each design. For each configuration, the shared bus is placed according to the counication requirements. Size is derived from [19]. 5. EXPERIMENTAL SETUP The simulation infrastructure that we use in this paper is composed of a microarchitecture simulator modeling a configurable CMP, integrated power models, and a temperature modeling tool. The CMP simulator is SESC []. It models a multiprocessor composed of a configurable number of cores. Table reports main Figure 1: Chip floorplans for,, cores, 5/51KB L caches (a), and 1/KB L caches (b) parameters of simulated architectures. Each core is an out-of-order superscalar processor, with private caches (instruction L1, data L1, and instructions and data L). The -issue microarchitecture models the Alpha 1 []. For any different issue width, all the processor structures have been scaled accordingly as shown in Table. Inter-processor counication develops on a high-bandwidth shared bus (57 GB/s), pipelined and clocked at half of the core clock (see Table ). Coherence protocol acts directly among L caches and it is MESI snoopy-based. This protocol generates invalidate/write requests between L1 and L caches to ensure the coherency of shared data. Memory ordering is ruled by a weak consistency model. The power model integrated in the simulator is based on Wattch [9] for processor structures, CACTI [1] for caches, and Orion [] for buses. The thermal model is based on Hotspot-3.. [5]. HotSpot uses dynamic power traces and chip floorplan to drive thermal simulation. As outputs, it provides transient behavior and steady state temperature. According to [5], we chose a standard cooling solution featuring thermal resistance of 1. K/W. We chose ambient air temperature of 5 C. We carefully select initial temperature values to ensure that thermal simulation converges to a stationary state. Hotspot has been augmented with a temperature-dependent leakage model, based on [7]. This model accounts for different transistor density of functional units and memories. Temperature dependency is modeled by varying the amount of leakage according to the proper exponential distribution as in [7]. At each iteration of the thermal simulation, the leakage contribution is calculated and injected into each HotSpot grid element. Table 3 lists the benchmark we selected. They are 3 scientific applications from Splash- suite [] and MPEG encoder/decoder from ALPbench [9]. All benchmarks have been run up to completion and statistics have been collected on the whole program run,

4 Table : Architecture configuration. Issue width parameters are in boldface. They are used as representatives of a core configuration through the paper. Processor Frequency nm Vdd.9V Core area (w/ L1$) [ ] Branch penalty Branch unit 7 cycles BTB (1K entries, -way) Alpha-style Hybrid Branch Predictor (3.7KB) RAS (3 entries) Fetch/issue/retire width // // // //1 Issue Queue size 1 3 INT I-Window size 1 3 FP I-Window size INT registers 1 1 FP registers ROB size 1 1 LdSt/Int/FP units 1//1 1// 1//3 1// Ld/St queue entries 1/1 3/3 / / IL1 KB, -way, B block, ports Hit/miss lat /1 cycles ITLB entries DL1 KB, -way, B block, write-through Hit/miss lat /1 cycles MAF size DTLB entries 1 Unified L 5 KB 51 KB 1 KB KB -way, B block, write-back, ports Hit/miss lat 1/ cycles MAF size 3 Coherence Protocol MESI CMP #cores Shared Bus 7B, 1.5GHz, 1 cycles delay Bandwidth 57 GB/s Memory Bus bandwidth GB/s Memory lat 9 cycles Table 3: Benchmarks #graduated Description Problem size instructions (M) FMM Fast Multipole Method 1k particles, 1 steps mpegdec 11 MPEG- decoder flowg (Stanford) 35, 1 frames mpegenc 75 MPEG- encoder flowg (Stanford) 35, 1 frames VOLREND 15 1 Volume rendering using ray casting head, 5 viewpoints WATER-NS 17 Forces and potentials of water molecules 51 molecules, 5 steps after skipping initialization. For thermal simulations, we used the power trace related to the whole benchmark simulation (dumped every 1 cycles). We used standard data set for Splash-, while we limited to 1 frames for the MPEG. In Table 3 is shown the total number of graduated instructions for each application. This number is fairly constant across all the simulated configurations for all the benchmarks, except for FMM and VOLREND. This variation (3-%) is related to different thread spawning and synchronization overhead as the number of cores is scaled.. PERFORMANCE AND POWER EVALU- ATION In this section, we evaluate the simulated CMP configurations from the point of view of performance and energy. We consider the interdependence of these metrics as well as chip physical characteristics (temperature and floorplan). We base this analysis on average values across all the simulated benchmarks, apart from Section., which presents a characterization of each benchmark. Delay [cycles].5 x Figure : Execution time (delay). Each group of bars (L cache varying) is labeled with <#procs>-<issue width>.1 Performance Figure shows the average execution time, for all the configurations, in terms of cores, issue width and L caches. Results in terms of IPC show the same behavior. It can be observed that when scaling the system from to, and from to processors, delay is reduced by a 7% each step. This trend is homogeneous across other parameter variations. It means that high parallel efficiency is achieved by these applications, and that counication overhead is not appreciable. By varying the issue width, for a given processor count and L size, a larger speedup is observed when moving from - to -issue (1%). If the issue width is furthermore increased, the improvement of the delay saturates (5.5% from - to -issue, and 3.3% from - to -issue). This is because of ILP diminishing returns. This trend is also seen (but not that dramatically) for the configurations of and cores. The impact of the L cache size is at maximum -5% (from 5KB to KB), when the other parameters are fixed. Each time the L size is doubled the speedup is around 1.-%. Also this behavior is orthogonal to the other parameters since this trend can be seen across all configurations.. Energy In this section, we analyze the effect of varying the number of cores, the issue width and the L cache on the system energy. Number of cores. In Figure 3, the energy behavior is reported. It can be seen that, when the number of processors is increased, the system energy slightly increases although the area nearly doubles as core count doubles. For example, for a -issue 1KB configuration, the energy increase rate is 5% (both from cores to cores, and from cores to cores). In fact, the leakage energy increase, due to larger chip, is mitigated by the reduction of the execution time power-on time (i.e. the time the chip is leaking is shorter). On the other hand, when considering configurations with a given resource budget, and similar area (see Table 1), the energy typically decreases. For a 1 total issue width and MB total cache, the energy decreases by a 37%, from to cores. From to, it slightly increases, but, the chip for cores is % larger. In this case, the shorter delay makes the difference, making relatively large CMPs result in good energy efficiency. Issue width. By varying the issue width and, therefore, the core

5 Energy [J] 15 1 Av. Temperature [ C] Figure 3: Energy. Each group of bars (L cache varying) is labeled with <#procs>-<issue width> Figure 5: Average chip temperature. Each group of bars (L cache varying) is labeled with <#procs>-<issue width> EDP ED1.5P EDP Leakage Energy [J] 1 Ls Energy [J] 15 core Delay [cycles] x 1 9 Figure : Leakage energy. Each bar is labeled <#issue>-<l size>. Core/Ls leakage breakdown is represented by a dash superposed on each bar. complexity, energy has a minimum at -issue. This occurs for, and processors. The -issue cores offer the best balance between complexity, power consumption and performance. The -issue microarchitecture can t efficiently extract available IPC (leakage due to delay dominates). -issue processor cores need much more energy to extract little more parallelism with respect to - and -issue one, and therefore they are quite inefficient. L size. L cache represents the major contributor to chip area, so it affects leakage energy considerably. The static energy component for our simulations ranges from 37% (P -issue 5KB- L) to 7% (P -issue KB-L). Figure shows the leakage energy and the core-leakage/l-leakage breakdown. Most of the leakage is due to the L caches. This component scales linearly with L cache size. The static power of the core increases as the issue-width and the number of cores is increased. Nevertheless, this effect is mitigated by larger L caches since they offer enhanced spreading capabilities of the chip (overall temperature decreases resulting in less leakage energy). Average chip temperature is shown in Figure 5. Despite abso- Figure : Energy Delay scatter plot for the simulated configurations lute variation is at most 1.7 C, even small temperature variation can correspond to appreciable energy modification. In [], SPICE simulation results are reported, showing that 1 C translates into around -5% leakage variation. Regarding our simulations, temperature reduction due to caches causes the energy reduction of the core leakage energy, as shown in Figure. In this case, 5% energy reduction (core leakage) corresponds to.5 C reduction of the average chip temperature. On the contrary, leakage in the cache is much more sensitive to area than temperature. So, the leakage increase due to the area can t be compensated by the temperature decrease..3 Energy Delay In Figure, the Energy-Delay scatter plot for the design space is reported. Several curves with constant ED n P, with n=1, 1.5, and are reported. Furthermore, the we connected with a line all the points of the energy-efficient family [3]. In the Energy-Delay plane, these points form a convex hull of all possible configurations of the design space. The best configuration from the Energy-Delay point of view is

6 IPC Energy per cycle [nj/cycle] Av. Temperature [ C] P issue 51L P issue 5L mpegenc mpegdec FMM VOLREND WATER NS mpegenc mpegdec FMM VOLREND WATER NS mpegenc mpegdec FMM VOLREND WATER NS Figure : IPC for each benchmark Figure 9: Energy per cycle for each benchmark Figure 1: Average temperature for each benchmark Energy*Delay [Js ] Figure 7: Energy Delay Product. Each group of bars (L cache varying) is labeled with <#procs>-<issue width> the -core -issue 5KB-L one. This is optimal for the ED n P with n <.5, as it can be intuitively observed in Figure. If n.5 the optimum moves to 51KB-L and therefore to the points of the energy-efficient family with higher hardware intensity 1. By minimizing energy the optimal solution is -core -issue 5- KB-L (low hardware intensity). -core -issue configurations are penalized by poor performances, while -core -issue 5KB-L represents an interesting trade-offs featuring only 5.5% worse energy and.% better delay. Nevertheless, one could now consider other metrics such as area (see Table 1) or average temperature (see Figure 5). In the first case (area), smaller core count architecture can be preferred. For example, -core -core -issue 5KB-L needs only 9, while the optimal one (-core -issue 5KB-L) requires 1, featuring.7% better delay. Otherwise, if chip temperature is minimized, larger caches can be preferred. For example, -core - 1 The hardware intensity is defined as is the ratio of the relative increase in energy to the corresponding relative gain in performance achievable by architectural modification. See [3] for a detailed discussion. issue KB-L features.7 C lower average temperature. This comes at the expenses of relevant leakage power consumption in the L caches (see Figure ). In Figure, we represent, as squares, the points with MB total L cache (equivalent to the optimum). They form 3 clusters, each one corresponding to a core/cache configuration: /5KB, /51KB and /1KB. These points show that increasing the number of cores, while decreasing the L size for each processor, makes the energy-delay better. Furthermore, if considering the same number of total issues (-core -issue and -core -issue), the advantages of reducing the number of issues, but increasing the core count come evident. Parallelism of applications in the considered benchmark set is much better exploited by multi-core architectures, rather than complex monolithic cores. It can be also observed that each cluster progressively rotates as moving towards the optimum, making more similar different issue-width from the point of view of delay. Figure 7 shows the Energy Delay Product for the simulated configurations. EDP strongly favors high-performance architectures. In fact, -cores CMPs regardless other parameters are better than any other architecture: % (w.r.t. large caches) up to % (w.r.t the EDP optimum). These results show that the rule-of-the-thumb that configurations offering the same ILP (#cores issue-width) have similar energy/performance trade-offs is not totally accurate when considering EDP (still holds for EDP). For example, the -issue -core 51KB CMP is equivalent to the -issue -core 5KB CMP in terms of hardware budget but not on EDP. Note that both configurations have a total of issue width and 51KB in L caches. Nevertheless the area budget favors the -core over the -core ( vs 1 die area). Cache size has a relatively small impact on performance (see Figure ), while it is important for energy. For the considered dataset, 5KB L cache per processor can be considered enough. Moreover it minimizes energy. It should be pointed out that L size differently affects the core and the L leakage (see Figure ). Intelligent cache management, which is aware of used data, could considerably reduce the dependence of the leakage on the L size, making larger caches more attractive when considering energy/delay trade-offs.. Per Benchmark Results Figures, 9, and 1 show IPC, energy per cycle, and average tem-

7 perature for each benchmark. Results are reported for two configurations: the optimal one (-issue, procs. 5KB-L) and total cache/issue budget equivalent one, i.e. -issue, procs. 51KB- L. These benchmarks have been selected since they represents two different kinds of workload. The mpeg decoder and the encoder are integer video applications, stressing integer units and memory system. FMM, VOLREND, and WATER-NS are floating point scientific benchmarks, each of them featuring different characteristics in terms of IPC and energy. IPC ranges from around 3 (FMM, -core) to 1 for WATER-NS, -core. It shows up to 1% difference across the benchmarks. Note that the IPC of mpegenc and FMM is much lower than VOL- REND and mpegdec (3-5 vs 5-), while WATER-NS stems for really high IPC (9-1). On the other hand, the energy per cycle doesn t vary so much across benchmarks (-3 nj/cycle). It is also quite similar for the -core and the core CMPs. Temperature variability is 1.7 C across the benchmarks. Temperature differences, between the two architectures, reported in the graphs, is about.5 C for mpegenc and FMM,.5 C for VOLREND and mpegdec, while it is negligible for WATER-NS. 7. TEMPERATURE SPATIAL DISTRIBUTION This section discusses of the spatial distribution of the temperature (i.e. the chip thermal map) and its relationship with the floorplan design. We first analyze several different floorplan designs for CMPs. We then discuss temperature distribution while varying processor core complexity. 7.1 Evaluation of Floorplans The main goal of this section is to discuss how floorplan design at system level can impact on chip thermal behavior. Several floorplan topologies, featuring different core/caches placement are analyzed. We reduce the scope of this part of the work to cores and -issue CMPs and L cache size of 5KB and 1KB. Similar results are obtained for the other configurations. Figure 11 shows the additional floorplans we have considered in addition to those in Figure 1. Floorplans are classified with respect to processor position in the die: Paired (Figure 1. Cores are paired and placed on alternate sides of the die. Lined up (Figure 11.b and Figure 11.d). Cores are lined up on a side of the die. Centered (Figure 11.a and Figure 11.c). Cores are placed in the middle of the die. Area and aspect ratio are roughly equivalent for each cache size and topology. We report the thermal map for each floorplan type and cache size. This is the stationary solution as obtained by HotSpot. We use the grid model, making sure that grid spacing is 1 µm for each floorplan. The grid spacing determines the magnitudes of the hotspots, since each grid point has somewhat the average temperature of 1 µm 1 µm surrounding it. Our constant grid spacing setup ensures that the hotspot magnitude is comparable across designs. Power traces are from the WATER-NS benchmark. We selected this application as representative of the behavior of all the benchmarks. The conclusions drawn in this analysis apply for each simulated application. For each thermal map (e.g. choose Figure 1), several coon phenomena can be observed. Several temperature peaks correspond to each processor. This is due to the hotspots in the core. L_ L_3 P L_ P1 P P3 L_1 P1 P P L_ L_ L_1 (a) 5KB Centered P3 L_3 (b) 5KB Lined up L_3 L_ P3 P P P1 L_ L_1 (c) 1KB Centered P P1 P P3 L_ L_1 L_ L_3 (d) 1KB Lined up Figure 11: Additional floorplans taken into account for evaluation of the spatial distribution of the temperature Table : Average and maximum chip temperature for different floorplans L 5KB 1KB Max [ C] Av. [ C] Max [ C] Av. [ C] Paired Lined up Centered The nature of the hotspots will be analyzed shortly in next section. Caches in particular the L cache and the shared bus are cooler, apart from the portions which are heated because of the proximity to hot units. Paired floorplans for 5KB (Figure 1) and 1KB (Figure 15) show similar thermal characteristics. The temperature of the two hotspots of each core (the hottest is the FP unit, the other one is the Integer Window coupled with the Ld/St queue) ranges from 73.1/71. to 7./7 C for for 5KB L. Hottest peaks are the ones closest to the die corners (FP units on the right side), since they dispose of less spreading sides into the silicon. For 1KB L hotspots temperature is between 7.5/71.5 and 73./7. In this case the hottest peaks are between each core pair. This is because of thermal coupling between processors. The same phenomena appear for the lined up floorplans (Figure 1 and Figure 15). The rightmost hotspots suffers from corner effect, while inner ones from thermal coupling. Figure 1 shows an alternative design. Here processors are placed in the center of the die (see Figure 11.a). Also their orientation is different, since they are back to back. As can be observed in Table, the maximum temperature is lowered significantly (1./1. C). In this case, this is due to the increased spreading perimeter of the processors: two sides are on the surrounding silicon, and another side is on the bus (the bus goes across the die, see Figure 11.a). The centered floorplans are the only ones with fairly hot bus. In this case, leakage in the bus can be significant. Furthermore, for this floorplan, between the backs of the processor the temperature is quite high (9. C), unlike other floorplans featuring relatively cold backs (7.9 C). For the 1KB-centered floorplan (Figure 17) these characteristics are less evident. Maximum temperature decrease is only./1. C with respect to paired and lined up floorplans. In particular, it is small if compared with the paired one. This is reasonable if considering that, in the paired floorplan the cache (1KB) surrounds each pair of cores, providing therefore enough spreading area. Overall, the choice of the cache size and the relative placing of

8 Figure 1: Thermal map for the 5KB- Figure 13: Thermal map for the 5KB- Figure 1: Thermal map for the 5KBpaired floorplan (Figure 1) lined up floorplan (Figure 11.b) centered floorplan (Figure 11.a) Figure 15: Thermal map for the 1KB- Figure 1: Thermal map for the 1KB- Figure 17: Thermal map for the 1KBlined up floorplan (Figure 11.d) paired floorplan (Figure 1) centered floorplan (Figure 11.c) the processors can be important in determining the chip thermal map. Those layouts, where processors are placed in the center and where L caches surround cores, typically feature lower peak temperature. The entity of the temperature decrease, with respect to alternative floorplans, is limited to approximately 1- C. The impact on the system energy can be considered negligible, since L leakage dominates. Despite this, such a reduction of the hotspot temperature can lead to leakage reduction localized in the hotspot sources. 7. Processor Complexity In Figure 1, several thermal maps for different issue width of the processors are reported. We selected the WATER-NS benchmark, a floating point one, since this enables us to discuss heating in both FP and Integer units. The CMP configuration is -core 5KB-L. Hottest units are the FP ALU, the Ld/St queue and the Integer I-Window. Depending on the issue width, the temperature of the Integer Window and the Ld/St vary. For - and -issue, the FP ALU dominates, while as the the issue width is scaled up the Integer Window and the Ld/St becomes the hottest units of the die. This is due to the super-linear increase of power density in these structures. Hotspot magnitude depends on issue width i.e. power density of the microarchitecture blocks and on the compactness of the floorplan. In the -issue chip temperature peaks higher than in the -issue one exist. This is because the floorplan of the -issue core has many cold units surrounding hotspots (e.g. see the FP ALU hotspots). Interleaving hot and cold blocks is an effective method to provide spreading silicon to hot areas. Thermal coupling of hotspots exists for various units shown in Figure 1, and it is often caused by inter-processor interaction. For example, In the -issue floorplan, the LdSt queue of the left processor is thermally coupled with the FP ALU of the right one. In the -issue, the Integer Execution Unit is warmed by the coupling between LdSt queue and FP ALU of the two cores. For what concerns intra-processor coupling, it can be observed for the -issue design, between the FP ALU and the FP register file, and in the -issue, between the Integer Window and the the LdSt queue. 7.3 Discussion Different factors determine hotspots in the processors. Power density for each microarchitecture unit provides a proxy to temperature, but doesn t suffice in explaining effects due to the thermal

9 FPMul_ FPAdd_ Bpred_ Icache_ L_ FPQ_IntMap_IntQ_ FPReg_ FPMap_ ITB_ Dcache_ IntExec_ IntReg_ LdStQ_ FPMul_1 FPAdd_1 DTB_Bpred_1 Icache_1 L_1 FPQ_1IntMap_1IntQ_1 FPReg_1 FPMap_1 ITB_1 Dcache_1 IntExec_1 IntReg_1 LdStQ_1 DTB_ FP ALU LSQ FPRF IntQ LSQ FP ALU IntQ LSQ FP ALU IntQ LSQ FPQ IntExec_ IntExec_1 FPMap_ IntReg_ FPMap_1 IntReg_1 FPMul_ FPReg_ FPAdd_ IntMap_IntQ_ LdStQ_ FPQ_ ITB_ IntExec_ FPMul_1 FPReg_1 FPAdd_1 IntMap_1IntQ_1 LdStQ_1 FPQ_1 ITB_1 IntExec_1 FPMap_ IntMap_ FPQ_ IntQ_ FPMap_1 IntMap_1 FPQ_1 IntQ_1 IntReg_ IntQ_ LdStQ_ FPMap_ IntReg_1 IntQ_1 LdStQ_1 FPMap_1 Bpred_ DTB_ Icache_ Dcache_ FP ALU Bpred_1 DTB_1 Icache_1 IntEx Dcache_1 FPAdd_ FPReg_ Bpred_ Icache_ FPMul_ ITB_ IntExec_ Dcache_ IntReg_ FPAdd_1 FPReg_1 LdStQ_ DTB_ Bpred_1 Icache_1 FPMul_1 ITB_1 IntExec_1 Dcache_1 IntReg_1 LdStQ_1 DTB_1 FPMul_ ITB_ Icache_ IntMap_ FPReg_FPQ_ DTB_ FPAdd_ Bpred_ Dcache_ FPMul_1 ITB_1 Icache_1 IntMap_1 FPReg_1FPQ_1 DTB_1 FPAdd_1 Bpred_1 Dcache_1 L_ L_1 L_ L_1 L_ L_1 -issue -issue -issue -issue Figure 1: Thermal maps of -core 5KB-L CMP for VOLREND, varying issue width (from left to right: -, -, -, and -issue) coupling and spreading area. It is important to care about the geometric characteristics of the floorplan. We suarize all those impacting on chip heating as follows: Proximity of hot units. If two or more hotspots come close, this will produce thermal coupling and therefore raise the temperature locally. Relative positions of hot and cold units. A floorplan interleaving hot and cold units will result in lower global power density (therefore lower temperature). Available spreading silicon. Units placed in such a position, that limits its spreading perimeter, will result in higher temperature. E.g. the units placed in a corner of the die. These principles, all related to the coon idea of lowering the global power density, apply to core microarchitecture, as well as CMP system architecture. In the first case, they suggest to make not too close hot units, like register files, instruction queues,... For what concerns CMPs they can be translated as follows: Proximity of the cores. The interaction between two cores placed side-by-side, can generate thermal coupling between some units of the processors. Relative position of cores and L caches. If L caches are placed to surround the cores, this results in better heat spreading and lower chip temperature. Position of cores in the die. Processors placed in the center of the die offer better thermal behavior with respect to processors in the die corners or beside the die edge.. CONCLUSIONS AND FUTURE WORK In this paper, we have discussed the impact of the choice of several architectural parameters of CMPs on power/performance/thermal metrics. Our conclusions apply to multicore architectures composed of processors with private L cache and running parallel applications. Our contributions can be suarized as follows: The choice of L cache size is not only matter of performance/area, but also temperature. According to EDP, best CMP configuration consists of a large number of fairly narrow cores. Wide cores can be considered too much power hungry to be competitive. Floorplan geometric characteristics are important for chip temperature. The relative position of processors and caches must be carefully determined to avoid hotspot thermal coupling. Several other factors may affect design choices, such as area, yield and reliability. These factors may, as well, affect the design choice. At the same time, orthogonal architectural alternatives such as heterogeneous cores, L3 caches, complex interconnects might be present in future CMPs. Nevertheless, evaluating all these alternatives is left for future research. 9. ACKNOWLEDGMENTS This work has been supported by the Generalitat de Catalunya under grant SGR-95, the Spanish Ministry of Education and Science under grants TIN-7739-C-1, TIN-37 and HI5-99, and the Ministero dell Istruzione, dell Università e della Ricerca under the MAIS-FIRB project. 1. REFERENCES [1] C. McNairy and R. Bhatia. Montecito: A dual-core, dual-thread Itanium processor. IEEE Micro, pages 1, March/April 5. [] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 3-way multithreaded Sparc processor. IEEE Micro, pages 1 9, March/April 5. [3] R. Kalla, B. Sinharoy, and J.M. Tendler. IBM Power5 Chip: A Dual-Core Multithreaded Processor. IEEE MICRO, pages 7, March/April. [] Trevor Mudge. Power: A first class constraint for future architectures. In HPCA : Proceedings of the th International Symposium on High-Performance Computer

10 Architecture, Washington, DC, USA,. IEEE Computer Society. [5] Kevin Skadron, Mircea R. Stan, Karthik Sankaranarayanan, Wei Huang, Sivakumar Velusamy, and David Tarjan. Temperature-aware microarchitecture: Modeling and implementation. ACM Trans. Archit. Code Optim., 1(1):9 15,. [] Intel. White paper: Superior performance with dual-core. Technical report, Intel, 5. [7] Kelly Quinn, Jessica Yang, and Vernon Turner. The next evolution in enterprise computing: The convergence of multicore x processing and -bit operating systems white paper. Technical report, Advanced Micro Devices Inc., April 5. [] Jose Renau, Basilio Fraguela, James Tuck, Wei Liu, Milos Prvulovic, Luis Ceze, Smruti Sarangi, Paul Sack, Karin Strauss, and Pablo Montesinos. SESC simulator, January 5. [9] David Brooks, Vivek Tiwari, and Margaret Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In Proc. of the 7th International Symposium on Computer Architecture, pages 3 9. ACM Press,. [1] Premkishore Shivakumar and Norman P. Jouppi. CACTI 3.: An integrated cache timing, power, and area model. Technical Report 1/, Western Research Laboratory, Compaq, 1. [11] J. Huh, D. Burger, and S. Keckler. Exploring the design space of future cmps. In PACT 1: Proceedings of the 1th International Conference on Parallel Architectures and Compilation Techniques, pages 199 1, Washington, DC, USA, 1. IEEE Computer Society. [1] M. Ekman and P. Stenstrom. Performance and power impact of issue-width in chip-multiprocessor cores. In ICPP 3: Proceedings of the 3 International Conference on Parallel Processing, pages , Washington, DC, USA, 3. IEEE Computer Society. [13] Stefanos Kaxiras, Girija Narlikar, Alan D. Berenbaum, and Zhigang Hu. Comparing power consumption of an SMT and a CMP DSP for mobile phone workloads. In CASES 1: Proceedings of the 1 international conference on Compilers, architecture, and synthesis for embedded systems, pages 11, New York, NY, USA, 1. ACM Press. [1] Ed Grochowski, Ronny Ronen, John Shen, and Hong Wang. Best of both latency and throughput. In ICCD : Proceedings of the IEEE International Conference on Computer Design (ICCD ), pages 3 3, Washington, DC, USA,. IEEE Computer Society. [15] J. Li and J.F. Martinez. Power-performance implications of thread-level parallelism on chip multiprocessors. In ISPASS 5: Proceedings of the 5 International Symposium on Performance Analysis of Systems and Software, pages 1 13, Washington, DC, USA, 5. IEEE Computer Society. [1] J. Li and J.F. Martinez. Dynamic power-performance adaptation of parallel computation on chip multiprocessors. In HPCA : Proceedings of the 1th International Symposium on High Performance Computer Architecture, Washington, DC, USA,. IEEE Computer Society. [17] Yingmin Li, David Brooks, Zhigang Hu, and Kevin Skadron. Performance, energy, and thermal considerations for SMT and CMP architectures. In HPCA 5: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pages 71, Washington, DC, USA, 5. IEEE Computer Society. [1] Y. Li, B. Lee, D. Brooks, Z. Hu, and K. Skadron. CMP design space exploration subject to physical constraints. In HPCA : Proceedings of the 1th International Symposium on High Performance Computer Architecture, Washington, DC, USA,. IEEE Computer Society. [19] Rakesh Kumar, Victor Zyuban, and Dean M. Tullsen. Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling. In ISCA 5: Proceedings of the 3nd Annual International Symposium on Computer Architecture, pages 19, Washington, DC, USA, 5. IEEE Computer Society. [] K. Sankaranarayanan, S. Velusamy, M. Stan, and K. Skadron. A case for thermal-aware floorplanning at the microarchitectural level. Journal of Instruction Level Parallelism, 5. [1] Pedro Chaparro, Grigorios Magklis, Jose Gonzalez, and Antonio Gonzalez. Distributing the frontend for temperature reduction. In HPCA 5: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pages 1 7, Washington, DC, USA, 5. IEEE Computer Society. [] Ja Chun Ku, Serkan Ozdemir, Gokhan Memik, and Yehea Ismail. Thermal management of on-chip caches through power density minimization. In MICRO 3: Proceedings of the 3th annual IEEE/ACM International Symposium on Microarchitecture, pages 3 93, Washington, DC, USA, 5. IEEE Computer Society. [3] V. Zyuban and P. N. Strenski. Balancing hardware intensity in microprocessor pipelines. IBM Journal of Research & Development, 7(5-):55 59, 3. [] R. E. Kessler. The Alpha 1 microprocessor. IEEE Micro, 19():. [5] Subbarao Palacharla, Norman P. Jouppi, and J. E. Smith. Complexity-effective superscalar processors. In ISCA 97: Proceedings of the th annual international symposium on Computer architecture, pages 1, New York, NY, USA, ACM Press. [] Hang-Sheng Wang, Xinping Zhu, Li-Shiuan Peh, and Sharad Malik. Orion: a power-performance simulator for interconnection networks. In Proc. of the 35th annual ACM/IEEE International Symposium on Microarchitecture, pages 9 35,. [7] Weiping Liao, Lei He, and K.M. Lepak. Temperature and supply Voltage aware performance and power modeling at microarchitecture level. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, (7):1. [] Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH- programs: characterization and methodological considerations. In ISCA 95: Proceedings of the nd annual International Symposium on Computer Architecture, pages 3, New York, NY, USA, ACM Press. [9] Man-Lap Li, Ruchira Sasanka, Sarita V. Adve, Yen-Kuang Chen, and Eric Debes. The alpbench benchmark suite for complex multimedia applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC-5), Washington, DC, USA, 5. IEEE Computer Society.

Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core

Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core 3 rd HiPEAC Workshop IBM, Haifa 17-4-2007 Initial Results on the Performance Implications of Thread Migration on a Chip Multi-Core, P. Michaud, L. He, D. Fetis, C. Ioannou, P. Charalambous and A. Seznec

More information

Value Compression for Efficient Computation

Value Compression for Efficient Computation Value Compression for Efficient Computation Ramon Canal 1, Antonio González 12 and James E. Smith 3 1 Dept of Computer Architecture, Universitat Politècnica de Catalunya Cr. Jordi Girona, 1-3, 08034 Barcelona,

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

THERMAL BENCHMARK AND POWER BENCHMARK SOFTWARE

THERMAL BENCHMARK AND POWER BENCHMARK SOFTWARE Nice, Côte d Azur, France, 27-29 September 26 THERMAL BENCHMARK AND POWER BENCHMARK SOFTWARE Marius Marcu, Mircea Vladutiu, Horatiu Moldovan and Mircea Popa Department of Computer Science, Politehnica

More information

Microarchitectural Floorplanning for Thermal Management: A Technical Report

Microarchitectural Floorplanning for Thermal Management: A Technical Report Microarchitectural Floorplanning for Thermal Management: A Technical Report Karthik Sankaranarayanan, Mircea R. Stan and Kevin Skadron Department of Computer Science, Charles L. Brown Department of Electrical

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors

Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors Three Tier Proximity Aware Cache Hierarchy for Multi-core Processors Akshay Chander, Aravind Narayanan, Madhan R and A.P. Shanti Department of Computer Science & Engineering, College of Engineering Guindy,

More information

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Ayse K. Coskun Electrical and Computer Engineering Department Boston University http://people.bu.edu/acoskun

More information

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor

DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor DDM-CMP: Data-Driven Multithreading on a Chip Multiprocessor Kyriakos Stavrou, Paraskevas Evripidou, and Pedro Trancoso Department of Computer Science, University of Cyprus, 75 Kallipoleos Ave., P.O.Box

More information

Re-Visiting the Performance Impact of Microarchitectural Floorplanning

Re-Visiting the Performance Impact of Microarchitectural Floorplanning Re-Visiting the Performance Impact of Microarchitectural Floorplanning Anupam Chakravorty, Abhishek Ranjan, Rajeev Balasubramonian School of Computing, University of Utah Abstract The placement of microarchitectural

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Guiding the optimization of parallel codes on multicores using an analytical cache model

Guiding the optimization of parallel codes on multicores using an analytical cache model Guiding the optimization of parallel codes on multicores using an analytical cache model Diego Andrade, Basilio B. Fraguela, and Ramón Doallo Universidade da Coruña, Spain {diego.andrade,basilio.fraguela,ramon.doalllo}@udc.es

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Two-Level Address Storage and Address Prediction

Two-Level Address Storage and Address Prediction Two-Level Address Storage and Address Prediction Enric Morancho, José María Llabería and Àngel Olivé Computer Architecture Department - Universitat Politècnica de Catalunya (Spain) 1 Abstract. : The amount

More information

Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri Hari Krishna Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Department of CSE, The Pennsylvania State University {snarayan, guilchen,

More information

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University 18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?

More information

With the announcement of multicore

With the announcement of multicore COVER FEATURE Heterogeneous Chip Multiprocessors Heterogeneous (or asymmetric) chip multiprocessors present unique opportunities for improving system throughput, reducing processor power, and mitigating

More information

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous

More information

Microprocessor Trends and Implications for the Future

Microprocessor Trends and Implications for the Future Microprocessor Trends and Implications for the Future John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 4 1 September 2016 Context Last two classes: from

More information

Exploring Performance, Power, and Temperature Characteristics of 3D Systems with On-Chip DRAM

Exploring Performance, Power, and Temperature Characteristics of 3D Systems with On-Chip DRAM Exploring Performance, Power, and Temperature Characteristics of 3D Systems with On-Chip DRAM Jie Meng, Daniel Rossell, and Ayse K. Coskun Electrical and Computer Engineering Department, Boston University,

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

ECE 486/586. Computer Architecture. Lecture # 2

ECE 486/586. Computer Architecture. Lecture # 2 ECE 486/586 Computer Architecture Lecture # 2 Spring 2015 Portland State University Recap of Last Lecture Old view of computer architecture: Instruction Set Architecture (ISA) design Real computer architecture:

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Shared vs. Snoop: Evaluation of Cache Structure for Single-chip Multiprocessors

Shared vs. Snoop: Evaluation of Cache Structure for Single-chip Multiprocessors vs. : Evaluation of Structure for Single-chip Multiprocessors Toru Kisuki,Masaki Wakabayashi,Junji Yamamoto,Keisuke Inoue, Hideharu Amano Department of Computer Science, Keio University 3-14-1, Hiyoshi

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Minimizing Energy Consumption for High-Performance Processing

Minimizing Energy Consumption for High-Performance Processing Minimizing Energy Consumption for High-Performance Processing Eric F. Weglarz, Kewal K. Saluja, and Mikko H. Lipasti University ofwisconsin-madison Madison, WI 5376 USA fweglarz,saluja,mikkog@ece.wisc.edu

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Reconfigurable Multicore Server Processors for Low Power Operation

Reconfigurable Multicore Server Processors for Low Power Operation Reconfigurable Multicore Server Processors for Low Power Operation Ronald G. Dreslinski, David Fick, David Blaauw, Dennis Sylvester, Trevor Mudge University of Michigan, Advanced Computer Architecture

More information

Reducing Data Cache Energy Consumption via Cached Load/Store Queue

Reducing Data Cache Energy Consumption via Cached Load/Store Queue Reducing Data Cache Energy Consumption via Cached Load/Store Queue Dan Nicolaescu, Alex Veidenbaum, Alex Nicolau Center for Embedded Computer Systems University of Cafornia, Irvine {dann,alexv,nicolau}@cecs.uci.edu

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

Last Level Cache Size Flexible Heterogeneity in Embedded Systems

Last Level Cache Size Flexible Heterogeneity in Embedded Systems Last Level Cache Size Flexible Heterogeneity in Embedded Systems Mario D. Marino, Kuan-Ching Li Leeds Beckett University, m.d.marino@leedsbeckett.ac.uk Corresponding Author, Providence University, kuancli@gm.pu.edu.tw

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Dan Nicolaescu Alex Veidenbaum Alex Nicolau Dept. of Information and Computer Science University of California at Irvine

More information

Adaptive Cache Partitioning on a Composite Core

Adaptive Cache Partitioning on a Composite Core Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,

More information

Position Paper: OpenMP scheduling on ARM big.little architecture

Position Paper: OpenMP scheduling on ARM big.little architecture Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

CSE 548 Computer Architecture. Clock Rate vs IPC. V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger. Presented by: Ning Chen

CSE 548 Computer Architecture. Clock Rate vs IPC. V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger. Presented by: Ning Chen CSE 548 Computer Architecture Clock Rate vs IPC V. Agarwal, M. S. Hrishikesh, S. W. Kechler. D. Burger Presented by: Ning Chen Transistor Changes Development of silicon fabrication technology caused transistor

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

The Effect of Temperature on Amdahl Law in 3D Multicore Era

The Effect of Temperature on Amdahl Law in 3D Multicore Era The Effect of Temperature on Amdahl Law in 3D Multicore Era L Yavits, A Morad, R Ginosar Abstract This work studies the influence of temperature on performance and scalability of 3D Chip Multiprocessors

More information

On GPU Bus Power Reduction with 3D IC Technologies

On GPU Bus Power Reduction with 3D IC Technologies On GPU Bus Power Reduction with 3D Technologies Young-Joon Lee and Sung Kyu Lim School of ECE, Georgia Institute of Technology, Atlanta, Georgia, USA yjlee@gatech.edu, limsk@ece.gatech.edu Abstract The

More information

The Scalability of Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures

The Scalability of Scheduling Algorithms for Unpredictably Heterogeneous CMP Architectures Appeared in the Workshop on Parallel Execution of Sequential Programs on Multi-core Architectures (PESPMA), Beijing, China, June 2008, co-located with the 35 th International Symposium on Computer Architecture

More information

Amdahl's Law in the Multicore Era

Amdahl's Law in the Multicore Era Amdahl's Law in the Multicore Era Explain intuitively why in the asymmetric model, the speedup actually decreases past a certain point of increasing r. The limiting factor of these improved equations and

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures

Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Understanding The Behavior of Simultaneous Multithreaded and Multiprocessor Architectures Nagi N. Mekhiel Department of Electrical and Computer Engineering Ryerson University, Toronto, Ontario M5B 2K3

More information

T. N. Vijaykumar School of Electrical and Computer Engineering Purdue University, W Lafayette, IN

T. N. Vijaykumar School of Electrical and Computer Engineering Purdue University, W Lafayette, IN Resource Area Dilation to Reduce Power Density in Throughput Servers Michael D. Powell 1 Fault Aware Computing Technology Group Intel Massachusetts, Inc. T. N. Vijaykumar School of Electrical and Computer

More information

Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective

Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective Efficiency of Thread-Level Speculation in SMT and CMP Architectures - Performance, Power and Thermal Perspective Venkatesan Packirisamy, Yangchun Luo, Wei-Lung Hung, Antonia Zhai, Pen-Chung Yew and Tin-Fook

More information

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen

Multi-threaded processors. Hung-Wei Tseng x Dean Tullsen Multi-threaded processors Hung-Wei Tseng x Dean Tullsen OoO SuperScalar Processor Fetch instructions in the instruction window Register renaming to eliminate false dependencies edule an instruction to

More information

Data Parallel Architectures

Data Parallel Architectures EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003

More information

An Approach for Adaptive DRAM Temperature and Power Management

An Approach for Adaptive DRAM Temperature and Power Management IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 An Approach for Adaptive DRAM Temperature and Power Management Song Liu, Yu Zhang, Seda Ogrenci Memik, and Gokhan Memik Abstract High-performance

More information

Power and Thermal Models. for RAMP2

Power and Thermal Models. for RAMP2 Power and Thermal Models for 2 Jose Renau Department of Computer Engineering, University of California Santa Cruz http://masc.cse.ucsc.edu Motivation Performance not the only first order design parameter

More information

Cache Implications of Aggressively Pipelined High Performance Microprocessors

Cache Implications of Aggressively Pipelined High Performance Microprocessors Cache Implications of Aggressively Pipelined High Performance Microprocessors Timothy J. Dysart, Branden J. Moore, Lambert Schaelicke, Peter M. Kogge Department of Computer Science and Engineering University

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory

More information

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS

Many-Core Computing Era and New Challenges. Nikos Hardavellas, EECS Many-Core Computing Era and New Challenges Nikos Hardavellas, EECS Moore s Law Is Alive And Well 90nm 90nm transistor (Intel, 2005) Swine Flu A/H1N1 (CDC) 65nm 2007 45nm 2010 32nm 2013 22nm 2016 16nm 2019

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

A 1.5GHz Third Generation Itanium Processor

A 1.5GHz Third Generation Itanium Processor A 1.5GHz Third Generation Itanium Processor Jason Stinson, Stefan Rusu Intel Corporation, Santa Clara, CA 1 Outline Processor highlights Process technology details Itanium processor evolution Block diagram

More information

EECS4201 Computer Architecture

EECS4201 Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be

More information

Multi-Core Microprocessor Chips: Motivation & Challenges

Multi-Core Microprocessor Chips: Motivation & Challenges Multi-Core Microprocessor Chips: Motivation & Challenges Dileep Bhandarkar, Ph. D. Architect at Large DEG Architecture & Planning Digital Enterprise Group Intel Corporation October 2005 Copyright 2005

More information

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1)

Lecture: SMT, Cache Hierarchies. Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) Lecture: SMT, Cache Hierarchies Topics: SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized

More information

Introducing Multi-core Computing / Hyperthreading

Introducing Multi-core Computing / Hyperthreading Introducing Multi-core Computing / Hyperthreading Clock Frequency with Time 3/9/2017 2 Why multi-core/hyperthreading? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:

More information

CO403 Advanced Microprocessors IS860 - High Performance Computing for Security. Basavaraj Talawar,

CO403 Advanced Microprocessors IS860 - High Performance Computing for Security. Basavaraj Talawar, CO403 Advanced Microprocessors IS860 - High Performance Computing for Security Basavaraj Talawar, basavaraj@nitk.edu.in Course Syllabus Technology Trends: Transistor Theory. Moore's Law. Delay, Power,

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Fundamentals of Computer Design

Fundamentals of Computer Design CS359: Computer Architecture Fundamentals of Computer Design Yanyan Shen Department of Computer Science and Engineering 1 Defining Computer Architecture Agenda Introduction Classes of Computers 1.3 Defining

More information

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

EE282 Computer Architecture. Lecture 1: What is Computer Architecture? EE282 Computer Architecture Lecture : What is Computer Architecture? September 27, 200 Marc Tremblay Computer Systems Laboratory Stanford University marctrem@csl.stanford.edu Goals Understand how computer

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu

More information

An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization

An Energy-Efficient Asymmetric Multi-Processor for HPC Virtualization An Energy-Efficient Asymmetric Multi-Processor for HP Virtualization hung Lee and Peter Strazdins*, omputer Systems Group, Research School of omputer Science, The Australian National University (slides

More information

Neural Network based Energy-Efficient Fault Tolerant Architect

Neural Network based Energy-Efficient Fault Tolerant Architect Neural Network based Energy-Efficient Fault Tolerant Architectures and Accelerators University of Rochester February 7, 2013 References Flexible Error Protection for Energy Efficient Reliable Architectures

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Performance Impact of Resource Conflicts on Chip Multi-processor Servers

Performance Impact of Resource Conflicts on Chip Multi-processor Servers Performance Impact of Resource Conflicts on Chip Multi-processor Servers Myungho Lee, Yeonseung Ryu, Sugwon Hong, and Chungki Lee Department of Computer Software, MyongJi University, Yong-In, Gyung Gi

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101

Lecture 21: Parallelism ILP to Multicores. Parallel Processing 101 18 447 Lecture 21: Parallelism ILP to Multicores S 10 L21 1 James C. Hoe Dept of ECE, CMU April 7, 2010 Announcements: Handouts: Lab 4 due this week Optional reading assignments below. The Microarchitecture

More information

Evaluation of Design Alternatives for a Multiprocessor Microprocessor

Evaluation of Design Alternatives for a Multiprocessor Microprocessor Evaluation of Design Alternatives for a Multiprocessor Microprocessor Basem A. Nayfeh, Lance Hammond and Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 9-7 {bnayfeh, lance,

More information

A Survey of Techniques for Power Aware On-Chip Networks.

A Survey of Techniques for Power Aware On-Chip Networks. A Survey of Techniques for Power Aware On-Chip Networks. Samir Chopra Ji Young Park May 2, 2005 1. Introduction On-chip networks have been proposed as a solution for challenges from process technology

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Multicore Soft Error Rate Stabilization Using Adaptive Dual Modular Redundancy

Multicore Soft Error Rate Stabilization Using Adaptive Dual Modular Redundancy Multicore Soft Error Rate Stabilization Using Adaptive Dual Modular Redundancy Ramakrishna Vadlamani, Jia Zhao, Wayne Burleson and Russell Tessier Department of Electrical and Computer Engineering University

More information

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors , July 4-6, 2018, London, U.K. A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid in 3D chip Multi-processors Lei Wang, Fen Ge, Hao Lu, Ning Wu, Ying Zhang, and Fang Zhou Abstract As

More information

Defining Wakeup Width for Efficient Dynamic Scheduling

Defining Wakeup Width for Efficient Dynamic Scheduling Defining Wakeup Width for Efficient Dynamic Scheduling Aneesh Aggarwal ECE Depment Binghamton University Binghamton, NY 9 aneesh@binghamton.edu Manoj Franklin ECE Depment and UMIACS University of Maryland

More information

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing

More information

Pull based Migration of Real-Time Tasks in Multi-Core Processors

Pull based Migration of Real-Time Tasks in Multi-Core Processors Pull based Migration of Real-Time Tasks in Multi-Core Processors 1. Problem Description The complexity of uniprocessor design attempting to extract instruction level parallelism has motivated the computer

More information

A Simple Model for Estimating Power Consumption of a Multicore Server System

A Simple Model for Estimating Power Consumption of a Multicore Server System , pp.153-160 http://dx.doi.org/10.14257/ijmue.2014.9.2.15 A Simple Model for Estimating Power Consumption of a Multicore Server System Minjoong Kim, Yoondeok Ju, Jinseok Chae and Moonju Park School of

More information