Dynamic Temperature Aware Scheduling for CPU-GPU 3D Multicore Processor with Regression Predictor

Size: px
Start display at page:

Download "Dynamic Temperature Aware Scheduling for CPU-GPU 3D Multicore Processor with Regression Predictor"

Transcription

1 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.1, FEBRUARY, 2018 ISSN(Print) ISSN(Online) Dynamic Temperature Aware Scheduling for CPU-GPU 3D Multicore Processor with Regression Predictor Hossein Pourmeidani 1, Ajay Sharma 1, Kyoshin Choo 1, Mainul Hassan 1, Minsu Choi 2, KyungKi Kim 3, and Byunghyun Jang 1 Abstract The 3D stacked integration of CPU, GPU and DRAM dies is a rising horizon in chip fabrication, where dies are vertically interconnected by TSVs (Through-Silicon Vias) to achieve high bandwidth, low latency and power consumption. However, thinned substrate, high power density and low thermal conductivity of inter-layer dielectric material cause thermal management a crucial problem. Moreover, the vertically stacked dies are susceptible to tight thermal correlations. High temperatures which tend to show higher spatial/temporal localities can make a negative impact on the IC s reliability and lifetime. To mitigate such problems on CPU-GPU 3D heterogeneous processors, a novel dynamic temperature-aware task scheduling approach for compute workloads using OpenCL framework is proposed in this work. The proposed scheduler predicts the future temperature of each core from a regression model based on its current temperature, the neighbors temperatures and the execution profile of each workgroup. The scheduler then selects a core to assign workgroups from task queue based on their predicted temperature to keep the 3D chip below certain threshold temperature. Our experimental results demonstrate that the proposed scheduling technique is a viable solution to address the hotspots Manuscript received Nov. 30, 2017; accepted Dec. 18, Department of Computer and Information Science, The University of Mississippi, University, MS, USA 2 Department of Electrical & Computer Engineering, Missouri University of Science & Technology, USA 3 Information and Communication Research Center, Daegu University, Gyeongsan, South Korea bjang@cs.olemiss.edu, kkkim@daegu.ac.kr and heat dissipation issue of 3D stacked heterogeneous processors under reasonable performance tradeoffs. Index Terms Dynamic thermal management, 3D IC, task scheduling, heterogeneous computing, GPGPU I. INTRODUCTION The 3D integration technology has gained considerable attention recently. The result of this new technology is the notable reduction of interconnect wires among dies in a System on Chip (SoC). The primary source of latency, area, and power in modern microprocessors is wire. The prior studies have shown that wires can consume more than 30% of total power in traditional 2D chip multiprocessors [1]. In comparison, 3D technology decreases the wire length by a factor of the square root of the number of layers by the vertical stacking of two or more dies with high-density, highspeed interfaces [2]. This remarkable reduction results in better performance and less power dissipation on the interconnection. Despite such significant advantages, 3D integration technology encounters a new problem that has never existed and solved before. As dies are stacked, the power density grows because of less distance between active devices, which makes the chip temperature to significantly increase. Also, the lower dies are placed far from the heat sink and have longer heat dissipation paths. Therefore, hotspots formed in the chip become a crucial concern for the reliability of the processor. As an example, previous studies show that the peak

2 116 HOSSEIN POURMEIDANI et al : DYNAMIC TEMPERATURE AWARE SCHEDULING FOR CPU-GPU 3D MULTICORE temperature can add up to 20 more with a 3D structure with two layers for an Alpha-like processor in comparison to a 2D structure [3, 4]. Other studies on multiple non-memory stacking 3D floorplans also show similar thermal behaviors [1, 5, 6]. The prior studies have shown that vertically neighboring dies tend to show significantly higher thermal correlations [7, 8]. As an example, a core in one layer could become hot due to another high-temperature core in the same vertical location at a different layer. To address this issue, we propose a dynamic temperature aware task scheduling technique for compute workloads on 3D stacked CPU-GPU heterogeneous processors. The proposed technique efficiently suppresses hotspots through temperature prediction modeling and smart task scheduling while minimizing performance degradation. II. RELATED WORK AND BACKGROUND Conventional CPU-GPU heterogeneous systems where two processors are connected through PCI-E bus suffer from a considerable data copy overhead between host and device memories. Industry finds a solution in singlechip heterogeneous processors where CPU and GPU are fabricated on a single die and share a physically unified memory. Even this recent 2D IC suffers from poor parallelism and scalability due to the limited bandwidth, high latency, and energy consumption of off-chip DRAM. To solve all these problems, the processor architecture is evolving to a 3D IC of CPU, GPU and DRAM dies vertically interconnected by TSVs (Through-Silicon Vias). However a new problem arises; three vertically stacked multiple active dies produce a considerable amount of heat in a three dimensional fashion and they suffers from poor heat dissipation, high thermal density, and hotspots because the multiple active layers are separated from each other by dielectrics layers. Therefore, thermal management is surely the biggest challenge in such fabrication technology as they can cause faults and reliability issues. Due to the importance and difficulty of thermal-reliability management in 3D IC, a number of design-time mechanical cooling solutions have been proposed such as liquid cooling [15], thermal vias [16], heat sinks and fans. Although these approaches will remain the front-line mechanisms for dealing with the thermal wall, these approaches are costly, unwieldy, and do not provide a complete solution to the transient nature of the problem. The idea of task scheduling for temperature management on multiple homogenous processors has been well-studied in the past. Yin et al. [9] proposed an algorithm to diminish core temperature rapidly by inactivating the core which reaches a critical temperature in next clock tick and by migrating the tasks based on the affinity of each core. The high integration density in 3D IC makes the thermal modeling and management more complicated as the thermal management techniques developed for 2D IC cannot be directly applied to 3D IC. Therefore, new techniques which are tailored for 3D chip thermal management and modeling are emerging. Several approaches have been proposed to target the thermal modeling of 3D chips. Recently, Zhao et al. [10] proposed a migration approach to decrease temperature in a 3D architecture with stacked DRAM. The main idea of their work is migrating threads between cores based on their temperatures. They propose a thread migration algorithm that the hottest and coldest threads switch places when the temperature variance is large enough. In [11], a task scheduling is proposed to manage both chip temperature and memory access delay. This approach attempts to increase the performance by preventing a task migration far from its data. Unfortunately they do not put any emphasis on decreasing the temperature of the chip. Coskun [12] combined dynamic thread migration with DVFS for thermal management, and achieved similar results to DVFS in the thermal optimization but with less performance degradation. Zhou et al. [13] shows that there is a strong thermal correlation between vertically adjacent layers in a 3D chip. They treat vertically adjacent cores as super cores. Then, they proposed an OS-level scheduling algorithm that the hottest super task is allocated to the coolest super core where a super task is a set of tasks. In a 3D chip multiprocessor, the heat dissipation ability varies from core to core. Liu et al. [14] proposed an algorithm to map and schedule jobs according to the thermal conductivity of various cores. The proposed algorithm allocates hotter jobs to closer cores to heat sink, and cooler jobs to farther cores from the heat sink. While the approaches described above target homogeneous multicore processors, our work aims to address the thermal management problem for

3 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.1, FEBRUARY, emerging heterogeneous processors where CPU and GPU are vertically stacked. III. TEMPERATURE AWARE DYNAMIC TASK SCHEDULING FOR 3D CPU-GPU HETEROGENEOUS PROCESSORS We propose a Dynamic Thermal Management (DTM) technique to solve the hotspot problem of vertically stacked 3DHP (3D Heterogeneous Processor). As we use OpenCL workloads, our goal is to assign workgroups to cores that can remain under certain temperature threshold with minimal performance degradation. To that end, we consider current temperature of the running core, the temperatures of neighboring cores, and the heat to be generated by a workgroup being assigned. We use DTM as a runtime solution to reduce the thermal hotspots and temperature gradients with minimal possible performance impact. Fig. 1 shows overall system diagram of the DTM system proposed in this paper. The management engine continuously monitors the thermal map of chip that is obtained from temperature sensors in each layer of the chip, and checks the activities of CPU/GPU/DRAM and takes workload profiles of heterogeneous tasks from dynamic global task scheduler. Based on the gathered information, the engine runs an algorithm to specify what and how to manage software and hardware techniques for best control of thermal hot spots. A range of proposed hardware techniques have been developed such as fined-grained DVFS and Power Gating (PG). Almost all processors in the market are equipped with some forms of hardware techniques and they have been proven to be effective. In 3DHP, heterogeneous workloads written in heterogeneous programming languages such as OpenCL provide unique additional opportunities to dynamically manage tasks and hardware devices at multiple levels. In order to model a realistic 3D stacked chip, a threelayer floorplan shown in Fig. 2 was considered. The bottom layer is a CPU with 4 cores, the middle layer is a GPU with 32 CUs and the top layer is a DRAM. These three active silicon layers generate heat and their vertically stacked structure makes it more difficult to dissipate heat than the case of 2D structure. As in existing works, the distance between cores is considered as an important factor in modeling the Fig. 1. The system diagram of the proposed Dynamic Thermal Management (DTM) for 3DHP. Fig. 2. A floorplan of CPU-GPU 3D heterogeneous processor. thermal correlation among cores. In this work, three major factors are analyzed to predict the temperature changes of each core: current temperature, neighbors temperature and the execution time of workgroup to be assigned. We configure temperature greater than a threshold, Threshold Critical, as a critical temperature, less than a threshold, Threshold Hot, as a normal temperature and between these temperatures as a hot temperature. In addition, two cores are considered as neighbors when their shortest distance is 1, 2 or 3 based on the distance graph. Fig. 3 shows an example distance graph for a GPU with 32 cores and a CPU with 4 cores. The neighbor distance based on the GPU distance graph is shown in Table 1. For example, 8 has distance 1 with s 7 and 16, distance 2 with s 6, 15, 24 and distance 3 with s 5, 14, 23, 32. Table 2 shows the neighbor distance for a CPU with 4 cores based on the CPU distance graph. Also, the distance between each GPU core and its direct

4 118 HOSSEIN POURMEIDANI et al : DYNAMIC TEMPERATURE AWARE SCHEDULING FOR CPU-GPU 3D MULTICORE Fig. 3. A distance graph (a) GPU, (b) CPU. underneath CPU core is assumed to be 2. We then compute the neighbors temperature weight from the following formula. We empirically found and used different b values as neighbors with different distances have different impacts on the target core. NeighborsTemperatureWeight ( NTW ) = ( 1) ( 2) ( 3) b ATND + b ATND + b ATND b + b + b where ATND (I) is the average temperature of neighbors with distance I and with distance I, and b 1 > b 2 > b 3. b I is a weight for the neighbors Along with NTW, another factor that we consider is the execution time of each workgroup. This factor directly affects the future temperature because once a workgroup is assigned to a CU it is impossible to migrate it to another core according to the modern GPU s thread execution model. Therefore, a core is likely to reach higher temperature when a workgroup runs for a longer time on it. We suppose that we have M workgroups that are extracted from N kernels defined as WG 11,...,WG MN and P available cores (including both CPU and GPU) defined as C 1,...,C P. Note that modern GPUs can run different kernels simultaneously. Each workgroup from each kernel has an execution time on each of the available cores - Time i,j,k defines the execution time of workgroup WG ij on core C k. Also, the allocation variable A i,j,k = 1 if Table 1. GPU shortest neighbor distance Distance 1 Distance 2 Distance 3 1 2, 9 3, 10, 17 4, 11, 18, , 3, 10 4, 9, 11, 18 5, 12, 17, 19, , 4, 11 1, 5, 10, 12, 19 6, 9, 13, 18, 20, , 5, 12 2, 6, 11, 13, 20 1, 7, 10, 14, 19, 21, , 6, 13 3, 7, 12, 14, 21 2, 8, 11, 15, 20, 22, , 7, 14 4, 8, 13, 15, 22 3, 12, 16, 21, 23, , 8, 15 5, 14, 16, 23 4, 13, 22, 24, , 16 6, 15, 24 5, 14, 23, , 10, 17 2, 11, 18, 25 3, 12 19, , 9, 11, 18 1, 3, 12, 17, 19, 26 4, 13, 20, 25, , 10, 12, 19 2, 4, 9, 13, 18, 20, 27 1, 5, 14, 17, 21, 26, , 11, 13, 20 3, 5, 10, 14, 19, 21, 28 2, 6, 9, 15, 18, 22, 27, , 12, 14, 21 4, 6, 11, 15, 20, 22, 29 3, 7, 10, 16, 19, 23, 28, , 13, 15, 22 5, 7, 12, 16, 21, 23, 30 4, 8, 11, 20, 24, 29, , 14, 16, 23 6, 8, 13, 22, 24, 31 5, 12, 21, 30, , 15, 24 7, 14, 23, 32 6, 13, 22, , 18, 25 1, 10, 19, 26 2, 11, 20, , 17, 19, 26 2, 9, 11, 20, 25, 27 1, 3, 12, 21, , 18, 20, 27 3, 10, 12, 17, 21, 26, 28 2, 4, 9, 13, 22, 25, , 19, 21, 28 4, 11, 13, 18, 22, 27, 29 3, 5, 10, 14, 17, 23, 26, , 20, 22, 29 5, 12, 14, 19, 23, 28, 30 4, 6, 11, 15, 18, 24, 27, , 21, 23, 30 6, 13, 15, 20, 24, 29, 31 5, 7, 12, 16, 19, 28, , 22, 24, 31 7, 14, 16, 21, 30, 32 6, 8, 13, 20, , 23, 32 8, 15, 22, 31 7, 14, 21, , 26 9, 18, 27 1, 10, 19, , 25, 27 10, 17, 19, 28 2, 9, 11, 20, , 26, 28 11, 18, 20, 25, 29 3, 10, 12, 17, 21, , 27, 29 12, 19, 21, 26, 30 4, 11, 13, 18, 22, 25, , 28, 30 13, 20, 22, 27, 31 5, 12, 14, 19, 23, 26, , 29, 31 14, 21, 23, 28, 32 6, 13, 15, 20, 24, , 30, 32 15, 22, 24, 29 7, 14, 16, 21, , 31 16, 23, 30 8, 15, 22, 29 Table 2. CPU shortest neighbor distance Distance 1 Distance 2 Distance , , workgroup WG ij is assigned to core C k and 0 otherwise. A workgroup can be assigned to only one core. Therefore, for each WG ij : P å A i, j, k k = 1 = 1 where i = 1,...,M, j = 1,...,N and k = 1,...,P. The execution time of workgroup WG ij is given by:

5 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.1, FEBRUARY, in order to reduce the temperature of cores. The proposed algorithm is pseudo-coded as follows: Fig. 4. Multiple linear regression for temperature prediction. P (,, ) = å i, j, k,, ExecutionTime i j k A Time k = 1 Our main objective is to predict the future temperature of each core based on the three factors above. We use a statistical regression method that finds a relationship between one dependent variable (e.g., Y) and other independent variables (e.g., X 1, X 2, ). The regression method takes multiple variables and creates a mathematical relationship between them in order to predict the dependent variable. The regression can be linear or multiple linear. A linear regression works with one independent variable while a multiple linear regression works with more than one independent variable to predict the consequence. The general form of multiple linear regressions is: Y = a + a X + a X +¼+ a X +ò i 0 1 i1 2 i2 n in i In the formula above, we have one dependent variable (Y) and n independent variables (X 1, X 2,, X n ) for several observations. The value a needs to be estimated and ò is an error. For example, Fig. 4 shows how the multiple linear regression can find a linear relationship between the two independent variables X 1 and X 2 and the dependent variable Y. In order to predict the future temperature of each core, we consider future temperature as a dependent variable and current temperature, NTW and execution time as independent variables. Finally, the scheduler assigns a workgroup to a core whose predicted temperature is the lowest among normals. If there is no core with predicted normal temperature then our scheduler decreases the frequency i j k Algorithm: The proposed task scheduler algorithm for 3DHP 1: while (there is a core whose current temperature is critical) do 2: for k = 1 to P do 3: Compute NTW for each C k 4: endfor 5: for k = 1 to P do 6: Predict future temperature for each C k 7: endfor 8: if (there is a core whose predicted temperature is normal) then 9: Assign next workgroup to the core whose predicted temperature is the lowest and normal 10: else 11: Decrement the frequency of cores 12: endif 13: endwhile IV. EXPERIMENTAL SETUP AND RESULTS Extensive experiments are carried out on the sample floorplan shown in Fig. 2. The Threshold Critical and Threshold Hot are set to 80 and 60 respectively in all experiments. Also, the values β 1, β 2 and β 3 are set to be 4, 2, 1, respectively. First, we compute the power consumed by each core in 3DHP. The power consumption of each core is calculated from the McPAT [17] power simulator using cycle-level detailed statistics collected from the Multi2Sim [18] architectural simulation. Once power consumption is computed, we then use the HotSpot [19] heat simulator to compute the temperature of each component of the 3DHP. The temperature of all cores are computed every 925 clock cycles which matches GPU clock frequency tested. Several metrics are chosen to evaluate the proposed scheduling algorithm including peak temperature, temperature changes, final temperature and performance. Five well-known benchmark workloads are considered: MatrixMultiplication, BinarySearch, Reduction, FFT and BitonicSort from AMD OpenCL SDK. The final temperature is an important metric because it is the initial temperature of next kernel to be executed. Our experiments show that the final temperatures of all cores in our proposed scheduler are normal for all benchmarks tested. For example, Fig. 5 and 6 show the final thermal map after the completion of Matrix Multiplication benchmark. Fig. 5 shows the final

6 120 HOSSEIN POURMEIDANI et al : DYNAMIC TEMPERATURE AWARE SCHEDULING FOR CPU-GPU 3D MULTICORE C (a) 60 C (b) Fig. 5. Final thermal map for GPU cores (a) Our proposed scheduler, (b) Default round-robin scheduler. Fig. 6. Final thermal map for CPU cores (a) Our proposed scheduler, (b) Default round-robin scheduler. Table 3. Average final temperatures ( ) of all cores Benchmark Round- Robin Proposed Scheduler Degradation MatrixMultiplicatoin % Reduction % BinarySearch % FFT % BitonicSort % temperature for GPU cores. We can clearly see that there is 16 critical, 8 hot and 8 normal cores for the default Round-Robin scheduler while all cores are normal when our proposed scheduler is used. In Fig. 6, all CPU cores are critical for the Round-Robin scheduler while all cores are normal for our proposed scheduler at the end. As mentioned earlier, the thermal correlation between CPU and GPU cores is obvious. Table 3 shows the Fig. 7. Temperature changes (a) MatrixMultiplication, (b) BinarySearch, (c) Reduction, (d) FFT, (e) BitonicSort. average final temperature of all CPU and GPU cores for the Round-Robin and the proposed scheduler. The proposed scheduler reduces the final temperature by more than 50%.

7 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.1, FEBRUARY, Table 4. Peak temperature degradation Benchmark Peak Temperature Degradation MatrixMultiplicatoin 19.5% BinarySearch 16.3% Reduction 17.8% FFT 16.5% BitonicSort 24.9% Table 6. Classification of benchmarks Benchmark Thermal Group BinarySearch Cool MatrixMultiplicatoin Warm Reduction Warm FFT Warm BitonicSort Hot Table 5. Performance overhead Benchmark Performance Overhead MatrixMultiplicatoin 36.5% BinarySearch 18.9% Reduction 37.3% FFT 36.9% BitonicSort 30.5% The changes in temperatures shows the ability of a scheduler to keep the temperature below certain threshold. Fig. 7 shows the changes in temperature every 20 interval for all benchmarks tested. The experimental results show how our proposed scheduler outperforms the default Round-Robin scheduler in maintaining temperature. Also, the average temperatures of cores are always under the critical temperature when our proposed scheduler is used. Down slopes represent that the temperature is controlled by the scheduler. The peak temperature represents how a scheduler can eliminate the worst thermal conditions. Table 4 shows the peak temperature degradation for the benchmarks based on Fig. 7. It demonstrates that a peak temperature decreased by 19% on average across all benchmarks tested. Performance degradation is measured using the overall execution time. Table 5 shows the performance overhead of our proposed scheduler in comparison with the default Round-Robin scheduler. The results show that the overhead is less than 20% for BinarySearch and slightly over 30% for other benchmarks. The performance reduction is caused by not utilizing the cores with critical temperature. Therefore, the critical cores are idle to be cooled and only normal cores are used. The overhead of BinarySearch is less than others because the temperature changes of BinarySearch are less than others as shown in Fig. 7(b). We also tested the back-to-back run of multiple kernels. Based on the final and peak temperatures in the default Round-Robin scheduler, we classify the Table 7. Mix of benchmarks Benchmark Mix Bitonic + MM + FFT Red + FFT + MM Bin + FFT + Bitonic FFT + Bin + MM benchmarks into three categories: hot, warm and cool, as described in Table 6. The benchmarks in different categories are selected to combine into different benchmark mix to demonstrate the importance of final temperature and evaluating the efficiency of scheduler. Table 7 shows the mix of benchmarks used in our experiments. The results are shown in Tables 8 and 9. From the tables, it can be observed that the proposed scheduler reduces the peak temperature for warmer combinations more than cooler combinations while they have higher overhead. For example, the temperature degradation for the combination WWW is 11.1% more than the combination WCW while the overhead is 7% more. V. CONCLUSIONS Classification HWW WWW CWH WCW Table 8. Peak temperature degradation of benchmark mix Benchmark Mix Peak Temperature Degradation Bitonic + MM + FFT 16.7% Red + FFT + MM 18.9% Bin + FFT + Bitonic 13.7% FFT + Bin + MM 7.8% Table 9. Performance overhead of benchmark mix Benchmark Mix Performance Overhead Bitonic + MM + FFT 34.5% Red + FFT + MM 35.5% Bin + FFT + Bitonic 32.4% FFT + Bin + MM 28.5% In this paper, a novel temperature-aware workgroup assignment algorithm for vertically stacked 3D

8 122 HOSSEIN POURMEIDANI et al : DYNAMIC TEMPERATURE AWARE SCHEDULING FOR CPU-GPU 3D MULTICORE heterogeneous processors has been proposed and validated. Unlike previous approaches on thermal management for homogeneous multicore processors, we target emerging heterogeneous workloads that run on both CPU and GPU processors. Using well-verified simulators widely used in the field, the efficiency of proposed temperature-aware scheduler has been demonstrated in terms of improving the thermal conditions of 3D CPU-GPU heterogeneous processors. To reduce hotspots, peak temperature and final temperature, the proposed scheduler predicts the future temperature of each core and assigns next workgroups to most desirable cores. The experimental results show that the proposed scheduler reduces the final temperature by more than 50%, peak temperature by 19% on average and performance degradation by 32% on average. ACKNOWLEDGMENTS This work was supported by National Science Foundation (NSF) grant CCF REFERENCES [1] B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso et al., Die stacking (3D) microarchitecture, in th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 06). IEEE, 2006, pp [2] J. W. Joyner, P. Zarkesh-Ha, and J. D. Meindl, A stochastic global net-length distribution for a threedimensional system-on-a-chip (3D-SoC), in ASIC/SOC Conference, Proceedings. 14th Annual IEEE International. IEEE, 2001, pp [3] W.-L. Hung, G. M. Link, Y. Xie, N. Vijaykrishnan, and M. J. Irwin, Interconnect and thermal-aware floorplanning for 3d microprocessors, in 7th International Symposium on Quality Electronic Design (ISQED 06). IEEE, 2006, pp. 6 pp. [4] K. Puttaswamy and G. H. Loh, Thermal analysis of a 3D die-stacked high-performance microprocessor, in Proceedings of the 16th ACM Great Lakes symposium on VLSI. ACM, 2006, pp [5] M. Awasthi and R. Balasubramonian, Exploring the design space for 3D clustered architectures, in Proceedings of the 3rd IBM Watson Conference on Interaction between Architecture, Circuits, and Compilers, [6] K. Puttaswamy and G. H. Loh, Thermal herding: Microarchitecture techniques for controlling hotspots in high-performance 3d-integrated processors, in 2007 IEEE 13th International Symposium on High Performance Computer Architecture. IEEE, 2007, pp [7] K. Banerjee, S. J. Souri, P. Kapur, and K. C. Saraswat, 3-D ICs: A novel chip design for improving deep-submicrometer interconnect performance and systems-on-chip integration, Proceedings of the IEEE, vol. 89, no. 5, pp , [8] Y. Xie, G. H. Loh, B. Black, and K. Bernstein, Design space exploration for 3D architectures, ACM Journal on Emerging Technologies in Computing Systems (JETC), vol. 2, no. 2, pp , [9] X. Yin, Y. Zhu, L. Xia, J. Ye, T. Huang, Y. Fu, and M. Qiu, Efficient implementation of thermalaware scheduler on a quad-core processor, in 2011 IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications. IEEE, 2011, pp [10] D. Zhao, H. Homayoun, and A. V. Veidenbaum, Temperature aware thread migration in 3D architecture with stacked DRAM, in Quality Electronic Design (ISQED), th International Symposium on. IEEE, 2013, pp [11] H. Wang, Y. Fu, T. Liu, and J. Wang, Thermal management via task scheduling for 3D NoC based multi-processor, in SoC Design Conference (ISOCC), 2010 International. IEEE, 2010, pp [12] A. K. Coskun, J. L. Ayala, D. Atienza, T. S. Rosing, and Y. Leblebici, Dynamic thermal management in 3D multicore architectures, in 2009 Design, Automation & Test in Europe Conference & Exhibition. IEEE, 2009, pp [13] X. Zhou, J. Yang, Y. Xu, Y. Zhang, and J. Zhao, Thermal-aware task scheduling for 3D multicore processors, IEEE Transactions on Parallel and

9 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.1, FEBRUARY, Distributed Systems, vol. 21, no. 1, pp , [14] S. Liu, J. Zhang, Q. Wu, and Q. Qiu, Thermalaware job allocation and scheduling for three dimensional chip multiprocessor, in Quality Electronic Design (ISQED), th International Symposium on. IEEE, 2010, pp [15] A. Sridhar, A. Vincenzi, M. Ruggiero, T. Brunschwiler, and D. Atienza, 3D-ICe: Fast compact transient thermal modeling for 3D ICs with inter-tier liquid cooling, in Proceedings of the International Conference on Computer-Aided Design. IEEE Press, 2010, pp [16] B. Goplen and S. Sapatnekar, Thermal via placement in 3D ICs, in Proceedings of the 2005 international symposium on Physical design. ACM, 2005, pp [17] D. M. Tullsen, Simulation and modeling of a simultaneous multithreading processor, in The nd International Conference for the Resource Management & Performance Evaluation of Enterprise Computing Systems, CMG. Part 2(of 2), 1996, pp [18] R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli, Multi2sim: a simulation framework for cpu-gpu computing, in Parallel Architectures and Compilation Techniques (PACT), st International Conference on. IEEE, 2012, pp [19] K. Skadron, M. R. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan, Temperatureaware microarchitecture: Modeling and implementation, ACM Transactions on Architecture and Code Optimization (TACO), vol. 1, no. 1, pp , Hossein Pourmeidani received his B.S. and M.S. degrees in Computer Engineering from Islamic Azad University in 2010 and 2012 respectively. He is currently pursuing his Ph.D. degree at the University of Mississippi. His interests include computer architecture and GPU computing. Ajay Sharma received his M.S degree in Computer Science from the University of Mississippi in He is currently working for FedEx, U.S.A. His research includes CPU- GPU heterogeneous computing and high performance computing. Kyoshin Choo received his B.S. degree from Glogbal Handong University in South Korea, MS degree from the University of Michigan, Ann Arbor, and PhD degree in Computer Science from the University of Mississippi in He is currently working for AMD, U.S.A. Mainul Hassan received his B.S. degree in Computer Engineering from Bangladesh University of Engineering and Technology and M.S degree in Computer Science from the University of Mississippi in He is currently working for IMS Health, U.S.A. Minsu Choi [M02 SM08] received his B.S.,M.S. and Ph.D. degrees in Computer Science from Oklahoma State University in 1995, 1998 and 2002, respectively. He is currently an associate professor of Electrical and Computer Engineering at Missouri University of Science & Technology (Missouri S&T). His research mainly focuses on Computer Architecture & VLSI, Crypto-hardware design, Nanoelectronics, Embedded Systems, Fault Tolerance, Testing, Quality Assurance, Reliability Modeling and Analysis, Configurable Computing, Parallel & Distributed Systems and Dependable Instrumentation & Measurement. He has won two outstanding teaching awards at MST in 2008 and He is a senior member of IEEE and a member of Golden Key National Honor Society and Sigma Xi.

10 124 HOSSEIN POURMEIDANI et al : DYNAMIC TEMPERATURE AWARE SCHEDULING FOR CPU-GPU 3D MULTICORE Kyung Ki Kim received his BS and MS degrees in Electronic Engineering from Yeungnam University, South Korea, in 1995 and 1997, respectively, and his Ph.D. degree in Computer Engineering from Northeastern University, Boston, MA, in He was a member of technical staff with Sun Microsystems, Santa Clara, CA in 2008 and a senior researcher with Illinois Institute of Technology, Chicago, IL in Currently, he is an Associate Professor at Daegu University, South Korea. His current research focuses on nanoscale CMOS design, high speed low power VLSI design, analog VLSI circuit design, electronic CAD and nano-electronics. Byunghyun Jang received his BS in Bio-Mechatronic Engineering from Sungkyunkwan University, South Korea, MS degree in Computer Science from Oklahoma State University, Stillwater OK, and Ph.D in Computer Engineering from Northeastern University, Boston MA. He is currently an Assistant Professor of Computer and Information Science at the University of Mississippi, University, MS where he directs the Heterogeneous Systems Research (HEROES) Laboratory. Prior to joining academia in 2012, he spent several years at AMD and Samsung. His research focuses on CPU-GPU heterogeneous computing, hardware architecture and compilers for data parallel architectures.

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems

Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Efficient Evaluation and Management of Temperature and Reliability for Multiprocessor Systems Ayse K. Coskun Electrical and Computer Engineering Department Boston University http://people.bu.edu/acoskun

More information

Exploring Performance, Power, and Temperature Characteristics of 3D Systems with On-Chip DRAM

Exploring Performance, Power, and Temperature Characteristics of 3D Systems with On-Chip DRAM Exploring Performance, Power, and Temperature Characteristics of 3D Systems with On-Chip DRAM Jie Meng, Daniel Rossell, and Ayse K. Coskun Electrical and Computer Engineering Department, Boston University,

More information

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors , July 4-6, 2018, London, U.K. A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid in 3D chip Multi-processors Lei Wang, Fen Ge, Hao Lu, Ning Wu, Ying Zhang, and Fang Zhou Abstract As

More information

Investigation and Comparison of Thermal Distribution in Synchronous and Asynchronous 3D ICs Abstract -This paper presents an analysis and comparison

Investigation and Comparison of Thermal Distribution in Synchronous and Asynchronous 3D ICs Abstract -This paper presents an analysis and comparison Investigation and Comparison of Thermal Distribution in Synchronous and Asynchronous 3D ICs Brent Hollosi 1, Tao Zhang 2, Ravi S. P. Nair 3, Yuan Xie 2, Jia Di 1, and Scott Smith 3 1 Computer Science &

More information

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology http://dx.doi.org/10.5573/jsts.014.14.6.760 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.14, NO.6, DECEMBER, 014 A 56-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology Sung-Joon Lee

More information

THERMAL BENCHMARK AND POWER BENCHMARK SOFTWARE

THERMAL BENCHMARK AND POWER BENCHMARK SOFTWARE Nice, Côte d Azur, France, 27-29 September 26 THERMAL BENCHMARK AND POWER BENCHMARK SOFTWARE Marius Marcu, Mircea Vladutiu, Horatiu Moldovan and Mircea Popa Department of Computer Science, Politehnica

More information

A Simple Model for Estimating Power Consumption of a Multicore Server System

A Simple Model for Estimating Power Consumption of a Multicore Server System , pp.153-160 http://dx.doi.org/10.14257/ijmue.2014.9.2.15 A Simple Model for Estimating Power Consumption of a Multicore Server System Minjoong Kim, Yoondeok Ju, Jinseok Chae and Moonju Park School of

More information

Thermal Modeling and Active Cooling

Thermal Modeling and Active Cooling Thermal Modeling and Active Cooling for 3D MPSoCs Prof. David Atienza, Embedded Systems Laboratory (ESL), EE Institute, Faculty of Engineering MPSoC 09, 2-7 August 2009 (Savannah, Georgia, USA) Thermal-Reliability

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network Topology

Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network Topology JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.15, NO.1, FEBRUARY, 2015 http://dx.doi.org/10.5573/jsts.2015.15.1.077 Design of Low-Power and Low-Latency 256-Radix Crossbar Switch Using Hyper-X Network

More information

Accelerating Distance Transform Image based Hand Detection using CPU-GPU Heterogeneous Computing

Accelerating Distance Transform Image based Hand Detection using CPU-GPU Heterogeneous Computing JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.16, NO.5, OCTOBER, 2016 ISSN(Print) 1598-1657 http://dx.doi.org/10.5573/jsts.2016.16.5.557 ISSN(Online) 2233-4866 Accelerating Distance Transform Image

More information

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures

Towards Performance Modeling of 3D Memory Integrated FPGA Architectures Towards Performance Modeling of 3D Memory Integrated FPGA Architectures Shreyas G. Singapura, Anand Panangadan and Viktor K. Prasanna University of Southern California, Los Angeles CA 90089, USA, {singapur,

More information

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of

More information

An Approach for Adaptive DRAM Temperature and Power Management

An Approach for Adaptive DRAM Temperature and Power Management IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 An Approach for Adaptive DRAM Temperature and Power Management Song Liu, Yu Zhang, Seda Ogrenci Memik, and Gokhan Memik Abstract High-performance

More information

Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri Hari Krishna Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Department of CSE, The Pennsylvania State University {snarayan, guilchen,

More information

Chapter 0 Introduction

Chapter 0 Introduction Chapter 0 Introduction Jin-Fu Li Laboratory Department of Electrical Engineering National Central University Jhongli, Taiwan Applications of ICs Consumer Electronics Automotive Electronics Green Power

More information

Energy-efficient Custom Topology-based Dynamic Voltage-frequency Island-enabled Network-on-chip Design

Energy-efficient Custom Topology-based Dynamic Voltage-frequency Island-enabled Network-on-chip Design JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.3, JUNE, 2018 ISSN(Print) 1598-1657 https://doi.org/10.5573/jsts.2018.18.3.352 ISSN(Online) 2233-4866 Energy-efficient Custom Topology-based

More information

3-Dimensional (3D) ICs: A Survey

3-Dimensional (3D) ICs: A Survey 3-Dimensional (3D) ICs: A Survey Lavanyashree B.J M.Tech, Student VLSI DESIGN AND EMBEDDED SYSTEMS Dayananda Sagar College of engineering, Bangalore. Abstract VLSI circuits are scaled to meet improved

More information

Hardware/Software T e T chniques for for DRAM DRAM Thermal Management

Hardware/Software T e T chniques for for DRAM DRAM Thermal Management Hardware/Software Techniques for DRAM Thermal Management 6/19/2012 1 Introduction The performance of the main memory is an important factor on overall system performance. To improve DRAM performance, designers

More information

The Effect of Temperature on Amdahl Law in 3D Multicore Era

The Effect of Temperature on Amdahl Law in 3D Multicore Era The Effect of Temperature on Amdahl Law in 3D Multicore Era L Yavits, A Morad, R Ginosar Abstract This work studies the influence of temperature on performance and scalability of 3D Chip Multiprocessors

More information

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem. The VLSI Interconnect Challenge Avinoam Kolodny Electrical Engineering Department Technion Israel Institute of Technology VLSI Challenges System complexity Performance Tolerance to digital noise and faults

More information

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor*

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Tyler Viswanath Krishnamurthy, and Hridesh Laboratory for Software Design Department of Computer Science Iowa State University

More information

Department of Electrical and Computer Engineering, University of Rochester, Computer Studies Building,

Department of Electrical and Computer Engineering, University of Rochester, Computer Studies Building, ,, Computer Studies Building, BOX 270231, Rochester, New York 14627 585.360.6181 (phone) kose@ece.rochester.edu http://www.ece.rochester.edu/ kose Research Interests and Vision Research interests: Design

More information

OVERVIEW: NETWORK ON CHIP 3D ARCHITECTURE

OVERVIEW: NETWORK ON CHIP 3D ARCHITECTURE OVERVIEW: NETWORK ON CHIP 3D ARCHITECTURE 1 SOMASHEKHAR, 2 REKHA S 1 M. Tech Student (VLSI Design & Embedded System), Department of Electronics & Communication Engineering, AIET, Gulbarga, Karnataka, INDIA

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

THERMAL EXPLORATION AND SIGN-OFF ANALYSIS FOR ADVANCED 3D INTEGRATION

THERMAL EXPLORATION AND SIGN-OFF ANALYSIS FOR ADVANCED 3D INTEGRATION THERMAL EXPLORATION AND SIGN-OFF ANALYSIS FOR ADVANCED 3D INTEGRATION Cristiano Santos 1, Pascal Vivet 1, Lee Wang 2, Michael White 2, Alexandre Arriordaz 3 DAC Designer Track 2017 Pascal Vivet Jun/2017

More information

Predictive Thermal Management for Hard Real-Time Tasks

Predictive Thermal Management for Hard Real-Time Tasks Predictive Thermal Management for Hard Real-Time Tasks Albert Mo Kim Cheng and Chen Feng Real-Time System Laboratory, Department of Computer Science University of Houston, Houston, TX 77204, USA {cheng,

More information

SERVICE ORIENTED REAL-TIME BUFFER MANAGEMENT FOR QOS ON ADAPTIVE ROUTERS

SERVICE ORIENTED REAL-TIME BUFFER MANAGEMENT FOR QOS ON ADAPTIVE ROUTERS SERVICE ORIENTED REAL-TIME BUFFER MANAGEMENT FOR QOS ON ADAPTIVE ROUTERS 1 SARAVANAN.K, 2 R.M.SURESH 1 Asst.Professor,Department of Information Technology, Velammal Engineering College, Chennai, Tamilnadu,

More information

Reconfigurable Multicore Server Processors for Low Power Operation

Reconfigurable Multicore Server Processors for Low Power Operation Reconfigurable Multicore Server Processors for Low Power Operation Ronald G. Dreslinski, David Fick, David Blaauw, Dennis Sylvester, Trevor Mudge University of Michigan, Advanced Computer Architecture

More information

A Low-Power ECC Check Bit Generator Implementation in DRAMs

A Low-Power ECC Check Bit Generator Implementation in DRAMs 252 SANG-UHN CHA et al : A LOW-POWER ECC CHECK BIT GENERATOR IMPLEMENTATION IN DRAMS A Low-Power ECC Check Bit Generator Implementation in DRAMs Sang-Uhn Cha *, Yun-Sang Lee **, and Hongil Yoon * Abstract

More information

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr

More information

Adaptive Power Blurring Techniques to Calculate IC Temperature Profile under Large Temperature Variations

Adaptive Power Blurring Techniques to Calculate IC Temperature Profile under Large Temperature Variations Adaptive Techniques to Calculate IC Temperature Profile under Large Temperature Variations Amirkoushyar Ziabari, Zhixi Bian, Ali Shakouri Baskin School of Engineering, University of California Santa Cruz

More information

Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors

Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors Jie Meng, Tiansheng Zhang, and Ayse K. Coskun Electrical and Computer Engineering Department, Boston University,

More information

Microprocessor Thermal Analysis using the Finite Element Method

Microprocessor Thermal Analysis using the Finite Element Method Microprocessor Thermal Analysis using the Finite Element Method Bhavya Daya Massachusetts Institute of Technology Abstract The microelectronics industry is pursuing many options to sustain the performance

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Hossein Sayadi Department of Electrical and Computer Engineering

More information

Thermal-Aware 3D IC Placement Via Transformation

Thermal-Aware 3D IC Placement Via Transformation Thermal-Aware 3D IC Placement Via Transformation Jason Cong, Guojie Luo, Jie Wei and Yan Zhang Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095 Email: { cong,

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

A TAXONOMY AND SURVEY OF ENERGY-EFFICIENT DATA CENTERS AND CLOUD COMPUTING SYSTEMS

A TAXONOMY AND SURVEY OF ENERGY-EFFICIENT DATA CENTERS AND CLOUD COMPUTING SYSTEMS A TAXONOMY AND SURVEY OF ENERGY-EFFICIENT DATA CENTERS AND CLOUD COMPUTING SYSTEMS Anton Beloglazov, Rajkumar Buyya, Young Choon Lee, and Albert Zomaya Prepared by: Dr. Faramarz Safi Islamic Azad University,

More information

Microprocessor Trends and Implications for the Future

Microprocessor Trends and Implications for the Future Microprocessor Trends and Implications for the Future John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 522 Lecture 4 1 September 2016 Context Last two classes: from

More information

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

More information

Temperature Aware Thread Block Scheduling in GPGPUs

Temperature Aware Thread Block Scheduling in GPGPUs Temperature Aware Thread Block Scheduling in GPGPUs Rajib Nath University of California, San Diego rknath@ucsd.edu Raid Ayoub Strategic CAD Labs, Intel Corporation raid.ayoub@intel.com Tajana Simunic Rosing

More information

Network on Chip Architecture: An Overview

Network on Chip Architecture: An Overview Network on Chip Architecture: An Overview Md Shahriar Shamim & Naseef Mansoor 12/5/2014 1 Overview Introduction Multi core chip Challenges Network on Chip Architecture Regular Topology Irregular Topology

More information

Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks

Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks Xylem: Enhancing Vertical Thermal Conduction in 3D Processor-Memory Stacks Aditya Agrawal, Josep Torrellas and Sachin Idgunji University of Illinois at Urbana Champaign and Nvidia Corporation http://iacoma.cs.uiuc.edu

More information

Peeling the Power Onion

Peeling the Power Onion CERCS IAB Workshop, April 26, 2010 Peeling the Power Onion Hsien-Hsin S. Lee Associate Professor Electrical & Computer Engineering Georgia Tech Power Allocation for Server Farm Room Datacenter 8.1 Total

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Efficient Implementation of Single Error Correction and Double Error Detection Code with Check Bit Precomputation

Efficient Implementation of Single Error Correction and Double Error Detection Code with Check Bit Precomputation http://dx.doi.org/10.5573/jsts.2012.12.4.418 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.12, NO.4, DECEMBER, 2012 Efficient Implementation of Single Error Correction and Double Error Detection

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

ISSN Vol.04,Issue.01, January-2016, Pages:

ISSN Vol.04,Issue.01, January-2016, Pages: WWW.IJITECH.ORG ISSN 2321-8665 Vol.04,Issue.01, January-2016, Pages:0077-0082 Implementation of Data Encoding and Decoding Techniques for Energy Consumption Reduction in NoC GORANTLA CHAITHANYA 1, VENKATA

More information

Figure 5.2: (a) Floor plan examples for varying the number of memory controllers and ranks. (b) Example configuration.

Figure 5.2: (a) Floor plan examples for varying the number of memory controllers and ranks. (b) Example configuration. Figure 5.2: (a) Floor plan examples for varying the number of memory controllers and ranks. (b) Example configuration. The study found that a 16 rank 4 memory controller system obtained a speedup of 1.338

More information

Best Engineering Practice to Extend the Free Air-Cooling Limit in Tablet Hand Held Devices AMD TFE 2011

Best Engineering Practice to Extend the Free Air-Cooling Limit in Tablet Hand Held Devices AMD TFE 2011 Best Engineering Practice to Extend the Free Air-Cooling Limit in Tablet Hand Held Devices AMD TFE 2011 Gamal Refai-Ahmed, Ph.D, AMD Fellow Guy Wagner, Director - Electronic Cooling Solutions William Maltz,

More information

Three DIMENSIONAL-CHIPS

Three DIMENSIONAL-CHIPS IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) ISSN: 2278-2834, ISBN: 2278-8735. Volume 3, Issue 4 (Sep-Oct. 2012), PP 22-27 Three DIMENSIONAL-CHIPS 1 Kumar.Keshamoni, 2 Mr. M. Harikrishna

More information

Traffic- and Thermal-Aware Run-Time Thermal Management Scheme for 3D NoC Systems

Traffic- and Thermal-Aware Run-Time Thermal Management Scheme for 3D NoC Systems 2010 Fourth ACM/IEEE International Symposium on Networks-on-Chip Traffic- and Thermal-Aware Run-Time Thermal Management Scheme for 3D NoC Systems Chih-Hao Chao, Kai-Yuan Jheng, Hao-Yu Wang, Jia-Cheng Wu,

More information

Co-optimization of TSV assignment and micro-channel placement for 3D-ICs

Co-optimization of TSV assignment and micro-channel placement for 3D-ICs THE INSTITUTE FOR SYSTEMS RESEARCH ISR TECHNICAL REPORT 2012-10 Co-optimization of TSV assignment and micro-channel placement for 3D-ICs Bing Shi, Ankur Srivastava and Caleb Serafy ISR develops, applies

More information

WITH the development of the semiconductor technology,

WITH the development of the semiconductor technology, Dual-Link Hierarchical Cluster-Based Interconnect Architecture for 3D Network on Chip Guang Sun, Yong Li, Yuanyuan Zhang, Shijun Lin, Li Su, Depeng Jin and Lieguang zeng Abstract Network on Chip (NoC)

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory

More information

3D Memory Formed of Unrepairable Memory Dice and Spare Layer

3D Memory Formed of Unrepairable Memory Dice and Spare Layer 3D Memory Formed of Unrepairable Memory Dice and Spare Layer Donghyun Han, Hayoug Lee, Seungtaek Lee, Minho Moon and Sungho Kang, Senior Member, IEEE Dept. Electrical and Electronics Engineering Yonsei

More information

Computer Architecture!

Computer Architecture! Informatics 3 Computer Architecture! Dr. Vijay Nagarajan and Prof. Nigel Topham! Institute for Computing Systems Architecture, School of Informatics! University of Edinburgh! General Information! Instructors

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

Minimizing Thermal Variation in Heterogeneous HPC System with FPGA Nodes

Minimizing Thermal Variation in Heterogeneous HPC System with FPGA Nodes Minimizing Thermal Variation in Heterogeneous HPC System with FPGA Nodes Yingyi Luo, Xiaoyang Wang, Seda Ogrenci-Memik, Gokhan Memik, Kazutomo Yoshii, Pete Beckman @ICCD 2018 Motivation FPGAs in data centers

More information

A Strategy for Interconnect Testing in Stacked Mesh Network-on- Chip

A Strategy for Interconnect Testing in Stacked Mesh Network-on- Chip 2010 25th International Symposium on Defect and Fault Tolerance in VLSI Systems A Strategy for Interconnect Testing in Stacked Mesh Network-on- Chip Min-Ju Chan and Chun-Lung Hsu Department of Electrical

More information

Three-Dimensional Integrated Circuits: Performance, Design Methodology, and CAD Tools

Three-Dimensional Integrated Circuits: Performance, Design Methodology, and CAD Tools Three-Dimensional Integrated Circuits: Performance, Design Methodology, and CAD Tools Shamik Das, Anantha Chandrakasan, and Rafael Reif Microsystems Technology Laboratories Massachusetts Institute of Technology

More information

Parallel Computing. Parallel Computing. Hwansoo Han

Parallel Computing. Parallel Computing. Hwansoo Han Parallel Computing Parallel Computing Hwansoo Han What is Parallel Computing? Software with multiple threads Parallel vs. concurrent Parallel computing executes multiple threads at the same time on multiple

More information

The Power Wall. Why Aren t Modern CPUs Faster? What Happened in the Late 1990 s?

The Power Wall. Why Aren t Modern CPUs Faster? What Happened in the Late 1990 s? The Power Wall Why Aren t Modern CPUs Faster? What Happened in the Late 1990 s? Edward L. Bosworth, Ph.D. Associate Professor TSYS School of Computer Science Columbus State University Columbus, Georgia

More information

BREAKING THE MEMORY WALL

BREAKING THE MEMORY WALL BREAKING THE MEMORY WALL CS433 Fall 2015 Dimitrios Skarlatos OUTLINE Introduction Current Trends in Computer Architecture 3D Die Stacking The memory Wall Conclusion INTRODUCTION Ideal Scaling of power

More information

From the table we can see that the main contribution to. EDA Publishing/THERMINIC 2011

From the table we can see that the main contribution to. EDA Publishing/THERMINIC 2011 Single-hip loud omputer Thermal odel ohammadsadegh Sadri, Andrea Bartolini, Luca Benini University of Bologna Via Risorgimento, 2, 40136 Bologna, Italy Tel:0039(0)512093787;Fax:0039(0)512093785, Email:mohammadsadegh.sadr2,a.bartolini,luca.benini@unibo.it

More information

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES

EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES EFFICIENTLY ENABLING CONVENTIONAL BLOCK SIZES FOR VERY LARGE DIE- STACKED DRAM CACHES MICRO 2011 @ Porte Alegre, Brazil Gabriel H. Loh [1] and Mark D. Hill [2][1] December 2011 [1] AMD Research [2] University

More information

ALONG with the continued scaling of complementary

ALONG with the continued scaling of complementary IEEE TRANSACTIONS ON COMPONENTS AND PACKAGING TECHNOLOGIES, VOL. 28, NO. 4, DECEMBER 2005 615 Parameterized Physical Compact Thermal Modeling Wei Huang, Student Member, IEEE, Mircea R. Stan, Senior Member,

More information

6T- SRAM for Low Power Consumption. Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1

6T- SRAM for Low Power Consumption. Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1 6T- SRAM for Low Power Consumption Mrs. J.N.Ingole 1, Ms.P.A.Mirge 2 Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1 PG Student [Digital Electronics], Dept. of ExTC, PRMIT&R,

More information

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

PowerRed: A Flexible Modeling Framework for Power Efficiency Exploration in GPUs

PowerRed: A Flexible Modeling Framework for Power Efficiency Exploration in GPUs PowerRed: A Flexible Modeling Framework for Power Efficiency Exploration in GPUs Karthik Ramani, Ali Ibrahim, Dan Shimizu School of Computing, University of Utah AMD Inc. Abstract The tremendous increase

More information

3D Integration & Packaging Challenges with through-silicon-vias (TSV)

3D Integration & Packaging Challenges with through-silicon-vias (TSV) NSF Workshop 2/02/2012 3D Integration & Packaging Challenges with through-silicon-vias (TSV) Dr John U. Knickerbocker IBM - T.J. Watson Research, New York, USA Substrate IBM Research Acknowledgements IBM

More information

Thermal Via Planning for 3-D ICs

Thermal Via Planning for 3-D ICs Thermal Via Planning for 3-D ICs Jason Cong Computer Science Department, UCLA Los Angeles, CA 90095 cong@cs.ucla.edu Yan Zhang Computer Science Department, UCLA Los Angeles, CA 90095 zhangyan@cs.ucla.edu

More information

Test-Architecture Optimization for 3D Stacked ICs

Test-Architecture Optimization for 3D Stacked ICs ACM STUDENT RESEARCH COMPETITION GRAND FINALS 1 Test-Architecture Optimization for 3D Stacked ICs I. PROBLEM AND MOTIVATION TSV-based 3D-SICs significantly impact core-based systemon-chip (SOC) design.

More information

Thermal-Driven Multilevel Routing for 3-D ICs

Thermal-Driven Multilevel Routing for 3-D ICs Thermal-Driven Multilevel Routing for 3-D ICs Jason Cong and Yan Zhang Computer Science Department, UCLA Los Angeles, CA 90095 tel. 310-206-5449, fax. 310-825-2273 cong, zhangyan@cs.ucla.edu Abstract 3-D

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Runtime Network-on-Chip Thermal and Power Balancing

Runtime Network-on-Chip Thermal and Power Balancing APPLICATIONS OF MODELLING AND SIMULATION http://www.ams-mss.org eissn 2600-8084 VOL 1, NO. 1, 2017, 36-41 Runtime Network-on-Chip Thermal and Power Balancing M. S. Rusli *, M. N. Marsono and N. S. Husin

More information

More Course Information

More Course Information More Course Information Labs and lectures are both important Labs: cover more on hands-on design/tool/flow issues Lectures: important in terms of basic concepts and fundamentals Do well in labs Do well

More information

A Study of IR-drop Noise Issues in 3D ICs with Through-Silicon-Vias

A Study of IR-drop Noise Issues in 3D ICs with Through-Silicon-Vias A Study of IR-drop Noise Issues in 3D ICs with Through-Silicon-Vias Moongon Jung and Sung Kyu Lim School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, Georgia, USA Email:

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Process Variation on Arch-structured Gate Stacked Array 3-D NAND Flash Memory

Process Variation on Arch-structured Gate Stacked Array 3-D NAND Flash Memory JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.17, NO.2, APRIL, 2017 ISSN(Print) 1598-1657 https://doi.org/10.5573/jsts.2017.17.2.260 ISSN(Online) 2233-4866 Process Variation on Arch-structured Gate

More information

ECE 486/586. Computer Architecture. Lecture # 2

ECE 486/586. Computer Architecture. Lecture # 2 ECE 486/586 Computer Architecture Lecture # 2 Spring 2015 Portland State University Recap of Last Lecture Old view of computer architecture: Instruction Set Architecture (ISA) design Real computer architecture:

More information

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing

More information

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 10: Three-Dimensional (3D) Integration

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 10: Three-Dimensional (3D) Integration 1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 10: Three-Dimensional (3D) Integration Instructor: Ron Dreslinski Winter 2016 University of Michigan 1 1 1 Announcements

More information

A Study of Through-Silicon-Via Impact on the 3D Stacked IC Layout

A Study of Through-Silicon-Via Impact on the 3D Stacked IC Layout A Study of Through-Silicon-Via Impact on the Stacked IC Layout Dae Hyun Kim, Krit Athikulwongse, and Sung Kyu Lim School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta,

More information

A New Scan Chain Fault Simulation for Scan Chain Diagnosis

A New Scan Chain Fault Simulation for Scan Chain Diagnosis JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.7, NO.4, DECEMBER, 2007 221 A New Scan Chain Fault Simulation for Scan Chain Diagnosis Sunghoon Chun, Taejin Kim, Eun Sei Park, and Sungho Kang Abstract

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems

Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems Chris Gregg Jeff S. Brantley Kim Hazelwood Department of Computer Science, University of Virginia Abstract A typical consumer desktop

More information

Power Consumption in 65 nm FPGAs

Power Consumption in 65 nm FPGAs White Paper: Virtex-5 FPGAs R WP246 (v1.2) February 1, 2007 Power Consumption in 65 nm FPGAs By: Derek Curd With the introduction of the Virtex -5 family, Xilinx is once again leading the charge to deliver

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Test-Wrapper Optimization for Embedded Cores in TSV-Based Three-Dimensional SOCs

Test-Wrapper Optimization for Embedded Cores in TSV-Based Three-Dimensional SOCs Test-Wrapper Optimization for Embedded Cores in TSV-Based Three-Dimensional SOCs Brandon Noia 1, Krishnendu Chakrabarty 1 and Yuan Xie 2 1 Department of Electrical and Computer Engineering, Duke University,

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Visualization of OpenCL Application Execution on CPU-GPU Systems

Visualization of OpenCL Application Execution on CPU-GPU Systems Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari*, R. Ubal*, D. Schaa**, D. Kaeli* *NUCAR Group, Northeastern Universiy **AMD Northeastern University Computer Architecture Research

More information

On-chip temperature-based digital signal processing for customized wireless microcontroller

On-chip temperature-based digital signal processing for customized wireless microcontroller On-chip temperature-based digital signal processing for customized wireless microcontroller Siti Farhah Razanah Faezal 1, *, Mohd Nazrin Md Isa 1, Azizi Harun 1, Shaiful Nizam Mohyar 1, and Asral Bahari

More information

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT

ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT ADAPTIVE AND DYNAMIC LOAD BALANCING METHODOLOGIES FOR DISTRIBUTED ENVIRONMENT PhD Summary DOCTORATE OF PHILOSOPHY IN COMPUTER SCIENCE & ENGINEERING By Sandip Kumar Goyal (09-PhD-052) Under the Supervision

More information

Physical Co-Design for Micro-Fluidically Cooled 3D ICs

Physical Co-Design for Micro-Fluidically Cooled 3D ICs Physical Co-Design for Micro-Fluidically Cooled 3D ICs Zhiyuan Yang, Ankur Srivastava Department of Electrical and Computer Engineering University of Maryland, College Park, Maryland, 20742 Email: {zyyang,

More information

Baoping Wang School of software, Nanyang Normal University, Nanyang , Henan, China

Baoping Wang School of software, Nanyang Normal University, Nanyang , Henan, China doi:10.21311/001.39.7.41 Implementation of Cache Schedule Strategy in Solid-state Disk Baoping Wang School of software, Nanyang Normal University, Nanyang 473061, Henan, China Chao Yin* School of Information

More information