Energy-efficient Custom Topology-based Dynamic Voltage-frequency Island-enabled Network-on-chip Design

Size: px

Start display at page:

Download "Energy-efficient Custom Topology-based Dynamic Voltage-frequency Island-enabled Network-on-chip Design"

Margaret Davidson
5 years ago
Views:

1 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.3, JUNE, 2018 ISSN(Print) ISSN(Online) Energy-efficient Custom Topology-based Dynamic Voltage-frequency Island-enabled Network-on-chip Design Chang-Lin Li, Jae-Chern Yoo, and Tae Hee Han * Abstract The voltage-frequency island (VFI) design paradigm has strong potential for reducing energy consumption in network-on-chip (NoC). The V/F of each island can be dynamically tuned according to the application s requirements. However, dynamic VFI (DVFI) requires an efficient on-chip communication architecture to compensate for the latency overhead produced while tuning the proper V/F of each VFI. Although standard topology has been used in most VFI designs, this approach incurs a large energy and latency overhead owing to the redundant hop counts. Therefore, we propose a custom topology-based DVFI for an energy-efficient manycore platform to maximize energy efficiency with a reasonable implementation cost. In this regard, a custom topology generation method with a heuristic run-time V/F tuning algorithm is incorporated by considering the core and link utilization. Experimental results demonstrated the effectiveness of the proposed scheme in terms of execution time and energy-delay product. Index Terms Network-on-chip (NoC), voltagefrequency island (VFI), dynamic voltage-frequency island (DVFI), custom topology, topology generation Manuscript received Aug. 25, 2017; accepted Jan. 22, 2018 The College of Information and Communication Engineering, Sungkyunkwan University than@skku.edu I. INTRODUCTION Owing to the diminishing returns of the performance scaling and ever-increasing computational demand of single-core processors, the system-on-chip (SoC) design paradigm has shifted to the manycore processor era. Moreover, the communication bottleneck between the processing cores and the memory has forced the communication subsystem to adopt a scalable and distributed on-chip interconnection architecture, which is called network-on-chip (NoC) [1]. In addition, energy efficiency has become a primary design concern not only for battery-powered embedded systems but also for highend server machines. In this regard, the voltagefrequency island (VFI) design has been widely adopted as an efficient and scalable energy optimization solution [2]. In a VFI-based manycore system, it is possible to tune the V/F of each VFI dynamically under the given performance constraints [3]. Compared with per-core dynamic voltage frequency scaling (DVFS), where each core has its own V/F scaling domain, the dynamic VFI (DVFI) is more practical for large-scale manycore processors in terms of implementation complexity and associated cost, considering the number of required voltage regulators (VRs) and phase-locked loops (PLLs) that cannot be well-scaled down with the finer fabrication technologies [4]. Moreover, compared with per-core DVFS, the DVFI is well-suited to the state-ofthe-art highly energy-efficient asymmetric multicore architecture such as the ARM big.little technology. However, the DVFI requires distributed core and link-

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.3, JUNE, 2018 353 (a) Fig. 1. VFI architecture with (a) a mesh, (b) a custom topology.

Most VFI-related NoC architectures employ the standard (e.g., mesh) topology (Fig. 1(a)).

Therefore, it cannot cope with the overhead produced in the DVFI [6, 7]. On the other hand, a custom topology (Fig.

2 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.3, JUNE, (a) Fig. 1. VFI architecture with (a) a mesh, (b) a custom topology. level information for assigning and tuning a proper V/F value to each VFI, which incurs extra latency [5]. Therefore, a latency-aware communication architecture for DVFI-enabled NoCs is needed. Most VFI-related NoC architectures employ the standard (e.g., mesh) topology (Fig. 1(a)). However, several studies have demonstrated that the standard topology produces a large energy and latency overhead owing to the path with long multi-hop. Therefore, it cannot cope with the overhead produced in the DVFI [6, 7]. On the other hand, a custom topology (Fig. 1(b)) is well suited to accommodate the diversified requirements of today s computing environment and most recent leading-edge manycore architecture, such as the ARM DynamIQ technology [8]. In this regard, custom topologies provide more design optimization opportunities with less latency and energy overhead compared to the standard one. Therefore, a custom topology should be incorporated into the DVFI to further enhance the energy efficiency and on-chip-network latencies in a flexible manner [6]. Therefore, a new scheme for designing a custom topology-based DVFI (CT-DVFI) is proposed for achieving significant energy savings. In this regard, a custom topology generation method and an associated DVFI tuning method are incorporated. As topology generation is an NP-hard problem, a heuristic algorithm is deployed while considering the DVFI tuning. During the topology generation, the core utilization and communication traffic are utilized to cluster the cores and links. Moreover, the intra- and inter-vfi communications with the VFI information are incorporated to generate the NoC components, routers, and links, for optimizing the trade- offs between performance and energy consumption. Because the (b) proposed custom topology utilizes a minimum number of routers, the additional latency produced by the DVFI tuning can be addressed by pursuing a minimum hop count for the dedicated communication paths. For the DVFI, a tuning method is proposed to dynamically tune the V/F of each VFI according to the developed metric with core and link-level utilization during runtime. The experimental results showed significant savings in terms of execution time and energy delay product compared to the mesh topology, along with VFI configuration. The rest of the paper is organized as follows. Section 2 reviews the related works. Section 3 gives a detailed description of the proposed design method. Section 4 gives the experimental results. Finally, Section 5 concludes the paper. II. RELATED WORKS VFI-based energy optimization schemes can be categorized into two: static VFI (SVFI) and dynamic VFI (DVFI). Basically, the difference between SVFI and DVFI is the flexibility in VFI partitioning: the former fixes the VFI partition at design time, whereas the latter can reallocate the partitioning at runtime. Ogras et al. employed the VFI paradigm into their NoC design for minimum energy consumption as frontier study [2]. For further energy optimization, Jang et al. incorporated partitioning, mapping, and routing and proposed an energy optimization framework [9]. The methods used in these studies can be classified as an SVFI technique wherein the SVFI relies on a single VFI partition for all types of applications and does not vary the V/Fs of the VFIs at runtime. In contrast to SVFI-based schemes, DVFI-based schemes can tune the V/F values of each VFI at runtime to further reduce the energy dissipation. Yan et al. proposed a hybrid regulator scheme to improve the power efficiency of multicore architectures with restricted shape and size of the VFIs [10]. Musoll analyzed the benefits of reconfiguring the VFIs in overcoming process variations [11]. These studies evaluated the energy saving and design overhead when using the DVFI compared with using the SVFI and percore DVFS, but they ignored the latency overhead produced in inherent NoC topology, e.g., the mesh. With respect to the NoC topology, the mesh is the

3 354 CHANG-LIN LI et al : ENERGY-EFFICIENT CUSTOM TOPOLOGY-BASED DYNAMIC VOLTAGE-FREQUENCY most preferred topology owing to its advantages of reusability and reduced design time. However, the inherent redundancy with multi-hops between communicating cores produces a large latency and energy overhead, and, thus, the mesh topology cannot cope with the latency overhead in the DVFI. On the other hand, a custom topology provides better opportunities for optimization and can be generated with predefined requirements with respect to the number of routers and hop counts for a given application. A number of recent studies on custom topology generation for static VFIbased NoCs have been presented, and they have demonstrated that a custom topology-based VFI scheme can achieve latency saving compared to a mesh-based VFI one [12, 13]. Consequently, to address the aforementioned problem with respect to the timing overhead of the DVFI and V/F tuning in a custom topology, it is necessary to have a new design method that incorporates the DVFI and the custom topology. III. CT-DVFI DESIGN METHOD In this section, we describe a detailed design method for constructing the CT-DVFI. The CT-DVFI design method consists of the custom topology generation and DVFI tuning steps for improving the energy efficiency. In the custom topology generation step, the core and communication information are employed to construct an optimal VFI cluster and to generate the NoC components, routers, and links for each application. In the DVFI tuning step, the V/F of each VFI is determined by a developed metric with core and link-level utilization. 1. Topology Generation The goal of the custom topology generation is to construct a topology such that all cores can communicate and transfer data over the on-chip networks to satisfy specific requirements, such as the performance and energy consumption. In addition, the cost for implementing mixed-clock first-in-first-out buffers (mcfifos), along with the partitioning between the intraand the inter-vfi communications, should be determined appropriately while considering the VFI architecture. It has been noted that, with the use of a custom topology, the energy consumption and the network latency can be reduced compared to the mesh topology. This allows us to implement DVFI tuning while maintaining the required network performance. Hence, in this part, we propose a custom topology generation method for the VFI to support efficient data transfers among VFIs and to enable the DVFI. Considering the on-chip communication and topological properties of VFI, the custom topology generation method consists of two main steps: core clustering and topology construction. In the following subsection, we give detailed descriptions of each step of the proposed topology generation method. Core clustering: Usually, it is advantageous to cluster cores with similar operating V/F demands. Several previous works have demonstrated that, if the cores with similar V/F level use the same V/F, interface overheads such as mcfifos, can be reduced [2]. In the consideration of the NoC, the communication between the cores should be fully realized to reduce the communication cost while clustering. In [12], a core clustering method was presented based on communication volume and demonstrated the energy efficiency and performance improvement it brings with various evaluations. However, the core-level information, i.e., not only the core communication but also the core utilization, needs to be realized in the DVFI. The new VFI-aware clustering method relies on both the core utilization and the communication traffic. The principle of this method is to cluster the cores with similar core utilization and the communication traffic in to same VFI so that tune V/F with easily manner. For example, the cores with low utilization should be clustered together and tuned to a cluster with low V/F level allocate cores with low utilization, whereas the high V/F level clusters allocate cores with high utilization. In this respect, the instructions per cycle (IPC) and the communication volume is used to count the core utilization and communication traffic, respectively. The pseudocode of the proposed core clustering method is shown in Fig. 2. First, the core utilization is used to allocate the cores with similar behavior to the same clusters (lines 1 8). In this aspect, we followed a widely known approach called k-means clustering to cluster the cores with similar utilization. The k-means algorithm aims to iteratively partition the elements into clusters with a predefined number in which each element belongs to the cluster with the nearest mean [14]. In our

4 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.3, JUNE, iteratively find the nearest cluster and move the core to that cluster according to 2 arg min x i - m j. (2) This operation repeats until all the clusters are updated and p clusters with similar utilization are generated. Then, for realizing the communication-induced energy consumption, the communication volume is deployed to create q clusters in each of the p clusters generated previously. The communication-based cluster begins by constructing the initial group with the required minimum voltage, which strongly affects the communication energy and design overhead as demonstrated in [13]. The cores in each of the P clusters are allocated to the corresponding Q (the number of communication-based clusters) group according to 2 ( VH -VL ) Vi < q, q = 1,..., Q. (3) Q Fig. 2. Pseudocode of the core clustering algorithm. method, we start by forming the basic initialization for the k-means clustering where each core is randomly assigned to P (the number of workload-based clusters) clusters (line 1) and calculate the initial center of each cluster (lines 1-2). The center of each cluster, μ, is defined as the mean utilization (IPC) of all the included cores in that cluster and is calculated as m j = å å N d 1 ij u i= i N d i= 1 ij. (1) Here, p is the cluster number, u i is the IPC for the i-th core, and δ ij is an indicator function, which is set to 1 if and only if the i-th core belongs to the p-th cluster or set to 0 otherwise. The cores in each cluster are evaluated to Here, V i is the required minimum voltage of the i-th core, and V H and V L are the predefined highest and lowest voltage in each of the P clusters, respectively. Since the V/F of all the cores within each cluster should be identical, the V/F of all the cores in the same clusters is set to the maximum value among the cores. Because the communication-based clustering is focused on the energy consumption, the communication energy is estimated for the current cluster as a temporary value. The energy consumption is calculated using the equation given in [4]. For the communication energy reduction, iteratively, we select the cores with the larger inter-vfi communication and check either cores to determine whether the total energy consumption can be reduced by migrating it to the other clusters. The core pair with the next highest communication volume is considered if none of the alternative core migrations can reduce the total energy consumption. Accordingly, the inter-vfi links are transformed into intra-vfi links for the migrated core pairs. This iteration stops when there is no improvement in energy consumption after the subsequent rearrangement of the VFIs. Consequently, the cores with similar utilization and communication traffic are clustered together in this step. Topology construction: The topology construction is

5 356 CHANG-LIN LI et al : ENERGY-EFFICIENT CUSTOM TOPOLOGY-BASED DYNAMIC VOLTAGE-FREQUENCY Because the routing path should be determined for the generated topology, the shortest path routing is used as the default routing. For each inter-vfi communication, the inter-vfi links on the minimal routing path are retrieved and the associated number of inter-vfi links is incremented. Among the candidate inter-vfi links in each VFI communication, the link with the largest number of VFI links is the most frequently used inter- VFI link. Finally, mcfifos are generated on the chosen optimal inter-vfi links. 2. DVFI Tuning Method Fig. 3. Pseudocode of the topology construction algorithm. built upon the clusters from the previous step and upon the cores contained in each cluster. We select the clusters one by one and assign each one to an appropriate router with a restricted number of ports. The detailed process is discussed below. The pseudocode of the proposed topology construction method is shown in Fig. 3. In the algorithm, we first place the routers to build up an initial VFI NoC topology. The minimum number of routers should be determined while connecting the cores to routers. To prevent excessive design complexity associated with the number of ports in the routers, typical four-port routers, which are common in two-dimensional NoC design, is used. Given a cluster with n nodes, the minimum number of routers, R min, is determined as With timing-varying workloads, dynamic fine-tuning of the V/F levels of VFIs is applied. The traditional DVFS uses core-level information to tune the core s V/F values. For the DVFI, we use the combined information from all cores and links within the VFI and the core and link utilization to determine the suitable V/F of each VFI. Therefore, we employ a metric, M, that incorporates the information of cores and links in VFI, which is defined as uc M = w + i l c å wl nc å i VFI j nl. (5) " Î j j " lîvfi j Here, uc i is the utilization of the i-th core, ul l is the utilization of the l-th link, nc j is the number of cores in the j-th VFI, and nl l is the number of links in the j-th VFI. ω c and ω l are the weights for the utilization of the core and link, respectively. The weights are calculated as the proportion of the core to the link utilization. According to the value of M, the predicted V/F is calculated and the V/F is adjusted for the VFI. ul R min é n ù = 2 ê ú. (4) Next, for each intra-vfi communication, the core pair with the larger communication volume among all the communicating core pairs will be connected to the same router iteratively until all the core airs are connected. Then, the routers with available ports are each other to generate as many inter-vfi links as possible. Finally, network interfaces (NI) are generated between the routers and the cores to packetize the data. IV. EXPERIMENTAL RESULTS In this section, the efficiency of proposed CT-DVFI is evaluated for comparing mesh-based designs and VFI configuration. Sniper [16], a multi-core simulator, is used to obtain detailed core and network-level information. The platform configurations were set according to the Intel Xeon Nehalem architecture for constructing a 64- core system. We modified the Sniper code to support NoC interconnection. A nominal cache configuration of 64 KB L1 instruction and data caches and a shared 8 MB

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.3, JUNE, 2018 357 L2 cache is assumed. The PARSEC and SPLASH2 benchmarks were used in the simulation.

6 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.3, JUNE, L2 cache is assumed. The PARSEC and SPLASH2 benchmarks were used in the simulation. The core-level statistic generated by the Sniper simulations was integrated into McPAT [17] to determine the energy consumption. To consider the nominal operation scenario, the adopted dynamic V/F level uses discrete V/F pairs as 0.5 V/1.25 GHz, 0.6 V/1.5 GHz, 0.6 V/1.5 GHz, 0.8 V/2.0 GHz, 0.9 V/2.25 GHz, and 1 V/2.5 GHz. To estimate the energy overhead introduced by the on-chip VR, we follow the method used in a recent work [2] and the overhead can be calculated as E = (1 - h) C V - V. (6) VR f Here, E VR is the energy dissipated by the voltage regulator due to a voltage transition, η is the power efficiency of the regulator, C filter is the regulator filter capacitance, and V 2 and V 1 are the two voltage levels. Therefore, the energy overhead for each VFI was calculated as the sum of energy overhead of clock signal, mcfifos, and VR. To demonstrate the performance of our proposed method, we considered the different VFI configurations, such as SVFI and DVFI, and network configurations, such as the mesh and a custom topology. Therefore, we performed simulations for mesh-based SVFI (ME-SVFI), mesh-based DVFI (ME-DVFI), custom topology-based SVFI (CT-SVFI), and custom topology-based DVFI (CT-DVFI). As the baseline to all our configurations, we considered the commonly used mesh-based non-vfi (ME-NVFI). The per-core DVFI scheme was ignored in this experiment owing to its impracticality in manycore design. We compared the execution time for all of the configurations considered here to the baseline ME-NVFI in Fig. 4. We can see that the custom topology produced less latency overhead compared to the mesh one in each VFI configuration. In addition, the CT-DVFI produced the lowest values compared with other configurations of the benchmarks considered. Moreover, the energy delay product is evaluated as comparative results. Fig. 5 shows the normalized energy delay product of all configurations with respect to ME-NVFI. The CT-DVFI configuration shows less energy delay product compared to its mesh counterpart in all benchmarks. Also, the DVFI outperform compare to the other configurations running Fig. 4. Comparison of the execution time using different benchmarks. Fig. 5. Comparison of the energy delay products using different benchmarks. either mesh or custom topology owing to the capability of the DVFI to energy consumption with little performance impact.

7 358 CHANG-LIN LI et al : ENERGY-EFFICIENT CUSTOM TOPOLOGY-BASED DYNAMIC VOLTAGE-FREQUENCY V. CONCLUSIONS In this paper, a new scheme for designing a custom topology-based DVFI is proposed for energy-efficient manycore platforms. We demonstrated that the proposed CT-DVFI significantly improved the energy efficiency without sacrificing the performance. We also showed that, for all the benchmarks considered, it was able to save significant energy-delay product in all topological and VFI configurations and combined configurations. ACKNOWLEDGMENTS This work was supported by the MOTIE (Ministry of Trade, Industry & Energy ( ) and KSRC (Korea Semiconductor Research Consortium) support program for the development of the future semiconductor device and by the IT R&D Program of MSIP/IITP ( ). REFERENCES [1] L. Benini and G. De Micheli, Networks on Chips: A New SoC Paradigm, Computer, Vol. 35, No. 1, pp , [2] U. Y. Ogras, et al., Voltage-Frequency Island Partitioning for GALS-based Networks-on-Chip, IEEE Design Automation Conference, pp , [3] R. David, et al., Dynamic Power Management of Voltage-Frequency Island Partitioned Networkson-Chip using Intel s Single-chip Cloud Computer, IEEE/ACM International Symposium on Networks on Chip, [4] S. Herbert and D. Marculescu, Analysis of Dynamic Voltage/Frequency Scaling in Chip- Multiprocessors, International Symposium on Low Power Electronics and Design, pp , [5] L. Guang, et al., Autonomous DVFS on Supply Islands for Energy-Constrained NoC Communication, International Conference on Architecture of Computing Systems, pp , [6] S. Tosun, et al., Application-specific topology generation algorithms for network-on-chip design, IET Computer & Digital Techniques, Vol. 6, No. 5, pp , [7] B. Huang, et al., Application-Specific Networkon-Chip Synthesis with Topology-Aware Floorplanning, Symposium on Integrated Circuits and Systems Design, pp. 1-6, [8] [9] W. Jang and D. Z. Pan, A Voltage-Frequency Island Aware Energy Optimization Framework for Network-on-Chip, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 1, no. 3, pp [10] P. Choudhary and D. Marculescu, Power management of voltage/frequency Island-based systems using hardware based methods, IEEE Transactions on Very Large Scale Integration Systems, vol. 17, no. 3, pp , [11] J. Howard et al., A 48-core IA-32 processor in 45 nm CMOS using on-die message-passing and DVFS for performance and power scaling, IEEE J. Solid-State Circuits, vol. 46, no. 1, pp , Jan [12] C. L. Li, et al., Communication-aware custom topology generation for VFI network-on-chip, IEICE Electronics Express, Vol. 11, No. 18, pp. 1-8, 2014 [13] C. Li, et al., Energy-efficient Custom Topology Generation for Link-failure-aware Network-onchip in Voltage-frequency Island Regime, Journal of Semiconductor Technology and Science, Vol. 16, No. 6, pp , [14] S. Jin et al., Statistical Energy Optimization on Voltage Frequency Island based MPSoCs in the Presence of Process Variations, Microelectronics Journal, vol. 54, pp [15] J. A. Hartigan. Clustering Algorithms. WILEY, [16] T. E. Carlson, et al., Sniper: Exploring the Level of Abstraction for Scalable and Accurate Parallel Multi-Core Simulation, international Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-12, [17] S. Li, et al, McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures, International Symposium on Microarchitecture, pp , 2009.

JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.3, JUNE, 2018 359 Chang-Lin Li received his B.S. degree from the Department of Computer, Electronics and Telecommunication engineering from Yanbian University of Science and Technology, Yanji, China, in 2010, and M.

Tae Hee Han received his BS, MS, and PhD degrees in electrical engineering from KAIST, Daejeon, Republic of Korea, in 1992, 1994, and 1999, respectively.

His research interests include SoC architectures and design technologies. From May 2011 to April 2013, he had served as a full-time advisor on semiconductor devices for the Korean government.

8 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.18, NO.3, JUNE, Chang-Lin Li received his B.S. degree from the Department of Computer, Electronics and Telecommunication engineering from Yanbian University of Science and Technology, Yanji, China, in 2010, and M.S. degree inform the Department of Cogno-Mechatronics Engineering from Pusan National Universiy, Busan, Korea in He is currently a combined M.S. and Ph.D. student in the Department of Electrical and Computer Engineering at Sungkyunkwan University, Suwon, Korea. Tae Hee Han received his BS, MS, and PhD degrees in electrical engineering from KAIST, Daejeon, Republic of Korea, in 1992, 1994, and 1999, respectively. From 1999 to 2006, he had been with the Telecom R&D Center in Samsung Electronics, Suwon, Korea. Since March 2008, he has been with Sungkyunkwan University, Suwon, Republic of Korea, as a professor. His research interests include SoC architectures and design technologies. From May 2011 to April 2013, he had served as a full-time advisor on semiconductor devices for the Korean government. Jae-Chern Yoo received the B.S. degree in electronics from Sungkyunkwan University, Korea, in 1986, and the M.S. and Ph.D. degrees in information & communication engineering, and electronics from KAIST and POSTECH, Korea, in 1996 and 2001, respectively. Since March 2008, he has been with Sungkyunkwan University as an Associate Professor.

WITH the development of the semiconductor technology,

WITH the development of the semiconductor technology, Dual-Link Hierarchical Cluster-Based Interconnect Architecture for 3D Network on Chip Guang Sun, Yong Li, Yuanyuan Zhang, Shijun Lin, Li Su, Depeng Jin and Lieguang zeng Abstract Network on Chip (NoC)