3118 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016

Size: px

Start display at page:

Download "3118 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016"

Gervase Benson
5 years ago
Views:

1 3118 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016 Hybrid L2 NUCA Design and Management Considering Data Access Latency, Energy Efficiency, and Storage Lifetime Seunghan Lee, Student Member, IEEE, Kyungsu Kang, Member, IEEE, Jongpil Jung, Student Member, IEEE, and Chong-Min Kyung, Fellow, IEEE Abstract Nonvolatile magnetic RAM (MRAM) offers high cell density and low leakage power. This paper reports on using a 3-D integration technology based on through-silicon vias, to stack disparate memory technologies (e.g., SRAM and MRAM) together onto chip multiprocessors. In this paper, we explore the design of a 3-D stacked nonuniform hybrid SRAM/MRAM L2 cache architecture (NUCA) using the on-chip network to mitigate the interconnection problem. In addition, this paper investigates the problem of partitioning shared SRAM/MRAM hybrid L2 cache and placing cache data into the partitioned 3-D stacked hybrid NUCA for concurrently executing multiple applications in order to improve the system performance in terms of instructions per second while considering the heterogeneous characteristics in interconnection wire delay, memory cell density, memory access latency, and memory power consumption in a 3-D stacked hybrid SRAM/MRAM L2 cache. Experimental results show that the proposed runtime method with a 3-D stacked hybrid L2 cache improves performance by 61%, energy efficiency, i.e., energydelay product by 53%, and storage lifetime by 15.6 times on average compared with the conventional SRAM-only L2 cache or MRAM-only L2 cache with the similar area. Index Terms 3-D integration, dynamic cache management, non-volatile memory. I. INTRODUCTION ENLARGEMENT of the traditional on-chip SRAM for the use of cache memory is becoming prohibitive, as the cache leakage power becomes critical and the access latency keeps increasing along with the cache capacity [1], [2]. Emerging memory technologies, such as magnetic RAM (MRAM), phase-change RAM (PCRAM), and Resistive RAM (ReRAM), have been explored as potential alternatives to the existing memory technologies (e.g., SRAM and DRAM). Among these technologies, especially, MRAM has been explored to replace the traditional on-chip SRAM cache due to its lower latency and the higher write endurance compared with other emerging memories [3], [7], [8]. MRAM has many advantages, such as Manuscript received October 27, 2015; revised February 2, 2016; accepted March 3, Date of publication March 24, 2016; date of current version September 23, This work was supported by the Center for Integrated Smart Sensors within the Ministry of Science, ICT and Future Planning through the Global Frontier Project under Grant CISS-2013M3A6A S. Lee and K. Kang are with the Memory Business Department, Samsung Electronics, Hwaseong , South Korea ( hslsh26@gmail.com; kyungsu.kang@gmail.com). J. Jung and C.-M. Kyung are with the Department of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon , South Korea ( jpjung@duo.kaist.ac.kr; kyung@kaist.ac.kr). Color versions of one or more of the figures in this paper are available online at Digital Object Identifier /TVLSI ultralow leakage power, high cell density, and nonvolatility. However, it suffers from longer write latency and higher write energy consumption when compared with SRAM-based cache architectures. We, therefore, need to consider the use of hybrid SRAM/MRAM cache to improve the system performance in an energy-efficient manner, since most performance-critical write accesses are accommodated by the low-latency SRAM cache, while the energy efficiency is mainly determined by the MRAM cache. Nonuniform cache architecture (NUCA) has been proposed to manage large on-chip caches where the long access latency has been mainly caused by the on-chip interconnect delay, which extremely increases with technology scaling. In NUCA, a large cache is divided into multiple cache banks with different access latencies depending on their physical distances to the cores that request cache data. Each cache bank is also accessed independently by multiple cores, which increases the memory bandwidth [21] [23]. 3-D integration is also an emerging technology that can alleviate the on-chip interconnect delay problems by stacking multiple active silicon layers on top of each other and connecting them through vertical interconnects [e.g., through-silicon vias (TSVs)]. 3-D integration makes it possible to stack disparate memory technologies (e.g., SRAM and MRAM) together in a cost-effective manner [9]. In this paper, we explore runtime management schemes for 3-D stacked hybrid SRAM/MRAM L2 NUCA in order to improve the performance of 3-D chip multiprocessors (CMPs). For that, cache data placement can be done statically, e.g., cache way placement [10], and/or dynamically (e.g., cache partitioning [11] and data migration [12]), while considering the heterogeneous characteristics in the interconnect wire delay (e.g., vertical wires are much shorter than horizontal wires), memory cell density (e.g., the size of an MRAM cell is much smaller than that of an SRAM cell), memory access latency (e.g., SRAM has much lower access latency than MRAM), and memory power consumption (e.g., MRAM consumes much higher power for memory write, but benefits from the ultralow leakage power than SRAM). The contributions of this paper are summarized as follows. 1) We compare the hybrid memory technology, i.e., SRAM/ MRAM L2 NUCA with the pure memory technologies, i.e., SRAM L2 NUCA or MRAM L2 NUCA, in terms of performance and power efficiency IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See for more information.

2 LEE et al.: HYBRID L2 NUCA DESIGN AND MANAGEMENT ) We study the design of 3-D stacked SRAM/MRAM L2 NUCA through managing the placement of cache ways that have an impact on the performance of the stacked hybrid L2 NUCA as well as 3-D CMPs. 3) We propose a runtime cache partitioning method for 3-D stacked SRAM/MRAM L2 NUCA to maximize the performance of 3-D CMPs in terms of instructions per second (IPSs). 4) We investigate the effect of data migration on the performance, the power efficiency, and the lifetime of 3-D stacked SRAM/MRAM L2 NUCA. The rest of this paper is organized as follows. Section II discusses related works. The design parameters with explorations for 3-D hybrid L2 NUCA are explained in Section III. Section IV presents the proposed runtime management policies for 3-D hybrid L2 NUCA. Section V presents the experimental setup. Section VI presents the experimental results. Finally, the conclusion is drawn in Section VII. II. RELATED WORKS MRAM has received attention as an alternative of SRAM or DRAM in the on-chip cache memory design because of its several advantages, high-speed write accesses (1 50 ns) and high endurance (>10 15 ) compared with PCRAM and ReRAM [3], [8]. In particular, Sun et al. [3] proposed an MRAM-based L2 cache directly atop CMPs and compared it against an SRAM-based L2 cache in terms of performance and energy. In addition, read-preemptive write buffer and SRAM-MRAM hybrid L2 cache were proposed in order to mitigate the long write latency and high write energy of MRAM. They have evaluated the proposed methods on both static NUCA and dynamic NUCA of stacked L2 MRAM. However, both the static NUCA and the dynamic NUCA may suffer from the long network delay and the reduced MRAM lifetime caused by excessive data writes in the data migration, which are addressed in this paper. A runtime cache partitioning method proposed in this paper maximizes the performance of 3-D CMPs with a stacked L2 SRAM/MRAM NUCA, while considering both the network latency and the lifetime of L2 SRAM/MRAM NUCA. There are some previous works focused on a cache architecture design and runtime cache management policy for MRAM on-chip cache [14] [16]. Mao et al. [14] proposed the MRAM-aware prefetching method by prioritizing different types of cache requests based on their criticality for CMPs with an MRAM-based L3 cache. Wang et al. [15] proposed a selective bypass caching management for write-intensive data in MRAM L3 cache. Chen et al. [16] combined static and dynamic scheme to optimize the cache data placement in a hybrid SRAM/MRAM L2 cache. In [16], a compiler was used for guiding hardware to rapidly achieve the desired cache data placement, while hardware corrects the compiler hints based on the runtime cache behavior. However, these works assume uniform cache architecture (UCA) and are, hence, only applicable for medium-sized caches. In the UCA model, the cache access latency is determined by the latency for the farthest cache bank. This is not a scalable model as cache sizes and the latency differences between the nearest and farthest Fig D CMP where hybrid SRAM-MRAM L2 cache is stacked. MRAM is placed onto the top layer and SRAM is placed on the other layer in the 3-D CMP. cache banks grow. To address this problem, NUCAs have been proposed [17]. There are previous works focused on the NUCA-based MRAM cache architecture [18], [19]. Mishra et al. [18] proposed a network level solution to alleviate long write latency of MRAM by prioritizing cache accesses to the idle banks by delaying accesses to the MRAM cache banks that are currently servicing long latency write requests. Syu et al. [19] proposed a partition-level wear leveling scheme to improve the lifetime of the hybrid SRAM/MRAM NUCA. However, these works assumed that the hybrid cache with a fixed capacity is assigned to each processor, which is likely to lead to suboptimal results depending on applications. Since MRAM fabrication involves hybrid magnetic CMOS processes, integrating MRAM into the same 2-D chip can be the key obstacle. Fortunately, the emerging 3-D chip integration technologies may provide further design and manufacture cost benefits for the on-chip mixed technology, i.e., MRAM integration [3], [20]. Dong et al. [20] evaluated the architectural level performance and energy benefits of stacked MRAM L2 cache, and they demonstrate that stacking MRAM atop a microprocessor can bring performance improvement and achieve more power consumption reduction compared with DRAM and SRAM. There are previous works focused on a hybrid cache architecture [3] [6]. Sun et al. [3], Wang et al. [4], and Zhao et al. [5] considered using a hybrid SRAM/MRAM cache to improve energy efficiency with negligible performance overhead, since most performance-critical write accesses can be accommodated by low-latency SRAM, while the energy efficiency is achieved by MRAMs. Li et al. [6] proposed a hybrid cache hierarchy by leveraging SRAM/eDRAM/RRAM to provide performancemaximal bandwidth capacity to the on-chip memory system. III. DESIGN OF HYBRID L2 NUCA FOR 3-D CMP In this paper, we focus on the design and management of a hybrid SRAM/MRAM L2 NUCA for 3-D CMP, where an SRAM tier and an MRAM tier are stacked onto a multicore die. Fig. 1 shows an illustration of the target 3-D CMP. The multicore tier consists of eight 16-stage pipelined in-order cores, such as Intel Atom N455 [24]. The area of core tier is 37 mm 2 in the 45-nm technology node [25]. Each core consists of a processing core, private L1 instruction and data caches, an L2 cache controller, and a TSV bus (consisting of TSVs for address, data, and control signals). Each stacked

3 3120 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016 TABLE I COMPARISON OF BANK ACCESS TIME AND AVERAGE HOP DISTANCE OF CACHE BANK ACCESS ACCORDING TO THE BANK CAPACITY (45-nm TECHNOLOGY SRAM BANK) SRAM and MRAM tier consists of multiple memory banks, forming a mesh network for planar communication. This 3-D hybrid bus-mesh topology [34], which is one of the most popular and general 3-D network-on-chip (NoC) topologies, takes the advantage of the short vertical links of 3-D NoCs. The capacities of SRAM and MRAM are 2 and 8 MB, respectively, assuming that each stacked tier has similar area to that of the core tier (i.e., 37 mm 2 ) assuming that each memory technology fabricated in the 45-nm technology [26]. In this section, we explore the effect of design parameters of the hybrid L2 NUCA, such as cache bank capacity and cache data mapping on bank, on the performance. For that, the experiments in this section were performed by assuming that several benchmarks are separately run on P0 core in Fig. 1. Then, we compare the hybrid L2 NUCA with pure L2 NUCA that use only one memory technology, either SRAM or MRAM, in terms of performance. A. Cache Bank Capacity Decision When the physical size of NUCA is fixed, the capacity of a bank determines the access latency of a cache bank as well as the number of cache banks in NUCA that affects the average hop distance for data traversal in the network. An average hop distance, which is a commonly used metric to evaluate the performance of a network, denotes the number of hops taken by a packet to traverse a path, averaged over all possible source destination pairs in the network [27]. As the capacity of a cache bank increases, the number of cache banks decreases, and thus, the average hop distance is also reduced, because the number of cache banks constituting the path for data traversal is reduced. However, the access latency of a cache bank itself increases with the cache bank capacity. On the other hand, the access latency of a cache bank decreases, and the average hop distance increases as the capacity of a cache bank decreases. Given the target 3-D CMP shown in Fig. 1, we varied the cache bank capacity from 32 to 512 kb for SRAM cache. Table I shows the results of the cache bank access latency and the average hop distance according to the bank capacity as obtained by an Non-Volatile SIMulator (NVSIM) [26]. A tradeoff between the cache bank access latency and the average hop distance needs to be made with respect to the cache bank capacity to minimize the average cache access time (ACAT), that is ACAT = γ t read + (1 γ) t write + hop t sw (1) where t read is the cache read latency, t write is the cache write latency, hop is the average hop distance, and t sw is the network latency, i.e., the latency owing to network router. γ is the Fig. 2. Normalized ACAT according to the cache bank size for (a) 45-nm SRAM bank and (b) MRAM bank. The cache access time consists of cache bank access latency and network latency. ratio of cache read to cache access that typically lies between 0.34 and 0.95 for the SPEC2000/2006 benchmark suite. In (1), the first two terms represent the average cache bank access latency and the last term represents the network latency resulting from data traversal through the network. Fig. 2 shows the results of the ACAT with respect to the bank capacity for SRAM cache and MRAM cache. As shown in Fig. 2, the ACAT is minimized at the bank capacity of 128 and 512 kb for SRAM cache and MRAM cache, respectively. This method is generally applicable to all 3-D CMPs with a 3-D stacked homogeneous NUCA cache. In general, the bank sizes of L2 SRAM and L2 MRAM are determined at design time using (1) without the knowledge of runtime behavior due to L2 cache partitioning and migration. Runtime cache access behavior is dynamically taken care of by runtime algorithms running on the L2 NUCA. B. Dynamic Logical to Physical Cache Mapping In our 3-D NUCA shown in Fig. 1, each cache layer is divided into 16 banks interconnected via a 2-D mesh network. In NUCA, a multitude of cache banks provides substantial flexibility for mapping cache data to banks. One extreme is the static NUCA (S-NUCA) strategy [17], where the mapping of cache data into banks is statically determined based on the low-order bits of the set index included in the memory address. Using the low-order bits of the set index as the cache bank index statistically distributes cache data into the allocated cache banks uniformly, and thus, the average hop distance of cache access can be large. The other extreme is the dynamic NUCA (D-NUCA) strategy [17], where the cache data can be mapped into any cache bank. In the D-NUCA approach, the energy consumed for locating the cache data can be large, because cache access request from a core must be broadcast to all the cache banks allocated to the core to read the tags of all allocated cache banks for checking cache hit or miss. The increased network traffic owing to the broadcasting can degrade the system performance, especially for systems with many cores and many cache banks. Besides, aggressive cache data migrations to increase the number of cache hits in the closest banks in D-NUCA can exacerbate the network traffic and shorten the lifetime of certain cache

(a) Normalized memory access time and (b) normalized EDP of D-NUCA L2 cache, where runtime cache data migration is either applied or not, for the heterogeneous SRAM-MRAM 3-D target architecture shown

4 LEE et al.: HYBRID L2 NUCA DESIGN AND MANAGEMENT 3121 Fig. 4. Lifetime results of each benchmark in D-NUCA L2 cache, where runtime cache data migration is either applied or not, normalized with respect to the migration-based D-NUCA L2 cache. Fig. 3. (a) Normalized memory access time and (b) normalized EDP of D-NUCA L2 cache, where runtime cache data migration is either applied or not, for the heterogeneous SRAM-MRAM 3-D target architecture shown in Fig. 1. banks due to intensive write operations to those cache banks in case of MRAM cache. To reduce the overhead owing to such cache data migrations, we prohibit the cache data migrations for least recently used (LRU) ordering aligned to proximity between processor core and an accessed bank. However, if there are multiple cache data that have the same LRU order when a cache miss occurs, the cache data that is in the closest bank from the core is chosen for cache data eviction. Fig. 3 shows the average memory access time and the energy-delay product (EDP) in the D-NUCA scheme with and without cache data migration for the SRAM/MRAM 3-D target system shown in Fig. 1. As shown in Fig. 3(a), prohibiting data migrations in D-NUCA increases the average memory access time by only 0.96% on average compared with the migration-based D-NUCA. The benefit of the reduction of an average hop distance from the requesting core to accessed banks can be offset by the increase of network traffic owing to excessive data migration. On the other hand, D-NUCA without migration improves EDP by 12.45% on average compared with the migration-based D-NUCA due to reduced bank access operations, as shown in Fig. 3(b). Compared with other benchmark, ammp has the largest EDP reduction due to the lower temporal locality. Since the benchmark with lower temporal locality accesses the memory almost randomly, D-NUCA with migration incurs excessive numbers of data migration and, thus, increases both the memory access time and the energy consumption. Another shortcoming of the migration-based D-NUCA is the reduction of the lifetime of MRAM owing to too frequent cache data write operations. Thus, as shown in Fig. 4, D-NUCA without data migration increases the L2 cache lifetime by 3.58 times on average compared with the migration-based D-NUCA, i.e., D-NUCA without migration is suitable for on-chip last-level MRAM cache where the lifetime is quite important as well as access time. For D-NUCA without migration, Fig. 5 shows the number of accesses of each bank for several benchmarks through simulation from 0 to 6 ms when 16 SRAM banks are allocated Fig. 5. Normalized number of accesses of each bank when 16 SRAM banks are allocated to a core for SPEC2000/2006 benchmarks. x-axis denotes the allocated bank indexes sorted in the ascending order of hop distance from the processing core. to a core. In Fig. 5, the x-axis denotes the allocated bank indexes sorted in the ascending order of a hop distance from the processing core. As shown in Fig. 5, in D-NUCA, the number of accesses to each bank tends to decrease as the physical distance from the processing core to each bank increases as shown for the first group, i.e., twolf, equake, ammp, gcc, and parser. On the other hand, in the second group benchmarks, such as art, bzip, lbm, sjeng, and mcf in Fig. 5, the number of accesses to far banks is quite significant. In D-NUCA without migration, cache replacement depends only on the LRU replacement policy, which makes accessed bank indexes more uniformly distributed as shown for the second group benchmarks. In application programs such as the second group benchmarks, the average hop distance becomes larger in D-NUCA, and memory access time becomes even longer owing to the network traffic overhead of the broadcasting. Fig. 6 shows the results of average hop distance and average memory access time for S-NUCA and migration-prohibited D-NUCA. As shown in Fig. 6(a), the average hop distance in D-NUCA is smaller than that in S-NUCA. However, a number of applications, such as art, bzip, lbm, sjeng, and mcf (the second group), show negligible average hop distance reduction from S-NUCA to D-NUCA, because, in these applications, banks far from the processing core are quite significantly accessed in D-NUCA, as shown in Fig. 5. In this type of application, S-NUCA outperforms D-NUCA in terms of average memory access time, as shown in Fig. 6(b), owing

3122 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016 Fig. 8. Address translator for dynamic cache data mapping scheme configuration. Fig. 6.

Fig. 1. Fig. 7. Average hop distance variation according to the time in the unit of 0.1 s for the three gcc, parser, and equake benchmarks when 16 SRAM cache banks are allocated to a core.

On the other hand, the first group (twolf, equake, ammp, gcc, and parser) shows larger reduction of the average hop distance from S-NUCA to D-NUCA, because the number of accesses to each bank in

In these applications, network traffic overhead owing to broadcasting in D-NUCA is compensated for by the reduction of hop distance of bank access, thereby D-NUCA outperforming S-NUCA in these

5 3122 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016 Fig. 8. Address translator for dynamic cache data mapping scheme configuration. Fig. 6. (a) Average hop distance of bank accesses and (b) average memory access time for the S-NUCA and D-NUCA without migration, normalized with respect to S-NUCA for the target 3-D architecture shown in Fig. 1. Fig. 7. Average hop distance variation according to the time in the unit of 0.1 s for the three gcc, parser, and equake benchmarks when 16 SRAM cache banks are allocated to a core. to the D-NUCA network traffic overhead of broadcasting. On the other hand, the first group (twolf, equake, ammp, gcc, and parser) shows larger reduction of the average hop distance from S-NUCA to D-NUCA, because the number of accesses to each bank in these applications is mainly to banks near from the processing core in D-NUCA, as shown in Fig. 5. In these applications, network traffic overhead owing to broadcasting in D-NUCA is compensated for by the reduction of hop distance of bank access, thereby D-NUCA outperforming S-NUCA in these application in terms of memory access time, as shown in Fig. 6(b). Therefore, the preference between S-NUCA and D-NUCA in terms of memory access time can be different according to the cache access pattern of the benchmark, i.e., the average hop distance of bank access. Fig. 7 shows the average hop distance variation according to the time for three benchmarks (gcc, parser, and equake). As shown in Fig. 7, the average hop distance varies with time. Thus, to minimize the memory access time, it is necessary to switch the cache data mapping scheme between S-NUCA and D-NUCA at runtime. This paper employs an address translator to support both S-NUCA and D-NUCA along with a runtime switch/controller. Fig. 8 shows the implementation of address translator, which consists of a 1-bit cache data mapping control register C and a multiplexer for selecting the cache data mapping scheme. Since the size of a cache data block is 64 bytes, six least significant bits of the memory address form the block index, as shown in Fig. 8 (offset). Since there are total 2048 (=2 11 ) sets in each cache bank and total 32 (=2 5 ) SRAM and MRAM banks in our cache architecture, next 16 (11 + 5) bits become the set index. In the 16-bit set index, low-order 5 bits are used for bank indexing, and the upper 11 bits are used for set indexing in the S-NUCA mode. In the D-NUCA mode, bank indexing is not applicable, because cache data can be sent to any bank. If the D-NUCA scheme is selected, the cache data mapping control register C is set to 1. The runtime cache data mapping selection between the S-NUCA and the D-NUCA is explained in Section IV-C. C. Homogeneous 3-D L2 NUCAs Let us compare the two memory technologies used for 3-D L2 NUCA, i.e., SRAM and MRAM. Here, we consider two examples of homogeneous 3-D L2 NUCAs: 1) 3-D SRAM-based L2 NUCA and 2) 3-D MRAM-based L2 NUCA, as shown in Fig. 9(a) and (b), respectively. In Fig. 9(a) and (b), the capacities of an SRAM bank and an MRAM bank are the same as those in the target 3-D CMP shown in Fig. 1, i.e., 128 and 512 kb, respectively. Fig. 10 shows the number of cache bank accesses with respect to the cache bank position for three benchmarks from SPEC2000, i.e., equake, gcc, and sjeng, separately run on P0 core in Fig. 9(a), i.e., 3-D SRAM-based L2 NUCA. In this result, for the cache addressing scheme, D-NUCA (explained in Section III-B) is used in order to fully exploit the given cache capacity. In Fig. 10(a) and (b), it is observed that the spread of bank accesses is different according to applications. The spread is largest in case of sjeng, and smallest in the case of equake. This tendency is true for both SRAM and MRAM. Fig. 11(a) shows the results of the average hop distance for the two examples, i.e., 3-D SRAM-based L2 NUCA and 3-D MRAM-based L2 NUCA, where each benchmark application is run on P0 core. Compared with the 3-D SRAM-based L2 NUCA, more cache accesses are performed in the closer banks in 3-D MRAM-based L2 NUCA because of the larger capacity of MRAM per bank. Thus, the resultant average hop distance of 3-D MRAM-based NUCA is smaller than that in 3-D SRAM-based NUCA. However, the reduction ratios

LEE et al.: HYBRID L2 NUCA DESIGN AND MANAGEMENT 3123 Fig. 9. Two motivational examples of 3-D CMPs, each of which consists of eight cores with one stacked L2 cache layer.

(a) Result of average hop distance of bank accesses, (b) normalized IPSs, and (c) normalized average memory access time for the two examples of 3-D CMPs, i.e., SRAM-based L2 cache and MRAM-based L2 cache, asshowninfig.

6 LEE et al.: HYBRID L2 NUCA DESIGN AND MANAGEMENT 3123 Fig. 9. Two motivational examples of 3-D CMPs, each of which consists of eight cores with one stacked L2 cache layer. (a) 3-D CMP where an SRAM-based L2 cache layer is stacked. (b) 3-D CMP where an MRAM-based L2 cache layer is stacked. Fig. 11. (a) Result of average hop distance of bank accesses, (b) normalized IPSs, and (c) normalized average memory access time for the two examples of 3-D CMPs, i.e., SRAM-based L2 cache and MRAM-based L2 cache, asshowninfig.9. Fig. 10. (a) Normalized number of accesses of each bank when 16 SRAM banks are allocated for equake, gcc, and sjeng benchmarks. x-axis denotes the allocated bank indexes sorted in the ascending order of hop distance from the processing core. (b) Same for 16 MRAM banks. (c) Indexes of cache bank nodes according to the distance from processor core node. of the average hop distance are different according to the applications. In case of equake, the reduction ratio of the average hop distance is smaller compared with the case of gcc, because equake has small average hop distance and has the small cache working set that can be covered by small cache capacity. In case of sjeng, which yields larger average hop distance, even the large capacity of 3-D MRAM-based NUCA is not enough to cover the cache working set. Thus, the reduction ratio of the average hop distance is smaller than in the case of gcc, where 3-D MRAM-based NUCA effectively reduces the average hop distance for the cache access. As shown in Fig. 10(a) and (b), the distance from the requesting core to accessed banks becomes effectively shorter in the case of gcc than other cases. Fig. 11(b) and (c) shows the results of IPSs and average memory access time, respectively. As shown in Fig. 11(a), in the cases of equake, sjeng, twolf, bzip, lbm, mcf, and ammp, the reduction of the average hop distance by using 3-D MRAM-based NUCA compared with 3-D SRAM-based NUCA is very small. This makes 3-D SRAM-based NUCA outperform 3-D MRAM-based NUCA if we consider the longer access latency of MRAM. On the other hand, in the cases of gcc, parser, and art, the performance gain due to the larger reduction of average hop distance by using 3-D MRAM-based NUCA more than compensates for the performance loss resulting from the longer access latency of MRAM, which makes the resultant IPS of 3-D MRAM-based NUCA larger, and the average memory access time smaller than those of 3-D SRAM-based NUCA. In short, using a single or homogeneous memory technology, either SRAM or MRAM, cannot give the best IPS performance for all various applications. This motivates us to use a hybrid or heterogeneous 3-D SRAM-MRAM NUCA, which can exploit the advantages of both the memory technologies. IV. RUNTIME MANAGEMENT OF HYBRID L2 NUCA FOR 3-D CMP Given the target 3-D CMP described in Section III, our runtime solution is to find: 1) the best allocation of the SRAM and MRAM cache banks to each core (i.e., cache partitioning) and 2) the best cache addressing scheme (e.g., S-NUCA or D-NUCA), at every runtime reconfiguration time interval, such that the instruction throughput of the target 3-D CMP is maximized. The procedure to find the performance-maximal runtime solution consists of the following three steps: 1) finding the L2 cache capacity (logically) assigned to each core to minimize the aggregated L2 cache misses (in Section IV-A); 2) based on the assigned L2 cache capacity, finding the numbers and the positions of SRAM cache banks and MRAM cache banks assigned to each core (in Section IV-B); and 3) selecting the best cache addressing scheme between S-NUCA and D-NUCA (in Section IV-C). A. First Step: Cache Capacity Allocation In the process of capacity allocation, the amount of L2 cache capacity to be assigned to each core is determined, so that the total number of L2 cache misses is minimized. Algorithm 1 shows the proposed L2 cache capacity allocation. It starts with initializing the L2 cache capacity assigned to each core (i.e., C i )asc total,wherec total is the total L2 cache capacity

7 3124 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016 Algorithm 1 Algorithm for Cache Capacity Allocation Algorithm 2 Algorithm for Cache Bank Allocation and C i is the L2 cache capacity assigned to core i (lines 1 3). For each core, the increase of cache misses is estimated when the assigned L2 cache capacity is halved (lines 5 7). After the estimation of cache misses, the core yielding the smallest increase in the L2 cache misses is assigned the cache capacity cut by half (lines 8 and 9). This procedure is iterated until Ni=1 C i becomes equal to or less than C total (line 4). For the cache miss estimation with respect to the assigned L2 cache capacity (line 6), we adopt a hardware-based cache miss monitor [21] [23], which consists of virtual cache tags and hit counters. Fig. 12 shows an example of a four-way set-associative cache in order to see how the cache miss monitor provides the number of cache misses with respect to the assigned cache capacity. In Fig. 12(a), each cache set has four counters (v 1, v 2, v 3,andv 4 ) to obtain the cache hit counts for each of the four recency positions ranging from most recently used to LRU. If a cache access results in a hit, the counter corresponding to the hit-causing recency position is incremented. Fig. 12(b) shows the hit counter results of each recency position. Since the LRU cache replacement policy obeys the stack property [28], given information about misses in a cache that has a large number of ways, it is possible to estimate the cache misses for a cache with smaller number of ways. As shown in Fig. 12(b), if the cache size is reduced from four ways to three ways, the misses increase from 25 to 35. Further reducing the cache size to two ways increases the number of misses to 50. And with only one way the cache incurs 70 misses. To reduce the hardware overhead of the cache miss monitor, dynamic set sampling [21] is used, which allows reducing the number of cache sets in the virtual cache tags by sampling a few cache sets in the main tag arrays, instead of the entire cache sets. In this paper, 32 cache sets are sampled. B. Second Step: Cache Bank Allocation Given the assigned cache capacity to core i, i.e., C i for all the cores in the 3-D CMP, the second step, i.e., cache bank allocation allocates SRAM and MRAM cache banks to each core in order to minimize the average memory access latency, while the total capacity of SRAM and MRAM cache banks allocated to a core is the same as the given cache capacity assigned to the core, i.e., C i. Algorithm 2 shows the proposed SRAM and MRAM bank allocation algorithm. It starts with coloring one SRAM cache bank nearest to each core with the color of the nearest core, which makes remaining capacity for each core, R i, be updated Fig. 12. (a) Cache miss monitor consisting of virtual cache tags and hit counters. (b) Example of how the number of cache misses can be estimated. as C i 128 kb, i.e., the assigned cache capacity to core i minus the cache capacity of one SRAM bank (lines 1 4). This initial conditioning guarantees that at least one SRAM cache bank is allocated to each core. The following procedure is iterated until all the remaining capacity becomes zero. First, we find uncolored q S -nodes or q M -nodes, which are neighbor to at least one colored bank node. For each uncolored bank, we assign the color of the core, which yields the largest memory access latency reduction,

8 LEE et al.: HYBRID L2 NUCA DESIGN AND MANAGEMENT 3125 i.e., RL lj when the uncolored bank (bank j) is allocated to that core (core l) (lines 15 18), in order to minimize the average cache access latency. The candidates of cores to which the uncolored bank is to be allocated include cores to which the hop distance from the uncolored bank is the shortest (line 8) and cores where the colored neighbor banks were already allocated (line 13). After each uncolored bank is colored, the remaining capacity to be allocated to the core where the uncolored bank is allocated is updated (as shown in line 23 27). RL ij, the estimation of reduced memory-access latency when bank j is additionally assigned to core i is given by RL ij = h ij (L OFF L ij ) (2) where h ij is the number of cache hits from core i to bank j after bank j is allocated to core i. L OFF is the latency for off-chip memory access, which is assumed to be a constant for all cores. L ij is the access latency from core i to bank j. The number of cache hits in bank j due to the accesses from core i, i.e., h ij, can be estimated from the auxiliary tag arrays (ATAs) shown in Fig. 12. In the ATAs, the number of cache misses is estimated with respect to the number of cache ways assigned to core i. Then, h ij is estimated by subtracting the number of aggregated cache misses resulting from the additional cache capacity of bank j from the number of cache accesses from core i. In(2),L ij is represented as follows: L ij = dij R r i R + dij W (1 ri R ) (3) where dij R and dij W are, respectively, read and write access latency between core i and bank j and ri R is the ratio of read to write in core i. C. Third Step: Cache Addressing Decision After the process of SRAM and MRAM bank allocation, the selection of cache addressing scheme between S-NUCA and D-NUCA is made in runtime to further reduce the network latency for data traversal in the network. To do that, it is necessary to estimate the average network latency for both cache addressing schemes and to select the cache addressing scheme with shorter average network latency. The average network latency is defined as the average number of core clock cycles for data traversal through the network, i.e., from sending the cache access request to receiving the cache access completion reply. The average network latency is a function of the average hop distance for bank access. Fig. 13 shows the average network latency for the two cache addressing schemes, S-NUCA and D-NUCA according to the average hop distances when 16 cache banks are allocated to a core. As shown in Fig. 13, the average network latency can be approximated as the straight lines, that is li S net = α S hopi S + β S (4) li D net = α D hopi D + β D (5) where li S net and li D net are the estimated average network latency of S-NUCA and D-NUCA, and α S, β S, α D,andβ D Fig. 13. Average network latency according to the average hop distance for the two cache addressing schemes, S-NUCA and D-NUCA when 16 cache banks are allocated to a core. are the coefficients for a straight line approximation of the average network latency according to the average hop distances, hop S and hop D, for S-NUCA and D-NUCA, respectively, for core i. The average hop distance, hop i,isdefinedas the average number of hops to access L2 cache banks allocated to core i, and is calculated as follows: Ni k=1 hop i = a k d k Ni k=1 a (6) k where N i is the total number of L2 cache banks allocated to core i, anda k and d k are the number of accesses and hop distance from the core to the bank k among allocated banks. The effect of packet broadcasting overhead in D-NUCA appears as a steeper slope of the approximated straight line than S-NUCA, as shown in Fig. 13. The ratio of the slope of the straight line of D-NUCA to that of S-NUCA is defined as congestion factor (θ). The congestion factor denotes the effective average hop distance of D-NUCA as normalized to that of S-NUCA. The congestion factor can be expressed in the following equation: θ = αd α S, θ > 1, βs = β D (7) where θ is usually larger than one owing to the packet broadcasting overhead in D-NUCA. The value of θ varies according to the number of banks allocated. In D-NUCA, as more cache banks are allocated, the number of packets to be sent simultaneously for broadcasting increases linearly. However, in S-NUCA, only one packet is required to check cache hit/miss, because the destination cache bank has already been determined based on the memory address. Thus, the relative network delay ratio from D-NUCA to S-NUCA, defined as congestion factor, θ, increases linearly as more cache banks are allocated to a core in D-NUCA. For more accurate average network latency estimation, it is convenient to have a list of congestion factors according to the number of allocated banks, as shown in Table II. β S and β D represent the network latency of S-NUCA and D-NUCA, respectively, when the average hop distance of bank access is zero. Intuitively, the zero-hop average network latencies of S-NUCA and D-NUCA must be the same to each other, i.e., β S = β D.

9 3126 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016 TABLE II LOOKUP TABLE FOR CONGESTION FACTORS, θ,according TO THE NUMBER OF ALLOCATED CACHE BANKS TABLE III ARCHITECTURE CONFIGURATIONS FOR SIMULATION Finally, the cache addressing scheme with shorter average network latency must be selected. The selection metric, i.e., the average network latency of S-NUCA minus that of D-NUCA, for deciding the cache address scheme is given by TABLE IV CACHE PARAMETERS OF DIFFERENT MEMORY TECHNOLOGIES FABRICATED IN THE 45-nm TECHNOLOGY NODE l net = l S net l D net = α S (hop S θ hop D ). (8) If l net > 0, then D-NUCA is selected, otherwise, S-NUCA scheme. Because α S is positive value, the metric for comparing the two schemes can be simplified as hop S θ hop D. D. Algorithm Overhead In this paper, the proposed runtime algorithm is periodically run on a core (we call it Master Core) to adjust cache resource allocation by taking care of runtime cache access behavior of each core. During the time interval between successive cache reconfigurations, instructions per cycle (IPCs), L2 accesses, and L2 SRAM and MRAM cache hits are gathered every 5 ms (which is the least time interval in the OS time scheduler) from hardware monitors. Based on the gathered information, the proposed algorithm is invoked every 50 ms to determine the performance-maximal system configuration (i.e., cache configuration) during workload execution. During the cache reconfigurable period, Master Core needs to suspend its workload execution to run the runtime algorithm. In order to estimate the timing overhead, the computational time of the two methods is measured by PAPI [33] on Samsung NC100 laptop at 1.67 GHz, because the CPU cores implemented in the SimpleScalar-based NoC simulator cannot emulate Intel 45-nm Atom processor that is our target CPU used in this paper. For that, we assumed that the overhead of switching from algorithm to the benchmark programs is ignorable, since the algorithm is too small to make context-switching overheads (e.g., memory traffic competition and instruction cache pollution). Based on the measured data, the maximum computational time of the overall algorithm (also measured by PAPI) is 1.11 ms, which is negligible compared with 50 ms (i.e., our dynamic reconfigurable time interval). After the cache reconfiguration results are obtained from the runtime algorithm, 32 BANK ID registers (stored in an L2 cache controller), which indicate the cache banks assigned to the core, need to be updated for each core. The time to update the BANK ID registers consumes only one clock cycle. V. EXPERIMENTAL SETUP Table III shows the parameters of the baseline configuration used in our experiments. The baseline configuration is an eight-core CMP with an L2 NUCA stacked on the CMP. Intel Atom processors with 32-/24-kB private L1 instruction and data caches were used as cores [24]. For the L2 NUCA, three candidates are evaluated, i.e., SRAM-only, MRAM-only, and hybrid SRAM/MRAM. SRAM-only stacks two layers of SRAM cache on the CMP, consisting of 16 SRAM cache banks for each SRAM cache layer [shown in Fig. 9(a)], while MRAM-only consists of two layers of the MRAM cache each of which has 16 MRAM cache banks [shown in Fig. 9(b)]. Hybrid SRAM/MRAM stacks one layer of SRAM cache and one layer of MRAM cache on the CMP, consisting of 16 SRAM and 16 MRAM cache banks for the SRAM and the MRAM cache layer, respectively [shown in Fig. 1]. However, four MRAM banks were disabled to make the total available cache capacity be the power of two, so that mapping cache data to banks are determined based on the bits of set index in the memory address, as shown in Fig. 8. Each SRAM cache bank and MRAM cache bank have a capacity of 128 and 512 kb, respectively. The access latencies of an SRAM cache bank and an MRAM cache bank are estimated from the NVSIM simulator [26]. Table IV shows the detailed parameters of 128-kB SRAM cache bank and 512-kB MRAM cache bank used in the experiment. The experiment was conducted with our cycle-accurate NoC simulator. Based on the SimpleScalar simulator (version 3.0d) [29], Manjikian [35] enhanced the original simulator to perform functional multiprocessor simulation as well as multiprocessor cache simulation. Based on the enhanced Simplescalar multiprocessor simulation [35], we added more features to support multibanked SRAM/MRAM L2 cache, where multiple banks are connected each other through the NoC network. First, we modified a unified L2 cache model to support multibanked L2 cache, where multiple banks are interconnected through a network. Second, we implemented

LEE et al.: HYBRID L2 NUCA DESIGN AND MANAGEMENT 3127 the proposed cache partitioning algorithm, which dynamically allocates a set of cache banks to each core.

10 LEE et al.: HYBRID L2 NUCA DESIGN AND MANAGEMENT 3127 the proposed cache partitioning algorithm, which dynamically allocates a set of cache banks to each core. The experiment was performed with SPEC2000/2006 [30] benchmark suites. In order to build the CMP environment running multiple process/program simultaneously, we chose eight programs (i.e., ammp, art, gzip, twolf, parser, gcc, sjeng, and vortex), each of which has different memory demands (i.e., cache utility). These are mapped on the targeted 3-D CMP and run at the same time. Among the benchmark programs, ammp, art, vortex, and gcc represent a high memory-demand program, which needs more cache/memory resources, while gzip, twolf, parser, and sjeng represent a compute-bound program, which needs fewer cache/memory resources. For the evaluation of the proposed cache partitioning scheme, we consider four cache partitioning schemes as follows. 1) Uniform: A naive method of reducing hop distance of cache access is to uniformly partition the cache memory. Cache banks on top of a core are dedicated to that core, which makes each core only access cache banks above it. Due to the ease of hardware implementation, several 3-D CMPs have already adopted this cache partitioning scheme [19], [31], [32]. 2) Sun-DNUCA: The whole L2 cache is shared by all the cores. LRU policy manages the shared cache resources for the cache replacement, while considering the heterogeneity of access latency between SRAM cache and MRAM to minimize the cache access latency. This scheme can be regarded as an extension of Sun et al. s [3] work. 3) Utility-Aware Cache Partitioning: This is our first proposed method. The whole L2 cache is divided into partitions, and the cache banks in a partition are dedicated to a core. We determine the number of cache banks for a partition based on the cache utility of a program running on the core that owns the partition, while considering the heterogeneity of access latency between SRAM cache and MRAM to minimize the cache access latency. 4) Utility- and Hop Distance-Aware Cache Partitioning: This is our second proposed method. Like utility-aware cache partitioning (UCP), the whole L2 cache is divided into partitions, and the cache banks in a partition are dedicated to a core. However, the number of cache banks for a partition is determined not only by the cache utility of the program but also by the hop distance from the core to the assigned cache banks in the partition. VI. EXPERIMENTAL RESULTS A. Impact of Cache Management Method In this section, we compared the four cache partitioning methods, i.e., Uniform, Sun-DNUCA, UCP, and utility- and hop distance-aware cache partitioning (UDCP), in terms of performance. Fig. 14 shows the results of L2 cache miss [i.e., cache misses per kilo instructions (MPKIs)] for the four partitioning methods applied to the 3-D CMP with Hybrid SRAM/MRAM L2 NUCA. In Fig. 14, it is shown that UCP and UDCP show significantly less MPKIs values than Fig. 14. MPKIs results of each benchmark suite for the four cache partitioning methods, i.e., Uniform, Sun-DNUCA, UCP, and UDCP. Fig. 15. (a) Average memory access time (cycles) and (b) IPSs result of each benchmark suite for the four cache partitioning methods, i.e., Uniform, Sun-DNUCA, UCP, and UDCP. Uniform for cores running art and ammp; the situation with cores running other programs (i.e., vortex, gcc, sjeng, gzip, twolf, and parser) is on the contrary. art and ammp have much higher cache utility, which needs larger cache capacity to sufficiently lower cache-miss rate. The higher cache utility of programs (i.e., art and ammp) makes UCP and UDCP assign more cache capacity to the cores running them than other cores, i.e., UCP and UDCP end up with assigning less cache capacity to cores running the programs with lower cache utility (i.e., vortex, gcc, sjeng, gzip, twolf, and parser) than Uniform, resulting in larger cache misses in those cores. However, the reduction of cache misses in the programs with higher cache utility overcompensates for the increase of cache misses in the programs with lower cache utility, which yields an overall reduction in memory access time and improved system performance, as shown in Fig. 15. Sun-DNUCA has the lowest MPKI among all the partitioning methods because of the shared L2 cache architecture. Fig. 15(a) shows the results of average memory access time in clock cycles for the four cache partitioning methods. Even though Uniform has the lowest hop distance from cores

11 3128 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016 Fig. 16. L2 cache hit ratio results of each benchmark suite, normalized with respect to SRAM-only. to cache banks, the larger cache misses [shown in Fig. 14] result in more off-chip memory accesses and eventually degrade the average memory access time. On the contrary, although Sun-DUNCA has the lowest MPKI as shown in Fig. 14, the average memory access time is increased owing to the biggest hop distance through the whole L2 NUCA. In Fig. 15(a), the memory access latency of UDCP is lower than that of UCP by up to 29.94% (on average 10.55%). UDCP considers not only the cache utility of the program but also the hop distance from the core to the assigned cache banks when the runtime cache partitioning is performed. Fig. 15(b) shows the results of IPSs for the four partitioning methods, which are normalized with respect to the results of Uniform. UCP and UDCP yield IPS improvements over Uniform by 92.45% and %, respectively. Compared with UCP, UDCP yields IPS improvement by 7.33%, which is an additional performance improvement by considering the hop distance in 3-D NUCA. Compared with Sun-DNUCA, UDCP yields IPS improvement by 18.03%. B. Impact of Hybrid Cache Composition In this section, the experiments were performed to compare the hybrid cache composition, i.e., hybrid SRAM/MRAM L2 cache with the homogeneous cache compositions, i.e., L2 SRAM-only and L2 MRAM-only in terms of performance, energy efficiency, and lifetime. For the cache partition scheme, UDCP was used throughout. Fig. 16 shows the results of L2 cache hit ratio for the three cache compositions, which are normalized with respect to those of SRAM-only. As shown in Fig. 16, MRAM-only and hybrid SRAM/MRAM yield significantly higher L2 cache hit ratios (by up to %) than SRAM-only, because the capacity of an MRAM cache bank is four times larger than that of an SRAM cache bank. Note that the amount of the increase in L2 cache hit ratio depends on the programs run on the 3-D CMP. For example, art and ammp show rather high increase, while sjeng and gzip show very small improvement in the L2 cache hit ratio because of the low cache utility. Fig. 17(a) shows the results of IPS for the three cache compositions. The results are normalized with respect to those of SRAM-only. In Fig. 17(a), hybrid SRAM/MRAM improves IPS over SRAM-only by 62.24%. On the other hand, hybrid SRAM/MRAM improves IPS over MRAM-only by 61% due to the reduced cache access latency. In hybrid SRAM/MRAM, because of runtime cache partitioning proposed in this paper, most cache accesses are made to the low-latency SRAM cache banks, while larger capacity to secure low miss ratio is achieved by MRAM Fig. 17. (a) IPSs and (b) EDP results of each benchmark suite, normalized with respect to SRAM-only. TABLE V IPS, EDP, AND LIFETIME RESULTS OF THE FOUR L2 CACHE ARCHITECTURES,i.e.,SRAM-ONLY,MRAM-ONLY, HYBRID SUN-DNUCA [3], AND HYBRID SRAM-MRAM WITH AND WITHOUT RUNTIME CACHE DATA MIGRATION cache banks. If we compare SRAM-only with MRAM-only, for the low cache utility programs (i.e., vortex, gcc, sjeng, gzip, twolf, and parser), SRAM-only has enough cache capacity to cover the working set of the programs and, thus, outperforms MRAM-only, while, for the high cache utility programs (i.e., art and ammp), MRAM-only outperforms SRAM-only, asshowninfig.17(a). Fig. 17(b) shows the results of EDP for the three cache compositions, which are normalized with respect to SRAMonly. In Fig. 17(b), hybrid SRAM-MRAM reduces EDP by up to 83.92% compared with SRAM-only, because hybrid SRAM/MRAM finishes the program execution earlier than SRAM-only, while consuming lower leakage power due to the use of MRAM. The shorter execution time reduces not only the energy consumed by cores but also the energy consumed by L2 cache. Hybrid SRAM/MRAM reduces EDP over MRAM-only by on average 53.02% due to the decrease of leakage energy coming from the shorter execution time. Table V shows the effect of cache data migration on the performance, energy efficiency, and lifetime of hybrid SRAM/MRAM L2 NUCA. In order to place the frequently accessed cache lines closer to the processor core, the data migration algorithm arranges the cache lines within the same set (across the cache banks) as an LRU queue with respect to the proximity to the processor core. Thus, when a new cache line is fetched from off-chip memory, the fetched line

12 LEE et al.: HYBRID L2 NUCA DESIGN AND MANAGEMENT 3129 is placed in the bank closest to the processor core, the cache lines in each bank shift to the next bank successively, and the cache line in the bank farthest from the processor core (i.e., the least recently used cache line) is evicted from L2 cache. This data migration results in data proximity, which may reduce the average memory access latency and, thus, increase the system performance (i.e., IPS). On the other hand, it may increase the energy consumption (i.e., EDP) and reduce the lifetime because of the extra MRAM read/write operations during cache data movement. Table V shows the results of IPS, EDP, and lifetime for the four L2 cache architectures, i.e., SRAMonly, MRAM-only, hybrid-sun-dnuca, and hybrid-udcp, with and without the runtime cache data migration. For the runtime cache partitioning scheme, UDCP is used throughout. As shown in Table V, cache data migration increases IPS by 4.70% and 7.50%, while degrades EDP by 19.21% and 15.75% for hybrid SRAM/MRAM and MRAM-only, respectively. It is also shown that by omitting the cache data migration, the lifetime can be increased by 4.4 (=45.58/10.34) times and 3.5 (=2.93/0.85) times for hybrid SRAM/MRAM and MRAM-only, respectively. In Table V, the lowest lifetime of MRAM-only and the lowest IPS of SRAM-only prove that the homogeneous cache composition is not recommended for L2 cache. Sun-DNUCA presents the best IPS result, where the runtime data migration is applied. However, the runtime cache data migration increases network traffic and energy consumption by the excessive write operations to MRAM banks and, thus, extremely degrades the energy efficiency and the lifetime. When the runtime data migration is not applied, hybrid-udcp improves IPS, EDP, and lifetime by 18.03% 50.58%, and 6.74% compared with Sun-DNUCA, respectively. In this paper, we assume that the endurance limit (in terms of number of write cycles) of an MRAM cell is Based on the write cycles, we use an endurance model proposed in [8] and [19] to estimate the lifetime of MRAM. The MRAM cell cannot be used if its write count exceeds the endurance limit during the simulation. To evaluate the lifetime, we repeat the benchmark programs until one of the MRAM cache blocks exceeds the maximum write cycles. Fig. 18 shows the lifetime comparison between MRAM-only and hybrid SRAM/MRAM, where runtime cache data migration is either applied or not. As shown in Fig. 18, regardless of whether the data migration is used or not, hybrid SRAM/MRAM achieves a huge enhancement in the L2 cache lifetime by exploiting SRAM to reduce the number of writes to MRAM. It can be shown, from the comparison of the results shown in Fig. 18(a) and (b), that using cache data migration reduces L2 cache lifetime because of frequent cache data write operations for the data migration. The lower lifetime of MRAM-only proves that the homogeneous cache composition with the MRAM technology is not recommended for L2 cache. Table VI shows MRAM cache bank parameters with different technology nodes, i.e., 45, 32, and 22 nm. In Table VI, the read/write latency (and the read/write energy) remains almost unchanged as the technology node decreases because of the increased cell density, while the bank capacity of MRAM and the leakage power consumption significantly increase. Fig. 19 shows the results of IPS and EDP for the three L2 NUCAs Fig. 18. Lifetime results for the two L2 NUCAs, i.e., MRAM-only and hybrid SRAM/MRAM. Runtime cache data migration is (a) applied or (b) not. TABLE VI MRAM CACHE BANK PARAMETERS WITH DIFFERENT TECHNOLOGIES, i.e., 45, 32, AND 22 nm Fig. 19. (a) IPSs and (b) EDP results for the three L2 NUCAs (i.e., SRAM-only, MRAM-only, and hybrid SRAM/MRAM) with respect to different MRAM technology nodes, i.e., 45, 32, and 22 nm. (i.e., SRAM-only, MRAM-only, and hybrid SRAM/MRAM) with respect to different MRAM technology nodes, i.e., 45, 32, and 22 nm. The results are normalized with respect to that of SRAM-only for each technology node. As shown in Fig. 19(a), MRAM-only improves IPS with the technology scaling due to the reduced cache miss resulting from the increased L2 MRAM capacity. However, as shown in Fig. 19(b), the results of EDP in MRAM-only worsen with the technology scaling due to the increase in leakage power consumption. Hybrid overwhelms SRAM-only and MRAM-only by 82.3% and 88.3% in terms of IPS and 42.4% and 77.4% in

13 3130 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 10, OCTOBER 2016 TABLE VII EFFECTS OF SRAM TO MRAM RATIO IN THE HYBRID SRAM/MRAM L2 NUCA average memory access time of Uniform is comparable with other schemes, because the memory demand of each program is small, and thus, the small amount of assigned cache capacity to each core is enough without performance degradation. In the case of memory-bound, Sun-DNUCA, UCP, and UDCP extremely outperform Uniform in the view of memory access time by 43.08% on average. It is because dynamic cache reconfiguration exploits time-varying behavior of memorybound benchmark programs by allocating additional cache capacity to the cores that desperately need more cache capacity. Fig. 20. Memory access time results with respect to benchmark configurations, i.e., compute-bound, memory-bound, and mixed. terms of EDP, respectively, when the 22-nm technology node is used for L2 MRAM. Table VII shows the effects of SRAM to MRAM capacity ratio in the hybrid SRAM/MRAM L2 NUCA on the system performance and energy efficiency. In Table VII (first row), the numbers before S and M are the numbers of SRAM and MRAM layers stacked on the processor cores, respectively. For example, 1S1M means that one SRAM and one MRAM layer are stacked on the processor cores, so the capacity of MRAM is four times of SRAM. In case of 1S2M and 2S1M, the capacity ratios of SRAM and MRAM are 1:8 and 1:2, respectively. As shown in Table VII, 1S2M and 2S1M degrade IPS by 6.7% and 17.4% compared with 1S1M. It is because both 1S2M and 2S1M yield the larger average hop distances, while the L2 cache miss ratios are similar among the three cache architectures (i.e., 1S1M, 1S2M, and 2S1M). 1S2M and 2S1M increase the average hop distances by 31.5% and 29% (on average) compared with 1S1M, respectively. The small difference of an L2 cache miss ratio implies that the increase in L2 cache capacity by stacking more cache layers does not reduce L2 cache misses greatly. In addition, 1S2M and 2S1M increase EDP by 31.3% and 34% compared with 1S1M, because stacking one additional cache layer increases the leakage power due to the larger cache capacity (i.e., area) and the higher operating temperature. Fig. 20 shows the results of memory access time with respect to the three benchmark configurations (i.e., computebound, memory-bound, and mixed). In compute-bound, all cores run compute-intensive benchmark programs (twolf, gzip, parser, sjeng, sixtrack, games, specrand, and povray), while, in memory-bound, cores run memory-bound benchmark programs (ammp, art, vortex, gcc, mcf, lbm, soplex, and libquantum). In the case of mixed, half cores run compute-intensive benchmark programs, while the rest run memory-bound benchmark programs. In Fig. 20, in the case of compute-bound, the VII. CONCLUSION This paper has explored a design space of multiprocessor with a 3-D stacked hybrid SRAM/MRAM L2 cache. We proposed a design and management for dynamic cache partitioning that enhances processor performance in terms of IPSs. We considered such characteristics as interconnect wire delay, memory cell density, memory access latency, and memory power consumption for the heterogeneous SRAM and MRAM L2 cache banks. An L2 cache miss rate is kept low by partitioning the cache based on cache resource utility of programs running on cores. In addition, the cache addressing scheme is dynamically chosen on runtime for further reducing the congestion-related network latency. The experimental results show that the 3-D stacked hybrid SRAM/MRAM L2 cache with the proposed cache data placement scheme (i.e., hybrid) improves the instruction throughput by up to 82.21% (62.24% on average) and reduces EDP by up to 83.92% on average, compared with the 3-D stacked SRAM-based L2 cache (i.e., SRAM-only). REFERENCES [1] M. Monchiero, R. Canal, and A. González, Design space exploration for multicore architectures: A power/performance/thermal view, in Proc. 20th ACM ICS, 2006, p [2] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi, McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures, in Proc. 42nd MICRO, Dec. 2009, p [3] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, A novel architecture of the 3D stacked MRAM L2 cache for CMPs, in Proc. 15th HPCA, Feb. 2009, pp [4] J. Wang, Y. Tim, W.-F. Wong, Z.-L. Ong, Z. Sun, and H. Li, A coherent hybrid SRAM and STT-RAM L1 cache architecture for shared memory multicores, in Proc. 19th ASP-DAC, Jan. 2014, pp [5] J. Zhao, C. Xu, and Y. Xie, Bandwidth-aware reconfigurable cache design with hybrid memory technologies, in Proc. ICCAD, Nov. 2011, pp [6] J. Li, C. J. Xue, and Y. Xu, STT-RAM based energy-efficiency hybrid cache for CMPs, in Proc. 19th VLSI-SoC, Oct. 2011, pp [7] X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony, and Y. Xie, Hybrid cache architecture with disparate memory technologies, in Proc. 36th ISCA, Jun. 2009, pp [8] Y.-T. Chen et al., Dynamically reconfigurable hybrid cache: An energy-efficient last-level cache design, in Proc. DATE, Mar. 2012, pp [9] X. Dong and Y. Xie, System-level cost analysis and design exploration for three-dimensional integrated circuits (3D ICs), in Proc. ASP-DAC, Jan. 2009, pp [10] L. R. Hsu, S. K. Reinhardt, R. Iyer, and S. Makineni, Communist, utilitarian, and capitalist cache policies on CMPs: Caches as a shared resource, in Proc. 15th PACT, 2006, pp

LEE et al.: HYBRID L2 NUCA DESIGN AND MANAGEMENT 3131 [11] K. T. Sundararajan, V. Porpodas, T. M. Jones, N. P. Topham, and B.

Henkel, Dynamic cache management in multi-core architectures through run-time adaptation, in Proc. DATE, 2012, pp. 485 490. [13] B. C. Lee, E. Ipek, O. Mutlu, and D.

Chen, Coordinating prefetching and STT-RAM based last-level cache management for multicore systems, in Proc. 23rd GLSVLSI, 2013, pp. 55 60. [15] J. Wang, X. Dong, and Y.

Reinman, Static and dynamic co-optimizations for blocks mapping in hybrid caches, in Proc. ISLPED, 2012, pp. 237 242. [17] C. Kim, D. Burger, and S. W.

14 LEE et al.: HYBRID L2 NUCA DESIGN AND MANAGEMENT 3131 [11] K. T. Sundararajan, V. Porpodas, T. M. Jones, N. P. Topham, and B. Franke, Cooperative partitioning: Energy-efficient cache partitioning for high-performance CMPs, in Proc. 18th HPCA, Feb. 2012, pp [12] F. Hameed, L. Bauer, and J. Henkel, Dynamic cache management in multi-core architectures through run-time adaptation, in Proc. DATE, 2012, pp [13] B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, Architecting phase change memory as a scalable DRAM alternative, in Proc. 36th ISCA, Jun. 2009, pp [14] M. Mao, H. Li, A. K. Jones, and Y. Chen, Coordinating prefetching and STT-RAM based last-level cache management for multicore systems, in Proc. 23rd GLSVLSI, 2013, pp [15] J. Wang, X. Dong, and Y. Xie, OAP: An obstruction-aware cache management policy for STT-RAM last-level caches, in Proc. DATE, 2013, pp [16] Y.-T. Chen, J. Cong, H. Huang, C. Liu, R. Prabhakar, and G. Reinman, Static and dynamic co-optimizations for blocks mapping in hybrid caches, in Proc. ISLPED, 2012, pp [17] C. Kim, D. Burger, and S. W. Keckler, An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches, in Proc. 10th ASPLOS, 2002, pp [18] A. K. Mishra, X. Dong, G. Sun, Y. Xie, N. Vijaykrishnan, and C. R. Das, Architecting on-chip interconnects for stacked 3D STT-RAM caches in CMPs, in Proc. 38th ISCA, Jun. 2011, pp [19] S.-M. Syu, Y.-H. Shao, and I.-C. Lin, High-endurance hybrid cache design in CMP architecture with cache partitioning and access-aware policy, in Proc. 23rd GLSVLSI, 2013, pp [20] X. Dong, X. Wu, G. Sun, Y. Xie, H. Li, and Y. Chen, Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement, in Proc. 45th DAC, Jun. 2008, pp [21] M. K. Qureshi and Y. N. Patt, Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches, in Proc. 39th MICRO, 2006, pp [22] J. Jung, S. Kim, and C.-M. Kyung, Latency-aware utility-based NUCA cache partitioning in 3D-stacked multi-processor systems, in Proc. 18th VLSI-SoC, Sep. 2010, pp [23] J. Jung, K. Kang, and C.-M. Kyung, Design and management of 3Dstacked NUCA cache for chip multiprocessors, in Proc. 21st GLSVLSI, 2011, pp [24] Intel Atom Processor, accessed on Mar. 21, [Online]. Available: [25] G. Gerosa et al., A sub-1 W to 2 W low-power IA processor for mobile Internet devices and ultra-mobile PCs in 45 nm hi-κ metal gate CMOS, in Proc. ISSCC, Feb. 2008, pp [26] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, NVSim: A circuit-level performance, energy, and area model for emerging nonvolatile memory, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 31, no. 7, pp , Jul [27] D. Banerjee and B. Mukherjee, Wavelength-routed optical networks: Linear formulation, resource budgeting tradeoffs, and a reconfiguration study, IEEE/ACM Trans. Netw., vol. 8, no. 5, pp , Oct [28] R. L. Mattson, Evaluation techniques for storage hierarchies, IBM Syst. J., vol. 9, no. 2, pp , [29] D. C. Burger, T. M. Austin, and S. Bennett, Evaluating future microprocessors: The SimpleScalar tool set, Dept. Comput. Sci., Univ. Wisconsin Madison, Madison, WI, USA, Tech. Rep. 1308, Jul [30] Standard Performance Evaluation Corporation, accessed on Mar. 21, [Online]. Available: [31] S. Kim, D. Chandra, and Y. Solihin, Fair cache sharing and partitioning in a chip multiprocessor architecture, in Proc. 13th PACT, 2004, pp [32] B. Verghese, A. Gupta, and M. Rosenblum, Performance isolation: Sharing and isolation in shared-memory multiprocessors, in Proc. 8th ASPLOS, 1998, pp [33] Performance Application Programming Interface, accessed on Mar. 21, [Online]. Available: [34] F. Li, C. Nicopoulos, T. Richardson, Y. Xie, V. Narayanan, and M. Kandemir, Design and management of 3D chip multiprocessors using network-in-memory, in Proc. 33rd ISCA, 2006, pp [35] N. Manjikian, Multiprocessor enhancements of the SimpleScalar tool set, ACM SIGARCH Comput. Archit. News, vol. 29, no. 1, pp. 8 15, Mar Seunghan Lee (S 08) received the B.S. degree and the unified M.S. and Ph.D. degree in electrical engineering from the Korea Advanced Institute of Science and Technology, Daejeon, South Korea, in 2004 and 2014, respectively. He has been with the Memory Business Department, Samsung Electronics, Hwaseong, South Korea, since His current research interests include cache data management, hybrid memory design, and power and thermal management for 3-D chip multiprocessors. Mr. Lee was a recipient of the best paper award at the 14th International Symposium on Quality Electronic Design, Santa Clara, CA, USA, in Kyungsu Kang (S 06 M 10) received the B.S. degree from the Department of Electrical and Electronic Engineering, Kyungpook National University, Daegu, South Korea, in 2003, and the M.S. and Ph.D. degrees from the Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in He was a Post-Doctoral Fellow with the Smart Sensor Architecture Laboratory, KAIST, from 2010 to 2011, and the Integrated Systems Laboratory, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, from 2011 to He has been with the Memory Business Department, Samsung Electronics, Hwaseong, South Korea, since His current research interests include 3-D integration, networks-on-chip, dynamic power/thermal management, and memory/storage system solutions. Dr. Kang received the best paper award at the 14th International Symposium on Quality Electronic Design, Santa Clara, CA, USA, in Jongpil Jung (S 10) received the B.S. degree in electrical engineering from the Korea Advanced Institute of Science and Technology, Daejeon, South Korea, in 2008, where he is currently pursuing the unified M.S. and Ph.D. degrees in electrical engineering and computer science. His current research interests include lowenergy system architecture, system modeling, and embedded systems. Chong-Min Kyung (S 76 M 81 SM 99 F 08) received the B.S. degree in electronics engineering from Seoul National University, Seoul, South Korea, in 1975, and the M.S. and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea, in 1977 and 1981, respectively. He held a post-doctoral position with Bell Telephone Laboratories, Murray Hill, NJ, USA, from 1981 to He joined KAIST in 1983, where he is currently a Hynix Chair Professor and has been involved in system-on-chip design and verification methodology, processor, and graphics architectures for high-speed and/or low-power applications, including mobile video codec. Dr. Kyung is a member of the National Academy of Engineering Korea and the Korean Academy of Science and Technology. He received the Most Excellent Design Award and the Special Feature Award in the University Design Contest at the Asia and South Pacific Design Automation Conference (ASP-DAC), in 1997 and 1998, respectively. He also received the best paper awards at the 36th DAC, New Orleans, LA, USA, the 10th International Conference on Signal Processing Application and Technology, Orlando, FL, USA, in 1999, and the International Conference on Computer Design, Austin, TX, USA, in 1999, and the National Medal from the Korean Government for his contribution to research and education in IC design in He was the General Chair of the Asian Solid-State Circuits Conference in 2007, and ASP-DAC in 2008.

Cache/Memory Optimization. - Krishna Parthaje

Cache/Memory Optimization. - Krishna Parthaje Cache/Memory Optimization - Krishna Parthaje Hybrid Cache Architecture Replacing SRAM Cache with Future Memory Technology Suji Lee, Jongpil Jung, and Chong-Min Kyung Department of Electrical Engineering,KAIST