Page-Based Memory Allocation Policies of Local and Remote Memory in Cluster Computers

Size: px
Start display at page:

Download "Page-Based Memory Allocation Policies of Local and Remote Memory in Cluster Computers"

Transcription

1 2012 IEEE 18th International Conference on Parallel and Distributed Systems Page-Based Memory Allocation Policies of Local and Remote Memory in Cluster Computers Mónica Serrano, Salvador Petit, Julio Sahuquillo, Rafael Ubal, Houcine Hassan, and José Duato Department of Computer Engineering (DISCA) Universitat Politècnica de València Valencia, Spain {spetit, jsahuqui, husein, Abstract Main memory latencies have a strong impact on the overall execution time of the applications. The need of efficiently scheduling the costly DRAM memory resources in the different motherboards is a major concern in cluster computers. Most of these systems implement remote access capabilities which allow the OS to access to remote memory. In this context, efficient scheduling becomes even more critical since remote memory accesses may be several orders of magnitude higher than local accesses. These systems typically support interleaved memory at cache-block granularity. In contrast, in this paper we explore the impact on the system performance when allocating memory at OS page granularity. Experimental results show that simply supporting interleaved memory at OS page granularity is a feasible solution that does not impact on the performance of most of the benchmarks. Based on this observation we investigated the reasons of performance drops in those benchmarks showing unacceptable performance when working at page granularity. The results of this analysis lead us to propose two memory allocation policies, namely on-demand (OD) and Most-accessed in-local (Mail). The OD policy first places the requested pages in local memory, once this memory region is full, the subsequent memory pages are placed in remote memory. This policy shows good performance when the most accessed pages are requested and allocated before than the least accessed ones, which as proven in this work, is the most common case. This simple policy reaches performance improvements by 25% in some benchmarks with respect to a typical block interleaving memory system. Nevertheless, this strategy has poor performance when a noticeable amount of the least accessed pages are requested before than the most accessed ones. This performance drawback is solved by the Mail allocation policy by using profile information to guide the allocation of new pages. This scheme always outperforms the baseline block interleaving policy and, in some cases, improves the performance of the OD policy by 25%. Keywords-Cluster computers; memory allocation; interleaved memory; memory management; workload characterization; I. INTRODUCTION Cluster computers are widespread in the domain of high performance computing. One of their main advantages is that the parallel processing power of a cluster is more cost-effective than the one of a supercomputer with similar power. Clusters computers also provide high scalability and availability at a low cost, which makes these machines an attractive choice compared to supercomputers. These machines allow several processes to run concurrently in the same node of the cluster, thus sharing costly and critical resources like the main memory. Each node consists of a motherboard hosting several processors and a set of main memory modules. Typically, processors can only access the main memory in their local motherboard. Therefore, in such systems, the different memory necessities of the running processes can lead to unbalance among the memory utilization of the nodes. If the memory requirements exceed the installed memory in a given node, disk paging is required, which can yield to unacceptable performance. A straightforward solution to avoid this drawback is to oversize the installed DRAM memory. However, this solution is expensive, since acquiring extra memory for each node could become prohibitive. In previous works [1] we probed that motherboards whose running processes require a large amount of memory can improve their performance by allowing the OS to access spare memory in remote motherboards. To do that, a fast interconnection network [2] with Remote Memory Access (RMA) capabilities [3] is required. Since the latency to access remote memory in this network is several orders of magnitude lower than the disk latency, the performance is highly improved. In this context, the challenge is to find out which pages from the processes concurrently running in the same node should be allocated to the local memory and which ones to remote memory. This paper focuses on a scheduler at the operating system level to select this memory distribution. As the operating system manages memory at page level granularity, we study the impact on the system performance when working at this granularity, which distribute memory pages among the memory regions (i.e., local or remote). This behavior is analyzed and compared to a typical memory distribution interleaved (local and remote) at cache block level granularity. This paper characterizes the behavior of the entire SPEC CPU2006 benchmark suite [4] under several memory allocation schemes. A set of metrics such as the execution /12 $ IEEE DOI /ICPADS

2 time, misses per kilo-instruction, and the distribution of L1 accesses are analyzed in order to provide a sound understanding of the effects of the memory behavior in the system performance. The analysis allows us to classify the applications based on their behavior for a given memory distribution. The different memory distributions affect the execution time of a given application in different ways, thus, it is important that the scheduler be able to estimate how memory distributions impact on performance. In this way, this knowledge could prevent the memory scheduler from choosing a memory distribution that could damage not only the performance of an application but also the overall system performance. The results of the analysis show two interesting observations. First, we found that most memory accesses are not evenly distributed among pages but a small subset of pages are responsible of most of the accesses. Second, the most accessed pages are usually requested during the first half of the execution time. Basing in these two observations, in this paper we propose two memory allocation policies. The first one, namely ondemand policy, allows to achieve good performance by simply placing the first requested pages in local memory and those exceeding the local capacity in remote memory. In this way, the scheduler exploits temporal locality in a reasonable way. The second one referred to as Mail (mostaccessed in local-memory) policy allocates in local memory the most requested pages (i.e. L2 misses) while assigning those pages exceeding the local capacity to remote memory. Experimental results show that the OD policy outperforms the typical block interleaving memory system by 25% in some benchmarks, while in some others suffer performance degradation. In contrast, the Mail allocation policy always outperforms the baseline block interleaving policy and, in some cases, surpasses the performance of the OD policy by 25%. The remaining of this paper is organized as follows. Section II describes the studied memory distributions. Section III presents the experimental framework. Section IV compares the performance of systems with different interleaving granularity. Section V describes and analyzes the performance behavior of the on-demand policy. Section VI presents and evaluates the Mail strategy. Section VII discusses previous research related to this work. Finally, Section VIII presents some concluding remarks. II. MEMORY ALLOCATION SCHEMES This paper proposes two page-granularity memory allocation policies. For performance comparison purposes we modeled a typical system implementing interleaved memory at cache block granularity. In addition, as the proposed schemes work at page granularity, we modeled a page-level interleaved scheme to check how sensitive the applications are to the granularity size of the interleaved schemes. Below we discuss the interleaved schemes. Block-level interleaved. This scheme referred to as BI assumes that cache blocks are allocated to local and remote memory in an interleaved way (e.g. even blocks are in local memory and odd cache blocks in remote memory). This scheme has been assumed as baseline since it is the typically implemented in current systems. Page-level interleaved. This scheme, namely PI, also allocates memory in an interleaved way but at OS page granularity. The idea behind this scheme is to explore if performance can be acceptable in most benchmarks when working at this coarse granularity. This scheme has been also considered as baseline, since the proposed strategies work at the same granularity. Next we discuss the proposed page-granularity memory allocation policies, namely on-demand and most-accessed in-local scheduler. All the policies that work at page level use the virtual memory subsystem to assign a given memory page to one of the two memory regions. That is, whenever a new virtual page is allocated, its physical frame is set depending on which region is chosen by the specific memory allocation policy. On-demand. This proposal, referred to as OD, also allocates memory at OS page granularity but it starts allocating pages in local memory, and when the requested pages exceed the local memory capacity, it allocates pages in remote memory which works as an extension of the local memory. Since local and remote memories are both considered as main memory, no swap is performed between both regions in this scheme. In other words, a cache miss whose requested block is found in remote memory is not handled as a page fault by the OS. Most-accessed in-local. This scheme will be referred to as Mail. This proposal tries to improve the performance of the OD scheme by determining which pages should be allocated in local memory. The Mail scheme works as a scheduler that places those pages which are responsible for more cache misses in local memory and the remaining ones in remote memory. In this paper, we analyzed the performance benefits of this scheme when working in a static way. III. MODELED SYSTEM AND EXPERIMENTAL FRAMEWORK Figure 1 shows the block diagram of the modeled system. The cluster computer consists of two nodes, namely local and remote, connected by means of a high speed interconnection network and RMA to keep low access time to remote memory. The local node contains a single-issue processor and local memory, while the remote node just contains memory. 613

3 To study the implications on performance of the discussed memory allocation schemes in the described cluster computer we used the Multi2Sim [5] simulation framework. Multi2Sim is a detailed cycle-by-cycle execution driven simulator to evaluate multicore multithreaded processors, which has been extended to model both the cluster-based computer and the memory allocation schemes. IV. PERFORMANCE ANALYSIS OF THE INTERLEAVED MEMORY SCHEMES Figure 1. Component Block diagram of the system Table I MACHINE PARAMETERS Issue width Issue policy Branch predictor L1 cache: size, #ways, line size L1 cache latency L2 cache: size, #ways, line size L2 cache latency Local memory latency Remote memory latency Parameter single issue out-of-order perfect 64KB, 2, 64B 1 cc 1MB, 16, 64B 6 cc 100 cc 410 cc That is, remote memory is partitioned in two parts, one for the processors in the motherboard and the exceeding memory is available for OS installed in other boards. In other words, the local OS only sees local and remote memory. Also, in order to isolate the performance impact of the memory accesses, we assumed a perfect branch predictor since it is responsible of a substantial number of pipeline stalls. Machine parameters concerning both the processor and memory are shown in Table I. Experiments were launched using the entire SPEC CPU2006 benchmark suite and assuming 64B cache blocks and 4KB memory pages, which are typical values in existing systems. This section explores how the granularity size can impact on the system performance. Figure 2(a) shows, for each benchmark, the normalized execution time of the PI policy with respect to the BI scheme. This plot shows the sensitiveness of each benchmark to the granularity of the interleaving. The wider the differences, the more sensitive a given benchmark is. Results show that the execution time of more than half of the benchmarks is not or scarcely affected when working at page granularity while the execution time of some of them (about one third) grows when working at such large granularity. The performance penalty widely differs among those benchmarks penalized by large granularities. According to this performance degradation benchmarks could be classified as high degradation, medium degradation and low degradation. For instance we could include in high degradation (e.g. between 50% and 75%) the gcc and h264ref benchmarks, in medium degradation (e.g. between 25% and 50%) the perlbench, gobmk, and xalankbmk benchmarks, and in low degradation (e.g. between 2% and 25%) benchmarks like soplex, calculix, tonto, omnetpp, bzip2, dealii, povray, and GemsFDTD. As observed in Figure 2(b), the performance of the benchmarks in the latter group presents good performance regardless the interleaving size, since all its benchmarks except bzip2 achieve a Cycles per Instruction (CPI) value close to 1. Notice that a CPI value close to 1 represents a near optimal performance, as we modeled a single-issue processor. Since a perfect branch predictor is considered, performance drops mainly come from to the memory subsystem. To analyze the memory subsystem behavior of both interleaving schemes we explored the cache hierarchy. Figure 2(c) shows, for each benchmark and interleaving, the distribution of load instructions (memory read requests) accesses to the L1 data cache. The results of such loads are classified in hits, delayed hits (i.e., a hit in a block that is being fetched either from L2 or main memory), and misses. We differentiate hits from delayed hits since the latter present a variable latency that may range from main memory latency to a latency close to the one of a conventional hit. Notice that all the benchmarks whose CPI value is below 1.5 present a hit ratio (without considering delayed hits) greater than

4 (a) Normalized execution time (b) Cycles per instruction (c) Distribution of L1 cache read accesses Figure 2. Interleaved memory configuration (d) L2 read misses per kilo-instruction accessing remote memory It can be observed that working at page granularity (PI scheme) increases the number of delayed hits with respect to the BI scheme in some of the benchmarks. The longer the latency of these additional delayed hits, the stronger the impact on performance. Since remote memory accesses present the longest latencies, they have potentially greater impact on the performance. To analyze this fact, we measured those Read Misses Per Kilo-Instruction (RMPKI) in L2 (L2 is the last level cache so a L2 miss incurs a MM access) that access to Remote Memory (L2 RMP KI RM ). Figure 2(d) shows that the benchmarks whose PI performance degradation (see Figure 2(a)) is higher than 25% (i.e., gcc, h264ref, perlbench, gobmk, and xalankbmk) are those which present a larger L2 RMP KI RM increase. On the other hand, there are some benchmarks such as str, gamess, gromacs, libquantum, zeusmp, namd, and sphinx3 whose L2 RMP KI RM is below 0.01 for both interleavings. Thus, the impact of the memory interleaving scheme over these benchmarks is negligible. Finally, note that although the L2 RMP KI RM of some benchmarks like tonto, omnetpp, dealii, and povray also noticeably grows, their CPI and hit ratio values (see Figures 2(b) and 2(c)) cancel the performance impact of the growing L2 RMP KI RM. In summary, with respect to the BI scheme, the PI impact on performance is noticeable only for a few benchmarks. The main reason of this performance drop is the increase in the number of L1 delayed hits that must wait until the block that is being fetched (delay hit) comes from remote memory. V. ON-DEMAND MEMORY ALLOCATION To avoid the increase of L2 misses accessing remote memory, we devised the OD or on-demand memory allocation policy, which places the first accessed pages in local memory as discussed in Section II. This scheme is based on the assumption that the pages that are accessed first will be likely accessed during the whole execution. Thus, if these pages are placed in local memory, the number of accesses to main memory (local and remote) will be noticeably reduced. This section analyzes the performance of the OD policy compared to the BI scheme. The BI scheme assumes, by design, that a half of the working set is allocated to local memory and the other half to remote memory. Therefore, for fair comparison purposes, the devised OD scheme also implements this assumption. In this way, the working set allocated to each memory region is roughly the same as the baseline scheme. For the on-demand policy, this means that the local memory is full when 50% of the accessed pages are brought into memory. From now on, we will refer to this scheme as on-demand 50 or simply OD-50. Figure 3 shows, for each benchmark, the normalized execution time of the OD-50 policy with respect to BI scheme. Note that an on-demand distribution does not only avoids the performance problems of page interleaving, but also improves the performance achieved by BI in 7 benchmarks. Performance improvements reach values up to 25% in some cases (bzip2, cactusadm, and astar). The only excep- 615

5 (a) astar (b) bzip2 (c) hmmer Figure 4. Local and Remote MM first accesses distribution along time (d) wrf Figure 3. Block interleave versus on-demand tion showing significant worse performance for the OD-50 policy is the benchmark hmmer. We explored the reason behind the variations of OD-50 performance across the different benchmarks by studying the characteristics of the pages that are allocated to local or remote memory as the execution time advances. The impact of each page on performance mainly depends on its temporal locality quantified as the number of times that it is accessed (i.e. L2 misses accessing that page) and its latency (if it is local or remote). For illustrative purposes, we analyzed this behavior and plotted the results for two of the best performing benchmarks with the OD-50 policy (astar and bzip2) and another two which do not take advantage from this allocation strategy (hmmer and wrf ). To check how the applications take advantage of temporal locality in the OD-50 scheme, pages are classified as most accessed or MA (those 50% of pages responsible of more L2 misses) and least accessed or LA (the remaining ones). Figure 4 plots the accumulated distribution of MA and LA memory pages as they are brought into memory by the OD- 50 scheme. To represent the plot only the first access to a given page was considered. Each of the 10 steps in the X axis are equally large in the number of L2 misses (i.e., MM accesses) but not in time. Depending on whether the local memory is full or not, the OD-50 scheme places that page in remote or local memory respectively. Notice that this decision only considers the first access to that page since no page replacement is performed. (i.e., accesses directly related Since both memory regions have the same size, the dashed lines in Figure 4 represent the situation where the local memory is full, so subsequent allocated pages will be placed in remote memory. According to temporal locality and OD- 50, the more MA pages allocated in local memory sooner, the better the performance. The ideal situation would happen 616

6 when all the MA pages are allocated before the dashed line in local memory and the LA pages after the dashed line in remote memory. In other words, the MA curve grows until the dashed line and then remains constant; from then on, the LA curve starts to grow. As observed, astar s plot (Figure 4(a)) resembles the described ideal situation. This can be corroborated by looking at Figure 3 which shows that this benchmark reaches the best on-demand performance compared to BI. Bzip2 s (see Figure 4(a)) distribution also presents a similar plot, but some LA pages are allocated to local memory, so the on-demand distribution can be improved. In contrast, for hmmer and wrf, which present the worst performance, the distribution shows a high number of LA page allocations before the dashed line (Figures 4(c) and 4(d)), thus a lot of MA pages are allocated to remote memory, so damaging the performance. VI. MAIL MEMORY ALLOCATION As observed in Figure 4, the memory pages distribution obtained by on-demand can be improved even for some of the best performing benchmarks (e.g., bzip2). The key is to place the memory pages with the highest number of main memory accesses in the local region. To this end, we devised the most-accessed in-local (Mail) allocation policy. With the aim of exploring the impact on performance of the Mail policy, profiling information is used, to discern MA from LA pages. is obtained from the execution of the PI memory allocation scheme. Figure 5 shows the normalized execution time of Mail with respect to the OD policy. As in previous experiments, we assume that the memory is split in two equally sized regions. As expected, there are several benchmarks whose performance is improved by Mail. In some cases, this improvement reaches 25% (xalancbmk, bzip2, and hmmer). The fact tht those benchmarks presenting and on-demand distribution close to the optimal like astar (see Figure 4(a)), take scarce benefit from the Mail policy was also foreseeable. To understand why some benchmarks are more sensitive to the improved page distribution applied by Mail, Figure 6 plots, for hmmer and wrf the accumulated distribution of L2 misses (i.e., main memory accesses) ordered along the X axis from the most accessed to the least accessed page. For example, Figure 6(c) shows that for hmmer, the 20% most accessed pages are responsible of around a 100% of its L2 misses. Thus, hmmer performance is greatly improved by the Mail scheme, which places the most accessed pages in the local memory region. On the other hand, although wrf also benefits from the Mail distribution, its performance improvement is smaller, since the L2 misses are not concentrated in a small set of pages, as shown in Figure 6(d). Nevertheless, its performance improvement with the Mail allocation policy is by a 15%. VII. RELATED WORK Some research papers can be found in the bibliography dealing with the characterization of applications in multicore and CMP but only dealing with main local memory or cache memory. None analyzes the behavior of the local and remote memory distribution in cluster computers as this work does. Pingjing et al. [6] propose an approach that constructs a memory optimization model based on hardware performance counters to decide how and when to apply memory transformations by using genetic algorithms. They show that their programs memory access time and influence ratio for memory reference improves programs memory performance. In Zhuravlev et. al [7] authors state that memory controller, memory bus and prefetching hardware contentions contribute more to overall performance degradation than cache space contention. To reduce the effect of these factors on performance, they minimize the total number of misses issued from each cache, by developing scheduling algorithms that distribute threads in such a way that the miss rate is balanced among the caches. To determine the behavior of applications sharing cache memory, Xie. et al. [8] propose a classification algorithm. They implement the solution on hardware to allow dynamic classification of applications behaviors. Their proposal consists of a dynamic cache partitioning scheme performing slightly better and with a lower implementation cost than the Utility-based Cache Partitioning scheme. Xu et al. In [9] propose a model that estimates the performance degradation due to cache contention of processes running on CMPs. They use distance histograms, cache access frequencies, and the relationship between the throughput and cache miss rate to predict process effective cache size when sharing cache with other processes, allowing instruction throughput estimation. The average throughput prediction error of the model is 1.57 Regarding remote memory allocation, in the literature some research papers have focused mainly on memory swapping but not in scheduling the memory among applications to improve performance. Figure 5. Mail versus on-demand 617

7 (a) astar (b) bzip2 (c) hmmer (d) wrf Figure 6. Accumulated percentage of main memory accesses In [10] authors develop a software-based prototype by extending the Xen hypervisor to emulate a disaggregated memory design wherein remote pages are swapped into local memory on-demand upon access. They reveal that lowlatency remote memory calls for a different regime of replacement policies than conventional disk paging. And show the synergy between disaggregated memory and contentbased page sharing. They find that combination of remote and local memory distribution provides greater workload consolidation opportunity and performance-per-dollar than either technique alone. The distributed large memory system (DLM) is presented in Midorikawa et al. [11]. This system consists of an user-level software-only solution that provides large virtual memory by using remote memory distributed over the nodes of a cluster. They propose a page size control methodology that estimates a working data set and changes page size to each processing part of an application when running to prevent memory server thrashing [12]. Some real applications have started using remote memories implementations, improving performance over systems using disk paging. Large graph processing is an example that have benefited from using remote memories rather than disks [13]. Authors present in this paper their implementation based on RM to deal with large data sets. In [14] authors describe the scalability of linear algebra kernels based on remote memory access approach. The experimental results using large scale systems (Linux-Infiniband cluster, Cray XT) demonstrate consistent performance advantages over ScaLAPACK suite, the leading implementation of parallel linear algebra algorithms used today. The aforementioned papers focus on remote memory for swapping over cluster nodes and present their system as an improvement of disk swapping. In contrast, the research presented in this paper defines useful applications characterization parameters that are used by the proposed memory scheduler to decide which is the best remote and local memory allocation configuration to enhance system performance. VIII. CONCLUSIONS High-end cluster computers incorporate a large amount of main memory across the different motherboards that is often underutilized in some nodes while there is a memory lacking in other nodes due to workload unbalance. RMA mechanisms can help to avoid this unbalance by allowing the applications of a given node to access the memory in remote nodes. This can be accomplished with the support 618

8 of the OS, which allocates the accessed pages either in the local memory of the node or in the memory of a remote node that is accessible through the fast RMA. Note that, although remote memory latency is longer than local memory latency, it is still much shorter than the latency incurred when using virtual memory paging mechanisms. In this paper, we have compared the performance of conventional hardware-based block interleaving between local and remote memory with the performance of OS-based page interleaving. We found that only some applications are significantly affected by page-based interleaving. Thus, we investigated the reasons that cause this impact on performance in order to design better OS-based memory allocation policies. Based on the results of this study, we have proposed two memory allocation policies, namely on-demand (OD) and Most-accessed in-local (Mail). The first one is a simple strategy that works by placing new pages in local memory until this region is full. Thus, it performs better when the most accessed pages are requested and allocated before than the least accessed ones, which is the common case, as proven in this work. Experimental results show that OD policy reaches around 25% performance improvement for some benchmarks with respect to a typical block interleaving memory system. However, under the OD policy, some benchmarks still allocate a large percentage of the least accessed pages to local memory. In contrast, the Mail allocation policy avoids this problem by using profile information to guide the allocation of new pages. Under this scheme, all the benchmarks show better performance than under block interleaving, and in some cases the performance offered by OD is improved as much as 25%. ACKNOWLEDGMENT This work was supported by the Spanish MICINN, Consolider Programme and Plan E funds, as well as European Commission FEDER funds, under Grants CSD and TIN C REFERENCES [1] M. Serrano, J. Sahuquillo, S. Petit, H. Hassan, and J. Duato, A cost-effective heuristic to schedule local and remote memory in cluster computers, The Journal of Supercomputing, vol. 59, no. 3, pp , [2] J. Duato, F. Silla, and S. Yalamanchili, Extending Hyper- Transport Protocol for Improved Scalability, First International Workshop on HyperTransport Research and Applications, [3] M. Nussle, M. Scherer, and U. Bruning, A Resource Optimized Remote-Memory-Access Architecture for Low-latency Communication, in International Conference on Parallel Processing, Sept. 2009, pp [4] SPEC CPU2006 Benchmark Descriptions, ACM SIGARCH newsletter, Computer Architecture News, Sept [5] R. Ubal, J. Sahuquillo, S. Petit, and P. López, Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors, Proceedings of the 19th International Symposium on Computer Architecture and High Performance Computing, [6] P. Lu, Y. Che, and Z. Wang, A framework for effective memory optimization of high performance computing applications, in High Performance Computing and Communications, HPCC th IEEE International Conference on, june 2009, pp [7] S. Zhuravlev, S. Blagodurov, and A. Fedorova, Addressing shared resource contention in multicore processors via scheduling, in Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, 2010, pp [8] Y. Xie and G. H. Loh, Dynamic Classification of Program Memory Behaviors in CMPs, 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects in conjunction with the 35th International Symposium on Computer Architecture, [9] C. Xu, X. Chen, R. P. Dick, and Z. M. Mao, Cache contention and application performance prediction for multi-core systems, in IEEE International Symposium on Performance Analysis of Systems and Software, 2010, pp [10] K. Lim, Y. Turner, J. R. Santos, A. AuYoung, J. Chang, P. Ranganathan, and T. F. Wenisch, System-level implications of disaggregated memory, in High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, feb. 2012, pp [11] H. Midorikawa, M. Kurokawa, R. Himeno, and M. Sato, DLM: A distributed Large Memory System using remote memory swapping over cluster nodes, in IEEE International Conference on Cluster Computing, October 2008, pp [12] H. Midorikawa and J. Uchiyama, Automatic adaptive pagesize control for remote memory paging, in Cluster, Cloud and Grid Computing (CCGrid), th IEEE/ACM International Symposium on, may 2012, pp [13] K. Jeon, H. Han, S. gyu Kim, H. Eom, H. Yeom, and Y. Lee, Large graph processing based on remote memory system, in High Performance Computing and Communications (HPCC), th IEEE International Conference on, sept. 2010, pp [14] M. Krishnan, R. Lewis, and A. Vishnu, Scaling linear algebra kernels using remote memory access, in Parallel Processing Workshops (ICPPW), th International Conference on, sept. 2010, pp

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

A Fast Instruction Set Simulator for RISC-V

A Fast Instruction Set Simulator for RISC-V A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.

More information

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

Addressing End-to-End Memory Access Latency in NoC-Based Multicores Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012 Energy Proportional Datacenter Memory Brian Neel EE6633 Fall 2012 Outline Background Motivation Related work DRAM properties Designs References Background The Datacenter as a Computer Luiz André Barroso

More information

Sandbox Based Optimal Offset Estimation [DPC2]

Sandbox Based Optimal Offset Estimation [DPC2] Sandbox Based Optimal Offset Estimation [DPC2] Nathan T. Brown and Resit Sendag Department of Electrical, Computer, and Biomedical Engineering Outline Motivation Background/Related Work Sequential Offset

More information

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1 Available online at www.sciencedirect.com Physics Procedia 33 (2012 ) 1029 1035 2012 International Conference on Medical Physics and Biomedical Engineering Memory Performance Characterization of SPEC CPU2006

More information

Energy Models for DVFS Processors

Energy Models for DVFS Processors Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July

More information

L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors

L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors Josué Feliu, Julio Sahuquillo, Salvador Petit, and José Duato Department of Computer Engineering (DISCA) Universitat Politècnica de València

More information

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems Rentong Guo 1, Xiaofei Liao 1, Hai Jin 1, Jianhui Yue 2, Guang Tan 3 1 Huazhong University of Science

More information

Lightweight Memory Tracing

Lightweight Memory Tracing Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for

More information

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*

More information

Bias Scheduling in Heterogeneous Multi-core Architectures

Bias Scheduling in Heterogeneous Multi-core Architectures Bias Scheduling in Heterogeneous Multi-core Architectures David Koufaty Dheeraj Reddy Scott Hahn Intel Labs {david.a.koufaty, dheeraj.reddy, scott.hahn}@intel.com Abstract Heterogeneous architectures that

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 36 Performance 2010-04-23 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Scheduling the Intel Core i7

Scheduling the Intel Core i7 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University

More information

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Sarah Bird ϕ, Aashish Phansalkar ϕ, Lizy K. John ϕ, Alex Mericas α and Rajeev Indukuru α ϕ University

More information

Perceptron Learning for Reuse Prediction

Perceptron Learning for Reuse Prediction Perceptron Learning for Reuse Prediction Elvira Teran Zhe Wang Daniel A. Jiménez Texas A&M University Intel Labs {eteran,djimenez}@tamu.edu zhe2.wang@intel.com Abstract The disparity between last-level

More information

Improving Cache Performance using Victim Tag Stores

Improving Cache Performance using Victim Tag Stores Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems 1 Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems Dimitris Kaseridis, Member, IEEE, Muhammad Faisal Iqbal, Student Member, IEEE and Lizy Kurian John,

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

Near-Threshold Computing: How Close Should We Get?

Near-Threshold Computing: How Close Should We Get? Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on

More information

SCHEDULING ALGORITHMS FOR MULTICORE SYSTEMS BASED ON APPLICATION CHARACTERISTICS

SCHEDULING ALGORITHMS FOR MULTICORE SYSTEMS BASED ON APPLICATION CHARACTERISTICS SCHEDULING ALGORITHMS FOR MULTICORE SYSTEMS BASED ON APPLICATION CHARACTERISTICS 1 JUNG KYU PARK, 2* JAEHO KIM, 3 HEUNG SEOK JEON 1 Department of Digital Media Design and Applications, Seoul Women s University,

More information

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Daniel A. Jiménez Department of Computer Science and Engineering Texas A&M University ABSTRACT Last-level caches mitigate the high latency

More information

Software-Controlled Transparent Management of Heterogeneous Memory Resources in Virtualized Systems

Software-Controlled Transparent Management of Heterogeneous Memory Resources in Virtualized Systems Software-Controlled Transparent Management of Heterogeneous Memory Resources in Virtualized Systems Min Lee Vishal Gupta Karsten Schwan College of Computing Georgia Institute of Technology {minlee,vishal,schwan}@cc.gatech.edu

More information

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis Electrical and Computer Engineering The University of Texas at Austin Austin, TX, USA kaseridis@mail.utexas.edu

More information

Data Prefetching by Exploiting Global and Local Access Patterns

Data Prefetching by Exploiting Global and Local Access Patterns Journal of Instruction-Level Parallelism 13 (2011) 1-17 Submitted 3/10; published 1/11 Data Prefetching by Exploiting Global and Local Access Patterns Ahmad Sharif Hsien-Hsin S. Lee School of Electrical

More information

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University

More information

Spatial Memory Streaming (with rotated patterns)

Spatial Memory Streaming (with rotated patterns) Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Adaptive Cache Partitioning on a Composite Core

Adaptive Cache Partitioning on a Composite Core Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

Thesis Defense Lavanya Subramanian

Thesis Defense Lavanya Subramanian Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Thesis Defense Lavanya Subramanian Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel)

More information

A Front-end Execution Architecture for High Energy Efficiency

A Front-end Execution Architecture for High Energy Efficiency A Front-end Execution Architecture for High Energy Efficiency Ryota Shioya, Masahiro Goshima and Hideki Ando Department of Electrical Engineering and Computer Science, Nagoya University, Aichi, Japan Information

More information

562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016

562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016 562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016 Memory Bandwidth Management for Efficient Performance Isolation in Multi-Core Platforms Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Member,

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

In Search for Contention-Descriptive Metrics in HPC Cluster Environment

In Search for Contention-Descriptive Metrics in HPC Cluster Environment In Search for Contention-Descriptive Metrics in HPC Cluster Environment Sergey Blagodurov Systems Research Lab Simon Fraser University sergey_blagodurov@sfu.ca Alexandra Fedorova Systems Research Lab Simon

More information

A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ

A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ E. Baydal, P. López and J. Duato Depto. Informática de Sistemas y Computadores Universidad Politécnica de Valencia, Camino

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Advanced Memory Organizations

Advanced Memory Organizations CSE 3421: Introduction to Computer Architecture Advanced Memory Organizations Study: 5.1, 5.2, 5.3, 5.4 (only parts) Gojko Babić 03-29-2018 1 Growth in Performance of DRAM & CPU Huge mismatch between CPU

More information

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko Advisors: Todd C. Mowry and Onur Mutlu Computer Science Department, Carnegie Mellon

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Power and Locality Aware Request Distribution Technical Report Heungki Lee, Gopinath Vageesan and Eun Jung Kim Texas A&M University College Station

Power and Locality Aware Request Distribution Technical Report Heungki Lee, Gopinath Vageesan and Eun Jung Kim Texas A&M University College Station Power and Locality Aware Request Distribution Technical Report Heungki Lee, Gopinath Vageesan and Eun Jung Kim Texas A&M University College Station Abstract With the growing use of cluster systems in file

More information

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache

Locality. Cache. Direct Mapped Cache. Direct Mapped Cache Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be

More information

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Improving Real-Time Performance on Multicore Platforms Using MemGuard Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study

More information

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Performance Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Defining Performance (1) Which airplane has the best performance? Boeing 777 Boeing

More information

Multiperspective Reuse Prediction

Multiperspective Reuse Prediction ABSTRACT Daniel A. Jiménez Texas A&M University djimenezacm.org The disparity between last-level cache and memory latencies motivates the search for e cient cache management policies. Recent work in predicting

More information

An Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs

An Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs International Journal of Computer Systems (ISSN: 2394-1065), Volume 04 Issue 04, April, 2017 Available at http://www.ijcsonline.com/ An Intelligent Fetching algorithm For Efficient Physical Register File

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

Enabling Consolidation and Scaling Down to Provide Power Management for Cloud Computing

Enabling Consolidation and Scaling Down to Provide Power Management for Cloud Computing Enabling Consolidation and Scaling Down to Provide Power Management for Cloud Computing Frank Yong-Kyung Oh Hyeong S. Kim Hyeonsang Eom Heon Y. Yeom School of Computer Science and Engineering Seoul National

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

http://uu.diva-portal.org This is an author produced version of a paper presented at the 4 th Swedish Workshop on Multi-Core Computing, November 23-25, 2011, Linköping, Sweden. Citation for the published

More information

LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm

LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm 1 LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm Mazen Kharbutli and Rami Sheikh (Submitted to IEEE Transactions on Computers) Mazen Kharbutli is with Jordan University of Science and

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 38 Performance 2008-04-30 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in

More information

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology Memory Hierarchies Instructor: Dmitri A. Gusev Fall 2007 CS 502: Computers and Communications Technology Lecture 10, October 8, 2007 Memories SRAM: value is stored on a pair of inverting gates very fast

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Dynamic Cache Pooling in 3D Multicore Processors

Dynamic Cache Pooling in 3D Multicore Processors Dynamic Cache Pooling in 3D Multicore Processors TIANSHENG ZHANG, JIE MENG, and AYSE K. COSKUN, BostonUniversity Resource pooling, where multiple architectural components are shared among cores, is a promising

More information

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of

More information

Computer Sciences Department

Computer Sciences Department Computer Sciences Department SIP: Speculative Insertion Policy for High Performance Caching Hongil Yoon Tan Zhang Mikko H. Lipasti Technical Report #1676 June 2010 SIP: Speculative Insertion Policy for

More information

V. Primary & Secondary Memory!

V. Primary & Secondary Memory! V. Primary & Secondary Memory! Computer Architecture and Operating Systems & Operating Systems: 725G84 Ahmed Rezine 1 Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM)

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

DynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories

DynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories SAFARI Technical Report No. 2-5 (December 6, 2) : A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon hanbinyoon@cmu.edu Justin Meza meza@cmu.edu

More information

Filtered Runahead Execution with a Runahead Buffer

Filtered Runahead Execution with a Runahead Buffer Filtered Runahead Execution with a Runahead Buffer ABSTRACT Milad Hashemi The University of Texas at Austin miladhashemi@utexas.edu Runahead execution dynamically expands the instruction window of an out

More information

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science Department, University of Pittsburgh Pittsburgh, PA

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

WADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System

WADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System WADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System ZHE WANG, Texas A&M University SHUCHANG SHAN, Chinese Institute of Computing Technology TING CAO, Australian National University

More information

Computer Architecture. Introduction

Computer Architecture. Introduction to Computer Architecture 1 Computer Architecture What is Computer Architecture From Wikipedia, the free encyclopedia In computer engineering, computer architecture is a set of rules and methods that describe

More information

Footprint-based Locality Analysis

Footprint-based Locality Analysis Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage.

More information

IN modern systems, the high latency of accessing largecapacity

IN modern systems, the high latency of accessing largecapacity IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 10, OCTOBER 2016 3071 BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling Lavanya Subramanian, Donghyuk

More information

Predicting Performance Impact of DVFS for Realistic Memory Systems

Predicting Performance Impact of DVFS for Realistic Memory Systems Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt The University of Texas at Austin Nvidia Corporation {rustam,patt}@hps.utexas.edu ebrahimi@hps.utexas.edu

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Cache Controller with Enhanced Features using Verilog HDL

Cache Controller with Enhanced Features using Verilog HDL Cache Controller with Enhanced Features using Verilog HDL Prof. V. B. Baru 1, Sweety Pinjani 2 Assistant Professor, Dept. of ECE, Sinhgad College of Engineering, Vadgaon (BK), Pune, India 1 PG Student

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Performance metrics for caches

Performance metrics for caches Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:

More information

Evaluating Private vs. Shared Last-Level Caches for Energy Efficiency in Asymmetric Multi-Cores

Evaluating Private vs. Shared Last-Level Caches for Energy Efficiency in Asymmetric Multi-Cores Evaluating Private vs. Shared Last-Level Caches for Energy Efficiency in Asymmetric Multi-Cores Anthony Gutierrez Adv. Computer Architecture Lab. University of Michigan EECS Dept. Ann Arbor, MI, USA atgutier@umich.edu

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

CloudCache: Expanding and Shrinking Private Caches

CloudCache: Expanding and Shrinking Private Caches CloudCache: Expanding and Shrinking Private Caches Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers Computer Science Department, University of Pittsburgh {abraham,cho,childers}@cs.pitt.edu Abstract The

More information

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

Two-Level Address Storage and Address Prediction

Two-Level Address Storage and Address Prediction Two-Level Address Storage and Address Prediction Enric Morancho, José María Llabería and Àngel Olivé Computer Architecture Department - Universitat Politècnica de Catalunya (Spain) 1 Abstract. : The amount

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM) 50ns 70ns, $20 $75 per GB Magnetic disk 5ms 20ms, $0.20 $2 per

More information

Analysis of Sorting as a Streaming Application

Analysis of Sorting as a Streaming Application 1 of 10 Analysis of Sorting as a Streaming Application Greg Galloway, ggalloway@wustl.edu (A class project report written under the guidance of Prof. Raj Jain) Download Abstract Expressing concurrency

More information

Tortoise vs. hare: a case for slow and steady retrieval of large files

Tortoise vs. hare: a case for slow and steady retrieval of large files Tortoise vs. hare: a case for slow and steady retrieval of large files Abstract Large file transfers impact system performance at all levels of a network along the data path from source to destination.

More information

Applications Classification and Scheduling on Heterogeneous HPC Systems Using Experimental Research

Applications Classification and Scheduling on Heterogeneous HPC Systems Using Experimental Research Applications Classification and Scheduling on Heterogeneous HPC Systems Using Experimental Research Yingjie Xia 1, 2, Mingzhe Zhu 2, 3, Li Kuang 2, Xiaoqiang Ma 3 1 Department of Automation School of Electronic

More information

The Affinity Effects of Parallelized Libraries in Concurrent Environments. Abstract

The Affinity Effects of Parallelized Libraries in Concurrent Environments. Abstract The Affinity Effects of Parallelized Libraries in Concurrent Environments FABIO LICHT, BRUNO SCHULZE, LUIS E. BONA, AND ANTONIO R. MURY 1 Federal University of Parana (UFPR) licht@lncc.br Abstract The

More information

OpenPrefetch. (in-progress)

OpenPrefetch. (in-progress) OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),

More information