Page-Based Memory Allocation Policies of Local and Remote Memory in Cluster Computers

Size: px

Start display at page:

Download "Page-Based Memory Allocation Policies of Local and Remote Memory in Cluster Computers"

Teresa Miller
5 years ago
Views:

1 2012 IEEE 18th International Conference on Parallel and Distributed Systems Page-Based Memory Allocation Policies of Local and Remote Memory in Cluster Computers Mónica Serrano, Salvador Petit, Julio Sahuquillo, Rafael Ubal, Houcine Hassan, and José Duato Department of Computer Engineering (DISCA) Universitat Politècnica de València Valencia, Spain {spetit, jsahuqui, husein, Abstract Main memory latencies have a strong impact on the overall execution time of the applications. The need of efficiently scheduling the costly DRAM memory resources in the different motherboards is a major concern in cluster computers. Most of these systems implement remote access capabilities which allow the OS to access to remote memory. In this context, efficient scheduling becomes even more critical since remote memory accesses may be several orders of magnitude higher than local accesses. These systems typically support interleaved memory at cache-block granularity. In contrast, in this paper we explore the impact on the system performance when allocating memory at OS page granularity. Experimental results show that simply supporting interleaved memory at OS page granularity is a feasible solution that does not impact on the performance of most of the benchmarks. Based on this observation we investigated the reasons of performance drops in those benchmarks showing unacceptable performance when working at page granularity. The results of this analysis lead us to propose two memory allocation policies, namely on-demand (OD) and Most-accessed in-local (Mail). The OD policy first places the requested pages in local memory, once this memory region is full, the subsequent memory pages are placed in remote memory. This policy shows good performance when the most accessed pages are requested and allocated before than the least accessed ones, which as proven in this work, is the most common case. This simple policy reaches performance improvements by 25% in some benchmarks with respect to a typical block interleaving memory system. Nevertheless, this strategy has poor performance when a noticeable amount of the least accessed pages are requested before than the most accessed ones. This performance drawback is solved by the Mail allocation policy by using profile information to guide the allocation of new pages. This scheme always outperforms the baseline block interleaving policy and, in some cases, improves the performance of the OD policy by 25%. Keywords-Cluster computers; memory allocation; interleaved memory; memory management; workload characterization; I. INTRODUCTION Cluster computers are widespread in the domain of high performance computing. One of their main advantages is that the parallel processing power of a cluster is more cost-effective than the one of a supercomputer with similar power. Clusters computers also provide high scalability and availability at a low cost, which makes these machines an attractive choice compared to supercomputers. These machines allow several processes to run concurrently in the same node of the cluster, thus sharing costly and critical resources like the main memory. Each node consists of a motherboard hosting several processors and a set of main memory modules. Typically, processors can only access the main memory in their local motherboard. Therefore, in such systems, the different memory necessities of the running processes can lead to unbalance among the memory utilization of the nodes. If the memory requirements exceed the installed memory in a given node, disk paging is required, which can yield to unacceptable performance. A straightforward solution to avoid this drawback is to oversize the installed DRAM memory. However, this solution is expensive, since acquiring extra memory for each node could become prohibitive. In previous works [1] we probed that motherboards whose running processes require a large amount of memory can improve their performance by allowing the OS to access spare memory in remote motherboards. To do that, a fast interconnection network [2] with Remote Memory Access (RMA) capabilities [3] is required. Since the latency to access remote memory in this network is several orders of magnitude lower than the disk latency, the performance is highly improved. In this context, the challenge is to find out which pages from the processes concurrently running in the same node should be allocated to the local memory and which ones to remote memory. This paper focuses on a scheduler at the operating system level to select this memory distribution. As the operating system manages memory at page level granularity, we study the impact on the system performance when working at this granularity, which distribute memory pages among the memory regions (i.e., local or remote). This behavior is analyzed and compared to a typical memory distribution interleaved (local and remote) at cache block level granularity. This paper characterizes the behavior of the entire SPEC CPU2006 benchmark suite [4] under several memory allocation schemes. A set of metrics such as the execution /12 $ IEEE DOI /ICPADS

2 time, misses per kilo-instruction, and the distribution of L1 accesses are analyzed in order to provide a sound understanding of the effects of the memory behavior in the system performance. The analysis allows us to classify the applications based on their behavior for a given memory distribution. The different memory distributions affect the execution time of a given application in different ways, thus, it is important that the scheduler be able to estimate how memory distributions impact on performance. In this way, this knowledge could prevent the memory scheduler from choosing a memory distribution that could damage not only the performance of an application but also the overall system performance. The results of the analysis show two interesting observations. First, we found that most memory accesses are not evenly distributed among pages but a small subset of pages are responsible of most of the accesses. Second, the most accessed pages are usually requested during the first half of the execution time. Basing in these two observations, in this paper we propose two memory allocation policies. The first one, namely ondemand policy, allows to achieve good performance by simply placing the first requested pages in local memory and those exceeding the local capacity in remote memory. In this way, the scheduler exploits temporal locality in a reasonable way. The second one referred to as Mail (mostaccessed in local-memory) policy allocates in local memory the most requested pages (i.e. L2 misses) while assigning those pages exceeding the local capacity to remote memory. Experimental results show that the OD policy outperforms the typical block interleaving memory system by 25% in some benchmarks, while in some others suffer performance degradation. In contrast, the Mail allocation policy always outperforms the baseline block interleaving policy and, in some cases, surpasses the performance of the OD policy by 25%. The remaining of this paper is organized as follows. Section II describes the studied memory distributions. Section III presents the experimental framework. Section IV compares the performance of systems with different interleaving granularity. Section V describes and analyzes the performance behavior of the on-demand policy. Section VI presents and evaluates the Mail strategy. Section VII discusses previous research related to this work. Finally, Section VIII presents some concluding remarks. II. MEMORY ALLOCATION SCHEMES This paper proposes two page-granularity memory allocation policies. For performance comparison purposes we modeled a typical system implementing interleaved memory at cache block granularity. In addition, as the proposed schemes work at page granularity, we modeled a page-level interleaved scheme to check how sensitive the applications are to the granularity size of the interleaved schemes. Below we discuss the interleaved schemes. Block-level interleaved. This scheme referred to as BI assumes that cache blocks are allocated to local and remote memory in an interleaved way (e.g. even blocks are in local memory and odd cache blocks in remote memory). This scheme has been assumed as baseline since it is the typically implemented in current systems. Page-level interleaved. This scheme, namely PI, also allocates memory in an interleaved way but at OS page granularity. The idea behind this scheme is to explore if performance can be acceptable in most benchmarks when working at this coarse granularity. This scheme has been also considered as baseline, since the proposed strategies work at the same granularity. Next we discuss the proposed page-granularity memory allocation policies, namely on-demand and most-accessed in-local scheduler. All the policies that work at page level use the virtual memory subsystem to assign a given memory page to one of the two memory regions. That is, whenever a new virtual page is allocated, its physical frame is set depending on which region is chosen by the specific memory allocation policy. On-demand. This proposal, referred to as OD, also allocates memory at OS page granularity but it starts allocating pages in local memory, and when the requested pages exceed the local memory capacity, it allocates pages in remote memory which works as an extension of the local memory. Since local and remote memories are both considered as main memory, no swap is performed between both regions in this scheme. In other words, a cache miss whose requested block is found in remote memory is not handled as a page fault by the OS. Most-accessed in-local. This scheme will be referred to as Mail. This proposal tries to improve the performance of the OD scheme by determining which pages should be allocated in local memory. The Mail scheme works as a scheduler that places those pages which are responsible for more cache misses in local memory and the remaining ones in remote memory. In this paper, we analyzed the performance benefits of this scheme when working in a static way. III. MODELED SYSTEM AND EXPERIMENTAL FRAMEWORK Figure 1 shows the block diagram of the modeled system. The cluster computer consists of two nodes, namely local and remote, connected by means of a high speed interconnection network and RMA to keep low access time to remote memory. The local node contains a single-issue processor and local memory, while the remote node just contains memory. 613

3 To study the implications on performance of the discussed memory allocation schemes in the described cluster computer we used the Multi2Sim [5] simulation framework. Multi2Sim is a detailed cycle-by-cycle execution driven simulator to evaluate multicore multithreaded processors, which has been extended to model both the cluster-based computer and the memory allocation schemes. IV. PERFORMANCE ANALYSIS OF THE INTERLEAVED MEMORY SCHEMES Figure 1. Component Block diagram of the system Table I MACHINE PARAMETERS Issue width Issue policy Branch predictor L1 cache: size, #ways, line size L1 cache latency L2 cache: size, #ways, line size L2 cache latency Local memory latency Remote memory latency Parameter single issue out-of-order perfect 64KB, 2, 64B 1 cc 1MB, 16, 64B 6 cc 100 cc 410 cc That is, remote memory is partitioned in two parts, one for the processors in the motherboard and the exceeding memory is available for OS installed in other boards. In other words, the local OS only sees local and remote memory. Also, in order to isolate the performance impact of the memory accesses, we assumed a perfect branch predictor since it is responsible of a substantial number of pipeline stalls. Machine parameters concerning both the processor and memory are shown in Table I. Experiments were launched using the entire SPEC CPU2006 benchmark suite and assuming 64B cache blocks and 4KB memory pages, which are typical values in existing systems. This section explores how the granularity size can impact on the system performance. Figure 2(a) shows, for each benchmark, the normalized execution time of the PI policy with respect to the BI scheme. This plot shows the sensitiveness of each benchmark to the granularity of the interleaving. The wider the differences, the more sensitive a given benchmark is. Results show that the execution time of more than half of the benchmarks is not or scarcely affected when working at page granularity while the execution time of some of them (about one third) grows when working at such large granularity. The performance penalty widely differs among those benchmarks penalized by large granularities. According to this performance degradation benchmarks could be classified as high degradation, medium degradation and low degradation. For instance we could include in high degradation (e.g. between 50% and 75%) the gcc and h264ref benchmarks, in medium degradation (e.g. between 25% and 50%) the perlbench, gobmk, and xalankbmk benchmarks, and in low degradation (e.g. between 2% and 25%) benchmarks like soplex, calculix, tonto, omnetpp, bzip2, dealii, povray, and GemsFDTD. As observed in Figure 2(b), the performance of the benchmarks in the latter group presents good performance regardless the interleaving size, since all its benchmarks except bzip2 achieve a Cycles per Instruction (CPI) value close to 1. Notice that a CPI value close to 1 represents a near optimal performance, as we modeled a single-issue processor. Since a perfect branch predictor is considered, performance drops mainly come from to the memory subsystem. To analyze the memory subsystem behavior of both interleaving schemes we explored the cache hierarchy. Figure 2(c) shows, for each benchmark and interleaving, the distribution of load instructions (memory read requests) accesses to the L1 data cache. The results of such loads are classified in hits, delayed hits (i.e., a hit in a block that is being fetched either from L2 or main memory), and misses. We differentiate hits from delayed hits since the latter present a variable latency that may range from main memory latency to a latency close to the one of a conventional hit. Notice that all the benchmarks whose CPI value is below 1.5 present a hit ratio (without considering delayed hits) greater than

4 (a) Normalized execution time (b) Cycles per instruction (c) Distribution of L1 cache read accesses Figure 2. Interleaved memory configuration (d) L2 read misses per kilo-instruction accessing remote memory It can be observed that working at page granularity (PI scheme) increases the number of delayed hits with respect to the BI scheme in some of the benchmarks. The longer the latency of these additional delayed hits, the stronger the impact on performance. Since remote memory accesses present the longest latencies, they have potentially greater impact on the performance. To analyze this fact, we measured those Read Misses Per Kilo-Instruction (RMPKI) in L2 (L2 is the last level cache so a L2 miss incurs a MM access) that access to Remote Memory (L2 RMP KI RM ). Figure 2(d) shows that the benchmarks whose PI performance degradation (see Figure 2(a)) is higher than 25% (i.e., gcc, h264ref, perlbench, gobmk, and xalankbmk) are those which present a larger L2 RMP KI RM increase. On the other hand, there are some benchmarks such as str, gamess, gromacs, libquantum, zeusmp, namd, and sphinx3 whose L2 RMP KI RM is below 0.01 for both interleavings. Thus, the impact of the memory interleaving scheme over these benchmarks is negligible. Finally, note that although the L2 RMP KI RM of some benchmarks like tonto, omnetpp, dealii, and povray also noticeably grows, their CPI and hit ratio values (see Figures 2(b) and 2(c)) cancel the performance impact of the growing L2 RMP KI RM. In summary, with respect to the BI scheme, the PI impact on performance is noticeable only for a few benchmarks. The main reason of this performance drop is the increase in the number of L1 delayed hits that must wait until the block that is being fetched (delay hit) comes from remote memory. V. ON-DEMAND MEMORY ALLOCATION To avoid the increase of L2 misses accessing remote memory, we devised the OD or on-demand memory allocation policy, which places the first accessed pages in local memory as discussed in Section II. This scheme is based on the assumption that the pages that are accessed first will be likely accessed during the whole execution. Thus, if these pages are placed in local memory, the number of accesses to main memory (local and remote) will be noticeably reduced. This section analyzes the performance of the OD policy compared to the BI scheme. The BI scheme assumes, by design, that a half of the working set is allocated to local memory and the other half to remote memory. Therefore, for fair comparison purposes, the devised OD scheme also implements this assumption. In this way, the working set allocated to each memory region is roughly the same as the baseline scheme. For the on-demand policy, this means that the local memory is full when 50% of the accessed pages are brought into memory. From now on, we will refer to this scheme as on-demand 50 or simply OD-50. Figure 3 shows, for each benchmark, the normalized execution time of the OD-50 policy with respect to BI scheme. Note that an on-demand distribution does not only avoids the performance problems of page interleaving, but also improves the performance achieved by BI in 7 benchmarks. Performance improvements reach values up to 25% in some cases (bzip2, cactusadm, and astar). The only excep- 615

5 (a) astar (b) bzip2 (c) hmmer Figure 4. Local and Remote MM first accesses distribution along time (d) wrf Figure 3. Block interleave versus on-demand tion showing significant worse performance for the OD-50 policy is the benchmark hmmer. We explored the reason behind the variations of OD-50 performance across the different benchmarks by studying the characteristics of the pages that are allocated to local or remote memory as the execution time advances. The impact of each page on performance mainly depends on its temporal locality quantified as the number of times that it is accessed (i.e. L2 misses accessing that page) and its latency (if it is local or remote). For illustrative purposes, we analyzed this behavior and plotted the results for two of the best performing benchmarks with the OD-50 policy (astar and bzip2) and another two which do not take advantage from this allocation strategy (hmmer and wrf ). To check how the applications take advantage of temporal locality in the OD-50 scheme, pages are classified as most accessed or MA (those 50% of pages responsible of more L2 misses) and least accessed or LA (the remaining ones). Figure 4 plots the accumulated distribution of MA and LA memory pages as they are brought into memory by the OD- 50 scheme. To represent the plot only the first access to a given page was considered. Each of the 10 steps in the X axis are equally large in the number of L2 misses (i.e., MM accesses) but not in time. Depending on whether the local memory is full or not, the OD-50 scheme places that page in remote or local memory respectively. Notice that this decision only considers the first access to that page since no page replacement is performed. (i.e., accesses directly related Since both memory regions have the same size, the dashed lines in Figure 4 represent the situation where the local memory is full, so subsequent allocated pages will be placed in remote memory. According to temporal locality and OD- 50, the more MA pages allocated in local memory sooner, the better the performance. The ideal situation would happen 616

6 when all the MA pages are allocated before the dashed line in local memory and the LA pages after the dashed line in remote memory. In other words, the MA curve grows until the dashed line and then remains constant; from then on, the LA curve starts to grow. As observed, astar s plot (Figure 4(a)) resembles the described ideal situation. This can be corroborated by looking at Figure 3 which shows that this benchmark reaches the best on-demand performance compared to BI. Bzip2 s (see Figure 4(a)) distribution also presents a similar plot, but some LA pages are allocated to local memory, so the on-demand distribution can be improved. In contrast, for hmmer and wrf, which present the worst performance, the distribution shows a high number of LA page allocations before the dashed line (Figures 4(c) and 4(d)), thus a lot of MA pages are allocated to remote memory, so damaging the performance. VI. MAIL MEMORY ALLOCATION As observed in Figure 4, the memory pages distribution obtained by on-demand can be improved even for some of the best performing benchmarks (e.g., bzip2). The key is to place the memory pages with the highest number of main memory accesses in the local region. To this end, we devised the most-accessed in-local (Mail) allocation policy. With the aim of exploring the impact on performance of the Mail policy, profiling information is used, to discern MA from LA pages. is obtained from the execution of the PI memory allocation scheme. Figure 5 shows the normalized execution time of Mail with respect to the OD policy. As in previous experiments, we assume that the memory is split in two equally sized regions. As expected, there are several benchmarks whose performance is improved by Mail. In some cases, this improvement reaches 25% (xalancbmk, bzip2, and hmmer). The fact tht those benchmarks presenting and on-demand distribution close to the optimal like astar (see Figure 4(a)), take scarce benefit from the Mail policy was also foreseeable. To understand why some benchmarks are more sensitive to the improved page distribution applied by Mail, Figure 6 plots, for hmmer and wrf the accumulated distribution of L2 misses (i.e., main memory accesses) ordered along the X axis from the most accessed to the least accessed page. For example, Figure 6(c) shows that for hmmer, the 20% most accessed pages are responsible of around a 100% of its L2 misses. Thus, hmmer performance is greatly improved by the Mail scheme, which places the most accessed pages in the local memory region. On the other hand, although wrf also benefits from the Mail distribution, its performance improvement is smaller, since the L2 misses are not concentrated in a small set of pages, as shown in Figure 6(d). Nevertheless, its performance improvement with the Mail allocation policy is by a 15%. VII. RELATED WORK Some research papers can be found in the bibliography dealing with the characterization of applications in multicore and CMP but only dealing with main local memory or cache memory. None analyzes the behavior of the local and remote memory distribution in cluster computers as this work does. Pingjing et al. [6] propose an approach that constructs a memory optimization model based on hardware performance counters to decide how and when to apply memory transformations by using genetic algorithms. They show that their programs memory access time and influence ratio for memory reference improves programs memory performance. In Zhuravlev et. al [7] authors state that memory controller, memory bus and prefetching hardware contentions contribute more to overall performance degradation than cache space contention. To reduce the effect of these factors on performance, they minimize the total number of misses issued from each cache, by developing scheduling algorithms that distribute threads in such a way that the miss rate is balanced among the caches. To determine the behavior of applications sharing cache memory, Xie. et al. [8] propose a classification algorithm. They implement the solution on hardware to allow dynamic classification of applications behaviors. Their proposal consists of a dynamic cache partitioning scheme performing slightly better and with a lower implementation cost than the Utility-based Cache Partitioning scheme. Xu et al. In [9] propose a model that estimates the performance degradation due to cache contention of processes running on CMPs. They use distance histograms, cache access frequencies, and the relationship between the throughput and cache miss rate to predict process effective cache size when sharing cache with other processes, allowing instruction throughput estimation. The average throughput prediction error of the model is 1.57 Regarding remote memory allocation, in the literature some research papers have focused mainly on memory swapping but not in scheduling the memory among applications to improve performance. Figure 5. Mail versus on-demand 617

7 (a) astar (b) bzip2 (c) hmmer (d) wrf Figure 6. Accumulated percentage of main memory accesses In [10] authors develop a software-based prototype by extending the Xen hypervisor to emulate a disaggregated memory design wherein remote pages are swapped into local memory on-demand upon access. They reveal that lowlatency remote memory calls for a different regime of replacement policies than conventional disk paging. And show the synergy between disaggregated memory and contentbased page sharing. They find that combination of remote and local memory distribution provides greater workload consolidation opportunity and performance-per-dollar than either technique alone. The distributed large memory system (DLM) is presented in Midorikawa et al. [11]. This system consists of an user-level software-only solution that provides large virtual memory by using remote memory distributed over the nodes of a cluster. They propose a page size control methodology that estimates a working data set and changes page size to each processing part of an application when running to prevent memory server thrashing [12]. Some real applications have started using remote memories implementations, improving performance over systems using disk paging. Large graph processing is an example that have benefited from using remote memories rather than disks [13]. Authors present in this paper their implementation based on RM to deal with large data sets. In [14] authors describe the scalability of linear algebra kernels based on remote memory access approach. The experimental results using large scale systems (Linux-Infiniband cluster, Cray XT) demonstrate consistent performance advantages over ScaLAPACK suite, the leading implementation of parallel linear algebra algorithms used today. The aforementioned papers focus on remote memory for swapping over cluster nodes and present their system as an improvement of disk swapping. In contrast, the research presented in this paper defines useful applications characterization parameters that are used by the proposed memory scheduler to decide which is the best remote and local memory allocation configuration to enhance system performance. VIII. CONCLUSIONS High-end cluster computers incorporate a large amount of main memory across the different motherboards that is often underutilized in some nodes while there is a memory lacking in other nodes due to workload unbalance. RMA mechanisms can help to avoid this unbalance by allowing the applications of a given node to access the memory in remote nodes. This can be accomplished with the support 618

8 of the OS, which allocates the accessed pages either in the local memory of the node or in the memory of a remote node that is accessible through the fast RMA. Note that, although remote memory latency is longer than local memory latency, it is still much shorter than the latency incurred when using virtual memory paging mechanisms. In this paper, we have compared the performance of conventional hardware-based block interleaving between local and remote memory with the performance of OS-based page interleaving. We found that only some applications are significantly affected by page-based interleaving. Thus, we investigated the reasons that cause this impact on performance in order to design better OS-based memory allocation policies. Based on the results of this study, we have proposed two memory allocation policies, namely on-demand (OD) and Most-accessed in-local (Mail). The first one is a simple strategy that works by placing new pages in local memory until this region is full. Thus, it performs better when the most accessed pages are requested and allocated before than the least accessed ones, which is the common case, as proven in this work. Experimental results show that OD policy reaches around 25% performance improvement for some benchmarks with respect to a typical block interleaving memory system. However, under the OD policy, some benchmarks still allocate a large percentage of the least accessed pages to local memory. In contrast, the Mail allocation policy avoids this problem by using profile information to guide the allocation of new pages. Under this scheme, all the benchmarks show better performance than under block interleaving, and in some cases the performance offered by OD is improved as much as 25%. ACKNOWLEDGMENT This work was supported by the Spanish MICINN, Consolider Programme and Plan E funds, as well as European Commission FEDER funds, under Grants CSD and TIN C REFERENCES [1] M. Serrano, J. Sahuquillo, S. Petit, H. Hassan, and J. Duato, A cost-effective heuristic to schedule local and remote memory in cluster computers, The Journal of Supercomputing, vol. 59, no. 3, pp , [2] J. Duato, F. Silla, and S. Yalamanchili, Extending Hyper- Transport Protocol for Improved Scalability, First International Workshop on HyperTransport Research and Applications, [3] M. Nussle, M. Scherer, and U. Bruning, A Resource Optimized Remote-Memory-Access Architecture for Low-latency Communication, in International Conference on Parallel Processing, Sept. 2009, pp [4] SPEC CPU2006 Benchmark Descriptions, ACM SIGARCH newsletter, Computer Architecture News, Sept [5] R. Ubal, J. Sahuquillo, S. Petit, and P. López, Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors, Proceedings of the 19th International Symposium on Computer Architecture and High Performance Computing, [6] P. Lu, Y. Che, and Z. Wang, A framework for effective memory optimization of high performance computing applications, in High Performance Computing and Communications, HPCC th IEEE International Conference on, june 2009, pp [7] S. Zhuravlev, S. Blagodurov, and A. Fedorova, Addressing shared resource contention in multicore processors via scheduling, in Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, 2010, pp [8] Y. Xie and G. H. Loh, Dynamic Classification of Program Memory Behaviors in CMPs, 2nd Workshop on Chip Multiprocessor Memory Systems and Interconnects in conjunction with the 35th International Symposium on Computer Architecture, [9] C. Xu, X. Chen, R. P. Dick, and Z. M. Mao, Cache contention and application performance prediction for multi-core systems, in IEEE International Symposium on Performance Analysis of Systems and Software, 2010, pp [10] K. Lim, Y. Turner, J. R. Santos, A. AuYoung, J. Chang, P. Ranganathan, and T. F. Wenisch, System-level implications of disaggregated memory, in High Performance Computer Architecture (HPCA), 2012 IEEE 18th International Symposium on, feb. 2012, pp [11] H. Midorikawa, M. Kurokawa, R. Himeno, and M. Sato, DLM: A distributed Large Memory System using remote memory swapping over cluster nodes, in IEEE International Conference on Cluster Computing, October 2008, pp [12] H. Midorikawa and J. Uchiyama, Automatic adaptive pagesize control for remote memory paging, in Cluster, Cloud and Grid Computing (CCGrid), th IEEE/ACM International Symposium on, may 2012, pp [13] K. Jeon, H. Han, S. gyu Kim, H. Eom, H. Yeom, and Y. Lee, Large graph processing based on remote memory system, in High Performance Computing and Communications (HPCC), th IEEE International Conference on, sept. 2010, pp [14] M. Krishnan, R. Lewis, and A. Vishnu, Scaling linear algebra kernels using remote memory access, in Parallel Processing Workshops (ICPPW), th International Conference on, sept. 2010, pp

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe