compared to 35% for multilevel checkpointing without compression. The efficiency of multilevel checkpointing with compression is further improved to

Size: px
Start display at page:

Download "compared to 35% for multilevel checkpointing without compression. The efficiency of multilevel checkpointing with compression is further improved to"

Transcription

1 ABSTRACT AGRAWAL, ABHINAV RAJIV. Reducing Checkpoint/Restart Overhead using Near Data Processing for Exascale System. (Under the direction of James Tuck.) With increasing size and complexity of high-performance computing (HPC) systems to achieve exascale performance, the system mean time to interrupt (system MTTI) is projected to decrease. To maintain the performance efficiency of the system, checkpoints need to be stored at a faster rate when using checkpoint/restart for mitigation. In addition it requires a lower checkpoint commit and restore time. The lower checkpoint commit and restore time requirement is aggravated by the increasing checkpoint-size to IO-bandwidth ratio. To overcome this, prior works have proposed multilevel (hierarchical) checkpoint schemes that involve frequent checkpoint writes to faster nodelocal storage with occasional writes to slower global I/O-based storage (e.g., disk). However, due to increasing cost of writing/reading checkpoints to/from global I/O based storage, this technique may not scale well with systems approaching exaflops performance. While I/O or storage hierarchy alleviates the performance cost by reducing I/O access times (including for checkpoint/restart), moving large data between storage in different levels of hierarchy adds overhead. Near data processing (NDP) has been shown to be effective in reducing the amount of data movement in many applications by performing computations closer to data, thus reducing the overhead. In addition, offloading computations of some applications from the host processors to NDP has shown to improve performance. In this work we show how NDP can be leveraged to improve C/R performance. We propose offloading the process of writing checkpoints to global I/O from the main compute cores to NDP. We also explore opportunities for additional optimizations using NDP to further reduce checkpoint overheads. Overall, our approach eliminates the performance cost of writing checkpoints to I/O as these operations are performed by NDP. We evaluate the performance of our novel application of NDP to reduce checkpoint/restart cost and compare it to existing checkpoint/restart optimizations. For two-level checkpoint schemes (i.e., checkpoints saved to local storage and remote I/O nodes), our evaluation for a projected exascale system shows that a baseline system (without NDP) spends nearly half its time writing checkpoints to I/O or restoring from a checkpoint or re-executing lost work. With NDP for offloading checkpoint management and compression, the host processor is able to increase its progress rate from 51% to 78% (i.e., a >50% speedup in the application performance). We further explore how checkpoint compression can be combined with multilevel checkpointing. We perform a compression study and discuss the compression performance requirement for making it beneficial to add compression to all levels of multilevel checkpointing. We analyze the C/R performance and other benefits of this technique. Our data shows that multilevel checkpointing combined with compression at all levels improves the efficiency of a system with C/R to 73%

2 compared to 35% for multilevel checkpointing without compression. The efficiency of multilevel checkpointing with compression is further improved to 89% when using NDP to offload certain C/R tasks. Finally, we explore how the two approaches of compression at all levels of multilevel checkpointing and the use of NDP can be combined. Adding compression to all levels of multilevel checkpointing will result in compressed checkpoint data being available in local storage. Therefore the role and benefit of NDP for further checkpoint data compression before writing it to global storage is evaluated. In addition to evaluating the performance overhead, we also estimate the energy and hardware cost of the various C/R configurations we discussed. Our cost efficiency analysis shows that adding checkpoint compression to improve progress rate is a more efficient solution than increasing bandwidth of node local storage. We also show that a configuration that leverages NDP to offload the task of writing data to global I/O has higher cost efficiency than a configuration that performs checkpoint compression at each level of multilevel checkpointing.

3 Copyright 2017 by Abhinav Rajiv Agrawal All Rights Reserved

4 Reducing Checkpoint/Restart Overhead using Near Data Processing for Exascale System by Abhinav Rajiv Agrawal A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Computer Engineering Raleigh, North Carolina 2017 APPROVED BY: Gregory Byrd Eric Rotenberg Frank Mueller James Tuck Chair of Advisory Committee

5 DEDICATION To my parents - Rajni and Rajiv Agrawal. ii

6 ACKNOWLEDGEMENTS This research was made possible due to support and guidance of many people - my advisor, research group members, collaborators, family and friends. Foremost, I would like to express my sincere gratitude to my advisor Dr. James Tuck for his constant support during my Ph.D studies. I would like to thank him for his guidance and patience while mentoring me in my research work. I am grateful to Dr. Tuck for allowing me to work on my research with enough independence and flexibility. I would like to thank my dissertation committee members: Dr. Gregory Byrd, Dr. Eric Rotenberg and Dr. Frank Mueller for their service on my committee as well as for their insightful comments, feedback and advice. I would also like to thank Gabriel Loh for collaborating with me on this work and for his advice during my internship. My sincere thanks also goes to Bagus Wibowo for helping with my research as well as for the many stimulating discussions and late nights before deadlines. Many thanks to my fellow labmates - Joonmoo Huh, Amro Awad, Hussein Elnawawy, Vinesh Srinivasan and Seunghee Shin. Thanks to Gayatri Powar for proofreading many paper and report drafts. Lastly I would like to thank my parents for instilling in me the importance of education from a young age and supporting me throughout my academic journey. This accomplishment is as much theirs as it is mine. iii

7 TABLE OF CONTENTS LIST OF TABLES vi LIST OF FIGURES viii Chapter 1 INTRODUCTION Overview Existing C/R Optimization Techniques Adding Checkpoint Compression to Multilevel Checkpointing Leveraging NDP to Improve C/R Efficiency Contributions Organization of This Thesis Chapter 2 BACKGROUND AND RELATED WORK Checkpoint/Restart Coordinated Checkpoint/Restart Checkpoint/Restart Overhead Failure Rate Checkpoint Size Progress Rate or C/R Efficiency Checkpoint/Restart Optimization Techniques Increase Checkpoint Commit Bandwidth Reduce Checkpoint Data Size Near Data Processing Chapter 3 SCALING STUDY Overview Exascale System Projection MTTI Projection Checkpoint/Restart Overhead with no Optimization Chapter 4 MULTILEVEL CHECKPOINTING WITH COMPRESSION Introduction Overview Multilevel Checkpointing Adding Checkpoint Compression to Multilevel C/R Compression Study Tools and Methodology Checkpoint Compression Speed And Factor Selecting Utility for Checkpoint Compression Evaluation Methodology Checkpoint/Restart Overhead Components Progress Rate Comparison iv

8 4.3.4 C/R Overhead Breakdown (by Local and I/O Level) Summary Chapter 5 LEVERAGING NDP FOR CHECKPOINT/RESTART Compute Node with NDP Operation of Multilevel Checkpointing with NDP NDP for Checkpoint Data Compression NDP Performance Requirements Configuring NDP for Compression Evaluation Methodology Checkpoint/Restart Overhead Components Progress Rate Comparison C/R Overhead - Breakdown (4% I/O Recovery) C/R Overhead - Sensitivity Study Summary Chapter 6 PERFORMANCE, POWER AND COST ANALYSIS FOR COMBINATION OF CHECK- POINT/RESTART OPTIMIZATIONS Introduction Compression Study Tools and Methodology Data: Compression Speed and Factor Selecting Utility for Checkpoint Compression using NDP Performance Evaluation Methodology Progress Rate Comparison C/R Overhead - Breakdown (15% I/O Recovery) Methodology - Cost Analysis Energy Cost Hardware Cost Results - Cost Analysis Absolute Cost Breakdown Cost Performance Ratio Chapter 7 CONCLUSION BIBLIOGRAPHY v

9 LIST OF TABLES Table 3.1 Exascale system projection scaled from the Titan Cray XK7 supercomputer Table 4.1 Checkpoint Data Details. Second column shows the size of total checkpoint data collected for each mini-app in gigabytes. Further columns show compression speed for checkpoint data using different utilities and compression levels on thehdd and SSD system. Compression speed is for a single thread of each utility. Value inside () is the compression level Table 4.2 Checkpoint commit and restore time in seconds for all compression utilities. Checkpoint size for all mini-apps is set to 112 GB per compute node. I/O column contains checkpoint times when checkpoints are compressed and saved to global I/O storage. L/S and L/F contains checkpoint times when checkpoints are compressed and saved to slow compute node local storage (5 GB/s) and fast compute node local storage (15 GB/s) respectively. Note that the checkpoint time values in the Average" row are not the average values of the seven mini-apps, but the checkpoint time if the performance model is simulated using average compression factor and compression speed from Figure 4.1. Note that checkpoint commit/restore time in the absence of compression would be I/O: 1120s, L/S: 22.4s and L/F: 7.47s Table 4.3 C/R parameters for evaluation using performance model Table 5.1 The required compression speed, required number of processor cores in NDP and the smallest possible checkpoint interval to I/O based on average compression factor and speed Table 5.2 C/R parameters for evaluation using performance model Table 6.1 Table 6.2 Checkpoint compression data. lz4-compressed data of 7 mini-apps is compressed again using various compression utilities. The first column shows the size of lz4-compressed checkpoint data used to collect compression parameters. Columns with header F contain compression factor and columns with header S contain compression speed in MB/s. Compression speed is the speed at which lz4-compressed data is compressed using various utilities Cumulative or equivalent checkpoint compression data for compression after lz4 compression. lz4-compressed data of 7 mini-apps is compressed again using various compression utilities. Compression factor in this table is a measure of the cumulative reduction in checkpoint size after compression using lz4 and the utility in the first row of the corresponding column. Compression speed is an equivalent compression speed, if the uncompressed checkpoint data were being compressed in the same amount of time as the lz4-compression checkpoint data is being compressed vi

10 Table 6.3 Checkpoint commit time in seconds for all compression utilities for 2 scenarios. UnC: Uncompressed checkpoint data compressed by NDP (Scenario-1); Uncompressed checkpoint size for all mini-apps is set to 112 GB per compute node. Comp: lz4-compressed checkpoint data compressed by NDP (Scenario- 2). Checkpoint size is the size if 112 GB of checkpoint data of the corresponding mini-app is compressed using lz4. Note that the checkpoint time values in the Average" row are not the average values of the seven mini-apps, but the checkpoint time if the performance model is simulated using average compression factor and compression speed from Figure Table 6.4 C/R parameters for performance, power and cost evaluation of multilevel checkpointing combined with compression and NDP Table 6.5 Power and cost parameters vii

11 LIST OF FIGURES Figure 1.1 Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 4.5 Figure 5.1 Figure 5.2 Figure 5.3 Figure 5.4 Progress rate of a system with C/R as a function of M /δ. Increasing value of M /δ leads to higher progress rate Compression factor for checkpoint data of mini-apps using various compression utilities. Value inside () is the compression level C/R overhead breakdown for one L/F + I/O-Comp configuration on y-axis for increasing ratio of locally-saved to I/O-saved checkpoints (n) on x-axis. Note that y-axis does not start at 0 for better resolution Frequency of writing checkpoints to global-i/o storage for seven mini-apps and the average compression case shown as bar plot. On y-axis, frequency of checkpointing is normalized to frequency of checkpointing for no compression case. Total checkpoint I/O write traffic shown as line plot, normalized to checkpoint I/O write traffic generated for no compress case Progress rate comparison between different configurations. Data is shown for 7 mini-apps studied and an average progress rate over the 7 mini-apps. For I/O only the compression factor and speed corresponds to the xz(1) speed from Table 4.1. Similarly for other configuration for checkpoints to I/O, compression factor and speed corresponding to xz(1) is used, while for local level, if compression is performed the parameters corresponding to lz4(1) are used. 32 C/R overhead breakdown normalized to compute time(left) and as % of total execution time(right). Y-axis does not start at 0 for plot on the right. Six configurations from the left: multilevel with no compression (local: 5GB/s), multilevel with compression to I/O (local: 5GB/s), multilevel with compression to both (local: 5GB/s), multilevel with no compression (local: 15GB/s), multilevel with compression to I/O (local: 15GB/s), multilevel with compression to both (local: 15GB/s). The probability that recovery from local fails for multilevel cases: 15%. Compression factor- I/O: 80.6%; Local:64.8% Hardware organization of a compute node with our proposed Near Data Checkpointing Architecture (NDCA) Time-line of multilevel checkpointing with and without NDP. HOST : Primary processing of compute node + DRAM; NVM : Compute node local storage; I/O : I/O nodes based storage or global I/O Ratio of the number of locally saved to the number of I/O saved checkpoints for different configurations and compression factors Progress rate comparison between different configurations. Data are shown for 3 of the 7 mini-apps studied and an average progress rate over the 7 mini-apps. The first set of bars is for no compression while for others the compression factor used is for gzip(1) as shown in Table 4.1. Compression factor is specified in parentheses in the label on x-axis viii

12 Figure 5.5 Figure 5.6 Figure 5.7 C/R overhead breakdown normalized to compute time (left) and as % of total execution time (right). Y-axis does not start at 0 for both plots. Four configurations from the left: multilevel (Local + I/O-H), multilevel+compression (Local + I/O-HC), NDP (Local + I/O-N) & NDP+compression (Local + I/O-NC). The probability that recovery from local fails: 4%. Compression factor: 73% Progress for five C/R configurations for increasing checkpoint size. Y-axis does not start at 0%. MTTI: 30 minutes Progress for five C/R configurations for increasing MTTI. Y-axis does not start at 0%. Checkpoint size: 112 GB per compute node Figure 6.1 Progress rate for C/R configurations in which NDP reads lz4-compressed data from node local storage (local-level). Data is shown for 7 mini-apps studied and an average progress rate over the 7 mini-apps. Compression data for compression performed using NDP is obtained from Section Figure 6.2 C/R overhead breakdown for C/R configurations listed in Section Figure 6.3 Five year cost breakdown (in USD) per compute node and progress rate for C/R configurations listed in Section Figure 6.4 Cost breakdown (in USD). Cost is per compute node per (exa) floating point operations for C/R configurations listed in Section Figure 6.5 Cost (in USD) per compute node per (exa) floating point operations for C/R configurations (without fast local storage) listed in Section , for varying compute node cost ix

13 CHAPTER 1 INTRODUCTION 1.1 Overview Increasing size and complexity of high-performance computing (HPC) systems to achieve exascale performance is projected to cause a decrease in the system mean time to interrupt (MTTI). Checkpoint/restart (C/R) is a widely used mechanism to deal with failures in HPC systems. It involves saving the state of the application required to resume the application to stable storage at certain intervals. In case of a failure or interrupt, the application s execution resumes from the most recent checkpoint (saved state). In the absence of C/R, a failure would lead to the application having to restart from the beginning, losing all completed work. However, C/R mechanisms also add performance overhead due to the time spent saving the checkpoint state, restoring from the saved state, and re-running lost work (work performed since the most recent checkpoint). The efficiency or availability of exascale systems with C/R is projected to be around 50% [Ber08]. 1 C/R efficiency or progress rate is the ratio of the time it takes to run an application in the absence of failures and C/R overhead to the time it takes to perform the task in the presence of such overheads. Under some simplifying assumptions, progress rate can be approximated as a function of ratio of MTTI(M) and time to save checkpoint(δ) [Dal06; Dal07]. 2 Figure 1.1 illustrates this function. For 1 Progress rate and efficiency are used interchangeably in this document. 1

14 Progress Rate 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% M/ẟ Figure 1.1 Progress rate of a system with C/R as a function of M /δ. Increasing value of M /δ leads to higher progress rate. exascale systems with checkpoint/restart, this ratio of M /δ decreases due to two factors. On one hand, the increasing number of compute nodes needed to reach exaflops performance would lead to a decrease in the system MTTI because the MTTI of a single compute node is not improving [SG07]. On the other hand, exascale systems are expected to have larger physical memory capacities and are expected to be able to run applications with larger problem sizes. This will lead to larger application state that needs checkpointing. Without a proportional increase in checkpoint commit write to storage, the time to save checkpoint(δ) will increase. The combination of these two factors leads to a reduction in the ratio of M /δ and thus the progress rate. I/O or storage hierarchy improves the access time to storage by means of having fast intermediate levels between compute and disk based global I/O. These intermediate levels in the form of burst buffers or compute node local storage can be flash-based solid state drives or SSDs. They provide a lower overhead storage site to stage data by the compute before it is drained to slower diskbased storage in I/O node [Bhi16]. While this alleviates the performance overhead by reducing the access time to storage (including for C/R), moving large data between storage in different levels of hierarchy is not energy efficient. Near data processing (NDP) is effective in reducing the amount of 2 While Daly s work [Dal06] provides an equation to calculate the optimal checkpoint interval given MTTI(M ) and checkpoint commit time (δ), Quantifying Checkpoint Efficiency[Dal07] provides an equation that calculate checkpoint efficiency given M and δ. Checkpoint restore time is assumed to be same as commit time. 2

15 data movement in many applications by performing computations closer to data. Offloading some application s computations from the host processors to NDP has shown to improve performance and energy efficiency, especially for data intensive applications [Kan13a; Cho16; Ses14; Do13; Tiw13; Cho13]. NDP (or active storage) can potentially address key challenges in the areas such as scalability, performance, and reliability for I/O systems in exascale computing [Don11]. In this work we show that NDP can be leveraged to improve C/R performance. 3 Using SSDs as burst buffer or node local storage would provide a higher bandwidth for writing and reading checkpoints thus decreasing C/R cost. However, using SSDs has been shown to be an expensive way to reduce C/R cost. Ibtesham et al. in [Ibt15] show that it is more cost effective to improve the efficiency of C/R using software techniques such as checkpoint compression [Ibt15; Ibt12b] or incremental checkpointing [Fer11; Kai16; Nic13]. Based on these observations we evaluate the benefits of combining the use of NDP and checkpoint compression to reduce C/R overhead at a lower cost. 1.2 Existing C/R Optimization Techniques A number of optimizations and mechanisms [Moo10; Di14; Don09; Kan13b; Zhe04; Ben09; Raj13; Ibt15; Ibt12b; Fer11; Kai16; Nic13; Gam14; Ell12; DP12] have been proposed to reduce C/R overhead. While a mechanism like partial redundancy [Ell12] decreases effective MTTI to reduce C/R overhead, many mechanisms aim to either increase the effective checkpoint commit bandwidth [Moo10; Di14; Don09; Kan13b; Zhe04; Ben09; Raj13] or reduce the checkpoint size [Ibt15; Ibt12b; Fer11; Kai16; Nic13]. Increasing the checkpoint commit bandwidth or decreasing checkpoint size decreases the value of δ, thus improving the progress rate. Many of the techniques that reduce the effective checkpoint commit time take advantage of the storage hierarchy. Multilevel checkpointing is a clear example of C/R mechanism that exploits storage hierarchy to improve C/R performance. Multilevel checkpointing [Moo10; Di14] involves writing frequent checkpoints to compute node local storage, while writing occasional checkpoints to global I/O based storage. To keep the overhead of checkpoints to compute node local storage low, the bandwidth of local storage needs to be high. This can be achieved by adding storage in the form of flash based solid state drives (SSDs). The bandwidth requirement and thus the hardware cost for fast local storage would increase with the increasing checkpoint size and failure rate in HPC systems. Another issue with multilevel checkpointing is the high overhead of the occasional checkpoints 3 NDP could be in the context of main memory (like DRAM) or in the context of storage (like SSDs). In this work NDP refers to compute capabilities coupled to compute-node local storage (which would likely be flash SSDs or other NVM based storage due to high speed requirement). 3

16 to global I/O. Writing a checkpoint out to global I/O in a conventional multilevel checkpointing system requires the host processor to read the checkpoint data from main memory, and then send the data over the network to the remote storage, which requires the host to execute all of the code associated with running the full network stack (e.g., TCP/IP). This can be a particularly slow process as the checkpointing process is typically bottlenecked on the slower I/O (disk) bandwidth at the shared remote I/O nodes. While this is happening, the host processor is generally not available to perform the useful computations of the main application. While bandwidth of the local storage in compute node scales, that of global I/O based storage does not scale with the increasing size of application. Moody et al. in [Moo10; Di14] show that with increasing failure rate and increasing time to save checkpoint to global I/O, the progress rate of a system with multilevel checkpointing decreases, although slower than the decrease for single level checkpointing. 1.3 Adding Checkpoint Compression to Multilevel Checkpointing Multilevel checkpointing increases the effective bandwidth of reading and writing checkpoint data by saving most checkpoints to fast compute node local storage, while saving a few checkpoints to high overhead global I/O based storage. These local storage will have increasing cost due to increasing bandwidth requirement, with increasing checkpoint size and failure rate for HPC systems. In Chapter 4, we show that checkpoint data compression before writing to local storage reduces the bandwidth requirement for local storage. The high performance overhead associated with checkpointing to global I/O is also mitigated by compressing checkpoint data before writing to global I/O. While, adding checkpoint compression to multilevel checkpointing is an intuitive (or obvious) solution to mitigate issues with scaling of multilevel checkpointing for exascale system, in this work, we study how compression can be added to multilevel checkpointing. We evaluate the compression performance requirement to add compression at all levels of multilevel checkpointing and provide a methodology as to how the requirements can be determined. In Chapter 4 and Chapter 6 we quantify the performance and cost efficiency improvement for multilevel checkpointing due to the addition of checkpoint compression. Our evaluation shows that even with the addition of compression to each level of multilevel checkpointing, the overhead associated with checkpointing to global I/O is still high and provides further opportunity for improvement. 1.4 Leveraging NDP to Improve C/R Efficiency In this work, we also explore leveraging NDP to target the overhead associated with checkpointing to global I/O. To improve C/R performance, multilevel checkpointing utilizes one feature of hierarchical 4

17 storage (i.e., storage with different speeds and availability at different levels in hierarchy). We propose leveraging additional feature of hierarchical storage - the likely presence of NDP or active storage in (future) HPC systems. Using NDP (i.e., compute capabilities coupled to compute-node local storage) allows the host processor to quickly write checkpoints to the node-local storage and resume execution; NDP can then handle the slower process of sending the checkpoint(s) to global I/O off of the main application s critical path. NDP can be leveraged for additional optimizations that improve C/R performance. We explore the benefits of adding compression capabilities to our NDP-based checkpointing scheme, as this can reduce network bandwidth requirement for sending checkpoints out to I/O (thereby reducing network contention for the main application s communication needs), and it can also help improve performance by speeding up checkpoint restoration (which is primarily limited by how fast checkpoints can be retrieved from the I/O nodes disks). While checkpoint compression is not new, the exploitation of an NDP architecture to offload it from the host processor is a new twist: past approaches tolerated higher host-side processing costs because the compression reduced the I/O cost sufficiently to make it a net win, whereas our approach can get the benefits of compression without the host-side overheads. 1.5 Contributions We make the following contributions: We perform a high-level analysis of existing checkpoint/restart optimizations using our projected exascale system. This analysis includes determining the scaling required by these optimizations to achieve a 90% progress rate on our projected system. We discuss how checkpoint compression can be added to multilevel checkpointing. Specifically, based on a compression study of checkpoint data, we determine which general purpose compression utilities are best suited to be used with different levels in multilevel checkpointing based on their compression factor and speed. We show that adding compression to multilevel checkpointing reduces the hardware requirement and also improves its performance. Our data shows that multilevel checkpointing combined with compression improves the efficiency of a system with C/R to 73% compared to 35% for multilevel checkpointing without compression. Our data also shows that adding compression to local-level of multilevel checkpointing, allows reducing the bandwidth for node local storage by a factor of 3x (from 15 GB/s to 5 GB/s), while maintaining the progress rate. 5

18 We describe the operational details of the checkpoint/restart mechanism using NDP as well as the compute-node s hardware organization to implement such a mechanism. We evaluate checkpoint compression using NDP as a starting point for exploring additional optimizations that can be performed by NDP. We perform a compression study to help select a compression utility that achieves a good trade-off between compression speed and compression factor when compressing checkpoint data using NDP. The study also informs us of the compression speed requirement for the NDP hardware. We perform a detailed evaluation of multilevel checkpointing with NDP support. With our proposed NDP approach for offloading I/O management and compression, the host processor is able to increase its progress rate from 51% to 78% (i.e., a more than 50% speedup in the application performance). We present a methodology to estimate the 5 year cost of an exascale node for different C/R configurations with certain simplifications. This cost analysis helps compare the cost efficiency of various C/R configuration for our projected exascale node. 1.6 Organization of This Thesis The rest of the thesis is organized as follows. In Chapter 2 we cover the background and related work to the topics discussed in this thesis. We start by projecting a exascale system configuration in Chapter 3. In this chapter we also project parameters relevant to checkpoint/restart and discuss the limitations of basic checkpoint/restart for our projected exascale system. In Chapter 4, we discuss our first proposal - adding checkpoint compression to each level of multilevel checkpointing. We discuss how this combination can be achieved in Section We perform a checkpoint compression study to mimic compression performance for an exascale node with fast local storage (i.e. when compression is not bottlenecked by storage bandwidth) in Section 4.2. Based on this compression study we discuss how a compression utility can be picked for a particular level in multilevel checkpointing in Section The performance gains of such combination of checkpoint/restart schemes are quantified in Section 4.3. Chapter 5 details our second proposal - leveraging NDP to reduce checkpoint/restart overhead for tasks associated with checkpointing to global I/O. We present a high level node organization in Section 5.1 and describe the operational details of our NDP approach in Section Using the data from the compression study in Section 4.2, in Section we determine the compression 6

19 performance requirement of NDP for scenarios in which NDP is to be used for compression. The impact of leveraging NDP on C/R overhead is evaluated in Section 5.3. Chapter 6 presents a discussion on how our first two proposals can be combined. In this chapter we discuss how compression requirements for NDP are impacted due to addition of compression at all levels of multilevel checkpointing. In Section 6.3, the performance overhead of the combination of our two approaches is evaluated. Next, in Section 6.4 energy and hardware cost analysis methodology for C/R configurations is described. Finally, in Section 6.5, the cost efficiency data for the various C/R configurations discussed in this thesis is presented. In Chapter 7, we conclude the thesis by summarizing our observations and discussing our main findings. 7

20 CHAPTER 2 BACKGROUND AND RELATED WORK 2.1 Checkpoint/Restart Checkpoint/Restart is a widely used fault tolerance mechanism to mitigate the performance overhead of faults in high performance computing systems. Checkpoint/Restart involves saving the state of the application execution periodically. In case of a fault or a failure requiring recovery, application is resumed from the most recent checkpointed state. This avoids restarting the application from the start which would have a high performance cost Coordinated Checkpoint/Restart Coordinated C/R involves synchronizing all nodes before writing a checkpoint. In coordinated checkpoint/restart, blocking method involves stopping the execution on all nodes at a global synchronization point (for e.g., by using a barrier operation) [Eln02; TS84; Hur09]. Alternatively, non-blocking methods can be used which involves saving the communication state during the checkpointing operations [LY87; Cot06]. Blocking methods generally have a higher synchronization overhead than non-blocking methods, but non-blocking methods have higher complexity. 8

21 2.2 Checkpoint/Restart Overhead The increase in C/R overhead for exascale system can be attributed to an increase in failure rate (or decrease in system mean time to interrupt) and an increase in checkpoint size without a corresponding increase in checkpoint read/write bandwidth. The following subsections look at each of these aspects Failure Rate Projections show that the system MTTI of exascale machines could be in the range of minutes to tens of minutes [Don11; Ber08; Chu12]. Bergman et al. [Ber08] project the system MTTI to be 35 minutes - 39 minutes for the strawman exascale system that they project. Chung et al. [Chu12] project the system MTTI to be less than 10 minutes for exascale systems. A study by Schroeder and Gibson [SG07] on petascale systems introduced in showed ~0.2 failures per socket per year. This is equivalent to a 5 year mean time to failure (MTTF) per socket. This node failure rate has been used in many prior studies [Rie10; Ibt15] to calculate the system failure rate. Using this node failure rate an exascale system with 100,000 nodes would have a system MTTI of minutes. This is a more than 5x increase in failure rate compared to a petascale system s MTTI (which is in the range of hours [Gam14]) Checkpoint Size With increasing computational capacity of exascale system, these systems would be able to handle bigger workloads with bigger memory footprint. Therefore these systems are projected to have larger system memory. The total system memory is projected to be in the range of tens to hundreds of petabytes [Don11; Ber08; Chu12]. Since a large part of system physical memory may need to be checkpointed for some applications, the checkpoint size for such applications would also be in the range of tens to hundreds of petabytes. This is a more than 10x increase in system memory compared to petascale systems, none of which have total system memory exceeding two petabytes Progress Rate or C/R Efficiency Progress rate or C/R efficiency is a commonly used metric to quantify the overhead due of checkpoint/restart techniques. Progress rate is the ratio of the time it takes to run an application in the absence of failures and C/R overhead to the time it takes to perform the task in the presence of such overheads. In other words, it is the fraction of time spent doing productive work by an application on a system with checkpoint/restart. To maintain the progress rate of systems in a scenario with 9

22 increasing failure rate (or decreasing MTTI) and increasing checkpoint size for exascale systems, the checkpoint read/write bandwidth would need to increase in proportion to both the increase in failure rate as well as the increase in checkpoint size. This is based on Daly s formula for calculating C/R efficiency [Dal07]. According to Daly s formula C/R efficiency is proportional to the system MTTI(M) and inversely proportional to the time to save checkpoint(δ) or C/R efficiency is proportional to M /δ. Time to save a checkpoint (δ) can be approximated as being proportional to checkpoint size (Si z e ) and inversely proportional to write bandwidth (BW ). Note that we are ignoring the cost of synchronization operation, which would have an increasing cost with increasing number of nodes that need to be synchronized. With this simplification, the C/R efficiency or progress rate is proportional to M BW Si z e. To maintain this ratio, time to save a checkpoint ( Si z e BW ) should reduce in proportion to MTTI. This would mean BW should increase in proportion to the increase in Si z e and inversely proportional to M (or directly proportional to failure rate). Based on the projections for exascale system, the I/O bandwidth is not expected to scale in proportion to failure rate and checkpoint size. 2.3 Checkpoint/Restart Optimization Techniques Traditional checkpoint/restart involves periodically saving the state of an application to I/O based storage. In case of a failure, the checkpoint data is used to restore the application and resume its execution from the checkpoint. However increasing failure rate requires proportional decrease in checkpoint commit time to avoid a decrease in progress rate. This would require a decrease in checkpoint size or increase in checkpoint commit bandwidth. The checkpoint size of HPC systems is expected to increase with increasingly large memory footprint of the application. Therefore checkpoint commit bandwidth need to increase to not only compensate for the increasing checkpoint size but also for the increasing checkpoint size failure rate. However, the projected increase in I/O bandwidth is unlikely to be enough to compensate for these effects and therefore to keep C/R feasible [Fer12] for future HPC systems (exascale systems), optimizations or techniques that reduce C/R overhead are increasingly important. While there are techniques [Ell12] that improve C/R performance by increasing the MTTI, most optimizations improve C/R performance by reducing checkpoint commit time. These optimizations can be broadly divided in two categories Increase Checkpoint Commit Bandwidth One set of optimizations reduce the average checkpoint commit time by increasing the effective checkpoint commit bandwidth. Example of such optimizations would be multilevel checkpoint- 10

23 ing [Moo10; Don09; Kan13b; Zhe04; Gam14], burst-buffers [Bhi16] and file-systems optimized to support faster checkpointing[raj13; Ben09]. Multilevel checkpointing techniques [Don09; Kan13b; Zhe04; Gam14] are some variation of multilevel checkpoint scheme described by Moody et al. [Moo10] where checkpoints are frequently saved to faster compute node local storage and less frequent checkpoints are stored to slower global I/O. The compute node local memory could be DRAM [Moo10; Zhe04] or non-volatile memory NVM [Don09; Kan13b]. While the bandwidth to local storage is expected to scale with checkpoint size, the bandwidth to global I/O is not expected to scale [Fer12]. Moody et al. [Moo10] show that while multilevel checkpointing performs better than single level checkpointing for increasing failure rate and increasing time to save checkpoints to I/O, its performance still reduces. Therefore the global I/O component of multilevel checkpointing will increase with increasing cost of checkpointing to I/O. These techniques involve saving most checkpoints to a combination of fast and local storage and saving a few to slow disk based storage Reduce Checkpoint Data Size Another set of optimizations reduce the checkpoint commit time by reducing the amount of checkpoint data that is saved. Example of such optimizations are checkpoint compression [Ibt15; Ibt12b], incremental checkpoints [Fer11] and data deduplication [Kai16; Nic13]. Checkpoint compression involves compressing checkpoints using general purpose compression utilities like gzip before saving them. Incremental checkpointing involves creating a full checkpoint followed by saving checkpoint increments which basically involves only saving the states that have changes since last checkpoint. These solutions have been shown [Ibt15] to be cost effective ways of reducing C/R overhead compared to techniques that require hardware support such as fast SSD based storage. 2.4 Near Data Processing Prior works have shown the performance and energy efficiency benefits of adding near data processing (NDP) to different levels of storage hierarchies such as burst buffer for HPC systems or fast SSD storage [Kan13a; Cho16; Ses14; Do13; Tiw13; Cho13]. While in this work we use NDP in context of processing coupled to local NVM based storage, NDP is also used in the context of adding compute capability or specialized logic in the main memory [Aza16; Ahn15]. A review of prior work on NDP in both contexts can be found in [Bal14]. 11

24 CHAPTER 3 SCALING STUDY 3.1 Overview In this chapter we project an exascale system configuration starting with an existing petascale HPC system. This projections in turn are used to project the MTTI of an exascale system. The exascale system configuration and the MTTI are used to calculate the overhead of basic C/R with no optimization using Daly s formula. We use the exascale configuration projected in this chapter to estimate the overhead of various C/R configurations using our performance model throughout this thesis. 3.2 Exascale System Projection We project an exascale system by scaling an existing petascale system to exaflops performance. The assumptions made when scaling are based on cited technology trends and the various parameters of the projections are compared to projections made in prior works [Don09; Chu12]. Furthermore, in our projections, we err on the side of more optimistic or lower checkpoint/restart overheads. The intent is to show that even with these optimistic assumptions, the overhead of existing checkpoint/restart mechanisms on exascale systems would be high, resulting in a lower progress rate. One 12

25 implication of this preference is that we project a conservative increase in physical memory size and, consequently, the checkpoint size. Similarly compared to other projections a conservative increase in failure rate is projected. These assumptions lead to an optimistic scenario for checkpoint/restart cost. In this study, we scale the Titan Cray XK7 system [Rog12], a petascale system, to exaflops performance. Titan has 18,688 compute nodes each consisting of a 16-core AMD Opteron processor coupled with additional GPU acceleration. Each node has a 38 GB of memory (2 GB per CPU core plus the GPU s 6 GB). Each node has a theoretical peak performance of 1.44 teraflops with a theoretical system peak performance of 27 petaflops. A ~37x factor increase is required to reach exaflops performance. This can be accomplished by a combination of increase in performance per compute node and an increase in the number of compute nodes. We assume that the performance of a single compute node can scale to 10 teraflops[van08], a ~7x increase compared to Titan s per-node performance. We assume a uniform 7x increase in both CPU and GPU performance. For the CPU, the performance increase is assumed to be achieved by a combination of a 75% increase in performance per core and an increase in core count from from 16 to 64. If the ratio of 2 GB/core is maintained, the memory for the CPU would increase to 128 GB. We conservatively assume that the memory of the GPU is doubled to 12 GB (and not increased 7x, proportional to performance). The total memory for the node would be 140 GB. This is a conservative estimate compared to projections made in past work [Don09]. With a 7x increase in the compute node s performance, the remaining increase in performance comes from a 5.3x increase in node count (37x/3x). This leads to 100,000 compute nodes, which at 10 teraflops each, provide a system peak performance of 1 exaflops. With 100K compute nodes, the total memory of the system would be 14 PB, again a conservative projection compared to other projections [Don11; Chu12; Don09]. The aggregated data bandwidth of Titan to its file system is 1000 GB/s. We project this to increase to 10 TB/s, a 10x increase which is in the same order as projected by Chen [Che11]. Titan uses the Gemini interconnect which has an injection bandwidth of 20 GB/s. We scale it to 50 GB/s [Che11]. 3.3 MTTI Projection We project the system MTTI based on previously observed or projected node MTTI and scale it for the compute node count of our projected exascale system. A study by Schroeder and Gibson [SG07] on petascale systems introduced in showed ~0.2 failures per socket per year. This is equivalent to a 5 year mean time to failure (MTTF) per socket. Similar to previous work [Rie10; Ibt15], we assume a node/socket MTTF of 5 years. This results in a system MTTF of ~26.28 minutes with 100K nodes. We assume each failure leads to an interrupt requiring recovery using checkpointed 13

26 Table 3.1 Exascale system projection scaled from the Titan Cray XK7 supercomputer Parameter Titan Cray XK7 Exascale Projection Factor change Node Count 18, , x System Peak 27 petaflops 1 exaflops 37x Node Peak 1.44 teraflops 10 teraflops 7x System Memory 710 TB 14 PB 19.72x Node Memory 38 GB 140 GB 3.68x Interconnect BW 20 GB/s 50 GB/s 2.5x I/O Bandwidth 1000 GB/s 10 TB/s 10x System MTTI 160 minutes 1 30 minutes (1/5.33)x application state, and thus system MTTI would also be ~26.28 minutes. For the sake of simplicity we make an optimistic assumption of system MTTI being 30 minutes for this exascale system, which falls in the range projected in previous work [Chu12]. Key parameters of our projected exascale system are listed in Table Checkpoint/Restart Overhead with no Optimization This section discusses the feasibility of the overhead of basic C/R for our projected exascale system and the system MTTI. Assuming that 80% of the main memory needs to be checkpointed, each checkpoint would have a size of 11.2 PB for our projected system. Writing a single checkpoint to global file system would require minutes at 10 TB/s. Using Daly s equation [Dal07] to calculate the progress rate, we get a value of 13.67%. 2 We validated this value using our performance model described in Section This implies that the system will spend more than 85% of the time performing C/R related tasks. How can we improve the progress rate of this system? If we only consider basic C/R without optimization, to achieve a progress rate of say 90% on a system with MTTI of 30 minutes, the required checkpoint commit times comes to 9 seconds (calculated using the same formula used to calculate progress rate given system MTTI and checkpoint commit time). This would require a checkpoint commit rate of (11.2 PB / 9 seconds) PB/s for the system. This comes to ~12.44 GB/s per compute node. The PB/s far outpaces the projected 10 TB/s of global I/O bandwidth, thus requiring additional C/R optimizations. 1 Prior work [Gam14] reports 9 failures per day for Titan, which converts to failure every 160 minutes. 2 While Daly s work [Dal06] provides an equation to calculate the optimal checkpoint interval given MTTI and checkpoint time, Quantifying Checkpoint Efficiency[Dal07] contains an equation to calculate checkpoint efficiency given MTTI and checkpoint commit time. Checkpoint restore time is assumed to be same as commit time. 14

Combing Partial Redundancy and Checkpointing for HPC

Combing Partial Redundancy and Checkpointing for HPC Combing Partial Redundancy and Checkpointing for HPC James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann North Carolina State University Sandia National Laboratory

More information

Scalable In-memory Checkpoint with Automatic Restart on Failures

Scalable In-memory Checkpoint with Automatic Restart on Failures Scalable In-memory Checkpoint with Automatic Restart on Failures Xiang Ni, Esteban Meneses, Laxmikant V. Kalé Parallel Programming Laboratory University of Illinois at Urbana-Champaign November, 2012 8th

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Swen Böhm 1,2, Christian Engelmann 2, and Stephen L. Scott 2 1 Department of Computer

More information

Enhanced Web Log Based Recommendation by Personalized Retrieval

Enhanced Web Log Based Recommendation by Personalized Retrieval Enhanced Web Log Based Recommendation by Personalized Retrieval Xueping Peng FACULTY OF ENGINEERING AND INFORMATION TECHNOLOGY UNIVERSITY OF TECHNOLOGY, SYDNEY A thesis submitted for the degree of Doctor

More information

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Gabriel H. Loh Mark D. Hill AMD Research Department of Computer Sciences Advanced Micro Devices, Inc. gabe.loh@amd.com

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

IBM InfoSphere Streams v4.0 Performance Best Practices

IBM InfoSphere Streams v4.0 Performance Best Practices Henry May IBM InfoSphere Streams v4.0 Performance Best Practices Abstract Streams v4.0 introduces powerful high availability features. Leveraging these requires careful consideration of performance related

More information

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan

PHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan PHX: Memory Speed HPC I/O with NVM Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan Node Local Persistent I/O? Node local checkpoint/ restart - Recover from transient failures ( node restart)

More information

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience

Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience Exploring Use-cases for Non-Volatile Memories in support of HPC Resilience Onkar Patil 1, Saurabh Hukerikar 2, Frank Mueller 1, Christian Engelmann 2 1 Dept. of Computer Science, North Carolina State University

More information

ECE 574 Cluster Computing Lecture 23

ECE 574 Cluster Computing Lecture 23 ECE 574 Cluster Computing Lecture 23 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 1 December 2015 Announcements Project presentations next week There is a final. time. Maybe

More information

HYBRID STORAGE TM. WITH FASTier ACCELERATION TECHNOLOGY

HYBRID STORAGE TM. WITH FASTier ACCELERATION TECHNOLOGY HYBRID STORAGE TM WITH FASTier ACCELERATION TECHNOLOGY Nexsan s FASTier acceleration technology uses advanced software architecture and algorithms to leverage the power of solid-state to accelerate the

More information

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Vazhkudai Instance of Large-Scale HPC Systems ORNL s TITAN (World

More information

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures

Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim,Christian Engelmann, and Galen Shipman

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

SolidFire and Pure Storage Architectural Comparison

SolidFire and Pure Storage Architectural Comparison The All-Flash Array Built for the Next Generation Data Center SolidFire and Pure Storage Architectural Comparison June 2014 This document includes general information about Pure Storage architecture as

More information

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup

Chapter 4. Routers with Tiny Buffers: Experiments. 4.1 Testbed experiments Setup Chapter 4 Routers with Tiny Buffers: Experiments This chapter describes two sets of experiments with tiny buffers in networks: one in a testbed and the other in a real network over the Internet2 1 backbone.

More information

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS

ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS ACCELERATED COMPLEX EVENT PROCESSING WITH GRAPHICS PROCESSING UNITS Prabodha Srimal Rodrigo Registration No. : 138230V Degree of Master of Science Department of Computer Science & Engineering University

More information

Reflections on Failure in Post-Terascale Parallel Computing

Reflections on Failure in Post-Terascale Parallel Computing Reflections on Failure in Post-Terascale Parallel Computing 2007 Int. Conf. on Parallel Processing, Xi An China Garth Gibson Carnegie Mellon University and Panasas Inc. DOE SciDAC Petascale Data Storage

More information

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Ewa Kusmierek and David H.C. Du Digital Technology Center and Department of Computer Science and Engineering University of Minnesota

More information

The Pennsylvania State University The Graduate School Department of Computer Science and Engineering

The Pennsylvania State University The Graduate School Department of Computer Science and Engineering The Pennsylvania State University The Graduate School Department of Computer Science and Engineering CPU- AND GPU-BASED TRIANGULAR SURFACE MESH SIMPLIFICATION A Thesis in Computer Science and Engineering

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows Rafael Ferreira da Silva, Scott Callaghan, Ewa Deelman 12 th Workflows in Support of Large-Scale Science (WORKS) SuperComputing

More information

5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1

5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1 5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks 485.e1 5.11 Parallelism and Memory Hierarchy: Redundant Arrays of Inexpensive Disks Amdahl s law in Chapter 1 reminds us that

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

Resource-Efficient Replication and Migration of Virtual Machines

Resource-Efficient Replication and Migration of Virtual Machines Resource-Efficient Replication and Migration of Virtual Machines by Kai-Yuan Hou A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Science

More information

Reduce Latency and Increase Application Performance Up to 44x with Adaptec maxcache 3.0 SSD Read and Write Caching Solutions

Reduce Latency and Increase Application Performance Up to 44x with Adaptec maxcache 3.0 SSD Read and Write Caching Solutions Reduce Latency and Increase Application Performance Up to 44x with Adaptec maxcache 3. SSD Read and Write Caching Solutions Executive Summary Today s data centers and cloud computing environments require

More information

Storage Hierarchy Management for Scientific Computing

Storage Hierarchy Management for Scientific Computing Storage Hierarchy Management for Scientific Computing by Ethan Leo Miller Sc. B. (Brown University) 1987 M.S. (University of California at Berkeley) 1990 A dissertation submitted in partial satisfaction

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction In a packet-switched network, packets are buffered when they cannot be processed or transmitted at the rate they arrive. There are three main reasons that a router, with generic

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

IBM Spectrum Scale IO performance

IBM Spectrum Scale IO performance IBM Spectrum Scale 5.0.0 IO performance Silverton Consulting, Inc. StorInt Briefing 2 Introduction High-performance computing (HPC) and scientific computing are in a constant state of transition. Artificial

More information

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application

More information

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011

The Road to ExaScale. Advances in High-Performance Interconnect Infrastructure. September 2011 The Road to ExaScale Advances in High-Performance Interconnect Infrastructure September 2011 diego@mellanox.com ExaScale Computing Ambitious Challenges Foster Progress Demand Research Institutes, Universities

More information

Lenovo RAID Introduction Reference Information

Lenovo RAID Introduction Reference Information Lenovo RAID Introduction Reference Information Using a Redundant Array of Independent Disks (RAID) to store data remains one of the most common and cost-efficient methods to increase server's storage performance,

More information

COMP283-Lecture 3 Applied Database Management

COMP283-Lecture 3 Applied Database Management COMP283-Lecture 3 Applied Database Management Introduction DB Design Continued Disk Sizing Disk Types & Controllers DB Capacity 1 COMP283-Lecture 3 DB Storage: Linear Growth Disk space requirements increases

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Constrained Control Allocation for. Systems with Redundant Control Effectors. Kenneth A. Bordignon. Dissertation submitted to the Faculty of the

Constrained Control Allocation for. Systems with Redundant Control Effectors. Kenneth A. Bordignon. Dissertation submitted to the Faculty of the Constrained Control Allocation for Systems with Redundant Control Effectors by Kenneth A. Bordignon Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial

More information

Expand In-Memory Capacity at a Fraction of the Cost of DRAM: AMD EPYCTM and Ultrastar

Expand In-Memory Capacity at a Fraction of the Cost of DRAM: AMD EPYCTM and Ultrastar White Paper March, 2019 Expand In-Memory Capacity at a Fraction of the Cost of DRAM: AMD EPYCTM and Ultrastar Massive Memory for AMD EPYC-based Servers at a Fraction of the Cost of DRAM The ever-expanding

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

REMEM: REmote MEMory as Checkpointing Storage

REMEM: REmote MEMory as Checkpointing Storage REMEM: REmote MEMory as Checkpointing Storage Hui Jin Illinois Institute of Technology Xian-He Sun Illinois Institute of Technology Yong Chen Oak Ridge National Laboratory Tao Ke Illinois Institute of

More information

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad nc. Application Note AN1801 Rev. 0.2, 11/2003 Performance Differences between MPC8240 and the Tsi106 Host Bridge Top Changwatchai Roy Jenevein risc10@email.sps.mot.com CPD Applications This paper discusses

More information

The Exascale Architecture

The Exascale Architecture The Exascale Architecture Richard Graham HPC Advisory Council China 2013 Overview Programming-model challenges for Exascale Challenges for scaling MPI to Exascale InfiniBand enhancements Dynamically Connected

More information

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es

CS6453. Data-Intensive Systems: Rachit Agarwal. Technology trends, Emerging challenges & opportuni=es CS6453 Data-Intensive Systems: Technology trends, Emerging challenges & opportuni=es Rachit Agarwal Slides based on: many many discussions with Ion Stoica, his class, and many industry folks Servers Typical

More information

EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS

EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS EFFICIENT ATTACKS ON HOMOPHONIC SUBSTITUTION CIPHERS A Project Report Presented to The faculty of the Department of Computer Science San Jose State University In Partial Fulfillment of the Requirements

More information

PowerVault MD3 SSD Cache Overview

PowerVault MD3 SSD Cache Overview PowerVault MD3 SSD Cache Overview A Dell Technical White Paper Dell Storage Engineering October 2015 A Dell Technical White Paper TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS

More information

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0) TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

Design of Parallel Algorithms. Course Introduction

Design of Parallel Algorithms. Course Introduction + Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:

More information

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work The Salishan Conference on High-Speed Computing April 26, 2016 Adam Moody

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

Shiraz: Exploiting System Reliability and Application Resilience Characteristics to Improve Large Scale System Throughput

Shiraz: Exploiting System Reliability and Application Resilience Characteristics to Improve Large Scale System Throughput 18 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Shiraz: Exploiting System Reliability and Application Resilience Characteristics to Improve Large Scale System Throughput

More information

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN

More information

An Oracle White Paper April 2010

An Oracle White Paper April 2010 An Oracle White Paper April 2010 In October 2009, NEC Corporation ( NEC ) established development guidelines and a roadmap for IT platform products to realize a next-generation IT infrastructures suited

More information

LATEST INTEL TECHNOLOGIES POWER NEW PERFORMANCE LEVELS ON VMWARE VSAN

LATEST INTEL TECHNOLOGIES POWER NEW PERFORMANCE LEVELS ON VMWARE VSAN LATEST INTEL TECHNOLOGIES POWER NEW PERFORMANCE LEVELS ON VMWARE VSAN Russ Fellows Enabling you to make the best technology decisions November 2017 EXECUTIVE OVERVIEW* The new Intel Xeon Scalable platform

More information

Performance of relational database management

Performance of relational database management Building a 3-D DRAM Architecture for Optimum Cost/Performance By Gene Bowles and Duke Lambert As systems increase in performance and power, magnetic disk storage speeds have lagged behind. But using solidstate

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture

DCS-ctrl: A Fast and Flexible Device-Control Mechanism for Device-Centric Server Architecture DCS-ctrl: A Fast and Flexible ice-control Mechanism for ice-centric Server Architecture Dongup Kwon 1, Jaehyung Ahn 2, Dongju Chae 2, Mohammadamin Ajdari 2, Jaewon Lee 1, Suheon Bae 1, Youngsok Kim 1,

More information

FAULT TOLERANT SYSTEMS

FAULT TOLERANT SYSTEMS FAULT TOLERANT SYSTEMS http://www.ecs.umass.edu/ece/koren/faulttolerantsystems Part 16 - Checkpointing I Chapter 6 - Checkpointing Part.16.1 Failure During Program Execution Computers today are much faster,

More information

IBM Power AC922 Server

IBM Power AC922 Server IBM Power AC922 Server The Best Server for Enterprise AI Highlights More accuracy - GPUs access system RAM for larger models Faster insights - significant deep learning speedups Rapid deployment - integrated

More information

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 06: Supercomputers & Data Centers Prof. Sherief Reda School of Engineering Brown University Material from: The Datacenter as a Computer: An Introduction to

More information

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010

Moneta: A High-performance Storage Array Architecture for Nextgeneration, Micro 2010 Moneta: A High-performance Storage Array Architecture for Nextgeneration, Non-volatile Memories Micro 2010 NVM-based SSD NVMs are replacing spinning-disks Performance of disks has lagged NAND flash showed

More information

Lecture 14: Congestion Control"

Lecture 14: Congestion Control Lecture 14: Congestion Control" CSE 222A: Computer Communication Networks Alex C. Snoeren Thanks: Amin Vahdat, Dina Katabi Lecture 14 Overview" TCP congestion control review XCP Overview 2 Congestion Control

More information

1 Copyright 2012, Oracle and/or its affiliates. All rights reserved.

1 Copyright 2012, Oracle and/or its affiliates. All rights reserved. 1 Engineered Systems - Exadata Juan Loaiza Senior Vice President Systems Technology October 4, 2012 2 Safe Harbor Statement "Safe Harbor Statement: Statements in this presentation relating to Oracle's

More information

Intel Many Integrated Core (MIC) Architecture

Intel Many Integrated Core (MIC) Architecture Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products

More information

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big

More information

The Role of Database Aware Flash Technologies in Accelerating Mission- Critical Databases

The Role of Database Aware Flash Technologies in Accelerating Mission- Critical Databases The Role of Database Aware Flash Technologies in Accelerating Mission- Critical Databases Gurmeet Goindi Principal Product Manager Oracle Flash Memory Summit 2013 Santa Clara, CA 1 Agenda Relational Database

More information

The Structure and Properties of Clique Graphs of Regular Graphs

The Structure and Properties of Clique Graphs of Regular Graphs The University of Southern Mississippi The Aquila Digital Community Master's Theses 1-014 The Structure and Properties of Clique Graphs of Regular Graphs Jan Burmeister University of Southern Mississippi

More information

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems)

EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) EI338: Computer Systems and Engineering (Computer Architecture & Operating Systems) Chentao Wu 吴晨涛 Associate Professor Dept. of Computer Science and Engineering Shanghai Jiao Tong University SEIEE Building

More information

Emerging NVM Memory Technologies

Emerging NVM Memory Technologies Emerging NVM Memory Technologies Yuan Xie Associate Professor The Pennsylvania State University Department of Computer Science & Engineering www.cse.psu.edu/~yuanxie yuanxie@cse.psu.edu Position Statement

More information

REMOTE PERSISTENT MEMORY THINK TANK

REMOTE PERSISTENT MEMORY THINK TANK 14th ANNUAL WORKSHOP 2018 REMOTE PERSISTENT MEMORY THINK TANK Report Out Prepared by a cast of thousands April 13, 2018 THINK TANK ABSTRACT Challenge - Some people think that Remote Persistent Memory over

More information

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown

Lecture 21: Reliable, High Performance Storage. CSC 469H1F Fall 2006 Angela Demke Brown Lecture 21: Reliable, High Performance Storage CSC 469H1F Fall 2006 Angela Demke Brown 1 Review We ve looked at fault tolerance via server replication Continue operating with up to f failures Recovery

More information

VMware vsphere: Taking Virtualization to the Next Level

VMware vsphere: Taking Virtualization to the Next Level About this research note: Product Evaluation notes provide an analysis of the market position of a specific product and its vendor through an in-depth exploration of their relative capabilities. VMware

More information

Stellar performance for a virtualized world

Stellar performance for a virtualized world IBM Systems and Technology IBM System Storage Stellar performance for a virtualized world IBM storage systems leverage VMware technology 2 Stellar performance for a virtualized world Highlights Leverages

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2 Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,

More information

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University

Checkpointing with DMTCP and MVAPICH2 for Supercomputing. Kapil Arya. Mesosphere, Inc. & Northeastern University MVAPICH Users Group 2016 Kapil Arya Checkpointing with DMTCP and MVAPICH2 for Supercomputing Kapil Arya Mesosphere, Inc. & Northeastern University DMTCP Developer Apache Mesos Committer kapil@mesosphere.io

More information

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters

Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Experiences with the Parallel Virtual File System (PVFS) in Linux Clusters Kent Milfeld, Avijit Purkayastha, Chona Guiang Texas Advanced Computing Center The University of Texas Austin, Texas USA Abstract

More information

White paper ETERNUS Extreme Cache Performance and Use

White paper ETERNUS Extreme Cache Performance and Use White paper ETERNUS Extreme Cache Performance and Use The Extreme Cache feature provides the ETERNUS DX500 S3 and DX600 S3 Storage Arrays with an effective flash based performance accelerator for regions

More information

Validating Hyperconsolidation Savings With VMAX 3

Validating Hyperconsolidation Savings With VMAX 3 Validating Hyperconsolidation Savings With VMAX 3 By Ashish Nadkarni, IDC Storage Team An IDC Infobrief, sponsored by EMC January 2015 Validating Hyperconsolidation Savings With VMAX 3 Executive Summary:

More information

Accelerating Implicit LS-DYNA with GPU

Accelerating Implicit LS-DYNA with GPU Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,

More information

Introduction to TCP/IP Offload Engine (TOE)

Introduction to TCP/IP Offload Engine (TOE) Introduction to TCP/IP Offload Engine (TOE) Version 1.0, April 2002 Authored By: Eric Yeh, Hewlett Packard Herman Chao, QLogic Corp. Venu Mannem, Adaptec, Inc. Joe Gervais, Alacritech Bradley Booth, Intel

More information

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010.

How TokuDB Fractal TreeTM. Indexes Work. Bradley C. Kuszmaul. Guest Lecture in MIT Performance Engineering, 18 November 2010. 6.172 How Fractal Trees Work 1 How TokuDB Fractal TreeTM Indexes Work Bradley C. Kuszmaul Guest Lecture in MIT 6.172 Performance Engineering, 18 November 2010. 6.172 How Fractal Trees Work 2 I m an MIT

More information

Memory Hierarchy Y. K. Malaiya

Memory Hierarchy Y. K. Malaiya Memory Hierarchy Y. K. Malaiya Acknowledgements Computer Architecture, Quantitative Approach - Hennessy, Patterson Vishwani D. Agrawal Review: Major Components of a Computer Processor Control Datapath

More information

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance

More information

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades Evaluation report prepared under contract with Dot Hill August 2015 Executive Summary Solid state

More information

Introduction to High Performance Parallel I/O

Introduction to High Performance Parallel I/O Introduction to High Performance Parallel I/O Richard Gerber Deputy Group Lead NERSC User Services August 30, 2013-1- Some slides from Katie Antypas I/O Needs Getting Bigger All the Time I/O needs growing

More information

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to

More information

Hyper-converged Secondary Storage for Backup with Deduplication Q & A. The impact of data deduplication on the backup process

Hyper-converged Secondary Storage for Backup with Deduplication Q & A. The impact of data deduplication on the backup process Hyper-converged Secondary Storage for Backup with Deduplication Q & A The impact of data deduplication on the backup process Table of Contents Introduction... 3 What is data deduplication?... 3 Is all

More information

LECTURE 1. Introduction

LECTURE 1. Introduction LECTURE 1 Introduction CLASSES OF COMPUTERS When we think of a computer, most of us might first think of our laptop or maybe one of the desktop machines frequently used in the Majors Lab. Computers, however,

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems

Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National Center of Atmospheric Research,

More information

Cray XC Scalability and the Aries Network Tony Ford

Cray XC Scalability and the Aries Network Tony Ford Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?

More information

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary

Engineers can be significantly more productive when ANSYS Mechanical runs on CPUs with a high core count. Executive Summary white paper Computer-Aided Engineering ANSYS Mechanical on Intel Xeon Processors Engineer Productivity Boosted by Higher-Core CPUs Engineers can be significantly more productive when ANSYS Mechanical runs

More information

CSEE W4824 Computer Architecture Fall 2012

CSEE W4824 Computer Architecture Fall 2012 CSEE W4824 Computer Architecture Fall 2012 Lecture 8 Memory Hierarchy Design: Memory Technologies and the Basics of Caches Luca Carloni Department of Computer Science Columbia University in the City of

More information

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29 Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions

More information

Technical Brief. AGP 8X Evolving the Graphics Interface

Technical Brief. AGP 8X Evolving the Graphics Interface Technical Brief AGP 8X Evolving the Graphics Interface Increasing Graphics Bandwidth No one needs to be convinced that the overall PC experience is increasingly dependent on the efficient processing of

More information

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University

Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity. Donghyuk Lee Carnegie Mellon University Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity Donghyuk Lee Carnegie Mellon University Problem: High DRAM Latency processor stalls: waiting for data main memory high latency Major bottleneck

More information

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction ICON for HD(CP) 2 High Definition Clouds and Precipitation for Advancing Climate Prediction High Definition Clouds and Precipitation for Advancing Climate Prediction ICON 2 years ago Parameterize shallow

More information