compared to 35% for multilevel checkpointing without compression. The efficiency of multilevel checkpointing with compression is further improved to

Size: px

Start display at page:

Download "compared to 35% for multilevel checkpointing without compression. The efficiency of multilevel checkpointing with compression is further improved to"

Willis Simmons
6 years ago
Views:

1 ABSTRACT AGRAWAL, ABHINAV RAJIV. Reducing Checkpoint/Restart Overhead using Near Data Processing for Exascale System. (Under the direction of James Tuck.) With increasing size and complexity of high-performance computing (HPC) systems to achieve exascale performance, the system mean time to interrupt (system MTTI) is projected to decrease. To maintain the performance efficiency of the system, checkpoints need to be stored at a faster rate when using checkpoint/restart for mitigation. In addition it requires a lower checkpoint commit and restore time. The lower checkpoint commit and restore time requirement is aggravated by the increasing checkpoint-size to IO-bandwidth ratio. To overcome this, prior works have proposed multilevel (hierarchical) checkpoint schemes that involve frequent checkpoint writes to faster nodelocal storage with occasional writes to slower global I/O-based storage (e.g., disk). However, due to increasing cost of writing/reading checkpoints to/from global I/O based storage, this technique may not scale well with systems approaching exaflops performance. While I/O or storage hierarchy alleviates the performance cost by reducing I/O access times (including for checkpoint/restart), moving large data between storage in different levels of hierarchy adds overhead. Near data processing (NDP) has been shown to be effective in reducing the amount of data movement in many applications by performing computations closer to data, thus reducing the overhead. In addition, offloading computations of some applications from the host processors to NDP has shown to improve performance. In this work we show how NDP can be leveraged to improve C/R performance. We propose offloading the process of writing checkpoints to global I/O from the main compute cores to NDP. We also explore opportunities for additional optimizations using NDP to further reduce checkpoint overheads. Overall, our approach eliminates the performance cost of writing checkpoints to I/O as these operations are performed by NDP. We evaluate the performance of our novel application of NDP to reduce checkpoint/restart cost and compare it to existing checkpoint/restart optimizations. For two-level checkpoint schemes (i.e., checkpoints saved to local storage and remote I/O nodes), our evaluation for a projected exascale system shows that a baseline system (without NDP) spends nearly half its time writing checkpoints to I/O or restoring from a checkpoint or re-executing lost work. With NDP for offloading checkpoint management and compression, the host processor is able to increase its progress rate from 51% to 78% (i.e., a >50% speedup in the application performance). We further explore how checkpoint compression can be combined with multilevel checkpointing. We perform a compression study and discuss the compression performance requirement for making it beneficial to add compression to all levels of multilevel checkpointing. We analyze the C/R performance and other benefits of this technique. Our data shows that multilevel checkpointing combined with compression at all levels improves the efficiency of a system with C/R to 73%

2 compared to 35% for multilevel checkpointing without compression. The efficiency of multilevel checkpointing with compression is further improved to 89% when using NDP to offload certain C/R tasks. Finally, we explore how the two approaches of compression at all levels of multilevel checkpointing and the use of NDP can be combined. Adding compression to all levels of multilevel checkpointing will result in compressed checkpoint data being available in local storage. Therefore the role and benefit of NDP for further checkpoint data compression before writing it to global storage is evaluated. In addition to evaluating the performance overhead, we also estimate the energy and hardware cost of the various C/R configurations we discussed. Our cost efficiency analysis shows that adding checkpoint compression to improve progress rate is a more efficient solution than increasing bandwidth of node local storage. We also show that a configuration that leverages NDP to offload the task of writing data to global I/O has higher cost efficiency than a configuration that performs checkpoint compression at each level of multilevel checkpointing.

4 Reducing Checkpoint/Restart Overhead using Near Data Processing for Exascale System by Abhinav Rajiv Agrawal A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Computer Engineering Raleigh, North Carolina 2017 APPROVED BY: Gregory Byrd Eric Rotenberg Frank Mueller James Tuck Chair of Advisory Committee

5 DEDICATION To my parents - Rajni and Rajiv Agrawal. ii

6 ACKNOWLEDGEMENTS This research was made possible due to support and guidance of many people - my advisor, research group members, collaborators, family and friends. Foremost, I would like to express my sincere gratitude to my advisor Dr. James Tuck for his constant support during my Ph.D studies. I would like to thank him for his guidance and patience while mentoring me in my research work. I am grateful to Dr. Tuck for allowing me to work on my research with enough independence and flexibility. I would like to thank my dissertation committee members: Dr. Gregory Byrd, Dr. Eric Rotenberg and Dr. Frank Mueller for their service on my committee as well as for their insightful comments, feedback and advice. I would also like to thank Gabriel Loh for collaborating with me on this work and for his advice during my internship. My sincere thanks also goes to Bagus Wibowo for helping with my research as well as for the many stimulating discussions and late nights before deadlines. Many thanks to my fellow labmates - Joonmoo Huh, Amro Awad, Hussein Elnawawy, Vinesh Srinivasan and Seunghee Shin. Thanks to Gayatri Powar for proofreading many paper and report drafts. Lastly I would like to thank my parents for instilling in me the importance of education from a young age and supporting me throughout my academic journey. This accomplishment is as much theirs as it is mine. iii

7 TABLE OF CONTENTS LIST OF TABLES vi LIST OF FIGURES viii Chapter 1 INTRODUCTION Overview Existing C/R Optimization Techniques Adding Checkpoint Compression to Multilevel Checkpointing Leveraging NDP to Improve C/R Efficiency Contributions Organization of This Thesis Chapter 2 BACKGROUND AND RELATED WORK Checkpoint/Restart Coordinated Checkpoint/Restart Checkpoint/Restart Overhead Failure Rate Checkpoint Size Progress Rate or C/R Efficiency Checkpoint/Restart Optimization Techniques Increase Checkpoint Commit Bandwidth Reduce Checkpoint Data Size Near Data Processing Chapter 3 SCALING STUDY Overview Exascale System Projection MTTI Projection Checkpoint/Restart Overhead with no Optimization Chapter 4 MULTILEVEL CHECKPOINTING WITH COMPRESSION Introduction Overview Multilevel Checkpointing Adding Checkpoint Compression to Multilevel C/R Compression Study Tools and Methodology Checkpoint Compression Speed And Factor Selecting Utility for Checkpoint Compression Evaluation Methodology Checkpoint/Restart Overhead Components Progress Rate Comparison iv

8 4.3.4 C/R Overhead Breakdown (by Local and I/O Level) Summary Chapter 5 LEVERAGING NDP FOR CHECKPOINT/RESTART Compute Node with NDP Operation of Multilevel Checkpointing with NDP NDP for Checkpoint Data Compression NDP Performance Requirements Configuring NDP for Compression Evaluation Methodology Checkpoint/Restart Overhead Components Progress Rate Comparison C/R Overhead - Breakdown (4% I/O Recovery) C/R Overhead - Sensitivity Study Summary Chapter 6 PERFORMANCE, POWER AND COST ANALYSIS FOR COMBINATION OF CHECK- POINT/RESTART OPTIMIZATIONS Introduction Compression Study Tools and Methodology Data: Compression Speed and Factor Selecting Utility for Checkpoint Compression using NDP Performance Evaluation Methodology Progress Rate Comparison C/R Overhead - Breakdown (15% I/O Recovery) Methodology - Cost Analysis Energy Cost Hardware Cost Results - Cost Analysis Absolute Cost Breakdown Cost Performance Ratio Chapter 7 CONCLUSION BIBLIOGRAPHY v

9 LIST OF TABLES Table 3.1 Exascale system projection scaled from the Titan Cray XK7 supercomputer Table 4.1 Checkpoint Data Details. Second column shows the size of total checkpoint data collected for each mini-app in gigabytes. Further columns show compression speed for checkpoint data using different utilities and compression levels on thehdd and SSD system. Compression speed is for a single thread of each utility. Value inside () is the compression level Table 4.2 Checkpoint commit and restore time in seconds for all compression utilities. Checkpoint size for all mini-apps is set to 112 GB per compute node. I/O column contains checkpoint times when checkpoints are compressed and saved to global I/O storage. L/S and L/F contains checkpoint times when checkpoints are compressed and saved to slow compute node local storage (5 GB/s) and fast compute node local storage (15 GB/s) respectively. Note that the checkpoint time values in the Average" row are not the average values of the seven mini-apps, but the checkpoint time if the performance model is simulated using average compression factor and compression speed from Figure 4.1. Note that checkpoint commit/restore time in the absence of compression would be I/O: 1120s, L/S: 22.4s and L/F: 7.47s Table 4.3 C/R parameters for evaluation using performance model Table 5.1 The required compression speed, required number of processor cores in NDP and the smallest possible checkpoint interval to I/O based on average compression factor and speed Table 5.2 C/R parameters for evaluation using performance model Table 6.1 Table 6.2 Checkpoint compression data. lz4-compressed data of 7 mini-apps is compressed again using various compression utilities. The first column shows the size of lz4-compressed checkpoint data used to collect compression parameters. Columns with header F contain compression factor and columns with header S contain compression speed in MB/s. Compression speed is the speed at which lz4-compressed data is compressed using various utilities Cumulative or equivalent checkpoint compression data for compression after lz4 compression. lz4-compressed data of 7 mini-apps is compressed again using various compression utilities. Compression factor in this table is a measure of the cumulative reduction in checkpoint size after compression using lz4 and the utility in the first row of the corresponding column. Compression speed is an equivalent compression speed, if the uncompressed checkpoint data were being compressed in the same amount of time as the lz4-compression checkpoint data is being compressed vi

10 Table 6.3 Checkpoint commit time in seconds for all compression utilities for 2 scenarios. UnC: Uncompressed checkpoint data compressed by NDP (Scenario-1); Uncompressed checkpoint size for all mini-apps is set to 112 GB per compute node. Comp: lz4-compressed checkpoint data compressed by NDP (Scenario- 2). Checkpoint size is the size if 112 GB of checkpoint data of the corresponding mini-app is compressed using lz4. Note that the checkpoint time values in the Average" row are not the average values of the seven mini-apps, but the checkpoint time if the performance model is simulated using average compression factor and compression speed from Figure Table 6.4 C/R parameters for performance, power and cost evaluation of multilevel checkpointing combined with compression and NDP Table 6.5 Power and cost parameters vii

11 LIST OF FIGURES Figure 1.1 Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 Figure 4.5 Figure 5.1 Figure 5.2 Figure 5.3 Figure 5.4 Progress rate of a system with C/R as a function of M /δ. Increasing value of M /δ leads to higher progress rate Compression factor for checkpoint data of mini-apps using various compression utilities. Value inside () is the compression level C/R overhead breakdown for one L/F + I/O-Comp configuration on y-axis for increasing ratio of locally-saved to I/O-saved checkpoints (n) on x-axis. Note that y-axis does not start at 0 for better resolution Frequency of writing checkpoints to global-i/o storage for seven mini-apps and the average compression case shown as bar plot. On y-axis, frequency of checkpointing is normalized to frequency of checkpointing for no compression case. Total checkpoint I/O write traffic shown as line plot, normalized to checkpoint I/O write traffic generated for no compress case Progress rate comparison between different configurations. Data is shown for 7 mini-apps studied and an average progress rate over the 7 mini-apps. For I/O only the compression factor and speed corresponds to the xz(1) speed from Table 4.1. Similarly for other configuration for checkpoints to I/O, compression factor and speed corresponding to xz(1) is used, while for local level, if compression is performed the parameters corresponding to lz4(1) are used. 32 C/R overhead breakdown normalized to compute time(left) and as % of total execution time(right). Y-axis does not start at 0 for plot on the right. Six configurations from the left: multilevel with no compression (local: 5GB/s), multilevel with compression to I/O (local: 5GB/s), multilevel with compression to both (local: 5GB/s), multilevel with no compression (local: 15GB/s), multilevel with compression to I/O (local: 15GB/s), multilevel with compression to both (local: 15GB/s). The probability that recovery from local fails for multilevel cases: 15%. Compression factor- I/O: 80.6%; Local:64.8% Hardware organization of a compute node with our proposed Near Data Checkpointing Architecture (NDCA) Time-line of multilevel checkpointing with and without NDP. HOST : Primary processing of compute node + DRAM; NVM : Compute node local storage; I/O : I/O nodes based storage or global I/O Ratio of the number of locally saved to the number of I/O saved checkpoints for different configurations and compression factors Progress rate comparison between different configurations. Data are shown for 3 of the 7 mini-apps studied and an average progress rate over the 7 mini-apps. The first set of bars is for no compression while for others the compression factor used is for gzip(1) as shown in Table 4.1. Compression factor is specified in parentheses in the label on x-axis viii

12 Figure 5.5 Figure 5.6 Figure 5.7 C/R overhead breakdown normalized to compute time (left) and as % of total execution time (right). Y-axis does not start at 0 for both plots. Four configurations from the left: multilevel (Local + I/O-H), multilevel+compression (Local + I/O-HC), NDP (Local + I/O-N) & NDP+compression (Local + I/O-NC). The probability that recovery from local fails: 4%. Compression factor: 73% Progress for five C/R configurations for increasing checkpoint size. Y-axis does not start at 0%. MTTI: 30 minutes Progress for five C/R configurations for increasing MTTI. Y-axis does not start at 0%. Checkpoint size: 112 GB per compute node Figure 6.1 Progress rate for C/R configurations in which NDP reads lz4-compressed data from node local storage (local-level). Data is shown for 7 mini-apps studied and an average progress rate over the 7 mini-apps. Compression data for compression performed using NDP is obtained from Section Figure 6.2 C/R overhead breakdown for C/R configurations listed in Section Figure 6.3 Five year cost breakdown (in USD) per compute node and progress rate for C/R configurations listed in Section Figure 6.4 Cost breakdown (in USD). Cost is per compute node per (exa) floating point operations for C/R configurations listed in Section Figure 6.5 Cost (in USD) per compute node per (exa) floating point operations for C/R configurations (without fast local storage) listed in Section , for varying compute node cost ix

13 CHAPTER 1 INTRODUCTION 1.1 Overview Increasing size and complexity of high-performance computing (HPC) systems to achieve exascale performance is projected to cause a decrease in the system mean time to interrupt (MTTI). Checkpoint/restart (C/R) is a widely used mechanism to deal with failures in HPC systems. It involves saving the state of the application required to resume the application to stable storage at certain intervals. In case of a failure or interrupt, the application s execution resumes from the most recent checkpoint (saved state). In the absence of C/R, a failure would lead to the application having to restart from the beginning, losing all completed work. However, C/R mechanisms also add performance overhead due to the time spent saving the checkpoint state, restoring from the saved state, and re-running lost work (work performed since the most recent checkpoint). The efficiency or availability of exascale systems with C/R is projected to be around 50% [Ber08]. 1 C/R efficiency or progress rate is the ratio of the time it takes to run an application in the absence of failures and C/R overhead to the time it takes to perform the task in the presence of such overheads. Under some simplifying assumptions, progress rate can be approximated as a function of ratio of MTTI(M) and time to save checkpoint(δ) [Dal06; Dal07]. 2 Figure 1.1 illustrates this function. For 1 Progress rate and efficiency are used interchangeably in this document. 1

14 Progress Rate 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% M/ẟ Figure 1.1 Progress rate of a system with C/R as a function of M /δ. Increasing value of M /δ leads to higher progress rate. exascale systems with checkpoint/restart, this ratio of M /δ decreases due to two factors. On one hand, the increasing number of compute nodes needed to reach exaflops performance would lead to a decrease in the system MTTI because the MTTI of a single compute node is not improving [SG07]. On the other hand, exascale systems are expected to have larger physical memory capacities and are expected to be able to run applications with larger problem sizes. This will lead to larger application state that needs checkpointing. Without a proportional increase in checkpoint commit write to storage, the time to save checkpoint(δ) will increase. The combination of these two factors leads to a reduction in the ratio of M /δ and thus the progress rate. I/O or storage hierarchy improves the access time to storage by means of having fast intermediate levels between compute and disk based global I/O. These intermediate levels in the form of burst buffers or compute node local storage can be flash-based solid state drives or SSDs. They provide a lower overhead storage site to stage data by the compute before it is drained to slower diskbased storage in I/O node [Bhi16]. While this alleviates the performance overhead by reducing the access time to storage (including for C/R), moving large data between storage in different levels of hierarchy is not energy efficient. Near data processing (NDP) is effective in reducing the amount of 2 While Daly s work [Dal06] provides an equation to calculate the optimal checkpoint interval given MTTI(M ) and checkpoint commit time (δ), Quantifying Checkpoint Efficiency[Dal07] provides an equation that calculate checkpoint efficiency given M and δ. Checkpoint restore time is assumed to be same as commit time. 2

15 data movement in many applications by performing computations closer to data. Offloading some application s computations from the host processors to NDP has shown to improve performance and energy efficiency, especially for data intensive applications [Kan13a; Cho16; Ses14; Do13; Tiw13; Cho13]. NDP (or active storage) can potentially address key challenges in the areas such as scalability, performance, and reliability for I/O systems in exascale computing [Don11]. In this work we show that NDP can be leveraged to improve C/R performance. 3 Using SSDs as burst buffer or node local storage would provide a higher bandwidth for writing and reading checkpoints thus decreasing C/R cost. However, using SSDs has been shown to be an expensive way to reduce C/R cost. Ibtesham et al. in [Ibt15] show that it is more cost effective to improve the efficiency of C/R using software techniques such as checkpoint compression [Ibt15; Ibt12b] or incremental checkpointing [Fer11; Kai16; Nic13]. Based on these observations we evaluate the benefits of combining the use of NDP and checkpoint compression to reduce C/R overhead at a lower cost. 1.2 Existing C/R Optimization Techniques A number of optimizations and mechanisms [Moo10; Di14; Don09; Kan13b; Zhe04; Ben09; Raj13; Ibt15; Ibt12b; Fer11; Kai16; Nic13; Gam14; Ell12; DP12] have been proposed to reduce C/R overhead. While a mechanism like partial redundancy [Ell12] decreases effective MTTI to reduce C/R overhead, many mechanisms aim to either increase the effective checkpoint commit bandwidth [Moo10; Di14; Don09; Kan13b; Zhe04; Ben09; Raj13] or reduce the checkpoint size [Ibt15; Ibt12b; Fer11; Kai16; Nic13]. Increasing the checkpoint commit bandwidth or decreasing checkpoint size decreases the value of δ, thus improving the progress rate. Many of the techniques that reduce the effective checkpoint commit time take advantage of the storage hierarchy. Multilevel checkpointing is a clear example of C/R mechanism that exploits storage hierarchy to improve C/R performance. Multilevel checkpointing [Moo10; Di14] involves writing frequent checkpoints to compute node local storage, while writing occasional checkpoints to global I/O based storage. To keep the overhead of checkpoints to compute node local storage low, the bandwidth of local storage needs to be high. This can be achieved by adding storage in the form of flash based solid state drives (SSDs). The bandwidth requirement and thus the hardware cost for fast local storage would increase with the increasing checkpoint size and failure rate in HPC systems. Another issue with multilevel checkpointing is the high overhead of the occasional checkpoints 3 NDP could be in the context of main memory (like DRAM) or in the context of storage (like SSDs). In this work NDP refers to compute capabilities coupled to compute-node local storage (which would likely be flash SSDs or other NVM based storage due to high speed requirement). 3

16 to global I/O. Writing a checkpoint out to global I/O in a conventional multilevel checkpointing system requires the host processor to read the checkpoint data from main memory, and then send the data over the network to the remote storage, which requires the host to execute all of the code associated with running the full network stack (e.g., TCP/IP). This can be a particularly slow process as the checkpointing process is typically bottlenecked on the slower I/O (disk) bandwidth at the shared remote I/O nodes. While this is happening, the host processor is generally not available to perform the useful computations of the main application. While bandwidth of the local storage in compute node scales, that of global I/O based storage does not scale with the increasing size of application. Moody et al. in [Moo10; Di14] show that with increasing failure rate and increasing time to save checkpoint to global I/O, the progress rate of a system with multilevel checkpointing decreases, although slower than the decrease for single level checkpointing. 1.3 Adding Checkpoint Compression to Multilevel Checkpointing Multilevel checkpointing increases the effective bandwidth of reading and writing checkpoint data by saving most checkpoints to fast compute node local storage, while saving a few checkpoints to high overhead global I/O based storage. These local storage will have increasing cost due to increasing bandwidth requirement, with increasing checkpoint size and failure rate for HPC systems. In Chapter 4, we show that checkpoint data compression before writing to local storage reduces the bandwidth requirement for local storage. The high performance overhead associated with checkpointing to global I/O is also mitigated by compressing checkpoint data before writing to global I/O. While, adding checkpoint compression to multilevel checkpointing is an intuitive (or obvious) solution to mitigate issues with scaling of multilevel checkpointing for exascale system, in this work, we study how compression can be added to multilevel checkpointing. We evaluate the compression performance requirement to add compression at all levels of multilevel checkpointing and provide a methodology as to how the requirements can be determined. In Chapter 4 and Chapter 6 we quantify the performance and cost efficiency improvement for multilevel checkpointing due to the addition of checkpoint compression. Our evaluation shows that even with the addition of compression to each level of multilevel checkpointing, the overhead associated with checkpointing to global I/O is still high and provides further opportunity for improvement. 1.4 Leveraging NDP to Improve C/R Efficiency In this work, we also explore leveraging NDP to target the overhead associated with checkpointing to global I/O. To improve C/R performance, multilevel checkpointing utilizes one feature of hierarchical 4

17 storage (i.e., storage with different speeds and availability at different levels in hierarchy). We propose leveraging additional feature of hierarchical storage - the likely presence of NDP or active storage in (future) HPC systems. Using NDP (i.e., compute capabilities coupled to compute-node local storage) allows the host processor to quickly write checkpoints to the node-local storage and resume execution; NDP can then handle the slower process of sending the checkpoint(s) to global I/O off of the main application s critical path. NDP can be leveraged for additional optimizations that improve C/R performance. We explore the benefits of adding compression capabilities to our NDP-based checkpointing scheme, as this can reduce network bandwidth requirement for sending checkpoints out to I/O (thereby reducing network contention for the main application s communication needs), and it can also help improve performance by speeding up checkpoint restoration (which is primarily limited by how fast checkpoints can be retrieved from the I/O nodes disks). While checkpoint compression is not new, the exploitation of an NDP architecture to offload it from the host processor is a new twist: past approaches tolerated higher host-side processing costs because the compression reduced the I/O cost sufficiently to make it a net win, whereas our approach can get the benefits of compression without the host-side overheads. 1.5 Contributions We make the following contributions: We perform a high-level analysis of existing checkpoint/restart optimizations using our projected exascale system. This analysis includes determining the scaling required by these optimizations to achieve a 90% progress rate on our projected system. We discuss how checkpoint compression can be added to multilevel checkpointing. Specifically, based on a compression study of checkpoint data, we determine which general purpose compression utilities are best suited to be used with different levels in multilevel checkpointing based on their compression factor and speed. We show that adding compression to multilevel checkpointing reduces the hardware requirement and also improves its performance. Our data shows that multilevel checkpointing combined with compression improves the efficiency of a system with C/R to 73% compared to 35% for multilevel checkpointing without compression. Our data also shows that adding compression to local-level of multilevel checkpointing, allows reducing the bandwidth for node local storage by a factor of 3x (from 15 GB/s to 5 GB/s), while maintaining the progress rate. 5

18 We describe the operational details of the checkpoint/restart mechanism using NDP as well as the compute-node s hardware organization to implement such a mechanism. We evaluate checkpoint compression using NDP as a starting point for exploring additional optimizations that can be performed by NDP. We perform a compression study to help select a compression utility that achieves a good trade-off between compression speed and compression factor when compressing checkpoint data using NDP. The study also informs us of the compression speed requirement for the NDP hardware. We perform a detailed evaluation of multilevel checkpointing with NDP support. With our proposed NDP approach for offloading I/O management and compression, the host processor is able to increase its progress rate from 51% to 78% (i.e., a more than 50% speedup in the application performance). We present a methodology to estimate the 5 year cost of an exascale node for different C/R configurations with certain simplifications. This cost analysis helps compare the cost efficiency of various C/R configuration for our projected exascale node. 1.6 Organization of This Thesis The rest of the thesis is organized as follows. In Chapter 2 we cover the background and related work to the topics discussed in this thesis. We start by projecting a exascale system configuration in Chapter 3. In this chapter we also project parameters relevant to checkpoint/restart and discuss the limitations of basic checkpoint/restart for our projected exascale system. In Chapter 4, we discuss our first proposal - adding checkpoint compression to each level of multilevel checkpointing. We discuss how this combination can be achieved in Section We perform a checkpoint compression study to mimic compression performance for an exascale node with fast local storage (i.e. when compression is not bottlenecked by storage bandwidth) in Section 4.2. Based on this compression study we discuss how a compression utility can be picked for a particular level in multilevel checkpointing in Section The performance gains of such combination of checkpoint/restart schemes are quantified in Section 4.3. Chapter 5 details our second proposal - leveraging NDP to reduce checkpoint/restart overhead for tasks associated with checkpointing to global I/O. We present a high level node organization in Section 5.1 and describe the operational details of our NDP approach in Section Using the data from the compression study in Section 4.2, in Section we determine the compression 6

19 performance requirement of NDP for scenarios in which NDP is to be used for compression. The impact of leveraging NDP on C/R overhead is evaluated in Section 5.3. Chapter 6 presents a discussion on how our first two proposals can be combined. In this chapter we discuss how compression requirements for NDP are impacted due to addition of compression at all levels of multilevel checkpointing. In Section 6.3, the performance overhead of the combination of our two approaches is evaluated. Next, in Section 6.4 energy and hardware cost analysis methodology for C/R configurations is described. Finally, in Section 6.5, the cost efficiency data for the various C/R configurations discussed in this thesis is presented. In Chapter 7, we conclude the thesis by summarizing our observations and discussing our main findings. 7

20 CHAPTER 2 BACKGROUND AND RELATED WORK 2.1 Checkpoint/Restart Checkpoint/Restart is a widely used fault tolerance mechanism to mitigate the performance overhead of faults in high performance computing systems. Checkpoint/Restart involves saving the state of the application execution periodically. In case of a fault or a failure requiring recovery, application is resumed from the most recent checkpointed state. This avoids restarting the application from the start which would have a high performance cost Coordinated Checkpoint/Restart Coordinated C/R involves synchronizing all nodes before writing a checkpoint. In coordinated checkpoint/restart, blocking method involves stopping the execution on all nodes at a global synchronization point (for e.g., by using a barrier operation) [Eln02; TS84; Hur09]. Alternatively, non-blocking methods can be used which involves saving the communication state during the checkpointing operations [LY87; Cot06]. Blocking methods generally have a higher synchronization overhead than non-blocking methods, but non-blocking methods have higher complexity. 8

21 2.2 Checkpoint/Restart Overhead The increase in C/R overhead for exascale system can be attributed to an increase in failure rate (or decrease in system mean time to interrupt) and an increase in checkpoint size without a corresponding increase in checkpoint read/write bandwidth. The following subsections look at each of these aspects Failure Rate Projections show that the system MTTI of exascale machines could be in the range of minutes to tens of minutes [Don11; Ber08; Chu12]. Bergman et al. [Ber08] project the system MTTI to be 35 minutes - 39 minutes for the strawman exascale system that they project. Chung et al. [Chu12] project the system MTTI to be less than 10 minutes for exascale systems. A study by Schroeder and Gibson [SG07] on petascale systems introduced in showed ~0.2 failures per socket per year. This is equivalent to a 5 year mean time to failure (MTTF) per socket. This node failure rate has been used in many prior studies [Rie10; Ibt15] to calculate the system failure rate. Using this node failure rate an exascale system with 100,000 nodes would have a system MTTI of minutes. This is a more than 5x increase in failure rate compared to a petascale system s MTTI (which is in the range of hours [Gam14]) Checkpoint Size With increasing computational capacity of exascale system, these systems would be able to handle bigger workloads with bigger memory footprint. Therefore these systems are projected to have larger system memory. The total system memory is projected to be in the range of tens to hundreds of petabytes [Don11; Ber08; Chu12]. Since a large part of system physical memory may need to be checkpointed for some applications, the checkpoint size for such applications would also be in the range of tens to hundreds of petabytes. This is a more than 10x increase in system memory compared to petascale systems, none of which have total system memory exceeding two petabytes Progress Rate or C/R Efficiency Progress rate or C/R efficiency is a commonly used metric to quantify the overhead due of checkpoint/restart techniques. Progress rate is the ratio of the time it takes to run an application in the absence of failures and C/R overhead to the time it takes to perform the task in the presence of such overheads. In other words, it is the fraction of time spent doing productive work by an application on a system with checkpoint/restart. To maintain the progress rate of systems in a scenario with 9

22 increasing failure rate (or decreasing MTTI) and increasing checkpoint size for exascale systems, the checkpoint read/write bandwidth would need to increase in proportion to both the increase in failure rate as well as the increase in checkpoint size. This is based on Daly s formula for calculating C/R efficiency [Dal07]. According to Daly s formula C/R efficiency is proportional to the system MTTI(M) and inversely proportional to the time to save checkpoint(δ) or C/R efficiency is proportional to M /δ. Time to save a checkpoint (δ) can be approximated as being proportional to checkpoint size (Si z e ) and inversely proportional to write bandwidth (BW ). Note that we are ignoring the cost of synchronization operation, which would have an increasing cost with increasing number of nodes that need to be synchronized. With this simplification, the C/R efficiency or progress rate is proportional to M BW Si z e. To maintain this ratio, time to save a checkpoint ( Si z e BW ) should reduce in proportion to MTTI. This would mean BW should increase in proportion to the increase in Si z e and inversely proportional to M (or directly proportional to failure rate). Based on the projections for exascale system, the I/O bandwidth is not expected to scale in proportion to failure rate and checkpoint size. 2.3 Checkpoint/Restart Optimization Techniques Traditional checkpoint/restart involves periodically saving the state of an application to I/O based storage. In case of a failure, the checkpoint data is used to restore the application and resume its execution from the checkpoint. However increasing failure rate requires proportional decrease in checkpoint commit time to avoid a decrease in progress rate. This would require a decrease in checkpoint size or increase in checkpoint commit bandwidth. The checkpoint size of HPC systems is expected to increase with increasingly large memory footprint of the application. Therefore checkpoint commit bandwidth need to increase to not only compensate for the increasing checkpoint size but also for the increasing checkpoint size failure rate. However, the projected increase in I/O bandwidth is unlikely to be enough to compensate for these effects and therefore to keep C/R feasible [Fer12] for future HPC systems (exascale systems), optimizations or techniques that reduce C/R overhead are increasingly important. While there are techniques [Ell12] that improve C/R performance by increasing the MTTI, most optimizations improve C/R performance by reducing checkpoint commit time. These optimizations can be broadly divided in two categories Increase Checkpoint Commit Bandwidth One set of optimizations reduce the average checkpoint commit time by increasing the effective checkpoint commit bandwidth. Example of such optimizations would be multilevel checkpoint- 10

23 ing [Moo10; Don09; Kan13b; Zhe04; Gam14], burst-buffers [Bhi16] and file-systems optimized to support faster checkpointing[raj13; Ben09]. Multilevel checkpointing techniques [Don09; Kan13b; Zhe04; Gam14] are some variation of multilevel checkpoint scheme described by Moody et al. [Moo10] where checkpoints are frequently saved to faster compute node local storage and less frequent checkpoints are stored to slower global I/O. The compute node local memory could be DRAM [Moo10; Zhe04] or non-volatile memory NVM [Don09; Kan13b]. While the bandwidth to local storage is expected to scale with checkpoint size, the bandwidth to global I/O is not expected to scale [Fer12]. Moody et al. [Moo10] show that while multilevel checkpointing performs better than single level checkpointing for increasing failure rate and increasing time to save checkpoints to I/O, its performance still reduces. Therefore the global I/O component of multilevel checkpointing will increase with increasing cost of checkpointing to I/O. These techniques involve saving most checkpoints to a combination of fast and local storage and saving a few to slow disk based storage Reduce Checkpoint Data Size Another set of optimizations reduce the checkpoint commit time by reducing the amount of checkpoint data that is saved. Example of such optimizations are checkpoint compression [Ibt15; Ibt12b], incremental checkpoints [Fer11] and data deduplication [Kai16; Nic13]. Checkpoint compression involves compressing checkpoints using general purpose compression utilities like gzip before saving them. Incremental checkpointing involves creating a full checkpoint followed by saving checkpoint increments which basically involves only saving the states that have changes since last checkpoint. These solutions have been shown [Ibt15] to be cost effective ways of reducing C/R overhead compared to techniques that require hardware support such as fast SSD based storage. 2.4 Near Data Processing Prior works have shown the performance and energy efficiency benefits of adding near data processing (NDP) to different levels of storage hierarchies such as burst buffer for HPC systems or fast SSD storage [Kan13a; Cho16; Ses14; Do13; Tiw13; Cho13]. While in this work we use NDP in context of processing coupled to local NVM based storage, NDP is also used in the context of adding compute capability or specialized logic in the main memory [Aza16; Ahn15]. A review of prior work on NDP in both contexts can be found in [Bal14]. 11

24 CHAPTER 3 SCALING STUDY 3.1 Overview In this chapter we project an exascale system configuration starting with an existing petascale HPC system. This projections in turn are used to project the MTTI of an exascale system. The exascale system configuration and the MTTI are used to calculate the overhead of basic C/R with no optimization using Daly s formula. We use the exascale configuration projected in this chapter to estimate the overhead of various C/R configurations using our performance model throughout this thesis. 3.2 Exascale System Projection We project an exascale system by scaling an existing petascale system to exaflops performance. The assumptions made when scaling are based on cited technology trends and the various parameters of the projections are compared to projections made in prior works [Don09; Chu12]. Furthermore, in our projections, we err on the side of more optimistic or lower checkpoint/restart overheads. The intent is to show that even with these optimistic assumptions, the overhead of existing checkpoint/restart mechanisms on exascale systems would be high, resulting in a lower progress rate. One 12

25 implication of this preference is that we project a conservative increase in physical memory size and, consequently, the checkpoint size. Similarly compared to other projections a conservative increase in failure rate is projected. These assumptions lead to an optimistic scenario for checkpoint/restart cost. In this study, we scale the Titan Cray XK7 system [Rog12], a petascale system, to exaflops performance. Titan has 18,688 compute nodes each consisting of a 16-core AMD Opteron processor coupled with additional GPU acceleration. Each node has a 38 GB of memory (2 GB per CPU core plus the GPU s 6 GB). Each node has a theoretical peak performance of 1.44 teraflops with a theoretical system peak performance of 27 petaflops. A ~37x factor increase is required to reach exaflops performance. This can be accomplished by a combination of increase in performance per compute node and an increase in the number of compute nodes. We assume that the performance of a single compute node can scale to 10 teraflops[van08], a ~7x increase compared to Titan s per-node performance. We assume a uniform 7x increase in both CPU and GPU performance. For the CPU, the performance increase is assumed to be achieved by a combination of a 75% increase in performance per core and an increase in core count from from 16 to 64. If the ratio of 2 GB/core is maintained, the memory for the CPU would increase to 128 GB. We conservatively assume that the memory of the GPU is doubled to 12 GB (and not increased 7x, proportional to performance). The total memory for the node would be 140 GB. This is a conservative estimate compared to projections made in past work [Don09]. With a 7x increase in the compute node s performance, the remaining increase in performance comes from a 5.3x increase in node count (37x/3x). This leads to 100,000 compute nodes, which at 10 teraflops each, provide a system peak performance of 1 exaflops. With 100K compute nodes, the total memory of the system would be 14 PB, again a conservative projection compared to other projections [Don11; Chu12; Don09]. The aggregated data bandwidth of Titan to its file system is 1000 GB/s. We project this to increase to 10 TB/s, a 10x increase which is in the same order as projected by Chen [Che11]. Titan uses the Gemini interconnect which has an injection bandwidth of 20 GB/s. We scale it to 50 GB/s [Che11]. 3.3 MTTI Projection We project the system MTTI based on previously observed or projected node MTTI and scale it for the compute node count of our projected exascale system. A study by Schroeder and Gibson [SG07] on petascale systems introduced in showed ~0.2 failures per socket per year. This is equivalent to a 5 year mean time to failure (MTTF) per socket. Similar to previous work [Rie10; Ibt15], we assume a node/socket MTTF of 5 years. This results in a system MTTF of ~26.28 minutes with 100K nodes. We assume each failure leads to an interrupt requiring recovery using checkpointed 13

26 Table 3.1 Exascale system projection scaled from the Titan Cray XK7 supercomputer Parameter Titan Cray XK7 Exascale Projection Factor change Node Count 18, , x System Peak 27 petaflops 1 exaflops 37x Node Peak 1.44 teraflops 10 teraflops 7x System Memory 710 TB 14 PB 19.72x Node Memory 38 GB 140 GB 3.68x Interconnect BW 20 GB/s 50 GB/s 2.5x I/O Bandwidth 1000 GB/s 10 TB/s 10x System MTTI 160 minutes 1 30 minutes (1/5.33)x application state, and thus system MTTI would also be ~26.28 minutes. For the sake of simplicity we make an optimistic assumption of system MTTI being 30 minutes for this exascale system, which falls in the range projected in previous work [Chu12]. Key parameters of our projected exascale system are listed in Table Checkpoint/Restart Overhead with no Optimization This section discusses the feasibility of the overhead of basic C/R for our projected exascale system and the system MTTI. Assuming that 80% of the main memory needs to be checkpointed, each checkpoint would have a size of 11.2 PB for our projected system. Writing a single checkpoint to global file system would require minutes at 10 TB/s. Using Daly s equation [Dal07] to calculate the progress rate, we get a value of 13.67%. 2 We validated this value using our performance model described in Section This implies that the system will spend more than 85% of the time performing C/R related tasks. How can we improve the progress rate of this system? If we only consider basic C/R without optimization, to achieve a progress rate of say 90% on a system with MTTI of 30 minutes, the required checkpoint commit times comes to 9 seconds (calculated using the same formula used to calculate progress rate given system MTTI and checkpoint commit time). This would require a checkpoint commit rate of (11.2 PB / 9 seconds) PB/s for the system. This comes to ~12.44 GB/s per compute node. The PB/s far outpaces the projected 10 TB/s of global I/O bandwidth, thus requiring additional C/R optimizations. 1 Prior work [Gam14] reports 9 failures per day for Titan, which converts to failure every 160 minutes. 2 While Daly s work [Dal06] provides an equation to calculate the optimal checkpoint interval given MTTI and checkpoint time, Quantifying Checkpoint Efficiency[Dal07] contains an equation to calculate checkpoint efficiency given MTTI and checkpoint commit time. Checkpoint restore time is assumed to be same as commit time. 14

Combing Partial Redundancy and Checkpointing for HPC

Combing Partial Redundancy and Checkpointing for HPC James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann North Carolina State University Sandia National Laboratory