Implementing an efficient method of check-pointing on CPU-GPU

Size: px

Start display at page:

Download "Implementing an efficient method of check-pointing on CPU-GPU"

Elwin Rodgers
6 years ago
Views:

1 Implementing an efficient method of check-pointing on CPU-GPU Harsha Sutaone, Sharath Prasad and Sumanth Suraneni Abstract In this paper, we describe the design, implementation, verification and analysis of providing finegrained architectural support for efficient check-pointing and restart on a CPU-GPU heterogeneous system. We use Multi2sim, a simulator, capable of emulating a CPU-GPU system. The simulator is capable of emulating a 32 bit x86 CPU that launches OpenCl Kernels on the GPU model emulating the Advanced Micro Devices (AMD) Southern Islands Architecture. We choose this configuration since this is one of the only known commercial GPU architectures. This helps demonstrate that the architectural changes proposed in this paper are feasible with low complexity on real GPU architectures. The AMDAPP benchmark suite with OpenCl kernels are used as tests for verification and analysis. Our implementation leverages the underlying micro-architecture and the execution model to save only the required state, at a much finer granularity, hence reducing the overhead of checkpoint and restart. The design is verified for correctness by comparing the traces generated by checkpoint and restart with golden execution traces for each of the AMDAPP workloads. We then estimate the size of the files generated during checkpoint and restart to compare them with the size of the complete Kernel state of the GPU at any given instant. Our design significantly reduces the memory overhead. Even though this paper does not discuss timing overhead, our design does not make drastic changes to the execution model, so we estimate low timing overhead. Keywords Graphics processing units, OpenCl, checkpoint, restart, southern islands, multi2sim I. INTRODUCTION The graphics processing unit (GPU) has become an integral part of today s mainstream computing systems. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor. The CPU counterpart is efficient at running sequential code. Modern systems often combine the best of both worlds. This effort in off-loading highly parallel workloads on a GPU from a CPU forms a CPU- GPU heterogeneous system. So far, many researchers have reported that various scientific and engineering applications can significantly be accelerated using GPUs. However, GPU computing has just emerged, and several functions commonly used in HPC have not been supported yet in GPU computing. Dependability will be a major concern of future GPU computing, [5] because a GPU computing application is often executed on a commodity PC, which is not designed for GPU computing at all and sometimes becomes unstable due to the insufficient cooling capability and so on. GPU fault tolerance is a nascent field. There are very few R&D work that been done to address resilience issues in GPU environment. With the increasing popularity of GPU, it is anticipated that reliability will be critical for its future success [5]. Checkpointing is the process of writing out the state information of a running application to physical storage periodically. With this feature, an application will be able to restart from the last check pointed state instead of from the beginning which would have been computationally expensive in HPTC environment. In general, check pointing tools can be classified into 2 different classes: A. Kernel-level 1) Such tools are built into the kernel of the operating system. During a checkpoint, the entire process space (which tends to be huge) is written to physical storage. 2) The user does not need to recompile/re-link their applications. 3) Check pointing and restarting of application is usually done through OS commands. 4) Check pointed application is usually unable to be restarted on a different host. B. User-level 1) These tools are built into the application which will periodically write their status information into physical storage. 2) Check pointing of such applications is usually done by sending a specific signal to the application. Restarting of such applications is usually done by calling the application with additional parameters pointing to the location of restart files. Checkpoint/restart is a fault tolerance mechanism that has been used in many system platforms. Instead of restarting the computation from the beginning when a failure occurs, with Checkpoint/restart, a process can be restarted from the last checkpoint and can be migrated to a healthier system. Our current survey of recent checkpoint/restart techniques on a GPU has revealed that majority of the techniques are focused on application-level or user level check-pointing. This is supported through either manually inserting the check-

With our proposed solution we demonstrate the need for architectural change in a GPU for supporting Checkpoint/restart mechanisms.

2 pointing logic inside the application source code or implicitly calling user-level check-pointing library routines [11][4][5][7][8]. We design and implement an efficient solution to checkpoint and restart. With our proposed solution we demonstrate the need for architectural change in a GPU for supporting Checkpoint/restart mechanisms. For the purpose of experimentation, we use the Multi2sim heterogeneous computing simulator. Multi2Sim is a simulation framework for CPU-GPU heterogeneous computing written in C. It includes models for superscalar, multithreaded, and multicore CPUs, as well as GPU architectures. We emulate a 32 bit x86 CPU that launches OpenCl kernels on a GPU simulator emulating the AMD Southern Islands architecture. The most recent GPUs from AMD, the Southern Islands family (Radeon HD 7000-series), constitute a dramatic change from the Evergreen and Northern Islands GPUs. All levels of the GPU, from the ISA to the processing elements to the memory system, have been redesigned. Thanks to continued collaboration with AMD, support for the Southern Islands family of GPUs is provided in Multi2Sim version 4.2. [2] The code execution on the GPU starts when the host program launches an OpenCL kernel. An instance of a kernel is called an ND-Range, and is formed of work-groups, which in turn are comprised of work-items (see below). When the ND- Range is launched by the OpenCL driver, the programming and execution models are mapped onto the Southern Islands GPU. 2) The Device Kernel The device kernel is code written in a different programming language than the host program, namely OpenCL C. It is code written by a Kernel programmer to implement algorithms with high degrees of data parallelism so that it runs better on the GPU. Multi2sim provides a compiler that can compile OpenCl code, with some effort, to statically compiled Kernel Binaries that can be launched on the Southern Islands (SI) GPU model. Figure 1: The Software entities defined in the OpenCL programming model. An ND-Range is formed of work-group which are, in turn, sets of work-items executing the same OpenCL C kernel code. An instance of the OpenCL kernel is called a work-item, which can access its own pool of private memory. B. The Multi2sim emulation of OpenCL execution The rest of this document talks about; 1. The background knowledge we have gathered regarding checkpoint/restart, AMD OpenCl execution model, Southern Islands GPU architecture and Muli2sim. 2. Next we talk more about functional simulation and workloads we have compiled and run to generate the execution trace on multi2sim. 3. Future work and the scope of exploration we hope to achieve. II. BACKGROUND A. The OpenCL Programming Model OpenCL is an industry-standard programming framework designed to specifically target heterogeneous computing platforms [2]. OpenCL programming model emphasizes parallel processing by using the Single Program Multiple Data (SPMD) paradigm, in which a single piece of code, called a kernel maps to multiple subsets of input data, creating massively parallel threads. An OpenCL application is formed of a host program and one or more device kernels that run on the GPU. 1) The Host Program The host program is the starting point for an OpenCL application. It executes on the CPU where the operating system is running. This program runs on the emulated 32bit X86 CPU. It makes OpenCl API calls to launch the kernel code on the GPU device. Figure 2: The interaction of user code, OS code and hardware, in both native and simulated environment The interaction between the CPU and GPU on the simulator is very similar to programs and kernels run on a native machine. The host program makes OpenCL API calls to vendor-specific runtime libraries (in our case AMD Southern Islands). These

are serviced by GPU device drivers to make ABI calls to the device. The simulator source code has a list of OpenCL API calls implemented for purposes of modeling the CPU-GPU design.

Finally, multi2sim emulates each GPU model (for our purpose the AMD SI ISA), that services the ABI calls from the driver. C.

3 are serviced by GPU device drivers to make ABI calls to the device. The simulator source code has a list of OpenCL API calls implemented for purposes of modeling the CPU-GPU design. The rest of the source code has driver calls to the emulator. That is the ABI calls in native machines. Finally, multi2sim emulates each GPU model (for our purpose the AMD SI ISA), that services the ABI calls from the driver. C. The Southern Islands Architecture A kernel running on an SI-GPU has access to multiple levels of storage and both scalar and vector arithmetic-logic units. 1) Some terminologies of AMD SI ISA a) Work-group: A collection of work-items working together, capable of sharing data and synchronizing with each other. b) Wavefront: A collection of 64 work-items grouped for efficient processing on the compute unit. Each wavefront shares a single program counter. c) Work-item: The basic unit of computation. Work-items are arranged into work-groups with two basic properties: i) those work-items contained in the same workgroup can perform efficient synchronization operations, and ii) work-items within the same work-group can share data through a low-latency local memory pool. The totality of work-groups form the ND-Range (grid of work-item groups), which shares a common global memory space. 3) The Southern Islands Instruction Set Architecture (ISA) a) Vector and Scalar Instructions Most arithmetic operations on a GPU are performed by vector instructions. A vector instruction is fetched once for an entire wavefront, and executed in a SIMD fashion, by comprising 64 work-items. Data that are private per work-item are stored in vector registers. Scalar instructions are fetched in common for the entire wavefront and executed once for each work-item. Data that are shared by all work-items in the wavefront are stored in scalar registers. b) Southern Islands Assembly The basic format and characteristics of the AMD Southern Islands instruction set is as follows. Scalar instructions use the prefix s_, and vector instructions use v_. All registers are 32 bits. Those instructions performing 64 bit computations use two consecutive registers to store 64 bit values. The execution of some instructions implicitly modify the value of some scalar registers. Special registers may also be handled directly, in the same way as general purpose registers. For example, VCC register is a 64-bit mask representing the result of a vector comparison. 4) Relevant Modules of SI GPU a) The Compute Unit 2) Running OpenCL kernel on a Southern Islands GPU Figure 3: Simplified diagram of SI GPU Architecture An ultra-threaded dispatcher acts as a work-group scheduler. It keeps consuming pending work-groups from running ND-Range, and assigns them to the compute units, as they become available. The global memory scope to the whole ND-Range corresponds to a physical global memory hierarchy on the GPU, formed of caches and main memory. Since all work-items forming a work-group run the same code, the compute unit combines sets of 64 work-items within a work-group to run in a SIMD (single-instruction-multipledata) fashion. These sets of 64 work-items are known as wavefronts. The work-item s private memory is physically mapped to a portion of the register file. When the work-item uses more private memory than the register file allows, register spills happen using privately allocated regions of global memory. Figure 4: Compute Unit GPU ultra-threaded dispatcher starts assigning workgroups to compute units as they stay or become available. At a given time during execution, one compute unit can have zero, one, or more work-groups allocated to it. These work-groups are split into wavefronts, for which the compute unit executes one instruction at a time. Each compute unit in the GPU is replicated with an identical design. The LDS unit interacts with local memory to service its instructions, while the scalar and vector memory units can access global memory, shared by all compute units.

b) Memory Hierarchy Figure 5: Block diagram of the memory pipeline - Local memory - The Local Data Share unit is responsible for handling all local memory instructions.

4 b) Memory Hierarchy Figure 5: Block diagram of the memory pipeline - Local memory - The Local Data Share unit is responsible for handling all local memory instructions. - Vector memory - The Vector Memory Unit is responsible for handling all vector global memory operations. The state of an ND-Range in execution is represented by the private, local, and global memory images, mapped to the register files, local physical memories, and global memory hierarchy, respectively. Register files are modeled in Multi2Sim without contention, and their access happens with fixed latency. d) Workgroup private memory Each compute unit has a 64 kb of Local Data Share (LDS) memory space that enables low-latency communication between work-items within a workgroup, including between work-items in a wavefront. Each workgroup can allocate up to 32 kb of this space, and can read and write any portion of the LDS space allocated to it. The LDS also includes 32 integer atomic units to enable fast, unordered atomic operations. This memory can be used as a software cache for predictable data reuse, a data exchange machine for the work-items of a work-group, or as a cooperative way to enable more efficient access to off-chip memory. e) Global Memory Work-items have access to two distinct types of global memory (memory visible to all work-groups): global data share and device memory. Multi2sim does not implement a Global Data Share or GDS unit. So we choose to ignore it. In a typical GPU with GDS, it is an additional state to maintain, just like other memory units. On multi2sim, the device memory is implemented as a single block of video memory. The simulator does not provide the complete hierarchy and we choose to ignore this. This is not representative of a typical GPU since the device memory contains a complete memory hierarchy of caches and main memory. In the following sections that discuss our solution assumes the dump of single memory block. Figure 6: Block diagram of the memory pipeline c) Work-item Private memory VGPRs Every work-item has access to some number of VGPRs, up to a maximum of 256. VGPRs are 32- bits wide and are used by the vector ALU and vector memory systems. Double-precision operations use two adjacent VGPRs to form a 64-bit value. SGPRs Every wavefront is allocated up to a maximum of 104 SGPRs. These SGPRs are 32 bits wide, but they are wavefront-private since they are common to all workitems in a wavefront. Private memory Work-items can allocate private memory space to allow spilling VGPRs to memory. This memory is accessed through vector memory instructions. III. RELEVANT WORK We surveyed light weight checkpointing mechanisms on CPU-GPU systems. The motivation and use cases of checkpointing on CPU-GPU systems. The papers we explored [11][4][5][7][8][12][13][14]. All the solutions researched so far that check point on a GPU provide user level support for saving the state of a running process on the GPU. Checkpointing for GPU systems present challenges compared to other conventional multi-processor systems. These differences emerge from the natural differences between the hardware design and programming languages of GPUs compared to those of unicore processors and multiprocessors. NVIDIA s CUDA has introduced latency hiding technique called streams that can overlap the memory copy and kernel execution [12]. The performance models in [13] show that streaming technique can reduce the kernel execution times while the applications may require more data transfer between device to host. Moreover, it suggests that streaming technique is more suitable for the applications that the data are independent so that both host-to-device and device-to-host memory transfer can be carried on without kernel interruption. User-level checkpointing is supported through either manually inserting the checkpointing logic inside an application source code or implicitly calling user-level library routines. In the latter, developers, for example, can re-compile their source code with function call libckpt[16] library or relink

5 object files with the Condor library[17]. User-level checkpointing can also be in conjunction with source code analysis tools [14, 15] to determine the appropriate places to insert the checkpointing codes. System-level checkpointing is supported through kernel modules or virtual machines. Software such as BLCR [18] and CRAK [19] provide kernellevel checkpoint-restart capabilities without modifying application executables. IV. IMPLEMENTATION We propose architectural support for creating a checkpoint and restarting for a GPU. The main idea is to keep track of the fine grained execution of the Kernel on the GPU. During the creation of a checkpoint, use the information of the exact micro-architectural state, to identify the regions of kernel state that are relevant and require restoration during restart. For finished work-groups we need not store the VGPRs and SGPRs they access. The NDRange keeps track of completed work-groups. We need to only keep track of this state. For running work-groups, we need to store the VGPRs and SGPRs accessed by each of the wavefronts in these workgroups. So we need hardware support to read the addresses accessed by the wavefronts to read the 256 VREGs per workitem, 103 SREGs and PC for each wavefront. Also, since each work-group addresses a private portion of the LDS unit, we need to snapshot only this region of memory. Figure 7 indicates the checkpoint location. From the figure, it can be observed that the checkpoint happens when current PC = 0x So, when restarting, the new PC that is loaded is 0x In our implementation we have the luxury of using a flexible simulator that keeps track of all the required state elements to prove our concept. We exploit the OpenCL programming model that launches the kernel as independent work-groups on the GPU. Each workgroup has a collection of work-items that run the same instruction stream. Also 64 workitems are grouped in elements called wavefronts. Our advantage lies in the fact that the ultra-threaded dispatcher issues at wavefront granularity and keeps track of the execution, dependence and resource information. The dispatcher launces the wavefront from a particular work-group based on the number of VGPRs and SGPRs required per work-group and the available resources on each compute unit. Therefore at a time only a subset of the complete Kernel state is relevant for Checkpointing. This can be extracted from examining the states of the dispatcher, wavefronts and work-groups. Therefore our concept is to snapshot only this portion of the kernel state, rather than storing the complete kernel state of the GPU. Let us examine in detail the exact mechanism of checkpoint and restart. A. Checkpoint mechanism During the execution of an OpenCL Kernel, we keep track of the states in each NDRange, work-group, wavefront and work-item. The ultra-threaded dispatcher modifies the state of each according to the execution of the wavefronts on the GPU. As wavefronts have completed, they modify the kernel state. Also, once all the wavefronts in a work-group have completed the work-group is finished. Once all the workgroups have completed execution then the kernel is completed. During a checkpoint, there are completed work-groups, running work-groups and waiting work-groups. We are concerned only about the kernel state modified by the running work-groups on the GPU. Figure 7: Trace at the checkpoint location. Next, we take a global memory snapshot along with other execution state elements. This provides sufficient information to restart the program from the checkpointed location. B. Restart Mechanism During a restart, we stop the GPU execution. We restore all the necessary state elements from the nearest checkpointed location in the program. All the state variables that are stored during the previous checkpoint are loaded back into the appropriate locations. Next, we restore the LDS and Global memory. All the completed work-groups in the program need not be re-executed and are taken care of by the previously set state variables. For the work-groups that require re-execution, we restore the state of the registers by writing their respective access to VGPRs and SGPRs. This is done by iterating through the wavefronts in each of these work-groups.

Figure 8 indicates the trace obtained after restarting the execution from the checkpoint location. In the figure, observe that the PC has started from 0x00000048 instead of 0x0.

6 Figure 8 indicates the trace obtained after restarting the execution from the checkpoint location. In the figure, observe that the PC has started from 0x instead of 0x0. Figure 9 compares the golden trace with the restarted trace. The portion of the trace that is green colored indicates the execution trace before the checkpoint. V. EVALUATION Our experimentation evaluations are on the checkpoint file generated by each benchmark. We calculate the memory overhead by estimating the file sizes and comparing it with the total Kernel state. Figure 10 indicates the number of work-groups in each of the benchmarks. Figure 8: Trace generated after Restart from checkpoint location. C. Verification Strategy Figure 10: Number of work-groups in each benchmark. Figure 11 gives the number of instructions in each of the benchmarks in AMDAPP. Figure 9: Difference between the golden trace and restart trace. We use AMDAPP workloads to test our implementation. We generated the execution traces for each of these benchmarks without checkpointing. This acts as our golden execution trace. We have a hard-coded checkpointing flag that is set randomly during program execution. We run this once per workload. Figure 11: Number of instructions in each benchmark. After checkpointing, we generated a checkpoint file which saves the current checkpoint state for each benchmark. Each file is in the range of Kilo Bytes (KB). Figure 12 compares the size of checkpoint file for each of the benchmarks. Next, we run the kernel again and if it detects a previous checkpoint, we move to that location of program execution. We again generate the new exection trace once the program has restarted from a previous checkpoint. The generated execution trace should match from the checkpointed location in the golden trace. We use a UNIX tool called Kompare to verify the execution trace after we have run the program on restart. Figure 12: Checkpoint Size for each benchmark.

7 Figure 13 gives the sizes of the global memory stored while checkpointing. We can also explore the possibility of compression algorithms to further reduce the overhead during multiple checkpoints. Also, if we schedule checkpoints at appropriate locations, roll-back time can be reduced significantly making the GPUs to work efficiently in real-time scenarios. VIII. REFERENCES [1] S. Laosooksathit et al., "Lightweight checkpoint mechanism and modeling in GPGPU environment," in 4th Workshop on System-level Virtualization for High Performance Computing (HPCVirt 2010), April france.fr/workshops/hpcvirt2010/hpcvirt2010_1.pdf Figure 13: Global Memory Size at the checkpoint location. During checkpointing, we are storing the LDS module of the work-group under running against dumping the LDS modules of all the work-groups in the benchmark. Figure 14 gives the comparison of the LDS memory stored versus the Actual LDS memory size along with the reduction achieved. [2] Ubal, R., et al.: Multi2Sim: a simulation framework for CPU-GPU computing. In: Proceedings of PACT 2012, pp ACM, Minneapolis 2012). [3] "Reference: Southern Islands Series Instruction Set Architecture." MS. Advanced Micro Devices, Inc, Sunnyvale,CA.Web. uthern_islands_instruction_set_architecture.pdf [4] Hong Ong, Natthapol Saragol, Kasidit Chanchio, and Chokchai Leangsuksun, "VCCP: A Transparent, Coordinated Checkpointing System for Virtualization-based Cluster Computing," IEEE Cluster Figure 14: LDS Comparison and Reduction Achieved VI. SUMMARY In our checkpoint/restart implementation, we use architectural and execution information to identify the relevant subset of kernel state that needs to be stored to effectively checkpoint a running process in a GPU. This implementation of checkpoint/restart in the GPU makes it to be fault tolerant and much more reliable. We have proved our concept on a flexible simulator. We demonstrate the need for such a support in GPU architecture to checkpoint with minimal overhead. We also show that this solution is agnostic to type or size of workload. Our results seem to indicate significant improvements in memory-overhead as compared to saving the complete kernel state snapshot. VII. FUTURE WORK The solution we provide can be extended to further minimize the LDS and Global Memory snapshots. If we can add additional states to the pages in each of these memories, we can keep track of the pages modified by completed and running work-groups. We can also implement a driver call to invoke checkpointing on the GPU from the host CPU. [5] Hiroyuki Takizawa, Katsuto Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi, "CheCUDA: A Checkpoint/Restart Tool for CUDA Applications," in PDCAT, 2009, pp [6] S. Laosooksathit et al., "Lightweight checkpoint mechanism and modeling in GPGPU enviornment," in 4th Workshop on System-level Virtualization for High Performance Computing (HPCVirt 2010), April [7] Takizawa, H.; Koyama, K.; Sato, K.; Komatsu, K.; Kobayashi, H., "CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications," Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, vol., no., pp.864,876, May 2011 [8] J. Menon, M. de Kruijf, and K. Sankaralingam, igpu: Exception Support and Speculative Execution on GPUs, ISCA, 2012 [9] J. Duell, "The design and implementation of berkeley lab's linux checkpoint/restart," Lawrence Berkeley National Laboratory, TR, 2000.

8 [10] P. H. Hargrove and J. C. Duell, Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters," Journal of Physics: Conference Series, vol. 1, no. 46, pp. 494{499, [11] Lizandro Dami, Parallelization & checkpointing of GPU applications through program transformation, 2012 ed.ames, Iowa: Lizandro Dami, 2012 [12] NVIDIA, NVIDIA CUDA Programming Guide, [13] Supada Laosooksathit, Chokchai Leangsuksun, Abdelkader Baggag, and Clayton Chandler, "Stream Experiments: Toward Latency Hiding in GPGPU," in Parallel and Distributed Computing and Networks (PDCN), [17] M. Litzkow and M. Solomon, The Evolution of Condor Checkpointing, [18] J. Duell, P.Hargrove, and E. Roman, The Design and Implementation of Berkeley Lab s Linux Checkpoint/Restart, [19] H. Zhong and J. Nieh, CRAK: Linux checkpoint / restart as a kernel module, Technical Report CUCS , Department of Computers Science, Columbia University, [14] K. Chanchio and X. H. Sun, Data collection and restoration for heterogeneous process migration'', SOFTWARE--PRACTICE AND EXPERIENCE, 32:1-27, April 15, [15] Ferrari, A., Chapin, S. J., and Grimshaw, A Heterogeneous process state capture and recovery through Process Introspection. Cluster Computing 3, 2 (Apr. 2000), [16] J. S. Plank, M. Beck, G. Kingsley, and K. Li, Libckpt: Transparent Checkpointing under Unix, In Proceedings of the 1995 Winter USENIX Techincal Conference, 1995.

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu