Implementing an efficient method of check-pointing on CPU-GPU

Size: px
Start display at page:

Download "Implementing an efficient method of check-pointing on CPU-GPU"

Transcription

1 Implementing an efficient method of check-pointing on CPU-GPU Harsha Sutaone, Sharath Prasad and Sumanth Suraneni Abstract In this paper, we describe the design, implementation, verification and analysis of providing finegrained architectural support for efficient check-pointing and restart on a CPU-GPU heterogeneous system. We use Multi2sim, a simulator, capable of emulating a CPU-GPU system. The simulator is capable of emulating a 32 bit x86 CPU that launches OpenCl Kernels on the GPU model emulating the Advanced Micro Devices (AMD) Southern Islands Architecture. We choose this configuration since this is one of the only known commercial GPU architectures. This helps demonstrate that the architectural changes proposed in this paper are feasible with low complexity on real GPU architectures. The AMDAPP benchmark suite with OpenCl kernels are used as tests for verification and analysis. Our implementation leverages the underlying micro-architecture and the execution model to save only the required state, at a much finer granularity, hence reducing the overhead of checkpoint and restart. The design is verified for correctness by comparing the traces generated by checkpoint and restart with golden execution traces for each of the AMDAPP workloads. We then estimate the size of the files generated during checkpoint and restart to compare them with the size of the complete Kernel state of the GPU at any given instant. Our design significantly reduces the memory overhead. Even though this paper does not discuss timing overhead, our design does not make drastic changes to the execution model, so we estimate low timing overhead. Keywords Graphics processing units, OpenCl, checkpoint, restart, southern islands, multi2sim I. INTRODUCTION The graphics processing unit (GPU) has become an integral part of today s mainstream computing systems. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor. The CPU counterpart is efficient at running sequential code. Modern systems often combine the best of both worlds. This effort in off-loading highly parallel workloads on a GPU from a CPU forms a CPU- GPU heterogeneous system. So far, many researchers have reported that various scientific and engineering applications can significantly be accelerated using GPUs. However, GPU computing has just emerged, and several functions commonly used in HPC have not been supported yet in GPU computing. Dependability will be a major concern of future GPU computing, [5] because a GPU computing application is often executed on a commodity PC, which is not designed for GPU computing at all and sometimes becomes unstable due to the insufficient cooling capability and so on. GPU fault tolerance is a nascent field. There are very few R&D work that been done to address resilience issues in GPU environment. With the increasing popularity of GPU, it is anticipated that reliability will be critical for its future success [5]. Checkpointing is the process of writing out the state information of a running application to physical storage periodically. With this feature, an application will be able to restart from the last check pointed state instead of from the beginning which would have been computationally expensive in HPTC environment. In general, check pointing tools can be classified into 2 different classes: A. Kernel-level 1) Such tools are built into the kernel of the operating system. During a checkpoint, the entire process space (which tends to be huge) is written to physical storage. 2) The user does not need to recompile/re-link their applications. 3) Check pointing and restarting of application is usually done through OS commands. 4) Check pointed application is usually unable to be restarted on a different host. B. User-level 1) These tools are built into the application which will periodically write their status information into physical storage. 2) Check pointing of such applications is usually done by sending a specific signal to the application. Restarting of such applications is usually done by calling the application with additional parameters pointing to the location of restart files. Checkpoint/restart is a fault tolerance mechanism that has been used in many system platforms. Instead of restarting the computation from the beginning when a failure occurs, with Checkpoint/restart, a process can be restarted from the last checkpoint and can be migrated to a healthier system. Our current survey of recent checkpoint/restart techniques on a GPU has revealed that majority of the techniques are focused on application-level or user level check-pointing. This is supported through either manually inserting the check-

2 pointing logic inside the application source code or implicitly calling user-level check-pointing library routines [11][4][5][7][8]. We design and implement an efficient solution to checkpoint and restart. With our proposed solution we demonstrate the need for architectural change in a GPU for supporting Checkpoint/restart mechanisms. For the purpose of experimentation, we use the Multi2sim heterogeneous computing simulator. Multi2Sim is a simulation framework for CPU-GPU heterogeneous computing written in C. It includes models for superscalar, multithreaded, and multicore CPUs, as well as GPU architectures. We emulate a 32 bit x86 CPU that launches OpenCl kernels on a GPU simulator emulating the AMD Southern Islands architecture. The most recent GPUs from AMD, the Southern Islands family (Radeon HD 7000-series), constitute a dramatic change from the Evergreen and Northern Islands GPUs. All levels of the GPU, from the ISA to the processing elements to the memory system, have been redesigned. Thanks to continued collaboration with AMD, support for the Southern Islands family of GPUs is provided in Multi2Sim version 4.2. [2] The code execution on the GPU starts when the host program launches an OpenCL kernel. An instance of a kernel is called an ND-Range, and is formed of work-groups, which in turn are comprised of work-items (see below). When the ND- Range is launched by the OpenCL driver, the programming and execution models are mapped onto the Southern Islands GPU. 2) The Device Kernel The device kernel is code written in a different programming language than the host program, namely OpenCL C. It is code written by a Kernel programmer to implement algorithms with high degrees of data parallelism so that it runs better on the GPU. Multi2sim provides a compiler that can compile OpenCl code, with some effort, to statically compiled Kernel Binaries that can be launched on the Southern Islands (SI) GPU model. Figure 1: The Software entities defined in the OpenCL programming model. An ND-Range is formed of work-group which are, in turn, sets of work-items executing the same OpenCL C kernel code. An instance of the OpenCL kernel is called a work-item, which can access its own pool of private memory. B. The Multi2sim emulation of OpenCL execution The rest of this document talks about; 1. The background knowledge we have gathered regarding checkpoint/restart, AMD OpenCl execution model, Southern Islands GPU architecture and Muli2sim. 2. Next we talk more about functional simulation and workloads we have compiled and run to generate the execution trace on multi2sim. 3. Future work and the scope of exploration we hope to achieve. II. BACKGROUND A. The OpenCL Programming Model OpenCL is an industry-standard programming framework designed to specifically target heterogeneous computing platforms [2]. OpenCL programming model emphasizes parallel processing by using the Single Program Multiple Data (SPMD) paradigm, in which a single piece of code, called a kernel maps to multiple subsets of input data, creating massively parallel threads. An OpenCL application is formed of a host program and one or more device kernels that run on the GPU. 1) The Host Program The host program is the starting point for an OpenCL application. It executes on the CPU where the operating system is running. This program runs on the emulated 32bit X86 CPU. It makes OpenCl API calls to launch the kernel code on the GPU device. Figure 2: The interaction of user code, OS code and hardware, in both native and simulated environment The interaction between the CPU and GPU on the simulator is very similar to programs and kernels run on a native machine. The host program makes OpenCL API calls to vendor-specific runtime libraries (in our case AMD Southern Islands). These

3 are serviced by GPU device drivers to make ABI calls to the device. The simulator source code has a list of OpenCL API calls implemented for purposes of modeling the CPU-GPU design. The rest of the source code has driver calls to the emulator. That is the ABI calls in native machines. Finally, multi2sim emulates each GPU model (for our purpose the AMD SI ISA), that services the ABI calls from the driver. C. The Southern Islands Architecture A kernel running on an SI-GPU has access to multiple levels of storage and both scalar and vector arithmetic-logic units. 1) Some terminologies of AMD SI ISA a) Work-group: A collection of work-items working together, capable of sharing data and synchronizing with each other. b) Wavefront: A collection of 64 work-items grouped for efficient processing on the compute unit. Each wavefront shares a single program counter. c) Work-item: The basic unit of computation. Work-items are arranged into work-groups with two basic properties: i) those work-items contained in the same workgroup can perform efficient synchronization operations, and ii) work-items within the same work-group can share data through a low-latency local memory pool. The totality of work-groups form the ND-Range (grid of work-item groups), which shares a common global memory space. 3) The Southern Islands Instruction Set Architecture (ISA) a) Vector and Scalar Instructions Most arithmetic operations on a GPU are performed by vector instructions. A vector instruction is fetched once for an entire wavefront, and executed in a SIMD fashion, by comprising 64 work-items. Data that are private per work-item are stored in vector registers. Scalar instructions are fetched in common for the entire wavefront and executed once for each work-item. Data that are shared by all work-items in the wavefront are stored in scalar registers. b) Southern Islands Assembly The basic format and characteristics of the AMD Southern Islands instruction set is as follows. Scalar instructions use the prefix s_, and vector instructions use v_. All registers are 32 bits. Those instructions performing 64 bit computations use two consecutive registers to store 64 bit values. The execution of some instructions implicitly modify the value of some scalar registers. Special registers may also be handled directly, in the same way as general purpose registers. For example, VCC register is a 64-bit mask representing the result of a vector comparison. 4) Relevant Modules of SI GPU a) The Compute Unit 2) Running OpenCL kernel on a Southern Islands GPU Figure 3: Simplified diagram of SI GPU Architecture An ultra-threaded dispatcher acts as a work-group scheduler. It keeps consuming pending work-groups from running ND-Range, and assigns them to the compute units, as they become available. The global memory scope to the whole ND-Range corresponds to a physical global memory hierarchy on the GPU, formed of caches and main memory. Since all work-items forming a work-group run the same code, the compute unit combines sets of 64 work-items within a work-group to run in a SIMD (single-instruction-multipledata) fashion. These sets of 64 work-items are known as wavefronts. The work-item s private memory is physically mapped to a portion of the register file. When the work-item uses more private memory than the register file allows, register spills happen using privately allocated regions of global memory. Figure 4: Compute Unit GPU ultra-threaded dispatcher starts assigning workgroups to compute units as they stay or become available. At a given time during execution, one compute unit can have zero, one, or more work-groups allocated to it. These work-groups are split into wavefronts, for which the compute unit executes one instruction at a time. Each compute unit in the GPU is replicated with an identical design. The LDS unit interacts with local memory to service its instructions, while the scalar and vector memory units can access global memory, shared by all compute units.

4 b) Memory Hierarchy Figure 5: Block diagram of the memory pipeline - Local memory - The Local Data Share unit is responsible for handling all local memory instructions. - Vector memory - The Vector Memory Unit is responsible for handling all vector global memory operations. The state of an ND-Range in execution is represented by the private, local, and global memory images, mapped to the register files, local physical memories, and global memory hierarchy, respectively. Register files are modeled in Multi2Sim without contention, and their access happens with fixed latency. d) Workgroup private memory Each compute unit has a 64 kb of Local Data Share (LDS) memory space that enables low-latency communication between work-items within a workgroup, including between work-items in a wavefront. Each workgroup can allocate up to 32 kb of this space, and can read and write any portion of the LDS space allocated to it. The LDS also includes 32 integer atomic units to enable fast, unordered atomic operations. This memory can be used as a software cache for predictable data reuse, a data exchange machine for the work-items of a work-group, or as a cooperative way to enable more efficient access to off-chip memory. e) Global Memory Work-items have access to two distinct types of global memory (memory visible to all work-groups): global data share and device memory. Multi2sim does not implement a Global Data Share or GDS unit. So we choose to ignore it. In a typical GPU with GDS, it is an additional state to maintain, just like other memory units. On multi2sim, the device memory is implemented as a single block of video memory. The simulator does not provide the complete hierarchy and we choose to ignore this. This is not representative of a typical GPU since the device memory contains a complete memory hierarchy of caches and main memory. In the following sections that discuss our solution assumes the dump of single memory block. Figure 6: Block diagram of the memory pipeline c) Work-item Private memory VGPRs Every work-item has access to some number of VGPRs, up to a maximum of 256. VGPRs are 32- bits wide and are used by the vector ALU and vector memory systems. Double-precision operations use two adjacent VGPRs to form a 64-bit value. SGPRs Every wavefront is allocated up to a maximum of 104 SGPRs. These SGPRs are 32 bits wide, but they are wavefront-private since they are common to all workitems in a wavefront. Private memory Work-items can allocate private memory space to allow spilling VGPRs to memory. This memory is accessed through vector memory instructions. III. RELEVANT WORK We surveyed light weight checkpointing mechanisms on CPU-GPU systems. The motivation and use cases of checkpointing on CPU-GPU systems. The papers we explored [11][4][5][7][8][12][13][14]. All the solutions researched so far that check point on a GPU provide user level support for saving the state of a running process on the GPU. Checkpointing for GPU systems present challenges compared to other conventional multi-processor systems. These differences emerge from the natural differences between the hardware design and programming languages of GPUs compared to those of unicore processors and multiprocessors. NVIDIA s CUDA has introduced latency hiding technique called streams that can overlap the memory copy and kernel execution [12]. The performance models in [13] show that streaming technique can reduce the kernel execution times while the applications may require more data transfer between device to host. Moreover, it suggests that streaming technique is more suitable for the applications that the data are independent so that both host-to-device and device-to-host memory transfer can be carried on without kernel interruption. User-level checkpointing is supported through either manually inserting the checkpointing logic inside an application source code or implicitly calling user-level library routines. In the latter, developers, for example, can re-compile their source code with function call libckpt[16] library or relink

5 object files with the Condor library[17]. User-level checkpointing can also be in conjunction with source code analysis tools [14, 15] to determine the appropriate places to insert the checkpointing codes. System-level checkpointing is supported through kernel modules or virtual machines. Software such as BLCR [18] and CRAK [19] provide kernellevel checkpoint-restart capabilities without modifying application executables. IV. IMPLEMENTATION We propose architectural support for creating a checkpoint and restarting for a GPU. The main idea is to keep track of the fine grained execution of the Kernel on the GPU. During the creation of a checkpoint, use the information of the exact micro-architectural state, to identify the regions of kernel state that are relevant and require restoration during restart. For finished work-groups we need not store the VGPRs and SGPRs they access. The NDRange keeps track of completed work-groups. We need to only keep track of this state. For running work-groups, we need to store the VGPRs and SGPRs accessed by each of the wavefronts in these workgroups. So we need hardware support to read the addresses accessed by the wavefronts to read the 256 VREGs per workitem, 103 SREGs and PC for each wavefront. Also, since each work-group addresses a private portion of the LDS unit, we need to snapshot only this region of memory. Figure 7 indicates the checkpoint location. From the figure, it can be observed that the checkpoint happens when current PC = 0x So, when restarting, the new PC that is loaded is 0x In our implementation we have the luxury of using a flexible simulator that keeps track of all the required state elements to prove our concept. We exploit the OpenCL programming model that launches the kernel as independent work-groups on the GPU. Each workgroup has a collection of work-items that run the same instruction stream. Also 64 workitems are grouped in elements called wavefronts. Our advantage lies in the fact that the ultra-threaded dispatcher issues at wavefront granularity and keeps track of the execution, dependence and resource information. The dispatcher launces the wavefront from a particular work-group based on the number of VGPRs and SGPRs required per work-group and the available resources on each compute unit. Therefore at a time only a subset of the complete Kernel state is relevant for Checkpointing. This can be extracted from examining the states of the dispatcher, wavefronts and work-groups. Therefore our concept is to snapshot only this portion of the kernel state, rather than storing the complete kernel state of the GPU. Let us examine in detail the exact mechanism of checkpoint and restart. A. Checkpoint mechanism During the execution of an OpenCL Kernel, we keep track of the states in each NDRange, work-group, wavefront and work-item. The ultra-threaded dispatcher modifies the state of each according to the execution of the wavefronts on the GPU. As wavefronts have completed, they modify the kernel state. Also, once all the wavefronts in a work-group have completed the work-group is finished. Once all the workgroups have completed execution then the kernel is completed. During a checkpoint, there are completed work-groups, running work-groups and waiting work-groups. We are concerned only about the kernel state modified by the running work-groups on the GPU. Figure 7: Trace at the checkpoint location. Next, we take a global memory snapshot along with other execution state elements. This provides sufficient information to restart the program from the checkpointed location. B. Restart Mechanism During a restart, we stop the GPU execution. We restore all the necessary state elements from the nearest checkpointed location in the program. All the state variables that are stored during the previous checkpoint are loaded back into the appropriate locations. Next, we restore the LDS and Global memory. All the completed work-groups in the program need not be re-executed and are taken care of by the previously set state variables. For the work-groups that require re-execution, we restore the state of the registers by writing their respective access to VGPRs and SGPRs. This is done by iterating through the wavefronts in each of these work-groups.

6 Figure 8 indicates the trace obtained after restarting the execution from the checkpoint location. In the figure, observe that the PC has started from 0x instead of 0x0. Figure 9 compares the golden trace with the restarted trace. The portion of the trace that is green colored indicates the execution trace before the checkpoint. V. EVALUATION Our experimentation evaluations are on the checkpoint file generated by each benchmark. We calculate the memory overhead by estimating the file sizes and comparing it with the total Kernel state. Figure 10 indicates the number of work-groups in each of the benchmarks. Figure 8: Trace generated after Restart from checkpoint location. C. Verification Strategy Figure 10: Number of work-groups in each benchmark. Figure 11 gives the number of instructions in each of the benchmarks in AMDAPP. Figure 9: Difference between the golden trace and restart trace. We use AMDAPP workloads to test our implementation. We generated the execution traces for each of these benchmarks without checkpointing. This acts as our golden execution trace. We have a hard-coded checkpointing flag that is set randomly during program execution. We run this once per workload. Figure 11: Number of instructions in each benchmark. After checkpointing, we generated a checkpoint file which saves the current checkpoint state for each benchmark. Each file is in the range of Kilo Bytes (KB). Figure 12 compares the size of checkpoint file for each of the benchmarks. Next, we run the kernel again and if it detects a previous checkpoint, we move to that location of program execution. We again generate the new exection trace once the program has restarted from a previous checkpoint. The generated execution trace should match from the checkpointed location in the golden trace. We use a UNIX tool called Kompare to verify the execution trace after we have run the program on restart. Figure 12: Checkpoint Size for each benchmark.

7 Figure 13 gives the sizes of the global memory stored while checkpointing. We can also explore the possibility of compression algorithms to further reduce the overhead during multiple checkpoints. Also, if we schedule checkpoints at appropriate locations, roll-back time can be reduced significantly making the GPUs to work efficiently in real-time scenarios. VIII. REFERENCES [1] S. Laosooksathit et al., "Lightweight checkpoint mechanism and modeling in GPGPU environment," in 4th Workshop on System-level Virtualization for High Performance Computing (HPCVirt 2010), April france.fr/workshops/hpcvirt2010/hpcvirt2010_1.pdf Figure 13: Global Memory Size at the checkpoint location. During checkpointing, we are storing the LDS module of the work-group under running against dumping the LDS modules of all the work-groups in the benchmark. Figure 14 gives the comparison of the LDS memory stored versus the Actual LDS memory size along with the reduction achieved. [2] Ubal, R., et al.: Multi2Sim: a simulation framework for CPU-GPU computing. In: Proceedings of PACT 2012, pp ACM, Minneapolis 2012). [3] "Reference: Southern Islands Series Instruction Set Architecture." MS. Advanced Micro Devices, Inc, Sunnyvale,CA.Web. uthern_islands_instruction_set_architecture.pdf [4] Hong Ong, Natthapol Saragol, Kasidit Chanchio, and Chokchai Leangsuksun, "VCCP: A Transparent, Coordinated Checkpointing System for Virtualization-based Cluster Computing," IEEE Cluster Figure 14: LDS Comparison and Reduction Achieved VI. SUMMARY In our checkpoint/restart implementation, we use architectural and execution information to identify the relevant subset of kernel state that needs to be stored to effectively checkpoint a running process in a GPU. This implementation of checkpoint/restart in the GPU makes it to be fault tolerant and much more reliable. We have proved our concept on a flexible simulator. We demonstrate the need for such a support in GPU architecture to checkpoint with minimal overhead. We also show that this solution is agnostic to type or size of workload. Our results seem to indicate significant improvements in memory-overhead as compared to saving the complete kernel state snapshot. VII. FUTURE WORK The solution we provide can be extended to further minimize the LDS and Global Memory snapshots. If we can add additional states to the pages in each of these memories, we can keep track of the pages modified by completed and running work-groups. We can also implement a driver call to invoke checkpointing on the GPU from the host CPU. [5] Hiroyuki Takizawa, Katsuto Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi, "CheCUDA: A Checkpoint/Restart Tool for CUDA Applications," in PDCAT, 2009, pp [6] S. Laosooksathit et al., "Lightweight checkpoint mechanism and modeling in GPGPU enviornment," in 4th Workshop on System-level Virtualization for High Performance Computing (HPCVirt 2010), April [7] Takizawa, H.; Koyama, K.; Sato, K.; Komatsu, K.; Kobayashi, H., "CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications," Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, vol., no., pp.864,876, May 2011 [8] J. Menon, M. de Kruijf, and K. Sankaralingam, igpu: Exception Support and Speculative Execution on GPUs, ISCA, 2012 [9] J. Duell, "The design and implementation of berkeley lab's linux checkpoint/restart," Lawrence Berkeley National Laboratory, TR, 2000.

8 [10] P. H. Hargrove and J. C. Duell, Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters," Journal of Physics: Conference Series, vol. 1, no. 46, pp. 494{499, [11] Lizandro Dami, Parallelization & checkpointing of GPU applications through program transformation, 2012 ed.ames, Iowa: Lizandro Dami, 2012 [12] NVIDIA, NVIDIA CUDA Programming Guide, [13] Supada Laosooksathit, Chokchai Leangsuksun, Abdelkader Baggag, and Clayton Chandler, "Stream Experiments: Toward Latency Hiding in GPGPU," in Parallel and Distributed Computing and Networks (PDCN), [17] M. Litzkow and M. Solomon, The Evolution of Condor Checkpointing, [18] J. Duell, P.Hargrove, and E. Roman, The Design and Implementation of Berkeley Lab s Linux Checkpoint/Restart, [19] H. Zhong and J. Nieh, CRAK: Linux checkpoint / restart as a kernel module, Technical Report CUCS , Department of Computers Science, Columbia University, [14] K. Chanchio and X. H. Sun, Data collection and restoration for heterogeneous process migration'', SOFTWARE--PRACTICE AND EXPERIENCE, 32:1-27, April 15, [15] Ferrari, A., Chapin, S. J., and Grimshaw, A Heterogeneous process state capture and recovery through Process Introspection. Cluster Computing 3, 2 (Apr. 2000), [16] J. S. Plank, M. Beck, G. Kingsley, and K. Li, Libckpt: Transparent Checkpointing under Unix, In Proceedings of the 1995 Winter USENIX Techincal Conference, 1995.

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Visualization of OpenCL Application Execution on CPU-GPU Systems

Visualization of OpenCL Application Execution on CPU-GPU Systems Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari*, R. Ubal*, D. Schaa**, D. Kaeli* *NUCAR Group, Northeastern Universiy **AMD Northeastern University Computer Architecture Research

More information

Handout 3. HSAIL and A SIMT GPU Simulator

Handout 3. HSAIL and A SIMT GPU Simulator Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants

More information

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Multi2sim Kepler: A Detailed Architectural GPU Simulator

Multi2sim Kepler: A Detailed Architectural GPU Simulator Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong, Rafael Ubal, David Kaeli Northeastern University Computer Architecture Research Lab Department of Electrical and Computer Engineering

More information

High Performance Computing. Taichiro Suzuki Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.

High Performance Computing. Taichiro Suzuki Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab. High Performance Computing Taichiro Suzuki Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab. 1 Review Paper Two-Level Checkpoint/Restart Modeling for GPGPU Supada

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming

More information

A Behavior Based File Checkpointing Strategy

A Behavior Based File Checkpointing Strategy Behavior Based File Checkpointing Strategy Yifan Zhou Instructor: Yong Wu Wuxi Big Bridge cademy Wuxi, China 1 Behavior Based File Checkpointing Strategy Yifan Zhou Wuxi Big Bridge cademy Wuxi, China bstract

More information

Auto-tunable GPU BLAS

Auto-tunable GPU BLAS Auto-tunable GPU BLAS Jarle Erdal Steinsland Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University of Science and Technology Department

More information

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

Cache Memory Access Patterns in the GPU Architecture

Cache Memory Access Patterns in the GPU Architecture Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 7-2018 Cache Memory Access Patterns in the GPU Architecture Yash Nimkar ypn4262@rit.edu Follow this and additional

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1]) EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,

More information

Understanding GPGPU Vector Register File Usage

Understanding GPGPU Vector Register File Usage Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture

More information

GPUs have enormous power that is enormously difficult to use

GPUs have enormous power that is enormously difficult to use 524 GPUs GPUs have enormous power that is enormously difficult to use Nvidia GP100-5.3TFlops of double precision This is equivalent to the fastest super computer in the world in 2001; put a single rack

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7 General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

A Framework for Visualization of OpenCL Applications Execution

A Framework for Visualization of OpenCL Applications Execution A Framework for Visualization of OpenCL Applications Execution A. Ziabari*, R. Ubal*, D. Schaa**, D. Kaeli* *NUCAR Group, Northeastern Universiy **AMD Conference title 1 Outline Introduction Simulation

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9 General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

What s in a process?

What s in a process? CSE 451: Operating Systems Winter 2015 Module 5 Threads Mark Zbikowski mzbik@cs.washington.edu Allen Center 476 2013 Gribble, Lazowska, Levy, Zahorjan What s in a process? A process consists of (at least):

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008 Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

GPU-accelerated Verification of the Collatz Conjecture

GPU-accelerated Verification of the Collatz Conjecture GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Heterogeneous-Race-Free Memory Models

Heterogeneous-Race-Free Memory Models Heterogeneous-Race-Free Memory Models Jyh-Jing (JJ) Hwang, Yiren (Max) Lu 02/28/2017 1 Outline 1. Background 2. HRF-direct 3. HRF-indirect 4. Experiments 2 Data Race Condition op1 op2 write read 3 Sequential

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin

More information

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015 OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015 TOPICS Data transfer Parallelism Coalesced memory access Best work group size Occupancy branching All the performance numbers come from a W8100 running

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12

More information

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor*

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Tyler Viswanath Krishnamurthy, and Hridesh Laboratory for Software Design Department of Computer Science Iowa State University

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded

More information

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Accelerating MapReduce on a Coupled CPU-GPU Architecture Accelerating MapReduce on a Coupled CPU-GPU Architecture Linchuan Chen Xin Huo Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210 {chenlinc,huox,agrawal}@cse.ohio-state.edu

More information

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors. CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says

More information

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014 Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2

More information

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013

Computer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed

More information

OPERATING SYSTEM. Chapter 4: Threads

OPERATING SYSTEM. Chapter 4: Threads OPERATING SYSTEM Chapter 4: Threads Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples Objectives To

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

Regression Modelling of Power Consumption for Heterogeneous Processors. Tahir Diop

Regression Modelling of Power Consumption for Heterogeneous Processors. Tahir Diop Regression Modelling of Power Consumption for Heterogeneous Processors by Tahir Diop A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department

More information

Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs

Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs A Dissertation Presented by Yash Ukidave to The Department of Electrical and Computer Engineering in partial

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

Space-Efficient Page-Level Incremental Checkpointing *

Space-Efficient Page-Level Incremental Checkpointing * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 22, 237-246 (2006) Space-Efficient Page-Level Incremental Checkpointing * JUNYOUNG HEO, SANGHO YI, YOOKUN CHO AND JIMAN HONG + School of Computer Science

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process

Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation Why are threads useful? How does one use POSIX pthreads? Michael Swift 1 2 What s in a process? Organizing a Process A process

More information

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings

Operating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Operating systems are those

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Operating Systems Overview. Chapter 2

Operating Systems Overview. Chapter 2 Operating Systems Overview Chapter 2 Operating System A program that controls the execution of application programs An interface between the user and hardware Masks the details of the hardware Layers and

More information

10/10/ Gribble, Lazowska, Levy, Zahorjan 2. 10/10/ Gribble, Lazowska, Levy, Zahorjan 4

10/10/ Gribble, Lazowska, Levy, Zahorjan 2. 10/10/ Gribble, Lazowska, Levy, Zahorjan 4 What s in a process? CSE 451: Operating Systems Autumn 2010 Module 5 Threads Ed Lazowska lazowska@cs.washington.edu Allen Center 570 A process consists of (at least): An, containing the code (instructions)

More information

Concepts. Virtualization

Concepts. Virtualization Concepts Virtualization Concepts References and Sources James Smith, Ravi Nair, The Architectures of Virtual Machines, IEEE Computer, May 2005, pp. 32-38. Mendel Rosenblum, Tal Garfinkel, Virtual Machine

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its

More information

Application Programming

Application Programming Multicore Application Programming For Windows, Linux, and Oracle Solaris Darryl Gove AAddison-Wesley Upper Saddle River, NJ Boston Indianapolis San Francisco New York Toronto Montreal London Munich Paris

More information

Hakam Zaidan Stephen Moore

Hakam Zaidan Stephen Moore Hakam Zaidan Stephen Moore Outline Vector Architectures Properties Applications History Westinghouse Solomon ILLIAC IV CDC STAR 100 Cray 1 Other Cray Vector Machines Vector Machines Today Introduction

More information

What s in a traditional process? Concurrency/Parallelism. What s needed? CSE 451: Operating Systems Autumn 2012

What s in a traditional process? Concurrency/Parallelism. What s needed? CSE 451: Operating Systems Autumn 2012 What s in a traditional process? CSE 451: Operating Systems Autumn 2012 Ed Lazowska lazowska @cs.washi ngton.edu Allen Center 570 A process consists of (at least): An, containing the code (instructions)

More information

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4

Motivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4 Motivation Threads Chapter 4 Most modern applications are multithreaded Threads run within application Multiple tasks with the application can be implemented by separate Update display Fetch data Spell

More information

C152 Laboratory Exercise 5

C152 Laboratory Exercise 5 C152 Laboratory Exercise 5 Professor: Krste Asanovic GSI: Henry Cook Department of Electrical Engineering & Computer Science University of California, Berkeley April 9, 2008 1 Introduction and goals The

More information

igpu: Exception Support and Speculation Execution on GPUs Jaikrishnan Menon, Marc de Kruijf University of Wisconsin-Madison ISCA 2012

igpu: Exception Support and Speculation Execution on GPUs Jaikrishnan Menon, Marc de Kruijf University of Wisconsin-Madison ISCA 2012 igpu: Exception Support and Speculation Execution on GPUs Jaikrishnan Menon, Marc de Kruijf University of Wisconsin-Madison ISCA 2012 Outline Motivation and Challenges Background Mechanism igpu Architecture

More information

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands Grassroots ASPLOS can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands ASPLOS-17 Doctoral Workshop London, March 4th, 2012 1 Current

More information

Preparing seismic codes for GPUs and other

Preparing seismic codes for GPUs and other Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

1 Publishable Summary

1 Publishable Summary 1 Publishable Summary 1.1 VELOX Motivation and Goals The current trend in designing processors with multiple cores, where cores operate in parallel and each of them supports multiple threads, makes the

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU

More information