Implementing an efficient method of check-pointing on CPU-GPU
|
|
- Elwin Rodgers
- 6 years ago
- Views:
Transcription
1 Implementing an efficient method of check-pointing on CPU-GPU Harsha Sutaone, Sharath Prasad and Sumanth Suraneni Abstract In this paper, we describe the design, implementation, verification and analysis of providing finegrained architectural support for efficient check-pointing and restart on a CPU-GPU heterogeneous system. We use Multi2sim, a simulator, capable of emulating a CPU-GPU system. The simulator is capable of emulating a 32 bit x86 CPU that launches OpenCl Kernels on the GPU model emulating the Advanced Micro Devices (AMD) Southern Islands Architecture. We choose this configuration since this is one of the only known commercial GPU architectures. This helps demonstrate that the architectural changes proposed in this paper are feasible with low complexity on real GPU architectures. The AMDAPP benchmark suite with OpenCl kernels are used as tests for verification and analysis. Our implementation leverages the underlying micro-architecture and the execution model to save only the required state, at a much finer granularity, hence reducing the overhead of checkpoint and restart. The design is verified for correctness by comparing the traces generated by checkpoint and restart with golden execution traces for each of the AMDAPP workloads. We then estimate the size of the files generated during checkpoint and restart to compare them with the size of the complete Kernel state of the GPU at any given instant. Our design significantly reduces the memory overhead. Even though this paper does not discuss timing overhead, our design does not make drastic changes to the execution model, so we estimate low timing overhead. Keywords Graphics processing units, OpenCl, checkpoint, restart, southern islands, multi2sim I. INTRODUCTION The graphics processing unit (GPU) has become an integral part of today s mainstream computing systems. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor. The CPU counterpart is efficient at running sequential code. Modern systems often combine the best of both worlds. This effort in off-loading highly parallel workloads on a GPU from a CPU forms a CPU- GPU heterogeneous system. So far, many researchers have reported that various scientific and engineering applications can significantly be accelerated using GPUs. However, GPU computing has just emerged, and several functions commonly used in HPC have not been supported yet in GPU computing. Dependability will be a major concern of future GPU computing, [5] because a GPU computing application is often executed on a commodity PC, which is not designed for GPU computing at all and sometimes becomes unstable due to the insufficient cooling capability and so on. GPU fault tolerance is a nascent field. There are very few R&D work that been done to address resilience issues in GPU environment. With the increasing popularity of GPU, it is anticipated that reliability will be critical for its future success [5]. Checkpointing is the process of writing out the state information of a running application to physical storage periodically. With this feature, an application will be able to restart from the last check pointed state instead of from the beginning which would have been computationally expensive in HPTC environment. In general, check pointing tools can be classified into 2 different classes: A. Kernel-level 1) Such tools are built into the kernel of the operating system. During a checkpoint, the entire process space (which tends to be huge) is written to physical storage. 2) The user does not need to recompile/re-link their applications. 3) Check pointing and restarting of application is usually done through OS commands. 4) Check pointed application is usually unable to be restarted on a different host. B. User-level 1) These tools are built into the application which will periodically write their status information into physical storage. 2) Check pointing of such applications is usually done by sending a specific signal to the application. Restarting of such applications is usually done by calling the application with additional parameters pointing to the location of restart files. Checkpoint/restart is a fault tolerance mechanism that has been used in many system platforms. Instead of restarting the computation from the beginning when a failure occurs, with Checkpoint/restart, a process can be restarted from the last checkpoint and can be migrated to a healthier system. Our current survey of recent checkpoint/restart techniques on a GPU has revealed that majority of the techniques are focused on application-level or user level check-pointing. This is supported through either manually inserting the check-
2 pointing logic inside the application source code or implicitly calling user-level check-pointing library routines [11][4][5][7][8]. We design and implement an efficient solution to checkpoint and restart. With our proposed solution we demonstrate the need for architectural change in a GPU for supporting Checkpoint/restart mechanisms. For the purpose of experimentation, we use the Multi2sim heterogeneous computing simulator. Multi2Sim is a simulation framework for CPU-GPU heterogeneous computing written in C. It includes models for superscalar, multithreaded, and multicore CPUs, as well as GPU architectures. We emulate a 32 bit x86 CPU that launches OpenCl kernels on a GPU simulator emulating the AMD Southern Islands architecture. The most recent GPUs from AMD, the Southern Islands family (Radeon HD 7000-series), constitute a dramatic change from the Evergreen and Northern Islands GPUs. All levels of the GPU, from the ISA to the processing elements to the memory system, have been redesigned. Thanks to continued collaboration with AMD, support for the Southern Islands family of GPUs is provided in Multi2Sim version 4.2. [2] The code execution on the GPU starts when the host program launches an OpenCL kernel. An instance of a kernel is called an ND-Range, and is formed of work-groups, which in turn are comprised of work-items (see below). When the ND- Range is launched by the OpenCL driver, the programming and execution models are mapped onto the Southern Islands GPU. 2) The Device Kernel The device kernel is code written in a different programming language than the host program, namely OpenCL C. It is code written by a Kernel programmer to implement algorithms with high degrees of data parallelism so that it runs better on the GPU. Multi2sim provides a compiler that can compile OpenCl code, with some effort, to statically compiled Kernel Binaries that can be launched on the Southern Islands (SI) GPU model. Figure 1: The Software entities defined in the OpenCL programming model. An ND-Range is formed of work-group which are, in turn, sets of work-items executing the same OpenCL C kernel code. An instance of the OpenCL kernel is called a work-item, which can access its own pool of private memory. B. The Multi2sim emulation of OpenCL execution The rest of this document talks about; 1. The background knowledge we have gathered regarding checkpoint/restart, AMD OpenCl execution model, Southern Islands GPU architecture and Muli2sim. 2. Next we talk more about functional simulation and workloads we have compiled and run to generate the execution trace on multi2sim. 3. Future work and the scope of exploration we hope to achieve. II. BACKGROUND A. The OpenCL Programming Model OpenCL is an industry-standard programming framework designed to specifically target heterogeneous computing platforms [2]. OpenCL programming model emphasizes parallel processing by using the Single Program Multiple Data (SPMD) paradigm, in which a single piece of code, called a kernel maps to multiple subsets of input data, creating massively parallel threads. An OpenCL application is formed of a host program and one or more device kernels that run on the GPU. 1) The Host Program The host program is the starting point for an OpenCL application. It executes on the CPU where the operating system is running. This program runs on the emulated 32bit X86 CPU. It makes OpenCl API calls to launch the kernel code on the GPU device. Figure 2: The interaction of user code, OS code and hardware, in both native and simulated environment The interaction between the CPU and GPU on the simulator is very similar to programs and kernels run on a native machine. The host program makes OpenCL API calls to vendor-specific runtime libraries (in our case AMD Southern Islands). These
3 are serviced by GPU device drivers to make ABI calls to the device. The simulator source code has a list of OpenCL API calls implemented for purposes of modeling the CPU-GPU design. The rest of the source code has driver calls to the emulator. That is the ABI calls in native machines. Finally, multi2sim emulates each GPU model (for our purpose the AMD SI ISA), that services the ABI calls from the driver. C. The Southern Islands Architecture A kernel running on an SI-GPU has access to multiple levels of storage and both scalar and vector arithmetic-logic units. 1) Some terminologies of AMD SI ISA a) Work-group: A collection of work-items working together, capable of sharing data and synchronizing with each other. b) Wavefront: A collection of 64 work-items grouped for efficient processing on the compute unit. Each wavefront shares a single program counter. c) Work-item: The basic unit of computation. Work-items are arranged into work-groups with two basic properties: i) those work-items contained in the same workgroup can perform efficient synchronization operations, and ii) work-items within the same work-group can share data through a low-latency local memory pool. The totality of work-groups form the ND-Range (grid of work-item groups), which shares a common global memory space. 3) The Southern Islands Instruction Set Architecture (ISA) a) Vector and Scalar Instructions Most arithmetic operations on a GPU are performed by vector instructions. A vector instruction is fetched once for an entire wavefront, and executed in a SIMD fashion, by comprising 64 work-items. Data that are private per work-item are stored in vector registers. Scalar instructions are fetched in common for the entire wavefront and executed once for each work-item. Data that are shared by all work-items in the wavefront are stored in scalar registers. b) Southern Islands Assembly The basic format and characteristics of the AMD Southern Islands instruction set is as follows. Scalar instructions use the prefix s_, and vector instructions use v_. All registers are 32 bits. Those instructions performing 64 bit computations use two consecutive registers to store 64 bit values. The execution of some instructions implicitly modify the value of some scalar registers. Special registers may also be handled directly, in the same way as general purpose registers. For example, VCC register is a 64-bit mask representing the result of a vector comparison. 4) Relevant Modules of SI GPU a) The Compute Unit 2) Running OpenCL kernel on a Southern Islands GPU Figure 3: Simplified diagram of SI GPU Architecture An ultra-threaded dispatcher acts as a work-group scheduler. It keeps consuming pending work-groups from running ND-Range, and assigns them to the compute units, as they become available. The global memory scope to the whole ND-Range corresponds to a physical global memory hierarchy on the GPU, formed of caches and main memory. Since all work-items forming a work-group run the same code, the compute unit combines sets of 64 work-items within a work-group to run in a SIMD (single-instruction-multipledata) fashion. These sets of 64 work-items are known as wavefronts. The work-item s private memory is physically mapped to a portion of the register file. When the work-item uses more private memory than the register file allows, register spills happen using privately allocated regions of global memory. Figure 4: Compute Unit GPU ultra-threaded dispatcher starts assigning workgroups to compute units as they stay or become available. At a given time during execution, one compute unit can have zero, one, or more work-groups allocated to it. These work-groups are split into wavefronts, for which the compute unit executes one instruction at a time. Each compute unit in the GPU is replicated with an identical design. The LDS unit interacts with local memory to service its instructions, while the scalar and vector memory units can access global memory, shared by all compute units.
4 b) Memory Hierarchy Figure 5: Block diagram of the memory pipeline - Local memory - The Local Data Share unit is responsible for handling all local memory instructions. - Vector memory - The Vector Memory Unit is responsible for handling all vector global memory operations. The state of an ND-Range in execution is represented by the private, local, and global memory images, mapped to the register files, local physical memories, and global memory hierarchy, respectively. Register files are modeled in Multi2Sim without contention, and their access happens with fixed latency. d) Workgroup private memory Each compute unit has a 64 kb of Local Data Share (LDS) memory space that enables low-latency communication between work-items within a workgroup, including between work-items in a wavefront. Each workgroup can allocate up to 32 kb of this space, and can read and write any portion of the LDS space allocated to it. The LDS also includes 32 integer atomic units to enable fast, unordered atomic operations. This memory can be used as a software cache for predictable data reuse, a data exchange machine for the work-items of a work-group, or as a cooperative way to enable more efficient access to off-chip memory. e) Global Memory Work-items have access to two distinct types of global memory (memory visible to all work-groups): global data share and device memory. Multi2sim does not implement a Global Data Share or GDS unit. So we choose to ignore it. In a typical GPU with GDS, it is an additional state to maintain, just like other memory units. On multi2sim, the device memory is implemented as a single block of video memory. The simulator does not provide the complete hierarchy and we choose to ignore this. This is not representative of a typical GPU since the device memory contains a complete memory hierarchy of caches and main memory. In the following sections that discuss our solution assumes the dump of single memory block. Figure 6: Block diagram of the memory pipeline c) Work-item Private memory VGPRs Every work-item has access to some number of VGPRs, up to a maximum of 256. VGPRs are 32- bits wide and are used by the vector ALU and vector memory systems. Double-precision operations use two adjacent VGPRs to form a 64-bit value. SGPRs Every wavefront is allocated up to a maximum of 104 SGPRs. These SGPRs are 32 bits wide, but they are wavefront-private since they are common to all workitems in a wavefront. Private memory Work-items can allocate private memory space to allow spilling VGPRs to memory. This memory is accessed through vector memory instructions. III. RELEVANT WORK We surveyed light weight checkpointing mechanisms on CPU-GPU systems. The motivation and use cases of checkpointing on CPU-GPU systems. The papers we explored [11][4][5][7][8][12][13][14]. All the solutions researched so far that check point on a GPU provide user level support for saving the state of a running process on the GPU. Checkpointing for GPU systems present challenges compared to other conventional multi-processor systems. These differences emerge from the natural differences between the hardware design and programming languages of GPUs compared to those of unicore processors and multiprocessors. NVIDIA s CUDA has introduced latency hiding technique called streams that can overlap the memory copy and kernel execution [12]. The performance models in [13] show that streaming technique can reduce the kernel execution times while the applications may require more data transfer between device to host. Moreover, it suggests that streaming technique is more suitable for the applications that the data are independent so that both host-to-device and device-to-host memory transfer can be carried on without kernel interruption. User-level checkpointing is supported through either manually inserting the checkpointing logic inside an application source code or implicitly calling user-level library routines. In the latter, developers, for example, can re-compile their source code with function call libckpt[16] library or relink
5 object files with the Condor library[17]. User-level checkpointing can also be in conjunction with source code analysis tools [14, 15] to determine the appropriate places to insert the checkpointing codes. System-level checkpointing is supported through kernel modules or virtual machines. Software such as BLCR [18] and CRAK [19] provide kernellevel checkpoint-restart capabilities without modifying application executables. IV. IMPLEMENTATION We propose architectural support for creating a checkpoint and restarting for a GPU. The main idea is to keep track of the fine grained execution of the Kernel on the GPU. During the creation of a checkpoint, use the information of the exact micro-architectural state, to identify the regions of kernel state that are relevant and require restoration during restart. For finished work-groups we need not store the VGPRs and SGPRs they access. The NDRange keeps track of completed work-groups. We need to only keep track of this state. For running work-groups, we need to store the VGPRs and SGPRs accessed by each of the wavefronts in these workgroups. So we need hardware support to read the addresses accessed by the wavefronts to read the 256 VREGs per workitem, 103 SREGs and PC for each wavefront. Also, since each work-group addresses a private portion of the LDS unit, we need to snapshot only this region of memory. Figure 7 indicates the checkpoint location. From the figure, it can be observed that the checkpoint happens when current PC = 0x So, when restarting, the new PC that is loaded is 0x In our implementation we have the luxury of using a flexible simulator that keeps track of all the required state elements to prove our concept. We exploit the OpenCL programming model that launches the kernel as independent work-groups on the GPU. Each workgroup has a collection of work-items that run the same instruction stream. Also 64 workitems are grouped in elements called wavefronts. Our advantage lies in the fact that the ultra-threaded dispatcher issues at wavefront granularity and keeps track of the execution, dependence and resource information. The dispatcher launces the wavefront from a particular work-group based on the number of VGPRs and SGPRs required per work-group and the available resources on each compute unit. Therefore at a time only a subset of the complete Kernel state is relevant for Checkpointing. This can be extracted from examining the states of the dispatcher, wavefronts and work-groups. Therefore our concept is to snapshot only this portion of the kernel state, rather than storing the complete kernel state of the GPU. Let us examine in detail the exact mechanism of checkpoint and restart. A. Checkpoint mechanism During the execution of an OpenCL Kernel, we keep track of the states in each NDRange, work-group, wavefront and work-item. The ultra-threaded dispatcher modifies the state of each according to the execution of the wavefronts on the GPU. As wavefronts have completed, they modify the kernel state. Also, once all the wavefronts in a work-group have completed the work-group is finished. Once all the workgroups have completed execution then the kernel is completed. During a checkpoint, there are completed work-groups, running work-groups and waiting work-groups. We are concerned only about the kernel state modified by the running work-groups on the GPU. Figure 7: Trace at the checkpoint location. Next, we take a global memory snapshot along with other execution state elements. This provides sufficient information to restart the program from the checkpointed location. B. Restart Mechanism During a restart, we stop the GPU execution. We restore all the necessary state elements from the nearest checkpointed location in the program. All the state variables that are stored during the previous checkpoint are loaded back into the appropriate locations. Next, we restore the LDS and Global memory. All the completed work-groups in the program need not be re-executed and are taken care of by the previously set state variables. For the work-groups that require re-execution, we restore the state of the registers by writing their respective access to VGPRs and SGPRs. This is done by iterating through the wavefronts in each of these work-groups.
6 Figure 8 indicates the trace obtained after restarting the execution from the checkpoint location. In the figure, observe that the PC has started from 0x instead of 0x0. Figure 9 compares the golden trace with the restarted trace. The portion of the trace that is green colored indicates the execution trace before the checkpoint. V. EVALUATION Our experimentation evaluations are on the checkpoint file generated by each benchmark. We calculate the memory overhead by estimating the file sizes and comparing it with the total Kernel state. Figure 10 indicates the number of work-groups in each of the benchmarks. Figure 8: Trace generated after Restart from checkpoint location. C. Verification Strategy Figure 10: Number of work-groups in each benchmark. Figure 11 gives the number of instructions in each of the benchmarks in AMDAPP. Figure 9: Difference between the golden trace and restart trace. We use AMDAPP workloads to test our implementation. We generated the execution traces for each of these benchmarks without checkpointing. This acts as our golden execution trace. We have a hard-coded checkpointing flag that is set randomly during program execution. We run this once per workload. Figure 11: Number of instructions in each benchmark. After checkpointing, we generated a checkpoint file which saves the current checkpoint state for each benchmark. Each file is in the range of Kilo Bytes (KB). Figure 12 compares the size of checkpoint file for each of the benchmarks. Next, we run the kernel again and if it detects a previous checkpoint, we move to that location of program execution. We again generate the new exection trace once the program has restarted from a previous checkpoint. The generated execution trace should match from the checkpointed location in the golden trace. We use a UNIX tool called Kompare to verify the execution trace after we have run the program on restart. Figure 12: Checkpoint Size for each benchmark.
7 Figure 13 gives the sizes of the global memory stored while checkpointing. We can also explore the possibility of compression algorithms to further reduce the overhead during multiple checkpoints. Also, if we schedule checkpoints at appropriate locations, roll-back time can be reduced significantly making the GPUs to work efficiently in real-time scenarios. VIII. REFERENCES [1] S. Laosooksathit et al., "Lightweight checkpoint mechanism and modeling in GPGPU environment," in 4th Workshop on System-level Virtualization for High Performance Computing (HPCVirt 2010), April france.fr/workshops/hpcvirt2010/hpcvirt2010_1.pdf Figure 13: Global Memory Size at the checkpoint location. During checkpointing, we are storing the LDS module of the work-group under running against dumping the LDS modules of all the work-groups in the benchmark. Figure 14 gives the comparison of the LDS memory stored versus the Actual LDS memory size along with the reduction achieved. [2] Ubal, R., et al.: Multi2Sim: a simulation framework for CPU-GPU computing. In: Proceedings of PACT 2012, pp ACM, Minneapolis 2012). [3] "Reference: Southern Islands Series Instruction Set Architecture." MS. Advanced Micro Devices, Inc, Sunnyvale,CA.Web. uthern_islands_instruction_set_architecture.pdf [4] Hong Ong, Natthapol Saragol, Kasidit Chanchio, and Chokchai Leangsuksun, "VCCP: A Transparent, Coordinated Checkpointing System for Virtualization-based Cluster Computing," IEEE Cluster Figure 14: LDS Comparison and Reduction Achieved VI. SUMMARY In our checkpoint/restart implementation, we use architectural and execution information to identify the relevant subset of kernel state that needs to be stored to effectively checkpoint a running process in a GPU. This implementation of checkpoint/restart in the GPU makes it to be fault tolerant and much more reliable. We have proved our concept on a flexible simulator. We demonstrate the need for such a support in GPU architecture to checkpoint with minimal overhead. We also show that this solution is agnostic to type or size of workload. Our results seem to indicate significant improvements in memory-overhead as compared to saving the complete kernel state snapshot. VII. FUTURE WORK The solution we provide can be extended to further minimize the LDS and Global Memory snapshots. If we can add additional states to the pages in each of these memories, we can keep track of the pages modified by completed and running work-groups. We can also implement a driver call to invoke checkpointing on the GPU from the host CPU. [5] Hiroyuki Takizawa, Katsuto Sato, Kazuhiko Komatsu, and Hiroaki Kobayashi, "CheCUDA: A Checkpoint/Restart Tool for CUDA Applications," in PDCAT, 2009, pp [6] S. Laosooksathit et al., "Lightweight checkpoint mechanism and modeling in GPGPU enviornment," in 4th Workshop on System-level Virtualization for High Performance Computing (HPCVirt 2010), April [7] Takizawa, H.; Koyama, K.; Sato, K.; Komatsu, K.; Kobayashi, H., "CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications," Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, vol., no., pp.864,876, May 2011 [8] J. Menon, M. de Kruijf, and K. Sankaralingam, igpu: Exception Support and Speculative Execution on GPUs, ISCA, 2012 [9] J. Duell, "The design and implementation of berkeley lab's linux checkpoint/restart," Lawrence Berkeley National Laboratory, TR, 2000.
8 [10] P. H. Hargrove and J. C. Duell, Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters," Journal of Physics: Conference Series, vol. 1, no. 46, pp. 494{499, [11] Lizandro Dami, Parallelization & checkpointing of GPU applications through program transformation, 2012 ed.ames, Iowa: Lizandro Dami, 2012 [12] NVIDIA, NVIDIA CUDA Programming Guide, [13] Supada Laosooksathit, Chokchai Leangsuksun, Abdelkader Baggag, and Clayton Chandler, "Stream Experiments: Toward Latency Hiding in GPGPU," in Parallel and Distributed Computing and Networks (PDCN), [17] M. Litzkow and M. Solomon, The Evolution of Condor Checkpointing, [18] J. Duell, P.Hargrove, and E. Roman, The Design and Implementation of Berkeley Lab s Linux Checkpoint/Restart, [19] H. Zhong and J. Nieh, CRAK: Linux checkpoint / restart as a kernel module, Technical Report CUCS , Department of Computers Science, Columbia University, [14] K. Chanchio and X. H. Sun, Data collection and restoration for heterogeneous process migration'', SOFTWARE--PRACTICE AND EXPERIENCE, 32:1-27, April 15, [15] Ferrari, A., Chapin, S. J., and Grimshaw, A Heterogeneous process state capture and recovery through Process Introspection. Cluster Computing 3, 2 (Apr. 2000), [16] J. S. Plank, M. Beck, G. Kingsley, and K. Li, Libckpt: Transparent Checkpointing under Unix, In Proceedings of the 1995 Winter USENIX Techincal Conference, 1995.
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationVisualization of OpenCL Application Execution on CPU-GPU Systems
Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari*, R. Ubal*, D. Schaa**, D. Kaeli* *NUCAR Group, Northeastern Universiy **AMD Northeastern University Computer Architecture Research
More informationHandout 3. HSAIL and A SIMT GPU Simulator
Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants
More informationRUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS
RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationMulti2sim Kepler: A Detailed Architectural GPU Simulator
Multi2sim Kepler: A Detailed Architectural GPU Simulator Xun Gong, Rafael Ubal, David Kaeli Northeastern University Computer Architecture Research Lab Department of Electrical and Computer Engineering
More informationHigh Performance Computing. Taichiro Suzuki Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.
High Performance Computing Taichiro Suzuki Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab. 1 Review Paper Two-Level Checkpoint/Restart Modeling for GPGPU Supada
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationOpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania
OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming
More informationA Behavior Based File Checkpointing Strategy
Behavior Based File Checkpointing Strategy Yifan Zhou Instructor: Yong Wu Wuxi Big Bridge cademy Wuxi, China 1 Behavior Based File Checkpointing Strategy Yifan Zhou Wuxi Big Bridge cademy Wuxi, China bstract
More informationAuto-tunable GPU BLAS
Auto-tunable GPU BLAS Jarle Erdal Steinsland Master of Science in Computer Science Submission date: June 2011 Supervisor: Anne Cathrine Elster, IDI Norwegian University of Science and Technology Department
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng
More informationCache Memory Access Patterns in the GPU Architecture
Rochester Institute of Technology RIT Scholar Works Theses Thesis/Dissertation Collections 7-2018 Cache Memory Access Patterns in the GPU Architecture Yash Nimkar ypn4262@rit.edu Follow this and additional
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationVirtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])
EE392C: Advanced Topics in Computer Architecture Lecture #10 Polymorphic Processors Stanford University Thursday, 8 May 2003 Virtual Machines Lecture #10: Thursday, 1 May 2003 Lecturer: Jayanth Gummaraju,
More informationUnderstanding GPGPU Vector Register File Usage
Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture
More informationGPUs have enormous power that is enormously difficult to use
524 GPUs GPUs have enormous power that is enormously difficult to use Nvidia GP100-5.3TFlops of double precision This is equivalent to the fastest super computer in the world in 2001; put a single rack
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationDesign of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017
Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 7
General Purpose GPU Programming Advanced Operating Systems Tutorial 7 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationA Framework for Visualization of OpenCL Applications Execution
A Framework for Visualization of OpenCL Applications Execution A. Ziabari*, R. Ubal*, D. Schaa**, D. Kaeli* *NUCAR Group, Northeastern Universiy **AMD Conference title 1 Outline Introduction Simulation
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The
More informationCourse II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan
Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 9
General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationWhat s in a process?
CSE 451: Operating Systems Winter 2015 Module 5 Threads Mark Zbikowski mzbik@cs.washington.edu Allen Center 476 2013 Gribble, Lazowska, Levy, Zahorjan What s in a process? A process consists of (at least):
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster
More informationGPU-accelerated Verification of the Collatz Conjecture
GPU-accelerated Verification of the Collatz Conjecture Takumi Honda, Yasuaki Ito, and Koji Nakano Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi Hiroshima 739-8527,
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationHeterogeneous-Race-Free Memory Models
Heterogeneous-Race-Free Memory Models Jyh-Jing (JJ) Hwang, Yiren (Max) Lu 02/28/2017 1 Outline 1. Background 2. HRF-direct 3. HRF-indirect 4. Experiments 2 Data Race Condition op1 op2 write read 3 Sequential
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationChapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin
More informationOPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015
OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015 TOPICS Data transfer Parallelism Coalesced memory access Best work group size Occupancy branching All the performance numbers come from a W8100 running
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12
More informationAR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors
AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors Computer Sciences Department University of Wisconsin Madison http://www.cs.wisc.edu/~ericro/ericro.html ericro@cs.wisc.edu High-Performance
More information15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011
5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationPredictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor*
Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Tyler Viswanath Krishnamurthy, and Hridesh Laboratory for Software Design Department of Computer Science Iowa State University
More informationAdvanced CUDA Optimization 1. Introduction
Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded
More informationAccelerating MapReduce on a Coupled CPU-GPU Architecture
Accelerating MapReduce on a Coupled CPU-GPU Architecture Linchuan Chen Xin Huo Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210 {chenlinc,huox,agrawal}@cse.ohio-state.edu
More informationNon-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.
CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says
More informationProfiling and Debugging OpenCL Applications with ARM Development Tools. October 2014
Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline
More informationThreading Hardware in G80
ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &
More informationGRAPHICS PROCESSING UNITS
GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011
More informationHSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!
Advanced Topics on Heterogeneous System Architectures HSA Foundation! Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2
More informationComputer Architecture Lecture 14: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013
18-447 Computer Architecture Lecture 14: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/18/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed
More informationOPERATING SYSTEM. Chapter 4: Threads
OPERATING SYSTEM Chapter 4: Threads Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples Objectives To
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationRegression Modelling of Power Consumption for Heterogeneous Processors. Tahir Diop
Regression Modelling of Power Consumption for Heterogeneous Processors by Tahir Diop A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department
More informationArchitectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs
Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs A Dissertation Presented by Yash Ukidave to The Department of Electrical and Computer Engineering in partial
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationOpenMP for next generation heterogeneous clusters
OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great
More informationSpace-Efficient Page-Level Incremental Checkpointing *
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 22, 237-246 (2006) Space-Efficient Page-Level Incremental Checkpointing * JUNYOUNG HEO, SANGHO YI, YOOKUN CHO AND JIMAN HONG + School of Computer Science
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationQuestions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation. What s in a process? Organizing a Process
Questions answered in this lecture: CS 537 Lecture 19 Threads and Cooperation Why are threads useful? How does one use POSIX pthreads? Michael Swift 1 2 What s in a process? Organizing a Process A process
More informationOperating Systems: Internals and Design Principles. Chapter 2 Operating System Overview Seventh Edition By William Stallings
Operating Systems: Internals and Design Principles Chapter 2 Operating System Overview Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Operating systems are those
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationOperating Systems Overview. Chapter 2
Operating Systems Overview Chapter 2 Operating System A program that controls the execution of application programs An interface between the user and hardware Masks the details of the hardware Layers and
More information10/10/ Gribble, Lazowska, Levy, Zahorjan 2. 10/10/ Gribble, Lazowska, Levy, Zahorjan 4
What s in a process? CSE 451: Operating Systems Autumn 2010 Module 5 Threads Ed Lazowska lazowska@cs.washington.edu Allen Center 570 A process consists of (at least): An, containing the code (instructions)
More informationConcepts. Virtualization
Concepts Virtualization Concepts References and Sources James Smith, Ravi Nair, The Architectures of Virtual Machines, IEEE Computer, May 2005, pp. 32-38. Mendel Rosenblum, Tal Garfinkel, Virtual Machine
More informationGPU programming. Dr. Bernhard Kainz
GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationComputer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)
18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures
More informationA Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware
A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its
More informationApplication Programming
Multicore Application Programming For Windows, Linux, and Oracle Solaris Darryl Gove AAddison-Wesley Upper Saddle River, NJ Boston Indianapolis San Francisco New York Toronto Montreal London Munich Paris
More informationHakam Zaidan Stephen Moore
Hakam Zaidan Stephen Moore Outline Vector Architectures Properties Applications History Westinghouse Solomon ILLIAC IV CDC STAR 100 Cray 1 Other Cray Vector Machines Vector Machines Today Introduction
More informationWhat s in a traditional process? Concurrency/Parallelism. What s needed? CSE 451: Operating Systems Autumn 2012
What s in a traditional process? CSE 451: Operating Systems Autumn 2012 Ed Lazowska lazowska @cs.washi ngton.edu Allen Center 570 A process consists of (at least): An, containing the code (instructions)
More informationMotivation. Threads. Multithreaded Server Architecture. Thread of execution. Chapter 4
Motivation Threads Chapter 4 Most modern applications are multithreaded Threads run within application Multiple tasks with the application can be implemented by separate Update display Fetch data Spell
More informationC152 Laboratory Exercise 5
C152 Laboratory Exercise 5 Professor: Krste Asanovic GSI: Henry Cook Department of Electrical Engineering & Computer Science University of California, Berkeley April 9, 2008 1 Introduction and goals The
More informationigpu: Exception Support and Speculation Execution on GPUs Jaikrishnan Menon, Marc de Kruijf University of Wisconsin-Madison ISCA 2012
igpu: Exception Support and Speculation Execution on GPUs Jaikrishnan Menon, Marc de Kruijf University of Wisconsin-Madison ISCA 2012 Outline Motivation and Challenges Background Mechanism igpu Architecture
More informationGrassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands
Grassroots ASPLOS can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands ASPLOS-17 Doctoral Workshop London, March 4th, 2012 1 Current
More informationPreparing seismic codes for GPUs and other
Preparing seismic codes for GPUs and other many-core architectures Paulius Micikevicius Developer Technology Engineer, NVIDIA 2010 SEG Post-convention Workshop (W-3) High Performance Implementations of
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More information1 Publishable Summary
1 Publishable Summary 1.1 VELOX Motivation and Goals The current trend in designing processors with multiple cores, where cores operate in parallel and each of them supports multiple threads, makes the
More informationOVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI
CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationGPU ACCELERATED DATABASE MANAGEMENT SYSTEMS
CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU
More information