LUNAR TEMPERATURE CALCULATIONS ON A GPU
|
|
- Norman Gregory
- 5 years ago
- Views:
Transcription
1 LUNAR TEMPERATURE CALCULATIONS ON A GPU Kyle M. Berney Department of Information & Computer Sciences Department of Mathematics University of Hawai i at Mānoa Honolulu, HI ABSTRACT Lunar surface temperature is a crucial parameter for retention of volatiles, such as ice. Near the lunar poles, temperature is determined by reflected sunlight rather than direct sunlight. Temperature modeling of craters with high resolution topography is a formidable computational challenge, because the computational cost increases rapidly with spatial resolution. Graphics Processing Units (GPUs), a novel type of computer hardware, are now programmable and inexpensive, and they provide low-latency memory units, hardware implemented trigonometric functions, and multi-threaded functionality. By utilizing the CUDA C programming language, we implement a simplified model for the surface energy balance, which retains the computational complexity of the full problem. We investigate and experiment with different approaches and compare the runtime of our CUDA C programs to the runtime of the same surface energy balance model written in the C programming language. Our objective is to develop a computationally efficient algorithm for surface energy balance calculations on a GPU, which may be used for future research and applications. We were successful and achieved a speed up by a factor of over 100 compared to a CPU. INTRODUCTION Recent and ongoing lunar missions have provided high-resolution topography, from the Kaguya spacecraft (Araki et al., 2009) and from the Lunar Orbiter Laser Altimeter (LOLA) (Smith et al., 2010a, 2010b), and surface temperature from the DIVINER instrument onboard the Lunar Reconnaissance Orbiter (Paige et al., 2010a, 2010b). Figure 1 below, shows a highresolution temperature map of the south polar region of the Moon. The brightly colored areas (red, yellow, and green) receive lots of direct sunlight, while the darker colored areas (blue and purple) receive little to no sunlight. For these darker colored areas, their temperature is determined by the reflected light received from other surface elements in its field of view. Given the high spatial resolution of the available data, modeling temperature is computationally challenging. In particular, the surface energy balance is the computationally most expensive component, since an algorithm has to visit many surface elements at each time step. 9
2 Figure 1: High-resolution temperature map of the south polar region of the Moon. The inner white circle represents 85 S latitude. (Adopted from Paige et al. 2010a). Modern day GPUs are now programmable and can be used effectively as a numerical coprocessor (Kirk and Hwu, 2010). GPU-based clusters are becoming increasingly common everywhere from university research labs to the world s fastest supercomputers. Thus, a single GPU can now replace a sizeable conventional computer cluster. GPUs are relatively inexpensive, compared to an equal number of CPU (Central Processing Unit) cores, and provide extensive low-latency memory units, hardware implemented trigonometric functions, and is able to execute thousands of threads in parallel. For this research project, an NVidia Tesla C1060 GPU was used, shown in Figure 2 below. The NVidia Tesla C1060 has 30 multi-processors with 8 cores each, for a total of 240 cores. Each core runs at a 1.3 GHz clock rate with a maximum of 512 threads per block. For comparison, a commonplace Intel Xeon CPU has 4 cores with 2 threads per core and a clock rate of 2.3 GHz. Figure 2: An NVidia Tesla C1060 GPU, which can be mounted in a regular PCI16e card slot. The Tesla series of GPUs from NVidia are designed solely for General Purpose GPU Computing; hence it serves as a numerical coprocessor for massively parallel computations. 10
3 In order to utilize our GPU for numerical computations, a programming language developed by NVidia, CUDA C, was learned and studied. CUDA C is comprised of host code, which runs on the CPU, and device code, which runs on the GPU. It is basically the C programming language with extra keywords and functions which provides GPU related functionality. For example, the keyword global is used to declare that a function will run on the GPU rather than the CPU. These functions are called kernels, which are executed in parallel on the GPU. Another example is the cudamalloc(), cudafree(), and cudamemcpy() functions. These three functions are analogous to the C programming language functions malloc(), free(), and memcpy() respectively. The function cudamalloc() allocates DRAM (Dynamic Random Access Memory) memory space for GPU use, cudafree() de-allocates DRAM memory space which was allocated via cudamalloc(), and cudamemcpy() is used to copy GPU memory to CPU memory or vice versa. One of the main features of CUDA C is that it provides explicit use of multiple types of GPU memory. There are five different GPU memory types: global memory, local memory, shared memory, constant memory, and texture memory. Local memory is stored off chip in DRAM, however it is cached on the GPU. Local memory is specific to 1 thread and can be read and written to. Global memory is also stored in DRAM and can be read and written to, however global memory has no cache on the GPU and it can be accessed by all threads. Shared memory is the only memory type to be stored on the actual GPU. It can also be both read and written to, however it can only be accessed by a group of threads called a block. Constant and texture memory are similar; both are stored in DRAM, are read only, and have caches on the GPU. Their difference comes in their use: constant memory is used when threads access the same memory space at the same time, while texture memory is used when threads access nearby memory spaces. Figures 3 below, provides a graphical overview of the different types of CUDA memory. Figure 3: Graphical overview of the memory types on a CUDA device. (Adopted from NVidia 2010) 11
4 METHODS In order to limit the set-up time for our model, we used a simplified toy model which retains the same computational complexity of the full problem. Our model consists of a collection of elements, which are the analog of lunar surface elements. Each of these elements reflects light and illuminates other elements in its field of view. To mimic solar illumination, an artificial incoming energy flux is computed, which changes with time. Time is determined by sun position, which is measured by the hour angle (radians east of noon). Hence, we have that the incoming energy for element consists of direct sunlight and reflected sunlight from other surface elements. This brings us to our mathematical formulation: 1,, where is the incoming energy for surface element at time step, is albedo, is the direct incoming solar radiation (insolation) for surface element at time step,, is the angle subtended by the surface element as seen by surface element, and is the incoming energy for surface element at time step 1. Every surface element experiences sunrise and sunset which depends on its azimuthal orientation and the latitude. We are computing the time average of over one day. For each of the two semesters that was spent on this research project, a separate lunar surface energy model was developed and implemented. In the first semester, we developed a model which consists of a simple ring of surface elements, shown below in Figure 4. For this model, we can clearly see that every surface element is in the field of view of all other surface elements. However,, is independent of and. It can be calculated by using the formula:, 2, where 2 and is the number of surface elements. Similarly, the azimuth value for each surface element can be calculated using:. Due to this, we do not need to store the, values nor the azimuth values for each surface element in memory; we instead calculate them on the fly when needed. In the second semester of our research, we wanted to develop a surface energy balance model where there exist surface elements which are not in the field of view of all other surface elements; and which incorporates the storage in memory of the, values and the azimuth values for each surface element. Hence, we created a model which consists of two conjoined regular octagons, shown below in Figure 4. For this model, there are no shortcut formulas to calculate the azimuth values and the, values, thus we compute the values once and store them in memory. 12
5 Figure 4: Lunar surface energy balance models developed. The first semester s model is depicted to the left and the second semester s model is depicted to the right. Multiple CUDA C programs were developed for each of the two surface energy balance models. The main differences between each CUDA C program are what type of GPU memory the data structures are stored in. For both of the models, we have a 1-dimensional array of size for the values and a 1-dimensional array of size for the values. However, for our second semester model, we also have a 1-dimensional array of size for the azimuth values and a flattened out square by matrix, which is a 1-dimensional array of size, for the, values. Two CUDA C programs were developed for the first semester s surface energy balance model. The first CUDA C program, global_ring.cu, stored both the array and the array in global memory, while the second CUDA C program, constant_ring.cu, stored the array in global memory and the array in constant memory. For the second semester s model, five different CUDA C programs were developed. The first program, global.cu, stores all data structures in global memory. The second program, constant.cu, stores the array in constant memory and all other data structures in global memory. The third program, transpose_global.cu, is basically the same as global.cu except that our, matrix is transposed. Similarly, the fourth program, transpose_constant.cu, is the same as constant.cu except our, matrix is transposed. For the last program, timeloop.cu, we move our time-loop onto the GPU. In other words, instead of calling a kernel at every time step, we only call one kernel which iterates with time. We also note that our, matrix is transposed as well in timeloop.cu. 13
6 RESULTS Time (seconds) Runtime (Log Log Base 2) y = x Number of Surface Elements, N First Semeter C Program global_ring.cu constant_ring.cu Figure 5: Runtime graph of the first semester s surface energy balance model programs. The black trend line is shown for comparison. Figure 5 above, shows the runtimes of the programs developed for the first semester s surface energy balance model. The first thing we see is that our C program does indeed show runtime growth. Also, both global_ring.cu and constant_ring.cu run much faster than the C program, with contant_ring.cu being the fastest. This is due to the fact that constant memory has a cache on the GPU, which makes constant memory reads significantly faster than global memory reads. However, we are limited by the size of our GPU s constant memory space. For our NVidia Tesla C1060 GPU, we have a total of 64kB of constant memory space. This translates to 16,384 floating point or integer values. Hence, constant_ring.cu can only operate on less than 16,384 surface elements. We found that the peak speed up occurred at 15,000 surface elements, with global_ring.cu running 115 times faster than the C program and constant_ring.cu running 207 times faster. 14
7 Runtime (Log Log Base 2) y = x 2 Time (seconds) Number of Surface Elements, N Second Semester C Program global.cu constant.cu transpose_global.cu transpose_constant.cu timeloop.cu Figure 6: Runtime graph of the second semester s surface energy balance model programs. The black trend line is shown for comparison. Figure 6 above, shows the runtimes of the programs developed for the second semester s surface energy balance model. Just like we saw in the first semester s C program, the second semester s C program shows growth. The first two CUDA programs developed, global.cu and constant.cu, both run faster than the C program, however their speed up factor is only 3-4 times faster than the C program, which is much less than what we saw in the first semester. We tracked down the reason of the slow down to the memory access pattern of our, matrix. By transposing our, matrix, the memory access pattern becomes coalesced. In other words, consecutive threads now accesses consecutive memory addresses. This small change in the organization of our, matrix translates to a very significant speed up. We find that the peak speed up occurs at 15,260 elements with transpose_global.cu running 72 times faster than the C program, transpose_constant.cu running 106 times faster, and timeloop.cu running 74 times faster. Since transpose_constant.cu is utilizing constant memory, we are limited to less than 16,384 surface elements. However, both transpose_global.cu and timeloop.cu are able to operate on a larger number of surface elements. We also see that the runtimes of transpose_global.cu and timeloop.cu are similar to each other; thus, whether the time loop is on the CPU or the GPU is not significant. 15
8 CONCLUSION In conclusion, we were successful in achieving a significant speed up for crater reflection calculations. For both of the surface energy balance models developed, the CUDA C program which utilized constant memory for the array ran the fastest; with constant_ring.cu having a peak speed up factor of 206 and transpose_constant.cu having a peak speed up factor of 106. However, for those programs, we are limited to less than 16,384 surface elements. The key realization that we take from this research project is where you store your data and how threads access your data is very important. We saw in the first semester that where you store your data, global memory vs. constant memory, has a big impact; and in the second semester, we saw that how threads access your data, non-coalesced memory access vs. coalesced memory access, is very significant. Our results also suggest that GPUs could be used successfully in 3-dimensional models of the surface temperature in lunar craters. REFERENCES Araki, H. et al. (2009) Lunar global shape and polar topography derived from Kaguya-LALT laser altimetry. Science 323, Kirk, D. B. and Hwu, W. W. (2010) Programming Massively Parallel Processors. Morgan Kaufmann. Nvidia (2010) CUDA C Best Practices Guide. Paige, D. A. et al. (2010a) Diviner Lunar Radiometer observations of cold traps in the Moon s south polar region. Science 330, Paige, D. A. et al. (2010b) The Lunar Reconnaissance Orbiter Diviner Lunar Radiometer Experiment. Space Sci. Rev. 150, Smith, D. E. et al. (2010b) Initial Observations from the Lunar Orbiter Laser Altimeter (LOLA). Geophys. Res. Lett. 37, L Smith, D. E. et al. (2010a) The Lunar Orbiter Laser Altimeter Investigation on the Lunar Reconnaissance Orbiter Mission. Space Sci. Rev. 150,
Paralization on GPU using CUDA An Introduction
Paralization on GPU using CUDA An Introduction Ehsan Nedaaee Oskoee 1 1 Department of Physics IASBS IPM Grid and HPC workshop IV, 2011 Outline 1 Introduction to GPU 2 Introduction to CUDA Graphics Processing
More informationIntroduction to CUDA Programming
Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationGPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum
GPU & High Performance Computing (by NVIDIA) CUDA Compute Unified Device Architecture 29.02.2008 Florian Schornbaum GPU Computing Performance In the last few years the GPU has evolved into an absolute
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationThis is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.
David Kirk/NVIDIA and Wen-mei Hwu, 2006-2008 This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC. Please send any comment to dkirk@nvidia.com
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationIntel Xeon Phi Coprocessors
Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationarxiv: v1 [physics.ins-det] 11 Jul 2015
GPGPU for track finding in High Energy Physics arxiv:7.374v [physics.ins-det] Jul 5 L Rinaldi, M Belgiovine, R Di Sipio, A Gabrielli, M Negrini, F Semeria, A Sidoti, S A Tupputi 3, M Villa Bologna University
More informationGPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh
GPU Performance Optimisation EPCC The University of Edinburgh Hardware NVIDIA accelerated system: Memory Memory GPU vs CPU: Theoretical Peak capabilities NVIDIA Fermi AMD Magny-Cours (6172) Cores 448 (1.15GHz)
More informationGraphics Processing Unit (GPU) Acceleration of Machine Vision Software for Space Flight Applications
Graphics Processing Unit (GPU) Acceleration of Machine Vision Software for Space Flight Applications Workshop on Space Flight Software November 6, 2009 Brent Tweddle Massachusetts Institute of Technology
More informationX10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management
X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large
More informationParallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs DOE Visiting Faculty Program Project Report
Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs 2013 DOE Visiting Faculty Program Project Report By Jianting Zhang (Visiting Faculty) (Department of Computer Science,
More informationSlide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth
Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 DRAM Bandwidth MEMORY ACCESS PERFORMANCE Objective To learn that memory bandwidth is a first-order performance factor in
More informationCS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system
CS/ECE 217 GPU Architecture and Parallel Programming Lecture 16: GPU within a computing system Objective To understand the major factors that dictate performance when using GPU as an compute co-processor
More informationJosef Pelikán, Jan Horáček CGG MFF UK Praha
GPGPU and CUDA 2012-2018 Josef Pelikán, Jan Horáček CGG MFF UK Praha pepca@cgg.mff.cuni.cz http://cgg.mff.cuni.cz/~pepca/ 1 / 41 Content advances in hardware multi-core vs. many-core general computing
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationhigh performance medical reconstruction using stream programming paradigms
high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming
More informationAccelerating Implicit LS-DYNA with GPU
Accelerating Implicit LS-DYNA with GPU Yih-Yih Lin Hewlett-Packard Company Abstract A major hindrance to the widespread use of Implicit LS-DYNA is its high compute cost. This paper will show modern GPU,
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationCUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0 Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nsight VSE APOD
More informationHigh-Performance and Parallel Computing
9 High-Performance and Parallel Computing 9.1 Code optimization To use resources efficiently, the time saved through optimizing code has to be weighed against the human resources required to implement
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationIntroduction to CUDA (1 of n*)
Administrivia Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 Paper presentation due Wednesday, 02/23 Topics first come, first serve Assignment 4 handed today
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationUsing Graphics Chips for General Purpose Computation
White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1
More informationLearn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh
Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA
More informationMassively Parallel Architectures
Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationB. Tech. Project Second Stage Report on
B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic
More informationBy: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationrcuda: an approach to provide remote access to GPU computational power
rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda
More informationDi Zhao Ohio State University MVAPICH User Group (MUG) Meeting, August , Columbus Ohio
Di Zhao zhao.1029@osu.edu Ohio State University MVAPICH User Group (MUG) Meeting, August 26-27 2013, Columbus Ohio Nvidia Kepler K20X Intel Xeon Phi 7120 Launch Date November 2012 Q2 2013 Processor Per-processor
More informationParallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs
Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs C.-C. Su a, C.-W. Hsieh b, M. R. Smith b, M. C. Jermy c and J.-S. Wu a a Department of Mechanical Engineering, National Chiao Tung
More informationAccelerating Correlation Power Analysis Using Graphics Processing Units (GPUs)
Accelerating Correlation Power Analysis Using Graphics Processing Units (GPUs) Hasindu Gamaarachchi, Roshan Ragel Department of Computer Engineering University of Peradeniya Peradeniya, Sri Lanka hasindu8@gmailcom,
More informationMulti-Threaded UPC Runtime for GPU to GPU communication over InfiniBand
Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State
More informationCUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.
Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication
More informationCSE 160 Lecture 24. Graphical Processing Units
CSE 160 Lecture 24 Graphical Processing Units Announcements Next week we meet in 1202 on Monday 3/11 only On Weds 3/13 we have a 2 hour session Usual class time at the Rady school final exam review SDSC
More information2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions
Administrative L6: Memory Hierarchy Optimization IV, Bandwidth Optimization Next assignment available Goals of assignment: simple memory hierarchy management block-thread decomposition tradeoff Due Tuesday,
More informationParallelism. Parallel Hardware. Introduction to Computer Systems
Parallelism We have been discussing the abstractions and implementations that make up an individual computer system in considerable detail up to this point. Our model has been a largely sequential one,
More informationEfficient Data Transfers
Efficient Data fers Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 PCIE Review Typical Structure of a CUDA Program Global variables declaration Function prototypes global
More informationCUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17
CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA
More informationPLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,
More informationCSE 599 I Accelerated Computing - Programming GPUS. Memory performance
CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationGPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP
GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP INTRODUCTION or With the exponential increase in computational power of todays hardware, the complexity of the problem
More informationPerformance Study of GPUs in Real-Time Trigger Applications for HEP Experiments
Available online at www.sciencedirect.com Physics Procedia 37 (212 ) 1965 1972 TIPP 211 Technology and Instrumentation in Particle Physics 211 Performance Study of GPUs in Real-Time Trigger Applications
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More informationGraphics Processor Acceleration and YOU
Graphics Processor Acceleration and YOU James Phillips Research/gpu/ Goals of Lecture After this talk the audience will: Understand how GPUs differ from CPUs Understand the limits of GPU acceleration Have
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationCS 179: GPU Computing. Lecture 2: The Basics
CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced
More information1/25/12. Administrative
Administrative L3: Memory Hierarchy Optimization I, Locality and Data Placement Next assignment due Friday, 5 PM Use handin program on CADE machines handin CS6235 lab1 TA: Preethi Kotari - Email:
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationChapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.
Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE
More informationInformation Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)
26(86) Information Coding / Computer Graphics, ISY, LiTH CUDA memory Coalescing Constant memory Texture memory Pinned memory 26(86) CUDA memory We already know... Global memory is slow. Shared memory is
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationA Comprehensive Study on the Performance of Implicit LS-DYNA
12 th International LS-DYNA Users Conference Computing Technologies(4) A Comprehensive Study on the Performance of Implicit LS-DYNA Yih-Yih Lin Hewlett-Packard Company Abstract This work addresses four
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationA Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT
A Fast GPU-Based Approach to Branchless Distance-Driven Projection and Back-Projection in Cone Beam CT Daniel Schlifske ab and Henry Medeiros a a Marquette University, 1250 W Wisconsin Ave, Milwaukee,
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationNVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas
NVidia s GPU Microarchitectures By Stephen Lucas and Gerald Kotas Intro Discussion Points - Difference between CPU and GPU - Use s of GPUS - Brie f History - Te sla Archite cture - Fermi Architecture -
More informationAn Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs
An Execution Strategy and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering
More informationCartoon parallel architectures; CPUs and GPUs
Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD
More informationIntroduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29
Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions
More informationBaseline V IRAM Trimedia. Cycles ( x 1000 ) N
CS 252 COMPUTER ARCHITECTURE MAY 2000 An Investigation of the QR Decomposition Algorithm on Parallel Architectures Vito Dai and Brian Limketkai Abstract This paper presents an implementation of a QR decomposition
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationComputer Caches. Lab 1. Caching
Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main
More informationCUDA GPGPU Workshop CUDA/GPGPU Arch&Prog
CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC
More informationWarps and Reduction Algorithms
Warps and Reduction Algorithms 1 more on Thread Execution block partitioning into warps single-instruction, multiple-thread, and divergence 2 Parallel Reduction Algorithms computing the sum or the maximum
More informationMulti-Processors and GPU
Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock
More informationA TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA
More informationN-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo
N-Body Simulation using CUDA CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo Project plan Develop a program to simulate gravitational
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationParallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer
Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing
More informationGViM: GPU-accelerated Virtual Machines
GViM: GPU-accelerated Virtual Machines Vishakha Gupta, Ada Gavrilovska, Karsten Schwan, Harshvardhan Kharche @ Georgia Tech Niraj Tolia, Vanish Talwar, Partha Ranganathan @ HP Labs Trends in Processor
More informationA novel way to efficiently simulate complex full systems incorporating hardware accelerators
ARM Research Summit 2017 Workshop A novel way to efficiently simulate complex full systems incorporating hardware accelerators Nikolaos Tampouratzis Technical University of Crete, Greece Motivation / The
More informationKaguya s HDTV and Its Imaging
Kaguya s HDTV and Its Imaging C h a p t e r 2 Overview of the HDTV System In addition to Kaguya s 13 science instruments, the HDTV was unique in being specifically included to engage the public in the
More informationGPU-accelerated ray-tracing for real-time treatment planning
Journal of Physics: Conference Series OPEN ACCESS GPU-accelerated ray-tracing for real-time treatment planning To cite this article: H Heinrich et al 2014 J. Phys.: Conf. Ser. 489 012050 View the article
More informationParallel Implementation of Facial Detection Using Graphics Processing Units
Marquette University e-publications@marquette Master's Theses (2009 -) Dissertations, Theses, and Professional Projects Parallel Implementation of Facial Detection Using Graphics Processing Units Russell
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationBlock Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations
Block Lanczos-Montgomery method over large prime fields with GPU accelerated dense operations Nikolai Zamarashkin and Dmitry Zheltkov INM RAS, Gubkina 8, Moscow, Russia {nikolai.zamarashkin,dmitry.zheltkov}@gmail.com
More informationCSE 599 I Accelerated Computing - Programming GPUS. Advanced Host / Device Interface
CSE 599 I Accelerated Computing - Programming GPUS Advanced Host / Device Interface Objective Take a slightly lower-level view of the CPU / GPU interface Learn about different CPU / GPU communication techniques
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationTR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut
TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More information