Evaluation and Improvement of GPU Ray Tracing with a Thread Migration Technique

Similar documents
Fast BVH Construction on GPUs

CS427 Multicore Architecture and Parallel Computing

Bounding Volume Hierarchy Optimization through Agglomerative Treelet Restructuring

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Acceleration Data Structures

EFFICIENT BVH CONSTRUCTION AGGLOMERATIVE CLUSTERING VIA APPROXIMATE. Yan Gu, Yong He, Kayvon Fatahalian, Guy Blelloch Carnegie Mellon University

A Hardware Pipeline for Accelerating Ray Traversal Algorithms on Streaming Processors

MSBVH: An Efficient Acceleration Data Structure for Ray Traced Motion Blur

PantaRay: Fast Ray-traced Occlusion Caching of Massive Scenes J. Pantaleoni, L. Fascione, M. Hill, T. Aila

6.837 Introduction to Computer Graphics Final Exam Tuesday, December 20, :05-12pm Two hand-written sheet of notes (4 pages) allowed 1 SSD [ /17]

S U N G - E U I YO O N, K A I S T R E N D E R I N G F R E E LY A VA I L A B L E O N T H E I N T E R N E T

Simpler and Faster HLBVH with Work Queues

Accelerating Ray-Tracing

Computer Graphics (CS 543) Lecture 13b Ray Tracing (Part 1) Prof Emmanuel Agu. Computer Science Dept. Worcester Polytechnic Institute (WPI)

Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.

Row Tracing with Hierarchical Occlusion Maps

Portland State University ECE 588/688. Graphics Processors

HLBVH: Hierarchical LBVH Construction for Real Time Ray Tracing of Dynamic Geometry. Jacopo Pantaleoni and David Luebke NVIDIA Research

Interactive Ray Tracing: Higher Memory Coherence

Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees

Improving Memory Space Efficiency of Kd-tree for Real-time Ray Tracing Byeongjun Choi, Byungjoon Chang, Insung Ihm

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010

Lecture 2 - Acceleration Structures

COMP 4801 Final Year Project. Ray Tracing for Computer Graphics. Final Project Report FYP Runjing Liu. Advised by. Dr. L.Y.

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Cross-Hardware GPGPU implementation of Acceleration Structures for Ray Tracing using OpenCL

Review for Ray-tracing Algorithm and Hardware

Simpler and Faster HLBVH with Work Queues. Kirill Garanzha NVIDIA Jacopo Pantaleoni NVIDIA Research David McAllister NVIDIA

Spatial Data Structures and Speed-Up Techniques. Tomas Akenine-Möller Department of Computer Engineering Chalmers University of Technology

INFOGR Computer Graphics. J. Bikker - April-July Lecture 11: Acceleration. Welcome!

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Collision Detection with Bounding Volume Hierarchies

Photorealistic 3D Rendering for VW in Mobile Devices

Accelerated Ambient Occlusion Using Spatial Subdivision Structures

Ray Tracing with Multi-Core/Shared Memory Systems. Abe Stephens

Scene Management. Video Game Technologies 11498: MSc in Computer Science and Engineering 11156: MSc in Game Design and Development

RACBVHs: Random Accessible Compressed Bounding Volume Hierarchies

Enhancing Traditional Rasterization Graphics with Ray Tracing. October 2015

GPU Architecture and Function. Michael Foster and Ian Frasch

INFOMAGR Advanced Graphics. Jacco Bikker - February April Welcome!

Real Time Ray Tracing

Lecture 11: Ray tracing (cont.)

Lecture 4 - Real-time Ray Tracing

Efficient Clustered BVH Update Algorithm for Highly-Dynamic Models. Kirill Garanzha

A Parallel Access Method for Spatial Data Using GPU

Computer Graphics. Lecture 13. Global Illumination 1: Ray Tracing and Radiosity. Taku Komura

Enabling immersive gaming experiences Intro to Ray Tracing

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Optimization solutions for the segmented sum algorithmic function

Parallel Processing of Ray Tracing on GPU with Dynamic Pipelining

Accelerating Geometric Queries. Computer Graphics CMU /15-662, Fall 2016

INFOMAGR Advanced Graphics. Jacco Bikker - February April Welcome!

Computer Graphics Ray Casting. Matthias Teschner

Single Scattering in Refractive Media with Triangle Mesh Boundaries

Parallel Computing: Parallel Architectures Jin, Hai

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

Fast Stereoscopic Rendering on Mobile Ray Tracing GPU for Virtual Reality Applications

Grid-based SAH BVH construction on a GPU

A Real-time Rendering Method Based on Precomputed Hierarchical Levels of Detail in Huge Dataset

GPU-BASED RAY TRACING ALGORITHM USING UNIFORM GRID STRUCTURE

Lecture 11 - GPU Ray Tracing (1)

Ray Tracing Acceleration Data Structures

A GPU-Based Data Structure for a Parallel Ray Tracing Illumination Algorithm

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Massively Parallel Hierarchical Scene Processing with Applications in Rendering

and Parallel Algorithms Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Implementation Details of GPU-based Out-of-Core Many-Lights Rendering

S U N G - E U I YO O N, K A I S T R E N D E R I N G F R E E LY A VA I L A B L E O N T H E I N T E R N E T

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

GPU-AWARE HYBRID TERRAIN RENDERING

Accelerating Ray Tracing

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Computer Graphics. Lecture 10. Global Illumination 1: Ray Tracing and Radiosity. Taku Komura 12/03/15

Adaptive Assignment for Real-Time Raytracing

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Intersection Acceleration

CPSC / Sonny Chan - University of Calgary. Collision Detection II

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Spring 2009 Prof. Hyesoon Kim

CS580: Ray Tracing. Sung-Eui Yoon ( 윤성의 ) Course URL:

Parallel Programming for Graphics

S U N G - E U I YO O N, K A I S T R E N D E R I N G F R E E LY A VA I L A B L E O N T H E I N T E R N E T

Fast continuous collision detection among deformable Models using graphics processors CS-525 Presentation Presented by Harish

Ray Tracing. Cornell CS4620/5620 Fall 2012 Lecture Kavita Bala 1 (with previous instructors James/Marschner)

Ray Tracing. Foley & Van Dam, Chapters 15 and 16

Ray Tracing Foley & Van Dam, Chapters 15 and 16

Ray Intersection Acceleration

A Bandwidth Effective Rendering Scheme for 3D Texture-based Volume Visualization on GPU

Specialized Acceleration Structures for Ray-Tracing. Warren Hunt

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Comparison of hierarchies for occlusion culling based on occlusion queries

B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes

Goal. Interactive Walkthroughs using Multiple GPUs. Boeing 777. DoubleEagle Tanker Model

Hardware Accelerated Volume Visualization. Leonid I. Dimitrov & Milos Sramek GMI Austrian Academy of Sciences

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs

Divide and Conquer G-Buffer Ray Tracing

Computer Architecture

Transcription:

Evaluation and Improvement of GPU Ray Tracing with a Thread Migration Technique Xingxing Zhu and Yangdong Deng Institute of Microelectronics, Tsinghua University, Beijing, China Email: zhuxingxing0107@163.com, dengyd@tsinghua.eud.cn thread migration technique. The thread migration reduced the cache miss rate and improved the speed of the traversal. The contribution of the paper is that the proposed techniques significantly reduces cache miss rate. The miss rate of the primary was reduced by 34% to 51%, and the rate of the secondary was decreased by 62% to 76%. The resultant speed of the primary traversal is improved by 1.29 to 1.74 fold, while the speed of the secondary is improved by 1.85 to 2.59 X. To explore the performance of our algorithms, we implemented our algorithms in CUDA(Compute Unified Device Architecture) and benchmarked their performance on an NVIDIA GeForce GTX 560Ti GPU. We also used GPGPU-Simversion 3.x as our GPU emulator. The emulator configuration is based on the Fermi architecture with 15 SM(Streaming Multiprocessor, SM) cluster, each cluster contains 10 SM cores. Abstract Ray tracing is a computer graphics rendering technique. Different from the traditional rasterization algorithm, the tracing algorithm simulate the real vision process. Being able to deliver highly realistic graphics effects, it has been considered as the fundamental graphics rendering mechanism for high-end applications and is also likely to be adopted as the work-horse of future graphics hardware. However, the high computational effort is the major stumbling block for tracing to be deployed in real-time applications. In this paper we present a detailed characterization of the tracing algorithm by focusing on tracing algorithm based on BVH acceleration structure. Using a state of art graphics processor unit (GPU) simulator, we are able to provide key insight on improving the performance of tracing on GPUs. We propose thread migration technique to reduce the cache miss rate and enhance the processing throughput of primary and secondary s. Index Terms tracing, graphic processing unit, thread migration II. I. INTRODUCTION A. Ray Tracing Algorithm The basic idea of tracing is to trace the light from the observation point through each pixel on the view plane to the objects in the scene. We compute the color of the pixel according to the reflection and refraction characteristics of the surface and the light source. The light from the observation point to each pixel of the view plane is called the primary. The light generated at the intersection point of the scene according to the laws of reflection and refraction of the light is called the secondary. The essential stage of tracing algorithm is to search the objects that each intersects. The scene is composed of a variety of different elements such as triangles, polygons, polyhedrons, sphere, etc.. Among those elements triangle is the most commonly used, which is also compatible with the rasterization algorithm primitive. A moderate size scene contains 50K to 10M triangles. In order to improve space traversal performance, several acceleration data structures are widely used. It can reduce the complexity of algorithms to O log( N ) [3], while N is the number of triangles in Ray tracing is a computer graphics rendering technique [1]. Different from the traditional rasterization algorithm [2], it is based on the vision formation process. Since it is able to deliver highly realistic graphics effects, it has been considered as the technology of choice for high-end graphics. In addition, the tracing algorithm has a strong potential to be adopted as the work-horse of future consumer level graphics hardware. So far, it has been widely used in such fields as engineering design, film industry, entertainment, gaming, computer aided design(cad) and movie production. Especially, tracing based graphics have appeared in many Hollywood movies like Happy Feet, The Lord of the Rings and Avatar. However, the high computational demand is the major stumbling block for tracing to be adopted in real-time applications. Therefore, now the tracing technology can only be used in off-line mode. In this paper we present a detailed characterization of the tracing algorithm based on BVH acceleration structure. Using a state of art Graphics Processor Unit (GPU) simulator, we are able to provide key insight on improving the performance of tracing on GPUs. We also propose a the scene. The commonly used acceleration structures in tracing are Bounding Volume Hierarchy (BVH)[4] Manuscript received May 8, 2013; revised August 5, 2013 doi: 10.12720/ijsps.1.1.111-115 BACKGROUND 111

and kd-tree [5]. BVH is a tree structure to organize the Axis-Aligned Bounding Box (AABB) of special geometric objects hierarchically. Kd-Tree is a space partitioning data structure for organizing points in a k-dimensional space. Generally the performance of traversal on kd-tree is faster than on BVH [6], but BVH is more suitable for dynamic scenes for the coordinates of the AABB can be updated without reconstructing the whole tree. Therefore, we use BVH as the acceleration structure in this paper. General-purpose about the conference room scene. Each cluster contains 15 SM cores. B. General-purpose Graphic Processing Unit GPU(Graphics Processing Unit, GPU) is a graphic processing hardwarewhich was designed for rasterization algorithm. Modern GPU are mostly programmable so that is can also be used for other parallel computing applicationsgpu is a highly parallel, multithreaded, manycore processor with tremendous computational power and memory bandwidth. The NVIDIA Fermi architecture has 512 CUDA(Compute Unified Device Architecture) core, each CUDA core can operate one integer or floating-point instruction per clock cycle. There are 16 SM coreson one chip.each of them contains 32 CUDA cores. All threads in GPU are organized into blocks, and a certain number of threads in the block are organized into a warp. The warp can be mapped to the SM. III. Figure 1. Test scenarios(top-left: fairy forest, top-right: conference room, bottom-left: Stanforddragon, bottom-right: happy ), the render resolution is 1024*768. TABLE I. TEST SCENARIOS Vertex Triangle Fairy Forest Room Stanford Happy 100737 174117 BVH Node Leaf 57519 57519 166907 282759 103801 103801 435545 871414 299504 299504 543775 1087425 385901 385901 QUANTITATIVE ANALYSISOF THE RAY TRACING ALGORITHM A. Test In the paper we use our implementation of TimoAila s open source GPU tracer [7] to analyze some detailed characterization of tracing algorithm.in order to get the implementation details of the underlying hardware, we used the traditional GPU emulator GPGPU-Sim. The detail information is shown in Fig. 1 and Table I. B. The Cache Miss Rate In the many-core architecture, the global memory access latency will increase greatly with the increase of the numbers of SM cores, which is influenced by on-chip interconnection network and the limitation of the global memory port, and thus affect the scalability of the many-core architecture. For many-core architecture, the memory access latency is (described) in equation (1). [ ( Figure 2. Execution speed of the primary and secondary s on the conference room scene Fig. 2 shows the total processing speed significantly deviates from the linear. The secondary s even start to decline, when the number of the core cluster is more than 5. Fig. 3 shows that the number of the global memory access increaseswith the increase number of the SM cores, and thus lead to the increase of the network latency ( ). Due to are fixed memory latency, we can only optimize the cache miss rate. Consider the size of the cache is limited by the chip area, we try to change the way of access to reduce the cache miss rate. )](1) is the access latency to the cache in the core, is the access latency to the global memory-side secondary cache, is the access latency to the global memory, is the network latency from the SM core to the global memory, and are the cache miss rate. Fig. 2, Fig. 3 and Fig. 4 show the simulation results of the tracing program on the GPGPU-Sim emulator 112

In the thread migration, the memory address is divided into a few section, each section corresponds to a SM core. Only the corresponding SM core can access the data in the section. In the thread migration, the thread will migrate between the SM cores according to the accessed data. For example, if the thread(i)running on the SM core(t) needs to access the data address corresponding to the SM core(j)(i ), the information of thread(i) will be migrated to SM core(j). We design a thread migration program which takes the advantage of the characteristics of BVH traversal. We bind the sub-tree with the SM core and divide the traversal process into scheduling distribute stage and sub-tree traversal stage. During the scheduling stage, the thread will traversal the upper BVH tree, and will stop after it reaches the certain level of the BVH tree. Then the accessed sub-tree will be inserted into the waiting queue. During the sub-tree traversal stage, they will traversal the sub-tree using the BVH traversal algorithm. Fig. 5 shows the detailed information. Figure 3. Network latency and the cache miss rate Figure 4. The cache miss rate about the conference room scene(top: the primary cache miss rate, bottom : the secondary cache miss rate) Fig.4 shows that the cache miss rate of the primary is about 20%, and the cache miss rate of the secondary is about 40%. Such high degree of cache miss rate will result in frequent replacement (thrashing) of the data in the cache, affecting the performance of the cache. IV. RESULTS AND ANALYSIS OF THE RAY TRACING ALGORITHM BASED ON THREAD MIGRATION Figure 5. Scheduling distribute stage and sub-tree traversal stage(top: Scheduling distribute stage, bottom: sub-tree traversal stage) A. Thread Migration Traditionally, a will be bound to a thread in the tracing program running on GPU. All the s will be packaged and mapped to the SM cores, then do parallel computing on GPUs. That means we bind the threads with the SM core and the data, for each SM core to access is random. The random access to data will cause high degree of the cache miss rate and will greatly restrict the program s performance. Otherwise the information of the tracing program for each thread only includes the light source, direction and other relevant information, which is very simple. So we solve the problem from another standpoint. We bind data with the SM core, and the thread executes through the scheduling assignment. This is the thread migration [8]. B. Results and Analysis We use the GPGPU-Sim to approximately simulate the tracing algorithm based on the thread migration, In our design, each sub-tree traversal program running on the SM core only can traversal the sub-tree. This reduces the competition of the different threads of the data in cache and alleviates the cache miss rate. However, we cannot directly specify the thread on different SM cores in GPGPU-Sim emulator, but can only through the automatic allocation of hardware and software. Therefore, it is difficult to achieve the separation of different sub-tree traversal thread. We allocate a certain virtual cache to each sub-tree bounding on the SM core to achieve our goal. There are 113

m virtual caches, wherein m is the number of the sub-trees. Each virtual cache is as large as the real one andcorresponds to a curtain sub-tree. To run a sub-tree thread, only the corresponding virtual cache data can be accessed while the others cannot. But the data in the other virtual caches is still stored without being replaced. Thus the data is still available when we switch to another sub-tree thread. Due to the cache miss rate are the main problem the thread migration technique proposed to solve, the virtual cache can approximately simulate its behavior. TABLE II. Ray Primary Secondary THE MISS RATE IN THE TRAVERSAL STAGE Standard Cache miss rate Improved Improved/ standard (%) Fairy 0.2872 0.1409 49 0.2875 0.1888 66 0.3558 0.1736 49 0.3884 0.2300 59 Fairy 0.3891 0.0945 24 0.3558 0.0979 28 0.4818 0.1827 38 0.5007 0.1598 32 TABLE III. THE SIMULATION TIME IN THE TRAVERSAL STAGE Ray Primary Secondary Fairy Fairy Simulation cycle Standard Improved Improved /standard (%) 517168 399729 77 297066 223943 75 386685 221619 57 291061 186805 64 1748566 693091 40 1081839 583214 54 1581138 665433 42 3009938 1161525 39 Table II and Table III show the results simulated on the GPGPU-Sim. From the Table II we can see that the miss rate of the primary was reduced by 34% to 51%, and the rate of the secondary was decreased by 62% to 76%. From Table III we can see that the resultant speed of the primary traversal is improved by 1.29 to 1.74 fold, while the speed of the secondary is improved by 1.85 to 2.59 X. In addition, we can see that the accelerator ratio of the complex scenes is better than the simple scenes. This indicates that the thread migration technology has advantage in processing the complex scenes. The reason is that the random memory access in such cases is likely to cause the problem of clogging of memory port and interconnect network which can be solved by the thread migration technology. V. CONCLUSION In this paper we present a detailed characterization of the tracing algorithm by focusing on tracing algorithm based on BVH acceleration structure. Using a state of art graphics processor unit (GPU) simulator, we are able to provide key insight on improving the performance of tracing on GPUs. We propose thread migration technique to reduce the cache miss rate and enhance the processing throughput of primary and secondary s. ACKNOWLEDGMENT The traversal part of our code is the implementation of TimoAila s downloadable tracing. And the author would like to explicitly thank the anonymous ICSAP reviewers. REFERENCES [1] A. Appel, Some techniques for shading machine rendering of solids, in Proc. AFIPS Conf., Washington DC, 1968, pp. 37-45. [2] L. Szirmay-Kalos, T. Umenhoffer, B Toth, et al. Volumetric ambient occlusion for real-time rendering and games, IEEE Computer Graphics and Applications, vol. 30, pp. 70-79, Jan. 2010. [3] J. Pantaleoni and D. Luebke, HLBVH: Hierarchical LBVH construction for real-time tracing of dynamic geometry, in Proc. of the Conf. on High Performance Graphics, 2010, pp. 87-95. [4] H. Weghorst, G. Hooper, and D. P. Greenberg, Improved computational methods for tracing, ACM Transactions on Graphics, vol. 3, pp. 52-69, Jan 1984. [5] I. Wald, Fast construction of SAH BVHs on the Intel many integrated core (MIC) architecture, IEEE Trans. on Visualization and Computer Graphics, vol. 18, pp. 47-57, Jan. 2012. [6] A. Santos, J. M. Teixeira, T. Farias, et al., Understanding the efficiency of KD-tree -traversal techniques over a GPGPU architecture, International Journal of Parallel Programming, vol. 40, pp. 331-352, June 2012. [7] T. Aila and S. Laine, Understanding the efficiency of traversal on GPUs, in Proc. of the High Performance Graphics, New York, 2009, pp. 145-149. [8] M. Lis, S. K. Shim, O. Khan, and S. Devadas, Shared memory via execution migration, presented at International on Architectural Support for Programming Languages and Operating Systems, New York, 2011. XingxingZhu received her B.S. degree in In School of Information Science and Engineering from the Southeast University, Nanjing, China in 2009. She is currently pursuing the M.S. degree in Institute of Microelectronics in the University of Tsinghua, Beijing, China. Her research interests mainly focus on tracing algorithm and general purpose computing on graphics processing hardware (GPGPU). 114

Yangdong Deng received his Ph.D. degree in Electrical and Computer Engineering from Carnegie Mellon University, Pittsburgh, PA, in 2006. He received his ME and BE degrees in Electrical and ElectronicsDepartment from Tsinghua University, Beijing, in 1998 and 1995, respectively. His research interests include parallel electronic design automation (EDA) algorithms, parallel program optimization and general purpose computing on graphics processing hardware. 115