Distributed Virtual Reality Computation

Similar documents
Lecture 25: Board Notes: Threads and GPUs

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

The Application Stage. The Game Loop, Resource Management and Renderer Design

2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or

Per-Pixel Lighting and Bump Mapping with the NVIDIA Shading Rasterizer

PowerVR Hardware. Architecture Overview for Developers

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Parallel Computing: Parallel Architectures Jin, Hai

Windowing System on a 3D Pipeline. February 2005

Headline in Arial Bold 30pt. Visualisation using the Grid Jeff Adie Principal Systems Engineer, SAPK July 2008

! Readings! ! Room-level, on-chip! vs.!

Current Trends in Computer Graphics Hardware

GPU Computing and Its Applications

CS427 Multicore Architecture and Parallel Computing

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Accelerating CFD with Graphics Hardware

Spring 2011 Prof. Hyesoon Kim

Enabling immersive gaming experiences Intro to Ray Tracing

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics

Threading Hardware in G80

Applications of Explicit Early-Z Culling

High Performance Visibility Testing with Screen Segmentation

Introduction. Chapter Computer Graphics

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

Shadows for Many Lights sounds like it might mean something, but In fact it can mean very different things, that require very different solutions.

ASYNCHRONOUS SHADERS WHITE PAPER 0

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Spring 2009 Prof. Hyesoon Kim

Mattan Erez. The University of Texas at Austin

COMP 4801 Final Year Project. Ray Tracing for Computer Graphics. Final Project Report FYP Runjing Liu. Advised by. Dr. L.Y.

Chapter 1 Computer System Overview

Enhancing Traditional Rasterization Graphics with Ray Tracing. October 2015

CSC 553 Operating Systems

Parallel and High Performance Computing CSE 745

Chapter 1. Introduction Marc Olano

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

The Hadoop Paradigm & the Need for Dataset Management

Application of Parallel Processing to Rendering in a Virtual Reality System

Lecture 9: MIMD Architecture

Parallelism and Concurrency. COS 326 David Walker Princeton University

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

GeForce3 OpenGL Performance. John Spitzer

ECE 574 Cluster Computing Lecture 16

Course Recap + 3D Graphics on Mobile GPUs

Lecture 9: MIMD Architectures

Real-time Digital Dome Rendering Techniques and Technologies Robert Kooima, Doug Roberts, Mark SubbaRao

high performance medical reconstruction using stream programming paradigms

Copyright Khronos Group Page 1

Coming to a Pixel Near You: Mobile 3D Graphics on the GoForce WMP. Chris Wynn NVIDIA Corporation

Rasterization Overview

NVIDIA nfinitefx Engine: Programmable Pixel Shaders

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

Chapter 18 Parallel Processing

1 Hardware virtualization for shading languages Group Technical Proposal

CS451Real-time Rendering Pipeline

Microsoft HoloLens Joe Hines

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Implementing a Programmable Pixel Pipeline in FPGAs

Advanced Real- Time Cel Shading Techniques in OpenGL Adam Hutchins Sean Kim

Chapter 1 Computer System Overview

Introduction to Modern GPU Hardware

Lecture 9: MIMD Architectures

Operating Systems: Internals and Design Principles. Chapter 1 Computer System Overview Seventh Edition By William Stallings

Using Graphics Chips for General Purpose Computation

Real Time Rendering Assignment 1 - Report 20/08/10

CS 450: COMPUTER GRAPHICS REVIEW: INTRODUCTION TO COMPUTER GRAPHICS SPRING 2016 DR. MICHAEL J. REALE

Optimizing Apache Spark with Memory1. July Page 1 of 14

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Drawing Fast The Graphics Pipeline

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

Coding OpenGL ES 3.0 for Better Graphics Quality

Standard Graphics Pipeline

A New Bandwidth Reduction Method for Distributed Rendering Systems

An Introduction to the Amoeba Distributed Operating System Apan Qasem Computer Science Department Florida State University

Parallel Computing and Grids

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Scheduling the Graphics Pipeline on a GPU

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Books: 1) Computer Graphics, Principles & Practice, Second Edition in C JamesD. Foley, Andriesvan Dam, StevenK. Feiner, John F.

Jonathan Ragan-Kelley P.O. Box Stanford, CA

Mention driver developers in the room. Because of time this will be fairly high level, feel free to come talk to us afterwards

Introduction. What s New in This Edition

S U N G - E U I YO O N, K A I S T R E N D E R I N G F R E E LY A VA I L A B L E O N T H E I N T E R N E T

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

Mobile AR Hardware Futures

Lecture 1: January 22

DOOM 3 : The guts of a rendering engine

GoForce 3D: Coming to a Pixel Near You

3D Programming. 3D Programming Concepts. Outline. 3D Concepts. 3D Concepts -- Coordinate Systems. 3D Concepts Displaying 3D Models

Photorealistic 3D Rendering for VW in Mobile Devices

Database Server. 2. Allow client request to the database server (using SQL requests) over the network.

3D Computer Games Technology and History. Markus Hadwiger VRVis Research Center

Ten Reasons to Optimize a Processor

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Profiling and Debugging Games on Mobile Platforms

Project 1, 467. (Note: This is not a graphics class. It is ok if your rendering has some flaws, like those gaps in the teapot image above ;-)

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

Memory Systems IRAM. Principle of IRAM

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Transcription:

Jeff Russell 4/15/05 Distributed Virtual Reality Computation Introduction Virtual Reality is generally understood today to mean the combination of digitally generated graphics, sound, and input. The goal of most VR systems is immersiveness; generating the illusion that the user is in fact in another reality. This is best accomplished through the use of a multi-walled display system which surrounds the user with sound and images that respond to input. The applications of VR range from data visualization, engineering, and physics simulations, to anthropology, art, and entertainment. One of the challenges in this field is that of computing power: where does one come up with the computing ability to drive interactive graphics and sound at such high resolutions? This paper will briefly outline a cluster-based approach to virtual reality computation. The Traditional Approach The traditional approach to immersive VR displays has been to use a single multiprocessor shared memory computer to drive all aspects of a program I/O, simulation, and rendering. This has been and still remains an effective solution to the problems of multi-display VR. The shared memory architecture of such machines assures essentially no latency between display updates, and the multiple processors allow for a large amount of rendering and simulation power to be present in a single bundle. Memory access and allocation is handled transparently by the operating system. Distributing computation in such machines is largely trivial from the user s perspective the user simply makes graphics API calls which are easily parallelized by the operating system or the device drivers. Silicon Graphics Inc. is famous for

producing and selling such devices, and they continue to be used widely by academia, industry, and government. The weakness of this approach becomes apparent however as soon as one tries to expand or upgrade such a system. The architectural challenges involved in engineering a userexpandable shared memory computer are nontrivial to say the least. This, paired with the high cost of such unique hardware and the pace of progress in the field of microprocessor design, makes the single bundle approach somewhat undesirable from a flexibility standpoint. Simply put: trying to make a bigger computer isn t the answer to VR. The Cluster Approach It is of course possible for a single modern commodity desktop computer to effectively render three dimensional graphics to a single display, and for such a computer to remain responsive while responding to input, playing audio, etc. The focus of computer graphics hardware design has in the last decade shifted to a market with much more economic potential than engineering and data visualization home entertainment. Gaming consoles and desktop computers are now available for less than $1000 that provide a significant portion of the computational power of a single multiprocessor graphics workstation that may cost tens of thousands of dollars or more. The obvious answer to the flexibility problem then would appear to be to construct a VR system consisting of a cluster of these machines. In the field of high performance computing, a clustered system is nothing new, and effective methods and tools have been developed to create such systems. However the needs of VR impose a new and rather severe restriction: that of latency. A virtual reality system requires that the user be able to manipulate objects on the display in real time with what should appear to

be no delay between input and display state changes. Each node must update with input at the very least 20-30 times per second to maintain the illusion of continuous animation. This is somewhat of a paradigm shift from traditional HPC rather than a single lengthy job that may take minutes or hours, this system is presented with an unending series of jobs each of which must be completed in a fraction of a second. The latency imposed by data transfer among nodes starts to become a very significant portion of the time the system has to render a frame, and it becomes clear that not all of the computations performed by a typical VR application are going to be distributable across memory spaces. Dividing the Work A typical VR application performs the following tasks for every frame: - Receive & Send Input to Peripherals - Run Physics or Animation - Manipulate & Translate Geometry - Render to Display With the observation that a single node can easily perform all of these tasks for a single display, a simplifying step can be taken. Those tasks that are not easily distributable can be simply duplicated across the nodes. That is, each node can be running its own instance of the job, performing physics and geometry manipulations which are exactly redundant to those being done on the other nodes. This tradeoff unfortunately removes any benefits parallelization would yield in this area, but more importantly it removes a substantial amount of inter-node communication.

In most cases a given VR facility will only be able to connect I/O devices to a single node in the system, and in this situation network traffic is unavoidable. Fortunately I/O data tend to be quite light, and these data combined with synchronization commands comprises the entirety of inter-node communication for each frame. Typically only one node will be connected to audio output, although in theory audio output could be distributed with no added latency across a number of nodes equal to the number of sound channels available. Figure 1 A sample VR cluster setup driving a four walled display. Courtesy Eric Olson. The clustering benefits are great in one area - rendering. Each node in a VR cluster system will be rendering the same scene, consisting of the same set of textures and geometry and so forth. However, each node will be rendering a different view of this scene, which means that as far as pixels are concerned this task can be totally parallelized. Since rendering comprises the majority of the computation in most VR applications, this allows the clustering approach to be quite effective. With the exception of the minimal overhead of I/O transfer and synchronization, the rendering power of such a system can increase linearly with the number of nodes.

Figure 2 Performance data for pantheon test application. Courtesy Eric Olson. Figure 2 illustrates the difference in performance characteristics between a PC Cluster and an Sgi Onyx2, a multi CPU shared memory computer. The first thing to note is that the PC Cluster outperforms the Onyx2 by a factor of 3:1 even for a single display, which is a testament to the growing power of modern commodity hardware. But much more significant is that the PC Cluster experiences almost no drop in frame rate as the number of displays (and nodes) increases, whereas the Onyx2 pays a hard price for every active display. It is worth mentioning that a even a 4-computer cluster system is an order of magnitude less expensive than a single Onyx2.

The Epitome of Parallelism: The GPU The GPU, or Graphics Processing Unit, is a specialized chip with functionality specifically designed for interactive rendering. While any CPU can of course render 3D graphics for VR, the custom design of the GPU allows it to be several orders of magnitude faster at this task. Everyday PC s usually contain at least a modest GPU for 2D and 3D video acceleration, and nodes in a VR cluster would certainly be no exception. The design of the GPU is highly constrained and specialized. The chip is typically designed to only accept as input a set of texture images and geometry, which it then quickly renders to an on-board frame buffer for display. Figure 3 is a greatly simplified diagram of a GPU rendering pipeline. Through a graphics API such as OpenGL, the user typically configures the way in which geometry will be rendered to the screen, and then passes a pointer to an array of vertices to be rendered. These vertices are processed by the vertex units on the GPU. Since each vertex can generally be transformed and lit independently of the others, a modern GPU will generally have 4-8 separate vertex pipelines, the combination of which can transform roughly 400-800 million triangles per second. Each triple of vertices forms a triangle on the screen, which must then be filled with pixels. The results of the vertex pipelines are fed through a triangle setup engine, and then passed to the pixel pipelines. A typical GPU has anywhere from 4 to 16 pixel pipes, each of which is deeply pipelined and independent of the others. The pixel units are responsible for accessing texture images, performing lighting and shading, and other effects. These pipelines render the pixels to the onboard frame buffer at a tremendous rate on the order of 2-8 billion pixels per second.

Figure 3 A simplified GPU pipeline. The result of all this parallelization and specialization is a rendering device which is hundreds of times faster than a general purpose CPU. The statistics are staggering; a modern GPU can reach floating point performance of 200 gigaflops or more when fully saturated. By comparison a fast CPU will generally perform in the 10-20 gigaflop range, and be further limited by its general purpose memory and cache structure. The benefits of a GPU do not end with its sheer rendering speed. An off-cpu renderer allows the CPU to be free to perform other calculations while rendering is completed. This means that the complexity of a VR applications can now be far less limited by rendering

complexity, and gives programmers room to implement detailed physics simulations or other expensive computations. All of this parallelism is thankfully present within a single cluster node, preventing any network traffic between nodes for the purposes of rendering. Other Considerations To allow for synchronous display updates, all nodes in a VR cluster need to post their newest frames to the display at the same time, which can only be done after each node has finished rendering. Since the rendering load for each node is fixed, there is no benefit to having some machines in the cluster being faster than others, as the faster machines will only end up waiting for the slower ones at the end of each frame. Hardware for the cluster nodes is easily updated, although care must be taken to update all of the machines to get any benefit out of the new devices. Developers that require more CPU cycles than a single processor could provide have the option of purchasing multi-processor nodes, or simply using MPI to distribute computation across several nodes. This must of course be done with the aforementioned latency concerns in mind. Several tools exist to provide a framework for clustered VR development. Foremost among them is VR Juggler [www.vrjuggler.org], which allows a single application to be run in a cluster, as a stand alone, or on a large multiprocessor shared memory workstation. These tools remove the burden of distributed computing and threading from the user, allowing for rapid VR development across a multitude of platforms.