Web Physics: A Hardware Accelerated Physics Engine for Web- Based Applications

Similar documents
A Different Approach for Continuous Physics. Vincent ROBERT Physics Programmer at Ubisoft

GAME PROGRAMMING ON HYBRID CPU-GPU ARCHITECTURES TAKAHIRO HARADA, AMD DESTRUCTION FOR GAMES ERWIN COUMANS, AMD

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

GPGPU Applications. for Hydrological and Atmospheric Simulations. and Visualizations on the Web. Ibrahim Demir

WebCL Overview and Roadmap

What is a Rigid Body?

SIGGRAPH Briefing August 2014

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

Parallel Systems. Project topics

Deformable and Fracturing Objects

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Khronos and the Mobile Ecosystem

Physics Parallelization. Erwin Coumans SCEA

Next Generation OpenGL Neil Trevett Khronos President NVIDIA VP Mobile Copyright Khronos Group Page 1

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Collision Detection of Triangle Meshes using GPU

WebGL Meetup GDC Copyright Khronos Group, Page 1

Collision Detection II. These slides are mainly from Ming Lin s course notes at UNC Chapel Hill

Khronos Connects Software to Silicon

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

Lesson 05. Mid Phase. Collision Detection

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Scene Management. Video Game Technologies 11498: MSc in Computer Science and Engineering 11156: MSc in Game Design and Development

Collision Detection with Bounding Volume Hierarchies

Radeon ProRender and Radeon Rays in a Gaming Rendering Workflow. Takahiro Harada, AMD 2017/3

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

Using Bounding Volume Hierarchies Efficient Collision Detection for Several Hundreds of Objects

OpenCL Press Conference

A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms

RACBVHs: Random Accessible Compressed Bounding Volume Hierarchies

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

How GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Enabling a Richer Multimedia Experience with GPU Compute. Roberto Mijat Visual Computing Marketing Manager

Numerical Algorithms on Multi-GPU Architectures

Anatomy of a Physics Engine. Erwin Coumans

CPU-GPU hybrid computing for feature extraction from video stream

GPU Programming Using NVIDIA CUDA

Warps and Reduction Algorithms

References. Additional lecture notes for 2/18/02.

Real-Time Hair Simulation and Rendering on the GPU. Louis Bavoil

Collision and Proximity Queries

Comparison of CPU and GPGPU performance as applied to procedurally generating complex cave systems

Current Trends in Computer Graphics Hardware

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Copyright Khronos Group, Page 1. Khronos Overview. Taiwan, February 2012

GPGPU Lessons Learned. Mark Harris

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Direct Rendering of Trimmed NURBS Surfaces

Physically-Based Laser Simulation

How to Optimize Geometric Multigrid Methods on GPUs

Copyright Khronos Group 2012 Page 1. OpenCL 1.2. August 2012

Addressing Heterogeneity in Manycore Applications

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

GPU Memory Model Overview

Spatial Data Structures and Speed-Up Techniques. Tomas Akenine-Möller Department of Computer Engineering Chalmers University of Technology

Ray Casting of Trimmed NURBS Surfaces on the GPU

Press Briefing SIGGRAPH 2015 Neil Trevett Khronos President NVIDIA Vice President Mobile Ecosystem. Copyright Khronos Group Page 1

Further Developing GRAMPS. Jeremy Sugerman FLASHG January 27, 2009

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Multi2sim Kepler: A Detailed Architectural GPU Simulator

High-performance Penetration Depth Computation for Haptic Rendering

Prospects for a more robust, simpler and more efficient shader cross-compilation pipeline in Unity with SPIR-V

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Sugar: Secure GPU Acceleration in Web Browsers

Announcements. Written Assignment2 is out, due March 8 Graded Programming Assignment2 next Tuesday

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

Recent Advances in Heterogeneous Computing using Charm++

On-the-Fly Elimination of Dynamic Irregularities for GPU Computing

Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs

Advanced Imaging Applications on Smart-phones Convergence of General-purpose computing, Graphics acceleration, and Sensors

Introduction to Collision Detection

Enhancing Traditional Rasterization Graphics with Ray Tracing. October 2015

WebGL, WebCL and Beyond!

Efficient Lists Intersection by CPU- GPU Cooperative Computing

CAB: Fast Update of OBB Trees for Collision Detection between Articulated Bodies

Real-Time Physics Simulation. Alan Hazelden.

GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Chapter 3 Parallel Software

CPSC / Sonny Chan - University of Calgary. Collision Detection II

Open API Standards for Mobile Graphics, Compute and Vision Processing GTC, March 2014

John Hsu Nate Koenig ROSCon 2012

Press Briefing SIGGRAPH 2015 Neil Trevett Khronos President NVIDIA Vice President Mobile Ecosystem. Copyright Khronos Group Page 1

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

Siggraph Asia December 2011

Fast continuous collision detection among deformable Models using graphics processors CS-525 Presentation Presented by Harish

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

T6: Position-Based Simulation Methods in Computer Graphics. Jan Bender Miles Macklin Matthias Müller

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Viewport 2.0 API Porting Guide for Locators

Transcription:

Web Physics: A Hardware Accelerated Physics Engine for Web- Based Applications Tasneem Brutch, Bo Li, Guodong Rong, Yi Shen, Chang Shu Samsung Research America-Silicon Valley {t.brutch, robert.li, g.rong, c.shu, yi.shen}@samsung.com

Web Physics Summary Web Physics Motivation Goals Usages Approach Accomplis hments Transparent access to native 2D physics engine through JS APIs. OpenCL accelerated 2D Physics Engine (Box2D-OpenCL). High performance, interactive compute & graphics apps on mobile Transparent & efficient access to acceleration from web apps. Cross-platform, portable, accelerated, robust & flexible Real time mobile gaming; Compute intensive simulation; Computer Aided modeling; Physical effects & realism; Augmented Reality; Real-time big data visualization; Hybrid Web Physics approach: Accelerate using native multicore hardware & open parallel APIs. Expose parallelism through JS APIs to web applications. Web Physics JavaScript API Framework (Web Physics JS APIs) OpenCL accelerated 2D Physics Engine (Box2D-OpenCL) JavaScript Physics Engine (Box2DWeb 2.2.1 APIs) Games Media Apps Web API Web Physics JS Bindings OpenCL Accelerated Box2D--OpenCL Native Physics Engine OpenCL Device Driver Multicore CPU Web Apps JavaScript Libraries Native Apps Manycore GPGPU 2

Web Physics Approach

Web Physics Overview Architectural Decision Accelerated Box2D JavaScript bindings for physics engine APIs Modular architecture Preserve Box2D open source APIs Reasoning JS performance a concern for high compute use cases. OpenCL parallelization API for heterogeneous multicore devices (CPU, GPU) can provide needed performance. Portable Web APIs for compute intensive web apps. Cross platform application developer support Transparent access to accelerated library, without requiring parallelization of apps. Incremental parallelization of physics engine pipeline Ease of testing. Leverage existing ecosystem: To ensure that existing apps using Box2D physics engine can use Box2D-OpenCL 4

Web Physics Requirements 1. Transparent and efficient access to acceleration on client through JS APIs, from Web applications: JS Bindings implemented in WebKit-based browser engine 2. Near-native performance for high compute web simulation: Hybrid approach: Access to accelerated native physics engine through JS bindings <= 1.5X performance impact from JS binding, relative to native physics engine 3. Full feature support, and no API change: Complete Box2D features and API support in accelerated native physics engine No API change: Box2D-OpenCL APIs should be identical to those of Box2D 2.2.1 4. Complete JavaScript API support in JS physics engine for benchmarking Box2DWeb 2.2.1 JS physics engine supporting all Box2D 2.2.1 features. 5. Acceleration using open parallel API for heterogeneous platforms: OpenCL accelerated physics engine for multicore CPU and GPGPU (Box2D-OpenCL) 5

JavaScript Bindings

Web Physics Architectural Approaches Web Applications Browser/ Web Runtime WebKit Box2D Web API WebCore NPAPI Framework Box2D IPC Plugin Binding Box2D- OpenCL Web Applications Browser/ Web Runtime WebCore Box2D JS Binding WebKit Box2D Web API JavaScript Core Box2D-OpenCL Web Applications Browser/ WebRuntime WebCore WebKit Box2D Web API JavaScript Core Injected Bundle Box2D-OpenCL Dynamic Injection Box2D JS Binding Pros Plugin-Approach WebCore Approach Inject Bundle Browser portability. Maintainability: Less impact from Best comparative performance. physics engine changes No IPC overhead Code reusability Cons IPC communication expensive between plugin & web process Difficult to Maintain More complex Usable by native and web apps Portability Maintainability No IPC overhead Complexity Not Reusable across browsers 7

Box2DWeb 2.2.1 JavaScript Physics Engine Physics Engine written entirely in JavaScript. Existing open source project (box2dweb) had implemented Box2D 2.1.2 APIs. Box2DWeb 2.2.1 JS physics engine implements JS APIs corresponding to all Box2D 2.2.1 APIs. Used for benchmarking & comparative analysis. Currently in the process of being open sourced. WebCore Web Apps Box2DWeb2.2.1 JS Library Browser / WebRuntime WebKit JavaScript Core 8

Web Physics Binding Implementation ² IDL code generator features: + Constructor overloading + Support for constructor-type static read-only attribute in IDL Patch up-streamed to WebKit.org JSC specific binding classes, functions generated from Box2D IDLs. Code generator can support JSC & V8. ² Wrapper Layer features: + Interfaces with native Box2D physics engine. A glue layer from Box2D native engine to autogenerated JS binding classes. + Most code is JS engine independent + Complete support for Box2D classes + 1:1 mapping to Box2D native objects + Preserves Box2D tree structure + Callback function support using Listeners + Supports DebugDraw ² Performance overhead <= 1.5x relative to Box2D Includes parsing overhead of JavaScript code, and binding code overhead ² Implementation restrictions: Strict type checking, relative to Box2DWeb Native classes & functions written into IDL files, & their implementation added manually Web Physics JS Bindings Framework 9

OpenCL Acceleration

Box2D-OpenCL Parallelization Collision Detection: Broad Phase and Narrow Phase. Enforces natural physical constraints on simulated physical objects Detects collisions. Constraint Solver: Computes velocity and position constraints Updates velocities and positions of bodies Benchmarking: Collision Detection & Constraints Solver prioritized for parallelization ~64% of total time spent in Collision Detection stage ~21% of time spent in Broad Phase ~43% in Narrow Phase ~35% of total time spent in Solver 11

Collision Detection Broad Phase Goal: Quickly find pairs of objects that might collide each other, and cull out all pairs that cannot collide each other. Algorithm: Axis Aligned Bounding Box (AABB) used to approximate objects for fast testing Sequential Broad Phase: BVH (Bounding Volume Hierarchy) Traverse from top to bottom Good for sequential programs but not efficient for parallel programs due to low parallelism in higher levels of BVH Parallel Broad Phase: SaP (Sweep and Prune) [1] Compute an interval [x i, X i ] for each AABB O i Sort all AABBs by x_min For all AABBs O i, execute the followings in parallel: Sweep from x i to X i to find all O j with x j [x i, X i ] and i<j For each O j, if O i O j, output a pair (O i, O j ) [1] F. Liu, T. Harada, Y. Lee, and Y. J. Kim, Real-time collision culling of a million bodies on graphics processing units, ACM Trans. Graph., vol. 29, no. 6, pp. 154:1 154:8, 2010. 12

Collision Detection Narrow Phase Goal: For each pair generated by BP, test whether the two objects collide or not. If collide, generate a manifold for the pair. Algorithm: Separating Axis Theorem (SAT) used to test intersection of two objects. Sequential Narrow Phase: Loop over all pairs Check each pair using different functions for different types (e.g. polygon-polygon, polygon-edge, circleedge, etc.) Parallel NP (Solution 1): Execute a single kernel for all pairs, using different branches for different types inside the kernel: if polygon-polygon pair Compute for p-p type else if circle-edge pair Compute for c-e type Parallel NP (Solution 2): Execute a multiple kernels for different type of pairs, each kernel deal with one type of pair only. Ø Solution 2 has better performance 13

Collision Detection Narrow Phase At the end of NP stage, we compact all valid pairs (i.e. whose two objects collide with each other), to the head of the list for the next stage. input pairs (from BP) validity masks (1-valid, 0-invalid) exclusive scan results (b i =a 0 +a 1 + a i-1 ) A B C D E F G H I J K L M N O P 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 1 0 0 1 2 2 2 2 3 4 5 6 6 7 7 7 7 compacted valid pairs B C G H I J L P 14

Solver Goal: For each collided contact pair generated by collision detection, solve the physics constraints and update two bodies velocities and positions 1. Solve velocity constraints iteratively Compute relative velocities between bodies Compute impulse from the relative velocities Update velocities using the impulse 2. Solve position constraints iteratively Compute penetration and convert it to impulse Update positions using the impulse Parallel Solver: Accelerated pipeline for parallel computation of contact constraints 15

Box2D-OpenCL Parallel Solver Architecture Goal: Optimize every component in solver pipeline to get the best parallel performance 1. Integrate velocities: Compute all bodies in parallel Apply external forces to update body velocity 2. Cluster collided contact pairs into different contact groups All pairs in each group do not share the same body Allows parallel computation in each group 3. Impulse Initial Guess (Warm Start): In parallel for each contact group Last frame s impulse (if any) is used as an initial guess for current impulse This initial guess accelerates velocity constraint solver 4. Solve Velocity Constraints parallel for each contact group Compute impulse to solve contact constraints and update body velocities 5. Integrate Positions for all bodies in parallel Use the new computed velocity to update the position for each body 6. Solve Position Constraints in parallel for each contact group Compute impulse to update body positions to avoid penetration 16

Parallel Solver Features 1. Significant performance improvement for high speed simulation: 5X performance improvement relative sequential algorithms 2. Full feature support includes special joints types: Parallel implementation of different joint types: Distance, Joint/Revolute, Joint/Prismatic, Joint/Pulley, Joint/Gear, joint/rope, Joint, etc. 3. Transparency: Parallelization is transparent to JS app developer App developers do not need to know anything about physics engine parallelization 4. Easily extensible: Fluid, Smoke, Fracture, Sand, etc. 5. CPU/GPU parallelization: Portable across different platforms 17

Experimental Results

Performance Results Core i7 (8 cores), NVIDIA GT 650M (384 shader cores): Max speedup for Box2DWeb vs. Box2D+Binding & Box2DWeb vs. Box2D-OpenCL+Binding: Box2DWeb vs. Binding+Box2D: 8.39X Box2DWeb vs. Binding+Box2D-OpenCL: 23.58X Box2DWeb vs. Binding+Box2D vs. Binding+Box2D- OpenCL Box2DWeb Binding+Box2D Binding+Box2D- OpenCL Tizen 2.2.1: Max speedup for Box2DWeb vs. Binding+Box2D & Box2DWeb vs. Binding+Box2D-OpenCL: Box2DWeb vs. Binding+Box2D: 4.75X Box2DWeb vs. Binding+Box2D-OpenCL: 14.88X Box2DWeb vs. Binding+Box2D vs. Binding+Box2D- OpenCL Box2DWeb Binding+Box2D Binding+Box2D- OpenCL 19

Box2D-OpenCL Acceleration Performance data for rigid body pipeline of OpenCL parallelized BVH construction Box2D-OpenCL (without binding) Tested on system with Core i7-3770k CPU, Radeon HD 7770 (640 unified shaders) GPGPU BVH Broad Phase Box2D on CPU List of Contacts GPU Box2D on CPU List of bodies Narrow Phase List of updated bodies List of manifolds Constraint Solver Sequential GPGPU CPU Sequential GPGPU CPU Sequential GPGPU CPU Sequential GPGPU CPU Sequential performance numbers are for Box2D. GPGPU numbers are for Box2D-OpenCL on GPGPU. CPU numbers are for Box2D- OpenCL on multicore CPU. 20

Conclusion

Comprehensive Web Physics Testing Comprehensive Testing Testing categories: Unit tests Stress tests Code path coverage Benchmarking Demo apps Memory Leak tests Robustness testing Tested on multiple platforms Test Plan # Test Type Description Unit Tests for WP 1 Tested for functionality and correctness JavaScript APIs Repetitive construction & destruction of classes. 2 Stress Testing Continuous extended execution Code Path Coverage Native and JS applications to test Box2D, Box2D- 3 Testing OpenCL & Web Physics bindings for code coverage 4 Benchmarking & Port of Web Physics bindings to Tizen. Benchmarking Performance Analysis & performance analysis Demo applications for Web Physics (Magic Sands, 5 Demo apps for testing glbrownian, Touch&Play) Static & dynamic testing (using Valgrind & developer 6 Memory Leak testing tool). Box2D & Box2D-OpenCL tested with native & JS demos Negative test cases. Testing with invalid and 7 Robustness testing insufficient input. Complete 22

Conclusion ² Web Physics and Box2D-OpenCL results: ü OpenCL accelerated physics engine, with web-based JS interface ü Box2D-OpenCL: OpenCL accelerated rigid body pipeline. Exposes same API as Box2D ü JS Physics Engine (Box2DWeb 2.2.1): Soon to be open sourced ü Web Physics JS bindings & Box2D-OpenCL optimized for Tizen. ü Box2D-OpenCL open sourced (contributions to Box2D-OpenCL are invited) https://github.com/samsung/box2d-opencl Ø The presenters would like to acknowledge and thank, Simon Gibbs for project guidance Braja Biswal, Sumit Maheshwari, Gajendra N. for contributions to testing & bindings & development of demo applications Linhai Qiu and Chonhyon Park for contributions to parallelization Thank you! 23