Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1

Size: px

Start display at page:

Download "Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1"

Prosper Oliver
6 years ago
Views:

1 Exploratory Visualization of Petascale Particle Data in Nvidia DGX-1 Benjamin Hernandez, PhD Advanced Data and Workflows Group Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory ORNL is managed by UT-Battelle for the US Department of Energy

Oak Ridge Leadership Computing Facility (OLCF) Mission: Provide the computational and data resources required to solve the most challenging problems.

2 Oak Ridge Leadership Computing Facility (OLCF) Mission: Provide the computational and data resources required to solve the most challenging problems. Highly competitive user allocation programs (INCITE, ALCC). Projects receive 10x to 100x more resource than at other generally available centers. We partner with users to enable science & engineering breakthroughs. 2

Sight: Exploratory Visualization of Scientific Data Client/Server architecture to provide high end visualization in laptops, desktops, and powerwalls.

3 Sight: Exploratory Visualization of Scientific Data Client/Server architecture to provide high end visualization in laptops, desktops, and powerwalls. Heterogeneous scientific visualization Take advantage of both CPU & GPU resources within a node: DGX-1 use case. Advanced shading to enable new insights into data exploration. Parallel I/O & Data Staging Pluggable for in-situ visualization Lightweight tool Load your data Perform exploratory analysis Visualize/Save results 3

Websockets Server (DGX-1 or multigpu node) HTML Client Server (DGX-1) Server (DGX-1) *OSPray

4 Local/Parallel File System HPC Cluster ADIOS I/O System VTK-m VTK-m Compression Sight System Architecture (in progress) Visualization Frames OSPray Nvidia Optix CPU cores Multi-GPU Websockets Server (DGX-1 or multigpu node) HTML Client Server (DGX-1) Server (DGX-1) *OSPray and Nvidia Optix are finely tuned libraries for Ray Tracing in multicore and manycore architectures 4

5 ADIOS ADIOS is an I/O framework Provides multiple methods to stage data to a staging area (on node, off node, off machine) Data output can be anything one wants Different methods allow for different types of data movement, aggregation, and arrangement to the storage system or to stream over the local-nodes, LAN, WAN It contains our own file format if you choose to use it (ADIOS-BP) Compress/decompress data in parallel Contains mechanisms to index and query data 5

6 First Approach: OpenGL VBO (points) V.S. Apply transfer function G.S. Quads w/tex. coords F.S. Sphere gen. and Shading 6

7 OpenGL Bindless Graphics Initialization Address pointer Display Vertices start from vboaddress to vboaddress + sizeof (float)*size 7

Fragment Shader sphere Generation Sphere equation: r 2 = x x 0 2 + y y 0 2 + z z 0 2 (-1,1) (1,1) r = 1.0, z = 1.0 x = texcoord.x y = texcoord.

8 Fragment Shader sphere Generation Sphere equation: r 2 = x x y y z z 0 2 (-1,1) (1,1) r = 1.0, z = 1.0 x = texcoord.x y = texcoord.y zz = 1.0 x*x y*y if (zz <= 0.0) // removes fragments outside discard; // scale to the desired radius // calculate diffuse illumination (-1,-1) (1,-1) 8

9 Results 9

10 OpenGL Multi-GPU Rendering One MPI task for each device Easy to implement Each device initialize its GLX/EGL context Multi-threading. One thread per device In EGL is possible: Create the main context in the main thread: mainctx = eglcreatecontext(display, config, 0, contextattrs) Each additional thread create a shared context: lclthrdctx = eglcreatecontext(display, config, mainctx, contextattrs); Implement some mutex/semaphores to sync any updates Vulkanize your viz! Devices are aware of other devices and can coordinate between each other That s precisely NVIDIA Optix can do 10

11 Second Approach Nvidia Optix Ray Tracing Engine The OptiX API is an application framework for achieving optimal ray tracing performance on the GPU. It provides a simple, recursive, and flexible pipeline for accelerating ray tracing algorithms. Similar to OpenGL in doing the heavy lifting of ray tracing and leaving capability and technique to the developers Plus it can use all GPUs available in your system Naturally fits material appearance and scene illumination 11

12 Nvidia Optix Programming Model Optix provides eight programmable components, some of them are: 1. Ray generation 2. Intersection 3. Shading (closest hit) Shadows (any hit) Selector Shaders are CUDA like syntax

13 Nvidia Optix Graph Nodes is defined by Graph nodes. A tree-like hierarchy where: Nodes at the bottom describes geometric objects. Nodes at the top describes collections of geometric objects. Group Acceleration Transform Selector Acceleration Group Group Group Acceleration Instance Instance Instance Instance 13

14 Nvidia Optix Graph Nodes Keep the hierarchy as flat as possible Group Acceleration Group Acceleration Instance Group Acceleration Group Acceleration Group Acceleration Particles Instance Instance Instance But not too flat! Particles Particles Particles 14

15 Results Test Systems Workstation CPU Intel Xeon 20 cores 512 GB GPU Titan Z (2 Geforce Kepler GPU, 2x6 GB VRAM), Ubuntu 16, Nvidia Driver Rhea Node CPU Intel Xeon GPU 2x Tesla K80 (4 Tesla Kepler GPU, 2x24 GB VRAM) Redhat 7. Nvidia Driver DGX-1 CPU Intel GPU 8x Tesla Pascal SMX, 8x16 GB VRAM Ubuntu 16, Nvidia Driver All systems: CUDA 8.0 Optix Acceleration Structure: Trbvh Image resolution: 1080p Shading: Phong Illumination & Ambient Occlusion 15

16 Time (ms) Results. How fast is built the acceleration structure? lower is better Workstation Rhea Node DGX-1 1 Million 10 Million 20 Million Particles 16

17 ms per frame ms per frame Results Performance, lower is better 275 Frame rate (worst case) Frame rate (best case) fps fps Workstation Rhea node DGX Million particles 30 fps 60 fps 5 0 Workstation Rhea node DGX Million particles 17

18 Results 18

19 Discussion DGX-1 can handle particle systems up to 10x larger in our test environment. For particle systems of the same size DGX-1 is 10x faster than the workstation system and 4.6x faster than Rhea node We expect for larger image resolutions DGX-1 speed up will increase. Our preliminary tests showed DGX-1 has enough compute power to drive a powerwall 3840 x fps Test larger resolution Researchers usually are happy when they can explore datasets even at 1 fps 19

20 Discussion Nvidia Optix provides multi-gpu support with no hassle Test if Nvidia Optix leaves free resources for analysis tasks. Paging was removed in Optix 4.x DGX-1 includes 40 CPU cores and 512 RAM Using Nvidia Optix & OSPRay library will allow full system allocation to handle larger systems. Summit is likely to support EGL through the Nvidia GPU Drivers (do not take it as a fact or alternative fact neither!). Best if (pre)exascale visualization tools are 100 % CUDA compliant. 20

Questions? Benjamin Hernandez, PhD hernandezarb@ornl.

21 Questions? Benjamin Hernandez, PhD Advanced Data and Workflows Group Oak Ridge Leadership Computing Facility Oak Ridge National Laboratory 21 Acknowledgements: Dylan Lacewell and the Nvidia Optix Team for their technical support. Datasets provided by Cheng-Yu Shi and Leonid Zhigilei from the Computational Materials Group at University of Virginia. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

22 Further Reading Tom True, Alina Alt (2013) Configuring, Programming and Debugging Applications for Multiple GPUs GTC 2013 Wil Braithwaite Multi-GPU Programming for Visual Computing SIGGRAPH 2013 Available in GTC on Demand: Adios Manual Optix Tutorial Talks submit=&select=+ 22

2018 INCITE Call for Proposals The 2018 INCITE Call for Proposals opened April 17, 2017 and closes June 23, 2017.

Office of Science. Soliciting research proposals for awards of time on the 27-petaflop Cray XK7 Titan, and the 10-petaflop IBM Blue Gene/Q, Mira.

$The INCITE program seeks research proposals for capability computing Production simulations, including ensembles, that use a large fraction of the LCF systems, or Proposals that require the$

23 2018 INCITE Call for Proposals The 2018 INCITE Call for Proposals opened April 17, 2017 and closes June 23, Features large allocations of computer time and supporting resources at the Argonne and Oak Ridge Leadership Computing Facility (LCF) centers, operated by the US Department of Energy (DOE) Office of Science. Soliciting research proposals for awards of time on the 27-petaflop Cray XK7 Titan, and the 10-petaflop IBM Blue Gene/Q, Mira. In addition, certain 2018 INCITE awards will receive time on Argonne s new Intel/Cray system, a 9.65-petaflops system called Theta. The INCITE program seeks research proposals for capability computing Production simulations, including ensembles, that use a large fraction of the LCF systems, or Proposals that require the unique LCF architectural infrastructure for highperformance computing projects that cannot be performed elsewhere The INCITE program is open to US and non-us based researchers. The INCITE program invites you to participate in an INCITE Proposal Writing Webinar, offered on April 19, May 18, and June 6. For more information visit 23

24 Results How fast is built the acceleration structure? Workstation Rhea node DGX-1 Time(%) Time Calls Avg Min Max Name 49.63% ms us us ms [CUDA memcpy HtoD] 22.64% ms ms ms ms Megakernel_CUDA_ % ms ms us ms [CUDA memcpy DtoH] 6.74% ms ms ms ms Megakernel_CUDA_0 0.30% us us us us [CUDA memcpy HtoA] 0.21% us us us us [CUDA memset] 0.05% us us us us [CUDA memcpy DtoD] Time(%) Time Calls Avg Min Max Name 50.29% ms us us ms [CUDA memcpy HtoD] 34.61% ms us us ms [CUDA memcpy DtoH] 9.48% ms us us ms Megakernel_CUDA_1 5.03% ms ms ms ms Megakernel_CUDA_0 0.36% us us us us [CUDA memset] 0.17% us us us us [CUDA memcpy HtoA] 0.06% us us us us [CUDA memcpy DtoD] Time(%) Time Calls Avg Min Max Name 43.50% ms us us ms [CUDA memcpy HtoD] 30.77% ms us us ms [CUDA memcpy DtoH] 18.13% ms us us us [CUDA memcpy PtoP] 6.16% ms us us us Megakernel_CUDA_1 1.11% us us us us Megakernel_CUDA_0 0.14% us us us us [CUDA memcpy HtoA] 0.13% us us us us [CUDA memset] 0.06% us us us us [CUDA memcpy DtoD] 24

25 Results How fast is built the acceleration structure? Workstation Rhea node DGX-1 Time(%) Time Calls Avg Min Max Name 49.63% ms us us ms [CUDA memcpy HtoD] 22.64% ms ms ms ms Megakernel_CUDA_ % ms ms us ms [CUDA memcpy DtoH] 6.74% ms ms ms ms Megakernel_CUDA_0 0.30% us us us us [CUDA memcpy HtoA] 0.21% us us us us [CUDA memset] 0.05% us us us us [CUDA memcpy DtoD] Time(%) Time Calls Avg Min Max Name 50.29% ms us us ms [CUDA memcpy HtoD] 34.61% ms us us ms [CUDA memcpy DtoH] 9.48% ms us us ms Megakernel_CUDA_1 5.03% ms ms ms ms Megakernel_CUDA_0 0.36% us us us us [CUDA memset] 0.17% us us us us [CUDA memcpy HtoA] 0.06% us us us us [CUDA memcpy DtoD] Time(%) Time Calls Avg Min Max Name 43.50% ms us us ms [CUDA memcpy HtoD] 30.77% ms us us ms [CUDA memcpy DtoH] 18.13% ms us us us [CUDA memcpy PtoP] 6.16% ms us us us Megakernel_CUDA_1 1.11% us us us us Megakernel_CUDA_0 0.14% us us us us [CUDA memcpy HtoA] 0.13% us us us us [CUDA memset] 0.06% us us us us [CUDA memcpy DtoD] 25

26 ADIOS I/O Abstracting metadata, data types, and dimensions from the source code into an XML file C Fortran zlib, bzip2, szip, zfp, isobar Alacrity all data in adios_write() calls are buffered before writing to the file system. POSIX MPI MPI_LUSTRE PHDF5 DATASPACES DIMES FLEXPATH ICEE 26

27 ADIOS I/O Generate the c-code from the XML file gpp.py atoms.xml gwrite_atoms.ch gread_atoms.ch Both files contains code to write and read ADIOS files You only need to modify your XML file an generate new *.ch files Main code remains the same 27

28 ADIOS Write/Read example Write Read 28

SIGHT. Benjamin Hernandez, PhD Advanced Data and Workflow(s) Group

SIGHT. Benjamin Hernandez, PhD Advanced Data and Workflow(s) Group SIGHT Benjamin Hernandez, PhD Advanced Data and Workflow(s) Group hernandezarb@ornl.gov ORNL is managed by UT-Battelle for the US Department of Energy name 1 Presentation This research used resources of