Combining NVIDIA Docker and databases to enhance agile development and optimize resource allocation

Size: px

Start display at page:

Download "Combining NVIDIA Docker and databases to enhance agile development and optimize resource allocation"

Candace Debra Thomas
5 years ago
Views:

1 Combining NVIDIA Docker and databases to enhance agile development and optimize resource allocation Chris Davis, Sophie Voisin, Devin White, Andrew Hardin Scalable and High Performance Geocomputation Team Geographic Information Science and Technology Group Oak Ridge National Laboratory GTC 2017 May 2017 ORNL is managed by UT-Battelle for the US Department of Energy

2 Outline Background Example HPC Application Study Results Lessons Learned / Future Work 2

3 The Story We are: Developing an HPC suite of applications Spread across multiple R&D teams In an Agile development process Delivering to a production environment Needing to support multiple systems / multiple capabilities Collecting performance metrics for system optimization 3

4 Why We Use NVIDIA-Docker Resource Optimization Operating Systems GPU Access Flexibility NVIDIA-Docker Docker Virtual Machine 4

5 Hardware Quadro: Compute + Display Card M4000 P6000 Capability Block SM Cores Memory 8GB 24GB 5

6 Hardware Tesla: Compute Only Card K40 K80 Capability Block SM Cores Memory 12GB 12GB 6

7 Hardware High End DELL C4130 GPU RAM 4 x K80 256GB Cores 48 SSD Storage 400GB 7

8 Constructing Containers Build Container: Based off NVIDIA Images at gitlab.com CentOS 7 CUDA 8.0 / 7.5 cudnn 5.1 GCC Cores: 24 Mount local folder with code Compile against chosen compute capability Copy product inside container docker commit container updates to new image docker save to Isilon HPC Server Local Drive Container NVIDIA-Docker CPUs GPUs PostgreSQL Compile Stats Profile Stats Git Repo Isilon Container Container Container Data 8

9 Running Containers PostgreSQL Compile Stats Profile Stats For each compute capability: docker load from Isilon storage Run container & profile script Send nvprof results to Profile Stats DB Container/Image removed Isilon Container Container Container HPC Server Local Drive Container NVIDIA-Docker CPUs GPUs Data 9

10 Hooking It All Together PostgreSQL Compile Stats Profile Stats HPC Server Local Drive Container NVIDIA-Docker CPUs GPUs One server generates containers All servers pull containers from Isilon Data to be processed pulled from Isilon Container build stats stored in Compiler DB Container execution stats stored in Profiler DB Git Repo Isilon Container Container Container Data HPC Server Local Drive Container NVIDIA-Docker CPUs GPUs HPC Server Local Drive Container NVIDIA-Docker CPUs GPUs 10

11 Profiling Combinations P6000 nvprof Output Parsed 6.1 CPU Sent to Profile DB Containers for: Cuda Version 6.0 D4 D1 3.0 Each Capability All Capabilities CPU only Data sets: 4 Total of 104 profiles M D3 D2 CUDA 7.5 CUDA All Capabilities 3.5 K40 11 K80

Database Postgres Databases Shared Fields Compute Capability Hostname CUDA Version Num CPU Threads Compile DB Run Time DB NVPROF DB Compile Time Execution Time GPU Device

12 Database Postgres Databases Shared Fields Compute Capability Hostname CUDA Version Num CPU Threads Compile DB Run Time DB NVPROF DB Compile Time Execution Time GPU Device Dataset Kernel / API Call Num CPU Threads Step Time Timestamp Timestamp Dataset Step Time Percent Max Time Num Calls Num CPU Threads Ave Time Timestamp Step Name Min Time 12

13 Outline Background Example HPC Application Study Results Lessons Learned / Future Work 13

Example HPC Application Geospatial metadata

OpenCV, Caffe, GDAL, Computer Vision Algorithms GPU

against control data Calculates geospatial metadata

14 Example HPC Application Geospatial metadata generator Leverages Open Source 3rdparty libraries OpenCV, Caffe, GDAL, Computer Vision Algorithms GPU Enabled SURF, ORB, NCC, NMI Automated matching against control data Calculates geospatial metadata for input imagery Satellites Manned Aircraft Unmanned Aerial Systems 14

Example HPC Application - GTC16 Two-step

Normalized Mutual Information Input Image

"# = & ' + & ) & * Core Libraries: NITRO

4 libpq (Postgres) OpenCV CUDA OpenMP Source

15 Example HPC Application - GTC16 Two-step Image Re-alignment Application using NMI Normalized Mutual Information Input Image Pipeline Preprocessing!"# = & ' + & ) & * Core Libraries: NITRO GDAL Proj.4 libpq (Postgres) OpenCV CUDA OpenMP Source Selection Global Localization Registration Control Histograms Source CPU GPU Resection Joint Output Image Metadata 15

16 Example HPC Application - GTC16 Global Localization Control 382x100 Input Image Pipeline Preprocessing Tactical 258x67 Solutions 4250 Core Libraries: NITRO GDAL Proj.4 libpq (Postgres) OpenCV CUDA OpenMP CPU GPU Output Image Source Selection Global Localization Registration Resection Metadata Objective Re-align the source image with the control image. Method In-house Implementation Roughly match source and control images. Coarse resolution Mask for non-valid data Exhaustive search 16

17 Example HPC Application - GTC16 Global Localization 17

Example HPC Application - GTC16 Similarity Metric Normalized Mutual Information!"# = & ' + & ) & * 3 & =, -. /01 2 -. 456 Source image and mask: N S xm S pixels & is the entropy -.

18 Example HPC Application - GTC16 Similarity Metric Normalized Mutual Information!"# = & ' + & ) & * 3 & =, -. / Source image and mask: N S xm S pixels & is the entropy -. is the probability density function for S and C H J for J Histogram with masked area Missing data Artifact Homogeneous area Control image and mask: N C xm C pixels Solution space: nxm NMI coefficients 18

19 Example HPC Application - GTC16 Summary Global Localization as coarse re-alignment Problematic: joint histogram computation for each solution No compromise on the number of bins Exhaustive search Solution: leverage of the K80 specifications 12 GB of memory 1 thread per solution Less than 25 seconds - 61K solutions for a 131K pixel image Kernel specifications occupancy 100% threads / block 128 stack frame total memory / block MB total memory / SM MB total memory / GPU 7.03 GB memory % 61.06% spill stores spill loads 0 0 registers 27 smem / block 0 smem / SM 0 smem % 0.00% cmem[0] cmem[2] solution / thread 19

20 Example HPC Application - GTC16 Registration Control 382x100 Tactical 258x67 Pipeline Input Image Preprocessing Core Libraries: NITRO GDAL Proj.4 libpq (Postgres) OpenCV CUDA OpenMP Source Selection Global Localization Registration CPU GPU Resection Output Image Metadata 20

Example HPC Application - GTC16 Registration Control 382x100 Tactical & Control 4571x1555 Tactical 258x67 Pipeline Input Image Preprocessing Core Libraries: NITRO GDAL Proj.

21 Example HPC Application - GTC16 Registration Control 382x100 Tactical & Control 4571x1555 Tactical 258x67 Pipeline Input Image Preprocessing Core Libraries: NITRO GDAL Proj.4 libpq (Postgres) OpenCV CUDA OpenMP CPU GPU Output Image Source Selection Global Localization Registration Resection Metadata Objective Refine the localization Method Use higher resolution ~400 times Keypoint matching 21

from Descriptor Keypoint list Source Image detect Keypoint list

22 Example HPC Application - GTC16 Registration Workflow Search windows: 73x73 pixels Control Image Search Windows metric detect from Descriptor Keypoint list Source Image detect Keypoint list describe Descriptor Tiepoint list Descriptors: 11x11 intensity values 22

23 Application Similarity Metric Normalized Mutual Information!"# = & ' + & ) & * 3 & =, -. / & is the entropy -. is the probability density function for S and C H J for J Small images but numerous Keypoints Numerous keypoints up to with GPU SURF detector Image / Descriptor size 11 x 11 intensity values to describe Search area 73 x 73 control sub-image Solution space 63 x 63 = 3969 / keypoint Descriptors: 11x11 intensity values Search windows: 73x73 pixels Solution spaces: 63x63 NMI coefficients 23

24 Example HPC Application - GTC16 Summary Registration refine the re-alignment Problematic: joint histogram computation for each solution No compromise on the number of bins Exhaustive search Solution: leverage of the K80 specifications 12 GB of memory 1 block per solution Leverage the number of values of the descriptors 121 (maximum) << Less than 100 seconds - 65K keypoints 260M NMI coefficients About 10K keypoints in less than 20 seconds List of indices for source List of indices for the corresponding subset control Joint histogram Kernel Find the best match for all keypoints 1 block per keypoint Optimize for the 63 x 63 search windows 64 threads / blocks 1 idle each threads compute a row of solutions Sparse joint histogram bins but only 121 values Leverage the 11 x 11 descriptor size Create 2 lists (length 121) of intensity values Update joint histogram count from lists Loop over lists to retrieve aggregate count Set aggregate count to 0 after first retrieval = 24

25 Outline Background Example HPC Application Study Results Lessons Learned / Future Work 25

26 Compile Time Results Compute Capability Specifications time in seconds size of binary files in MB OFF CUDA 7.5 CUDA 8.0 CUDA7.5 CUDA

27 Run Time Results D1 Ave Run Time (sec) D2 Ave Run Time (sec) CPU CUDA 7.5 CUDA8 CPU CUDA 7.5 CUDA D3 Ave Run Time (sec) D4 Ave Run Time (sec) CPU CUDA 7.5 CUDA 8 CPU CUDA 7.5 CUDA 8 27

28 K80 - Kernel Time Results in Seconds with nvprof Step 1 Kernel Timings vs CUDA version (7.5 and 8) Step 2 Kernel Timings vs CUDA version (7.5 and 8) CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 10 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 D1 D2 D3 D4 D1 D2 D3 D4 average min max std std average min max std std 28

29 Run Time Results D1 - Step 2 Kernel (sec) D2 - Step 2 Kernel (sec) CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 K40 K80 M4000 P6000 K40 K80 M4000 P6000 average min max std std average min max std std D3 - Step 2 Kernel (sec) D4 - Step 2 Kernel (sec) CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 K40 K80 M4000 P6000 K40 K80 M4000 P6000 average min max std std average min max std std 29

30 Outline Background Example HPC Application Study Results Lessons Learned / Future Work 30

31 Lessons Learned GPU isolation: Ran into issue with swapping out P6000 and K40. nvidia-smi swapped GPU ID for K40 and M4000. This caused nvidia-docker to ignore NV_GPU value UUID vs Index Our Application can set the GPU index for multi-gpu environment (default to 0) 31

32 Future Work Move off Desktop machines to full testing platform with dedicated hardware with multiple GPU types Investigate Docker Registry & Docker Swarm for managing containers Enhance Database analysis to autogenerate reports Generalize the process to containerize any GPU application to profile with this architecture 32

33 Thank you!

34 Customer Resources 50 Run time with 6 threads (sec) GPU RAM DELL C x K80 256GB Cores SSD Storage 400GB D1 D2 D3 D4 CPU CUDA

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate