Combining NVIDIA Docker and databases to enhance agile development and optimize resource allocation
|
|
- Candace Debra Thomas
- 5 years ago
- Views:
Transcription
1 Combining NVIDIA Docker and databases to enhance agile development and optimize resource allocation Chris Davis, Sophie Voisin, Devin White, Andrew Hardin Scalable and High Performance Geocomputation Team Geographic Information Science and Technology Group Oak Ridge National Laboratory GTC 2017 May 2017 ORNL is managed by UT-Battelle for the US Department of Energy
2 Outline Background Example HPC Application Study Results Lessons Learned / Future Work 2
3 The Story We are: Developing an HPC suite of applications Spread across multiple R&D teams In an Agile development process Delivering to a production environment Needing to support multiple systems / multiple capabilities Collecting performance metrics for system optimization 3
4 Why We Use NVIDIA-Docker Resource Optimization Operating Systems GPU Access Flexibility NVIDIA-Docker Docker Virtual Machine 4
5 Hardware Quadro: Compute + Display Card M4000 P6000 Capability Block SM Cores Memory 8GB 24GB 5
6 Hardware Tesla: Compute Only Card K40 K80 Capability Block SM Cores Memory 12GB 12GB 6
7 Hardware High End DELL C4130 GPU RAM 4 x K80 256GB Cores 48 SSD Storage 400GB 7
8 Constructing Containers Build Container: Based off NVIDIA Images at gitlab.com CentOS 7 CUDA 8.0 / 7.5 cudnn 5.1 GCC Cores: 24 Mount local folder with code Compile against chosen compute capability Copy product inside container docker commit container updates to new image docker save to Isilon HPC Server Local Drive Container NVIDIA-Docker CPUs GPUs PostgreSQL Compile Stats Profile Stats Git Repo Isilon Container Container Container Data 8
9 Running Containers PostgreSQL Compile Stats Profile Stats For each compute capability: docker load from Isilon storage Run container & profile script Send nvprof results to Profile Stats DB Container/Image removed Isilon Container Container Container HPC Server Local Drive Container NVIDIA-Docker CPUs GPUs Data 9
10 Hooking It All Together PostgreSQL Compile Stats Profile Stats HPC Server Local Drive Container NVIDIA-Docker CPUs GPUs One server generates containers All servers pull containers from Isilon Data to be processed pulled from Isilon Container build stats stored in Compiler DB Container execution stats stored in Profiler DB Git Repo Isilon Container Container Container Data HPC Server Local Drive Container NVIDIA-Docker CPUs GPUs HPC Server Local Drive Container NVIDIA-Docker CPUs GPUs 10
11 Profiling Combinations P6000 nvprof Output Parsed 6.1 CPU Sent to Profile DB Containers for: Cuda Version 6.0 D4 D1 3.0 Each Capability All Capabilities CPU only Data sets: 4 Total of 104 profiles M D3 D2 CUDA 7.5 CUDA All Capabilities 3.5 K40 11 K80
12 Database Postgres Databases Shared Fields Compute Capability Hostname CUDA Version Num CPU Threads Compile DB Run Time DB NVPROF DB Compile Time Execution Time GPU Device Dataset Kernel / API Call Num CPU Threads Step Time Timestamp Timestamp Dataset Step Time Percent Max Time Num Calls Num CPU Threads Ave Time Timestamp Step Name Min Time 12
13 Outline Background Example HPC Application Study Results Lessons Learned / Future Work 13
14 Example HPC Application Geospatial metadata generator Leverages Open Source 3rdparty libraries OpenCV, Caffe, GDAL, Computer Vision Algorithms GPU Enabled SURF, ORB, NCC, NMI Automated matching against control data Calculates geospatial metadata for input imagery Satellites Manned Aircraft Unmanned Aerial Systems 14
15 Example HPC Application - GTC16 Two-step Image Re-alignment Application using NMI Normalized Mutual Information Input Image Pipeline Preprocessing!"# = & ' + & ) & * Core Libraries: NITRO GDAL Proj.4 libpq (Postgres) OpenCV CUDA OpenMP Source Selection Global Localization Registration Control Histograms Source CPU GPU Resection Joint Output Image Metadata 15
16 Example HPC Application - GTC16 Global Localization Control 382x100 Input Image Pipeline Preprocessing Tactical 258x67 Solutions 4250 Core Libraries: NITRO GDAL Proj.4 libpq (Postgres) OpenCV CUDA OpenMP CPU GPU Output Image Source Selection Global Localization Registration Resection Metadata Objective Re-align the source image with the control image. Method In-house Implementation Roughly match source and control images. Coarse resolution Mask for non-valid data Exhaustive search 16
17 Example HPC Application - GTC16 Global Localization 17
18 Example HPC Application - GTC16 Similarity Metric Normalized Mutual Information!"# = & ' + & ) & * 3 & =, -. / Source image and mask: N S xm S pixels & is the entropy -. is the probability density function for S and C H J for J Histogram with masked area Missing data Artifact Homogeneous area Control image and mask: N C xm C pixels Solution space: nxm NMI coefficients 18
19 Example HPC Application - GTC16 Summary Global Localization as coarse re-alignment Problematic: joint histogram computation for each solution No compromise on the number of bins Exhaustive search Solution: leverage of the K80 specifications 12 GB of memory 1 thread per solution Less than 25 seconds - 61K solutions for a 131K pixel image Kernel specifications occupancy 100% threads / block 128 stack frame total memory / block MB total memory / SM MB total memory / GPU 7.03 GB memory % 61.06% spill stores spill loads 0 0 registers 27 smem / block 0 smem / SM 0 smem % 0.00% cmem[0] cmem[2] solution / thread 19
20 Example HPC Application - GTC16 Registration Control 382x100 Tactical 258x67 Pipeline Input Image Preprocessing Core Libraries: NITRO GDAL Proj.4 libpq (Postgres) OpenCV CUDA OpenMP Source Selection Global Localization Registration CPU GPU Resection Output Image Metadata 20
21 Example HPC Application - GTC16 Registration Control 382x100 Tactical & Control 4571x1555 Tactical 258x67 Pipeline Input Image Preprocessing Core Libraries: NITRO GDAL Proj.4 libpq (Postgres) OpenCV CUDA OpenMP CPU GPU Output Image Source Selection Global Localization Registration Resection Metadata Objective Refine the localization Method Use higher resolution ~400 times Keypoint matching 21
22 Example HPC Application - GTC16 Registration Workflow Search windows: 73x73 pixels Control Image Search Windows metric detect from Descriptor Keypoint list Source Image detect Keypoint list describe Descriptor Tiepoint list Descriptors: 11x11 intensity values 22
23 Application Similarity Metric Normalized Mutual Information!"# = & ' + & ) & * 3 & =, -. / & is the entropy -. is the probability density function for S and C H J for J Small images but numerous Keypoints Numerous keypoints up to with GPU SURF detector Image / Descriptor size 11 x 11 intensity values to describe Search area 73 x 73 control sub-image Solution space 63 x 63 = 3969 / keypoint Descriptors: 11x11 intensity values Search windows: 73x73 pixels Solution spaces: 63x63 NMI coefficients 23
24 Example HPC Application - GTC16 Summary Registration refine the re-alignment Problematic: joint histogram computation for each solution No compromise on the number of bins Exhaustive search Solution: leverage of the K80 specifications 12 GB of memory 1 block per solution Leverage the number of values of the descriptors 121 (maximum) << Less than 100 seconds - 65K keypoints 260M NMI coefficients About 10K keypoints in less than 20 seconds List of indices for source List of indices for the corresponding subset control Joint histogram Kernel Find the best match for all keypoints 1 block per keypoint Optimize for the 63 x 63 search windows 64 threads / blocks 1 idle each threads compute a row of solutions Sparse joint histogram bins but only 121 values Leverage the 11 x 11 descriptor size Create 2 lists (length 121) of intensity values Update joint histogram count from lists Loop over lists to retrieve aggregate count Set aggregate count to 0 after first retrieval = 24
25 Outline Background Example HPC Application Study Results Lessons Learned / Future Work 25
26 Compile Time Results Compute Capability Specifications time in seconds size of binary files in MB OFF CUDA 7.5 CUDA 8.0 CUDA7.5 CUDA
27 Run Time Results D1 Ave Run Time (sec) D2 Ave Run Time (sec) CPU CUDA 7.5 CUDA8 CPU CUDA 7.5 CUDA D3 Ave Run Time (sec) D4 Ave Run Time (sec) CPU CUDA 7.5 CUDA 8 CPU CUDA 7.5 CUDA 8 27
28 K80 - Kernel Time Results in Seconds with nvprof Step 1 Kernel Timings vs CUDA version (7.5 and 8) Step 2 Kernel Timings vs CUDA version (7.5 and 8) CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 10 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 D1 D2 D3 D4 D1 D2 D3 D4 average min max std std average min max std std 28
29 Run Time Results D1 - Step 2 Kernel (sec) D2 - Step 2 Kernel (sec) CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 K40 K80 M4000 P6000 K40 K80 M4000 P6000 average min max std std average min max std std D3 - Step 2 Kernel (sec) D4 - Step 2 Kernel (sec) CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 K40 K80 M4000 P6000 K40 K80 M4000 P6000 average min max std std average min max std std 29
30 Outline Background Example HPC Application Study Results Lessons Learned / Future Work 30
31 Lessons Learned GPU isolation: Ran into issue with swapping out P6000 and K40. nvidia-smi swapped GPU ID for K40 and M4000. This caused nvidia-docker to ignore NV_GPU value UUID vs Index Our Application can set the GPU index for multi-gpu environment (default to 0) 31
32 Future Work Move off Desktop machines to full testing platform with dedicated hardware with multiple GPU types Investigate Docker Registry & Docker Swarm for managing containers Enhance Database analysis to autogenerate reports Generalize the process to containerize any GPU application to profile with this architecture 32
33 Thank you!
34 Customer Resources 50 Run time with 6 threads (sec) GPU RAM DELL C x K80 256GB Cores SSD Storage 400GB D1 D2 D3 D4 CPU CUDA
GPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationGeoImaging Accelerator Pansharpen Test Results. Executive Summary
Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance Whitepaper), the same approach has
More informationCUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer
CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs
More informationScalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009
Scalable Multi Agent Simulation on the GPU Avi Bleiweiss NVIDIA Corporation San Jose, 2009 Reasoning Explicit State machine, serial Implicit Compute intensive Fits SIMT well Collision avoidance Motivation
More informationOpportunities for container environments on Cray XC30 with GPU devices
Opportunities for container environments on Cray XC30 with GPU devices Cray User Group 2016, London Sadaf Alam, Lucas Benedicic, T. Schulthess, Miguel Gila May 12, 2016 Agenda Motivation Container technologies,
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationDGX-1 DOCKER USER GUIDE Josh Park Senior Solutions Architect Contents created by Jack Han Solutions Architect
DGX-1 DOCKER USER GUIDE 17.08 Josh Park Senior Solutions Architect Contents created by Jack Han Solutions Architect AGENDA Introduction to Docker & DGX-1 SW Stack Docker basic & nvidia-docker Docker image
More informationAdvanced Geospatial Image Processing using Graphics Processing Units
Advanced Geospatial Image Processing using Graphics Processing Units Atle Borsholm Ron Kneusel Exelis Visual Information Solutions Boulder, CO USA Why? Common geospatial image processing algorithms are
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationThe Titan Tools Experience
The Titan Tools Experience Michael J. Brim, Ph.D. Computer Science Research, CSMD/NCCS Petascale Tools Workshop 213 Madison, WI July 15, 213 Overview of Titan Cray XK7 18,688+ compute nodes 16-core AMD
More informationDeep learning in MATLAB From Concept to CUDA Code
Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks,
More informationDense matching GPU implementation
Dense matching GPU implementation Author: Hailong Fu. Supervisor: Prof. Dr.-Ing. Norbert Haala, Dipl. -Ing. Mathias Rothermel. Universität Stuttgart 1. Introduction Correspondence problem is an important
More informationHPC with GPU and its applications from Inspur. Haibo Xie, Ph.D
HPC with GPU and its applications from Inspur Haibo Xie, Ph.D xiehb@inspur.com 2 Agenda I. HPC with GPU II. YITIAN solution and application 3 New Moore s Law 4 HPC? HPC stands for High Heterogeneous Performance
More informationBuilding NVLink for Developers
Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized
More informationBest Practices for Deploying and Managing GPU Clusters
Best Practices for Deploying and Managing GPU Clusters Dale Southard, NVIDIA dsouthard@nvidia.com About the Speaker and You [Dale] is a senior solution architect with NVIDIA (I fix things). I primarily
More informationDGX UPDATE. Customer Presentation Deck May 8, 2017
DGX UPDATE Customer Presentation Deck May 8, 2017 NVIDIA DGX-1: The World s Fastest AI Supercomputer FASTEST PATH TO DEEP LEARNING EFFORTLESS PRODUCTIVITY REVOLUTIONARY AI PERFORMANCE Fully-integrated
More informationFunctional Partitioning to Optimize End-to-End Performance on Many-core Architectures
Functional Partitioning to Optimize End-to-End Performance on Many-core Architectures Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim,Christian Engelmann, and Galen Shipman
More informationCluster-based 3D Reconstruction of Aerial Video
Cluster-based 3D Reconstruction of Aerial Video Scott Sawyer (scott.sawyer@ll.mit.edu) MIT Lincoln Laboratory HPEC 12 12 September 2012 This work is sponsored by the Assistant Secretary of Defense for
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationCS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST
CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter
More informationHardware Acceleration of Feature Detection and Description Algorithms on Low Power Embedded Platforms
Hardware Acceleration of Feature Detection and Description Algorithms on LowPower Embedded Platforms Onur Ulusel, Christopher Picardo, Christopher Harris, Sherief Reda, R. Iris Bahar, School of Engineering,
More informationSentinelOne Technical Brief
SentinelOne Technical Brief SentinelOne unifies prevention, detection and response in a fundamentally new approach to endpoint protection, driven by behavior-based threat detection and intelligent automation.
More informationGPU-accelerated 3-D point cloud generation from stereo images
GPU-accelerated 3-D point cloud generation from stereo images Dr. Bingcai Zhang Release of this guide is approved as of 02/28/2014. This document gives only a general description of the product(s) or service(s)
More informationCUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni
CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3
More informationParallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU
Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware
More informationFast Segmented Sort on GPUs
Fast Segmented Sort on GPUs Kaixi Hou, Weifeng Liu, Hao Wang, Wu-chun Feng {kaixihou, hwang121, wfeng}@vt.edu weifeng.liu@nbi.ku.dk Segmented Sort (SegSort) Perform a segment-by-segment sort on a given
More informationAdvanced Research Computing. ARC3 and GPUs. Mark Dixon
Advanced Research Computing Mark Dixon m.c.dixon@leeds.ac.uk ARC3 (1st March 217) Included 2 GPU nodes, each with: 24 Intel CPU cores & 128G RAM (same as standard compute node) 2 NVIDIA Tesla K8 24G RAM
More information3DServices. Hinnerup Net Scalable GPU computing service architecture - GTC-S The LEGO Group Page 1
3DServices Scalable GPU computing service architecture - GTC-S0261 Hinnerup Net www.hinnerup.net 2012 The LEGO Group Page 1 2012 The LEGO Group Page 2 LEGO 3DServices Scalable GPU computing service architecture
More informationNVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI
NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI Overview Unparalleled Value Product Portfolio Software Platform From Desk to Data Center to Cloud Summary AI researchers depend on computing performance to gain
More informationAzure DevOps. Randy Pagels Intelligent Cloud Technical Specialist Great Lakes Region
Azure DevOps Randy Pagels Intelligent Cloud Technical Specialist Great Lakes Region What is DevOps? People. Process. Products. Build & Test Deploy DevOps is the union of people, process, and products to
More informationNVIDIA T4 FOR VIRTUALIZATION
NVIDIA T4 FOR VIRTUALIZATION TB-09377-001-v01 January 2019 Technical Brief TB-09377-001-v01 TABLE OF CONTENTS Powering Any Virtual Workload... 1 High-Performance Quadro Virtual Workstations... 3 Deep Learning
More informationHands-on CUDA Optimization. CUDA Workshop
Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory
More informationProfiling of Data-Parallel Processors
Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel Kruck 1 / 41 Outline 1 Motivation 2 Background - GPUs 3 Profiler NVIDIA Tools Lynx 4 Optimizations 5 Conclusion
More informationSplotch: High Performance Visualization using MPI, OpenMP and CUDA
Splotch: High Performance Visualization using MPI, OpenMP and CUDA Klaus Dolag (Munich University Observatory) Martin Reinecke (MPA, Garching) Claudio Gheller (CSCS, Switzerland), Marzia Rivi (CINECA,
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationEfficient Lists Intersection by CPU- GPU Cooperative Computing
Efficient Lists Intersection by CPU- GPU Cooperative Computing Di Wu, Fan Zhang, Naiyong Ao, Gang Wang, Xiaoguang Liu, Jing Liu Nankai-Baidu Joint Lab, Nankai University Outline Introduction Cooperative
More informationGPU Performance Nuggets
GPU Performance Nuggets Simon Garcia de Gonzalo & Carl Pearson PhD Students, IMPACT Research Group Advised by Professor Wen-mei Hwu Jun. 15, 2016 grcdgnz2@illinois.edu pearson@illinois.edu GPU Performance
More informationUniversity at Buffalo Center for Computational Research
University at Buffalo Center for Computational Research The following is a short and long description of CCR Facilities for use in proposals, reports, and presentations. If desired, a letter of support
More informationShifter: Fast and consistent HPC workflows using containers
Shifter: Fast and consistent HPC workflows using containers CUG 2017, Redmond, Washington Lucas Benedicic, Felipe A. Cruz, Thomas C. Schulthess - CSCS May 11, 2017 Outline 1. Overview 2. Docker 3. Shifter
More informationIntroduction to Geodatabase and Spatial Management in ArcGIS. Craig Gillgrass Esri
Introduction to Geodatabase and Spatial Management in ArcGIS Craig Gillgrass Esri Session Path The Geodatabase - What is it? - Why use it? - What types are there? - What can I do with it? Query Layers
More informationS WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018
S8630 - WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS Jakob Progsch, Mathias Wagner GTC 2018 1. Know your hardware BEFORE YOU START What are the target machines, how many nodes? Machine-specific
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationHigh Quality Real Time Image Processing Framework on Mobile Platforms using Tegra K1. Eyal Hirsch
High Quality Real Time Image Processing Framework on Mobile Platforms using Tegra K1 Eyal Hirsch Established in 2009 and headquartered in Israel SagivTech Snapshot Core domain expertise: GPU Computing
More informationScalaIOTrace: Scalable I/O Tracing and Analysis
ScalaIOTrace: Scalable I/O Tracing and Analysis Karthik Vijayakumar 1, Frank Mueller 1, Xiaosong Ma 1,2, Philip C. Roth 2 1 Department of Computer Science, NCSU 2 Computer Science and Mathematics Division,
More informationEmbedded GPGPU and Deep Learning for Industrial Market
Embedded GPGPU and Deep Learning for Industrial Market Author: Dan Mor GPGPU and HPEC Product Line Manager September 2018 Table of Contents 1. INTRODUCTION... 3 2. DIFFICULTIES IN CURRENT EMBEDDED INDUSTRIAL
More informationAcceleration of HPC applications on hybrid CPU-GPU systems: When can Multi-Process Service (MPS) help?
Acceleration of HPC applications on hybrid CPU- systems: When can Multi-Process Service (MPS) help? GTC 2018 March 28, 2018 Olga Pearce (Lawrence Livermore National Laboratory) http://people.llnl.gov/olga
More informationVisualization on BioHPC
Visualization on BioHPC [web] [email] portal.biohpc.swmed.edu biohpc-help@utsouthwestern.edu 1 Updated for 2015-09-16 Outline What is Visualization - Scientific Visualization - Work flow for Visualization
More informationHigh-Performance Data Loading and Augmentation for Deep Neural Network Training
High-Performance Data Loading and Augmentation for Deep Neural Network Training Trevor Gale tgale@ece.neu.edu Steven Eliuk steven.eliuk@gmail.com Cameron Upright c.upright@samsung.com Roadmap 1. The General-Purpose
More informationComparison of High-Speed Ray Casting on GPU
Comparison of High-Speed Ray Casting on GPU using CUDA and OpenGL November 8, 2008 NVIDIA 1,2, Andreas Weinlich 1, Holger Scherl 2, Markus Kowarschik 2 and Joachim Hornegger 1 1 Chair of Pattern Recognition
More informationComputer and Machine Vision
Computer and Machine Vision Lecture Week 7 Part-1 (Convolution Transform Speed-up and Hough Linear Transform) February 26, 2014 Sam Siewert Outline of Week 7 Basic Convolution Transform Speed-Up Concepts
More informationA case study of performance portability with OpenMP 4.5
A case study of performance portability with OpenMP 4.5 Rahul Gayatri, Charlene Yang, Thorsten Kurth, Jack Deslippe NERSC pre-print copy 1 Outline General Plasmon Pole (GPP) application from BerkeleyGW
More informationANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation
ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent
More informationLeveraging Hybrid Hardware in New Ways: The GPU Paging Cache
Leveraging Hybrid Hardware in New Ways: The GPU Paging Cache Frank Feinbube, Peter Tröger, Johannes Henning, Andreas Polze Hasso Plattner Institute Operating Systems and Middleware Prof. Dr. Andreas Polze
More informationWilliam Yang Group 14 Mentor: Dr. Rogerio Richa Visual Tracking of Surgical Tools in Retinal Surgery using Particle Filtering
Mutual Information Computation and Maximization Using GPU Yuping Lin and Gérard Medioni Computer Vision and Pattern Recognition Workshops (CVPR) Anchorage, AK, pp. 1-6, June 2008 Project Summary and Paper
More informationRe-architecting Virtualization in Heterogeneous Multicore Systems
Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College
More informationHIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1
April 4-7, 2016 Silicon Valley HIGH PERFORMANCE PEDESTRIAN DETECTION ON TEGRA X1 Max Lv, NVIDIA Brant Zhao, NVIDIA April 7 mlv@nvidia.com https://github.com/madeye Histogram of Oriented Gradients on GPU
More informationAccelerating Financial Applications on the GPU
Accelerating Financial Applications on the GPU Scott Grauer-Gray Robert Searles William Killian John Cavazos Department of Computer and Information Science University of Delaware Sixth Workshop on General
More informationGPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten
GPU Computing: Development and Analysis Part 1 Anton Wijs Muhammad Osama Marieke Huisman Sebastiaan Joosten NLeSC GPU Course Rob van Nieuwpoort & Ben van Werkhoven Who are we? Anton Wijs Assistant professor,
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationImplementing Deep Learning for Video Analytics on Tegra X1.
Implementing Deep Learning for Video Analytics on Tegra X1 research@hertasecurity.com Index Who we are, what we do Video analytics pipeline Video decoding Facial detection and preprocessing DNN: learning
More informationDell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration
Dell EMC PowerEdge R740xd as a Dedicated Milestone Server, Using Nvidia GPU Hardware Acceleration Dell IP Video Platform Design and Calibration Lab June 2018 H17250 Reference Architecture Abstract This
More informationNAMD GPU Performance Benchmark. March 2011
NAMD GPU Performance Benchmark March 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory
More informationCUDA Experiences: Over-Optimization and Future HPC
CUDA Experiences: Over-Optimization and Future HPC Carl Pearson 1, Simon Garcia De Gonzalo 2 Ph.D. candidates, Electrical and Computer Engineering 1 / Computer Science 2, University of Illinois Urbana-Champaign
More informationLarge Scale 3D Reconstruction by Structure from Motion
Large Scale 3D Reconstruction by Structure from Motion Devin Guillory Ziang Xie CS 331B 7 October 2013 Overview Rome wasn t built in a day Overview of SfM Building Rome in a Day Building Rome on a Cloudless
More informationAdditive Manufacturing Defect Detection using Neural Networks
Additive Manufacturing Defect Detection using Neural Networks James Ferguson Department of Electrical Engineering and Computer Science University of Tennessee Knoxville Knoxville, Tennessee 37996 Jfergu35@vols.utk.edu
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware
More informationCUDA Development Using NVIDIA Nsight, Eclipse Edition. David Goodwin
CUDA Development Using NVIDIA Nsight, Eclipse Edition David Goodwin NVIDIA Nsight Eclipse Edition CUDA Integrated Development Environment Project Management Edit Build Debug Profile SC'12 2 Powered By
More informationState of Containers. Convergence of Big Data, AI and HPC
State of Containers Convergence of Big Data, AI and HPC Technology ReCap Comparison of Hypervisor and Container Virtualization VM1 VM2 appa appb Userland Userland Kernel Kernel Operational Abstraction
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationFIELA: A Fast Image Encryption with Lorenz Attractor using Hybrid Computing
FIELA: A Fast Image Encryption with Lorenz Attractor using Hybrid Computing P Kranthi Kumar, B V Nagendra Prasad, Gelli MBSS Kumar, V. Chandrasekaran, P.K.Baruah Sri Sathya Sai Institute of Higher Learning,
More informationStan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA
Stan Posey, CAE Industry Development NVIDIA, Santa Clara, CA, USA NVIDIA and HPC Evolution of GPUs Public, based in Santa Clara, CA ~$4B revenue ~5,500 employees Founded in 1999 with primary business in
More informationAn Experimentation Workbench for Replayable Networking Research
An Experimentation Workbench for Replayable Networking Research Eric Eide,, Leigh Stoller, and Jay Lepreau University of Utah, School of Computing NSDI 2007 / April 12, 2007 Repeated Research A scientific
More informationCompute Node Linux: Overview, Progress to Date & Roadmap
Compute Node Linux: Overview, Progress to Date & Roadmap David Wallace Cray Inc ABSTRACT: : This presentation will provide an overview of Compute Node Linux(CNL) for the CRAY XT machine series. Compute
More informationIBM Deep Learning Solutions
IBM Deep Learning Solutions Reference Architecture for Deep Learning on POWER8, P100, and NVLink October, 2016 How do you teach a computer to Perceive? 2 Deep Learning: teaching Siri to recognize a bicycle
More informationAllowing Users to Run Services at the OLCF with Kubernetes
Allowing Users to Run Services at the OLCF with Kubernetes Jason Kincl Senior HPC Systems Engineer Ryan Adamson Senior HPC Security Engineer This work was supported by the Oak Ridge Leadership Computing
More informationProfiling & Tuning Applications. CUDA Course István Reguly
Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs
More informationAn Experimentation Workbench for Replayable Networking Research
An Experimentation Workbench for Replayable Networking Research Eric Eide, Leigh Stoller, and Jay Lepreau Repeated Research A scientific community advances when its experiments are repeated University
More informationPOINT CLOUD DEEP LEARNING
POINT CLOUD DEEP LEARNING Innfarn Yoo, 3/29/28 / 57 Introduction AGENDA Previous Work Method Result Conclusion 2 / 57 INTRODUCTION 3 / 57 2D OBJECT CLASSIFICATION Deep Learning for 2D Object Classification
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationMaximizing Face Detection Performance
Maximizing Face Detection Performance Paulius Micikevicius Developer Technology Engineer, NVIDIA GTC 2015 1 Outline Very brief review of cascaded-classifiers Parallelization choices Reducing the amount
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationRemote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain
Remote virtualization: pros and cons of a recent technology Federico Silla Technical University of Valencia Spain The scope of this talk HPC Advisory Council Brazil Conference 2015 2/43 st Outline What
More informationSystem Requirements for Q-DAS Products
System Requirements for Q-DAS Products Version V11 (32-bit) System Requirements for Q-DAS Products 1/13 System Requirements for Q-DAS Products Contents 1. Preface... 2 1.1. General hardware requirements...
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationParallelising Pipelined Wavefront Computations on the GPU
Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick
More informationKepler Overview Mark Ebersole
Kepler Overview Mark Ebersole TFLOPS TFLOPS 3x Performance in a Single Generation 3.5 3 2.5 2 1.5 1 0.5 0 1.25 1 Single Precision FLOPS (SGEMM) 2.90 TFLOPS.89 TFLOPS.36 TFLOPS Xeon E5-2690 Tesla M2090
More informationAutomatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.
Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Vazhkudai Instance of Large-Scale HPC Systems ORNL s TITAN (World
More informationNVIDIA GRID. Ralph Stocker, GRID Sales Specialist, Central Europe
NVIDIA GRID Ralph Stocker, GRID Sales Specialist, Central Europe rstocker@nvidia.com GAMING AUTO ENTERPRISE HPC & CLOUD TECHNOLOGY THE WORLD LEADER IN VISUAL COMPUTING PERFORMANCE DELIVERED FROM THE CLOUD
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationComputer Science Section. Computational and Information Systems Laboratory National Center for Atmospheric Research
Computer Science Section Computational and Information Systems Laboratory National Center for Atmospheric Research My work in the context of TDD/CSS/ReSET Polynya new research computing environment Polynya
More informationEditing Versioned Geodatabases : An Introduction
Esri International User Conference San Diego, California Technical Workshops July 24, 2012 Editing Versioned Geodatabases : An Introduction Cheryl Cleghorn Shawn Thorne Assumptions: Basic knowledge of
More informationHDX 3D Version 1.0 Requirements Guide
HDX 3D Version 1.0 Requirements Guide www.citrix.com TABLE OF CONTENTS Chapter 1 Overview... 3 Introduction to HDX 3D for Professional Graphics... 3 Architecture... 3 Licensing... 4 Chapter 2 Requirements...
More informationSentinelOne Technical Brief
SentinelOne Technical Brief SentinelOne unifies prevention, detection and response in a fundamentally new approach to endpoint protection, driven by machine learning and intelligent automation. By rethinking
More informationNVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS
TECHNICAL OVERVIEW NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS A Guide to the Optimized Framework Containers on NVIDIA GPU Cloud Introduction Artificial intelligence is helping to solve some of the most
More informationOptimizing Testing Performance With Data Validation Option
Optimizing Testing Performance With Data Validation Option 1993-2016 Informatica LLC. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording
More informationShifter at CSCS Docker Containers for HPC
Shifter at CSCS Docker Containers for HPC HPC Advisory Council Swiss Conference Alberto Madonna, Lucas Benedicic, Felipe A. Cruz, Kean Mariotti - CSCS April 9 th, 2018 Table of Contents 1. Introduction
More information