BUILDING A GPU-FOCUSED CI SOLUTION Mike Wendt @mike_wendt github.com/nvidia github.com/mike-wendt
Need for CPU CI Challenges of GPU CI Methods to Implement GPU CI AGENDA Improving GPU CI Today Demo Lessons Learned Next Steps Getting Started 2
NEED FOR GPU CI The number of GPU-accelerated applications are growing The leading open-source software projects from Apache and others rely on CI External demand Partners are collaborating with us on projects like GPU Open Analytics Initiative (GoAi) and need GPU CI to ensure stable builds Internal demand Large code-bases internally for all kinds of GPU-accelerated applications require testing across different platforms/hardware Performance testing of new drivers and hardware needs repeatable methods to make sure we continue to deliver performance 3
CHALLENGES OF GPU CI GPUs bring a different set of problems than traditional CI Need GPUs Cloud or physical Resource management Expose GPU configuration to developers Driver, CUDA, GPU type Many traditional tools like Travis CI, Circle CI, and others do not support GPUs For good reasons, dangers of misuse For tools that offer support, many times it is not native Still feels hacky, but it gets the job done 4
METHODS TO IMPLEMENT GPU CI 5
BARE-METAL + GPU Fastest to get started with the most limitations Benefits Reduces complexity with minimal setup Works well for a small set of projects that use the same/similar dependencies Challenges Managing dependencies can be tricky for multiple projects Limits ability to test multiple platforms, limited to installed CUDA/OS Resource management is difficult 6
BARE-METAL + GPU Fastest to get started with the most limitations CI Environment Source Code Tests Test Results GPUs Server 7
DOCKER + NVIDIA CONTAINER RUNTIME github.com/nvidia/nvidia-docker Docker runtime that allows for GPU passthru on Linux systems Works with Debian/Ubuntu, RHEL/CentOS, and Amazon Linux Allows for testing multiple CUDA/OS environments on one machine Includes options to set supported driver operations and restrict GPU visibility 8
DOCKER + GPU Easier to use with some hacking still required Benefits Ability to test multiple CUDA/OS combinations Handles dependency management for all projects Enables fine-grained resource management Supports scale needed for larger projects and teams Challenges Typically requires pre-built Docker images with environments for testing and code to test injected into container for testing Configuration tends to be a lot of environment variables and cumbersome to manage GitLab CI and Jenkins require runners for multiple nodes 9
DOCKER + GPU Easier to use with some hacking still required CI Environment Dockerfile or Container Custom Config Docker Container Source Code Tests Test Results Docker + NVIDIA Runtime GPUs Server 10
DOCKER + GPU Easier to use with some hacking still required CI Environment Dockerfile or Container Custom Config Docker Container Source Code Tests Test Results Docker + NVIDIA Runtime GPUs Server 11
DOCKER + GPU Easier to use with some hacking still required CI Environment Dockerfile or Container Custom Config Docker Container Source Code Tests Test Results Docker + NVIDIA Runtime GPUs Server 12
KUBERNETES + DOCKER + GPU Promises to be the easiest to use with minimal hacking Benefits GPU support in v1.8+ of Kubernetes Takes care of the runner challenge with GitLab/Jenkins Resource management and scheduling is handled by Kubernetes Challenges Can only target GPUs on homogeneous nodes (heterogeneous support coming) Not all tools support GPU CI out of the box Docker containers required for testing, but this can be the previous step in a pipeline 13
KUBERNETES + DOCKER + GPU Promises to be the easiest to use with minimal hacking CI Environment Docker Container Repo Docker Container Dockerfile or Container Custom Config Docker Test Container Kubernetes Master Source Code Test Results Scheduler Tests Server Kubernetes Master Docker + NVIDIA Runtime GPUs Kubernetes Worker 14
KUBERNETES + DOCKER + GPU Promises to be the easiest to use with minimal hacking CI Environment Docker Container Repo Docker Container Dockerfile or Container Custom Config Docker Test Container Kubernetes Master Source Code Test Results Scheduler Tests Server Kubernetes Master Docker + NVIDIA Runtime GPUs Kubernetes Worker 15
KUBERNETES + DOCKER + GPU Promises to be the easiest to use with minimal hacking CI Environment Docker Container Repo Docker Container Dockerfile or Container Custom Config Docker Test Container Kubernetes Master Source Code Test Results Scheduler Tests Server Kubernetes Master Docker + NVIDIA Runtime GPUs Kubernetes Worker 16
KUBERNETES + DOCKER + GPU Promises to be the easiest to use with minimal hacking CI Environment Docker Container Repo Docker Container Dockerfile or Container Custom Config Docker Test Container Kubernetes Master Source Code Test Results Scheduler Tests Server Kubernetes Master Docker + NVIDIA Runtime GPUs Kubernetes Worker 17
HOW CAN WE MAKE THIS BETTER TODAY? 18
JENKINS PLUGIN FOR NVIDIA + DOCKER Based on Jenkins docker-slaves plugin Simplifies the configuration of Docker containers for GPU CI testing Allows for targeting a Dockerfile within the repo to build and use for testing or a Docker image in a remote hub Supports side-containers with GPU support Easy to use and adapt a project for GPU CI 19
DEMO 20
JENKINS PLUGIN FOR NVIDIA + DOCKER Simplifying the configuration for GPU CI Jenkins CI Environment Dockerfile or Container + Plugin Config Docker Container Source Code Test Results Tests Docker + NVIDIA Runtime GPUs Server 21
LESSONS LEARNED CI best practices apply to GPU code as well Pull request testing is one of the best methods to ensure code quality GitLab CI works great if there are only a few GPU-enabled repos to test For scale-out, GitLab on Kubernetes is best Larger organizations and projects need a centralized CI platform like Jenkins Setup of a new repo is easy and with parameterized builds we can make use of existing pipelines Advanced uses of Jenkins Tagging is key to test on multiple GPU architectures and pipelines for multiple CUDA version testing 22
NEXT STEPS Continue plugin development and release as an open source project Internal Continue deployment of GPU CI and migrate performance testing toward full GPU CI Leverage capabilities of Jenkins to go beyond CI with CD and workflow automation External Expand GPU CI testing by testing pull requests of open source projects using Jenkins and the plugin Take advantage of the GPU targeting within Kubernetes and new GPU features in the coming months Look at ways to more closely integrate GPU CI with GitLab CI and Jenkins plugins for Kubernetes 23
GETTING STARTED Links to useful repos github.com/nvidia NVIDIA Docker Runtime nvidia-docker NVIDIA Kubernetes Device Plugin k8s-device-plugin github.com/mike-wendt Jenkins Plugin For NVIDIA Coming soon Docker + NVIDIA Runtime on Ubuntu nvidia-docker-ubuntu 24
THANK YOU Mike Wendt @mike_wendt github.com/nvidia github.com/mike-wendt