Shifter: Fast and consistent HPC workflows using containers

Similar documents
Shifter at CSCS Docker Containers for HPC

Opportunities for container environments on Cray XC30 with GPU devices

Shifter and Singularity on Blue Waters

Using EasyBuild and Continuous Integration for Deploying Scientific Applications on Large Scale Production Systems

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

Containerizing GPU Applications with Docker for Scaling to the Cloud

S INSIDE NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORK CONTAINERS

IBM Bluemix compute capabilities IBM Corporation

An Introduction to the SPEC High Performance Group and their Benchmark Suites

DGX-1 DOCKER USER GUIDE Josh Park Senior Solutions Architect Contents created by Jack Han Solutions Architect

Running MarkLogic in Containers (Both Docker and Kubernetes)

Bright Cluster Manager: Using the NVIDIA NGC Deep Learning Containers

MAKING CONTAINERS EASIER WITH HPC CONTAINER MAKER. Scott McMillan September 2018

Who is Docker and how he can help us? Heino Talvik

Shifter on Blue Waters

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

ReFrame: A Regression Testing Framework Enabling Continuous Integration of Large HPC Systems

GPU-centric communication for improved efficiency

Immutable Application Containers Reproducibility of CAE-computations through Immutable Application Containers

Red Hat Atomic Details Dockah, Dockah, Dockah! Containerization as a shift of paradigm for the GNU/Linux OS

State of Containers. Convergence of Big Data, AI and HPC

PLAN-E Workshop Switzerland. Welcome! September 8, 2016

GPU Developments for the NEMO Model. Stan Posey, HPC Program Manager, ESM Domain, NVIDIA (HQ), Santa Clara, CA, USA

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

Managing and Deploying GPU Accelerators. ADAC17 - Resource Management Stephane Thiell and Kilian Cavalotti Stanford Research Computing Center

DGX UPDATE. Customer Presentation Deck May 8, 2017

Welcome to Docker Birthday # Docker Birthday events (list available at Docker.Party) RSVPs 600 mentors Big thanks to our global partners:

GPU- Aware Design, Implementation, and Evaluation of Non- blocking Collective Benchmarks

Testing Docker Performance for HPC Applications

Pedraforca: a First ARM + GPU Cluster for HPC

Docker 101 Workshop. Eric Smalling - Solution Architect, Docker

DevOps Workflow. From 0 to kube in 60 min. Christian Kniep, v Technical Account Manager, Docker Inc.

CSCS Proposal writing webinar Technical review. 12th April 2015 CSCS

Python based Data Science on Cray Platforms Rob Vesse, Alex Heye, Mike Ringenburg - Cray Inc C O M P U T E S T O R E A N A L Y Z E

Singularity: Containers for High-Performance Computing. Grigory Shamov Nov 21, 2017

GPU Architecture. Alan Gray EPCC The University of Edinburgh

CSCS Site Update. HPC Advisory Council Workshop Colin McMurtrie, Associate Director and Head of HPC Operations.

Best Practices for Deploying and Managing GPU Clusters

The Path to GPU as a Service in Kubernetes Renaud Gaubert Lead Kubernetes Engineer

CONTAINERIZING JOBS ON THE ACCRE CLUSTER WITH SINGULARITY

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Centre de Calcul de l Institut National de Physique Nucléaire et de Physique des Particules. Singularity overview. Vanessa HAMAR

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Introduction to Containers

CafeGPI. Single-Sided Communication for Scalable Deep Learning

IBM Power Systems HPC Cluster

Illinois Proposal Considerations Greg Bauer

Deployment Patterns using Docker and Chef

Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition

NAMD GPU Performance Benchmark. March 2011

Toward Automated Application Profiling on Cray Systems

Deep Learning Frameworks with Spark and GPUs

OpenPOWER Performance

Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Install your scientific software stack easily with Spack

Reproducible computational pipelines with Docker and Nextflow

Merging Enterprise Applications with Docker* Container Technology

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

Operational Robustness of Accelerator Aware MPI

TENSORRT 4.0 RELEASE CANDIDATE (RC)

Important DevOps Technologies (3+2+3days) for Deployment

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

TENSORRT 3.0. DU _v3.0 February Installation Guide

BEST PRACTICES FOR DOCKER

LSST software stack and deployment on other architectures. William O Mullane for Andy Connolly with material from Owen Boberg

General Purpose GPU Computing in Partial Wave Analysis

Stan Posey, NVIDIA, Santa Clara, CA, USA

Opportunities & Challenges for Piz Daint s Cray XC50 with ~5000 P100 GPUs. Thomas C. Schulthess

Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to OpenStack Environment

A Case for High Performance Computing with Virtual Machines

Docker & why we should use it

GPU on OpenStack for Science

Technology for a better society. hetcomp.com

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Onto Petaflops with Kubernetes

AI for HPC and HPC for AI Workflows: The Differences, Gaps and Opportunities with Data Management

RENKU - Reproduce, Reuse, Recycle Research. Rok Roškar and the SDSC Renku team

Docker CaaS. Sandor Klein VP EMEA

A DEVOPS STATE OF MIND WITH DOCKER AND KUBERNETES. Chris Van Tuin Chief Technologist, West

USING NGC WITH GOOGLE CLOUD PLATFORM

NVIDIA GPU TECHNOLOGY UPDATE

Docker A FRAMEWORK FOR DATA INTENSIVE COMPUTING

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

The Arm Technology Ecosystem: Current Products and Future Outlook

Docker und IBM Digital Experience in Docker Container

4 Effective Tools for Docker Monitoring. By Ranvijay Jamwal

Fra superdatamaskiner til grafikkprosessorer og

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

Amir Zipory Senior Solutions Architect, Redhat Israel, Greece & Cyprus

TENSORFLOW. DU _v1.8.0 June User Guide

Automating the Build Pipeline for Docker Container

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

BEST PRACTICES FOR DOCKER

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

DGX SYSTEMS: DEEP LEARNING FROM DESK TO DATA CENTER. Markus Weber and Haiduong Vo

A Hands on Introduction to Docker

Cray Support of the MPICH ABI Compatibility Initiative

Transcription:

Shifter: Fast and consistent HPC workflows using containers CUG 2017, Redmond, Washington Lucas Benedicic, Felipe A. Cruz, Thomas C. Schulthess - CSCS May 11, 2017

Outline 1. Overview 2. Docker 3. Shifter 4. Workflows 5. Use cases 6. Conclusion 2

Overview 3

CSCS / Piz Daint Location: Swiss National Supercomputing Center Name: Piz Daint Model: Cray XC50/XC40 Node: Intel Xeon E5-2690, Nvidia Tesla P100, Aries interconnect TOP500: 8th in the world Green500: 2nd in the world 4

Motivation Bring Docker containers to production on Piz Daint. Docker: flexible and self-contained execution environments. Tool that enable workflows for some users. Part of an ecosystem that provides value to users. SI group focus on enhancing Shifter s container runtime. Usability. Robustness. High-performance. 5

Talk in a nutshell Production workflows with Docker and Shifter on Piz Daint 1. Build and test containers with Docker on a Laptop. 2. Run with high-performance with Shifter on Piz Daint. 6

About Docker Motto: Build, Ship, and Run Any App, Anywhere Portability and convenience first Target: web applications. Creation Creates semi-isolated containers. Packages all application requirements. Management Easy-to-use recipe file. Version-control driven image creation. Share Push and Pull images from a community-driven Hub (i.e., DockerHub) 7

About Shifter (1) A lean runtime that enables the deployment of Docker-like Linux containers in HPC systems. From NERSC by D. Jacobsen and S. Canon Objectives: HPC environments. Performance. Security. 8

About Shifter (2) Flexibility for users Enable complex software stacks using different Linux flavors Develop on your laptop and run it on an HPC system Integration with HPC resources Availability of shared resources (e.g., parallel filesystems, accelerator devices and network interfaces) Distribution and reproducibility Integration with public image repositories, e.g., DockerHub Improving result reproducibility 9

Workflow (From a Laptop to Piz Daint) Docker: Ease of use and convenience (1, 2, 3). Shifter: Performance and security (4, 5). 1) Build container 2) Test container 3) Push to registry 4) Pull container 5) Run on Piz Daint 10

About Docker and Shifter Containers are hardware- and platform-agnostic by design How do we go about accessing specialized hardware like GPUs? 1) CSCS and NVIDIA co-designed a solution that provides: direct access to the GPU device characters; automatic discovery of the required libraries at runtime; NVIDIA s DGX-1 software stack is based on this solution 2) CSCS extended design to the MPI stack Supports different versions MPICH-based implementations Let s look at use cases to illustrate the workflows. 11

Use case: devicequery (GPU) 12

devicequery (1): building a simple image Start with an image: nvidia image + devicequery ethcscs/dockerfiles:cudasamples8.0 FROM nvidia/cuda:8.0 RUN apt-get update && apt-get install -y --no-install-recommends \ cuda-samples-$cuda_pkg_version && \ rm -rf /var/lib/apt/lists/* RUN (cd /usr/local/cuda/samples/1_utilities/devicequery && make) RUN (cd /usr/local/cuda/samples/5_simulations/nbody && make) 13

devicequery (2): Testing on a Laptop Let s start with Docker on the laptop $ nvidia-docker build -t "ethcscs/dockerfiles:cudasamples8.0". $ nvidia-docker run ethcscs/dockerfiles:cudasamples8.0./devicequery $ nvidia-docker push ethcscs/dockerfiles:cudasamples8.0./devicequery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce 940MX" CUDA Driver Version / Runtime Version 8.0 / 8.0 [...] devicequery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce 940MX Result = PASS 14

devicequery (3): Running on a Piz Daint Running the container on Piz Daint $ shifterimg pull ethcscs/dockerfiles:cudasamples8.0 $ salloc -N 1 -C gpu $ srun shifter -image=ethcscs/dockerfiles:cudasamples8.0./devicequery./devicequery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "Tesla P100-PCIE-16GB" CUDA Driver Version / Runtime Version 8.0 / 8.0 [...] devicequery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla P100-PCIE-16GB Result = PASS 15

Use case: Nbody (GPU) 16

Nbody: same image N-body double-precision calculation using 200k bodies, single node. GPU-accelerated runs using the official CUDA image from DockerHub. Relative GFLOP/s performance when comparing Laptop and Piz Daint. Laptop Piz Daint (P100) Native 18.34 [GFLOP/s] 149.01x Container* 18.34 [GFLOP/s] 149.04x *Laptop run using nvidia-docker, Piz Daint uses Shifter 17

Use case: TensorFlow (GPU / Third party container) 18

TensorFlow Software library capable of building and training neural networks uses CUDA. Official TensorFlow image from DockerHub (not modified). TensorFlow has a rapid release cycle (Once a week new build available!). Ready to run containers. Performance relative to the Laptop wall-clock time of image classification tests: MNIST. Test case Laptop* Piz Daint (P100) MNIST, TF tutorial 613 [seconds] 17.17x *Laptop run using nvidia-docker 19

Note on MPI-stack support Based on the MPICH, Application Binary Interface (ABI) compatibility MPICH v3.1 (released February 2014) IBM MPI v2.1 (released December 2014) Intel MPI Library v5.0 (released June 2014) CRAY MPT v7.0.0 (released June 2014) MVAPICH2 v2.0 (released June 2014) $ srun -n 2 shifter --mpi --image=osu-benchmarks-image./osu_latency MPI library from the container is swapped by Shifter at run time. ABI-compatible MPI from the host system is checked. Hardware acceleration is enabled. 20

Use case: OSU benchmark (MPI) 21

OSU Benchmark $ srun -n 2 shifter --mpi --image=osu-benchmarks-image./osu_latency Host MPI: Cray MPT 7.5.0 Cray aries Interconnect Container MPI: MPICH v3.1 (A) MVAPICH2 2.1 (B) Intel MPI Library (C) Native performance! 22

Use case: PyFR (CUDA + MPI / Complex build) 23

PyFR Python based framework for solving advection-diffusion type problems on streaming architectures. 2016 Gordon Bell Prize finalist (Highly scalable). GPU- and MPI-accelerated runs using containers. Complex build (100 lines dockerfile) and test on Laptop. Production-like run on Piz Daint. Parallel efficiency for a 10-GB test case on different systems (4 node setup). Number of nodes 24 Piz Daint (P100) 1 1.000 2 0.975 4 0.964 8 0.927 16 0.874

Use case: Portable compilation units / Linpack benchmark 25

Vanilla Linpack with specialized BLAS Some application performance depends on targeted optimization of libraries. Use container to pack application environment. Proof of concept: pack vanilla Linpack, compile specialized BLAS before run. Two stage: compile first (link against host libs), then run. 26

Conclusion 27

Conclusion The Docker-Shifter combo towards production on Piz Daint: Portability Scalability High-performance. The showed use cases highlighted: Pull and run containers; high-performance containers; access to hardware accelerators like GPUs; use of high-speed interconnect through MPI; portable compilation environment. 28

Thank you for your attention 29

GPU device access (3) On the GPU device numbering Must-have for application portability A containerized application will consistently access the GPUs starting from ID 0 (zero) $ export CUDA_VISIBLE_DEVICES=0 $ srun shifter -image=ethcscs/dockerfiles:cudasamples8.0./devicequery [...] Detected 1 CUDA Capable device(s) Device 0: "Tesla K40m" [...] $ export CUDA_VISIBLE_DEVICES=2 $ srun shifter -image=ethcscs/dockerfiles:cudasamples8.0./devicequery [...] Detected 1 CUDA Capable device(s) Device 0: "Tesla K80" [...] 30

GPU device access (4) Same image on an multi-gpu system with Shifter $ shifterimg pull ethcscs/dockerfiles:cudasamples8.0 $ export CUDA_VISIBLE_DEVICES=0,2 $ srun shifter -image=ethcscs/dockerfiles:cudasamples8.0./devicequery /usr/local/cuda/samples/bin/x86_64/linux/release/devicequery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 2 CUDA Capable device(s) Device 0: "Tesla K40m" [...] Device 1: "Tesla K80" [...] devicequery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 2, Device0 = Tesla K40m, Device1 = Tesla K80 Result = PASS 31

Use case: deploy at scale 32

Pynamic* Test startup time of workloads by simulating DLL behaviour of Python applications. Compare wall-clock time for container vs native. Over +3000 MPI processes. *Pynamic parameters: 495 shared object files; 1950 avg. functions per object; 215 math like library files; 1950 avg. math functions; function name lenght ~100 characters. 33