Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to OpenStack Environment

Can we boost more HPC performance? Integrate IBM POWER servers with GPUs to OpenStack Environment Ankit Purohit, Takeaki Matsumoto Transform your business, transcend expectations with our technologically advanced solutions.

Self-Introduction Ankit Purohit Takeaki Matsumoto a.purohit@ntt.com takeaki.matsumoto@ntt.com NTT Communications Technology Development NTT Communications Technology Development High Performance Computing GPU R&D for OpenStack Ops for Private Cloud 1

Previous talk at OpenPOWER Summit 2018 March 19, 2018 at Las Vegas OpenPOWER Summit Website: https://openpowerfoundation.org/summit-2018-03-us/ Co-speaker : Yutaka Kawai, IBM Japan Our Talk s Video: https://www.youtube.com/watch?v=l4g6smtgcou&feature=youtu.be Topics * KVM on POWER * Many other Benchmarks 2

Agenda Background Our OpenStack GPU cloud Motivation for using POWER server Goal Can we boost more performance with POWER? Approach Unleash POWER s full performance as Baremetal server Integrate POWER server into OpenStack Cloud Conclusion Another choice: Kubernetes 3

Background NTT Communications The largest Telecommunications company in Japan Subsidiaries and offices in over 110 cities worldwide Part of a Fortune Global 100 company Our team provide GPU cloud using OpenStack, for in-house users experimental usage. AI communication engine COTOHA http://www.ntt.com/en/services/application/cotoha.html Deep Learning training on customer data (time-series) etc. 5

Our OpenStack Environment Image source: https://www.openstack.org/software/ x86 servers (as compute nodes) nvidia nvidia nvidia K10 GPU M60 GPU P100 GPU 6

Motivation to try IBM POWER system Even with same GPU card... different server architecture brings us better performance? Intel based system : DGX-1 IBM POWER8 system : Minsky - CPU and GPU are connected via PCle (32 GB/s) - CPU and GPU are connected via NVLink (80 GB/s) - Bandwidth between CPU sockets is 64 GB/s - Bandwidth between CPU sockets is 76.8 GB/s - Bandwidth between CPU and memory is 76.8 GB/s - Bandwidth between CPU and memory is 115 GB/s 76.8 GB/s 64 GB/s 76.8 GB/s 32 GB/s 76.8 GB/s 7

Goal How can we boost more performance with POWER? 8

Benchmark program: nbody - nbody is kind of cuda sample program. This program can calculate single precision and double precision by using GPU and the results are displayed in GFLOPS. It can be also calculated by CPU only. $./nbody -benchmark -numbodies=2048000 -numdevices=1 -benchmark : (run benchmark to measure performance) -numbodies : (number of bodies (>= 1) to run in simulation) (for GPU benchmark 2048000, for CPU benchmark 20480) -numdevice : (where i=(number of CUDA devices > 0) to use for simulation) -cpu : (run n-body simulation on the CPU)] -fp64 : (use double precision floating point values for simulation) 10

Benchmark program: nbody We use nbody to emulate memory intensive workflow In nbody, GPU directly access data from host memory (Main memory) many times NVLink(or PCle) Bottleneck? CPU GPU0 GPU1... Zero-copy Main Memory GPU Memory GPU Memory nbody data flow 11

Benchmark Result: POWER8 baremetal (1/2) With default server configuration Workload: numbodies=2048000, FP32 on Minsky w/ RHEL7.3 1GPU 2GPU 2GPU 4GPU When using 4 GPUs, there is low performance than 2 GPUs because it is not scaled When using 2 GPUs, specifying different GPUs causes different performance. Why?! T. Kamenoue, M. Mitsugi, and Y. Kawai, "The optimization of nbody simulation on Multi-GPU environment in Proc. the 80th National Convention of Information Processing Society of Japan (IPSJ), Tokyo, Japan, Mar. 2018, pp. 1-25,26. 12

A Solution : Memory Interleave What memory Interleave actually does?? - It enables equally use of memories of all the node (CPU sockets) in round robin way. - I/O access can be balanced - it works well for the case of nbody benchmark (FP32) - How to execute? numactl -interleave=all./nbody Interleave disabled(default) OR numactl -i all./nbody... Interleave enabled T. Kamenoue, M. Mitsugi, and Y. Kawai, "The optimization of nbody simulation on Multi-GPU environment in Proc. the 80th National Convention of Information Processing Society of Japan (IPSJ), Tokyo, Japan, Mar. 2018, pp. 1-25,26. 13

What happens if Interleave is disabled? System Memory workload : FP32, numbodies=2048000, 4GPU, Interleave disabled System Memory 115 GB/s 115 GB/s POWER8 GPU0 and GPU1 always reads from CLOSE Memory GPU2 and GPU3 always reads from FAR Memory - Elapsed Time Per 1 Iteration GPU 0 : 4.3-4.4 Second GPU 1 : 4.3-4.4 Second GPU 2 : 9.2-9.10 Second GPU 3 : 9.2-9.10 Second Benchmark Result : 8673 GFLOP/s POWER8 CPU1 CPU0 80 GB/s NVLink P100 GPU0 GPU Memory 80 GB/s NVLink P100 GPU1 P100 GPU2 GPU Memory GPU Memory 80 GB/s P100 GPU3 GPU Memory 1 Iteration T. Kamenoue, M. Mitsugi, and Y. Kawai, "The optimization of nbody simulation on Multi-GPU environment in Proc. the 80th National Convention of Information Processing Society of Japan (IPSJ), Tokyo, Japan, Mar. 2018, pp. 1-25,26. 14

What happens if Interleave is enabled? System Memory workload : FP32, numbodies=2048000, 4GPU, Interleave enabled System Memory 115 GB/s 115 GB/s POWER8 GPU0 and GPU1 always reads 1/2 data from CLOSE Memory 1/2 data from FAR Memory All GPUs read same as above - Elapsed Time Per 1 Iteration GPU 0 : 5.2-5.3 Second GPU 1 : 5.2-5.3 Second GPU 2 : 5.2-5.3 Second GPU 3 : 5.2-5.3 Second Benchmark Result : 15969 GFLOP/s POWER8 CPU1 CPU0 80 GB/s NVLink P100 GPU0 GPU Memory 80 GB/s NVLink P100 GPU1 P100 GPU2 GPU Memory GPU Memory 80 GB/s P100 GPU3 GPU Memory 1 Iteration T. Kamenoue, M. Mitsugi, and Y. Kawai, "The optimization of nbody simulation on Multi-GPU environment in Proc. the 80th National Convention of Information Processing Society of Japan (IPSJ), Tokyo, Japan, Mar. 2018, pp. 1-25,26. 15

Benchmark Result: POWER8 baremetal (2/2) With memory interleave enabled Workload: numbodies=2048000, FP32 on Minsky w/ RHEL7.3 1GPU 2GPU 2GPU 4GPU Now it is scaled. 4 GPU case has becomes faster than 2 GPU. T. Kamenoue, M. Mitsugi, and Y. Kawai, "The optimization of nbody simulation on Multi-GPU environment in Proc. the 80th National Convention of Information Processing Society of Japan (IPSJ), Tokyo, Japan, Mar. 2018, pp. 1-25,26. 16

Benchmark Result: POWER8 vs DGX-1 baremetal nbody result when increasing GPU number Workload: numbodies=2048000, FP32 1GPU 2GPU 4GPU GFLOP/s POWER8 DGX-1 - Current Intel Architecture machine can not take benefit from Memory Interleave because of its narrow I/O bandwidth. 17

How to integrate POWER8 to OpenStack nova-api nova-scheduler nova-conductor Controller (x86) nova-compute Compute (x86) nova-compute Compute (x86) nova-compute Compute (ppc64le) 19

How to integrate POWER8 to OpenStack Linux can run on POWER8 KVM can run on POWER8 OpenStack can run on POWER8 Cloud Archive repository available Basically, same procedure can be used as x86 20

How to integrate POWER8 to OpenStack For GPU, we need KVM PCI-Passthrough KVM support qemu (1:2.6.1+dfsg-0ubuntu2) xenial; urgency=medium Enable GPU Passthru for ppc64le https://launchpad.net/bugs/1541902 IOMMU (like Intel VT-d) In POWER servers, IBM Translation Control Entry is available 21

How to integrate POWER8 to OpenStack Environment OpenPOWER IBM S822LC for HPC "Minsky" CPU: 20 cores (logical: 160 cores) MEM: 1TB GPU: NVIDIA P100 * 4 (with NVLink) OS Ubuntu 16.04.4 (kernel: 4.15.0-13-generic) Software KVM 2.11 Nova 17.0.1 (Queens) 22

How to integrate POWER8 to OpenStack Configuration Kernel parameters Disable SMT vfio-pci.disable_idle_d3=1 $ ppc64_cpu --smt=off Disable nouveau driver $ cat /etc/modprobe.d/blacklist-nouveau.conf blacklist nouveau blacklist lbm-nouveau options nouveau modeset=0 alias nouveau off $ sudo update-initramfs -u $ reboot $ lsmod grep nouveau 23

How to integrate POWER8 to OpenStack Nova Configuration Compute node Ensure PCI device id nova.conf $ lspci -nn grep -i nvidia 0002:01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:15f9] (rev a1) [default] pci_passthrough_whitelist={"vendor_id":"10de","product_id":"15f9"} Controller node nova.conf [default] pci_alias= {"vendor_id":"10de", "product_id":"15f9", "name": "P100"} [filter_scheduler] enabled_filters =,PciPassthroughFilter 24

Our OpenStack Environment: After Integration Image source: https://www.openstack.org/software/ x86 servers POWER8 servers nvidia nvidia nvidia nvidia K10 GPU M60 GPU P100 GPU P100 GPU 25

Benchmark of OpenStack-integrated VM Instance flavor vcpu: 16 Mem: 120GB Disk: 160GB Metadata: pci_passthrough:alias=p100:4 hw:mem_page_size=16384 hw:numa_nodes=2 GPU environment NVIDIA Driver: 390.12 CUDA: 9.1 26

Benchmark of OpenStack-integrated VM nbody benchmark results $ numactl -i all./nbody -benchmark -numbodies=2048000 1GPU 2GPU 4GPU 27

Benchmark of OpenStack-integrated VM CPU-GPU Memory bandwidth benchmark results $./bandwidthtest 28

Benchmark of OpenStack-integrated VM CPU-GPU Memory bandwidth benchmark results $./bandwidthtest Why? 29

Benchmark of OpenStack-integrated VM NVLink implementation Physical Linux recognize CPU NVLink (2.5x PCIe GPU PCI CPU NVLink Device NVLink Device GPU 30

Benchmark of OpenStack-integrated VM OpenStack attached only GPU VM PCI-Passthrough NVLink Device PCIe x8 NVLink Device GPU 31

Benchmark of OpenStack-integrated VM Passthrough 3 devices solve this issue? PCI-Passthrough VM NVLink Device NVLink Device GPU 32

Benchmark of OpenStack-integrated VM GPU loc-code $ lspci -d 10de:15f9 0002:01:00.0 3D controller: NVIDIA Corporation Device 15f9 (rev a1) 0003:01:00.0 3D controller: NVIDIA Corporation Device 15f9 (rev a1) 000a:01:00.0 3D controller: NVIDIA Corporation Device 15f9 (rev a1) 000b:01:00.0 3D controller: NVIDIA Corporation Device 15f9 (rev a1) $ cat /sys/bus/pci/devices/0002\:01\:00.0/of_node/ibm\,loc-code GPU1 $ cat /sys/bus/pci/devices/0003\:01\:00.0/of_node/ibm\,loc-code GPU2 $ cat /sys/bus/pci/devices/000a\:01\:00.0/of_node/ibm\,loc-code GPU3 $ cat /sys/bus/pci/devices/000b\:01\:00.0/of_node/ibm\,loc-code GPU4 33

Benchmark of OpenStack-integrated VM NVLink devices and its connection $ lspci -d 1014:04ea 0004:00:00.0 Bridge: IBM Device 04ea 0004:00:00.1 Bridge: IBM Device 04ea 0004:00:01.0 Bridge: IBM Device 04ea 0004:00:01.1 Bridge: IBM Device 04ea 0005:00:00.0 Bridge: IBM Device 04ea 0005:00:00.1 Bridge: IBM Device 04ea 0005:00:01.0 Bridge: IBM Device 04ea 0005:00:01.1 Bridge: IBM Device 04ea $ cat /sys/bus/pci/devices/0004\:00\:00.0/of_node/ibm\,loc-code GPU2 $ cat /sys/bus/pci/devices/0004\:00\:00.1/of_node/ibm\,loc-code GPU2 $ cat /sys/bus/pci/devices/0004\:00\:01.0/of_node/ibm\,loc-code GPU1 $ cat /sys/bus/pci/devices/0004\:00\:01.1/of_node/ibm\,loc-code GPU1 $ cat /sys/bus/pci/devices/0005\:00\:00.0/of_node/ibm\,loc-code GPU4 $ cat /sys/bus/pci/devices/0005\:00\:00.1/of_node/ibm\,loc-code GPU4 $ cat /sys/bus/pci/devices/0005\:00\:01.0/of_node/ibm\,loc-code GPU3 $ cat /sys/bus/pci/devices/0005\:00\:01.1/of_node/ibm\,loc-code GPU3 34

Benchmark of OpenStack-integrated VM Add NVLink devices (by hand) ~~~ <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0002' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x8' function='0x0'/> </hostdev> instance-000000xx.xml <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0004' bus='0x00' slot='0x01' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x9' function='0x0' multifunction='on'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0004' bus='0x00' slot='0x01' function='0x1'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x9' function='0x1'/> </hostdev> ~~~ 35

Benchmark of OpenStack-integrated VM CPU-GPU Memory bandwidth benchmark results with NVLink device added 36

Benchmark of OpenStack-integrated VM nbody benchmark results with NVLink device with NVLink device added 1GPU 2GPU 4GPU 37

How can we manage NVLink devices? OpenStack doesn't care about device connection 1014:04ea pool 10de:15f9 pool NVLink Device GPU1 NVLink Device GPU1 NVLink Device GPU2 NVLink Device GPU2 NVLink Device GPU3 NVLink Device GPU3 NVLink Device GPU4 NVLink Device GPU4 GPU1 GPU2 GPU3 GPU4 Request P100:1,NVLink:2 38

How can we manage NVLink devices? In ideal device_set_p100 pool GPU1 NVLink Device GPU1 NVLink Device GPU1 GPU3 NVLink Device GPU3 NVLink Device GPU3 GPU2 NVLink Device GPU2 NVLink Device GPU2 GPU4 NVLink Device GPU4 NVLink Device GPU4 Request device_set_p100:1 39

How can we manage NVLink devices? Our solution Add simple script between libvirt and qemu Rename qemu-system-ppc64 to qemu-system-ppc64.orig Add the script as qemu-system-ppc64 Nova libvirt script qemu Add NVLink devices parameters Launch VM with P100 and NVLink devices Request P100 40

Conclusion How can we boost more performance with POWER? Memory interleave may be required to get max performance Add POWER as compute node into OpenStack Specify GPU and its NVLink devices to passthrough to VM Power8 results better performance than x86 in some cases It has powerful NVLink CPU-GPU connection With OpenStack, some limitations exists SMT is no available NVLink requires extra device allocation OpenStack doesn't support now 42

Another option How is the container? 44

Another option How to manage containers and GPUs 45

Another option Kubernetes schedules containers can integrate with OpenStack supports GPU scheduler requirements NVIDIA drivers ~= 361.93 Device Plugin feature NVIDIA device plugin for Kubernetes nvidia-docker 46

Another option Device plugin feature NVIDIA device plugin for Kubernetes nvidia-docker NVIDIA Driver NVIDIA GPU 47

Another option Device Plugin feature Add kubelet exec parameter <= K8s version 1.9 "-feature-gates=deviceplugins=true" Example: deployed by kubeadm $ cat /etc/systemd/system/kubelet.service.d/10-kubeadm.conf grep KUBELET_EXTRA_ARGS= Environment="KUBELET_EXTRA_ARGS=--feature-gates=DevicePlugins=true" Device Plugins feature is Beta >= K8s version 1.10 Enabled by default Note: If you deploy k8s using kubeadm and the controller is x86, you have to do like $ docker tag gcr.io/google_containers/kube-proxy-ppc64le:v1.9.2 gcr.io/google_containers/kube-proxy:v1.9.2 48

Another option NVIDIA device plugin for Kubernetes https://github.com/nvidia/k8s-device-plugin Build image for ppc64le $ docker build. -t nvidia/k8s-device-plugin:1.9 49

Another option nvidia-docker (2.0) supports NVLink devices ppc64le packages are not available yet nvidia-docker depends on following packages libnvidia-container https://github.com/nvidia/libnvidia-container nvidia-container-runtime https://github.com/nvidia/nvidia-container-runtime can be installed using nvidia official repository now https://nvidia.github.io/nvidia-docker/ 50

Another option Change the default runtime $ cat /etc/docker/daemon.json $ sudo systemctl daemon-reload $ sudo systemctl restart kubelet Enable NVIDIA device plugin $ kubectl create -f https://raw.githubusercontent.com/nvidia/k8s-device-plugin/v1.9/nvidia-device-plugin.yml 51

Another option Ensure GPU resource is available $ kubectl describe node 52

Another option Ensure GPU resource is available $ kubectl apply -f bandwidth-test.yml $ kubectl logs bwt-pod bandwidth-test.yml 53

Another option CPU-GPU Memory bandwidth benchmark results 54

Thank you! 55

References OpenStack Docs: Attaching physical PCI devices to guests https://docs.openstack.org/nova/pike/admin/pci-passthrough.html Device Plugins - Kubernetes Feature Gates Kubernetes https://kubernetes.io/docs/reference/feature-gates/ GitHub - NVIDIA/k8s-device-plugin https://kubernetes.io/docs/concepts/cluster-administration/device-plugins/ https://github.com/nvidia/k8s-device-plugin GitHub - NVIDIA/nvidia-docker https://github.com/nvidia/nvidia-docker 56