Bringing Security and Multitenancy to Kubernetes Lei (Harry) Zhang
About Me Lei (Harry) Zhang #Microsoft MVP in cloud and datacenter management though I m a Linux guy :/ Previous: VMware, Baidu Feature maintainer of Kubernetes HyperCrew: https://hyper.sh Publications: Docker & Kubernetes Under the Hood PhD candidate @ZJU: Large-scale cluster management and scheduling
A survey about boundary Are you comfortable with Linux containers as an effective boundary? Yes, I use containers in my private/safe environment No, I use containers to serve the public cloud
As long as we care security We have to wrap containers inside full-blown virtual machines But we lose cloud-native deployment reality Slow startup time Huge resources wasting dream Memory tax for every container
Revisit namespace cgroups container Container Runtime The dynamic view and boundary of /bin /dev /etc /home /lib / lib64 /media /mnt /opt /proc / root /run /sbin /sys /tmp / usr /var /data /temp.txt echo hello Read-Write Layer & /data read-write layer init layer your running process Container Image The static view of your program, data, dependencies, files and directories /etc/hosts /etc/hostname /etc/resolv.conf CMD [ echo hello"] VOLUME /data ADD temp.txt / json json /temp.txt read-only layer FROM busybox Docker Container FROM busybox ADD temp.txt / VOLUME /data CMD [ echo hello"]
HyperContainer Secure Kubernetes from runtime level
HyperContainer Container Runtime RunV https://github.com/hyperhq/runv The OCI compatible hypervisor based runtime implementation Widely adopted by companies like Huawei etc Control daemon https://github.com/hyperhq/hyperd Container Image Docker Image Spec
Combine the best parts Portable and behaves like a Linux container $ hyperctl run -t busybox echo helloworld sub-second startup time*, ~12MB memory cost Fully isolated sandbox with an independent guest kernel $ hyperctl exec -t busybox uname -r 4.4.12-hyper (or your provided kernel) security, backward compatibility, maturity See: http://hypercontainer.io/why-hyper.html
HyperContainer is a Pod That s how HyperContainer fits into the Kubernetes philosophy Wait, why Pod is so important?
Pod: lesson learned from Borg Should sample.war be packaged with Tomcat?
Pod: lesson learned from Borg InitContainers: one or more containers started in sequence before the pod's normal containers are started. Share volumes, perform network operations, and perform computation prior to the app containers.
So, Pod is The group of super-affinity containers The atomic scheduling unit Pod The process group in container cloud log app Do right things without modifying your container image infra container init container Kubernetes = Spring Framework volume Pod = IoC
Pod is not easy to simulate log super affinity app Requirement: app: 1G, log: 0.5G Available: Node_A: 1.25G, Node_B: 2G What happens if app scheduled to Node_A?
HyperContainer is a Pod Linux container based runtimes wraps and encapsulates several app containers into a logical group Hypervisor container based runtime hypervisor serves as a natural boundary of Pod
HyperContainer is a Pod kubelet Container Runtime Interface create sandbox Foo --> create container C --> start container C stop container C --> remove container C --> delete sandbox Foo Sandbox Normally: the infra container HyperContainer: hypervisor with HyperKernel a HyperStart process as PID 1 setup mnt namespace, launch apps from the images etc
Hypernetes Kubernetes with HyperContainer Runtime
Hypernetes Also: h8s 1. Kubernetes + HyperContainer runtime officially supported by using kubernetes/frakti 2. Multi-tenant network and persistent volumes battle tested Neutron + Cinder plugin
Multi-tenant Network
Multi-tenant Network Goal: leveraging tenant-aware neutron network for Kubernetes following the network plugin workflow Non-goal: break k8s network model or hack k8s code
Define the Network Network a top class api object each tenant (created by Keystone) has its own Network Network mapping to Neutron net a Network Controller is responsible to manage Network lifecycle
Example proxy Call Neutron to create/delete network Desired World Real World controller-manager ControlLoop network pod replica namespace service job deployment volume petset kubelet SyncLoop api-server etcd proxy scheduler kubelet SyncLoop
Kubernetes Network Model Container reach container all containers can communicate with all other containers without NAT Node reach container all nodes can communicate with all containers (and vice-versa) without NAT IP addressing Pod in cluster can be addressed by its IP
How h8s fits that? Network can be assigned to one or more Namespaces Pods belonging to the same Network can reach each other directly through IP a Pod s network mapping to Neutron port kubelet is responsible for Pod network setup let s see how kubelet works
Example proxy kubelet SyncLoop 1 Pod created scheduler api-server etcd proxy kubelet SyncLoop
Example proxy kubelet SyncLoop scheduler 2 Pod object added api-server etcd proxy kubelet SyncLoop
Example proxy 3.1 New pod object detected 3.2 Bind pod with node kubelet SyncLoop scheduler api-server etcd proxy kubelet SyncLoop
Example proxy kubelet SyncLoop scheduler api-server etcd proxy 4.1 Detected pod bind with me 4.2 Start containers in pod kubelet SyncLoop
Design of kubelet Choose Runtime docker, rkt, hyper/remote NodeStatus Network Status status Manager PLEG InitNetworkPlugin SyncLoop volume Manager Pod Update Worker (e.g.add) generale Pod status check volume status (talk later) call runtime to start containers set up Pod network (see next slide) image Manager PodUpdate HandlePods {Add, Update, Remove, Delete, }
Set Up Pod Network
kubestack A standalone grpc daemon 1. to translate the SetUpPod request to the Neutron network API 2. handling multi-tenant Service proxy
Service OnServiceUpdate $ iptables-save grep my-service -A KUBE-SERVICES -d 10.0.0.116/32 -p tcp -m comment --comment "default/my-service: cluster IP" -m tcp --dport 8001 -j KUBE-SVC-KEAUNL7HVWWSEZA6 -A KUBE-SVC-KEAUNL7HVWWSEZA6 -m comment --comment "default/my-service:" --mode random -j KUBE-SEP-6XXFWO3KTRMPKCHZ -A KUBE-SVC-KEAUNL7HVWWSEZA6 -m comment --comment "default/my-service:" --mode random -j KUBE-SEP-57KPRZ3JQVENLNBRZ -A KUBE-SEP-6XXFWO3KTRMPKCHZ -p tcp -m comment --comment "default/my-service:" -m tcp -j DNAT --to-destination 172.17.0.2:80 -A KUBE-SEP-57KPRZ3JQVENLNBRZ -p tcp -m comment --comment "default/my-service:" -m tcp -j DNAT --to-destination 172.17.0.3:80 OnEndpointsUpdate portal 10.10.0.116:8001 backend rule_1 172.17.0.2.:80 random mode rules backend rule_2 172.17.0.3.:80
Multi-tenant Service Default iptables-based kube-proxy is not tenant aware Endpoint Pods and Nodes with iptables rules are isolated into different networks Hypernetes uses a built-in HAproxy as the Service portal to handle all Service instances within same namespace the same OnServiceUpdate and OnEndpointsUpdate workflow ExternalProvider a OpenStack LB will be created as Service e.g. curl 58.215.33.98:8078
Persistent Volume
Kubernetes Persistent Volume Get mountedvolume from actualstateofworld mount Host Unmount volumes in mountedvolume but not in desiredstateofworld AttachVolume() if vol in desiredstateofworld and not attached Pod mountpath Pod mountpath MountVolume() if vol in desiredstateofworld and not in mountedvolume Verify devices that should be detached/unmounted are detached/unmounted attach path Cinder volume plugin Tips: 1. -v host:path Volume Manager desired World 2. attach VS mount 3. Totally independent from container management reconcile
Persistent Volume with HyperContainer Enhanced Cinder volume plugin Host Linux container: Pod Pod 1. full OpenStack cluster mountpath mountpath 2. query Nova to find node 3. attach Cinder volume to host path attach vol vol 4. bind mount host path to Pod containers Enhanced Cinder volume plugin HyperContainer: directly attach block devices to Pod thanks to the hypervisor based Pod boundary Volume Manager desired World eliminates extra time to query Nova reconcile
PV Example Create a Cinder volume Claim volume by reference its volumeid
Container Runtime Interface
Future of CRI Keep Docker as the only one default container runtime oci-runtime, rktlet, hyperd Frakti: the Remote Container Runtime Kit https://github.com/kubernetes/frakti welcome to tryout, star and fork
if image becomes non-standard e.g. Docker image becomes somehow Docker specific Don t worry, kubelet.imagemanager is moving to runtime specific but then k8s will probably choose NO DEFAULT runtime
Full Topology Node Node Node KeyStone Pod Pod Pod Pod Master Neutron Object: Network kubestack kube-proxy Object: Pod Cinder Ceph Neutron L2 Agent kubelet Object: Cinder Plugin
Summary A new way to build secure and multi-tenant Kubernetes Kubernetes + HyperContainer + Neutron Plugin + Cinder Plugin + Keystone Project URL: https://github.com/hyperhq/hypernetes Roadmap Graduate HyperContainer runtime on k8s upstream see HyperContainer in official k8s release Neutron CNI plugin Tip: https://hyper.sh is totally built on Hypernetes, try it out :)
END Lei (Harry) Zhang @resouer