Life of a Packet. KubeCon Europe Michael Rubin TL/TLM in GKE/Kubernetes github.com/matchstick. logo. Google Cloud Platform

Similar documents
Kubernetes - Networking. Konstantinos Tsakalozos

10 Kube Commandments

Project Calico v3.2. Overview. Architecture and Key Components. Project Calico provides network security for containers and virtual machine workloads.

Kubernetes Container Networking with NSX-T Data Center Deep Dive

Project Calico v3.1. Overview. Architecture and Key Components

Wolfram Richter Red Hat. OpenShift Container Netzwerk aus Sicht der Workload

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Authorized Source IP for OpenShift Project

Kubernetes Integration Guide

Networking Approaches in. a Container World. Flavio Castelli Engineering Manager

Kuber-what?! Learn about Kubernetes

Note: Currently (December 3, 2017), the new managed Kubernetes service on Azure (AKS) does not yet support Windows agents.

Services and Networking

Secure Kubernetes Container Workloads

Kubernetes Container Networking

Scaling Jenkins with Docker and Kubernetes Carlos

Deployment Strategies on Kubernetes. By Etienne Tremel Software engineer at Container February 13th, 2017

Code: Slides:

Maximizing Network Throughput for Container Based Storage David Borman Quantum

Implementing Container Application Platforms with Cisco ACI

Question: 2 Kubernetes changed the name of cluster members to "Nodes." What were they called before that? Choose the correct answer:

Kubernetes and the CNI: Where we are and What s Next Casey Callendrello RedHat / CoreOS

Kubernetes Ingress Virtual Service Configuration

Kubernetes Ingress Virtual Service Configuration

& the architecture along the way!

Kubernetes introduction. Container orchestration

FD.io VPP & Ligato Use Cases. Contiv-VPP CNI plugin for Kubernetes IPSEC VPN gateway

Kubernetes and the CNI: Where we are and What s Next Casey Callendrello RedHat / CoreOS

Continuous delivery while migrating to Kubernetes

Scheduling in Kubernetes October, 2017

Dockercon 2017 Networking Workshop

Kubernetes Love at first sight?

NSX Data Center Load Balancing and VPN Services

Efficiently exposing apps on Kubernetes at scale. Rasheed Amir, Stakater

CONTAINERS AND MICROSERVICES WITH CONTRAIL

gcp / gke / k8s microservices

Kubernetes: Twelve KeyFeatures

Service discovery in Kubernetes with Fabric8

Hacking and Hardening Kubernetes

VMware Integrated OpenStack with Kubernetes Getting Started Guide. VMware Integrated OpenStack 4.1

Kubernetes 101. Doug Davis, STSM September, 2017

Buenos Aires 31 de Octubre de 2018

Building an on premise Kubernetes cluster DANNY TURNER

An Introduction to Kubernetes

Introduction to Kubernetes

$ wget V SOLUTIONS.tar.bz2 \ --user=lftraining --password=penguin2014

Cloud Native Networking

agenda PAE Docker Docker PAE

Improve Performance of Kube-proxy and GTP-U using VPP

Microservices. Chaos Kontrolle mit Kubernetes. Robert Kubis - Developer Advocate,

Docker Networking: From One to Many. Don Mills

Multiple Networks and Isolation in Kubernetes. Haibin Michael Xie / Principal Architect Huawei

ASP.NET Core & Docker

Defining Security for an AWS EKS deployment

Containerisation with Docker & Kubernetes

Firewalls and NAT. Firewalls. firewall isolates organization s internal net from larger Internet, allowing some packets to pass, blocking others.

Enabling Multi-Cloud with Istio Stretching an Istio service mesh between Public & Private Clouds. John Joyce Robert Li

Kubernetes on Openstack

What s New in K8s 1.3

Dan Williams Networking Services, Red Hat

@briandorsey #kubernetes #GOTOber

Oracle Cloud Infrastructure Virtual Cloud Network Overview and Deployment Guide ORACLE WHITEPAPER JANUARY 2018 VERSION 1.0

Table of Contents HOL CNA

Layer-4 to Layer-7 Services

Kubernetes deep dive

Cilium Documentation. Release v0.8. Cilium Authors

Linux Clusters Institute: OpenStack Neutron

Kubernetes: What s New

Docker & Mesos/Marathon in production at OVH. Balthazar Rouberol

OpenShift Dedicated 3 Release Notes

Kubernetes. Introduction

Cilium Documentation. Release v0.8. Cilium Authors

What s New in K8s 1.3

NGINX: From North/South to East/West

Kubernetes Basics. Christoph Stoettner Meetup Docker Mannheim #kubernetes101

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

Docker Networking Deep Dive online meetup

Loadbalancer.org Virtual Appliance quick start guide v6.3

Bitnami s Kubernetes Projects Leverage Application delivery on Next Generation Platforms

Overview of Container Management

AGENDA Introduction Pivotal Cloud Foundry NSX-V integration with Cloud Foundry New Features in Cloud Foundry Networking NSX-T with Cloud Fou

Singapore. Service Proxy, Container Networking & K8s. Acknowledgement: Pierre Pfister, Jerome John DiGiglio, Ray

Top Nine Kubernetes Settings You Should Check Right Now to Maximize Security

K8s(Kubernetes) and SDN for Multi-access Edge Computing deployment

Building a Kubernetes on Bare-Metal Cluster to Serve Wikipedia. Alexandros Kosiaris Giuseppe Lavagetto

IPv6 NAT. Open Source Days 9th-10th March 2013 Copenhagen, Denmark. Patrick McHardy

OpenStack and OVN What s New with OVS 2.7 OpenStack Summit -- Boston 2017

A Comparision of Service Mesh Options

Neutron: peeking behind the curtains

Project Kuryr. Antoni Segura Puimedon (apuimedo) Gal Sagie (gsagie)

How Container Runtimes matter in Kubernetes?

Kubernetes networking in the telco space

Table of Contents HOL NET

Disclaimer This presentation may contain product features that are currently under development. This overview of new technology represents no commitme

So, I have all these containers! Now what?

PVS Deployment in the Cloud. Last Updated: June 17, 2016

Simplify Container Networking With ican. Huawei Cloud Network Lab

F5 Solutions for Containers

VNS3 Configuration. ElasticHosts

Transcription:

logo Life of a Packet KubeCon Europe 2017 Michael Rubin <mrubin@google.com> TL/TLM in GKE/Kubernetes github.com/matchstick Google Cloud Platform

Kubernetes is about clusters Because of that, networking is pretty important Most of Kubernetes centers on network concepts Our job is to make sure your applications can communicate: With each other With the world outside your cluster Only where you want

The IP-per-pod model

Every pod has a real IP address This is different from the out-of-the-box model Docker offers No machine-private IPs No port-mapping Pod IPs are accessible from other pods, regardless of which VM they are on Linux network namespaces (aka netns ) and virtual interfaces

Network namespaces Node

Network namespaces root netns Node

Network namespaces root netns pod1 netns Node

Network namespaces root netns vethxx vethxy pod1 netns Node

Network namespaces root netns vethxx pod1 netns Node

Network namespaces root netns vethxx vethyy pod1 netns Node pod2 netns

Network namespaces cbr0 root netns vethxx vethyy pod1 netns Node pod2 netns

Life of a packet: pod-to-pod, same node root netns vethxx cbr0 vethyy pod1 netns Node pod2 netns

Life of a packet: pod-to-pod, same node src: pod1 dst: pod2 root netns cbr0 vethxx vethyy pod1 netns Node pod2 netns

Life of a packet: pod-to-pod, same node src: pod1 dst: pod2 root netns cbr0 vethxx vethyy pod1 ctr1 netns netns Node pod2 ctr2 netns

Life of a packet: pod-to-pod, same node src: pod1 dst: pod2 root netns cbr0 vethxx vethyy pod1 ctr1 netns netns Node pod2 ctr2 netns

Life of a packet: pod-to-pod, same node src: pod1 dst: pod2 root netns cbr0 vethxx vethyy pod1 ctr1 netns netns Node pod2 ctr2 netns

Flat network space Pods must be reachable across Nodes, too Kubernetes doesn t care HOW, but this is a requirement L2, L3, or overlay Assign a CIDR (IP block) to each Node GCP: Teach the network how to route packets

Life of a packet: pod-to-pod, across nodes Node1 pod1 pod2 vethxx vethyy root cbr0 root vethxx cbr0 vethyy Node2 pod3 pod4

Life of a packet: pod-to-pod, across nodes src: pod1 dst: pod4 Node1 root pod1 ctr1 vethxx cbr0 vethyy ctr2 pod2 root vethxx cbr0 vethyy Node2 pod3 ctr3 pod4 ctr4

Life of a packet: pod-to-pod, across nodes src: pod1 dst: pod4 Node1 root pod1 ctr1 vethxx cbr0 vethyy ctr2 pod2 root vethxx cbr0 vethyy Node2 pod3 ctr3 pod4 ctr4

Life of a packet: pod-to-pod, across nodes src: pod1 dst: pod4 Node1 root pod1 ctr1 vethxx cbr0 vethyy ctr2 pod2 root vethxx cbr0 vethyy Node2 pod3 ctr3 pod4 ctr4

Life of a packet: pod-to-pod, across nodes src: pod1 dst: pod4 Node1 root pod1 ctr1 vethxx cbr0 vethyy ctr2 pod2 root vethxx cbr0 vethyy Node2 pod3 ctr3 pod4 ctr4

Life of a packet: pod-to-pod, across nodes src: pod1 dst: pod4 Node1 root pod1 ctr1 vethxx cbr0 vethyy ctr2 pod2 root vethxx cbr0 vethyy Node2 pod3 ctr3 pod4 ctr4

Life of a packet: pod-to-pod, across nodes src: pod1 dst: pod4 Node1 root pod1 ctr1 vethxx cbr0 vethyy ctr2 pod2 root vethxx cbr0 vethyy Node2 pod3 ctr3 pod4 ctr4

Life of a packet: pod-to-pod, across nodes src: pod1 dst: pod4 Node1 root pod1 ctr1 vethxx cbr0 vethyy ctr2 pod2 root vethxx cbr0 vethyy Node2 pod3 ctr3 pod4 ctr4

Life of a packet: pod-to-pod, across nodes src: pod1 dst: pod4 Node1 root pod1 ctr1 vethxx cbr0 vethyy ctr2 pod2 root vethxx cbr0 vethyy Node2 pod3 ctr3 pod4 ctr4

Overlays

Overlay networks Why? Can t get enough IP space Network can t handle extra routes Want management features Encapsulate packet-in-packet Traverse the native network between Nodes

Overlay networks Why Not? Latency overhead in some cloud providers Complexity overhead Often not required Use it when you know you need it

Overlay example: Flannel (vxlan) VXLAN interface acts like any other vnic Node1 root pod1 vethxx cbr0 pod2 vethyy flannel0 root vethxx cbr0 flannel0 vethyy Node2 pod3 pod4

Overlay example: Flannel src: pod1 dst: pod4 Node1 root pod1 vethxx cbr0 pod2 vethyy flannel0 root vethxx cbr0 flannel0 vethyy Node2 pod3 pod4

Overlay example: Flannel src: pod1 dst: pod4 Node1 root pod1 vethxx cbr0 pod2 vethyy flannel0 root vethxx cbr0 flannel0 vethyy Node2 pod3 pod4

Overlay example: Flannel src: pod1 dst: pod4 Node1 root pod1 vethxx cbr0 pod2 vethyy flannel0 root vethxx cbr0 flannel0 vethyy Node2 pod3 pod4

Overlay example: Flannel src: pod1 dst: pod4 root netns Node1 vethxx cbr0 flannel0

Overlay example: Flannel src: pod1 dst: pod4 root netns Node1 vethxx cbr0 flannel0

Overlay example: Flannel Encapsulates the packet MAC Flannel device implementation: Simple VXLAN, developed by CoreOS for containers and kubernetes Uses Linux native VXLAN devices A userspace agent for address resolution Data path is in-kernel (fast) src: pod1 dst: pod4 Payload flannel0 src: node1 dst: node2 UDP src: pod1 dst: pod4 Payload

Overlay example: Flannel src: pod1 src: node1 dst: pod4 dst: node2 encapsulated root netns flannel0 Node1 vethxx cbr0

Overlay example: Flannel src: node1 dst: node2 root netns Node1 vethxx cbr0 flannel0

Overlay example: Flannel src: node1 dst: node2 Node1 root pod1 vethxx cbr0 pod2 vethyy flannel0 root vethxx cbr0 flannel0 vethyy Node2 pod3 pod4

Overlay example: Flannel src: node1 dst: node2 Node1 root pod1 vethxx cbr0 pod2 vethyy flannel0 root vethxx cbr0 flannel0 vethyy Node2 pod3 pod4

Overlay example: Flannel src: node1 dst: node2 flannel0 cbr0 root netns vethxx Node2

Overlay example: Flannel src: node1 src: pod1 dst: node2 dst: pod4 decapsulated flannel0 root netns cbr0 vethxx Node2

Overlay example: Flannel src: pod1 dst: pod4 Node1 root pod1 vethxx cbr0 pod2 vethyy flannel0 root vethxx cbr0 flannel0 vethyy Node2 pod3 pod4

Overlay example: Flannel src: pod1 dst: pod4 Node1 root pod1 vethxx cbr0 pod2 vethyy flannel0 root vethxx cbr0 flannel0 vethyy Node2 pod3 pod4

Dealing with change A real cluster changes over time: Rolling updates Scale-up and scale-down events Pods crash or hang Nodes reboot The pod addresses you need to reach can change without warning You need something more durable than a pod IP

Services

The service abstraction A service is a group of endpoints (usually pods) Services provide a stable VIP VIP automatically routes to backend pods Implementations can vary We will examine the default implementation The set of pods behind a service can change Clients only need the VIP, which doesn t change

Service What you submit is simple Other fields will be defaulted or assigned kind: Service apiversion: v1 metadata: name: store-be spec: selector: app: store role: be ports: - name: http port: 80

Service What you submit is simple Other fields will be defaulted or assigned The selector field chooses which pods to balance across kind: Service apiversion: v1 metadata: name: store-be spec: selector: app: store role: be ports: - name: http port: 80

Service What you get back has more information Automatically creates a distributed load balancer kind: Service apiversion: v1 metadata: name: store-be namespace: default creationtimestamp: 2016-05-06T19:16:56Z resourceversion: "7" selflink: /api/v1/namespaces/default/services/store-be uid: 196d5751-13bf-11e6-9353-42010a800fe3 Spec: type: ClusterIP selector: app: store role: be clusterip: 10.9.3.76 ports: - name: http protocol: TCP port: 80 targetport: 80 sessionaffinity: None

Service What you get back has more information Automatically creates a distributed load balancer The default is to allocate an in-cluster IP kind: Service apiversion: v1 metadata: name: store-be namespace: default creationtimestamp: 2016-05-06T19:16:56Z resourceversion: "7" selflink: /api/v1/namespaces/default/services/store-be uid: 196d5751-13bf-11e6-9353-42010a800fe3 Spec: type: ClusterIP selector: app: store role: be clusterip: 10.9.3.76 ports: - name: http protocol: TCP port: 80 targetport: 80 sessionaffinity: None

Endpoints selector: app: store role: be 10.11.5.3 app: store role: be app: store role: fe app: db role: be 10.11.8.67 10.7.1.18 app: store role: be app: db role: be app: store role: be 10.11.8.67 10.4.1.11 10.11.0.9

Endpoints selector: app: store role: be 10.11.5.3 app: store role: be app: store role: fe app: db role: be 10.11.8.67 10.7.1.18 app: store role: be app: db role: be app: store role: be 10.11.8.67 10.4.1.11 10.11.0.9

Endpoints When you create a service, a controller wakes up kind: Endpoints apiversion: v1 metadata: name: store-be namespace: default subsets: - addresses: - ip: 10.11.8.67 - ip: 10.11.5.3 - ip: 10.11.0.9 ports: - name: http port: 80 protocol: TCP

Endpoints When you create a service, a controller wakes up Holds the IPs of the pod backends kind: Endpoints apiversion: v1 metadata: name: store-be namespace: default subsets: - addresses: - ip: 10.11.8.67 - ip: 10.11.5.3 - ip: 10.11.0.9 ports: - name: http port: 80 protocol: TCP

Life of a packet: pod-to-service root netns vethxx cbr0 pod1 netns

Life of a packet: pod-to-service src: pod1 dst: svc1 root netns vethxx cbr0 pod1 ctr1 netns

Life of a packet: pod-to-service src: pod1 dst: svc1 root netns vethxx cbr0 pod1 ctr1 netns

Life of a packet: pod-to-service src: pod1 dst: svc1 root netns iptables cbr0 vethxx pod1 ctr1 netns

Life of a packet: pod-to-service src: pod1 dst: svc1 dst: pod99 DNAT, conntrack root netns iptables cbr0 vethxx pod1 ctr1 netns

Conntrack Linux kernel connection-tracking Remembers address translations Based on the 5-tuple Does a lot more, but not very relevant here Reversed on the return path { protocol = TCP src_ip = pod1 src_port = 1234 dst_ip = svc1 dst_port = 80 } => { protocol = TCP src_ip = pod1 src_port = 1234 dst_ip = pod99 dst_port = 80 }

Life of a packet: pod-to-service src: pod1 dst: pod99 root netns iptables cbr0 vethxx pod1 ctr1 netns

Life of a packet: pod-to-service src: pod99 dst: pod1 root netns iptables cbr0 vethxx pod1 ctr1 netns

Life of a packet: pod-to-service src: pod99 src: svc1 dst: pod1 un-dnat root netns iptables cbr0 vethxx pod1 ctr1 netns

Life of a packet: pod-to-service src: svc1 dst: pod1 root netns iptables cbr0 vethxx pod1 ctr1 netns

Life of a packet: pod-to-service src: svc1 dst: pod1 root netns iptables cbr0 vethxx pod1 ctr1 netns

A bit more on iptables The iptables rules look scary, but are actually simple: if dest.ip == svc1.ip && dest.port == svc1.port { pick one of the backends at random rewrite destination IP } Configured by kube-proxy - a pod running on each Node Not actually a proxy Not in the data path Kube-proxy is a controller - it watches the API for services

DNS Even easier: services are added to an in-cluster DNS server You would never hardcode an IP, but you might hardcode a hostname and port Serves A and SRV records DNS itself runs as pods and a service

DNS Service Requests a particular cluster IP Pods are auto-scaled with the cluster size Service VIP is stable kind: Service apiversion: v1 metadata: name: kube-dns namespace: kube-system spec: clusterip: 10.0.0.10 selector: k8s-app: kube-dns ports: - name: dns port: 53 protocol: UDP - name: dns-tcp port: 53 protocol: TCP

DNS Service Requests a particular cluster IP Pods are auto-scaled with the cluster size Service VIP is stable kind: Service apiversion: v1 metadata: name: kube-dns namespace: kube-system spec: clusterip: 10.0.0.10 selector: k8s-app: kube-dns ports: - name: dns port: 53 protocol: UDP - name: dns-tcp port: 53 protocol: TCP

Simple and powerful Can use any port you want, no conflicts Can request a particular clusterip Can remap ports

That s all there is to it Services are an abstraction - the API is a VIP No running process or intercepting the data-path All a client needs to do is hit the service IP:port

Sending external traffic Services are within a cluster What happens if you want your pod to reach google.com?

Egress

Leaving the GCP project Nodes get private IPs (in 10.0.0.0/8) Nodes can have public IPs, too GCP: Public IPs are provided by 1-to-1 NAT

Life of a packet: Node-to-internet 1:1 NAT Cluster root netns Node

Life of a packet: Node-to-internet src: internal-ip dst: 8.8.8.8 Cluster 1:1 NAT root netns Node

Life of a packet: Node-to-internet src: internal-ip src: external-ip dst: 8.8.8.8 Cluster 1:1 NAT root netns Node

Life of a packet: Node-to-internet src: external-ip dst: 8.8.8.8 Cluster 1:1 NAT root netns Node

Life of a packet: Node-to-internet src: 8.8.8.8 dst: external-ip Cluster 1:1 NAT root netns Node

Life of a packet: Node-to-internet src: 8.8.8.8 dst: external-ip dst: internal-ip Cluster 1:1 NAT root netns Node

Life of a packet: Node-to-internet src: 8.8.8.8 dst: internal-ip Cluster 1:1 NAT root netns Node

Life of a packet: pod-to-internet 1:1 NAT Cluster root netns vethxx cbr0 Node pod1 netns

Life of a packet: pod-to-internet src: pod1 dst: 8.8.8.8 Cluster 1:1 NAT root netns vethxx cbr0 Node pod1 netns

Life of a packet: pod-to-internet src: pod1 dst: 8.8.8.8 Cluster 1:1 NAT root netns vethxx cbr0 Node pod1 netns

Life of a packet: pod-to-internet src: pod1 dst: 8.8.8.8 Cluster 1:1 NAT root netns vethxx cbr0 Node pod1 netns

Life of a packet: pod-to-internet dropped! Cluster 1:1 NAT root netns vethxx cbr0 Node pod1 netns

What went wrong? The 1:1 NAT only understands Node IPs Anything else gets dropped Pod IPs!= Node IPs When in doubt, add some more iptables MASQUERADE, aka SNAT Applies to any packet with a destination *outside* of 10.0.0.0/8

Life of a packet: pod-to-internet src: pod1 src: internal-ip dst: 8.8.8.8 MASQUERADE Cluster root netns 1:1 NAT iptables cbr0 vethxx Node pod1 netns

Life of a packet: pod-to-internet src: internal-ip dst: 8.8.8.8 Cluster 1:1 NAT iptables root netns vethxx cbr0 Node pod1 netns

Life of a packet: pod-to-internet src: internal-ip src: external-ip dst: 8.8.8.8 Cluster 1:1 NAT iptables root netns vethxx cbr0 Node pod1 netns

Life of a packet: pod-to-internet src: external-ip dst: 8.8.8.8 Cluster 1:1 NAT iptables root netns vethxx cbr0 Node pod1 netns

Life of a packet: pod-to-internet src: 8.8.8.8 dst: external-ip Cluster 1:1 NAT iptables root netns vethxx cbr0 Node pod1 netns

Life of a packet: pod-to-internet src: 8.8.8.8 dst: external-ip dst: internal-ip Cluster 1:1 NAT iptables root netns vethxx cbr0 Node pod1 netns

Life of a packet: pod-to-internet src: 8.8.8.8 dst: internal-ip Cluster 1:1 NAT iptables root netns vethxx cbr0 Node pod1 netns

Life of a packet: pod-to-internet src: 8.8.8.8 dst: internal-ip dst: pod1 Cluster 1:1 NAT iptables root netns vethxx cbr0 Node pod1 netns

Life of a packet: pod-to-internet src: 8.8.8.8 dst: pod1 Cluster 1:1 NAT iptables root netns vethxx cbr0 Node pod1 netns

Life of a packet: pod-to-internet src: 8.8.8.8 dst: pod1 Cluster 1:1 NAT iptables root netns vethxx cbr0 Node pod1 netns

Receiving external traffic Kubernetes builds on two: Network Load Balancer (L4) HTTP/S Load balancer (L7) These map to Kubernetes APIs: Service type=loadbalancer Ingress

L4: Service + LoadBalancer

Service Change the type of your service Implemented by the cloud provider controller kind: Service apiversion: v1 metadata: name: store-be spec: type: LoadBalancer selector: app: store role: be ports: - name: https port: 443

Service The LB info is populated when ready kind: Service apiversion: v1 metadata: name: store-be #... spec: type: LoadBalancer selector: app: store role: be clusterip: 10.9.3.76 ports: #... sessionaffinity: None status: loadbalancer: ingress: - ip: 86.75.30.9

Life of a packet: external-to-service Cluster pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service Net LB Cluster pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service Net LB Cluster pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: client dst: LB Cluster Net LB pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: client dst: LB Cluster Net LB Choose a Node pod1 pod2 pod3 VM1 Node1 VM1 Node2 VM1 Node3

Life of a packet: external-to-service src: client dst: LB Cluster Net LB Choose a Node VM1 Node1 pod1 pod2 pod3 Node2 Node3

Life of a packet: external-to-service src: client dst: LB Cluster Net LB VM1 Node1 pod1 pod2 pod3 Node2 Node3

Life of a packet: external-to-service src: client dst: LB Cluster Net LB VM1 Node1 pod1 pod2 pod3 Node2 Node3

Balancing to Nodes Most LB only knows about Nodes Nodes do not map 1:1 with pods Node1 Node2 Node3

The imbalance problem Assume the LB only hits Nodes with backend pods on them The LB only knows about Nodes Cluster Net LB pod1 pod2 pod3 Node1 Node2 Node3

The imbalance problem Net LB Cluster 50% 50% pod1 pod2 pod3 Node1 Node2 Node3

The imbalance problem Net LB Cluster 50% 50% 50% 25% 25% pod1 pod2 pod3 Node1 Node2 Node3

Balancing to Nodes Most cloud LB only knows about Nodes Nodes do not map 1:1 with pods How do we avoid imbalance? iptables, of course Node1 Node2 Node3

Life of a packet: external-to-service src: client dst: LB Cluster Net LB Choose a pod iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: client dst: LB Cluster Net LB Choose a pod iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: client dst: LB dst: pod2 Cluster iptables Net LB NAT pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: client dst: pod2 Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: client dst: pod2 Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: client dst: pod2 Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: client dst: pod2 Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: pod2 dst: client Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: pod2 dst: client Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: pod2 dst: client Cluster Net LB INVALID iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: client src: Node1 dst: LB dst: pod2 NAT Cluster iptables Net LB pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: Node1 dst: pod2 Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: Node1 dst: pod2 Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: Node1 dst: pod2 Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: Node1 dst: pod2 Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: pod2 dst: Node1 Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: pod2 dst: Node1 Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: pod2 dst: Node1 Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: pod2 src: LB dst: Node1 dst: client Cluster iptables Net LB pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: LB dst: client Cluster Net LB pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: LB dst: client Cluster Net LB pod1 pod2 pod3 Node1 Node2 Node3

Explain the complexity To avoid imbalance, we re-balance inside Kubernetes A backend is chosen randomly from all pods Good: Well balanced, in practice Bad: Can cause an extra network hop Hides the client IP from the user s backend Users wanted to make the trade-off themselves

OnlyLocal Specify an external-traffic policy iptables will always choose a pod on the same node Preserves client IP Risks imbalance kind: Service apiversion: v1 metadata: name: store-be annotations: service.beta.kubernetes.io/external-traffic: OnlyLocal spec: type: LoadBalancer selector: app: store role: be ports: - name: https port: 443

Opt-in to the imbalance problem In practice Kubernetes spreads pods across nodes If pods >> nodes: OK If nodes >> pods: OK If pods ~= nodes: risk Cluster Net LB 50% 50% iptables iptables 50% 25% 25% pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service Not considered Health-check fails if no backends Cluster Net LB pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: client dst: LB Cluster Net LB pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: client dst: LB Cluster Net LB Choose a Node pod1 pod2 pod3 VM1 Node1 Node2 VM1 Node3

Life of a packet: external-to-service src: client dst: LB Cluster Net LB Node1 pod1 pod2 pod3 Node2 VM1 Node3

Life of a packet: external-to-service src: client dst: LB Cluster Net LB pod1 pod2 pod3 Node1 Node2 VM1 Node3

Life of a packet: external-to-service src: client dst: LB Cluster Net LB iptables Choose a pod pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: client dst: LB dst: pod2 Cluster Net LB iptables DNAT pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: client dst: pod2 Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: pod2 src: LB dst: client Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: LB dst: client Cluster Net LB iptables pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: LB dst: client Cluster Net LB pod1 pod2 pod3 Node1 Node2 Node3

Life of a packet: external-to-service src: LB dst: client Cluster Net LB pod1 pod2 pod3 Node1 Node2 Node3

Network Policy

Network Policy A common pattern for applications is to organize into micro-services or tiers Example: The classic three-tier app Users want to lock down the network. app: store role: fe app: store role: be Allow some tiers to communicate with others, but not a free-for-all. app: store role: db

Network Policy Networks can disallow communication between tiers that are not allowed An FE should never reach around to the DB An FE should never talk to another FE Labels are dynamic so these rules must be so also app: store role: fe app: store role: be X X app: store role: db

Namespaces Private scope for creating and managing objects (Pods, Services, NetworkPolicies...) Namespaced objects are always namespaced If you don t specify a namespace in YAML, use kubectl command-line or the current context kubectl -n my-namespace create -f file.yaml

Network Policy When first created, the namespace allows all pods to reach each other. Remember we said all pods can reach each other. app: store role: fe app: store role: be app: store role: db

Network Policy Describe the allowed links Example: Pods labelled role: be can receive traffic from role: fe Pods labelled role: db can receive traffic from role: be app: store role: fe app: store role: be app: store role: db

Network Policy Install policies Per-namespace API Specify: Pods subject to this policy Pods allowed to connect to the subjects Port(s) allowed apiversion: extensions/v1beta1 kind: NetworkPolicy metadata: name: store-net-policy spec: podselector: matchlabels: role: db ingress: - from: - podselector: matchlabels: role: be ports: - protocol: tcp port: 6379

Network Policy Switch the network isolation mode to DefaultDeny Absent policies, no network traffic can flow app: store role: fe X app: store role: be app: store role: db X X X X

Network Policy kind: Namespace apiversion: v1 metadata: name: store-namespace annotations: net.beta.kubernetes.io/network-policy: { "ingress": { "isolation": "DefaultDeny" } }

Network Policy Connections allowed by NetworkPolicies are OK Ordering is very important for these steps app: store role: fe X app: store role: be X app: store role: db

Network Policy There may be multiple NetworkPolicies in a Namespace Purely additive There is no way to specify deny in NetworkPolicy Implemented at L3/L4, no L7 support app: store role: fe app: store role: be app: store role: db

Open implementation Beta in v1.6 Expected GA in v1.7 Kubernetes allows the user to select the best method to implement Network policy. Today there is a wide range of choices: Calico Contiv Openshift Romana Trireme WeaveNet And more...

Watch this space Kubernetes Networking is a moving target The efforts of Open Source developers continue to improve and simplify the system Hopefully the next KubeCon we will have the opportunity to present more.

Kubernetes is Open open community open design open source open to ideas https://kubernetes.io Code: github.com/kubernetes/kubernetes Chat: slack.k8s.io Twitter: @kubernetesio Google Cloud Platform