Host Stack and Layers Florin Coras, Dave Barach, Keith Burns, Dave Wallace
- A Universal Terabit Network Platform For Native Cloud Network Services Most Efficient on the Planet EFFICIENCY Superior Performance PERFORMANCE Flexible and Extensible SOFTWARE DEFINED NETWORKING Cloud Native CLOUD NETWORK SERVICES Open Source LINUX FOUNDATION Breaking the Barrier of Software Defined Network Services 1 Terabit Services on a Single Intel Xeon Server!
How does it work? Compute Optimized SW Network Platform 1 Packet processing is decomposed into a directed graph of nodes 2 packets move through graph nodes in vector 3 graph nodes are optimized to fit inside the instruction cache vhost-userinput af-packetinput dpdk-input Packet 0 Packet 1 Microprocessor ethernetinput Packet 2 Packet 3 3 Instruction Cache arp-input cdp-input lldp-input mpls-input ip4-lookupmulitcast l2-input ip4-lookup* ip4-input...-nochecksum ip6-input Packet 4 Packet 5 Packet 6 Packet 7 4 Data Cache Packet 8 interfaceoutput ip4-loadbalance mpls-policyencap ip4-rewritetransit ip4- midchain Packet 9 Packet 10 4 packets are pre-fetched into the data cache. * Each graph node implements a micro-nf, a micro-networkfunction processing packets. Makes use of modern Intel Xeon Processor micro-architectures. Instruction cache & data cache always hot è Minimized memory latency and usage.
Motivation: Container networking PID 1234 PID 4321 glibc send() recv() FIFO kernel FIFO IP (routing) IP (routing) device device
Motivation: Container networking PID 1234 PID 4321 send() etc etc etc ACL, SR, VXLAN, LISP IP4/6 MPLS Ethernet recv() FIFO af_packet dpdk af_packet FIFO IP (routing) FIFO dpdk FIFO IP (routing) device device device device device
Why not this? PID 1234 PID 4321 send() recv() FIFO FIFO IP DPDK
Host Stack App rx tx shm segment
Host Stack: Layer App Maintains per app state and conveys to/from session events Allocates and manages sessions/segments/fifos Isolates network resources via namespacing lookup tables (5-tuple) and local/global session rule tables (filters) Support for pluggable transport protocols Binary/native C API for external/builtin applications rx tx shm segment
Host Stack: SVM FIFOs App rx tx shm segment Allocated within shared memory segments Fixed position and size Lock free enqueue/dequeue but atomic size increment Option to dequeue/peek data Support for out-of-order data enqueues
Host Stack: App Clean-slate implementation Complete state machine implementation Connection management and flow control (window management) Timers and retransmission, fast retransmit, SACK NewReno congestion control, SACK based fast recovery Checksum offloading Linux compatibility tested with IWL protocol tester rx tx shm segment
Host Stack: Comms Library (VCL) Comms library (VCL) apps can link against LD_PRELOAD library for legacy apps epoll App rx tx shm segment
Application Attachment attach bind (server) connect (client) App shm segment
Establishment Client Server attach bind listen
Establishment attach connect Client Server attach bind open listen
Establishment Client Server handshake
Establishment Client Server connect succeeded handshake new client
Establishment Client Server shm segment rx tx connect reply accept notify rx tx shm segment
Data Transfer write Client Server read tx write evt rx write evt rx tx rx tx copy to buffer copy to fifo Congestion control Reliable transport
Data Transfer write Client Server read tx write evt rx write evt rx tx rx tx copy to buffer copy to fifo Congestion control Reliable transport Not yet part of CSIT but some rough numbers on a E2690: 200k CPS and 8Gbps/core!
Redirected Connections (Cut-through) Client Server bind
Redirected Connections (Cut-through) Client Server connect redirect
Redirected Connections (Cut-through) Throughput is memory bandwidth constrained: ~120Gbps! Client Server connect redirect
Ongoing work Overall integration with k8s Istio/Envoy Rx policer/tx pacer TSO New congestion control algorithms PMTU discovery Optimization/hardening/testing VCL/LD_PRELOAD Iperf, nginx, wget, curl
Next steps Get involved Get the Code, Build the Code, Run the Code layer: src/vnet/session : src/vnet/tcp SVM: src/svm VCL: src/vcl Read/Watch the Tutorials Read/Watch Tutorials Join the Mailing Lists
Thank you!? Florin Coras email:(fcoras@cisco.com) irc: florinc
Multi-threading App1 rx tx rx tx IP IP Core 0 DPDK Core 1
Features: Namespaces App Request access to vpp ns + secret IP IP IP fib1 fib2 ns1 ns2 ns3
Features: Tables App1 Request access to global and/or local scope NS Local Table NS Local Table Global Table fib1 ns1 ns2
Features: Tables App1 Both table have rules table that can be used for filtering Local tables are namespace specific and can be used for egress filtering Global tables are fib table specific and can be used for ingress filtering NS Local Table NS Local Table Global Table fib1 ns1 ns2