TOWARDS FAST IP FORWARDING IP FORWARDING PERFORMANCE IMPROVEMENT AND MEASUREMENT IN FREEBSD Nanako Momiyama Keio University 25th September 2016 EuroBSDcon 2016
OUTLINE Motivation Design and implementation Applying fast packet I/O and fast IP lookup into FreeBSD network stack Measurement results Problem analysis Approach (ongoing work) Conclusion
MOTIVATION Software packet forwarding has played an important role in general-purpose OSes L2 bridging, IP Routing, Firewall etc Increasing network capacities (10GbE, 40GbE...) pushed people out of the kernel user-space packet forwarding on top of netmap[1], DPDK[2] Stresses using them in production are beginning to arise APIs/CLIs compatibility, port scalability (s, VMs), features and isolation It s time for bridging a performance gap between kernel-based packet forwarding (1-2 Mpps) and user-space one (> 10 Mpps)
STARTING POINT L3 IP forwarding Support the Internet Useful for datacenter and VM back-end L2 network doesn t scale VM VM VM VM VM VM VM vrouter server
WHERE IS THE PERFORMANCE BOTTLENECK? Default FreeBSD can forward packets only at 1.4 Mpps (10GbE line rate is 14.88 Mpps) Packet I/O? Was a main bottleneck for packet forwarding Now several solutions to achieve the 10GbE line rate netmap, DPDK IP routing table lookup? Hardware appliance has TCAM for fast lookup Now several fast routing lookup algorithms for software SAIL[3], DXR[4], Poptrie[5] What if we bring these techniques into FreeBSD?
DESIGN AND IMPLEMENTATION Design overview FreeBSD default network stack FreeBSD for Control Plane The OS network stack to preserve existing APIs VALE[6] + DXR for Forwarding Plane VALE for fast, scalable packet I/O DXR for fast IP route lookup user kernel OS stack IP application routing socket radix tree Ethernet Device I/O
VALE OVERVIEW VALE is a software switch Run in the kernel Part of the netmap framework Netmap is a fast packet I/O framework which enables applications to send and receive packets at 10 GbE line rate VALE works as a L2 learning switch by default Packets do NOT go through the OS network stack just forwarding packets from one port to another port L2 switch logic can be replaced with a different module Default I/O user kernel OS stack IP Ethernet Device I/O VALE with L2 learning bridge user kernel Switch fabric Switch logic (L2 learning bridge)
NEW SWITCH LOGIC IMPLEMENTATION Create a new function as a new switch logic (L3 module) in VALE Use VALE for packet I/O and the OS network stack for L2/L3 Make a fake mbuf in VALE and pass it to the OS network stack The OS stack embeds a route lookup result in an unused mbuf field Before if_transmit(), force return to have VALE transmit packets user kernel VALE with L3 module OS stack IP Ethernet Switch fabric Switch logic fake mbuf (L3 Module)
DXR OVERVIEW DXR is a fast IPv4 route lookup algorithm Create compact data structures based on a large routing table (radix tree) Fit into CPU caches See the DXR paper for more details DXR compact fib Default routing structure generate Lookup table Range table Next hop table direct indexing binary search dst gw & addr 0x0000 0x0001 0x0002 nh #0 nh #2 range 0: 1.2.3.1 1: 1.2.3.4 2: 4.5.6.7 range 0x0000 nh #0 3: 4.5.6.8 0x0200 nh #3 0x0800 nh #1 0xfffe 0xffff nh #1 range 0x0000 nh #2 0x1400 nh #3 0x0000 nh #1 0xabcd nh #3 Ref. Modified from Figure 1 of Zec, Marko, Luigi Rizzo, and Miljenko Mikuc. "DXR: towards a billion routing lookups per second in software." ACM SIGCOMM Computer Communication Review 42.5 (2012): 29-36.
DXR IMPLEMENTATION Porting DXR patch for FreeBSD 8.0 to FreeBSD 12.0-CURRENT DXR builds and uses new compact data structures based on the OS radix tree user kernel OS stack DXR integration DXR-specific lookup function is called instead of ip_findroute() IP socket Radix Tree DXR FIB Ethernet Device I/O
EXPERIMENTAL SETUP Machine spec OS: FreeBSD (12.0-CURRENT, 04/08/16 snapshot) CPU: Intel(R) Core(TM) i7-3930k CPU @ 3.20GHz 6 core : Intel X520 10GbE dual-port Method Two machines connected back-to-back Generate 10GbE line-rate traffic using pkt-gen application Measure packet rates forwarded by router machine Setting Packet size is 64 byte (Incl. Ethernet CRC) Routing table size is minimum(less than 10 entries) Router machine pktgen rx Router pktgen tx Send-and-receive machine
RESULTS Default FreeBSD 1.43 Mpps out of 14.88 Mpps 10GbE line rate throughput 1.43 Mpps implementation device I/O if_input (if_ethersubr.c) ip_input ip_fastfwd if_output (if_ethersubr.c) device I/O function packet input L2 input L3 Route lookup L2 output packet output I/O Protocol I/O
RESULTS Default I/O + DXR lookup Using DXR lookup instead of FreeBSD default routing lookup (ip_findroute()) 1.66 Mpps out of 14.88 Mpps 10GbE line rate Replacing lookup part saves 97 ns throughput 1.66 Mpps implementation device I/O if_input (if_ethersubr.c) ip_input DXR if_output (if_ethersubr.c) device I/O function packet input L2 input L3 Route lookup L2 output packet output I/O Protocol I/O
RESULTS VALE + default routing lookup Replace FreeBSD default I/O with VALE 1.95 Mpps out of 14.88 Mpps 10GbE line rate Replacing packet I/O saves 187ns throughput 1.95 Mpps implementation netmap if_input (if_ethersubr.c) ip_input ip_fastfwd if_output (if_ethersubr.c) netmap function packet input L2 input L3 Route lookup L2 output packet output I/O Protocol I/O
RESULTS VALE + DXR lookup Replace FreeBSD default I/O with VALE and use DXR lookup 2.43 Mpps out of 14.88 Mpps 10GbE line rate Slightly (1Mpps) faster than default FreeBSD but still SLOW throughput 2.43 Mpps implementation netmap if_input (if_ethersubr.c) ip_input DXR if_output (if_ethersubr.c) netmap function packet input L2 input L3 Route lookup L2 output packet output I/O Protocol I/O
RESULTS AND TAKEAWAY Module Default (baseline) Default I/O + DXR lookup VALE + default lookup VALE + DXR lookup VALE L2 switch Throughput 1.43Mpps 1.66Mpps 1.95Mpps 2.43Mpps 12.39Mpps VALE L2 switch itself can achieve 12.39 Mpps Why does the 10 Mpps gap between L2 and L3 module exist? We should investigate which parts of take time Packet I/O and route lookup are not very expensive anymore
MEASUREMENT METHODOLOGY Hardcode the output interface in VALE in advance user VALE and DXR Force to return at the several vantage points Receive the packets on the send-and-receive machine and measure rates kernel IP return OS stack Radix Tree DXR return FIB return Ethernet return return Switch fabric Switch logic (L3 Module)
VALE + DXR lookup VALE Which + DXR part does LOOKUP consume time? VALE and DXR user kernel 36ns 118ns 4.64 Mpps 5.32 Mpps 14.36 Mpps return before ip_tryforward() return before ip_input() return before if_input() OS stack IP Ethernet Switch fabric Switch logic (L3 Module) Radix Tree DXR FIB return before if_output() return before if_transmit() 4.64 Mpps 3.66 Mpps 2.44 Mpps 49ns 137ns
MEASUREMENT CONCLUSION Packet I/O is fast enough and the cost of route lookup is negligible L2 protocol has become a new performance bottleneck How can we solve this problem?
BASIC DESIGN(ONGOING WORK) if_input() bypass user kernel Filtering packets in VALE if the packet has protocol type of IPv4(0x0800) and the destination MAC address of the input interface, it directory goes to ip_input() IP ip_input() OS stack DXR next hop table If & gw addr MAC addr 0 : 1.2.3.4 08:00:27:60:10:20 1 : 2.3.4.5 08:00:27:f4:d0:7a if_output() bypass Add a new field in DXR s FIB to cache the destination MAC address of the next hop Ether if_input() Switch fabric if_output() Avoid if_output() (incl. ARP resolve) for subsequent packets Filter Switch logic (L3 module)
CONCLUSION FreeBSD can forward packets only at 1.43 Mpps By replacing packet I/O with VALE, and route lookup with DXR, we can forward packets at 2.43 Mpps Ethernet layer remains expensive We have to bypass it for further speed up
THANK YOU Questions? Comments? Mail nanako@sfc.wide.ad.jp Code https://github.com/nanakom/freebsd/tree/dxr
REFERENCES [1] L. Rizzo. netmap: A novel framework for fast packet i/o. In Presented as part of the 2012 USENIX Annual Technical Conference (USENIX ATC 12), June 2012. [2] DPDK: http://dpdk.org [3] T. Yang, G. Xie, Y. Li, Q. Fu, A. X. Liu, Q. Li, and L. Mathy. Guarantee IP Lookup Performance with FIB Explosion. In ACM SIGCOMM, pages 39 50, 2014. [4] M. Zec, L. Rizzo, and M. Mikuc. Dxr: Towards a billion routing lookups per second in software. SIGCOMM Comput. Commun. Rev., 42(5):29 36, Sept. 2012. [5] H. Asai and Y. Ohara. Poptrie: A compressed trie with population count for fast and scalable software IP routing table lookup. In ACM SIGCOMM, pages 57 70, 2015. [6] M. Honda, F. Huici, G. Lettieri, and L. Rizzo. mswitch: A highly- scalable, modular software switch. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research, SOSR 15, pages 1:1 1:13, New York, NY, USA, 2015. ACM