OpenFlow Software Switch & Intel DPDK performance analysis
Agenda Background Intel DPDK OpenFlow 1.3 implementation sketch Prototype design and setup Results Future work, optimization ideas OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 2
Intel dpdk basics Why Intel DPDK? (www.intel.com/go/dpdk) kernel space implementation is more restricted harder to develop and debug interrupts are still needed performance issues user space implementation over normal Linux kernel is slow user kernel memory separation, copy is slow - some workarounds exist (e.g. pcap mmap), but they are still not fast enough a similar, but less widespread solution: http://info.iet.unipi.it/~luigi/netmap/ Main features poll mode driver: avoid using interrupts and scheduling direct I/O: packet or first X bytes is copied to L1 cache directly some details from the Intel DPDK tutorial will follow OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 3
Intel DPDK Basic Design Designed to run on any Intel architecture CPU Intel Atom to client cores to Sandy Bridge Essential to the IA value proposition PThread to bind h/w thread to s/w task Literally no scheduler overhead User Level Polled Mode Driver No Kernel Context / Interrupt Context Switching Overhead Huge Pages To Improve Performance 1Gig Huge as well as 2 Meg Page support Co-exists with Linux s 4 K Page Low Latency Cache and Memory Access DDIO - Cache Prefetch and rte_cache_aligned - memory 4
Understanding the Choices & Performance Setting the Direction for the Intel DPDK Scheduler (or why not) Hardware threads only No scheduler/task switcher - typical task switch time is between 200+ processor cycles (varies depending on processor architecture) Process bunch of packets at a time Cores process a bunch of packets at a time to amortize some latencies Prefetch Critical to latency hiding since we don t have software threads. Stalls on hardware threads are costly The queue-based model is key to making prefetch effective Locks Generally lockless implementations where-ever possible. A spinlock-unlock pair costs between 60-90 cycles. Queues are lockless (single producer & multi-producer, single consumer) 5
Scheduler or Why not? Primary reason was performance: Task-switch overhead is typically a few hundred cycles FXSAVE/FXRSTOR are 100 and 150 cycles respectively (on Intel NetBurst ) Faster on recent processors, but not significantly Need to add cost of interrupt if pre-emptive To put that in perspective in a 10 GbE environment On a 3 GHz processor, for small (64B) packets, a packet arrives every 67.2 ns = 201 cycles For lower bandwidth environments, an essential thing to think about is the added CPU bandwidth consumed 6
Packet Bunching Done on the NIC today NIC Receive descriptors are bunched four to a cache line Writing back partial descriptors has a severe performance penalty Conflicts between CPU and I/O device on the same cache line Increases memory & PCI-E bandwidth usage Needed to overcome PCI-Express latencies All Intel Ethernet* controllers have settings that can be tweaked to control descriptor write-back Coalesce as many descriptors as possible on Receive Transmit side coalescing done as well (software controlled) Timer values can be set to control latency (EITR) Took the paradigm to the next level in having the fast-path process bunches of packets Facilitated by the queue abstraction * Other names and brands may be claimed as the property of others. 7
Prefetch Two types of prefetch hardware & software Hardware prefetch is issued by the core L1 DCU prefetcher: Streaming prefetcher triggered by ascending access to recently loaded data L1 IP-based strided prefetcher: triggered on individual loads with a stride L2 DPL: Prefetches data into L2 cache based on DCU requests Adjacent cache line (n, n+2, prefetch n+2) Strided prefetcher (e.g. skipped cache lines) Software prefetch needs to be issued appropriately ahead of time to be effective Too early could cause eviction before use Multiple types of software prefetch 8
Paging 1GB super-pages & 2 Meg Huge Page Support Performance implications Primarily due to D-TLB thrashing/replacement Paging performance drop is difficult to gauge really dependent on application Gets significantly worse as memory footprint increases Varies by architecture, but initial measurements suggested ~30% on L3 forwarding Quite often 2-3 D-TLB replacements per packet 9
Intel Data Direct I/O Technology (Intel DDIO) 1x SNB-EP 8C B0, 2.0GHz 10
Intel DPDK Performance IPv4 Layer 3 Forwarding on an IA Server Platform 64B Throughput Mpps 300.00 250.00 200.00 150.00 100.00 50.00 Native Linux Kernel Performance 12 MPPS PS Native Linux Introduction of Integrated Memory Controller + Intel DPDK 42 MPPS Intel DPDK R0.7 55 MPPS PS R1.0 Intel DPDK Release 1.3 SNB @2.1 GHz 1C/1T = 18.6 Mpps SNB @2.7 GHz 1C/1T = 23.7 Mpps Intel DPDK Release 1.4 IVB @2.4 GHz 1C/1T = 23.9 Mpps Introduction of Integrated PCIe* Controller 110 MPPS R1.3 250 MPPS R1.3 1C/2T = 24 Mpps 1C/2T = 28.8 Mpps 1C/2T = 28.5 Mpps 255 MPPS R1.4 0.00 2009 2S Intel Xeon processor E5645 (2x6C Westmere) 2.4GHz 2009 2S Intel Xeon processor E5540 (2x4C Nehalem) 2.53 GHz 2010 2S Intel Xeon processor E5645 (2x6C Westmere-EP) 2.40 GHz 2012 1S Intel Xeon E5-2658 processors C1 Stepping (1x8C Sandy bridge-ep) 2.1 GHz 8 x 10GbE PCIe Gen2 2012 2S Intel Xeon E5-2658 processors C1 Stepping (2x8C Sandy bridge-ep) 2.1 GHz 22 x 10GbE PCIe Gen2 2013 2S Intel Xeon E5-2658v2 (2x10C Ivy bridge-ep) 2.4 GHz 22 x 10GbE PCIe Gen2 Massive IA Performance Improvements since 2009, PCIe Gen3 will Offer Even Better Performance.! Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. 11
openflow basics Main idea: programmable networking flexibility, programmability together with high performance Problem: OF is either flexible OR fast today flexible rules with many tuples: use TCAM or slow lookup TCAM is expensive and uses a lot of power complex instructions and actions: high overhead for software implementations some solutions limit flexibility to increase performance (e.g. TTP) In theory performance should only depend on the data plane functions the node is implementing in the given scenario it should be irrelevant whether the device is executing a native implementation of the use case, or is executing OF rules programmed by a controller for the same purpose OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 12
add/rem/mod flow entry table ID Flow Table wildcard/prio lookup Action Set Execution Actions add/rem/mod group entry data access per packet processing data access by control plane data access by internal control liveliness propagation Flow Entry remove dependent flows on removal meter ID remove assoc. flow entry Instructions Group Entry group ID remove assoc. instructions Apply Actions group ID group ID Buckets bucket liveliness port ID Meter meter ID add/rem/mod meter port ID Action Set Port queue ID port liveliness Queue OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 13
Agenda Background OpenFlow 1.3 implementation sketch Intel DPDK Prototype design and setup Results Future work, optimization ideas OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 14
why new prototype Software prototypes investigated and not selected OVS: well-established, open source mainly for virtual environment, performance issues - OVS on Intel DPDK (OVDK) is an ongoing activity CPqD softswitch: used by ONF for prototyping new features, open source serious performance limitations Linc: Erlang based softswitch, open source runs in a VM environment, while we are primarily interested in close to the hardware solutions Hardware based prototypes / products they have serious limitations in terms of number of rules usually OF implementations use TCAM which has limited capacity usually hard to program / modify / add new features OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 15
configuration Simple MAC based forwarding 1, 10, 100, 1000, 2000 and 5000 DMAC rules currently with linear search always the last rule will match caching is not easy instruction = write action action set = Output (egress port) Intel DPDK based generator station (tgen) generates 15 Mpps (@ 64 Bytes / pkt) on one core OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 16
add/rem/mod flow entry table ID Flow Table 1 Flow Entry 2 wildcard/prio lookup Action Set Execution 4 5 Actions add/rem/mod group entry remove dependent flows on removal data access per packet processing data access by control plane data access by internal control liveliness propagation meter ID remove assoc. flow entry Instructions remove assoc. instructions 3 Apply Actions group ID group ID Group Entry Buckets bucket liveliness group ID port ID Meter meter ID add/rem/mod meter port ID Port Queue OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 17 6 7 queue ID Action Set port liveliness
measurement setup 3Com (mgmt) 172.31.32.0/24 1G GENERATOR Intel Xeon E5-2630 2x6 cores @ 2.3 GHz 8x4 GB DDR3 SDRAM Intel Niantic (82599EB) 2x10 GbE 10G 10G 1G OF-SW Intel Xeon E5-2630 2x6 cores @ 2.3 GHz 8x4 GB DDR3 SDRAM OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 18
Linux OF-SW core0 core3 core4 OF code core5 OF code core1 rx tx rx tx rx tx core2 Linux driver Intel DPDK driver q42 q52 q43 q53 ETH0 (1G) ETH1 (1G) ETH2 (10G) ETH3 (10G) ETH2 (10G) ETH3 (10G) GENERATOR OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 19
results Main results: 25% overhead vs. L2FWD (Intel s example) it was more without highly optimizing the software Pkt size L2FWD OF 1.3 1 rule 100000 Mpps Gbps Mpps Gbps 64 13.82 7.08 10.27 5.26 128 8.45 8.65 8.45 8.65 256 4.53 9.28 4.53 9.28 512 2.35 9.63 2.35 9.63 linear with nr. of rules (not surprisingly) 10000 1000 100 10 1 10 100 500 1000 2000 Performance (kpps) Processing time (ns) So we began some investigation OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 20
some details Processing time per number of rules at small number of rules cache(s) are effectively used note that real traffic would behave better 15 14 13 12 11 10 9 8 7 6 5 4 10 100 500 1000 2000 5000 time per rule (ns) OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 21
improving further Preliminary results of current code: overhead was completely removed 16 15 14 13 12 11 L2FWD OF 1.3 1 rule OF 1.3 ++ 10 9 8 OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 22
and further Current status: basically removed static OF overhead It s time for improving rule processing speed and implement control plane Basic ideas under discussion high-performance southbound interface minimize the need for locking, timeouts, etc. fast data plane execution flow caching lookup algorithm selection, selective TTP usage prediction OF 1.3 prototype measurement - EWSDN Public Ericsson AB 2013 2013-10-07 Page 23