RouteBricks: Exploiting Parallelism To Scale Software Routers

Size: px

Start display at page:

Download "RouteBricks: Exploiting Parallelism To Scale Software Routers"

Marshall Goodman
5 years ago
Views:

1 outebricks: Exploiting Parallelism To Scale Software outers Mihai Dobrescu & Norbert Egi, Katerina Argyraki, Byung-Gon Chun, Kevin Fall, Gianluca Iannaccone, Allan Knies, Maziar Manesh, Sylvia atnasamy EPFL, Intel Labs Berkeley, Lancaster University

2 Building routers Fast Programmable» custom statistics» filtering» packet transformation» 2

3 Why programmable routers New ISP services» intrusion detection, application acceleration Simpler network monitoring» measure link latency, track down traffic New protocols» IP traceback, Trajectory Sampling, Enable flexible, extensible networks 3

4 Today: fast or programmable Fast hardware routers» throughput : Tbps» no programmability Programmable software routers» processing by general-purpose CPUs» throughput < 10Gbps 4

5 outebricks A router out of off-the-shelf PCs» familiar programming environment» large-volume manufacturing Can we build a Tbps router out of PCs? 5

6 outer = N packet processing + switching N: number of external router ports : external line rate 6

7 A hardware router N linecards linecards Processing at rate ~ per linecard 7

8 A hardware router N linecards switch fabric linecards Processing at rate ~ per linecard Switching at rate N x by switch fabric 8

9 outebricks N commodity interconnect servers servers Processing at rate ~ per server Switching at rate ~ per server 9

10 outebricks N commodity interconnect servers servers Per-server processing rate: c x 10

11 Outline Interconnect Server optimizations Performance Conclusions 11

12 Outline Interconnect Server optimizations Performance Conclusions 12

13 equirements N commodity interconnect Internal link rates < Per-server processing rate: c x Per-server fanout: constant 13

14 A naive solution N 14

15 A naive solution N N external links of capacity N 2 internal links of capacity 15

16 Valiant load balancing N /N /N 16

17 Valiant load balancing /N /N N N external links of capacity N 2 internal links of capacity 2/N 17

18 Valiant load balancing /N /N N Per-server processing rate: 3 Uniform traffic: 2 18

19 Per-server fanout? N 19

20 Per-server fanout? N Increase server capacity 20

21 Per-server fanout? N Increase server capacity 21

22 Per-server fanout? N Increase server capacity Add intermediate nodes» k-degree n-stage butterfly 22

23 Our solution: combination Assign max external ports per server Full mesh, if possible Extra servers, otherwise 23

24 Example Assuming current servers» 5 NICs, 2 x 10G ports or 8 x 1G ports» 1 external port per server N = 32 ports: full mesh» 32 servers N = 1024 ports: 16-ary 4-fly» 2 extra servers per port 24

25 ecap N Valiant load balancing + full mesh k-ary n-fly Per-server processing rate:

26 Outline Interconnect Server optimizations Performance Conclusions 26

27 Setup: NUMA architecture min-size packets I/O hub Mem Ports Cores Mem» Nehalem architecture, QuickPath interconnect» CPUs: 2 x [2.8GHz, 4 cores, 8MB L3 cache]» NICs: 2 x Intel XFS 2x10Gbps» kernel-mode Click 27

28 Single-server performance I/O hub Mem Ports Cores Mem First try: 1.3 Gbps 28

29 Problem #1: book-keeping Managing packet descriptors» moving between NIC and memory» updating descriptor rings Solution: batch packet operations» NIC batches multiple packet descriptors» CPU polls for multiple packets 29

30 Single-server performance I/O hub Mem Ports Cores Mem First try: 1.3 Gbps With batching: 3 Gbps 30

31 Problem #2: queue access Ports Cores 31

32 Problem #2: queue access ule #1: 1 core per port 32

33 Problem #2: queue access ule #1: 1 core per port ule #2: 1 core per packet 33

34 Problem #2: queue access ule #1: 1 core per port ule #2: 1 core per packet 34

35 Problem #2: queue access ule #1: 1 core per port ule #2: 1 core per packet 35

36 Problem #2: queue access ule #1: 1 core per port ule #2: 1 core per packet queue 36

37 Single-server performance I/O hub Mem Ports Cores Mem First try: 1.3 Gbps With batching: 3 Gbps With multiple queues: 9.7 Gbps 37

38 ecap State-of-the art hardware» NUMA architecture, multi-queue NICs Modified NIC driver» batching Careful queue-to-core allocation» one core per queue, per packet 38

39 Outline Interconnect Server optimizations Performance Conclusions 39

40 Single-server performance Gbps ealistic size mix Min-size packets No-op forwarding IP routing ealistic size mix: = 8 12 Gbps Min-size packets: = 2 3 Gbps 40

41 Bottlenecks Gbps ealistic size mix Min-size packets No-op forwarding IP routing ealistic size mix: I/O Min-size packets: CPU 41

42 With upcoming servers ealistic size mix Gbps Min-size packets No-op forwarding IP routing ealistic size mix: = Gbps Min-size packets: = Gbps 42

43 B4 prototype N = 4 external ports» 1 server per port» full mesh ealistic size mix: 4 x 8.75 = 35 Gbps» expected = 8 12 Gbps Min-size packets: 4 x 3 = 12 Gbps» expected = 2 3 Gbps 43

44 I did not talk about eordering» avoid per-flow reordering» 0.15% Latency» 24 microseconds per server (estimate) Open issues» power, form-factor, programming model 44

45 Conclusions outebricks: high-end software router» Valiant LB cluster of commodity servers Programmable with Click Performance:» easily = 1Gbps, N = 100s» = 10Gbps for realistic traffic» for worst case, with upcoming servers 45

46 Thank you. NIC driver and more information at 46

Today s Paper. Routers forward packets. Networks and routers. EECS 262a Advanced Topics in Computer Systems Lecture 18

Today s Paper. Routers forward packets. Networks and routers. EECS 262a Advanced Topics in Computer Systems Lecture 18 EECS 262a Advanced Topics in Computer Systems Lecture 18 Software outers/outebricks March 30 th, 2016 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley Slides