Today s Paper. Routers forward packets. Networks and routers. EECS 262a Advanced Topics in Computer Systems Lecture 18

Similar documents
Today s Paper. Routers forward packets. Networks and routers. EECS 262a Advanced Topics in Computer Systems Lecture 18

RouteBricks: Exploiting Parallelism To Scale Software Routers

RouteBricks: Exploi2ng Parallelism to Scale So9ware Routers

RouteBricks: Exploiting Parallelism To Scale Software Routers

Controlling Parallelism in a Multicore Software Router

Forwarding Architecture

Evaluating the Suitability of Server Network Cards for Software Routers

ECE 158A: Lecture 7. Fall 2015

PacketShader: A GPU-Accelerated Software Router

EECS 122: Introduction to Computer Networks Switch and Router Architectures. Today s Lecture

Fast packet processing in the cloud. Dániel Géhberger Ericsson Research

The IP Data Plane: Packets and Routers

Can Software Routers Scale?

Routers: Forwarding EECS 122: Lecture 13

Generic Architecture. EECS 122: Introduction to Computer Networks Switch and Router Architectures. Shared Memory (1 st Generation) Today s Lecture

Routers: Forwarding EECS 122: Lecture 13

Professor Yashar Ganjali Department of Computer Science University of Toronto.

CSE 123A Computer Networks

Improved parallelism and scheduling in multi-core software routers

15-744: Computer Networking. Routers

소프트웨어기반고성능침입탐지시스템설계및구현

Topics for Today. Network Layer. Readings. Introduction Addressing Address Resolution. Sections 5.1,

CS344 - Build an Internet Router. Nick McKeown, Steve Ibanez (TF)

Rule-Based Forwarding

Network Superhighway CSCD 330. Network Programming Winter Lecture 13 Network Layer. Reading: Chapter 4

Computer Networks. Instructor: Niklas Carlsson

ResQ: Enabling SLOs in Network Function Virtualization

Lecture 11: Packet forwarding

EE/CSCI 451: Parallel and Distributed Computation

TOWARD PREDICTABLE PERFORMANCE IN SOFTWARE PACKET-PROCESSING PLATFORMS. Mihai Dobrescu, EPFL Katerina Argyraki, EPFL Sylvia Ratnasamy, UC Berkeley

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

The Network Layer and Routers

PacketShader as a Future Internet Platform

Programmable Software Switches. Lecture 11, Computer Networks (198:552)

Hardware Evolution in Data Centers

LS Example 5 3 C 5 A 1 D

1-1. Switching Networks (Fall 2010) EE 586 Communication and. October 25, Lecture 24

CSCD 330 Network Programming

CMPE 150/L : Introduction to Computer Networks. Chen Qian Computer Engineering UCSC Baskin Engineering Lecture 11

CMSC 332 Computer Networks Network Layer

Last Lecture: Network Layer

NetFPGA Update at GEC4

Key Network-Layer Functions

Routing Algorithms. Review

CSC 401 Data and Computer Communications Networks

The router architecture consists of two major components: Routing Engine. 100-Mbps link. Packet Forwarding Engine

What Global (non-digital) Communication Network Do You Use Every Day?"

Optical Packet Switching

Migration to IPv6 from IPv4. Is it necessary?

Routing Domains in Data Centre Networks. Morteza Kheirkhah. Informatics Department University of Sussex. Multi-Service Networks July 2011

IP Packet Switching. Goals of Todayʼs Lecture. Simple Network: Nodes and a Link. Connectivity Links and nodes Circuit switching Packet switching

CSCE 463/612 Networks and Distributed Processing Spring 2018

Network layer (addendum) Slides adapted from material by Nick McKeown and Kevin Lai

Computer Networks. ENGG st Semester, 2010 Hayden Kwok-Hay So

Chapter 4. Computer Networking: A Top Down Approach 5 th edition. Jim Kurose, Keith Ross Addison-Wesley, sl April 2009.

EECS 570 Final Exam - SOLUTIONS Winter 2015

University of California at Berkeley College of Engineering Department of Electrical Engineering and Computer Sciences

GPGPU introduction and network applications. PacketShaders, SSLShader

Lecture 9 - Virtual Memory

CS4450. Computer Networks: Architecture and Protocols. Lecture 20 Pu+ng ALL the Pieces Together. Spring 2018 Rachit Agarwal

Programmable NICs. Lecture 14, Computer Networks (198:552)

CS 152 Computer Architecture and Engineering. Lecture 19: Directory-Based Cache Protocols

CS162 Operating Systems and Systems Programming Lecture 21. Networking. Page 1

CSE 123b Communications Software

Network Design Considerations for Grid Computing

CS118 Discussion, Week 6. Taqi

Midterm #2 Exam Solutions April 26, 2006 CS162 Operating Systems

High Performance Packet Processing with FlexNIC

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Chapter 4 Network Layer

Recent Advances in Software Router Technologies

Data Center Network Topologies II

CS 3640: Introduction to Networks and Their Applications

The Power of Batching in the Click Modular Router

Yuval Carmel Tel-Aviv University "Advanced Topics in Storage Systems" - Spring 2013

Topic & Scope. Content: The course gives

Network Algorithmics, Introduction. George Varghese. April 2, 2007

The Impact of Optics on HPC System Interconnects

CSCE 463/612 Networks and Distributed Processing Spring 2018

An Enhanced Dynamic Packet Buffer Management

Routers Technologies & Evolution for High-Speed Networks

Lecture 17: Router Design

TOC: Switching & Forwarding

CS 356: Computer Network Architectures. Lecture 14: Switching hardware, IP auxiliary functions, and midterm review. [PD] chapter 3.4.1, 3.2.

Review for Chapter 4 R1,R2,R3,R7,R10,R11,R16,R17,R19,R22,R24, R26,R30 P1,P2,P4,P7,P10,P11,P12,P14,P15,P16,P17,P22,P24,P29,P30

OFAR-CM: Efficient Dragonfly Networks with Simple Congestion Management

CSC 4900 Computer Networks: Introduction

Lecture 16: Router Design

Link-State Routing OSPF

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

CS 3516: Advanced Computer Networks

Routing, Routers, Switching Fabrics

Power Management for Networked Systems

Efficient IP-Address Lookup with a Shared Forwarding Table for Multiple Virtual Routers

Network Performance: Queuing

Networking in a Vertically Scaled World

Optimizing Performance: Intel Network Adapters User Guide

Xen Network I/O Performance Analysis and Opportunities for Improvement

6.9. Communicating to the Outside World: Cluster Networking

Best Practices for Setting BIOS Parameters for Performance

Transcription:

EECS 262a Advanced Topics in Computer Systems Lecture 18 Software outers/outebricks March 30 th, 2016 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley Slides Courtesy: Sylvia atnasamy Today s Paper outebricks: Exploiting Parallelism To Scale Software outers Mihai Dobrescu and Norbert Egi, Katerina Argyraki, Byung-Gon Chun, Kevin Fall Gianluca Iannaccone, Allan Knies, Maziar Manesh, Sylvia atnasamy. Appears in Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP), October 2009 Thoughts? Paper divided into two pieces: Single-Server outer Cluster-Based outing http://www.eecs.berkeley.edu/~kubitron/cs262 2 Networks and routers outers forward packets UCB AT&T MIT UCB payload header 111010010 MIT outer 4 outer 2 oute Table Destination Address Next Hop outer to MIT UCB 4 outer 1 HP 5 MIT 2 NYU 3 HP NYU to HP outer 5 to NYU outer 3 3 4

outer definitions Networks and routers 1 N 2 N-1 3 bits per second (bps) 4 5 N = number of external router ` = line rate of a port outer capacity = N x UCB home, small business HP AT&T core edge (ISP) core edge (enterprise) MIT NYU 5 6 Examples of routers (core) Examples of routers (edge) Juniper T640 = 2.5/10 Gbps N = 320 Gbps Cisco AS 1006 =1/10 Gbps N = 40 Gbps Cisco CS-1 =10/40 Gbps N = 46 Tbps Juniper M120 = 2.5/10 Gbps N = 120 Gbps 72 racks, 1MW 7 8

Examples of routers (small business) Building routers Cisco 3945E = 10/100/1000 Mbps N < 10 Gbps edge, core ASICs network processors commodity servers outebricks home, small business ASICs network, embedded processors commodity PCs, servers 9 10 Why programmable routers New ISP services intrusion detection, application acceleration Simpler network monitoring measure link latency, track down traffic New protocols IP traceback, Trajectory Sampling, Enable flexible, extensible networks Challenge: performance deployed edge/core routers port speed (): 1/10/40 Gbps capacity (Nx): 40Gbps to 40Tbps PC-based software routers capacity (Nx), 2007: 1-2 Gbps [Click] capacity (Nx), 2009: 4 Gbps [Vyatta] subsequent challenges: power, form-factor, 11 12

A single-server router Packet processing in a server ory controllers (integrated) sockets with server N router links point-to-point links (e.g., QPI) Network Interface Cards (NICs) Per packet, 1. core polls input port 2. NIC writes packet to ory 3. core reads packet 4. core processes packet (address lookup, checksum, etc.) 5. core writes packet to port 13 14 Packet processing in a server Lesson#1: multi-core alone isn t enough 8x 2.8GHz Assuming with all 64B packets 19.5 million packets per second one packet every 0.05 µsecs ~1000 cycles to process a packet `Older (2008) Shared frontside bus Current (2009) Today, 200Gbps ory `chipset Today, 144Gbps I/O Suggests efficient Teaser: use? of CPU cycles is key! 15 Memory controller in `chipset Hardware need: avoid shared-bus servers 16

Lesson#2: on and Lesson#2: on and Problem: locking poll transmit input output How do we assign to input and output? 17 Hence, rule: one core per port 18 Lesson#2: on and Problem: cache misses, inter-core communication Lesson#2: on and two rules: one core per port one core per packet problem: often, can t simultaneously satisfy both Example: when # > # L3 cache L3 cache L3 cache L3 cache packet transferred (may be) transferred between across caches packet stays at one core packet always in one cache pipelined parallel Hence, rule: one core per packet 19 one core per packet solution: use multi-q NICs one core per port 20

Multi-Q NICs feature on modern NICs (for virtualization) port associated with multiple queues on NIC NIC demuxes (muxes) incoming (outgoing) traffic demux based on hashing packet fields (e.g., source+destination address) Multi-Q NICs feature on modern NICs (for virtualization) repurposed for routing rule: one core per port queue rule: one core per packet Multi-Q NIC: incoming traffic Multi-Q NIC: outgoing traffic 21 if #queues per port == #, can always enforce both rules 22 Lesson#2: on and recap: use multi-q NICs with modified NIC driver for lock-free polling of queues with one core per queue (avoid locking) one core per packet (avoid cache misses, intercore communication) 23 Lesson#3: book-keeping 1. core polls input port 2. NIC writes packet to ory 3. core reads packet 4. core processes packet 5. core writes packet to out port and packet descriptors problem: excessive per packet book-keeping overhead solution: batch packet operations NIC transfers packets in batches of `k 24

ecap: routing on a server Single-Server Measurements: Experimental setup Design lessons: 1. parallel hardware» at and ory and NICs 2. careful queue-to-core allocation» one core per queue, per packet 3. reduced book-keeping per packet» modified NIC driver w/ batching test server: Intel Nehalem (X5560) dual socket, 8x 2.80GHz 2x NICs; 2x /NIC max 40Gbps (see paper for non needs careful ory placement, etc.) additional servers generate/sink test traffic 25 26 Project Feedback from Meetings Should have Updated your project descriptions and plan Turn your description/plan into a living document in Google Docs Share Google Docs link with us Update plan/progress throughout the semester Questions to address: What is your evaluation methodology? What will you compare/evaluate against? Strawman? What are your evaluation metrics? What is your typical workload? Trace based, analytical, Create a concrete staged project execution plan:» Set reasonable initial goals with incremental milestones always have something to show/results for project Midterm: Over Weekend? Out Wednesday Due 11:59PM PST a week from Tomorrow (11/11) ules: Open book No collaboration with other students 27 28

Experimental setup Experimental setup test server: Intel Nehalem (X5560) software: kernel mode Click [TOCS 00] with modified NIC driver (batching, multi Q) me m packet processing Click runtime modified NIC driver me m additional servers generate/sink test traffic test server: Intel Nehalem (X5560) software: kernel mode Click [TOCS 00] with modified NIC driver me m packet processing Click runtime modified NIC driver packet processing static forwarding (no header processing) IP routing» trie based longest prefix address lookup» ~300,000 table entries [outeviews]» checksum calculation, header updates, etc. additional servers generate/sink test traffic me m 29 30 Experimental setup Factor analysis: design lessons test server: Intel Nehalem (X5560) software: kernel mode Click [TOCS 00] with modified NIC driver me m packet processing Click runtime modified NIC driver me m pkts/sec (M) 1.2 2.8 5.9 19 packet processing static forwarding (no header processing) IP routing input traffic additional servers all min size (64B) packets generate/sink test traffic (maximizes packet rate given port speed ) realistic mix of packet sizes [Abilene] 31 older shared bus server current Nehalem server Nehalem + `batching NIC driver Nehalem w/ multi Q + `batching driver Test scenario: static forwarding of min-sized packets 32

Single-server performance Bottleneck analysis (64B pkts) 40Gbps Gbps 9.7 36.5 static forwarding 6.35 36.5 IP routing Bottleneck: traffic generation Bottleneck? min-size packets realistic pkt sizes 33 ecall: max IP routing = 6.35Gbps 12.4 M pkts/sec Per-packet load due to routing Maximum component capacity nominal (empirical) Max. packet rate as per component capacity -- nominal (empirical) ory 725 bytes/pkt 51 (33) Gbytes/sec 70 (46) Mpkts/sec I/O 191 bytes/pkt 16 (11) Gbytes/sec 84 (58) Mpkts/sec Intersocket link CPUs are the bottleneck 231 bytes/pkt 25 (18) Gbytes/sec 108 ( 78) Mpkts/sec CPUs 1693 cycles/pkt 22.4 Gcycles/sec 13 Mpkts/sec Test scenario: IP routing of min-sized packets 34 ecap: single-server performance ecap: single-server performance N current servers (realistic packet sizes) 1/10 Gbps 36.5 Gbps current servers (min-sized packets) 1 6.35 (CPUs bottleneck) With upcoming servers? (2010) 4x, 2x ory, 2x I/O 35 36

ecap: single-server performance Practical Architecture: Goal N current servers (realistic packet sizes) 1/10 Gbps 36.5 Gbps current servers (min-sized packets) 1 upcoming servers estimated (realistic packet sizes) upcoming servers estimated (min-sized packets) 6.35 (CPUs bottleneck) 1/10/40 146 1/10 25.4 scale software routers to multiple example: 320Gbps (32x ) higher-end of edge routers; lower-end core routers 37 38 A cluster-based router today Interconnecting servers Challenges any input can send up to bps to any output interconnect? 39 40

A naïve solution Interconnecting servers N 2 internal links of capacity Challenges any input can send up to bps to any output» but need a low-capacity interconnect (~N)» i.e., fewer (<N), lower-capacity (<) links per server must cope with overload problem: commodity servers cannot accommodate Nx traffic 41 42 Overload drop at input servers? problem: requires global state need to drop 20Gbps; (fairly across input ) Interconnecting servers Challenges any input can send up to bps to any output» but need a lower-capacity interconnect» i.e., fewer (<N), lower-capacity (<) links per server must cope with overload» need distributed dropping without global scheduling» processing at servers should scale as, not Nx drop at output server? problem: output might receive up to Nx traffic 43 44

Interconnecting servers Challenges any input can send up to bps to any output must cope with overload With constraints (due to commodity servers and NICs) internal link rates per-node processing: cx (small c) limited per-node fanout Valiant Load Balancing (VLB) Valiant et al. [STOC 81], communication in multiprocessors applied to data centers [Greenberg 09], all-optical routers [Kesslassy 03], traffic engineering [Zhang- Shen 04], etc. idea: random load-balancing across a lowcapacity interconnect Solution: Use Valiant Load Balancing (VLB) 45 46 VLB: operation phase 1 phase 2 VLB: operation phase 1+2 /N /N /N /N /N /N /N /N Packets arriving at external port N 2 internal are uniformly links of load capacity balanced /N Each N Output server 2 internal sends links up of to capacity /N (of traffic /N server transmits received each server receives up to bps received each in server phase-1) receives to output up to server; bps traffic on external port drops excess fairly Packets forwarded in two phases 47 /N /N N 2 internal links of capacity 2/N each server receives up to 2 bps plus bps from external port hence, each server processes up to 3 or up to 2, when traffic is uniform [directvlb, Liu 05] 48

VLB: fanout? (1) VLB: fanout? (2) fewer but faster links fewer but faster servers Multiple external per server (if server constraints permit) 49 Use extra servers to form a constant-degree multi-stage interconnect (e.g., butterfly) 50 Authors solution: assign maximum external per server servers interconnected with commodity NIC links servers interconnected in a full mesh if possible else, introduce extra servers in a k-degree butterfly servers run flowlet-based VLB Scalability question: how well does clustering scale for realistic server fanout and processing capacity? metric: number of servers required to achieve a target router speed 51 52

Scalability Assumptions 7 NICs per server each NIC has 6 x or 8 x 1Gbps current servers one external port per server (i.e., requires that a server process 20-30Gbps) upcoming servers two external port per server (i.e., requires that a server process 40-60Gbps) Example: 320Gbps =, N=32 with current servers: 1x external port target: 32 servers 2/N < 1Gbps need: 1Gbps internal links 8x 1Gbps /NIC need: 4 NICs per server 53 54 Example: 320Gbps =, N=32 with current servers: 1x external port need: 32 servers, 4 NICs/server (1Gbps NICs) Scalability (computed) 160Gbps 320Gbps 640Gbps 1.28Tbps 2.56Tbps current servers 16 32 128 256 512 upcoming servers 8 16 32 128 256 Transition from mesh to butterfly Example: can build 320Gbps router with 32 current servers 55 56

Implementation: the B8/4 Is this a good paper? 4 x Nehalem servers 2 x external (Intel Niantic NIC) Specs. 8x external form-factor: 4U power: 1.2KW cost: ~$10k Key results (realistic traffic) 72 Gbps routing reordering: 0-0.15% validated VLB bounds What were the authors goals? What about the evaluation/metrics? Did they convince you that this was a good system/approach? Were there any red-flags? What mistakes did they make? Does the system/approach meet the Test of Time challenge? How would you review this paper today? 57 58