Ethernet VPN (EVPN) in Data Center Description and Design considerations Vasilis Stavropoulos Sparkle GR
EVPN in Data Center The necessity for EVPN (what it is, which problems it solves) EVPN with MPLS transport (RFC7432) EVPN with VXLAN transport (draft-ietf-bess-evpn-overlay-07) Design considerations Configuration examples (Junos) L4-L7 Services integration
Data Center L2 issues Traditionally in Data Centers (DC), tenant separation is performed at L2 level with VLANs. This introduces spanning-tree limitations and dangers (data plane flooding, broadcasting + related storms, partially used uplinks) Slow recovery times due to STP convergence. Potential scalability problems imposed by the maximum number of vlans(4096). Proprietary vendor solutions (vpc, MC-LAG), in order to bypass spanning-tree limitations
EVPN Benefits EVPN brings mac learning through control plane, via another extension (evpn-signaling) of our favorite protocol, BGP. It allows to tunnel L2 traffic (overlay) through an IP fabric (underlay) Faster convergence times. Service provider level scalability (route-reflectors). All active multi-homing from hosts to the network without vendor proprietary solutions Anycastgateway, identical gateway (IP/MAC) for all hosts/vms in the fabricleading to reduced ARP flooding and traffic optimization.
EVPN Terminology EVI : EVPN instance, the instance that spans among all PEs participating in the specific EVPN. ES : Ethernet Segment defines the connection between the Hosts and the PEs, in the case of active/active uplinks, an ES represents the link aggregation set. ESI : Ethernet Segment Identifier, which significantly identifies the connected hosts on the PE and it has a zero value for single-homed hosts and a non-zero unique value for multi-homed hosts
EVPN Terminology Route Types Route Type 1: Ethernet Auto-Discovery (AD) Route - Provides auto-discovery for multi-homed host and represents the ESI (also known as mass withdraw route). Route Type 2: MAC/IP Advertisement Route - EVPN allows end hostsip and MAC addresses to be advertised within the EVPN network layer reachability information (NLRI). This allows for control plane learning of end systems MAC addresses. Route Type 3: Inclusive Multicast Ethernet Tag Route - This route sets up a path for broadcast, unknown unicast, and multicast (BUM) traffic from a PE device to the remote PE device on a per VLAN, per ESI basis (Ingress replication method). Route Type 4: Ethernet Segment Route - These routes are needed in multihomed scenarios (active/active) and help determine the Designated ForwarderPE. Designated Forwarder is elected per ESI for BUM traffic handling.
EVPN Network (MPLS transport) IP fabric (CLOS) with MPLS enabled ibgpbetween the Leaf routers with evpn signaling extension, OSPF as IGP LDP or RSVP as MPLS signaling protocol We achieve MAC and MAC/IP advertisement through MP- BGP (control plane learning) VMs of Host1 and Host2 think that they are on the same broadcast domain, although an IP fabric is in the middle AnycastGW offer transparent VM mobility between hosts
EVPN Network (MPLS transport) ESI is the same for Leaf-1,2 and simplifies link aggregation towards two distinct physical switch/routers (no vpc, MC- LAG, etc.) Via Route Type 1, Leaf-1 learns that MACs of Host2 VMs are behind both Leaf2, Leaf3, so it load balances traffic towards them. Route Type 2 describes individual MAC/IP advertisements through BGP Through ESI also faster convergence times are achieved If Host2-Leaf3 link goes down, Leaf3 withdraws RT 1 and all related MACs are purged immediately from other PEs
EVPN Network New Extended communities MAC mobility extended community Sequence number to help PEs withdraw old MAC/IP routes during VM relocations between hosts. Default GW extended community Extended community carried by the MAC/IP route to indicate that the route is associated with a default GW. Alternatively, manually configure the IP/MAC per interface on all PEs
EVPN Network (MPLS transport) EVPN VLAN based (different instance per vlan), 1:1 Mapping Vlan 10 ---EVI 10 ---Vlan 10 or translated Vlan 20 ---EVI 20 ---Vlan 20 or translated EVPN VLAN bundle based (same instance, different BDs) Vlan 10 ---EVI 10 (bridge-domain 10) ---vlan10 or translated Vlan 20 ---EVI 10 (bridge-domain 20) ---vlan20 or translated
EVPN Network (VXLAN transport) Leaf switches are usually lower spec devices not supporting or having limited features. MPLS is not popular in enterprise world and is not supported by hypervisors So, EVPN with VXLAN transport is the most popular choice for the overlay It provides a theoretical upper limit of VNIs (VXLAN Network Identifiers) to 16.7M (24bit field in header), compared to 4096 VLANs
EVPN Network (VXLAN transport) VXLAN provides L2 overlay tunneling through encapsulation of MAC frames over IP/UDP, creating an independent overlay network over the IP fabric It uses Virtual Tunnel End Point interfaces (VTEP) in Hypervisors or physical switches, to perform this encapsulation VTEP is a function with two interfaces, one L2 interface towards the LAN segment (Hosts/VMs) and one L3 interface towards the IP fabric VLAN-to-VXLAN mapping at LAN side before encapsulation Initial implementation of VXLAN included flood and learn mechanism through multicast protocol for VTEP discovery in the fabric Not scalable and not very elegant to enable multicast in DC for such reason EVPN solves this by enabling VTEP discovery through control plane learning (BGP)
VXLAN+VTEP https://www.cisco.com/c/dam/en/us/products/collateral/switches/nexus-9000-series-switches/white-paper-c11-729383.doc/_jcr_content/renditions/whitepaper-c11-729383_2.jpg https://www.cisco.com/c/dam/en/us/products/collateral/switches/nexus-9000-series-switches/white-paper-c11-729383.doc/_jcr_content/renditions/white-paperc11-729383_1.jpg
EVPN Network (VXLAN transport) IP fabric (CLOS) without MPLS VTEP is a function with two interfaces, one L2 interface towards the LAN segment (Hosts/VMs) and one L3 interface towards the IP fabric VTEP IP discovery through the MP-BGP EVPN control plane MAC frames are encapsulated in UDP/IP before being transported through the IP fabric via MP-BGP De-encapsulation process takes place at the remote VTEP
EVPN Network (VXLAN transport-vtep function) Local VTEP IP MAC VXLAN id Remote VTEP ab:cd:ef:12:34:56 5001 10.10.10.1 ac:dd:11:22:33:aa 5010 10.10.10.2 be:af:12:ac:22:ac 5020 10.10.10.3 VLAN to VXLAN mapping (LAN side) VLAN VXLAN 501 5001 510 5010 520 5020
EVPN Network (Vlan vs VXLAN) root@vmx1> show configuration routing-instances EVPN-100 instance-type virtual-switch; route-distinguisher 10.10.10.1:100; vrf-import VL100-vrf-import; vrf-target target:100:100; protocols { evpn{ extended-vlan-list 100-101; default-gateway do-not-advertise; bridge-domains { VL-100 { vlan-id 100; interface ge-0/0/1.100; routing-interface irb.100; VL-101 { vlan-id 101; interface ge-0/0/1.101; routing-interface irb.101; root@vmx1> show configuration routing-instances EVPN-100 vtep-source-interface lo0.0; instance-type virtual-switch; route-distinguisher 10.10.10.1:100; vrf-target target:100:100; protocols { evpn{ encapsulation vxlan; extended-vni-list 1000-1010; default-gateway do-not-advertise; bridge-domains { VL-100 { vlan-id 100; interface ge-0/0/1.100; vxlan{ vni 1000; ingress-node-replication;..
EVPN Network Even with VXLAN there are limitations on merchant silicon switches usually used as TOR Most of the smart things take place at Spine level, where all the features are available (e.g. VLXAN L3 gateway) This leads to more complex scenarios and configurations between different type of equipment Exercise: collapse the Leaf architecture inside the hypervisor (virtual router as a Leaf) and proceed with VLAN ids and MPLS transport as an example
EVPN Network
EVPN packet walkthrough VM-B sends arprequest for IP of VM-A Packet is flooded to all PEs participating in the EVI using the Type 3 Route (Ingress replication) via BGP Packet reaches VM-A which replies to the ARP request with its own IP address The reply is unicast and is sent only the specific remote PE due to MAC learning from MP-BGP (Route type 2, IP/MAC) Different MPLS label allocation for RT-2 and RT-3
EVPN routes show route advertising-protocol bgp 10.10.10.1 table EVPN-100.evpn.0 detail Route Type-2 (MAC) * 2:10.10.10.2:100::100::00:50:56:95:5d:11/304 (1 entry, 1 announced) BGP group IBGP type Internal Route Distinguisher: 10.10.10.2:100 Route Label: 299776 ESI: 00:00:00:00:00:00:00:00:00:00 ------> Single homed Nexthop: Self Flags: Nexthop Change Localpref: 100 AS path: [65001] I Communities: target:100:100 Route Type-3 * 3:10.10.10.2:100::100::10.10.10.2/304 (1 entry, 1 announced) BGP group IBGP type Internal Route Distinguisher: 10.10.10.2:100 Route Label: 299785 PMSI: Flags 0x0: Label 299785: Type INGRESS-REPLICATION 10.10.10.2 Nexthop: Self Flags: Nexthop Change Localpref: 100 AS path: [65001] I Communities: target:100:100 PMSI: Flags 0x0: Label 299785: Type INGRESS-REPLICATION 10.10.10.2 Route Type-2 (MAC/IP) * 2:10.10.10.2:100::100::00:50:56:95:5d:11::192.168.10.10/304 (1 entry, 1 announced) BGP group IBGP type Internal Route Distinguisher: 10.10.10.2:100 Route Label: 299776 ESI: 00:00:00:00:00:00:00:00:00:00 Nexthop: Self Flags: Nexthop Change Localpref: 100 AS path: [65001] I Communities: target:100:100 2: Route-type (MAC/IP advertisement) 10.10.10.1:100 : RD 00:50:56:95:5d:11: Mac address of VM 192.168.10.11 : IP of VM 100 : VlanID
EVPN routes root@vmx1> show route table EVPN-100 2:10.10.10.1:100::100::00:50:56:95:20:bf::192.168.10.11/304 *[EVPN/170] 00:01:42 Indirect 2:10.10.10.2:100::100::00:50:56:95:5d:11::192.168.10.10/304 *[BGP/170] 00:01:40, localpref 100, from 10.10.10.2 AS path: I, validation-state: unverified > to 10.20.30.2 via ge-0/0/2.0 root@vmx2> show route table EVPN-100 2:10.10.10.1:100::100::00:50:56:95:20:bf::192.168.10.11/304 *[BGP/170] 00:03:28, localpref 100, from 10.10.10.1 AS path: I, validation-state: unverified > to 10.20.30.1 via ge-0/0/2.0 2:10.10.10.2:100::100::00:50:56:95:5d:11::192.168.10.10/304 *[EVPN/170] 00:03:27 Indirect
EVPN vleaf Each vpe/leaf has pretty much identical configuration May be templated/automated However, special care is needed for optimizing the resources (CPU, Memory, network) Various optimized techniques for compute resources (NUMA, CPU pinning..) The same for the networking part (PCI pass-through, SR-IOV, DPDK.)
L4-7 Services integration In order to route traffic outside the IP fabric and maintain the desired multi-tenancy function, we need to implement L3 VRFs These VRFs have different RD and RT than the EVPN ones, but contain the same routinginterface, which continues to be the default GW per tenant So, for each vlanwe have one bridge-domain in EVPN instance and one L3 VRF, which contains e.g. a static or dynamic route towards outside the fabric, via another edge device (firewall)
L4-7 Services integration
L4-7 Services integration Tenant may use its own vfirewall or Provider s Tenant s default gw remains the Leaf (VRF) Inside interface of vfw is terminated at different port participating only in the L3 VRF This ensures independent vmotion of the VMs compared to the vfw+ more flexibility to inter-vlan forwarding (vrf importexport policies) EVPN instance for East/West traffic, VRF instance for routing outside the fabric
L4-7 Services integration L3 VRF instance-type vrf; interface ge-0/0/3.0; ---- > vfw inside interface irb.100; route-distinguisher 10.10.10.1:10; vrf-target target:100:10; vrf-table-label; routing-options { static { route 0.0.0.0/0 next-hop x.x.x.x; protocols { ospf{ area 0.0.0.0 { interface ge-0/0/3.0 { metric 100; EVPN instance instance-type virtual-switch; route-distinguisher 10.10.10.1:100; vrf-import VL100-vrf-import; vrf-target target:100:100; protocols { evpn { extended-vlan-list 100-101; default-gateway do-not-advertise; bridge-domains { VL-100 { vlan-id 100; interface ge-0/0/1.100; ------ > LAN side (Tenant VMs) routing-interface irb.100; VL-101 { vlan-id 101; interface ge-0/0/1.101; routing-interface irb.101;
L4-7 Services integration One vleaf per node without a centralized lifecycle manager could be a problem, depending on the scale However, the configuration per vleaf is similar and can be easy(ier) templated and automated EVPN with MPLS transport could work at DC level for small/medium design scenarios Repeated rack configuration using the same vlan ids Easier integration with the rest of the service provider network, especially for potential Data Center Interconnect (DCI) needs.
Summary/References Legacy DC designs with L2 domains (vlans) using Spanning-Tree is long considered obsolete for all the well known reasons Intermediate solution with vendor proprietary protocols (vpc etc.) to reduce the STP topology and better utilize uplinks However, there are still limitations e.g. in routing protocols usage EVPN brings control plane into the game of MAC learning, eliminating the need for proprietary solutions and of course spanning-tree EVPN/MPLS or EVPN/VXLAN? EVPN/VXLAN in DC and EVPN/MPLS at the core/sp is the trend, while other encapsulation methods are available https://tools.ietf.org/html/rfc7432 (BGP MPLS-Based Ethernet VPN) https://tools.ietf.org/html/draft-ietf-bess-evpn-overlay-07