Cisco ACI Multi-Pod/Multi-Site Deployment Options Max Ardica Principal Engineer BRKACI-2003
Agenda ACI Introduction and Multi-Fabric Use Cases ACI Multi-Fabric Design Options ACI Stretched Fabric Overview ACI Multi-Pod Deep Dive ACI Multi-Site Solutions Overview Conclusions 3
Session Objectives At the end of the session, the participants should be able to: Articulate the different Multi-Fabric deployment options offered with Cisco ACI Understand the design considerations associated to those options Initial assumption: The audience already has a good knowledge of ACI main concepts (Tenant, BD, EPG, L2Out, L3Out, etc.) 4
Introducing: Application Centric Infrastructure (ACI) Web App DB Outside (Tenant VRF) QoS Filter QoS Service QoS Filter APIC ACI Fabric Integrated GBP VXLAN Overlay Application Policy Infrastructure Controller 6
ACI MultiPod/MultiSite Use Cases Single Site Multi-Fabric Multiple Fabrics connected within the same DC (between halls, buildings, within the same Campus location) Cabling limitations, HA requirements, Scaling requirements Single Region Multi-Fabric (Classic Active/Active scenario) Scoped by the application mobility domain of 10 msec RTT BDs/IP Subnets can be stretched between sites Desire is reducing as much as possible fate sharing across sites, yet maintaining operational simplicity Multi Region Multi-Fabric Creation of separate Availability Zones Disaster Recovery Minimal Cross Site Communication Deployment of applications not requiring Layer 2 adjacency 7
Agenda ACI Introduction and Multi-Fabric Use Cases ACI Multi-Fabric Design Options ACI Stretched Fabric Overview ACI Multi-Pod Deep Dive ACI Multi-Site Solutions Overview Conclusions 8
ACI Multi-Fabric Design Options Single APIC Cluster/Single Domain Stretched Fabric Multiple APIC Clusters/Multiple Domains Dual-Fabric Connected (L2 and L3 Extension) ACI Fabric Site 1 Site 2 ACI Fabric 1 ACI Fabric 2 L2/L3 Multi-Pod (Q3CY16) Multi-Site (Future) Pod A IP Network Pod n Site A IP Network Site n MP-BGP - EVPN MP-BGP - EVPN APIC Cluster 9
Agenda ACI Introduction and Multi-Fabric Use Cases ACI Multi-Fabric Design Options ACI Stretched Fabric Overview ACI Multi-Pod Deep Dive ACI Multi-Site Solutions Overview Conclusions 10
Stretched ACI Fabric For more information on ACI Stretched Fabric Deployment: BRKACI-3503 ACI Stretched Fabric DC Site 1 DC Site 2 vcenter Fabric stretched to two sites works as a single fabric deployed within a DC One APIC cluster one management and configuration point Anycast GW on all leaf switches Transit leaf Transit leaf Work with one or more transit leaf per site any leaf node can be a transit leaf Number of transit leaf and links dictated by redundancy and bandwidth capacity decision Different options for Inter-site links (dark fiber, DWDM, EoMPLS PWs) 11
Stretched ACI Fabric Support for 3 Interconnected Sites (Q2CY16) Site 2 Transit leafs in all sites connect to the local and remote spines Site 1 Transit Leaf Site 3 2x40G or 4x40G 16
Agenda ACI Introduction and Multi-Fabric Use Cases ACI Multi-Fabric Design Options ACI Stretched Fabric Overview ACI Multi-Pod Solution Deep Dive ACI Multi-Site Solutions Overview Conclusions 17
ACI Multi-Pod Solution Overview Inter-Pod Network Pod A Pod n MP-BGP - EVPN IS-IS, COOP, MP-BGP Single APIC Cluster IS-IS, COOP, MP-BGP Multiple ACI Pods connected by an IP Inter-Pod L3 network, each Pod consists of leaf and spine nodes Managed by a single APIC Cluster Single Management and Policy Domain Forwarding control plane (IS-IS, COOP) fault isolation Data Plane VXLAN encapsulation between Pods End-to-end policy enforcement 18
ACI Multi-Pod Solution Use Cases Handling 3-tiers physical cabling layout Cable constrain (multiple buildings, campus, metro) requires a second tier of spines Preferred option when compared to ToR FEX deployment Leaf Nodes Pod Spine Nodes Inter-Pod Network Evolution of Stretched Fabric design Metro Area (dark fiber, DWDM), L3 core >2 interconnected sites Pod 1 Pod 2 APIC Cluster DB Web/App Web/App 19
ACI Multi-Pod Solution SW and HW Requirements Software The solution will be available from Q3CY16 SW Release Hardware The Multi-Pod solution can be supported with all currently shipping Nexus 9000 platforms The requirement is to use multicast in the Inter-Pod Network for handling BUM (L2 Broadcast, Unknown Unicast, Multicast) traffic across Pods 20
ACI Multi-Pod Solution Supported Topologies Pod 1 40G/100G Intra-DC 40G/100G Pod n 40G/100G Two DC sites connected back2back 10G/40G/100G 40G/100G Pod 1 Dark fiber/dwdm (up Pod 2 to 10 msec RTT) DB Web/App APIC Cluster Web/App DB Web/App APIC Cluster Web/App 3 DC Sites Pod 1 Pod 2 10G/40G/100G 40G/100G 40G/100G Dark fiber/dwdm (up to 10 msec RTT) 40G/100G Multiple sites interconnected by a generic L3 network 40G/100G 40G/100G L3 40G/100G 40G/100G Pod 3 21
ACI Multi-Pod Solution Scalability Considerations Those scalability values may change without warning before the Multi-Pod solution gets officially released At FCS, the maximum number of supported ACI leaf nodes is 400 (across all Pods) 200 is the maximum number of leaf nodes per Pod Use case 1: larger number of Pods (up to 20) with a small number of leaf nodes in each Pod (20-30) Use case 2: low number of Pods (2-3) with large number of leaf nodes in each Pod (up to 200) 22
ACI Multi-Pod Solution Inter-Pod Network (IPN) Requirements Not managed by APIC, must be pre-configured IPN topology can be arbitrary, not mandatory to connect to all spine nodes Pod A 40G/100G 40G/100G Pod B Main requirements: MP-BGP - EVPN 40G/100G interfaces to connect to the spine nodes DB Web/App APIC Cluster Web/App Multicast BiDir PIM needed to handle BUM traffic DHCP Relay to enable spine/leaf nodes discovery across Pods OSPF to peer with the spine nodes and learn VTEP reachability Increased MTU support to handle VXLAN encapsulated traffic QoS (to prioritize intra APIC cluster communication) 23
APIC Distributed Multi-Active Data Base The Data Base is replicated across APIC nodes Shard 1 Shard 1 Shard 1 APIC APIC APIC One copy is active for every specific portion of the Data Base Shard 2 Shard 3 Shard 2 Shard 3 Shard 2 Shard 3 Processes are active on all nodes (not active/standby) The Data Base is distributed as active + 2 backup instances (shards) for every attribute 24
APIC Distributed Multi-Active Data Base Shard 1 Shard 1 Shard 1 APIC APIC APIC Shard 2 Shard 3 Shard 2 Shard 3 Shard 2 Shard 3 When an APIC fails a backup copy of the shard is promoted to active and it takes over for all tasks associated with that portion of the Data Base 26
APIC Design Considerations APIC APIC APIC APIC APIC Additional APIC will increase the system scale (today up to 5 nodes supported) but does not add more redundancy APIC APIC APIC X X APIC will allow read-only access to the DB when only one node remains active (standard DB quorum) X APIC APIC 800km APIC APIC APIC APIC 800km APIC APIC There is a max supported distance between data base (APIC) nodes 800km NOT RECOMMENDED: failure of site 1 may cause irreparable loss of data for some shards and inconsistent behaviour for others
ACI Multi-Pod Solution APIC Cluster Deployment Considerations APIC cluster is stretched across multiple Pods Central Mgmt for all the Pods (VTEP address, VNIDs, class-ids, GIPo, etc.) Centralized policy definition Pod A Pod B Recommended not to connect more than two APIC nodes per Pod (due to the creation of three replicas per shard ) The first APIC node connects to the Seed Pod DB Web/App MP-BGP - EVPN APIC Cluster Web/App Drives auto-provisioning for all the remote Pods Pods can be auto-provisioned and managed even without a locally connected APIC node 28
ACI Multi-Pod Solution Auto-Provisioning of Pods Provisioning interfaces on the spines facing the IPN and EVPN control plane configuration 2 3 DHCP requests are relayed by the IPN devices back to the APIC in Pod 1 5 6 DHCP response reaches Spine 1 allowing its full provisioning Spine 1 in Pod 2 connects to the IPN and generates DHCP requests 4 7 Discovery and provisioning of all the devices in the local Pod Seed Pod 1 1 APIC Node 1 connected to a Single APIC Cluster 9 APIC Node 2 joins the Cluster 8 APIC Node 2 connected to a Leaf node in Pod 2 Discovery and provisioning of all the devices in the local Pod Leaf node in Seed Pod 1 Pod 2 10 Discover other Pods following the same procedure 29
ACI Multi-Pod Solution IPN Control Plane Separate IP address pools for VTEPs assigned by APIC to each Pod Summary routes advertised toward the IPN via OSPF routing Spine nodes redistribute other Pods summary routes into the local IS-IS process Needed for local VTEPs to communicate with remote VTEPs IPN Global VRF IP Prefix Next-Hop 10.0.0.0/16 Pod1-S1, Pod1-S2, Pod1-S3, Pod1-S4 10.1.0.0/16 Pod2-S1, Pod2-S2, Pod2-S3, Pod2-S4 OSPF OSPF 10.0.0.0/16 mutual redistribution 10.1.0.0/16 DB Web/App IP Prefix IS-IS to OSPF APIC Cluster Leaf Node Underlay VRF Next-Hop 10.1.0.0/16 Pod1-S1, Pod1-S2, Pod1-S3, Pod1-S4 Web/App 30
ACI Fabric Integrated Overlay Decoupled Identity, Location & Policy APIC VTEP VXLAN IP Payload VTEP VTEP VTEP VTEP VTEP VTEP ACI Fabric decouples the tenant end-point address, it s identifier, from the location of that endpoint which is defined by it s locator or VTEP address Forwarding within the Fabric is between VTEPs (ACI VXLAN tunnel endpoints) and leverages an extender VXLAN header format referred to as the ACI VXLAN policy header The mapping of the internal tenant MAC or IP address to location is performed by the VTEP using a distributed mapping database 31
Host Routing - Inside Inline Hardware Mapping DB - 1,000,000+ hosts Global Station Table contains a local cache of the fabric endpoints Proxy Proxy Proxy Proxy 10.1.3.35 Leaf 3 10.1.3.11 Leaf 1 fe80::8e5e fe80::5b1a Leaf 4 Leaf 6 10.1.3.35 * 10.1.3.11 Leaf 3 Proxy A Port 9 Proxy Station Table contains addresses of all hosts attached to the fabric 10.1.3.11 10.1.3.35 fe80::462a:60ff:fef7:8e5e fe80::62c5:47ff:fe0a:5b1a Local Station Table contains addresses of all hosts attached directly to the Leaf The Forwarding Table on the Leaf Switch is divided between local (directly attached) and global entries The Leaf global table is a cached portion of the full global table If an endpoint is not found in the local cache the packet is forwarded to the default forwarding table in the spine switches (1,000,000+ entries in the spine forwarding table) 32
ACI Multi-Pod Solution Inter-Pods MP-BGP EVPN Control Plane MP-BGP EVPN used to communicate Endpoint (EP) and Multicast Group information between Pods All remote Pod entries associated to a Proxy VTEP next-hop address 172.16.1.10 Leaf 1 172.16.2.40 Leaf 3 172.16.1.20 Proxy B 172.16.3.50 Proxy B Proxy A MP-BGP - EVPN 172.16.1.10 Proxy A 172.16.2.40 Proxy A 172.16.1.20 Leaf 4 172.16.3.50 Leaf 6 Proxy B Single BGP AS across all the Pods COOP BGP EVPN on multiple spines in a Pod (minimum of two for redundancy) 172.16.1.1 0 172.16.2.4 0 APIC Cluster 172.16.1.2 0 172.16.3.5 0 Some spines may also provide the route reflector functionality (one in each Pod) 34
ACI Multi-Pod Solution Overlay Data Plane Group VTEP IP VNID Tenant Packet Policy 172.16.2.40 Leaf 4 172.16.1.20 Proxy B 3 Spine encapsulates traffic to remote Proxy B Spine VTEP Spine encapsulates traffic to local leaf 4 172.16.1.20 Leaf 4 172.16.2.40 Proxy A Proxy A Proxy B * Proxy A VM2 unknown, traffic is encapsulated to the local Proxy A Spine VTEP (adding S_Class information) 2 172.16.2.40 1 VM1 sends traffic destined to remote VM2 Single APIC Cluster 5 172.16.1.20 6 If policy allows it, VM2 receives the packet 172.16.2.40 Pod1 L4 * Proxy B Leaf learns remote VM1 location and enforces policy 35
ACI Multi-Pod Solution Overlay Data Plane (2) Group VTEP IP VNID Tenant Packet Policy 172.16.1.20 Pod2 L4 172.16.2.40 Pod1 L4 * Proxy A Leaf learns remote VM2 location (no need to enforce policy) 9 172.16.2.40 10 VM1 receives the packet 11 Single APIC Cluster From this point VM1 to VM1 communication is encapsulated Leaf to Leaf (VTEP to VTEP) 8 172.16.1.20 7 VM2 sends traffic back to remote VM1 * Proxy B Leaf enforces policy in ingress and, if allowed, encapsulates traffic to remote Leaf node L4 36
ACI Multi-Pod Solution Handling of Multi-Destination Traffic (BUM*) Spine 2 is responsible to send GIPo 1 traffic toward the IPN 3 IPN replicates traffic to all the Pods that joined GIPo 1 (optimized delivery to Pods) 4 * 2 BUM frame is associated to GIPo 1 and flooded intra-pod via the corresponding tree 172.16.2.40 1 VM1 generates a BUM frame Single APIC Cluster 5 172.16.1.20 6 VM2 receives the BUM frame BUM frame is flooded along the tree associated to GIPo 1. VTEP learns VM1 remote location * 172.16.2.40 Pod1 L4 * Proxy B *L2 Broadcast, Unknown Unicast and Multicast 37
ACI Multi-Pod Solution Traditional WAN Connectivity A Pod does not need to have a dedicated WAN connection Multiple WAN connections can be deployed across Pods Traditional L3Out configuration Shared between tenants or dedicated per tenant (VRF-Lite) VTEPs always select WAN connection in the local Pod based on preferred metric Inbound traffic may require hair-pinning across the IPN network Recommended to deploy clustering technology when stateful services are deployed Pod 1 Pod 2 MP-BGP - EVPN WAN WAN Pod 3 39
ACI Integration with WAN at Scale Project GOLF Overview GOLF Devices MP-BGP EVPN WAN IP Network Addresses both control plane and data plane scale VXLAN data plane between ACI spines and WAN Routers BGP-EVPN control plane between ACI spines and WAN routers OpFlex for exchanging config parameters (VRF names, BGP Route-Targets, etc.) VRF-1 VRF-2 Consistent policy enforcement on ACI leaf nodes (for both ingress and egress directions) GOLF Router support (Q3CY16) Web/App Nexus 7000, ASR9000 and ASR1000 (not yet committed) 40
ACI Integration with WAN at Scale Supported Topologies Directly Connected WAN Routers Remote WAN Routers Multi-Pod + GOLF WAN WAN WAN MP-BGP EVPN MP-BGP EVPN IP Network IP Network MP-BGP EVPN 41
Multi-Pod and GOLF Intra-DC Deployment Control Plane Public BD subnets advertised to GOLF devices with the external spine-proxy TEP as Next-Hop MP-BGP EVPN Control Plane GOLF Devices IPN WAN WAN routes received on the Pod spines as EVPN routes and translated to VPNv4/VPNv6 routes with the spine proxy TEP as Next-Hop DB Web/App Web/App Single APIC Domain Cluster Multiple Pods... Web/App DB 42
Multi-Pod and GOLF Intra-DC Deployment Control Plane Option to consolidate Golf and IPN devices Perform pure L3 routing for Inter-Pod VXLAN traffic GOLF Devices WAN VXLAN Encap/Decap for WAN to DC traffic flows IPN DB Web/App Web/App Single APIC Domain Cluster Multiple Pods... Web/App DB *Not available at FCS 43
Multi-Pod and GOLF Multi-DC Deployment Control Plane Host routes for endpoint belonging to public BD subnets in Pod A GOLF devices inject host routes into the WAN or register them in the LISP database Host routes for endpoint belonging to public BD subnets in Pod B Pod A MP-BGP EVPN Control Plane IPN MP-BGP EVPN Control Plane Pod B DB Web/App Single APIC Cluster Web/App DB 46
Multi-Pod and GOLF Multi-DC Deployment Data Plane GOLF devices VXLAN encapsulate traffic and send it to the Spine Proxy VTEP address Traffic from an external user is steered toward the GOLF devices (via routing or LISP) IPN Proxy A Spine encapsulates traffic to the destination VTEP that can then apply policy Proxy B DB Web/App Single APIC Cluster Web/App DB 47
Multi-Pod and GOLF Multi-DC Deployment Data Plane (2) GOLF devices de-encapsulate traffic and route it to the WAN (or LISP encapsulates to the remote router) WAN Traffic is received by the external user IPN Leaf applies policy and encapsulates traffic directly to the local GOLF VTEP address DB Web/App Single APIC Cluster Web/App 48
ACI Multi-Pod Solution Summary ACI Multi-Pod solution represents the natural evolution of the Stretched Fabric design Combines the advantages of a centralized mgmt and policy domain with fault domain isolation (each Pod runs independent control planes) Control and data plane integration with WAN Edge devices (Nexus 7000/7700 and ASR 9000) completes and enriches the solution The solution is planned to be available in Q3CY16 and will be released with a companion Design Guide 49
Agenda ACI Introduction and Multi-Fabric Use Cases ACI Multi-Fabric Design Options ACI Stretched Fabric Overview ACI Multi-Pod Solution Deep Dive ACI Multi-Site Solutions Overview Conclusions 50
ACI Dual-Fabric Solution Overview ACI Fabric 1 For more information on ACI Dual Fabric Deployment: BRKACI-3503 ACI Fabric 2 L2/L3 DCI Independent ACI Fabrics interconnected via L2 and L3 DCI technologies Each ACI Fabric is independently managed by a separate APIC cluster Separate Management and Policy Domains Data Plane VXLAN encapsulation terminated at the edge of each Fabric VLAN hand-off to the DCI devices for providing Layer 2 extension service Requires to classify inbound traffic for providing end-to-end policy extensibility 51
ACI Multi-Site (Future) Overview Inter-Site Network Site A Site n MP-BGP - EVPN IS-IS, COOP, MP-BGP Separate APIC Clusters IS-IS, COOP, MP-BGP Multiple ACI fabrics connected via IP Network Separate availability zones with maximum isolation Separate APIC clusters, separate management and policy domains, separate fabric control planes End to end policy enforcement with policy collaboration Support multiple sites Not bound by distance 52
ACI Multi-Site Reachability Inter-Site Network Site A Site n MP-BGP - EVPN Host Level Reachability Advertised between Fabrics via BGP Transit Network is IP Based Separate APIC Clusters Host Routes do not need to be advertised into transit network Policy Context is carried with packets as they traverse the transit IP Network Forwarding between multiple Fabrics is allowed (not limited to two sites)
ACI Multi-Site Policy Collaboration EPG policy is exported by source site to desired peer target site fabrics Fabric A advertises which of its endpoints it allows other sites to see Target site fabrics selectively imports EPG policy from desired source sites Fabric B controls what it wants to allow its endpoints to see in other sites Policy export between multiple Fabrics is allowed (not limited to two sites) Site A Site B Web1 Web2 Import Web & App from Fabric B Export Web & App to Fabric A Web1 Web2 App1 db1 App2 Export Web, App, DB to Fabric B Import Web, App, DB from Fabric A App1 db1 App2 db2
Scope of Policy Site A Inter-Site Network Site n MP-BGP - EVPN Separate APIC Clusters Web App App Web Policy is applied at provider of the contract (always at fabric where the provider endpoint is connected) Scoping of changes No need to propagate all policies to all fabrics Different policy applied based on source EPG (which fabric)
Agenda ACI Introduction and Multi-Fabric Use Cases ACI Multi-Fabric Design Options ACI Stretched Fabric Overview ACI Multi-Pod Solution Deep Dive ACI Multi-Site Solutions Overview Conclusions 56
Conclusions Cisco ACI offers different multi-fabric options that can be deployed today There is a solid roadmap to evolve those options in the short and mid term Multi-Pod represents the natural evolution of the existing Stretched Fabric design Multi-Site will replace the Dual-Fabric approach Cisco will offer smooth and gradual migration path to drive the adoption of those new solutions 57
Where to Go for More Information ACI Stretched Fabric White Paper http://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/kb/b_kbaci-stretched-fabric.html#concept_524263c54d8749f2ad248faeba7dad78 ACI Dual Fabric Design Guide Coming soon! ACI Dual Fabric Live Demos Active/Active ASA Cluster Integration https://youtu.be/qn5ki5sviea vcenter vsphere 6.0 Integration http://videosharing.cisco.com/p.jsp?i=14394 58
Thank you 59