Managing the Management Switches Erik Ruiter SURFsara Cumulus Meetup Amsterdam 2017
Outline 1. Old vs new Situation 2. Used technologies (Ansible / Cumulus) 3. Ansible Examples 4. Results / Whats next? 5. Short demo (if time left )
Background (High-Level Design SURFsara network) SURFsara provides many services, for instance: - Super computing (Cartesius) - HPC cloud - Hadoop - GRID computing / storage All services are operated as clusters, and are accessible for personnel through a management network. This is used for: - Steppingstone LAN - DRAC / IPMI access - Backup traffic - Server monitoring Routing is done using Firewall
Previous management network Left-over switches from 1 to 10G upgrades in the past Unsupported switches: up to 10 years old Hand-maintained, vendor proprietary CLI Arista Brocade (and also Foundry) Cisco (7 different models) Dell (8 different models) Juniper Nortel (unmanaged) Supermicro (unmanaged)
Desired management network Physical requirements: - Low(er) power switches - Redundant, swappable power supply and fans - Back-to-front airflow Core (Fiber) mgmt switch 1 HUB A 2Gbps LACP 1G-BASE-T 20Gbps 20Gbps MLAG / VC 10GBASE-SR MM-fiber HUB B 2Gbps LACP 1G-BASE-T Core (Fiber) mgmt switch 2 Operational requirements: - Standard CLI for all switches - Automated configuration management / ZTP - Mostly layer 2 requirements 20Gbps LACP 10GBASE-SR MM-fiber Core (UTP) mgmt switch 3 Core (UTP) mgmt switch 4 20Gbps LACP 10GBASE-SR MM-fiber NW-MGMT NW-MGMT EoR Console server 1G-BASE-T ToR switch EoR switch 1G-BASE-T 1G-BASE-T 1G-BASE-T EoR switch ToR switch EoR Console server ToR switch ToR switch CONSOLE RJ45 RS232 MGMT VLANS VLAN 800 SURFSARA-NW-MGMT SERIAL RS232 CONSOLE CONSOLE RJ45 RS232
Resulting management network Tender for 78 switches, February April 2016, Won by Dell Core switches: Top of Rack and End of Row switches: Console servers for OOB support: Dell S4048-ON Dell S3048-ON Opengear CM7148 Core switches configured as an MLAG pair EoR switches connected to cores using LACP (10 Gbps optical) ToR switches connected to EoR using single uplink (1 Gbps UTP) Bare Metal ( white label ) switches using the ONIE boot loader All switches running Cumulus Linux as networking OS
Bare Metal ( White Label ) Switches Decoupling of network operating system and hardware Similar to the rise of Linux and Windows with the IBM compatible PC Driven by cheap top-of-rack switches in big datacentres (eg Facebook Wedge / Backpack) Bootloader: ONIE Hardware: Dell, Edge-core (Accton), Quanta, Penguin, Mellanox, Software: Cumulus Linux, Big Switch, PicOS (Pica8), Pluribus, Dell OS 10, Microsoft SONiC, etc Freedom of Choice! Network Processors Broadcom Apollo2 Broadcom Firebolt3 Broadcom Helix4 Broadcom Tomahawk Broadcom Trident Broadcom Trident/+/2/2+ Broadcom Triumph2 Mellanox Spectrum Cavium
Cumulus Linux Linux on a switch No more vendor proprietary CLI Install your own software when desired Manageable using many configuration management systems Makes use of switchd kernel driver for communicating with Broadcom ASIC SURFsara already gained experience using Cumulus in some previous projects
Port configuration Cumulus vs Cisco NOTE: in version 3.2 a CLI is introduced
Network management approach Original Plan: Network Controller Considered different network controllers, Not what we hoped for: No / Limited northbound authentication, Poor vendor abstraction We only require a simple northbound interface, no device state required, just pull status and push config. No intelligent decisions required (eg traffic engineering) Network controllers can be difficult to learn and manage SPOF or complex redundant installation NCS Still times can change, we still need to keep track of these to monitor improvements.
Managing the management switches Our approach: Configuration management Built in python -> Flexible / extendible No agent required on switch/server Support for multiple vendors (including Juniper, Cisco and Arista) Templating using YAML Playbooks -> Plays -> Roles -> Tasks
Zero Touch Provisioning Is used for provisioning a switch without any user interaction. Switch is racked, connected to network mgmt VLAN and turned on DHCP server provides IP address for management interface ONIE boot loader downloads required firmware from HTTP server (using URL from DHCP option) ZTP script removes default login credentials, creates NOC user, and adds authorized_keys From here on Ansible takes over switch configuration using playbook and predefined variables Physical racking DHCP ONIE installation Config using Ansible
Implemented roles cl-users: Creates users and adds sshkeys cl-common Sets common settings: DNS resolvers, NTP server, Timezone and hostname cl-license Sets and activates Cumulus licence cl-apt Set apt repositories and set location of apt-proxy cl-ldap_auth Sets up LDAP authentication for non-local users (not in use at the moment) cl_snmp Sets up SNMP configuration and starts daemon cl-rsyslog Sets up syslogging and starts daemon cl-interface Sets up switch interfaces (VLAN aware bridge,vlan tagging, LACP, MLAG, routed interfaces) Cumulus also provides similar roles, in their Ansible Galaxy repository
Inventory and Variables Inventory: - Contains host and group information for Ansible managed servers/switches - SURFsara uses a dynamic inventory, making use of own in house developed Inventory database (CMT). - This avoids having a to keep a separate administration for Ansible hosts. Group variables file: - A single file which contains all variables that are the same for all switches in a Ansible group - NTP settings - DNS settings, - SNMP - Syslog settings - etc Host variables file - A file per switch, containing the variables that are unique per switch - Mgmt ip address - Port settings (VLAN tagging / MTU / port description) - MLAG / LACP settings
Performing changes on network Scenario: Icinga has a new IP adress, ip address of SNMP querier needs to be changed on all switches: 1. Adjust global variables file 2. Commit in GIT 3. Pull new config 4 Execute ansible playbook #ansible-playbook mgmt/provisioning.yml -tags snmp Measured execution time: 2 minutes for 70 switches (how much time does this take for 70 individually managed switches from 7 different vendors?)
Example role: SNMP (in too much detail) Tasks (YAML) Templates (Jinja) Variables (YAML) Result
Results/ Lessons learnt Positives: - Freedom of choosing own OS - Standardized network (big improvement from pile of old switches) - Configuration management using Ansible is powerful - Cheaper switches, built for datacenter environment Keep in Mind: - Network engineers require some additional skills. (Linux / Cumulus / Ansible / Git) - Different way of working, pushing configs in stead of configuring switches - Configuration management using Ansible is powerful (again!) - Support contract is important; Who is responsible? (Dell or Cumulus?)
Whats next Short term: Implement Lesser steps for making changes (no more manual git commands) Exposing Ansible using API (poor mans Ansible Tower / Semaphore) Implement additional roles for routing (quagga) and ACLs (iptables) Longer term (wish list) Build GUI (+ Authorization) for delegating small network changes to end-users -> e.g. change VLAN tag on single port Explore possibilities for using Ansible on production routers and switches (mostly Juniper equipment) (using NAPALM?) Standardize configuration abstraction using YAML or possibly OpenConfig for multiple vendors Add automated provisioning for Cacti and Icinga
Erik Ruiter Erik.Ruiter@surfsara.nl www.surfsara.nl
Network topology 20Gbps MLAG sw-lab-c04-1 sw-lab-h04-1 LACP 20Gbps MLAG sw-lab-c04-2 sw-lab-h04-2