Terabit Networking with JASMIN Jonathan Churchill JASMIN Infrastructure Manager Research Infrastructure Group Scientific Computing Department STFC Rutherford Appleton Labs
Terabit Networking with JASMIN What is JASMIN? Why is it needed? 3 year Growth Pains Network (re)design #3 Criteria Network Design Issues ECMP CLOS Architecture Advantages and Disadvantages VXLAN JASMIN Future Expansion
JASMIN s Purpose CEDA data storage & services Curated data archive Archive management services Archive access services (HTTP, FTP, Helpdesk,...) Data intensive scientific computing Global / regional datasets & models High spatial, temporal resolution Private cloud Flexible access to high-volume & complex data for climate & earth observation communities Online workspaces Services for sharing & collaboration
JASMIN is a world leading, unique hybrid of: 16PB high performance storage (~250GByte/s) High-performance computing (~4,000 cores) Non-blocking Networking (> 3Tbit/sec), and Optical Private Network WAN s Coupled with cloud hosting capabilities To address one of NERC s most strategically important challenges: the improvement of predictive environmental science. Prof. Duncan Wingham, NERC Chief Exec.
Panasas Storage Parallel file system (cf Lustre, GPFS, pnfs etc) Single Namespace 140GB/sec benchmarked (95 shelves PAS14) Access via PanFS client/nfs/cifs Posix filesystem out of the box. Mounted on Physical machines and VMs 103 shelves PAS11 + 101 shelves PAS14 + 40 Shelves PAS16 Each shelf connected at 10Gb (20Gb PAS14) 2,684 Blades JASMIN - Largest single realm in the world One Management Console TCO: Big Capital, Small Recurrent but JASMIN2 /TB < GPFS/Lustre offerings
Three year growth pains 172.16.X.0/21 = 2,000 IPs 130.246.X.0/21 Flat Overlaid L2 160->240 Ports @ 10Gb
4 x VMware Clusters vjasmin 156 cores, 1.2TB 40x 10Gb 385 x 10Gb Panasas Storage 20PBytes 15.1 PB (usable) Overview 12.5M,38 Racks, 850Amps, 25 tonnes, 3Terabit/s bandwidth Lotus HPC Cluster 468 x 10Gb 32x 10Gb NetApp+Dell 1010TB + (VM VMDK images) A network : 1,100 Ports @ 10GbE vcloud 208-1648 cores, 1.5TB 12.8TB MPI network (10Gb low latency eth) 144-234 hosts, 2.2K-3.6K cores. RHEL6, Platform LSF, MPI 40x 10Gb LightPaths @ 1&2Gb/s and 10Gb/s: Leeds, UKMO, Archer, (KNMI), CEMS-ISIC
Floor Plan Network distributed ~30m x ~20m JASMIN 1 JASMIN 4,5 (2016 20) ) JASMIN 3 JASMIN 2 Science DMZ
Network Design Criteria Non-Blocking (No network contention) Low Latency ( < 20uS MPI. Preferably < 10uS) Small latency spread. Converged (IP storage, SAN storage, Compute, MPI) 700-1100 Ports @ 10Gb Expansion to 1,600 ports and beyond wo forklift. Easy to manage and configure Cheap later on: Replaces JASMIN1 240 ports in place.
Cabling Costs 1,000 Fibre Connections = 400-600K JASMIN1+2 700-1100 10Gb Connections Compute Rack Storage Rack Storage Rack Storage Rack Network Rack Storage Rack Storage Rack Storage Rack Compute Rack 312 Twinax Fully Populated ToR 6x S4810 ToR Switches 48x Active Optic QSFP ToR e.g Force10 S4810P 48 x 10Gb SFP+ 4x 40Gb QSFP+ 1:1 Contention ToR 20x S4810 ToR Switches 80x Active Optic QSFP Lots of core 40Gb ports needed. MLAG to 72 Ports?... Chassis switch? expansion/cost
Mellanox SX1036 1,104 x 10GbE Ports CLOS L3 ECMP OSPF Mellanox SX1024 768 Ports max. no expansion so 12 spines Max 36 leaf switches :1,728 Ports @ 10GbE Non-Blocking. Zero Contention (48x10Gb = 12x 40Gb uplinks) Low Latency (250nS L3 / per switch/router). 7-10uS MPI Cheap!.. (ish)
Four routed ECMP hopshops Fast
ECMP CLOS L3 Advantages Massive scale High performance Low latency with fixed switches Standards based supports multiple vendors Very small blast radius upon network failures Small isolated subnets Deterministic latency with a fixed spine and leaf Pay as you grow start small and increment https://www.nanog.org/sites/default/files/monday.general.hanks.multistage.10.pdf
ECMP CLOS L3 Issues Managing scale: #s of IPs, subnets, VLANs, Cables Monitoring Routed L3 network: Reqs dynamic OSPF routing (100 s routes per switch) No L2 between switches (VMware: SAN s, vmotion) Reqs: DHCP Relay, VXLAN Complex traceroute seen by users.
IP and Subnet Management
Subnet / IP Management 2x /21 Panasas Storage 4x /24 Internet Connects 55x /26 Fabric Subnets 264x /30 Inter switch links.. 400 VMs & Growing quickly 288 Servers 2,244 Storage Blades 5,000 IPs & Growing Another 1,000 this month! ~260 VLAN IDs
Monitoring / Visualisation Complex Cacti >30 Fabric Switches >50 Mgmt Switches 100 s links to monitor Nagios bloat
Need for VXLAN 24 hosts per L2 switch L2 subnets differ per switch No switch to switch vmotion http://crankypotato.com/?p=598/
VXLAN ESXi MultiCast PIM IGMP Snooping MTU 50 Byte overhead VMs no MTU 9000 IP Storage ESXi Routing ESXi Auto deploy DHCP No Panasas Mounts http://crankypotato.com/?p=598/
JASMIN Future Expansion?? Or 3 Tier CLOS 4x 100Gb Fabric Links?? 30-80PB on Disk by 2020 (Demand for 300PB) Reqs > 2-3K 10GbE ports (more likely 20Gb or 40Gb) Standards based OSPF, ECMP, L3 Needs automation / SDN to manage.
Terabit Networking with JASMIN What is JASMIN? Why is it needed? 3 year Growth Pains Network (re)design #3 Criteria Network Design Issues ECMP CLOS Architecture Advantages and Disadvantages VXLAN JASMIN Future Expansion
Questions? Contact: jonathan.churchill@stfc.ac.uk http://www.stfc.ac.uk/scd/ LinkedIn