An-Institut der Technischen Universität Berlin Network Troubleshooting with Mirror VNets Andreas Wundsam M. Amir Mehmood Olaf Maennel Anja Feldmann FutureNet III Workshop, 2010-12-10
Troubleshooting and evolving large scale networks is still an open problem
What's a large scale network? Distributed system Runs for a long time State distributed in many hosts + routers Under autonomous control Lots of users affected Distributed software + configuration
Admin challenges Routing optimization Re-convergence can take time Upgrade of faulty service component E.g., upgrade the super-node software in a telephony overlay How to avoid new bugs / ensure smooth upgrades?
Validation How do you test an update to the configuration or software of your system? simulation testbed
Simulation scalable simulation cheap accuracy? (wireless anyone!?)
Testbed may be more accurate but user behavior? testbed expensive to run on large scale? longtime?
Problems are nasty They like to surface in large scale, long term deployments with real user behavior/traffic. cannot be caught in small testbeds
Our approach leverage Virtual Networks
What is virtualization? Abstraction concept Hides the details of what's underneath Provides layer of indirection Enables Resource Sharing More than multiplexing! -- provides Isolation Transparence
Virtual Networks (VNets) combination of networ king and comput ing under unified management
Virtual Networks (VNets) combination of networ king and comput ing slices (slithers?) of routers links servers end hosts
Network Virtualization Provides coherent view and management of Virtual Networks (VNet) Allows to run multiple VNets On shared substrate Under separate administrative control In isolation
Mirror VNets Only output of Prod sent to ext entities Ext Host Ext Host Input mirrored to Prod and Mirror Prod Mirror Prod Mirror Prod Mirror Substrate Production VNet Mirror VNet
Typical use case Spawn Mirror Switch Prod
Typical use case Switch Prod Spawn Mirror Move over Users
Mirroring options Not all traffic has to be mirrored Not all mirrored traffic has to be transmitted Control plane traffic: <1% of data volume >90% bugs
Mirroring options Reducing Mirroring All traffic Selection (OF Rules, Layer 1-4) Sampling Reducing Transmission Full Transmission Packet Headers Re-synthesized based on stats
Mirror VNet case study Operator considers introduction of QoS because of customer complaints of lacking quality at peak load Uses a Mirror VNet to test solution, then switch over
Experiment setup BG Traffic Sender Node 1 Node 2 Node 3 BG Traffic Receiver VNET A Vnode 1A Vnode 2A Vnode 3A prod VNET B Vnode 1B Vnode 2B Vnode 3B shadow VOIP sender Vnet entry node Vnet transfer node Moni VOIP receiver /dev/null Traffic flow Vnet exit node
Metrics VOIP packet drops MoS: Mean Opinion Score using ITU-T E-Model non linear Quality Of Experience Metric values from 1 (worst) to 5 (best)
Experiment phases Phase 1: Things are running smoothly
Experiment phases Phase 2: Traffic spikes cause overload in VNode 2, quality degradation
Experiment phases Phase 3: Operator introduces Mirror VNet
Experiment phases Phase 4: Operator fixes problem in Mirror (introduces QoS)
Experiment phases Phase 5: Operator switches VNets
Experiment phases Phase 6: Operator dismantles VNET A
Experiment results (1) packet drops % 0 10 20 30 40 phase 1 phase 2 phase 3 phase 4 phase 5 phase 6 Receiver VNET A VNET B BG Traffic 0 5 10 15 20 25 BG traffic throughput [MBit/s] 0 500 1000 1500 Experiment time [s]
Experiment results (2) Phases 1 low traffic 2 high traffic 3 start Mirror VNET B 4 enable QOS 5 make VNET B prod. 6 dismantle A
Discussion Benefits Resilience against operator mistakes Real user traffic Rollback / undo for networks
Discussion Limitations Trade-off between overhead and prediction Elastic/closed loop traffic limits prediction quality
Future work Further case studies, larger networks Predict elasticity of traffic, adapt mirror predictions
Thank you.