SHARED MESH RESTORATION IN OPTICAL NETWORKS OFC 2004 Jean-Francois Labourdette, Ph.D. labourdette@ieee.org
Page: 2 Outline Introduction Network & Restoration Arch Evolution Mesh Routing & Provisioning Fast Shared Mesh Restoration Re-provisioning for Improved Reliability Planning & Dimensioning Re-optimization Maintenance Summary
Page: 3 Network Evolution Network architecture Mesh vs. Ring Multi-tier network Historical (DS0/DS1/DS3/STS-1/STS-48) Not a question of if but when Technology O/E/O switching All-optical switching in opaque network All-optical/transparent network e.g., (R)OADM Difficulty is often not just the technology but one of network management and operations
Restoration Architecture Evolution 80's DCS-based Mesh Restoration of DS3 Facilities Centralized (EMS/NMS) Path-based, failure-dependent, after fault detection and isolation Capacity-efficient but slow (~ minutes) 90's ADM-based Ring Protection of SONET/SDH Facilities Distributed Path-based (UPSR) or span-based (BLSR), pre-determined Fast ( 50 msec ) but capacity-inefficient 2000's OXC-based Mesh Protection/Restoration Distributed Path-based, failure-independent, pre-determined and preprovisioned Capacity-efficient AND fast (10 s 100's msec) Page: 4
Page: 5 Restoration Architecture Evolution (cont'd) Challenge: OXC-based mesh network architecture requires new network management, and new operations which can be more sophisticated than those in ring-based networks Question: can ring-based architectures be evolved instead? Trans-oceanic ring architecture (G.841) p-cycle (W. Grover et al.) Next-gen SONET/SDH equip. can handle multiple rings Ability to not close working or protection ring Ability to share protection channels across rings End-result is an evolution towards mesh networking
Page: 6 Mesh Operations Routing & Provisioning Shared mesh restoration based on pre-determined, preprovisioned restoration paths KEY requirement of mesh networking Complete synch between network and its inventory Cannot rely on manual entry of network inventory Provisioning Components Self-discovery of neighbour and port connectivity <-- the key feature Topology discovery Route computation Lightpath setup Debate on management vs. control plane approach
Page: 7 Automated Discovery of Port and Neighbouring Node Connectivity Hello (O1,P1) OXC O1 T P1 R OXC O2 T P10 R Hello (O2,P10) Hello Ack (O2,P10, O1,P2) Hello Ack (O1,P1, O2,P10) NDP at O1, P1 P2 T R misconfiguration T R P11 exchange reveals misconfiguration since sending P1 is followed by acknowledgment of P2 P3 T R T R P12 Hello (O1,P3) Hello (O2,P12) Hello Ack (O2,P12, O1,P3) NDP at O1, P3 Hello Ack (O1,P3, O2,P12) normal exchange
Comparing Control & Management Plane Approaches Page: 8 Management Plane Approach Control Plane Approach Neighbor discovery Manually configured Topology discovery Derived at NMS/EMS from neighbour & port adjacency information Route computation NMS/EMS computes the primary and backup paths from the topology and lightpath databases Lightpath setup NMS sets up lightpaths by configuring the NEs using TL-1 messages Neighbor discovery Done using neighbour discovery protocol (LMP) between NEs Topology discovery Done by NEs by exchanging neighbour & port adjacency info (OSPF/IS-IS) Route computation Done by NEs from topology database NEs may not have complete lightpath database Lightpath setup Done by NEs using a signaling protocol like RSVP-TE or CR-LDP
Page: 9 Dedicated Mesh (1+1) Protection U Primary for demand AB W A C S X V Bridging (both lightpaths are active) Secondary for demand AB Secondary for demand CD Cross-connections are established on secondary paths before failure Y Primary for demand CD Z T B D source transmits to both working and edge/nodediverse backup paths The destination decides which path to select, based on the quality of the received signals. Fastest restoration but wasteful because backup path is permanently active, although not used under normal conditions.
Shared Mesh Restoration Shared mesh restoration reserves the capacity for the backup path, and activate the backup only when necessary Enable the same capacity to serve multiple backup paths if the paths are not expected to be activated simultaneously. This condition is satisfied if their working paths are diverse A U Primary for demand AB V W B A Primary for demand AB is not affected U W V B S Optical line reserved for shared protection of demands AB and CD T S Optical line carries protection of demands CD T Cross-connections are not established Y Cross-connections are established after failure occurs Link failure Y C X Primary for demand CD Z D C X Z D BEFORE FAILURE AFTER FAILURE Page: 10
Page: 11 Diversity of Paths e a b c z e f a b c z f After computing the working path using a shortest path algorithm, and removing the edges to compute the backup. The residual graph becomes disconnected, and we find no corresponding backup path even though one exists. e a b c z f
Shared Risk Groups U second route A to B V SRG 4 W SRG 10 All optical lines in conduit share a same risk SRG 2 SRG 1 SRG 3 SRG 6 X SRG 5 SRG 8 SRG 9 SRG 7 A Y first route A to B Z B SRGs are used to represent sets of edges that may be affected by a common failure, such as fibers routed through a given conduit, etc. Non-trivial SRGs are the norm rather than the exception in telecom networks. There is no known optimum algorithm for arbitrary SRGs. Page: 12
Dedicated vs. Shared Mesh Case Study - Network Page: 13 100 Nodes 137 Links ~ 3000 OC-48 equivalent demands
Page: 14 Dedicated vs. Shared Mesh Case Study - Results US core network - 100 nodes, 137 links 50000 45000 Number of (bi-directional) Channels 40000 35000 30000 25000 20000 15000 10000 26829 19807 19913 26905 19981 9491 21199 11660 21227 Secondary Primary 5000 0 no protection 1+1 link failure protected 1+1 node failure protected Central mesh link failure restored Central mesh node failure restored
Page: 15 Restoration: Scope & Objectives Scope Guaranteed restoration after single failure events TR failure Amplifier failure Fiber/Cable cut Recovery via re-provisioning after multiple concurrent failures Objectives Fast response High degree of robustness Support for different service levels Dedicated (1+1) diverse protection for lowest restoration latency Shared diverse restoration for capacity efficiency Non pre-emptible non-restorable service Pre-emptible service
Page: 16 Mesh Networking and Ring-like Restoration Fast restoration Pre-computed and pre-provisioned backup paths Bit-oriented failure notification & restoration signalling (using SONET/SDH overhead bytes) Fast intra-system communication Robustness Bit-oriented failure notification & restoration signalling Per-channel independent signaling for each lightpath Connection after verification Optimized for the common case Fast and guaranteed restoration for single SRG failure Re-provisioning for multiple concurrent failures
Page: 17 Restoration Protocol Overview F G Shared Backup Path H 8 7 9 4 5 8 Primary Path B C D A 7 5 4 7 3 1 0 7 5 7 4 E 1 9 Drop port Drop port A shared backup path is soft-setup for each shared mesh restorable primary path Channels on the backup path may be shared with other backup paths Crossconnects are not setup during provisioning Path restoration triggers are sent to the end-nodes End-to-end signaling over the backup path activates it and completes the path restoration
Page: 18 Mesh Restoration Simulation Why simulation? Unlike ADM-based rings, restoration speed dependent on network loading and routing Predict restoration performance (e.g., for SLA compliance) Restored lightpaths (%) 100 90 80 70 60 50 40 30 20 10 0 0 20 40 60 80 100 Time since failure event (ms)
Page: 19 Restoration Simulation Methodology Use calibrated simulation tool to predict network performance, using identical traffic loading and routing as real network Equipment model System architecture racks, shelves, interface modules, control modules Internal communication architectures Processing queues Restoration protocol model Parameterize basic events, processes and queuing delays Measure restoration latency in a test network of real systems Tune parameters to calibrate simulation tool
Page: 20 Summary Mesh Restoration Shared mesh restoration is fast! - 10's to 100's of msec Very different from centralized mesh restoration in DCS networks (order of minutes) Distributed ring-like implementation Simulation critical for mesh network operations Similation tool with calibration, identical protocols and routing algorithms as in real network Integration with planning and operations systems Ring restoration will be slower with next-gen products Because same equipment will terminate 10's of rings Restoration processes will compete for resources Restoration times will be dependent on network loading
Page: 21 Network Reliability & Service Availability Speed of restoration is not that important for service availability A = MTBF/(MTBF + MTTR) Impact of MTTR ranging from msec to sec is insignificant Impact of MTTR of several hours if physical repair is needed (in case of double failure) is significant Ability to protect against double failure is key to high service availability
Page: 22 Service Unavailability Mostly due to double failures for protected services Service unavailability is roughly proportional to Length of path for unprotected service Product of length of working and protection path for protected service (higher unavailability of shared mesh over dedicated mesh due to impact of sharing) But decreases when network is split into independent restoration domains, so For limited geographical span, longer protection path on ring causes higher unavailability For larger geographical span, presence of many rings decreases chances of two failures happening in single ring, and ring-based architecture can achieve lower unavailability
Page: 23 Service Unavailability Ring vs. Mesh Unavailability with MTTR = 4 hrs 3 2.5 Unavailable Minutes/Yr 2 1.5 1 0.5 Mesh Ring 0 0 500 1000 1500 2000 2500 3000 3500 Lightpath Length (km)
Page: 24 Service Availability Re-provisioning But mesh networks can be made much more robust by splitting into multiple domains e.g., US/trans-Atlantic/European domains Using end-to-end re-provisioning in case of double failures EMS/NMS-based with fault detection and localization Take 10's of sec instead of hours with manual intervention Can be 100% succesful with enough spare capacity
Page: 25 Planning & Dimensioning Mesh networks are more robust than ring networks to traffic forecast uncertainties Many rings have low utilization because traffic did not materialize where and when predicted Will become less true with next-gen SONET/SDH equipment supporting many rings But ability to re-optimize mesh network Dimensioning mesh networks Can be modeled and analyzed mathematically (e.g., random graphs, Moore bound,...) Properties, approximations, asymptotic behavior (e.g., ratio of protected to working capacity)
TIME Page: 26 Why Re-optimization? Over-time, a network routing diverges from optimality On-line results in drift toward sub-optimal solutions, Service churn, capacity upgrade, and new additions to the network infrastructure, create opportunities for improvement Periodically re-optimize the routes to seize on these opportunities and espouse the network dynamics as it grows COST Cost of current solution Cost of best known solution (that is achievable using same input) Drift RE OPTIMIZE New capacity trunk and/or Service termination (churn)
Re-optimization: Objectives and challenges Page: 27 Objectives of re-optimization Increase routing efficiency and utilization, increase capacity availability to offer more services at no additional cost Decrease paths lengths (reduce latency), improve service quality Challenges Re-optimization is executed from the network operation center, using existing network infrastructure and given capacity Minimize undesirable impacts to services Minimize risks of major service disruptions in case of network failures, i.e. keep protection mechanism functional during re-optimization Ability to pause and resume the re-optimization, in order to cope with unexpected events
Page: 28 Types of Re-optimization Complete: re-optimize both primary and backup paths Most effective, but Requires service interruption to switch from old to new primary path Partial: re-optimize the backup path only No impact on primary, thus transparent to the client layer, Almost as effective as a complete re-optimization, Type can be decided on a per-lightpath basis Complete re-optimization for services with lower SLA Partial re-optimization for services with stringent requirements
How Re-optimization is Done? Network Operation Center Download network state Re optimization Tool Execute re route sequence 1. Select a lightpath 2. Remove it's backup path 3. Compute and provision new backup path 4. Iterate over 1 to 3 until no further improvement is observed Lightpath reroute sequence Re optimize routes Backup paths are re-routed one at a time, the corresponding lightpath is unprotected during re-routing The backup paths of some lightpaths may be reoptimized more than once to perform capacity swaps Intelligence, in selecting the proper lightpaths and proper sequence to achieve the most effective results Page: 29
Page: 30 Network Re-optimization A Real Network Example (45 cities across US) Normalized bandwidth 120 100 80 60 40 Actual solution from online routing. 31% Re-optimization 27% 20 0 Feb-02 Mar-02 Apr-02 May-02 Jul-02 Jun-02 Aug-02 Oct-02 Sep-02 Nov-02 Dec-02 Feb-03 Jan-03 Demand growth Actual used ports Port requirement of best known solution Mar-03
Page: 31 Network Re-optimization - Summary Experience clearly demonstrates benefits of re-optimization The nature of network operations leads to inefficient routing over time Up to 20% capacity saving: freed capacity can be reused for future services (capital avoidance) Backup latency is 30% shorter Procedure is safe Primary paths unaffected, no service interruption One demand at a time is briefly unprotected, while backup path is being re-provisioned Operates within actual network capacity, all operations are performed from network operation center Possible thanks to increased flexibility and efficiency offered by mesh optical networks, based on intelligent optical switches Not applicable to ring-based networks
Page: 32 Maintenance Maintenance operations need to be adapted to mesh networks In ring networks, no interference of maintenance activities between geographically distant rings In mesh networks, back-up capacity in one location can be shared by geographically distant lightpaths Operations can rely on accurate inventory of routes to schedule maintenance activities Ability to constrain routing of lightpaths and sharing Ability to reroute back-up path (as during re-optimization)
Page: 33 Summary Mesh - Overall long term strategic architecture evolution thanks to Capacity efficiency of shared path-based restoration Shared mesh restoration speed (10 s to 100 s msec) Improved reliability with re-provisioning Re-optimization Network control, management, and operations need to be adapted Consistent set of planning and modelling software tools, in addition to management system, critical for operating a mesh network efficiently All tools must be able to interact with each other to transfer data Sophisticated algorithms required for maximum benefits of a mesh network Set of tools along with mesh network intelligence provide higher efficiency, lower cost, higher reliability and service flexibility