Gamma Service Report Final 18/9/14 Broadband Service Please read the following as it could have an impact on some of your customers. Reference: Start Date: Start Time: Actual Clear Date: Actual Clear Time: Summary Details Gamma Ref-BB28142014 28 th August 2014 02:09 28 th August 2014 22:10 Loss of broadband connectivity with further impact on some voice services. Broadband connectivity to our Trafford (North) and Paul St (South) nodes interrupted by planned maintenance on BT network. Once the BT services were restored, the terminating devices for subscribers on Gamma s network could not recover the lost sessions. This prolonged the outage for the majority of our BB services. In addition to the failure of services dependent on the BB connectivity, the congestion caused some failed and poor quality calls for our SIP trunking, Horizon, IB2, CPS & IDA services. Timeline 02:09 NOC alerts shows loss of connectivity to our PST and TFD nodes. 02:30 On-call transmission engineers fully engaged in diagnostics. 02:30-03:15 Diagnostics indicate that majority of BB connectivity has been lost and initial attempts at recovery failing. 03:20 - Major Service Outage process invoked. 03:20 Gamma MSO bridge opened 03:20 04:00 Additional engineering engaged and working on resolution. 04:15 First customer alert sent, thence regular updates throughout the day. 04:20 BT Engineering teams join Gamma bridge 04:20 05:00 Gamma and BT engaged in co-op diagnostics. At this point it stated to become clear that there was an issue with the process for subscribers being retained on the network. There was a constant churn of subscribers joining and then dropping again after a 120sec window. BT indicated that planned works at a local exchange had commenced at the Gamma Telecom Ltd, Kings House, Kings Road West, Newbury RG14 5BY Tel: 0333 240 3000 Fax: 0333 240 3001 Email:marketing@gamma.co.uk
same time of the outage at 02:10. 05:15-05:45 Connectivity begins to return and approximately 25% of subscribers have successfully rejoined the network. 06:00 Subscribers numbers fall rapidly again and most recovered sessions are dropped. 06:15-09:00 - BT performed various tests and remedial works on the network by rerouting traffic across both their core networks. BT/Gamma reviewing and tracking individual subscriber ingress/egress through network. BT now assisting with review of what happened at 05:45 that caused a partial restore of sessions. BT can ping the Gamma tunnel termination devices but they appear unreachable from elsewhere within the BT network. BT Access Control List being removed at Manchester to see if that assists in resolving apparent routing issues. 08:50-09:10 - Gamma commence full restart of selected core equipment in data path. This process is intrusive and only taken in exceptional circumstances. The restarts have no beneficial impact. 09:10hrs: BT begins detailed review of changes made at local exchange that may have triggered the outage. Reversion to conditions prior to the change has no impact. 09:15 Equipment vendors fully engaged and reviewing detailed logs and traces of network activity. 09:30-11:30 At this point the focus of investigations is a routing or IP conflict. As the individual sessions are built through a very large number of routes extensive work is done to reduce the routing to a smaller more manageable level (focused on our Trafford node) to allow effective diagnostics. This is complex and must be achieved without further impact to stable data services. 11:35 After extensive analysis equipment vendors report they can find no obvious issues with core devices handling traffic. 11:51: Majority of IP stream customers now stable on Trafford node. 12:05 - BT revert their changes to the core network, re-introducing redundant paths. 12:19 - BT confirmed they have fully reverted their network to standard topology. 12:25 Begin re-establishing WBC links at Trafford. Using a route map we start to allow our terminating equipment to respond to tunnel setups from a small BT subnet that restricts subscription attempts. This process expanded slowly. 12:43 - Limited number of WBC customers begin to return to service. 12:58-15:00- Subscribers continue to be introduced in a controlled fashion to avoid any losses of existing circuits. 15:00 - Gamma re-establishes the IPStream and WBC links at North and Southern nodes. (TFD & PST). 2
15:00 17:00 To alleviate the load on Gamma termination equipment BT apply an outbound Access Control List (ACL) towards Gamma. 17:40: BT ACL proves effective in allowing increased rate of subscriber reconnects. Gamma introduce similar process on own equipment to introduce BT subnets in a more controlled fashion and returns network to fully routed status. Thus proves to be stable, allowing us to reach higher subscriber levels. 19:36 - All host links back up. Core systems stable 19:55-21:45 - Continuing the process of bringing the subscribers back online by permitting more subnets in the inbound ACL. Connectivity being managed to ensure that subscribers fully balanced over host links. 22:00: All subnets now permitted. A small number of subscribers had not returned to service but this was expected as often CPE require rebooting. 23:59hrs: Final balancing of subscribers across host links carried out and network and subscribers fully stable. Corrective Action After extensive network topology reroutes and detailed diagnostics subscribers were returned to normal level by restricting the rate at which connections were being reestablished to prevent overload of Gamma core network devices. This process is now built into edge network devices and in the unlikely event of a similar failure, will enable a more rapid restoration of subscribers. The resulting congestion in the remainder of the Gamma network caused many reports of impact on voice services. This was addressed through rerouting traffic and increasing bandwidth as required on congested routes. These latter measures will remain in place until a full RCA is completed. Gamma operates a fully resilient network and to date have successfully redirected traffic between nodes in the event of infrastructure failures with no impact on subscribers. Gamma s core termination equipment is rated to carry many more subscribers than currently active and consequently this will be one of the main areas of investigation. Additional Comments Work will also focus on how an external incident was able to impact all elements of our subscriber termination equipment. Extensive load tests will be made within our lab environment in close cooperation with equipment vendors to attempt to reproduce the failure modes experienced. We will be working with BT to fully understand what part their planned maintenance works played in triggering such a large failure and to ensure that we are adequately prepared should there be similar works. We will also be closely reviewing the handling of subscriber s restoration rates within our network in the event of termination failures and the larger than expected signaling levels experienced. 3
This work will be detailed and exhaustive and we expect to have results within the next two weeks. Update 12 th September 2014 Through further analysis we have been able to better define the initial trigger of the failure and introduce additional mitigation. In order to explain the mitigation, a simplified view of the subscriber connection process is described in the following paragraph. Equipment in the exchange will firstly authenticate users with Gamma radius servers (radius servers, authenticate, authorise and account for each subscriber). Once authorised a virtual tunnel will be opened up between the exchange and the Gamma terminating equipment (LNS servers). This tunnel offers secure communication channel for subscribers within the exchange to fully connect to the Gamma core. Update 12/9/14 Contributory Factors Whilst the initial trigger event (i.e. the reason subscriber sessions were dropped at 2am) for the outage was related to a change control on BT s network, the root cause of the prolonged outage is most likely to be an interworking issue between Gamma s and BT s networks. This issue, coupled to the method employed by Gamma to evenly distribute subscribers appears to have caused excessive load on the associated LNS termination equipment following the planned maintenance. There was no absolute failure in BT or Gamma s network that caused the outage. We are continuing to investigate why the above series of events had such a serious knock-on effect on our other LNS devices in geographically diverse locations. Mitigation We have reconfigured the method employed to distribute subscribers between the LNS termination devices and are planning further load reduction through adjusting the rate of response to Radius authentication requests. In addition, and as described in the body of the original RFO above, we have a rapid and efficient method of restoring users in the unlikely event of a similar incident. In the medium to long term continuing upgrade and enhancement of our network equipment will ensure that we can continue to fully address future growth and changes within our supplier s networks. A final RFO will be issued once we are satisfied that our mitigation steps and subsequent testing have proved effective. Voice Service Impact In addition to the obvious loss of voice services supported by broadband we received reports of quality and connectivity issues impacting non-related voice services. The initial diagnosis was that this was a result of congestion related to the broadband outage and traffic was redistributed via alternative routes with good effect. However, this did not adequately explain the root cause. Subsequent investigations have revealed that this was related to an error/fault condition on one of our large IP Interconnects from a supplier of third party fibre that commenced at approximately 09:00. The fibre was not out of service (therefore no alarms) but erroneously throttling bandwidth which caused a 4
variety of quality and connectivity problems. On this basis we are sure that the additional voice issues reported were unrelated to the BB outage. Final Update 18 th September 2014 After extensive testing in the Gamma labs and detailed consultation with our network vendor and equipment suppliers, the following mitigation steps have been taken and are now fully in service: Final Update 18/9/14 1. The distribution of subscribers to the Gamma termination devices has been modified to reduce the number of tunnels necessary to support the subscriber base. 2. The Gamma authentication servers (Radius) have been modified to limit the rate at which termination requests are processed. 3. An automated script is deployed to offer rapid restoration of subscribers via access control lists in the unlikely event that the above measures prove ineffective. Contact brian.mulligan@gamma.co.uk 5