Gamma Service Incident Report Final 18/9/14

Similar documents
Agenda Event Analysis Subcommittee Conference Call

KEMP 360 Vision. KEMP 360 Vision. Product Overview

SIP TRUNKING THE COST EFFECTIVE AND FLEXIBLE ALTERNATIVE TO ISDN

Opengear Technical Note

Never Drop a Call With TecInfo SIP Proxy White Paper

Resilient SIP build options

Telephone System Service Level Agreement

SIP Trunks. The cost-effective and flexible alternative to ISDN

Portable Wireless Mesh Networks: Competitive Differentiation

ForeScout CounterACT Resiliency Solutions

SIP trunks with Microsoft Skype For Business

Network Services Enterprise Broadband

INTEGRATED COMMUNICATIONS FOR YOUR BUSINESS: SIP TRUNK

ON-LINE EXPERT SUPPORT THROUGH VPN ACCESS

VIDYO CLOUD SERVICES SERVICE AND SUPPORT POLICY FOR VIDYOCLOUD SERVICES - STANDARD

SIP Trunks. The cost-effective and flexible alternative to ISDN

Q-Balancer Range FAQ The Q-Balance LB Series General Sales FAQ

Council, 26 March Information Technology Report. Executive summary and recommendations. Introduction

Digital Advisory Services Professional Service Description SIP Centralized IP Trunk with Field Trial Model

Vodacom MPLS Service Specific Terms & Conditions

IP network probe systems increasing role in managing the customer experience

The Virtual Machine Aware SAN

Disaster Recovery Solutions for Oracle Database Standard Edition RAC. A Dbvisit White Paper By Anton Els

Standdards of Service

BT Connect Networks that think Optical Connect UK

Digital Advisory Services Professional Service Description SIP SBC with Field Trial Endpoint Deployment Model

MASTER SERVICES AGREEMENT SERVICE DESCRIPTION

Ethernet Network Redundancy in SCADA and real-time Automation Platforms.

Digital Advisory Services Professional Service Description SIP IP Trunk with Field Trial for Legacy PBX Model

Instant Messaging Compliance for the IM and Presence Service, Release 12.0(1)

Network Deployments in Cisco ISE

Cisco Expressway Cluster Creation and Maintenance

Network Services BT MPLS (Marketed as IP Connect Global)

ForeScout CounterACT. Resiliency Solutions. CounterACT Version 8.0

Ellipse Support. Contents

Reliable, fast data connectivity

West AT&T TXT Power Service Guide

Avaya Solution & Interoperability Test Lab

Managing Costs in Growing Networks with SEGway STPs and Point Code Emulation

Interdomain Federation Guide for IM and Presence Service on Cisco Unified Communications Manager, Release 11.5(1)SU2

NETWORK ARCHITECTURE

Cisco 3: Advanced Routing & Switching

Cisco Webex Cloud Connected Audio

CTS performs nightly backups of the Church360 production databases and retains these backups for one month.

Data Services. Reliable, high-speed data connectivity

UNIT 4: DATA COLLECTION AND TRANSMISSION

CORPORATE GLOBAL ROAMING PRODUCT SPECIFICATION

Network Services ADSL Managed Service

Data Services. Reliable, high-speed data connectivity

Token Ring VLANs and Related Protocols

Broadband Solutions Pty Ltd. Broadband Solutions Service Level Agreement

Deploy Webex Video Mesh

Managed WAN SLA. Contents

Managed WAN SLA. Contents

Abstract. Avaya Solution & Interoperability Test Lab

Network Deployments in Cisco ISE

Wireless Network Virtualization: Ensuring Carrier Grade Availability

Equitrac Office and Express DCE High Availability White Paper

GSN Cloud Contact Centre Customer Connectivity Datasheet

EarthLink Business SIP Trunking. Allworx 6x IP PBX SIP Proxy Customer Configuration Guide

SEC Appendix AG. Deleted: 0. Draft Version AG 1.1. Appendix AG. Incident Management Policy

Security of Information Technology Resources IT-12

A large cluster architecture with Efficient Caching Coherency, Intelligent Management, and High Performance for a Low-Cost storage node

EVERYTHING YOU NEED TO KNOW ABOUT NETWORK FAILOVER

PSTN Security. Sougat Ghosh Security Services Leader Asia, Nortel Delhi / September 29, 2008 BUSINESS MADE SIMPLE

Maintaining High Availability for Enterprise Voice in Microsoft Office Communication Server 2007

ACANO SOLUTION RESILIENT ARCHITECTURE. White Paper. Mark Blake, Acano CTO

SHORETEL APPLICATION NOTE

Multihoming with BGP and NAT

CCIE SP Operations Written Exam v1.0

Introducing Cisco IPICS

Epicor ERP Cloud Services Specification Multi-Tenant and Dedicated Tenant Cloud Services (Updated July 31, 2017)

INCREASING electrical network interconnection is

LinchPin. Managed Service For IP VPN Networks. Web Site Telephone

Added SerialNumber object to ECESSA-MIB Description Device serial number is readable via ECESSA-MIB::SerialNumber.0.

Building AMR services over Broadband PLC Networks

Token Ring VLANs and Related Protocols

Security SSID Selection: Broadcast SSID:

Information Security Controls Policy

Carolina s 21 st Century Electric Grid

Network Management Policy ( Policy ) 1

Incident Response Lessons From the Front Lines. Session 276, March 8, 2018 Nolan Garrett, CISO, Children s Hospital Los Angeles

10 Reasons to Choose AudioCodes Enterprise SBC

Data Services. Reliable, high-speed data connectivity. Group Ltd

Managed NIDS Care Services

ID Features Tested Case Title Description Call Component Flow Status Defects UC802CL.ACE.001 Basic Call Flow Integrate Cisco Application Control

MOC 20411B: Administering Windows Server Course Overview

SERVICE PROVIDER HANDBOOK BT PERFORMANCE TESTER

ALCATEL Edge Services Router

Microsoft Active Directory Services with Windows Server

21CTL Disaster Recovery, Workload Mobility and Infrastructure as a Service Proposal. By Adeyemi Ademola E. Cloud Engineer

BT Managed DDoS Security UK Annex to the Internet Connect UK Schedule

Circuit Emulation Service

Wholesale Very high bit-rate Digital Subscriber Line (VDSL) Service

X.25 Substitution. Maintaining X.25 services over a fully supported NGN/IP infrastructure. The Challenge. How it Works. Solution

Opengear Application Note

Cisco Unified MeetingPlace Integration

Sonus Networks engaged Miercom to evaluate the call handling

Question No: 1 What is the maximum number of switches that can be stacked using Cisco StackWise?

TRACKVIA SECURITY OVERVIEW

Transcription:

Gamma Service Report Final 18/9/14 Broadband Service Please read the following as it could have an impact on some of your customers. Reference: Start Date: Start Time: Actual Clear Date: Actual Clear Time: Summary Details Gamma Ref-BB28142014 28 th August 2014 02:09 28 th August 2014 22:10 Loss of broadband connectivity with further impact on some voice services. Broadband connectivity to our Trafford (North) and Paul St (South) nodes interrupted by planned maintenance on BT network. Once the BT services were restored, the terminating devices for subscribers on Gamma s network could not recover the lost sessions. This prolonged the outage for the majority of our BB services. In addition to the failure of services dependent on the BB connectivity, the congestion caused some failed and poor quality calls for our SIP trunking, Horizon, IB2, CPS & IDA services. Timeline 02:09 NOC alerts shows loss of connectivity to our PST and TFD nodes. 02:30 On-call transmission engineers fully engaged in diagnostics. 02:30-03:15 Diagnostics indicate that majority of BB connectivity has been lost and initial attempts at recovery failing. 03:20 - Major Service Outage process invoked. 03:20 Gamma MSO bridge opened 03:20 04:00 Additional engineering engaged and working on resolution. 04:15 First customer alert sent, thence regular updates throughout the day. 04:20 BT Engineering teams join Gamma bridge 04:20 05:00 Gamma and BT engaged in co-op diagnostics. At this point it stated to become clear that there was an issue with the process for subscribers being retained on the network. There was a constant churn of subscribers joining and then dropping again after a 120sec window. BT indicated that planned works at a local exchange had commenced at the Gamma Telecom Ltd, Kings House, Kings Road West, Newbury RG14 5BY Tel: 0333 240 3000 Fax: 0333 240 3001 Email:marketing@gamma.co.uk

same time of the outage at 02:10. 05:15-05:45 Connectivity begins to return and approximately 25% of subscribers have successfully rejoined the network. 06:00 Subscribers numbers fall rapidly again and most recovered sessions are dropped. 06:15-09:00 - BT performed various tests and remedial works on the network by rerouting traffic across both their core networks. BT/Gamma reviewing and tracking individual subscriber ingress/egress through network. BT now assisting with review of what happened at 05:45 that caused a partial restore of sessions. BT can ping the Gamma tunnel termination devices but they appear unreachable from elsewhere within the BT network. BT Access Control List being removed at Manchester to see if that assists in resolving apparent routing issues. 08:50-09:10 - Gamma commence full restart of selected core equipment in data path. This process is intrusive and only taken in exceptional circumstances. The restarts have no beneficial impact. 09:10hrs: BT begins detailed review of changes made at local exchange that may have triggered the outage. Reversion to conditions prior to the change has no impact. 09:15 Equipment vendors fully engaged and reviewing detailed logs and traces of network activity. 09:30-11:30 At this point the focus of investigations is a routing or IP conflict. As the individual sessions are built through a very large number of routes extensive work is done to reduce the routing to a smaller more manageable level (focused on our Trafford node) to allow effective diagnostics. This is complex and must be achieved without further impact to stable data services. 11:35 After extensive analysis equipment vendors report they can find no obvious issues with core devices handling traffic. 11:51: Majority of IP stream customers now stable on Trafford node. 12:05 - BT revert their changes to the core network, re-introducing redundant paths. 12:19 - BT confirmed they have fully reverted their network to standard topology. 12:25 Begin re-establishing WBC links at Trafford. Using a route map we start to allow our terminating equipment to respond to tunnel setups from a small BT subnet that restricts subscription attempts. This process expanded slowly. 12:43 - Limited number of WBC customers begin to return to service. 12:58-15:00- Subscribers continue to be introduced in a controlled fashion to avoid any losses of existing circuits. 15:00 - Gamma re-establishes the IPStream and WBC links at North and Southern nodes. (TFD & PST). 2

15:00 17:00 To alleviate the load on Gamma termination equipment BT apply an outbound Access Control List (ACL) towards Gamma. 17:40: BT ACL proves effective in allowing increased rate of subscriber reconnects. Gamma introduce similar process on own equipment to introduce BT subnets in a more controlled fashion and returns network to fully routed status. Thus proves to be stable, allowing us to reach higher subscriber levels. 19:36 - All host links back up. Core systems stable 19:55-21:45 - Continuing the process of bringing the subscribers back online by permitting more subnets in the inbound ACL. Connectivity being managed to ensure that subscribers fully balanced over host links. 22:00: All subnets now permitted. A small number of subscribers had not returned to service but this was expected as often CPE require rebooting. 23:59hrs: Final balancing of subscribers across host links carried out and network and subscribers fully stable. Corrective Action After extensive network topology reroutes and detailed diagnostics subscribers were returned to normal level by restricting the rate at which connections were being reestablished to prevent overload of Gamma core network devices. This process is now built into edge network devices and in the unlikely event of a similar failure, will enable a more rapid restoration of subscribers. The resulting congestion in the remainder of the Gamma network caused many reports of impact on voice services. This was addressed through rerouting traffic and increasing bandwidth as required on congested routes. These latter measures will remain in place until a full RCA is completed. Gamma operates a fully resilient network and to date have successfully redirected traffic between nodes in the event of infrastructure failures with no impact on subscribers. Gamma s core termination equipment is rated to carry many more subscribers than currently active and consequently this will be one of the main areas of investigation. Additional Comments Work will also focus on how an external incident was able to impact all elements of our subscriber termination equipment. Extensive load tests will be made within our lab environment in close cooperation with equipment vendors to attempt to reproduce the failure modes experienced. We will be working with BT to fully understand what part their planned maintenance works played in triggering such a large failure and to ensure that we are adequately prepared should there be similar works. We will also be closely reviewing the handling of subscriber s restoration rates within our network in the event of termination failures and the larger than expected signaling levels experienced. 3

This work will be detailed and exhaustive and we expect to have results within the next two weeks. Update 12 th September 2014 Through further analysis we have been able to better define the initial trigger of the failure and introduce additional mitigation. In order to explain the mitigation, a simplified view of the subscriber connection process is described in the following paragraph. Equipment in the exchange will firstly authenticate users with Gamma radius servers (radius servers, authenticate, authorise and account for each subscriber). Once authorised a virtual tunnel will be opened up between the exchange and the Gamma terminating equipment (LNS servers). This tunnel offers secure communication channel for subscribers within the exchange to fully connect to the Gamma core. Update 12/9/14 Contributory Factors Whilst the initial trigger event (i.e. the reason subscriber sessions were dropped at 2am) for the outage was related to a change control on BT s network, the root cause of the prolonged outage is most likely to be an interworking issue between Gamma s and BT s networks. This issue, coupled to the method employed by Gamma to evenly distribute subscribers appears to have caused excessive load on the associated LNS termination equipment following the planned maintenance. There was no absolute failure in BT or Gamma s network that caused the outage. We are continuing to investigate why the above series of events had such a serious knock-on effect on our other LNS devices in geographically diverse locations. Mitigation We have reconfigured the method employed to distribute subscribers between the LNS termination devices and are planning further load reduction through adjusting the rate of response to Radius authentication requests. In addition, and as described in the body of the original RFO above, we have a rapid and efficient method of restoring users in the unlikely event of a similar incident. In the medium to long term continuing upgrade and enhancement of our network equipment will ensure that we can continue to fully address future growth and changes within our supplier s networks. A final RFO will be issued once we are satisfied that our mitigation steps and subsequent testing have proved effective. Voice Service Impact In addition to the obvious loss of voice services supported by broadband we received reports of quality and connectivity issues impacting non-related voice services. The initial diagnosis was that this was a result of congestion related to the broadband outage and traffic was redistributed via alternative routes with good effect. However, this did not adequately explain the root cause. Subsequent investigations have revealed that this was related to an error/fault condition on one of our large IP Interconnects from a supplier of third party fibre that commenced at approximately 09:00. The fibre was not out of service (therefore no alarms) but erroneously throttling bandwidth which caused a 4

variety of quality and connectivity problems. On this basis we are sure that the additional voice issues reported were unrelated to the BB outage. Final Update 18 th September 2014 After extensive testing in the Gamma labs and detailed consultation with our network vendor and equipment suppliers, the following mitigation steps have been taken and are now fully in service: Final Update 18/9/14 1. The distribution of subscribers to the Gamma termination devices has been modified to reduce the number of tunnels necessary to support the subscriber base. 2. The Gamma authentication servers (Radius) have been modified to limit the rate at which termination requests are processed. 3. An automated script is deployed to offer rapid restoration of subscribers via access control lists in the unlikely event that the above measures prove ineffective. Contact brian.mulligan@gamma.co.uk 5