Tuesday, June 12, 2007

Analysis of incident 11:25 to 11:45 11 June 2007

What happened?

We have now completed our investigation of what happened on 11 June and why it caused such a problem.

At 11:25 we experienced a loss of network reach-ability on our London to Manchester ring which affected DSL customers. This also impacted, although not immediately apparent, our transit customers in Manchester particularly connected to Level3.

A layer 2 protocol problem had occurred affecting the switches in the ring.

Service on the London to Manchester ring was restored at around 11:30; rebooting the Manchester switches and this restored service to effected DSL customers. Unfortunately upon reloading a further layer 2 protocol problem caused the automatic shut down one of the Gig interconnects with Level3 in Manchester.

This Level3 problem was identified at 11:40 and rectified. Service returned for affected Level3 transit customers at 11:45.

Root cause analysis

It seems that the problem was caused by three separate factors interacting together.

  • Mis-configuration of a customer switch at IFL2
  • Engineering works being carried out in Telecity
  • Mis-configuration of an old port by Level3

These probably interacted as follows:

A layer 2 loop control protocol problem occurred between two different isolated sides of the London to Manchester ring. A miss-configured customer switch at IFL2 sent out layer 2 loop control packets. After the reboot a residual configuration of an old transit port at Level3 was brought back into service as part of the aborted London to Manchester work. This created a further loop between two Level3 ports automatically shutting one of them down which affected all customer VLANs mapped via this port.

Lessons to take to Heart

Preventative

Better control of legacy cables and ports is required both with suppliers and customers. Also customers’ ports need strict layer 2 protocol controls at all times without exception by omission or special case.

Restorative

We estimate that we were delayed by about 10 minutes in the problem analysis and fix. When catastrophic problems occur our priorities are:

  1. Diagnose and restore service as quickly as possible
  2. Triage customers and deal with fact-finding and result feedback
  3. Triage customers and ensure higher support band customers have service restored first / follow ups.
  4. Don’t deal with individual customer issues that cannot be quickly resolved / are out of the norm.
  5. Don’t deal with un-related issues
  6. Don’t give customers misleading information
  7. Work to ensure all customers are back fully enabled
  8. Analyse what went wrong
  9. Determine lessons to be learned
  10. Write Reason for Outage report.

When catastrophic events happen customers naturally want to know what the problem is. Our reception staff receive a high volume of calls in a short space of time.

The problem solving team need to focus on solving the problem and are isolated to avoid distractions.

The diagnostic team is guided by the problem solving team on what information to gather from which customers in order to build up a picture of what is happening over the ground. This process needs to happen quickly and be very focused and directed. Anything that slows this process down is bad for all concerned.

To optimise the above we will be implementing the following changes with immediate effect

1. In the first instance call answering staff will take ‘focused yet detailed’ messages and email these to engineers. This ensures that engineers pro-actively manage fault resolution and are not distracted by in bound calls.

2. Diagnostic / information gathers will seek to quickly dispatch information and retrieve feedback from information requests. If you need to consult a third party please call back when you have gathered the information.

3. A bulletin will be published to the NOC website giving full information about the incident in due course once the problem has been solved and analysed – this is standard procedure.

4. If you wish to make a complaint, please do so once the incident is solved using the complaints mechanism on our website. Your complaint will receive a response within 24 hours of the incident resolution.

5. Abusive callers will not be tolerated.

We do value you as customers and want to service you as individuals, and collectively fix network problems as quickly as possible, and to this end we need, and thank you, for your co-operation and participation in the problem solving process.