We have now completed our investigation of what happened on 11 June and why it caused such a problem.
At 11:25 we experienced a loss of network reach-ability on our
A layer 2 protocol problem had occurred affecting the switches in the ring.
Service on the
This Level3 problem was identified at 11:40 and rectified. Service returned for affected Level3 transit customers at 11:45.
Root cause analysis
It seems that the problem was caused by three separate factors interacting together.
- Mis-configuration of a customer switch at IFL2
- Engineering works being carried out in Telecity
- Mis-configuration of an old port by Level3
These probably interacted as follows:
A layer 2 loop control protocol problem occurred between two different isolated sides of the
Lessons to take to Heart
Preventative
Better control of legacy cables and ports is required both with suppliers and customers. Also customers’ ports need strict layer 2 protocol controls at all times without exception by omission or special case.
Restorative
We estimate that we were delayed by about 10 minutes in the problem analysis and fix. When catastrophic problems occur our priorities are:
- Diagnose and restore service as quickly as possible
- Triage customers and deal with fact-finding and result feedback
- Triage customers and ensure higher support band customers have service restored first / follow ups.
- Don’t deal with individual customer issues that cannot be quickly resolved / are out of the norm.
- Don’t deal with un-related issues
- Don’t give customers misleading information
- Work to ensure all customers are back fully enabled
- Analyse what went wrong
- Determine lessons to be learned
- Write Reason for Outage report.
When catastrophic events happen customers naturally want to know what the problem is. Our reception staff receive a high volume of calls in a short space of time.
The problem solving team need to focus on solving the problem and are isolated to avoid distractions.
The diagnostic team is guided by the problem solving team on what information to gather from which customers in order to build up a picture of what is happening over the ground. This process needs to happen quickly and be very focused and directed. Anything that slows this process down is bad for all concerned.
To optimise the above we will be implementing the following changes with immediate effect
1. In the first instance call answering staff will take ‘focused yet detailed’ messages and email these to engineers. This ensures that engineers pro-actively manage fault resolution and are not distracted by in bound calls.
2. Diagnostic / information gathers will seek to quickly dispatch information and retrieve feedback from information requests. If you need to consult a third party please call back when you have gathered the information.
3. A bulletin will be published to the NOC website giving full information about the incident in due course once the problem has been solved and analysed – this is standard procedure.
4. If you wish to make a complaint, please do so once the incident is solved using the complaints mechanism on our website. Your complaint will receive a response within 24 hours of the incident resolution.
5. Abusive callers will not be tolerated.
We do value you as customers and want to service you as individuals, and collectively fix network problems as quickly as possible, and to this end we need, and thank you, for your co-operation and participation in the problem solving process.
No comments:
Post a Comment