Monday, October 19, 2009
Core Network
Update 10:09; we've been told that the problems are down to a major incident in London which is affecting multiple parties, unfortunately it is of a scale which covers both our London based datacentres.
Update 10:18; THN now appears to be stable, all our connections to RBHX are now down (rather than flapping)
Update 10:47; we've now seen RBHX come back online, though no direct confirmation yet from our provider.
Update 13:45; we've just seen a blip on all our connections at RBHX, approx 1 minute.
For the near future services should still be considered at risk.
Monday, September 28, 2009
Manchester Transit HSRP Problems
During normal operation both routers would carry traffic, some customers having a higher priority on router 1 while the others have a higher priority on router 2. A number of scenarios are tested to ensure fail over does occur so this failure is a bit unusual.
An engineer is currently en-route to verify the status of the router experiencing problems.
Unexpected Reboot of rtr1.thn
The router appears to be stable after the reboot though reasons for why it may have rebooted are still being investigated. We will continue to monitor the router closely for the next few hours.
Saturday, September 19, 2009
Transit router
Customers directly connected to this router would have seen an outage of 32 minutes. Others may have experienced an outage of approx 30 seconds while their sessions were redirected to another router.
C2 apologise for the inconvenience this outage may have caused.
Thursday, June 18, 2009
Scheduled maintenance 9th June 04:00 and 06:00.
Following on from the Network incident report for 9th June, we are scheduling a network maintenance window for Sunday 21 June that will be open between 04:00 and 06:00. The purpose of this window will be to investigate and test the stability of the Manchester ring.
We are not expecting the network to experience any disruption in service, but the Manchester ring will obviously be more at risk of interruption while testing takes place.
Kind Regards
Stuart McKindley
Tuesday, June 12, 2007
Analysis of incident 11:25 to 11:45 11 June 2007
We have now completed our investigation of what happened on 11 June and why it caused such a problem.
At 11:25 we experienced a loss of network reach-ability on our
A layer 2 protocol problem had occurred affecting the switches in the ring.
Service on the
This Level3 problem was identified at 11:40 and rectified. Service returned for affected Level3 transit customers at 11:45.
Root cause analysis
It seems that the problem was caused by three separate factors interacting together.
- Mis-configuration of a customer switch at IFL2
- Engineering works being carried out in Telecity
- Mis-configuration of an old port by Level3
These probably interacted as follows:
A layer 2 loop control protocol problem occurred between two different isolated sides of the
Lessons to take to Heart
Preventative
Better control of legacy cables and ports is required both with suppliers and customers. Also customers’ ports need strict layer 2 protocol controls at all times without exception by omission or special case.
Restorative
We estimate that we were delayed by about 10 minutes in the problem analysis and fix. When catastrophic problems occur our priorities are:
- Diagnose and restore service as quickly as possible
- Triage customers and deal with fact-finding and result feedback
- Triage customers and ensure higher support band customers have service restored first / follow ups.
- Don’t deal with individual customer issues that cannot be quickly resolved / are out of the norm.
- Don’t deal with un-related issues
- Don’t give customers misleading information
- Work to ensure all customers are back fully enabled
- Analyse what went wrong
- Determine lessons to be learned
- Write Reason for Outage report.
When catastrophic events happen customers naturally want to know what the problem is. Our reception staff receive a high volume of calls in a short space of time.
The problem solving team need to focus on solving the problem and are isolated to avoid distractions.
The diagnostic team is guided by the problem solving team on what information to gather from which customers in order to build up a picture of what is happening over the ground. This process needs to happen quickly and be very focused and directed. Anything that slows this process down is bad for all concerned.
To optimise the above we will be implementing the following changes with immediate effect
1. In the first instance call answering staff will take ‘focused yet detailed’ messages and email these to engineers. This ensures that engineers pro-actively manage fault resolution and are not distracted by in bound calls.
2. Diagnostic / information gathers will seek to quickly dispatch information and retrieve feedback from information requests. If you need to consult a third party please call back when you have gathered the information.
3. A bulletin will be published to the NOC website giving full information about the incident in due course once the problem has been solved and analysed – this is standard procedure.
4. If you wish to make a complaint, please do so once the incident is solved using the complaints mechanism on our website. Your complaint will receive a response within 24 hours of the incident resolution.
5. Abusive callers will not be tolerated.
We do value you as customers and want to service you as individuals, and collectively fix network problems as quickly as possible, and to this end we need, and thank you, for your co-operation and participation in the problem solving process.
Wednesday, September 27, 2006
C2 NOC Blog Site
This site is intended for the posting of scheduled network maintenance tickits and updates, as well as an advisory site to provide updates in relation to outage and emergency work that is being carried out to re-instate services.
This site is deliberatly hosted outside of the C2 network so that it should be available even during any disruptive periods.
Kind Regards
Ben