C2 NOC

Monday, October 19, 2009

Core Network

We currently experiencing problems across multiple links of the core network; our providers have been notified and we are waiting for an update.

Update 10:09; we've been told that the problems are down to a major incident in London which is affecting multiple parties, unfortunately it is of a scale which covers both our London based datacentres.

Update 10:18; THN now appears to be stable, all our connections to RBHX are now down (rather than flapping)

Update 10:47; we've now seen RBHX come back online, though no direct confirmation yet from our provider.

Update 13:45; we've just seen a blip on all our connections at RBHX, approx 1 minute.

For the near future services should still be considered at risk.

Monday, September 28, 2009

Manchester Transit HSRP Problems

One of a pair of Cisco routers serving some transit customers is experiencing issues, currently it is not passing packets. This should have caused an automatic fail over to it's pair however for reasons currently unknown it did not fail over. The standby priority for all affected customers has been increased on the operational router so traffic is once again flowing.

During normal operation both routers would carry traffic, some customers having a higher priority on router 1 while the others have a higher priority on router 2. A number of scenarios are tested to ensure fail over does occur so this failure is a bit unusual.

An engineer is currently en-route to verify the status of the router experiencing problems.

Unexpected Reboot of rtr1.thn

At 17:03 rtr1.thn unexpectedly rebooted, customers served via this router would have noticed approx 7 minutes of downtime while the router reloaded.

The router appears to be stable after the reboot though reasons for why it may have rebooted are still being investigated. We will continue to monitor the router closely for the next few hours.

Saturday, September 19, 2009

Transit router

At 17:33 one of our transit routers stopped passing packets on one of it's interfaces. The affected router was restarted at 18:00 correcting the problem.

Customers directly connected to this router would have seen an outage of 32 minutes. Others may have experienced an outage of approx 30 seconds while their sessions were redirected to another router.

C2 apologise for the inconvenience this outage may have caused.

Thursday, June 18, 2009

Scheduled maintenance 9th June 04:00 and 06:00.

Hi,

Following on from the Network incident report for 9th June, we are scheduling a network maintenance window for Sunday 21 June that will be open between 04:00 and 06:00. The purpose of this window will be to investigate and test the stability of the Manchester ring.

We are not expecting the network to experience any disruption in service, but the Manchester ring will obviously be more at risk of interruption while testing takes place.

Kind Regards

Stuart McKindley

Tuesday, June 12, 2007

Analysis of incident 11:25 to 11:45 11 June 2007

What happened?

We have now completed our investigation of what happened on 11 June and why it caused such a problem.

At 11:25 we experienced a loss of network reach-ability on our London to Manchester ring which affected DSL customers. This also impacted, although not immediately apparent, our transit customers in Manchester particularly connected to Level3.

A layer 2 protocol problem had occurred affecting the switches in the ring.

Service on the London to Manchester ring was restored at around 11:30; rebooting the Manchester switches and this restored service to effected DSL customers. Unfortunately upon reloading a further layer 2 protocol problem caused the automatic shut down one of the Gig interconnects with Level3 in Manchester.

This Level3 problem was identified at 11:40 and rectified. Service returned for affected Level3 transit customers at 11:45.

Root cause analysis

It seems that the problem was caused by three separate factors interacting together.

Mis-configuration of a customer switch at IFL2
Engineering works being carried out in Telecity
Mis-configuration of an old port by Level3

These probably interacted as follows:

A layer 2 loop control protocol problem occurred between two different isolated sides of the London to Manchester ring. A miss-configured customer switch at IFL2 sent out layer 2 loop control packets. After the reboot a residual configuration of an old transit port at Level3 was brought back into service as part of the aborted London to Manchester work. This created a further loop between two Level3 ports automatically shutting one of them down which affected all customer VLANs mapped via this port.

Lessons to take to Heart

Preventative

Better control of legacy cables and ports is required both with suppliers and customers. Also customers’ ports need strict layer 2 protocol controls at all times without exception by omission or special case.

Restorative

We estimate that we were delayed by about 10 minutes in the problem analysis and fix. When catastrophic problems occur our priorities are:

Diagnose and restore service as quickly as possible
Triage customers and deal with fact-finding and result feedback
Triage customers and ensure higher support band customers have service restored first / follow ups.
Don’t deal with individual customer issues that cannot be quickly resolved / are out of the norm.
Don’t deal with un-related issues
Don’t give customers misleading information
Work to ensure all customers are back fully enabled
Analyse what went wrong
Determine lessons to be learned
Write Reason for Outage report.

When catastrophic events happen customers naturally want to know what the problem is. Our reception staff receive a high volume of calls in a short space of time.

The problem solving team need to focus on solving the problem and are isolated to avoid distractions.

The diagnostic team is guided by the problem solving team on what information to gather from which customers in order to build up a picture of what is happening over the ground. This process needs to happen quickly and be very focused and directed. Anything that slows this process down is bad for all concerned.

To optimise the above we will be implementing the following changes with immediate effect

1. In the first instance call answering staff will take ‘focused yet detailed’ messages and email these to engineers. This ensures that engineers pro-actively manage fault resolution and are not distracted by in bound calls.

2. Diagnostic / information gathers will seek to quickly dispatch information and retrieve feedback from information requests. If you need to consult a third party please call back when you have gathered the information.

3. A bulletin will be published to the NOC website giving full information about the incident in due course once the problem has been solved and analysed – this is standard procedure.

4. If you wish to make a complaint, please do so once the incident is solved using the complaints mechanism on our website. Your complaint will receive a response within 24 hours of the incident resolution.

5. Abusive callers will not be tolerated.

We do value you as customers and want to service you as individuals, and collectively fix network problems as quickly as possible, and to this end we need, and thank you, for your co-operation and participation in the problem solving process.

Wednesday, September 27, 2006

C2 NOC Blog Site

Hi,

This site is intended for the posting of scheduled network maintenance tickits and updates, as well as an advisory site to provide updates in relation to outage and emergency work that is being carried out to re-instate services.

This site is deliberatly hosted outside of the C2 network so that it should be available even during any disruptive periods.

Kind Regards

Ben