C2 NOC

Thursday, April 05, 2012

Emergency Maintenance Notification

We have received notification from our primary transit provider that on 12th April 2012 between 1am and 4am GMT there is a possibility of an upto 20 minute outage.

Regards
Atlas Technical

Tuesday, August 09, 2011

Future works on Manchester ring 19/08/2011

We've received notification from one of our network providers that as part of expanding their network, a number of fibre routes between TeleData and IFL2 require diverting.

This work will take place during the period 19/08/2011 23:00 to 20/08/2011 11:00 (BST).

This will result in a loss of protection on the Manchester ring until the work is complete.

Regards
C2

Wednesday, July 20, 2011

Issues on Manchester Ring

The causes of the issues on the Manchester ring last night are still not yet clear, we know part of the ring went down, what we cant do however is replicate the issue. Each time we manually shut down part of the ring traffic simply flows in the opposite direction. We are aware of issues with our network provider, they've been working on the TCW to IFL2 links, we will be chasing today for an update as to if they were working on anything at the time. As we cannot replicate the issue it would appear that some outside influence may be affecting the network, however, we will today be taking some emergency investigative works onsite. From 1pm onwards you may see some network blips, we will try and keep these to a minimum.

Apologies for the short notice, however if we can we need to find what the cause was.

Once we have completed our tests we will post a further update.

---

Following on from the tests a switch at TCW, which is part of a pair, has been identified as the most probable cause for the issues and instabilities. As the issue is somewhat intermittent I'm reluctant to pull the switch while it's under load from directly connected clients - the quietest time across all the ports is around 7am, so, we will look to swap the unit out then on Thursday 21st.. Directly connected clients and those using services connected to this switch will drop for a few minutes while the switch is replaced.

---

The work is now complete - the suspect switch will be taken back for further tests in the lab, further updates will be posted here.

Tuesday, December 21, 2010

DSL issues in London

Some C2 customers in the London area may currently be experiencing DSL issues due to an incident at the West End BT exchange. BT engineers are trying to restore services after a flood prompted a fire in the exchange. BT expect to have services restored shortly.

Thursday, December 02, 2010

Scheduled maintenance 6th December 10:00 and 16:00

Hi,

We have been advised by our network provider that as part of their capacity planning they are having to move the current line between IFL and Telecity Williams to another line. Though the window for the required work is between 10:00 and 16:00, they expect that the line will only be down for 1 hour. However the remaining two legs of the ring will remain untouched, so customers should not notice any disruption to service, but the Manchester ring will obviously be more at risk of interruption while testing takes place.

We have insisted that this work be completed while we are in contact with their engineers, so that if we notice anything unexpected on the network, we can get any work reversed immediately.

Kind Regards

Stuart McKindley

Monday, November 22, 2010

Removal of secondary mail server mail-relay20.c2internet.net

Due to hardware failure the server mail-relay20.c2internet.net is being removed from service.

This server's only role was to operate as a secondary/backup mx to customers requesting this functionality where those customers operated their own primary mail servers.

While initially this type of setup was the norm as the war against spam continues these type of backup servers have been targetted as easy routes in. This brings rise to a few problems;

It's not uncommon for these backup servers to be whitelisted/trusted by the primary server, thus totally defeating any anti-spam techniques they are utilising. The backup servers will accept all mail for the domains where it is told to be the secondary, if when forwarding that email to the primary server the primary server rejects a mailbox as unknown the backup server will want to send a non-delivery report. If the originating email was from a forged email address then these NDR's clog the system further which just puts extra load on the server for no real good reason. Worst case is the NDR's are sent to a valid email address but one which had nothing to do with the original email, at which point the server is generating backscatter which is every bit as bad as spam.

If the primary mail server was to fail most sending servers will now quite happily queue email, notify the sender of any sending delays and generally look after sending the email again after a few minutes when the server comes back up.

With all this in mind will we shortly be removing all entries from DNS for mail-relay20.c2internet.net. The unusual thing here is customers who have been using the service may well see a drop in the amount of incoming spam to that of which they had been used to.

This does not affect customers that have their own secondary mail servers

DSL Connections via BT

-- 15:15

Fault is now cleared - we will continue to monitor.

Apologies for any inconvenience.

-- 14:26

We're seeing a number of lines coming back up, though as yet have had no notification of this fault being cleared. We will continue to monitor.

-- 13:49

There is an issue affecting a number of tail circuits that are provided over the BT wholesale network. This is affecting a number of ISPs and is not related to anything within our network or anything under our direct control. The issue is being investigated and more information will be posted as soon as is available.

Wednesday, November 10, 2010

Service Outage report for 10th November 2010

17:32-

We've just had confirmation that the outage was from two separate faults happening in two separate geographic locations, one fault was on the providers Leeds to Sheffield connection, the other on their Warrington to Birmingham connection.

13:42-

At 10:50am this morning we lost both our west and east-bound connections from Manchester to London, this had the outcome of partitioning our core network into two. This partitioning would have caused routing issues and due to the location of name and radius servers within the network name lookup and xDSL authentication would also have failed.

Our transit feed out of Manchester was also experiencing problems which as this issue cleared at the same time our connections came back up was no doubt down to the same core root problem.

With the issue affecting multiple providers it was clear the problem was itself not within any equipment within our direct control or the outcome of any of our actions within the network.

Our main telephone system is also based out of Manchester however when the server went offline it failed over onto the backup analogue PSTN system, the number of incoming calls obviously proving a challenge.

We're currently in discussion with our network provider for the Manchester to London connections as these routes should be separate and diverse, initially they also went via separate providers however due to consolidation within the market one provider has ended up owning both networks. If it transpires that our provider has without our knowledge or authorisation joined these pathways then of course action will be taken.

At 12:40pm both connections came back up, with the exception of transit our of Manchester once the network had re-converged connections and traffic flows returned to normal. Approximately ten minutes after our connections re-established transit via our transit provider also re-established.

Our apologies for this outage and the inconvenience.

Monday, July 26, 2010

Upstream transit provider

-- 9:00AM

We're currently experiencing packet loss on one of our upstream transit providers. The connection needs to remain active for a short while to allow us to run diagnostics before passing the call to our provider.

-- 9:25am

Sessions to this transit provider have now been shutdown and traffic is flowing via alternative pathways, another upstrean provider however is also now showing packet loss so this provider has also been disabled.

Thursday, June 03, 2010

General Service Issues

4:00AM - Customers may be experiencing some general connectivity issues ranging from some Internet hosts being unavailable through to DSL lines not reconnecting after a reboot.

There appears to be a service outage in London which is affecting some of our suppliers, this outage is affecting multiple providers.

If you are experiencing any issues please do not reboot your router at this stage.

More information will be posted as it becomes available

5:40AM - Most of the affected circuits came back up approx 15 mins ago, work is continuing to restore the remainder as soon as possible.

6:50AM - The remaining circuits have now re-established - all connectivity has returned to normal.

Friday, March 26, 2010

Browsing / Access delays

Customers are experiences delays in reaching some parts of the Internet.

---

11:04am Sites giving problems are those we would normally access through Telehouse North in London, we have therefore closed down transit and peering at THN which is forcing traffic to take alternative routes through differant transit partners. this has impoved things to some sites however some others remain problematic. The issues causing these delays are located outside of our network and thus outside of our control, we are waiting on updates as to when these problems will be corrected, we will then enable once more peering and transit at THN. Apologies to those customers affected by this issue.

Monday, March 15, 2010

C2 21CN ADSL

5pm - We're experiencing some issues with our 21CN connections failing to reconnect after a reboot, BT are investigating, if your connection is 21CN please do not do a manual reboot until the problem is resolved. Thanks.

8:37pm - Problems appear to be resolved, most of the disconnected sessions have come back online - incident should be considered closed, we will however continue to monitor for a while.

Friday, January 22, 2010

C2 DSL Network

We're experiencing some issues on the DSL network, engineers are investigating, the problem is affecting multiple interconnects at multiple locations but does not appear to be core network related.

Monday, November 30, 2009

Network upgrades

During the course of this week we are increasing our switchport capacity at IFL2 and TCW. We are also taking this oppurtunity to relocate some equipment to increase our internal redundancy and resilience.

While the installations and relocations take place there will be points in time where some circuits and systems will be deemed at risk, however traffic at these points will be manually set to traverse through alternate pathways and routers.

----

Dec 3rd update.

IFL2 Complete.

----

Dec 8th update.

Due to delays in getting some fibre links provisioned at TCW this site will be delayed, new circuits should be in by Dec 18th.

Monday, October 19, 2009

Core Network

We currently experiencing problems across multiple links of the core network; our providers have been notified and we are waiting for an update.

Update 10:09; we've been told that the problems are down to a major incident in London which is affecting multiple parties, unfortunately it is of a scale which covers both our London based datacentres.

Update 10:18; THN now appears to be stable, all our connections to RBHX are now down (rather than flapping)

Update 10:47; we've now seen RBHX come back online, though no direct confirmation yet from our provider.

Update 13:45; we've just seen a blip on all our connections at RBHX, approx 1 minute.

For the near future services should still be considered at risk.

Monday, September 28, 2009

Manchester Transit HSRP Problems

One of a pair of Cisco routers serving some transit customers is experiencing issues, currently it is not passing packets. This should have caused an automatic fail over to it's pair however for reasons currently unknown it did not fail over. The standby priority for all affected customers has been increased on the operational router so traffic is once again flowing.

During normal operation both routers would carry traffic, some customers having a higher priority on router 1 while the others have a higher priority on router 2. A number of scenarios are tested to ensure fail over does occur so this failure is a bit unusual.

An engineer is currently en-route to verify the status of the router experiencing problems.

Unexpected Reboot of rtr1.thn

At 17:03 rtr1.thn unexpectedly rebooted, customers served via this router would have noticed approx 7 minutes of downtime while the router reloaded.

The router appears to be stable after the reboot though reasons for why it may have rebooted are still being investigated. We will continue to monitor the router closely for the next few hours.

Saturday, September 19, 2009

Transit router

At 17:33 one of our transit routers stopped passing packets on one of it's interfaces. The affected router was restarted at 18:00 correcting the problem.

Customers directly connected to this router would have seen an outage of 32 minutes. Others may have experienced an outage of approx 30 seconds while their sessions were redirected to another router.

C2 apologise for the inconvenience this outage may have caused.

Thursday, June 18, 2009

Scheduled maintenance 9th June 04:00 and 06:00.

Hi,

Following on from the Network incident report for 9th June, we are scheduling a network maintenance window for Sunday 21 June that will be open between 04:00 and 06:00. The purpose of this window will be to investigate and test the stability of the Manchester ring.

We are not expecting the network to experience any disruption in service, but the Manchester ring will obviously be more at risk of interruption while testing takes place.

Kind Regards

Stuart McKindley

Tuesday, June 12, 2007

Analysis of incident 11:25 to 11:45 11 June 2007

What happened?

We have now completed our investigation of what happened on 11 June and why it caused such a problem.

At 11:25 we experienced a loss of network reach-ability on our London to Manchester ring which affected DSL customers. This also impacted, although not immediately apparent, our transit customers in Manchester particularly connected to Level3.

A layer 2 protocol problem had occurred affecting the switches in the ring.

Service on the London to Manchester ring was restored at around 11:30; rebooting the Manchester switches and this restored service to effected DSL customers. Unfortunately upon reloading a further layer 2 protocol problem caused the automatic shut down one of the Gig interconnects with Level3 in Manchester.

This Level3 problem was identified at 11:40 and rectified. Service returned for affected Level3 transit customers at 11:45.

Root cause analysis

It seems that the problem was caused by three separate factors interacting together.

Mis-configuration of a customer switch at IFL2
Engineering works being carried out in Telecity
Mis-configuration of an old port by Level3

These probably interacted as follows:

A layer 2 loop control protocol problem occurred between two different isolated sides of the London to Manchester ring. A miss-configured customer switch at IFL2 sent out layer 2 loop control packets. After the reboot a residual configuration of an old transit port at Level3 was brought back into service as part of the aborted London to Manchester work. This created a further loop between two Level3 ports automatically shutting one of them down which affected all customer VLANs mapped via this port.

Lessons to take to Heart

Preventative

Better control of legacy cables and ports is required both with suppliers and customers. Also customers’ ports need strict layer 2 protocol controls at all times without exception by omission or special case.

Restorative

We estimate that we were delayed by about 10 minutes in the problem analysis and fix. When catastrophic problems occur our priorities are:

Diagnose and restore service as quickly as possible
Triage customers and deal with fact-finding and result feedback
Triage customers and ensure higher support band customers have service restored first / follow ups.
Don’t deal with individual customer issues that cannot be quickly resolved / are out of the norm.
Don’t deal with un-related issues
Don’t give customers misleading information
Work to ensure all customers are back fully enabled
Analyse what went wrong
Determine lessons to be learned
Write Reason for Outage report.

When catastrophic events happen customers naturally want to know what the problem is. Our reception staff receive a high volume of calls in a short space of time.

The problem solving team need to focus on solving the problem and are isolated to avoid distractions.

The diagnostic team is guided by the problem solving team on what information to gather from which customers in order to build up a picture of what is happening over the ground. This process needs to happen quickly and be very focused and directed. Anything that slows this process down is bad for all concerned.

To optimise the above we will be implementing the following changes with immediate effect

1. In the first instance call answering staff will take ‘focused yet detailed’ messages and email these to engineers. This ensures that engineers pro-actively manage fault resolution and are not distracted by in bound calls.

2. Diagnostic / information gathers will seek to quickly dispatch information and retrieve feedback from information requests. If you need to consult a third party please call back when you have gathered the information.

3. A bulletin will be published to the NOC website giving full information about the incident in due course once the problem has been solved and analysed – this is standard procedure.

4. If you wish to make a complaint, please do so once the incident is solved using the complaints mechanism on our website. Your complaint will receive a response within 24 hours of the incident resolution.

5. Abusive callers will not be tolerated.

We do value you as customers and want to service you as individuals, and collectively fix network problems as quickly as possible, and to this end we need, and thank you, for your co-operation and participation in the problem solving process.