Wednesday, May 20, 2009
Mysql1 Server Restart
Mysql1 crashed a few minutes ago. It has been restarted and appears to be working normally.
Sunday, May 17, 2009
Connectivity Issue
We are aware that customers have been having issues connecting to sites on our network since this morning.
The issue has been traced to a hacked customer server which was pushing out denial of service attack. We will monitor the network over the next 24 hours to ensure that we can maintain a degree of stability.
UPDATE 19:00
It has become apparent this afternoon that a second denial of service attack was hitting our DNS servers ns1 and ns2. Customers will be aware of the importance of the name servers. Not only do they allow remote visitors to find the IP address of the server, they're also used by the web servers and Sql servers for performing reverse lookups.
The result of the attack has been to slow down all services including DSL/transit due to the high volume of UDP traffic hitting our routers.
We believe that at this stage we have been able to take action to reduce the effects of the Dos attack. We will be monitoring the network this evening and taking appropriate action as needed.
UPDATE 19:50
Network monitoring is showing that services have returned to normal. We will continue to monitor the network and services over the next 24 hours.
UPDATE 21:00
One issue with some static transit routes has been resolved. There are now no outstanding issues.
The issue has been traced to a hacked customer server which was pushing out denial of service attack. We will monitor the network over the next 24 hours to ensure that we can maintain a degree of stability.
UPDATE 19:00
It has become apparent this afternoon that a second denial of service attack was hitting our DNS servers ns1 and ns2. Customers will be aware of the importance of the name servers. Not only do they allow remote visitors to find the IP address of the server, they're also used by the web servers and Sql servers for performing reverse lookups.
The result of the attack has been to slow down all services including DSL/transit due to the high volume of UDP traffic hitting our routers.
We believe that at this stage we have been able to take action to reduce the effects of the Dos attack. We will be monitoring the network this evening and taking appropriate action as needed.
UPDATE 19:50
Network monitoring is showing that services have returned to normal. We will continue to monitor the network and services over the next 24 hours.
UPDATE 21:00
One issue with some static transit routes has been resolved. There are now no outstanding issues.
Saturday, May 16, 2009
ADSL port change
Our ADSL traffic has now been migrated from it's temporary port in our router back to it's permanent port location. Our engineering investigations on the link are not yet completed but we now have sufficient information to be able to identify the cause of the issues with the circuit provider.
Traffic is already flowing over the link so we are satisfied that the migration has been completed smoothly. The loss of IP connectivity lasted for 90 seconds during the cable transfer.
Traffic is already flowing over the link so we are satisfied that the migration has been completed smoothly. The loss of IP connectivity lasted for 90 seconds during the cable transfer.
DNS server restart
Both of our dns servers required a restart this morning causing some inconsistency with the resolution of domain names.
Wednesday, May 13, 2009
Maintenance Window
We will be bringing our second router back online between 11am and 1pm on Thursday 14th May.
ADSL customers may experience a brief (60 second) outage while cables are re patched.
No other issues are expected as a result of the router replacement.
UPDATE 14:00
The work with the router replacement has taken longer than expected due to a faulty APC distribution switch causing our main load balancer to reboot. The result being that sites on clusters 150, 156 and 212 have suffered 3 outages in the last hour lasting up to 10 minutes.
Work is nearly completed to complete the router installation. Further updates will follow.
UPDATE 18:00
The work to replace router 2 was completed earlier this afternoon but is not yet in use. We will shedule out of hours work to enable the upstream connections on this router to reduce disruption during the settlment of our routing tables.
We have identified two issues during the work today.
An APC power switch which supports the load balancer for 150, 156 and 212 appears to have a loose connection which causes the load balancer, servers 29, 37 and 64 to reboot if the cables are moved. We will shedule another maintenance window to transfer the power cables to a new APC unit. As yet a date has not been set.
The second issue relates to a fault with a link to IFL1 which carries our ADSL customer traffic. The work on the new router today has highlighted a potential fault which we have raised with the link provider today. When the issue has been investigated and resolved we will shedule an out of hours maintenance window to transfer the connection from router 1 to router 2.
ADSL customers may experience a brief (60 second) outage while cables are re patched.
No other issues are expected as a result of the router replacement.
UPDATE 14:00
The work with the router replacement has taken longer than expected due to a faulty APC distribution switch causing our main load balancer to reboot. The result being that sites on clusters 150, 156 and 212 have suffered 3 outages in the last hour lasting up to 10 minutes.
Work is nearly completed to complete the router installation. Further updates will follow.
UPDATE 18:00
The work to replace router 2 was completed earlier this afternoon but is not yet in use. We will shedule out of hours work to enable the upstream connections on this router to reduce disruption during the settlment of our routing tables.
We have identified two issues during the work today.
An APC power switch which supports the load balancer for 150, 156 and 212 appears to have a loose connection which causes the load balancer, servers 29, 37 and 64 to reboot if the cables are moved. We will shedule another maintenance window to transfer the power cables to a new APC unit. As yet a date has not been set.
The second issue relates to a fault with a link to IFL1 which carries our ADSL customer traffic. The work on the new router today has highlighted a potential fault which we have raised with the link provider today. When the issue has been investigated and resolved we will shedule an out of hours maintenance window to transfer the connection from router 1 to router 2.
Mail, repeated downloads.
We are noticing that a number of Outlook Express clients are experiencing issues. Downloads appear to be looping.
To rule out any corruptions to the upstream cache at the mail server, the local index has been removed. While this restructures there will be some delay.
If your mail client is repeatedly downloading email please take it off line. Then remove or reset any local cache and use the webmail facility in the interim as repeated downloads are currently forming a denial of service.
UPDATE 21:45
The load on the mail servers and file system is now down to within normal levels.
There should be no delays on mail delivery or slow mail downloads.
We would like to remind customers that they must set their mail clients to delete mail after a set number of days to prevent their mail clients from downloading mail repeatedly.
To rule out any corruptions to the upstream cache at the mail server, the local index has been removed. While this restructures there will be some delay.
If your mail client is repeatedly downloading email please take it off line. Then remove or reset any local cache and use the webmail facility in the interim as repeated downloads are currently forming a denial of service.
UPDATE 21:45
The load on the mail servers and file system is now down to within normal levels.
There should be no delays on mail delivery or slow mail downloads.
We would like to remind customers that they must set their mail clients to delete mail after a set number of days to prevent their mail clients from downloading mail repeatedly.
Monday, May 11, 2009
ADSL outage not directly related to router issue
We are currently very aware of the ADSL issue remaining.
We have had confirmation that the system delivery is sound, and our infrastructure is ready to delivery the routing and transit, however we are no receiving the inbound traffic.
This issue is now clarified to be unrelated to the router outage of later Saturday night.
We are pressing / waiting on switch/circuit owners between our IFL1 and IFL2 installations in Manchester to pick up the ball.
We have had confirmation that the system delivery is sound, and our infrastructure is ready to delivery the routing and transit, however we are no receiving the inbound traffic.
This issue is now clarified to be unrelated to the router outage of later Saturday night.
We are pressing / waiting on switch/circuit owners between our IFL1 and IFL2 installations in Manchester to pick up the ball.
Saturday, May 09, 2009
Router Reboot
Router 2 was rebooted a short time ago causing a brief interuption to network stability.
We are waiting on a new backup transit provider to deliver connection details to improve the reliability of the network and hope to have a new connection for router 1 completed by Monday.
UPDATE
The router has crashed twice and been rebooted twice. We're unable at this moment to identify the casuse of the crash but suspect environmental issues at the colo facility to be the cause of the issue.
We are working to migrate some of our backup transit routes to the remaining working routers to restore full routing to our network.
Further updates will follow.
We are waiting on a new backup transit provider to deliver connection details to improve the reliability of the network and hope to have a new connection for router 1 completed by Monday.
UPDATE
The router has crashed twice and been rebooted twice. We're unable at this moment to identify the casuse of the crash but suspect environmental issues at the colo facility to be the cause of the issue.
We are working to migrate some of our backup transit routes to the remaining working routers to restore full routing to our network.
Further updates will follow.
Sunday, May 03, 2009
Network Issue
We are currently investigating what appears to be a network issue. Further information will be made available as we have it.
UPDATE
It would appear that we have a rouge AS advertisement being made by an ISP not providing service to us. we are attempting to work through the various channels to have their network management centre correct the issue.
UPDATE
A ticket has been raised with the noc of the ISP concerned we are awaiting the results of their investigations.
UPDATE
The issue has been identified relating to a single upstream provider so engineering are now working to ensure that the fixes are put in place with our supplier network to restore connectivity via that route.
Our network stats are showing working routes via two out of three upstream providers and all of our Manap peering. The faulty connection accounts for approximately 35% of our total capacity, which because of the nature of the fault, is not being re routed to the working connections.
A final update should be available within the next 15 minutes once full service has been restored.
UPDATE
It would appear that we have a rouge AS advertisement being made by an ISP not providing service to us. we are attempting to work through the various channels to have their network management centre correct the issue.
UPDATE
A ticket has been raised with the noc of the ISP concerned we are awaiting the results of their investigations.
UPDATE
The issue has been identified relating to a single upstream provider so engineering are now working to ensure that the fixes are put in place with our supplier network to restore connectivity via that route.
Our network stats are showing working routes via two out of three upstream providers and all of our Manap peering. The faulty connection accounts for approximately 35% of our total capacity, which because of the nature of the fault, is not being re routed to the working connections.
A final update should be available within the next 15 minutes once full service has been restored.