Thursday, November 19, 2009
Network cable update
We have a number of network cables which need to be re-patched over the course of Friday.
Network customers may notice a few seconds instability while the patch leads are moved.
Network customers may notice a few seconds instability while the patch leads are moved.
Monday, November 16, 2009
Host Server 8
Host server 8 has now been upgraded to PHP5 and Mysql 5.
We have migrated all databases and sites to the server and the service appears stable.
There are however some points :-
- If your hosting service has expired prior to June 2009 then your site has not been upgraded and is not live. Please contact us to renew the service and re enable your site.
- A small number of sites have databases with some locked/crashed database tables. There is nothing to suggest these are active tables or that the issue has only recently occurred but if you do have a table that requires a repair please let us know.
- The ftp server on the new host is being very 'Monday' and not authenticating requests at the moment. Assuming nothing major crops up after the migration we'll work on the fault over the remainder of the afternoon.
UPDATE 15:30
FTP accounts on the server are working without issue.
Support will run to 11pm via email this evening to resolve any remaining issues with users.
We have migrated all databases and sites to the server and the service appears stable.
There are however some points :-
- If your hosting service has expired prior to June 2009 then your site has not been upgraded and is not live. Please contact us to renew the service and re enable your site.
- A small number of sites have databases with some locked/crashed database tables. There is nothing to suggest these are active tables or that the issue has only recently occurred but if you do have a table that requires a repair please let us know.
- The ftp server on the new host is being very 'Monday' and not authenticating requests at the moment. Assuming nothing major crops up after the migration we'll work on the fault over the remainder of the afternoon.
UPDATE 15:30
FTP accounts on the server are working without issue.
Support will run to 11pm via email this evening to resolve any remaining issues with users.
Tuesday, November 10, 2009
Web Cluster maintenance
We will be performing mantenance on the web cluster from 5.30am tomorrow morning.
The following services will be taken offline, Cluster 150, 156 and 212.
We are anticipating that work will be completed and full services restored by 8am with a possible overrun to 8.30am at the latest.
While the work is being completed sites will show an under maintenance page.
The work is necessary to allow us to migrate to a new storage platform and add support for php5 to the web clusters
UPDATE 7AM
IFL have failed to renew the access pass for the engineer carrying out this mornings scheduled work. The pass expired 2 days ago thereby preventing us from completing the migration according to our original timetable.
We have no alternative dates to complete the work so it will commence at 7.30 am this morning with a target completion of 9.30am.
UPDATE 11AM
We're seeing high traffic levels at the file server caused by a single site. If services do not settle to an acceptable level within the next 5 to 10 minutes we will disable the site and look for stability to return to the servers.
UPDATE 1.30PM
We have been able to confirm the issues seen after the file server was returned to service this morning we caused by a single site. The site in question has been disabled until we can migrate it to alternative hardware.
We will be monitoring the servers for any further issues for the rest of today.
The following services will be taken offline, Cluster 150, 156 and 212.
We are anticipating that work will be completed and full services restored by 8am with a possible overrun to 8.30am at the latest.
While the work is being completed sites will show an under maintenance page.
The work is necessary to allow us to migrate to a new storage platform and add support for php5 to the web clusters
UPDATE 7AM
IFL have failed to renew the access pass for the engineer carrying out this mornings scheduled work. The pass expired 2 days ago thereby preventing us from completing the migration according to our original timetable.
We have no alternative dates to complete the work so it will commence at 7.30 am this morning with a target completion of 9.30am.
UPDATE 11AM
We're seeing high traffic levels at the file server caused by a single site. If services do not settle to an acceptable level within the next 5 to 10 minutes we will disable the site and look for stability to return to the servers.
UPDATE 1.30PM
We have been able to confirm the issues seen after the file server was returned to service this morning we caused by a single site. The site in question has been disabled until we can migrate it to alternative hardware.
We will be monitoring the servers for any further issues for the rest of today.
Tuesday, October 20, 2009
Phishing Emails
We've seen a batch of fishing emails arrive with customers today asking them to update contact details as part of a security update. We remind customers to be wary about clicking links in emails even when they appear to come from known sources.
Monday, October 19, 2009
Web cluster stability
We've been experiencing some instability with the web cluster 150 servers over the last hour. The cause has now been traced to a single site causing a the server to run out of memory.
The site in question has been disabled pending further investigations.
The site in question has been disabled pending further investigations.
Monday, October 12, 2009
Mail Cluster Issue
INITIAL 9:45PM
We're currently repairing some file corruption on the main mail server file store.
The process should not take more than a few minutes. Until the repair is completed there is no access to POP, IMAP, or Webmail services.
UPDATE 10:15PM
The file system check and repair has completed however a server reboot is required to clear a number of Dead/Zombie processes on the server.
We're going to assess the issues and decide on a course of action shortly.
UPDATE 10:50PM
An engineer will be dispatched to IFL2 at 3am to recover the file server to our new racks. We hope to have the service restored shortly after 8am.
UPDATE 8.15AM
The overnight work to recover the file server to our new racks was completed on time and without incident.
UPDATE 11:45AM
Heavy storage loads have resulted in the need to pull the file system and force a tree rebuild - this is going to be a lengthy process, so we are working to restore a backup in the interim with an ETA of a fifth of the time before inboxes are present. AGAIN I would like to reiterate this is not preventing inbound mail from arriving, merely the ability to collect it, an important difference.
UPDATE 13:38
Working on a solution to deliver / present fresh email to the end users. While historical data will be missing, all new emails will be arriving and visible - allowing us to migrate these to the existing file system once rebuilt. Light at end of collective tunnels.
UPDATE 13:45
Temporary solution in place. POP users will now be seeing all their new email arriving. Those who are paying for the IMAP solution will notice that all their mail has gone. DO NOT adjust your set. Simply work with what you have for now, and your inbox will be reunited with old mail on completion of the rebuild over the next 12 to 24 hours.
This completes updates on this entry, work from this point on will continue in a new entry. Over the last 10 minutes, over 467Mbytes of inbound mail have been stored, and we can see you retrieving email.
NB, if you use webmail or Imap to collect mail, please avoid creating new directory structures until your old mail account has been merged with the new server.
UPDATE Wednesday 11.30am
Work is continuing on the old file server. We are currently running a snapshot of the data to backup before starting the merge with the new file server. Further updates will follow at 2.30pm today.
UPDATE Wednesday 2.30am
The snapshot of the data is now 1/3rd complete. It's likely to be early evening before we can start the process of merging the new mails with the old. Next update 5pm
UPDATE Wednesday 5pm
The snapshot of the data is now 3/4 complete. We're expecting to be able to start the data merge at 9pm which should take less than 1 hour to complete. Next update once data merge has been completed and all webmail/IMAP data has been restored.
UPDATE Thursday 1am
Our new file server is now live with a full copy of all mails, old and new. At no point has any mail been lost or rejected.
There is a small risk that the mail indexes will have changed for IMAP users resulting in a mis match between the From/Subject line presented by your mail client and the mail downloaded. The fix is to right click the folder concerned and select 'rebuild index'
We're currently repairing some file corruption on the main mail server file store.
The process should not take more than a few minutes. Until the repair is completed there is no access to POP, IMAP, or Webmail services.
UPDATE 10:15PM
The file system check and repair has completed however a server reboot is required to clear a number of Dead/Zombie processes on the server.
We're going to assess the issues and decide on a course of action shortly.
UPDATE 10:50PM
An engineer will be dispatched to IFL2 at 3am to recover the file server to our new racks. We hope to have the service restored shortly after 8am.
UPDATE 8.15AM
The overnight work to recover the file server to our new racks was completed on time and without incident.
UPDATE 11:45AM
Heavy storage loads have resulted in the need to pull the file system and force a tree rebuild - this is going to be a lengthy process, so we are working to restore a backup in the interim with an ETA of a fifth of the time before inboxes are present. AGAIN I would like to reiterate this is not preventing inbound mail from arriving, merely the ability to collect it, an important difference.
UPDATE 13:38
Working on a solution to deliver / present fresh email to the end users. While historical data will be missing, all new emails will be arriving and visible - allowing us to migrate these to the existing file system once rebuilt. Light at end of collective tunnels.
UPDATE 13:45
Temporary solution in place. POP users will now be seeing all their new email arriving. Those who are paying for the IMAP solution will notice that all their mail has gone. DO NOT adjust your set. Simply work with what you have for now, and your inbox will be reunited with old mail on completion of the rebuild over the next 12 to 24 hours.
This completes updates on this entry, work from this point on will continue in a new entry. Over the last 10 minutes, over 467Mbytes of inbound mail have been stored, and we can see you retrieving email.
NB, if you use webmail or Imap to collect mail, please avoid creating new directory structures until your old mail account has been merged with the new server.
UPDATE Wednesday 11.30am
Work is continuing on the old file server. We are currently running a snapshot of the data to backup before starting the merge with the new file server. Further updates will follow at 2.30pm today.
UPDATE Wednesday 2.30am
The snapshot of the data is now 1/3rd complete. It's likely to be early evening before we can start the process of merging the new mails with the old. Next update 5pm
UPDATE Wednesday 5pm
The snapshot of the data is now 3/4 complete. We're expecting to be able to start the data merge at 9pm which should take less than 1 hour to complete. Next update once data merge has been completed and all webmail/IMAP data has been restored.
UPDATE Thursday 1am
Our new file server is now live with a full copy of all mails, old and new. At no point has any mail been lost or rejected.
There is a small risk that the mail indexes will have changed for IMAP users resulting in a mis match between the From/Subject line presented by your mail client and the mail downloaded. The fix is to right click the folder concerned and select 'rebuild index'
Wednesday, September 09, 2009
Thursday Technical Support
Our key staff will be attending a conference from 7am to 1pm today (Thursday)
For the time we are away, we will be operating a reduced support service and no telephone support.
Emergencies will be handled by on call staff.
General support will resume early afternoon.
For the time we are away, we will be operating a reduced support service and no telephone support.
Emergencies will be handled by on call staff.
General support will resume early afternoon.
Wednesday, August 26, 2009
Network Stability 12:40pm
A transit customers server has been subject to a DDOS causing in excess of 100Mbit of small packets to traverse our network. As a result there has been a disruption to connectivity and some packet loss.
Steps to mitigate the problem have been taken as as of 12.44pm stability has returned to the network with no packet loss apparent.
Steps to mitigate the problem have been taken as as of 12.44pm stability has returned to the network with no packet loss apparent.
Tuesday, July 28, 2009
Cluster 150 - 5.15pm
It would appear that a customer wordpress site has caused the web cluster to overload. We're currently waiting on the load to fall on the web servers to confirm the cause and to restore service.
We expect the servers in cluster 150 to have recovered within the next 5 to 10 minutes.
An update will follow shortly.
UPDATE - 5.25pm
Services have been restored and will be monitored over the course of the evening. The service restart prevented us confirming the exact process details that had caused the server load to climb.
We expect the servers in cluster 150 to have recovered within the next 5 to 10 minutes.
An update will follow shortly.
UPDATE - 5.25pm
Services have been restored and will be monitored over the course of the evening. The service restart prevented us confirming the exact process details that had caused the server load to climb.
Monday, July 20, 2009
"delayed emails" / duplicates.
Friday saw a scheduled file system check of a mail cluster member. It would appear that a complication with files not being deleted properly after being cleared from the spool queue had arisen, and as a result on recovering these files to a state in which they could be deleted - the mail server diligently sent them on their way. The resulting phanton resending of emails - in some cases up to a month old.
As these mails had been marked as deleted, these mails are DUPLICATES, and these and can safely be disregarded.
We apologise for the confusion this may have caused.
As these mails had been marked as deleted, these mails are DUPLICATES, and these and can safely be disregarded.
We apologise for the confusion this may have caused.
Monday, June 08, 2009
Database restart
The server hosting our main database has crashed and is in the process of being rebooted.
Logins for Mail/POP/Imap will be offline for a short time while the reboot completes.
UPDATE
The server has restarted without issues. All services are working normally.
Logins for Mail/POP/Imap will be offline for a short time while the reboot completes.
UPDATE
The server has restarted without issues. All services are working normally.
Wednesday, May 20, 2009
Mysql1 Server Restart
Mysql1 crashed a few minutes ago. It has been restarted and appears to be working normally.
Sunday, May 17, 2009
Connectivity Issue
We are aware that customers have been having issues connecting to sites on our network since this morning.
The issue has been traced to a hacked customer server which was pushing out denial of service attack. We will monitor the network over the next 24 hours to ensure that we can maintain a degree of stability.
UPDATE 19:00
It has become apparent this afternoon that a second denial of service attack was hitting our DNS servers ns1 and ns2. Customers will be aware of the importance of the name servers. Not only do they allow remote visitors to find the IP address of the server, they're also used by the web servers and Sql servers for performing reverse lookups.
The result of the attack has been to slow down all services including DSL/transit due to the high volume of UDP traffic hitting our routers.
We believe that at this stage we have been able to take action to reduce the effects of the Dos attack. We will be monitoring the network this evening and taking appropriate action as needed.
UPDATE 19:50
Network monitoring is showing that services have returned to normal. We will continue to monitor the network and services over the next 24 hours.
UPDATE 21:00
One issue with some static transit routes has been resolved. There are now no outstanding issues.
The issue has been traced to a hacked customer server which was pushing out denial of service attack. We will monitor the network over the next 24 hours to ensure that we can maintain a degree of stability.
UPDATE 19:00
It has become apparent this afternoon that a second denial of service attack was hitting our DNS servers ns1 and ns2. Customers will be aware of the importance of the name servers. Not only do they allow remote visitors to find the IP address of the server, they're also used by the web servers and Sql servers for performing reverse lookups.
The result of the attack has been to slow down all services including DSL/transit due to the high volume of UDP traffic hitting our routers.
We believe that at this stage we have been able to take action to reduce the effects of the Dos attack. We will be monitoring the network this evening and taking appropriate action as needed.
UPDATE 19:50
Network monitoring is showing that services have returned to normal. We will continue to monitor the network and services over the next 24 hours.
UPDATE 21:00
One issue with some static transit routes has been resolved. There are now no outstanding issues.
Saturday, May 16, 2009
ADSL port change
Our ADSL traffic has now been migrated from it's temporary port in our router back to it's permanent port location. Our engineering investigations on the link are not yet completed but we now have sufficient information to be able to identify the cause of the issues with the circuit provider.
Traffic is already flowing over the link so we are satisfied that the migration has been completed smoothly. The loss of IP connectivity lasted for 90 seconds during the cable transfer.
Traffic is already flowing over the link so we are satisfied that the migration has been completed smoothly. The loss of IP connectivity lasted for 90 seconds during the cable transfer.
DNS server restart
Both of our dns servers required a restart this morning causing some inconsistency with the resolution of domain names.
Wednesday, May 13, 2009
Maintenance Window
We will be bringing our second router back online between 11am and 1pm on Thursday 14th May.
ADSL customers may experience a brief (60 second) outage while cables are re patched.
No other issues are expected as a result of the router replacement.
UPDATE 14:00
The work with the router replacement has taken longer than expected due to a faulty APC distribution switch causing our main load balancer to reboot. The result being that sites on clusters 150, 156 and 212 have suffered 3 outages in the last hour lasting up to 10 minutes.
Work is nearly completed to complete the router installation. Further updates will follow.
UPDATE 18:00
The work to replace router 2 was completed earlier this afternoon but is not yet in use. We will shedule out of hours work to enable the upstream connections on this router to reduce disruption during the settlment of our routing tables.
We have identified two issues during the work today.
An APC power switch which supports the load balancer for 150, 156 and 212 appears to have a loose connection which causes the load balancer, servers 29, 37 and 64 to reboot if the cables are moved. We will shedule another maintenance window to transfer the power cables to a new APC unit. As yet a date has not been set.
The second issue relates to a fault with a link to IFL1 which carries our ADSL customer traffic. The work on the new router today has highlighted a potential fault which we have raised with the link provider today. When the issue has been investigated and resolved we will shedule an out of hours maintenance window to transfer the connection from router 1 to router 2.
ADSL customers may experience a brief (60 second) outage while cables are re patched.
No other issues are expected as a result of the router replacement.
UPDATE 14:00
The work with the router replacement has taken longer than expected due to a faulty APC distribution switch causing our main load balancer to reboot. The result being that sites on clusters 150, 156 and 212 have suffered 3 outages in the last hour lasting up to 10 minutes.
Work is nearly completed to complete the router installation. Further updates will follow.
UPDATE 18:00
The work to replace router 2 was completed earlier this afternoon but is not yet in use. We will shedule out of hours work to enable the upstream connections on this router to reduce disruption during the settlment of our routing tables.
We have identified two issues during the work today.
An APC power switch which supports the load balancer for 150, 156 and 212 appears to have a loose connection which causes the load balancer, servers 29, 37 and 64 to reboot if the cables are moved. We will shedule another maintenance window to transfer the power cables to a new APC unit. As yet a date has not been set.
The second issue relates to a fault with a link to IFL1 which carries our ADSL customer traffic. The work on the new router today has highlighted a potential fault which we have raised with the link provider today. When the issue has been investigated and resolved we will shedule an out of hours maintenance window to transfer the connection from router 1 to router 2.
Mail, repeated downloads.
We are noticing that a number of Outlook Express clients are experiencing issues. Downloads appear to be looping.
To rule out any corruptions to the upstream cache at the mail server, the local index has been removed. While this restructures there will be some delay.
If your mail client is repeatedly downloading email please take it off line. Then remove or reset any local cache and use the webmail facility in the interim as repeated downloads are currently forming a denial of service.
UPDATE 21:45
The load on the mail servers and file system is now down to within normal levels.
There should be no delays on mail delivery or slow mail downloads.
We would like to remind customers that they must set their mail clients to delete mail after a set number of days to prevent their mail clients from downloading mail repeatedly.
To rule out any corruptions to the upstream cache at the mail server, the local index has been removed. While this restructures there will be some delay.
If your mail client is repeatedly downloading email please take it off line. Then remove or reset any local cache and use the webmail facility in the interim as repeated downloads are currently forming a denial of service.
UPDATE 21:45
The load on the mail servers and file system is now down to within normal levels.
There should be no delays on mail delivery or slow mail downloads.
We would like to remind customers that they must set their mail clients to delete mail after a set number of days to prevent their mail clients from downloading mail repeatedly.
Monday, May 11, 2009
ADSL outage not directly related to router issue
We are currently very aware of the ADSL issue remaining.
We have had confirmation that the system delivery is sound, and our infrastructure is ready to delivery the routing and transit, however we are no receiving the inbound traffic.
This issue is now clarified to be unrelated to the router outage of later Saturday night.
We are pressing / waiting on switch/circuit owners between our IFL1 and IFL2 installations in Manchester to pick up the ball.
We have had confirmation that the system delivery is sound, and our infrastructure is ready to delivery the routing and transit, however we are no receiving the inbound traffic.
This issue is now clarified to be unrelated to the router outage of later Saturday night.
We are pressing / waiting on switch/circuit owners between our IFL1 and IFL2 installations in Manchester to pick up the ball.
Saturday, May 09, 2009
Router Reboot
Router 2 was rebooted a short time ago causing a brief interuption to network stability.
We are waiting on a new backup transit provider to deliver connection details to improve the reliability of the network and hope to have a new connection for router 1 completed by Monday.
UPDATE
The router has crashed twice and been rebooted twice. We're unable at this moment to identify the casuse of the crash but suspect environmental issues at the colo facility to be the cause of the issue.
We are working to migrate some of our backup transit routes to the remaining working routers to restore full routing to our network.
Further updates will follow.
We are waiting on a new backup transit provider to deliver connection details to improve the reliability of the network and hope to have a new connection for router 1 completed by Monday.
UPDATE
The router has crashed twice and been rebooted twice. We're unable at this moment to identify the casuse of the crash but suspect environmental issues at the colo facility to be the cause of the issue.
We are working to migrate some of our backup transit routes to the remaining working routers to restore full routing to our network.
Further updates will follow.
Sunday, May 03, 2009
Network Issue
We are currently investigating what appears to be a network issue. Further information will be made available as we have it.
UPDATE
It would appear that we have a rouge AS advertisement being made by an ISP not providing service to us. we are attempting to work through the various channels to have their network management centre correct the issue.
UPDATE
A ticket has been raised with the noc of the ISP concerned we are awaiting the results of their investigations.
UPDATE
The issue has been identified relating to a single upstream provider so engineering are now working to ensure that the fixes are put in place with our supplier network to restore connectivity via that route.
Our network stats are showing working routes via two out of three upstream providers and all of our Manap peering. The faulty connection accounts for approximately 35% of our total capacity, which because of the nature of the fault, is not being re routed to the working connections.
A final update should be available within the next 15 minutes once full service has been restored.
UPDATE
It would appear that we have a rouge AS advertisement being made by an ISP not providing service to us. we are attempting to work through the various channels to have their network management centre correct the issue.
UPDATE
A ticket has been raised with the noc of the ISP concerned we are awaiting the results of their investigations.
UPDATE
The issue has been identified relating to a single upstream provider so engineering are now working to ensure that the fixes are put in place with our supplier network to restore connectivity via that route.
Our network stats are showing working routes via two out of three upstream providers and all of our Manap peering. The faulty connection accounts for approximately 35% of our total capacity, which because of the nature of the fault, is not being re routed to the working connections.
A final update should be available within the next 15 minutes once full service has been restored.
Friday, February 27, 2009
Master Mysql Server
We are investigating an issue causing the main mysql server to reject new connections from the web servers. Further updates will follow once we have identified the cause.
UPDATE
The service has been restarted a few times now but with no apparent resolution to the issue. We are continuing to look for the cause of the issue.
UPDATE
The issue with the mysql server has been tracked back to a faulty connection with one of the 4 dns servers. All services have been restored and are running normally.
UPDATE
The service has been restarted a few times now but with no apparent resolution to the issue. We are continuing to look for the cause of the issue.
UPDATE
The issue with the mysql server has been tracked back to a faulty connection with one of the 4 dns servers. All services have been restored and are running normally.
Monday, February 09, 2009
LVS load balancer rteboot
The LVS load balancer has been rebooted after a crash, the downtime was 14 minutes.
All services appear to be running normally.
All services appear to be running normally.
Tuesday, February 03, 2009
Mail server issues
We are currently investigating two issues with the mail servers.
1) Possible authentication issue for sending mail
2) Failure of the file server holding pop/imap and webmail.
A reboot is being undertaken of the file server and we are tracing the cause of the authentication issue for sending email.
Further updates to follow.
UPDATE
No issues with mail authentication have been found, inbound and outbound mail are working normally.
The pop/imap/webmail file server is having a disk check performed before we bring it live again. That should be completed shortly.
UPDATE
Mail storage services are all restored and working again. We have identified a single domain that has been mailbombed so are taking steps to reduce the impact of the attack on the mail servers.
All customer mail services should be working normally now.
1) Possible authentication issue for sending mail
2) Failure of the file server holding pop/imap and webmail.
A reboot is being undertaken of the file server and we are tracing the cause of the authentication issue for sending email.
Further updates to follow.
UPDATE
No issues with mail authentication have been found, inbound and outbound mail are working normally.
The pop/imap/webmail file server is having a disk check performed before we bring it live again. That should be completed shortly.
UPDATE
Mail storage services are all restored and working again. We have identified a single domain that has been mailbombed so are taking steps to reduce the impact of the attack on the mail servers.
All customer mail services should be working normally now.
Tuesday, January 20, 2009
DNS: Purge with flame
We have had some issue over the past hours with DNS reliability.
We have taken the harsh step of sanity filtering the DNS zone records, removing invalid entries that contain chars outside the scope of [a-z|A-B|0-9|.] within the primary tuples, and with the added inclusions of things like = and : in text records (under 128 chars). The majority of these where in the form of trailing spaces and people putting things like http:// in as a cname.
Clearing these has resulted in a far better DNS platform in terms of throughput and reliability.
Please take this opportunity to check over any domains that are causing you issue today before contacting us with support queries.
We have taken the harsh step of sanity filtering the DNS zone records, removing invalid entries that contain chars outside the scope of [a-z|A-B|0-9|.] within the primary tuples, and with the added inclusions of things like = and : in text records (under 128 chars). The majority of these where in the form of trailing spaces and people putting things like http:// in as a cname.
Clearing these has resulted in a far better DNS platform in terms of throughput and reliability.
Please take this opportunity to check over any domains that are causing you issue today before contacting us with support queries.
Tuesday, January 06, 2009
Webmail / POP / IMAP
A recurrence of the earlier issue has re emerged, we are currently working on a longer term fix. This will be effecting a small section of the users. For those it does effect the inbound mail servers are not affected in any way, however the inboxes are currently offline - so your connections will either time out or fail. We are aware of the situation and will update this entry once as soon as the issue is resolved. This will not be effecting the majority of users.
[update 1620] The back end for those effected, while struggling initially with the load of retries has settled down and is now servicing requests without issue.
[update 1620] The back end for those effected, while struggling initially with the load of retries has settled down and is now servicing requests without issue.
Webmail / POP / IMAP
The file server which supports the webmail / POP and Imap servers is currently suffering from a crash. IFL support staff will be rebooting it shortly at which point we hope to restore services.
The next update will be 9am.
UPDATE
The server has been rebooted and checked. All appears to be well so webmail/POP/Imap services are now working.
Any mail delivered in the last 5 hours will be stored on our inbound mail servers. It should be delivered to the mail accounts over the next few hours.
The next update will be 9am.
UPDATE
The server has been rebooted and checked. All appears to be well so webmail/POP/Imap services are now working.
Any mail delivered in the last 5 hours will be stored on our inbound mail servers. It should be delivered to the mail accounts over the next few hours.
Monday, December 15, 2008
IMAP issue this morning.
An IMAP server this morning has decided that being busy and slow is not its thing, and it has had enough. Having checked out we are running some tests before bringing it back on line. If your IP address is being diverted to this machine, you will be seeing connection errors. Expect a request for U&P when it comes back up and all will be well. We envisage about 10 mins tops of downtime. Many thanks for your patience.
Thursday, November 20, 2008
Inbound email
When it rains it pours.
We have located another issue with one of the primary MX cluster members, and the way another member uses DNS - these occured over night (19th-20th November).
These have been addressed, and we are seeing mail processing in at an expected rate given the time of day.
Please do keep in mind the way the mail RFC's work, and that unless you have asked the admin at the other end to force a flush, mails retry with a logarithmic delay, so be patient, and new mail may arrive before ones sent less recently. This is the way it is meant to work to assist with resilience and spread the load.
We are sorry for the delays, and are happy to say that there are now no issues.
We have located another issue with one of the primary MX cluster members, and the way another member uses DNS - these occured over night (19th-20th November).
These have been addressed, and we are seeing mail processing in at an expected rate given the time of day.
Please do keep in mind the way the mail RFC's work, and that unless you have asked the admin at the other end to force a flush, mails retry with a logarithmic delay, so be patient, and new mail may arrive before ones sent less recently. This is the way it is meant to work to assist with resilience and spread the load.
We are sorry for the delays, and are happy to say that there are now no issues.
SMTP
Outbound email is currently being delayed.
Client abuse has resulted in a huge spam list that we have successfully cleared from the queue and contacted / blacklisted those involved.
Trying to clear the backlog appears to be being hampered as - quite logically users are sending themselves test messages, or mails that have not arrived have been resent. While this may make sense on the scale of one person, upscale this a couple of hundred times, and there is a fair amount of mail happening that doesnt need to.
We are aware of the issue, and we are processing the queue as well as we are able - once the queue has returned below 8K mails, we are confident it will start to recover of its own accord. Bear with us, and we will update you once we have the green light.
[update 0955] Blaming the tail end of an issue yesterday for this mornings issues quickly became a non starter. We have located the issues and made good. I have posted a new article to cover what happened.
Client abuse has resulted in a huge spam list that we have successfully cleared from the queue and contacted / blacklisted those involved.
Trying to clear the backlog appears to be being hampered as - quite logically users are sending themselves test messages, or mails that have not arrived have been resent. While this may make sense on the scale of one person, upscale this a couple of hundred times, and there is a fair amount of mail happening that doesnt need to.
We are aware of the issue, and we are processing the queue as well as we are able - once the queue has returned below 8K mails, we are confident it will start to recover of its own accord. Bear with us, and we will update you once we have the green light.
[update 0955] Blaming the tail end of an issue yesterday for this mornings issues quickly became a non starter. We have located the issues and made good. I have posted a new article to cover what happened.
Saturday, August 02, 2008
Load balancer Issue
The primary load balancer failed this morning.
Service has been reinstated and tested.
We are sorry for any inconvenience caused to the effected users.
Service has been reinstated and tested.
We are sorry for any inconvenience caused to the effected users.
Thursday, July 24, 2008
New & recently changed email accounts.
We are aware of an issue with slave replication that we hope to address today. This is effecting some email accounts that have been changed or created recently. As this will require replacing the out of step database, and then bringing it back up to speed, it will be a requirement to reduce service before restoring it to full strength. Every effort will be made to minimise further complication.
[update 1225] - this has been addressed, and is now up to speed - thank you for your patience.
[update 1225] - this has been addressed, and is now up to speed - thank you for your patience.
Wednesday, July 02, 2008
Issue with cluster 1
We are currently experiencing an issue with web hosting cluster 1. Cluster 2 and 3 are operating within tolerances if a little slow.
We are currently looking into storage issues, as yesterday an air conditioning failure in the data centre IFL2 resulted in a sizable heat spike for a prolonged period.
UPDATE 17:00
The cause of the issue has been traced to the file server and the underlying cause identified as an overheating CPU. We believe this is associated with failure of some air conditioning units in IFL2 yesterday afternoon.
We will be working on the racks over the next week to improve air flow through the servers as well as increasing our range of monitoring sensors to include more temperature sensors on servers.
We are currently looking into storage issues, as yesterday an air conditioning failure in the data centre IFL2 resulted in a sizable heat spike for a prolonged period.
UPDATE 17:00
The cause of the issue has been traced to the file server and the underlying cause identified as an overheating CPU. We believe this is associated with failure of some air conditioning units in IFL2 yesterday afternoon.
We will be working on the racks over the next week to improve air flow through the servers as well as increasing our range of monitoring sensors to include more temperature sensors on servers.
Tuesday, June 10, 2008
Webclusters 150 and 156/212
It has become apparent that the work on the database server today has not resolved all of the issues on the above clusters. While the service is working better than it was earlier, the process list on the web servers is continuing to climb to the point that new connections are not permitted.
At this point we have eliminated the possibility that the fault is within the LVS load balancer (that was replaced this morning) and the SQL server (replaced 1pm with a new server). We have also eliminated the network as the possible source of the issue as virtual servers and the mail service are working without issue.
The only other element of the service which is now in question is the nfs file server. While there are no obvious errors being produced we feel that it is the only possible cause of the issues left. A new file server is in the rack and we have just begun the process of transferring of data from the old server to the new. We expect that to be substantially completed within the next 4 hours.
UPDATE 8PM
The transfer of files to a new file server is underway and proceeding without issues. The web servers have been pointed to the new file server. Files are being restored from a to z so sites starting a and b have already been migrated. Judging by the first hour of transfer, we expect the process to complete in the early hours of the morning.
We would like to thank customers for their patience during this time.
UPDATE 1AM
We are approaching half way through the transfer of sites from the old file server to the new. We expect the remainder of the process to be completed by 6 to 8am.
Webmail services have been restored and are working without issue.
FTP access to the new file server will be suspended until mid morning Wednesday.
UPDATE 7AM
The file transfer is still running with about 75% of sites completed allbeit very slowly. Clients with sites still not available can email our support email address with any sites not showing so that we can push them by hand. Priority will be given to business sites.
UPDATE 11 AM
All of the remaining sites should have been resored within the next 2 hours. Once that is complete, ftp access will be made available to the new file server. Customers will not need to change any settings in their ftp clients.
UPDATE 2PM
All transfers are completed and there appears to be stability at last. A few bugs with sites have cropped up during the process but they have been ironed out. If you are aware of any site which is not working correctly please raise the issue with our support mail address and we will investigate it.
In summary,
Mysql has moved to a new server, no changes required from customers.
File server is on new hardware, no customer changes needed.
FTP service is up and working, no change to FTP settings needed.
Webmail up and working.
We would again like to thank our customers for their patience during this issue.
At this point we have eliminated the possibility that the fault is within the LVS load balancer (that was replaced this morning) and the SQL server (replaced 1pm with a new server). We have also eliminated the network as the possible source of the issue as virtual servers and the mail service are working without issue.
The only other element of the service which is now in question is the nfs file server. While there are no obvious errors being produced we feel that it is the only possible cause of the issues left. A new file server is in the rack and we have just begun the process of transferring of data from the old server to the new. We expect that to be substantially completed within the next 4 hours.
UPDATE 8PM
The transfer of files to a new file server is underway and proceeding without issues. The web servers have been pointed to the new file server. Files are being restored from a to z so sites starting a and b have already been migrated. Judging by the first hour of transfer, we expect the process to complete in the early hours of the morning.
We would like to thank customers for their patience during this time.
UPDATE 1AM
We are approaching half way through the transfer of sites from the old file server to the new. We expect the remainder of the process to be completed by 6 to 8am.
Webmail services have been restored and are working without issue.
FTP access to the new file server will be suspended until mid morning Wednesday.
UPDATE 7AM
The file transfer is still running with about 75% of sites completed allbeit very slowly. Clients with sites still not available can email our support email address with any sites not showing so that we can push them by hand. Priority will be given to business sites.
UPDATE 11 AM
All of the remaining sites should have been resored within the next 2 hours. Once that is complete, ftp access will be made available to the new file server. Customers will not need to change any settings in their ftp clients.
UPDATE 2PM
All transfers are completed and there appears to be stability at last. A few bugs with sites have cropped up during the process but they have been ironed out. If you are aware of any site which is not working correctly please raise the issue with our support mail address and we will investigate it.
In summary,
Mysql has moved to a new server, no changes required from customers.
File server is on new hardware, no customer changes needed.
FTP service is up and working, no change to FTP settings needed.
Webmail up and working.
We would again like to thank our customers for their patience during this issue.
Web Clusters 150, 156, 212
We are currently dealing with an issue on all web clusters whereby requests are not being serviced correctly resulting in slow page load times.
Our investigations over the last 12+ hours have drawn a blank with the result that there is no apparent reason for the poor performance of the web servers.
We are in the process of bringing in outside support this morning to double check our own investigations and are in the process of installing a new file server.
At this point, we have tried to eliminate all of the common points of failure in the web cluster service. The file server appears to be operating correctly but as it is such a major component in the delivery of the web pages, and given that we have no other potential sources for the problem we have decided to replace the file server as a precaution. We expect that work to be completed by lunchtime today.
UPDATE 9.40am
Further testing has identified that the bottleneck appears to be the SQL server not the File server as previously thought. Work to replace the file server has been suspended for the time being. We have a standby version 5 SQL server and are currently investigating the issues involved in making an upgrade from the current mysql4 service to mysql5 on the new server. If the issues appear to carry too high a risk, then we will prepare a Mysql 4 server. This work is expected to last for the next 3 hours.
UPDATE 1pm
The database tables and indexes on the existing sql server have all been repaired but this has not given us any significant performance increase. A new Mysql 4 server is being prepared to take the databases. Next update 2pm.
UPDATE 2PM
The new database server should be running within the next 10 to 15 minutes.
UPDATE 3PM
The main database has now migrated to a new multi core host server. Initial tests are showing that the loads on the web servers is now more stable with pages being served within normal times.
We will continue to monitor the service for the next 24 hours to ensure stability has been returned.
We would like to thank our customers for their patience during this outage and offer our sincere apologies for the intermittent service over the last 24 hours.
Our investigations over the last 12+ hours have drawn a blank with the result that there is no apparent reason for the poor performance of the web servers.
We are in the process of bringing in outside support this morning to double check our own investigations and are in the process of installing a new file server.
At this point, we have tried to eliminate all of the common points of failure in the web cluster service. The file server appears to be operating correctly but as it is such a major component in the delivery of the web pages, and given that we have no other potential sources for the problem we have decided to replace the file server as a precaution. We expect that work to be completed by lunchtime today.
UPDATE 9.40am
Further testing has identified that the bottleneck appears to be the SQL server not the File server as previously thought. Work to replace the file server has been suspended for the time being. We have a standby version 5 SQL server and are currently investigating the issues involved in making an upgrade from the current mysql4 service to mysql5 on the new server. If the issues appear to carry too high a risk, then we will prepare a Mysql 4 server. This work is expected to last for the next 3 hours.
UPDATE 1pm
The database tables and indexes on the existing sql server have all been repaired but this has not given us any significant performance increase. A new Mysql 4 server is being prepared to take the databases. Next update 2pm.
UPDATE 2PM
The new database server should be running within the next 10 to 15 minutes.
UPDATE 3PM
The main database has now migrated to a new multi core host server. Initial tests are showing that the loads on the web servers is now more stable with pages being served within normal times.
We will continue to monitor the service for the next 24 hours to ensure stability has been returned.
We would like to thank our customers for their patience during this outage and offer our sincere apologies for the intermittent service over the last 24 hours.
Monday, June 09, 2008
VPS/ vserver clients.
VPS/vserver users please be aware that unless you specifically have a managed account with us then it is your responsibility to keep the server as secure as possible.
We have had two incidences already this week where clients have failed to keep on top of patches and updates that has resulted in machines being taken off line after they have been used in attacks on other networks.
If you have any questions regarding this or could do with some advice, then drop us a mail to support and we will assist as much as we can - we would prefer you to be on top of the situation as opposed to it coming as unwelcome news.
We have had two incidences already this week where clients have failed to keep on top of patches and updates that has resulted in machines being taken off line after they have been used in attacks on other networks.
If you have any questions regarding this or could do with some advice, then drop us a mail to support and we will assist as much as we can - we would prefer you to be on top of the situation as opposed to it coming as unwelcome news.
Network congestion
We are currently experiencing an inbound UDP flood.
Once we have isolated the source we will be able to deal with it and return services to you. Technically the services are unaffected, however while the switches are so busy service will be slow or time out.
Once we have isolated the source we will be able to deal with it and return services to you. Technically the services are unaffected, however while the switches are so busy service will be slow or time out.