Monday, December 19, 2011
Routing Issue
We've been investigating two reports of end users not able to reach our network. In both cases the points of failure appear to be well outside of our network. As yet we have been unable to identify the network at the heart of the issue.
If anything develops we'll advise here, but at this point in time we're not seeing any changes in network usage or border routes.
Thursday, December 08, 2011
Rack A5 power trip
The power supply to rack A5 has tripped due to the power supply on server 64 failing.
Power has been restored, servers in A5 are all working as normal.
Services (POP/IMAP mail) from Server 64 have transferred to servers 65 and 66 via the cluster management system.
Power has been restored, servers in A5 are all working as normal.
Services (POP/IMAP mail) from Server 64 have transferred to servers 65 and 66 via the cluster management system.
Sunday, December 04, 2011
POP/IMAP/Webmail Servers
The file server responsible for storing mail data has crashed three times in the last 3 days.
Investigations have shown one of the drives to be generating IO errors.
At midnight the failed drive will be replaced and re silvered
UPDATE 9am Monday
The drive replacement / resilvering is 50% completed. The file server is operating in a degraded state so is running with less performance than normal.
We expect the process to be completed by 6pm.
Until that time the mail servers will run with a higher than normal load and may be less responsive that usual.
UPDATE 1pm Monday
Work to bring the file server up to speed with a full disk set should now be completed by 4pm.
We have identified during the work that one of the POP cluster servers is suffering from a faulty network connection. Further investigation will continue this afternoon.
UPDATE 3 pm Monday
The process to re silver the missing drive is due to complete at 4.30pm.
During the monitoring of the server loads it has become increasingly apparent that there are a number of customers with POP accounts where mail is being stored for considerable periods of time. POP is not best suited to storing mail for longer periods and it is apparent that the loads on the POP servers are being caused by inappropriate use of the POP accounts.
We will be performing a review of POP usage over the next week and advising customers where their usage needs to be modified.
UPDATE 8pm Monday
File system completed the rebuilt at 4.30pm. Server loads all returned to normal within a few minutes of the work being completed.
We will monitor the situation over the next 24 hours to ensure there are no further issues.
Investigations have shown one of the drives to be generating IO errors.
At midnight the failed drive will be replaced and re silvered
UPDATE 9am Monday
The drive replacement / resilvering is 50% completed. The file server is operating in a degraded state so is running with less performance than normal.
We expect the process to be completed by 6pm.
Until that time the mail servers will run with a higher than normal load and may be less responsive that usual.
UPDATE 1pm Monday
Work to bring the file server up to speed with a full disk set should now be completed by 4pm.
We have identified during the work that one of the POP cluster servers is suffering from a faulty network connection. Further investigation will continue this afternoon.
UPDATE 3 pm Monday
The process to re silver the missing drive is due to complete at 4.30pm.
During the monitoring of the server loads it has become increasingly apparent that there are a number of customers with POP accounts where mail is being stored for considerable periods of time. POP is not best suited to storing mail for longer periods and it is apparent that the loads on the POP servers are being caused by inappropriate use of the POP accounts.
We will be performing a review of POP usage over the next week and advising customers where their usage needs to be modified.
UPDATE 8pm Monday
File system completed the rebuilt at 4.30pm. Server loads all returned to normal within a few minutes of the work being completed.
We will monitor the situation over the next 24 hours to ensure there are no further issues.