Tuesday, June 10, 2008
Web Clusters 150, 156, 212
We are currently dealing with an issue on all web clusters whereby requests are not being serviced correctly resulting in slow page load times.
Our investigations over the last 12+ hours have drawn a blank with the result that there is no apparent reason for the poor performance of the web servers.
We are in the process of bringing in outside support this morning to double check our own investigations and are in the process of installing a new file server.
At this point, we have tried to eliminate all of the common points of failure in the web cluster service. The file server appears to be operating correctly but as it is such a major component in the delivery of the web pages, and given that we have no other potential sources for the problem we have decided to replace the file server as a precaution. We expect that work to be completed by lunchtime today.
UPDATE 9.40am
Further testing has identified that the bottleneck appears to be the SQL server not the File server as previously thought. Work to replace the file server has been suspended for the time being. We have a standby version 5 SQL server and are currently investigating the issues involved in making an upgrade from the current mysql4 service to mysql5 on the new server. If the issues appear to carry too high a risk, then we will prepare a Mysql 4 server. This work is expected to last for the next 3 hours.
UPDATE 1pm
The database tables and indexes on the existing sql server have all been repaired but this has not given us any significant performance increase. A new Mysql 4 server is being prepared to take the databases. Next update 2pm.
UPDATE 2PM
The new database server should be running within the next 10 to 15 minutes.
UPDATE 3PM
The main database has now migrated to a new multi core host server. Initial tests are showing that the loads on the web servers is now more stable with pages being served within normal times.
We will continue to monitor the service for the next 24 hours to ensure stability has been returned.
We would like to thank our customers for their patience during this outage and offer our sincere apologies for the intermittent service over the last 24 hours.
Our investigations over the last 12+ hours have drawn a blank with the result that there is no apparent reason for the poor performance of the web servers.
We are in the process of bringing in outside support this morning to double check our own investigations and are in the process of installing a new file server.
At this point, we have tried to eliminate all of the common points of failure in the web cluster service. The file server appears to be operating correctly but as it is such a major component in the delivery of the web pages, and given that we have no other potential sources for the problem we have decided to replace the file server as a precaution. We expect that work to be completed by lunchtime today.
UPDATE 9.40am
Further testing has identified that the bottleneck appears to be the SQL server not the File server as previously thought. Work to replace the file server has been suspended for the time being. We have a standby version 5 SQL server and are currently investigating the issues involved in making an upgrade from the current mysql4 service to mysql5 on the new server. If the issues appear to carry too high a risk, then we will prepare a Mysql 4 server. This work is expected to last for the next 3 hours.
UPDATE 1pm
The database tables and indexes on the existing sql server have all been repaired but this has not given us any significant performance increase. A new Mysql 4 server is being prepared to take the databases. Next update 2pm.
UPDATE 2PM
The new database server should be running within the next 10 to 15 minutes.
UPDATE 3PM
The main database has now migrated to a new multi core host server. Initial tests are showing that the loads on the web servers is now more stable with pages being served within normal times.
We will continue to monitor the service for the next 24 hours to ensure stability has been returned.
We would like to thank our customers for their patience during this outage and offer our sincere apologies for the intermittent service over the last 24 hours.