If you have been following our Twitter stream you will know we have been having problems with one of our web servers. Bad things happen. This was about the worst case scenario we could envision.
Fortunately we had backups of everything since about 12 hours before the crash. Everything has been restored, all sites are up and we are getting all of the accounts properly configured.
Here is how it all went down:
On Wednesday at approximately 3PM we received notices that sites were behaving erratically – database connections were dropping and required libraries we not being loaded. When we tried to log on to the server the connections were refused.
We called the server room and hooked up a terminal which showed a kernel panic – similar to the Windows Blue Screen of Death or the Mac’s “grey screen with white text” so we rebooted the machine. From there it never came back.
Our techs then ran a file system check to make sure the files were not corrupt. The system passed the check but on boot-up the server would kernel panic.
We attempted to resurrect the files directly on the server but the repeating kernel panic prevented us from booting the server.
So we provisioned a new server and began the time-consuming process of restoring the backups from the full backup on December 23rd. This would give us all of the files that were on the server the day it went down.
After the backups were complete we started the process of restoring each site individually which has been completed.
We are now in the process of updating the databases and files from the incremental backups performed over the last few days.
How do we prevent this in the future? We have taken the measure of installing a new server architecture and chipset, newer, more reliable hard drives and a new upgraded RAID system. And yes, we are continuing to back up everything on the server.
We were 8 days shy of this server being up for 1,000 days. We have never had a downtime experience that has lasted this long. Bad things happen. Electronics break. But we can make sure procedures are in place to minimize data loss and get your information back online as soon as possible.
Thank you for your patience and understanding while we worked to restore your data.
LOS ANGELES: Fox Studios reported today that the last two episodes of the of the popular sitcom Glee, set to air later this month, have been lost due to a “catastrophic” failure at their primary data center.