Sononaco: The Blog

A Holiday “Gift” From the Server Gremlins

If you have been following our Twitter stream you will know we have been having problems with one of our web servers. Bad things happen. This was about the worst case scenario we could envision.

Fortunately we had backups of everything since about 12 hours before the crash. Everything has been restored, all sites are up and we are getting all of the accounts properly configured.

Here is how it all went down:

On Wednesday at approximately 3PM we received notices that sites were behaving erratically – database connections were dropping and required libraries we not being loaded. When we tried to log on to the server the connections were refused.

We called the server room and hooked up a terminal which showed a kernel panic – similar to the Windows Blue Screen of Death or the Mac’s “grey screen with white text” so we rebooted the machine. From there it never came back.

Our techs then ran a file system check to make sure the files were not corrupt. The system passed the check but on boot-up the server would kernel panic.

We attempted to resurrect the files directly on the server but the repeating kernel panic prevented us from booting the server.

So we provisioned a new server and began the time-consuming process of restoring the backups from the full backup on December 23rd. This would give us all of the files that were on the server the day it went down.

After the backups were complete we started the process of restoring each site individually which has been completed.

We are now in the process of updating the databases and files from the incremental backups performed over the last few days.

How do we prevent this in the future? We have taken the measure of installing a new server architecture and chipset, newer, more reliable hard drives and a new upgraded RAID system. And yes, we are continuing to back up everything on the server.

We were 8 days shy of this server being up for 1,000 days. We have never had a downtime experience that has lasted this long. Bad things happen. Electronics break. But we can make sure procedures are in place to minimize data loss and get your information back online as soon as possible.

Thank you for your patience and understanding while we worked to restore your data.

Downtime issues 6/16 around 3AM

In the wee hours of Thursday there came a bump in the night at our server host. The details of the outage are explained below.

In short, Cox Communications – an upstream internet provider for our hosting company – did something stupid by allocating a block of IP address to themselves. This essentially redirected the IP addresses to Cox’s network preventing the traffic from reaching the hosting company.

=====

At approximately 3:20 AM EDT, CARI.net internal monitoring began reporting problems with DNS resolution. The problem was immediately escalated to our on call senior network admins. Due to the nature of the problem, remote access was not possible to resolve the issue, onsite access would be required. Once onsite it was established that two of our upstream Bandwidth providers (Level3 and COX) were not passing traffic, however the connections themselves were functional. Both providers were contacted and tickets were opened with tier 1 support. Working with Level3, we were able to jointly identify that the problem was originating from the COX network.

COX was apparently routing 3 of CARI.net’s 5 IP allocations incorrectly causing traffic to be dropped in the COX network.

At 5:30 AM COX’s on call Hi-Cap engineer contacted us. Since this was a routing problem he had to transfer the issue to the routing group. At 6:06 AM the on call COX routing engineer contacted us and confirmed what we already knew and stated that he would work on the problem and call us back. At 6:40 AM CARI.net internal monitoring indicated that DNS was once again functioning and some traffic was once again flowing to Level3. At 7:05 AM COX called back indicating that the problem was fixed.

COX will be working to create a full report of the incident. We will not be using the COX service until we receive this report. During the outage, all of CARI.net’s services were internally functioning normally.

=====

Mail System Upgraded to Version 7.0

The mail server upgrade has been completed. You will find a few things have been moved around.

1. The instant messaging system is still there.

2. Your Documents have been moved into the Briefcase

3. New Zimlets are activated. Please check Preferences > Zimlets to enable them if they are not enabled already. The new “social” Zimlet has been in hot demand.

4. The new “Carbon” theme is beautiful. Find it under Preferences > General Theme

Enjoy the new system!

Mail server upgrade scheduled for Memorial Day Weekend

Several months ago a major upgrade to our mail server was released, Zimbra 7.0. We are currently running version 6.

I have been reluctant to upgrade the server because there were several changes made which concerned me, most notably the absence of the Instant Messaging and Documents tabs/applications. Plus, mission-critical operations such as e-mail are not the technologies to blindly jump onto the latest and greatest. When was the last time the mail server was down for something other than maintenance? Right: February 23rd, 2011. And before that? 1,010 days prior. Not a bad track record.

So now the software is at version 7.1. That “dot-1″ is key to me because it means the release-version bugs have been fixed. During the testing I have found how to re-enable the instant messaging application and access the Documents, which have been moved to the Briefcase application.

Unless there are no objections I am hoping to apply this upgrade to the mail server over Memorial Day weekend. If you have any questions, concerns or wish to try a demo of the new software please let me know and I will set up a demo account.

All information © 2010 Sononaco, Inc.