Sononaco: The Blog

Sononaco

A Holiday “Gift” From the Server Gremlins

If you have been following our Twitter stream you will know we have been having problems with one of our web servers. Bad things happen. This was about the worst case scenario we could envision.

Fortunately we had backups of everything since about 12 hours before the crash. Everything has been restored, all sites are up and we are getting all of the accounts properly configured.

Here is how it all went down:

On Wednesday at approximately 3PM we received notices that sites were behaving erratically – database connections were dropping and required libraries we not being loaded. When we tried to log on to the server the connections were refused.

We called the server room and hooked up a terminal which showed a kernel panic – similar to the Windows Blue Screen of Death or the Mac’s “grey screen with white text” so we rebooted the machine. From there it never came back.

Our techs then ran a file system check to make sure the files were not corrupt. The system passed the check but on boot-up the server would kernel panic.

We attempted to resurrect the files directly on the server but the repeating kernel panic prevented us from booting the server.

So we provisioned a new server and began the time-consuming process of restoring the backups from the full backup on December 23rd. This would give us all of the files that were on the server the day it went down.

After the backups were complete we started the process of restoring each site individually which has been completed.

We are now in the process of updating the databases and files from the incremental backups performed over the last few days.

How do we prevent this in the future? We have taken the measure of installing a new server architecture and chipset, newer, more reliable hard drives and a new upgraded RAID system. And yes, we are continuing to back up everything on the server.

We were 8 days shy of this server being up for 1,000 days. We have never had a downtime experience that has lasted this long. Bad things happen. Electronics break. But we can make sure procedures are in place to minimize data loss and get your information back online as soon as possible.

Thank you for your patience and understanding while we worked to restore your data.

Downtime issues 6/16 around 3AM

In the wee hours of Thursday there came a bump in the night at our server host. The details of the outage are explained below.

In short, Cox Communications – an upstream internet provider for our hosting company – did something stupid by allocating a block of IP address to themselves. This essentially redirected the IP addresses to Cox’s network preventing the traffic from reaching the hosting company.

=====

At approximately 3:20 AM EDT, CARI.net internal monitoring began reporting problems with DNS resolution. The problem was immediately escalated to our on call senior network admins. Due to the nature of the problem, remote access was not possible to resolve the issue, onsite access would be required. Once onsite it was established that two of our upstream Bandwidth providers (Level3 and COX) were not passing traffic, however the connections themselves were functional. Both providers were contacted and tickets were opened with tier 1 support. Working with Level3, we were able to jointly identify that the problem was originating from the COX network.

COX was apparently routing 3 of CARI.net’s 5 IP allocations incorrectly causing traffic to be dropped in the COX network.

At 5:30 AM COX’s on call Hi-Cap engineer contacted us. Since this was a routing problem he had to transfer the issue to the routing group. At 6:06 AM the on call COX routing engineer contacted us and confirmed what we already knew and stated that he would work on the problem and call us back. At 6:40 AM CARI.net internal monitoring indicated that DNS was once again functioning and some traffic was once again flowing to Level3. At 7:05 AM COX called back indicating that the problem was fixed.

COX will be working to create a full report of the incident. We will not be using the COX service until we receive this report. During the outage, all of CARI.net’s services were internally functioning normally.

=====

Mother’s Day Mail Server Upgrade

We will be upgrading the mail server this weekend by applying the latest security patch. For the most part these upgrades are extremely smooth, only requiring about 20-30 minutes of downtime.

The upgrade will be performed Saturday or Sunday. If you have any questions about this please contact us.

This upgrade will have *no* effect on web sites or e-commerce.

Server downtime April 17th

In the wee hours of Sunday morning April 17th something went wrong. But first, a little history and a quick lesson.

First a little Computers 101: Each file and folder have users and permissions. Users “own” the files/folders and permissions allow other users to take action such as reading, writing and executing on the files/folders. This is important for later.

Earlier this week, as we do every week, we applied security patches to our hosting servers. Usually this process goes so smoothly no one ever knows it’s happening or happened. The process was flawless and we went about our business.

Fast forward to Sunday morning at 4AM. That’s when our servers run through their weekly maintenance routines (cleaning up logs, clearing out caches, rotating logs). It usually lasts about 5 minutes and is a most unspectacular event. That is, unless there is a problem.

With the system security patch earlier in the week a tiny little bug was introduced when patching the module that handles all of the security routines of the web site. In human speak, it’s the thing that makes web addresses that begin with “https” secure.

When the upgrade was applied the “aliases” folder in the web root had permissions and ownership changed. During the maintenance routines the system tried to restart the Apache web server. With the permissions and ownership different on the “aliases” the server could not be restarted.

You may think this is kind of stupid that a silly permissions issue would prevent the server from restarting but it’s a good thing. We don’t want to grant access to everything on the server. That would be bad.

So Sunday morning we spent our time changing permissions and restarting the services. We apologize for the downtime and appreciate your understanding.

All information © 2010 Sononaco, Inc.