moderated Outage report #downtime


Hi All,

The site was down between 6:38pm and 7:03pm and again between 9:51pm and 10:23pm Pacific time.

The original outage was due to a large digest being generated and sent to several hundred people. This exposed a new out-of-memory issue in our message queue system, which caused it to crash. Every part of the site depends on the message queue, so when it went down, it took the entire site down with it. Also unfortunately, the cats maintenance page did not automatically come up when this happened, because I had not configured the new load balancer that I installed last week correctly. So, people going to the website saw a cryptic error message.

Due to spotty cell service, I did not receive any pages about the original outage until about 6:50pm. I initially diagnosed the problem and got the site back up within 13 minutes. To do that, I had to clear out the message queue, saving an off-line backup first. The backup contained approximately 500 large unsent digest messages. I took the next few hours to analyze the situation and verify that my initial hunch was correct (and that it wasn't some other issue).

To fix the issue of the message queue running out of memory, I needed to upgrade the machine it runs on. That was the cause of the second, intentional outage, starting at 9:51pm. Once that machine was upgraded, I was able to restore the backed-up message queue and send out the digest messages. Because of the system being down at the time, the normal 10pm digest run was delayed until 10:30pm. As far as I can tell, no messages were lost because of this outage.

I have also fixed the configuration issue with the load balancer, so the cats maintenance page will come up if the site has a problem.

Thanks, Mark

Join to automatically receive all group messages.