moderated Outage report #downtime


Hi All,

The site was off-line from 2:02pm through 2:38pm. The process that sends email ran out of memory when a large number of large email digests were sent. As it ran out of memory, it would cause the machine to reboot. This machine also runs our load balancer, which is responsible for distributing incoming email and web traffic. With this machine rebooting constantly, the site was effectively off-line.

After evaluating the options, I decided that the quickest way to get back on-line was to upgrade that machine to an instance with substantially more memory. The majority of downtime was caused by that upgrade. Once the upgrade was complete, the machine came back on-line and is currently processing the backlog of messages.

If you'll recall, a similar thing happened a few months ago. I thought I had made changes to the software to prevent this from happening again. Guess not. So I will be looking at that. I will also be moving the load balancer off of this machine and onto its own dedicated machine.



Join to automatically receive all group messages.