moderated Re: Outage report #downtime


 

Sounds like progress, tech wise and biz wise!

J

Sent from my iPhone

On Jun 16, 2017, at 2:55 PM, Mark Fletcher <markf@corp.groups.io> wrote:

Hi All,

The site was off-line from 2:02pm through 2:38pm. The process that sends email ran out of memory when a large number of large email digests were sent. As it ran out of memory, it would cause the machine to reboot. This machine also runs our load balancer, which is responsible for distributing incoming email and web traffic. With this machine rebooting constantly, the site was effectively off-line.

After evaluating the options, I decided that the quickest way to get back on-line was to upgrade that machine to an instance with substantially more memory. The majority of downtime was caused by that upgrade. Once the upgrade was complete, the machine came back on-line and is currently processing the backlog of messages.

If you'll recall, a similar thing happened a few months ago. I thought I had made changes to the software to prevent this from happening again. Guess not. So I will be looking at that. I will also be moving the load balancer off of this machine and onto its own dedicated machine.

Thanks,

Mark


--
J

Messages are the sole opinion of the author. Especially the fishy ones.

I wish I could shut up, but I can't, and I won't. - Desmond Tutu

Join main@beta.groups.io to automatically receive all group messages.