Topics

moderated Downtime on Monday, September 21, 2020 #postmortem


 

Hi All,

Here's a summary of what happened during the email outage on Monday and the steps I'm taking to address it.

What Happened

I noticed around 9am that the program that sends emails, called karld, had crashed and restarted several times during the night. I had not been paged. Checking the mail queue, it appeared that there were a large number of emails that had not been sent. I began to investigate.

karld was running out of memory and restarting. It would only be able to send a few emails before each restart. I initially believed it was because there were about 600k messages to gmail.com addresses waiting to be sent (of a total of 900k messages in the queue), and we simply didn't have enough memory to support that. At the time I thought it was an issue with delivering email to Gmail that was the main problem.

As a short term measure, I greatly increased the swap on the main email sending machine, to try and clear the message backlog. I also decreased the number of outbound email connections. This started to clear the backlog, but tripped another issue; the machine was now running out of socket resources, which had never happened before. I increased the kernel socket resource limits and that allowed the backlog to continue to drain. But I noticed that email to gmail.com was not being sent in any significant numbers, so the backlog was not draining quickly. Restarting the karld process would cause some gmail.com messages to be sent, so I began restarting the process every few minutes, while I tried to puzzle out what was happening. I discovered two inter-related bugs.

I discovered that a change I made last week introduced a bug where we were including a large amount of extra, unneeded data with every recipient for each message. This greatly increased the memory usage of the system. This did not cause a problem until sometime early Monday morning, when enough messages were in the unsent queue that karld started taking up too much memory.

But that didn't explain the initial reason that messages were not being sent, which I had initially thought was caused by an intermittent Gmail delivery issue. In reality, it was caused by a second bug, also caused by the change I made last week. We have to track how many concurrent connections we have to each email service, because many services restrict the number of connections we can have to them. This second bug was causing an accounting issue where we would lose track of how many open connections we had with each email service. In some cases, we thought we had open connections, when we actually had none. Specifically, this was triggered for Gmail, and it meant that we weren't sending many messages to them. Which caused the large number of unsent messages, which in turn caused the out of memory issue.

I fixed these bugs, and the email queue was back to normal by around 2pm.

Changes Already Made

  • I fixed the two bugs at the root of the downtime.
  • I re-enabled monitoring of karld, which I had disabled last week for an unrelated issue.
  • During the downtime, I discovered that status.groups.io was not working. I fixed the DNS issue causing that and that website is back up.
  • I fixed several inaccurate metrics on the karld internal dashboard, which will help diagnose issues in the future.

Changes Still To Be Made

  • I am going to accelerate the addition of a second karld instance. Emails from beta are already being sent from the new instance.
  • I am going to make it easier to add a status banner to the website. Right now it requires a software release, which is an unnecessary speed bump when I'm diagnosing a problem.
  • I am going to automate posting of issues to the status.groups.io page. Right now, it has to be done by hand, which means it doesn't get done.

Please let me know if you have any questions.

Thanks, Mark


Andy
 

What is "status.groups.io"?

When I go there, it says (among other things):

    Sep 20, 2020
    No incidents reported.

Seriously?

I guess you're saying things get reported only when you manually enter them, which tends to be never.  Got it.

Andy


RCardona
 

Andy,

Reread the 21 Sept message. . .  status.groups.io is listed under the section, "Changes Still to be Made."  

Give the man some time to implement it.

Rob


On Sat, Sep 26, 2020 at 05:18 PM, Andy wrote:
What is "status.groups.io"?
When I go there, it says (among other things):
    Sep 20, 2020
    No incidents reported.

Seriously?
I guess you're saying things get reported only when you manually enter them, which tends to be never.  Got it.

Andy