Here's a summary of what happened during the email outage on Monday and the steps I'm taking to address it.
I noticed around 9am that the program that sends emails, called karld, had crashed and restarted several times during the night. I had not been paged. Checking the mail queue, it appeared that there were a large number of emails that had not been sent. I began to investigate.
karld was running out of memory and restarting. It would only be able to send a few emails before each restart. I initially believed it was because there were about 600k messages to gmail.com addresses waiting to be sent (of a total of 900k messages in the queue), and we simply didn't have enough memory to support that. At the time I thought it was an issue with delivering email to Gmail that was the main problem.
As a short term measure, I greatly increased the swap on the main email sending machine, to try and clear the message backlog. I also decreased the number of outbound email connections. This started to clear the backlog, but tripped another issue; the machine was now running out of socket resources, which had never happened before. I increased the kernel socket resource limits and that allowed the backlog to continue to drain. But I noticed that email to gmail.com was not being sent in any significant numbers, so the backlog was not draining quickly. Restarting the karld process would cause some gmail.com messages to be sent, so I began restarting the process every few minutes, while I tried to puzzle out what was happening. I discovered two inter-related bugs.
I discovered that a change I made last week introduced a bug where we were including a large amount of extra, unneeded data with every recipient for each message. This greatly increased the memory usage of the system. This did not cause a problem until sometime early Monday morning, when enough messages were in the unsent queue that karld started taking up too much memory.
But that didn't explain the initial reason that messages were not being sent, which I had initially thought was caused by an intermittent Gmail delivery issue. In reality, it was caused by a second bug, also caused by the change I made last week. We have to track how many concurrent connections we have to each email service, because many services restrict the number of connections we can have to them. This second bug was causing an accounting issue where we would lose track of how many open connections we had with each email service. In some cases, we thought we had open connections, when we actually had none. Specifically, this was triggered for Gmail, and it meant that we weren't sending many messages to them. Which caused the large number of unsent messages, which in turn caused the out of memory issue.
I fixed these bugs, and the email queue was back to normal by around 2pm.
Changes Already Made
Changes Still To Be Made
Please let me know if you have any questions.