moderated Re: Reduced email delivery overnight #outage


Christopher Hallsworth <challsworth2@...>
 

Wow, thanks for the report, never seen any service but this one generate reports of this kind. In fact, I've never seen any service generate any reports at all. Thanks once again for a job well done.

On 4 Feb 2017, at 15:26, Mark Fletcher <markf@corp.groups.io> wrote:

Outage Summary

Email delivery was dramatically slowed overnight due to a crashing bug in the karl process. I fixed the bug around 6:25am pacific time, and it took about 25 minutes for all queued email to be delivered.

Duration

From approximately around midnight pacific time until 6:25am pacific time.

Cause

A group was transferred from Yahoo yesterday with a lot of users from a bogus/typo domain, yahooo.com. That domain has an MX record that is blank. We were not ignoring blank MX records. What that meant is that for email to those users, we'd connect to the localhost SMTP port to send this email, essentially DOSing ourselves. I was alerted to this behavior on Friday evening via our normal alerts. I wrote a fix and pushed it to the site. Because of how we retry sending messages, the code introduced with the fix was not run until after I went to bed. I did not consider the fix to be risky, but the fix contained a divide by zero, causing a crash. The karl process then continued to crash until I woke up and saw the problem. I was not alerted to the crashes.

Areas Of Improvement

• I was not alerted to the crashes because our alerts are emailed to Pagerduty using karl. Alerts should be sent directly to Pagerduty using their API instead of via email.
Thanks, Mark

Join main@beta.groups.io to automatically receive all group messages.