Topics

moderated Reduced email delivery overnight #outage

 

Outage Summary

Email delivery was dramatically slowed overnight due to a crashing bug in the karl process. I fixed the bug around 6:25am pacific time, and it took about 25 minutes for all queued email to be delivered.

Duration

From approximately around midnight pacific time until 6:25am pacific time.

Cause

A group was transferred from Yahoo yesterday with a lot of users from a bogus/typo domain, yahooo.com. That domain has an MX record that is blank. We were not ignoring blank MX records. What that meant is that for email to those users, we'd connect to the localhost SMTP port to send this email, essentially DOSing ourselves. I was alerted to this behavior on Friday evening via our normal alerts. I wrote a fix and pushed it to the site. Because of how we retry sending messages, the code introduced with the fix was not run until after I went to bed. I did not consider the fix to be risky, but the fix contained a divide by zero, causing a crash. The karl process then continued to crash until I woke up and saw the problem. I was not alerted to the crashes.

Areas Of Improvement
  • I was not alerted to the crashes because our alerts are emailed to Pagerduty using karl. Alerts should be sent directly to Pagerduty using their API instead of via email.

Thanks, Mark

Ro
 

It seems that Karl is a trouble maker and a bully.  Thanks for dealing with him!


Ro





From: beta@groups.io <beta@groups.io> on behalf of Mark Fletcher <markf@corp.groups.io>
Sent: Saturday, February 4, 2017 7:26 AM
To: beta@groups.io
Subject: [beta] Reduced email delivery overnight #outage
 
Outage Summary

Email delivery was dramatically slowed overnight due to a crashing bug in the karl process. I fixed the bug around 6:25am pacific time, and it took about 25 minutes for all queued email to be delivered.

Duration

From approximately around midnight pacific time until 6:25am pacific time.

Cause

A group was transferred from Yahoo yesterday with a lot of users from a bogus/typo domain, yahooo.com. That domain has an MX record that is blank. We were not ignoring blank MX records. What that meant is that for email to those users, we'd connect to the localhost SMTP port to send this email, essentially DOSing ourselves. I was alerted to this behavior on Friday evening via our normal alerts. I wrote a fix and pushed it to the site. Because of how we retry sending messages, the code introduced with the fix was not run until after I went to bed. I did not consider the fix to be risky, but the fix contained a divide by zero, causing a crash. The karl process then continued to crash until I woke up and saw the problem. I was not alerted to the crashes.

Areas Of Improvement
  • I was not alerted to the crashes because our alerts are emailed to Pagerduty using karl. Alerts should be sent directly to Pagerduty using their API instead of via email.

Thanks, Mark

 

Very impressive. Just imagine, if you will, yahoo creating a report like this for every outage.

https://www.youtube.com/watch?v=NzlG28B-R8Y

--
J

Messages are the sole opinion of the author. Especially the fishy ones.

I wish I could shut up, but I can't, and I won't. - Desmond Tutu

Christopher Hallsworth
 

Wow, thanks for the report, never seen any service but this one generate reports of this kind. In fact, I've never seen any service generate any reports at all. Thanks once again for a job well done.

On 4 Feb 2017, at 15:26, Mark Fletcher <markf@corp.groups.io> wrote:

Outage Summary

Email delivery was dramatically slowed overnight due to a crashing bug in the karl process. I fixed the bug around 6:25am pacific time, and it took about 25 minutes for all queued email to be delivered.

Duration

From approximately around midnight pacific time until 6:25am pacific time.

Cause

A group was transferred from Yahoo yesterday with a lot of users from a bogus/typo domain, yahooo.com. That domain has an MX record that is blank. We were not ignoring blank MX records. What that meant is that for email to those users, we'd connect to the localhost SMTP port to send this email, essentially DOSing ourselves. I was alerted to this behavior on Friday evening via our normal alerts. I wrote a fix and pushed it to the site. Because of how we retry sending messages, the code introduced with the fix was not run until after I went to bed. I did not consider the fix to be risky, but the fix contained a divide by zero, causing a crash. The karl process then continued to crash until I woke up and saw the problem. I was not alerted to the crashes.

Areas Of Improvement

• I was not alerted to the crashes because our alerts are emailed to Pagerduty using karl. Alerts should be sent directly to Pagerduty using their API instead of via email.
Thanks, Mark