moderated Downtime post-mortem for outages on 2/9/19 and 2/10/19 #outage



As promised, here is the post-mortem from the recent downtime of a couple weekends ago.

The site was mostly unreachable for two periods the weekend of 2/9-10/19. The periods were:

  • 2/9/19 6:52am - 7:25am
  • 2/10/19 4:17am - 5:12am

The underlying cause of the downtime was that not enough connections were available to the main database via pgbouncer, our connection pooler. This caused a sort of death spiral, with more and more connection attempts happening, and failing, so that the database basically because unreachable.

We ran out of connections, ironically, because of work I did to off-load some work from the main database onto a replica. This was not fully configured yet, and inadvertently caused a lot of extra connections to be made to the main database, tipping it over.

Coincident and unrelated, during the first outage, two of our Elasticsearch nodes crashed, causing further slowdowns to the site. The nodes crashed because they ran out of heap space, due, I believe to a memory leak of some kind.

I was not paged for the first outage, because the main process that would have paged me was misconfigured, and therefore wasn't sending any pages. For the second outage, I did not have my phone's volume up enough, and slept through several of the pages. :-/

Here are the things I've done to address these issues:

  • Greatly increased the number of available connections to the main database.
  • Fixed the issue that caused me not to be paged for the first outage.
  • Finished the configuration for using the replica to off-load the main database for some queries.
  • Set the Elasticsearch process to automatically restart when it crashes.

The calendar has been updated with both outages.

Thanks, Mark



What about a clone or two to prevent the “sleeping through”???  Or cyborg.