As promised, here is the post-mortem from the recent downtime of a couple weekends ago.
The site was mostly unreachable for two periods the weekend of 2/9-10/19. The periods were:
The underlying cause of the downtime was that not enough connections were available to the main database via pgbouncer, our connection pooler. This caused a sort of death spiral, with more and more connection attempts happening, and failing, so that the database basically because unreachable.
We ran out of connections, ironically, because of work I did to off-load some work from the main database onto a replica. This was not fully configured yet, and inadvertently caused a lot of extra connections to be made to the main database, tipping it over.
Coincident and unrelated, during the first outage, two of our Elasticsearch nodes crashed, causing further slowdowns to the site. The nodes crashed because they ran out of heap space, due, I believe to a memory leak of some kind.
I was not paged for the first outage, because the main process that would have paged me was misconfigured, and therefore wasn't sending any pages. For the second outage, I did not have my phone's volume up enough, and slept through several of the pages. :-/
Here are the things I've done to address these issues:
The calendar has been updated with both outages.