It's been a couple of days since the unplanned downtime on Tuesday. Here's a summary of what happened and what I've done to address it.
- I was paged at approximately 6:50am on Tuesday; the alert indicated that all web servers were down. The page and alert were correct and timely.
- It took me about the next hour to figure out the problem. The web servers were running out of connections to the database and were constantly restarting. It appeared the reason was a change I made to the chat system the previous day.
- I disabled chat, restarted all the webservers, and the site was back up.
- Email delivery was not affected during this time.
- I believe I've addressed the bug of chat taking up all the database connections, although without a stress test (ie. a large group of people using it) I am not positive.
- I have partitioned the chat system so that even if it does in the future have a resource leak like this, it will not take down the webservers. It would only affect chat.
- I have implemented a connection pooling system in front of the main database system and reevaluated all database connection timeouts, to make the overall system more resilient to any sort of database connection leak.