moderated Downtime this morning #outage
Well that was fun. Here's what I know right now. At 8:28am pacific time, one of the back end machines appeared to freeze up in a weird way. This machine takes all changes to the main database and inserts them into the search cluster (new messages, new activity logs, etc). For some reason that I do not know yet and really do not understand how, this caused a chain of events to happen that started eating up all connections to the main database. This effectively took the site down at 8:34am, which is when I got paged the first time. It took me some time to figure out that the machine was frozen in a weird way and to reboot it. The site came back at 8:52am.
The site is functioning normally and all email sent to groups during this time should have been queued and resent after the site was back.