moderated Re: 1/15/22 Outage #postmortem


Mike Hanauer
 

Mark, thanks so much for the complete explanation. Informative and interesting.

Consider Better, not Bigger. So many advantages. Just ask. USA adds a Chicago to our overpop each year.
"Still more population growth is not our way to a healthy community, a healthy planet, OR enjoyable cycling."

    ~Mike


On Tuesday, January 18, 2022, 03:14:31 PM EST, Mark Fletcher <markf@corp.groups.io> wrote:


SUMMARY: Between the hours of 2:44pm Pacific Time and 3:38pm Pacific Time on Saturday January 15th, all aspects of the service, including the website and email delivery, were down. People visiting the website were redirected to the Maintenance page. Emails sent to the service were temporarily rejected for resending.

FAULT: The event was triggered when the main database stopped accepting internal network connections at 2:44pm Pacific Time.

DETECTION: The event was detected by our monitoring system, and a page was sent to me within one minute of this happening.

RESPONSE: I was away from home at the time, and did not have my laptop. Because of this, I did not have easy command-line access to the servers. I had to rely on our web-based internal dashboard systems to diagnose and fix the problem, using a third party's computer. I also had trouble changing the https://status.groups.io page and updating the Groups.io Twitter account.

RECOVERY: Once I had determined that the main database was not accepting internal network connections, I restarted the database, and the site came back on-line. No data was lost and no emails were lost.

CORRECTIVE ACTIONS: The main database has never frozen in this way in the past. I don't know why it froze, and there isn't anything in the logs. Given that the uptime of the database is measured in hundreds of days, at this point I don't see any changes that need to be made. If this happens again soon, I will re-evaluate.

In situations like this, I normally rely on command-line access to the Groups.io servers. But I did not easily have that in this instance. For the purposes of diagnosing this problem, our internal dashboards were missing some features that would have helped with this, and which would have reduced the amount of downtime.

A major focus is to try to eliminate my need for command-line access to the machines in order to diagnose issues with the site. I have embarqed on adding features to our internal web-based dashboards to address this, including the following:

  • An integrated database connection test dashboard, to more easily highlight where connection issues are occuring
  • Access to more database logs within the dashboard
  • Additional database statistics dashboards
  • Better display of logs within the dashboard, especially on mobile devices
  • Adding and testing a one button Status Page/Tweet update function
  • Several bug fixes to our dashboards that only became apparent during this situation

I expect these features to be completed this week.

Mark

Join main@beta.groups.io to automatically receive all group messages.