moderated Re: Emergency downtime #downtime

Glenn Glazer

On 2/22/2020 09:05, Mark Fletcher wrote:

.... and we are back again.

Daily summaries did not go out for the past 3 hours because the database query to get summary subscriptions was erroring out with an out of memory error from the database that indicated some kind of corruption on the database again. This was a different problem than the one I saw yesterday. Instead of just restarting the database, I restarted the entire machine this time. If this doesn't fix it, I will have to do something more serious. Again, it doesn't appear that there's any corruption or loss of data in the database; right now this appears to just be a machine issue.

Thanks, Mark

Something we do at work is to use database replication to form a chain of three database hosts: alpha-beta-gamma.  If something goes wrong with the alpha host, we simply take it offline, promote the beta to alpha (e.g., change DNS, undo the read-only status of the beta), promote gamma to beta and then debug the old alpha offline and/or just make a new gamma.

I realize that there is a cost involved to this that may or may not be able to sustain, but perhaps an alpha-beta model would work also.



PG&E Delenda Est

Join to automatically receive all group messages.