Topics

moderated Emergency downtime #downtime


 

Hi All,

The database is misbehaving again. I need to restart the database machine to see if that fixes it. The site will be down for about 5-10 minutes, starting at 8:50am Pacific Time.

Thanks, Mark


 

Looking forward to cat pics! Hoping you get that misbehaving database under control. :-)
--
J

Messages are the sole opinion of the author, especially the fishy ones.
My humanity is bound up in yours, for we can only be human together. - Desmond Tutu


 

.... and we are back again.

Daily summaries did not go out for the past 3 hours because the database query to get summary subscriptions was erroring out with an out of memory error from the database that indicated some kind of corruption on the database again. This was a different problem than the one I saw yesterday. Instead of just restarting the database, I restarted the entire machine this time. If this doesn't fix it, I will have to do something more serious. Again, it doesn't appear that there's any corruption or loss of data in the database; right now this appears to just be a machine issue.

Thanks, Mark


Glenn Glazer
 

On 2/22/2020 09:05, Mark Fletcher wrote:

.... and we are back again.

Daily summaries did not go out for the past 3 hours because the database query to get summary subscriptions was erroring out with an out of memory error from the database that indicated some kind of corruption on the database again. This was a different problem than the one I saw yesterday. Instead of just restarting the database, I restarted the entire machine this time. If this doesn't fix it, I will have to do something more serious. Again, it doesn't appear that there's any corruption or loss of data in the database; right now this appears to just be a machine issue.

Thanks, Mark


Something we do at work is to use database replication to form a chain of three database hosts: alpha-beta-gamma.  If something goes wrong with the alpha host, we simply take it offline, promote the beta to alpha (e.g., change DNS, undo the read-only status of the beta), promote gamma to beta and then debug the old alpha offline and/or just make a new gamma.

I realize that there is a cost involved to this that groups.io may or may not be able to sustain, but perhaps an alpha-beta model would work also.

Best,

Glenn

--
PG&E Delenda Est


 

On Sat, Feb 22, 2020 at 9:10 AM Glenn Glazer <glenn.glazer@...> wrote:

Something we do at work is to use database replication to form a chain of three database hosts: alpha-beta-gamma.  If something goes wrong with the alpha host, we simply take it offline, promote the beta to alpha (e.g., change DNS, undo the read-only status of the beta), promote gamma to beta and then debug the old alpha offline and/or just make a new gamma.

We do run a hot standby, which also serves some queries. I'm trying to decide next steps, should a problem reappear, and one possibility is to switch over to the hot standby. That wouldn't be instantaneous, because I'd need to spin up a new hot standby before I brought the site back, because we don't currently have a 'gamma' as in your example. I may bring one up today. 

Thanks,
Mark