On 2/22/2020 09:05, Mark Fletcher
.... and we are back again.
Daily summaries did not go out for the past 3 hours because the
database query to get summary subscriptions was erroring out
with an out of memory error from the database that indicated
some kind of corruption on the database again. This was a
different problem than the one I saw yesterday. Instead of just
restarting the database, I restarted the entire machine this
time. If this doesn't fix it, I will have to do something more
serious. Again, it doesn't appear that there's any corruption or
loss of data in the database; right now this appears to just be
a machine issue.
Something we do at work is to use database replication to form a
chain of three database hosts: alpha-beta-gamma. If something goes
wrong with the alpha host, we simply take it offline, promote the
beta to alpha (e.g., change DNS, undo the read-only status of the
beta), promote gamma to beta and then debug the old alpha offline
and/or just make a new gamma.
I realize that there is a cost involved to this that groups.io may
or may not be able to sustain, but perhaps an alpha-beta model would
PG&E Delenda Est