Topics

moderated Event: Data center power loss #outage - Wednesday, 20 June 2018 #outage #cal-invite

main@beta.groups.io Calendar <main@...>
 

Data center power loss #outage

When:
Wednesday, 20 June 2018 9:30pm to
Thursday, 21 June 2018 12:39am
(GMT-07:00) America/Los Angeles

Description:

Summary

On June 20 at approximately 9:30pm, Linode's Fremont datacenter lost Internet connectivity, effectively taking the site off-line. Connectivity was restored after midnight, and the site was brought back on-line around 12:39am on June 21. Linode says that a power outage was responsible, but that's all the information they've given. More than half of the machines in the Groups.io cluster were rebooted during this process. All machines came back up without issues.

Action Items

I was not paged when the site went down; I happened to notice it at about 10pm. The system I use to check whether the entire site is reachable failed to notify me in this instance. I need to fix that.

Groups.io is hosted in only one datacenter. To avoid this type of downtime in the future, a multi-datacenter setup will be needed. I have a technical path to get there, but it greatly complicates the system. Given that this is only the second time in four years that the datacenter has gone down, moving to a multi-datacenter setup is low priority right now.

Thanks, Mark

 

Mark,
Thanks for the invite to this outage. Darn, I missed it! ;p
Seriously: as long as no data was lost, I agree with the low priority.
Thanks for the info!
--
J

 

Messages are the sole opinion of the author, especially the fishy ones.

I wish I could shut up, but I can't, and I won't. - Desmond Tutu

William Finn
 

Thanks for the update Mark.

Have you considered a resiliency product to replicate into the cloud so it auto fails over your systems .


On Thu, Jun 21, 2018, 12:53 PM main@beta.groups.io Calendar <main@beta.groups.io> wrote:

Data center power loss #outage

When:
Wednesday, 20 June 2018 9:30pm to
Thursday, 21 June 2018 12:39am
(GMT-07:00) America/Los Angeles

Description:

Summary

On June 20 at approximately 9:30pm, Linode's Fremont datacenter lost Internet connectivity, effectively taking the site off-line. Connectivity was restored after midnight, and the site was brought back on-line around 12:39am on June 21. Linode says that a power outage was responsible, but that's all the information they've given. More than half of the machines in the Groups.io cluster were rebooted during this process. All machines came back up without issues.

Action Items

I was not paged when the site went down; I happened to notice it at about 10pm. The system I use to check whether the entire site is reachable failed to notify me in this instance. I need to fix that.

Groups.io is hosted in only one datacenter. To avoid this type of downtime in the future, a multi-datacenter setup will be needed. I have a technical path to get there, but it greatly complicates the system. Given that this is only the second time in four years that the datacenter has gone down, moving to a multi-datacenter setup is low priority right now.

Thanks, Mark

-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.

View/Reply Online (#17486): https://beta.groups.io/g/main/message/17486
Mute This Topic: https://groups.io/mt/22503254/174318
Mute #cal-invite: https://groups.io/mk?hashtag=cal-invite&subid=1984272
Mute #outage: https://groups.io/mk?hashtag=outage&subid=1984272
Group Owner: main+owner@beta.groups.io
Unsubscribe: https://beta.groups.io/g/main/leave/1984272/799664390/xyzzy  [info@...]
-=-=-=-=-=-=-=-=-=-=-=-

Dave Sergeant
 

On 21 Jun 2018 at 9:52, main@beta.groups.io Calendar wrote:

"Data center power loss #outage" Event
Thanks Mark. Yes we noticed this here in the UK - I realised a message
I posted on a group hadn't gone through then since I needed to check
bouncing status on one of our members found the website was off line.
But it was up again soon after breakfast so most of our members weren't
even aware.

As you say, it is so rare that you probably don't need to address it.
But I imagined you would be having a rather disturbed night's sleep
over there in the States...

73 Dave G3YMC

http://davesergeant.com

Mark Irving
 

I found the web site not working this morning, too, and checked the useful status page to see what had happened. As a recent immigrant from Y! Groups! I continue to be pleasantly surprised by Groups.io. It's not that it never goes wrong, although it's far more reliable than Y!, but that the level of information available is far better. This seems the right moment to say Thank you to Mark Fletcher for his dedication and competence.

  - Mark, Cambridgeshire, UK

Dan Hartford
 

Mark,

Thanks for the update and if you don't mind, I have a few comments (having been a data center manager in a past life).

1)  Yes, by all means fix your notification system that pages you if it can't connect.

2)  You should be concerned using a single data center in Femont, CA (if that's where it is) which is directly on top of the the Calavera's Fault - considered the most likely to generate a significant quake in the Bay Area

3)  I'm surprised that a commercial data center, as I presume it is, does not have automatic backup power that instantly swaps to battery when the PG&E grid has a hiccup and then the batteries run the system till diesel generators start up and kick in.  Perhaps a different data center company should be looked at.

4)  Having said all of that, this is a "messaging" system.  It's not a hospital or bank card processing system, or airline reservation system.  So, having an outage for a few hours once in a while is not the end of the world.  It is an inconvenience at best.  Now, an outage of several days or weeks as Yahoo had a month or so ago is a big deal (and why I moved from Yahoo to Groups.io) and must be avoided.  This should be of special concern with a datacenter on a major earthquake fault.  Contracting with a data center company which has multiple sites across the country and offers 24 hour recovery at an alternate data center site is something you should look into.  In these plans the data center company is on the hook to perform the transfer and recovery when a failure occurs (including IP addresses and all the other details) - not you.    Moving that responsibility to the Data Center outfit may be way easier than building your own multi-center architecture.

Anyway, just thought I'd chime in.  

Cheers -- Dan