HA WTF?

Started by deanwebb, August 25, 2015, 12:39:55 PM

Previous topic - Next topic

deanwebb

Got a high-availability device cluster... the secondary has had a slightly flaky history, but it's always, "wait a few minutes, and it'll come back", so nothing is done for it.

Friday, did an SP upgrade on it. Primary took the upgrade just fine, the secondary... well, it lost HA and reports as "disconnected". Can't SSH to it.

Today, vendor recommended rebooting it. Shouldn't affect the primary at all. That's what HA is for, right?

:challenge-accepted:

So I fire up the Raritan and cycle power on the secondary.

Both it and the primary go down and then come back up with a Linux OS "soft lockup" error.

:rage: :kramer: :frustration: :facepalm4:

Vendor says the error message indicates the possibility of bad hardware. They recommend rebooting it.

:facepalm2:

Of course, I'm in a very large company, so that's gonna take some time to accomplish.

Lesson learned: If part of an HA pair shows flaky behavior, don't "wait a few minutes." RMA the fartknocker ASAP.

:coolstory:
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Reggle

That's what HA is for. So you can replace the one with issues.

deanwebb

Sure is... but, dang, we're a little gunshy with this pair. If taking the HA unit out brings down the primary, that's not HA... that's HF... High Failure!
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

mmcgurty

I have (4) Cisco WLC's and the primary server just up and died on Friday afternoon at 3:05PM EST.  I was very surprised it actually failed over properly to the others and load balanced the AP's connected to them.  A replacement is supposed to be shipping today.