Monitoring nightmare

Started by icecream-guy, December 30, 2020, 04:45:48 PM

Previous topic - Next topic

icecream-guy

so, I get paged for critical issue, wide spread panic/ major incident because a single PSN node went down, not like we have 14 total PSN nodes across 2 disparate datacenters (we do),  boy I did send some hate mail for that one.  I could see if all nodes in one of the DC's went down, but an emergency call out alert for a single node down in 1 DC is ridiculous.  I think ITIL4 refers to something about silos and collaboration/communication.  I suggested a meeting to discuss next week, will see if that ever happens.
:professorcat:

My Moral Fibers have been cut.

deanwebb

Ugh. It shouldn't be SEV ONE if fault tolerance kicks in and we're all still running as per.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.