ran into a wierd one yesterday (long)

Started by icecream-guy, January 08, 2016, 08:14:52 AM

Previous topic - Next topic

icecream-guy

Lost access to management resources yesterday morning, couldn't get into nuthin, jumpbox nope, console into jumbox,  tacacs nope.. Trolling around the data center I found one of the 6500 VSS switches down, shouldn't  be an issue, yet....So Sup on the active VSS 6500 switch failed, system failed over, (switch crashed due to memory allocation failures) I brought up the second switch which restored some services and users were still complaining about intermittent connectivity issues,  this VSS pair is not  the core, but a distro block with lots of stuff connected. scratching out heads, this shouldn't happen.... thought we narrowed it down to some legacy services, rebooted the switch connected to those, and they came up, users still complaining about intermittent connectivity to other areas of the network, so we focused on the VSS switch that had problems, we determined that all the affected service were connected via port-channels, services that weren't - not affected. so we focused on port channels, all interfaces up/up everything looks connected, but reports of connectivity loss still going on. someone on the team had the idea to shut each member of the port-channel, see if connectivity is restored, on the 4th try of a 6 member port channel, shut/no shut all services restore.  our guess is that is was either some sort of hashing issue on the switch,  or some sort of UDLD issue. (not enabled on switches) The UDLD issue is more logical, but s shut/no shut
of an interface should not have fixed a physical issue with a fiber cable. just plain weird.  took a lengthy outage for it, 4+ hours.
:professorcat:

My Moral Fibers have been cut.

SimonV


deanwebb

Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

icecream-guy

:professorcat:

My Moral Fibers have been cut.

wintermute000


SimonV

Quote from: ristau5741 on January 08, 2016, 11:15:42 AM
Quote from: SimonV on January 08, 2016, 08:34:55 AM
Are you running LACP or mode on?

LACP , mode active


Good, mode on was the first thing I thought off when I read port-channel issues

LynK

If this was related to UDLD you would see it down down not-connect on one side, and down down (errdisable) on the other. So most likely not a UDLD issue. How are you doing link balancing on the port-channel? Anything fancy like source ip? Hashing would effect LACP balancing ... if that were indeed the case.
Sys Admin: "You have a stuck route"
            Me: "You have an incorrect Default Gateway"

wintermute000

LACP should ensure no mismatches though including bringing down member links

SHOULD lol... bearing in mind real time L2 control traffic like LACP is supposed to be the hardest thing to get right in clustering/multi-chassis software (e.g. they still haven't gotten most of this stuff working in openflow central control plane)

dlots

This kinda thing is why I refused to VSS our cores together

icecream-guy

Near as I could figure, the 10G module in one VSS, was in a cyclical reboot mode, where it would power, come up, and online for like 2 seconds, then the module would crash again, and power off.  doing this repeatedly every 3-4 minutes, until we pulled it out of the chassis, that's when things got back to normal.  so the port channels would come up on both switches modules, traffic would start to traverse the port channel on the rebooting module and then get lost when the module rebooted. I don't know what the load balancing across the port channels was at the time. but looks like it was a fairly heavily used link.
:professorcat:

My Moral Fibers have been cut.