Phew, that was a scary one....

Started by icecream-guy, March 14, 2017, 04:34:32 AM

Previous topic - Next topic

icecream-guy

Upgraded 2 VPC peered Nexus 5596's this morning with 22 dual home connected FEX's, scared I'd loose a fex, or something.  Took a long time, the fex's kept rebooting on one switch, let them do that for like 45 minutes, then I realized my window was closing, so I had to reload the other switch. seems like everything came up ok. monitoring now
:professorcat:

My Moral Fibers have been cut.

icecream-guy

:professorcat:

My Moral Fibers have been cut.

that1guy15

Ugh. sorry dude.

My last gig we did a swap-out to about the same size. Replaced 3750s/6500vss (yes...) with 5K/2248 fex. Had almost the whole department on-site to help. Swap out and had the DC down for less than 30 min in total I think.

But we also pre-built and staged all the nexus gear and had upgrades applied before hand.

Nexus vPC and FEX upgrades are a nightmare. ISSU my ass!
That1guy15
@that1guy_15
blog.movingonesandzeros.net

icecream-guy

#3
seems like the active/active firewall cluster when split brain, can ping beyond VIP's and IPs, see it in captures, but no responses back.
makes no sense.


:professorcat:

My Moral Fibers have been cut.

deanwebb

Quote from: ristau5741 on March 14, 2017, 10:14:18 AM
seems like the active/active firewall cluster when split brain, can ping beyond VIP's and IPs, see it in captures, but no responses back.
makes no sense.

:zomgwtfbbq:

Totally serious question... How many things will you need to reboot to get back to normal?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

icecream-guy

Quote from: deanwebb on March 14, 2017, 10:52:54 AM
Quote from: ristau5741 on March 14, 2017, 10:14:18 AM
seems like the active/active firewall cluster when split brain, can ping beyond VIP's and IPs, see it in captures, but no responses back.
makes no sense.

:zomgwtfbbq:

Totally serious question... How many things will you need to reboot to get back to normal?

rolled back the switches off the new code, testing as to whether that was the cause or not.

firewall wasn't getting ARP from the gateway on the upstream switch.
:professorcat:

My Moral Fibers have been cut.

wintermute000


icecream-guy

#7
yeah, the upsteam SVI on the 6500,  found a possible bug,  Looks and smells like what happened

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCtx03643
Nexus 5500: Increase MAC aging timer to be more than ARP aging timer

Symptom:
As MAC aging timer is set to 300 seconds (default), unidirectional traffic may get flooded through the Nexus 5000/5500 once MAC entry is aged out.

Further Problem Description:
Recommended MAC aging timer value is 1800 seconds (as in Nexus 7000 and 6000 platforms) in an NX-OS environment.

If a Catalyst switch is being used as an L3 gateway, it's default ARP timer is 14400 seconds. The recommendation in this case is to decrease the ARP timer to 1500 seconds and use the 1800 second MAC timer.


who knew?
:professorcat:

My Moral Fibers have been cut.

Nerm


LynK

@ristau,

Is there a recommended upgrade steps when upgrading DCs with FEXs? I have never used FEX before, but I am proposing one to replace our 6500s. It would be nice to have this information and study it beforehand.
Sys Admin: "You have a stuck route"
            Me: "You have an incorrect Default Gateway"

wintermute000

Unless you need the price point, they are a pain. They don't switch, and upgrades are butt.clenching. Go 1RU 9200s leaf-spine anyday.

Dieselboy

Nice post Ristau, I like your thought train.

icecream-guy

Quote from: LynK on March 17, 2017, 02:15:37 PM
@ristau,

Is there a recommended upgrade steps when upgrading DCs with FEXs? I have never used FEX before, but I am proposing one to replace our 6500s. It would be nice to have this information and study it beforehand.

Problem is primarily with dual home FEX,  with 2 switches running VPC between them.  When the first switch is reloaded all is fine, but the FEX go into 'AA version mismatch' state where they are pretty much out to lunch due to the code difference. The FEX are still online connected to the second switch, so that's not an issue, the outage occurs when the second switch is reloaded. When the second switch reloads those online FEX get reloaded again too. With them out to lunch connected to the first switch, one has to wait for the FEX reload and upgrade on the second switch before they will connect and come online to the first switch when the code version matches. Then they will com up online to the second switch. generally 10 minutes but depends on the number of connected FEX.


If you use single home FEX, connect servers to 2 different FEX, and configure the servers to do the redundancy, this should fix the above.


and don't buy 10G FEX, they suck. Buffers are too small to handle more than a few busy 10G servers. and you will end up with tail drops. Cisco recommends connecting the 10G servers to the 5K's directly.

:professorcat:

My Moral Fibers have been cut.

wintermute000

If you get bailed up on some Cisco white paper worshipping person (typically from a large Cisco-is-the-universe VAR) forcing you down the road of dual-homed FEXs, some ammo here to reinforce Ristau's excellent comments


https://rednectar.net/2012/08/30/why-i-wouldnt-bother-with-enhanced-vpc/




icecream-guy

I also forgot the interesting bug on the Nexus 5596's (5010, 5020, 5548's are all going away shortly). There is a hardware bug CSCuf57615 that the unit whilst running dual power supplies, if one power supply should happen to fail. Will cause the switch to reload. We hit this one a few weeks ago. Workaround is to get a replacement for failed the power supply

:professorcat:

My Moral Fibers have been cut.