Redundant interfaces keeps triggering spanning tree loops

Started by kurdam, November 21, 2023, 01:34:21 AM

Previous topic - Next topic

kurdam

Hi,

I'm opening this thread because i think i'm missing some knowledge regarding the spanning tree protocol on redundant/dual-controller hardware.

To explain my problem here are some details:
I'm working in an infrastructure where we are located in two datacenters, we have on each site a symmetrical configuration.
Our entry point in the infrastructure is a fortigate that is used as a gateway for all our VLANS (public and private) as well as our firewall, from there we are connected to two Cisco Nexus 9000 in a vPC configuration and from that we also have some catalysts connected also in vPC.

I have set all the STP weights at the switch level instead of on the interface level because it was recommended to me and it is easier to manage.
I have tried STP, rSTP, and MST with exactly the same results
Before and after the implementation of vPC i had exactly the same problems.

On this infrastructure are connected some end devices (hypervisors, NAS and some network storages (dell equallogic)).

All the connections in this infrastructure are setup to have a redundant path on another network hardware to avoid downtime if something goes down so we either have dual-controller or dual network interfaces (via VDS or linux bound set up in active/standby configuration) on all our hardware

I think i'm missing some knowledge with this kind of configuration because no matter what i tried i can't seem to be able to avoid network loops when we have for example a router update.

I studied the logs and err_dis_loop are occuring on our switches interfaces seemingly at random on our servers and storages (because each time it's happening on a different hardware). I understand that due to this configuration, with network loops everywhere that this is to be expected even if i tried to upgrade to a vPC infrastructure in order to reduce the problem.

I suspect that during the STP convergence, the dual controller is also switching its active interface in order to find a path that is working and the switch in not understanding what is happening because it sees the same MAC address on two interfaces so it blocks the ports (or at least one).

tldr: Is there some specific configuration that i have to set up in order to avoid blocked ports during a convergence in a infrastructure with dual controller / VDS / linux bound interfaces/hardware ?

Thank you in advance for your help and i will be happy to give you more information if you need them. ;)

deanwebb

If you have different NICs involved, there should be different MAC addresses for each - so I don't understand why the same MAC address would appear on two interfaces. Can you explain that part in more detail?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

kurdam

I'm not sure but I think that when you create a aggregate of two nics, it creates a VIP that contains the ip that you want to attribute to the two ports and this VIP also has its own MAC. So when a failover is happening the switches see the VIP MAC moving from a port to another and blocks it because the switch thinks there is a loop. (from what i'm seeing this guess looks wrong and you are correct)

From what i'm seeing in VMWare it's not the case, the MAC on the VIP is the same as one of the two hardware interfaces used for the failover. but i don't know for all our other hardware.

In any case, I don't get why i'm getting these kinds of errors :
2023 Nov 21 07:09:35.204 ciscoXX %L2FM-2-L2FM_MAC_MOVE_PORT_DOWN: Loops detected in the network for mac 0a1b.2c3d.4e5f a
mong ports Eth1/XX and PoXX vlan X - Port Eth1/42 Disabled on loop detection
or
2023 Nov  7 02:21:44.357 ciscoXX %L2FM-4-L2FM_MAC_MOVE2: Mac 0a1b.2c3d.4e5f in vlan XX has moved from Eth1/XX to Eth1/XY


Maybe i have to tune my STP timeouts in order to avoid this behaviour :
  • Hello Time  2  sec 
  • Max Age 20 sec
  • Forward Delay 15 sec


deanwebb

I'm wondering if the setup would be better served with a load balancer array that would handle the VIPs and then the server interfaces themselves could have separate IP, MAC addresses.

But, if you tuning works out, that's another workaround.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

kurdam

The "load balancer" part is done from the end device, it's either the VDS on vmware ar the Linux bound on Proxmox or the intelligence inside our dual controller network storages or the multiplexor card on our windows server. That part is working correctly and is doing what it is supposed to do.

What i don't understand is why when there is a change in the topology due to a failure, my switch keeps blocking the ports and preventing my hardware to failover correctly...

My real question is : Am I missing something ? Is there a specific config that i don't know for this kind of hardware that i have to setup on the switch side ? Or is it just a STP tuning problem ?

deanwebb

Can the MAC for the VIP be different from the MAC for a hardware interface? I would want to do that to see if that resolves the issue, change the address so it's not a duplicate.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.