port channel business

icecream-guy · August 29, 2016, 11:12:23 AM

Host A (ESX Server) cannot Ping Host B (NetApp Storage).

Host A and Host B are in the same VLAN
The VLAN is a L2 private network, and exists primarily on the ESX server side for the servers to communicate with storage
Host A and Host B are connected to 2 Nexus 5K,
Host A and Host B are configured each in their own port channels running LACP
The port-channel are trunked and allow the same VLANs
With one member of the port-channel on each 5K switch
Port channels are up and in port-channel mode
The port channels are VPC'd across to the Peer-Link to the other 5K
I have setup a monitor and wireshark captured packets both directions in the VLAN
What I see on the first switch, ECHO request going to the destination
What I see on the second switch are ECHO replies coming back from the destination.
sender never gets the reply.

I see the port-channels are up, interfaces are up, VPC is up, vlan is trunked properly, both hosts MAC address in the VLAN,

I guessing it's a port-channel load balancing issues as to why I see the sent ICMP and reply ICMP on two different switch.
but I think I would see both if I'm capturing in the VLAN on both switches.

wintermute000 · August 29, 2016, 04:54:39 PM

ESXi LACP by default IIRC is not using the same algo as a Cisco switch. Make sure they're both using the same hashing method.

icecream-guy · August 30, 2016, 06:46:05 AM

humm, we discussed this, Cisco was source/dest IP/MAC, VMware was source/dest ip, and didn't have and option to do like Cisco, and when you set source/dest IP on Cisco, I think it does include MAC by defaults, but I think there was an option to not use MAC, but I don't think it was an option to set on CLI on the 5K.

Dieselboy · August 31, 2016, 01:32:56 AM

I have this problem on our n3k's going to a Cisco UCS-C running ESXi 5.5. I tried doing this when I was running ESXi 5.0 and I just kept losing connectivity to the ESXi and the VMs after bringing up the port channel in VPC.
When I lose connectivity, if I shut down any single link going from the n3k to the UCS then connectivity is restored. I was planning to upgrade to ESXi 5.5 anyway, and thought it might have been an issue with ESXi 5.0 but the same issue is there in 5.5 so I'm now thinking it might be something to do with the L2 load-balancing.

On the weekend where I set this up again after upgrading the UCS CIMC and the ESXi to 5.5 and I lost connectivity if I simply shut down the other link (doesn't matter which one) then connectivity is restored straight away.
I decided not to change the load-balancing because I have port channels going everywhere and I wasn't sure what would happen to those other working port channels.

In addition, I have another ESXi 5.5 box (a Riverbed appliance) and I have successfully configured a nailed-up port channel in VPC. I actually manually copied this configuration into the Cisco UCS as a known-good but the UCS does not work as configured like the Riverbed. So it's strange how one unit works and the other doesnt with the exact same config and without changing the load-balancing algo. This is the same n3k switches.

All I can think of is that the ESXi on UCS is sending data back to the wrong nexus switch and the VPC is blocking it. Although haven't confirmed this.

wintermute000 · August 31, 2016, 02:29:23 AM

LACP is a hornets nest with Vmware.

All their design guides recommend LBT or route based on originating port for NSX (since you can't load base team once multiple VTEPs are involved).

It basically achieves the same thing as LACP (i.e. load balancing) except you take away any and all complexity from the switch side which only seems dumb access/trunk ports, no STP nothing, life is good.

diesel, i wouldn't discount UCS firmware bugs, I sit next to a guy who spends half his day swearing at N7ks and the other half swearing at UCS FIs, and I know for a fact he ran into some hilarious UCS bugs re: pinning, uplinks, LACP and all that shimozzle

icecream-guy · August 31, 2016, 06:59:58 AM

Quote from: Dieselboy on August 31, 2016, 01:32:56 AM
I have this problem on our n3k's going to a Cisco UCS-C running ESXi 5.5. I tried doing this when I was running ESXi 5.0 and I just kept losing connectivity to the ESXi and the VMs after bringing up the port channel in VPC.
When I lose connectivity, if I shut down any single link going from the n3k to the UCS then connectivity is restored. I was planning to upgrade to ESXi 5.5 anyway, and thought it might have been an issue with ESXi 5.0 but the same issue is there in 5.5 so I'm now thinking it might be something to do with the L2 load-balancing.

On the weekend where I set this up again after upgrading the UCS CIMC and the ESXi to 5.5 and I lost connectivity if I simply shut down the other link (doesn't matter which one) then connectivity is restored straight away.
I decided not to change the load-balancing because I have port channels going everywhere and I wasn't sure what would happen to those other working port channels.

I think we have a winner here, similar issue less the UCS

Dieselboy · August 31, 2016, 08:56:15 PM

Wintermute - I have an open TAC case for UCS Fi 6248 - where the XML API isn't working. Our virtualisation management system (red hat virtualisation) uses the API to power up the blades when more resources are needed, or if there is an hypervisor OS issue for example. We upgraded to 2.2.3h to support RHEL 7.2 enic driver version and then shortly after the RHEV system began having issues powering up the blades. It would work most of the time, then over time it got to the point where it would not work most of the time. Now it doesn't work at all. I ran the API commands at the CLI of another system, it uses CURL. The response from UCS is that the system is powered up, but looking at UCSM the blade is never powered up. I rebooted both FI's last weekend and this was fixed completely, until Tuesday lunchtime where it broke again. What frustrated me, is that I logged a tac case last week and Cisco's response was that I should contact the developers at Cisco devnet as they're not sure if they should provide support. I said why? It's a break-fix issue. I didn't get anywhere with that engineer and he went off shift without getting back to me, so I re-queued the case. The second engineer said the same thing. I said okay, please explain to me why or give more information as I don't follow. I did contact the dev team and they weren't sure either, even though I said it's a break-fix scenario. The 2nd TAC engineer is actively providing support now, though.

Ristau - the confusing thing for me is that I did have two identically configured ESXi 5.5 systems, both running 2 links - one link to each Nexus N3K, in a nailed-up channel (no LACP). One unit is working fine today and the other unit I couldn't get working as mentioned already.

The only difference here for me is that the working port-channel has one VLAN untagged and the non-working one had a few tagged VLANs.

port channel business

icecream-guy

wintermute000

icecream-guy

Dieselboy

wintermute000

icecream-guy

Dieselboy