IOS bug of the day

icecream-guy · February 17, 2016, 11:05:15 AM

Ran into this oddity today, thought I'd share

Beni MR1: Debug(network-rf ) enabled by default on Mingla
CSCut80144

#show debug
network RF:
network-rf idb-sync-history events debugging is on

Accidentally turned it off, was trying to turn it back on, I suppose it must have been needed if someone turned debugging on.
Found out it's a bug. idb-sync-history events debugging is on whenever an affected router is booted.

https://tools.cisco.com/bugsearch/bug/CSCut80144

TheGreatDoc · February 17, 2016, 11:59:42 AM

My ASR doesnt have that bug

srg · February 17, 2016, 12:51:42 PM

I think this is kinda awesome: show cli history detail displays all HEX MAC values in decimal format
Opened the SR almost a year ago, to my knowledge this is still not fixed.

SimonV · January 16, 2017, 07:57:52 AM

Ran into this one at a customer site on Friday:

QuoteCSCuz57493 - High CPU observed in punjectrx fed-ots-main thread

https://quickview.cloudapps.cisco.com/quickview/bug/CSCuz57493

The first symptoms observed were that they had an L2 loop (macflaps in the logs) in their server VLAN.
I was in an airport so could only instruct them certain commands and asked them to start disconnecting redundant ports.
When these links were disconnected the server VLAN was ok again, but they still had intermittent reachability loss on some servers.
When I finally got there I found that the ARP and MAC address on one 3850 core switch were inconsistent - I could see the ARP entries and ping devices, but could not see the MAC address in the CAM table, or the MAC address showed up on the wrong port. Also found one of the 4 CPUs was running at 100% caused by the punjectx process (25% used in total)
We finally vmotion'ed all devices to the other side and rebooted the switch stack - problem solved.

My current hypothesis is that it just stopped sending BPDUs and that caused the blocking links on downstream switches to go forwarding.

Thanks Cisco

NetworkGroover · January 16, 2017, 10:33:32 AM

Quote from: SimonV on January 16, 2017, 07:57:52 AM
Ran into this one at a customer site on Friday:

QuoteCSCuz57493 - High CPU observed in punjectrx fed-ots-main thread

https://quickview.cloudapps.cisco.com/quickview/bug/CSCuz57493

The first symptoms observed were that they had an L2 loop (macflaps in the logs) in their server VLAN.
I was in an airport so could only instruct them certain commands and asked them to start disconnecting redundant ports.
When these links were disconnected the server VLAN was ok again, but they still had intermittent reachability loss on some servers.
When I finally got there I found that the ARP and MAC address on one 3850 core switch were inconsistent - I could see the ARP entries and ping devices, but could not see the MAC address in the CAM table, or the MAC address showed up on the wrong port. Also found one of the 4 CPUs was running at 100% caused by the punjectx process (25% used in total)
We finally vmotion'ed all devices to the other side and rebooted the switch stack - problem solved.

My current hypothesis is that it just stopped sending BPDUs and that caused the blocking links on downstream switches to go forwarding.

Thanks Cisco

Loop guard on the downstream switches would have been helpful, yeah? Minus the whole, "just stopped sending BPDUs" thing...

srg · January 16, 2017, 10:35:27 AM

This is a beauty; https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvb41889

Symptom:
after leap second addition, everyday night 23:59:50 leap sec addition happening

deanwebb · January 16, 2017, 12:21:58 PM

Quote from: srg on January 16, 2017, 10:35:27 AM
This is a beauty; https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvb41889

Symptom:
after leap second addition, everyday night 23:59:50 leap sec addition happening

The timing on that couldn't have been any worse.

srg · January 16, 2017, 12:24:30 PM

Quote from: deanwebb on January 16, 2017, 12:21:58 PM
Quote from: srg on January 16, 2017, 10:35:27 AM
This is a beauty; https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvb41889

Symptom:
after leap second addition, everyday night 23:59:50 leap sec addition happening

The timing on that couldn't have been any worse.

This is actually triggered I think only in releases that fixes another XE bug that made the box crash and reboot due to watchdog triggered by the leap second..

SimonV · January 16, 2017, 12:56:27 PM

Quote from: AspiringNetworker on January 16, 2017, 10:33:32 AM
Loop guard on the downstream switches would have been helpful, yeah? Minus the whole, "just stopped sending BPDUs" thing...

Well, not in this case I think. Loopguard is implemented on the blocking links, but here the BPDUs would have stopped arriving on a designated port.

As I said, just a hypothesis at the moment, I still need to through the logs... By the way, the loop was gone by the time I arrived and we couldn't reproduce it, so it could very well be that the loop was elsewhere.

NetworkGroover · January 16, 2017, 01:37:56 PM

Right but if I remember correctly, it triggers based on BPDUs not being received on those ports and places the port in loop-inconsistent state. Are you saying that BPDUs were still being received on those ports?

QuoteThe loop guard feature makes additional checks. If BPDUs are not received on a non-designated port, and loop guard is enabled, that port is moved into the STP loop-inconsistent blocking state, instead of the listening / learning / forwarding state. Without the loop guard feature, the port assumes the designated port role. The port moves to the STP forwarding state and creates a loop.

SimonV · January 17, 2017, 01:58:00 AM

Well, in any case, the blocking link moved to forwarding but I'm not sure if it's because it stopped receiving BPDUs on the blocked port itself, or if it just lost the root port on the other interface. That IOS bug is really putting my focus on the core switch, and not on the access switch. It would be a serious coincidence to have two different issues in a time span of 4 hours.

SimonV · January 27, 2017, 06:38:08 AM

QuoteCSCuo58994 - Failed POST:PortASIC Macsec Loopback Tests during bootup

The system continuously reboots, failing for POST tests on WS-C3750X-24T-L:

POST: PortASIC Macsec Loopback Tests : Begin
Pattern not found Y \002
POST: Failed Packet compare asic_index 1 port_hardware_index 0
Pattern not found D \002
...
...
POST: Failed Packet compare asic_index 1 port_hardware_index 26
POST: Failed MacsecEncryption Packet Test asic_index 1 port_hardware_index 26
POST: PortASIC Macsec Loopback Tests : End, Status Failed

Error: Macsec POST failed. Cannot continue.

Workaround:
Use IOS version 12.2(55)SE9

Happening to one switch of a three-unit stack

deanwebb · January 27, 2017, 09:17:36 AM

Quote from: SimonV on January 27, 2017, 06:38:08 AM
QuoteCSCuo58994 - Failed POST:PortASIC Macsec Loopback Tests during bootup

The system continuously reboots, failing for POST tests on WS-C3750X-24T-L:

POST: PortASIC Macsec Loopback Tests : Begin
Pattern not found Y \002
POST: Failed Packet compare asic_index 1 port_hardware_index 0
Pattern not found D \002
...
...
POST: Failed Packet compare asic_index 1 port_hardware_index 26
POST: Failed MacsecEncryption Packet Test asic_index 1 port_hardware_index 26
POST: PortASIC Macsec Loopback Tests : End, Status Failed

Error: Macsec POST failed. Cannot continue.

Workaround:
Use IOS version 12.2(55)SE9

Happening to one switch of a three-unit stack

Ummm... is 12.2(55)SE9 a *downgrade* from the IOS you're on, by any chance?

SimonV · January 27, 2017, 09:23:34 AM

Yes, it is. It's on 15.0(2)SE4 so I'm not sure what the *downgrade path* is

deanwebb · January 27, 2017, 09:27:04 AM

Quote from: SimonV on January 27, 2017, 09:23:34 AM
Yes, it is. It's on 15.0(2)SE4 so I'm not sure what the *downgrade path* is

I believe the first step is...

Just guessing...