IOS bug of the day

Started by icecream-guy, February 17, 2016, 11:05:15 AM

Previous topic - Next topic

icecream-guy

Ran into this oddity today, thought I'd share


Beni MR1: Debug(network-rf ) enabled by default on Mingla
CSCut80144


#show debug
network RF:
  network-rf idb-sync-history events debugging is on

Accidentally turned it off, was trying to turn it back on, I suppose it must have been needed if someone turned debugging on.
Found out it's a bug. idb-sync-history events debugging is on whenever an affected router is booted.


https://tools.cisco.com/bugsearch/bug/CSCut80144
:professorcat:

My Moral Fibers have been cut.

TheGreatDoc

a.k.a. Daniel.
I dont have any cert, just learned all by my self.

srg

I think this is kinda awesome: show cli history detail displays all HEX MAC values in decimal format
Opened the SR almost a year ago, to my knowledge this is still not fixed.
som om sinnet hade svartnat för evigt.

SimonV

#3
Ran into this one at a customer site on Friday:

QuoteCSCuz57493 - High CPU observed in punjectrx fed-ots-main thread

https://quickview.cloudapps.cisco.com/quickview/bug/CSCuz57493

The first symptoms observed were that they had an L2 loop (macflaps in the logs) in their server VLAN. 
I was in an airport so could only instruct them certain commands and asked them to start disconnecting redundant ports.
When these links were disconnected the server VLAN was ok again, but they still had intermittent reachability loss on some servers.
When I finally got there I found that the ARP and MAC address on one 3850 core switch were inconsistent - I could see the ARP entries and ping devices, but could not see the MAC address in the CAM table, or the MAC address showed up on the wrong port. Also found one of the 4 CPUs was running at 100% caused by the punjectx process (25% used in total)
We finally vmotion'ed all devices to the other side and rebooted the switch stack - problem solved.

My current hypothesis is that it just stopped sending BPDUs and that caused the blocking links on downstream switches to go forwarding.

Thanks Cisco  :thankyou:

NetworkGroover

Quote from: SimonV on January 16, 2017, 07:57:52 AM
Ran into this one at a customer site on Friday:

QuoteCSCuz57493 - High CPU observed in punjectrx fed-ots-main thread

https://quickview.cloudapps.cisco.com/quickview/bug/CSCuz57493

The first symptoms observed were that they had an L2 loop (macflaps in the logs) in their server VLAN. 
I was in an airport so could only instruct them certain commands and asked them to start disconnecting redundant ports.
When these links were disconnected the server VLAN was ok again, but they still had intermittent reachability loss on some servers.
When I finally got there I found that the ARP and MAC address on one 3850 core switch were inconsistent - I could see the ARP entries and ping devices, but could not see the MAC address in the CAM table, or the MAC address showed up on the wrong port. Also found one of the 4 CPUs was running at 100% caused by the punjectx process (25% used in total)
We finally vmotion'ed all devices to the other side and rebooted the switch stack - problem solved.

My current hypothesis is that it just stopped sending BPDUs and that caused the blocking links on downstream switches to go forwarding.

Thanks Cisco  :thankyou:

Loop guard on the downstream switches would have been helpful, yeah?  Minus the whole, "just stopped sending BPDUs" thing...
Engineer by day, DJ by night, family first always

srg

This is a beauty; https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvb41889

Symptom:
after leap second addition, everyday night 23:59:50 leap sec addition happening
som om sinnet hade svartnat för evigt.

deanwebb

Quote from: srg on January 16, 2017, 10:35:27 AM
This is a beauty; https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvb41889

Symptom:
after leap second addition, everyday night 23:59:50 leap sec addition happening

The timing on that couldn't have been any worse.

:haha2:
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

srg

Quote from: deanwebb on January 16, 2017, 12:21:58 PM
Quote from: srg on January 16, 2017, 10:35:27 AM
This is a beauty; https://bst.cloudapps.cisco.com/bugsearch/bug/CSCvb41889

Symptom:
after leap second addition, everyday night 23:59:50 leap sec addition happening

The timing on that couldn't have been any worse.

:haha2:
This is actually triggered I think only in releases that fixes another XE bug that made the box crash and reboot due to watchdog triggered by the leap second.. ;)
som om sinnet hade svartnat för evigt.

SimonV

Quote from: AspiringNetworker on January 16, 2017, 10:33:32 AM
Loop guard on the downstream switches would have been helpful, yeah?  Minus the whole, "just stopped sending BPDUs" thing...

Well, not in this case I think. Loopguard is implemented on the blocking links, but here the BPDUs would have stopped arriving on a designated port.   

As I said, just a hypothesis at the moment, I still need to through the logs... By the way, the loop was gone by the time I arrived and we couldn't reproduce it, so it could very well be that the loop was elsewhere.

NetworkGroover

Right but if I remember correctly, it triggers based on BPDUs not being received on those ports and places the port in loop-inconsistent state.  Are you saying that BPDUs were still being received on those ports?

QuoteThe loop guard feature makes additional checks. If BPDUs are not received on a non-designated port, and loop guard is enabled, that port is moved into the STP loop-inconsistent blocking state, instead of the listening / learning / forwarding state. Without the loop guard feature, the port assumes the designated port role. The port moves to the STP forwarding state and creates a loop.
Engineer by day, DJ by night, family first always

SimonV

Well, in any case, the blocking link moved to forwarding but I'm not sure if it's because it stopped receiving BPDUs on the blocked port itself, or if it just lost the root port on the other interface.  That IOS bug is really putting my focus on the core switch, and not on the access switch. It would be a serious coincidence to have two different issues in a time span of 4 hours.

SimonV

QuoteCSCuo58994 - Failed POST:PortASIC Macsec Loopback Tests during bootup

The system continuously reboots, failing for POST tests on WS-C3750X-24T-L:

POST: PortASIC Macsec Loopback Tests : Begin
Pattern not found Y \002
POST: Failed Packet compare asic_index 1 port_hardware_index 0
Pattern not found D \002
...
...
POST: Failed Packet compare asic_index 1 port_hardware_index 26
POST: Failed MacsecEncryption Packet Test asic_index 1 port_hardware_index 26
POST: PortASIC Macsec Loopback Tests : End, Status Failed

Error: Macsec POST failed. Cannot continue.

Workaround:
Use IOS version 12.2(55)SE9

:rolleyes: :rolleyes: :rolleyes:

Happening to one switch of a three-unit stack

deanwebb

Quote from: SimonV on January 27, 2017, 06:38:08 AM
QuoteCSCuo58994 - Failed POST:PortASIC Macsec Loopback Tests during bootup

The system continuously reboots, failing for POST tests on WS-C3750X-24T-L:

POST: PortASIC Macsec Loopback Tests : Begin
Pattern not found Y \002
POST: Failed Packet compare asic_index 1 port_hardware_index 0
Pattern not found D \002
...
...
POST: Failed Packet compare asic_index 1 port_hardware_index 26
POST: Failed MacsecEncryption Packet Test asic_index 1 port_hardware_index 26
POST: PortASIC Macsec Loopback Tests : End, Status Failed

Error: Macsec POST failed. Cannot continue.

Workaround:
Use IOS version 12.2(55)SE9

:rolleyes: :rolleyes: :rolleyes:

Happening to one switch of a three-unit stack
Ummm... is 12.2(55)SE9 a *downgrade* from the IOS you're on, by any chance?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

SimonV

Yes, it is. It's on 15.0(2)SE4 so I'm not sure what the *downgrade path* is

deanwebb

Quote from: SimonV on January 27, 2017, 09:23:34 AM
Yes, it is. It's on 15.0(2)SE4 so I'm not sure what the *downgrade path* is

I believe the first step is...

:kiwf:

Just guessing...
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.