What Causes a Switch to Crash?

deanwebb · April 28, 2018, 11:49:12 AM

"You turned on (X) and that brought down the switch!"

I've heard that more than once, in reference to various things, including but not limited to Netflow, SNMP, SSH connections, SPAN ports, and port-based ACLs. So, I'm wondering how much of that is for-reals and how much of that is a switch guy that was dealing with a flaky switch and just wanted to blame someone else for his switch going down.

Basically, what brings down switches and why?

wintermute000 · April 29, 2018, 07:43:06 PM

1.) broadcast storms
2.) features that smash the CPU like ACL logging, too-verbose-debugs etc.
3.) bugs - I have literally seen 'debug ntp' crash a 6500. Another classic is when the 10Gb modules were new, some versions of 3750X code would make the thing run at 80-100% CPU redlining and dropping packets like flies just for having the thing physcially in there. Legit bugs to happen.

icecream-guy · April 30, 2018, 06:24:25 AM

Typically, when an event happens on a device that the programmed code does not know how to handle, an exception is generated, and as a life preservation task the device will reboot itself to reset everything back to "normal".

deanwebb · April 30, 2018, 07:14:22 AM

Oh yes, the debug sessions that engineers forgot to turn off... Good call on that one.

mlan · May 01, 2018, 03:22:20 PM

The best crash I have ever experienced was a memory bit flip that forced a reload of a 6500 supervisor in a VSS pair. The ensuing network destruction that resulted from that crash was a sight to behold. Root cause was possibly a solar flare? Haha...

wintermute000 · May 01, 2018, 06:39:08 PM

are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

icecream-guy · May 02, 2018, 06:45:51 AM

Quote from: wintermute000 on May 01, 2018, 06:39:08 PM
are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

if they can't pinpoint the cause, that's the scapegoat.

dlots · May 02, 2018, 09:07:07 AM

Updating IOSs

The Cisco "test crash" command (Will pretty much always crash your cisco device)

Control Plane Policing

deanwebb · May 02, 2018, 10:07:52 AM

Quote from: dlots on May 02, 2018, 09:07:07 AM

Control Plane Policing

This is recommended for security on Cisco features that can't be switched off... what's some more detail / war story about how this brings down a switch?

dlots · May 02, 2018, 12:25:52 PM

I honestly don't remember if it was a switch or a router, but we put copp on, write mem, wait a while and the device went down, came back up, wait a few, then it went back down (repeate)

mlan · May 04, 2018, 05:11:30 PM

Quote from: wintermute000 on May 01, 2018, 06:39:08 PM
are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

I still have the SP and RP crashfiles... here are the relevant bits from the SP crashfile:

Code Select

Cache error detected!
  CPO_ECC     (reg 26/0): 0x00000089
  CPO_CACHERI (reg 27/0): 0xA0000000
  CP0_CAUSE   (reg 13/0): 0x00001C00

Real cache error detected.  System will be halted.

Error: Primary data cache, fields: data,
Actual physical addr 0x00000000,
virtual address is imprecise.

 Imprecise Data Parity Error

 Imprecise Data Parity Error

 08:58:20 PDT Wed Jul 13 2011: Interrupt exception, CPU signal 20, PC = 0x40FEA860



--------------------------------------------------------------------
   Possible software fault. Upon reccurence, please collect
   crashinfo, "show tech" and contact Cisco Technical Support.
--------------------------------------------------------------------


-Traceback= 417BEE50 
$0 : 00000000, AT : 42640000, v0 : 52D11A90, v1 : 45BF04F8
a0 : 52D11AC4, a1 : 52D44E3C, a2 : 40FEA848, a3 : 52D44E3C
t0 : 408B5698, t1 : 3400FF01, t2 : 3400F100, t3 : FFFF00FF
t4 : 417B13A8, t5 : 0000FFFF, t6 : 00000004, t7 : 0000030D
s0 : 52D44E3C, s1 : 00000002, s2 : 40FEA848, s3 : 52D44E3C
s4 : 43ECEF90, s5 : 00000004, s6 : 00000000, s7 : EFFFFFFA
t8 : 55BB5088, t9 : 00000000, k0 : 55B8DC94, k1 : 408EAE50
gp : 42647238, sp : 52D44D90, s8 : 9FBF04BE, ra : 40FEA860
EPC  : 417BEE50, ErrorEPC : 40FEA860, SREG     : 3400FF05
MDLO : 3B13B68E, MDHI     : 00000719, BadVaddr : 00000000
DATA_START : 0x42322420
Cause 00000000 (Code 0x0): Interrupt exception

The SP crash forced the RP to reload and then all hell broke loose.... more info

deanwebb · May 10, 2018, 09:16:52 AM

Wow, a whole article on how to blame sunspots for your crash. Niiiiiiiiiiiiiiiiiiice. Putting that in my bag of tricks...

SimonV · May 14, 2018, 06:46:18 AM

I found this in the comments of one of the whitepapers about it:

QuoteWhen given the transient soft parity error explanation for a device or component failure, the following link may help you rule out Solar Flares as a possibility: http://www.tesis.lebedev.ru/en/sun_flares.html?m=9&d=11&y=2013

Replace date in the URL or click the calendar on the page.

deanwebb · May 14, 2018, 02:43:06 PM

That is so awesome. I think this is my favorite whitepaper now.

Dieselboy · May 21, 2018, 08:32:29 PM

Quote from: mlan on May 04, 2018, 05:11:30 PM
Quote from: wintermute000 on May 01, 2018, 06:39:08 PM
are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

I still have the SP and RP crashfiles... here are the relevant bits from the SP crashfile:

Code Select Expand
Cache error detected! CPO_ECC (reg 26/0): 0x00000089 CPO_CACHERI (reg 27/0): 0xA0000000 CP0_CAUSE (reg 13/0): 0x00001C00 Real cache error detected. System will be halted. Error: Primary data cache, fields: data, Actual physical addr 0x00000000, virtual address is imprecise. Imprecise Data Parity Error Imprecise Data Parity Error 08:58:20 PDT Wed Jul 13 2011: Interrupt exception, CPU signal 20, PC = 0x40FEA860 -------------------------------------------------------------------- Possible software fault. Upon reccurence, please collect crashinfo, "show tech" and contact Cisco Technical Support. -------------------------------------------------------------------- -Traceback= 417BEE50 $0 : 00000000, AT : 42640000, v0 : 52D11A90, v1 : 45BF04F8 a0 : 52D11AC4, a1 : 52D44E3C, a2 : 40FEA848, a3 : 52D44E3C t0 : 408B5698, t1 : 3400FF01, t2 : 3400F100, t3 : FFFF00FF t4 : 417B13A8, t5 : 0000FFFF, t6 : 00000004, t7 : 0000030D s0 : 52D44E3C, s1 : 00000002, s2 : 40FEA848, s3 : 52D44E3C s4 : 43ECEF90, s5 : 00000004, s6 : 00000000, s7 : EFFFFFFA t8 : 55BB5088, t9 : 00000000, k0 : 55B8DC94, k1 : 408EAE50 gp : 42647238, sp : 52D44D90, s8 : 9FBF04BE, ra : 40FEA860 EPC : 417BEE50, ErrorEPC : 40FEA860, SREG : 3400FF05 MDLO : 3B13B68E, MDHI : 00000719, BadVaddr : 00000000 DATA_START : 0x42322420 Cause 00000000 (Code 0x0): Interrupt exception

The SP crash forced the RP to reload and then all hell broke loose.... more info

How is it possible when Cisco equipment uses ECC memory?