Networking-Forums.com

Professional Discussions => Routing and Switching => Topic started by: deanwebb on April 28, 2018, 11:49:12 AM

Title: What Causes a Switch to Crash?
Post by: deanwebb on April 28, 2018, 11:49:12 AM
"You turned on (X) and that brought down the switch!"

I've heard that more than once, in reference to various things, including but not limited to Netflow, SNMP, SSH connections, SPAN ports, and port-based ACLs. So, I'm wondering how much of that is for-reals and how much of that is a switch guy that was dealing with a flaky switch and just wanted to blame someone else for his switch going down.

Basically, what brings down switches and why?
Title: Re: What Causes a Switch to Crash?
Post by: wintermute000 on April 29, 2018, 07:43:06 PM
1.) broadcast storms
2.) features that smash the CPU like ACL logging, too-verbose-debugs etc.
3.) bugs - I have literally seen 'debug ntp' crash a 6500. Another classic is when the 10Gb modules were new, some versions of 3750X code would make the thing run at 80-100% CPU redlining and dropping packets like flies just for having the thing physcially in there. Legit bugs to happen.
Title: Re: What Causes a Switch to Crash?
Post by: icecream-guy on April 30, 2018, 06:24:25 AM
Typically, when an event happens on a device that the programmed code does not know how to handle, an exception is generated, and as a life preservation task the device will reboot itself to reset everything back to "normal".
Title: Re: What Causes a Switch to Crash?
Post by: deanwebb on April 30, 2018, 07:14:22 AM
Oh yes, the debug sessions that engineers forgot to turn off... Good call on that one.
Title: Re: What Causes a Switch to Crash?
Post by: mlan on May 01, 2018, 03:22:20 PM
The best crash I have ever experienced was a memory bit flip that forced a reload of a 6500 supervisor in a VSS pair.  The ensuing network destruction that resulted from that crash was a sight to behold.  Root cause was possibly a solar flare?  Haha...
Title: Re: What Causes a Switch to Crash?
Post by: wintermute000 on May 01, 2018, 06:39:08 PM
are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?
Title: Re: What Causes a Switch to Crash?
Post by: icecream-guy on May 02, 2018, 06:45:51 AM
Quote from: wintermute000 on May 01, 2018, 06:39:08 PM
are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

if they can't pinpoint the cause, that's the scapegoat.
Title: Re: What Causes a Switch to Crash?
Post by: dlots on May 02, 2018, 09:07:07 AM
Updating IOSs

The Cisco "test crash" command  (Will pretty much always crash your cisco device)

Control Plane Policing
Title: Re: What Causes a Switch to Crash?
Post by: deanwebb on May 02, 2018, 10:07:52 AM
Quote from: dlots on May 02, 2018, 09:07:07 AM

Control Plane Policing


This is recommended for security on Cisco features that can't be switched off... what's some more detail / war story about how this brings down a switch?
Title: Re: What Causes a Switch to Crash?
Post by: dlots on May 02, 2018, 12:25:52 PM
I honestly don't remember if it was a switch or a router, but we put copp on, write mem, wait a while and the device went down, came back up, wait a few, then it went back down (repeate)
Title: Re: What Causes a Switch to Crash?
Post by: mlan on May 04, 2018, 05:11:30 PM
Quote from: wintermute000 on May 01, 2018, 06:39:08 PM
are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

I still have the SP and RP crashfiles... here are the relevant bits from the SP crashfile:

Cache error detected!
  CPO_ECC     (reg 26/0): 0x00000089
  CPO_CACHERI (reg 27/0): 0xA0000000
  CP0_CAUSE   (reg 13/0): 0x00001C00

Real cache error detected.  System will be halted.

Error: Primary data cache, fields: data,
Actual physical addr 0x00000000,
virtual address is imprecise.

Imprecise Data Parity Error

Imprecise Data Parity Error

08:58:20 PDT Wed Jul 13 2011: Interrupt exception, CPU signal 20, PC = 0x40FEA860



--------------------------------------------------------------------
   Possible software fault. Upon reccurence, please collect
   crashinfo, "show tech" and contact Cisco Technical Support.
--------------------------------------------------------------------


-Traceback= 417BEE50
$0 : 00000000, AT : 42640000, v0 : 52D11A90, v1 : 45BF04F8
a0 : 52D11AC4, a1 : 52D44E3C, a2 : 40FEA848, a3 : 52D44E3C
t0 : 408B5698, t1 : 3400FF01, t2 : 3400F100, t3 : FFFF00FF
t4 : 417B13A8, t5 : 0000FFFF, t6 : 00000004, t7 : 0000030D
s0 : 52D44E3C, s1 : 00000002, s2 : 40FEA848, s3 : 52D44E3C
s4 : 43ECEF90, s5 : 00000004, s6 : 00000000, s7 : EFFFFFFA
t8 : 55BB5088, t9 : 00000000, k0 : 55B8DC94, k1 : 408EAE50
gp : 42647238, sp : 52D44D90, s8 : 9FBF04BE, ra : 40FEA860
EPC  : 417BEE50, ErrorEPC : 40FEA860, SREG     : 3400FF05
MDLO : 3B13B68E, MDHI     : 00000719, BadVaddr : 00000000
DATA_START : 0x42322420
Cause 00000000 (Code 0x0): Interrupt exception



The SP crash forced the RP to reload and then all hell broke loose....   more info (https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/116135-trouble-6500-parity-00.html)
Title: Re: What Causes a Switch to Crash?
Post by: deanwebb on May 10, 2018, 09:16:52 AM
Wow, a whole article on how to blame sunspots for your crash. Niiiiiiiiiiiiiiiiiiice. Putting that in my bag of tricks...
Title: Re: What Causes a Switch to Crash?
Post by: SimonV on May 14, 2018, 06:46:18 AM
I found this in the comments of one of the whitepapers about it:

QuoteWhen given the transient soft parity error explanation for a device or component failure, the following link may help you rule out Solar Flares as a possibility: http://www.tesis.lebedev.ru/en/sun_flares.html?m=9&d=11&y=2013

Replace date in the URL or click the calendar on the page.
Title: Re: What Causes a Switch to Crash?
Post by: deanwebb on May 14, 2018, 02:43:06 PM
That is so awesome. I think this is my favorite whitepaper now.
Title: Re: What Causes a Switch to Crash?
Post by: Dieselboy on May 21, 2018, 08:32:29 PM
Quote from: mlan on May 04, 2018, 05:11:30 PM
Quote from: wintermute000 on May 01, 2018, 06:39:08 PM
are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

I still have the SP and RP crashfiles... here are the relevant bits from the SP crashfile:

Cache error detected!
  CPO_ECC     (reg 26/0): 0x00000089
  CPO_CACHERI (reg 27/0): 0xA0000000
  CP0_CAUSE   (reg 13/0): 0x00001C00

Real cache error detected.  System will be halted.

Error: Primary data cache, fields: data,
Actual physical addr 0x00000000,
virtual address is imprecise.

Imprecise Data Parity Error

Imprecise Data Parity Error

08:58:20 PDT Wed Jul 13 2011: Interrupt exception, CPU signal 20, PC = 0x40FEA860



--------------------------------------------------------------------
   Possible software fault. Upon reccurence, please collect
   crashinfo, "show tech" and contact Cisco Technical Support.
--------------------------------------------------------------------


-Traceback= 417BEE50
$0 : 00000000, AT : 42640000, v0 : 52D11A90, v1 : 45BF04F8
a0 : 52D11AC4, a1 : 52D44E3C, a2 : 40FEA848, a3 : 52D44E3C
t0 : 408B5698, t1 : 3400FF01, t2 : 3400F100, t3 : FFFF00FF
t4 : 417B13A8, t5 : 0000FFFF, t6 : 00000004, t7 : 0000030D
s0 : 52D44E3C, s1 : 00000002, s2 : 40FEA848, s3 : 52D44E3C
s4 : 43ECEF90, s5 : 00000004, s6 : 00000000, s7 : EFFFFFFA
t8 : 55BB5088, t9 : 00000000, k0 : 55B8DC94, k1 : 408EAE50
gp : 42647238, sp : 52D44D90, s8 : 9FBF04BE, ra : 40FEA860
EPC  : 417BEE50, ErrorEPC : 40FEA860, SREG     : 3400FF05
MDLO : 3B13B68E, MDHI     : 00000719, BadVaddr : 00000000
DATA_START : 0x42322420
Cause 00000000 (Code 0x0): Interrupt exception



The SP crash forced the RP to reload and then all hell broke loose....   more info (https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/116135-trouble-6500-parity-00.html)

How is it possible when Cisco equipment uses ECC memory?
Title: Re: What Causes a Switch to Crash?
Post by: shortstop20 on May 22, 2018, 12:15:05 PM
Quote from: Dieselboy on May 21, 2018, 08:32:29 PM
Quote from: mlan on May 04, 2018, 05:11:30 PM
Quote from: wintermute000 on May 01, 2018, 06:39:08 PM
are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

I still have the SP and RP crashfiles... here are the relevant bits from the SP crashfile:

Cache error detected!
  CPO_ECC     (reg 26/0): 0x00000089
  CPO_CACHERI (reg 27/0): 0xA0000000
  CP0_CAUSE   (reg 13/0): 0x00001C00

Real cache error detected.  System will be halted.

Error: Primary data cache, fields: data,
Actual physical addr 0x00000000,
virtual address is imprecise.

Imprecise Data Parity Error

Imprecise Data Parity Error

08:58:20 PDT Wed Jul 13 2011: Interrupt exception, CPU signal 20, PC = 0x40FEA860



--------------------------------------------------------------------
   Possible software fault. Upon reccurence, please collect
   crashinfo, "show tech" and contact Cisco Technical Support.
--------------------------------------------------------------------


-Traceback= 417BEE50
$0 : 00000000, AT : 42640000, v0 : 52D11A90, v1 : 45BF04F8
a0 : 52D11AC4, a1 : 52D44E3C, a2 : 40FEA848, a3 : 52D44E3C
t0 : 408B5698, t1 : 3400FF01, t2 : 3400F100, t3 : FFFF00FF
t4 : 417B13A8, t5 : 0000FFFF, t6 : 00000004, t7 : 0000030D
s0 : 52D44E3C, s1 : 00000002, s2 : 40FEA848, s3 : 52D44E3C
s4 : 43ECEF90, s5 : 00000004, s6 : 00000000, s7 : EFFFFFFA
t8 : 55BB5088, t9 : 00000000, k0 : 55B8DC94, k1 : 408EAE50
gp : 42647238, sp : 52D44D90, s8 : 9FBF04BE, ra : 40FEA860
EPC  : 417BEE50, ErrorEPC : 40FEA860, SREG     : 3400FF05
MDLO : 3B13B68E, MDHI     : 00000719, BadVaddr : 00000000
DATA_START : 0x42322420
Cause 00000000 (Code 0x0): Interrupt exception



The SP crash forced the RP to reload and then all hell broke loose....   more info (https://www.cisco.com/c/en/us/support/docs/switches/catalyst-6500-series-switches/116135-trouble-6500-parity-00.html)

How is it possible when Cisco equipment uses ECC memory?

I can't answer that question but we have seen the parity errors on a Catalyst 6807, twice.