What Causes a Switch to Crash?

Started by deanwebb, April 28, 2018, 11:49:12 AM

Previous topic - Next topic

deanwebb

"You turned on (X) and that brought down the switch!"

I've heard that more than once, in reference to various things, including but not limited to Netflow, SNMP, SSH connections, SPAN ports, and port-based ACLs. So, I'm wondering how much of that is for-reals and how much of that is a switch guy that was dealing with a flaky switch and just wanted to blame someone else for his switch going down.

Basically, what brings down switches and why?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

wintermute000

1.) broadcast storms
2.) features that smash the CPU like ACL logging, too-verbose-debugs etc.
3.) bugs - I have literally seen 'debug ntp' crash a 6500. Another classic is when the 10Gb modules were new, some versions of 3750X code would make the thing run at 80-100% CPU redlining and dropping packets like flies just for having the thing physcially in there. Legit bugs to happen.

icecream-guy

Typically, when an event happens on a device that the programmed code does not know how to handle, an exception is generated, and as a life preservation task the device will reboot itself to reset everything back to "normal".
:professorcat:

My Moral Fibers have been cut.

deanwebb

Oh yes, the debug sessions that engineers forgot to turn off... Good call on that one.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

mlan

The best crash I have ever experienced was a memory bit flip that forced a reload of a 6500 supervisor in a VSS pair.  The ensuing network destruction that resulted from that crash was a sight to behold.  Root cause was possibly a solar flare?  Haha...

wintermute000

are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

icecream-guy

Quote from: wintermute000 on May 01, 2018, 06:39:08 PM
are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

if they can't pinpoint the cause, that's the scapegoat.
:professorcat:

My Moral Fibers have been cut.

dlots

Updating IOSs

The Cisco "test crash" command  (Will pretty much always crash your cisco device)

Control Plane Policing

deanwebb

Quote from: dlots on May 02, 2018, 09:07:07 AM

Control Plane Policing


This is recommended for security on Cisco features that can't be switched off... what's some more detail / war story about how this brings down a switch?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

dlots

I honestly don't remember if it was a switch or a router, but we put copp on, write mem, wait a while and the device went down, came back up, wait a few, then it went back down (repeate)

mlan

Quote from: wintermute000 on May 01, 2018, 06:39:08 PM
are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

I still have the SP and RP crashfiles... here are the relevant bits from the SP crashfile:

Cache error detected!
  CPO_ECC     (reg 26/0): 0x00000089
  CPO_CACHERI (reg 27/0): 0xA0000000
  CP0_CAUSE   (reg 13/0): 0x00001C00

Real cache error detected.  System will be halted.

Error: Primary data cache, fields: data,
Actual physical addr 0x00000000,
virtual address is imprecise.

Imprecise Data Parity Error

Imprecise Data Parity Error

08:58:20 PDT Wed Jul 13 2011: Interrupt exception, CPU signal 20, PC = 0x40FEA860



--------------------------------------------------------------------
   Possible software fault. Upon reccurence, please collect
   crashinfo, "show tech" and contact Cisco Technical Support.
--------------------------------------------------------------------


-Traceback= 417BEE50
$0 : 00000000, AT : 42640000, v0 : 52D11A90, v1 : 45BF04F8
a0 : 52D11AC4, a1 : 52D44E3C, a2 : 40FEA848, a3 : 52D44E3C
t0 : 408B5698, t1 : 3400FF01, t2 : 3400F100, t3 : FFFF00FF
t4 : 417B13A8, t5 : 0000FFFF, t6 : 00000004, t7 : 0000030D
s0 : 52D44E3C, s1 : 00000002, s2 : 40FEA848, s3 : 52D44E3C
s4 : 43ECEF90, s5 : 00000004, s6 : 00000000, s7 : EFFFFFFA
t8 : 55BB5088, t9 : 00000000, k0 : 55B8DC94, k1 : 408EAE50
gp : 42647238, sp : 52D44D90, s8 : 9FBF04BE, ra : 40FEA860
EPC  : 417BEE50, ErrorEPC : 40FEA860, SREG     : 3400FF05
MDLO : 3B13B68E, MDHI     : 00000719, BadVaddr : 00000000
DATA_START : 0x42322420
Cause 00000000 (Code 0x0): Interrupt exception



The SP crash forced the RP to reload and then all hell broke loose....   more info

deanwebb

Wow, a whole article on how to blame sunspots for your crash. Niiiiiiiiiiiiiiiiiiice. Putting that in my bag of tricks...
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

SimonV

I found this in the comments of one of the whitepapers about it:

QuoteWhen given the transient soft parity error explanation for a device or component failure, the following link may help you rule out Solar Flares as a possibility: http://www.tesis.lebedev.ru/en/sun_flares.html?m=9&d=11&y=2013

Replace date in the URL or click the calendar on the page.

deanwebb

That is so awesome. I think this is my favorite whitepaper now.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Dieselboy

Quote from: mlan on May 04, 2018, 05:11:30 PM
Quote from: wintermute000 on May 01, 2018, 06:39:08 PM
are you sure it was a bit flip or was that a random guess by a TAC guy wanting to close it out?

I still have the SP and RP crashfiles... here are the relevant bits from the SP crashfile:

Cache error detected!
  CPO_ECC     (reg 26/0): 0x00000089
  CPO_CACHERI (reg 27/0): 0xA0000000
  CP0_CAUSE   (reg 13/0): 0x00001C00

Real cache error detected.  System will be halted.

Error: Primary data cache, fields: data,
Actual physical addr 0x00000000,
virtual address is imprecise.

Imprecise Data Parity Error

Imprecise Data Parity Error

08:58:20 PDT Wed Jul 13 2011: Interrupt exception, CPU signal 20, PC = 0x40FEA860



--------------------------------------------------------------------
   Possible software fault. Upon reccurence, please collect
   crashinfo, "show tech" and contact Cisco Technical Support.
--------------------------------------------------------------------


-Traceback= 417BEE50
$0 : 00000000, AT : 42640000, v0 : 52D11A90, v1 : 45BF04F8
a0 : 52D11AC4, a1 : 52D44E3C, a2 : 40FEA848, a3 : 52D44E3C
t0 : 408B5698, t1 : 3400FF01, t2 : 3400F100, t3 : FFFF00FF
t4 : 417B13A8, t5 : 0000FFFF, t6 : 00000004, t7 : 0000030D
s0 : 52D44E3C, s1 : 00000002, s2 : 40FEA848, s3 : 52D44E3C
s4 : 43ECEF90, s5 : 00000004, s6 : 00000000, s7 : EFFFFFFA
t8 : 55BB5088, t9 : 00000000, k0 : 55B8DC94, k1 : 408EAE50
gp : 42647238, sp : 52D44D90, s8 : 9FBF04BE, ra : 40FEA860
EPC  : 417BEE50, ErrorEPC : 40FEA860, SREG     : 3400FF05
MDLO : 3B13B68E, MDHI     : 00000719, BadVaddr : 00000000
DATA_START : 0x42322420
Cause 00000000 (Code 0x0): Interrupt exception



The SP crash forced the RP to reload and then all hell broke loose....   more info

How is it possible when Cisco equipment uses ECC memory?