routing and switching issue

Started by icecream-guy, June 05, 2023, 06:36:21 PM

Previous topic - Next topic

icecream-guy

so I have not been a NE for a few years,  disclaimer that I cannot know certain models or versions of devices affected, I am a firewall security engineer.  the is just a documentation post for the major outage we had over the past weekend  symptoms. and some findings.

so about last Friday, there were intermittent outages, in one of out data centers, alt of this is here say, user could access devices then  could not auth, we could access network via AD credentials but not RSA credentials.  so for me I could log into a firewall management interface with AD but could not get to enable mode with RSA  on and off like if it was going. but then in a few minutes I could.

Hellish troubleshooting weekend, where it seems like the problem was flapping (but flapping where?)  across many applications (not just RSA) , users couldn't login to devices, web services, and other service.  Most of these issues were related to the F5 (VIPS), We determined that the F5 was seeing the flaps, the F5 monitoring ports were seen flapping on the F5s, past experiences with issues like this it probable was a bad fiber or SFP for one of the port channel members on one of the data center switches. F5s seemed to be connect to same switches as some hosts. e.g. we could not ping from F5 to backend server on same network.

Working with Cisco TAC, the issue was isolated to a few 7Ks on the network,  seemed that the F5 and hosts were connected to one of the switches that was having issues. seemed to be a possible memory issue, but syslog's couldn't confirm, there were not memory errors in logs.

Once the affected devices were identified and rebooted, the reboot process identified that there were memory issues on the device,  TAC made some config adjustments, and issue cleared up for the most part.

although for the members here, informational. this is more a post for those that are seeing a similar problem and maybe googling for an answers

GN


:professorcat:

My Moral Fibers have been cut.

deanwebb

Makes me want to rant about memory in Cisco gear... the stuff is cheap, we can buy it by the GB, and we're still dealing with MB-level quantities in our network gear...
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Dieselboy

Strange problem!

Cisco uses ECC memory, which is in a large part why cisco routers used to be the most stable networking gear. ECC is error correction. Compared with normal user computers when a memory error occurs the system just crashes. I would be leaning more towards software bug relating to memory rather than hardware issue as a guess.

Our customer had a similar issue but eigrp kept breaking and hello packets getting lost. A reboot of the device seemed to have fixed this, too.


Otanx

I am at Cisco Live this week, and taking a few of the breakout sessions for deep dive troubleshooting. The largest section is about how to identify memory issues. Either just high utilization or memory leaks. Then how to identify the root cause. This seems to be the biggest issue they run into.

-Otanx

icecream-guy

with more details seems like a memory allocation issue, with almost 3000 routes on the device, more memory had to be allocated to the region where the routes are stored.  reminds me of the TCAM issues on the 6500's when dealing with IPv4 and IPv6 routing. and trying to balance.
seems like we need some sort of route summarization, but that may mean re-ip for large parts of the network.
:professorcat:

My Moral Fibers have been cut.