IOS Order of operations with firewall and nat failover problem

Started by Dieselboy, October 28, 2018, 10:41:10 PM

Previous topic - Next topic

Dieselboy

I have set up a remote site with two internet connections, with failover using ip sla and tracked routing; like I have done many times before. However this time, there is no internet access (nor ICMP ping reply) during the failover.

I can see the tracked route is removed and the second default route comes into play. It's active when I had done a show ip route. However if I do "show ip cef x.x.x.x" where x.x.x.x is a random internet host, the result from that command showed it cef was sending the packet to the wrong ISP where there was no route. I managed to disable cef to see if the issue was resolved but it was not. Then the primary ISP came up again and I had not been able to test again, yet. TAC advised the reason for this is due to the existing nat on the router sending the packets out of the wrong interface however I dont think that this would have exhibited any issue with icmp in that case. ICMP was also not working at that time.
Also the backup internet connection used to be the primary, so I know the config works there. As a test, I put in a static route out of the backup isp for networking-forums.com and I can access this site fine as well as trace route shows the traffic going via the backup. There's just no connectivity during the failover.

Found this but doesnt mention firewall and I'm using zone firewall. https://www.techrepublic.com/article/understand-the-order-of-operations-for-cisco-ios/

Dieselboy

Tested failover again today. Internet access was restored after running the command "clear ip nat translations * "

This suggests nat is the culprit, however in the order of operations routing is done before nat; so I am a bit confused.

After testing the above, I then made the primary isp live again. The track route SLA went back to normal and the tracked route was re-installed. Internet connectivity was immediate and no clearing of any nats was required.

If nat translation was the issue, then why is it not affecting the fail back?  :twitch:

Dieselboy

Done some packet captures today. During the fail-back scenario, when the primary ISP is up once again what we are now seeing is the router sending the packet out of the primary interface (which is good) but the router is natting the source which is that of the backup interface IP (overload). So, we see traffic leaving the primary isp and the return traffic has a destination of the backup isp interface and so the return traffic is coming via backup interface.


Code c2900-universalk9-mz.SPA.154-3.M10.bin = show ip cef has incorrect next hop. Possible bug causing traffic to be sent to incorrect upstream gateway which is down

c2900-universalk9-mz.SPA.154-3.M4.bin = above "bug" or rather the cef experience is resolved, but now we have a scenario where traffic is leaving the correct interface with the incorrect source IP address and so response packets are arriving via incorrect interface

Two different TAC engineers have said that they see this often and what they do is use an EEM script to clear the NATs. I created this config about 10 years ago and never needed an EEM script back then. The order of operations states that routing is done before NAT so I am expecting the router to route the packet, see there's no NAT for that traffic on this interface, and set up new NAT for the traffic.

I've been looking but have been unable to find any docs that state that nat is honoured before routing if there exists nat already - if anyone has this please share it? I could be mistaken / confused about this part.

Dieselboy

I dug up one of my previous working configs which was back in 2011. Compared the config and it's identical apart form interface names and names of route-maps.

Asked tac to confirm bug and let me know the ID.

To me, it looks like the order of operations on IOS is muddled and rather than raise it to get it fixed, tac have been sending out EEM scripts to clear nats.

Otanx

This is a problem on ASAs as well. What we had happen is a remote site has a single firewall backed by a switch. There are several sessions that are long lasting UDP sessions from outside to inside. If we do maintenance on the switch, and reboot, the interface on the firewall goes down (of course). This causes the firewall to clear all sessions that used that interface, and remove the route entries for that interface. The next UDP packet in the session hits the outside interface. ACL lookup says permit. Route lookup now says default route. The firewall hairpins the traffic back at the internet. Builds the connection, and traffic black holes. We don't care at this point. The switch is rebooting. We know traffic is getting lost. However, when the switch comes back online. The ASA updates routing, but does NOT clear existing connections. So traffic is still forwarded wrong. It will not start working till the session times out (could be hours or days in our case), or someone does a "clear conn" on the firewall once everything is back up.

Normally this isn't an issue for us because we patch the switch then firewall during the same maintenance so rebooting the firewall fixes the issue.

-Otanx

icecream-guy

Quote from: Otanx on November 02, 2018, 09:59:55 AM
This is a problem on ASAs as well. What we had happen is a remote site has a single firewall backed by a switch. There are several sessions that are long lasting UDP sessions from outside to inside. If we do maintenance on the switch, and reboot, the interface on the firewall goes down (of course). This causes the firewall to clear all sessions that used that interface, and remove the route entries for that interface. The next UDP packet in the session hits the outside interface. ACL lookup says permit. Route lookup now says default route. The firewall hairpins the traffic back at the internet. Builds the connection, and traffic black holes. We don't care at this point. The switch is rebooting. We know traffic is getting lost. However, when the switch comes back online. The ASA updates routing, but does NOT clear existing connections. So traffic is still forwarded wrong. It will not start working till the session times out (could be hours or days in our case), or someone does a "clear conn" on the firewall once everything is back up.

Normally this isn't an issue for us because we patch the switch then firewall during the same maintenance so rebooting the firewall fixes the issue.

-Otanx


We had this exact thing happen with our Voice system that went haywire. except in our case, the SIP UDP never times out since it was looping, couple hours of troubleshooting and an hour on the phone with TAC , connection was cleared and all was good again.
:professorcat:

My Moral Fibers have been cut.

Dieselboy

Cisco need to get their act together before they start jeopardising CCIEs and CCNPs certification values. How can they allow bugs to continue like this without anyone getting it fixed properly. A bug in a component, fair enough - things can happen. But when the bug affects the underlying nature of the device basically making the router into an ASA. It just shows a complete lack of process and care at Cisco and this has been a growing problem for coming on to 10 years. 

deanwebb

Quote from: Dieselboy on November 04, 2018, 11:06:59 PM
Cisco need to get their act together before they start jeopardising CCIEs and CCNPs certification values. How can they allow bugs to continue like this without anyone getting it fixed properly. A bug in a component, fair enough - things can happen. But when the bug affects the underlying nature of the device basically making the router into an ASA. It just shows a complete lack of process and care at Cisco and this has been a growing problem for coming on to 10 years. 

If it's not a feature that directly affects sales, it gets backburnered in favor of fixing things that *are* affecting sales.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.