What could cause a router sub interface to drop random pings?

Started by Dieselboy, April 29, 2016, 06:33:56 AM

Previous topic - Next topic

Dieselboy

I have a 2921 connected to a WAN switch. Also in the WAN switch are 2 internet circuits.
The router port 0/0 has one circuit /30 on the physical interface and a sub interface with a different internet circuit /30. I'm using VRF lite to segregate the two internet circuits from each other.

If I ping the IP on the physical interface I get good response and no dropped packets.
If I ping the IP on the subinterface I get packet loss.

example:
This is the ping to the physical interface from the internet:

TP-2901V#ping 116.[] rep 1000
Type escape sequence to abort.
Sending 1000, 100-byte ICMP Echos to 116.[], timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!
Success rate is 99 percent (999/1000), round-trip min/avg/max = 4/4/40 ms
TP-2901V#


This is the ping to the subinterface from the net


TP-2901V#ping 139.[] rep 1000
Type escape sequence to abort.
Sending 1000, 100-byte ICMP Echos to 139.[], timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!.!!!.!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!.!!!!!!!!.!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.
!!!!!!!!!!!!!!!!!!!!
Success rate is 98 percent (985/1000), round-trip min/avg/max = 1/4/156 ms


It's the same if I ping through the router to other devices internally. But likewise if I ping through the good physical interface, no issues.

If I take the router out of the equation, put a laptop into the switch where the internet circuit resides and configure the laptop NIC with the IP of the router, I get ping response like the one that is good. All physical interfaces are 1GB (including the laptop) and there's no incrementing errors or duplex issues.

The config of the subinterface is more simple on the one that is getting the dropped packets, hence the confusion. Simple in terms of there's no service policy applying QoS.
CPU use is <20% on the router.


!
interface GigabitEthernet0/0
description Auto 1GB link to HP CIN-1620WAN-1 port 1 - INTERNET FACING
bandwidth 20000
ip vrf forwarding ~~VRF-1
ip address 116.[] 255.255.255.252
ip access-group GI00-~~-INBOUND in
no ip redirects
no ip proxy-arp
duplex auto
speed auto
no cdp enable
no mop enabled
service-policy output QOS-~~-OUT
end



interface GigabitEthernet0/0.132
description ~~ interface to ~~ /30
bandwidth 50000
encapsulation dot1Q 132
ip vrf forwarding ~~-TID-VRF-2
ip address 139.[] 255.255.255.252
ip access-group GI02.132-IPv4-~~-INBOUND in
no ip redirects
no ip unreachables
no ip proxy-arp
no cdp enable
end


There are policer's applied to the ISP circuit upstream which police traffic to 50M. All I'm doing is pinging with another Cisco 2900 router with default data size so I wont be anywhere near the 50M, so I would not expect the policer to be kicking in. Likewise if I ping my laptop with the internet IP I can get fast ping results and no loss.

To rule it out being an issue with the switch, I connected my laptop into the switch into the same VLAN as the internet circuit and run the same pings - results were good. 

Last thing I can do is disconnect the router entirely and connect in my laptop into the same port, configured with VLAN on the NIC and then run the same test.

Wanted to post here to see if anyone else had something similar...

:o

routerdork

I've not seen anyone run a config like this without two sub-interfaces. Unless that's a VRF Lite thing? I see that you did drop one packet on the other ping so that leads me to believe all is not perfect in the world. What do the show interface outputs on each side of the link show?
"The thing about quotes on the internet is that you cannot confirm their validity." -Abraham Lincoln

NetworkGroover

Yeahh if you know the ICMP packets are getting there.... then this is where counters can help... if they expose any that are useful.  Again though, that's assuming they are getting there.
Engineer by day, DJ by night, family first always

Reggle

First of all, you have a service policy on the physical link. That will likely affect subinterfaces too.
Which brings me to the second point: I would use two subinterfaces to avoid any accidental interference.

NetworkGroover

Quote from: Reggle on April 29, 2016, 01:47:34 PM
First of all, you have a service policy on the physical link. That will likely affect subinterfaces too.
Which brings me to the second point: I would use two subinterfaces to avoid any accidental interference.

If the service policy were affecting both, shouldn't there be similar behavior?
Engineer by day, DJ by night, family first always

deanwebb

Does one path go through a load balancer and the other not?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Reggle

Quote from: AspiringNetworker on April 29, 2016, 03:31:52 PM
Quote from: Reggle on April 29, 2016, 01:47:34 PM
First of all, you have a service policy on the physical link. That will likely affect subinterfaces too.
Which brings me to the second point: I would use two subinterfaces to avoid any accidental interference.

If the service policy were affecting both, shouldn't there be similar behavior?
If you go for two subinterfaces, service policy on the subinterfaces only in that case. I believe it's pssible, I may be mistaken.
If you mean that the service policy should affect both the main and the subinterface, that's indeed the case. But I also see differing bandwidth statements and the service policy class and ACL (not shown) may consider the subinterface traffic more interesting to drop.

I don't know, this setup just doesn't "feel" right.

Dieselboy

Hi guys thanks for all your replies.

You're all completely correct about the service policy on the physical / subinterface. I'm moving the physical configuration to a subinterface today as I will have a window.
How this happened is that the router was configured for Internet on the physical interface and then we had another circuit procured so during business hours i set that up on a subint. You can either apply service policy to the physical only or you can not apply to the physical and then apply to each individual sub int.
Although removing the service policy entirely has no change.

There's no load balancer. At the moment it goes like this on both circuits
Internet > fibre > NTE > my switch > router
I think i know what you're getting at in terms of load balancer - upstream mac address. I did check already to see if the ISP mac address changed when i done a clear arp but it did not.

The physical interface is 1gb. Theres no errors but there are unknown protocol drops and there are output drops due to the service policy shaping at 20MB. Even though removing the service policy entirely has no change.

I'll move the interfaces to how it should be and see if there's any change. If that's the only thing that sticks out then it's a good place to start. I was wary about making that change and have the potential risk of getting that issue on the in production circuit.
Cheers
Tony

Dieselboy

Moving the interfaces has had no effect at all on the lost ICMP but I have gone and applied individual QoS per sub interface at least.

Since the ASA's are connected into the same switch, I moved the IP to the ASA and the ASA can ping the upstream gateway fine. If I do this test on the router I get the same odd packet loss.

So I then moved the IP to another subinterface on the router, in case there was any issues with the interface itself, physical cable (even though no errors), or other interface issue and I still get the same packet loss.

So to summarise:
ISP1 -> router = good
ISP2 -> router = bad
ISP2 -> ASA = good

So this got me thinking... What changed? I've ruled out a lot of things.

I then decided to change the mac manually on the interface.

So what I did was (since I've still got a window to do intrusive changes):
- copy the ASA mac address from the show int on the ASA
- remove the VLAN from the switch going to the ASA (I left the config there for the moment, but can be deleted)
- under the physical interface of the router, set the mac address from the Burned in address (4c00.828a.cf00) to the virtual one from the ASA (00a0.c9c0.8201)
- run pings

And what do you know, there's no packet loss from either pinging from the router, or pinging from my Cisco router at home across the internet to the router:


- my main ISP connection, came back before I had a chance to clear arp. I was expecting this to drop until I changed the mac address back.

TP-2901V#ping 116.[] rep 1000
Type escape sequence to abort.
Sending 1000, 100-byte ICMP Echos to 116.[], timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (1000/1000), round-trip min/avg/max = 1/4/160 ms
TP-2901V#ping 139.[] rep 1000
Type escape sequence to abort.
Sending 1000, 100-byte ICMP Echos to 139.[], timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!
Success rate is 100 percent (1000/1000), round-trip min/avg/max = 1/3/8 ms


If the mac address change "fixed" it, why didn't it fix it when I moved the config from gi0/0 (4c00.828a.cf00) to Gi0/2 (4c00.828a.cf02) ?

Could the issue be at the ISP end, meaning they have a mac in their table the same as both of my BIA's?

Just had a thought - if they're doing QinQ somewhere (the ISP) then this might be the issue. I saw some mac address re-learn issue between our Datacentres back in 2010. Going to have to think to remember what the cause was and the resolution.

What do you guys make of it? :) I don't really want to have to specify mac addresses like this.

I checked the mac table of my switch and it does not have an entry for the burned in address any more. So what ever the issue is, it's on the other side of the NTE which is a Cisco ME3400 provided by the ISP.

Dieselboy

Just remembered regarding the QinQ and packet loss. The issue was when we used HSRP mac addresses because they were the same in each VLAN. So when the packet transited the QinQ, the VLAN was all the same QinQ outer VLAN so the Mac addresses were not unique, and the core switches were re-learning the mac each time in different locations.
The fix, was to move to HSRPv2 and use unique FHRP macs.

I'm not using HSRP or any FHRP, and the upstream gateway mac is 0014.1bd5.8c00. So still unsure at this time.

Reggle

Very interesting, thanks for the feedback. Since I'll be deploying a new QinQ soon myself I can use it. However, have you confiration from your service provider that this the case? Because the router MAC address should be globally unique.

Dieselboy

Havent managed to confirm anything yet.. I called them after my last post but I can only speak to 1st and 2nd line. To quote "there is no way they can contact 3rd line" they have to leave them a message. Of course that's BS though. It's the weekend anyway, I've reverted my config from earlier.

Indeed the router physical address must be unique, but it's confusing why I get packet loss when using that address and not with a virtual address. This makes me think that there is a duplicate mac or similar as this was the same symptoms as when we had the QinQ issue.

One other thing worth mentioning is that ISP 1 and ISP 2 are just different physical ports on the NTE ME3400 switch. Originally before I joined the company, the company sourced fibre internet with ISP 1 but they don't run their own fibre, they get their fibre from ISP2. So as we are "on-net" ISP 2 just had to configure port 2 on the ME3400.
Now, my router uses the same MAC as the customer interface for both ISPs (the burned in address). So the same MAC will be seen on port 1 and port 2 of their ME3400, albeit in different VLANs... it must be different VLANs on their 3400... However, if the problem was because of the same mac being seen then the issue wouldn't be fixed by using a virtual mac. So then this makes me lean toward my Mac not being unique. But even if this was the case, I also tried with another physical interface on that router and different burned in mac - same problem. I would have thought that duplicate mac is possible but highly unlikely. 2 duplicates can't be true.

So I really am not sure. I look forward to speaking with them soon.

One final thing, if the issue was at my end then I would expect it to affect the working internet connection I have too. The problem must be in a cloud somewhere. I'll go and pray.  :awesome:

Dieselboy

I suggested to the ISP I know that using my laptop to test the circuit previously I get zero packet loss (because the BIA is different). And I said if they really wanted, I could copy the MAC from the router and place it onto my NIC driver so my known-good laptop will use the MAC which is experiencing packet loss.
And as expected, extreme packet loss when sourcing from that specific MAC address. It is worse during business hours. Yesterday (Sunday) when I was on a call with the ISP I was only getting 4 drops within 1000. During the day I'm getting a lot more but not as much as I've seen (15% packet loss) in previous times.

So all day long I've been sending 1500 byte ICMP pings sourcing from the bad MAC in the hope it's having a negative impact elsewhere in the ISP network so someone else logs a fault :)

Dieselboy

To again rule out any issues with my WAN switches, I plugged my laptop directly into the NTE. The attached screenshot shows good pings with the burned in address of my laptop. I've circled the point at which I specified the MAC address, copied from the Cisco router.

While I'm here, I used MAC addresses from 4C00828ACF00 to 4C00828ACF0-9 and they all exhibit the same behavior.

I really don't think I can do any more tests from my end.

deanwebb

You know it's a crazy case when you be messing with MAC addresses.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.