Packet Drop Issue

Started by deanwebb, May 27, 2016, 10:46:24 AM

Previous topic - Next topic

deanwebb

This is an update of the Wireless issue I ranted on about... well, we got the replacement and it had the same issue.

In the traces, I noted that fragmented UDP traffic from the site goes just fine... but arrives at cross-WAN destinations without the final packet that makes everything make sense. Unfragmented UDP traffic, no problem. TCP traffic... can't say I've seen any fragmented TCP stuff. But fragmented SNMP traffic NEVER works and fragmented RADIUS traffic works about half the time, with more failures during busier times of the day.

We think the RADIUS and SNMP issues are connected, since they have the same issue: terminating fragment drops. We can test for success by generating bulk SNMP get-requests and seeing if that get-response packet shows up or gets dropped. Right now, it's getting dropped at some point between the far site's Riverbed (checked egress on WAN interface, looked good there) and the NIC on the destination servers (RADIUS server and Cisco Prime server).

Anyone here ever see a thing like that, where those final fragments get dropped and the destination responds with an ICMP message that the TTL on fragment reassembly expired?

We're going to be doing captures on the WAN routers to see if the WAN router at the far location is sending those
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

icecream-guy

Quote from: deanwebb on May 27, 2016, 10:46:24 AM
This is an update of the Wireless issue I ranted on about... well, we got the replacement and it had the same issue.

In the traces, I noted that fragmented UDP traffic from the site goes just fine... but arrives at cross-WAN destinations without the final packet that makes everything make sense. Unfragmented UDP traffic, no problem. TCP traffic... can't say I've seen any fragmented TCP stuff. But fragmented SNMP traffic NEVER works and fragmented RADIUS traffic works about half the time, with more failures during busier times of the day.

We think the RADIUS and SNMP issues are connected, since they have the same issue: terminating fragment drops. We can test for success by generating bulk SNMP get-requests and seeing if that get-response packet shows up or gets dropped. Right now, it's getting dropped at some point between the far site's Riverbed (checked egress on WAN interface, looked good there) and the NIC on the destination servers (RADIUS server and Cisco Prime server).

Anyone here ever see a thing like that, where those final fragments get dropped and the destination responds with an ICMP message that the TTL on fragment reassembly expired?

We're going to be doing captures on the WAN routers to see if the WAN router at the far location is sending those

dont riverbeds run in pairs? one for the far end, one for local end, to do the proprietary magik on the WAN between the two.
:professorcat:

My Moral Fibers have been cut.

deanwebb

We do have one in the main office where the RADIUS server is, but... all the UDP traffic bypasses it. It goes straight through the IPS to the L3 core switch.

And, yes, I checked the IPS... no packets blocked to the RADIUS server.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

wintermute000

#3
have defo seen fragments dropped before, nice detective work - but - every time I've seen a fragment getting dropped scenario, its been... wait for it... a firewall LOL (or at least a ZBFW feature, seen that before definitely). If there's anything at all doing reassembly before it hits the host? Pretty sure and IPS would have to reassemble UDP fragments to do its job....

as you say can't be riverbeds as they pass through UDP (well by default anyway and you've confirmed as well). One good thing though is that doing a non-intrusive, rolling packet capture on a riverbed is a GUI enabled piece of pie so you can easily see if its happening before or after WAN transit - from your wording its unclear which end you've checked, is it OK at the point it enters the DC side riverbed? don't forget you can capture on either side of the riverbed too (if memory serves me correctly) if you really want to be paranoid. But you want to at least start by chopping the problem in half (i.e. is the fragment getting dropped in the WAN or your DC).

Do you see the same UDP fragmentation behaviour (and presumably NO drops) with other controllers at other sites? whilst you're packet capturing, you might as well get a 'reference' capture?

just because we're down in the weeds, you are talking about fragmentation of the UDP payload right i.e. UDP reassembled segments forming a 'UDP packet'? You aren't talking about IP fragmentation issues, PMTUD works, 1500 nice and clean all the way through etc.?

deanwebb

No firewall in the path, there is an IPS in the path, does not show it is blocking any traffic, but it is inline.

The Riverbed in the remote site had traces running on inbound/outbound on both the LAN and WAN interfaces. Zero packet loss all the way out to the WAN router, this much we know. On the DC side, the Riverbeds aren't yet inline... so we have to get the trace set up tomorrow on our core switch, since there aren't any spare interfaces on the routers. I figure if the traffic is intact on the way in, then we have something set up in both datacenters we tested that affects *only* traffic from this one site. If not, then it's something in front of the switch, again affecting traffic only from this site.

Same RADIUS controllers that we've tested all handle traffic from other WLCs. I've also checked with a WLC-RADIUS server combo to a third datacenter. Same fragmentation of SNMP and RADIUS, no drops. I took a few of these reference captures specifically to keep my sanity. 1500 MTU on all devices on our LAN. This is reassembly of UDP packets.

I'm skeptical that it's a device in the datacenters, since it's only this one site having the issue. What I do suspect is either the WAN router at the remote site, or one of the devices in the MPLS network close to the remote side end of things.

I hope to know more on the capture we get tomorrow. I'll need to see if we can get our WAN provider to also get captures from inside its network.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Dieselboy

Network diagram? Is the traffic being encapsulated over a VPN tunnel?

deanwebb

No VPN.

It's WiSM inside 6800 -> Riverbed -> WAN router -> MPLS -> WAN router -> IPS -> Core switches -> Vblock -> RADIUS server
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

wintermute000

can't fault your logic (i.e. one site affected only so 99% not likely to be hub site, packet captures at your spoke WAN CE are fine, so issue is 99% likely to be in the WAN)

Dieselboy

Not a load balancing issue with the provider MPLS like what I have found recently on my provider internet service?
What is the mac address of the source of traffic which the provider sees?

deanwebb

@Wintermute: thanks for confirming my suspicions.

@Dieselboy: We checked that right after you posted that thread.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

NetworkGroover

#10
I'll admit that I was too lazy to read the entire thread - but I saw "fragmented" and thought, "Why?"

Why do you have fragmented traffic?  That's never desirable I thought?

EDIT - Ah, nevermind.  I see.
Engineer by day, DJ by night, family first always

deanwebb

OK, found a zillion packet drops on the egress interface on the remote site's WAN router. That is bad because it's not good.

The drops spike with user activity. All queues are affected. About 20% of the traffic there is from the guest wireless SSID... guess what we just throttled down...

So, I can see why that drops RADIUS traffic as the user count increases for the day, but I still don't see how the SNMP bulk traffic gets whacked while unfragmented SNMP traffic always gets through... Doesn't help that I'm also sick today and had to deal with a partner site whose ISP started blocking port 22 outbound...
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

icecream-guy

Quote from: deanwebb on June 01, 2016, 12:58:42 PM
OK, found a zillion packet drops on the egress interface on the remote site's WAN router. That is bad because it's not good.

The drops spike with user activity. All queues are affected. About 20% of the traffic there is from the guest wireless SSID... guess what we just throttled down...

So, I can see why that drops RADIUS traffic as the user count increases for the day, but I still don't see how the SNMP bulk traffic gets whacked while unfragmented SNMP traffic always gets through... Doesn't help that I'm also sick today and had to deal with a partner site whose ISP started blocking port 22 outbound...

what sick? of the rain?  you waterlogged?  I'd blame the port 22 thing on all the flooding there in Texas.
:professorcat:

My Moral Fibers have been cut.

Dieselboy

Quote from: ristau5741 on June 01, 2016, 01:46:33 PM
Quote from: deanwebb on June 01, 2016, 12:58:42 PM
OK, found a zillion packet drops on the egress interface on the remote site's WAN router. That is bad because it's not good.

The drops spike with user activity. All queues are affected. About 20% of the traffic there is from the guest wireless SSID... guess what we just throttled down...

So, I can see why that drops RADIUS traffic as the user count increases for the day, but I still don't see how the SNMP bulk traffic gets whacked while unfragmented SNMP traffic always gets through... Doesn't help that I'm also sick today and had to deal with a partner site whose ISP started blocking port 22 outbound...

what sick? of the rain?  you waterlogged?  I'd blame the port 22 thing on all the flooding there in Texas.

Unfragmented SNMP traffic always gets through and fragmented SNMP does not because if one packet is lost out of the fragmented SNMP then all packets would be discarded right? If one SNMP request is fragged into 3 packets and one of those 3 are lost then the entire request is lost and would need to be resent.

deanwebb

Nope. One of the fragments may go, but 1 or 2 others go on through, hence the ICMP type 11 messages that the device didn't get all the fragments it was expecting.

And when we see the resend, all 3 for the request leave, but less than 3 arrive at the destination.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.