Packet Drop Issue

Started by deanwebb, May 27, 2016, 10:46:24 AM

Previous topic - Next topic

wintermute000

if you're getting random output drops then its strange that you've only noticed this on fragmented UDP traffic.

You're probably sick of captures by now LOL but a good test is iperf, as there are sequence numbers in there even in UDP mode IIRC.

deanwebb

No, I always love a good capture... and I've got sequence numbers in Wireshark.

Our WAN provider's engineers found zillions of drops on the outbound interface and throttled back the guest wireless traffic... but today is a holiday in the remote site, with only 80 people on the wireless there, and we're still seeing the goofy RADIUS drops... I'm thinking the solution for this is not in throttling traffic although, clearly, it was needed since packets from all queues were getting drops because of about 20% of the total traffic being guest wireless stuff.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

deanwebb

RESOLVED:

QoS policy. It actually had the RADIUS and SNMP traffic marked for a priority queue, but fragments are IPv4... those go into the bulk traffic and get dropped first. Policy was amended to include classification by endpoints and then a few more tweaks, and now it's working.

PHEW.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Dieselboy


wintermute000

Can I clarify: your QoS policy was rely on NBAR presumably instead of L3/L4 to classify radius and SNMP, hence could not ID the fragments?

And the fix was to switch to classifying via traditional L3/L4? As I'd imagine even a fragmented UDP packet would surely have the port info intact?

I'm very curious if more application-smart devices *cough* Palo Alto *cough* have the same issue. I'll hassle the firewall guys next week if above is indeed the case.

deanwebb

UDP fragments had some port info in them, but it was still default Cisco QoS logic to put fragments into the highest category. Fix was to specifically deny fragments from that top queue, which was congested beyond belief.

Just encountered another site where all TCP traffic worked fine, but SNMP and RADIUS were broken... hmmm...
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

wintermute000

So what exactly is the classify map matching on, can you post the configuration please
I.e. What criteria was configured for snmp radius that was failing to catch fragments

deanwebb

We first had no match, that gave random results.

Then we matched on destination IP addresses of the RADIUS servers. That gave predictable, but still erroneous results. The problem was that the first packet would match as SNMP or RADIUS traffic, but the fragment would still shoot into the EF queue and get whacked.

Once the line to exclude fragments from the top queue was entered, the issue resolved. We then removed the match on IP address and the issue stayed resolved.

Extended IP access list EF_Video_Voice
5 deny ip any any fragments


That's pretty much all we needed, that line at the top of the ACL to keep fragments out. Then the default Cisco behavior when treating fragments didn't screw up our traffic.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

wintermute000

I'm sorry but this makes even less sense now - I had assumed you were using 'opposite' terminology and calling the best efforts queue the 'top' queue but you mean EF i.e. the best queue which I presume is given priority.

If the fragments are being (accidentally) prioritized, why are they being dropped?

Secondly, what is the match criteria that is incorrectly matching UDP fragments into ET class and queue? i.e. what is the class-map that classifies EF? e.g. I've had issues before with NBARv1 mis-matching.
For example, if you are using the ACL EF_Video_Voice to match the class-map, then why is it even matching ANY radius/SNMP fragments in the first place?
I am aware voice traffic is usually small UDP packets which is why I am suspicious its an issue with NBAR and the use of "match protocol voice" or similar commands. 


Are you able to post the complete QoS configurations?

deanwebb

Can't post the QoS policy, since it's on gear owned and operated by our WAN vendor. I only have that snippet sent by the vendor.

The internet pipe is very small - 200Mb/sec for about 3000 people. The voice/video queue was set at 1Mb/sec, as well. There was massive congestion in all queues, and there were drops in all queues.

The SNMP and RADIUS were in a queue that had less priority than voice/video, but which also had more capacity allotted to it to handle the traffic. First packets or unfragmented packets got through every time.

The fragments, however, according to the WAN vendor engineer, were hit by a default behavior in Cisco QoS logic. IE, there was no policy to send them into our highest-priority-queue-of-doom, they just wound up in there because of that default behavior. Blocking fragments from that queue pushed them into the next-highest-priority queue, which was where the rest of the SNMP and RADIUS traffic was successfully getting through.

That's the best I can do, I'm afraid. But it was no policy that moved the fragments, it was default Cisco QoS logic. If the voice/video queue had more space in it and the WAN link not been running at 90%+ congestion, we wouldn't have seen the issue.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

LynK

silly question. Have you double checked MTU on all of your devices? Have you double checked speed/duplex on your links?
Sys Admin: "You have a stuck route"
            Me: "You have an incorrect Default Gateway"

deanwebb

Yes and yes.

The site wanted to see if the fix allowed it to use just the datacenter and not the local RADIUS server. The answer is yes, until the entire office shows up to work a few hours after the early birds roll in. Then it's a hellride until people start to go home for the day. That WAN link is still too small, so even though the fragmented RADIUS traffic is no longer getting sent to die in the voice-video queue, it gets massacred in the general queue alongside other traffic when there's just too much stuff on the line.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.