Cisco WiSM 2 sending random EAP-Type information

Started by deanwebb, April 26, 2016, 10:38:31 AM

Previous topic - Next topic

deanwebb

So... the RADIUS server is getting random EAP-Types from a WiSM-2 module in one of our locations. All around the world, we have only EAP-TLS, PEAP, MS-CHAP (love those XPs!), but in this one location, we got EAP-SAKE, Zonelabs, AirFortress-EAP, and some numbers from reserved blocks of EAP-types.

:zomgwtfbbq:

I should not have to be aware of this link if everything is working properly: http://www.vocal.com/secure-communication/eap-types/

And we're getting all kinds... you name it, it's popping up on our RADIUS server. Of course, those guys get rejected. Around the world, we have less than 1% of our clients get a reject, at this site, we have well over 20%. The RADIUS servers for this site are doing 100% accept for other sites, just as distant from ther RADIUS server or more so from where the site is. Clients from that site that are visiting other sites have 100% acceptance on the WLAN. To me, everything points at the WiSM.

We called Cisco TAC... he'd like to run a few more traces and debugs...

:phone:

I think this is where he's supposed to say "Time to RMA that bad boy." But... he wants more traces...

:wha?:

OK... let's set those up... oh, wait, the WiSM is in a core 6509, so any trace we do on traffic outbound from that WiSM to the RADIUS server is mixed in with everything else on a highly-utilized main line.

:frustration:

If this was my network, I wouldn't put up with that BS. RMA that WiSM. EVERY OTHER WLC WORKS. RADIUS IS FINE. THE CLIENTS, EVEN, ARE FINE. I mean, I want to throw the client under the bus, but it's innocent, this time. No. It's the WiSM. It should be obvious. I want my RMA. Not another debug.

:no:
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Reggle


deanwebb

Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

icecream-guy

:professorcat:

My Moral Fibers have been cut.

deanwebb

I'm not the primary from my company on this case. I'm just along for the ride because I handle RADIUS. Our wireless guy doesn't want to go to a manager, just yet... after today's charlie foxtrot call, he might, though.

:wall:
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

deanwebb

Online with the lads, doing debugs and wireshark captures. I love how it takes a conference call to get a wireshark straight...
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Dieselboy

To be fair, I was happy with my wireshark 1.x until a few weeks ago when it crashed and wouldn't end task causing loads of problems as I had to reboot. Now the 2.x version constantly crashes, the gui is a bit different and for some reason it wont let me type or delete characters in the filter unless I highlight it and then type over it. Arrow keys dont work there either.

Did you get it isolated / identified?

deanwebb

We got a capture during a failure... we were turning the radio off and on repeatedly to try to roll those dice and get a RADIUS-Reject. Within a few minutes, it was accepted with EAP-TLS, then rejected with AirFortress-EAP, then once we got the capture, the client tried again on its own and got in with EAP-TLS again.

We shall see if Cisco says "time to RMA" to-day in about an hour.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

deanwebb

Follow-up... Cisco didn't RMA. More discussions. Cisco claimed it couldn't find any failures in its capture. Since then, I learned that they only looked for one kind of failure, which pisses me off. I got up at 5AM to start running captures on the RADIUS server handling that WiSM.

I did notice that the strange failures do happen in clusters and I picked up one such cluster in one of my captures, which is still running... but when it's done, I'm going through it with care and precision, I tell you! Care and precision!

Stupid random error that Cisco pushes back on even though everything points at their WiSM as the problem... and they STILL tried to blame the RADIUS server on Friday...
:no:
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

deanwebb

OK, finished plowing through Wiresharks.

Friggin' WiSM runs just fine for big stretches of time, and then... WHAM! It's sending IPv4 fragments of size 1442 that are suspiciously the same size as a RADIUS-Request packet that should be next in the conversation flow. Also lots and lots of duplicate traffic from the WiSM. This lasts for a minute or three and then it's back to normal.

Email sent to Cisco, customer, managers...
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

mlan


deanwebb

Looks like a code upgrade on the WiSM-2 to the most current release worked. I'll know for sure on Monday, but things look very promising right now.

That was our idea, though. Why Cisco didn't think to suggest that is beyond me. I thought it was reflex action for TAC to suggest a code upgrade and a reboot before doing anything else. True, this is apparently a very new release, but, still... why were we the ones that sussed that out? Not happy with our support in this case.

:hankhill:
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

deanwebb

I feel bad for that location, but vindicated in my troubleshooting and assessment. After the upgrade, the original behavior returned. All of it, even if it is at a slightly reduced frequency.

Someone just found a 5508 in a lab site at that location... I'm all for cutting the site over to that bad boy while we wait for the HA pair of 5508s to arrive.

:yeahright:
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

deanwebb

The Cisco guy is unbelievable. Unnnnnnnnnnnnnnbelievable.

We've demonstrated that it's not the client, not the RADIUS server, not the Riverbed, not the WAN routers, not the MPLS network, not the datacenter infrastructure, and it's not even the chassis housing the WiSM, since client traffic for the other WLC we're testing with using FlexConnect has to pass through the 6800 to get to the WLC.

So... it's either the WiSM itself or some screwy interaction with WiSM at our level of code and the 6800 at our level of code.

But the TAC guy tried to say that since the RADIUS traffic from the WLC to the server follows a different path, that we're somehow bypassing the WAN infrastructure.
:zomgwtfbbq:

After me and the wireless guy explained how the traffic to the WLC goes through all that other stuff that traffic from the WiSM traverses, TAC still wants to see if something else could be causing the weird fragments and delays in communication.

I said, "The client sends a packet to the NAS and awaits a response. The RADIUS server gets that packet and sends a response. That client never gets the response from the RADIUS server. This is the only device in our environment that has this issue. I'm pretty sure it's the WiSM."

TAC guy said, "Well, I don't know if replacing the WiSM with another one is going to be the solution that you want."

"You're right. I'm thinking that any WiSM we use is going to be garbage. I want a real WLC in there, handling all the traffic."

I got to be bad cop for the rest of the call and the wireless guy got to be the good cop. We didn't coordinate our responses ahead of time, but we have a good improv chemistry. I'd get up to a moderate roar and he'd be ready with a "Easy there, now. What we mean to say is..." TAC guy would be ready to deal with the wireless guy. And if he strayed, I got to roar a bit again. I don't *like* being the bad cop, but I know I can fill the role when I have to.

TAC guy said he'd have to see if he could find us a loaner. (Turns out, the lab 5508 was unusable... damn!) I let that comment pass without responding, "Well, I'm pretty sure that if you can't find us a loaner, Aruba Networks could set us up with one."
:problem?:

But I might say that next time... I dunno...
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Otanx

This just boggles me. Can't you just say "send me a new WiSM." I have requested new hardware, and they have done it after much less hassle than you are going through.

-Otanx