Please Explain BGP to Me...

Started by deanwebb, July 06, 2017, 11:37:33 AM

Previous topic - Next topic

deanwebb

Situation: Site X starts to experience connectivity issues at 11:00am. No Outlook, no Office 365, no browsing of Internet or Intranet... but they still have Skype, can browse to web pages of WLCs (both by IP and by FQDN), SSH works fine, and they can both ping Intranet web servers and telnet to port 80 on them.

No QoS, Riverbed, IPS, or Firewall in the path. When I set users at Site X to use a different proxy (one that is slated for decommissioning), they can browse everywhere. But, because we need a registry hack to set the proxy in IE, Outlook and Office 365 still fail - along with web-based applications that require IE. I go to bed last night thinking it's a proxy issue, good luck with that to the proxy team.

I wake up to find that the resolution was in doing stuff with the WAN link... an hour or so before Site X had its issues, the local telcos were doing work on the WAN link and had brought up a secondary line. When Site X turned off the primary link and brought up the secondary, everything worked as it was supposed to work.

I'm confused.

Why would we be able to make direct, non-browser connections to web servers if there was something jacked up in the WAN link? I get that, maybe, the route to the proxy was messed up, but we were able to ping it and telnet to its open ports. Once we used the proxy that was getting decommed, we could even download the proxy script from the main corporate proxy. That says to me, "full L2 and L3 connectivity, all is well".

Why is it that the WAN link was the problem? One of the engineers said something about BGP, but I don't know enough to understand him, or to determine if he was wrong.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

burnyd

That could really be anything.  Like for some reason when routing certain subnets over one wan link traffic would die due to a null route.  Or maybe aysemtricial traffic going through one firewall and out another.  Its probably the firewall in the DC 100%

deanwebb

Can't be the firewall because changing the proxy fixed it and changing the WAN circuit fixed it even better.

But even if it was asymmetrical traffic, why would a telnet to the proxy on port 80 work, but a browser connection to the same proxy on the same port fail? It's not like we had an app-sensitive QoS policy being beta tested in Site X... I thought that either routes work, or they don't, or they flap. This thing was so app-specific, it made me think some disgruntled admin had installed a Palo Alto inline without anyone knowing. How could that be a routing issue?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

SimonV

QuoteOr maybe aysemtricial traffic going through one firewall and out another.  Its probably the firewall in the DC 100%

That was my first thought too...

Quote from: deanwebb on July 06, 2017, 12:19:37 PMBut even if it was asymmetrical traffic, why would a telnet to the proxy on port 80 work, but a browser connection to the same proxy on the same port fail? It's not like we had an app-sensitive QoS policy being beta tested in Site X... I thought that either routes work, or they don't, or they flap. This thing was so app-specific, it made me think some disgruntled admin had installed a Palo Alto inline without anyone knowing. How could that be a routing issue?

A telnet only verifies the 3-way handshake, really. Palo Alto for example lets the 3-way handshake through before it can move to application ID. Not saying that's what happened, just to explain that Telnet is not always the best test. It could also be some sort of traffic shaping on the provider (or upstream) network, where they classified HTTP as traffic eligible to drop.

Have you verified with wireshark what happened in the browser? Would be interesting to see if at least the TCP session formed... Is port 80 your true proxy port?

We had something like that in one of Russian sites too, a couple of months back. SYN would go through, SYN-ACK would be returned but the ACK always mysteriously disappeared. Our provider spent weeks trying to find the issue. We were all suspecting some sort of firewall at the local carrier but the problem suddenly disappeared. And we never found out what it was either.

deanwebb

Pretty sure the TCP session formed, but we did not have Wireshark on the client PC at Site X and the Unix guys kept making excuses to nut do a tcpdump.

WAN guys were positive that there was no QoS, neither shaping nor policing, which was my first thought.

We'll say the proxy ports include 80, (n), and (m). One of those numbers will work. The old proxy uses (m) and the new ones will answer on 80 and (n). Doing a telnet or TCPing to any of those ports would return a successful response.

Where I'm going nuts is why all the WLC web pages worked just fine but web pages of devices in the same VLAN would fail. IE, a WLC would respond to https://10.1.1.1 or https://remote.wireless.controller.megacorp.com but the web server at 10.1.1.2 would time out when the browser went to it until we switched the proxy - or went to the secondary WAN line.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

that1guy15

network grimlins is my offical CCIE guess.

But what those guys said sound right too.
That1guy15
@that1guy_15
blog.movingonesandzeros.net

SimonV

So what's the next step, schedule a second fail-over to reproduce?

LynK

Dean,

This could be a lot of things... but let me give you a hand.

Certain subnets could be routing out certain routers based on static routes (or PBR), with no ip sla for failover. Basically black-holing any return traffic. We are going to need to see a diagram of sorts, and maybe a config to give you a hand.

Sys Admin: "You have a stuck route"
            Me: "You have an incorrect Default Gateway"

deanwebb

To make matters more fun, the issue is with gear and lines that belong to a third-party ISP in Latin America that provides last mile connectivity to the MPLS provider. I'm not gonna get configs on this one...  :-\
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

isaiahgoveait

Just found this http://pin.it/Ze9hSBz

Sent from my LGMS550 using Tapatalk


deanwebb

I think I'll upload it here...

:tmyk:



Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.