Tweaking routing metrics

Started by deanwebb, December 09, 2016, 04:50:32 PM

Previous topic - Next topic

deanwebb

Is this ever done? If so, what do you tweak and why?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

that1guy15

BGP timers in the DC for peering and dead time.

Direct links on stable gear drop them all the way down. They were designed for unpredictable WAN links not DC links.

Honestly BFD and fast-reroute when available have made timers a non-issue.
That1guy15
@that1guy_15
blog.movingonesandzeros.net

deanwebb

Quote from: that1guy15 on December 10, 2016, 02:07:08 PM
BGP timers in the DC for peering and dead time.

Direct links on stable gear drop them all the way down. They were designed for unpredictable WAN links not DC links.

Honestly BFD and fast-reroute when available have made timers a non-issue.

I think I'm going to wind up learning something here... but that's what this place is for, right? We ask and answer questions to learn and to keep sharp.

I'll start with, "what is peering and dead time?"
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

that1guy15

Sorry mixed up terms on my part. The timers Im talking about are the keep-alive and holddown timers. Dead timers mixed up in my head as BGP Dead Peer Detection which is used to quickly detect if a neighbor/peer in BGP is lost. Once noticed the peer and prefixes are flushed instantly which can reduce convergence from 3+ minutes to seconds or less.

Hold timers are the number of seconds to keep a peer session "established" without receiving a hello. Once the hold timer is reached the peer session goes "active" and prefixes from the peer are flushed. By default keep-alive is set to send every 60 seconds and hold down is 3x the keep-alive.

This means if a peer goes down it will take 3 minutes to notice and start the convergence process.

My recommendation was to reduce the timers to the smallest possible which is a keep-alive of 1 second and 2x for holddown. So 3 seconds to detect a peer loss. Not bad but as my original post said BFD and Dead Peer Detection drop this well below sub-second.

Sorry for all the mixup.
That1guy15
@that1guy_15
blog.movingonesandzeros.net

deanwebb

Honestly, this is all new to me, so I had no idea if you had a mixup or not. :D

But now that leads me to ask, why would the defaults be set the way they are if most rapid detection is preferable?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

NetworkGroover

#5
In my opinion, I wouldn't touch any of the timers.  Between BFD and fast-fallover or whatever it's called, convergence time should be minimal.

Am I misunderstanding? Can you propose a scenario I can build out in vEOS to tinker (If I ever have time).

Dunno if this helps at all, but I tinkered with it a bit while I was writing http://aspiringnetworker.blogspot.com/2015/08/bgp-in-arista-data-center_90.html

Check out "The Need for Fast Failure Detection"
Engineer by day, DJ by night, family first always

that1guy15

Quote from: AspiringNetworker on December 11, 2016, 08:09:38 PM
In my opinion, I wouldn't touch any of the timers.  Between BFD and fast-fallover or whatever it's called, convergence time should be minimal.

Am I misunderstanding? Can you propose a scenario I can build out in vEOS to tinker (If I ever have time).

Dunno if this helps at all, but I tinkered with it a bit while I was writing http://aspiringnetworker.blogspot.com/2015/08/bgp-in-arista-data-center_90.html

Check out "The Need for Fast Failure Detection"

I tested both ways during our DC turnup testing last year and timers played no role. We put both in place.
That1guy15
@that1guy_15
blog.movingonesandzeros.net

wintermute000

Its free and an additional safety net beneath BFD. Also, not everything / every scenario can run BFD (peerings via SVIs on some hw/sw etc.). Also, P2MP topologies.

LynK

^^This.

Dean, essentially you want to tweak the BGP timers because they are so drastically long (naturally because it is traditionally an external routing protocol). You want to tweak this across your DC well... because itz SUPA FASTTTTTT
Sys Admin: "You have a stuck route"
            Me: "You have an incorrect Default Gateway"

deanwebb

OK, thanks. I think I'm getting an understanding here. Remember, I didn't go to no routin' school! :)

That being said, what's BFD?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

that1guy15

Bidirectional Forward Detection. AKA advanced heartbeat between device interfaces, protocols, etc used to very quickly detect failures or issues.

Supported across almost all vendors now.
That1guy15
@that1guy_15
blog.movingonesandzeros.net

deanwebb

Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

burnyd

Yep support between vendors and done at the line card level is really what the benefits are to it other than really awesome convergence.  VMware started offering it as well within NSX as of late so thats pretty neato.

Anything else?

deanwebb

that1guy15 mentions unstable WAN links... what makes a WAN link unstable?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

burnyd

WAN links are typically under powered like t1's small ethernet lines etc.  But the same thing that makes proxies and firewalls unstable ie youtube,facebook etc.