BGP in the DC - BGP Path Hunting and ECMP

Started by NetworkGroover, April 21, 2015, 05:21:50 PM

Previous topic - Next topic

NetworkGroover

Hey guys,

I'm reading the draft RFC regarding use of BGP in the DC and it mentioned "BGP Path Hunting".  I had to Google that to learn what it is, but after learning that, I'm pondering something.

In a 2-tier spine/leaf BGP ECMP switch fabric, is that really a concern?

I guess a more pointed form of the question would be... let's say I'm a leaf switch and I have a route to get to something "outside" hanging off my spine switches.  So if that route is equal through every spine, they will be ECMP routes. Now, if one spine switch loses that route, wouldn't the leaf ultimately end just removing that ECMP route and move on?   

I'm just trying to visualize how a BGP path hunting situation would occur

EDIT - If it helps at all I can provide a picture, but you can pretty much look up any spine/leaf data center architecture for reference
Engineer by day, DJ by night, family first always

burnyd


NetworkGroover

#2
Quote from: burnyd on April 22, 2015, 08:56:45 AM
Provide  a picture

Any leaf/spine architecture.  Attached is a pic I slapped together for the paper I'm writing based off the draft RFC.

EDIT - Let's focus on a single "cluster" and say there are 4 spines instead of just two, and XYZ subnet is being advertised from the core to those 4 spines... so a leaf has 4 ECMP routes to that subnet. What happens if one of those spines, for whatever reason, loses that advertised subnet, but the other three are fine?

EDIT 2 - Or is there any other failure scenario you can come up with that would cause BGP Path hunting to occur?

Engineer by day, DJ by night, family first always

burnyd

Yes as long as all bgp metrics are equal and multipath is on mutlipathing should work to a given destination.  That is really interesting above the spine layers that there is like a separate core layer.  Why is that?  Why does your design have like 3 pod like environments with different leaf and spines there?  Why not one large leaf spine network?

burnyd

Come to think about it now I am using all nexus switches for my cos/ecmp bgp setup.  Which I think has as path relax setup on it but I would imagine you need a similar feature.

NetworkGroover

#5
Quote from: burnyd on April 22, 2015, 11:30:34 AM
Yes as long as all bgp metrics are equal and multipath is on mutlipathing should work to a given destination.  That is really interesting above the spine layers that there is like a separate core layer.  Why is that?  Why does your design have like 3 pod like environments with different leaf and spines there?  Why not one large leaf spine network?

I should have just presented an easier picture, but I had this one handy so slapped it on there.  This picture is a recreation from the draft RFC, and ultimately if I understand correctly, had some part to play in Facebook's Altoona DC design, a ton of good info here that I still have yet to read (shame on me): https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/

Right now for my paper I'm only focusing on basic stuff, like why BGP over OSPF, why eBGP over iBGP for MLAG peer leaf switches, etc.  Right now I'm working on ASN scheme and I had this note about BGP path hunting so I wanted to dig more into it.  The article I found about it on Routing Freak, and another site.. forget what it was, presented the exact same topology which isn't a spine/leaf (and likely not one to be used in production, but was perfect to explain what happens during BGP path hunting).  So I'm just curious how that would happen in this architecture.

If I feel my paper is too short, maybe I'll get into scaling, but for now, with my tendency to be verbose and if my previous paper is any indicator (https://www.arista.com/assets/data/pdf/Whitepapers/STPInteroperabilitywithCisco.pdf), current topics may pan out to be 20+ pages.
Engineer by day, DJ by night, family first always

burnyd

Well when did you guys start supporting 4 byte ASN?  Okay so this entire paper is based off of Petr Lupkovs draft for facebooks bgp when you have a ton of vm's in a data center.  I followed his draft for my ecmp data center.  It is a really good one.

BGP is going to be a better choice.  You have a lot more traffic manipulation than a igp does.  Also you can run all sorts of things over bgp because bgp can leverage multiple different address families.  Its real easy to throw things into different OSPF areas.  However, filtering on things where you have no control with OSPF is extremely difficult ie a NSX edge router or a CSR router just for an example.  Filtering is tough.  Also throughout the network you have things like community strings and as path removal all sorts of things. 

Are you writing this for Arista's site documentation?  I once again think you should stay away from the pod approach that sort of thing imho is not needed in the large overlay / ecmp layer 3 at the access.  Since I have lived through this and we are friends on linkedin and all other things I can skype or chat over the phone with you to give you real world examples.

NetworkGroover

Quote from: burnyd on April 22, 2015, 11:58:02 AM
Well when did you guys start supporting 4 byte ASN?  Okay so this entire paper is based off of Petr Lupkovs draft for facebooks bgp when you have a ton of vm's in a data center.  I followed his draft for my ecmp data center.  It is a really good one.

BGP is going to be a better choice.  You have a lot more traffic manipulation than a igp does.  Also you can run all sorts of things over bgp because bgp can leverage multiple different address families.  Its real easy to throw things into different OSPF areas.  However, filtering on things where you have no control with OSPF is extremely difficult ie a NSX edge router or a CSR router just for an example.  Filtering is tough.  Also throughout the network you have things like community strings and as path removal all sorts of things. 

Are you writing this for Arista's site documentation?  I once again think you should stay away from the pod approach that sort of thing imho is not needed in the large overlay / ecmp layer 3 at the access.  Since I have lived through this and we are friends on linkedin and all other things I can skype or chat over the phone with you to give you real world examples.

Hehe.  Don't try to put everything together 'too' closely - I'm not doing it word for word.  Just wanted to give you some background and sources I'm leveraging.  For me personally, I haven't worked with entities of that size, so I don't feel I have the experience to warrant writing a paper about hyperscale data centers. Just being honest.

Arista did just release EOS 4.15 which supports 4 byte ASNs, multipath relax, and other features... if you were curious.

Yes, everything you mentioned on why BGP is better is what I've discovered in research as well.  I still get folks who want to run OSPF but I do my best to explain that ultimately the end game will be BGP when you get to a certain size, so why not just make the move now?  And like you said, filtering, ugh... that's 'loads' of fun - and by that I mean not at all.

Right now I'm just focusing on some basic stuff to A) educate myself, and B) hopefully help alleviate some concerns/questions for folks about running BGP in the DC that other documentation may not provide. A lot of documentation seems to talk about things at a high level - with this paper I'm trying to get into the nitty-gritty, "devil's in the details", best practice kind of stuff.  It's still very early - once I complete it I'll be putting it out for review by my peers, which honestly I expect to be torn apart as my knowledge of STP is much stronger than BGP, but hey, that's how I like to learn and if it makes the paper better I'm all for it. ;)

I was mainly going to use that picture to give a bigger view on AS usage scheme and how it works - not to tell folks they should design their DC in pods or not.
Engineer by day, DJ by night, family first always

burnyd

If all those features are baked in then yes you should be able to work perfect.  Those features were not there previously which not to get this into a vendor war but it was partial to why arista was never considered.

OSPF once again is extremely difficult to filter with properly especially when you are bringing up virtual routers all the time.  One error in the script or issues bringing up a default who knows will wreck an entire environment. 

To get this stuff to work properly you have to have a advance level understanding of how bgp works espcially because the allow as in aspath relax blah blah so I see why you are doing the pod design but honestly I hate that.  That makes for so many more additional leaf and spine switches.  The model I used was get 2 giant 18 chassis switches and use them as spine switches.  Run 40gb links between TOR leaf switches and done.  If I need more east west bisectional bandwidth then Ill get more spine switches or break out my 7718s into more vdcs.

But once again if you want to talk over skype I have been through this entire scenario front and back from my year + of running a overlay network.

NetworkGroover

Quote from: burnyd on April 22, 2015, 04:05:01 PM
If all those features are baked in then yes you should be able to work perfect.  Those features were not there previously which not to get this into a vendor war but it was partial to why arista was never considered.

OSPF once again is extremely difficult to filter with properly especially when you are bringing up virtual routers all the time.  One error in the script or issues bringing up a default who knows will wreck an entire environment. 

To get this stuff to work properly you have to have a advance level understanding of how bgp works espcially because the allow as in aspath relax blah blah so I see why you are doing the pod design but honestly I hate that.  That makes for so many more additional leaf and spine switches.  The model I used was get 2 giant 18 chassis switches and use them as spine switches.  Run 40gb links between TOR leaf switches and done.  If I need more east west bisectional bandwidth then Ill get more spine switches or break out my 7718s into more vdcs.

But once again if you want to talk over skype I have been through this entire scenario front and back from my year + of running a overlay network.

Yes - I'm not here to discuss vendors.  I learned that lesson the hard way as you know.

Again, to reiterate - I'm not telling anyone to design their DC in a pod fashion or otherwise.  I'm sure either way is completely viable - I've seen both, and more in the manner you describe than the other. 

We're getting away from the main point of the thread. I'm just curious if a BGP path hunting issue could exist in a two-tier, spine/leaf architecture running BGP and ECMP.  I'm having a hard time visualizing it.
Engineer by day, DJ by night, family first always

burnyd

Sorry getting back on topic.

I set a specific dampening in all my configuration but I have had leafs fail or reload due to issues but never ran into a issue like you are saying with path hunting.... at least I dont think I have.

wintermute000

slightly OT but what is the driver for having duplicate ASNs on the leaf multilayers?

NetworkGroover

#12
Quote from: wintermute000 on April 23, 2015, 06:30:31 AM
slightly OT but what is the driver for having duplicate ASNs on the leaf multilayers?

Allows you to conserve AS numbers (I'm guessing not using 4-byte ASNs for the reasons mentioned in the draft RFC).

EDIT -  Check out: https://www.nanog.org/meetings/nanog55/presentations/Monday/Lapukhov.pdf

It's talked about specifically there.
Engineer by day, DJ by night, family first always

NetworkGroover

Quote from: burnyd on April 22, 2015, 04:34:13 PM
Sorry getting back on topic.

I set a specific dampening in all my configuration but I have had leafs fail or reload due to issues but never ran into a issue like you are saying with path hunting.... at least I dont think I have.

Hmmm... ok..  I actually do have a separate question for you (and anyone else who runs a spine/leaf DC...)  - I'll open another thread.
Engineer by day, DJ by night, family first always

burnyd

Quote from: wintermute000 on April 23, 2015, 06:30:31 AM
slightly OT but what is the driver for having duplicate ASNs on the leaf multilayers?

Just an easy way to simply keep redeploying the same stuff imho.  Its in the RFC petr lupkov wrote.