BGP in the DC - Draft white paper

Started by NetworkGroover, July 01, 2015, 11:43:03 AM

Previous topic - Next topic

NetworkGroover

Heh - looks like the draft white paper is going to get a major overhaul.  Found out some pretty interesting things I'm in the middle of testing right now.  I was leaning toward eBGP between MLAG peers for a few convenience reasons, but now it looks like there's a large entity that uses the same AS at every single leaf.  That's huge because then we can use BGP dynamic neighbors at the spine and never need to add another line of BGP config for each leaf switch that gets added... sha-weet!
Engineer by day, DJ by night, family first always

that1guy15

Then your spine switches run as route reflectors?

Dynamic neighbors is pretty cool and I can see how that would fit in well in this design. As long as everything is cookie-cutter then life is good.
That1guy15
@that1guy_15
blog.movingonesandzeros.net

NetworkGroover

#17
Quote from: that1guy15 on July 07, 2015, 02:17:24 PM
Then your spine switches run as route reflectors?

Dynamic neighbors is pretty cool and I can see how that would fit in well in this design. As long as everything is cookie-cutter then life is good.

No no - the spines run in their own separate AS with no connection between them - that gives you easy loop prevention (Think SPINE1 advertise to LEAF1, which in turn advertises to SPINE2 - SPINE2 sees it's own AS and doesn't add the route).  Just the leaf switches run with the same AS.  Sounds weird, but get the "iBGP mesh" thought out of your head - this is outside-the-box kind of stuff.  A new concept to me which I'm labbing up now to see how it works.. but it's gotta work - the company that does this is huge.

EDIT - Re-reading this I see I did a crappy job of explaining.  So let me put it this way.

Say you have a two-tier spine/leaf DC design.  The spines both run AS 64600.  The way things are playing out, I'm seeing three options at the leaf:
1. Run a different AS at every leaf, such as 65000, 65001, 65002, etc.  This was my preferred way to do it, but that may be changing.
2. Run a different AS at each rack (one or two switches), such as 65000 in one rack, 65001 in another, etc.
3. Run the same AS at every leaf, such as 65000 - at every leaf.  If you're running two switches (MLAG,etc.) you iBGP peer them of course, and use allowas-in to accept routes from other leaf switches.  If this can be done, you can leverage dynamic neighbors at the spine and REALLY cut down on config.
Engineer by day, DJ by night, family first always

NetworkGroover

For example, at a spine with just 3 leaf switches, I went from this:

BGPDC-SPINE2(config)#sh run sec router bgp
router bgp 64600
   router-id 192.168.254.2
   bgp log-neighbor-changes
   distance bgp 20 200 200
   maximum-paths 32 ecmp 32
   neighbor eBGP_GROUP peer-group
   neighbor eBGP_GROUP fall-over bfd
   neighbor eBGP_GROUP password 7 gxo9zOfCTHZihMXNwE0BXQ==
   neighbor eBGP_GROUP maximum-routes 12000
   neighbor 192.168.255.17 peer-group eBGP_GROUP
   neighbor 192.168.255.17 remote-as 65000
   neighbor 192.168.255.19 peer-group eBGP_GROUP
   neighbor 192.168.255.19 remote-as 65001
   neighbor 192.168.255.21 peer-group eBGP_GROUP
   neighbor 192.168.255.21 remote-as 65002
   network 192.168.254.2/32
   aggregate-address 192.168.255.16/28 summary-only


To this:
BGPDC-SPINE2(config-router-bgp)#sh active
router bgp 64600
   router-id 192.168.254.2
   bgp log-neighbor-changes
   distance bgp 20 200 200
   maximum-paths 32 ecmp 32
   bgp listen range 192.168.255.0/24 peer-group ARISTA remote-as 65000
   neighbor ARISTA peer-group
   neighbor ARISTA fall-over bfd
   neighbor ARISTA password 7 6x5GIQqJNWigZDc2QCgeMg==
   neighbor ARISTA maximum-routes 12000
   network 192.168.254.2/32
   aggregate-address 192.168.255.16/28 summary-only


And it works:
BGPDC-SPINE1(config)#sh ip bgp summ
BGP summary information for VRF default
Router identifier 192.168.254.1, local AS number 64600
Neighbor         V  AS           MsgRcvd   MsgSent  InQ OutQ  Up/Down State  PfxRcd PfxAcc
192.168.255.1    4  65000             27        20    0    0 00:14:24 Estab  3      3
192.168.255.3    4  65000             33        21    0    0 00:14:22 Estab  3      3
192.168.255.5    4  65000              9         9    0    0 00:03:04 Estab  2      2

BGPDC-SPINE1(config)#sh ip route bgp

VRF name: default
Codes: C - connected, S - static, K - kernel,
       O - OSPF, IA - OSPF inter area, E1 - OSPF external type 1,
       E2 - OSPF external type 2, N1 - OSPF NSSA external type 1,
       N2 - OSPF NSSA external type2, B I - iBGP, B E - eBGP,
       R - RIP, I - ISIS, A B - BGP Aggregate, A O - OSPF Summary,
       NG - Nexthop Group Static Route

B E    192.168.10.0/24 [20/0] via 192.168.255.1, Ethernet1
                               via 192.168.255.3, Ethernet2
B E    192.168.20.0/24 [20/0] via 192.168.255.5, Ethernet3
B E    192.168.254.3/32 [20/0] via 192.168.255.1, Ethernet1
                                via 192.168.255.3, Ethernet2
B E    192.168.254.4/32 [20/0] via 192.168.255.1, Ethernet1
                                via 192.168.255.3, Ethernet2
B E    192.168.254.5/32 [20/0] via 192.168.255.5, Ethernet3
Engineer by day, DJ by night, family first always

Otanx

If that works it is pretty cool. I read your paper, and was thinking that using the bgp listen range command on the spines would have been more helpful than on the leafs, but didn't see how it would work. Using the same AS on all the leafs would make adding leafs easy.

-Otanx

that1guy15

Interesting.

The single ASN for all leafs is a smart idea. i need to chew threw all of this more!!
That1guy15
@that1guy_15
blog.movingonesandzeros.net

NetworkGroover

Yeah I completely agree that using bgp listen at the spines is way better, but unfortunately the command in its current form requires an AS be specified, and you can't specify multiple. You could specify multiple peer groups but that's not really practical as you'd have a ton of repeated config (think bfd, authentication, etc. for each peer group). So having a different AS at each leaf obviously caused a problem there.  Using the same AS at every leaf makes it so that you can use bgp listen at the spine instead and as shown earlier drastically reduce config.

Some folks are making an argument though that they want to be able to trace routes back to a leaf using AS_PATH which obviously won't be too easy if every leaf has the same AS (though I would think the server subnets below it would help with that, but whatever). So in that situation as a happy medium it looks like another option would be to use a different AS for each rack (singe leaf or dual-leaf via MLAG).  That destroys bgp listen at the spine though, BUT there may be a fix coming down the road to allow the use of the command with multiple ASNs - which would help alleviate that.
Engineer by day, DJ by night, family first always

Otanx

Could you have each leaf prepend a different AS. So each leaf would be the same AS, but prepend a different AS? So something like

Leaf 1

router bgp 65001
neighbor 10.1.0.2 remote-as 65200
neighbor 10.1.0.2 route-map prepend out
!
route-map prepend permit 10
set as-path prepend 65042


Leaf 2

router bgp 65001
neighbor 10.1.0.2 remote-as 65200
neighbor 10.1.0.2 route-map prepend out
!
route-map prepend permit 10
set as-path prepend 65043


That is just one line of difference for each leaf which isn't bad.

-Otanx

NetworkGroover

Quote from: Otanx on July 08, 2015, 09:07:44 AM
Could you have each leaf prepend a different AS. So each leaf would be the same AS, but prepend a different AS? So something like

Leaf 1

router bgp 65001
neighbor 10.1.0.2 remote-as 65200
neighbor 10.1.0.2 route-map prepend out
!
route-map prepend permit 10
set as-path prepend 65042


Leaf 2

router bgp 65001
neighbor 10.1.0.2 remote-as 65200
neighbor 10.1.0.2 route-map prepend out
!
route-map prepend permit 10
set as-path prepend 65043


That is just one line of difference for each leaf which isn't bad.

-Otanx

Hmmm. Maybe? I don't see why not... allowas-in should still work I think. Eats up more ASNs which some people don't like and it's a little clunky but may get the job done.  I'll play with it.
Engineer by day, DJ by night, family first always

Otanx

If you want to trace to a leaf with AS_PATH then this gives you the simplified config, and only uses one more ASN than you would have anyway. If you don't care about tracing using AS_PATH then don't use the prepend.

-Otanx



NetworkGroover

Quote from: Otanx on July 08, 2015, 01:04:55 PM
If you want to trace to a leaf with AS_PATH then this gives you the simplified config, and only uses one more ASN than you would have anyway. If you don't care about tracing using AS_PATH then don't use the prepend.

-Otanx

Oh, I was in no way knocking it.  Completely agree.
Engineer by day, DJ by night, family first always

Otanx

Right, I am just suggesting you can make it an optional deployment option. If you want tracing add these two lines. If you don't leave them out.

-Otanx

NetworkGroover

Eeesh.  So my original paper is based off using a different AS at every leaf - even those that are MLAG'd together.

Just realized something... let's say I have a spine switch (SPINE1) connected to two leaf switches (LEAF1, LEAF2).  Both of the leaf switches are MLAG'd together and connected to the same server, advertising its subnet to the spine.

If the two leaf switches are in different ASs, does that create a routing loop?   I'm thinking LEAF1 advertises server subnet to SPINE1, and SPINE1 advertises it down to LEAF2 - will LEAF2 accept and add the route?  Need to test this... I feel like this a simple networking 101 concept I'm forgetting about... but if it's true, I don't even see that as a viable design option - to address it would require filtering, and why do that when you can just put the two leaf switches in the same AS and address the problem.
Engineer by day, DJ by night, family first always

NetworkGroover

Yeah... paper's getting a major overhaul.  Will let you guys know once the updates are finished if you're interested.
Engineer by day, DJ by night, family first always

NetworkGroover

Well, I think I'm about done with the paper.  It's had a major overhaul and I've finally been talked out of running eBGP between MLAG'd switches - mostly for ease of automation.  You can find the updated paper on my LinkedIn profile - it will not be officially published as pieces of it will be used in a more holistic design doc to be released at a future date.
Engineer by day, DJ by night, family first always