BGP in the DC - Point-to-Point links

Started by NetworkGroover, April 23, 2015, 12:23:07 PM

Previous topic - Next topic

NetworkGroover

    Hey guys,

    So if you're running a large two-tier spine/leaf data center with BGP, you likely have a lot of /31s to deal with.  I've seen/read several ways to handle this, but what's your take?  How do you handle prefix advertisement in this kind of environment? I know you don't want to add a network statement for every link for scalability reasons, but then how do you support troubleshooting with traceroute?

    Like let's say I have a spine switch with links going down to two leaf switches, and those links are addressed 192.168.255.0/31 and 192.168.255.2/31. On that spine switch, would I:


    • Use network statements 192.168.255.0/31 and 192.168.255.2/31
    • Redistribute connected with a route map only allowing those subnets
    • Use aggregate-address 192.168.255.0/30 - and with any options
    • Something else entirely
Engineer by day, DJ by night, family first always

NetworkGroover

I've narrowed this down to a couple things.

Spine switches:
Option 1: summary network statements and static null route
Option 2: aggregate-address summary only

Leaf switches:
Don't advertise - only your server subnets

How do you guys handle it?
Engineer by day, DJ by night, family first always

burnyd

redistribute connected.  That is what I do for quicker deployment.  I dont summarize a whole lot because generally in that leaf and spine architecture those devices should be able to hold a lot of routes.  The only place you should be doing summarization if your wan / exit leaf to summarize the environment.

NetworkGroover

Quote from: burnyd on April 23, 2015, 06:23:42 PM
redistribute connected.  That is what I do for quicker deployment.  I dont summarize a whole lot because generally in that leaf and spine architecture those devices should be able to hold a lot of routes.  The only place you should be doing summarization if your wan / exit leaf to summarize the environment.

Is that really optimal though?  Doesn't that put a lot of routes into your FIB?

I can't disclose the name, but I found out a very large DC entity does the following:

From Spine:
- Advertise a summary network statement that includes all P2P links
- Advertise a loopback via network statement
- Add a static null route for the P2P summary

From Leaf:
- Advertise server subnet (Usually a simple /24 per rack) via network statement
- Advertise loopback via network statement

They could have also done the same in a simpler fashion at the spine using BGP aggregate addressing with the summary-only option, but the reason they do it the way they do is because there was some bug in NXOS at some point that required them to do this as a workaround, and as we all know.. once it's in place you tend to just leave it there as it's not worth addressing later when you're at that large of a scale.

I did it this way (using aggregate-address summary-only) and I like it - just a single entry for the leaf to use (per spine) instead of multiple /31s.  You do have to be smart about allocating a block of IPs per spine though to account for scaling otherwise it'll get a little messy and you'll end up using multiple aggregate-address statements I'd imagine.

Granted, I'm using a very small virtual environment on my laptop, but here's an example.  Attached is the topology in GNS3 I'm using.

On SPINE1 I'm using redistribute connected, and on SPINE2 I'm using aggregate-address:

VS-LEAF-1(config-router-bgp)#sh ip route bgp
Codes: C - connected, S - static, K - kernel,
       O - OSPF, IA - OSPF inter area, E1 - OSPF external type 1,
       E2 - OSPF external type 2, N1 - OSPF NSSA external type 1,
       N2 - OSPF NSSA external type2, B I - iBGP, B E - eBGP,
       R - RIP, I - ISIS, A B - BGP Aggregate, A O - OSPF Summary,
       NG - Nexthop Group Static Route

B E    192.168.254.1/32 [200/0] via 192.168.255.0, Ethernet6
B E    192.168.254.2/32 [200/0] via 192.168.255.4, Ethernet7
B E    192.168.254.4/32 [200/0] via 1.1.1.3, Vlan4093
B E    192.168.255.2/31 [200/0] via 192.168.255.0, Ethernet6
B E    192.168.255.4/30 [200/0] via 192.168.255.4, Ethernet7


So, you can't REALLY see the effect here because of so few switches, but what I want to point out is the 192.168.255.2/31 route.  That route is coming from SPINE1 which is doing the redistribute connected.  I imagine if I were to scale this out and add 4 more leaf switches (though my crappy IP allocation doesn't support it very well), there would be 4 more /31s for the associated P2P links sitting in LEAF1's routing table.

The 192.168.255.4/30 route is coming from SPINE2 using aggregate-address.  Now, I obviously wasn't smart about my IP allocation and I should have set aside a block for each spine instead to accommodate additional leaf switches (additional p2p links), but the point is had I done that, I could have a single /whatever entry to cover the entire range, and a single /whatever entry on the leaf for each spine -  rather than a /31 for each P2P link that isn't directly connected.  Make sense?

Engineer by day, DJ by night, family first always

NetworkGroover

#4
This might help visualize a little better.  I assigned a /25 block to each spine, so now SPINE1's addressing for P2P links starts at .0, and SPINE2's starts at .128.

Here's the config from the spines:

VS-SPINE-1(config-router-bgp)#sh run sec router bgp
router bgp 64600
   bgp log-neighbor-changes
   maximum-paths 32 ecmp 32
   neighbor eBGP_GROUP peer-group
   neighbor eBGP_GROUP fall-over bfd
   neighbor eBGP_GROUP maximum-routes 12000
   neighbor 192.168.255.1 peer-group eBGP_GROUP
   neighbor 192.168.255.1 remote-as 65000
   neighbor 192.168.255.3 peer-group eBGP_GROUP
   neighbor 192.168.255.3 remote-as 65001
   network 192.168.254.1/32
   aggregate-address 192.168.255.0/25 summary-only

VS-SPINE-2(config-router-bgp)#sh active
router bgp 64600
   bgp log-neighbor-changes
   maximum-paths 32 ecmp 32
   neighbor eBGP_GROUP peer-group
   neighbor eBGP_GROUP fall-over bfd
   neighbor eBGP_GROUP maximum-routes 12000
   neighbor 192.168.255.129 peer-group eBGP_GROUP
   neighbor 192.168.255.129 remote-as 65000
   neighbor 192.168.255.131 peer-group eBGP_GROUP
   neighbor 192.168.255.131 remote-as 65001
   network 192.168.254.2/32
   aggregate-address 192.168.255.128/25 summary-only


And here's the resulting routing tables of the leaf switches:
VS-LEAF-1(config-router-bgp)#sh ip route bgp
Codes: C - connected, S - static, K - kernel,
       O - OSPF, IA - OSPF inter area, E1 - OSPF external type 1,
       E2 - OSPF external type 2, N1 - OSPF NSSA external type 1,
       N2 - OSPF NSSA external type2, B I - iBGP, B E - eBGP,
       R - RIP, I - ISIS, A B - BGP Aggregate, A O - OSPF Summary,
       NG - Nexthop Group Static Route

B E    192.168.254.1/32 [200/0] via 192.168.255.0, Ethernet6
B E    192.168.254.2/32 [200/0] via 192.168.255.128, Ethernet7
B E    192.168.254.4/32 [200/0] via 1.1.1.3, Vlan4093
B E    192.168.255.0/25 [200/0] via 192.168.255.0, Ethernet6
B E    192.168.255.128/25 [200/0] via 192.168.255.128, Ethernet7

VS-LEAF-2(config-router-bgp)#sh ip route bgp
Codes: C - connected, S - static, K - kernel,
       O - OSPF, IA - OSPF inter area, E1 - OSPF external type 1,
       E2 - OSPF external type 2, N1 - OSPF NSSA external type 1,
       N2 - OSPF NSSA external type2, B I - iBGP, B E - eBGP,
       R - RIP, I - ISIS, A B - BGP Aggregate, A O - OSPF Summary,
       NG - Nexthop Group Static Route

B E    192.168.254.1/32 [200/0] via 192.168.255.2, Ethernet6
B E    192.168.254.2/32 [200/0] via 192.168.255.130, Ethernet7
B E    192.168.254.3/32 [200/0] via 1.1.1.2, Vlan4093
B E    192.168.255.0/25 [200/0] via 192.168.255.2, Ethernet6
B E    192.168.255.128/25 [200/0] via 192.168.255.130, Ethernet7


Nice and clean - just one /25 entry per spine.
Engineer by day, DJ by night, family first always

burnyd

I see what you are saying.  However, I cant go and make changes all the to physical network switches because part of my model is bringing up networks all the time from my NSX ESG's to the ToR switches.

Talking fib space.  Yes it takes up a decent amount however, the specific nexus switches I picked on purpose are set to use their tcam space to ipv4 unicast.

NetworkGroover

Quote from: burnyd on April 24, 2015, 08:02:02 PM
I see what you are saying.  However, I cant go and make changes all the to physical network switches because part of my model is bringing up networks all the time from my NSX ESG's to the ToR switches.

Talking fib space.  Yes it takes up a decent amount however, the specific nexus switches I picked on purpose are set to use their tcam space to ipv4 unicast.

I wasn't suggesting you do bud.  Different strokes for different folks, and all that jazz.  :pub:
Engineer by day, DJ by night, family first always

burnyd


killabee

I've done "redistribute connected," "aggregate-address summary" and null route advertisements in non-leaf/spine scenarios.  There wasn't a silver bullet for all my designs...I considered their pros/cons and used the best fit.  In your case, I'd go with the "aggregate-address summary" so the summary is advertised conditionally based on the availability of the subset prefixes on the leafs and the leaf links.  There's a caveat to this in terms of how the AS_PATH shows up...but I think that's only an issue for eBGP, and not an issue in your case.

Without thinking too much about it, I want to say that advertising a null summary could blackhole traffic since that's unconditionally up (unless you have IPSLAs, track objects, etc).  You wouldn't want one of the spine switches to do that when there's ample link redundancy.

The "redistribute connected" will help you by not having to type many "network" commands," but you could get yourself in trouble if you don't use a route map or watch what interfaces you create.  It won't help you with your FIB issue, though.

NetworkGroover

Quote from: killabee on April 25, 2015, 11:00:01 PM
I've done "redistribute connected," "aggregate-address summary" and null route advertisements in non-leaf/spine scenarios.  There wasn't a silver bullet for all my designs...I considered their pros/cons and used the best fit.  In your case, I'd go with the "aggregate-address summary" so the summary is advertised conditionally based on the availability of the subset prefixes on the leafs and the leaf links.  There's a caveat to this in terms of how the AS_PATH shows up...but I think that's only an issue for eBGP, and not an issue in your case.

Without thinking too much about it, I want to say that advertising a null summary could blackhole traffic since that's unconditionally up (unless you have IPSLAs, track objects, etc).  You wouldn't want one of the spine switches to do that when there's ample link redundancy.

The "redistribute connected" will help you by not having to type many "network" commands," but you could get yourself in trouble if you don't use a route map or watch what interfaces you create.  It won't help you with your FIB issue, though.

Gotcha, yeah this is more geared toward the goal of a 2-tier spine/leaf architecture that should be easily repeatable and scale to any size DC. Regarding your second paragraph, could you elaborate a bit?  Keep in mind that the only thing we're summarizing is the point-to-point links here - the server subnets and others would not be included in that or summarized separately - they'd be explicitly advertised.  They're just not shown in my example because I haven't set any up yet.
Engineer by day, DJ by night, family first always