Meh... looks like my work is going to be swallowed up into a larger design doc... so if you guys are interested, I uploaded my draft proposal for a practical BGP in the DC design doc to my LinkedIn profile under my current position until it gets published. Once published I'll broadcast the URL. Let me know what you think (PM me if you don't know who I am).
Cool! Ill have a look when I get some time.
Thanks for sharing.
I would love to read it but I don't think I have anyone from here on LinkedIn (something I should probably change at some point).
Yeah - you should rectify that. It's small world and you never know when someone from the forum shows up in your neck of the woods.
Just a little while ago... what.. four of us were able to get together and hang out?
Nice document. Only when you're talking about failover and reconvergence timers and BGP, I'd expect BFD to be mentioned somewhere. Did you forget it or is it not that interesting in this kind of design? If it's the latter, I'd like to know why. And I'm sure others do as well.
Quote from: Reggle on July 01, 2015, 03:21:03 PM
Nice document. Only when you're talking about failover and reconvergence timers and BGP, I'd expect BFD to be mentioned somewhere. Did you forget it or is it not that interesting in this kind of design? If it's the latter, I'd like to know why. And I'm sure others do as well.
Thanks! You probably missed it - it's in the section called "The Need for Fast Failure Detection" - last paragraph. I just didn't write a huge section on it.
Quote from: Nerm on July 01, 2015, 01:44:27 PM
I would love to read it but I don't think I have anyone from here on LinkedIn (something I should probably change at some point).
Don't tell me you missed the linkedin-circle-jerk thread aka the belkin thread of the year
Quote from: wintermute000 on July 01, 2015, 09:54:22 PM
Quote from: Nerm on July 01, 2015, 01:44:27 PM
I would love to read it but I don't think I have anyone from here on LinkedIn (something I should probably change at some point).
Don't tell me you missed the linkedin-circle-jerk thread aka the belkin thread of the year
lolwut
Quote from: AspiringNetworker on July 01, 2015, 10:08:08 PM
Quote from: wintermute000 on July 01, 2015, 09:54:22 PM
Quote from: Nerm on July 01, 2015, 01:44:27 PM
I would love to read it but I don't think I have anyone from here on LinkedIn (something I should probably change at some point).
Don't tell me you missed the linkedin-circle-jerk thread aka the belkin thread of the year
lolwut
another place, another time.
OK, time for another LinkedIn roundabout...
Petr Lapukhov presented a BGP Design for Data Center back in 2012 for NANOG when he was /w Microsoft.
Quote from: AnthonyC on July 04, 2015, 09:54:46 PM
Petr Lapukhov presented a BGP Design for Data Center back in 2012 for NANOG when he was /w Microsoft.
I love comments like these. What was the point of that?
Yes, Petr Lapukhov did present a BGP Design for the DC, and you apparently missed the fact one of my listed sources was the IETF Draft he collaborated on - if you even looked at my paper at all. Thanks though.
The point of this wasn't to say I created the design. It was to discuss the design in a practical manner, the whys, hows, etc. - and to teach myself and to share a little knowledge with others so they didn't go through the same pains I did.
Quote from: AspiringNetworker on July 05, 2015, 11:58:10 AM
Quote from: AnthonyC on July 04, 2015, 09:54:46 PM
Petr Lapukhov presented a BGP Design for Data Center back in 2012 for NANOG when he was /w Microsoft.
I love comments like these. What was the point of that?
Yes, Petr Lapukhov did present a BGP Design for the DC, and you apparently missed the fact one of my listed sources was the IETF Draft he collaborated on - if you even looked at my paper at all. Thanks though.
The point of this wasn't to say I created the design. It was to discuss the design in a practical manner, the whys, hows, etc. - and to teach myself and to share a little knowledge with others so they didn't go through the same pains I did.
Why so defensive? Others may find interesting to read another paper on the topic. Also I don't see a link to the paper and not sure what's your LinkedIn profile.
Quote from: AnthonyC on July 06, 2015, 11:32:06 AM
Why so defensive? Others may find interesting to read another paper on the topic. Also I don't see a link to the paper and not sure what's your LinkedIn profile.
Wow. My behavior was totally from being used to a certain type of behavior on other forums without even thinking about all of the factors involved. My apologies - you're absolutely correct. Heck, I just realized I have the pdf that you're talking about saved as part of my research. ;)
I'll PM you my LinkedIn profile if you're interested. Sorry again.
No problem; it is the Internet and I have pretty thick skin. :)
Heh - looks like the draft white paper is going to get a major overhaul. Found out some pretty interesting things I'm in the middle of testing right now. I was leaning toward eBGP between MLAG peers for a few convenience reasons, but now it looks like there's a large entity that uses the same AS at every single leaf. That's huge because then we can use BGP dynamic neighbors at the spine and never need to add another line of BGP config for each leaf switch that gets added... sha-weet!
Then your spine switches run as route reflectors?
Dynamic neighbors is pretty cool and I can see how that would fit in well in this design. As long as everything is cookie-cutter then life is good.
Quote from: that1guy15 on July 07, 2015, 02:17:24 PM
Then your spine switches run as route reflectors?
Dynamic neighbors is pretty cool and I can see how that would fit in well in this design. As long as everything is cookie-cutter then life is good.
No no - the spines run in their own separate AS with no connection between them - that gives you easy loop prevention (Think SPINE1 advertise to LEAF1, which in turn advertises to SPINE2 - SPINE2 sees it's own AS and doesn't add the route). Just the leaf switches run with the same AS. Sounds weird, but get the "iBGP mesh" thought out of your head - this is outside-the-box kind of stuff. A new concept to me which I'm labbing up now to see how it works.. but it's gotta work - the company that does this is huge.
EDIT - Re-reading this I see I did a crappy job of explaining. So let me put it this way.
Say you have a two-tier spine/leaf DC design. The spines both run AS 64600. The way things are playing out, I'm seeing three options at the leaf:
1. Run a different AS at every leaf, such as 65000, 65001, 65002, etc. This was my preferred way to do it, but that may be changing.
2. Run a different AS at each rack (one or two switches), such as 65000 in one rack, 65001 in another, etc.
3. Run the same AS at every leaf, such as 65000 - at every leaf. If you're running two switches (MLAG,etc.) you iBGP peer them of course, and use allowas-in to accept routes from other leaf switches. If this can be done, you can leverage dynamic neighbors at the spine and REALLY cut down on config.
For example, at a spine with just 3 leaf switches, I went from this:
BGPDC-SPINE2(config)#sh run sec router bgp
router bgp 64600
router-id 192.168.254.2
bgp log-neighbor-changes
distance bgp 20 200 200
maximum-paths 32 ecmp 32
neighbor eBGP_GROUP peer-group
neighbor eBGP_GROUP fall-over bfd
neighbor eBGP_GROUP password 7 gxo9zOfCTHZihMXNwE0BXQ==
neighbor eBGP_GROUP maximum-routes 12000
neighbor 192.168.255.17 peer-group eBGP_GROUP
neighbor 192.168.255.17 remote-as 65000
neighbor 192.168.255.19 peer-group eBGP_GROUP
neighbor 192.168.255.19 remote-as 65001
neighbor 192.168.255.21 peer-group eBGP_GROUP
neighbor 192.168.255.21 remote-as 65002
network 192.168.254.2/32
aggregate-address 192.168.255.16/28 summary-only
To this:
BGPDC-SPINE2(config-router-bgp)#sh active
router bgp 64600
router-id 192.168.254.2
bgp log-neighbor-changes
distance bgp 20 200 200
maximum-paths 32 ecmp 32
bgp listen range 192.168.255.0/24 peer-group ARISTA remote-as 65000
neighbor ARISTA peer-group
neighbor ARISTA fall-over bfd
neighbor ARISTA password 7 6x5GIQqJNWigZDc2QCgeMg==
neighbor ARISTA maximum-routes 12000
network 192.168.254.2/32
aggregate-address 192.168.255.16/28 summary-only
And it works:
BGPDC-SPINE1(config)#sh ip bgp summ
BGP summary information for VRF default
Router identifier 192.168.254.1, local AS number 64600
Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc
192.168.255.1 4 65000 27 20 0 0 00:14:24 Estab 3 3
192.168.255.3 4 65000 33 21 0 0 00:14:22 Estab 3 3
192.168.255.5 4 65000 9 9 0 0 00:03:04 Estab 2 2
BGPDC-SPINE1(config)#sh ip route bgp
VRF name: default
Codes: C - connected, S - static, K - kernel,
O - OSPF, IA - OSPF inter area, E1 - OSPF external type 1,
E2 - OSPF external type 2, N1 - OSPF NSSA external type 1,
N2 - OSPF NSSA external type2, B I - iBGP, B E - eBGP,
R - RIP, I - ISIS, A B - BGP Aggregate, A O - OSPF Summary,
NG - Nexthop Group Static Route
B E 192.168.10.0/24 [20/0] via 192.168.255.1, Ethernet1
via 192.168.255.3, Ethernet2
B E 192.168.20.0/24 [20/0] via 192.168.255.5, Ethernet3
B E 192.168.254.3/32 [20/0] via 192.168.255.1, Ethernet1
via 192.168.255.3, Ethernet2
B E 192.168.254.4/32 [20/0] via 192.168.255.1, Ethernet1
via 192.168.255.3, Ethernet2
B E 192.168.254.5/32 [20/0] via 192.168.255.5, Ethernet3
If that works it is pretty cool. I read your paper, and was thinking that using the bgp listen range command on the spines would have been more helpful than on the leafs, but didn't see how it would work. Using the same AS on all the leafs would make adding leafs easy.
-Otanx
Interesting.
The single ASN for all leafs is a smart idea. i need to chew threw all of this more!!
Yeah I completely agree that using bgp listen at the spines is way better, but unfortunately the command in its current form requires an AS be specified, and you can't specify multiple. You could specify multiple peer groups but that's not really practical as you'd have a ton of repeated config (think bfd, authentication, etc. for each peer group). So having a different AS at each leaf obviously caused a problem there. Using the same AS at every leaf makes it so that you can use bgp listen at the spine instead and as shown earlier drastically reduce config.
Some folks are making an argument though that they want to be able to trace routes back to a leaf using AS_PATH which obviously won't be too easy if every leaf has the same AS (though I would think the server subnets below it would help with that, but whatever). So in that situation as a happy medium it looks like another option would be to use a different AS for each rack (singe leaf or dual-leaf via MLAG). That destroys bgp listen at the spine though, BUT there may be a fix coming down the road to allow the use of the command with multiple ASNs - which would help alleviate that.
Could you have each leaf prepend a different AS. So each leaf would be the same AS, but prepend a different AS? So something like
Leaf 1
router bgp 65001
neighbor 10.1.0.2 remote-as 65200
neighbor 10.1.0.2 route-map prepend out
!
route-map prepend permit 10
set as-path prepend 65042
Leaf 2
router bgp 65001
neighbor 10.1.0.2 remote-as 65200
neighbor 10.1.0.2 route-map prepend out
!
route-map prepend permit 10
set as-path prepend 65043
That is just one line of difference for each leaf which isn't bad.
-Otanx
Quote from: Otanx on July 08, 2015, 09:07:44 AM
Could you have each leaf prepend a different AS. So each leaf would be the same AS, but prepend a different AS? So something like
Leaf 1
router bgp 65001
neighbor 10.1.0.2 remote-as 65200
neighbor 10.1.0.2 route-map prepend out
!
route-map prepend permit 10
set as-path prepend 65042
Leaf 2
router bgp 65001
neighbor 10.1.0.2 remote-as 65200
neighbor 10.1.0.2 route-map prepend out
!
route-map prepend permit 10
set as-path prepend 65043
That is just one line of difference for each leaf which isn't bad.
-Otanx
Hmmm. Maybe? I don't see why not... allowas-in should still work I think. Eats up more ASNs which some people don't like and it's a little clunky but may get the job done. I'll play with it.
If you want to trace to a leaf with AS_PATH then this gives you the simplified config, and only uses one more ASN than you would have anyway. If you don't care about tracing using AS_PATH then don't use the prepend.
-Otanx
Quote from: Otanx on July 08, 2015, 01:04:55 PM
If you want to trace to a leaf with AS_PATH then this gives you the simplified config, and only uses one more ASN than you would have anyway. If you don't care about tracing using AS_PATH then don't use the prepend.
-Otanx
Oh, I was in no way knocking it. Completely agree.
Right, I am just suggesting you can make it an optional deployment option. If you want tracing add these two lines. If you don't leave them out.
-Otanx
Eeesh. So my original paper is based off using a different AS at every leaf - even those that are MLAG'd together.
Just realized something... let's say I have a spine switch (SPINE1) connected to two leaf switches (LEAF1, LEAF2). Both of the leaf switches are MLAG'd together and connected to the same server, advertising its subnet to the spine.
If the two leaf switches are in different ASs, does that create a routing loop? I'm thinking LEAF1 advertises server subnet to SPINE1, and SPINE1 advertises it down to LEAF2 - will LEAF2 accept and add the route? Need to test this... I feel like this a simple networking 101 concept I'm forgetting about... but if it's true, I don't even see that as a viable design option - to address it would require filtering, and why do that when you can just put the two leaf switches in the same AS and address the problem.
Yeah... paper's getting a major overhaul. Will let you guys know once the updates are finished if you're interested.
Well, I think I'm about done with the paper. It's had a major overhaul and I've finally been talked out of running eBGP between MLAG'd switches - mostly for ease of automation. You can find the updated paper on my LinkedIn profile - it will not be officially published as pieces of it will be used in a more holistic design doc to be released at a future date.
Gimme the link for eos central I couldnt find it.
Go to arista.com and at the bottom left of the web page there is a link for Software Downloads. Click that and create a guest account (or link to your account if you have one) and then login. Then you can download it. Once Software Downloads is selected after logging in it's at the bottom under all the notification boxes, looks like there are iso's and vmdk's. The 4.15.0F vmdk is about 450MB.
Quote from: burnyd on August 08, 2015, 10:32:22 AM
Gimme the link for eos central I couldnt find it.
I'm assuming that's what you wanted by the way :) Either way it helped me.
It's not going to be officially published. It will be cannibalized into a larger design doc coming out at a future date being written by a separate team. I'm sharing it with you fine folks since I love you so much. ;) (And I trust your feedback)
It's on my LinkedIn profile.