Leaf & Spine Architectures

Started by routerdork, October 08, 2015, 09:01:13 AM

Previous topic - Next topic

burnyd

Well, to be fair I dont think people who work with big booming expensive SAN's and arrays will be employed much longer unless they figure something out.  That field is change dramatically.

wintermute000

not to mention that their secret sauce (FC) is basically a bunch of ethernet type concepts with different names to confuse the uninitiated.


But yeah with VSAN, nutanix etc. there is definitely a massive shift away from the big bad break the bank SAN / shared storage architecture that has dominated the last 20 years. I have a feeling Dell might have paid at the top of the market. At the other end of the scale - I was in a small customer the other day and he was showing me this crazy local storage like solution (looked like SAS cables) but simultaneously cabled - and accessible concurrently - to all 3 of his hosts. He swore it was perfect and acted just like shared storage as far as he was concerned, but it cost the same as an external DAS, not a NAS or a SAN, and he didn't need to buy 10G switches/NIC cards/SFPs. The only thing 'wrong' with it was that he reckoned you have to manually remount it if a host reboots/fails, but in a small environment where you know that fact its not such a big deal.

NetworkGroover

Quote from: routerdork on October 08, 2015, 02:16:07 PM
Quote from: AspiringNetworker on October 08, 2015, 11:10:32 AM
If it helps at all, I wrote about this subject based on Petr Lapukhov's work:

http://aspiringnetworker.blogspot.com/2015/08/bgp-in-arista-data-center_90.html
So question on this. Very informative by the way. Using the same AS on each Leaf was for the benefit of the configuration template? Doesn't seem that efficient to me to make it one thing and then prepend it with another. If you are prepending it, you are using it. Or maybe I missed something else in the concept?

So, necroing here but discovered something.  I liked the idea of prepending for route source tracing, like was mentioned by another member, but as you mentioned - "If you are prepending it, you are using it" - and no, you didn't miss something - I did.

I was doing some testing with Ansible and got side-tracked with noticing this:

Here's the bgp table on one of my spines:

Every 2s: sh ip bgp                                       Dec 19, 2015 07:13:47
BGP routing table information for VRF default
Router identifier 192.168.254.1, local AS number 64600
Route status codes: s - suppressed, * - valid, > - active, E - ECMP head, e - EC
                    S - Stale, c - Contributing to ECMP, b - backup
Origin codes: i - IGP, e - EGP, ? - incomplete
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Li

       Network             Next Hop         Metric  LocPref Weight Path
* >   192.168.254.1/32    -                1       0       -       i
* >Ec 192.168.254.3/32    192.168.255.1    0       100     0       65000 i
*  ec 192.168.254.3/32    192.168.255.3    0       100     0       65000 i
* >Ec 192.168.254.4/32    192.168.255.3    0       100     0       65000 i
*  ec 192.168.254.4/32    192.168.255.1    0       100     0       65000 i
* >   192.168.254.5/32    192.168.255.5    0       100     0       65001 i


This is a VXLAN environment where I'm not advertising the host subnets - only the loopbacks which are the VXLAN tunnel endpoint addresses.  Now here's the table after I push route-map config via Ansible to prepend an ASN:

Every 2s: sh ip bgp                                       Dec 19, 2015 07:17:21
BGP routing table information for VRF default
Router identifier 192.168.254.1, local AS number 64600
Route status codes: s - suppressed, * - valid, > - active, E - ECMP head, e - EC
                    S - Stale, c - Contributing to ECMP, b - backup
Origin codes: i - IGP, e - EGP, ? - incomplete
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Li

       Network             Next Hop         Metric  LocPref Weight Path
* >   192.168.254.1/32    -                1       0       -       i
* >   192.168.254.3/32    192.168.255.1    0       100     0       65000 70001
*     192.168.254.3/32    192.168.255.3    0       100     0       65000 70002
* >   192.168.254.4/32    192.168.255.3    0       100     0       65000 70002
*     192.168.254.4/32    192.168.255.1    0       100     0       65000 70001
* >   192.168.254.5/32    192.168.255.5    0       100     0       65001 70003


Now I see what ToRs are responsible for what routes (70001 is LEAF1, 70002 is LEAF2, etc), but ECMP is gone.  Now, THAT said, this kinda showed me another interesting tidbit.... if you look at the old bgp table again:

Every 2s: sh ip bgp                                       Dec 19, 2015 07:13:47
BGP routing table information for VRF default
Router identifier 192.168.254.1, local AS number 64600
Route status codes: s - suppressed, * - valid, > - active, E - ECMP head, e - EC
                    S - Stale, c - Contributing to ECMP, b - backup
Origin codes: i - IGP, e - EGP, ? - incomplete
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Li

       Network             Next Hop         Metric  LocPref Weight Path
* >   192.168.254.1/32    -                1       0       -       i
* >Ec 192.168.254.3/32    192.168.255.1    0       100     0       65000 i
*  ec 192.168.254.3/32    192.168.255.3    0       100     0       65000 i
* >Ec 192.168.254.4/32    192.168.255.3    0       100     0       65000 i
*  ec 192.168.254.4/32    192.168.255.1    0       100     0       65000 i
* >   192.168.254.5/32    192.168.255.5    0       100     0       65001 i


If we think about this in a VXLAN scenario... so traffic hits a leaf switch, gets VXLAN-encapsulated, and is destined for 192.168.254.4 for example.  There's two paths to that IP, because I have two leaf switches that are MLAG-peered in the same AS (iBGP peering). So, looking at the route on this spine for that IP:

BGPDC-SPINE1(config)#sh ip route 192.168.254.4

VRF name: default
Codes: C - connected, S - static, K - kernel,
       O - OSPF, IA - OSPF inter area, E1 - OSPF external type 1,
       E2 - OSPF external type 2, N1 - OSPF NSSA external type 1,
       N2 - OSPF NSSA external type2, B I - iBGP, B E - eBGP,
       R - RIP, I - ISIS, A B - BGP Aggregate, A O - OSPF Summary,
       NG - Nexthop Group Static Route

B E    192.168.254.4/32 [200/0] via 192.168.255.1, Ethernet1
                                 via 192.168.255.3, Ethernet2


So technically.. even though that IP is actually one hop away... because I have both leaf switches in the same AS, some traffic will be hashed to the other leaf switch and will have to cross the peer link in order to hit the VTEP IP... no big deal - iBGP will handle it.. but not the MOST optimal path to take. Prepending the AS for route source tracing actually also has a side effect of addressing this.  Now my route to 192.168.254.4 points to the leaf switch that it resides on.

Then I wondered, what's the point?  Aren't I effectively turning these leaves into discrete AS's by doing that?  Well, sorta, but I do maintain the benefit of making the leaf switches an iBGP peering, which in turn enables me to leverage dynamic BGP peering at the spine without it getting too messy:

BGPDC-SPINE1(config)#sh run sec router bgp
router bgp 64600
   router-id 192.168.254.1
   maximum-paths 32 ecmp 32
   **bgp listen range 192.168.255.0/30 peer-group ARISTA remote-as 65000
   bgp listen range 192.168.255.4/31 peer-group ARISTA remote-as 65001**
   neighbor ARISTA peer-group
   neighbor ARISTA maximum-routes 12000
   network 192.168.254.1/32


Using eBGP between peered leaf switches, this would be messy because I'd need a listen range statement for every single leaf switch - at least until they add the ability to specify an AS range in the command.
Engineer by day, DJ by night, family first always

wintermute000

"[size=0px] because I have both leaf switches in the same AS, some traffic will be hashed to the other leaf switch and will have to cross the peer link in order to hit the VTEP IP"[/size]
[/size]
[/size]Sorry confused. Do you mean the spine switch? i.e. ECMP so going via a different spine?
[/size]Not sure what you mean by hashed to the other leaf switch?
[/size]
[/size]Are the 192.168.1.3, 5 spines or leaves?

NetworkGroover

We're looking at traffic hitting a spine switch, then being ECMP-routed to two different leaf switches.  Just pretend it's a two-tier leaf/spine with 2 spine switches, and 3 leaf switches.  Leaf1 and Leaf2 or MLAG and iBGP peers.  Leaf3 is standalone.

.3, .4, .5 are leaf switch loopback0 - Leaf1, Leaf2, and Leaf3, respectively.  These are used as the VTEP addresses.  In my hypothetical scenario, pretend VXLAN traffic has left LEAF3 and is being sent to LEAF2 - it has to hit the Spine first and then the spine has to make a routing decision.

.1 is Spine1

Though I will say, this is probably where virtual VTEP comes in... will need to add that and look at it....
Engineer by day, DJ by night, family first always

NetworkGroover

Yep... with virtual VTEP it looks better.

So instead of using Loopback0 on Leaf1 and 2 (Within the same "rack") I created Loopback1 on both leaves and gave them the same address (2.2.2.1) and advertised that instead.  Now the spine looks cleaner:

BGPDC-SPINE1(config)#sh ip bgp
BGP routing table information for VRF default
Router identifier 192.168.254.1, local AS number 64600
Route status codes: s - suppressed, * - valid, > - active, E - ECMP head, e - ECMP
                    S - Stale, c - Contributing to ECMP, b - backup
Origin codes: i - IGP, e - EGP, ? - incomplete
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop

       Network             Next Hop         Metric  LocPref Weight Path
* >Ec 2.2.2.1/32          192.168.255.1    0       100     0       65000 70001 i
*  ec 2.2.2.1/32          192.168.255.3    0       100     0       65000 70002 i
* >   2.2.2.2/32          192.168.255.5    0       100     0       65001 70003 i
* >   192.168.254.1/32    -                1       0       -       i
* >   192.168.254.3/32    192.168.255.1    0       100     0       65000 70001 i
*     192.168.254.3/32    192.168.255.3    0       100     0       65000 70002 70001 i
* >   192.168.254.4/32    192.168.255.3    0       100     0       65000 70002 i
*     192.168.254.4/32    192.168.255.1    0       100     0       65000 70001 70002 i
* >   192.168.254.5/32    192.168.255.5    0       100     0       65001 70003 i


Now you have ECMP, source route tracing, and the suboptimal path is eliminated.  It won't matter which leaf the spine sends it to (which leaf of the pair that is, of course)
Engineer by day, DJ by night, family first always

burnyd

I dont get it why are you prepending with a leaf/spine network?

NetworkGroover

The idea was route source tracing just by looking at the AS_PATH you'll know what ToR it belongs to if you have pairs of them in the same AS.... or if you're using the same AS for all of them.
Engineer by day, DJ by night, family first always

matgar

Quote from: AspiringNetworker on December 21, 2015, 05:53:20 PM
We're looking at traffic hitting a spine switch, then being ECMP-routed to two different leaf switches.  Just pretend it's a two-tier leaf/spine with 2 spine switches, and 3 leaf switches.  Leaf1 and Leaf2 or MLAG and iBGP peers.  Leaf3 is standalone.

.3, .4, .5 are leaf switch loopback0 - Leaf1, Leaf2, and Leaf3, respectively.  These are used as the VTEP addresses.  In my hypothetical scenario, pretend VXLAN traffic has left LEAF3 and is being sent to LEAF2 - it has to hit the Spine first and then the spine has to make a routing decision.

.1 is Spine1

Though I will say, this is probably where virtual VTEP comes in... will need to add that and look at it....
So first of all I have never fooled around with spine/leaf setup, so its very possible I'm missing something obvious here.
With that said I have a problem with your scenario.

Why would the ip of a VTEP exist in 2 different leafs?

Also ECMP in a spine/leaf setup is as far as I know supposed to happen in the leaf.
Ie LEAF3 has 2 paths to LEAF2 via either SPINE1 or SPINE2.

NetworkGroover

#39
Quote from: matgar
Why would the ip of a VTEP exist in 2 different leafs?

Both leafs are effectively acting as a single logical unit - think of them as two ToRs in the same rack.  To get to the same resources (servers, etc.), you could go to either leaf - why treat them as two discrete entities when they both do the same job?

Quote from: matgar
Also ECMP in a spine/leaf setup is as far as I know supposed to happen in the leaf.
Ie LEAF3 has 2 paths to LEAF2 via either SPINE1 or SPINE2.

Doesn't ECMP happen anywhere there is more than one route with equal costs to the same destination?  So if the spine needed to reach a host, and there were two equal-cost paths to reach it via LEAF1 or LEAF2, is that not ECMP?  I'm not being snarky here I'm seriously asking - I've been known for being dumb before and if I'm misunderstanding the definition of ECMP I'd like to alleviate that.

EDIT - Oh, and welcome to the forum.
Engineer by day, DJ by night, family first always

burnyd

Quote from: AspiringNetworker on December 22, 2015, 08:29:08 AM
The idea was route source tracing just by looking at the AS_PATH you'll know what ToR it belongs to if you have pairs of them in the same AS.... or if you're using the same AS for all of them.

I see.  Generally you have enough bandwidth if traffic lands on ToR 1 for Tor 2 loopback / vtep that its not an issue.  Haha if its an issue go ahead and add another spine switch.  Also, go to your internal thing and look at the bgp unnumbered request I put in.  The one that allows bgp over ipv6 link local ips and overlays ipv4.  Right now today I want to use 169.254.x.x ips and keep reusing the ip's everywhere but its more of a political thing internally.

NetworkGroover

Quote from: burnyd on December 23, 2015, 10:05:13 AM
Quote from: AspiringNetworker on December 22, 2015, 08:29:08 AM
The idea was route source tracing just by looking at the AS_PATH you'll know what ToR it belongs to if you have pairs of them in the same AS.... or if you're using the same AS for all of them.

I see.  Generally you have enough bandwidth if traffic lands on ToR 1 for Tor 2 loopback / vtep that its not an issue.  Haha if its an issue go ahead and add another spine switch.  Also, go to your internal thing and look at the bgp unnumbered request I put in.  The one that allows bgp over ipv6 link local ips and overlays ipv4.  Right now today I want to use 169.254.x.x ips and keep reusing the ip's everywhere but its more of a political thing internally.

It's not really to address any issue - just something you could do if you wanted to.  I personally don't see a huge benefit, and in this case with VXLAN it's almost completely useless since the only thing I'm advertising is the loopback addresses which already identify each ToR.  If it wasn't a 100% VXLAN environment and I was advertising host subnets I could somewhat see the benefit, but still it's optional.
Engineer by day, DJ by night, family first always

that1guy15

Quote from: burnyd on December 23, 2015, 10:05:13 AM
Quote from: AspiringNetworker on December 22, 2015, 08:29:08 AM
The idea was route source tracing just by looking at the AS_PATH you'll know what ToR it belongs to if you have pairs of them in the same AS.... or if you're using the same AS for all of them.

I see.  Generally you have enough bandwidth if traffic lands on ToR 1 for Tor 2 loopback / vtep that its not an issue.  Haha if its an issue go ahead and add another spine switch.  Also, go to your internal thing and look at the bgp unnumbered request I put in.  The one that allows bgp over ipv6 link local ips and overlays ipv4.  Right now today I want to use 169.254.x.x ips and keep reusing the ip's everywhere but its more of a political thing internally.

Cases like this are the reason IPv6 link-local is the way to go. Just need more adoption of IPv6... or more flexibility with MPBGP using IPv6 AFI to carry IPv4 prefixes.
That1guy15
@that1guy_15
blog.movingonesandzeros.net

NetworkGroover

Quote from: that1guy15 on December 23, 2015, 10:43:58 AM
Quote from: burnyd on December 23, 2015, 10:05:13 AM
Quote from: AspiringNetworker on December 22, 2015, 08:29:08 AM
The idea was route source tracing just by looking at the AS_PATH you'll know what ToR it belongs to if you have pairs of them in the same AS.... or if you're using the same AS for all of them.

I see.  Generally you have enough bandwidth if traffic lands on ToR 1 for Tor 2 loopback / vtep that its not an issue.  Haha if its an issue go ahead and add another spine switch.  Also, go to your internal thing and look at the bgp unnumbered request I put in.  The one that allows bgp over ipv6 link local ips and overlays ipv4.  Right now today I want to use 169.254.x.x ips and keep reusing the ip's everywhere but its more of a political thing internally.

Cases like this are the reason IPv6 link-local is the way to go. Just need more adoption of IPv6... or more flexibility with MPBGP using IPv6 AFI to carry IPv4 prefixes.

Not even following what you guys are talking about.... lol.  What's the goal? Why so complex?  What's the  driver to do whatever it is you're describing that current methods can't address? Just curious at this point.
Engineer by day, DJ by night, family first always

burnyd

Quote from: AspiringNetworker on December 23, 2015, 12:23:08 PM
Quote from: that1guy15 on December 23, 2015, 10:43:58 AM
Quote from: burnyd on December 23, 2015, 10:05:13 AM
Quote from: AspiringNetworker on December 22, 2015, 08:29:08 AM
The idea was route source tracing just by looking at the AS_PATH you'll know what ToR it belongs to if you have pairs of them in the same AS.... or if you're using the same AS for all of them.

I see.  Generally you have enough bandwidth if traffic lands on ToR 1 for Tor 2 loopback / vtep that its not an issue.  Haha if its an issue go ahead and add another spine switch.  Also, go to your internal thing and look at the bgp unnumbered request I put in.  The one that allows bgp over ipv6 link local ips and overlays ipv4.  Right now today I want to use 169.254.x.x ips and keep reusing the ip's everywhere but its more of a political thing internally.

Cases like this are the reason IPv6 link-local is the way to go. Just need more adoption of IPv6... or more flexibility with MPBGP using IPv6 AFI to carry IPv4 prefixes.

Not even following what you guys are talking about.... lol.  What's the goal? Why so complex?  What's the  driver to do whatever it is you're describing that current methods can't address? Just curious at this point.


qft!!!!!