Leaf & Spine Architectures

NetworkGroover · October 12, 2015, 02:09:38 PM

Quote from: burnyd on October 12, 2015, 11:38:15 AM
Quote from: ristau5741 on October 09, 2015, 07:52:12 AM
I think the greatest thing about the leaf/spine architecture is that is can scale well above the 5 9's, depending on hardware, network architecture, and cash in pocket, the more spines one has with servers multi-homed to different leafs (leaves?) provides enough redundancy where one can scale to 10 9's, 20 9's or more, to a point where more than half the network can be down without affecting services.

Im not going to say its unlimited scale but its close to it. After you fill up the ports on the spine you have to do a 3d model like facebook did but thats a ton of servers!

I posted this yesterday ironically.

https://danielhertzberg.wordpress.com/2015/10/11/333/

Cool!

burnyd · October 12, 2015, 03:01:14 PM

Quote from: AspiringNetworker on October 12, 2015, 02:09:02 PM
Quote from: ristau5741 on October 09, 2015, 07:52:12 AM
I think the greatest thing about the leaf/spine architecture is that is can scale well above the 5 9's, depending on hardware, network architecture, and cash in pocket, the more spines one has with servers multi-homed to different leafs (leaves?) provides enough redundancy where one can scale to 10 9's, 20 9's or more, to a point where more than half the network can be down without affecting services.

Yep... just imagine a 64-way ECMP design.

"One of our spine switches went down. Wanna go handle it?"

"Naahhh.. I'll deal with it next week. We only lost 1/64th of our bandwidth."

Haha I couldnt even imagine the amount of shear leaf switches and the amount of ports that would require. I read that in some of the 7500 documentation its like holy crap who in the world would need that.

NetworkGroover · October 12, 2015, 05:17:12 PM

Quote from: burnyd on October 12, 2015, 03:01:14 PM
Quote from: AspiringNetworker on October 12, 2015, 02:09:02 PM
Quote from: ristau5741 on October 09, 2015, 07:52:12 AM
I think the greatest thing about the leaf/spine architecture is that is can scale well above the 5 9's, depending on hardware, network architecture, and cash in pocket, the more spines one has with servers multi-homed to different leafs (leaves?) provides enough redundancy where one can scale to 10 9's, 20 9's or more, to a point where more than half the network can be down without affecting services.

Yep... just imagine a 64-way ECMP design.

"One of our spine switches went down. Wanna go handle it?"

"Naahhh.. I'll deal with it next week. We only lost 1/64th of our bandwidth."

Haha I couldnt even imagine the amount of shear leaf switches and the amount of ports that would require. I read that in some of the 7500 documentation its like holy crap who in the world would need that.

Right? Not any of the guys I've ever worked with.. I can assure you of that. The largest I've seen so far is only 4. I don't get to work with the big boys though.

wintermute000 · October 21, 2015, 12:23:55 AM

Quote from: that1guy15 on October 12, 2015, 01:01:27 PM
Quote from: burnyd on October 12, 2015, 11:38:15 AM

I posted this yesterday ironically.

https://danielhertzberg.wordpress.com/2015/10/11/333/
Great post dude!!

Hey burnyd

Reading your post and this BGP bit

"The peerings in a BGP leaf spine architecture are rather easy. iBGP between each leaf switches and eBGP between each leaf to spine connection. ECMP is rather vital in this topology as BGP by default DOES NOT leverage multiple links in a ECMP fashion. So generally it has to be turned on."

Can you elaborate a little bit?
- each spine has its own ASN (you wrote earlier).so you're not rolling with the IEFT draft proposal idea where the spines share ASN?
- but leaves do not connect to each other, nor do you have an underlying iGP to enable multi-hop iBGP - so what is each leaf switch iBGP peering to?
- presumably the spines do not peer at all?

Aspiringnetworker, in your article, re: the spines in each cluster/pod sharing the same ASN. but they're not connected to each other as its CLOS, so you do not have any iBGP peering?

NetworkGroover · October 21, 2015, 11:19:14 AM

Quote from: routerdork on October 08, 2015, 02:16:07 PM
Quote from: AspiringNetworker on October 08, 2015, 11:10:32 AM
If it helps at all, I wrote about this subject based on Petr Lapukhov's work:

http://aspiringnetworker.blogspot.com/2015/08/bgp-in-arista-data-center_90.html
So question on this. Very informative by the way. Using the same AS on each Leaf was for the benefit of the configuration template? Doesn't seem that efficient to me to make it one thing and then prepend it with another. If you are prepending it, you are using it. Or maybe I missed something else in the concept?

The point of prepending was only to track routes via AS_PATH to a particular leaf switch if you desired to do so, since all leaf switches were in the same AS. Using the same AS obviously means you only need a single line in your configuration for that piece that you can apply to every leaf, and a single bgp listen command at every spine. The source route tracking isn't really required, but can be handy if you desire to leverage it.

NetworkGroover · October 21, 2015, 11:25:04 AM

Quote from: wintermute000 on October 21, 2015, 12:23:55 AM
Aspiringnetworker, in your article, re: the spines in each cluster/pod sharing the same ASN. but they're not connected to each other as its CLOS, so you do not have any iBGP peering?

No there is never any peering between spine switches in a L3 ECMP design. The reason the spine switches have the same ASN is for built-in route loop prevention of BGP.

Imagine a route is advertised from "Spine1" down to "Leaf1". Leaf1 then in turn advertises it, right? So then the route gets advertised to "Spine2". Spine2, since it's in the same ASN as Spine1, notices in the AS_PATH of the route it's own ASN - so it doesn't add the route. Built-in route loop prevention.

wintermute000 · October 21, 2015, 03:39:42 PM

Cheers
Which design have you seen more of in the wild? Ie reuse ASN or not

routerdork · October 28, 2015, 09:36:43 AM

I've been reading a lot of the Cisco Live presentations on VXLAN/MSDC/Nexus/DC Architecture/etc. Something I've been thinking about with all this new DC stuff...I have yet to see any article, presentation, etc. state anything about QoS. Maybe I haven't come across the right presentations yet but I would think this is still going to be needed?

NetworkGroover · October 28, 2015, 10:42:37 AM

Quote from: wintermute000 on October 21, 2015, 03:39:42 PM
Cheers
Which design have you seen more of in the wild? Ie reuse ASN or not

Honestly I haven't dealt with any big guys yet personally, and those are usually the ones that deploy this. The ones I do know about from talking to other folks though use a per-rack AS and then re-use those ASNs at each "cluster" or "pod". With 4-byte ASNs though... you could not do that and even in a huge data center not worry about running out of ASNs.. but if you introduce another vendor that doesn't support it you're SOL... which is why the re-use of ASNs at each pod of ToRs was chosen.

NetworkGroover · October 28, 2015, 10:47:05 AM

Quote from: routerdork on October 28, 2015, 09:36:43 AM
I've been reading a lot of the Cisco Live presentations on VXLAN/MSDC/Nexus/DC Architecture/etc. Something I've been thinking about with all this new DC stuff...I have yet to see any article, presentation, etc. state anything about QoS. Maybe I haven't come across the right presentations yet but I would think this is still going to be needed?

Yeahhh that's a reocurring theme I've noticed moving from working on Campus type stuff to DC. I haven't once in my entire time dealing with folks in my current role configured QoS - not once. Well, that's a lie, I did for certification testing, but that certification testing was built around a Campus environment to support Voice. I dunno if it's just the huge pipes you see in the DC compared to the Campus or what but honestly I'm thankful. I view QoS as a crutch and a PITA. I wouldn't want to introduce that into my DC, personally.

Just Googled and found this - apparently Ivan P. agrees, at least partially:
http://blog.ipspace.net/2015/02/do-we-need-qos-in-data-center.html

wintermute000 · October 28, 2015, 04:49:22 PM

you need QoS when you run FCoE or iSCSI over the same converged LAN. That's a clear use case. But most people I've seen are still splitting out their storage networks, despite Cisco's massive push for CNAs and all the FC/FCoE stuff they shoehorned into the 7ks. doesn't help that most network jockeys like us don't know much about FC or FCoE - all I remember is the 2 hours of slides I got during my DCUFI training LOL and that's just a bit of a haze.

NetworkGroover · October 28, 2015, 07:25:02 PM

Quote from: wintermute000 on October 28, 2015, 04:49:22 PM
you need QoS when you run FCoE or iSCSI over the same converged LAN. That's a clear use case. But most people I've seen are still splitting out their storage networks, despite Cisco's massive push for CNAs and all the FC/FCoE stuff they shoehorned into the 7ks. doesn't help that most network jockeys like us don't know much about FC or FCoE - all I remember is the 2 hours of slides I got during my DCUFI training LOL and that's just a bit of a haze.

Yes, true. When you bring IP storage into the mix is when you start looking at things like DCBX, etc. I think while FC still has a presence, as the speeds have increased to 10/40/100G, people are starting to see the value and cost savings of running IP-based storage rather than need specialized adapters and other equipment, and people with that particular skill set.

wintermute000 · October 28, 2015, 10:43:02 PM

Quote from: AspiringNetworker on October 28, 2015, 07:25:02 PM
I think while FC still has a presence, as the speeds have increased to 10/40/100G, people are starting to see the value and cost savings of running IP-based storage rather than need specialized adapters and other equipment, and people with that particular skill set.

Strictly speaking, DCBX etc. is for ethernet based storage which is not only iSCSI but FCoE - and FCoE still requires FC knowledge and config, just like IP over Ethernet requires IP routing knowledge and config separate from ethernet. IIRC all the FC stuff in Nexus 7k for example is separate from the IP stuff and has to be configured and maintained separately - it just happens to run over layer 2 ethernet. Anyhow my point was that us IP guys are guilty for letting the issue lie as we're not FC skilled so haven't exactly been pushing hard for convergence.

Good point re: increasing speeds of ethernet, though I don't know how FC has progressed, but as you say esp in the 40 to 100G realm it makes sense to converge since you just have so much bandwidth. Theres still the practical reality of the difference in priorities and mindsets between a traditional data network and a storage network.

Some interesting comments in this discussion I googled up, alludes to several limitations of FCoE that are hazily coming back to me now

http://forums.theregister.co.uk/forum/1/2014/02/11/fcoe_faster_than_fibre_channel_who_knows/

burnyd · October 29, 2015, 10:23:44 AM

FC needs to die. DCBX is being used in a lot of supercomputer/GPU setups from what I understand due to the way pause frames work and how its sort of lossless to a degree. But you really have to be pushing like 25/40/100gb links super hard to need anything like that.

NetworkGroover · October 29, 2015, 11:19:58 AM

Quote from: burnyd on October 29, 2015, 10:23:44 AM
FC needs to die. DCBX is being used in a lot of supercomputer/GPU setups from what I understand due to the way pause frames work and how its sort of lossless to a degree. But you really have to be pushing like 25/40/100gb links super hard to need anything like that.

+1 ... which of course doesn't make the FC guys happy since they have such a niche skill set but hey... everyone's got to adapt - not just the storage guys. Leave it to me to get so heavily involved in networking in a time where so much is changing so rapidly... ugh. At least in the DC space anyway... I guess I like making life hard on myself. ;P