In December 2014 I got an early Christmas present from Dell. They shipped me the latest 40G and 10G S series (Force 10) switches so that I could begin to test, validate and document the integration and reference architectures between Dell Networking and Nutanix. I’m starting with a L2 MLAG (Multi-chassis Link Aggregation Group) configuration and I will work my way through to a full ECMP (Equal Cost Multi-path) L3 configuration including VMware NSX. This will be a journey and I’ll include the different options and their key considerations, and configurations in the eventual white papers that Nutanix publishes. I’ve had a few weeks to configure them and do some initial testing (after the Christmas Holiday break), so I thought I’d write about what I’ve found so far. On the NSX front, you’ll be interested to know that Nutanix already has customers running NSX and that the platforms work extremely well together, as they both scale out linearly and predictably. But we’ll leave NSX specific discussion till another day. This article will contain some highlights without steeling the thunder of the white papers I’m working on.
Traditionally datacenter networks were usually designed with 3 layers, Access – where servers connected, Aggregation or Distribution – where the access switches connected and also normally the L2 demarcation, and Core, where everything was bought together at L3. Spanning Tree Protocol (STP) is enabled and redundant links will be blocked to prevent network loops on the L2 segments. However with this design you are not able to utilize all the links, due to STP, and it can be complex to scale and achieve consistent latency between the different points in the network. However these problems can be solved by taking a leaf and spine architecture approach.
As you can see in the diagram above I have 2 Spine Switches – Dell S6000’s with 32 x 40GbE ports, and 4 Leaf Switches – Dell S4810 with 48 x 10GbE Ports and 4 x 40GbE Ports. The Spine and two of the Leaf Switches are running Dell FTOS (Force Ten OS) 9.6, while two of the Leaf switches are the S4810-ON (Open Networks) model and are running Cumulus Linux 2.5. Check out the details of the Dell Networking Force 10 S series switches.
The reason I like Cumulus Linux, is well, it’s just Linux, but with hardware accelerated switching and routing. If you know Linux (Debian is the distribution it’s based off), then it’s very easy to use. No need for additional training, fits into the same management frameworks, such as Puppet, Chef, Ansible and others. So it’s great to have the choice of this or FTOS on the Dell Networking switches. I found both Cumulus and FTOS easy to use, and the documentation is very good.
The port to port latency on the S6000’s is ~500ns, while the S4810’s are ~800ns (so about 1.3us Leaf to Spine), even when routing. With the overhead in the IP stack of each of the hypervisors and VM’s I’m seeing latency across the network between VM’s of < 90us end to end, this is using standard Intel 10GbE NIC’s, VMXNET3 vNIC and without using the latency sensitive settings or tuning within VMware vSphere 5.5.
The main benefit of using MLAG is that you don’t have to have a whole lot of links disabled to prevent network loops and have STP interfering with it, it’s also very easy to set up. You still have STP enabled to prevent loops during switch boot, but when things are up and running all of the links are available to pass traffic and you benefit from the combined bandwidth. Each switch is managed and updated independently, so doesn’t become a single point of failure, as it could if you’d chosen to stack the switches. Switch stacking might be ok if you needed multiple stacks anyway and the host links to the stack themselves were redundant (you can’t mix stacking and MLAG together). While this does add some slight management overhead, it gains the ability to update the switch firmware independently without causing any disruption. With the standard automation and management frameworks such as Puppet, Ansible, Chef, CFEngine etc, the management overhead is greatly reduced or eliminated in any case.
Here is a diagram of what my lab network looks like at a high level. My lab hosts are directly connected to the leaf switches.
In FTOS an MLAG is called a Virtual Link Trunk (VLT), and in Cumulus it’s called CLAG (Chassis Link Aggregation Group). The process to configure them is fairly similar at a high level.
Configure Out of Band Management Interfaces (after initial switch boot / install)
Create a port channel between the adjacent switches (Spine or Leaf)
Configure Spanning Tree (RSTP)
Create a Peer Link on top of the port channel between the adjacent switches so that inter-switch communications can take place to sync mac addresses etc. Set up a backup address in case the primary fails, this will prevent split brain scenarios.
Configure and enable VLT or CLAGD (examples will follow)
Configure and enable the other port channels, edge ports, VLAN’s, routing etc
FTOS Spine Example:
! Note: Peer Link is recommended to be static port channel not LACP.
!
lacp ungroup member-independent vlt
lacp ungroup member-independent port-channel 100
!
default vlan-id 4000
!
protocol spanning-tree rstp
no disable
bridge-priority 16384
!
vlt domain 1
peer-link port-channel 100
back-up destination xxx.xxx.xxx.xxx <- IP Address of Backup Destination
primary-priority 16384
peer-routing
peer-routing-timeout 1
!
interface Port-channel 10
description Cumulus Leaf-Link
no ip address
mtu 9216
portmode hybrid
switchport
lacp fast-switchover
vlt-peer-lag port-channel 10
no shutdown
!
interface Port-channel 100
description Peer-Link
no ip address
mtu 9216
channel-member fortyGigE 0/120,124
no shutdown
!
interface fortyGigE 0/112
description Leaf1 – Port Channel 10
no ip address
mtu 9216
flowcontrol rx on tx off
!
port-channel-protocol LACP
port-channel 10 mode active
no shutdown
!
interface fortyGigE 0/116
description Leaf2 – Port Channel 10
no ip address
mtu 9216
flowcontrol rx on tx off
!
port-channel-protocol LACP
port-channel 10 mode active
no shutdown
!
interface fortyGigE 0/120
description Peer-Port 1 – Port Channel 100
no ip address
mtu 9216
flowcontrol rx on tx off
no shutdown
!
interface fortyGigE 0/124
description Peer-Port 2 – Port Channel 100
no ip address
mtu 9216
flowcontrol rx on tx off
no shutdown
!
interface Vlan 500
description Host VLAN
no ip address
mtu 9216
tagged Port-channel 1,10
no shutdown
!
interface Vlan 4000
mtu 9216
!untagged Port-channel 100
no shutdown
!
Cumulus Leaf Example with CLAGD:
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5), ifup(8)
#
# Please see /usr/share/doc/python-ifupdown2/examples/ for examples
#
#
# The loopback network interface
auto lo
iface lo inet loopback
# The primary network interface
auto eth0
iface eth0
address xxx.xxx.xxx.xxx/24
broadcast xxx.xxx.xxx.255
# Spine Link
auto spn1-2
iface spn1-2
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-miimon 100
bond-use-carrier 1
bond-min-links 1
bond-xmit_hash_policy layer3+4
clag-id 1
# clag-id needs to be unique on each clag, like a vlt domain id on FTOS
mstpctl-portnetwork no
mtu 9216
# Peer Link to Other LeafSwitch
auto pl
iface pl
bond-slaves swp51 swp52
bond-mode 802.3ad
bond-miimon 100
bond-use-carrier 1
bond-min-links 1
bond-xmit_hash_policy layer3+4
mstpctl-portnetwork no
mtu 9216
# CLAGD Peer Int Config
auto pl.4000
iface pl.4000
address xxx.xxx.xxx.xxx/30
clagd-enable yes
clagd-priority 8192
clagd-peer-ip 172.16.0.2
clagd-backup-ip 192.168.255.12
clagd-sys-mac 44:38:39:ff:00:01
# Switch Port Interface Configuration
auto swp1
iface swp1
mtu 9216
auto swp2
iface swp2
mtu 9216
auto swp3
iface swp3
mtu 9216
auto swp4
iface swp4
mtu 9216
auto swp49
iface swp49
mtu 9216
auto swp50
iface swp50
mtu 9216
auto swp51
iface swp51
mtu 9216
auto swp52
iface swp52
mtu 9216
# Bridge Configuration
#
auto br0
iface br0
bridge-vlan-aware yes
bridge-ports pl spn1-2 glob swp[1-4]
bridge-stp on
bridge-pvid 1
bridge-vids 500
mstpctl-portadminedge swp1=yes swp2=yes swp3=yes swp4=yes
bridge-mcsnoop 1
mtu 9216
# Bridge VLAN
# Host VLAN
auto br0.500
iface br0.500
address xxx.xxx.xxx.xxx/23
broadcast xxx.xxx.xxx.255
up ip route add 0.0.0.0/0 via 192.168.1.230
mtu 9216
The above are examples and not complete configs and you can’t just copy and paste it all and expect it to work in your environment. But it could be used as a starting point.
Bringing this all together. The Dell S6000 / S4810 combination allows you to create a scalable network design that provides predictable and consistent low latency and high throughput from end to end in the network. The configuration of the MLAG/CLAG/VLT is straight forward, and it provides management flexibility. You get to choose either FTOS or Open Networking such as Cumulus Linux as your switching software. With Cumulus Linux, it’s just like Linux, but wire speed non-blocking networking accelerated in hardware and fits into the normal management frameworks. For environments, such as Hyper-converged or Web-scale infrastructure, the network scales linearly, as do the systems that connect to it. Each time you grow, you get consistent, predictable and linear performance.
Here is an example of a high level diagram that might be appropriate for a small scale deployment. In this case the Dell S4810’s are used as both Leaf and Spine switches in a VLT configuration, with Dell N3048 or N3024 providing 1G connectivity. With this you could easily start with a single rack and scale to 8 racks. Each rack would have full 10G and 1G redundant connectivity. With say 24 Servers or Hyper-converged nodes per rack you would be able to support 192 servers across 8 racks. 40G QSFP+ ports are used for Peer Links, while the Leaf connects to the Spine using 40G to 10G break out cables.
Here is an example high level diagram of a medium density Nutanix Web-scale Converged Infrstructure deployment with Dell S4810 Leaf switches connected to S6000 Spine, which could be using FTOS VLT or Cumulus CLAG. As you can see this design is capable of scaling to 12 racks (576 nodes) with the S6000 Spine switches, and up to 52 racks and 2496 nodes, with the Dell Z9500 Spine switches. Dell Z9500 supports 128 x 40GbE ports. Both options have spare 40GbE ports still available for Boarder-Leaf connectivity (Cross Datacenter, or Internet routers etc).
If you wanted a higher density design you can combine S6000 in middle of rack or top of rack configuration for Leaf switches with S6000 or Z9500 Spine. This diagram provides a high level example of a high density configuration on a standard 48U rack. In this example the design would support 10 racks with 880 nodes with S6000 Spine, and 48 racks with 4224 nodes with a Z9500 Spine. With enough ports to accommodate the Boarder-Leaf nodes as well.
With densities in the datacenter increasing and the power consumption of the servers decreasing I can see an explosion of 40GbE Top of Rack (ToR) or Middle of Rack (MoR) Leaf switches. This would also provide an option to easily allow different bandwidth oversubscription models, 6:1, 4:1 etc as requirements change.
So far we’ve covered the Dell Networking switches by themselves, with FTOS and Cumulus, and then some examples combined with Nutanix. Dell is an OEM partner of Nutanix software and delivers the Dell XC Series Web-scale Converged Appliances. With Dell XC you have a number of hardware options for different use cases, and you can build a complete solution, including networking.
The following example shows Dell S4810 Leaf switches and Dell S6000 Spine switches with Dell XC series appliances. These appliances are 2U each and contain one node. Dell XC also has 1U options.
Final Word
Those of you who spend time in a modern datacenter will see I’ve drawn the diagrams with the switch ports facing forward, which is actually backwards. I did this to make it easy to draw, and because I like flashing lights. In real world environments the air flow of the switches would be reversed and the back of the switches, where the PSU is, would be facing the front of the rack, so that the cabling can be nice and tidy. In summary, where predictable, low latency, high throughput, linearly scalable networking is required, Leaf Spine architectures are becoming increasingly common (and are simpler than three tier IMHO). Web-scale converged or Hyper-converged Infrastructure benefits from the low latency, high throughput and linearly scalability the Dell Networking switches can provide. As does network virtualization such as VMware NSX. Dell Networking switches combined with Nutanix appliances, or Dell XC series appliances powered by Nutanix can deliver a unified and simplified high performance virtual infrastructure with greatly reduced complexity compared to a traditional three tier architecture. A great foundation for a private cloud or software defined datacenter.
—
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – 2015 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
Disclaimer: I work for DELL.
Hi Michael. Very nice post. I think generally is better to use L3 Leaf-spine fabric with OSPF and/or BGP because of better scalability and reliability. I’m sure you are working towards L3 network architecture and I’m looking forward for future posts because this is exactly what I’m testing in my lab to be familiar with. I’m very interested what will be your final conclusion.
HI David, I'm actually working towards this. I'm planning to set up L3 to the Spine after I'm done testing VLT. This configuration is a lot simpler for smaller environments. I'm also not using LACP to the hosts yet. Also later on I'm also interested in doing full L3 to the hosts to see how that works. Will definitely be doing more articles as I work through this more as well as writing the Nutanix official guides.
Excellent. Sounds very promising.
Absolutely agree with you. L2 is much simpler and that’s the reason I only did the same L2 (VLT/mVLT/rVLT) deployments for our customers so far.
However L2 network is single failure domain having potential negative impact on availability and scalability.
Right now I have one customer who is planning new datacenter for 3,000 servers plus for next year. He is interested in leaf-spine L3 network PoC so I’l most probably (if time allows) try to build a small lab this month.
I have some ideas how to implement L3 leaf-spine with OSPF but I have never test it so far because I’m more virtualization guy then networking guy and as you stated correctly L2 implementation is simpler for us but that’s most probably only because we are little bit less familiar with L3 dynamic routing protocols.
That’s the reason I’m very interested on your thoughts and thanks for sharing your experience with community.
Hi David, No problem. ECMP (Equal Cost MultiPath) with OSPF is a good way to go. It's not that complicated to set up on the Dell switches, just have to get the AS numbers right, use the private range etc. Also you have to make sure you don't have to stretch a normal VLAN across to another rack when using ToR/MoR switches (MoR = Middle of Rack). If you did, then you'd need to use an overlay such as NSX to span the VLAN across the racks. Whereas with the L2 design, you can easily configure the VLAN's across the racks, but you definitely have the downside of a single failure domain. I'll be doing ECMP with FTOS and also with Cumulus Linux, and putting NSX on the top. Will blog about that once I'm done.
[…] Force10 or PowerConnect switches. I’ve previously written about their switches in my article Configuring Scalable Low Latency L2 Leaf-Spine Network Fabrics with Dell Networking Switches. Other OEM partners and solutions providers may have different support offerings and you should […]
Hi Michael,
Great post, Is LACP recommended load balancing on the ESXi vDS?
Hi Shan,
LACP on ESXi VDS is definitely highly recommended. One reason is that LACP solve potential black hole scenario when VLTi (peer-link) failure. VLTi failure is rare but it can happen.
I have described this in detail at
http://blog.igics.com/2015/05/dell-force10-vtl-an…
David.
Properly configured VLTi failure won\’t cause link failure if backup link is used with LBT. Complexity of LACP is not justified in most cases. Load based teaming is recommended. If worried about VLTi failure link state tracking can be used. IFPlugD on Cumulus.
Hi Mike. Are you sure? Did you test it? I did 🙂
I did a test as a part of validation tests of my already implemented VLT network design.
I had the same assumption like you but validation tests told me the truth.
Backup link will not help you. It is not a bug, it is a feature. If VLTi is down then MAC addresses cannot be sync between VLT nodes. Therefore VLT domain cannot work. Backup link is used just to know who is up and who is down. When VLTi is down and secondary VLT node see primary node up and running over backup link than all ports participating in VLT port-channels on secondary VLT node are switched to link down. However orphan ports (non-VLT ports) are still up and it leads to black hole scenario.
When you use switch independent teaming on ESXi host you effectively use orphan ports.
When you use LACP then it is VLT aware.
Backup link is beneficial when primary node is down and backup-link is up then secondary VLT node will keep all VLT port-channel ports up.
When backup link is not configured correctly then there is no visibility between VLT nodes during VLTi failure and it is considered as split-brain scenario when only primary VLT node will switch the traffic.
David.
I get that. With dual redundant VLTi and ToR switches what scenario in the real world will take down VLTi and Link State Tracking will not address the orphaned ports? In Cisco there is a command for orphaned ports, in Cumulus you use ifplugd. In my FTOS design I would use LST, but due to the very low probability of this case I have accepted the risk. Which incidentally is what the network architects I\’m working with at Dell also recommend.
I fully agree that VLTi failure probability is very low because of redundancy. But typical failure scenario is human error.
I also agree that this particular scenario can be documented as a risk and accepted by customer.
However if not accepted I also believe that line state tracking (In Force10 language UFD – Uplink Failure Detection) is potential solution – workaround.
BUT REMEMBER orphaned ports must be configured as dependent on some VLT (for example VLT to upstream router) and NOT on VLTi (peer-link) port-channel because VLTi port-channel can be correctly down during primary VLT node maintenance like firmware upgrade or reload.
I’m planning to test and validate UFD workaround in my lab because LACP nor static Ether-channel cannot be used when NPAR is leveraged on NICs. And that’s exactly what is used in my particular design because of vSphere Licensing (standard edition) and iSCSI with DCB constraints.
I call this solution as workaround because it can introduce other unwanted side effects (dependency on something irrelevant).
Another workaround (maybe more reliable than UFD) would be Force10 smart-script (perl, python. zsh) testing VLTi status and also the status of primary VLT node. However who likes custom scripts, right? It would have negative impact on long-term solution manageability.
I still believe LACP (or static ether-channel) from the host is the purest solution if applicable and you will get the best result. I don’t thing VDS LACP configuration is too complex.
I also agree that VMware’s LACP implementation is relatively new and I have seen some interoperability issues with HP IRF but that’s another topic.
P.S. I’m also proponent of simple solutions and I really like switch independent teaming, especially VMware’s LBT but it is good to know there is at least one risky scenario.
Human error in any scenario can take the network down. I don't think it increases the risk of VLTi failure anymore than any change to the networking environment. The real workaround is to have appropriate failure domains around ToR dependencies, including the connected server infrastructure. There are a few scenarios that can mess up the LACP configuration on the vSphere host side of things too. So with the choice between a very low probability of failure and having to do a lot more documentation, testing, verification to implement LACP, I'd go for the former. Unless the customer is implementing a solution such as NSX, which really needs LACP to allow the overlay to achieve the acceptable performance.
[…] Especially when deploying Oracle RAC a low latency high throughput network environment is preferred due to the cluster interconnect that coordinates between the database cluster nodes. Overall a scalable network that can scale out with your applications and servers is preferred and something that offers predictable consistent latency and throughput between endpoints. A popular network topology that provides these characteristics is a leaf spine network architecture. All of the testing in this article was based on systems deployed and connected to a leaf spine network with 40GbE Spine switches connected to 10GbE leaf switches, which are connected to the hosts. I gave an overview of my test lab network topology in Configuring Scalable Low Latency L2 Leaf-Spine Network Fabrics with Dell Networking Switches. […]
[…] the article Configuring Scalable Low Latency L2 Leaf-Spine Network Fabrics with Dell Networking Switches I wrote about the general set up of the leaf spine architecture in my Nutanix performance lab with […]
Hi Mike
I have a very basic question.
In my environment we are planning to use NSX with Leaf Spine Architecture. (In current scenario we have traditional 3 level network design)
When we go for NSX with Leaf Spine Architecture, would like to understand whether the leaf switches should be L2 or L3
My understanding is that it should be L2 since NSX will be having the Edge Gateways which will form the L3 Adjacency with the spine switches.
Please clarify
Yes the Leaf Switches should be configured for L2. L3 is usually in the Spine.