VMware Distributed vSwitch LACP Configuration with Dell Force10 and Cumulus Linux

In my article Etherchannel and IP Hash or Load Based Teaming? I argued that using port channels and Etherchannel or LACP is an overly complex configuration that doesn’t offer much in terms of predictable load balance across host NICs. There is a lot more switch side configuration when you want to use LACP or Etherchannel and port-channels to your host NIC’s, for in most cases little benefit. In most common cases Load Based Teaming in vSphere is recommended with vSphere Distributed Switches.

However there are a number of situations where you really can’t avoid doing port-channels and LACP or Static Etherchannel load balancing to your hosts. One of those situations is where you wish to use a network virtualization product such as VMware NSX, and you wish to maximize performance and utilization of your host NIC’s. This article will focus on both the vSphere Host and physical switch side configuration to set up LACP on VMware vSphere 6.0 when using Dell Force10 Networks switches based on Cumulus Linux Network OS.

First we will start with the network configuration and then move onto the vSphere Host configuration. For those not familiar with Cumulus Linux you should check out the Cumulus Networks web site, as it makes networking easy enough even server administrators can do it, it’s just Linux. It’s suitable for small environments with limited administration staff, as everything can be easily automated using familiar automation tools and frameworks, for both server configuration to network configuration, and also for large scale environments that need to change and adapt rapidly. I’ve heard that Cumulus is popular with at least one of the large cloud / web-scale companies where they have thousands of switches.

Cumulus Linux offers comprehensive network functionality on a choice of hardware at a fraction of the cost of other network OS’s. In the examples presented here we use Dell Force10 switches to provide low latency, high performance and high reliability network fabric. Dell offers a choice of network OS on their Open Network compliant switches, including Cumulus Linux, Force10 OS, and BigSwitch. Dell provided the network switches for my lab environment and I use them for all the performance testing and validation that I perform on a day to day basis across business critical applications running in virtualized environments on Nutanix web-scale hyperconverged infrastructure.

Below I’ve provided two different ways of configuration for Cumulus Linux that will allow you to leverage LACP across a redundant switching fabric utilizing CLAG. This is similar to MLAG, or VLT in other switch configurations. The below configuration is for the leaf switches in a leaf and spine architecture, which is depicted in the image below. The first type of configuration leverages /etc/network/interfaces completely, and is a more manual configuration. The second way leverages mako templates, which are placed in /etc/network/interfaces.d, which automatically builds the configurations based on the template.

The first part of the configuration is to configure the links between the two top of rack or middle of rack leaf switches and the link to the spine.

LeafSwitch1 CLAG, Peer and Spine Config using /etc/network/interfaces:


# Spine Link
auto spn1-2
iface spn1-2 
  bond-slaves swp49 swp50
  bond-mode 802.3ad
  bond-miimon 100
  bond-use-carrier 1
  bond-min-links 1
  bond-lacp-rate 1
  bond-xmit_hash_policy layer3+4
  clag-id 1
  mstpctl-portnetwork no
  mtu 9216

# Peer Link
auto pl
iface pl 
  bond-slaves swp51 swp52
  bond-mode 802.3ad
  bond-miimon 100
  bond-use-carrier 1
  bond-min-links 1
  bond-lacp-rate 1
  bond-xmit_hash_policy layer3+4
  mstpctl-portnetwork no
  mtu 9216

# CLAGD Peer Config

auto pl.4000
iface pl.4000 
  address 172.16.0.1/30
  clagd-enable yes
  clagd-priority 8192
  clagd-peer-ip 172.16.0.2
  clagd-backup-ip 192.168.255.12#MGMT IP Address of Peer
  clagd-sys-mac 44:38:39:ff:00:01

LeafSwitch2 CLAG, Peer and Spine Config using /etc/network/interfaces:

auto spn1-2
iface spn1-2 
  bond-slaves swp49 swp50 
  bond-mode 802.3ad
  bond-miimon 100 
  bond-use-carrier 1 
  bond-min-links 1 
  bond-lacp-rate 1
  bond-xmit_hash_policy layer3+4 
  clag-id 1
  mstpctl-portnetwork no
  mtu 9216

# Peer Link
auto pl
iface pl 
  bond-slaves swp51 swp52
  bond-mode 802.3ad
  bond-miimon 100
  bond-lacp-rate 1
  bond-use-carrier 1
  bond-min-links 1
  bond-xmit_hash_policy layer3+4
  mstpctl-portnetwork no
  mtu 9216

# CLAGD Peer Config
auto pl.4000
iface pl.4000 
  address 172.16.0.2/30
  clagd-enable yes
  clagd-peer-ip 172.16.0.1
  clagd-backup-ip 192.168.255.11 #MGMT IP Address of Peer
  clagd-sys-mac 44:38:39:ff:00:01

Next we need to define the port channels also known as bonds in Linux networking, which will be used by the hosts connected to the switches redundantly. This ensures there is no single point of failure on the network fabric to hosts. A bond consists of one of more physical ports connected to the hosts. The CLAG ID must be unique across all bonds.

Here is an example of a host port channel configured in /etc/network/interfaces

# Switch Port Interface Configuration
# NXVMW Block Configuration
auto nxvmw-node1
iface nxvmw-node1
  bond-slaves swp1
  bond-mode 802.3ad
  bond-lacp-rate 1
  bond-min-links 1
  bond-lacp-bypass-allow 1
  bond-miimon 100
  bond-xmit_hash_policy layer3+4
  bridge-pvid 560
  mstpctl-portadminedge yes
  mstpctl-bpduguard yes
  clag-id 2
  mtu 9216

Now we have our uplinks to the spine and our downlinks to the hosts, we need our Bridge for all of the ports to connect to, in this case we will use VLAN-Aware Bridge, as we want to trunk VLAN’s to the host ports. A bridge in Linux networking for those not familiar is like a virtual switch that all physical and virtual ports connect to. In the case of Cumulus Linux the bridge is the ports that all of the hardware accelerated physical switch ports connect to. Individual ports to the hosts are created in bonds. Bridges can have multiple VLAN interfaces in addition to the physical switch port interfaces and bonds connected. You can use a glob to create a group of multiple similar interfaces in the specification as shown below. The example includes multiple VLAN’s, including some VLAN’s assigned with layer3 IP addresses.

# Bridge Configuration

auto br0
iface br0
   bridge-vlan-aware yes
   bridge-ports pl spn1-2 glob nxvmw-node1-4 glob nx3000-node1-4
   bridge-stp on
   bridge-pvid 1
   bridge-vids 100,560,570,580,590,600
   bridge-mcsnoop 1
   bridge-igmp-querier-src 192.168.245.211
   mtu 9216

# Bridge VLAN
auto br0.600
iface br0.600
   address 172.16.100.211/23
   broadcast 172.16.101.255
   mtu 9216

auto br0.100
iface br0.100
   address 192.168.245.211/24
   broadcast 192.168.245.255
   up ip route add 0.0.0.0/0 via 192.168.245.230
   mtu 9216

ifupdown2 supports Mako templates natively, so you can configure templates such as below and put them in /etc/network/interfaces.d to achieve the same thing. The Mako templates allow a simple form of automation and programability for interface creation and configuration on Cumulus Linux. Customers could have a script or automation tool that configures their virtual environment and automatically applies the changes to mako templates on their Cumulus switches and updates the physical environment all in one go.

Here is the Mako template to configure clagd – /etc/network/interfaces.d/clag_cfg, this creates the peer links and all the CLAG configuration necessary.

<%
#INPUTS - Peer Link
 lacp_mode = "fast" # or "slow" # Use for bond rate&lt;/pre&gt;
 peer_bond_name = "pl" # Name for the bond link
 peer_port_start = 51
 peer_port_end = 52
#INPUT - Clag Info
 clag_vlan_id = 4000
 clag_address = "172.16.0.2/30" #SP1 CLAG address
 clag_peer_address = "172.16.0.1" # CLAG peer link address
 clag_pair_unique_mac = "44:38:39:ff:ff:01" # Unique/reserved Per CLAG Pair MAC 
 clag_priority = 8192 # CLAG PRIORITY/Slave should be higher number, like 12288
 clag_args = [] # OUTPUT
# Will create peer_slaves =
# Optional flags can be added over here.bond member matching list.
 "glob swp[{0}-{1}]".format(peer_port_start, peer_port_end)

if clag_args:
  clag_args_str = " ".join(clag_args)else:
  clag_args_str = ""lacp_rate = 1 if "fast" in lacp_mode else 0 %>


auto ${peer_bond_name}
iface ${peer_bond_name}
  bond-slaves ${peer_slaves}
  bond-mode 802.3ad
  bond-miimon 100
  bond-use-carrier 1
  bond-min-links 1
  bond-lacp-rate ${lacp_rate}
  bond-xmit-hash-policy layer3+4
  mstpctl-portnetwork no
  mtu 9216

auto ${peer_bond_name}.${clag_vlan_id}
iface ${peer_bond_name}.${clag_vlan_id}
  address ${clag_address}
  clagd-enable yes
  clagd-priority ${clag_priority}
  clagd-peer-ip ${clag_peer_address}
  clagd-sys-mac ${clag_pair_unique_mac}
  mtu 9216

%if clag_args_str:
  clagd-args ${clag_args_str}
%endif

Here is a Mako template that could be used to configure the uplinks to the Spine switches – /etc/network/interfaces.d/spine_uplink_cfg

<%
# INPUT
# Tuple is of the form (name of bond, bond slaves, LACP RATE = 1/FAST or 0/SLOW)
deploy_uplink_bonds = [("spn1-2","swp49 swp50","1","1")]
%>

% for bond_name, bond_slaves, lacp_rate, clag_id in deploy_uplink_bonds:

auto ${bond_name}
iface ${bond_name}
  bond-slaves ${bond_slaves}
  bond-mode 802.3ad
  bond-miimon 100
  bond-lacp-rate ${lacp_rate}
  bond-use-carrier 1 
  bond-min-links 1 
  bond-xmit_hash_policy layer3+4 
  clag-id {clag_id}
  mstpctl-portnetwork no
  mtu 9216 
% endfor

Here is a Mako template that could be used to configure the host port channels aka bonds – /etc/network/interfaces.d/downlink_bonds_cfg

<%
 # INPUT
 # Tuple is of the form (name of bond, bond slaves, LACP RATE = 1/FAST or 0/SLOW, clag_id) 
deploy_downlink_bonds = [("nxvmw-node1","swp1","1","2"),
                         ("nxvmw-node2","swp2","1","3"),
                         ("nxvmw-node3","swp47","1","4"),
                         ("nxvmw-node4","swp48","1","5")] %>
 
% for bond_name, bond_slaves, lacp_rate, clag_id in deploy_downlink_bonds: 

auto ${bond_name}
iface ${bond_name}
 bond-slaves ${bond_slaves} 
 bond-mode 802.3ad
 bond-miimon 100
 bond-lacp-rate ${lacp_rate} 
 bond-min-links 1
 bond-lacp-rate 1
 bond-lacp-bypass-allow 1
 bridge-pvid 560
 mstpctl-portadminedge yes 
 mstpctl-bpduguard yes
 clag-id {clag_id}
 bond-xmit-hash-policy layer3+4 
 mtu 9216
% endfor

In a future article I will cover a similar configuration with Dell Force10 switches and the Dell Force10 Network OS (FTOS), in addition to VMware NSX and VXLAN configuration.

To use LACP in vSphere you need to be on vSphere 5.1 or above, and to have configured your vSphere Distributed Switch for LACP. If you’re running vSphere 5.5 or above it’s recommended that you use Enhanced LACP, which offers additional load balancing methods. The load balancing method I’ve found most useful in my tests is depicted in the vDS configuration image below, which is based on a vDS version 6.0:

VMware vShphere 6.0 LACP Config 2015-11-25_06-36-53

With vSphere 6.0 vDS you can configure up to 64 port channels, known as Link Aggregation Groups (LAG’s) per Host and up to 32 ports per LAG. It is recommended you review the other relevant vSphere Configuration Maximums.

You will connect your hosts directly to a port on the LAG and not a standard vDS uplink. The LAG then appears as a single interface in the port group configuration. The below image shows the host 10GbE NIC’s connected to a LAG on a vDS.

VMware vSphere 6.0 LACP Host Uplink Config 2015-11-25_07-46-11

When you are configuring the vDS port groups that VM’s will connect to you need to ensure that only the LAG is present as the active interface. All other interfaces should be marked as unused. However during migration you can mark the LAG as active and another interface as stand by, until you’ve migrated all physical and virtual interfaces to the LAG. It is recommended you read the migrating to LAG guidance on the Enhanced LACP Configuration of the vDS. The below image shows an example of the vDS Portgroup configuration.

VMware vSphere 6.0 LACP Portgroup Config 2015-11-25_07-46-11

When using a LAG the Load Balancing method in the Portgroup configuration does not apply.

As far as performance goes I’ve successfully tested hosts maxed out with 2 x 10GbE Links doing full duplex traffic to wire speed, i.e. 4.8GBytes/s and latency as low as 70us, even when using VMware NSX and VXLAN overlay networks.

Final Word

When you need to use LACP and port channels / LAG’s for VMware vSphere, using Cumulus Linux on your physical network switches provides an easy way to get up and running quickly and provide configuration automation. The above configurations can be used and reused as much as you like. You can also download a free version of Cumulus Linux VX, which simulates the physical network configurations, before you apply it to physical switches.

—

This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – 2015 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.

109732 Responses2015-11-24+19%3A50%3A18Michael+Webster

Doug Youd (@cnidus) September 3, 2016 at 12:05 pm | Permalink

Gday Michael,

Disclaimer: I work for Cumulus Networks.

Generally when I’m doing MLAG-based implementations I try to suggest LACP everywhere if possible. (There are a number of situations where this is not realistic though).

My justification is that gives the switches a messaging mechanism to the hosts for a variety of topology changes and allows for more intelligent failover. It can also simplify the network configuration (i.e. avoid having to use ifplugd etc).

For example, there a few failure scenarios in an MLAG:
1) Uplink failure
2) Peerlink (ISL) failure
3) Switch failure.
4) mlag daemon failure.
5) Control-plane failure.
6) Planned maintenance.

These scenarios can have different desired actions. In a lot of cases, if the host is given the appropriate topology change info, it can make the best decision on which links to use and which to drop from the bundle.

We use the “chassis ID/MAC” in the LACP messages to notify on topology change.

For example, in a planned maintenance window, clagd (the mlag process) will notify its peer to assume the ‘primary’ role, shutdown the local daemon gracefully and revert the local chassis-MAC to the default (i.e. non-cluster MAC). The host will then see an LACP bundle with 2 different chassis-mac’s and drop the link to the maintenance-switch (as defined in the LACP standard).

When maintenance is complete, the clagd daemon will start again, form a relationship with the peer, then set the local chassis-mac for the lacp links back to the cluster-ID (same as the peer had active). The host will then see the chassis-mac matching again and gracefully add the link back into the bundle and seamlessly start using it.

tl;dr – My 2c is (if its available), LACP config on the host is worth the effort for MLAG topo’s.

Sharing state between host and upstream network: LACP part 3 | GIXtools project April 25, 2017 at 4:58 am | Permalink

[…] is also the reason I chose to write this post, I’ve seen many others describe in detail LBT vs etherchannel/LACP (Nice articles @vcdxnz01, btw), but none that go into much detail […]

all things Nutanix, VMware, cloud and virtualizing business critical applications

VMware Distributed vSwitch LACP Configuration with Dell Force10 and Cumulus Linux

Like this:

109732 Responses2015-11-24+19%3A50%3A18Michael+Webster

Leave a ReplyCancel reply

Share this:

Like this:

Leave a ReplyCancel reply