To be fair this doesn’t come up often, but within the last few weeks it’s come up a few times. I’ve seen it enough times and it traps quite a few admins as well, especially when host side ports don’t use LACP. When you have a Cisco VPC configured between your Nexus switches and you do maintenance or have a failure, one of the switches when it comes back into service can cause traffic black holes. This is where traffic for one NIC or another turns into a GNDN (goes nowhere, does nothing). I first hit this many years ago when working at one of the worlds largest banks. When we did failover testing everything worked fine, until we bought the switch we had powered down back into service. So what should you do about this?
First thing to do is check out the best practices and the documentation for the switch you are using. The Cisco Nexus 9000 and Nexus 7000 documentation is incredibly useful. The best practices and configuration documentation cover all the different parameters for VPC configuration. The main one to look at is the VPC Domain Delay Restore configuration. This is the timer that prevents the VPC on a failed switch coming back into service until it can actually pass traffic. The default value is 30 seconds, which is often too aggressive.
A value of 30 seconds is often too aggressive for the VPC Domain Delay Restore timer. The testing I’ve done shows that a value of 300 or 600 seconds (5 minutes or 10 minutes) is usually ideal. This is especially important when the downstream connections to the servers are not using LACP. In that case you should also take care to avoid orphaned ports and determine if you need link state tracking.
N7K(config)# vpc domain <domain number>
N7K(config-vpc-domain)# delay restore <1-3600 sec>
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – 2017 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.