6 Responses

  1. gman
    gman at |

    Good read, but in the end this goes back to the basics, don't create vSwitches teamed to the same physcical nics in any situation. This would have not been in issue if this best practice had been adhered to. Just because 10G E allows the bandwidth to afford just a couple of connections doesn't mean you should do it. a single dual port nic is a single point of failure regardless of the fact it has 2 ports.

    Reply
    1. @vcdxnz001
      @vcdxnz001 at |

      I agree in general, but in many situations that is not possible. Either because the customer you are working with has already made the decision to go with just the 2 x 10G E NIC ports on a dual port NIC, or because of convergence of the system architecture or server platform doesn't allow it (or cost). This article documents quite an extreme case, where a cascading failure occurred because of a fault with the NIC's in each host of the cluster. So there is also something to be said about not using identical NIC's in every host in a cluster, as that too is a single point of failure. But you still need to continue to try and keep your hosts as similar as possible in terms of memory, CPU's etc.

      If possible in the host platform it would be ideal to have a minimum of 2 physical dual port 10G NIC's, and also 2 physical dual port HBA's. But this is not possible in many cases, and also increases the cost of the infrastructure, when in many situations that might not be justified. At least by reading this I hope people will be conscious of the situation and can make informed decisions that meet their specific business requirements, as every situation is different.

      Reply
  2. babyg
    babyg at |

    Interesting article Mike.

    First off, agree with GMAN, e.g. you should never team on the same physical nic (dual port /quad or not). That is simply poor practice if it was the case (bad customer).

    Do not agree with the statement you should have separate types of NIC in a host. Especially as we move towards converged adapters embedded on-board. In some systems (CISCO UCS for example) you only have one choice, you cant have different types, the point of buying enterprise ready hardware is you are paying to mitigate these types of issues, and limiting the number and types of NICs has operational benefits going forward. If you were to take that thinking to the next level, you would end up having AMD based HP’s and Intel based IBM’s “just in case”, yeah OK a little extreme but you get my point.

    Interesting about the SD card needing to be the first to boot.. weird, this means if you want to update the bios via CD, ILO etc in the future you will need to change to source, and then back to SD. Never seen this issue myself, so interested in which HP box you were dealing with.

    Good article, good to read about others experiences.

    Reply
    1. @vcdxnz001
      @vcdxnz001 at |

      The systems we were dealing with were DL380 G7's.

      The risk you take with a converged infrastructure with everything on board is if there is a firmware or driver problem you have no solutions available, other than relying on the vendor to fix the problem. Waiting for the vendor could take weeks or months. This has proven time and again to be a very risky proposition as some of the most popular vendors have had major firmware and driver problems with their devices. This can of course be mitigated by very thorough testing of the infrastructure before going into production, but you won't be able to anticipate and test every eventuality. If there is only the one on board multi-port card, then you can't team across cards, which isn't ideal as I think we all agree.

      So for the benefit of those operational gains and the lower costs you are increasing the risk and impact of some failures. I definitely would not consider two different CPU architectures, but it might be worth considering two different brands of server in a very large cluster, if you want to mitigate against this risk, and if having multiple different NIC's in the server isn't an option. If the NIC driver and firmware problems were rare and the consequences of them weren't so bad we wouldn't have to worry about it. Unfortunately that's not the reality. If you want to build a robust always on infrastructure to support business critical applications you have to reduce the risk of failure and the impact of failure also.

      On the subject of CNA's specifically, if you loose a dual port CNA in the server (assuming there is only one) not only have you lost your network but you've also lost your access to storage. The consequences of this can be catastrophic. So this particular scenario needs careful consideration depending on the workloads being virtualized.

      I'm not against any particular architecture or system, as I've deployed most of them, and used converged infrastructures where it met the customers requirements. But you always need to be aware of the risks you are taking and take them knowingly. In many environments the risks or impacts are insignificant compared to the benefits, but that might not work when you're virtualizing business critical applications. Every environment and situation is different and our job is to provide the most cost effective and efficient infrastructure possible that meets the unique business requirements.

      Reply
  3. babyg
    babyg at |

    Have you deployed cisco UCS yet?

    Cisco UCS is gaining quite a bit of traction here in the northern hemisphere, and of course all HP G7 is gearing towards CNA type stuff, plus their are cisco nexus for HP (finally) now….. UCS is great kit, thou more of a "one size fits all" type approach. Unfortunately UCS blades includes only a single slot for CNA in their standard blade offering (note unlike HP, you cant add 2nd mezz HBA or mezz 2nd NIC,… thus the issue described here a very real possibility. Good article, good insight, I'm afraid bean counters read the glossy FCOE to solve global warming type stuf, and dont see the value in separate physical management etc…. as you say, horses for courses.

    PS Ive changed my current deployment from CISOC UCS 230M2's, to CISCO 440M2's to mitigate single dual port CNA issues…

    Reply
  4. The Achilles Heel of the VMware Distributed Switch « Long White Virtual Clouds

    […] problem to occur, such as a full site power outage or a storage network failure. My article titled When Management NIC’s Go Down is a good example of the type of failure, other than a full power outage, that could cause these […]

Leave a Reply