14 Responses

  1. When should N+1 not Equal 1 Host with VMware HA? | Long White Virtual Clouds | white-clouds

    […] http://longwhiteclouds.com/2012/05/13/when-should-n1-not-equal-1-host-with-vmware-ha/ virtualization ← HP Gen 8 Servers – An onsite deep dive part 1 | The Solutions Architect /* */ […]

  2. joe@nomaen.com
    joe@nomaen.com at |

    But… Seriously, how many times have you seen a blade chassis fail?

    Reply
    1. joe@nomaen.com
      joe@nomaen.com at |

      Sorry I should elaborate. I understand the necessity to design for failure of all types, but personally, I have seen the total site failure occur several times with multiple customers, but I have never seen a blade chassis fail…

      Reply
      1. Duncan Epping (@Dunc
        Duncan Epping (@Dunc at |

        I have seen it 3 times so far… Not a pretty sight I can tell you that. It is uncommon indeed though and you will need to decide for yourself if you want to take a failure like that in to account.

      2. @vcdxnz001
        @vcdxnz001 at |

        Agreed, not every environment will justify or require this level of resiliency, it should be driven by business requirements and also cost consideraions. Hopefully the infrastructure can be more location aware and DRS rules in a future release maybe take some of this location awareness into consideration with HA and DRS. As we drive more efficiency and consolidation through cloud the number of environments that need to consider chassis failure will undoubtedly increase, especially as clusters become ever larger.

      3. Paul Kelly
        Paul Kelly at |

        Every major customer (1000+ users) I have worked for has experienced at least one chassis failure. Chassis introduce an ugly single point of failure issue that needs to be catered for. If you haven't had this experience then I would say you have been very lucky so far.

    2. @vcdxnz001
      @vcdxnz001 at |

      A number of times, and for a few different reasons. In the worst case a single management domain was created across two C Class HP Blade Chassis and an upgrade of the management firmware caused both chassis to become unavailable, as well as all hosts within the two chassis. I have also seen serious configuration errors on blade chassis cause complete blade unavailability within the chassis. So although it should be rare, I have seen it quite a few times. I have also seen multiple instances just in the last 6 months where a single rack lost power due to maintenance or configuration issues. It's also not just about hardware failure, but configuration issues, and also maintenance and upgrades. In an environment that requires high availability these considerations will be important. Bottom line, it's not as rare as it should be and we have to plan and design to eliminate as many single points of failure as possible within reasonable cost objectives dependent on customer requirements.

      Reply
  3. FY
    FY at |

    If the rack can expose the chassis information, that would be easier..

    Reply
  4. Paul Kelly
    Paul Kelly at |

    Great writeup Mike, well done. It is amazing how you seem to put pen to paper on stuff that is currently floating around in my head.

    Reply
  5. Mark
    Mark at |

    Thanks for this article. Can you elaborate at all on scrutiny you gave to the storage layer with regards to it also being a failure point? IE, we can arrange a layout on the servers to reduce risk from a chassis failure, but below we may have a storage layout that undermines/contradicts efforts at the chassis layer.

    Reply
    1. @vcdxnz001
      @vcdxnz001 at |

      Hi Mark, Ideally the same consideration would be given to the storage layer to ensure availability. In the case of one of my customers there are multiple storage arrays for different parts of the infrastructure. In many cases there may only be separation of storage between management and resource clusters. It will depend on the type of storage array selected, but in a lot of cases the storage layer already has multiple levels of redundancy and resiliency built in. In one customer use case they are deploying their VM's with in guest disk mirroring between two storage arrays to mitigate the risk of failure of one array. This is for an application that doesn't scale horizontally. Configuration risks on the storage still exist and also need to be mitigated. Availability and scalability at all levels of the infrastructure and applications needs to be considered and driven by business requirements.

      Reply
  6. Mike Sheehy
    Mike Sheehy at |

    The bottom line is it's rare that you can address of failure scenario. We can go on and one, etc, and at some point there will be a single point of failure, in most cases, however, that doesn't mean that you don't mitigate risk at each layer.

    In my environment I have two Dell Blade m1000e chasis, and have spread my MGMT/Resource clusters accross both enclosures, two cisco 3750's stack and a couple of brocade FC switches, however, connecting to a single VNX5500, although, dual controllers. Even at the storage layer I have my Pools spread accross my DAE's, and have Hot spares, and my EFD's for FAST are spread accross DAE's as well.

    This doesn't mean that i'm comletely protected from a Storage failure, but I took the necessary steps where I could to mitigate risk.

    Reply
  7. JPerkins
    JPerkins at |

    EXCELLENT!!! Article…

    Reply
  8. Simon
    Simon at |

    Great article and it's exactly where we are at the moment. In reference to the Chassis Failure, we have experienced a chassis failure. The chassis doesnt fail but the backplane needed a swapout due to a bent pin, we also have the same problem in another chassis but am living with that issue at the moment.

    My main concern for chassis outage is down to firmware upgrades on the VC cards. If the upgrade doesn't go well or you have a design flaw in your architecture you can suffer from an outage and end up with the split brain personality going on in your farm. Also, the three maxims highlighted are important considerations; people do make mistakes and software may have bugs. To add to this consider your application software and understand that application arhcitecture fits the DRS design and is also tolerant of vmotioning. We have been in that position where an application cluster is tweaked too aggressively and does not survive the vmotion. Application experts will time to time make changes and not apply any logic to what has been implemented at the hardware layer.

    Reply

Leave a Reply