23 Responses

  1. joshatwell
    joshatwell at |

    Great Post! I was wondering if you had seen any issue or had any additional information about clock speed throttling when doing large number of vMotions as mentioned by Eric Siebert http://de.twitter.com/ericsiebert/status/21051008

    I know in our environment with 10G interfaces when we put hosts in maintenance mode one of our monitoring applications shows huge spikes in response time to those VMs. Naturally it is safe to say that response time might be impacted. Who are we kidding, there are no pings dropped, but the impact on latency sensitive applications is certainly evident.

    I haven't yet had an opportunity to dig into this deeper so thought I would see if you might already have additional information on the subject. Thanks again for the post!

    1. @vcdxnz001
      @vcdxnz001 at |

      Hi Josh, The technology you're refering to is SDPS (Stun During Page Send). It's only used when the page dirty rate of the VM's is faster then the ability to copy the pages to the destination host. SDPS only kicks in when a situation occurs that would mean a vMotion might otherwise fail. It's a failsafe mechanism. So if you have sufficient bandwidth and you're not migrating workloads with a page dirty rate that would exceed the 10G/s bandwidth then you shouldn't be seeing any issues with SDPS. You may want to consider allowing more bandwidth for vMotion depending on the impact to particular applications you're seeing, or reducing the concurrency of vMotion operations by modifying some advanced settings. Any changes should be tested.

      However if you are using QoS or vNIC's in a UCS platform then you may have less vMotion bandwidth then necessary and then in some cases SDPS may kick in. But it is only slowing down the CPU of a VM for a few milliseconds at a time. This should not be sufficient to notice any significant response time impact. If you have very latency sensitive applications then they may require special attention and tuning.

      Generally things just work, but for business critical applications and latency sensitive applications they do require a different approach and more care and attention. Let me know what you find when you dig deeper and if necessary get VMware involved. I'd be keen to see what you come up with and exactly what the situation is.

  2. Clearing up a misunderstanding around CPU throttling with vMotion

    […] was reading a nice article by Michael Webster on multi-nic vMotion. In the comment section Josh Attwell refers to a tweet by Eric Siebert around how CPUs are […]

  3. Paul Kelly
    Paul Kelly at |

    Great article. Just wondering though; with this configuration I'm not sure how to configure this to work with a vDS using LBT, NIOC and SIOC.

    Do you need a VMkernel port for each chunk of 10GbE you want to allow for vMotion?

    If you wanted to guarantee 50% of the 4 uplinks to allow roughly 20Gbps vMotion traffic would you configure the environment somthing like this;

    vDS1 – LBT, SIOC, NIOC

    Management – 5 NIOC Shares, active uplinks 1,2,3,4

    VM Networking – 20 NIOC Shares, active uplinks 1,2,3,4

    iSCSI – 20 NIOC Shares, active uplinks 1,2,3,4

    FT – 15 NIOC Shares, active uplinks 1,2,3,4

    vMotion1 – 30 NIOC Shares, active uplinks 1,2,3,4

    vMotion2 – 30 NIOC Shares, active uplinks 1,2,3,4

    1. @vcdxnz001
      @vcdxnz001 at |

      Hi Paul,

      vMotion1 – Active Uplink 1, All others Standby or Unused
      vMotion2 – Active Uplink 2, All others standby or Unused

      All other port groups, except management, all uplinks active, using route based on physical NIC load. Best practice for management is active/passive with failback set to no, so that you've always got it on the same side of the switch fabric, and it doesn't flip flop around. This reduces the chances of false isolation events a bit more.

      No need to configure LBT on the vMotion port groups, the other VM's will move around them. NIOC can still be used to control the quality of service. I generally just use the normal, low, high shares without specifying specific values. It's all relative. As long as the important traffic types get the bulk of the shares for when there is congestion. LBT and NIOC sort out most p