My main mission in my work and in my blog is to help customers successfully virtualize business critical apps without compromising SLA’s, while reducing risk, and gaining the economic and operational benefits of virtualization. I very often see customers attempting to virtualize their business critical apps and taking the exact same approach as they did with dev/test and tier 2 and 3 applications. The result is often they struggle, and sometimes their projects are not successful. To achieve success when you are virtulizing critical apps takes a different approach. Often attempting these important projects yourself is more costly than getting in the right team that has done it all before. The objective of this article is to bust some of the myths that surround virtualizing business critical apps and give you some ideas of how to successfully approach such projects so that you can increase your chances of success. This is by no means exhaustive, but will give you a good starting point. I will share some of the secrets to success that I’ve built up over hundreds of successful vBCA projects and from lessons learned from projects I’ve seen go off the rails.
Thanks to the wonders of modern technology this article comes to you from 40088ft somewhere above Australia on a Singapore Airlines A380-800 from Sydney to Singapore. This quietest and most comfortable plane I’ve ever flown in. I’m on my way to Singapore to assist with an Architect Bootcamp where I’ll be training 50 of ASEAN’s top architects on the important aspects of virtualizing business critical applications.
Defining Business Critical Applications
Before I start busting myths and revealing some of my secrets lets define what we mean by business critical apps, and what makes an app critical:
- Virtual Desktop Environment – If it supports all users
- ERP systems and supporting databases and middleware
- Manufacturing, Power Grid Management Systems, Process Automation and Control Systems – SCADA
- Financial systems, payment processing, online banking
- Middleware and ESB systems
- Billing systems
- Customer facing online systems, e-commerce systems
- Medical systems
- Security or door access control systems at secure facilities, such as airports or military bases
- High performance computing grid’s for military, finance or biotech research and operations
- Firstly a virtual cpu (vCPU) only has one thread, whereas a physical core could have two threads in the case of hyperthreading.
- Network access between VM’s on the same host that are connected to the same virtual network happens at memory speed. If you have two VM’s that interact on the network a lot it might make sense to group them together.
- Less virtual CPU’s can mean better performance, especially where there are many VM’s running on a host. This is because if a VM has less vCPU’s there is more chance that each vCPU will be scheduled on a physical processor thread. Oversizing VM’s can lead to performance problems. You should start smaller as you can always easily add vCPU’s at a later time once you’ve verified performance.
- Because modern x86 servers support NUMA (Non-uniform Memory Access) you should aim to size your VM’s if possible to fit within a NUMA node. For example, if your physical CPU sockets have 8 cores your VM’s will be configured optimally if they have 1, 2, 4 or 8 vCPU’s. It’s easy to work out your NUMA node size by dividing the number of physical CPU cores and the total amount of memory by the number of physical CPU sockets. You should aim to size your VM’s memory to be less than the size of the NUMA node, and ideally less than half the size of total memory on the hosts.
- Due to the multiple layers of abstraction in the storage stack of a virtual machine it is best to use the simples IO scheduler. So for Linux systems the IO scheduler (elevator) should be set to NOOP. Also because of the multiple layers of abstraction the data is always fragmented all of the time, this is normal and expected, defragmenting your systems would not be a good idea and would likely cause performance issues.
- Treating virtualization as a magical black box and just expecting any workload to run with half of the resources it needs. Virtualizing isn’t a magical black box. It is still bound by the laws of physics. If your workload really needs 6 vCPU or 8 vCPU and 96GB RAM at times then you better make sure it gets it when it needs it. There are plenty of benefits to virtualizing over and above just consolidation ratios. In fact when virtualizing business critical apps consolidation ratios are the least important factor. Reducing risk, ensuring availability and performance, and greatly simplifying DR, performance management and capacity planning are generally higher up the list of priorities. Ultimately you need to ensure that your physical hardware and infrastructure underpinning your virtualization solution has the capability to meet your objectives. If you buy cheap slow hardware don’t expect it to perform like a rocket when you put your virtual machines on it. This is really just common sense.
- Failing to baseline or properly evaluate and record the performance and other requirements of the source system, or the new system that is being developed. You can lump not having objective measures of performance and other SLA’s into this category. If you haven’t got a baseline and you haven’t got clearly documented and agreed objective business and technical metrics and requirements then you will find it almost impossible to achieve success. Having a gut feeling if something is working ok or not is not sufficient when dealing with business critical apps. This leads us to the next point.
- Failing to verify that the performance, availability and other business requirements can actually be met and that you have the right infrastructure to meet them. Each component of the solution needs to be able to meet it’s performance, availability and other business objectives. Testing per component and as an integrated solution will prove if the solution works as expected. If each component meets its objectives then logically so should the integrated whole.
- Insufficient planning for risk and disaster scenarios, prior, during and after migration.
- Making a solution more complicated than it needs to be, or designing a technical solution that is not supported by business requirements. For example implementing a metro stretched cluster solution when there is no business justification for it and where other alternative solutions would be simpler, less costly and still meet the requirements. As a lot of high availability features are already built into the base VMware vSphere platform, such as VMware HA you may not need to use in guest clustering solutions in some cases. This can greatly simplify your solution and it’s operations.
You can’t take a plug and pray approach with Business Critical Apps. To ensure their SLA’s, predictability and low risk you need to take a very methodical and disciplined approach to the project and have objective measures to meet, that can be verified.
Secrets to Successful vBCA Projects
As I said previously you need a methodical and disciplined approach to virtualizing business critical apps. It is much more of an applications or software development (SDLC) type project than a pure infrastructure project. Especially as even the hardware is now software when it’s virtualized. This is especially true if the project involves migration from a traditional Unix system, which may involve porting of code or software redevelopment if it’s not a standard commercial off the shelf product (COTS). So here are some tips or secrets to successfully virtualizing business critical apps:
- Clearly document all the important business requirements and when doing the architecture and solution design make sure you have traceability of design decisions back to the business requirements that they support.
- Baseline the source environment and record metrics that are objective and are a valid representation of system performance and availability as it impacts the end users. How you achieve this will be up to you but it constantly surprises me how many customers don’t bother to baseline or evaluate and record the baseline performance, availability and other important requirements of their source systems prior to virtualizing it. Every time I have seen this, without exception, the projects have run into problems. If you don’t have objective, accurate and valid metrics of source system performance how can you verify that the system will at least achieve if not exceed the prior state, which is our goal. Another point here is just because CPU utilization is 40% or 50% might not make a difference so long as the end user response times and scalability objects are met. So any evaluation must be in an application metrics context rather than just infrastructure utilization metrics.
- Test your infrastructure! Verify it’s meeting all of the business requirements, design criteria, performance, availability, security and recoverability metrics (and other metrics) that are important for your project. If you don’t test it how will you know if it’ll work when you need it to most. I recommend a risk based approach to testing so that you get the most coverage of the most important things for the least amount of effort and time. It will be up to you to decide what and how much to test. I will give you some more ideas on this below.
- Test your applications, both component based and integration based testing, including testing of the combined workload of multiple application instances or VM’s on a host. You should also test your availability and recoverability methods while testing the applications under load.
- Prepare for the worst! Testing and verifying everything is not a one time process. It is an ongoing process. You need to have well defined plans that are tested and verified to work. Especially when it comes to DR and it needs to cover every component. You need to test normal operating scenarios as well as scenarios where things go wrong. Getting to know how the system behaves during failures and disasters will give you a lot more confidence. VMware Site Recovery Manager can provide you with an automated recovery process that is auditable and testable without disruption to production. Recovery and failure testing should include security and compliance requirements to ensure that your systems are secure and compliant even when recovered after a disaster. Also bare in mind that most real disasters in a datacenter are man made, not as a result of a natural phenomenon.
- Test your migration methodology and your roll back plans. Before you do a production migration for real you should test your migration methodology. During this process you should be timing it and although its a test you should be making it as real and valid as possible. I also recommend a pilot or proof of concept in most cases. Your should also test and verify your roll back process and have clear criteria about when and how a roll back would be initiated and under what conditions.
- Follow the VMware and your vendor best practices for architecture design and the applications that you’re migrating, at least as a baseline to start from. Best practices are created over numerous projects and are the best place to start in the absence of any special requirements that might cause you to modify them. Some best practices might not be valid for your environment and you may need to create your own best practices, but at a minimum all previous best practice documents should be reviewed during your project. I often see people having trouble with databases and applications when they have not even bothered to read the best practice documentation that would have prevented the problem in the first place.
- Make sure the applications teams and end users are core members of your project team and that they have input into design, testing, and migration methods. This will not help get their buy in, it’ll increase the chances you will cover all important aspects of the applications migration and they will get an understanding of how the applications will behave once virtualized, even when things are going wrong.
- Don’t overcommit resources too aggressively without having observed the systems performance and behaviour once it’s been virtualized. You should consider allocating no more than one vCPU per logical host cpu or thread to start with, and not overcommitting memory on the hosts in your VMware clusters until you properly understand usage patterns and you have real data to base decision on. You can increase your system utilization safely and get better overall consolidation by grouping systems that need different resources onto the same host. Fortunately most resource scheduling decisions are automatically done by VMware with features such as VMware Distributed Resource Scheduler (DRS). For systems that need high storage performance and low latency you need to make sure they are configured with enough virtual SCSI controllers, sufficient virtual disks, and have access to sufficient physical storage devices to get the queue depth and parallelism they need.
- Make use of all of the features of VMware vSphere to ensure quality of service to your critical applications and prevent impacts from noisy neighbours. For example using Network IO Control, Load Based Teaming, Storage IO Control and VMware HA and VMware DRS can give you much more predictable performance and improved quality of service, over and above what you can get from when the applications were physical.
- Low risk idea. Plan for risk and have detailed risk mitigation plans that are documented and agreed with all key stakeholders. Identifying and mitigating risks and their impacts will be an ongoing process as part of the project and even after the project is complete and in operation. In some cases the mitigation plans will also need to be tested and verified (depending on impact). Virtualizing without compromising is the name of the game and reducing risk is one of the most important objectives. There is no point virtualizing if through a flawed process, flawed design, or lack of testing, you introduce more risk that has a severe business impact. This could well have a bigger impact than any possible benefits if you don’t plan and execute your projects carefully. Done well of course the results and benefits are substantial.
- Migrate low risk systems first, prove the processes, gain confidence, before moving onto higher risk systems. Once you have proved multiple times over that your design, your process and the results are in accordance with your objectives it will be much easier to migrate higher risk systems. You would normally start with Dev Systems, then Test Systems, then Pre-Prod, before finally migrating the actual production system.
Testing Coverage
In terms of what you should test and your testing coverage, here are some of the areas I would recommend you consider when planning out your vBCA project testing strategy and plans:
- Pilot and Design Verification Testing
- Has the design been implemented as expected?
- Does the migration process work as expected?
- System and Operation Testing
- Does the application and the full solution function as expected?
- Do the maintenance and operational aspects of the design work as expected?
- Availability and Recovery Testing
- Do individual infrastructure and application components behave as expected when components fail?
- Do the business continuity and availability aspects of the infrastructure and applications work as expected under various disaster scenarios?
- Performance and Scalability Testing
- Does the solution meet the performance SLA’s for applications and infrastructure?
- What is the saturation point and headroom of the design and individual components, and what is the sweet spot for scalability?
You may also want to do application regression testing, integration testing, and of course User Acceptance Testing. I would recommend a risk based approach so that you can cover the most important areas thoroughly but without unnecessary effort required to test areas that are low impact. You will need to decide how much testing and what testing is required to give you the desired level of comfort and to verify that your requirements and objectives are being met. You won’t know if you’ve met your objectives unless you’ve tested and verified them. So test thoroughly and test often.
Final Word
Virtualizing Business Critical Applications successfully requires a disciplined and methodical approach that reduces and manages risk and a higher level of assessment, testing and verification. The reason for this is that the impacts are a lot higher if the applications are not meeting their objectives. A revenue generating system that becomes unavailable or where performance is severely degraded will likely have immediate consequences, and could even jeopardise the future viability of the business. A medical system or military system going down could mean the difference between life and death. By approaching a vBCA project in the right way and through thorough planning and testing you can achieve better SLA’s for your applications, with lower risk, higher availability, and very often much higher performance. If you do it wrong you might find yourself updating your CV. vBCA is about virtualizing without compromise. No compromise to SLA’s, no compromise to performance, no compromise to risk.
—
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2013 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
Wow, that was a great article. Thanks for providing us with your in-depth reasoning and also that myth-busting section. I was wondering from your perspective, what are the overall top 5 business-critical applications that businesses are trying to move to the cloud? I'm interested to hear your answer since it seems you have a vast range of experience with different types of businesses. Thanks Michael!
Hi Courtney, Sorry for the late response. It's hard to say what the top 5 applications are moving to the cloud. The cloud needs to be defined as public or private, then are we talking traditional business critical apps or newly developed ones. Assuming traditional COTS apps, SAP is definitely one, anything Java or web related, email. But it varies a great deal and it would be almost impossible to tie it down.
Another great post Michael.
To me it seems like there is still stigma associated with virtualizng BCAs due to platform on which they're running is virtual. Like you've mentioned in your article, assuming the appropriate planning, design and implementation, there is rarely a reason to go physical at this point.
What I'm also interested in hearing from you is from your experience, do people tend to rebuild their vSphere environments or do they build a separate environment to support BCAs?
Hi Greg, there are many things to consider, but in most cases a cluster or group of clusters will be purpose designed and built to meet the specific requirements of their business critical applications, if those requirements can't already be met by their existing design and configuration. This can be for a number of factors. This is generally preferred as customers generally want to have a lower consolidation ratio and lower overcommitment levels in their clusters for business critical applications vs their other clusters. it can also be due to licensing of the applications, or to provide a different host platform, or because a large number of LUNs are needed. But this may not always be the best option as have lots of clusters can introduce additional management overheads. Bigger clusters are generally more efficient to manage and there is less percentage of resource designated for failure. The main difference is that the particular application requirements are evaluated in more detail and the solution is designed to meet each applications specific requirements. The overhead in going to this level of detail is hard to justify for non-business critical applications, which will be deployed in more general purpose clusters. When it comes to VM design we still try and take a service catalog approach to have as small a set of standardized template sizes to fit most application requirements and then an exception process for systems outside of the standards. In most cases though the template sizes for the business critical apps might differ from the other environments. At the end of the day it's all about providing predictability of service levels and availability and ensuring that the business critical apps always get access to the resources they need when they need them based on the customers requirements in the most efficient manner possible.
[…] are actually not relevant. An example of relevant article is by former colleague, and a good friend Michael Webster. All the relevant articles are good and informative articles. They also mention Performance SLA. […]