Blueprint for Successful Large Scale Oracle Virtualization on vSphere
I recently attended a Webinar on the topic of Virtualizing Business Critical Oracle systems presented by a very large company that had started their journey back in 2004 and so far successfully virtualized 86% of their systems, including some of the most critical Oracle systems. After the Webinar I decided to go back and re-read the whitepaper they had published regarding their journey a few months ago to refresh myself on it’s contents. The whitepaper is one of the best I have read, and the presentation on the Webinar was one of the best I had attended. What they laid out was a successful blueprint and framework for how any company might successfully virtualize business critical Oracle systems, and it is definitely a must read. I’ll discuss some of the main points I think are important and what I got out of it, then you can read it yourself and make up your own mind.
The Webinar was presented by Ramesh Razdan, Senior Director and EMC Distinguished Engineer on EMC IT’s virtualization journey – EMC IT’s Virtual Oracle Deployment Framework. Ramesh is a very smart guy and brilliant presenter. Both the Webinar and the whitepaper present a solid, pragmatic, disciplined and methodical approach to virtualizing large scale Oracle applications with differing levels of criticality, which includes the most critical requiring five 9′s availability. A version of the Webinar presented by Darryl Smith (who I have been talking to re the results) and Ramesh is available at http://www.emc.com/events/2012/q1/01-19-12-virtual-oracle-deployment.htm. One of the most valuable aspects of the paper and Webinar was EMC’s candidacy criteria and how they modeled the requirements of each system and application layer before applying these to the best fit virtualization solution. Here are some of the raw numbers covered in the paper:
50K Users, 400K Customers and Partners, ~850 DB’s, 52M Transactions per day, 5 Datacentres, 10PB Storage, 500 Applications, 6.5K OS Images, 80 Countries, 20 Languages
The process EMC undertook to classify their application components and systems by both availability and scalability and then applying those requirements to the appropriate solution. This is very close to the process I take customers through when virtualizing business critical apps and in Unix to Linux on vSphere Migrations. These two dimensions are very important. When planning the journey however I add additional dimensions that I believe are needed in the classification process, such as complexity to migrate, and financial classification in terms of ROI. By using a multidimensional classification approach you can build a clear roadmap and project plan with measurable and achievable results.
Databases and applications previously supported by RAC in some cases don’t need RAC anymore. Applications and systems that required 99.9% availability could be supported by single instance Oracle DB’s deployed in VMware HA/DRS Clusters as the VMware HA functionality provided sufficient availability. RAC was still required for systems that needed very high availability, four or five 9′s, or where resource requirements exceeded a single virtual machine capability, 8 vCPU 255GB RAM on vSphere 4 and 32 vCPU 1TB RAM on vSphere 5. Reduced complexity from having to manage fewer RAC instances is not the only benefit, it may also be possible to lower license costs as you don’t necessarily need to license the same Oracle edition or same features (Oracle Enterprise to Standard Edition) . However there is a cost to managing many different licenses and license types and ensuring compliance. This is why many organizations choose to sign Enterprise License Agreements. For some workloads single instance Oracle DB’s provide better performance. In 2011 I completed a Oracle DB migration project (Sun E25K to RHEL Linux on vSphere 4) for an organization with as many users as EMC. One of the reasons they were migrating and virtualizing their Oracle DB’s was so they didn’t have to deploy RAC for availability. For them the additional cost and complexity was significant.
EMC has been on the journey for 8 years and has learned a lot during the process that can be leveraged by others. Their experience as documented in the whitepaper and presented in the Webinar can be used as a blueprint for other organizations and tweaked where needed. EMC and VMware Professional Services Organizations and partners experienced and accredited with Virtualizing Business Critical Applications can help customers reproduce similar excellent results by using the collective best practices as a starting point to work from.
The process of virtualizing your business’s most critical applications is a journey and does take a disciplined methodical approach, but can have spectacular financial and operational rewards. Plan and design the virtualization project to meet business requirements from the start. Start with low risk, low impact systems, test, build confidence, then move on to other more critical systems. Virtualize Dev/Test before Prod. This is the standard virtualization journey. One aspect that is not discussed in the paper is how large the cost efficiencies can be from being able to run valid tests on systems that are much more a replica of their production counterparts than may ordinarily have been the case. In large organizations with large numbers of applications projects the costs associated with testing are very significant. Accelerating and improving the testing lifecycle can make a very significant improvement to the bottom line, not to mention much faster time to market.
You can use DRS Host Groups to separate workloads for licensing, but be prepared to be audited and prove your use. VMware DRS Host Groups and VM to Host Rules can be an effective way of separating groups of VM’s within a cluster for licensing or availability reasons. If you’re using it for licensing reasons care should be taken to ensure that it will be recognized by the software vendors involved. I would normally recommend a conservative approach is taken and if there are any possible grey areas a separate cluster be used. However this isn’t as efficient from an infrastructure perspective as additional clusters do require additional management and additional failover resources. There is a very good thread of discussion on this topic here posted by Jeff Browning, aka the Oracle Storage Guy. He also posted a blog article on the topic titled Comments by Dave Welch of House of Brick on Oracle on VMware Licensing and here. I would encourage you to read this if you plan to use DRS Host Groups for Licensing separation.
Pay careful attention to the lessons learned, but beware! Every environment is different and best practices and lessons learned from other environments may be a great place to start from, but they may not all apply to your unique environment. I agreed with and use as best practices all of EMC IT’s recommendations accept one, which was Transparent Page Sharing (TPS). Disabling TPS can have disastrous consequences, including causing additional host swapping, which can result in extremely poor performance, much worse than disabling it could ever possibly gain. In every system I’ve virtualized, and in every environment, including those similar scale to EMC’s, I have never had to disable it to achieve acceptable performance. The small amount of CPU required to calculate memory page hashes by itself should not result in any measurable performance difference, and a properly designed and managed environment should not be breaking large pages into small pages for Oracle Databases or Applications in the first place.
I was so concerned by this recommendation that I have followed up with EMC to try and understand the situation and data that lead them to this in the first place. Based on my inquiries I understand the original testing that lead to this recommendation was conducted on ESX 3.5 (not sure which build or update version), in an overcommitted cluster. With that specific test set up of the application layer and the hardware at the time a 5% improvement in performance was measured. EMC’s biggest reason for disabling TPS appears to be to avoid breaking down of large pages and in their scenario the environment is carefully controlled and isolated. They are managing usage of system memory very carefully and use memory very aggressively for Oracle and the hosts are being used for a single purpose and in some cases a single VM. Disabling the TPS option in this scenario may not be as risky because of the expertise they commit to configuring and monitoring their environment. I have not yet been able to determine if the TPS setting was the only setting changed that achieved the 5% measured improvement.
I was advised that subsequent testing on vSphere 5 did not show any improvement by disabling TPS. Unfortunately the whitepaper did not put this recommendation in context with the version of hypervisor tested, and even though the paper was published in November 2011, this recommendation was not updated to reflect the latest testing. There have been a few fundamental changes since the original ESX 3.5 that would in my opinion eliminate any possible improvement that may have resulted form disabling TPS when the other best practices are applied (e.g. VM reservations), such as Intel Extended Page Table (EPT) Support (not supported until ESX 3.5 U4), vSphere backing guest pages with large host pages by default on hardware supporting EPT (Nehalem and above), and major improvements in the hypervisor efficiency. VMware published tests of EPT alone showed 48% improvement in performance for MMU intensive workloads. I have not had time yet to independently verify the findings. My advice would be to not implement this recommendation without your own testing given the serious consequences that could result from disabling TPS, the additional management overheads and low probability of performance gains on recent hardware with a recent vSphere version. Scott Drummonds has posted an article regarding the TPS discussion titled Transparent Page Sharing and Performance, I would recommend you read it.
[Updated:24/10/2012] Subsequent testing on vSphere 5 has shown no difference between having TPS Enabled or Disabled. My very strong recommendation is to not modify the default memory management behaviour when it comes to TPS, so keep it enabled (default setting). Make sure you follow the best practices around memory reservations and using Huge Pages for your databases.
This is the only point in the EMC IT whitepaper that I thought required further investigation and explanation. I am grateful that the authors of the EMC IT paper have been willing to provide more of an understanding of how they got to their results with regard to TPS. Whenever there is something that doesn’t seem quite right, or there is a technical disagreement, it opens up an opportunity to learn something new. Both EMC and VMware take this approach and they are very much learning and collaborative organizations between themselves and also with their respective ecosystem partners. We all benefit from this approach and the engineering culture that results from it.
Here is the link again to the EMC IT Virtual Oracle Deployment Framework. It’s one of the best written whitepapers in my opinion and presents a very strong case for and method to virtualize Oracle applications and databases in very large enterprise environments. Thanks to Ramesh, Darryl and the entire EMC IT Team that were involved in the project of virtualizing their Oracle systems, and especially for publishing the whitepaper so others can all have a great chance of achieving the same success.
My company and I have worked closely with EMC and VMware on similar projects and achieved extremely good results for our clients. If you’re a customer interested in how you might start on your journey to virtualizing your Oracle critical business applications, or a partner wanting to lead their customers on the journey contact me from the Author page, or contact your local EMC or VMware Professional Services representative or Account Manager.
For additional information on virtualizing Oracle visit my Oracle Page.
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com, by Michael Webster +. Copyright © 2012 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.