Many organizations are interested in deploying Oracle Databases, even the business critical ones, on vSphere. I’ve been helping customers for a while now to deploy Oracle and other business critical applications in a VMware environment (since ESX 3.x). As there is a lot of FUD around I thought I’d try and write an article on how to successfully virtualize Oracle databases in a way that will meet performance and SLA requirements, along with the key design considerations. This will focus a lot more on the how than the why, although I’ve included the key virtualization drivers from my experience. There are numerous benefits to virtualizing Oracle databases and other business critical applications on VMware vSphere, so I hope this helps you ensure the process is successful. This will be a fairly high level overview and by no means exhaustive, but should provide you with some solid guidance to build upon. Much of the information is also applicable to other database systems and other types of business critical applications.
Benefits of Virtualizing Oracle databases
- Reduce hardware costs
- Provision databases on demand – Minutes to provision in production and in the lab
- Increase application Quality of Service – Scale dynamically (Hot Add of CPU, Memory, Storage to VM’s and Hot Add Hosts to Clusters )
- Built-in high availability and simple disaster recovery – VMware HA, VMware Site Recovery Manager
- Fully utilize hardware without compromising SLA’s
- Maintain application isolation
- Automated authorization workflow and provisioning using VMware Service Manager and vCenter Orchestrator
- Avoid manual configuration errors during provisioning by using a service catalog of pre-built templates
- Zero downtime maintenance by vMotion migrating live databases without disruption (Including Oracle RAC see previous article)
- Hardware independence – restore database virtual machines to any system capable of running vSphere, no need to have identical hardware at DR site
- Reduce operational costs by using consistent operating environment with increased automation
- Improved visibility, control and protection of intra-host and inter-host traffic using vShield Edge and vShield App
- Consolidate licensing – Potentially increase utilization of Oracle licenses (Depends on Degree of Consolidation) and reduce licensing costs in some cases
Previous Experience is Valuable
Please keep in mind when reading through the rest of the article that it is always a good idea to have a person experienced in virtualizing Oracle databases on your project team when you are going through the analysis, design, validation, and at least the first part of the migration phase. Having someone that has already gone through this process successfully multiple times will save you a lot of time, money, and reduce your project risk. There are some very good VMware Partners that can assist you (including my company), as well as VMware Professional Services.
Meeting your SLA’s and Business Requirements
Meeting your SLA’s and Business Requirements is critical and fully understanding them during the design and analysis phase is one of the key factors that will ensure a successful outcome. You should consider any best practices as a good starting point to reduce your risk and effort on the project, but you should test and verify the configuration to ensure that you meet all the requirements and business objectives of the project (functional and non-functional verification). You should not blindly follow best practices without understanding them and determining if they fit your specific situation and objectives.
Baseline and Analyze Current State Performance
The first task on any project that aims to virtualize a business critical application is to analyze and baseline the current state performance. If the project is virtualizing a new system, then you would analyze the performance requirements and metrics of the new system based on Application ISV sizing and performance recommendations. Below I outline some tools that can be helpful and some general guidelines that I have found successful. The baseline performance analysis is critical during design and validation and will greatly assist in achieving a successful outcome.
Tools for Analysis and Performance Baselining
You might want to consider using some of the following tools during the performance baselining and analysis phase of the project.
- VMware Capacity Planner
- VMware vCenter AppSpeed and App Insight
- VMware vCenter Application Discovery Manager
- VMware vCenter Operations Enterprise
- HP Load Runner, Apache Jmeter
- IOMeter, ORION, FIO
- NetPerf, iPerf
- Existing Enterprise and App Monitoring Tools and System Logs
General Guidelines during Analysis and Performance Baseline
- Consider both peak and average workloads
- Consider impacts of consolidation of workloads that use similar resources at the same time of day
- Consider additional resources required for growth
- Consider known business projects and objectives that may impact performance and the workload model
- Consider system architecture differences (RISC vs CISC)
- Synthetic workload results can be great at comparing different configurations, but might be misleading, try to test against a realistic as possible business workload model
Key Considerations During Design Phase
The following is a list (not exhaustive) in no particular order of some of the key areas that need to be considered during the Design Phase of the project:
- ISV Support
- Scalability (Scale Up vs. Scale Out)
- Standardized VM Template Sizes
- Optimal Physical Host Sizes
- Storage IOPS and Latency Requirements
- Tradeoff between Memory and Storage IOPS
- Security Hardening
- NUMA Node Size Impact to VM Sizing and Performance
- Regulatory Compliance (PCI DSS, SOX, HIPPA, DISA STIG etc.)
- vMotion Slot Size for Monster VM’s (is there enough unreserved RAM on a host in the cluster to put a host into maintenance mode?)
- vSphere Clusters and HA
- Performance and Resource Constraints
- Testing (Infrastructure and Applications, non-functional and functional tests)
- Operational Readiness and Standard Operating Procedures
- Appropriate over-commitment of resources
- Appropriate storage sizes (vmdk, lun, mount point), RAID levels, paths and queue depths
Defining Standard OS Templates and Sizing Considerations
During the discovery, analysis and performance baseline phase of a project you should attempt to define a small number of standard template sizes suitable for the size of workloads being virtualized. Defining a small number of standard templates that cover the majority of database sizes will reduce operational costs and provide the basis for a database service catalog. This will make the database service easy to order, provision and consume. Templates sizing should be based on observed system metrics and performance test results.
Each template should include the following characteristics:
- Balance RAM, CPU, and IOPS requirements
- Include the customized base OS configuration including tuning parameters
- Standard disk storage to cover the likely smallest or starting size database server, including OS, Swap, Redo, Archive Log, App, Backup and at least one Data file volume
- Provide the most efficient unit of scalability, performance, consolidation and management
- Fit easily within the NUMA node sizes of the physical hosts
- Large enough to provide for current performance SLA’s with room for future growth
- Limits the number of datastores and LUNs required per database server to a minimum
- Include any standard and required base software packages including monitoring, management and backup agents if applicable
- Include standard OS hardening
- Memory reservation policy
Storage requirements are normally the biggest variable in a database server so can be provisioned based on specific requirements based on the optimal size for SAN admins, storage technology and performance.
Large monolithic database servers containing large numbers of databases and schema can be split up into multiple smaller database servers to provide better performance, availability, managability and isolation. Larger numbers of smaller database servers can be managed efficiently in a virtualized environment, while still supporting large individual database servers where necessary.
On a recent engagement I defined three base templates described below:
- Medium – 2 vCPU 16GB RAM
- Large – 3 vCPU, 32GB RAM
- XXL – 6 vCPU, 94GB RAM
Each template had 150GB split into multiple virtual disks across multiple virtual SCSI controllers on a small number of VMFS datastores per database server. These templates were the best fit for the database systems I was migrating from. In this instance the databases I virtualized were on Sun E25K systems each with multiple containers, and each container with one or more zones. The storage design is described later in this article.
The average transaction latency for one of the XXL systems migrated went down from 13ms on average on the Sun E25K system to 2ms on average once virtualized, this is with more than double the number of database transactions per second. This was with a similar amount of memory, but with only 25% the number of CPU’s (6 vCPU vs 24CPU’s).
Don’t pay too much attention to consolidation ratios initially. They are not the most relevant metric, especially to business critical applications, can be changed by changing host size, and are different for almost every situation and environment. Focus instead on the workloads and optimal utilization first. For production systems overcommitting resources too much may result in poor performance an not meeting SLA’s, but it may be acceptable to adopt more aggressive overcommitment in Test/Dev environments, both will impact consolidation ratios. Good consolidation will come from making sure you provide an optimal balance of workloads in the clusters and design efficiently to meet the peak workloads within the SLA’s defined for the project.
Datastore Sizing and Performance Considerations
Datastores should be sized sufficiently to allow for maximum utilization and as little as possible waste while allowing for appropriate performance, managability and recoverability. The actual standard sizes you choose will be largely based on how the physical storage is configured and on what is easiest for your virtualization and storage teams to manage, as well as taking into account the performance needs of the database systems. The best size will be one that balances IOPS with storage capacity. You should always take into consideration the restore time and restore point objectives as bigger datastores will take longer to back up and restore.
In most cases you will run out of IOPS long before you run out of capacity. This is why I try and balance high IO consumers with low IO consumers on the same datastore (such as with OS / App / Redo volumes). However you should try and avoid combining IO workloads that will conflict onto the same datastores if you have concerns over it’s ability to meet IOPS and latency metrics. VMware Storage IO Control (SIOC) can help balance conflicting IO workloads by limiting queue slots for some VM’s. This can prevent noisy neighbours from impacting performance of other VM’s. There is a trade off here between latency and throughput however. You should consider enabling SIOC and using the default shares, this will ensure that each VM has equal access to storage resources and will limit the possibility of IO storms and denial of service conditions caused by high IO VM’s. Remember SIOC will only come into play in times of storage IO resource contention, and only where there is more than one VM per datastore. It can also be impacted by any non-VM workload on the arrays, such as replication, automated array tiering, array based snapshots, or backups.
If you have a tiered storage infrastructure choosing the correct tier of storage for each volume profile will be important. The storage tier should take into account the optimal RAID level to produce the performance you are trying to achieve. RAID 10 may be able to produce a performance level using far less physical disk spindles than RAID 5 would be able to do. Where high IOPS are required RAID 10 can be a more efficient choice than RAID 5. RAID 5 will likely limit the effective usable storage space as you will not be able to push any more IOPS through the defined number of physical disk spindles. These generic considerations don’t take into account specific features of the storage arrays you are using and therefore you should consult with your storage vendor on the best configuration to deliver a defined performance profile. Having Flash Based LUN’s, Flash Storage Cache, or Automated Storage Tiering will all potentially impact the optimal storage configuration. Consider using Flash Based LUN’s where the highest IOPS performance and lowest latency is required and the additional cost of Flash is justified.
The following list are some of the important considerations that will drive datastore sizes:
- OS data size
- DBA standard data file size
- Number of data files per volume or per vmdk
- Optimal size for IOPS and spindles on the back end storage as determined by your SAN admins or storage vendor
- RAID levels appropriate to storage technology and performance requirements
- VM log and swap file sizes
- Use of VM Snapshots
- Use of ASM
- RPO, RTO considerations and DR strategy
- Risk and impact of losing an individual datastore or group of virtual disks
Given that many DBA’s prefer to allocate tables spaces with defined data file sizes, which will be fully allocated on the storage, you can safely use thin provisioning where it makes sense, as it will have no detrimental impact on performance. However if you choose to use thin provisioning it would not be wise to overcommit the datastore. The last thing you want is to run out of physical space and have IO failing. The smart and appropriate use of Thin Provisioning can result in substantial cost savings without sacrificing performance but always be conscious of the potential impacts and additional management overheads.
To ensure that any VM snapshots don’t impact performance or availability of the database you should consider creating dedicated snapshot datastores and redirecting snapshots to them using the workingDir=”<new_path_location>” parameter in the virtual machine configuration file (vmx) (See KB 1002929).
If you are running Oracle RAC on VMDK’s utilizing the multi-writer flag then you should set all of your virtual disk devices to independent persistent. In this case you should not be taking VMDK level snapshots, and the independent persistent setting will enforce this. Independent Persistent setting can also be used for stand lone DB instances where you want to prevent snapshots being taken or impacting the VM. If using this setting you do not need to modify the VM working directory.
In and ESXi environment you should consider setting up a scratch and core dump datastore and each host configured to use it. This might only be 50GB in size. The scratch datastore is where the hosts will store log and temporary data so it will survive a reboot. (See KB 1033696 and KB 1000328)
As you will see from the design diagram examples at the bottom of this article I recently used the following standard datastore sizes in a large Oracle virtualization project:
150GB – OS/App/Redo Datastore, dedicated per each Database server, containing multiple VMDK files
512GB – Data file volumes, multiple per database server if necessary, each contianing multiple VMDK files
1TB – Backup store volume, dedicated for very large database servers, or shared for smaller database servers, containing multiple VMDK files
The above will be driven by the workload profile and size of the databases you will be virtualizing. If you have a data warehouse system with many TB of data then your datastore and VMDK sizes will be much larger.
If you are going to have databases with large disk footprints you may need to modify the ESXi VMFS Heap Size by changing the advanced setting VMFS3.MaxHeapSizeMB. Review KB 1004424 and Jason Boche’s article Monster VMs & ESX(i) Heap Size: Trouble In Storage Paradise.
Host Design and Sizing Considerations
Once you are close to completion of your baseline performance analysis of the target workloads you can make informed decisions about the size of the compute platforms. A single host will need to at least accommodate the biggest database system you plan to virtualize at it’s peak. You will need to ensure that when multiple databases are consolidated down onto the one host that it will be able to accommodate the peak workload that is within your SLA’s. The key goal is to try and reduce the number of hosts to improve managability and operational costs, while not compromising availability or performance. At the same time you want a consistent, known and cost effective unit of capacity expansion. You need to try and reduce single points of failure as much as possible without making the solution to costly to implement. In some converged infrastructures you won’t have any choice regarding certain single points of failure, such as where a converged network adapter with 2 x 10GbE Ports is all that is available.
Here are some of the key areas you need to consider in your host design:
- NUMA Node Size
- Storage and Network Access Ports
- CPU Clock Speed
- vMotion Networks
NUMA Node Size
The NUMA Node Size is important due to the way that vSphere scheduler will balance VM’s vCPU’s and Memory. If you can size your NUMA nodes such that your largest VM fits easily within a single NUMA node you will ensure locality of vCPU’s and Memory, which will produce optimal performance. NUMA nodes need to be balanced across the host (same amount of CPU and memory per node), and sometimes it is not possible to have a single NUMA node sufficiently large to accommodate the biggest VM. vSphere will balance VM’s across NUMA nodes to accommodate large ‘wide’ VM’s, and in vSphere 5 you can expose the NUMA architecture to the VM’s so they can optimize their memory and CPU access accordingly. For database servers more memory is better than fastest memory, so if you have to sacrifice raw memory clock speed to get much more RAM in a host it is most likely worth it. For more information on NUMA you can refer to this information on Wikipedia.
Storage and Network Access Ports
Ideally for availability and performance reasons you will have multiple storage and network access ports, and also at least two physical storage and network access adapter cards, and load balance between them. In many converged architectures this is not possible, and some may consider it a costly luxury considering the individual port and adapter card costs when weighed against fairly good component reliability. But the risk vs cost is a key consideration for Business Critical Applications. But every environment is different, so there will be no one size fits all approach. In environments where only a single dual port 10GbE converged network adapter per host is being used you will need to pay great attention to quality of service to separate traffic types, and also the maximum number of systems consolidated per host to limit your exposure to component failure. Ensure when deciding on network port speed and number that you consider vMotion performance requirements (see below for more discussion on this). Consider the risk of driver and firmware problems if you have all the same type of adapters in the entire datacentre. Reduce single points of failure as much as possible and know the risks which you are accepting.
While CPU in some database workloads is not a major performance constraint having a high CPU clock speed can improve performance of some OLTP type workloads if they are not storage constrained. Depending on the workloads you will be consolidating a two socket host platform with a higher CPU clock speed may perform much better than a quad socket platform with a lower CPU clock speed. However the quad socket system is likely to be an appropriate choice when you are consolidating many very large database systems, as it may be possible to cost effectively size the hosts to have a much larger NUMA node size. This may be a good compromise over raw clock speed for many database workloads, as having more memory can improve transaction rates and reduce transaction latency and required storage IOPS on the SAN (see Capacity Planning Considerations below). The compromise between sockets, memory, and clock speed is not the same for all vendors’ architectures, and I’ve intentionally kept this article vendor neutral. Some modern architectures allow very large NUMA node sizes in a small physical footprint and with high CPU and memory clock speed. When deciding on the best platform to virtualize on you should consider your existing relationships and the best architectural and operational fit. Both modern Rack and Blade server platforms can successfully accommodate database workloads if designed and implemented appropriately, so the choice is yours.
Large database servers with lots of memory assigned place additional pressure on your vMotion network. If you try and migrate a 48GB or larger VM with a rapid memory change rate from one host to another over a 1Gb/s network link it is going to take a very long time. In order to ensure applications and users are not impacted during vMotion operations, and to ensure that database servers can be quickly moved from one physical host to another for maintenance purposes you should ensure you have sufficient bandwidth available to your vMotion vmkernel port(s). I would strongly recommend using 10Gb/s NIC’s for your vMotion network and that you enable Jumbo Frames. With vSphere 5 you should take advantage of using multiple vMotion vmkernel ports to load balance the vMotion traffic and improve migration times, but be aware that this will consume more CPU so there is a tradeoff. vSphere vMotion traffic can fully saturate a 10Gb/s NIC, so you will need to either separate the traffic physically onto dedicated NIC ports or use Network IO Control (QoS on Cisco Nexus 1000v), or both depending on your environment. Testing I have done in my lab has shown that on vSphere 5 vMotion operations on a single 10GbE NIC port using a single vmkernel port can consume more than 1250MBytes/s as measured by ESXTOP at peak.
vSphere Cluster Design and Sizing Considerations
There are a number of key design and sizing considerations when you are virtualizing important Oracle databases. The following is a brief list of some of the most important considerations:
- Number and Type of Clusters (Management vs Resource)
- Min and Max hosts per cluster (including failure and maintenance hosts)
- Dedicated hosts for failure and maintenance capacity
- Target utilization thresholds
- HA and DRS Settings
- vMotion Slot Size
- Scratch, Core Dump and VM Swap Datastores Per vSphere Cluster
Number and type of vSphere Clusters
The reason to create multiple clusters are many and varied. The are three key reasons why you would want to create separate clusters for Oracle Databases and other types of business critical applications from the rest of your fleet:
- Optimize Licensing Costs and Utilization
- Provide resource guarantees to ensure SLA’s are met
- Separation of Production from Non-Production, and separation of management infrastructure from managed infrastructure
At a logical level there are three general types of vSphere Clusters, Management Clusters, Resource Clusters, or a combination of the two. In order to ensure only productive workloads (Oracle Databases) have dedicated resources it is a very good idea to have a management cluster of hosts sized specifically to run the management functions, such as vCenter. With a management cluster you can effectively separate the infrastructure being managed from the systems that are managing it, which reduces the risks of a failure impacting the management infrastructure and inhibiting troubleshooting and recovery. This design will eliminate circular dependencies that would exist if the management systems were deployed within the same cluster as the productive workloads. Separating the clusters in this way allows you to get the optimum use of resources and licenses. vCenter, vCenter Database, Update Manager, Site Recovery Manager, Monitoring Tools and any other management tools can all be hosted in a small number of small hosts (2 or 3) depending on the size of the management systems infrastructure. If you are investing significant money in Oracle licenses and the infrastructure to run them at peak performance you don’t want non Oracle systems impacting those resources.
In large vSphere deployments it is quite common to have a management cluster at each physical datacentre that hosts all the management systems and infrastructure that supports all of the other vSphere clusters and resources. This allows the management cluster to be scaled as needed without any impact on resource clusters and vice versa. This provides you with a very scalable, highly available, and high performance design that can accommodate a very large deployments, and the necessary predictability, performance and availability required by business critical applications, and Oracle database workloads.
Min and Max Hosts per Cluster
Outside of the published vSphere Configuration Maximums there are a number of factors that will drive the number of hosts you choose per cluster. Some of these are as follows:
- Overall size and number of VM’s being deployed
- Deploying Oracle RAC Clusters using VMFS Datastores (See Oracle RAC in an HA/DRS Environment)
- Separating Production from Non-Production Oracle Databases
- Separation of management functions into dedicated management clusters
- Number of LUNs per Oracle Database (See Capacity Planning Considerations below)
- Larger clusters are more resource efficient and require a smaller proportion of resources for failure and maintenance
- Smaller clusters of larger hosts may be easier to manage and consume less datacentre space, network and storage ports, power and cooling
- VMware HA considerations around number of primaries if using pre vSphere 5
- Number of hosts required for failure and maintenance capacity
- The use of Stretched VMware HA/DRS Clusters
If your resource clusters will be running Oracle RAC it would be recommended to keep the size of the cluster to 8 hosts or less or use a DRS Host Affinity Rule to ensure that the Oracle RAC nodes only run on a specific 8 nodes. 8 nodes is the VMFS limit when you are sharing a VMDK file across multiple hosts with the multi-writer protection flag disabled. This is the best way to deploy Oracle RAC in a vSphere Cluster as it will still allow vMotion for maintenance and ease of deployment from templates while still providing optimal performance.
Where possible vSphere clusters you should start with 3 hosts as a minimum to allow for maintenance and failure for both resource clusters and management clusters. The host configuration for each cluster will be sized for the workloads they are to support. For very large environments you should consider separating Test/Dev, Pre-Prod, DR, and Production Oracle Database systems into separate clusters each sized appropriately to support the current and expected future workloads. To make the most efficient use of licensed and deployed resources I normally recommend that the Test/Dev and Pre-prod workloads run out of the DR datacentre. If you don’t have separate clusters for all non-production and DR systems the non-production systems can easily be shut down or given much lower priority by using shares when there is a need for a DR failover.
Dedicated hosts for failure and maintenance capacity
For important production resource clusters I recommend N+2 hosts so that you can suffer a failure during maintenance without major impact to performance and availability. This provides dedicated capacity for failure and maintenance. By being conservative in this way you also reduce the chances of resource contention and allow for some growth. If necessary having this resource configuration would allow you to quickly deploy new databases for the business while servers are still on order without risking the performance of the production environment.
Many organisations will find that their non-production systems are also critical and should consider using N+2 hosts in each vSphere cluster and conservative levels of over-commitment. Some large applications can have up to 6 or more non-production environments, some of which are full production scale, and large project teams maintaining them and delivering new projects. The costs associated with downtime and performance degradation for non-production environments in some cases make the cost and effort of deploying an additional host for maintenance capacity seem inconsequential.
Target utilization levels
With the assumption that you will be trying to right size the hosts, vSphere clusters and VM’s to make the most efficient use of licenses, resources, and for optimal performance I recommend conservative target utilization levels for Oracle Database clusters. Excluding Failure and Maintenance Capacity I generally recommend <80% memory utilization and <70% CPU utilization at peak, with vCPU per pCPU Core ratio of <= 2:1. This reduces the risk of unnecessary latency or resource contention when many of the Oracle databases need concurrent access to the same resources. Being fairly conservative with target resource utilization levels will also reduce the risk associated resource fragmentation with regard to your vMotion Slot Size (see below)
Resource contention in vSphere clusters dedicated to a single use can be a real risk if you are deploying a large number of database systems that have workload patterns or scheduled operations that need the same resources at the same time. If during your performance baseline and analysis phase of the project you have systems that use different resources at different times, or the databases themselves will not all reach their peak at the same time you could consider more aggressive over-commitment and utilization targets. The best consolidation and most efficient use of resources will be found where complimentary workloads that use different resources at different times are grouped together.
An example of where greater consolidation may be possible is if you have a batch processing system that runs overnight and an OLTP system running during the day. The highest peak resource utilization for host memory and CPU might be 80% and 60% respectively, and both systems will be quite happy being consolidated together.
HA and DRS Settings
If you have followed my recommendations up to this point you will have a small number of different Oracle Database master templates in your service catalog that have been right sized for the majority of database workloads that you will deploy. You will be using Memory Reservations to ensure the minimum acceptable level of performance, which will cover the SGA/PGA and critical OS memory space. You will have a conservative vCPU to pCPU Core ratio. You will have an amount of unreserved memory that can be used for filesystem cache to boost performance. You will have N+2 hosts in your cluster so that a failure event won’t have a major negative impact on performance even when a host is under maintenance. But even if you had two hosts fail the filesystem cache which is unreserved can be sacrificed by trading off higher IOPS load on the SAN.
Assuming all of the above are true I would recommend the following HA and DRS Settings:
- HA Enabled – If a host fails you want the Oracle Databases on that host to restart automatically and get back up and running as soon as possible with as little human intervention as possible.
- Admission Control Enabled – Disallow vm power operations if they violate availability constraints
- Admission Control Policy – Percentage of Cluster Resources Reserved as Failover Spare Capacity, P = (F/N)*100, where P is the percentage, F is the logical number of host failures to tolerate, and N is the number of hosts in the cluster. To achieve N+2 for a cluster you enforce N+1 using admission control and use capacity planning and change control for deployments to ensure sufficient spare capacity for failure during maintenance. This is because when a host is put into maintenance the same percentage of cluster resources is reserved by admission control, however the total resource available in the cluster is reduced by 1 host. The reason you would want to use this setting over the host failures the cluster tolerates is because you are likely to have a large variation in resource reservations. The host failures option would be very restrictive as by default it would use the largest reservation to calculate the HA Slot Size for the cluster to calculate remaining slots. If you had one 64GB Memory reservation and many 16GB reservations 64GB would be used to calculate the number of HA Slots in the Cluster and how many slots were remaining. You would find very quickly that you can’t deploy any more VM’s. You could specify a specific host for failure, but the downside to this is that it can’t be used productively during the majority of the time where there is no failure, but may still need to be fully licensed.
- HA Isolation Response – Leave Powered On. If your management network is separate from your VM network, and could become isolated independent of the Oracle Database servers, and when you’re using Fibre Channel Storage, or storage on a separate and independent network, I recommend you set the Isolation Response to Leave Powered On. I’d much rather have a network problem than a data corruption problem. If you’re using Ethernet/IP based storage and your management network becomes isolated the best Isolation Response may be Power Off, as you’ve just lost access to your storage also. But you would want to adjust the failure detection time so that brief transient network issues do not trigger isolation responses.
- VM Monitoring Enabled – Set to Medium sensitivity. HA will restart the VM if it has frozen or crashed for 60 seconds, and restart it up to 3 times within a 24 hour period. As of vSphere 4.1 HA also checks network and disk I/O in addition to the VM heartbeat to determine if the VM is available.
- DRS Enabled – Reduce any resource contention, provide automated placement and automated vMotion migrations for maintenance.
- Fully Automated – Migration Threshold left at Default Setting (apply priority 1, 2 and 3 recommendations). This will allow DRS to optimize the cluster if there were any events causing resource contention, and to migrate VM’s using vMotion when hosts enter maintenance mode.
- DRS VM Affinity and Anti-Affinity Rules – Use DRS Anti-Affinity Rules to ensure availability of systems by keeping systems apart, use DRS Affinity Rules to keep systems together that generate large amounts of network traffic between each other. For an Oracle Cluster you will most likely be using Anti-Affinity rules to ensure Oracle RAC Nodes are kept separate.
vMotion Slot Size
vMotion Slot Size is a new term that I have developed to describe a key consideration that you will very likely come across when virtualizing business critical applications such as Oracle databases due to the high likelihood of making heavy use of memory reservations to guarantee a minimum acceptable performance SLA. This is not the same thing as the HA Slot Size discussed above. Duncan Epping and Frank Denneman in their book titled vSphere 4.1 HA and DRS Technical Deepdive refer to this problem as Resource Fragmentation. The difference between the vMotion Slot Size and a VMware HA Cluster Slot Size is that the vMotion Slot Size (or resource fragmentation) concept will also apply when you have a cluster with its admission control policy set to reserve a percentage of cluster resources for failure. There is no enforcement or means of tracking vMotion Slot Size currently built into vSphere, and DRS doesn’t take it into account when calculating optimal load balance. As previously discussed you definitely don’t want to be using a specified number of hosts for failure, as that setting will severely limit the number and size of database servers you can power on in the cluster. The following diagram illustrates the point:
In the above diagram you have 4 hosts in a cluster with 13 VM’s deployed. Assuming the cluster admission control is set to 25% of cluster resources reserved the utilization level will fit within the HA admission control parameters (Logically N+1). However it may not be possible to put host 4 into maintenance mode and have DRS automatically migrate the workloads as none of the other hosts would have sufficient unreserved memory to accommodate the large 48GB reserved VM that it contains. DRS may attempt to do resource defragmentation, but there is no guarantee that it will free up sufficient resources. This would necessitate some manually initiated vMotion migrations before the host can be put into maintenance mode. If the cluster suffered a host failure on host 4 only the 16GB VM may be able to restart, as there is insufficient unreserved resources to restart the large 48GB reserved VM. As of vSphere 4.1 HA will ask DRS to defragment the host resources, but this is not a guarantee, and DRS may still not be able to free up sufficient resources to restart the 48GB VM. HA Restart priorities may help, but won’t solve the problem of the large VM not being able to restart.
I recommend having your largest VM being a maximum of half a single hosts CPU and Memory resources. So in a host with 48 cores and 256GB RAM your largest VM should be no bigger than 24 vCPU and 128GB RAM. With an appropriate reservation this guideline should limit the possibility and impact of the vMotion Slot Size problem described above.
Scratch, Core Dump and VM Swap Datastores Per vSphere Cluster
The Scratch Location is where ESXi hosts store temporary files, and log data so that they can survive a reboot. Without having Scratch configured all the ESXi hosts logs, unless it’s configured to sent to Syslog, will be lost upon reboot. This may hinder troubleshooting. The Core Dump Location is where the core files from any kernel panics or Purple Screen of Death (PSOD) are recorded. Both of these are important and you should be setting up at least one small LUN/Datastore to store this data as mentioned above in Datastore Sizing and Performance Considerations (See KB 1033696 and KB 1000328). For a large cluster you should consider creating two Core Dump/Scratch Datastores so that losing one doesn’t result in all hosts losing access.
Given the measures we’ve taken to protect VM memory with reservations and conservative utilization targets it’s unlikely a cluster will experience large amounts of host swapping. However it is still possible and could be cause by incorrect resource assignments or limits. You may not want the VM swap files sitting on your highest performance and most expensive storage and risking productive I/O for these rare cases, especially when the VM swap files could be very large. Due to this many organizations will choose to implement dedicated VM Swap Datastores.
I would recommend creating a minimum of two VM Swap Datastores per vSphere Cluster, but would prefer one per host. Having two or more separate VM Swap datastores not only removes the risk of VM Swap I/O flooding production datastores, it also reduces the risk if one of the swap datastores is locked or unavailable not all of the VM’s will be impacted. You may want to put these VM Swap files on a lower tier of storage. You should consider the availability and management impacts of doing this. If you are using storage systems that support fully automated storage tiering you should consider how the array will manage these separate LUN’s. If the automatic tiering is at the block level and not the LUN level you may not have to worry about using dedicated VM Swap Datastores, as the VM Swap files will be migrated to lower tiered storage automatically.
Capacity Planning Considerations
When considering capacity planning you should take the Target Utilization levels and HA Admission Control Policy into consideration. You should also understand which resources are likely to have the most contention, and where the likely constraints will be.
In my experience the key constraints for most Oracle Database environments fall in this order:
- Disk I/O
- LUN’s per Host/Cluster
- Storage IO Queue’s
Due to the large amount of memory many Oracle Databases require, and the need to use reservations to guarantee a minimum acceptable level of performance Host Memory is normally the biggest constraint. More RAM = Less IOPS.
The next constraint is normally disk I/O, especially if you don’t have enough memory. There is a direct tradeoff between having more memory and reducing disk I/O, especially for read operations. If you reach the limit of IOPS or latency the consequences to the end users can be immediate and severe. I’ve had customers describe the situation when they’ve reached this limit like coming to the edge of a cliff and then falling off. More RAM = Less IOPS.
You will run out of IOPS and have poor latency in most cases well before you run out of capacity, especially with the spindle size of modern disks. The exception to this rule is currently Flash based storage and Enterprise Flash Disks (EFD’s). They are definitely useful for very high I/O consumers.
LUN’s Per Host/Cluster
With Oracle Database servers you are very likely to be using multiple datastores or multiple LUN’s per server. There is a limit of 256 LUN’s per host (1024 paths), and each host within a cluster should have visibility of all the storage to ensure that you can properly manage the environment and vMotion workloads when necessary. This means it’s quite likely you could well run out of LUN’s well before you reach the maximum number of hosts in a cluster. This will likely become obvious during planning and design and you will be able to size your clusters in terms of number of hosts and other resources accordingly.
With vSphere 5 it is possible to have datastore extents up to 64TB, although RPO and RTO at this size may be a concern. This does allow you to use much larger datastores and therefore consolidate more database servers onto fewer datastores and LUN’s. Provided they can support the IO requirements. This may in some cases allow you to deploy many more hosts and many more Oracle Databases in a given vSphere cluster.
Storage IO Queue’s
As part of the tuning process you may be adjusting the storage device/LUN queue depths and also adjusting the disk.schednumreqoutstanding advanced parameter. When you are considering this ensure you take into consideration the total number of queue’s per HBA port, and if necessary add additional HBA ports for the maximum concurrency of IO’s that you expect. For the case of fibre channel HBA’s as an example if you have 4096 queue slots per HBA port and you have 200 LUN’s configured with a device queue depth of 64, and assuming all LUN’s are being used at the same time and consuming all their queues (not probable in most cases), you would need a total of 12,800 HBA queue slots, so four physical HBA ports. In reality you’d start with two HBA ports and observe if your HBA port queues are ever being maxed out. You would detect this by having a higher than normal DAVG and KAVG value (in ESXTOP), and excessive device queuing in the hypervisor.
You should also bare in mind that when the IO’s are released to the driver they are outside the control of the hypervisor, so if one VM is consuming all the queue slots a VM issuing a rare IO might suffer some latency impacts. If you have multiple storage platforms connected to your vSphere hosts and they have different device queue depths by default you should set disk.schednumreqoutstanding parameter to the lowest common denominator. Storage IO Control can help, but it will only kick in if the latency increases beyond the threshold set, by default 30ms. It is possible that on a fast array that you exceed the queue depth without exceeding the device latency threshold for SIOC.
Take care modifying the default storage parameters. They have been set conservatively to give priority to fairness over absolute maximum performance. Special thanks to Thorbjoern Donbaek from VMware for all the help and advice he’s given me in the area of device queue depths and it’s impact on VM performance.
General Best Practices for Business Critical DB
- Do not overcommit core memory for production systems
- Set reservations to enforce this
- Right size VMs, memory requirement can be high
- Avoid Swapping / Paging
- Where relevant, to avoid NUMA remote memory access:
- Use Virtual NUMA or Size VM memory <= Memory per NUMA node, # of vCPU <= and easily divisible by # of cores per NUMA node
- But when sizing requirements dictate, large database VMs can cross NUMA nodes, ESX can manage wide NUMA VMs, use Virtual NUMA on ESXi 5.
- You can over-commit vCPUs conservatively, based on workload
- Start with no vCPU over-commit. Once average host utilization is known, depending on capacity, increase vCPUs (e.g. adding VMs). Use DRS capabilities to control and balance overall workload.
- For highly utilized production systems total number of vCPUs on all VMs per host <= number of Server cores, or threads with HT
- Create VMFS file systems from vCenter to ensure partition alignment
- Use multiple vSCSI adapters and distribute target devices
- Allows execution of several parallel I/O operations inside the Guest OS
- Use paravirtual SCSI adapter (http://kb.vmware.com/kb/1017652)
- On Linux Swappiness = 0
- Linux Guests OS IO Scheduler = NOOP
- Tune network buffers and window size where necessary
- Use application vendor OS tuning parameters as a starting point, Oracle Linux Kernel Paramater recommendations for 11G
- Adjust storage queue depths
- HBA Queue Depth
- Size Storage for IOPS and Latency
- Separate production and non-production on different datastores / storage devices if possible
- Align Guest OS Disk Partitions
- Minimize time drift in virtual machines follow guidelines in KB articles 1006427(Linux) and 1318(Windows)
- Separate infrastructure traffic from VM traffic for security and isolation
- Use NIC teaming for availability and load balancing
- Use Net I/O Control different network traffic types
- Use NIC Load Based Teaming
- Use VMXNET3 Paravirtualized NIC
- Use Jumbo Frames on 10GE for vMotion and IP Storage
- Install VMware tools (mandatory)
- Optimize ESX Host BIOS Settings (See Oracle Databases on VMware – Best Practices Guide for detailed BIOS recommendations)
- Disable any unnecessary devices in the VM BIOS and remove unnecessary devices from VM configuration
DB Specific Best Practice Recommendations
- Spread the databases across multiple LUNs as you would with a physical database server
- Use VMFS in most cases (Including Oracle RAC)
- Use RDM when required by storage vendor tools (e.g. backup, tiered storage) or MSCS Clustering
- Use RDM’s only if there is a need to swing LUNs across to physical systems
- Refer to VMware KB 1034165 for Oracle RAC and Third Party Clustering on VMFS
- Refer to my blog post on Oracle RAC in DRS and HA Clusters
- At least one data file per CPU for each tablespace
- Parallel Threads Per CPU = 1
- Tune file system handles, num open files
- Tune number of OS processes
- In large environments separation of Production and Dev/Test database systems on different clusters
- Data file size and growth size should be multiple of underlying VMFS block size. 8GB, 16GB, 32GB for example
- Split OS and Database files over separate virtual SCSI controllers and virtual disks on as few VMFS datastores as will support I/O requirements.
- Consider using ASM for storage management for the multiple virtual disks, use separate disk groups for data files, redo log and archive log
- When using ASM the ASM AU size should be the same as or a multiple of the VMFS block size
According to tests conducted by EMC when virtualizing Oracle RAC for their internal mission critical systems enabling NUMA in Oracle, even with 11G actually degrades performance. For detailed insight into EMC’s experience and their strategy for their internal mission critical Oracle RAC systems on vSphere please refer to this excellent article: EMC IT Virtual Oracle Deploy Framework.
For more information on ASM see Oracle’s Introduction to Automatic Storage Management (ASM)
The above recommendations are a good starting point but should be reviewed and tuned for your specific environment and requirements. Depending on the SLA’s and performance requirements for Test and Development environments you may want to consider higher utilization and overcommitment levels to improve cost efficiency further. If you plan to only have a single cluster for all your database systems, including production and non-production systems, you will need to pay very close attention to reservations and overcommitment levels. For large environments you should consider separate clusters for production and non-production systems.
VMware has also published a great Oracle Databases on VMware – Best Practices Guide, I would highly recommend you also review this thoroughly.
Linux OS Volume Logical Design Example:
The diagram below displays an example of how you might decide to deploy the volumes and storage on a Linux based Oracle database server.
The OS and App/Redo Volume in the above diagram are grouped together into a single datastore to make efficient use of available IOPS on tier 1 storage. This has the effect of grouping a low IO consumer with a high IO consumer. The backup datastore is separate and dedicated to the very large database servers as the latency may be higher and the type of IO stream is expected to be sequential. It is place on tier 3 storage. The data file volumes are grouped three per datastore and on tier 1 storage. This was done to align the performance with the physcial SAN characteristics and a standard datastore size for managability. If you were using ASM to manage the database storage it would just be a matter of configuring the virtual disks and assigning them to ASM to manage. You would set up one ASM disk group for data file storage and one for redo and archive log storage, each with the relevant virtual disks assigned. Having multiple virtual disks is important as there is a limit of 32 IO queues per LUN or volume mount point in the Linux OS and you want to maximize the parallelism of IO’s for best performance and lowest latency.
In the above example I’ve used a very small swap partition. This is intentional as the server has been sized with sufficient memory that it will not need to swap, I also tuned the kernel parameters to ensure that swapping is an absolute last resort. The memory needed by the database and key OS processes has been reserved in vCenter, only memory used for filesystem cache has been left unreserved. If there was ever a case of server failure which caused momentary memory contention I was happy to sacrifice some filesystem cache for additional IOPS on the SAN, but I was not willing to risk any core database or OS memory. Given that I design the production cluster with N+2 hosts, which included provision for maintenance capacity the chances of memory contention are quite low. The Oracle SGA is using large pages to ensure that the OS cannot swap out it’s pages, which also reduces the need for swap.
Test Dev Database Logical Storage Design Example:
The diagram below displays an example of how the storage for test and development or small production databases might be deployed.
The diagram above illustrates how you might choose to group two Test/Dev Database servers across a relatively small number of datastores to make efficient use of the number of LUN’s available to a host/cluster. Note that it is only the data file and backup store volume datastores that are shared. The OS/App/Redo volumes are separate per database server. This is because this is where the majority if high intensity IO is expected.
Large Production Database Logical Storage Design Example:
The diagram below displays an example of how the storage for a large production database might be deployed.
Large database servers will require dedicated storage to meet their requirements but this can still be optimized on a smaller number of datastores to get the most efficient use of storage IOPS and to improve managability.
I hope you have enjoyed the above high level overview of how to successfully Deploy Enterprise Oracle Databases on vSphere. Here are some additional references that may be helpful.
For everything I’ve written on Virtualizing Oracle visit my Oracle Page.
Oracle Databases on VMware vSphere Part 2 by Kannan Mani – Oracle RAC on VMFS with ASM
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.