VMware has announced that it will turn off TPS in upcoming version of it’s hypervisor ESXi and vCloud Air hybrid cloud service. This is due to a security bug, considered a very rare possibility and only exploitable in very controlled and largely misconfigured environments. TPS also known as Transparent Page Sharing is a memory management technique that allows multiple VM’s to share a read only copy of the same memory page. When a VM needs to update or write to a page a new copy is created. The idea is that if there are many VM’s with similar memory pages on the same physical host server it will de-duplicate the pages and only store one copy. The result is that you can run more VM’s per physical server while still achieving very good performance.
TPS has for a long time been used as a competitive advantage by VMware over all of the other hypervisors. But realistically it hasn’t been in wide use by most customers for some time (since ESX 3.5) as the amount of RAM per host has increased, because of the use of large memory pages (2MB instead of 4KB) in Nehalem and above processors, and because most customers don’t want to run their systems at 100% utilization so that they can handle bursts of activity. When using large pages TPS only kicked in when systems were over 96% memory utilization, at which point large pages would be broken down into small pages that could be shared. However this has been a popular technique with service providers and with virtual desktop environments, and in some test and development environments, where over commitment of memory may have been acceptable.
The security problem was found by recent research that leverages Transparent Page Sharing (TPS) to gain unauthorized access to data under certain highly controlled conditions. The research demonstrated that by forcing a flush and reload of cache memory, it is possible to measure memory timings to try and determine an AES encryption key in use on another virtual machine running on the same physical processor of the host server, if Transparent Page Sharing is enabled. This is effectively a VM escape, where code executed within one VM can break the hypervisor isolation and read data from another VM’s memory. Certainly not a good situation if said VM contains credit card data, as we’ve already had enough breaches recently. The conditions under which this could be exploited would be rare in the real world, especially as most environments don’t use TPS actively, even if it is enabled. Even so, I believe in being secure by default, and even though the number of conditions that have to simultaneous by true for this to be exploited would be very rare, if this were exploited the impact could be high. So I believe that VMware is taking the right approach to this research by disabling TPS.
I have been a proponent for leaving TPS enabled in the past, even though a few others have previously recommended it be disabled for performance reasons. My argument was that TPS is a good safety net if all else fails, even if during normal operations it is not used. Also performance was never proven to be a factor. I put this argument in my article Blueprint for Successful Large Scale Oracle Virtualization on vSphere when an EMC paper recommended disabling TPS. To quote that article “Disabling TPS can have disastrous consequences, including causing additional host swapping, which can result in extremely poor performance, much worse than disabling it could ever possibly gain.” So this begs the question, now that it’s being disable by VMware what impact will it have?
Without TPS you will have to have much more conservative memory usage per host. If you business requirements dictate, you will have to be able to sustain maintenance and failure without causing memory overcommitment. If there is a failure or maintenance that causes temporary or prolonged overcommitment of memory you will have a lot more guest OS swapping, due to ballooning, and also host swapping may occur, which would greatly impact performance. Memory swapping is the enemy of performance, and this also adds significantly to poor performance on shared storage if it occurs. But this is possibly better than the alternative security bug.
If you have an existing VMware vSphere environment this will mean you need to evaluate the level of resource usage you have today, your standard operating procedures for maintenance, and the settings of VMware HA Admission Control for failure. If you don’t have sufficient available memory to operate your environment in the case of failure or maintenance, then you may need to upgrade the amount of RAM per host or purchase additional hosts. With any additional hosts you’d need additional licenses. Frank Denneman has a good take on the capacity planning implications in his article here.
TPS will be disabled by default from the following VMware vSphere Releases:
- ESXi 5.5 Update release – Q1 2015
- ESXi 5.1 Update release – Q4 2014
- ESXi 5.0 Update release – Q1 2015
- The next major version of ESXi
VMware’s official statement on this problem is contained within KB 2080735 Security considerations and disallowing inter-Virtual Machine Transparent Page Sharing. This KB also contains the steps to disable TPS on older versions of VMware vSphere that will not be covered by patches.
If you want to check whether you have TPS enabled or not on your existing versions, and if you want to disable it you can use the following PowerCLI examples (explicitly provided without any warranty, use at your own risk):
Check if TPS is Enabled on all hosts connected to a vCenter Server, Mem.ShareScanGHz returns > 0 if enabled.
Connect-VIServer <YourvCenter>
Get-VMHost –State Connected | Get-AdvancedSetting –Name Mem.ShareScanGHz | Format-Table –Property Entity,Name,Value -AutoSize
Disconnect-VIServer
Disable TPS on all hosts connected to a vCenter Server by setting Mem.ShareScanGHz = 0, check the setting has been applied correctly
Connect-VIServer <YourvCenter>
Get-VMHost –State Connected | Get-AdvancedSetting –Name Mem.ShareScanGHz | Set-AdvancedSetting –Value 0
Get-VMHost –State Connected | Get-AdvancedSetting –Name Mem.ShareScanGHz | Format-Table –Property Entity,Name,Value -AutoSize
Disconnect-VIServer
So if TPS is vulnerable to data leakage and VM escape attacks what about the recently announced Project Fargo, AKA VMFork? VMFork allows a running VM to be quiesced and rapidly cloned by using a similar copy on write technique to share a read only copy of the parent VM memory, and sharing the parent VM’s read only disk, with updates being written to a delta disk. This allows a VM to be cloned and get up and running on the network with it’s own personality in a matter of a few seconds, with the VM memory and disk effectively being deduped at the same time. This doesn’t just have applicability to VDI environments, but web server environments, Dev and Test environments and many other use cases. I’m sure VMware won’t let VMFork out in the wild until issues such as the VM escape bug with TPS are addressed. Kit Colbert, VMware CTO for End User Computing, has said to me that VMFork is much more secure than TPS, so it may not suffer from the same problems.
VMware is not alone with a VM escape vulnerability being discovered. There was also a security bug made public regarding the Xen hypervisor that allowed a VMescape, where code executed within one VM could escape the encapsulation of the hypervisor to a neighbour VM or dom0. This is covered at the VUPEN Vulnerability Research Team’s blog site.
Final Word
Nothing is fully secure. You can never guarantee that your system isn’t vulnerable to attack. All you can do is take appropriate measures to reduce the risk of attack, implement technical controls and monitoring and auditing processes. Implement separation of duties, least privilege access, and role based access controls. Implement the guidelines that make sense based on your business requirements from the VMware and other vendors hardening guides. Comply with the security standards for your industry / company that make sense. Stay on top of critical security patches and implement them as soon as practicable, especially for any environments containing public facing or highly secure systems.
—
This post first appeared on the Long White Virtual Clouds blog at longwhiteclouds.com. By Michael Webster +. Copyright © 2012 – 2014 – IT Solutions 2000 Ltd and Michael Webster +. All rights reserved. Not to be reproduced for commercial purposes without written permission.
Just a note, Mem.ShareScanGHz does not disable TPS, the correct value is Mem.ShareForceSalting, see the following KB for more details http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2097593
[…] LongWhiteClouds – VMware Turns Off TPS Taps in vSphere ESXi and vCloud Air to Avoid Rare VMescape Security Bug […]