What is vNUMA?

20150924225507I am occasionally asked by people what vNUMA is. I decided to write this post and share it with those inquisitors. This post aims to explain things fairly high-level and hopefully will enable readers to understand what vNUMA is and why it is important.

VMware introduced vNUMA (Virtual Non-Uniform Memory Access) in vSphere 5. It is a technology feature that exposes the underlying NUMA architecture of the hypervisor to the VMs running on it. Assuming that those VMs run operating systems that are NUMA-aware, they could potentially gain significant performance increases from seeing the underlying NUMA architecture.

In order to fully understand vNUMA and its benefits we must first explain UMA (Unified Memory Architecture) and NUMA (Non-Uniform Memory Architecture).

UMA

Unified Memory Architecture is a Shared Memory Architecture where all the processors share the same physical memory uniformly. This configuration is also known as a Symmetric Multi-Processing system or SMP. The graphic below illustrates the UMA architecture:

20150924115547

As you can see both processors have direct access to the same memory on the same bus and this access is uniform or equal, meaning that no processor would have a performance advantage over another when accessing memory addresses. This architecture is suitable for general purpose time critical applications used by multiple users. It is also suitable for large single programs in time critical applications.

The problem with UMA was that the requirement for much larger server systems resulted in more processors sharing the same memory bus which increased memory access latency. This consequently impacted operating system and application performance.

NUMA

NUMA architecture works by linking memory directly to a processor to create a NUMA node. Here, all processors have access to all memory, but that memory access is not uniform or equal. The graphic below illustrates this:

20150924115640

As you can see each processor has direct access to its own memory, this is known as local memory. It can also access memory assigned to the other processor, this is known as remote memory. Access to remote memory is significantly slower than local memory hence why the memory access is non-uniform.

Memory access times are not uniform and depend on the location of the memory and the node from which it is accessed, as the technology’s name implies.

The main benefit of NUMA architecture is memory latency reduction and application memory performance improvement. To realise these performance benefits the operating system must be NUMA-aware in order for it to place applications in specific NUMA nodes and to prevent them from crossing NUMA-node boundaries.

vNUMA

As mentioned earlier, vNUMA exposes the underlying NUMA architecture of the hpyervisor to the VMs running on it. As long as those VMs run operating systems that are NUMA-aware they can make the most intelligent or efficient use of the underlying processors and memory.

vNUMA is a technology feature that was introduced with vSphere 5.  In earlier versions of vSphere, a VM that contained vCPUs spanning multiple physical sockets would think it was running on a UMA system and would therefore adversely affect its NUMA-aware management features. This could significantly impact the performance of the VM.

With the increased demand for ever larger VMs, we are now seeing wide or monster VMs becoming the norm in enterprise cloud environments. Especially when those VMs are running critical workloads. The latest version of vSphere, at the time of writing is it vSphere 6, supports VMs with 128 vCPUs and 4TB of memory. As VMs move towards these upper limits in terms of size, they will certainly will span multiple NUMA nodes, this is where vNUMA can significantly improve system and application performance of these large, high performance VMs.

Below are some important points relating to vNUMA:

  • vNUMA requires VMs to run virtual hardware version 8 or above.
  • The hypervisor must run vSphere 5.0 and above.
  • The hypervisor must contain NUMA-enabled hardware.
  • vNUMA is automatically enabled for VMs with more than 8 x vCPUs (so 9 x vCPUs or more).
  • To enable vNUMA on VMs with 8 x vCPUs or less, it must be done manually. It can be set in the VM’s Configuration Parameters.
  • vNUMA will not be run on a VM with vCPU hotplug enabled, in fact, the will use UMA with interleaved memory access instead.
  • A VM’s vNUMA topology is set based on the NUMA topology of the hypervisor it is running on. It retains the same vNUMA topology of the hypversior it was started on even if is migrated to another hypervisor in the same cluster. This is why it is good practice to build clusters with identical physical hosts/hardware.

VM Sizing

VMware recommends sizing VMs so they align with physical NUMA boundaries. So, if your hypervisor has 8 x cores per socket (octo-core), the NUMA node is assumed to contain 8 cores. Then size your virtual machines in multiples of 8 x vCPUs (8 vCPUs, 16 vCPUs, 24 vCPUs, 32 vCPUs etc).

VMware recommends sizing VMs with the default value of 1 core per socket (with the number of virtual sockets therefore equal to the number of vCPUs). So in our octo-core server if we require 8 x vCPUs we would configure the VM to have the configuration below:

20150924173345

By changing the configuration to 8 cores per socket, it still aligns correctly with the NUMA node as there are 8 cores:

20150924175647

However, this configuration can result in reduced performance. According to this VMware article, this configuration is considered to be less than optimal and resulted in significant increases in application execution time. The conclusion was that the corespersocket configuration of a virtual machine does indeed have an impact on performance when the manually configured vNUMA topology does not optimally match the physical NUMA topology.

If VMs were sized incorrectly and did not match the underlying NUMA toplogy it would result in reduced performance. So using our example where the hypervisor has 8 cores per socket, if a VM with 10 virtual sockets (a total of 10 cores) were created, it would breach the NUMA boundary. 8 of the cores would come from an assigned NUMA node and 2 would come from another NUMA node. This would result in reduced performance as applications would be forced to access remote memory thereby incurring a performance hit. How much the performance hit is depends on factors unique to that specific VM.

References:
vNUMA: What it is and why it matters
Understanding vNUMA (Virtual Non-Uniform Memory Access)
Sizing VMs and NUMA nodes
ESX 4.1 NUMA Scheduling
SQL Server Virtual Machine vNUMA Sizing
Many Cores per Socket or Single-Core Socket Mystery
Does corespersocket Affect Performance?
Cores Per Socket and vNUMA in VMware vSphere
Performance Best Practices for VMware vSphere® 6.0