numa performance considerations in vmware vsphere - AMD

seedgemsbokStorage

Dec 10, 2013 (3 years and 8 months ago)

1,227 views

PID 52355A

NUMA

P
ERFORMANCE
C
ONSIDERATIONS IN
VM
WARE V
S
PHERE

Ruben Soto – Member Technical Staff
AMD
May, 2012
Abstract
Addressing performance anomalies with workloads running on servers based on AMD Opteron™
processors and VMware vSphere™ requires a deep understanding of how vSphere manages and
assigns resources to a virtual machine. It is no longer suitable to apply traditional practices for hypervisor
deployments without taking into consideration the impact of the processor microarchitecture and how
hypervisors leverage these new features.

Understanding how to begin an analysis of AMD Opteron processor performance requires an
understanding of (a) how vSphere initially ‘carves up’ and lays out a large vSMP VM in a multi-die
processor on a multi-socket socket server, (b) the role and decision-making process of the vSphere
scheduler, and (c) how vSphere™ attempts to maintain proper workload distribution to optimize
performance.

While the issues and practices described here have a VMware vSphere™ focus, many of these same
concepts apply to other competitive hypervisors as well.
Introduction
As x86 server and multi-core processor architectures continue to evolve, so do the operating systems and
applications that run on these systems. AMD and VMware customers are implementing increasingly
bolder, more robust VMs than ever before. Keeping with this trend, VMware continues to enhance
vSphere in order to support these large, resource intensive workloads. For example, starting with
vSphere™ 4.0 the ESX hypervisor now supports:

• 8 Virtual CPUs (vCPU) per virtual machine (vSMP)
• 256 GB RAM per VM
• 10 virtual NICs per VM

With vSphere 5.0 these configuration parameters have been increased to:

• 32 Virtual vCPUs per virtual machine
• 1 TB RAM per VM


Coupled with the concentrated, physical compute capacity of AMD Opteron™ processors, complex
workloads such as ERP, Data Warehouses, and Business Intelligence - which previously may have been
too complex and demanding to be suitable for virtualization - are now feasible. However, to get the most
PID 52355A

out of your VM infrastructure, greater consideration should now be given to ensuring that these complex
solutions perform as expected.

Understanding Non Uniform Memory Access (NUMA)
For those not intimately familiar with memory access designs, Non Uniform Memory Access (NUMA)
provides a separate memory bank for each processor or, in the case of a Multi Chip Module (MCM)
design, each processor die. This process has significantly improved performance for multi-processor
systems where access to memory local to the processor is much faster than access to non-local memory
(that is, memory local to another processor or shared with another processor). As you will see, this
concept can be extremely important when configuring multi-vCPU VMs.
The Processor’s Role
At the very high level, the AMD Opteron™ 6000 Series platform is a two-die design packaged onto a
common substrate/package referred to as a Multi-Chip Module (MCM. See Figure 1 below). Each die in
the AMD Opteron 6100-series processor consists of either 4 or 6 physical compute cores. Each die in the
AMD Opteron 6200-series processor may consist of 2, 4, 6, or even 8 physical compute cores. This
design introduces the notion of ‘intra-processor’ NUMA (Non Uniform Memory Access) at the processor
level, and ‘inter-processor’ NUMA at the multi-socket server level, which has a major influence on how
vSMPs must be crafted so as to not introduce performance penalties.

No discussion of performance would be complete without also discussing resource over-commitment.
VMware vSphere™ allows the administrator to allocate CPU cores and memory for all virtual machines
beyond the total physical resources that exist on a node or the entire system. vSphere then does it’s best
to efficiently manage the resources to get the best performance possible. While the vSphere algorithms
do an excellent job of this, the over-commitment of resources can often result in movement of virtual
machines from one processor to another, and subsequently the memory of the process from one node to
another, in order to balance system performance as a whole. This rebalancing process itself takes CPU
cycles and results in some non-local memory access, which in turn may degrade performance for some
workloads. In short, every CPU cycle spent on rebalancing a VM is a cycle not spent on doing
meaningful work on behalf of the workload.

PID 52355A


Illustration Courtesy of Frank Denneman

Figure 1: AMD Opteron™ 6000 Series Processor MCM
VM Initial Placement
vSphere™ utilizes the same BIOS-created structures that any bare-metal OS would use to get a picture
of the underlying host server. It’s from these structures that vSphere creates a ‘mapping’ of the host
server, defining the NUMA node boundaries as well as identifying the relationship of the compute cores
and associated memory locations. As shown in the figure above, the AMD Opteron™ 6000-series
processor consists of two dies with each die representing a single NUMA node. With this information, the
vSphere CPU scheduler will attempt to ensure that each virtual machine’s memory is located on the local
NUMA node to which its vCPUs reside.

The CPU scheduler implemented with VMware vSphere uses a round-robin technique for initial vCPU
placement based on the size of the vSMP VM, the location and size of the physical memory structures,
and other pertinent information such as VM Entitlements, Reservations, and any information regarding
over commitment of resources. The CPU scheduler is further responsible for keeping a process and its
data co-resident on the same NUMA node. This is important with regards to performance as it helps
reduce remote memory access latency and promote cache sharing. This is another example of VMware’s
software engineering strength and collaboration with AMD.

At this point it should be obvious that the AMD Opteron processor’s high core count provides many
benefits for multi-threaded workloads that can take advantage of the increase in vCPU limits. More cores
per NUMA node can vastly improve performance and capacity, including:

– Greater local memory access
– Larger vSMP VMs w/o splitting across nodes
– Reduced World Switches

PID 52355A

You might be thinking that’s all well and good, but what about a vSMP VM that is defined with more
vCPUs than fit within a single NUMA node?

Let’s look at an example of configuring a VM with more vCPUs than cores available on a single NUMA
node. We’ll base our system on the AMD Opteron 6100 Series processor, which is constructed of 2 dies
(see Figure 1 above) with up to 6 physical cores in each die and each die having its own memory
controller.

VMware vSphere™ 4.1 NUMA Scheduling

Prior to VMware vSphere™ 4.1, the ESX scheduler would assign a VM to a NUMA node called the home
node with the complete VM treated as one NUMA client. In this case the total number of vCPUs of a
NUMA client must not exceed the number of CPU cores on that node. In other words, all vCPUs must
reside within the NUMA node. If the total number of vCPUs for a VM exceeds the available cores on the
NUMA node then the VM will not be considered a NUMA client and will not be managed as such. For
example, a VM with 8 vCPUs on a 4-socket quad-core system like the one shown in the figure below
exceeds the capability of the NUMA node. In this case the CPU scheduler will abandon its NUMA-
boundary-biased algorithm in favor of equally distributing all vCPUs across all available NUMA nodes as
shown below in Figure 2.




Figure 2: 8vCPU VM spread across 4 quad-core processors

The performance implications in this case should be obvious:
• The CPU scheduler will place data at a position it deems appropriate, without much
consideration of parent process placement. This can potentially cause performance issues
due to remote memory accesses.
• NUMA optimizations could be disabled since there is no possibility of consolidating the vSMP
onto one NUMA node.
• Note: The situation is more acute for HyperThreaded processors since vSphere does not
recognize ‘logical processors’ until post VM-boot, i.e. they don’t exist.

With vSphere 4.1 VMware introduced the concept of Wide-VMs, which are able to take advantage of
NUMA management. A wide-VM is defined as a virtual machine that has more vCPUs than the available
cores on a NUMA node.

In the case of a wide-VM the CPU scheduler will split the wide-VM into smaller NUMA clients whose
vCPU count does not exceed the number of cores per NUMA note. In the case above, the 8-vCPU SMP
Illustration Courtesy of Frank Denneman

PID 52355A

VM would be placed on 2 4-vCPU NUMA clients each with its own home node (see Figure 3 below). The
vCPU scheduling of each smaller NUMA client is now performed the same as any other NUMA client. In
this case the memory access can now be improved from about 25% local (non-NUMA management) to
50% local in the wide-VM case.





Figure 3: 8vCPU VM spread across 2 quad-core processors with each managed as a NUMA node


VMware vSphere™ Workload Rebalancing
While the VMware vSphere™ CPU scheduler will attempt to keep a process and its data within the
same NUMA node, this is not guaranteed. Other potential issues introduced with larger vSMPs are:

• The VMs’ guest OS expects all the vCPUs to execute in close synchronicity. If the time difference
in execution becomes too large, the guest OS may incur an unrecoverable fault.
• vCPUs need to be available before the virtual machine begins execution, VM placement may not
be optimal due to high system utilization

Every two (2) seconds, vSphere takes an accounting of resource utilization, vSMP synchronicity, and VM
Reservations and Entitlements status and takes any actions it deems necessary to “rebalance”
workloads. During this rebalancing, it may be possible that a running process may be moved to a vCPU in
a separate NUMA node, i.e. separate from its data, causing a performance problem due to remote
memory access. vSphere will eventually rebalance the data portion as well, but there could be a
performance issue in the meantime.

VMware has been enhancing the CPU scheduler in vSphere over the past major releases to account for
these conditions, and to further optimize workload distribution, via a “relaxed co-scheduling” algorithm.
Illustration Courtesy of Frank Denneman

PID 52355A


Note: This situation is more acute in HyperThreaded processors as vSphere enumerates physical and
logical cores in sequential fashion. In a rebalance, it’s possible that two independent VMs may be
collocated on the same physical core, causing an overcommitted CPU condition.

System Rebalancing – A Case Study
A customer recently reported a performance issue during a benchmarking exercise with an AMD
Opteron™ 6100 series based server with 12 cores per processor. The customer reported excessive
latency compared to an older AMD Opteron-based server, and as compared to a competitive server. After
a little research we found a couple of interesting data points:

1. The VMs were being migrated from node to node, causing the VM and its data to become
separated (remote) for long periods of time
2. In an attempt to keep vCPUs and its associated data co-resident, the vSMPs were being split
across dies as part of a VM rebalancing. This resulted in non-optimal cache utilization or data
sharing across vCPUs that needed to communicate

In order to keep the VMs on a single NUMA node the customer set the following parameter
“sched.cpu.vsmpConsolidate=true” which resolved this particular issue.

Important vSMP VM Deployment Considerations
The following practices will help address the issues discussed here:

1. Craft vSMPs so they do not exceed the physical core count of the die/NUMA Node.

2. Although vCPU scheduling has been “relaxed”, synchronicity demands that each vCPUs’ virtual
time and run assignment are in close proximity.

a. “Encourage” vSphere™ to keep all vCPU siblings in a vSMP on the same NUMA Node:

i. “sched.cpu.vsmpConsolidate=true”

1. Optional parameter in VM Config file
2. Good for cache sharing workloads
3. Helps reduce the potential for remote memory access

3. Do not oversize the VM with excessive vCPUs
If a workload consumes 60% of a 3GHz qud-core processor, then #cores * Utilization will
yield the starting point for #vCPUs required:

#vCPU = (4)(.6) = 2.4 (Round up to 3 vCPUs)

Testing will ultimately determine how many vCPUs to configure in the vSMP, but this
should provide a starting point for initial configuration.

PID 52355A

4. For Benchmarking, perhaps Rebalancing is not necessary. Other options include:

a. Disable via “Numa.RebalanceEnable=0”
b. Disable excessive Page Migrations via /Numa/PageMigEnable=“0”
A Word on CPU Affinity
Setting the CPU affinity for a particular VM limits the CPU(s) that that VM can execute on. It is important
to note that CPU affinity does not dedicate a CPU to a VM, rather it designates the CPU or CPUs that a
VM can run on. Subsequent VMs running on the system may also utilize the resources of a CPU whose
affinity is assigned to another VM. Indeed, CPU affinity may improve performance in certain situations but
one must take precaution as performance is known to degrade in many circumstances. Below are just a
few guidelines to follow when considering the use of CPU affinity.

 Can help when:
– VM workloads are cache bound
– Workloads require isolation to a physical CPU
 Should only be used as part of overall configuration strategy
– Requires significant extra management
 Shares, Reservations, and Entitlements
 CPU Affinity does not guarantee Memory Affinity
– Scheduling overhead remains the same
 Other affinity considerations
– Limits the available CPUs on which the virtual machine can run
– Physical CPU isolation/dedication to a VM is not guaranteed
– VM is excluded from NUMA Scheduling
 Avoid in DRS cluster
– Precludes vMotion
Other Performance Considerations
The following are a few other items to consider when analyzing your systems virtualization performance.

1. For I/O or Network Performance/Benchmarking:

 Ensure IO Virtualization (IOMMU) is Enabled in BIOS
 Ensure VMDIRECTPATH is Enabled in ESX/ESXi
– Only available with selected 10Gb NICs
– The entire chain needs to support this
 Test with “Large Pages” and Jumbo Frames (Default = off)
 NIC Teaming helps balance load across adapters
 ESX only, Ensure Console OS (COS) Interrupt Handler is not shared with slower devices.
ESX Consoles that share an interrupt with a slow device will perform poorly. As a result, VMs
will perform poorly, as shown in the data below. Note the high number of interrupts for CPU0
as well as the Interrupt Request Queue (IRQ) #16. Detail on IRQ #16 shown lower provides
more clarity.



PID 52355A











2. Ensure the settings for Hypervisor Power Management are consistent with the guest OS.
3. Follow the sizing and best practices recommendations of the storage and ISV providers.
4. Beware of Caching Conflicts between I/O adapters, CPU, OS, and Workload
 Each one thinks they can do it better
 So which one? Testing is key

Summary

As you can see, VM placement with regard to processor architecture and memory placement can greatly influence
the performance in your VMware vSphere environment. In addition, AMD Opteron™ processors with up to 8 cores
per NUMA node and 16 cores per processor can give you a distinct advantage in configuring your multi-vCPU VMs
for performance. A keen eye towards aligning vCPUs with available cores per NUMA node will greatly influence local
memory access and give you the performance advantage you’re looking for.
Resources
http://www.vmware.com/files/pdf/techpaper/VMW_vSphere41_cpu_schedule_ESX.pdf
http://www.vmware.com/pdf/Perf_Best_Practices_vSphere4.1.pdf
http://www.vmware.com/pdf/vSphere4/r41/vsp_41_resource_mgmt.pdf
http://www.vmware.com/pdf/vsphere4/r41/vsp_41_config_max.pdf
Frank Denneman blog at http://frankdenneman.nl/2010/09/esx-4-1-numa-scheduling/



Ruben Soto is a Field Application Engineer at AMD. His postings are his own opinions and
may not represent AMD’s positions, strategies or opinions. Links to third party sites, and
references to third party trademarks, are provided for convenience and illustrative purposes
only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third
party endorsement of AMD or any of its products is implied.
# cat /proc/vmware/interrupts

Vector PCPU0 PCPU1 PCPU2 PCPU3
0x71: 68505248 0 0 0 COS irq 16 (PCI level), VMK vmnic6, VMK vmnic1
Note: COS only runs on CPU0
# cat /proc/vmware/pci | grep <Interrupt Vector #>

Bus:Sl.F Vend:Dvid Subv:Subd Type Vendor ISA/ irq/Vec P M Module Name
000:29.0 8086:24d2 1014:02dc USB Intel 11/ 16/0x71 A C
007:00.0 14e4:1659 1014:02c6 Ethernet Broadcom 11/ 16/0x71 A V tg3 vmnic1
008:00.0 14e4:1659 1014:02c6 Ethernet Broadcom 11/ 16/0x71 A V tg3 vmnic6
In this case, the COS is sharing the IRQ with the VM Kernel, VMNICs, and the USB port.