3.3. Claims Design
3.3.1. Introduction
Xen’s page allocator supports a claims API that allows privileged domain builders to reserve an amount of available memory before populating the guest physical memory of new domains they are creating, configuring and building.
These reservations are called claims. They ensure that the claimed memory remains available for the domains when allocating it, even if other domains are allocating memory at the same time.
Installing claims is a privileged operation performed by domain builders before they populate the guest physical memory. This prevents other domains from allocating memory earmarked for domains under construction. Xen maintains the per-domain claim state for pages that are claimed but not yet allocated.
When claim installation succeeds, Xen updates the claim state to reflect the new targets and protects the claimed memory until it is allocated or the claim is released. As Xen allocates pages for the domain, claims are redeemed by reducing the claim state by the size of each allocation.
3.3.2. Design Goals
The design’s primary goals are:
Allow domain builders to claim memory on multiple NUMA nodes using a claim set atomically.
Preserve the existing XENMEM_claim_pages hypercall command for compatibility with existing domain builders and its legacy semantics, while introducing a new, unrestricted hypercall command for new use cases such as NUMA-aware claim sets.
Global claims are supported for compatibility with existing domain builders and for use cases where a flexible claim that can be satisfied from any node is desirable, such as on UMA machines or as a fallback for memory that comes available on any node. This means we cannot remove or replace the legacy global claim call nor the needed variables maintaining the global claim state. They are still very much needed: claims are not just for NUMA use cases, but for parallel domain builds in general.
Only on UMA machines is a global claim the same as a claim on node 0, but the same is not true for NUMA machines, where global claims can claim more memory than any single node, and the global claim can be used as a flexible fallback for claiming memory on any node, which can be useful when preferred NUMA node(s) should be claimed, but may have insufficient free memory at the time of claim installation, and the global claim can ensure that the shortfall is available from any node.
Use fast allocation-time claims protection in the allocator’s hot paths to protect claimed memory from parallel allocations from other domain builders in case of parallel domain builds, and to protect claimed memory from allocations from already running domains.
3.3.3. Design Overview
The legacy XENMEM_claim_pages hypercall is superseded by XEN_DOMCTL_claim_memory. This hypercall installs a claim set. It is an array of memory_claim_t entries, where each entry specifies a page count and a target: either a specific NUMA node ID or a special selector (for example, a global or flexible claim).
Like legacy claims, claim sets are validated and installed under domain.page_alloc_lock and heap_lock: Either the entire set is accepted, or the request fails with no side effects. Repeated calls to install claims replace any existing claims for the domain rather than accumulating.
As installing claim sets after allocations is not a supported use case, the legacy behaviour of subtracting existing allocations from installed claims is somewhat surprising and counterintuitive, and page exchanges make incremental per-node tracking of already-allocated pages on a per-node basis difficult. Therefore, claim sets do not retain the legacy behaviour of subtracting existing allocations, optionally on a per-node basis, from the installed claims across the individual claim set entries.
Summary:
Legacy domain builders can continue to use the previous (now deprecated) XENMEM_claim_pages hypercall command to install single-node claims with the legacy semantics and, aside from improvements or fixes to global claims in general, observe no changes in their behaviour.
Updated domain builders can take advantage of claim sets to install NUMA-aware claims on multiple NUMA nodes and/or globally in a single step.
For readers following the design in order, the next sections cover the following topics:
Claim Installation Paths explains how claim sets are installed.
Protection of Claims describes how claimed memory is protected during allocation.
Redeeming Claims explains how claims are redeemed as allocations succeed.
Claims Accounting describes the accounting model that underpins those steps.
3.3.4. Key design decisions
- node_outstanding_claims[MAX_NUMNODES]
Tracks the sum of all claims on a node. get_free_buddy() checks it before scanning zones on a node, so claimed memory is protected from other allocations.
- redeem_claims_for_allocation()
When allocating memory for a domain, the page allocator redeems the matching claims for this allocation, ensuring the domain’s total memory allocation as domain_tot_pages(domain) plus its outstanding claims as domain.global_claims + domain.node_claims remain within the domain’s limits, defined by domain.max_pages. See Redeeming Claims for details on redeeming claims.
- domain.global_claims (formerly domain.outstanding_claims)
Support for global claims is maintained for two reasons: first, for compatibility with existing domain builders, and second, for use cases where a flexible claim that can be satisfied from any node is desirable.
When the preferred NUMA node(s) for a domain do not have sufficient free memory to satisfy the domain’s memory requirements, global claims provide a flexible fallback for the memory shortfall from the preferred node(s) that can be satisfied from any available node.
In this case, domain builders can exploit a combination of passing the preferred node to xc_domain_populate_physmap() and NUMA node affinity to steer allocations towards the preferred NUMA node(s), while letting the global claim ensure that the shortfall is available.
This allows the domain builder to define a set of desired NUMA nodes to allocate from and even specify which nodes to prefer for an allocation, but the claim for the shortfall is flexible, not specific to any node.
3.3.5. Non-goals
3.3.5.1. Using per-node allocator data
Some data structures could be moved into the per-node allocator data allocated by init_node_heap(), to avoid bouncing those data structures between nodes, but that would not eliminate the need to take the global heap_lock, which is still needed to protect the allocator’s internal state during allocation and deallocation.
The synchronisation point for taking the global heap_lock is the main point of contention during allocation, freeing and scrubbing pages. The overhead of accessing the per-node claims accounting data is expected to be minimal.
However, we aim move that data into the per-node allocator data in the future to reduce the need to bounce those data structures between nodes.
3.3.5.2. Legacy behaviours
Installing claims is a privileged operation performed by domain builders before they populate guest memory. As such, tracking previous allocations is not in scope for claims.
For the following reasons, claim sets do not retain the legacy behaviour of subtracting existing allocations from installed claims:
Xen does not currently maintain a
d->node_tot_pages[node]count, and the hypercall to exchange extents of memory with new memory makes such accounting relatively complicated.The legacy behaviour is somewhat surprising and counterintuitive. Because installing claims after allocations is not a supported use case, subtracting existing allocations at installation time is unnecessary.
Claim sets are a new API and can provide more intuitive semantics without subtracting existing allocations from installed claims. This also simplifies the implementation and makes it easier to maintain.
3.3.5.3. Versioned hypercall
The domain builders using the XEN_DOMCTL_claim_memory hypercall also need to use other version-controlled hypercalls which are wrapped through the libxenctrl library.
Wrapping this call in libxenctrl is therefore a practical approach; otherwise, we would have a mix of version-controlled and unversioned hypercalls, which could be confusing for API users and for future maintenance. From the domain builders’ viewpoint, it is more consistent to expose the claims hypercall in the same way as the other calls they use.
Stable interfaces also have drawbacks: with stable syscalls, Linux needs to maintain the old interface indefinitely, which can be a maintenance burden and can limit the ability to make improvements or changes to the interface in the future. Linux carries many system call successor families, e.g., oldstat, stat, newstat, stat64, fstatat, statx, with similar examples including openat, openat2, clone3, dup3, waitid, mmap2, epoll_create1, pselect6 and many more. Glibc hides that complexity from users by providing a consistent API, but it still needs to maintain the old system calls for compatibility.
In contrast, versioned hypercalls allow for more flexibility and evolution of the API while still providing a clear path to adopt new features. The reserved fields and reserved bits in the structures of this hypercall allow for many future extensions without breaking existing callers.
3.3.6. Glossary
- claims
Reservations of memory for domains that are installed by domain builders before populating the domain’s memory. Claims ensure that the reserved memory remains available for the domains when allocating it, even if other domains are allocating memory at the same time.
- claim set
An array of memory_claim_t entries, each specifying a page count and a target (either a NUMA node ID or a special value for global claims), that can be installed atomically for a domain to reserve memory on multiple NUMA nodes. The chapter on Claim sets provides further information on the structure and semantics of claim sets.
- claim set installation
- installing claim sets
- installing claims
The process of validating and installing a claim set for a domain under domain.page_alloc_lock and heap_lock, ensuring that either the entire set is accepted and installed, or the request fails with no side effects. The chapter on Claim set installation provides further information on the structure and semantics of claim sets.
- domain builders
Privileged entities (such as toolstacks in management domains) responsible for constructing and configuring domains, including installing claims, populating memory, and setting up other resources before the domains are started.
- domains
Virtual machine instances managed by Xen, built by domain builders.
- global claims
claims that can be satisfied from any NUMA node, required for compatibility with existing domain builders and for use cases where strict node-local placement is not required or not possible, such as on UMA machines or as a fallback for memory that comes available on any node.
- libxenctrl
A library used by domain builders running in privileged domains to interact with the hypervisor, including making hypercalls to install claims and populate memory.
- libxenguest
A library used by domain builders running in privileged domains to interact with the hypervisor, including making hypercalls to install claims and populate memory.
- meminit
The phase of a domain build where the guest’s physical memory is populated, which involves allocating and mapping physical memory for the domain’s guest physmap. This should be performed after installing claims to protect the process against parallel allocations of other domain builder processes in case of parallel domain builds.
It is implemented in libxenguest and optionally installs claims to ensure the claimed memory is reserved before populating the physmap using calls to xc_domain_populate_physmap().
- nodemask
A bitmap representing a set of NUMA nodes, used for status information like node_online_map and the domain.node_affinity.
- node
- NUMA node
- NUMA nodes
A grouping of CPUs and memory in a NUMA architecture. NUMA nodes have varying access latencies to memory, and NUMA-aware claims allow domain builders to reserve memory on specific NUMA nodes for performance reasons. Platform firmware configures what constitutes a NUMA node, and Xen relies on that configuration for NUMA-related features.
When this design refers to NUMA nodes, it is referring to the NUMA nodes as defined by the platform firmware and exposed to Xen, initialized at boot time and not changing at runtime (so far).
The NUMA node ID is a numeric identifier for a NUMA node, used whenever code specifies a NUMA node, such as the target of a claim or indexing into arrays related to NUMA nodes.
NUMA node IDs start at 0 and are less than MAX_NUMNODES.
Some NUMA nodes may be offline, and the node_online_map is used to track which nodes are online. Currently, Xen does not support hotplug of NUMA nodes, so the set of online NUMA nodes is determined at boot time based on the platform firmware configuration and does not change at runtime.
- NUMA node affinity
The preference of a domain for a set of NUMA nodes, which can be used by domain builders to guide memory allocation even when not forcing the buddy allocator to only consider (or prefer) a specific node when allocating memory, but even a set of preferred NUMA nodes.
By default, domains have NUMA node auto-affinity, which means their NUMA node affinity is determined automatically by the hypervisor based on the CPU affinity of their vCPUs, but it can be disabled and configured.
- guest physical memory
- physmap
The mapping of a domain’s guest physical memory to the host’s machine address space. The physmap defines how the guest’s physical memory corresponds to the actual memory locations on the host.
- populating
The process of allocating and mapping physical memory for a domain’s guest physmap, performed by the domain builders, preferably after installing claims to protect the process against parallel allocations of other domain builder processes in case of parallel domain builds.
- toolstacks
Privileged entities (running in privileged domains) responsible for managing domains, including building, configuring, and controlling their lifecycle using domain builders. One toolstack may run multiple domain builders in parallel to build multiple domains at the same time.