2.8. Accounting#

Note

Claims accounting state is only updated while holding the heap_lock. See Locking of the claims state for details on the locks used to protect the claims accounting state.

This section formalises the internal state and invariants that Xen must maintain to ensure correctness.

For readers following the design in order, the preceding sections are:

  1. Claims Design introduces the overall model and goals.

  2. Installation explains how claim sets are installed.

  3. Protection describes how claimed memory is protected during allocation.

  4. Redeeming explains how claims are redeemed when allocations succeed.

2.8.1. Overview#

Table 1: Claims accounting: All accesses, Aggregate state, and invariants protected by heap_lock.#

Level

Claims must be lower or equal to

the available memory

Total

outstanding_claims =

= Aggregate state:

SUM() over all domains: SUM(domain.outstanding_pages)

Also, it is the sum of claims over all nodes:

= Aggregate state:

SUM(node_outstanding_claims[*])

total_avail_pages =

Aggregate state:

SUM(node_avail_pages)

Node

node_outstanding_claims[node]

Aggregate state over all domains: SUM(domain.claims[node])

node_avail_pages[node]

Aggregate of the free lists of all zones on node

Dom per-node

domain.node_claims = SUM(domain.claims[node])

node_avail_pages[node]

Total claims

domain.outstanding_pages

total_avail_pages

Memory limit

domain.outstanding_pages + domain_tot_pages()

Invariant: must be lower or equal to domain.max_pages

2.8.2. Total claims and available memory#

These variables tracking the total claims and available memory in the system are aggregates of the actual per-node and per-domain values.

They are only maintained for efficient checks in the allocator hot paths, to quickly determine if an allocation can be satisfied from unclaimed memory or if further checks are needed to determine if the claims of the domain can be used to free up memory for the allocation. This also ensures that the sum of all claims never exceeds the total free memory in the system.

The number of unclaimed pages across all nodes in the system is derived as total_avail_pages minus outstanding_claims. This number is then used to:

  • Permit allocation requests if they can be satisfied from unclaimed pages.

  • Ensure that the sum of all claims never exceeds the total free memory.

unsigned long total_avail_pages#

Total available pages in the system across all NUMA nodes. It is the aggregate of the per-node available pages: total_avail_pages = SUM(node_avail_pages[MAX_NUMNODES])

unsigned long outstanding_claims#

The total sum of all claims across all domains. outstanding_claims = SUM(domain.outstanding_pages)

2.8.3. Per-node claims and available memory#

unsigned long node_avail_pages[MAX_NUMNODES]#

Available pages for each NUMA node, including both free and claimed pages. This is used for validating that node claims do not exceed the available memory on the respective NUMA node.

unsigned long node_outstanding_claims[MAX_NUMNODES]#

The total claims across all domains for each NUMA node, indexed by node ID. This is maintained for efficient checks in the allocator hot paths.

This diagram illustrates the claims accounting state and the invariants:

2.8.4. Accounting diagram#

        %% SPDX-License-Identifier: CC-BY-4.0
%% Claim variables and their Invariants
flowchart TD

subgraph "Access&nbsp;under&nbsp;the&nbsp;<tt><b>heap_lock</b></tt>&nbsp;only:"
   direction TB
   Memory_of_Nodes --"&nbsp; Contribute to &nbsp;"--> Overall_Memory
   Overall_Memory --"&nbsp; Available to &nbsp;"--> Memory_of_Domains
end

subgraph Memory_of_Nodes["Per-node claims and available memory"]
    direction LR
    per_node_claims -->|"&nbsp; less or equal to &nbsp;"| node_avail_pages
    per_node_claims["Claims on the node:
                     <tt>node_outstanding_claims[n]"]
    node_avail_pages["Available pages on the node:
                      <tt>node_avail_pages[n]"]
end

subgraph Overall_Memory["Overall claims and available memory"]
    direction LR
    outstanding -->|"&nbsp; less or equal to &nbsp;"| avail_pages
    outstanding["Total claims on the host:
                 <tt>outstanding_claims"]
    avail_pages["Available pages on the host:
                 <tt>total_avail_pages"]
end

subgraph Memory_of_Domains["Per-domain&nbsp;claims and available memory"]
    direction LR
    claims -->|"&nbsp; less or equal to &nbsp;"| available_memory_for_domains
    claims["Claims of the domain:<br><tt>d->outstanding_pages"]
    available_memory_for_domains["Available pages:<br><tt>node_avail_pages[n]
                                                          total_avail_pages"]
end

    

Diagram: Claims accounting state and invariants#

2.8.5. Claims accounting state for each domain#

struct domain#

The main structure representing a domain in Xen. It includes the claims accounting state for the domain, including both unpinned and node-specific claims, as well as the maximum page limits for the domain and the lock protecting the domain’s page allocation counts.

While the domain’s page counts are currently unsigned int, work is underway to change them to unsigned long to support larger page counts beyond 16 TB. The code is already designed to anticipate this change and work with either unsigned int or unsigned long page counts equally well.

unsigned int outstanding_pages#

The domain’s total claim, representing the number of pages claimed for the domain.

unsigned int node_claims#

The total of the domain’s node-affine claims, maintained for efficient checks in the allocator hot paths without needing to sum over the per-node claims each time. It is equal to the sum of claims[MAX_NUMNODES] for all nodes.

unsigned int claims[MAX_NUMNODES]#

The domain’s claims for each NUMA node, indexed by node ID.

As the storage for struct domain is allocated using a dedicated page for each domain, this array allows for efficient and fast storage with direct indexing, without consuming any additional memory for an extra allocation.

The claims for each node are used for NUMA-affine domains to specify the amount of memory claimed for each node, to ensure that the domain’s claims for each node do not exceed the available memory on that node, and to allow the allocator to redeem claims from the appropriate nodes when allocating memory for the domain.

Allocation of the domain structure in xen/common/domain.c#
328static struct domain *alloc_domain_struct(void)
329{
330#ifndef arch_domain_struct_memflags
331# define arch_domain_struct_memflags() 0
332#endif
333
334    struct domain *d = alloc_xenheap_pages(0, arch_domain_struct_memflags());
335
336    BUILD_BUG_ON(sizeof(*d) > PAGE_SIZE);
337
338    if ( d )
339        clear_page(d);
340
341    return d;
342}

The page allocated for struct domain is large enough to accommodate this array several times, even beyond the current MAX_NUMNODES limit of 64. It should be sufficient even for future expansion of the maximum number of supported NUMA nodes if needed. The allocation has a build-time assertion for safety to ensure that struct domain fits within the allocated page.

The sum of these claims is stored in domain.node_claims for efficient checks in the allocator hot paths which need to know the total number of node claims for the domain.

unsigned int max_pages#

The maximum number of pages the domain is allowed to claim, set at domain creation time.

rspinlock_t page_alloc_lock#

Lock for checking domain_tot_pages() on top of new claims against domain.max_pages when installing these new claims. This is a recursive spinlock to allow for nested calls into the allocator while holding it, such as when redeeming claims during page allocation. It is taken before heap_lock when installing claims to ensure a consistent locking order and must not be taken while holding heap_lock to avoid deadlocks.

nodemask_t node_affinity#

A nodemask_t representing the set of NUMA nodes the domain is affine to. This is used for efficient checks in the allocator hot paths to quickly get the set of nodes a domain is affine to for memory allocation decisions.

2.8.6. Claims accounting invariants#

Xen must maintain the following invariants at all times to ensure correctness of claims accounting:

2.8.7. Constants#

MAX_NUMNODES#

The maximum number of NUMA nodes supported by Xen. Used for validating node IDs in the memory_claim_t entries of claim sets. When Xen is built without NUMA support, it is 1.

The default on x86_64 is 64 which is sufficient for current hardware and allows for efficient storage of e.g. the node_online_map for online nodes and domain.node_affinity in a single 64-bit value, and in the domain.claims[MAX_NUMNODES] array.

xen/arch/Kconfig limits the maximum number of NUMA nodes to 64. While Xen can be compiled for up to 254 nodes, configuring machines to split the installed memory into more than 64 nodes would be unusual. For example, dual-socket servers, even when using multiple chips per CPU package should typically be configured for 2 NUMA nodes by default.

nodemask_t node_online_map#

A bitmap representing which NUMA nodes are currently online in the system. This is used for validating that claims are only made for online nodes and for efficient checks in the allocator hot paths to quickly determine which nodes are online. Currently, Xen does not support hotplug of NUMA nodes, so this is set at boot time based on the platform firmware configuration and does not change at runtime.

2.8.8. Types#

typedef uint8_t nodeid_t#

Type for NUMA node IDs. It is passed to Xenctrl using the mem_flags argument of xc_domain_populate_physmap() and passed to Xen in this form.

It allocates 8 bits in the flags for the node ID, which limits the theoretical maximum value of CONFIG_NR_NUMA_NODES at 254 (255 is NUMA_NO_NODE), which is far beyond the current maximum of 64 supported by Xen and should be sufficient for all practical purposes. This also allows for efficient storage of NUMA nodes in arrays indexed by node ID and in nodemask_t bitmaps node_online_map and domain.node_affinity for efficient checks in the allocator hot paths.

type nodemask_t#

A bitmap representing a set of NUMA nodes, used for status information like node_online_map and the domain.node_affinity, and to track which nodes are online and which nodes are in a domain’s node affinity.

2.8.9. Memflags#

type memflags#

Flags for memory allocation requests that can affect the allocation behaviour, such as node preference and whether the request is for an exact node.

MEMF_no_owner#

Flag for memory allocation requests to indicate that the allocation shall not be owned by a domain, and as part of that, MEMF_no_refcount is also set.

MEMF_no_refcount#

Flag for memory allocation requests to indicate that the request is not reference-counted to a domain’s memory allocation state, and as part of that, claims of a domain cannot be used to protect and redeem the allocation using claims. This is used for requests which are not for domains or which explicitly bypass reference-counting for other reasons.

MEMF_no_scrub#

Flag for memory allocation requests to indicate that the allocated memory should not be scrubbed (zeroed) before being used. This is used for performance reasons for certain types of allocations where the caller guarantees that the memory will be properly initialized before use.

2.8.10. Locking of the claims state#

spinlock_t heap_lock#

Lock for all heap operations including claims. It protects the claims state and invariants from concurrent updates and ensures that checks in the allocator hot paths see a consistent view of the claims state.

2.8.11. Helper functions#

inline unsigned int domain_tot_pages(struct domain *d)#
Parameters:
  • d (struct domain*) – The domain for which to calculate the total pages.

Returns:

The total pages allocated to the domain.

This function is used for validating that an allocation and the domain’s claims do not exceed domain.max_pages.