3.9. Implementation

Note

This part describes implementation details of claims and their interaction with memory allocation in Xen. It covers the functions and data structures involved in installing claims and allocating memory with claims.

Functions related to the implementation of claims and their interaction with memory allocation.

3.9.1. Installation of claims

This section describes the functions and data structures involved in installing claims for domains, and the internal functions for validating and installing claim sets.

int domain_set_outstanding_pages(domain, pages)

This function is replaced by domain_set_claim_entries().

int domain_set_claim_entries(domain, nr_entries, claim_set)
Parameters:
  • domain (struct domain*) – The domain for which to set the node claims

  • nr_entries (unsigned int) – The number of claims in the claim set

  • claim_set (memory_claim_t*) – The claim set to install for the domain

Returns:

0 on success, or a negative error code on failure.

Handles installing claim sets. It performs validation of the claim set and updates the domain’s claims accordingly.

The function works in four phases:

  1. Validate claim entries and check node-specific claims availability

  2. Validate the host-wide request against the remaining availability

  3. Reset any current claims of the domain

  4. Install the claim set as the domain’s claiming state

Phase 1 checks claim entries for validity and memory availability:

  1. Target must be XEN_DOMCTL_CLAIM_MEMORY_TOTAL or a node.

  2. Each target node may only appear once in the claim set.

  3. For node-specific claims, requested pages must not exceed the available memory on that node after accounting for existing claims.

  4. The explicit padding field must be zero for forward compatibility.

Phase 2 checks:

  1. The total sum of the requested pages must not exceed the total unclaimed memory of the host after accounting for existing claims.

  2. The claims must not exceed the domain.max_pages limit. See Accounting and Redeeming for the accounting checks that enforce the domain’s domain.max_pages limit.

Added in version claims-v5.

int domain_get_claim_entries(domain, nr_entries, claim_set)
Parameters:
  • domain (struct domain*) – The domain for which to retrieve a claim set

  • nr_entries (unsigned int*) – The number of claims in the claim set

  • claim_set (memory_claim_t*) – The preallocated buffer for up to nr_entries claim entries

Returns:

0 on success with nr_entries updated to the number of claims written to the buffer, or a negative error code on failure.

Retrieves a claim set for the current claims of the domain and writes it to the provided buffer. The number of claims written to the buffer is stored in the variable pointed to by nr_entries.

nr_entries specifies the size of the provided buffer for claim entries, and the function writes up to that many claim entries to the buffer. If the buffer is too small to hold all claim entries, the function returns -ERANGE and updates nr_entries to the number of entries needed to hold all claim entries.

Added in version claims-v7.

3.9.2. Helper functions for managing claims

unsigned long domain_release_host_claims(domain, release)
Parameters:
  • domain (struct domain*) – The domain for which to release host-wide claims

  • release (unsigned long) – The number of pages to release

Returns:

The number of host-wide pages actually deducted from the domain.

This function releases the specified number of host-wide claims. It limits the release to the number of host-wide claims actually held by the domain and updates the overall claim state accordingly.

Added in version claims-v4.

unsigned long domain_release_node_claims(domain, node, release)
Parameters:
  • domain (struct domain*) – The domain for which to release the node claims

  • node (nodeid_t) – The node for which to release the claim

  • release (unsigned long) – The number of pages to release from the claim

Returns:

The number of pages actually deducted from the domain’s claim.

This function deducts a specified number of pages from a domain’s claim on a specific node. It limits the release to the number of pages actually claimed by the domain on that node and updates the node-local claims currently held by the domain on that node, and it updates the host-wide and node-specific claim state accordingly.

Added in version claims-v5.

void domain_recall_node_claims(domain, recall)
Parameters:
  • domain (struct domain*) – The domain for which to recall node claims

  • recall (unsigned long) – The number of node-specific pages to recall

This function recalls the specified number of node-specific claims from the domain and updates the overall claim state accordingly.

It iterates over the domain’s node-specific claims, calls domain_release_node_claims() to up to the given pages from the node claims until the specified number of pages has been recalled, or all node-specific claims have been exhausted.

This function is used to recall node-specific claims from a domain when offlining memory or when pages for a domain are allocated on other nodes than the claimed node.

Added in version claims-v5.

3.9.3. Allocation with claims

The functions below play a key role in allocating memory for domains.

int xc_domain_populate_physmap(xch, domid, extents, order, mem_flags, extent_start)
Parameters:
  • xch (xc_interface*) – The libxenctrl interface

  • domid (uint32_t) – The ID of the domain

  • extents (unsigned long) – Number of extents

  • order (unsigned int) – Order of the extents

  • mem_flags (unsigned int) – Allocation flags

  • extent_start (xen_pfn_t*) – Starting PFN

Returns:

0 on success, or a negative error code on failure.

This function is a wrapper for the XENMEM_populate_physmap hypercall, which is handled by the populate_physmap() function in the hypervisor. It is used by libxenguest for populating the guest physical memory of a domain. domain builders can set the NUMA node affinity and pass the preferred node to this function to steer allocations towards the preferred NUMA node(s) and let claims ensure that the memory will be available even in cases of parallel domain builds where multiple domains are being built at the same time.

The meminit API calls xc_domain_populate_physmap() for populating the guest physical memory. It invokes the restartable XENMEM_populate_physmap hypercall implemented by populate_physmap().

void populate_physmap(struct memop_args *a)
Parameters:
  • a (struct memop_args*) – Provides status and hypercall restart info

Allocates memory for building a domain and uses it for populating the physmap. For allocation, it uses alloc_domheap_pages(), which forwards the request to alloc_heap_pages().

During domain creation, it adds the MEMF_no_scrub flag to the request for populating the physmap to optimise domain startup by allowing the use of unscrubbed pages.

When that happens, it scrubs the pages as needed using hypercall continuation to avoid long hypercall latency and watchdog timeouts.

Domain builders can optimise on-demand scrubbing by running physmap population pinned to the domain’s NUMA node, keeping scrubbing local and avoiding cross-node traffic.

struct page_info *alloc_heap_pages(unsigned int zone_lo, unsigned int zone_hi, unsigned int order, unsigned int memflags, struct domain *d)
Parameters:
  • zone_lo (unsigned int) – The lowest zone index to consider for allocation

  • zone_hi (unsigned int) – The highest zone index to consider for allocation

  • order (unsigned int) – The order of the pages to allocate (2^order pages)

  • memflags (unsigned int) – Memory allocation flags that may affect the allocation

  • d (struct domain*) – The domain for which to allocate memory or NULL

Returns:

The allocated page_info structure, or NULL on failure

This function allocates a contiguous block of pages from the heap. It checks claims and available memory before attempting the allocation. On success, it updates relevant counters and redeems claims as necessary.

It first checks whether the request can be satisfied given the domain’s claims and available memory using claims_permit_request(). If claims and availability permit the request, it calls get_free_buddy() to find a suitable block of free pages while respecting node and zone constraints.

Simplified pseudocode of its logic:

struct page_info *alloc_heap_pages(unsigned int zone_lo,
                                   unsigned int zone_hi,
                                   unsigned int order,
                                   unsigned int memflags,
                                   struct domain *d) {
    /* D's claims and available memory need to permit the request. */
    if (!claims_permit_request(1UL << order, total_avail_pages, memflags,
                               NUMA_NO_NODE, d,  outstanding_claims))
        return NULL;

    /* Find a suitable buddy block. Pass the zone range, order and
     * memflags so the helper can apply node and zone selection. */
    pg = get_free_buddy(zone_lo, zone_hi, order, memflags, d);
    if (!pg)
        return NULL;

    redeem_claims_for_allocation(d, 1UL << order, node_of(pg));
    update_counters_and_stats(d, order);
    if (pg_has_dirty_pages(pg))
        scrub_dirty_pages(pg);
    return pg;
}
struct page_info *get_free_buddy(zone_lo, zone_hi, order, memflags, domain)
Parameters:
  • zone_lo (unsigned int) – The lowest zone index to consider for allocation

  • zone_hi (unsigned int) – The highest zone index to consider for allocation

  • order (unsigned int) – The order of the pages to allocate (2^order pages)

  • memflags (unsigned int) – Flags for conducting the allocation

  • domain (struct domain*) – domain to allocate memory for or NULL

Returns:

The allocated page_info structure, or NULL on failure

This function finds a suitable block of free pages in the buddy allocator while respecting claims and node-level available memory.

Called by alloc_heap_pages() after verifying the request is permissible, it iterates over nodes and zones to find a buddy block that satisfies the request. It checks node-local claims before attempting allocation from a node.

Using claims_permit_request(), it checks whether the node has enough unclaimed memory to satisfy the request or whether the domain’s claims can permit the request on that node after accounting for outstanding claims.

If the node can satisfy the request, it searches for a suitable block in the specified zones. If found, it returns the block; otherwise it tries the next node until all online nodes are exhausted.

Simplified pseudocode of its logic:

/*
 * preferred_node_or_next_node() represents the policy to first try the
 * preferred/requested node then fall back to other online nodes.
 */
struct page_info *get_free_buddy(unsigned int zone_lo,
                                 unsigned int zone_hi,
                                 unsigned int order,
                                 unsigned int memflags,
                                 const struct domain *d) {
    nodeid_t request_node = MEMF_get_node(memflags);

    /*
     * Iterate over candidate nodes: start with preferred node (if any),
     * then try other online nodes according to the normal placement policy.
     */
    while (there are more nodes to try) {
        nodeid_t node = preferred_node_or_next_node(request_node);
        unsigned long avail_pages = node_avail_pages[node] -
                                    node_outstanding_claims[node]
                                    + ((d && !(memflags & MEMF_no_refcount))
                                       ? d->claims[node] : 0);

        /* Ensure the target node and the claims permit can this allocation */
        if ( avail_pages < (1UL << order) )
            goto next_node;

        /* Find a zone on this node with a suitable buddy */
        for (int zone = highest_zone; zone >= lowest_zone; zone--)
            for (int j = order; j <= MAX_ORDER; j++)
                if ((pg = remove_head(&heap(node, zone, j))) != NULL)
                    return pg;
     next_node:
        if (request_node != NUMA_NO_NODE && (memflags & MEMF_exact_node))
            return NULL;
        /* Fall back to the next node and repeat. */
    }
    return NULL;
}

Note

The actual implementation includes additional details but the pseudocode captures the core logic of checking claims and available memory while searching for a suitable buddy.

3.9.4. Offlining memory in presence of claims

When offlining pages, Xen must ensure that available memory on a node and the total number of free pages does not fall below their respective outstanding claims. If it does, Xen recalls claims from domains until accounting is valid again.

This is triggered by privileged domains via the XEN_SYSCTL_page_offline_op sysctl or by machine-check memory errors.

Offlining currently allocated pages cannot remove those in-use pages from circulation. They are marked for offlining and are offlined when freed back to the allocator. However, when already free pages are directly offlined, free memory the outstanding claims may need to be adjusted directly too.

reserve_offlined_page() needs to check whether offlining the page causes total_avail_pages to fall below outstanding_claims or node_avail_pages[page->node] to fall below node_outstanding_claims[page->node]. If so, reserve_offlined_page() must look for domains with relevant claims and recall those claims until the claim accounting is valid again.

This can violate claim guarantees, but it is necessary to maintain system stability when memory must be offlined.

int reserve_offlined_page(struct page_info *head)
Parameters:
  • head (struct page_info*) – The page being offlined

Returns:

0 on success, or a negative error code on failure.

This function is called during the offlining process to offline pages.

If offlining a page causes available memory to fall below outstanding claims, it checks the node-specific and host-wide claim accounting and recalls claims from domains as necessary to ensure accounting invariants hold after a buddy is offlined.