3.9. Implementation¶
Note
This part describes implementation details of claims and their interaction with memory allocation in Xen. It covers the functions and data structures involved in installing claims and allocating memory with claims.
Functions related to the implementation of claims and their interaction with memory allocation.
3.9.1. Installation of claims¶
This section describes the functions and data structures involved in installing claims for domains, and the internal functions for validating and installing claim sets.
- int domain_set_outstanding_pages(domain, pages)¶
This function is replaced by
domain_set_claim_entries().
- int domain_set_claim_entries(domain, nr_entries, claim_set)¶
- Parameters:
domain (struct domain*) – The domain for which to set the node claims
nr_entries (unsigned int) – The number of claims in the claim set
claim_set (memory_claim_t*) – The claim set to install for the domain
- Returns:
0 on success, or a negative error code on failure.
Handles installing claim sets. It performs validation of the claim set and updates the domain’s claims accordingly.
The function works in four phases:
Validate claim entries and check node-specific claims availability
Validate the host-wide request against the remaining availability
Reset any current claims of the domain
Install the claim set as the domain’s claiming state
Phase 1 checks claim entries for validity and memory availability:
Target must be
XEN_DOMCTL_CLAIM_MEMORY_TOTALor a node.Each target node may only appear once in the claim set.
For node-specific claims, requested pages must not exceed the available memory on that node after accounting for existing claims.
The explicit padding field must be zero for forward compatibility.
Phase 2 checks:
The total sum of the requested pages must not exceed the total unclaimed memory of the host after accounting for existing claims.
The claims must not exceed the
domain.max_pageslimit. See Accounting and Redeeming for the accounting checks that enforce the domain’sdomain.max_pageslimit.Added in version claims-v5.
- int domain_get_claim_entries(domain, nr_entries, claim_set)¶
- Parameters:
domain (struct domain*) – The domain for which to retrieve a claim set
nr_entries (unsigned int*) – The number of claims in the claim set
claim_set (memory_claim_t*) – The preallocated buffer for up to nr_entries claim entries
- Returns:
0 on success with nr_entries updated to the number of claims written to the buffer, or a negative error code on failure.
Retrieves a claim set for the current claims of the domain and writes it to the provided buffer. The number of claims written to the buffer is stored in the variable pointed to by
nr_entries.
nr_entriesspecifies the size of the provided buffer for claim entries, and the function writes up to that many claim entries to the buffer. If the buffer is too small to hold all claim entries, the function returns -ERANGEand updatesnr_entriesto the number of entries needed to hold all claim entries.Added in version claims-v7.
3.9.2. Helper functions for managing claims¶
- unsigned long domain_release_host_claims(domain, release)¶
- Parameters:
domain (struct domain*) – The domain for which to release host-wide claims
release (unsigned long) – The number of pages to release
- Returns:
The number of host-wide pages actually deducted from the domain.
This function releases the specified number of host-wide claims. It limits the release to the number of host-wide claims actually held by the domain and updates the overall claim state accordingly.
Added in version claims-v4.
- unsigned long domain_release_node_claims(domain, node, release)¶
- Parameters:
- Returns:
The number of pages actually deducted from the domain’s claim.
This function deducts a specified number of pages from a domain’s claim on a specific node. It limits the release to the number of pages actually claimed by the domain on that node and updates the node-local claims currently held by the domain on that node, and it updates the host-wide and node-specific claim state accordingly.
Added in version claims-v5.
- void domain_recall_node_claims(domain, recall)¶
- Parameters:
domain (struct domain*) – The domain for which to recall node claims
recall (unsigned long) – The number of node-specific pages to recall
This function recalls the specified number of node-specific claims from the domain and updates the overall claim state accordingly.
It iterates over the domain’s node-specific claims, calls
domain_release_node_claims()to up to the given pages from the node claims until the specified number of pages has been recalled, or all node-specific claims have been exhausted.This function is used to recall node-specific claims from a domain when offlining memory or when pages for a domain are allocated on other nodes than the claimed node.
Added in version claims-v5.
3.9.3. Allocation with claims¶
The functions below play a key role in allocating memory for domains.
- int xc_domain_populate_physmap(xch, domid, extents, order, mem_flags, extent_start)¶
- Parameters:
xch (xc_interface*) – The libxenctrl interface
domid (uint32_t) – The ID of the domain
extents (unsigned long) – Number of extents
order (unsigned int) – Order of the extents
mem_flags (unsigned int) – Allocation flags
extent_start (xen_pfn_t*) – Starting PFN
- Returns:
0 on success, or a negative error code on failure.
This function is a wrapper for the
XENMEM_populate_physmaphypercall, which is handled by thepopulate_physmap()function in the hypervisor. It is used by libxenguest for populating the guest physical memory of a domain. domain builders can set the NUMA node affinity and pass the preferred node to this function to steer allocations towards the preferred NUMA node(s) and let claims ensure that the memory will be available even in cases of parallel domain builds where multiple domains are being built at the same time.
The meminit API calls xc_domain_populate_physmap()
for populating the guest physical memory. It invokes the restartable
XENMEM_populate_physmap hypercall implemented by
populate_physmap().
-
void populate_physmap(struct memop_args *a)¶
- Parameters:
a (struct memop_args*) – Provides status and hypercall restart info
Allocates memory for building a domain and uses it for populating the physmap. For allocation, it uses
alloc_domheap_pages(), which forwards the request toalloc_heap_pages().During domain creation, it adds the
MEMF_no_scrubflag to the request for populating the physmap to optimise domain startup by allowing the use of unscrubbed pages.When that happens, it scrubs the pages as needed using hypercall continuation to avoid long hypercall latency and watchdog timeouts.
Domain builders can optimise on-demand scrubbing by running physmap population pinned to the domain’s NUMA node, keeping scrubbing local and avoiding cross-node traffic.
-
struct page_info *alloc_heap_pages(unsigned int zone_lo, unsigned int zone_hi, unsigned int order, unsigned int memflags, struct domain *d)¶
- Parameters:
zone_lo (unsigned int) – The lowest zone index to consider for allocation
zone_hi (unsigned int) – The highest zone index to consider for allocation
order (unsigned int) – The order of the pages to allocate (2^order pages)
memflags (unsigned int) – Memory allocation flags that may affect the allocation
d (struct domain*) – The domain for which to allocate memory or NULL
- Returns:
The allocated page_info structure, or NULL on failure
This function allocates a contiguous block of pages from the heap. It checks claims and available memory before attempting the allocation. On success, it updates relevant counters and redeems claims as necessary.
It first checks whether the request can be satisfied given the domain’s claims and available memory using
claims_permit_request(). If claims and availability permit the request, it callsget_free_buddy()to find a suitable block of free pages while respecting node and zone constraints.Simplified pseudocode of its logic:
struct page_info *alloc_heap_pages(unsigned int zone_lo,
unsigned int zone_hi,
unsigned int order,
unsigned int memflags,
struct domain *d) {
/* D's claims and available memory need to permit the request. */
if (!claims_permit_request(1UL << order, total_avail_pages, memflags,
NUMA_NO_NODE, d, outstanding_claims))
return NULL;
/* Find a suitable buddy block. Pass the zone range, order and
* memflags so the helper can apply node and zone selection. */
pg = get_free_buddy(zone_lo, zone_hi, order, memflags, d);
if (!pg)
return NULL;
redeem_claims_for_allocation(d, 1UL << order, node_of(pg));
update_counters_and_stats(d, order);
if (pg_has_dirty_pages(pg))
scrub_dirty_pages(pg);
return pg;
}
-
struct page_info *get_free_buddy(zone_lo, zone_hi, order, memflags, domain)¶
- Parameters:
zone_lo (unsigned int) – The lowest zone index to consider for allocation
zone_hi (unsigned int) – The highest zone index to consider for allocation
order (unsigned int) – The order of the pages to allocate (2^order pages)
memflags (unsigned int) – Flags for conducting the allocation
domain (struct domain*) – domain to allocate memory for or NULL
- Returns:
The allocated page_info structure, or NULL on failure
This function finds a suitable block of free pages in the buddy allocator while respecting claims and node-level available memory.
Called by
alloc_heap_pages()after verifying the request is permissible, it iterates over nodes and zones to find a buddy block that satisfies the request. It checks node-local claims before attempting allocation from a node.Using
claims_permit_request(), it checks whether the node has enough unclaimed memory to satisfy the request or whether the domain’s claims can permit the request on that node after accounting for outstanding claims.If the node can satisfy the request, it searches for a suitable block in the specified zones. If found, it returns the block; otherwise it tries the next node until all online nodes are exhausted.
Simplified pseudocode of its logic:
/*
* preferred_node_or_next_node() represents the policy to first try the
* preferred/requested node then fall back to other online nodes.
*/
struct page_info *get_free_buddy(unsigned int zone_lo,
unsigned int zone_hi,
unsigned int order,
unsigned int memflags,
const struct domain *d) {
nodeid_t request_node = MEMF_get_node(memflags);
/*
* Iterate over candidate nodes: start with preferred node (if any),
* then try other online nodes according to the normal placement policy.
*/
while (there are more nodes to try) {
nodeid_t node = preferred_node_or_next_node(request_node);
unsigned long avail_pages = node_avail_pages[node] -
node_outstanding_claims[node]
+ ((d && !(memflags & MEMF_no_refcount))
? d->claims[node] : 0);
/* Ensure the target node and the claims permit can this allocation */
if ( avail_pages < (1UL << order) )
goto next_node;
/* Find a zone on this node with a suitable buddy */
for (int zone = highest_zone; zone >= lowest_zone; zone--)
for (int j = order; j <= MAX_ORDER; j++)
if ((pg = remove_head(&heap(node, zone, j))) != NULL)
return pg;
next_node:
if (request_node != NUMA_NO_NODE && (memflags & MEMF_exact_node))
return NULL;
/* Fall back to the next node and repeat. */
}
return NULL;
}
Note
The actual implementation includes additional details but the pseudocode captures the core logic of checking claims and available memory while searching for a suitable buddy.
3.9.4. Offlining memory in presence of claims¶
When offlining pages, Xen must ensure that available memory on a node and the total number of free pages does not fall below their respective outstanding claims. If it does, Xen recalls claims from domains until accounting is valid again.
This is triggered by privileged domains via the
XEN_SYSCTL_page_offline_op sysctl or by machine-check memory errors.
Offlining currently allocated pages cannot remove those in-use pages from circulation. They are marked for offlining and are offlined when freed back to the allocator. However, when already free pages are directly offlined, free memory the outstanding claims may need to be adjusted directly too.
reserve_offlined_page() needs to check whether offlining the page
causes total_avail_pages to fall below outstanding_claims or
node_avail_pages[page->node] to fall below
node_outstanding_claims[page->node]. If so,
reserve_offlined_page() must look for domains with relevant claims
and recall those claims until the claim accounting is valid again.
When node_outstanding_claims[page->node] exceeds node_avail_pages[page->node] for the offlined page,
reserve_offlined_page()should calldomain_release_node_claims()to recall claims on that node from domains with claims on the node of the offlined buddy until the claim accounting of the node is valid again.When total
outstanding_claimsexceedstotal_avail_pages,reserve_offlined_page()callsdomain_release_host_claims()to recall host-wide claims from domains until the overall claims accounting is valid again.
This can violate claim guarantees, but it is necessary to maintain system stability when memory must be offlined.
-
int reserve_offlined_page(struct page_info *head)¶
- Parameters:
head (struct page_info*) – The page being offlined
- Returns:
0 on success, or a negative error code on failure.
This function is called during the offlining process to offline pages.
If offlining a page causes available memory to fall below outstanding claims, it checks the node-specific and host-wide claim accounting and recalls claims from domains as necessary to ensure accounting invariants hold after a buddy is offlined.