diff mbox series

xen/page_alloc: Simplify domain_adjust_tot_pages

Message ID 20250224132724.9074-1-alejandro.vallejo@cloud.com (mailing list archive)
State Superseded
Headers show
Series xen/page_alloc: Simplify domain_adjust_tot_pages | expand

Commit Message

Alejandro Vallejo Feb. 24, 2025, 1:27 p.m. UTC
The logic has too many levels of indirection and it's very hard to
understand it its current form. Split it between the corner case where
the adjustment is bigger than the current claim and the rest to avoid 5
auxiliary variables.

While at it, fix incorrect field name in nearby comment.

Not a functional change (although by its nature it might not seem
immediately obvious at first).

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
---
 xen/common/page_alloc.c | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

Comments

Alejandro Vallejo Feb. 24, 2025, 2:49 p.m. UTC | #1
Open question to whoever reviews this...

On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote:
>      spin_lock(&heap_lock);
> -    /* adjust domain outstanding pages; may not go negative */
> -    dom_before = d->outstanding_pages;
> -    dom_after = dom_before - pages;
> -    BUG_ON(dom_before < 0);
> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> -    d->outstanding_pages = dom_claimed;
> -    /* flag accounting bug if system outstanding_claims would go negative */
> -    sys_before = outstanding_claims;
> -    sys_after = sys_before - (dom_before - dom_claimed);
> -    BUG_ON(sys_after < 0);
> -    outstanding_claims = sys_after;
> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> +    if ( pages > 0 && d->outstanding_pages < pages )
> +    {
> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> +        outstanding_claims -= d->outstanding_pages;
> +        d->outstanding_pages = 0;

While this matches the previous behaviour, do we _really_ want it? It's weird,
quirky, and it hard to extend to NUMA-aware claims (which is something in
midway through).

Wouldn't it make sense to fail the allocation (earlier) if the claim has run
out? Do we even expect this to ever happen this late in the allocation call
chain?

> +    } else {
> +        outstanding_claims -= pages;
> +        d->outstanding_pages -= pages;
> +    }
>      spin_unlock(&heap_lock);
>  
>  out:

Cheers,
Alejandro
Roger Pau Monne Feb. 26, 2025, 1:56 p.m. UTC | #2
On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
> The logic has too many levels of indirection and it's very hard to
> understand it its current form. Split it between the corner case where
> the adjustment is bigger than the current claim and the rest to avoid 5
> auxiliary variables.
> 
> While at it, fix incorrect field name in nearby comment.
> 
> Not a functional change (although by its nature it might not seem
> immediately obvious at first).
> 
> Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
> ---
>  xen/common/page_alloc.c | 25 +++++++++++--------------
>  1 file changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index 1bf070c8c5df..4306753f0bd2 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
>  
>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>  {
> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
> -
>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>      d->tot_pages += pages;
>  
>      /*
> -     * can test d->claimed_pages race-free because it can only change
> +     * can test d->outstanding_pages race-free because it can only change
>       * if d->page_alloc_lock and heap_lock are both held, see also
>       * domain_set_outstanding_pages below
>       */
> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>          goto out;

I think you can probably short-circuit the logic below if pages == 0?
(and avoid taking the heap_lock)

>  
>      spin_lock(&heap_lock);
> -    /* adjust domain outstanding pages; may not go negative */
> -    dom_before = d->outstanding_pages;
> -    dom_after = dom_before - pages;
> -    BUG_ON(dom_before < 0);
> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> -    d->outstanding_pages = dom_claimed;
> -    /* flag accounting bug if system outstanding_claims would go negative */
> -    sys_before = outstanding_claims;
> -    sys_after = sys_before - (dom_before - dom_claimed);
> -    BUG_ON(sys_after < 0);
> -    outstanding_claims = sys_after;
> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> +    if ( pages > 0 && d->outstanding_pages < pages )
> +    {
> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> +        outstanding_claims -= d->outstanding_pages;
> +        d->outstanding_pages = 0;
> +    } else {
> +        outstanding_claims -= pages;
> +        d->outstanding_pages -= pages;

I wonder if it's intentional for a pages < 0 value to modify
outstanding_claims and d->outstanding_pages, I think those values
should only be set from domain_set_outstanding_pages().
domain_adjust_tot_pages() should only decrease the value, but never
increase either outstanding_claims or d->outstanding_pages.

At best the behavior is inconsistent, because once
d->outstanding_pages reaches 0 there will be no further modification
from domain_adjust_tot_pages().

I think it would be best if outstanding_claims and
d->outstanding_pages was only modified if pages > 0.

Thanks, Roger.
Jan Beulich Feb. 26, 2025, 2:02 p.m. UTC | #3
On 24.02.2025 14:27, Alejandro Vallejo wrote:
> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>          goto out;
>  
>      spin_lock(&heap_lock);
> -    /* adjust domain outstanding pages; may not go negative */
> -    dom_before = d->outstanding_pages;
> -    dom_after = dom_before - pages;
> -    BUG_ON(dom_before < 0);
> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> -    d->outstanding_pages = dom_claimed;
> -    /* flag accounting bug if system outstanding_claims would go negative */
> -    sys_before = outstanding_claims;
> -    sys_after = sys_before - (dom_before - dom_claimed);
> -    BUG_ON(sys_after < 0);
> -    outstanding_claims = sys_after;
> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> +    if ( pages > 0 && d->outstanding_pages < pages )

The lhs isn't needed, is it? d->outstanding_pages is an unsigned quantity,
after all. Else dropping the earlier of the two BUG_ON() wouldn't be quite
right.

> +    {
> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> +        outstanding_claims -= d->outstanding_pages;
> +        d->outstanding_pages = 0;
> +    } else {

Nit: Braces on their own lines please.

In any event - yes, this reads quite a bit easier after the adjustment.

With the adjustments (happy to make while committing, so long as you agree)
Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan

> +        outstanding_claims -= pages;
> +        d->outstanding_pages -= pages;
> +    }
>      spin_unlock(&heap_lock);
>  
>  out:
Jan Beulich Feb. 26, 2025, 2:05 p.m. UTC | #4
On 24.02.2025 15:49, Alejandro Vallejo wrote:
> Open question to whoever reviews this...
> 
> On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote:
>>      spin_lock(&heap_lock);
>> -    /* adjust domain outstanding pages; may not go negative */
>> -    dom_before = d->outstanding_pages;
>> -    dom_after = dom_before - pages;
>> -    BUG_ON(dom_before < 0);
>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>> -    d->outstanding_pages = dom_claimed;
>> -    /* flag accounting bug if system outstanding_claims would go negative */
>> -    sys_before = outstanding_claims;
>> -    sys_after = sys_before - (dom_before - dom_claimed);
>> -    BUG_ON(sys_after < 0);
>> -    outstanding_claims = sys_after;
>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>> +    if ( pages > 0 && d->outstanding_pages < pages )
>> +    {
>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>> +        outstanding_claims -= d->outstanding_pages;
>> +        d->outstanding_pages = 0;
> 
> While this matches the previous behaviour, do we _really_ want it? It's weird,
> quirky, and it hard to extend to NUMA-aware claims (which is something in
> midway through).
> 
> Wouldn't it make sense to fail the allocation (earlier) if the claim has run
> out? Do we even expect this to ever happen this late in the allocation call
> chain?

This goes back to what a "claim" means. Even without any claim, a domain may
allocate memory. So a claim having run out doesn't imply allocation has to
fail.

NUMA-aware claims require more than an adjustment just here, I expect. Tracking
of claims (certainly the global, maybe also the per-domain value) would likely
need to become per-node, for example.

Jan
Jan Beulich Feb. 26, 2025, 2:08 p.m. UTC | #5
On 26.02.2025 14:56, Roger Pau Monné wrote:
> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
>> --- a/xen/common/page_alloc.c
>> +++ b/xen/common/page_alloc.c
>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
>>  
>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>  {
>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
>> -
>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>>      d->tot_pages += pages;
>>  
>>      /*
>> -     * can test d->claimed_pages race-free because it can only change
>> +     * can test d->outstanding_pages race-free because it can only change
>>       * if d->page_alloc_lock and heap_lock are both held, see also
>>       * domain_set_outstanding_pages below
>>       */
>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>          goto out;
> 
> I think you can probably short-circuit the logic below if pages == 0?
> (and avoid taking the heap_lock)

Are there callers passing in 0?

>>      spin_lock(&heap_lock);
>> -    /* adjust domain outstanding pages; may not go negative */
>> -    dom_before = d->outstanding_pages;
>> -    dom_after = dom_before - pages;
>> -    BUG_ON(dom_before < 0);
>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>> -    d->outstanding_pages = dom_claimed;
>> -    /* flag accounting bug if system outstanding_claims would go negative */
>> -    sys_before = outstanding_claims;
>> -    sys_after = sys_before - (dom_before - dom_claimed);
>> -    BUG_ON(sys_after < 0);
>> -    outstanding_claims = sys_after;
>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>> +    if ( pages > 0 && d->outstanding_pages < pages )
>> +    {
>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>> +        outstanding_claims -= d->outstanding_pages;
>> +        d->outstanding_pages = 0;
>> +    } else {
>> +        outstanding_claims -= pages;
>> +        d->outstanding_pages -= pages;
> 
> I wonder if it's intentional for a pages < 0 value to modify
> outstanding_claims and d->outstanding_pages, I think those values
> should only be set from domain_set_outstanding_pages().
> domain_adjust_tot_pages() should only decrease the value, but never
> increase either outstanding_claims or d->outstanding_pages.
> 
> At best the behavior is inconsistent, because once
> d->outstanding_pages reaches 0 there will be no further modification
> from domain_adjust_tot_pages().

Right, at that point the claim has run out. While freeing pages with an
active claim means that the claim gets bigger (which naturally needs
reflecting in the global).

Jan
Roger Pau Monne Feb. 26, 2025, 2:18 p.m. UTC | #6
On Mon, Feb 24, 2025 at 02:49:48PM +0000, Alejandro Vallejo wrote:
> Open question to whoever reviews this...
> 
> On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote:
> >      spin_lock(&heap_lock);
> > -    /* adjust domain outstanding pages; may not go negative */
> > -    dom_before = d->outstanding_pages;
> > -    dom_after = dom_before - pages;
> > -    BUG_ON(dom_before < 0);
> > -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> > -    d->outstanding_pages = dom_claimed;
> > -    /* flag accounting bug if system outstanding_claims would go negative */
> > -    sys_before = outstanding_claims;
> > -    sys_after = sys_before - (dom_before - dom_claimed);
> > -    BUG_ON(sys_after < 0);
> > -    outstanding_claims = sys_after;
> > +    BUG_ON(outstanding_claims < d->outstanding_pages);
> > +    if ( pages > 0 && d->outstanding_pages < pages )
> > +    {
> > +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> > +        outstanding_claims -= d->outstanding_pages;
> > +        d->outstanding_pages = 0;
> 
> While this matches the previous behaviour, do we _really_ want it? It's weird,
> quirky, and it hard to extend to NUMA-aware claims (which is something in
> midway through).
> 
> Wouldn't it make sense to fail the allocation (earlier) if the claim has run
> out? Do we even expect this to ever happen this late in the allocation call
> chain?

I'm unsure.  This is the case where more memory than initially claimed
has been allocated, but by the time domain_adjust_tot_pages() gets
called the memory has already been allocated, so it's kind of
unhelpful to fail by then.

I think any caller that requests more memory than what has been
initially claimed for the domain should be prepared to deal with such
allocation failing.  This quirky handling is very likely a workaround
for the miscellaneous differences between the memory accounted by the
toolstack for a guest vs the memory really used by such guest.  I bet
if you limit a guest to strictly only allocate up to
d->outstanding_pages domain creation will fail.

In general the toolstack memory calculations are not fully accurate,
see for example how vmx_alloc_vlapic_mapping() allocates a domheap
page which very likely the toolstack won't have accounted for.  There
are likely other examples that would possibly break the accounting
done by the toolstack.

Thanks, Roger.
Roger Pau Monne Feb. 26, 2025, 2:28 p.m. UTC | #7
On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
> On 26.02.2025 14:56, Roger Pau Monné wrote:
> > On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
> >> --- a/xen/common/page_alloc.c
> >> +++ b/xen/common/page_alloc.c
> >> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
> >>  
> >>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> >>  {
> >> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
> >> -
> >>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
> >>      d->tot_pages += pages;
> >>  
> >>      /*
> >> -     * can test d->claimed_pages race-free because it can only change
> >> +     * can test d->outstanding_pages race-free because it can only change
> >>       * if d->page_alloc_lock and heap_lock are both held, see also
> >>       * domain_set_outstanding_pages below
> >>       */
> >> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> >>          goto out;
> > 
> > I think you can probably short-circuit the logic below if pages == 0?
> > (and avoid taking the heap_lock)
> 
> Are there callers passing in 0?

Not sure, but if there are no callers expected we might add an ASSERT
to that effect then.

> >>      spin_lock(&heap_lock);
> >> -    /* adjust domain outstanding pages; may not go negative */
> >> -    dom_before = d->outstanding_pages;
> >> -    dom_after = dom_before - pages;
> >> -    BUG_ON(dom_before < 0);
> >> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> >> -    d->outstanding_pages = dom_claimed;
> >> -    /* flag accounting bug if system outstanding_claims would go negative */
> >> -    sys_before = outstanding_claims;
> >> -    sys_after = sys_before - (dom_before - dom_claimed);
> >> -    BUG_ON(sys_after < 0);
> >> -    outstanding_claims = sys_after;
> >> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> >> +    if ( pages > 0 && d->outstanding_pages < pages )
> >> +    {
> >> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> >> +        outstanding_claims -= d->outstanding_pages;
> >> +        d->outstanding_pages = 0;
> >> +    } else {
> >> +        outstanding_claims -= pages;
> >> +        d->outstanding_pages -= pages;
> > 
> > I wonder if it's intentional for a pages < 0 value to modify
> > outstanding_claims and d->outstanding_pages, I think those values
> > should only be set from domain_set_outstanding_pages().
> > domain_adjust_tot_pages() should only decrease the value, but never
> > increase either outstanding_claims or d->outstanding_pages.
> > 
> > At best the behavior is inconsistent, because once
> > d->outstanding_pages reaches 0 there will be no further modification
> > from domain_adjust_tot_pages().
> 
> Right, at that point the claim has run out. While freeing pages with an
> active claim means that the claim gets bigger (which naturally needs
> reflecting in the global).

domain_adjust_tot_pages() is not exclusively called when freeing
pages, see steal_page() for example.

When called from steal_page() it's wrong to increase the claim, as
it assumes that the page removed from d->tot_pages is freed, but
that's not the case.  The domain might end up in a situation where
the claim is bigger than the available amount of memory.

Thanks, Roger.
Jan Beulich Feb. 26, 2025, 2:36 p.m. UTC | #8
On 26.02.2025 15:28, Roger Pau Monné wrote:
> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
>> On 26.02.2025 14:56, Roger Pau Monné wrote:
>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
>>>> --- a/xen/common/page_alloc.c
>>>> +++ b/xen/common/page_alloc.c
>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
>>>>  
>>>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>  {
>>>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
>>>> -
>>>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>>>>      d->tot_pages += pages;
>>>>  
>>>>      /*
>>>> -     * can test d->claimed_pages race-free because it can only change
>>>> +     * can test d->outstanding_pages race-free because it can only change
>>>>       * if d->page_alloc_lock and heap_lock are both held, see also
>>>>       * domain_set_outstanding_pages below
>>>>       */
>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>          goto out;
>>>
>>> I think you can probably short-circuit the logic below if pages == 0?
>>> (and avoid taking the heap_lock)
>>
>> Are there callers passing in 0?
> 
> Not sure, but if there are no callers expected we might add an ASSERT
> to that effect then.
> 
>>>>      spin_lock(&heap_lock);
>>>> -    /* adjust domain outstanding pages; may not go negative */
>>>> -    dom_before = d->outstanding_pages;
>>>> -    dom_after = dom_before - pages;
>>>> -    BUG_ON(dom_before < 0);
>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>>>> -    d->outstanding_pages = dom_claimed;
>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
>>>> -    sys_before = outstanding_claims;
>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
>>>> -    BUG_ON(sys_after < 0);
>>>> -    outstanding_claims = sys_after;
>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
>>>> +    {
>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>>>> +        outstanding_claims -= d->outstanding_pages;
>>>> +        d->outstanding_pages = 0;
>>>> +    } else {
>>>> +        outstanding_claims -= pages;
>>>> +        d->outstanding_pages -= pages;
>>>
>>> I wonder if it's intentional for a pages < 0 value to modify
>>> outstanding_claims and d->outstanding_pages, I think those values
>>> should only be set from domain_set_outstanding_pages().
>>> domain_adjust_tot_pages() should only decrease the value, but never
>>> increase either outstanding_claims or d->outstanding_pages.
>>>
>>> At best the behavior is inconsistent, because once
>>> d->outstanding_pages reaches 0 there will be no further modification
>>> from domain_adjust_tot_pages().
>>
>> Right, at that point the claim has run out. While freeing pages with an
>> active claim means that the claim gets bigger (which naturally needs
>> reflecting in the global).
> 
> domain_adjust_tot_pages() is not exclusively called when freeing
> pages, see steal_page() for example.

Or also when pages were allocated. steal_page() ...

> When called from steal_page() it's wrong to increase the claim, as
> it assumes that the page removed from d->tot_pages is freed, but
> that's not the case.  The domain might end up in a situation where
> the claim is bigger than the available amount of memory.

... is a case that likely wasn't considered when the feature was added.

I never really liked this; I'd be quite happy to see it ripped out, as
long as we'd be reasonably certain it isn't in active use by people.

Jan
Roger Pau Monne Feb. 26, 2025, 4:04 p.m. UTC | #9
On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote:
> On 26.02.2025 15:28, Roger Pau Monné wrote:
> > On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
> >> On 26.02.2025 14:56, Roger Pau Monné wrote:
> >>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
> >>>> --- a/xen/common/page_alloc.c
> >>>> +++ b/xen/common/page_alloc.c
> >>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
> >>>>  
> >>>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> >>>>  {
> >>>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
> >>>> -
> >>>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
> >>>>      d->tot_pages += pages;
> >>>>  
> >>>>      /*
> >>>> -     * can test d->claimed_pages race-free because it can only change
> >>>> +     * can test d->outstanding_pages race-free because it can only change
> >>>>       * if d->page_alloc_lock and heap_lock are both held, see also
> >>>>       * domain_set_outstanding_pages below
> >>>>       */
> >>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> >>>>          goto out;
> >>>
> >>> I think you can probably short-circuit the logic below if pages == 0?
> >>> (and avoid taking the heap_lock)
> >>
> >> Are there callers passing in 0?
> > 
> > Not sure, but if there are no callers expected we might add an ASSERT
> > to that effect then.
> > 
> >>>>      spin_lock(&heap_lock);
> >>>> -    /* adjust domain outstanding pages; may not go negative */
> >>>> -    dom_before = d->outstanding_pages;
> >>>> -    dom_after = dom_before - pages;
> >>>> -    BUG_ON(dom_before < 0);
> >>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> >>>> -    d->outstanding_pages = dom_claimed;
> >>>> -    /* flag accounting bug if system outstanding_claims would go negative */
> >>>> -    sys_before = outstanding_claims;
> >>>> -    sys_after = sys_before - (dom_before - dom_claimed);
> >>>> -    BUG_ON(sys_after < 0);
> >>>> -    outstanding_claims = sys_after;
> >>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> >>>> +    if ( pages > 0 && d->outstanding_pages < pages )
> >>>> +    {
> >>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> >>>> +        outstanding_claims -= d->outstanding_pages;
> >>>> +        d->outstanding_pages = 0;
> >>>> +    } else {
> >>>> +        outstanding_claims -= pages;
> >>>> +        d->outstanding_pages -= pages;
> >>>
> >>> I wonder if it's intentional for a pages < 0 value to modify
> >>> outstanding_claims and d->outstanding_pages, I think those values
> >>> should only be set from domain_set_outstanding_pages().
> >>> domain_adjust_tot_pages() should only decrease the value, but never
> >>> increase either outstanding_claims or d->outstanding_pages.
> >>>
> >>> At best the behavior is inconsistent, because once
> >>> d->outstanding_pages reaches 0 there will be no further modification
> >>> from domain_adjust_tot_pages().
> >>
> >> Right, at that point the claim has run out. While freeing pages with an
> >> active claim means that the claim gets bigger (which naturally needs
> >> reflecting in the global).
> > 
> > domain_adjust_tot_pages() is not exclusively called when freeing
> > pages, see steal_page() for example.
> 
> Or also when pages were allocated. steal_page() ...
> 
> > When called from steal_page() it's wrong to increase the claim, as
> > it assumes that the page removed from d->tot_pages is freed, but
> > that's not the case.  The domain might end up in a situation where
> > the claim is bigger than the available amount of memory.
> 
> ... is a case that likely wasn't considered when the feature was added.
> 
> I never really liked this; I'd be quite happy to see it ripped out, as
> long as we'd be reasonably certain it isn't in active use by people.

What do you mean with 'it' in the above sentence, the whole claim
stuff?  Or just getting rid of allowing the claim to increase as a
result of pages being removed from a domain?

Thanks, Roger.
Jan Beulich Feb. 26, 2025, 4:06 p.m. UTC | #10
On 26.02.2025 17:04, Roger Pau Monné wrote:
> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote:
>> On 26.02.2025 15:28, Roger Pau Monné wrote:
>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
>>>> On 26.02.2025 14:56, Roger Pau Monné wrote:
>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
>>>>>> --- a/xen/common/page_alloc.c
>>>>>> +++ b/xen/common/page_alloc.c
>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
>>>>>>  
>>>>>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>  {
>>>>>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
>>>>>> -
>>>>>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>>>>>>      d->tot_pages += pages;
>>>>>>  
>>>>>>      /*
>>>>>> -     * can test d->claimed_pages race-free because it can only change
>>>>>> +     * can test d->outstanding_pages race-free because it can only change
>>>>>>       * if d->page_alloc_lock and heap_lock are both held, see also
>>>>>>       * domain_set_outstanding_pages below
>>>>>>       */
>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>          goto out;
>>>>>
>>>>> I think you can probably short-circuit the logic below if pages == 0?
>>>>> (and avoid taking the heap_lock)
>>>>
>>>> Are there callers passing in 0?
>>>
>>> Not sure, but if there are no callers expected we might add an ASSERT
>>> to that effect then.
>>>
>>>>>>      spin_lock(&heap_lock);
>>>>>> -    /* adjust domain outstanding pages; may not go negative */
>>>>>> -    dom_before = d->outstanding_pages;
>>>>>> -    dom_after = dom_before - pages;
>>>>>> -    BUG_ON(dom_before < 0);
>>>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>>>>>> -    d->outstanding_pages = dom_claimed;
>>>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
>>>>>> -    sys_before = outstanding_claims;
>>>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
>>>>>> -    BUG_ON(sys_after < 0);
>>>>>> -    outstanding_claims = sys_after;
>>>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>>>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
>>>>>> +    {
>>>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>>>>>> +        outstanding_claims -= d->outstanding_pages;
>>>>>> +        d->outstanding_pages = 0;
>>>>>> +    } else {
>>>>>> +        outstanding_claims -= pages;
>>>>>> +        d->outstanding_pages -= pages;
>>>>>
>>>>> I wonder if it's intentional for a pages < 0 value to modify
>>>>> outstanding_claims and d->outstanding_pages, I think those values
>>>>> should only be set from domain_set_outstanding_pages().
>>>>> domain_adjust_tot_pages() should only decrease the value, but never
>>>>> increase either outstanding_claims or d->outstanding_pages.
>>>>>
>>>>> At best the behavior is inconsistent, because once
>>>>> d->outstanding_pages reaches 0 there will be no further modification
>>>>> from domain_adjust_tot_pages().
>>>>
>>>> Right, at that point the claim has run out. While freeing pages with an
>>>> active claim means that the claim gets bigger (which naturally needs
>>>> reflecting in the global).
>>>
>>> domain_adjust_tot_pages() is not exclusively called when freeing
>>> pages, see steal_page() for example.
>>
>> Or also when pages were allocated. steal_page() ...
>>
>>> When called from steal_page() it's wrong to increase the claim, as
>>> it assumes that the page removed from d->tot_pages is freed, but
>>> that's not the case.  The domain might end up in a situation where
>>> the claim is bigger than the available amount of memory.
>>
>> ... is a case that likely wasn't considered when the feature was added.
>>
>> I never really liked this; I'd be quite happy to see it ripped out, as
>> long as we'd be reasonably certain it isn't in active use by people.
> 
> What do you mean with 'it' in the above sentence, the whole claim
> stuff?

Yes.

>  Or just getting rid of allowing the claim to increase as a
> result of pages being removed from a domain?

No.

Jan
Andrew Cooper Feb. 26, 2025, 4:34 p.m. UTC | #11
On 26/02/2025 4:06 pm, Jan Beulich wrote:
> On 26.02.2025 17:04, Roger Pau Monné wrote:
>> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote:
>>> On 26.02.2025 15:28, Roger Pau Monné wrote:
>>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
>>>>> On 26.02.2025 14:56, Roger Pau Monné wrote:
>>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
>>>>>>> --- a/xen/common/page_alloc.c
>>>>>>> +++ b/xen/common/page_alloc.c
>>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
>>>>>>>  
>>>>>>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>>  {
>>>>>>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
>>>>>>> -
>>>>>>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>>>>>>>      d->tot_pages += pages;
>>>>>>>  
>>>>>>>      /*
>>>>>>> -     * can test d->claimed_pages race-free because it can only change
>>>>>>> +     * can test d->outstanding_pages race-free because it can only change
>>>>>>>       * if d->page_alloc_lock and heap_lock are both held, see also
>>>>>>>       * domain_set_outstanding_pages below
>>>>>>>       */
>>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>>          goto out;
>>>>>> I think you can probably short-circuit the logic below if pages == 0?
>>>>>> (and avoid taking the heap_lock)
>>>>> Are there callers passing in 0?
>>>> Not sure, but if there are no callers expected we might add an ASSERT
>>>> to that effect then.
>>>>
>>>>>>>      spin_lock(&heap_lock);
>>>>>>> -    /* adjust domain outstanding pages; may not go negative */
>>>>>>> -    dom_before = d->outstanding_pages;
>>>>>>> -    dom_after = dom_before - pages;
>>>>>>> -    BUG_ON(dom_before < 0);
>>>>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>>>>>>> -    d->outstanding_pages = dom_claimed;
>>>>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
>>>>>>> -    sys_before = outstanding_claims;
>>>>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
>>>>>>> -    BUG_ON(sys_after < 0);
>>>>>>> -    outstanding_claims = sys_after;
>>>>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>>>>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
>>>>>>> +    {
>>>>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>>>>>>> +        outstanding_claims -= d->outstanding_pages;
>>>>>>> +        d->outstanding_pages = 0;
>>>>>>> +    } else {
>>>>>>> +        outstanding_claims -= pages;
>>>>>>> +        d->outstanding_pages -= pages;
>>>>>> I wonder if it's intentional for a pages < 0 value to modify
>>>>>> outstanding_claims and d->outstanding_pages, I think those values
>>>>>> should only be set from domain_set_outstanding_pages().
>>>>>> domain_adjust_tot_pages() should only decrease the value, but never
>>>>>> increase either outstanding_claims or d->outstanding_pages.
>>>>>>
>>>>>> At best the behavior is inconsistent, because once
>>>>>> d->outstanding_pages reaches 0 there will be no further modification
>>>>>> from domain_adjust_tot_pages().
>>>>> Right, at that point the claim has run out. While freeing pages with an
>>>>> active claim means that the claim gets bigger (which naturally needs
>>>>> reflecting in the global).
>>>> domain_adjust_tot_pages() is not exclusively called when freeing
>>>> pages, see steal_page() for example.
>>> Or also when pages were allocated. steal_page() ...
>>>
>>>> When called from steal_page() it's wrong to increase the claim, as
>>>> it assumes that the page removed from d->tot_pages is freed, but
>>>> that's not the case.  The domain might end up in a situation where
>>>> the claim is bigger than the available amount of memory.
>>> ... is a case that likely wasn't considered when the feature was added.
>>>
>>> I never really liked this; I'd be quite happy to see it ripped out, as
>>> long as we'd be reasonably certain it isn't in active use by people.
>> What do you mean with 'it' in the above sentence, the whole claim
>> stuff?
> Yes.
>
>>  Or just getting rid of allowing the claim to increase as a
>> result of pages being removed from a domain?
> No.

Alejandro and I discussed this earlier in the week.

The claim infrastructure stuff is critical for a toolstack capable of
doing things in parallel.

However, it is also nonsensical for there to be a remaining claim by the
time domain construction is done.

If creation_finished were a concrete thing, rather than a bodge hacked
into domain_unpause_by_systemcontroller(), it ought to be made to fail
if there were an outstanding claim.  I suggested that we follow through
on a previous suggestion of making it a real hypercall (which is needed
by the encrypted VM crowd too).

~Andrew
Jan Beulich Feb. 26, 2025, 4:42 p.m. UTC | #12
On 26.02.2025 17:34, Andrew Cooper wrote:
> On 26/02/2025 4:06 pm, Jan Beulich wrote:
>> On 26.02.2025 17:04, Roger Pau Monné wrote:
>>> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote:
>>>> On 26.02.2025 15:28, Roger Pau Monné wrote:
>>>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
>>>>>> On 26.02.2025 14:56, Roger Pau Monné wrote:
>>>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
>>>>>>>> --- a/xen/common/page_alloc.c
>>>>>>>> +++ b/xen/common/page_alloc.c
>>>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
>>>>>>>>  
>>>>>>>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>>>  {
>>>>>>>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
>>>>>>>> -
>>>>>>>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>>>>>>>>      d->tot_pages += pages;
>>>>>>>>  
>>>>>>>>      /*
>>>>>>>> -     * can test d->claimed_pages race-free because it can only change
>>>>>>>> +     * can test d->outstanding_pages race-free because it can only change
>>>>>>>>       * if d->page_alloc_lock and heap_lock are both held, see also
>>>>>>>>       * domain_set_outstanding_pages below
>>>>>>>>       */
>>>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>>>          goto out;
>>>>>>> I think you can probably short-circuit the logic below if pages == 0?
>>>>>>> (and avoid taking the heap_lock)
>>>>>> Are there callers passing in 0?
>>>>> Not sure, but if there are no callers expected we might add an ASSERT
>>>>> to that effect then.
>>>>>
>>>>>>>>      spin_lock(&heap_lock);
>>>>>>>> -    /* adjust domain outstanding pages; may not go negative */
>>>>>>>> -    dom_before = d->outstanding_pages;
>>>>>>>> -    dom_after = dom_before - pages;
>>>>>>>> -    BUG_ON(dom_before < 0);
>>>>>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>>>>>>>> -    d->outstanding_pages = dom_claimed;
>>>>>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
>>>>>>>> -    sys_before = outstanding_claims;
>>>>>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
>>>>>>>> -    BUG_ON(sys_after < 0);
>>>>>>>> -    outstanding_claims = sys_after;
>>>>>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>>>>>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
>>>>>>>> +    {
>>>>>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>>>>>>>> +        outstanding_claims -= d->outstanding_pages;
>>>>>>>> +        d->outstanding_pages = 0;
>>>>>>>> +    } else {
>>>>>>>> +        outstanding_claims -= pages;
>>>>>>>> +        d->outstanding_pages -= pages;
>>>>>>> I wonder if it's intentional for a pages < 0 value to modify
>>>>>>> outstanding_claims and d->outstanding_pages, I think those values
>>>>>>> should only be set from domain_set_outstanding_pages().
>>>>>>> domain_adjust_tot_pages() should only decrease the value, but never
>>>>>>> increase either outstanding_claims or d->outstanding_pages.
>>>>>>>
>>>>>>> At best the behavior is inconsistent, because once
>>>>>>> d->outstanding_pages reaches 0 there will be no further modification
>>>>>>> from domain_adjust_tot_pages().
>>>>>> Right, at that point the claim has run out. While freeing pages with an
>>>>>> active claim means that the claim gets bigger (which naturally needs
>>>>>> reflecting in the global).
>>>>> domain_adjust_tot_pages() is not exclusively called when freeing
>>>>> pages, see steal_page() for example.
>>>> Or also when pages were allocated. steal_page() ...
>>>>
>>>>> When called from steal_page() it's wrong to increase the claim, as
>>>>> it assumes that the page removed from d->tot_pages is freed, but
>>>>> that's not the case.  The domain might end up in a situation where
>>>>> the claim is bigger than the available amount of memory.
>>>> ... is a case that likely wasn't considered when the feature was added.
>>>>
>>>> I never really liked this; I'd be quite happy to see it ripped out, as
>>>> long as we'd be reasonably certain it isn't in active use by people.
>>> What do you mean with 'it' in the above sentence, the whole claim
>>> stuff?
>> Yes.
>>
>>>  Or just getting rid of allowing the claim to increase as a
>>> result of pages being removed from a domain?
>> No.
> 
> Alejandro and I discussed this earlier in the week.
> 
> The claim infrastructure stuff is critical for a toolstack capable of
> doing things in parallel.
> 
> However, it is also nonsensical for there to be a remaining claim by the
> time domain construction is done.

I'm not entirely sure about this. Iirc it was the tmem work where this was
added, and then pretty certainly it was needed also for already running
domains.

> If creation_finished were a concrete thing, rather than a bodge hacked
> into domain_unpause_by_systemcontroller(), it ought to be made to fail
> if there were an outstanding claim.  I suggested that we follow through
> on a previous suggestion of making it a real hypercall (which is needed
> by the encrypted VM crowd too).

Rather than failing we could simply zap the leftover?

Jan
Andrew Cooper Feb. 26, 2025, 4:51 p.m. UTC | #13
On 26/02/2025 4:42 pm, Jan Beulich wrote:
> On 26.02.2025 17:34, Andrew Cooper wrote:
>> On 26/02/2025 4:06 pm, Jan Beulich wrote:
>>> On 26.02.2025 17:04, Roger Pau Monné wrote:
>>>> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote:
>>>>> On 26.02.2025 15:28, Roger Pau Monné wrote:
>>>>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
>>>>>>> On 26.02.2025 14:56, Roger Pau Monné wrote:
>>>>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
>>>>>>>>> --- a/xen/common/page_alloc.c
>>>>>>>>> +++ b/xen/common/page_alloc.c
>>>>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
>>>>>>>>>  
>>>>>>>>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>>>>  {
>>>>>>>>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
>>>>>>>>> -
>>>>>>>>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>>>>>>>>>      d->tot_pages += pages;
>>>>>>>>>  
>>>>>>>>>      /*
>>>>>>>>> -     * can test d->claimed_pages race-free because it can only change
>>>>>>>>> +     * can test d->outstanding_pages race-free because it can only change
>>>>>>>>>       * if d->page_alloc_lock and heap_lock are both held, see also
>>>>>>>>>       * domain_set_outstanding_pages below
>>>>>>>>>       */
>>>>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>>>>>          goto out;
>>>>>>>> I think you can probably short-circuit the logic below if pages == 0?
>>>>>>>> (and avoid taking the heap_lock)
>>>>>>> Are there callers passing in 0?
>>>>>> Not sure, but if there are no callers expected we might add an ASSERT
>>>>>> to that effect then.
>>>>>>
>>>>>>>>>      spin_lock(&heap_lock);
>>>>>>>>> -    /* adjust domain outstanding pages; may not go negative */
>>>>>>>>> -    dom_before = d->outstanding_pages;
>>>>>>>>> -    dom_after = dom_before - pages;
>>>>>>>>> -    BUG_ON(dom_before < 0);
>>>>>>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>>>>>>>>> -    d->outstanding_pages = dom_claimed;
>>>>>>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
>>>>>>>>> -    sys_before = outstanding_claims;
>>>>>>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
>>>>>>>>> -    BUG_ON(sys_after < 0);
>>>>>>>>> -    outstanding_claims = sys_after;
>>>>>>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>>>>>>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
>>>>>>>>> +    {
>>>>>>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>>>>>>>>> +        outstanding_claims -= d->outstanding_pages;
>>>>>>>>> +        d->outstanding_pages = 0;
>>>>>>>>> +    } else {
>>>>>>>>> +        outstanding_claims -= pages;
>>>>>>>>> +        d->outstanding_pages -= pages;
>>>>>>>> I wonder if it's intentional for a pages < 0 value to modify
>>>>>>>> outstanding_claims and d->outstanding_pages, I think those values
>>>>>>>> should only be set from domain_set_outstanding_pages().
>>>>>>>> domain_adjust_tot_pages() should only decrease the value, but never
>>>>>>>> increase either outstanding_claims or d->outstanding_pages.
>>>>>>>>
>>>>>>>> At best the behavior is inconsistent, because once
>>>>>>>> d->outstanding_pages reaches 0 there will be no further modification
>>>>>>>> from domain_adjust_tot_pages().
>>>>>>> Right, at that point the claim has run out. While freeing pages with an
>>>>>>> active claim means that the claim gets bigger (which naturally needs
>>>>>>> reflecting in the global).
>>>>>> domain_adjust_tot_pages() is not exclusively called when freeing
>>>>>> pages, see steal_page() for example.
>>>>> Or also when pages were allocated. steal_page() ...
>>>>>
>>>>>> When called from steal_page() it's wrong to increase the claim, as
>>>>>> it assumes that the page removed from d->tot_pages is freed, but
>>>>>> that's not the case.  The domain might end up in a situation where
>>>>>> the claim is bigger than the available amount of memory.
>>>>> ... is a case that likely wasn't considered when the feature was added.
>>>>>
>>>>> I never really liked this; I'd be quite happy to see it ripped out, as
>>>>> long as we'd be reasonably certain it isn't in active use by people.
>>>> What do you mean with 'it' in the above sentence, the whole claim
>>>> stuff?
>>> Yes.
>>>
>>>>  Or just getting rid of allowing the claim to increase as a
>>>> result of pages being removed from a domain?
>>> No.
>> Alejandro and I discussed this earlier in the week.
>>
>> The claim infrastructure stuff is critical for a toolstack capable of
>> doing things in parallel.
>>
>> However, it is also nonsensical for there to be a remaining claim by the
>> time domain construction is done.
> I'm not entirely sure about this. Iirc it was the tmem work where this was
> added, and then pretty certainly it was needed also for already running
> domains.

It wasn't TMEM.  It was generally large-memory VMs.

The problem is if you've got 2x 2T VMs booting on a 3T system.

Previously, you'd start building both of them, and minutes later they
both fail because Xen was fully out of memory.

Claim was introduced to atomically reserve the memory you were intending
to build a domain with.

For XenServer, we're working on NUMA fixes, and something that's
important there is to be able to reserve memory on a specific NUMA node
(hence why this is all getting looked at).
>> If creation_finished were a concrete thing, rather than a bodge hacked
>> into domain_unpause_by_systemcontroller(), it ought to be made to fail
>> if there were an outstanding claim.  I suggested that we follow through
>> on a previous suggestion of making it a real hypercall (which is needed
>> by the encrypted VM crowd too).
> Rather than failing we could simply zap the leftover?

Hmm.  Perhaps.

I'd be slightly wary about zapping a claim, but there should only be an
outstanding claim if the toolstack did something wrong.

OTOH, we absolutely definitely do need a real hypercall here at some
point soon.

~Andrew
Alejandro Vallejo Feb. 27, 2025, 2:36 p.m. UTC | #14
On Wed Feb 26, 2025 at 2:05 PM GMT, Jan Beulich wrote:
> On 24.02.2025 15:49, Alejandro Vallejo wrote:
> > Open question to whoever reviews this...
> > 
> > On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote:
> >>      spin_lock(&heap_lock);
> >> -    /* adjust domain outstanding pages; may not go negative */
> >> -    dom_before = d->outstanding_pages;
> >> -    dom_after = dom_before - pages;
> >> -    BUG_ON(dom_before < 0);
> >> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> >> -    d->outstanding_pages = dom_claimed;
> >> -    /* flag accounting bug if system outstanding_claims would go negative */
> >> -    sys_before = outstanding_claims;
> >> -    sys_after = sys_before - (dom_before - dom_claimed);
> >> -    BUG_ON(sys_after < 0);
> >> -    outstanding_claims = sys_after;
> >> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> >> +    if ( pages > 0 && d->outstanding_pages < pages )
> >> +    {
> >> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> >> +        outstanding_claims -= d->outstanding_pages;
> >> +        d->outstanding_pages = 0;
> > 
> > While this matches the previous behaviour, do we _really_ want it? It's weird,
> > quirky, and it hard to extend to NUMA-aware claims (which is something in
> > midway through).
> > 
> > Wouldn't it make sense to fail the allocation (earlier) if the claim has run
> > out? Do we even expect this to ever happen this late in the allocation call
> > chain?
>
> This goes back to what a "claim" means. Even without any claim, a domain may
> allocate memory. So a claim having run out doesn't imply allocation has to
> fail.

Hmmm... but that violates the purpose of the claim infra as far as I understand
it. If a domain may overallocate by (e.g) ballooning in memory it can distort the
ability of another domain to start up, even if it succeeded in its own claim.

We might also break the invariant that total claims are strictly >=
total_avail_pages.

I'm somewhat puzzled at the "why" of having separate concepts for max_mem and
claims. I guess it simply grew the way it did. Reinstating sanity would
probably involve making max_mem effectively the claim, but that's a ton of
work I really would rather not do for now.

>
> NUMA-aware claims require more than an adjustment just here, I expect. Tracking
> of claims (certainly the global, maybe also the per-domain value) would likely
> need to become per-node, for example.

A fair amount more, yes. I'm preparing a series on the side to address per-node
claims and it's far more invasive on page_alloc.c. This function was just
sufficiently impossible to read that I felt the urge to send it ahead of time
for my own mental health.

>
> Jan

Cheers,
Alejandro
Alejandro Vallejo Feb. 27, 2025, 2:39 p.m. UTC | #15
On Wed Feb 26, 2025 at 4:51 PM GMT, Andrew Cooper wrote:
> On 26/02/2025 4:42 pm, Jan Beulich wrote:
> > On 26.02.2025 17:34, Andrew Cooper wrote:
> >> On 26/02/2025 4:06 pm, Jan Beulich wrote:
> >>> On 26.02.2025 17:04, Roger Pau Monné wrote:
> >>>> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote:
> >>>>> On 26.02.2025 15:28, Roger Pau Monné wrote:
> >>>>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
> >>>>>>> On 26.02.2025 14:56, Roger Pau Monné wrote:
> >>>>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
> >>>>>>>>> --- a/xen/common/page_alloc.c
> >>>>>>>>> +++ b/xen/common/page_alloc.c
> >>>>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
> >>>>>>>>>  
> >>>>>>>>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> >>>>>>>>>  {
> >>>>>>>>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
> >>>>>>>>> -
> >>>>>>>>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
> >>>>>>>>>      d->tot_pages += pages;
> >>>>>>>>>  
> >>>>>>>>>      /*
> >>>>>>>>> -     * can test d->claimed_pages race-free because it can only change
> >>>>>>>>> +     * can test d->outstanding_pages race-free because it can only change
> >>>>>>>>>       * if d->page_alloc_lock and heap_lock are both held, see also
> >>>>>>>>>       * domain_set_outstanding_pages below
> >>>>>>>>>       */
> >>>>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> >>>>>>>>>          goto out;
> >>>>>>>> I think you can probably short-circuit the logic below if pages == 0?
> >>>>>>>> (and avoid taking the heap_lock)
> >>>>>>> Are there callers passing in 0?
> >>>>>> Not sure, but if there are no callers expected we might add an ASSERT
> >>>>>> to that effect then.
> >>>>>>
> >>>>>>>>>      spin_lock(&heap_lock);
> >>>>>>>>> -    /* adjust domain outstanding pages; may not go negative */
> >>>>>>>>> -    dom_before = d->outstanding_pages;
> >>>>>>>>> -    dom_after = dom_before - pages;
> >>>>>>>>> -    BUG_ON(dom_before < 0);
> >>>>>>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> >>>>>>>>> -    d->outstanding_pages = dom_claimed;
> >>>>>>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
> >>>>>>>>> -    sys_before = outstanding_claims;
> >>>>>>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
> >>>>>>>>> -    BUG_ON(sys_after < 0);
> >>>>>>>>> -    outstanding_claims = sys_after;
> >>>>>>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> >>>>>>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
> >>>>>>>>> +    {
> >>>>>>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> >>>>>>>>> +        outstanding_claims -= d->outstanding_pages;
> >>>>>>>>> +        d->outstanding_pages = 0;
> >>>>>>>>> +    } else {
> >>>>>>>>> +        outstanding_claims -= pages;
> >>>>>>>>> +        d->outstanding_pages -= pages;
> >>>>>>>> I wonder if it's intentional for a pages < 0 value to modify
> >>>>>>>> outstanding_claims and d->outstanding_pages, I think those values
> >>>>>>>> should only be set from domain_set_outstanding_pages().
> >>>>>>>> domain_adjust_tot_pages() should only decrease the value, but never
> >>>>>>>> increase either outstanding_claims or d->outstanding_pages.
> >>>>>>>>
> >>>>>>>> At best the behavior is inconsistent, because once
> >>>>>>>> d->outstanding_pages reaches 0 there will be no further modification
> >>>>>>>> from domain_adjust_tot_pages().
> >>>>>>> Right, at that point the claim has run out. While freeing pages with an
> >>>>>>> active claim means that the claim gets bigger (which naturally needs
> >>>>>>> reflecting in the global).
> >>>>>> domain_adjust_tot_pages() is not exclusively called when freeing
> >>>>>> pages, see steal_page() for example.
> >>>>> Or also when pages were allocated. steal_page() ...
> >>>>>
> >>>>>> When called from steal_page() it's wrong to increase the claim, as
> >>>>>> it assumes that the page removed from d->tot_pages is freed, but
> >>>>>> that's not the case.  The domain might end up in a situation where
> >>>>>> the claim is bigger than the available amount of memory.
> >>>>> ... is a case that likely wasn't considered when the feature was added.
> >>>>>
> >>>>> I never really liked this; I'd be quite happy to see it ripped out, as
> >>>>> long as we'd be reasonably certain it isn't in active use by people.
> >>>> What do you mean with 'it' in the above sentence, the whole claim
> >>>> stuff?
> >>> Yes.
> >>>
> >>>>  Or just getting rid of allowing the claim to increase as a
> >>>> result of pages being removed from a domain?
> >>> No.
> >> Alejandro and I discussed this earlier in the week.
> >>
> >> The claim infrastructure stuff is critical for a toolstack capable of
> >> doing things in parallel.
> >>
> >> However, it is also nonsensical for there to be a remaining claim by the
> >> time domain construction is done.
> > I'm not entirely sure about this. Iirc it was the tmem work where this was
> > added, and then pretty certainly it was needed also for already running
> > domains.
>
> It wasn't TMEM.  It was generally large-memory VMs.
>
> The problem is if you've got 2x 2T VMs booting on a 3T system.
>
> Previously, you'd start building both of them, and minutes later they
> both fail because Xen was fully out of memory.
>
> Claim was introduced to atomically reserve the memory you were intending
> to build a domain with.
>
> For XenServer, we're working on NUMA fixes, and something that's
> important there is to be able to reserve memory on a specific NUMA node
> (hence why this is all getting looked at).
> >> If creation_finished were a concrete thing, rather than a bodge hacked
> >> into domain_unpause_by_systemcontroller(), it ought to be made to fail
> >> if there were an outstanding claim.  I suggested that we follow through
> >> on a previous suggestion of making it a real hypercall (which is needed
> >> by the encrypted VM crowd too).
> > Rather than failing we could simply zap the leftover?
>
> Hmm.  Perhaps.
>
> I'd be slightly wary about zapping a claim, but there should only be an
> outstanding claim if the toolstack did something wrong.
>
> OTOH, we absolutely definitely do need a real hypercall here at some
> point soon.
>
> ~Andrew

It should probably be removed at the end. Otherwise we're effectively leaking
all claimed but non-used memory for the lifetime of the domain.

Cheers,
Alejandro
Alejandro Vallejo Feb. 27, 2025, 2:50 p.m. UTC | #16
On Wed Feb 26, 2025 at 2:28 PM GMT, Roger Pau Monné wrote:
> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
> > On 26.02.2025 14:56, Roger Pau Monné wrote:
> > > On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
> > >> --- a/xen/common/page_alloc.c
> > >> +++ b/xen/common/page_alloc.c
> > >> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
> > >>  
> > >>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> > >>  {
> > >> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
> > >> -
> > >>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
> > >>      d->tot_pages += pages;
> > >>  
> > >>      /*
> > >> -     * can test d->claimed_pages race-free because it can only change
> > >> +     * can test d->outstanding_pages race-free because it can only change
> > >>       * if d->page_alloc_lock and heap_lock are both held, see also
> > >>       * domain_set_outstanding_pages below
> > >>       */
> > >> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> > >>          goto out;
> > > 
> > > I think you can probably short-circuit the logic below if pages == 0?
> > > (and avoid taking the heap_lock)
> > 
> > Are there callers passing in 0?
>
> Not sure, but if there are no callers expected we might add an ASSERT
> to that effect then.
>
> > >>      spin_lock(&heap_lock);
> > >> -    /* adjust domain outstanding pages; may not go negative */
> > >> -    dom_before = d->outstanding_pages;
> > >> -    dom_after = dom_before - pages;
> > >> -    BUG_ON(dom_before < 0);
> > >> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> > >> -    d->outstanding_pages = dom_claimed;
> > >> -    /* flag accounting bug if system outstanding_claims would go negative */
> > >> -    sys_before = outstanding_claims;
> > >> -    sys_after = sys_before - (dom_before - dom_claimed);
> > >> -    BUG_ON(sys_after < 0);
> > >> -    outstanding_claims = sys_after;
> > >> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> > >> +    if ( pages > 0 && d->outstanding_pages < pages )
> > >> +    {
> > >> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> > >> +        outstanding_claims -= d->outstanding_pages;
> > >> +        d->outstanding_pages = 0;
> > >> +    } else {
> > >> +        outstanding_claims -= pages;
> > >> +        d->outstanding_pages -= pages;
> > > 
> > > I wonder if it's intentional for a pages < 0 value to modify
> > > outstanding_claims and d->outstanding_pages, I think those values
> > > should only be set from domain_set_outstanding_pages().
> > > domain_adjust_tot_pages() should only decrease the value, but never
> > > increase either outstanding_claims or d->outstanding_pages.
> > > 
> > > At best the behavior is inconsistent, because once
> > > d->outstanding_pages reaches 0 there will be no further modification
> > > from domain_adjust_tot_pages().
> > 
> > Right, at that point the claim has run out. While freeing pages with an
> > active claim means that the claim gets bigger (which naturally needs
> > reflecting in the global).
>
> domain_adjust_tot_pages() is not exclusively called when freeing
> pages, see steal_page() for example.
>
> When called from steal_page() it's wrong to increase the claim, as
> it assumes that the page removed from d->tot_pages is freed, but
> that's not the case.  The domain might end up in a situation where
> the claim is bigger than the available amount of memory.
>
> Thanks, Roger.

This is what I meant by my initial reply questioning the logic itself.

It's all very dubious with memory_exchange and makes very little sense on the
tentative code I have for per-node claims.

I'd be quite happy to put an early exit before the spin_lock on pages <= 0.
That also covers your initial comment and prevents claims from growing after a
domain started running if it didn't happen to consume all of them.

Is anyone opposed?

Cheers,
Alejandro
Alejandro Vallejo Feb. 27, 2025, 2:59 p.m. UTC | #17
On Wed Feb 26, 2025 at 2:02 PM GMT, Jan Beulich wrote:
> On 24.02.2025 14:27, Alejandro Vallejo wrote:
> > @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
> >          goto out;
> >  
> >      spin_lock(&heap_lock);
> > -    /* adjust domain outstanding pages; may not go negative */
> > -    dom_before = d->outstanding_pages;
> > -    dom_after = dom_before - pages;
> > -    BUG_ON(dom_before < 0);
> > -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> > -    d->outstanding_pages = dom_claimed;
> > -    /* flag accounting bug if system outstanding_claims would go negative */
> > -    sys_before = outstanding_claims;
> > -    sys_after = sys_before - (dom_before - dom_claimed);
> > -    BUG_ON(sys_after < 0);
> > -    outstanding_claims = sys_after;
> > +    BUG_ON(outstanding_claims < d->outstanding_pages);
> > +    if ( pages > 0 && d->outstanding_pages < pages )
>
> The lhs isn't needed, is it? d->outstanding_pages is an unsigned quantity,
> after all. Else dropping the earlier of the two BUG_ON() wouldn't be quite
> right.

d->outstanding pages is unsigned, but pages isn't.

It was originally like that, but I then got concerned about 32bit machines,
where you'd be comparing a signed and an unsigned integer of the same
not-very-large width. That seems like dangerous terrains if the unsigned number
grows large enough.

TL;DR: It's there for clarity and paranoia. Even if the overflowing into bit 31
would be rare in such a system.

>
> > +    {
> > +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> > +        outstanding_claims -= d->outstanding_pages;
> > +        d->outstanding_pages = 0;
> > +    } else {
>
> Nit: Braces on their own lines please.

Ack.

>
> In any event - yes, this reads quite a bit easier after the adjustment.
>
> With the adjustments (happy to make while committing, so long as you agree)
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks. I'd probably like to hold off and send a v2 if you're fine with the
adjustment I answered Roger with (returning ealy on pages <= 0, so claims are
never increased on free).

>
> Jan
>

Cheers,
Alejandro
Jan Beulich March 5, 2025, 10:49 a.m. UTC | #18
On 27.02.2025 15:36, Alejandro Vallejo wrote:
> On Wed Feb 26, 2025 at 2:05 PM GMT, Jan Beulich wrote:
>> On 24.02.2025 15:49, Alejandro Vallejo wrote:
>>> Open question to whoever reviews this...
>>>
>>> On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote:
>>>>      spin_lock(&heap_lock);
>>>> -    /* adjust domain outstanding pages; may not go negative */
>>>> -    dom_before = d->outstanding_pages;
>>>> -    dom_after = dom_before - pages;
>>>> -    BUG_ON(dom_before < 0);
>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>>>> -    d->outstanding_pages = dom_claimed;
>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
>>>> -    sys_before = outstanding_claims;
>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
>>>> -    BUG_ON(sys_after < 0);
>>>> -    outstanding_claims = sys_after;
>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
>>>> +    {
>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>>>> +        outstanding_claims -= d->outstanding_pages;
>>>> +        d->outstanding_pages = 0;
>>>
>>> While this matches the previous behaviour, do we _really_ want it? It's weird,
>>> quirky, and it hard to extend to NUMA-aware claims (which is something in
>>> midway through).
>>>
>>> Wouldn't it make sense to fail the allocation (earlier) if the claim has run
>>> out? Do we even expect this to ever happen this late in the allocation call
>>> chain?
>>
>> This goes back to what a "claim" means. Even without any claim, a domain may
>> allocate memory. So a claim having run out doesn't imply allocation has to
>> fail.
> 
> Hmmm... but that violates the purpose of the claim infra as far as I understand
> it. If a domain may overallocate by (e.g) ballooning in memory it can distort the
> ability of another domain to start up, even if it succeeded in its own claim.

Why would that be? As long as we hold back enough memory to cover the claim, it
shouldn't matter what kind of allocation we want to process. I'd say that a PV
guest starting ballooned ought to be able to deflate its balloon as far as
there was a claim established for it up front.

> We might also break the invariant that total claims are strictly >=
> total_avail_pages.

Same here - I don't see why this would happen as long as all accounting is
working correctly.

> I'm somewhat puzzled at the "why" of having separate concepts for max_mem and
> claims. I guess it simply grew the way it did. Reinstating sanity would
> probably involve making max_mem effectively the claim, but that's a ton of
> work I really would rather not do for now.

To me the two are different (beyond claim being global while max-mem is per-
domain). max-mem is a hard boundary (beyond which allocations _will_ fail),
whereas a claim is a softer one, beyond which allocations _may_ fail.

Jan
Jan Beulich March 5, 2025, 10:50 a.m. UTC | #19
On 27.02.2025 15:50, Alejandro Vallejo wrote:
> On Wed Feb 26, 2025 at 2:28 PM GMT, Roger Pau Monné wrote:
>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote:
>>> On 26.02.2025 14:56, Roger Pau Monné wrote:
>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote:
>>>>> --- a/xen/common/page_alloc.c
>>>>> +++ b/xen/common/page_alloc.c
>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */
>>>>>  
>>>>>  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>  {
>>>>> -    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
>>>>> -
>>>>>      ASSERT(rspin_is_locked(&d->page_alloc_lock));
>>>>>      d->tot_pages += pages;
>>>>>  
>>>>>      /*
>>>>> -     * can test d->claimed_pages race-free because it can only change
>>>>> +     * can test d->outstanding_pages race-free because it can only change
>>>>>       * if d->page_alloc_lock and heap_lock are both held, see also
>>>>>       * domain_set_outstanding_pages below
>>>>>       */
>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>>>          goto out;
>>>>
>>>> I think you can probably short-circuit the logic below if pages == 0?
>>>> (and avoid taking the heap_lock)
>>>
>>> Are there callers passing in 0?
>>
>> Not sure, but if there are no callers expected we might add an ASSERT
>> to that effect then.
>>
>>>>>      spin_lock(&heap_lock);
>>>>> -    /* adjust domain outstanding pages; may not go negative */
>>>>> -    dom_before = d->outstanding_pages;
>>>>> -    dom_after = dom_before - pages;
>>>>> -    BUG_ON(dom_before < 0);
>>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>>>>> -    d->outstanding_pages = dom_claimed;
>>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
>>>>> -    sys_before = outstanding_claims;
>>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
>>>>> -    BUG_ON(sys_after < 0);
>>>>> -    outstanding_claims = sys_after;
>>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
>>>>> +    {
>>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>>>>> +        outstanding_claims -= d->outstanding_pages;
>>>>> +        d->outstanding_pages = 0;
>>>>> +    } else {
>>>>> +        outstanding_claims -= pages;
>>>>> +        d->outstanding_pages -= pages;
>>>>
>>>> I wonder if it's intentional for a pages < 0 value to modify
>>>> outstanding_claims and d->outstanding_pages, I think those values
>>>> should only be set from domain_set_outstanding_pages().
>>>> domain_adjust_tot_pages() should only decrease the value, but never
>>>> increase either outstanding_claims or d->outstanding_pages.
>>>>
>>>> At best the behavior is inconsistent, because once
>>>> d->outstanding_pages reaches 0 there will be no further modification
>>>> from domain_adjust_tot_pages().
>>>
>>> Right, at that point the claim has run out. While freeing pages with an
>>> active claim means that the claim gets bigger (which naturally needs
>>> reflecting in the global).
>>
>> domain_adjust_tot_pages() is not exclusively called when freeing
>> pages, see steal_page() for example.
>>
>> When called from steal_page() it's wrong to increase the claim, as
>> it assumes that the page removed from d->tot_pages is freed, but
>> that's not the case.  The domain might end up in a situation where
>> the claim is bigger than the available amount of memory.
>>
>> Thanks, Roger.
> 
> This is what I meant by my initial reply questioning the logic itself.
> 
> It's all very dubious with memory_exchange and makes very little sense on the
> tentative code I have for per-node claims.
> 
> I'd be quite happy to put an early exit before the spin_lock on pages <= 0.
> That also covers your initial comment and prevents claims from growing after a
> domain started running if it didn't happen to consume all of them.
> 
> Is anyone opposed?

We first need to reach common understanding what a claim is (or is not). See
the other reply just sent.

Jan
Jan Beulich March 5, 2025, 10:52 a.m. UTC | #20
On 27.02.2025 15:59, Alejandro Vallejo wrote:
> On Wed Feb 26, 2025 at 2:02 PM GMT, Jan Beulich wrote:
>> On 24.02.2025 14:27, Alejandro Vallejo wrote:
>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
>>>          goto out;
>>>  
>>>      spin_lock(&heap_lock);
>>> -    /* adjust domain outstanding pages; may not go negative */
>>> -    dom_before = d->outstanding_pages;
>>> -    dom_after = dom_before - pages;
>>> -    BUG_ON(dom_before < 0);
>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>>> -    d->outstanding_pages = dom_claimed;
>>> -    /* flag accounting bug if system outstanding_claims would go negative */
>>> -    sys_before = outstanding_claims;
>>> -    sys_after = sys_before - (dom_before - dom_claimed);
>>> -    BUG_ON(sys_after < 0);
>>> -    outstanding_claims = sys_after;
>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>>> +    if ( pages > 0 && d->outstanding_pages < pages )
>>
>> The lhs isn't needed, is it? d->outstanding_pages is an unsigned quantity,
>> after all. Else dropping the earlier of the two BUG_ON() wouldn't be quite
>> right.
> 
> d->outstanding pages is unsigned, but pages isn't.
> 
> It was originally like that, but I then got concerned about 32bit machines,
> where you'd be comparing a signed and an unsigned integer of the same
> not-very-large width. That seems like dangerous terrains if the unsigned number
> grows large enough.

Oh, indeed - I didn't take ILP32 arches into account.

Jan
Alejandro Vallejo March 5, 2025, 1:22 p.m. UTC | #21
On Wed Mar 5, 2025 at 10:49 AM GMT, Jan Beulich wrote:
> On 27.02.2025 15:36, Alejandro Vallejo wrote:
> > On Wed Feb 26, 2025 at 2:05 PM GMT, Jan Beulich wrote:
> >> On 24.02.2025 15:49, Alejandro Vallejo wrote:
> >>> Open question to whoever reviews this...
> >>>
> >>> On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote:
> >>>>      spin_lock(&heap_lock);
> >>>> -    /* adjust domain outstanding pages; may not go negative */
> >>>> -    dom_before = d->outstanding_pages;
> >>>> -    dom_after = dom_before - pages;
> >>>> -    BUG_ON(dom_before < 0);
> >>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> >>>> -    d->outstanding_pages = dom_claimed;
> >>>> -    /* flag accounting bug if system outstanding_claims would go negative */
> >>>> -    sys_before = outstanding_claims;
> >>>> -    sys_after = sys_before - (dom_before - dom_claimed);
> >>>> -    BUG_ON(sys_after < 0);
> >>>> -    outstanding_claims = sys_after;
> >>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> >>>> +    if ( pages > 0 && d->outstanding_pages < pages )
> >>>> +    {
> >>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> >>>> +        outstanding_claims -= d->outstanding_pages;
> >>>> +        d->outstanding_pages = 0;
> >>>
> >>> While this matches the previous behaviour, do we _really_ want it? It's weird,
> >>> quirky, and it hard to extend to NUMA-aware claims (which is something in
> >>> midway through).
> >>>
> >>> Wouldn't it make sense to fail the allocation (earlier) if the claim has run
> >>> out? Do we even expect this to ever happen this late in the allocation call
> >>> chain?
> >>
> >> This goes back to what a "claim" means. Even without any claim, a domain may
> >> allocate memory. So a claim having run out doesn't imply allocation has to
> >> fail.
> > 
> > Hmmm... but that violates the purpose of the claim infra as far as I understand
> > it. If a domain may overallocate by (e.g) ballooning in memory it can distort the
> > ability of another domain to start up, even if it succeeded in its own claim.
>
> Why would that be? As long as we hold back enough memory to cover the claim, it
> shouldn't matter what kind of allocation we want to process. I'd say that a PV
> guest starting ballooned ought to be able to deflate its balloon as far as
> there was a claim established for it up front.

The fact a domain allocated something does not mean it had it claimed before. A
claim is a restriction to others' ability to allocate/claim, not to the domain
that made the claim. e.g:

  0. host is idle. No domU.
  1. dom1 is created with a claim to 10% of host memory and max_mem of 80% of
     host meomory.
  2. dom1 balloons in 70% of host memory.
  3. dom1 ballons out 70% of host memory.
  4. dom1 now holds a a claim to 80% of host memory.

It's all quite perverse. Fortunately, looking at adjacent claims-related code
xl seems to default to making a claim prior to populating the physmap and
cancelling the claim at the end of the meminit() hook so this is never a real
problem.

This tells me that the logic intent is to force early failure of
populate_physmap and nothing else. It's never active by the time ballooning or
memory exchange matter at all.

Xen ought to cancel the claim by itself though, toolstack should not NEED to do
it.

And furthermore, the fact that the claim goes up and down interacts really
poorly with the per-node claims I want to implement, because it's just
impossible to know if a freed page actually comes from a prior decrease of a
claim or not.

>
> > We might also break the invariant that total claims are strictly >=
> > total_avail_pages.
>
> Same here - I don't see why this would happen as long as all accounting is
> working correctly.

See previous example.

>
> > I'm somewhat puzzled at the "why" of having separate concepts for max_mem and
> > claims. I guess it simply grew the way it did. Reinstating sanity would
> > probably involve making max_mem effectively the claim, but that's a ton of
> > work I really would rather not do for now.
>
> To me the two are different (beyond claim being global while max-mem is per-
> domain). max-mem is a hard boundary (beyond which allocations _will_ fail),
> whereas a claim is a softer one, beyond which allocations _may_ fail.
>
> Jan

What I meant is that I'd rather have a non-oversubscribed memory model by which
I can statically partition my memory and make 100% sure the sum of all max_mem
of every existing domain never exceeds host capacity. But this is neither the
intent of the existing claim infra nor what I'm after right now. I was just
thinking out loud in text form. :)

Cheers,
Alejandro
Jan Beulich March 5, 2025, 1:39 p.m. UTC | #22
On 05.03.2025 14:22, Alejandro Vallejo wrote:
> On Wed Mar 5, 2025 at 10:49 AM GMT, Jan Beulich wrote:
>> On 27.02.2025 15:36, Alejandro Vallejo wrote:
>>> On Wed Feb 26, 2025 at 2:05 PM GMT, Jan Beulich wrote:
>>>> On 24.02.2025 15:49, Alejandro Vallejo wrote:
>>>>> Open question to whoever reviews this...
>>>>>
>>>>> On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote:
>>>>>>      spin_lock(&heap_lock);
>>>>>> -    /* adjust domain outstanding pages; may not go negative */
>>>>>> -    dom_before = d->outstanding_pages;
>>>>>> -    dom_after = dom_before - pages;
>>>>>> -    BUG_ON(dom_before < 0);
>>>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
>>>>>> -    d->outstanding_pages = dom_claimed;
>>>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
>>>>>> -    sys_before = outstanding_claims;
>>>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
>>>>>> -    BUG_ON(sys_after < 0);
>>>>>> -    outstanding_claims = sys_after;
>>>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
>>>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
>>>>>> +    {
>>>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
>>>>>> +        outstanding_claims -= d->outstanding_pages;
>>>>>> +        d->outstanding_pages = 0;
>>>>>
>>>>> While this matches the previous behaviour, do we _really_ want it? It's weird,
>>>>> quirky, and it hard to extend to NUMA-aware claims (which is something in
>>>>> midway through).
>>>>>
>>>>> Wouldn't it make sense to fail the allocation (earlier) if the claim has run
>>>>> out? Do we even expect this to ever happen this late in the allocation call
>>>>> chain?
>>>>
>>>> This goes back to what a "claim" means. Even without any claim, a domain may
>>>> allocate memory. So a claim having run out doesn't imply allocation has to
>>>> fail.
>>>
>>> Hmmm... but that violates the purpose of the claim infra as far as I understand
>>> it. If a domain may overallocate by (e.g) ballooning in memory it can distort the
>>> ability of another domain to start up, even if it succeeded in its own claim.
>>
>> Why would that be? As long as we hold back enough memory to cover the claim, it
>> shouldn't matter what kind of allocation we want to process. I'd say that a PV
>> guest starting ballooned ought to be able to deflate its balloon as far as
>> there was a claim established for it up front.
> 
> The fact a domain allocated something does not mean it had it claimed before. A
> claim is a restriction to others' ability to allocate/claim, not to the domain
> that made the claim.

Yes and no. No in so far as the target's "ability to allocate" may or may not
be meant to extend beyond domain creation.

> e.g:
> 
>   0. host is idle. No domU.
>   1. dom1 is created with a claim to 10% of host memory and max_mem of 80% of
>      host meomory.
>   2. dom1 balloons in 70% of host memory.
>   3. dom1 ballons out 70% of host memory.
>   4. dom1 now holds a a claim to 80% of host memory.

Sure, this shouldn't be the result. Yet that may merely be an effect of claim
accounting being insufficient.

> It's all quite perverse. Fortunately, looking at adjacent claims-related code
> xl seems to default to making a claim prior to populating the physmap and
> cancelling the claim at the end of the meminit() hook so this is never a real
> problem.
> 
> This tells me that the logic intent is to force early failure of
> populate_physmap and nothing else. It's never active by the time ballooning or
> memory exchange matter at all.

Ah yes, this I find more convincing. (Oddly enough this is all x86-only code.)

> Xen ought to cancel the claim by itself though, toolstack should not NEED to do
> it.

Fundamentally yes. Except that the toolstack can do it earlier than Xen could.

Jan
Alejandro Vallejo March 5, 2025, 2:55 p.m. UTC | #23
On Wed Mar 5, 2025 at 1:39 PM GMT, Jan Beulich wrote:
> On 05.03.2025 14:22, Alejandro Vallejo wrote:
> > On Wed Mar 5, 2025 at 10:49 AM GMT, Jan Beulich wrote:
> >> On 27.02.2025 15:36, Alejandro Vallejo wrote:
> >>> On Wed Feb 26, 2025 at 2:05 PM GMT, Jan Beulich wrote:
> >>>> On 24.02.2025 15:49, Alejandro Vallejo wrote:
> >>>>> Open question to whoever reviews this...
> >>>>>
> >>>>> On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote:
> >>>>>>      spin_lock(&heap_lock);
> >>>>>> -    /* adjust domain outstanding pages; may not go negative */
> >>>>>> -    dom_before = d->outstanding_pages;
> >>>>>> -    dom_after = dom_before - pages;
> >>>>>> -    BUG_ON(dom_before < 0);
> >>>>>> -    dom_claimed = dom_after < 0 ? 0 : dom_after;
> >>>>>> -    d->outstanding_pages = dom_claimed;
> >>>>>> -    /* flag accounting bug if system outstanding_claims would go negative */
> >>>>>> -    sys_before = outstanding_claims;
> >>>>>> -    sys_after = sys_before - (dom_before - dom_claimed);
> >>>>>> -    BUG_ON(sys_after < 0);
> >>>>>> -    outstanding_claims = sys_after;
> >>>>>> +    BUG_ON(outstanding_claims < d->outstanding_pages);
> >>>>>> +    if ( pages > 0 && d->outstanding_pages < pages )
> >>>>>> +    {
> >>>>>> +        /* `pages` exceeds the domain's outstanding count. Zero it out. */
> >>>>>> +        outstanding_claims -= d->outstanding_pages;
> >>>>>> +        d->outstanding_pages = 0;
> >>>>>
> >>>>> While this matches the previous behaviour, do we _really_ want it? It's weird,
> >>>>> quirky, and it hard to extend to NUMA-aware claims (which is something in
> >>>>> midway through).
> >>>>>
> >>>>> Wouldn't it make sense to fail the allocation (earlier) if the claim has run
> >>>>> out? Do we even expect this to ever happen this late in the allocation call
> >>>>> chain?
> >>>>
> >>>> This goes back to what a "claim" means. Even without any claim, a domain may
> >>>> allocate memory. So a claim having run out doesn't imply allocation has to
> >>>> fail.
> >>>
> >>> Hmmm... but that violates the purpose of the claim infra as far as I understand
> >>> it. If a domain may overallocate by (e.g) ballooning in memory it can distort the
> >>> ability of another domain to start up, even if it succeeded in its own claim.
> >>
> >> Why would that be? As long as we hold back enough memory to cover the claim, it
> >> shouldn't matter what kind of allocation we want to process. I'd say that a PV
> >> guest starting ballooned ought to be able to deflate its balloon as far as
> >> there was a claim established for it up front.
> > 
> > The fact a domain allocated something does not mean it had it claimed before. A
> > claim is a restriction to others' ability to allocate/claim, not to the domain
> > that made the claim.
>
> Yes and no. No in so far as the target's "ability to allocate" may or may not
> be meant to extend beyond domain creation.
>
> > e.g:
> > 
> >   0. host is idle. No domU.
> >   1. dom1 is created with a claim to 10% of host memory and max_mem of 80% of
> >      host meomory.
> >   2. dom1 balloons in 70% of host memory.
> >   3. dom1 ballons out 70% of host memory.
> >   4. dom1 now holds a a claim to 80% of host memory.
>
> Sure, this shouldn't be the result. Yet that may merely be an effect of claim
> accounting being insufficient.
>
> > It's all quite perverse. Fortunately, looking at adjacent claims-related code
> > xl seems to default to making a claim prior to populating the physmap and
> > cancelling the claim at the end of the meminit() hook so this is never a real
> > problem.
> > 
> > This tells me that the logic intent is to force early failure of
> > populate_physmap and nothing else. It's never active by the time ballooning or
> > memory exchange matter at all.
>
> Ah yes, this I find more convincing. (Oddly enough this is all x86-only code.)

(about claims being an x86-ism in toolstack). Yeah, that's weird. It's should
be moved in time to a common area of xg. I guess no other arch has cared about
bootstorms or massive VMs just yet.

>
> > Xen ought to cancel the claim by itself though, toolstack should not NEED to do
> > it.
>
> Fundamentally yes. Except that the toolstack can do it earlier than Xen could.
>
> Jan

Yes, in the sense that it can do it just after it's done populating the
physmap, but Xen should still zero it at the end in case toolstack hasn't done
so. Nothing good can come out of that counter going up and down in such a
dubious fashion.

Cheers,
Alejandro
Alejandro Vallejo March 11, 2025, 9:46 a.m. UTC | #24
On Wed Mar 5, 2025 at 1:39 PM GMT, Jan Beulich wrote:
> > It's all quite perverse. Fortunately, looking at adjacent claims-related code
> > xl seems to default to making a claim prior to populating the physmap and
> > cancelling the claim at the end of the meminit() hook so this is never a real
> > problem.
> > 
> > This tells me that the logic intent is to force early failure of
> > populate_physmap and nothing else. It's never active by the time ballooning or
> > memory exchange matter at all.
>
> Ah yes, this I find more convincing. (Oddly enough this is all x86-only code.)

Should I take this as an "ack" to the general plan of early returning on pages
<=0? I have a series pending that relies on it (the v2 of this[1]). And would
rather defer its sending until this one get some form of nod. Otherwise I'll
integrate it in the other series so I can at least reduce remove dependencies
between things in-flight.

Cheers,
Alejandro

[1] https://lore.kernel.org/xen-devel/20250304111000.9252-1-alejandro.vallejo@cloud.com/
Jan Beulich March 11, 2025, 10:06 a.m. UTC | #25
On 11.03.2025 10:46, Alejandro Vallejo wrote:
> On Wed Mar 5, 2025 at 1:39 PM GMT, Jan Beulich wrote:
>>> It's all quite perverse. Fortunately, looking at adjacent claims-related code
>>> xl seems to default to making a claim prior to populating the physmap and
>>> cancelling the claim at the end of the meminit() hook so this is never a real
>>> problem.
>>>
>>> This tells me that the logic intent is to force early failure of
>>> populate_physmap and nothing else. It's never active by the time ballooning or
>>> memory exchange matter at all.
>>
>> Ah yes, this I find more convincing. (Oddly enough this is all x86-only code.)
> 
> Should I take this as an "ack" to the general plan of early returning on pages
> <=0? I have a series pending that relies on it (the v2 of this[1]).

Not an ack, but an "I can accept this, but someone else will need to ack it".

Jan

> And would
> rather defer its sending until this one get some form of nod. Otherwise I'll
> integrate it in the other series so I can at least reduce remove dependencies
> between things in-flight.
> 
> Cheers,
> Alejandro
> 
> [1] https://lore.kernel.org/xen-devel/20250304111000.9252-1-alejandro.vallejo@cloud.com/
diff mbox series

Patch

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 1bf070c8c5df..4306753f0bd2 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -490,13 +490,11 @@  static long outstanding_claims; /* total outstanding claims by all domains */
 
 unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
 {
-    long dom_before, dom_after, dom_claimed, sys_before, sys_after;
-
     ASSERT(rspin_is_locked(&d->page_alloc_lock));
     d->tot_pages += pages;
 
     /*
-     * can test d->claimed_pages race-free because it can only change
+     * can test d->outstanding_pages race-free because it can only change
      * if d->page_alloc_lock and heap_lock are both held, see also
      * domain_set_outstanding_pages below
      */
@@ -504,17 +502,16 @@  unsigned long domain_adjust_tot_pages(struct domain *d, long pages)
         goto out;
 
     spin_lock(&heap_lock);
-    /* adjust domain outstanding pages; may not go negative */
-    dom_before = d->outstanding_pages;
-    dom_after = dom_before - pages;
-    BUG_ON(dom_before < 0);
-    dom_claimed = dom_after < 0 ? 0 : dom_after;
-    d->outstanding_pages = dom_claimed;
-    /* flag accounting bug if system outstanding_claims would go negative */
-    sys_before = outstanding_claims;
-    sys_after = sys_before - (dom_before - dom_claimed);
-    BUG_ON(sys_after < 0);
-    outstanding_claims = sys_after;
+    BUG_ON(outstanding_claims < d->outstanding_pages);
+    if ( pages > 0 && d->outstanding_pages < pages )
+    {
+        /* `pages` exceeds the domain's outstanding count. Zero it out. */
+        outstanding_claims -= d->outstanding_pages;
+        d->outstanding_pages = 0;
+    } else {
+        outstanding_claims -= pages;
+        d->outstanding_pages -= pages;
+    }
     spin_unlock(&heap_lock);
 
 out: