Message ID | 20250224132724.9074-1-alejandro.vallejo@cloud.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | xen/page_alloc: Simplify domain_adjust_tot_pages | expand |
Open question to whoever reviews this... On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote: > spin_lock(&heap_lock); > - /* adjust domain outstanding pages; may not go negative */ > - dom_before = d->outstanding_pages; > - dom_after = dom_before - pages; > - BUG_ON(dom_before < 0); > - dom_claimed = dom_after < 0 ? 0 : dom_after; > - d->outstanding_pages = dom_claimed; > - /* flag accounting bug if system outstanding_claims would go negative */ > - sys_before = outstanding_claims; > - sys_after = sys_before - (dom_before - dom_claimed); > - BUG_ON(sys_after < 0); > - outstanding_claims = sys_after; > + BUG_ON(outstanding_claims < d->outstanding_pages); > + if ( pages > 0 && d->outstanding_pages < pages ) > + { > + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > + outstanding_claims -= d->outstanding_pages; > + d->outstanding_pages = 0; While this matches the previous behaviour, do we _really_ want it? It's weird, quirky, and it hard to extend to NUMA-aware claims (which is something in midway through). Wouldn't it make sense to fail the allocation (earlier) if the claim has run out? Do we even expect this to ever happen this late in the allocation call chain? > + } else { > + outstanding_claims -= pages; > + d->outstanding_pages -= pages; > + } > spin_unlock(&heap_lock); > > out: Cheers, Alejandro
On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: > The logic has too many levels of indirection and it's very hard to > understand it its current form. Split it between the corner case where > the adjustment is bigger than the current claim and the rest to avoid 5 > auxiliary variables. > > While at it, fix incorrect field name in nearby comment. > > Not a functional change (although by its nature it might not seem > immediately obvious at first). > > Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> > --- > xen/common/page_alloc.c | 25 +++++++++++-------------- > 1 file changed, 11 insertions(+), 14 deletions(-) > > diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c > index 1bf070c8c5df..4306753f0bd2 100644 > --- a/xen/common/page_alloc.c > +++ b/xen/common/page_alloc.c > @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ > > unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > { > - long dom_before, dom_after, dom_claimed, sys_before, sys_after; > - > ASSERT(rspin_is_locked(&d->page_alloc_lock)); > d->tot_pages += pages; > > /* > - * can test d->claimed_pages race-free because it can only change > + * can test d->outstanding_pages race-free because it can only change > * if d->page_alloc_lock and heap_lock are both held, see also > * domain_set_outstanding_pages below > */ > @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > goto out; I think you can probably short-circuit the logic below if pages == 0? (and avoid taking the heap_lock) > > spin_lock(&heap_lock); > - /* adjust domain outstanding pages; may not go negative */ > - dom_before = d->outstanding_pages; > - dom_after = dom_before - pages; > - BUG_ON(dom_before < 0); > - dom_claimed = dom_after < 0 ? 0 : dom_after; > - d->outstanding_pages = dom_claimed; > - /* flag accounting bug if system outstanding_claims would go negative */ > - sys_before = outstanding_claims; > - sys_after = sys_before - (dom_before - dom_claimed); > - BUG_ON(sys_after < 0); > - outstanding_claims = sys_after; > + BUG_ON(outstanding_claims < d->outstanding_pages); > + if ( pages > 0 && d->outstanding_pages < pages ) > + { > + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > + outstanding_claims -= d->outstanding_pages; > + d->outstanding_pages = 0; > + } else { > + outstanding_claims -= pages; > + d->outstanding_pages -= pages; I wonder if it's intentional for a pages < 0 value to modify outstanding_claims and d->outstanding_pages, I think those values should only be set from domain_set_outstanding_pages(). domain_adjust_tot_pages() should only decrease the value, but never increase either outstanding_claims or d->outstanding_pages. At best the behavior is inconsistent, because once d->outstanding_pages reaches 0 there will be no further modification from domain_adjust_tot_pages(). I think it would be best if outstanding_claims and d->outstanding_pages was only modified if pages > 0. Thanks, Roger.
On 24.02.2025 14:27, Alejandro Vallejo wrote: > @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > goto out; > > spin_lock(&heap_lock); > - /* adjust domain outstanding pages; may not go negative */ > - dom_before = d->outstanding_pages; > - dom_after = dom_before - pages; > - BUG_ON(dom_before < 0); > - dom_claimed = dom_after < 0 ? 0 : dom_after; > - d->outstanding_pages = dom_claimed; > - /* flag accounting bug if system outstanding_claims would go negative */ > - sys_before = outstanding_claims; > - sys_after = sys_before - (dom_before - dom_claimed); > - BUG_ON(sys_after < 0); > - outstanding_claims = sys_after; > + BUG_ON(outstanding_claims < d->outstanding_pages); > + if ( pages > 0 && d->outstanding_pages < pages ) The lhs isn't needed, is it? d->outstanding_pages is an unsigned quantity, after all. Else dropping the earlier of the two BUG_ON() wouldn't be quite right. > + { > + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > + outstanding_claims -= d->outstanding_pages; > + d->outstanding_pages = 0; > + } else { Nit: Braces on their own lines please. In any event - yes, this reads quite a bit easier after the adjustment. With the adjustments (happy to make while committing, so long as you agree) Reviewed-by: Jan Beulich <jbeulich@suse.com> Jan > + outstanding_claims -= pages; > + d->outstanding_pages -= pages; > + } > spin_unlock(&heap_lock); > > out:
On 24.02.2025 15:49, Alejandro Vallejo wrote: > Open question to whoever reviews this... > > On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote: >> spin_lock(&heap_lock); >> - /* adjust domain outstanding pages; may not go negative */ >> - dom_before = d->outstanding_pages; >> - dom_after = dom_before - pages; >> - BUG_ON(dom_before < 0); >> - dom_claimed = dom_after < 0 ? 0 : dom_after; >> - d->outstanding_pages = dom_claimed; >> - /* flag accounting bug if system outstanding_claims would go negative */ >> - sys_before = outstanding_claims; >> - sys_after = sys_before - (dom_before - dom_claimed); >> - BUG_ON(sys_after < 0); >> - outstanding_claims = sys_after; >> + BUG_ON(outstanding_claims < d->outstanding_pages); >> + if ( pages > 0 && d->outstanding_pages < pages ) >> + { >> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >> + outstanding_claims -= d->outstanding_pages; >> + d->outstanding_pages = 0; > > While this matches the previous behaviour, do we _really_ want it? It's weird, > quirky, and it hard to extend to NUMA-aware claims (which is something in > midway through). > > Wouldn't it make sense to fail the allocation (earlier) if the claim has run > out? Do we even expect this to ever happen this late in the allocation call > chain? This goes back to what a "claim" means. Even without any claim, a domain may allocate memory. So a claim having run out doesn't imply allocation has to fail. NUMA-aware claims require more than an adjustment just here, I expect. Tracking of claims (certainly the global, maybe also the per-domain value) would likely need to become per-node, for example. Jan
On 26.02.2025 14:56, Roger Pau Monné wrote: > On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: >> --- a/xen/common/page_alloc.c >> +++ b/xen/common/page_alloc.c >> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ >> >> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >> { >> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; >> - >> ASSERT(rspin_is_locked(&d->page_alloc_lock)); >> d->tot_pages += pages; >> >> /* >> - * can test d->claimed_pages race-free because it can only change >> + * can test d->outstanding_pages race-free because it can only change >> * if d->page_alloc_lock and heap_lock are both held, see also >> * domain_set_outstanding_pages below >> */ >> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >> goto out; > > I think you can probably short-circuit the logic below if pages == 0? > (and avoid taking the heap_lock) Are there callers passing in 0? >> spin_lock(&heap_lock); >> - /* adjust domain outstanding pages; may not go negative */ >> - dom_before = d->outstanding_pages; >> - dom_after = dom_before - pages; >> - BUG_ON(dom_before < 0); >> - dom_claimed = dom_after < 0 ? 0 : dom_after; >> - d->outstanding_pages = dom_claimed; >> - /* flag accounting bug if system outstanding_claims would go negative */ >> - sys_before = outstanding_claims; >> - sys_after = sys_before - (dom_before - dom_claimed); >> - BUG_ON(sys_after < 0); >> - outstanding_claims = sys_after; >> + BUG_ON(outstanding_claims < d->outstanding_pages); >> + if ( pages > 0 && d->outstanding_pages < pages ) >> + { >> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >> + outstanding_claims -= d->outstanding_pages; >> + d->outstanding_pages = 0; >> + } else { >> + outstanding_claims -= pages; >> + d->outstanding_pages -= pages; > > I wonder if it's intentional for a pages < 0 value to modify > outstanding_claims and d->outstanding_pages, I think those values > should only be set from domain_set_outstanding_pages(). > domain_adjust_tot_pages() should only decrease the value, but never > increase either outstanding_claims or d->outstanding_pages. > > At best the behavior is inconsistent, because once > d->outstanding_pages reaches 0 there will be no further modification > from domain_adjust_tot_pages(). Right, at that point the claim has run out. While freeing pages with an active claim means that the claim gets bigger (which naturally needs reflecting in the global). Jan
On Mon, Feb 24, 2025 at 02:49:48PM +0000, Alejandro Vallejo wrote: > Open question to whoever reviews this... > > On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote: > > spin_lock(&heap_lock); > > - /* adjust domain outstanding pages; may not go negative */ > > - dom_before = d->outstanding_pages; > > - dom_after = dom_before - pages; > > - BUG_ON(dom_before < 0); > > - dom_claimed = dom_after < 0 ? 0 : dom_after; > > - d->outstanding_pages = dom_claimed; > > - /* flag accounting bug if system outstanding_claims would go negative */ > > - sys_before = outstanding_claims; > > - sys_after = sys_before - (dom_before - dom_claimed); > > - BUG_ON(sys_after < 0); > > - outstanding_claims = sys_after; > > + BUG_ON(outstanding_claims < d->outstanding_pages); > > + if ( pages > 0 && d->outstanding_pages < pages ) > > + { > > + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > > + outstanding_claims -= d->outstanding_pages; > > + d->outstanding_pages = 0; > > While this matches the previous behaviour, do we _really_ want it? It's weird, > quirky, and it hard to extend to NUMA-aware claims (which is something in > midway through). > > Wouldn't it make sense to fail the allocation (earlier) if the claim has run > out? Do we even expect this to ever happen this late in the allocation call > chain? I'm unsure. This is the case where more memory than initially claimed has been allocated, but by the time domain_adjust_tot_pages() gets called the memory has already been allocated, so it's kind of unhelpful to fail by then. I think any caller that requests more memory than what has been initially claimed for the domain should be prepared to deal with such allocation failing. This quirky handling is very likely a workaround for the miscellaneous differences between the memory accounted by the toolstack for a guest vs the memory really used by such guest. I bet if you limit a guest to strictly only allocate up to d->outstanding_pages domain creation will fail. In general the toolstack memory calculations are not fully accurate, see for example how vmx_alloc_vlapic_mapping() allocates a domheap page which very likely the toolstack won't have accounted for. There are likely other examples that would possibly break the accounting done by the toolstack. Thanks, Roger.
On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: > On 26.02.2025 14:56, Roger Pau Monné wrote: > > On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: > >> --- a/xen/common/page_alloc.c > >> +++ b/xen/common/page_alloc.c > >> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ > >> > >> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > >> { > >> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; > >> - > >> ASSERT(rspin_is_locked(&d->page_alloc_lock)); > >> d->tot_pages += pages; > >> > >> /* > >> - * can test d->claimed_pages race-free because it can only change > >> + * can test d->outstanding_pages race-free because it can only change > >> * if d->page_alloc_lock and heap_lock are both held, see also > >> * domain_set_outstanding_pages below > >> */ > >> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > >> goto out; > > > > I think you can probably short-circuit the logic below if pages == 0? > > (and avoid taking the heap_lock) > > Are there callers passing in 0? Not sure, but if there are no callers expected we might add an ASSERT to that effect then. > >> spin_lock(&heap_lock); > >> - /* adjust domain outstanding pages; may not go negative */ > >> - dom_before = d->outstanding_pages; > >> - dom_after = dom_before - pages; > >> - BUG_ON(dom_before < 0); > >> - dom_claimed = dom_after < 0 ? 0 : dom_after; > >> - d->outstanding_pages = dom_claimed; > >> - /* flag accounting bug if system outstanding_claims would go negative */ > >> - sys_before = outstanding_claims; > >> - sys_after = sys_before - (dom_before - dom_claimed); > >> - BUG_ON(sys_after < 0); > >> - outstanding_claims = sys_after; > >> + BUG_ON(outstanding_claims < d->outstanding_pages); > >> + if ( pages > 0 && d->outstanding_pages < pages ) > >> + { > >> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > >> + outstanding_claims -= d->outstanding_pages; > >> + d->outstanding_pages = 0; > >> + } else { > >> + outstanding_claims -= pages; > >> + d->outstanding_pages -= pages; > > > > I wonder if it's intentional for a pages < 0 value to modify > > outstanding_claims and d->outstanding_pages, I think those values > > should only be set from domain_set_outstanding_pages(). > > domain_adjust_tot_pages() should only decrease the value, but never > > increase either outstanding_claims or d->outstanding_pages. > > > > At best the behavior is inconsistent, because once > > d->outstanding_pages reaches 0 there will be no further modification > > from domain_adjust_tot_pages(). > > Right, at that point the claim has run out. While freeing pages with an > active claim means that the claim gets bigger (which naturally needs > reflecting in the global). domain_adjust_tot_pages() is not exclusively called when freeing pages, see steal_page() for example. When called from steal_page() it's wrong to increase the claim, as it assumes that the page removed from d->tot_pages is freed, but that's not the case. The domain might end up in a situation where the claim is bigger than the available amount of memory. Thanks, Roger.
On 26.02.2025 15:28, Roger Pau Monné wrote: > On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: >> On 26.02.2025 14:56, Roger Pau Monné wrote: >>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: >>>> --- a/xen/common/page_alloc.c >>>> +++ b/xen/common/page_alloc.c >>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ >>>> >>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>> { >>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; >>>> - >>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); >>>> d->tot_pages += pages; >>>> >>>> /* >>>> - * can test d->claimed_pages race-free because it can only change >>>> + * can test d->outstanding_pages race-free because it can only change >>>> * if d->page_alloc_lock and heap_lock are both held, see also >>>> * domain_set_outstanding_pages below >>>> */ >>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>> goto out; >>> >>> I think you can probably short-circuit the logic below if pages == 0? >>> (and avoid taking the heap_lock) >> >> Are there callers passing in 0? > > Not sure, but if there are no callers expected we might add an ASSERT > to that effect then. > >>>> spin_lock(&heap_lock); >>>> - /* adjust domain outstanding pages; may not go negative */ >>>> - dom_before = d->outstanding_pages; >>>> - dom_after = dom_before - pages; >>>> - BUG_ON(dom_before < 0); >>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; >>>> - d->outstanding_pages = dom_claimed; >>>> - /* flag accounting bug if system outstanding_claims would go negative */ >>>> - sys_before = outstanding_claims; >>>> - sys_after = sys_before - (dom_before - dom_claimed); >>>> - BUG_ON(sys_after < 0); >>>> - outstanding_claims = sys_after; >>>> + BUG_ON(outstanding_claims < d->outstanding_pages); >>>> + if ( pages > 0 && d->outstanding_pages < pages ) >>>> + { >>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >>>> + outstanding_claims -= d->outstanding_pages; >>>> + d->outstanding_pages = 0; >>>> + } else { >>>> + outstanding_claims -= pages; >>>> + d->outstanding_pages -= pages; >>> >>> I wonder if it's intentional for a pages < 0 value to modify >>> outstanding_claims and d->outstanding_pages, I think those values >>> should only be set from domain_set_outstanding_pages(). >>> domain_adjust_tot_pages() should only decrease the value, but never >>> increase either outstanding_claims or d->outstanding_pages. >>> >>> At best the behavior is inconsistent, because once >>> d->outstanding_pages reaches 0 there will be no further modification >>> from domain_adjust_tot_pages(). >> >> Right, at that point the claim has run out. While freeing pages with an >> active claim means that the claim gets bigger (which naturally needs >> reflecting in the global). > > domain_adjust_tot_pages() is not exclusively called when freeing > pages, see steal_page() for example. Or also when pages were allocated. steal_page() ... > When called from steal_page() it's wrong to increase the claim, as > it assumes that the page removed from d->tot_pages is freed, but > that's not the case. The domain might end up in a situation where > the claim is bigger than the available amount of memory. ... is a case that likely wasn't considered when the feature was added. I never really liked this; I'd be quite happy to see it ripped out, as long as we'd be reasonably certain it isn't in active use by people. Jan
On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote: > On 26.02.2025 15:28, Roger Pau Monné wrote: > > On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: > >> On 26.02.2025 14:56, Roger Pau Monné wrote: > >>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: > >>>> --- a/xen/common/page_alloc.c > >>>> +++ b/xen/common/page_alloc.c > >>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ > >>>> > >>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > >>>> { > >>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; > >>>> - > >>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); > >>>> d->tot_pages += pages; > >>>> > >>>> /* > >>>> - * can test d->claimed_pages race-free because it can only change > >>>> + * can test d->outstanding_pages race-free because it can only change > >>>> * if d->page_alloc_lock and heap_lock are both held, see also > >>>> * domain_set_outstanding_pages below > >>>> */ > >>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > >>>> goto out; > >>> > >>> I think you can probably short-circuit the logic below if pages == 0? > >>> (and avoid taking the heap_lock) > >> > >> Are there callers passing in 0? > > > > Not sure, but if there are no callers expected we might add an ASSERT > > to that effect then. > > > >>>> spin_lock(&heap_lock); > >>>> - /* adjust domain outstanding pages; may not go negative */ > >>>> - dom_before = d->outstanding_pages; > >>>> - dom_after = dom_before - pages; > >>>> - BUG_ON(dom_before < 0); > >>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; > >>>> - d->outstanding_pages = dom_claimed; > >>>> - /* flag accounting bug if system outstanding_claims would go negative */ > >>>> - sys_before = outstanding_claims; > >>>> - sys_after = sys_before - (dom_before - dom_claimed); > >>>> - BUG_ON(sys_after < 0); > >>>> - outstanding_claims = sys_after; > >>>> + BUG_ON(outstanding_claims < d->outstanding_pages); > >>>> + if ( pages > 0 && d->outstanding_pages < pages ) > >>>> + { > >>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > >>>> + outstanding_claims -= d->outstanding_pages; > >>>> + d->outstanding_pages = 0; > >>>> + } else { > >>>> + outstanding_claims -= pages; > >>>> + d->outstanding_pages -= pages; > >>> > >>> I wonder if it's intentional for a pages < 0 value to modify > >>> outstanding_claims and d->outstanding_pages, I think those values > >>> should only be set from domain_set_outstanding_pages(). > >>> domain_adjust_tot_pages() should only decrease the value, but never > >>> increase either outstanding_claims or d->outstanding_pages. > >>> > >>> At best the behavior is inconsistent, because once > >>> d->outstanding_pages reaches 0 there will be no further modification > >>> from domain_adjust_tot_pages(). > >> > >> Right, at that point the claim has run out. While freeing pages with an > >> active claim means that the claim gets bigger (which naturally needs > >> reflecting in the global). > > > > domain_adjust_tot_pages() is not exclusively called when freeing > > pages, see steal_page() for example. > > Or also when pages were allocated. steal_page() ... > > > When called from steal_page() it's wrong to increase the claim, as > > it assumes that the page removed from d->tot_pages is freed, but > > that's not the case. The domain might end up in a situation where > > the claim is bigger than the available amount of memory. > > ... is a case that likely wasn't considered when the feature was added. > > I never really liked this; I'd be quite happy to see it ripped out, as > long as we'd be reasonably certain it isn't in active use by people. What do you mean with 'it' in the above sentence, the whole claim stuff? Or just getting rid of allowing the claim to increase as a result of pages being removed from a domain? Thanks, Roger.
On 26.02.2025 17:04, Roger Pau Monné wrote: > On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote: >> On 26.02.2025 15:28, Roger Pau Monné wrote: >>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: >>>> On 26.02.2025 14:56, Roger Pau Monné wrote: >>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: >>>>>> --- a/xen/common/page_alloc.c >>>>>> +++ b/xen/common/page_alloc.c >>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ >>>>>> >>>>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>> { >>>>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; >>>>>> - >>>>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); >>>>>> d->tot_pages += pages; >>>>>> >>>>>> /* >>>>>> - * can test d->claimed_pages race-free because it can only change >>>>>> + * can test d->outstanding_pages race-free because it can only change >>>>>> * if d->page_alloc_lock and heap_lock are both held, see also >>>>>> * domain_set_outstanding_pages below >>>>>> */ >>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>> goto out; >>>>> >>>>> I think you can probably short-circuit the logic below if pages == 0? >>>>> (and avoid taking the heap_lock) >>>> >>>> Are there callers passing in 0? >>> >>> Not sure, but if there are no callers expected we might add an ASSERT >>> to that effect then. >>> >>>>>> spin_lock(&heap_lock); >>>>>> - /* adjust domain outstanding pages; may not go negative */ >>>>>> - dom_before = d->outstanding_pages; >>>>>> - dom_after = dom_before - pages; >>>>>> - BUG_ON(dom_before < 0); >>>>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; >>>>>> - d->outstanding_pages = dom_claimed; >>>>>> - /* flag accounting bug if system outstanding_claims would go negative */ >>>>>> - sys_before = outstanding_claims; >>>>>> - sys_after = sys_before - (dom_before - dom_claimed); >>>>>> - BUG_ON(sys_after < 0); >>>>>> - outstanding_claims = sys_after; >>>>>> + BUG_ON(outstanding_claims < d->outstanding_pages); >>>>>> + if ( pages > 0 && d->outstanding_pages < pages ) >>>>>> + { >>>>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >>>>>> + outstanding_claims -= d->outstanding_pages; >>>>>> + d->outstanding_pages = 0; >>>>>> + } else { >>>>>> + outstanding_claims -= pages; >>>>>> + d->outstanding_pages -= pages; >>>>> >>>>> I wonder if it's intentional for a pages < 0 value to modify >>>>> outstanding_claims and d->outstanding_pages, I think those values >>>>> should only be set from domain_set_outstanding_pages(). >>>>> domain_adjust_tot_pages() should only decrease the value, but never >>>>> increase either outstanding_claims or d->outstanding_pages. >>>>> >>>>> At best the behavior is inconsistent, because once >>>>> d->outstanding_pages reaches 0 there will be no further modification >>>>> from domain_adjust_tot_pages(). >>>> >>>> Right, at that point the claim has run out. While freeing pages with an >>>> active claim means that the claim gets bigger (which naturally needs >>>> reflecting in the global). >>> >>> domain_adjust_tot_pages() is not exclusively called when freeing >>> pages, see steal_page() for example. >> >> Or also when pages were allocated. steal_page() ... >> >>> When called from steal_page() it's wrong to increase the claim, as >>> it assumes that the page removed from d->tot_pages is freed, but >>> that's not the case. The domain might end up in a situation where >>> the claim is bigger than the available amount of memory. >> >> ... is a case that likely wasn't considered when the feature was added. >> >> I never really liked this; I'd be quite happy to see it ripped out, as >> long as we'd be reasonably certain it isn't in active use by people. > > What do you mean with 'it' in the above sentence, the whole claim > stuff? Yes. > Or just getting rid of allowing the claim to increase as a > result of pages being removed from a domain? No. Jan
On 26/02/2025 4:06 pm, Jan Beulich wrote: > On 26.02.2025 17:04, Roger Pau Monné wrote: >> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote: >>> On 26.02.2025 15:28, Roger Pau Monné wrote: >>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: >>>>> On 26.02.2025 14:56, Roger Pau Monné wrote: >>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: >>>>>>> --- a/xen/common/page_alloc.c >>>>>>> +++ b/xen/common/page_alloc.c >>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ >>>>>>> >>>>>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>>> { >>>>>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; >>>>>>> - >>>>>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); >>>>>>> d->tot_pages += pages; >>>>>>> >>>>>>> /* >>>>>>> - * can test d->claimed_pages race-free because it can only change >>>>>>> + * can test d->outstanding_pages race-free because it can only change >>>>>>> * if d->page_alloc_lock and heap_lock are both held, see also >>>>>>> * domain_set_outstanding_pages below >>>>>>> */ >>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>>> goto out; >>>>>> I think you can probably short-circuit the logic below if pages == 0? >>>>>> (and avoid taking the heap_lock) >>>>> Are there callers passing in 0? >>>> Not sure, but if there are no callers expected we might add an ASSERT >>>> to that effect then. >>>> >>>>>>> spin_lock(&heap_lock); >>>>>>> - /* adjust domain outstanding pages; may not go negative */ >>>>>>> - dom_before = d->outstanding_pages; >>>>>>> - dom_after = dom_before - pages; >>>>>>> - BUG_ON(dom_before < 0); >>>>>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; >>>>>>> - d->outstanding_pages = dom_claimed; >>>>>>> - /* flag accounting bug if system outstanding_claims would go negative */ >>>>>>> - sys_before = outstanding_claims; >>>>>>> - sys_after = sys_before - (dom_before - dom_claimed); >>>>>>> - BUG_ON(sys_after < 0); >>>>>>> - outstanding_claims = sys_after; >>>>>>> + BUG_ON(outstanding_claims < d->outstanding_pages); >>>>>>> + if ( pages > 0 && d->outstanding_pages < pages ) >>>>>>> + { >>>>>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >>>>>>> + outstanding_claims -= d->outstanding_pages; >>>>>>> + d->outstanding_pages = 0; >>>>>>> + } else { >>>>>>> + outstanding_claims -= pages; >>>>>>> + d->outstanding_pages -= pages; >>>>>> I wonder if it's intentional for a pages < 0 value to modify >>>>>> outstanding_claims and d->outstanding_pages, I think those values >>>>>> should only be set from domain_set_outstanding_pages(). >>>>>> domain_adjust_tot_pages() should only decrease the value, but never >>>>>> increase either outstanding_claims or d->outstanding_pages. >>>>>> >>>>>> At best the behavior is inconsistent, because once >>>>>> d->outstanding_pages reaches 0 there will be no further modification >>>>>> from domain_adjust_tot_pages(). >>>>> Right, at that point the claim has run out. While freeing pages with an >>>>> active claim means that the claim gets bigger (which naturally needs >>>>> reflecting in the global). >>>> domain_adjust_tot_pages() is not exclusively called when freeing >>>> pages, see steal_page() for example. >>> Or also when pages were allocated. steal_page() ... >>> >>>> When called from steal_page() it's wrong to increase the claim, as >>>> it assumes that the page removed from d->tot_pages is freed, but >>>> that's not the case. The domain might end up in a situation where >>>> the claim is bigger than the available amount of memory. >>> ... is a case that likely wasn't considered when the feature was added. >>> >>> I never really liked this; I'd be quite happy to see it ripped out, as >>> long as we'd be reasonably certain it isn't in active use by people. >> What do you mean with 'it' in the above sentence, the whole claim >> stuff? > Yes. > >> Or just getting rid of allowing the claim to increase as a >> result of pages being removed from a domain? > No. Alejandro and I discussed this earlier in the week. The claim infrastructure stuff is critical for a toolstack capable of doing things in parallel. However, it is also nonsensical for there to be a remaining claim by the time domain construction is done. If creation_finished were a concrete thing, rather than a bodge hacked into domain_unpause_by_systemcontroller(), it ought to be made to fail if there were an outstanding claim. I suggested that we follow through on a previous suggestion of making it a real hypercall (which is needed by the encrypted VM crowd too). ~Andrew
On 26.02.2025 17:34, Andrew Cooper wrote: > On 26/02/2025 4:06 pm, Jan Beulich wrote: >> On 26.02.2025 17:04, Roger Pau Monné wrote: >>> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote: >>>> On 26.02.2025 15:28, Roger Pau Monné wrote: >>>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: >>>>>> On 26.02.2025 14:56, Roger Pau Monné wrote: >>>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: >>>>>>>> --- a/xen/common/page_alloc.c >>>>>>>> +++ b/xen/common/page_alloc.c >>>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ >>>>>>>> >>>>>>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>>>> { >>>>>>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; >>>>>>>> - >>>>>>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); >>>>>>>> d->tot_pages += pages; >>>>>>>> >>>>>>>> /* >>>>>>>> - * can test d->claimed_pages race-free because it can only change >>>>>>>> + * can test d->outstanding_pages race-free because it can only change >>>>>>>> * if d->page_alloc_lock and heap_lock are both held, see also >>>>>>>> * domain_set_outstanding_pages below >>>>>>>> */ >>>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>>>> goto out; >>>>>>> I think you can probably short-circuit the logic below if pages == 0? >>>>>>> (and avoid taking the heap_lock) >>>>>> Are there callers passing in 0? >>>>> Not sure, but if there are no callers expected we might add an ASSERT >>>>> to that effect then. >>>>> >>>>>>>> spin_lock(&heap_lock); >>>>>>>> - /* adjust domain outstanding pages; may not go negative */ >>>>>>>> - dom_before = d->outstanding_pages; >>>>>>>> - dom_after = dom_before - pages; >>>>>>>> - BUG_ON(dom_before < 0); >>>>>>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; >>>>>>>> - d->outstanding_pages = dom_claimed; >>>>>>>> - /* flag accounting bug if system outstanding_claims would go negative */ >>>>>>>> - sys_before = outstanding_claims; >>>>>>>> - sys_after = sys_before - (dom_before - dom_claimed); >>>>>>>> - BUG_ON(sys_after < 0); >>>>>>>> - outstanding_claims = sys_after; >>>>>>>> + BUG_ON(outstanding_claims < d->outstanding_pages); >>>>>>>> + if ( pages > 0 && d->outstanding_pages < pages ) >>>>>>>> + { >>>>>>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >>>>>>>> + outstanding_claims -= d->outstanding_pages; >>>>>>>> + d->outstanding_pages = 0; >>>>>>>> + } else { >>>>>>>> + outstanding_claims -= pages; >>>>>>>> + d->outstanding_pages -= pages; >>>>>>> I wonder if it's intentional for a pages < 0 value to modify >>>>>>> outstanding_claims and d->outstanding_pages, I think those values >>>>>>> should only be set from domain_set_outstanding_pages(). >>>>>>> domain_adjust_tot_pages() should only decrease the value, but never >>>>>>> increase either outstanding_claims or d->outstanding_pages. >>>>>>> >>>>>>> At best the behavior is inconsistent, because once >>>>>>> d->outstanding_pages reaches 0 there will be no further modification >>>>>>> from domain_adjust_tot_pages(). >>>>>> Right, at that point the claim has run out. While freeing pages with an >>>>>> active claim means that the claim gets bigger (which naturally needs >>>>>> reflecting in the global). >>>>> domain_adjust_tot_pages() is not exclusively called when freeing >>>>> pages, see steal_page() for example. >>>> Or also when pages were allocated. steal_page() ... >>>> >>>>> When called from steal_page() it's wrong to increase the claim, as >>>>> it assumes that the page removed from d->tot_pages is freed, but >>>>> that's not the case. The domain might end up in a situation where >>>>> the claim is bigger than the available amount of memory. >>>> ... is a case that likely wasn't considered when the feature was added. >>>> >>>> I never really liked this; I'd be quite happy to see it ripped out, as >>>> long as we'd be reasonably certain it isn't in active use by people. >>> What do you mean with 'it' in the above sentence, the whole claim >>> stuff? >> Yes. >> >>> Or just getting rid of allowing the claim to increase as a >>> result of pages being removed from a domain? >> No. > > Alejandro and I discussed this earlier in the week. > > The claim infrastructure stuff is critical for a toolstack capable of > doing things in parallel. > > However, it is also nonsensical for there to be a remaining claim by the > time domain construction is done. I'm not entirely sure about this. Iirc it was the tmem work where this was added, and then pretty certainly it was needed also for already running domains. > If creation_finished were a concrete thing, rather than a bodge hacked > into domain_unpause_by_systemcontroller(), it ought to be made to fail > if there were an outstanding claim. I suggested that we follow through > on a previous suggestion of making it a real hypercall (which is needed > by the encrypted VM crowd too). Rather than failing we could simply zap the leftover? Jan
On 26/02/2025 4:42 pm, Jan Beulich wrote: > On 26.02.2025 17:34, Andrew Cooper wrote: >> On 26/02/2025 4:06 pm, Jan Beulich wrote: >>> On 26.02.2025 17:04, Roger Pau Monné wrote: >>>> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote: >>>>> On 26.02.2025 15:28, Roger Pau Monné wrote: >>>>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: >>>>>>> On 26.02.2025 14:56, Roger Pau Monné wrote: >>>>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: >>>>>>>>> --- a/xen/common/page_alloc.c >>>>>>>>> +++ b/xen/common/page_alloc.c >>>>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ >>>>>>>>> >>>>>>>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>>>>> { >>>>>>>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; >>>>>>>>> - >>>>>>>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); >>>>>>>>> d->tot_pages += pages; >>>>>>>>> >>>>>>>>> /* >>>>>>>>> - * can test d->claimed_pages race-free because it can only change >>>>>>>>> + * can test d->outstanding_pages race-free because it can only change >>>>>>>>> * if d->page_alloc_lock and heap_lock are both held, see also >>>>>>>>> * domain_set_outstanding_pages below >>>>>>>>> */ >>>>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>>>>>> goto out; >>>>>>>> I think you can probably short-circuit the logic below if pages == 0? >>>>>>>> (and avoid taking the heap_lock) >>>>>>> Are there callers passing in 0? >>>>>> Not sure, but if there are no callers expected we might add an ASSERT >>>>>> to that effect then. >>>>>> >>>>>>>>> spin_lock(&heap_lock); >>>>>>>>> - /* adjust domain outstanding pages; may not go negative */ >>>>>>>>> - dom_before = d->outstanding_pages; >>>>>>>>> - dom_after = dom_before - pages; >>>>>>>>> - BUG_ON(dom_before < 0); >>>>>>>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; >>>>>>>>> - d->outstanding_pages = dom_claimed; >>>>>>>>> - /* flag accounting bug if system outstanding_claims would go negative */ >>>>>>>>> - sys_before = outstanding_claims; >>>>>>>>> - sys_after = sys_before - (dom_before - dom_claimed); >>>>>>>>> - BUG_ON(sys_after < 0); >>>>>>>>> - outstanding_claims = sys_after; >>>>>>>>> + BUG_ON(outstanding_claims < d->outstanding_pages); >>>>>>>>> + if ( pages > 0 && d->outstanding_pages < pages ) >>>>>>>>> + { >>>>>>>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >>>>>>>>> + outstanding_claims -= d->outstanding_pages; >>>>>>>>> + d->outstanding_pages = 0; >>>>>>>>> + } else { >>>>>>>>> + outstanding_claims -= pages; >>>>>>>>> + d->outstanding_pages -= pages; >>>>>>>> I wonder if it's intentional for a pages < 0 value to modify >>>>>>>> outstanding_claims and d->outstanding_pages, I think those values >>>>>>>> should only be set from domain_set_outstanding_pages(). >>>>>>>> domain_adjust_tot_pages() should only decrease the value, but never >>>>>>>> increase either outstanding_claims or d->outstanding_pages. >>>>>>>> >>>>>>>> At best the behavior is inconsistent, because once >>>>>>>> d->outstanding_pages reaches 0 there will be no further modification >>>>>>>> from domain_adjust_tot_pages(). >>>>>>> Right, at that point the claim has run out. While freeing pages with an >>>>>>> active claim means that the claim gets bigger (which naturally needs >>>>>>> reflecting in the global). >>>>>> domain_adjust_tot_pages() is not exclusively called when freeing >>>>>> pages, see steal_page() for example. >>>>> Or also when pages were allocated. steal_page() ... >>>>> >>>>>> When called from steal_page() it's wrong to increase the claim, as >>>>>> it assumes that the page removed from d->tot_pages is freed, but >>>>>> that's not the case. The domain might end up in a situation where >>>>>> the claim is bigger than the available amount of memory. >>>>> ... is a case that likely wasn't considered when the feature was added. >>>>> >>>>> I never really liked this; I'd be quite happy to see it ripped out, as >>>>> long as we'd be reasonably certain it isn't in active use by people. >>>> What do you mean with 'it' in the above sentence, the whole claim >>>> stuff? >>> Yes. >>> >>>> Or just getting rid of allowing the claim to increase as a >>>> result of pages being removed from a domain? >>> No. >> Alejandro and I discussed this earlier in the week. >> >> The claim infrastructure stuff is critical for a toolstack capable of >> doing things in parallel. >> >> However, it is also nonsensical for there to be a remaining claim by the >> time domain construction is done. > I'm not entirely sure about this. Iirc it was the tmem work where this was > added, and then pretty certainly it was needed also for already running > domains. It wasn't TMEM. It was generally large-memory VMs. The problem is if you've got 2x 2T VMs booting on a 3T system. Previously, you'd start building both of them, and minutes later they both fail because Xen was fully out of memory. Claim was introduced to atomically reserve the memory you were intending to build a domain with. For XenServer, we're working on NUMA fixes, and something that's important there is to be able to reserve memory on a specific NUMA node (hence why this is all getting looked at). >> If creation_finished were a concrete thing, rather than a bodge hacked >> into domain_unpause_by_systemcontroller(), it ought to be made to fail >> if there were an outstanding claim. I suggested that we follow through >> on a previous suggestion of making it a real hypercall (which is needed >> by the encrypted VM crowd too). > Rather than failing we could simply zap the leftover? Hmm. Perhaps. I'd be slightly wary about zapping a claim, but there should only be an outstanding claim if the toolstack did something wrong. OTOH, we absolutely definitely do need a real hypercall here at some point soon. ~Andrew
On Wed Feb 26, 2025 at 2:05 PM GMT, Jan Beulich wrote: > On 24.02.2025 15:49, Alejandro Vallejo wrote: > > Open question to whoever reviews this... > > > > On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote: > >> spin_lock(&heap_lock); > >> - /* adjust domain outstanding pages; may not go negative */ > >> - dom_before = d->outstanding_pages; > >> - dom_after = dom_before - pages; > >> - BUG_ON(dom_before < 0); > >> - dom_claimed = dom_after < 0 ? 0 : dom_after; > >> - d->outstanding_pages = dom_claimed; > >> - /* flag accounting bug if system outstanding_claims would go negative */ > >> - sys_before = outstanding_claims; > >> - sys_after = sys_before - (dom_before - dom_claimed); > >> - BUG_ON(sys_after < 0); > >> - outstanding_claims = sys_after; > >> + BUG_ON(outstanding_claims < d->outstanding_pages); > >> + if ( pages > 0 && d->outstanding_pages < pages ) > >> + { > >> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > >> + outstanding_claims -= d->outstanding_pages; > >> + d->outstanding_pages = 0; > > > > While this matches the previous behaviour, do we _really_ want it? It's weird, > > quirky, and it hard to extend to NUMA-aware claims (which is something in > > midway through). > > > > Wouldn't it make sense to fail the allocation (earlier) if the claim has run > > out? Do we even expect this to ever happen this late in the allocation call > > chain? > > This goes back to what a "claim" means. Even without any claim, a domain may > allocate memory. So a claim having run out doesn't imply allocation has to > fail. Hmmm... but that violates the purpose of the claim infra as far as I understand it. If a domain may overallocate by (e.g) ballooning in memory it can distort the ability of another domain to start up, even if it succeeded in its own claim. We might also break the invariant that total claims are strictly >= total_avail_pages. I'm somewhat puzzled at the "why" of having separate concepts for max_mem and claims. I guess it simply grew the way it did. Reinstating sanity would probably involve making max_mem effectively the claim, but that's a ton of work I really would rather not do for now. > > NUMA-aware claims require more than an adjustment just here, I expect. Tracking > of claims (certainly the global, maybe also the per-domain value) would likely > need to become per-node, for example. A fair amount more, yes. I'm preparing a series on the side to address per-node claims and it's far more invasive on page_alloc.c. This function was just sufficiently impossible to read that I felt the urge to send it ahead of time for my own mental health. > > Jan Cheers, Alejandro
On Wed Feb 26, 2025 at 4:51 PM GMT, Andrew Cooper wrote: > On 26/02/2025 4:42 pm, Jan Beulich wrote: > > On 26.02.2025 17:34, Andrew Cooper wrote: > >> On 26/02/2025 4:06 pm, Jan Beulich wrote: > >>> On 26.02.2025 17:04, Roger Pau Monné wrote: > >>>> On Wed, Feb 26, 2025 at 03:36:33PM +0100, Jan Beulich wrote: > >>>>> On 26.02.2025 15:28, Roger Pau Monné wrote: > >>>>>> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: > >>>>>>> On 26.02.2025 14:56, Roger Pau Monné wrote: > >>>>>>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: > >>>>>>>>> --- a/xen/common/page_alloc.c > >>>>>>>>> +++ b/xen/common/page_alloc.c > >>>>>>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ > >>>>>>>>> > >>>>>>>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > >>>>>>>>> { > >>>>>>>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; > >>>>>>>>> - > >>>>>>>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); > >>>>>>>>> d->tot_pages += pages; > >>>>>>>>> > >>>>>>>>> /* > >>>>>>>>> - * can test d->claimed_pages race-free because it can only change > >>>>>>>>> + * can test d->outstanding_pages race-free because it can only change > >>>>>>>>> * if d->page_alloc_lock and heap_lock are both held, see also > >>>>>>>>> * domain_set_outstanding_pages below > >>>>>>>>> */ > >>>>>>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > >>>>>>>>> goto out; > >>>>>>>> I think you can probably short-circuit the logic below if pages == 0? > >>>>>>>> (and avoid taking the heap_lock) > >>>>>>> Are there callers passing in 0? > >>>>>> Not sure, but if there are no callers expected we might add an ASSERT > >>>>>> to that effect then. > >>>>>> > >>>>>>>>> spin_lock(&heap_lock); > >>>>>>>>> - /* adjust domain outstanding pages; may not go negative */ > >>>>>>>>> - dom_before = d->outstanding_pages; > >>>>>>>>> - dom_after = dom_before - pages; > >>>>>>>>> - BUG_ON(dom_before < 0); > >>>>>>>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; > >>>>>>>>> - d->outstanding_pages = dom_claimed; > >>>>>>>>> - /* flag accounting bug if system outstanding_claims would go negative */ > >>>>>>>>> - sys_before = outstanding_claims; > >>>>>>>>> - sys_after = sys_before - (dom_before - dom_claimed); > >>>>>>>>> - BUG_ON(sys_after < 0); > >>>>>>>>> - outstanding_claims = sys_after; > >>>>>>>>> + BUG_ON(outstanding_claims < d->outstanding_pages); > >>>>>>>>> + if ( pages > 0 && d->outstanding_pages < pages ) > >>>>>>>>> + { > >>>>>>>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > >>>>>>>>> + outstanding_claims -= d->outstanding_pages; > >>>>>>>>> + d->outstanding_pages = 0; > >>>>>>>>> + } else { > >>>>>>>>> + outstanding_claims -= pages; > >>>>>>>>> + d->outstanding_pages -= pages; > >>>>>>>> I wonder if it's intentional for a pages < 0 value to modify > >>>>>>>> outstanding_claims and d->outstanding_pages, I think those values > >>>>>>>> should only be set from domain_set_outstanding_pages(). > >>>>>>>> domain_adjust_tot_pages() should only decrease the value, but never > >>>>>>>> increase either outstanding_claims or d->outstanding_pages. > >>>>>>>> > >>>>>>>> At best the behavior is inconsistent, because once > >>>>>>>> d->outstanding_pages reaches 0 there will be no further modification > >>>>>>>> from domain_adjust_tot_pages(). > >>>>>>> Right, at that point the claim has run out. While freeing pages with an > >>>>>>> active claim means that the claim gets bigger (which naturally needs > >>>>>>> reflecting in the global). > >>>>>> domain_adjust_tot_pages() is not exclusively called when freeing > >>>>>> pages, see steal_page() for example. > >>>>> Or also when pages were allocated. steal_page() ... > >>>>> > >>>>>> When called from steal_page() it's wrong to increase the claim, as > >>>>>> it assumes that the page removed from d->tot_pages is freed, but > >>>>>> that's not the case. The domain might end up in a situation where > >>>>>> the claim is bigger than the available amount of memory. > >>>>> ... is a case that likely wasn't considered when the feature was added. > >>>>> > >>>>> I never really liked this; I'd be quite happy to see it ripped out, as > >>>>> long as we'd be reasonably certain it isn't in active use by people. > >>>> What do you mean with 'it' in the above sentence, the whole claim > >>>> stuff? > >>> Yes. > >>> > >>>> Or just getting rid of allowing the claim to increase as a > >>>> result of pages being removed from a domain? > >>> No. > >> Alejandro and I discussed this earlier in the week. > >> > >> The claim infrastructure stuff is critical for a toolstack capable of > >> doing things in parallel. > >> > >> However, it is also nonsensical for there to be a remaining claim by the > >> time domain construction is done. > > I'm not entirely sure about this. Iirc it was the tmem work where this was > > added, and then pretty certainly it was needed also for already running > > domains. > > It wasn't TMEM. It was generally large-memory VMs. > > The problem is if you've got 2x 2T VMs booting on a 3T system. > > Previously, you'd start building both of them, and minutes later they > both fail because Xen was fully out of memory. > > Claim was introduced to atomically reserve the memory you were intending > to build a domain with. > > For XenServer, we're working on NUMA fixes, and something that's > important there is to be able to reserve memory on a specific NUMA node > (hence why this is all getting looked at). > >> If creation_finished were a concrete thing, rather than a bodge hacked > >> into domain_unpause_by_systemcontroller(), it ought to be made to fail > >> if there were an outstanding claim. I suggested that we follow through > >> on a previous suggestion of making it a real hypercall (which is needed > >> by the encrypted VM crowd too). > > Rather than failing we could simply zap the leftover? > > Hmm. Perhaps. > > I'd be slightly wary about zapping a claim, but there should only be an > outstanding claim if the toolstack did something wrong. > > OTOH, we absolutely definitely do need a real hypercall here at some > point soon. > > ~Andrew It should probably be removed at the end. Otherwise we're effectively leaking all claimed but non-used memory for the lifetime of the domain. Cheers, Alejandro
On Wed Feb 26, 2025 at 2:28 PM GMT, Roger Pau Monné wrote: > On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: > > On 26.02.2025 14:56, Roger Pau Monné wrote: > > > On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: > > >> --- a/xen/common/page_alloc.c > > >> +++ b/xen/common/page_alloc.c > > >> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ > > >> > > >> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > > >> { > > >> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; > > >> - > > >> ASSERT(rspin_is_locked(&d->page_alloc_lock)); > > >> d->tot_pages += pages; > > >> > > >> /* > > >> - * can test d->claimed_pages race-free because it can only change > > >> + * can test d->outstanding_pages race-free because it can only change > > >> * if d->page_alloc_lock and heap_lock are both held, see also > > >> * domain_set_outstanding_pages below > > >> */ > > >> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > > >> goto out; > > > > > > I think you can probably short-circuit the logic below if pages == 0? > > > (and avoid taking the heap_lock) > > > > Are there callers passing in 0? > > Not sure, but if there are no callers expected we might add an ASSERT > to that effect then. > > > >> spin_lock(&heap_lock); > > >> - /* adjust domain outstanding pages; may not go negative */ > > >> - dom_before = d->outstanding_pages; > > >> - dom_after = dom_before - pages; > > >> - BUG_ON(dom_before < 0); > > >> - dom_claimed = dom_after < 0 ? 0 : dom_after; > > >> - d->outstanding_pages = dom_claimed; > > >> - /* flag accounting bug if system outstanding_claims would go negative */ > > >> - sys_before = outstanding_claims; > > >> - sys_after = sys_before - (dom_before - dom_claimed); > > >> - BUG_ON(sys_after < 0); > > >> - outstanding_claims = sys_after; > > >> + BUG_ON(outstanding_claims < d->outstanding_pages); > > >> + if ( pages > 0 && d->outstanding_pages < pages ) > > >> + { > > >> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > > >> + outstanding_claims -= d->outstanding_pages; > > >> + d->outstanding_pages = 0; > > >> + } else { > > >> + outstanding_claims -= pages; > > >> + d->outstanding_pages -= pages; > > > > > > I wonder if it's intentional for a pages < 0 value to modify > > > outstanding_claims and d->outstanding_pages, I think those values > > > should only be set from domain_set_outstanding_pages(). > > > domain_adjust_tot_pages() should only decrease the value, but never > > > increase either outstanding_claims or d->outstanding_pages. > > > > > > At best the behavior is inconsistent, because once > > > d->outstanding_pages reaches 0 there will be no further modification > > > from domain_adjust_tot_pages(). > > > > Right, at that point the claim has run out. While freeing pages with an > > active claim means that the claim gets bigger (which naturally needs > > reflecting in the global). > > domain_adjust_tot_pages() is not exclusively called when freeing > pages, see steal_page() for example. > > When called from steal_page() it's wrong to increase the claim, as > it assumes that the page removed from d->tot_pages is freed, but > that's not the case. The domain might end up in a situation where > the claim is bigger than the available amount of memory. > > Thanks, Roger. This is what I meant by my initial reply questioning the logic itself. It's all very dubious with memory_exchange and makes very little sense on the tentative code I have for per-node claims. I'd be quite happy to put an early exit before the spin_lock on pages <= 0. That also covers your initial comment and prevents claims from growing after a domain started running if it didn't happen to consume all of them. Is anyone opposed? Cheers, Alejandro
On Wed Feb 26, 2025 at 2:02 PM GMT, Jan Beulich wrote: > On 24.02.2025 14:27, Alejandro Vallejo wrote: > > @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) > > goto out; > > > > spin_lock(&heap_lock); > > - /* adjust domain outstanding pages; may not go negative */ > > - dom_before = d->outstanding_pages; > > - dom_after = dom_before - pages; > > - BUG_ON(dom_before < 0); > > - dom_claimed = dom_after < 0 ? 0 : dom_after; > > - d->outstanding_pages = dom_claimed; > > - /* flag accounting bug if system outstanding_claims would go negative */ > > - sys_before = outstanding_claims; > > - sys_after = sys_before - (dom_before - dom_claimed); > > - BUG_ON(sys_after < 0); > > - outstanding_claims = sys_after; > > + BUG_ON(outstanding_claims < d->outstanding_pages); > > + if ( pages > 0 && d->outstanding_pages < pages ) > > The lhs isn't needed, is it? d->outstanding_pages is an unsigned quantity, > after all. Else dropping the earlier of the two BUG_ON() wouldn't be quite > right. d->outstanding pages is unsigned, but pages isn't. It was originally like that, but I then got concerned about 32bit machines, where you'd be comparing a signed and an unsigned integer of the same not-very-large width. That seems like dangerous terrains if the unsigned number grows large enough. TL;DR: It's there for clarity and paranoia. Even if the overflowing into bit 31 would be rare in such a system. > > > + { > > + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > > + outstanding_claims -= d->outstanding_pages; > > + d->outstanding_pages = 0; > > + } else { > > Nit: Braces on their own lines please. Ack. > > In any event - yes, this reads quite a bit easier after the adjustment. > > With the adjustments (happy to make while committing, so long as you agree) > Reviewed-by: Jan Beulich <jbeulich@suse.com> Thanks. I'd probably like to hold off and send a v2 if you're fine with the adjustment I answered Roger with (returning ealy on pages <= 0, so claims are never increased on free). > > Jan > Cheers, Alejandro
On 27.02.2025 15:36, Alejandro Vallejo wrote: > On Wed Feb 26, 2025 at 2:05 PM GMT, Jan Beulich wrote: >> On 24.02.2025 15:49, Alejandro Vallejo wrote: >>> Open question to whoever reviews this... >>> >>> On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote: >>>> spin_lock(&heap_lock); >>>> - /* adjust domain outstanding pages; may not go negative */ >>>> - dom_before = d->outstanding_pages; >>>> - dom_after = dom_before - pages; >>>> - BUG_ON(dom_before < 0); >>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; >>>> - d->outstanding_pages = dom_claimed; >>>> - /* flag accounting bug if system outstanding_claims would go negative */ >>>> - sys_before = outstanding_claims; >>>> - sys_after = sys_before - (dom_before - dom_claimed); >>>> - BUG_ON(sys_after < 0); >>>> - outstanding_claims = sys_after; >>>> + BUG_ON(outstanding_claims < d->outstanding_pages); >>>> + if ( pages > 0 && d->outstanding_pages < pages ) >>>> + { >>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >>>> + outstanding_claims -= d->outstanding_pages; >>>> + d->outstanding_pages = 0; >>> >>> While this matches the previous behaviour, do we _really_ want it? It's weird, >>> quirky, and it hard to extend to NUMA-aware claims (which is something in >>> midway through). >>> >>> Wouldn't it make sense to fail the allocation (earlier) if the claim has run >>> out? Do we even expect this to ever happen this late in the allocation call >>> chain? >> >> This goes back to what a "claim" means. Even without any claim, a domain may >> allocate memory. So a claim having run out doesn't imply allocation has to >> fail. > > Hmmm... but that violates the purpose of the claim infra as far as I understand > it. If a domain may overallocate by (e.g) ballooning in memory it can distort the > ability of another domain to start up, even if it succeeded in its own claim. Why would that be? As long as we hold back enough memory to cover the claim, it shouldn't matter what kind of allocation we want to process. I'd say that a PV guest starting ballooned ought to be able to deflate its balloon as far as there was a claim established for it up front. > We might also break the invariant that total claims are strictly >= > total_avail_pages. Same here - I don't see why this would happen as long as all accounting is working correctly. > I'm somewhat puzzled at the "why" of having separate concepts for max_mem and > claims. I guess it simply grew the way it did. Reinstating sanity would > probably involve making max_mem effectively the claim, but that's a ton of > work I really would rather not do for now. To me the two are different (beyond claim being global while max-mem is per- domain). max-mem is a hard boundary (beyond which allocations _will_ fail), whereas a claim is a softer one, beyond which allocations _may_ fail. Jan
On 27.02.2025 15:50, Alejandro Vallejo wrote: > On Wed Feb 26, 2025 at 2:28 PM GMT, Roger Pau Monné wrote: >> On Wed, Feb 26, 2025 at 03:08:33PM +0100, Jan Beulich wrote: >>> On 26.02.2025 14:56, Roger Pau Monné wrote: >>>> On Mon, Feb 24, 2025 at 01:27:24PM +0000, Alejandro Vallejo wrote: >>>>> --- a/xen/common/page_alloc.c >>>>> +++ b/xen/common/page_alloc.c >>>>> @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ >>>>> >>>>> unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>> { >>>>> - long dom_before, dom_after, dom_claimed, sys_before, sys_after; >>>>> - >>>>> ASSERT(rspin_is_locked(&d->page_alloc_lock)); >>>>> d->tot_pages += pages; >>>>> >>>>> /* >>>>> - * can test d->claimed_pages race-free because it can only change >>>>> + * can test d->outstanding_pages race-free because it can only change >>>>> * if d->page_alloc_lock and heap_lock are both held, see also >>>>> * domain_set_outstanding_pages below >>>>> */ >>>>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>>>> goto out; >>>> >>>> I think you can probably short-circuit the logic below if pages == 0? >>>> (and avoid taking the heap_lock) >>> >>> Are there callers passing in 0? >> >> Not sure, but if there are no callers expected we might add an ASSERT >> to that effect then. >> >>>>> spin_lock(&heap_lock); >>>>> - /* adjust domain outstanding pages; may not go negative */ >>>>> - dom_before = d->outstanding_pages; >>>>> - dom_after = dom_before - pages; >>>>> - BUG_ON(dom_before < 0); >>>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; >>>>> - d->outstanding_pages = dom_claimed; >>>>> - /* flag accounting bug if system outstanding_claims would go negative */ >>>>> - sys_before = outstanding_claims; >>>>> - sys_after = sys_before - (dom_before - dom_claimed); >>>>> - BUG_ON(sys_after < 0); >>>>> - outstanding_claims = sys_after; >>>>> + BUG_ON(outstanding_claims < d->outstanding_pages); >>>>> + if ( pages > 0 && d->outstanding_pages < pages ) >>>>> + { >>>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >>>>> + outstanding_claims -= d->outstanding_pages; >>>>> + d->outstanding_pages = 0; >>>>> + } else { >>>>> + outstanding_claims -= pages; >>>>> + d->outstanding_pages -= pages; >>>> >>>> I wonder if it's intentional for a pages < 0 value to modify >>>> outstanding_claims and d->outstanding_pages, I think those values >>>> should only be set from domain_set_outstanding_pages(). >>>> domain_adjust_tot_pages() should only decrease the value, but never >>>> increase either outstanding_claims or d->outstanding_pages. >>>> >>>> At best the behavior is inconsistent, because once >>>> d->outstanding_pages reaches 0 there will be no further modification >>>> from domain_adjust_tot_pages(). >>> >>> Right, at that point the claim has run out. While freeing pages with an >>> active claim means that the claim gets bigger (which naturally needs >>> reflecting in the global). >> >> domain_adjust_tot_pages() is not exclusively called when freeing >> pages, see steal_page() for example. >> >> When called from steal_page() it's wrong to increase the claim, as >> it assumes that the page removed from d->tot_pages is freed, but >> that's not the case. The domain might end up in a situation where >> the claim is bigger than the available amount of memory. >> >> Thanks, Roger. > > This is what I meant by my initial reply questioning the logic itself. > > It's all very dubious with memory_exchange and makes very little sense on the > tentative code I have for per-node claims. > > I'd be quite happy to put an early exit before the spin_lock on pages <= 0. > That also covers your initial comment and prevents claims from growing after a > domain started running if it didn't happen to consume all of them. > > Is anyone opposed? We first need to reach common understanding what a claim is (or is not). See the other reply just sent. Jan
On 27.02.2025 15:59, Alejandro Vallejo wrote: > On Wed Feb 26, 2025 at 2:02 PM GMT, Jan Beulich wrote: >> On 24.02.2025 14:27, Alejandro Vallejo wrote: >>> @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) >>> goto out; >>> >>> spin_lock(&heap_lock); >>> - /* adjust domain outstanding pages; may not go negative */ >>> - dom_before = d->outstanding_pages; >>> - dom_after = dom_before - pages; >>> - BUG_ON(dom_before < 0); >>> - dom_claimed = dom_after < 0 ? 0 : dom_after; >>> - d->outstanding_pages = dom_claimed; >>> - /* flag accounting bug if system outstanding_claims would go negative */ >>> - sys_before = outstanding_claims; >>> - sys_after = sys_before - (dom_before - dom_claimed); >>> - BUG_ON(sys_after < 0); >>> - outstanding_claims = sys_after; >>> + BUG_ON(outstanding_claims < d->outstanding_pages); >>> + if ( pages > 0 && d->outstanding_pages < pages ) >> >> The lhs isn't needed, is it? d->outstanding_pages is an unsigned quantity, >> after all. Else dropping the earlier of the two BUG_ON() wouldn't be quite >> right. > > d->outstanding pages is unsigned, but pages isn't. > > It was originally like that, but I then got concerned about 32bit machines, > where you'd be comparing a signed and an unsigned integer of the same > not-very-large width. That seems like dangerous terrains if the unsigned number > grows large enough. Oh, indeed - I didn't take ILP32 arches into account. Jan
On Wed Mar 5, 2025 at 10:49 AM GMT, Jan Beulich wrote: > On 27.02.2025 15:36, Alejandro Vallejo wrote: > > On Wed Feb 26, 2025 at 2:05 PM GMT, Jan Beulich wrote: > >> On 24.02.2025 15:49, Alejandro Vallejo wrote: > >>> Open question to whoever reviews this... > >>> > >>> On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote: > >>>> spin_lock(&heap_lock); > >>>> - /* adjust domain outstanding pages; may not go negative */ > >>>> - dom_before = d->outstanding_pages; > >>>> - dom_after = dom_before - pages; > >>>> - BUG_ON(dom_before < 0); > >>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; > >>>> - d->outstanding_pages = dom_claimed; > >>>> - /* flag accounting bug if system outstanding_claims would go negative */ > >>>> - sys_before = outstanding_claims; > >>>> - sys_after = sys_before - (dom_before - dom_claimed); > >>>> - BUG_ON(sys_after < 0); > >>>> - outstanding_claims = sys_after; > >>>> + BUG_ON(outstanding_claims < d->outstanding_pages); > >>>> + if ( pages > 0 && d->outstanding_pages < pages ) > >>>> + { > >>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > >>>> + outstanding_claims -= d->outstanding_pages; > >>>> + d->outstanding_pages = 0; > >>> > >>> While this matches the previous behaviour, do we _really_ want it? It's weird, > >>> quirky, and it hard to extend to NUMA-aware claims (which is something in > >>> midway through). > >>> > >>> Wouldn't it make sense to fail the allocation (earlier) if the claim has run > >>> out? Do we even expect this to ever happen this late in the allocation call > >>> chain? > >> > >> This goes back to what a "claim" means. Even without any claim, a domain may > >> allocate memory. So a claim having run out doesn't imply allocation has to > >> fail. > > > > Hmmm... but that violates the purpose of the claim infra as far as I understand > > it. If a domain may overallocate by (e.g) ballooning in memory it can distort the > > ability of another domain to start up, even if it succeeded in its own claim. > > Why would that be? As long as we hold back enough memory to cover the claim, it > shouldn't matter what kind of allocation we want to process. I'd say that a PV > guest starting ballooned ought to be able to deflate its balloon as far as > there was a claim established for it up front. The fact a domain allocated something does not mean it had it claimed before. A claim is a restriction to others' ability to allocate/claim, not to the domain that made the claim. e.g: 0. host is idle. No domU. 1. dom1 is created with a claim to 10% of host memory and max_mem of 80% of host meomory. 2. dom1 balloons in 70% of host memory. 3. dom1 ballons out 70% of host memory. 4. dom1 now holds a a claim to 80% of host memory. It's all quite perverse. Fortunately, looking at adjacent claims-related code xl seems to default to making a claim prior to populating the physmap and cancelling the claim at the end of the meminit() hook so this is never a real problem. This tells me that the logic intent is to force early failure of populate_physmap and nothing else. It's never active by the time ballooning or memory exchange matter at all. Xen ought to cancel the claim by itself though, toolstack should not NEED to do it. And furthermore, the fact that the claim goes up and down interacts really poorly with the per-node claims I want to implement, because it's just impossible to know if a freed page actually comes from a prior decrease of a claim or not. > > > We might also break the invariant that total claims are strictly >= > > total_avail_pages. > > Same here - I don't see why this would happen as long as all accounting is > working correctly. See previous example. > > > I'm somewhat puzzled at the "why" of having separate concepts for max_mem and > > claims. I guess it simply grew the way it did. Reinstating sanity would > > probably involve making max_mem effectively the claim, but that's a ton of > > work I really would rather not do for now. > > To me the two are different (beyond claim being global while max-mem is per- > domain). max-mem is a hard boundary (beyond which allocations _will_ fail), > whereas a claim is a softer one, beyond which allocations _may_ fail. > > Jan What I meant is that I'd rather have a non-oversubscribed memory model by which I can statically partition my memory and make 100% sure the sum of all max_mem of every existing domain never exceeds host capacity. But this is neither the intent of the existing claim infra nor what I'm after right now. I was just thinking out loud in text form. :) Cheers, Alejandro
On 05.03.2025 14:22, Alejandro Vallejo wrote: > On Wed Mar 5, 2025 at 10:49 AM GMT, Jan Beulich wrote: >> On 27.02.2025 15:36, Alejandro Vallejo wrote: >>> On Wed Feb 26, 2025 at 2:05 PM GMT, Jan Beulich wrote: >>>> On 24.02.2025 15:49, Alejandro Vallejo wrote: >>>>> Open question to whoever reviews this... >>>>> >>>>> On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote: >>>>>> spin_lock(&heap_lock); >>>>>> - /* adjust domain outstanding pages; may not go negative */ >>>>>> - dom_before = d->outstanding_pages; >>>>>> - dom_after = dom_before - pages; >>>>>> - BUG_ON(dom_before < 0); >>>>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; >>>>>> - d->outstanding_pages = dom_claimed; >>>>>> - /* flag accounting bug if system outstanding_claims would go negative */ >>>>>> - sys_before = outstanding_claims; >>>>>> - sys_after = sys_before - (dom_before - dom_claimed); >>>>>> - BUG_ON(sys_after < 0); >>>>>> - outstanding_claims = sys_after; >>>>>> + BUG_ON(outstanding_claims < d->outstanding_pages); >>>>>> + if ( pages > 0 && d->outstanding_pages < pages ) >>>>>> + { >>>>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ >>>>>> + outstanding_claims -= d->outstanding_pages; >>>>>> + d->outstanding_pages = 0; >>>>> >>>>> While this matches the previous behaviour, do we _really_ want it? It's weird, >>>>> quirky, and it hard to extend to NUMA-aware claims (which is something in >>>>> midway through). >>>>> >>>>> Wouldn't it make sense to fail the allocation (earlier) if the claim has run >>>>> out? Do we even expect this to ever happen this late in the allocation call >>>>> chain? >>>> >>>> This goes back to what a "claim" means. Even without any claim, a domain may >>>> allocate memory. So a claim having run out doesn't imply allocation has to >>>> fail. >>> >>> Hmmm... but that violates the purpose of the claim infra as far as I understand >>> it. If a domain may overallocate by (e.g) ballooning in memory it can distort the >>> ability of another domain to start up, even if it succeeded in its own claim. >> >> Why would that be? As long as we hold back enough memory to cover the claim, it >> shouldn't matter what kind of allocation we want to process. I'd say that a PV >> guest starting ballooned ought to be able to deflate its balloon as far as >> there was a claim established for it up front. > > The fact a domain allocated something does not mean it had it claimed before. A > claim is a restriction to others' ability to allocate/claim, not to the domain > that made the claim. Yes and no. No in so far as the target's "ability to allocate" may or may not be meant to extend beyond domain creation. > e.g: > > 0. host is idle. No domU. > 1. dom1 is created with a claim to 10% of host memory and max_mem of 80% of > host meomory. > 2. dom1 balloons in 70% of host memory. > 3. dom1 ballons out 70% of host memory. > 4. dom1 now holds a a claim to 80% of host memory. Sure, this shouldn't be the result. Yet that may merely be an effect of claim accounting being insufficient. > It's all quite perverse. Fortunately, looking at adjacent claims-related code > xl seems to default to making a claim prior to populating the physmap and > cancelling the claim at the end of the meminit() hook so this is never a real > problem. > > This tells me that the logic intent is to force early failure of > populate_physmap and nothing else. It's never active by the time ballooning or > memory exchange matter at all. Ah yes, this I find more convincing. (Oddly enough this is all x86-only code.) > Xen ought to cancel the claim by itself though, toolstack should not NEED to do > it. Fundamentally yes. Except that the toolstack can do it earlier than Xen could. Jan
On Wed Mar 5, 2025 at 1:39 PM GMT, Jan Beulich wrote: > On 05.03.2025 14:22, Alejandro Vallejo wrote: > > On Wed Mar 5, 2025 at 10:49 AM GMT, Jan Beulich wrote: > >> On 27.02.2025 15:36, Alejandro Vallejo wrote: > >>> On Wed Feb 26, 2025 at 2:05 PM GMT, Jan Beulich wrote: > >>>> On 24.02.2025 15:49, Alejandro Vallejo wrote: > >>>>> Open question to whoever reviews this... > >>>>> > >>>>> On Mon Feb 24, 2025 at 1:27 PM GMT, Alejandro Vallejo wrote: > >>>>>> spin_lock(&heap_lock); > >>>>>> - /* adjust domain outstanding pages; may not go negative */ > >>>>>> - dom_before = d->outstanding_pages; > >>>>>> - dom_after = dom_before - pages; > >>>>>> - BUG_ON(dom_before < 0); > >>>>>> - dom_claimed = dom_after < 0 ? 0 : dom_after; > >>>>>> - d->outstanding_pages = dom_claimed; > >>>>>> - /* flag accounting bug if system outstanding_claims would go negative */ > >>>>>> - sys_before = outstanding_claims; > >>>>>> - sys_after = sys_before - (dom_before - dom_claimed); > >>>>>> - BUG_ON(sys_after < 0); > >>>>>> - outstanding_claims = sys_after; > >>>>>> + BUG_ON(outstanding_claims < d->outstanding_pages); > >>>>>> + if ( pages > 0 && d->outstanding_pages < pages ) > >>>>>> + { > >>>>>> + /* `pages` exceeds the domain's outstanding count. Zero it out. */ > >>>>>> + outstanding_claims -= d->outstanding_pages; > >>>>>> + d->outstanding_pages = 0; > >>>>> > >>>>> While this matches the previous behaviour, do we _really_ want it? It's weird, > >>>>> quirky, and it hard to extend to NUMA-aware claims (which is something in > >>>>> midway through). > >>>>> > >>>>> Wouldn't it make sense to fail the allocation (earlier) if the claim has run > >>>>> out? Do we even expect this to ever happen this late in the allocation call > >>>>> chain? > >>>> > >>>> This goes back to what a "claim" means. Even without any claim, a domain may > >>>> allocate memory. So a claim having run out doesn't imply allocation has to > >>>> fail. > >>> > >>> Hmmm... but that violates the purpose of the claim infra as far as I understand > >>> it. If a domain may overallocate by (e.g) ballooning in memory it can distort the > >>> ability of another domain to start up, even if it succeeded in its own claim. > >> > >> Why would that be? As long as we hold back enough memory to cover the claim, it > >> shouldn't matter what kind of allocation we want to process. I'd say that a PV > >> guest starting ballooned ought to be able to deflate its balloon as far as > >> there was a claim established for it up front. > > > > The fact a domain allocated something does not mean it had it claimed before. A > > claim is a restriction to others' ability to allocate/claim, not to the domain > > that made the claim. > > Yes and no. No in so far as the target's "ability to allocate" may or may not > be meant to extend beyond domain creation. > > > e.g: > > > > 0. host is idle. No domU. > > 1. dom1 is created with a claim to 10% of host memory and max_mem of 80% of > > host meomory. > > 2. dom1 balloons in 70% of host memory. > > 3. dom1 ballons out 70% of host memory. > > 4. dom1 now holds a a claim to 80% of host memory. > > Sure, this shouldn't be the result. Yet that may merely be an effect of claim > accounting being insufficient. > > > It's all quite perverse. Fortunately, looking at adjacent claims-related code > > xl seems to default to making a claim prior to populating the physmap and > > cancelling the claim at the end of the meminit() hook so this is never a real > > problem. > > > > This tells me that the logic intent is to force early failure of > > populate_physmap and nothing else. It's never active by the time ballooning or > > memory exchange matter at all. > > Ah yes, this I find more convincing. (Oddly enough this is all x86-only code.) (about claims being an x86-ism in toolstack). Yeah, that's weird. It's should be moved in time to a common area of xg. I guess no other arch has cared about bootstorms or massive VMs just yet. > > > Xen ought to cancel the claim by itself though, toolstack should not NEED to do > > it. > > Fundamentally yes. Except that the toolstack can do it earlier than Xen could. > > Jan Yes, in the sense that it can do it just after it's done populating the physmap, but Xen should still zero it at the end in case toolstack hasn't done so. Nothing good can come out of that counter going up and down in such a dubious fashion. Cheers, Alejandro
On Wed Mar 5, 2025 at 1:39 PM GMT, Jan Beulich wrote: > > It's all quite perverse. Fortunately, looking at adjacent claims-related code > > xl seems to default to making a claim prior to populating the physmap and > > cancelling the claim at the end of the meminit() hook so this is never a real > > problem. > > > > This tells me that the logic intent is to force early failure of > > populate_physmap and nothing else. It's never active by the time ballooning or > > memory exchange matter at all. > > Ah yes, this I find more convincing. (Oddly enough this is all x86-only code.) Should I take this as an "ack" to the general plan of early returning on pages <=0? I have a series pending that relies on it (the v2 of this[1]). And would rather defer its sending until this one get some form of nod. Otherwise I'll integrate it in the other series so I can at least reduce remove dependencies between things in-flight. Cheers, Alejandro [1] https://lore.kernel.org/xen-devel/20250304111000.9252-1-alejandro.vallejo@cloud.com/
On 11.03.2025 10:46, Alejandro Vallejo wrote: > On Wed Mar 5, 2025 at 1:39 PM GMT, Jan Beulich wrote: >>> It's all quite perverse. Fortunately, looking at adjacent claims-related code >>> xl seems to default to making a claim prior to populating the physmap and >>> cancelling the claim at the end of the meminit() hook so this is never a real >>> problem. >>> >>> This tells me that the logic intent is to force early failure of >>> populate_physmap and nothing else. It's never active by the time ballooning or >>> memory exchange matter at all. >> >> Ah yes, this I find more convincing. (Oddly enough this is all x86-only code.) > > Should I take this as an "ack" to the general plan of early returning on pages > <=0? I have a series pending that relies on it (the v2 of this[1]). Not an ack, but an "I can accept this, but someone else will need to ack it". Jan > And would > rather defer its sending until this one get some form of nod. Otherwise I'll > integrate it in the other series so I can at least reduce remove dependencies > between things in-flight. > > Cheers, > Alejandro > > [1] https://lore.kernel.org/xen-devel/20250304111000.9252-1-alejandro.vallejo@cloud.com/
diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c index 1bf070c8c5df..4306753f0bd2 100644 --- a/xen/common/page_alloc.c +++ b/xen/common/page_alloc.c @@ -490,13 +490,11 @@ static long outstanding_claims; /* total outstanding claims by all domains */ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) { - long dom_before, dom_after, dom_claimed, sys_before, sys_after; - ASSERT(rspin_is_locked(&d->page_alloc_lock)); d->tot_pages += pages; /* - * can test d->claimed_pages race-free because it can only change + * can test d->outstanding_pages race-free because it can only change * if d->page_alloc_lock and heap_lock are both held, see also * domain_set_outstanding_pages below */ @@ -504,17 +502,16 @@ unsigned long domain_adjust_tot_pages(struct domain *d, long pages) goto out; spin_lock(&heap_lock); - /* adjust domain outstanding pages; may not go negative */ - dom_before = d->outstanding_pages; - dom_after = dom_before - pages; - BUG_ON(dom_before < 0); - dom_claimed = dom_after < 0 ? 0 : dom_after; - d->outstanding_pages = dom_claimed; - /* flag accounting bug if system outstanding_claims would go negative */ - sys_before = outstanding_claims; - sys_after = sys_before - (dom_before - dom_claimed); - BUG_ON(sys_after < 0); - outstanding_claims = sys_after; + BUG_ON(outstanding_claims < d->outstanding_pages); + if ( pages > 0 && d->outstanding_pages < pages ) + { + /* `pages` exceeds the domain's outstanding count. Zero it out. */ + outstanding_claims -= d->outstanding_pages; + d->outstanding_pages = 0; + } else { + outstanding_claims -= pages; + d->outstanding_pages -= pages; + } spin_unlock(&heap_lock); out:
The logic has too many levels of indirection and it's very hard to understand it its current form. Split it between the corner case where the adjustment is bigger than the current claim and the rest to avoid 5 auxiliary variables. While at it, fix incorrect field name in nearby comment. Not a functional change (although by its nature it might not seem immediately obvious at first). Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> --- xen/common/page_alloc.c | 25 +++++++++++-------------- 1 file changed, 11 insertions(+), 14 deletions(-)