diff mbox series

[v3,01/10] page-alloc: Clamp get_free_buddy() to online nodes

Message ID 20190729121204.13559-2-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show
Series xen/nodemask: API cleanup and fixes | expand

Commit Message

Andrew Cooper July 29, 2019, 12:11 p.m. UTC
d->node_affinity defaults to NODE_MASK_ALL which has bits set outside of
node_online_map.  This in turn causes the loop in get_free_buddy() to waste
effort iterating over offline nodes.

Always clamp d->node_affinity to node_online_map when in use.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Wei Liu <wl@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Julien Grall <julien.grall@arm.com>
CC: George Dunlap <george.dunlap@eu.citrix.com>

v3:
 * Rebase to before the nodemask API cleanup, to make it easier to backport.
v2:
 * Rebase over the nodemask API change, and implement with a single
   nodes_and()
---
 xen/common/page_alloc.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

Jan Beulich July 29, 2019, 3:48 p.m. UTC | #1
On 29.07.2019 14:11, Andrew Cooper wrote:
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -811,11 +811,18 @@ static struct page_info *get_free_buddy(unsigned int zone_lo,
>                                           const struct domain *d)
>   {
>       nodeid_t first, node = MEMF_get_node(memflags), req_node = node;
> -    nodemask_t nodemask = d ? d->node_affinity : node_online_map;
> +    nodemask_t nodemask = node_online_map;
>       unsigned int j, zone, nodemask_retry = 0;
>       struct page_info *pg;
>       bool use_unscrubbed = (memflags & MEMF_no_scrub);
>   
> +    /*
> +     * d->node_affinity is our preferred allocation set if provided, but it
> +     * may have bit set outside of node_online_map.  Clamp it.

Would you mind adding the apparently missing "s" to "bit"?

> +     */
> +    if ( d )
> +        nodes_and(nodemask, nodemask, d->node_affinity);

Despite my earlier ack: Code further down assumes a non-empty mask,
which is no longer guaranteed afaics. I think you want to append an
"intersects" check in the if(). With that feel free to promote my
A-b to R-b.

Jan
Andrew Cooper July 29, 2019, 5:26 p.m. UTC | #2
On 29/07/2019 16:48, Jan Beulich wrote:
> On 29.07.2019 14:11, Andrew Cooper wrote:
>> --- a/xen/common/page_alloc.c
>> +++ b/xen/common/page_alloc.c
>> @@ -811,11 +811,18 @@ static struct page_info *get_free_buddy(unsigned int zone_lo,
>>                                           const struct domain *d)
>>   {
>>       nodeid_t first, node = MEMF_get_node(memflags), req_node = node;
>> -    nodemask_t nodemask = d ? d->node_affinity : node_online_map;
>> +    nodemask_t nodemask = node_online_map;
>>       unsigned int j, zone, nodemask_retry = 0;
>>       struct page_info *pg;
>>       bool use_unscrubbed = (memflags & MEMF_no_scrub);
>>   
>> +    /*
>> +     * d->node_affinity is our preferred allocation set if provided, but it
>> +     * may have bit set outside of node_online_map.  Clamp it.
> Would you mind adding the apparently missing "s" to "bit"?

Oops yes.

>
>> +     */
>> +    if ( d )
>> +        nodes_and(nodemask, nodemask, d->node_affinity);
> Despite my earlier ack: Code further down assumes a non-empty mask,
> which is no longer guaranteed afaics.

Nothing previous guaranteed that d->node_affinity had any bits set in
it, either.

That said, in practice it is either ALL, or something derived from the
cpu=>node mappings, so I don't think this is a problem in practice.

> I think you want to append an
> "intersects" check in the if().

I think it would be better to assert that callers don't give us complete
junk.

> With that feel free to promote my
> A-b to R-b.

How about:

    if ( d )
    {
        if ( nodes_intersect(nodemask, d->node_affinity) )
            nodes_and(nodemask, nodemask, d->node_affinity);
        else
            ASSERT_UNREACHABLE();
    }

?

This change has passed my normal set of prepush checks (not not that
there is anything interesting NUMA-wise in there).

~Andrew
Jan Beulich July 30, 2019, 8:09 a.m. UTC | #3
On 29.07.2019 19:26, Andrew Cooper wrote:
> On 29/07/2019 16:48, Jan Beulich wrote:
>> On 29.07.2019 14:11, Andrew Cooper wrote:
>>> +    if ( d )
>>> +        nodes_and(nodemask, nodemask, d->node_affinity);
>> Despite my earlier ack: Code further down assumes a non-empty mask,
>> which is no longer guaranteed afaics.
> 
> Nothing previous guaranteed that d->node_affinity had any bits set in
> it, either.
> 
> That said, in practice it is either ALL, or something derived from the
> cpu=>node mappings, so I don't think this is a problem in practice.
> 
>> I think you want to append an
>> "intersects" check in the if().
> 
> I think it would be better to assert that callers don't give us complete
> junk.
> 
>> With that feel free to promote my
>> A-b to R-b.
> 
> How about:
> 
>      if ( d )
>      {
>          if ( nodes_intersect(nodemask, d->node_affinity) )
>              nodes_and(nodemask, nodemask, d->node_affinity);
>          else
>              ASSERT_UNREACHABLE();
>      }
> 
> ?
> 
> This change has passed my normal set of prepush checks (not not that
> there is anything interesting NUMA-wise in there).

domain_update_node_affinity() means to guarantee a non-empty mask (by
way of a similar assertion), when ->auto_node_affinity is set. Otoh
domain_set_node_affinity() may clear that flag, at which point I can't
see what would guarantee that the intersection would remain non-empty
as CPUs get offlined. (I don't understand, btw, why the function calls
domain_update_node_affinity() after clearing the flag.) Therefore I
don't think an assertion like you suggest would be legitimate. For
this there would be a prereq of domain_update_node_affinity() clearing
the flag when it finds the intersection has become empty. And even
then there would be a window in time where the intersection might be
actively empty, so some synchronization would be needed in addition.

Jan
Andrew Cooper July 30, 2019, 5:32 p.m. UTC | #4
On 30/07/2019 09:09, Jan Beulich wrote:
> On 29.07.2019 19:26, Andrew Cooper wrote:
>> On 29/07/2019 16:48, Jan Beulich wrote:
>>> On 29.07.2019 14:11, Andrew Cooper wrote:
>>>> +    if ( d )
>>>> +        nodes_and(nodemask, nodemask, d->node_affinity);
>>> Despite my earlier ack: Code further down assumes a non-empty mask,
>>> which is no longer guaranteed afaics.
>> Nothing previous guaranteed that d->node_affinity had any bits set in
>> it, either.
>>
>> That said, in practice it is either ALL, or something derived from the
>> cpu=>node mappings, so I don't think this is a problem in practice.
>>
>>> I think you want to append an
>>> "intersects" check in the if().
>> I think it would be better to assert that callers don't give us complete
>> junk.
>>
>>> With that feel free to promote my
>>> A-b to R-b.
>> How about:
>>
>>      if ( d )
>>      {
>>          if ( nodes_intersect(nodemask, d->node_affinity) )
>>              nodes_and(nodemask, nodemask, d->node_affinity);
>>          else
>>              ASSERT_UNREACHABLE();
>>      }
>>
>> ?
>>
>> This change has passed my normal set of prepush checks (not not that
>> there is anything interesting NUMA-wise in there).
> domain_update_node_affinity() means to guarantee a non-empty mask (by
> way of a similar assertion), when ->auto_node_affinity is set. Otoh
> domain_set_node_affinity() may clear that flag, at which point I can't
> see what would guarantee that the intersection would remain non-empty
> as CPUs get offlined.

I don't see what CPU offlining has to do with anything.  There is no
such thing as taking a node out of the node_online_map, nor should there
be - even if we offline an entire socket's worth of CPUs, the memory
controller is still active and available for use.

The domain always has non-zero vCPUs, which will always result in an
intersection with node_online_map.

What is a problem is XEN_DOMCTL_setnodeaffinity being called with node
mask which is disjoint to node_online_map to begin with.

This problematic behaviour already exists today, and I bet there is a
lot of fun to had with that hypercall.

As a first pass,

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 9aefc2a680..57c84cdc42 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -631,8 +631,9 @@ void domain_update_node_affinity(struct domain *d)
 
 int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity)
 {
-    /* Being affine with no nodes is just wrong */
-    if ( nodes_empty(*affinity) )
+    /* Being affine with no nodes, or disjoint with the system, is
wrong. */
+    if ( nodes_empty(*affinity) ||
+         !nodes_intersects(*affinity, node_online_map) )
         return -EINVAL;
 
     spin_lock(&d->node_affinity_lock);

ought to work, and make it safe to retain the ASSERT() in the heap
allocator.

~Andrew
Jan Beulich July 31, 2019, 8:22 a.m. UTC | #5
On 30.07.2019 19:32, Andrew Cooper wrote:
> On 30/07/2019 09:09, Jan Beulich wrote:
>> On 29.07.2019 19:26, Andrew Cooper wrote:
>>> On 29/07/2019 16:48, Jan Beulich wrote:
>>>> On 29.07.2019 14:11, Andrew Cooper wrote:
>>>>> +    if ( d )
>>>>> +        nodes_and(nodemask, nodemask, d->node_affinity);
>>>> Despite my earlier ack: Code further down assumes a non-empty mask,
>>>> which is no longer guaranteed afaics.
>>> Nothing previous guaranteed that d->node_affinity had any bits set in
>>> it, either.
>>>
>>> That said, in practice it is either ALL, or something derived from the
>>> cpu=>node mappings, so I don't think this is a problem in practice.
>>>
>>>> I think you want to append an
>>>> "intersects" check in the if().
>>> I think it would be better to assert that callers don't give us complete
>>> junk.
>>>
>>>> With that feel free to promote my
>>>> A-b to R-b.
>>> How about:
>>>
>>>       if ( d )
>>>       {
>>>           if ( nodes_intersect(nodemask, d->node_affinity) )
>>>               nodes_and(nodemask, nodemask, d->node_affinity);
>>>           else
>>>               ASSERT_UNREACHABLE();
>>>       }
>>>
>>> ?
>>>
>>> This change has passed my normal set of prepush checks (not not that
>>> there is anything interesting NUMA-wise in there).
>> domain_update_node_affinity() means to guarantee a non-empty mask (by
>> way of a similar assertion), when ->auto_node_affinity is set. Otoh
>> domain_set_node_affinity() may clear that flag, at which point I can't
>> see what would guarantee that the intersection would remain non-empty
>> as CPUs get offlined.
> 
> I don't see what CPU offlining has to do with anything.  There is no
> such thing as taking a node out of the node_online_map, nor should there
> be - even if we offline an entire socket's worth of CPUs, the memory
> controller is still active and available for use.
> 
> The domain always has non-zero vCPUs, which will always result in an
> intersection with node_online_map.

Oh, right - I forgot that we (almost) never clear bits from
node_online_map. There's one use of node_set_offline() in
memory_add() - I wonder whether we shouldn't ditch
node_set_offline() to make more visible that we don't mean
to ever clear bits there.

> What is a problem is XEN_DOMCTL_setnodeaffinity being called with node
> mask which is disjoint to node_online_map to begin with.
> 
> This problematic behaviour already exists today, and I bet there is a
> lot of fun to had with that hypercall.
> 
> As a first pass,
> 
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 9aefc2a680..57c84cdc42 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -631,8 +631,9 @@ void domain_update_node_affinity(struct domain *d)
>   
>   int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity)
>   {
> -    /* Being affine with no nodes is just wrong */
> -    if ( nodes_empty(*affinity) )
> +    /* Being affine with no nodes, or disjoint with the system, is wrong. */
> +    if ( nodes_empty(*affinity) ||
> +         !nodes_intersects(*affinity, node_online_map) )
>           return -EINVAL;

Right, and then you don't need the nodes_empty() part anymore. With
this change folded in (or as a prereq one to allow backporting) you
can add my R-b with the adjustment further up in place.

Jan
Andrew Cooper July 31, 2019, 9:01 a.m. UTC | #6
On 31/07/2019 09:22, Jan Beulich wrote:
> On 30.07.2019 19:32, Andrew Cooper wrote:
>> On 30/07/2019 09:09, Jan Beulich wrote:
>>> On 29.07.2019 19:26, Andrew Cooper wrote:
>>>> On 29/07/2019 16:48, Jan Beulich wrote:
>>>>> On 29.07.2019 14:11, Andrew Cooper wrote:
>>>>>> +    if ( d )
>>>>>> +        nodes_and(nodemask, nodemask, d->node_affinity);
>>>>> Despite my earlier ack: Code further down assumes a non-empty mask,
>>>>> which is no longer guaranteed afaics.
>>>> Nothing previous guaranteed that d->node_affinity had any bits set in
>>>> it, either.
>>>>
>>>> That said, in practice it is either ALL, or something derived from the
>>>> cpu=>node mappings, so I don't think this is a problem in practice.
>>>>
>>>>> I think you want to append an
>>>>> "intersects" check in the if().
>>>> I think it would be better to assert that callers don't give us complete
>>>> junk.
>>>>
>>>>> With that feel free to promote my
>>>>> A-b to R-b.
>>>> How about:
>>>>
>>>>       if ( d )
>>>>       {
>>>>           if ( nodes_intersect(nodemask, d->node_affinity) )
>>>>               nodes_and(nodemask, nodemask, d->node_affinity);
>>>>           else
>>>>               ASSERT_UNREACHABLE();
>>>>       }
>>>>
>>>> ?
>>>>
>>>> This change has passed my normal set of prepush checks (not not that
>>>> there is anything interesting NUMA-wise in there).
>>> domain_update_node_affinity() means to guarantee a non-empty mask (by
>>> way of a similar assertion), when ->auto_node_affinity is set. Otoh
>>> domain_set_node_affinity() may clear that flag, at which point I can't
>>> see what would guarantee that the intersection would remain non-empty
>>> as CPUs get offlined.
>> I don't see what CPU offlining has to do with anything.  There is no
>> such thing as taking a node out of the node_online_map, nor should there
>> be - even if we offline an entire socket's worth of CPUs, the memory
>> controller is still active and available for use.
>>
>> The domain always has non-zero vCPUs, which will always result in an
>> intersection with node_online_map.
> Oh, right - I forgot that we (almost) never clear bits from
> node_online_map. There's one use of node_set_offline() in
> memory_add() - I wonder whether we shouldn't ditch
> node_set_offline() to make more visible that we don't mean
> to ever clear bits there.
>
>> What is a problem is XEN_DOMCTL_setnodeaffinity being called with node
>> mask which is disjoint to node_online_map to begin with.
>>
>> This problematic behaviour already exists today, and I bet there is a
>> lot of fun to had with that hypercall.
>>
>> As a first pass,
>>
>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>> index 9aefc2a680..57c84cdc42 100644
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -631,8 +631,9 @@ void domain_update_node_affinity(struct domain *d)
>>   
>>   int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity)
>>   {
>> -    /* Being affine with no nodes is just wrong */
>> -    if ( nodes_empty(*affinity) )
>> +    /* Being affine with no nodes, or disjoint with the system, is wrong. */
>> +    if ( nodes_empty(*affinity) ||
>> +         !nodes_intersects(*affinity, node_online_map) )
>>           return -EINVAL;
> Right, and then you don't need the nodes_empty() part anymore.

Good point.

> With this change folded in (or as a prereq one to allow backporting) you
> can add my R-b with the adjustment further up in place.

I think I'll fold it together into a single change.  Its directly
relevant to the subject.

~Andrew
diff mbox series

Patch

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 44a72d0b19..efa437c7df 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -811,11 +811,18 @@  static struct page_info *get_free_buddy(unsigned int zone_lo,
                                         const struct domain *d)
 {
     nodeid_t first, node = MEMF_get_node(memflags), req_node = node;
-    nodemask_t nodemask = d ? d->node_affinity : node_online_map;
+    nodemask_t nodemask = node_online_map;
     unsigned int j, zone, nodemask_retry = 0;
     struct page_info *pg;
     bool use_unscrubbed = (memflags & MEMF_no_scrub);
 
+    /*
+     * d->node_affinity is our preferred allocation set if provided, but it
+     * may have bit set outside of node_online_map.  Clamp it.
+     */
+    if ( d )
+        nodes_and(nodemask, nodemask, d->node_affinity);
+
     if ( node == NUMA_NO_NODE )
     {
         if ( d != NULL )