diff mbox series

[v2,2/2] xen/common: Add NUMA node id bounds check to page_alloc.c/node_to_scrub

Message ID cbecf35392441b03dc3ccd2a119b6fbdbf13b855.1695754185.git.sanastasio@raptorengineering.com (mailing list archive)
State Superseded
Headers show
Series Fix Power CI build | expand

Commit Message

Shawn Anastasio Sept. 26, 2023, 6:54 p.m. UTC
When building for Power with CONFIG_DEBUG unset, a compiler error gets
raised inside page_alloc.c's node_to_scrub function:

common/page_alloc.c: In function 'node_to_scrub.part.0':
common/page_alloc.c:1217:29: error: array subscript 1 is above array
            bounds of 'long unsigned int[1]' [-Werror=array-bounds]
 1217 |         if ( node_need_scrub[node] )

It appears that this is a false positive, given that in practice
cycle_node should never return a node ID >= MAX_NUMNODES as long as the
architecture's node_online_map is properly defined and initialized, so
this additional bounds check is only to satisfy GCC.

Signed-off-by: Shawn Anastasio <sanastasio@raptorengineering.com>
---
v2: Add comment to explain the bounds check.

 xen/common/page_alloc.c | 8 ++++++++
 1 file changed, 8 insertions(+)

--
2.30.2

Comments

Andrew Cooper Sept. 26, 2023, 8:54 p.m. UTC | #1
On 26/09/2023 7:54 pm, Shawn Anastasio wrote:
> When building for Power with CONFIG_DEBUG unset, a compiler error gets
> raised inside page_alloc.c's node_to_scrub function:
>
> common/page_alloc.c: In function 'node_to_scrub.part.0':
> common/page_alloc.c:1217:29: error: array subscript 1 is above array
>             bounds of 'long unsigned int[1]' [-Werror=array-bounds]
>  1217 |         if ( node_need_scrub[node] )
>
> It appears that this is a false positive, given that in practice
> cycle_node should never return a node ID >= MAX_NUMNODES as long as the
> architecture's node_online_map is properly defined and initialized, so
> this additional bounds check is only to satisfy GCC.

If GCC thinks it's got an index of 1 here, then it thinks it's proved
that a 1 gets passed.  But the fact that cycle_node() really can return
1 if one variable gets tweaked in memory means that logic derived on
this property is broken.

But we've got even more basic problems to fix first.

xen$ git grep -B1 '__read_mostly node_online_map'
arch/arm/smpboot.c-45-/* Fake one node for now. See also asm/numa.h */
arch/arm/smpboot.c:46:nodemask_t __read_mostly node_online_map = { { [0]
= 1UL } };
arch/ppc/stubs.c-25-
arch/ppc/stubs.c:26:nodemask_t __read_mostly node_online_map = { { [0] =
1UL } };
common/numa.c-44-
common/numa.c:45:nodemask_t __read_mostly node_online_map = { { [0] =
1UL } };

There are 3 identical definitions of node_online_map, one for the
architecture which really supports NUMA, and two for the stubs which don't.

And the bug here is that code outside of CONFIG_NUMA assumes NUMA is
valid, including several places in page_alloc.c,
domain_set_node_affinity(), credit1 and sysctl.  XEN_SYSCTL_numainfo
even has a bigger sin of using a static MAX_NUMNODES array when it
doesn't need an array in the first place.

It's rude for xen/nodemask.h to short circuit some of the operations on
node_online_map based on MAX_NUMNODES but not others.

If nothing else, the fallback for "not really NUMA" needs providing by
the common code and not by the arch stubs.  When that is sorted, we
might have more consistent behaviour to investigate.

~Andrew
Stefano Stabellini Sept. 26, 2023, 9:14 p.m. UTC | #2
On Tue, 26 Sep 2023, Shawn Anastasio wrote:
> When building for Power with CONFIG_DEBUG unset, a compiler error gets
> raised inside page_alloc.c's node_to_scrub function:
> 
> common/page_alloc.c: In function 'node_to_scrub.part.0':
> common/page_alloc.c:1217:29: error: array subscript 1 is above array
>             bounds of 'long unsigned int[1]' [-Werror=array-bounds]
>  1217 |         if ( node_need_scrub[node] )
> 
> It appears that this is a false positive, given that in practice
> cycle_node should never return a node ID >= MAX_NUMNODES as long as the
> architecture's node_online_map is properly defined and initialized, so
> this additional bounds check is only to satisfy GCC.
> 
> Signed-off-by: Shawn Anastasio <sanastasio@raptorengineering.com>

Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>


> ---
> v2: Add comment to explain the bounds check.
> 
>  xen/common/page_alloc.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
> index 35d9a26fa6..c53f917dbc 100644
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -1211,6 +1211,14 @@ static unsigned int node_to_scrub(bool get_node)
>          } while ( !cpumask_empty(&node_to_cpumask(node)) &&
>                    (node != local_node) );
> 
> +        /*
> +         * In practice `node` will always be within MAX_NUMNODES, but GCC can't
> +         * always see that, so an explicit check is necessary to avoid tripping
> +         * its out-of-bounds array access warning (-Warray-bounds).
> +         */
> +        if ( node >= MAX_NUMNODES )
> +            break;
> +
>          if ( node == local_node )
>              break;
> 
> --
> 2.30.2
>
Shawn Anastasio Sept. 26, 2023, 10:33 p.m. UTC | #3
On 9/26/23 3:54 PM, Andrew Cooper wrote:
> On 26/09/2023 7:54 pm, Shawn Anastasio wrote:
>> When building for Power with CONFIG_DEBUG unset, a compiler error gets
>> raised inside page_alloc.c's node_to_scrub function:
>>
>> common/page_alloc.c: In function 'node_to_scrub.part.0':
>> common/page_alloc.c:1217:29: error: array subscript 1 is above array
>>             bounds of 'long unsigned int[1]' [-Werror=array-bounds]
>>  1217 |         if ( node_need_scrub[node] )
>>
>> It appears that this is a false positive, given that in practice
>> cycle_node should never return a node ID >= MAX_NUMNODES as long as the
>> architecture's node_online_map is properly defined and initialized, so
>> this additional bounds check is only to satisfy GCC.
> 
> If GCC thinks it's got an index of 1 here, then it thinks it's proved
> that a 1 gets passed.  But the fact that cycle_node() really can return
> 1 if one variable gets tweaked in memory means that logic derived on
> this property is broken.
> 
> But we've got even more basic problems to fix first.
> 
> xen$ git grep -B1 '__read_mostly node_online_map'
> arch/arm/smpboot.c-45-/* Fake one node for now. See also asm/numa.h */
> arch/arm/smpboot.c:46:nodemask_t __read_mostly node_online_map = { { [0]
> = 1UL } };
> arch/ppc/stubs.c-25-
> arch/ppc/stubs.c:26:nodemask_t __read_mostly node_online_map = { { [0] =
> 1UL } };
> common/numa.c-44-
> common/numa.c:45:nodemask_t __read_mostly node_online_map = { { [0] =
> 1UL } };
> 
> There are 3 identical definitions of node_online_map, one for the
> architecture which really supports NUMA, and two for the stubs which don't.
> 
> And the bug here is that code outside of CONFIG_NUMA assumes NUMA is
> valid, including several places in page_alloc.c,
> domain_set_node_affinity(), credit1 and sysctl.  XEN_SYSCTL_numainfo
> even has a bigger sin of using a static MAX_NUMNODES array when it
> doesn't need an array in the first place.
> 

Agreed. It seems like the current state of affairs resulted from hacks
in Arm to work around NUMA assumptions in common instead of properly
making common support !CONFIG_NUMA. And now ppc is inheriting some of
that baggage.

> It's rude for xen/nodemask.h to short circuit some of the operations on
> node_online_map based on MAX_NUMNODES but not others.
> 
> If nothing else, the fallback for "not really NUMA" needs providing by
> the common code and not by the arch stubs.  When that is sorted, we
> might have more consistent behaviour to investigate.
>

I'll leave these more sweeping changes up to someone with a bit more
familiarity with this code for now. In the meantime, this patch at
should introduce no functional change to the status quo and allows ppc
to build in the interim.

> ~Andrew

Thanks,
Shawn
diff mbox series

Patch

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 35d9a26fa6..c53f917dbc 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -1211,6 +1211,14 @@  static unsigned int node_to_scrub(bool get_node)
         } while ( !cpumask_empty(&node_to_cpumask(node)) &&
                   (node != local_node) );

+        /*
+         * In practice `node` will always be within MAX_NUMNODES, but GCC can't
+         * always see that, so an explicit check is necessary to avoid tripping
+         * its out-of-bounds array access warning (-Warray-bounds).
+         */
+        if ( node >= MAX_NUMNODES )
+            break;
+
         if ( node == local_node )
             break;