diff mbox series

[v2] mm/page_alloc: Work around a pahole limitation with zero-sized struct pagesets

Message ID 20210527120251.GC30378@techsingularity.net (mailing list archive)
State Not Applicable
Headers show
Series [v2] mm/page_alloc: Work around a pahole limitation with zero-sized struct pagesets | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Mel Gorman May 27, 2021, 12:02 p.m. UTC
This patch replaces
mm-page_alloc-convert-per-cpu-list-protection-to-local_lock-fix.patch in
Andrew's tree.

Michal Suchanek reported the following problem with linux-next

  [    0.000000] Linux version 5.13.0-rc2-next-20210519-1.g3455ff8-vanilla (geeko@buildhost) (gcc (SUSE Linux) 10.3.0, GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.36.1.20210326-3) #1 SMP Wed May 19 10:05:10 UTC 2021 (3455ff8)
  [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.13.0-rc2-next-20210519-1.g3455ff8-vanilla root=UUID=ec42c33e-a2c2-4c61-afcc-93e9527 8f687 plymouth.enable=0 resume=/dev/disk/by-uuid/f1fe4560-a801-4faf-a638-834c407027c7 mitigations=auto earlyprintk initcall_debug nomodeset earlycon ignore_loglevel console=ttyS0,115200
...
  [   26.093364] calling  tracing_set_default_clock+0x0/0x62 @ 1
  [   26.098937] initcall tracing_set_default_clock+0x0/0x62 returned 0 after 0 usecs
  [   26.106330] calling  acpi_gpio_handle_deferred_request_irqs+0x0/0x7c @ 1
  [   26.113033] initcall acpi_gpio_handle_deferred_request_irqs+0x0/0x7c returned 0 after 3 usecs
  [   26.121559] calling  clk_disable_unused+0x0/0x102 @ 1
  [   26.126620] initcall clk_disable_unused+0x0/0x102 returned 0 after 0 usecs
  [   26.133491] calling  regulator_init_complete+0x0/0x25 @ 1
  [   26.138890] initcall regulator_init_complete+0x0/0x25 returned 0 after 0 usecs
  [   26.147816] Freeing unused decrypted memory: 2036K
  [   26.153682] Freeing unused kernel image (initmem) memory: 2308K
  [   26.165776] Write protecting the kernel read-only data: 26624k
  [   26.173067] Freeing unused kernel image (text/rodata gap) memory: 2036K
  [   26.180416] Freeing unused kernel image (rodata/data gap) memory: 1184K
  [   26.187031] Run /init as init process
  [   26.190693]   with arguments:
  [   26.193661]     /init
  [   26.195933]   with environment:
  [   26.199079]     HOME=/
  [   26.201444]     TERM=linux
  [   26.204152]     BOOT_IMAGE=/boot/vmlinuz-5.13.0-rc2-next-20210519-1.g3455ff8-vanilla
  [   26.254154] BPF:      type_id=35503 offset=178440 size=4
  [   26.259125] BPF:
  [   26.261054] BPF:Invalid offset
  [   26.264119] BPF:
  [   26.264119]
  [   26.267437] failed to validate module [efivarfs] BTF: -22

Andrii Nakryiko bisected the problem to the commit "mm/page_alloc: convert
per-cpu list protection to local_lock" currently staged in mmotm. In his
own words

  The immediate problem is two different definitions of numa_node per-cpu
  variable. They both are at the same offset within .data..percpu ELF
  section, they both have the same name, but one of them is marked as
  static and another as global. And one is int variable, while another
  is struct pagesets. I'll look some more tomorrow, but adding Jiri and
  Arnaldo for visibility.

  [110907] DATASEC '.data..percpu' size=178904 vlen=303
  ...
        type_id=27753 offset=163976 size=4 (VAR 'numa_node')
        type_id=27754 offset=163976 size=4 (VAR 'numa_node')

  [27753] VAR 'numa_node' type_id=27556, linkage=static
  [27754] VAR 'numa_node' type_id=20, linkage=global

  [20] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED

  [27556] STRUCT 'pagesets' size=0 vlen=1
        'lock' type_id=507 bits_offset=0

  [506] STRUCT '(anon)' size=0 vlen=0
  [507] TYPEDEF 'local_lock_t' type_id=506

The patch in question introduces a zero-sized per-cpu struct and while
this is not wrong, versions of pahole prior to 1.22 (unreleased) get
confused during BTF generation with two separate variables occupying the
same address.

This patch checks for older versions of pahole and only allows
DEBUG_INFO_BTF_MODULES if pahole supports zero-sized per-cpu structures.
DEBUG_INFO_BTF is still allowed as a KVM boot test passed with pahole
v1.19.  While pahole 1.22 does not exist yet, it is assumed that Hritik's
fix that allows DEBUG_INFO_BTF_MODULES to work will be included in that
release.

Reported-by: Michal Suchanek <msuchanek@suse.de>
Reported-by: Hritik Vijay <hritikxx8@gmail.com>
Debugged-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 lib/Kconfig.debug | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Andrii Nakryiko May 27, 2021, 2:37 p.m. UTC | #1
On Thu, May 27, 2021 at 5:02 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> This patch replaces
> mm-page_alloc-convert-per-cpu-list-protection-to-local_lock-fix.patch in
> Andrew's tree.
>
> Michal Suchanek reported the following problem with linux-next
>
>   [    0.000000] Linux version 5.13.0-rc2-next-20210519-1.g3455ff8-vanilla (geeko@buildhost) (gcc (SUSE Linux) 10.3.0, GNU ld (GNU Binutils; openSUSE Tumbleweed) 2.36.1.20210326-3) #1 SMP Wed May 19 10:05:10 UTC 2021 (3455ff8)
>   [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.13.0-rc2-next-20210519-1.g3455ff8-vanilla root=UUID=ec42c33e-a2c2-4c61-afcc-93e9527 8f687 plymouth.enable=0 resume=/dev/disk/by-uuid/f1fe4560-a801-4faf-a638-834c407027c7 mitigations=auto earlyprintk initcall_debug nomodeset earlycon ignore_loglevel console=ttyS0,115200
> ...
>   [   26.093364] calling  tracing_set_default_clock+0x0/0x62 @ 1
>   [   26.098937] initcall tracing_set_default_clock+0x0/0x62 returned 0 after 0 usecs
>   [   26.106330] calling  acpi_gpio_handle_deferred_request_irqs+0x0/0x7c @ 1
>   [   26.113033] initcall acpi_gpio_handle_deferred_request_irqs+0x0/0x7c returned 0 after 3 usecs
>   [   26.121559] calling  clk_disable_unused+0x0/0x102 @ 1
>   [   26.126620] initcall clk_disable_unused+0x0/0x102 returned 0 after 0 usecs
>   [   26.133491] calling  regulator_init_complete+0x0/0x25 @ 1
>   [   26.138890] initcall regulator_init_complete+0x0/0x25 returned 0 after 0 usecs
>   [   26.147816] Freeing unused decrypted memory: 2036K
>   [   26.153682] Freeing unused kernel image (initmem) memory: 2308K
>   [   26.165776] Write protecting the kernel read-only data: 26624k
>   [   26.173067] Freeing unused kernel image (text/rodata gap) memory: 2036K
>   [   26.180416] Freeing unused kernel image (rodata/data gap) memory: 1184K
>   [   26.187031] Run /init as init process
>   [   26.190693]   with arguments:
>   [   26.193661]     /init
>   [   26.195933]   with environment:
>   [   26.199079]     HOME=/
>   [   26.201444]     TERM=linux
>   [   26.204152]     BOOT_IMAGE=/boot/vmlinuz-5.13.0-rc2-next-20210519-1.g3455ff8-vanilla
>   [   26.254154] BPF:      type_id=35503 offset=178440 size=4
>   [   26.259125] BPF:
>   [   26.261054] BPF:Invalid offset
>   [   26.264119] BPF:
>   [   26.264119]
>   [   26.267437] failed to validate module [efivarfs] BTF: -22
>
> Andrii Nakryiko bisected the problem to the commit "mm/page_alloc: convert
> per-cpu list protection to local_lock" currently staged in mmotm. In his
> own words
>
>   The immediate problem is two different definitions of numa_node per-cpu
>   variable. They both are at the same offset within .data..percpu ELF
>   section, they both have the same name, but one of them is marked as
>   static and another as global. And one is int variable, while another
>   is struct pagesets. I'll look some more tomorrow, but adding Jiri and
>   Arnaldo for visibility.
>
>   [110907] DATASEC '.data..percpu' size=178904 vlen=303
>   ...
>         type_id=27753 offset=163976 size=4 (VAR 'numa_node')
>         type_id=27754 offset=163976 size=4 (VAR 'numa_node')
>
>   [27753] VAR 'numa_node' type_id=27556, linkage=static
>   [27754] VAR 'numa_node' type_id=20, linkage=global
>
>   [20] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
>
>   [27556] STRUCT 'pagesets' size=0 vlen=1
>         'lock' type_id=507 bits_offset=0
>
>   [506] STRUCT '(anon)' size=0 vlen=0
>   [507] TYPEDEF 'local_lock_t' type_id=506
>
> The patch in question introduces a zero-sized per-cpu struct and while
> this is not wrong, versions of pahole prior to 1.22 (unreleased) get
> confused during BTF generation with two separate variables occupying the
> same address.
>
> This patch checks for older versions of pahole and only allows
> DEBUG_INFO_BTF_MODULES if pahole supports zero-sized per-cpu structures.
> DEBUG_INFO_BTF is still allowed as a KVM boot test passed with pahole

Unfortunately this won't work. The problem is that vmlinux BTF is
corrupted, which results in module BTFs to be rejected as well, as
they depend on it.

But vmlinux BTF corruption makes BPF subsystem completely unusable. So
even though kernel boots, nothing BPF-related works. So we'd need to
add dependency for DEBUG_INFO_BTF on pahole 1.22+.

> v1.19.  While pahole 1.22 does not exist yet, it is assumed that Hritik's
> fix that allows DEBUG_INFO_BTF_MODULES to work will be included in that
> release.
>
> Reported-by: Michal Suchanek <msuchanek@suse.de>
> Reported-by: Hritik Vijay <hritikxx8@gmail.com>
> Debugged-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  lib/Kconfig.debug | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 678c13967580..51b355cbe6d7 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -313,9 +313,12 @@ config DEBUG_INFO_BTF
>  config PAHOLE_HAS_SPLIT_BTF
>         def_bool $(success, test `$(PAHOLE) --version | sed -E 's/v([0-9]+)\.([0-9]+)/\1\2/'` -ge "119")
>
> +config PAHOLE_HAS_ZEROSIZE_PERCPU_SUPPORT
> +       def_bool $(success, test `$(PAHOLE) --version | sed -E 's/v([0-9]+)\.([0-9]+)/\1\2/'` -ge "122")
> +
>  config DEBUG_INFO_BTF_MODULES
>         def_bool y
> -       depends on DEBUG_INFO_BTF && MODULES && PAHOLE_HAS_SPLIT_BTF
> +       depends on DEBUG_INFO_BTF && MODULES && PAHOLE_HAS_SPLIT_BTF && PAHOLE_HAS_ZEROSIZE_PERCPU_SUPPORT
>         help
>           Generate compact split BTF type information for kernel modules.
>
Mel Gorman May 27, 2021, 2:54 p.m. UTC | #2
On Thu, May 27, 2021 at 07:37:05AM -0700, Andrii Nakryiko wrote:
> > This patch checks for older versions of pahole and only allows
> > DEBUG_INFO_BTF_MODULES if pahole supports zero-sized per-cpu structures.
> > DEBUG_INFO_BTF is still allowed as a KVM boot test passed with pahole
> 
> Unfortunately this won't work. The problem is that vmlinux BTF is
> corrupted, which results in module BTFs to be rejected as well, as
> they depend on it.
> 
> But vmlinux BTF corruption makes BPF subsystem completely unusable. So
> even though kernel boots, nothing BPF-related works. So we'd need to
> add dependency for DEBUG_INFO_BTF on pahole 1.22+.
> 

While bpf usage would be broken, the kernel will boot and the effect
should be transparent to any kernel build based on "make oldconfig".
CONFIG_DEBUG_INFO_BTF defaults N so if that is forced out, it will be
easily missed by a distribution kernel maintainer.

Yes, users of BPF will be affected and it may generate bug reports but
the fix will be to build with a working pahole. Breaking boot on the other
hand is a lot more visible and hacking around this with a non-zero struct
size has been shot down.
Andrii Nakryiko May 27, 2021, 4:36 p.m. UTC | #3
On Thu, May 27, 2021 at 7:54 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, May 27, 2021 at 07:37:05AM -0700, Andrii Nakryiko wrote:
> > > This patch checks for older versions of pahole and only allows
> > > DEBUG_INFO_BTF_MODULES if pahole supports zero-sized per-cpu structures.
> > > DEBUG_INFO_BTF is still allowed as a KVM boot test passed with pahole
> >
> > Unfortunately this won't work. The problem is that vmlinux BTF is
> > corrupted, which results in module BTFs to be rejected as well, as
> > they depend on it.
> >
> > But vmlinux BTF corruption makes BPF subsystem completely unusable. So
> > even though kernel boots, nothing BPF-related works. So we'd need to
> > add dependency for DEBUG_INFO_BTF on pahole 1.22+.
> >
>
> While bpf usage would be broken, the kernel will boot and the effect
> should be transparent to any kernel build based on "make oldconfig".

I think if DEBUG_INFO_BTF=y has no chance of generating valid vmlinux
BTF it has to be forced out. So if we are doing this at all, we should
do it for CONFIG_DEBUG_INFO_BTF, not CONFIG_DEBUG_INFO_BTF_MODULES.
CONFIG_DEBUG_INFO_BTF_MODULES will follow automatically.

> CONFIG_DEBUG_INFO_BTF defaults N so if that is forced out, it will be
> easily missed by a distribution kernel maintainer.

We actually had previous discussions on forcing build failure in cases
when CONFIG_DEBUG_INFO_BTF=y can't be satisfied, but no one followed
up. I'll look into this and will try to change the behavior. It's
caused too much confusion previously and now with changes like this we
are going to waste even more people's time.

>
> Yes, users of BPF will be affected and it may generate bug reports but
> the fix will be to build with a working pahole. Breaking boot on the other
> hand is a lot more visible and hacking around this with a non-zero struct
> size has been shot down.
>
> --
> Mel Gorman
> SUSE Labs
Mel Gorman May 27, 2021, 5:27 p.m. UTC | #4
On Thu, May 27, 2021 at 09:36:35AM -0700, Andrii Nakryiko wrote:
> On Thu, May 27, 2021 at 7:54 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Thu, May 27, 2021 at 07:37:05AM -0700, Andrii Nakryiko wrote:
> > > > This patch checks for older versions of pahole and only allows
> > > > DEBUG_INFO_BTF_MODULES if pahole supports zero-sized per-cpu structures.
> > > > DEBUG_INFO_BTF is still allowed as a KVM boot test passed with pahole
> > >
> > > Unfortunately this won't work. The problem is that vmlinux BTF is
> > > corrupted, which results in module BTFs to be rejected as well, as
> > > they depend on it.
> > >
> > > But vmlinux BTF corruption makes BPF subsystem completely unusable. So
> > > even though kernel boots, nothing BPF-related works. So we'd need to
> > > add dependency for DEBUG_INFO_BTF on pahole 1.22+.
> > >
> >
> > While bpf usage would be broken, the kernel will boot and the effect
> > should be transparent to any kernel build based on "make oldconfig".
> 
> I think if DEBUG_INFO_BTF=y has no chance of generating valid vmlinux
> BTF it has to be forced out. So if we are doing this at all, we should
> do it for CONFIG_DEBUG_INFO_BTF, not CONFIG_DEBUG_INFO_BTF_MODULES.
> CONFIG_DEBUG_INFO_BTF_MODULES will follow automatically.
> 

Ok, I sent a version that prevents DEBUG_INFO_BTF being set unless
pahole is at least 1.22.

> > CONFIG_DEBUG_INFO_BTF defaults N so if that is forced out, it will be
> > easily missed by a distribution kernel maintainer.
> 
> We actually had previous discussions on forcing build failure in cases
> when CONFIG_DEBUG_INFO_BTF=y can't be satisfied, but no one followed
> up.

It is weird how it is handled. DEBUG_INFO_BTF can be set and then fail to
build vmlinux because pahole is too old. With DEBUG_INFO_BTF now requiring
at least 1.22, the other version checks for 1.16 and 1.19 are redundant
and could be cleaned up.

> I'll look into this and will try to change the behavior. It's
> caused too much confusion previously and now with changes like this we
> are going to waste even more people's time.
> 

Thanks.
Andrii Nakryiko May 27, 2021, 10:25 p.m. UTC | #5
On Thu, May 27, 2021 at 10:27 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Thu, May 27, 2021 at 09:36:35AM -0700, Andrii Nakryiko wrote:
> > On Thu, May 27, 2021 at 7:54 AM Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > On Thu, May 27, 2021 at 07:37:05AM -0700, Andrii Nakryiko wrote:
> > > > > This patch checks for older versions of pahole and only allows
> > > > > DEBUG_INFO_BTF_MODULES if pahole supports zero-sized per-cpu structures.
> > > > > DEBUG_INFO_BTF is still allowed as a KVM boot test passed with pahole
> > > >
> > > > Unfortunately this won't work. The problem is that vmlinux BTF is
> > > > corrupted, which results in module BTFs to be rejected as well, as
> > > > they depend on it.
> > > >
> > > > But vmlinux BTF corruption makes BPF subsystem completely unusable. So
> > > > even though kernel boots, nothing BPF-related works. So we'd need to
> > > > add dependency for DEBUG_INFO_BTF on pahole 1.22+.
> > > >
> > >
> > > While bpf usage would be broken, the kernel will boot and the effect
> > > should be transparent to any kernel build based on "make oldconfig".
> >
> > I think if DEBUG_INFO_BTF=y has no chance of generating valid vmlinux
> > BTF it has to be forced out. So if we are doing this at all, we should
> > do it for CONFIG_DEBUG_INFO_BTF, not CONFIG_DEBUG_INFO_BTF_MODULES.
> > CONFIG_DEBUG_INFO_BTF_MODULES will follow automatically.
> >
>
> Ok, I sent a version that prevents DEBUG_INFO_BTF being set unless
> pahole is at least 1.22.
>
> > > CONFIG_DEBUG_INFO_BTF defaults N so if that is forced out, it will be
> > > easily missed by a distribution kernel maintainer.
> >
> > We actually had previous discussions on forcing build failure in cases
> > when CONFIG_DEBUG_INFO_BTF=y can't be satisfied, but no one followed
> > up.
>
> It is weird how it is handled. DEBUG_INFO_BTF can be set and then fail to
> build vmlinux because pahole is too old. With DEBUG_INFO_BTF now requiring
> at least 1.22, the other version checks for 1.16 and 1.19 are redundant
> and could be cleaned up.
>
> > I'll look into this and will try to change the behavior. It's
> > caused too much confusion previously and now with changes like this we
> > are going to waste even more people's time.
> >
>
> Thanks.

So I've tried to change that, but I'm not sure that's possible with
the current Kconfig system. I tried to use $(error-if), but it happens
too early, at the text pre-processing stage, before the value of
CONFIG_DEBUG_INFO_BTF is known, so it's impossible to express
something like this:

$(error_if,CONFIG_DEBUG_INFO_BTF=y && PAHOLE_VERSION < 116,Pahole is tool old)

Masahiro, is it possible to somehow express the condition that if
CONFIG_DEBUG_INFO_BTF=y is selected, but some external dependency
(pahole version in this case) is too old, then fail the build
immediately? Currently we fail at the very end of vmlinux linking
step, which is very late.

Alternatively, it was proposed to just add an extra dependency (like,
"depends PAHOLE_IS_116_OR_NEWER"), but that will silently unselect
CONFIG_DEBUG_INFO_BTF if the condition is not satisfied, so it's even
more confusing to users.

Any suggestions on how to proceed with something like that? Thanks!

>
> --
> Mel Gorman
> SUSE Labs
diff mbox series

Patch

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 678c13967580..51b355cbe6d7 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -313,9 +313,12 @@  config DEBUG_INFO_BTF
 config PAHOLE_HAS_SPLIT_BTF
 	def_bool $(success, test `$(PAHOLE) --version | sed -E 's/v([0-9]+)\.([0-9]+)/\1\2/'` -ge "119")
 
+config PAHOLE_HAS_ZEROSIZE_PERCPU_SUPPORT
+	def_bool $(success, test `$(PAHOLE) --version | sed -E 's/v([0-9]+)\.([0-9]+)/\1\2/'` -ge "122")
+
 config DEBUG_INFO_BTF_MODULES
 	def_bool y
-	depends on DEBUG_INFO_BTF && MODULES && PAHOLE_HAS_SPLIT_BTF
+	depends on DEBUG_INFO_BTF && MODULES && PAHOLE_HAS_SPLIT_BTF && PAHOLE_HAS_ZEROSIZE_PERCPU_SUPPORT
 	help
 	  Generate compact split BTF type information for kernel modules.