diff mbox

arm64 crashkernel fails to boot on acpi-only machines due to ACPI regions being no longer mapped as NOMAP

Message ID CACi5LpMzYidDaC0_yfwgVOisH-FqcNViYj+Z54uKfUtHkJKKXA@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Bhupesh Sharma Dec. 25, 2017, 8:14 p.m. UTC
On Mon, Dec 25, 2017 at 8:55 AM, AKASHI Takahiro
<takahiro.akashi@linaro.org> wrote:
> On Sun, Dec 24, 2017 at 01:21:02AM +0530, Bhupesh Sharma wrote:
>> On Fri, Dec 22, 2017 at 2:03 PM, AKASHI Takahiro
>> <takahiro.akashi@linaro.org> wrote:
>> > On Thu, Dec 21, 2017 at 05:36:30PM +0530, Bhupesh Sharma wrote:
>> >> Hello Akashi,
>> >>
>> >> On Thu, Dec 21, 2017 at 4:04 PM, AKASHI Takahiro
>> >> <takahiro.akashi@linaro.org> wrote:
>> >> > Bhupesh,
>> >> >
>> >> > Can you test the patch attached below, please?
>> >> >
>> >> > It is intended to retain already-reserved regions (ACPI reclaim memory
>> >> > in this case) in system ram (i.e. memblock.memory) without explicitly
>> >> > exporting them via usable-memory-range.
>> >> > (I still have to figure out what the side-effect of this patch is.)
>> >> >
>> >> > Thanks,
>> >> > -Takahiro AKASHI
>> >> >
>> >> > On Thu, Dec 21, 2017 at 01:30:43AM +0530, Bhupesh Sharma wrote:
>> >> >> On Tue, Dec 19, 2017 at 6:39 PM, Ard Biesheuvel
>> >> >> <ard.biesheuvel@linaro.org> wrote:
>> >> >> > On 19 December 2017 at 07:09, AKASHI Takahiro
>> >> >> > <takahiro.akashi@linaro.org> wrote:
>> >> >> >> On Mon, Dec 18, 2017 at 01:40:09PM +0800, Dave Young wrote:
>> >> >> >>> On 12/15/17 at 05:59pm, AKASHI Takahiro wrote:
>> >> >> >>> > On Wed, Dec 13, 2017 at 12:17:22PM +0000, Ard Biesheuvel wrote:
>> >> >> >>> > > On 13 December 2017 at 12:16, AKASHI Takahiro
>> >> >> >>> > > <takahiro.akashi@linaro.org> wrote:
>> >> >> >>> > > > On Wed, Dec 13, 2017 at 10:49:27AM +0000, Ard Biesheuvel wrote:
>> >> >> >>> > > >> On 13 December 2017 at 10:26, AKASHI Takahiro
>> >> >> >>> > > >> <takahiro.akashi@linaro.org> wrote:
>> >> >> >>> > > >> > Bhupesh, Ard,
>> >> >> >>> > > >> >
>> >> >> >>> > > >> > On Wed, Dec 13, 2017 at 03:21:59AM +0530, Bhupesh Sharma wrote:
>> >> >> >>> > > >> >> Hi Ard, Akashi
>> >> >> >>> > > >> >>
>> >> >> >>> > > >> > (snip)
>> >> >> >>> > > >> >
>> >> >> >>> > > >> >> Looking deeper into the issue, since the arm64 kexec-tools uses the
>> >> >> >>> > > >> >> 'linux,usable-memory-range' dt property to allow crash dump kernel to
>> >> >> >>> > > >> >> identify its own usable memory and exclude, at its boot time, any
>> >> >> >>> > > >> >> other memory areas that are part of the panicked kernel's memory.
>> >> >> >>> > > >> >> (see https://www.kernel.org/doc/Documentation/devicetree/bindings/chosen.txt
>> >> >> >>> > > >> >> , for details)
>> >> >> >>> > > >> >
>> >> >> >>> > > >> > Right.
>> >> >> >>> > > >> >
>> >> >> >>> > > >> >> 1). Now when 'kexec -p' is executed, this node is patched up only
>> >> >> >>> > > >> >> with the crashkernel memory range:
>> >> >> >>> > > >> >>
>> >> >> >>> > > >> >>                 /* add linux,usable-memory-range */
>> >> >> >>> > > >> >>                 nodeoffset = fdt_path_offset(new_buf, "/chosen");
>> >> >> >>> > > >> >>                 result = fdt_setprop_range(new_buf, nodeoffset,
>> >> >> >>> > > >> >>                                 PROP_USABLE_MEM_RANGE, &crash_reserved_mem,
>> >> >> >>> > > >> >>                                 address_cells, size_cells);
>> >> >> >>> > > >> >>
>> >> >> >>> > > >> >> (see https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/kexec/arch/arm64/kexec-arm64.c#n465
>> >> >> >>> > > >> >> , for details)
>> >> >> >>> > > >> >>
>> >> >> >>> > > >> >> 2). This excludes the ACPI reclaim regions irrespective of whether
>> >> >> >>> > > >> >> they are marked as System RAM or as RESERVED. As,
>> >> >> >>> > > >> >> 'linux,usable-memory-range' dt node is patched up only with
>> >> >> >>> > > >> >> 'crash_reserved_mem' and not 'system_memory_ranges'
>> >> >> >>> > > >> >>
>> >> >> >>> > > >> >> 3). As a result when the crashkernel boots up it doesn't find this
>> >> >> >>> > > >> >> ACPI memory and crashes while trying to access the same:
>> >> >> >>> > > >> >>
>> >> >> >>> > > >> >> # kexec -p /boot/vmlinuz-`uname -r` --initrd=/boot/initramfs-`uname
>> >> >> >>> > > >> >> -r`.img --reuse-cmdline -d
>> >> >> >>> > > >> >>
>> >> >> >>> > > >> >> [snip..]
>> >> >> >>> > > >> >>
>> >> >> >>> > > >> >> Reserved memory range
>> >> >> >>> > > >> >> 000000000e800000-000000002e7fffff (0)
>> >> >> >>> > > >> >>
>> >> >> >>> > > >> >> Coredump memory ranges
>> >> >> >>> > > >> >> 0000000000000000-000000000e7fffff (0)
>> >> >> >>> > > >> >> 000000002e800000-000000003961ffff (0)
>> >> >> >>> > > >> >> 0000000039d40000-000000003ed2ffff (0)
>> >> >> >>> > > >> >> 000000003ed60000-000000003fbfffff (0)
>> >> >> >>> > > >> >> 0000001040000000-0000001ffbffffff (0)
>> >> >> >>> > > >> >> 0000002000000000-0000002ffbffffff (0)
>> >> >> >>> > > >> >> 0000009000000000-0000009ffbffffff (0)
>> >> >> >>> > > >> >> 000000a000000000-000000affbffffff (0)
>> >> >> >>> > > >> >>
>> >> >> >>> > > >> >> 4). So if we revert Ard's patch or just comment the fixing up of the
>> >> >> >>> > > >> >> memory cap'ing passed to the crash kernel inside
>> >> >> >>> > > >> >> 'arch/arm64/mm/init.c' (see below):
>> >> >> >>> > > >> >>
>> >> >> >>> > > >> >> static void __init fdt_enforce_memory_region(void)
>> >> >> >>> > > >> >> {
>> >> >> >>> > > >> >>         struct memblock_region reg = {
>> >> >> >>> > > >> >>                 .size = 0,
>> >> >> >>> > > >> >>         };
>> >> >> >>> > > >> >>
>> >> >> >>> > > >> >>         of_scan_flat_dt(early_init_dt_scan_usablemem, &reg);
>> >> >> >>> > > >> >>
>> >> >> >>> > > >> >>         if (reg.size)
>> >> >> >>> > > >> >>                 //memblock_cap_memory_range(reg.base, reg.size); /*
>> >> >> >>> > > >> >> comment this out */
>> >> >> >>> > > >> >> }
>> >> >> >>> > > >> >
>> >> >> >>> > > >> > Please just don't do that. It can cause a fatal damage on
>> >> >> >>> > > >> > memory contents of the *crashed* kernel.
>> >> >> >>> > > >> >
>> >> >> >>> > > >> >> 5). Both the above temporary solutions fix the problem.
>> >> >> >>> > > >> >>
>> >> >> >>> > > >> >> 6). However exposing all System RAM regions to the crashkernel is not
>> >> >> >>> > > >> >> advisable and may cause the crashkernel or some crashkernel drivers to
>> >> >> >>> > > >> >> fail.
>> >> >> >>> > > >> >>
>> >> >> >>> > > >> >> 6a). I am trying an approach now, where the ACPI reclaim regions are
>> >> >> >>> > > >> >> added to '/proc/iomem' separately as ACPI reclaim regions by the
>> >> >> >>> > > >> >> kernel code and on the other hand the user-space 'kexec-tools' will
>> >> >> >>> > > >> >> pick up the ACPI reclaim regions from '/proc/iomem' and add it to the
>> >> >> >>> > > >> >> dt node 'linux,usable-memory-range'
>> >> >> >>> > > >> >
>> >> >> >>> > > >> > I still don't understand why we need to carry over the information
>> >> >> >>> > > >> > about "ACPI Reclaim memory" to crash dump kernel. In my understandings,
>> >> >> >>> > > >> > such regions are free to be reused by the kernel after some point of
>> >> >> >>> > > >> > initialization. Why does crash dump kernel need to know about them?
>> >> >> >>> > > >> >
>> >> >> >>> > > >>
>> >> >> >>> > > >> Not really. According to the UEFI spec, they can be reclaimed after
>> >> >> >>> > > >> the OS has initialized, i.e., when it has consumed the ACPI tables and
>> >> >> >>> > > >> no longer needs them. Of course, in order to be able to boot a kexec
>> >> >> >>> > > >> kernel, those regions needs to be preserved, which is why they are
>> >> >> >>> > > >> memblock_reserve()'d now.
>> >> >> >>> > > >
>> >> >> >>> > > > For my better understandings, who is actually accessing such regions
>> >> >> >>> > > > during boot time, uefi itself or efistub?
>> >> >> >>> > > >
>> >> >> >>> > >
>> >> >> >>> > > No, only the kernel. This is where the ACPI tables are stored. For
>> >> >> >>> > > instance, on QEMU we have
>> >> >> >>> > >
>> >> >> >>> > >  ACPI: RSDP 0x0000000078980000 000024 (v02 BOCHS )
>> >> >> >>> > >  ACPI: XSDT 0x0000000078970000 000054 (v01 BOCHS  BXPCFACP 00000001
>> >> >> >>> > >   01000013)
>> >> >> >>> > >  ACPI: FACP 0x0000000078930000 00010C (v05 BOCHS  BXPCFACP 00000001
>> >> >> >>> > > BXPC 00000001)
>> >> >> >>> > >  ACPI: DSDT 0x0000000078940000 0011DA (v02 BOCHS  BXPCDSDT 00000001
>> >> >> >>> > > BXPC 00000001)
>> >> >> >>> > >  ACPI: APIC 0x0000000078920000 000140 (v03 BOCHS  BXPCAPIC 00000001
>> >> >> >>> > > BXPC 00000001)
>> >> >> >>> > >  ACPI: GTDT 0x0000000078910000 000060 (v02 BOCHS  BXPCGTDT 00000001
>> >> >> >>> > > BXPC 00000001)
>> >> >> >>> > >  ACPI: MCFG 0x0000000078900000 00003C (v01 BOCHS  BXPCMCFG 00000001
>> >> >> >>> > > BXPC 00000001)
>> >> >> >>> > >  ACPI: SPCR 0x00000000788F0000 000050 (v02 BOCHS  BXPCSPCR 00000001
>> >> >> >>> > > BXPC 00000001)
>> >> >> >>> > >  ACPI: IORT 0x00000000788E0000 00007C (v00 BOCHS  BXPCIORT 00000001
>> >> >> >>> > > BXPC 00000001)
>> >> >> >>> > >
>> >> >> >>> > > covered by
>> >> >> >>> > >
>> >> >> >>> > >  efi:   0x0000788e0000-0x00007894ffff [ACPI Reclaim Memory ...]
>> >> >> >>> > >  ...
>> >> >> >>> > >  efi:   0x000078970000-0x00007898ffff [ACPI Reclaim Memory ...]
>> >> >> >>> >
>> >> >> >>> > OK. I mistakenly understood those regions could be freed after exiting
>> >> >> >>> > UEFI boot services.
>> >> >> >>> >
>> >> >> >>> > >
>> >> >> >>> > > >> So it seems that kexec does not honour the memblock_reserve() table
>> >> >> >>> > > >> when booting the next kernel.
>> >> >> >>> > > >
>> >> >> >>> > > > not really.
>> >> >> >>> > > >
>> >> >> >>> > > >> > (In other words, can or should we skip some part of ACPI-related init code
>> >> >> >>> > > >> > on crash dump kernel?)
>> >> >> >>> > > >> >
>> >> >> >>> > > >>
>> >> >> >>> > > >> I don't think so. And the change to the handling of ACPI reclaim
>> >> >> >>> > > >> regions only revealed the bug, not created it (given that other
>> >> >> >>> > > >> memblock_reserve regions may be affected as well)
>> >> >> >>> > > >
>> >> >> >>> > > > As whether we should honor such reserved regions over kexec'ing
>> >> >> >>> > > > depends on each one's specific nature, we will have to take care one-by-one.
>> >> >> >>> > > > As a matter of fact, no information about "reserved" memblocks is
>> >> >> >>> > > > exposed to user space (via proc/iomem).
>> >> >> >>> > > >
>> >> >> >>> > >
>> >> >> >>> > > That is why I suggested (somewhere in this thread?) to not expose them
>> >> >> >>> > > as 'System RAM'. Do you think that could solve this?
>> >> >> >>> >
>> >> >> >>> > Memblock-reserv'ing them is necessary to prevent their corruption and
>> >> >> >>> > marking them under another name in /proc/iomem would also be good in order
>> >> >> >>> > not to allocate them as part of crash kernel's memory.
>> >> >> >>> >
>> >> >> >>> > But I'm not still convinced that we should export them in useable-
>> >> >> >>> > memory-range to crash dump kernel. They will be accessed through
>> >> >> >>> > acpi_os_map_memory() and so won't be required to be part of system ram
>> >> >> >>> > (or memblocks), I guess.
>> >> >> >>> >     -> Bhupesh?
>> >> >> >>>
>> >> >> >>> I forgot how arm64 kernel retrieve the memory ranges and initialize
>> >> >> >>> them.  If no "e820" like interfaces shouldn't kernel reinitialize all
>> >> >> >>> the memory according to the efi memmap?  For kdump kernel anything other
>> >> >> >>> than usable memory (which is from the dt node instead) should be
>> >> >> >>> reinitialized according to efi passed info, no?
>> >> >> >>
>> >> >> >> All the regions exported in efi memmap will be added to memblock.memory
>> >> >> >> in (u)efi_init() and then trimmed down to the exact range specified as
>> >> >> >> usable-memory-range by fdt_enforce_memory_region().
>> >> >> >>
>> >> >> >> Now I noticed that the current fdt_enforce_memory_region() may not work well
>> >> >> >> with multiple entries in usable-memory-range.
>> >> >> >>
>> >> >> >
>> >> >> > In any case, the root of the problem is that memory regions lose their
>> >> >> > 'memory' annotation due to the way the memory map is mangled before
>> >> >> > being supplied to the kexec kernel.
>> >> >> >
>> >> >> > Would it be possible to classify all memory that we want to hide from
>> >> >> > the kexec kernel as NOMAP instead? That way, it will not be mapped
>> >> >> > implicitly, but will still be mapped cacheable by acpi_os_ioremap(),
>> >> >> > so this seems to be the most appropriate way to deal with the host
>> >> >> > kernel's memory contents.
>> >> >>
>> >> >> Hmm. wouldn't appending the acpi reclaim regions to
>> >> >> 'linux,usable-memory-range' in the dtb being passed to the crashkernel
>> >> >> be better? Because its indirectly achieving a similar objective
>> >> >> (although may be a subset of all System RAM regions on the primary
>> >> >> kernel's memory).
>> >> >>
>> >> >> I am not aware of the background about the current kexec-tools
>> >> >> implementation where we add only the crashkernel range to the dtb
>> >> >> being passed to the crashkernel.
>> >> >>
>> >> >> Probably Akashi can answer better, as to how we arrived at this design
>> >> >> approach and why we didn't want to expose all System RAM regions (i.e.
>> >> >> ! NOMPAP regions) to the crashkernel.
>> >> >>
>> >> >> I am suspecting that some issues were seen/meet when the System RAM (!
>> >> >> NOMAP regions) were exposed to the crashkernel, and that's why we
>> >> >> finalized on this design approach, but this is something which is just
>> >> >> my guess.
>> >> >>
>> >> >> Regards,
>> >> >> Bhupesh
>> >> >>
>> >> >> >>> >
>> >> >> >>> > Just FYI, on x86, ACPI tables seems to be exposed to crash dump kernel
>> >> >> >>> > via a kernel command line parameter, "memmap=".
>> >> >> >>>
>> >> >> >>> memmap= is only used in old kexec-tools, now we are passing them via
>> >> >> >>> e820 table.
>> >> >> >>
>> >> >> >> Thanks. I remember that you have explained it before.
>> >> >> >>
>> >> >> >> -Takahiro AKASHI
>> >> >> >>
>> >> >> >>> [snip]
>> >> >> >>>
>> >> >> >>> Thanks
>> >> >> >>> Dave
>> >> >
>> >> > ===8<==
>> >> > From 74e2451fea83d546feae76160ba7de426913fe03 Mon Sep 17 00:00:00 2001
>> >> > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
>> >> > Date: Thu, 21 Dec 2017 19:14:23 +0900
>> >> > Subject: [PATCH] arm64: kdump: mark unusable memory as NOMAP
>> >> >
>> >> > ---
>> >> >  arch/arm64/mm/init.c | 10 ++++++++--
>> >> >  1 file changed, 8 insertions(+), 2 deletions(-)
>> >> >
>> >> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
>> >> > index 00e7b900ca41..8175db94257b 100644
>> >> > --- a/arch/arm64/mm/init.c
>> >> > +++ b/arch/arm64/mm/init.c
>> >> > @@ -352,11 +352,17 @@ static void __init fdt_enforce_memory_region(void)
>> >> >         struct memblock_region reg = {
>> >> >                 .size = 0,
>> >> >         };
>> >> > +       u64 idx;
>> >> > +       phys_addr_t start, end;
>> >> >
>> >> >         of_scan_flat_dt(early_init_dt_scan_usablemem, &reg);
>> >> >
>> >> > -       if (reg.size)
>> >> > -               memblock_cap_memory_range(reg.base, reg.size);
>> >> > +       if (reg.size) {
>> >> > +               for_each_free_mem_range(idx, NUMA_NO_NODE, MEMBLOCK_NONE,
>> >> > +                                       &start, &end, NULL)
>> >> > +                       memblock_mark_nomap(start, end - start);
>> >> > +               memblock_clear_nomap(reg.base, reg.size);
>> >> > +       }
>> >> >  }
>> >> >
>> >> >  void __init arm64_memblock_init(void)
>> >> > --
>> >> > 2.15.1
>> >> >
>> >>
>> >> Thanks for the patch. After applying this on top of
>> >> 4.15.0-rc4-next-20171220, there seems to be a improvement and the
>> >> crashkernel boot no longer hangs while trying to access the acpi
>> >> tables.
>> >>
>> >> However I notice a minor issue. Please see the log below for
>> >> reference, the following message keeps spamming the console but I see
>> >> the crashkernel boot proceed further.:
>> >>
>> >> [    0.000000] ACPI: NUMA: SRAT: PXM 3 -> MPIDR 0x70303 -> Node 3
>> >> [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x3fffffff]
>> >> [    0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x2000000000-0x2fffffffff]
>> >> [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x1000000000-0x1fffffffff]
>> >> [    0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0xa000000000-0xafffffffff]
>> >> [    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x9000000000-0x9fffffffff]
>> >> [    0.000000] NUMA: NODE_DATA [mem 0x1ffbffe200-0x1ffbffffff]
>> >> [    0.000000] NUMA: NODE_DATA [mem 0x1ffbffc400-0x1ffbffe1ff]
>> >> [    0.000000] NUMA: NODE_DATA(1) on node 0
>> >> [    0.000000] NUMA: NODE_DATA [mem 0x1ffbffa600-0x1ffbffc3ff]
>> >> [    0.000000] NUMA: NODE_DATA(2) on node 0
>> >> [    0.000000] NUMA: NODE_DATA [mem 0x1ffbff8800-0x1ffbffa5ff]
>> >> [    0.000000] NUMA: NODE_DATA(3) on node 0
>> >> [    0.000000] [ffff7fe008000000-ffff7fe00800ffff] potential offnode
>> >> page_structs
>> >> [    0.000000] [ffff7fe008010000-ffff7fe00801ffff] potential offnode
>> >> page_structs
>> >> [    0.000000] [ffff7fe008020000-ffff7fe00802ffff] potential offnode
>> >> page_structs
>> >> [    0.000000] [ffff7fe008030000-ffff7fe00803ffff] potential offnode
>> >> page_structs
>> >> [    0.000000] [ffff7fe008040000-ffff7fe00804ffff] potential offnode
>> >> page_structs
>> >> [    0.000000] [ffff7fe008050000-ffff7fe00805ffff] potential offnode
>> >> page_structs
>> >>
>> >> [snip..]
>> >> [    0.000000] [ffff7fe0081f0000-ffff7fe0081fffff] potential offnode
>> >> page_structs
>> >
>> > These messages shows that some "struct page" data are allocated on remote
>> > (numa) nodes.
>> > Since on your crash dump kernel, all the usable system memory (starting
>> > 0x0e800000) belongs to Node#0, we can't avoid such non-local allocations.
>> >
>> > In my best guess, you can ingore them except for some performance penality.
>> > This may be one side-effect.
>> >
>> > So does your crash dump kernel now boot successfully?
>> >
>>
>> Indeed. The crash dump kernel now boots successfully and the crash
>> dump core can be saved properly as well (I tried saving it to local
>> disk).
>
> Thank you for the confirmation.
> (I'd like to suggest you to examine the core dump with crash utility.)
>
>> However, the 'potential offnode page_structs' WARN messages hog the
>> console and delay crashkernel boot for a significant duration, which
>> can be irritating.
>>
>> Can we also consider ratelimiting this WARNING message [which seems to
>> come from vmemmap_verify()] if invoked in the context of crash kernel,
>> in addition to making the above change suggested by  you.
>
> Well, we may be able to change pr_warn() to pr_warn_once() here, but
> I hope that adding "numa=off" to kernel command line should also work.

Hmm, adding "numa=off" to crashkernel bootargs works, and TBH it was
my initial thought process as well, but I am not sure if this will
cause any regressions on aarch64 systems which use crashdump feature.

I think the 2nd solution, i.e limiting the warn message print
frequency might be a better option. Can you please add the following
patch (may be as a separate one) and send it along the patch which
marks all areas other than the crashkernel region being passed to the
crashkernel as NOMAP, so that we can get this issue fixed in upstream
aarch64 kernel:

I have tested this solution on huawei taishan board and can boot
crashkernel successfully and also save the crash core properly
(without the  console warn message flooding which used to hold up the
crashkernel boot).

Thanks,
Bhupesh

Comments

Dave Young Dec. 26, 2017, 1:32 a.m. UTC | #1
On 12/26/17 at 01:44am, Bhupesh Sharma wrote:
> On Mon, Dec 25, 2017 at 8:55 AM, AKASHI Takahiro
> <takahiro.akashi@linaro.org> wrote:
> > On Sun, Dec 24, 2017 at 01:21:02AM +0530, Bhupesh Sharma wrote:
> >> On Fri, Dec 22, 2017 at 2:03 PM, AKASHI Takahiro
> >> <takahiro.akashi@linaro.org> wrote:
> >> > On Thu, Dec 21, 2017 at 05:36:30PM +0530, Bhupesh Sharma wrote:
> >> >> Hello Akashi,
> >> >>
> >> >> On Thu, Dec 21, 2017 at 4:04 PM, AKASHI Takahiro
> >> >> <takahiro.akashi@linaro.org> wrote:
> >> >> > Bhupesh,
> >> >> >
> >> >> > Can you test the patch attached below, please?
> >> >> >
> >> >> > It is intended to retain already-reserved regions (ACPI reclaim memory
> >> >> > in this case) in system ram (i.e. memblock.memory) without explicitly
> >> >> > exporting them via usable-memory-range.
> >> >> > (I still have to figure out what the side-effect of this patch is.)
> >> >> >
> >> >> > Thanks,
> >> >> > -Takahiro AKASHI
> >> >> >
> >> >> > On Thu, Dec 21, 2017 at 01:30:43AM +0530, Bhupesh Sharma wrote:
> >> >> >> On Tue, Dec 19, 2017 at 6:39 PM, Ard Biesheuvel
> >> >> >> <ard.biesheuvel@linaro.org> wrote:
> >> >> >> > On 19 December 2017 at 07:09, AKASHI Takahiro
> >> >> >> > <takahiro.akashi@linaro.org> wrote:
> >> >> >> >> On Mon, Dec 18, 2017 at 01:40:09PM +0800, Dave Young wrote:
> >> >> >> >>> On 12/15/17 at 05:59pm, AKASHI Takahiro wrote:
> >> >> >> >>> > On Wed, Dec 13, 2017 at 12:17:22PM +0000, Ard Biesheuvel wrote:
> >> >> >> >>> > > On 13 December 2017 at 12:16, AKASHI Takahiro
> >> >> >> >>> > > <takahiro.akashi@linaro.org> wrote:
> >> >> >> >>> > > > On Wed, Dec 13, 2017 at 10:49:27AM +0000, Ard Biesheuvel wrote:
> >> >> >> >>> > > >> On 13 December 2017 at 10:26, AKASHI Takahiro
> >> >> >> >>> > > >> <takahiro.akashi@linaro.org> wrote:
> >> >> >> >>> > > >> > Bhupesh, Ard,
> >> >> >> >>> > > >> >
> >> >> >> >>> > > >> > On Wed, Dec 13, 2017 at 03:21:59AM +0530, Bhupesh Sharma wrote:
> >> >> >> >>> > > >> >> Hi Ard, Akashi
> >> >> >> >>> > > >> >>
> >> >> >> >>> > > >> > (snip)
> >> >> >> >>> > > >> >
> >> >> >> >>> > > >> >> Looking deeper into the issue, since the arm64 kexec-tools uses the
> >> >> >> >>> > > >> >> 'linux,usable-memory-range' dt property to allow crash dump kernel to
> >> >> >> >>> > > >> >> identify its own usable memory and exclude, at its boot time, any
> >> >> >> >>> > > >> >> other memory areas that are part of the panicked kernel's memory.
> >> >> >> >>> > > >> >> (see https://www.kernel.org/doc/Documentation/devicetree/bindings/chosen.txt
> >> >> >> >>> > > >> >> , for details)
> >> >> >> >>> > > >> >
> >> >> >> >>> > > >> > Right.
> >> >> >> >>> > > >> >
> >> >> >> >>> > > >> >> 1). Now when 'kexec -p' is executed, this node is patched up only
> >> >> >> >>> > > >> >> with the crashkernel memory range:
> >> >> >> >>> > > >> >>
> >> >> >> >>> > > >> >>                 /* add linux,usable-memory-range */
> >> >> >> >>> > > >> >>                 nodeoffset = fdt_path_offset(new_buf, "/chosen");
> >> >> >> >>> > > >> >>                 result = fdt_setprop_range(new_buf, nodeoffset,
> >> >> >> >>> > > >> >>                                 PROP_USABLE_MEM_RANGE, &crash_reserved_mem,
> >> >> >> >>> > > >> >>                                 address_cells, size_cells);
> >> >> >> >>> > > >> >>
> >> >> >> >>> > > >> >> (see https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/kexec/arch/arm64/kexec-arm64.c#n465
> >> >> >> >>> > > >> >> , for details)
> >> >> >> >>> > > >> >>
> >> >> >> >>> > > >> >> 2). This excludes the ACPI reclaim regions irrespective of whether
> >> >> >> >>> > > >> >> they are marked as System RAM or as RESERVED. As,
> >> >> >> >>> > > >> >> 'linux,usable-memory-range' dt node is patched up only with
> >> >> >> >>> > > >> >> 'crash_reserved_mem' and not 'system_memory_ranges'
> >> >> >> >>> > > >> >>
> >> >> >> >>> > > >> >> 3). As a result when the crashkernel boots up it doesn't find this
> >> >> >> >>> > > >> >> ACPI memory and crashes while trying to access the same:
> >> >> >> >>> > > >> >>
> >> >> >> >>> > > >> >> # kexec -p /boot/vmlinuz-`uname -r` --initrd=/boot/initramfs-`uname
> >> >> >> >>> > > >> >> -r`.img --reuse-cmdline -d
> >> >> >> >>> > > >> >>
> >> >> >> >>> > > >> >> [snip..]
> >> >> >> >>> > > >> >>
> >> >> >> >>> > > >> >> Reserved memory range
> >> >> >> >>> > > >> >> 000000000e800000-000000002e7fffff (0)
> >> >> >> >>> > > >> >>
> >> >> >> >>> > > >> >> Coredump memory ranges
> >> >> >> >>> > > >> >> 0000000000000000-000000000e7fffff (0)
> >> >> >> >>> > > >> >> 000000002e800000-000000003961ffff (0)
> >> >> >> >>> > > >> >> 0000000039d40000-000000003ed2ffff (0)
> >> >> >> >>> > > >> >> 000000003ed60000-000000003fbfffff (0)
> >> >> >> >>> > > >> >> 0000001040000000-0000001ffbffffff (0)
> >> >> >> >>> > > >> >> 0000002000000000-0000002ffbffffff (0)
> >> >> >> >>> > > >> >> 0000009000000000-0000009ffbffffff (0)
> >> >> >> >>> > > >> >> 000000a000000000-000000affbffffff (0)
> >> >> >> >>> > > >> >>
> >> >> >> >>> > > >> >> 4). So if we revert Ard's patch or just comment the fixing up of the
> >> >> >> >>> > > >> >> memory cap'ing passed to the crash kernel inside
> >> >> >> >>> > > >> >> 'arch/arm64/mm/init.c' (see below):
> >> >> >> >>> > > >> >>
> >> >> >> >>> > > >> >> static void __init fdt_enforce_memory_region(void)
> >> >> >> >>> > > >> >> {
> >> >> >> >>> > > >> >>         struct memblock_region reg = {
> >> >> >> >>> > > >> >>                 .size = 0,
> >> >> >> >>> > > >> >>         };
> >> >> >> >>> > > >> >>
> >> >> >> >>> > > >> >>         of_scan_flat_dt(early_init_dt_scan_usablemem, &reg);
> >> >> >> >>> > > >> >>
> >> >> >> >>> > > >> >>         if (reg.size)
> >> >> >> >>> > > >> >>                 //memblock_cap_memory_range(reg.base, reg.size); /*
> >> >> >> >>> > > >> >> comment this out */
> >> >> >> >>> > > >> >> }
> >> >> >> >>> > > >> >
> >> >> >> >>> > > >> > Please just don't do that. It can cause a fatal damage on
> >> >> >> >>> > > >> > memory contents of the *crashed* kernel.
> >> >> >> >>> > > >> >
> >> >> >> >>> > > >> >> 5). Both the above temporary solutions fix the problem.
> >> >> >> >>> > > >> >>
> >> >> >> >>> > > >> >> 6). However exposing all System RAM regions to the crashkernel is not
> >> >> >> >>> > > >> >> advisable and may cause the crashkernel or some crashkernel drivers to
> >> >> >> >>> > > >> >> fail.
> >> >> >> >>> > > >> >>
> >> >> >> >>> > > >> >> 6a). I am trying an approach now, where the ACPI reclaim regions are
> >> >> >> >>> > > >> >> added to '/proc/iomem' separately as ACPI reclaim regions by the
> >> >> >> >>> > > >> >> kernel code and on the other hand the user-space 'kexec-tools' will
> >> >> >> >>> > > >> >> pick up the ACPI reclaim regions from '/proc/iomem' and add it to the
> >> >> >> >>> > > >> >> dt node 'linux,usable-memory-range'
> >> >> >> >>> > > >> >
> >> >> >> >>> > > >> > I still don't understand why we need to carry over the information
> >> >> >> >>> > > >> > about "ACPI Reclaim memory" to crash dump kernel. In my understandings,
> >> >> >> >>> > > >> > such regions are free to be reused by the kernel after some point of
> >> >> >> >>> > > >> > initialization. Why does crash dump kernel need to know about them?
> >> >> >> >>> > > >> >
> >> >> >> >>> > > >>
> >> >> >> >>> > > >> Not really. According to the UEFI spec, they can be reclaimed after
> >> >> >> >>> > > >> the OS has initialized, i.e., when it has consumed the ACPI tables and
> >> >> >> >>> > > >> no longer needs them. Of course, in order to be able to boot a kexec
> >> >> >> >>> > > >> kernel, those regions needs to be preserved, which is why they are
> >> >> >> >>> > > >> memblock_reserve()'d now.
> >> >> >> >>> > > >
> >> >> >> >>> > > > For my better understandings, who is actually accessing such regions
> >> >> >> >>> > > > during boot time, uefi itself or efistub?
> >> >> >> >>> > > >
> >> >> >> >>> > >
> >> >> >> >>> > > No, only the kernel. This is where the ACPI tables are stored. For
> >> >> >> >>> > > instance, on QEMU we have
> >> >> >> >>> > >
> >> >> >> >>> > >  ACPI: RSDP 0x0000000078980000 000024 (v02 BOCHS )
> >> >> >> >>> > >  ACPI: XSDT 0x0000000078970000 000054 (v01 BOCHS  BXPCFACP 00000001
> >> >> >> >>> > >   01000013)
> >> >> >> >>> > >  ACPI: FACP 0x0000000078930000 00010C (v05 BOCHS  BXPCFACP 00000001
> >> >> >> >>> > > BXPC 00000001)
> >> >> >> >>> > >  ACPI: DSDT 0x0000000078940000 0011DA (v02 BOCHS  BXPCDSDT 00000001
> >> >> >> >>> > > BXPC 00000001)
> >> >> >> >>> > >  ACPI: APIC 0x0000000078920000 000140 (v03 BOCHS  BXPCAPIC 00000001
> >> >> >> >>> > > BXPC 00000001)
> >> >> >> >>> > >  ACPI: GTDT 0x0000000078910000 000060 (v02 BOCHS  BXPCGTDT 00000001
> >> >> >> >>> > > BXPC 00000001)
> >> >> >> >>> > >  ACPI: MCFG 0x0000000078900000 00003C (v01 BOCHS  BXPCMCFG 00000001
> >> >> >> >>> > > BXPC 00000001)
> >> >> >> >>> > >  ACPI: SPCR 0x00000000788F0000 000050 (v02 BOCHS  BXPCSPCR 00000001
> >> >> >> >>> > > BXPC 00000001)
> >> >> >> >>> > >  ACPI: IORT 0x00000000788E0000 00007C (v00 BOCHS  BXPCIORT 00000001
> >> >> >> >>> > > BXPC 00000001)
> >> >> >> >>> > >
> >> >> >> >>> > > covered by
> >> >> >> >>> > >
> >> >> >> >>> > >  efi:   0x0000788e0000-0x00007894ffff [ACPI Reclaim Memory ...]
> >> >> >> >>> > >  ...
> >> >> >> >>> > >  efi:   0x000078970000-0x00007898ffff [ACPI Reclaim Memory ...]
> >> >> >> >>> >
> >> >> >> >>> > OK. I mistakenly understood those regions could be freed after exiting
> >> >> >> >>> > UEFI boot services.
> >> >> >> >>> >
> >> >> >> >>> > >
> >> >> >> >>> > > >> So it seems that kexec does not honour the memblock_reserve() table
> >> >> >> >>> > > >> when booting the next kernel.
> >> >> >> >>> > > >
> >> >> >> >>> > > > not really.
> >> >> >> >>> > > >
> >> >> >> >>> > > >> > (In other words, can or should we skip some part of ACPI-related init code
> >> >> >> >>> > > >> > on crash dump kernel?)
> >> >> >> >>> > > >> >
> >> >> >> >>> > > >>
> >> >> >> >>> > > >> I don't think so. And the change to the handling of ACPI reclaim
> >> >> >> >>> > > >> regions only revealed the bug, not created it (given that other
> >> >> >> >>> > > >> memblock_reserve regions may be affected as well)
> >> >> >> >>> > > >
> >> >> >> >>> > > > As whether we should honor such reserved regions over kexec'ing
> >> >> >> >>> > > > depends on each one's specific nature, we will have to take care one-by-one.
> >> >> >> >>> > > > As a matter of fact, no information about "reserved" memblocks is
> >> >> >> >>> > > > exposed to user space (via proc/iomem).
> >> >> >> >>> > > >
> >> >> >> >>> > >
> >> >> >> >>> > > That is why I suggested (somewhere in this thread?) to not expose them
> >> >> >> >>> > > as 'System RAM'. Do you think that could solve this?
> >> >> >> >>> >
> >> >> >> >>> > Memblock-reserv'ing them is necessary to prevent their corruption and
> >> >> >> >>> > marking them under another name in /proc/iomem would also be good in order
> >> >> >> >>> > not to allocate them as part of crash kernel's memory.
> >> >> >> >>> >
> >> >> >> >>> > But I'm not still convinced that we should export them in useable-
> >> >> >> >>> > memory-range to crash dump kernel. They will be accessed through
> >> >> >> >>> > acpi_os_map_memory() and so won't be required to be part of system ram
> >> >> >> >>> > (or memblocks), I guess.
> >> >> >> >>> >     -> Bhupesh?
> >> >> >> >>>
> >> >> >> >>> I forgot how arm64 kernel retrieve the memory ranges and initialize
> >> >> >> >>> them.  If no "e820" like interfaces shouldn't kernel reinitialize all
> >> >> >> >>> the memory according to the efi memmap?  For kdump kernel anything other
> >> >> >> >>> than usable memory (which is from the dt node instead) should be
> >> >> >> >>> reinitialized according to efi passed info, no?
> >> >> >> >>
> >> >> >> >> All the regions exported in efi memmap will be added to memblock.memory
> >> >> >> >> in (u)efi_init() and then trimmed down to the exact range specified as
> >> >> >> >> usable-memory-range by fdt_enforce_memory_region().
> >> >> >> >>
> >> >> >> >> Now I noticed that the current fdt_enforce_memory_region() may not work well
> >> >> >> >> with multiple entries in usable-memory-range.
> >> >> >> >>
> >> >> >> >
> >> >> >> > In any case, the root of the problem is that memory regions lose their
> >> >> >> > 'memory' annotation due to the way the memory map is mangled before
> >> >> >> > being supplied to the kexec kernel.
> >> >> >> >
> >> >> >> > Would it be possible to classify all memory that we want to hide from
> >> >> >> > the kexec kernel as NOMAP instead? That way, it will not be mapped
> >> >> >> > implicitly, but will still be mapped cacheable by acpi_os_ioremap(),
> >> >> >> > so this seems to be the most appropriate way to deal with the host
> >> >> >> > kernel's memory contents.
> >> >> >>
> >> >> >> Hmm. wouldn't appending the acpi reclaim regions to
> >> >> >> 'linux,usable-memory-range' in the dtb being passed to the crashkernel
> >> >> >> be better? Because its indirectly achieving a similar objective
> >> >> >> (although may be a subset of all System RAM regions on the primary
> >> >> >> kernel's memory).
> >> >> >>
> >> >> >> I am not aware of the background about the current kexec-tools
> >> >> >> implementation where we add only the crashkernel range to the dtb
> >> >> >> being passed to the crashkernel.
> >> >> >>
> >> >> >> Probably Akashi can answer better, as to how we arrived at this design
> >> >> >> approach and why we didn't want to expose all System RAM regions (i.e.
> >> >> >> ! NOMPAP regions) to the crashkernel.
> >> >> >>
> >> >> >> I am suspecting that some issues were seen/meet when the System RAM (!
> >> >> >> NOMAP regions) were exposed to the crashkernel, and that's why we
> >> >> >> finalized on this design approach, but this is something which is just
> >> >> >> my guess.
> >> >> >>
> >> >> >> Regards,
> >> >> >> Bhupesh
> >> >> >>
> >> >> >> >>> >
> >> >> >> >>> > Just FYI, on x86, ACPI tables seems to be exposed to crash dump kernel
> >> >> >> >>> > via a kernel command line parameter, "memmap=".
> >> >> >> >>>
> >> >> >> >>> memmap= is only used in old kexec-tools, now we are passing them via
> >> >> >> >>> e820 table.
> >> >> >> >>
> >> >> >> >> Thanks. I remember that you have explained it before.
> >> >> >> >>
> >> >> >> >> -Takahiro AKASHI
> >> >> >> >>
> >> >> >> >>> [snip]
> >> >> >> >>>
> >> >> >> >>> Thanks
> >> >> >> >>> Dave
> >> >> >
> >> >> > ===8<==
> >> >> > From 74e2451fea83d546feae76160ba7de426913fe03 Mon Sep 17 00:00:00 2001
> >> >> > From: AKASHI Takahiro <takahiro.akashi@linaro.org>
> >> >> > Date: Thu, 21 Dec 2017 19:14:23 +0900
> >> >> > Subject: [PATCH] arm64: kdump: mark unusable memory as NOMAP
> >> >> >
> >> >> > ---
> >> >> >  arch/arm64/mm/init.c | 10 ++++++++--
> >> >> >  1 file changed, 8 insertions(+), 2 deletions(-)
> >> >> >
> >> >> > diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> >> >> > index 00e7b900ca41..8175db94257b 100644
> >> >> > --- a/arch/arm64/mm/init.c
> >> >> > +++ b/arch/arm64/mm/init.c
> >> >> > @@ -352,11 +352,17 @@ static void __init fdt_enforce_memory_region(void)
> >> >> >         struct memblock_region reg = {
> >> >> >                 .size = 0,
> >> >> >         };
> >> >> > +       u64 idx;
> >> >> > +       phys_addr_t start, end;
> >> >> >
> >> >> >         of_scan_flat_dt(early_init_dt_scan_usablemem, &reg);
> >> >> >
> >> >> > -       if (reg.size)
> >> >> > -               memblock_cap_memory_range(reg.base, reg.size);
> >> >> > +       if (reg.size) {
> >> >> > +               for_each_free_mem_range(idx, NUMA_NO_NODE, MEMBLOCK_NONE,
> >> >> > +                                       &start, &end, NULL)
> >> >> > +                       memblock_mark_nomap(start, end - start);
> >> >> > +               memblock_clear_nomap(reg.base, reg.size);
> >> >> > +       }
> >> >> >  }
> >> >> >
> >> >> >  void __init arm64_memblock_init(void)
> >> >> > --
> >> >> > 2.15.1
> >> >> >
> >> >>
> >> >> Thanks for the patch. After applying this on top of
> >> >> 4.15.0-rc4-next-20171220, there seems to be a improvement and the
> >> >> crashkernel boot no longer hangs while trying to access the acpi
> >> >> tables.
> >> >>
> >> >> However I notice a minor issue. Please see the log below for
> >> >> reference, the following message keeps spamming the console but I see
> >> >> the crashkernel boot proceed further.:
> >> >>
> >> >> [    0.000000] ACPI: NUMA: SRAT: PXM 3 -> MPIDR 0x70303 -> Node 3
> >> >> [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x3fffffff]
> >> >> [    0.000000] ACPI: SRAT: Node 1 PXM 1 [mem 0x2000000000-0x2fffffffff]
> >> >> [    0.000000] ACPI: SRAT: Node 0 PXM 0 [mem 0x1000000000-0x1fffffffff]
> >> >> [    0.000000] ACPI: SRAT: Node 3 PXM 3 [mem 0xa000000000-0xafffffffff]
> >> >> [    0.000000] ACPI: SRAT: Node 2 PXM 2 [mem 0x9000000000-0x9fffffffff]
> >> >> [    0.000000] NUMA: NODE_DATA [mem 0x1ffbffe200-0x1ffbffffff]
> >> >> [    0.000000] NUMA: NODE_DATA [mem 0x1ffbffc400-0x1ffbffe1ff]
> >> >> [    0.000000] NUMA: NODE_DATA(1) on node 0
> >> >> [    0.000000] NUMA: NODE_DATA [mem 0x1ffbffa600-0x1ffbffc3ff]
> >> >> [    0.000000] NUMA: NODE_DATA(2) on node 0
> >> >> [    0.000000] NUMA: NODE_DATA [mem 0x1ffbff8800-0x1ffbffa5ff]
> >> >> [    0.000000] NUMA: NODE_DATA(3) on node 0
> >> >> [    0.000000] [ffff7fe008000000-ffff7fe00800ffff] potential offnode
> >> >> page_structs
> >> >> [    0.000000] [ffff7fe008010000-ffff7fe00801ffff] potential offnode
> >> >> page_structs
> >> >> [    0.000000] [ffff7fe008020000-ffff7fe00802ffff] potential offnode
> >> >> page_structs
> >> >> [    0.000000] [ffff7fe008030000-ffff7fe00803ffff] potential offnode
> >> >> page_structs
> >> >> [    0.000000] [ffff7fe008040000-ffff7fe00804ffff] potential offnode
> >> >> page_structs
> >> >> [    0.000000] [ffff7fe008050000-ffff7fe00805ffff] potential offnode
> >> >> page_structs
> >> >>
> >> >> [snip..]
> >> >> [    0.000000] [ffff7fe0081f0000-ffff7fe0081fffff] potential offnode
> >> >> page_structs
> >> >
> >> > These messages shows that some "struct page" data are allocated on remote
> >> > (numa) nodes.
> >> > Since on your crash dump kernel, all the usable system memory (starting
> >> > 0x0e800000) belongs to Node#0, we can't avoid such non-local allocations.
> >> >
> >> > In my best guess, you can ingore them except for some performance penality.
> >> > This may be one side-effect.
> >> >
> >> > So does your crash dump kernel now boot successfully?
> >> >
> >>
> >> Indeed. The crash dump kernel now boots successfully and the crash
> >> dump core can be saved properly as well (I tried saving it to local
> >> disk).
> >
> > Thank you for the confirmation.
> > (I'd like to suggest you to examine the core dump with crash utility.)
> >
> >> However, the 'potential offnode page_structs' WARN messages hog the
> >> console and delay crashkernel boot for a significant duration, which
> >> can be irritating.
> >>
> >> Can we also consider ratelimiting this WARNING message [which seems to
> >> come from vmemmap_verify()] if invoked in the context of crash kernel,
> >> in addition to making the above change suggested by  you.
> >
> > Well, we may be able to change pr_warn() to pr_warn_once() here, but
> > I hope that adding "numa=off" to kernel command line should also work.
> 
> Hmm, adding "numa=off" to crashkernel bootargs works, and TBH it was
> my initial thought process as well, but I am not sure if this will
> cause any regressions on aarch64 systems which use crashdump feature.

It should be fine since we use numa=off by default for all other arches
ie. x86, ppc64 and s390. Actually disabling numa in kdump kernel can save
mm component memory usage. 

> 
> I think the 2nd solution, i.e limiting the warn message print
> frequency might be a better option. Can you please add the following
> patch (may be as a separate one) and send it along the patch which
> marks all areas other than the crashkernel region being passed to the
> crashkernel as NOMAP, so that we can get this issue fixed in upstream
> aarch64 kernel:
> 
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 17acf01791fa..4c13fe3c644d 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -169,7 +169,7 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
>         int actual_node = early_pfn_to_nid(pfn);
> 
>         if (node_distance(actual_node, node) > LOCAL_DISTANCE)
> -               pr_warn("[%lx-%lx] potential offnode page_structs\n",
> +               pr_warn_once("[%lx-%lx] potential offnode page_structs\n",
>                         start, end - 1);
>  }
> 
> I have tested this solution on huawei taishan board and can boot
> crashkernel successfully and also save the crash core properly
> (without the  console warn message flooding which used to hold up the
> crashkernel boot).
> 
> Thanks,
> Bhupesh

Thanks
Dave
Dave Young Dec. 26, 2017, 1:35 a.m. UTC | #2
[snip]
> > > Well, we may be able to change pr_warn() to pr_warn_once() here, but
> > > I hope that adding "numa=off" to kernel command line should also work.
> > 
> > Hmm, adding "numa=off" to crashkernel bootargs works, and TBH it was
> > my initial thought process as well, but I am not sure if this will
> > cause any regressions on aarch64 systems which use crashdump feature.
> 
> It should be fine since we use numa=off by default for all other arches
> ie. x86, ppc64 and s390. Actually disabling numa in kdump kernel can save
> mm component memory usage. 
> 

Forgot to say I means in RHEL and Fedora we use numa=off for kdump..
AKASHI Takahiro Dec. 26, 2017, 2:28 a.m. UTC | #3
On Tue, Dec 26, 2017 at 09:35:17AM +0800, Dave Young wrote:
> [snip]
> > > > Well, we may be able to change pr_warn() to pr_warn_once() here, but
> > > > I hope that adding "numa=off" to kernel command line should also work.
> > > 
> > > Hmm, adding "numa=off" to crashkernel bootargs works, and TBH it was
> > > my initial thought process as well, but I am not sure if this will
> > > cause any regressions on aarch64 systems which use crashdump feature.
> > 
> > It should be fine since we use numa=off by default for all other arches
> > ie. x86, ppc64 and s390. Actually disabling numa in kdump kernel can save
> > mm component memory usage. 
> > 
> 
> Forgot to say I means in RHEL and Fedora we use numa=off for kdump..

Thank you for the clarification.
(It might be better to make numa off automatically if maxcpus == 0 (and 1?).)

-Takahiro AKASHI
Bhupesh Sharma Dec. 26, 2017, 2:56 a.m. UTC | #4
On Tue, Dec 26, 2017 at 7:58 AM, AKASHI Takahiro
<takahiro.akashi@linaro.org> wrote:
> On Tue, Dec 26, 2017 at 09:35:17AM +0800, Dave Young wrote:
>> [snip]
>> > > > Well, we may be able to change pr_warn() to pr_warn_once() here, but
>> > > > I hope that adding "numa=off" to kernel command line should also work.
>> > >
>> > > Hmm, adding "numa=off" to crashkernel bootargs works, and TBH it was
>> > > my initial thought process as well, but I am not sure if this will
>> > > cause any regressions on aarch64 systems which use crashdump feature.
>> >
>> > It should be fine since we use numa=off by default for all other arches
>> > ie. x86, ppc64 and s390. Actually disabling numa in kdump kernel can save
>> > mm component memory usage.
>> >
>>
>> Forgot to say I means in RHEL and Fedora we use numa=off for kdump..
>
> Thank you for the clarification.
> (It might be better to make numa off automatically if maxcpus == 0 (and 1?).)
>

Not sure if we can leave this to the distribution-specific kdump
scripts (as the crashkernel boot can be held up for sufficient time
and may appear stuck). The distribution scripts may be different (for
e.g. ubuntu and RHEL/fedora) across distributions and may have
different bootarg options.

So how about considering a kernel fix only which doesn't require
relying on changing the distribution-specific kdump scripts, as we
should avoid introducing a regression while trying to fix a regression
:)

Just my 2 cents.

Thanks,
Bhupesh
Dave Young Dec. 26, 2017, 6:56 a.m. UTC | #5
On 12/26/17 at 11:28am, AKASHI Takahiro wrote:
> On Tue, Dec 26, 2017 at 09:35:17AM +0800, Dave Young wrote:
> > [snip]
> > > > > Well, we may be able to change pr_warn() to pr_warn_once() here, but
> > > > > I hope that adding "numa=off" to kernel command line should also work.
> > > > 
> > > > Hmm, adding "numa=off" to crashkernel bootargs works, and TBH it was
> > > > my initial thought process as well, but I am not sure if this will
> > > > cause any regressions on aarch64 systems which use crashdump feature.
> > > 
> > > It should be fine since we use numa=off by default for all other arches
> > > ie. x86, ppc64 and s390. Actually disabling numa in kdump kernel can save
> > > mm component memory usage. 
> > > 
> > 
> > Forgot to say I means in RHEL and Fedora we use numa=off for kdump..
> 
> Thank you for the clarification.
> (It might be better to make numa off automatically if maxcpus == 0 (and 1?).)

Hmm, I did a quick test with qemu/kvm, kdump kernel boot without numa=off
I'm not sure why I do not see the warning messages on x86
machines, maybe something arm64 specific?

> 
> -Takahiro AKASHI
Dave Young Dec. 26, 2017, 6:58 a.m. UTC | #6
On 12/26/17 at 08:26am, Bhupesh Sharma wrote:
> On Tue, Dec 26, 2017 at 7:58 AM, AKASHI Takahiro
> <takahiro.akashi@linaro.org> wrote:
> > On Tue, Dec 26, 2017 at 09:35:17AM +0800, Dave Young wrote:
> >> [snip]
> >> > > > Well, we may be able to change pr_warn() to pr_warn_once() here, but
> >> > > > I hope that adding "numa=off" to kernel command line should also work.
> >> > >
> >> > > Hmm, adding "numa=off" to crashkernel bootargs works, and TBH it was
> >> > > my initial thought process as well, but I am not sure if this will
> >> > > cause any regressions on aarch64 systems which use crashdump feature.
> >> >
> >> > It should be fine since we use numa=off by default for all other arches
> >> > ie. x86, ppc64 and s390. Actually disabling numa in kdump kernel can save
> >> > mm component memory usage.
> >> >
> >>
> >> Forgot to say I means in RHEL and Fedora we use numa=off for kdump..
> >
> > Thank you for the clarification.
> > (It might be better to make numa off automatically if maxcpus == 0 (and 1?).)
> >
> 
> Not sure if we can leave this to the distribution-specific kdump
> scripts (as the crashkernel boot can be held up for sufficient time
> and may appear stuck). The distribution scripts may be different (for
> e.g. ubuntu and RHEL/fedora) across distributions and may have
> different bootarg options.

Personally I think distribution should take care of this param as for
kdump.  But as AKASHI said it could be a issue for 1st kernel with
nr_cpus=1 booting.  Problem is why we do not see this issue on other
machines.

> 
> So how about considering a kernel fix only which doesn't require
> relying on changing the distribution-specific kdump scripts, as we
> should avoid introducing a regression while trying to fix a regression
> :)
> 
> Just my 2 cents.
> 
> Thanks,
> Bhupesh

Thanks
Dave
Bhupesh Sharma Jan. 8, 2018, 8 p.m. UTC | #7
Hello Akashi,

On Tue, Dec 26, 2017 at 8:26 AM, Bhupesh Sharma <bhsharma@redhat.com> wrote:
> On Tue, Dec 26, 2017 at 7:58 AM, AKASHI Takahiro
> <takahiro.akashi@linaro.org> wrote:
>> On Tue, Dec 26, 2017 at 09:35:17AM +0800, Dave Young wrote:
>>> [snip]
>>> > > > Well, we may be able to change pr_warn() to pr_warn_once() here, but
>>> > > > I hope that adding "numa=off" to kernel command line should also work.
>>> > >
>>> > > Hmm, adding "numa=off" to crashkernel bootargs works, and TBH it was
>>> > > my initial thought process as well, but I am not sure if this will
>>> > > cause any regressions on aarch64 systems which use crashdump feature.
>>> >
>>> > It should be fine since we use numa=off by default for all other arches
>>> > ie. x86, ppc64 and s390. Actually disabling numa in kdump kernel can save
>>> > mm component memory usage.
>>> >
>>>
>>> Forgot to say I means in RHEL and Fedora we use numa=off for kdump..
>>
>> Thank you for the clarification.
>> (It might be better to make numa off automatically if maxcpus == 0 (and 1?).)
>>
>
> Not sure if we can leave this to the distribution-specific kdump
> scripts (as the crashkernel boot can be held up for sufficient time
> and may appear stuck). The distribution scripts may be different (for
> e.g. ubuntu and RHEL/fedora) across distributions and may have
> different bootarg options.
>
> So how about considering a kernel fix only which doesn't require
> relying on changing the distribution-specific kdump scripts, as we
> should avoid introducing a regression while trying to fix a regression
> :)
>
> Just my 2 cents.
>

Sorry for the delay but I was on holidays in the last week.

Are you planning to send a patch to fix this issue or do you want me
to send a RFC version instead?

i think this is a blocking issue for aarch64 kdump support on newer
kernels (v4.14) and we are already hearing about this issue from other
users as well, so it would be great to get this fixed now that we have
root-caused the issue and found a possible way around.

Regards,
Bhupesh
AKASHI Takahiro Jan. 9, 2018, 4:42 a.m. UTC | #8
Bhupesh,

On Tue, Jan 09, 2018 at 01:30:07AM +0530, Bhupesh Sharma wrote:
> Hello Akashi,
> 
> On Tue, Dec 26, 2017 at 8:26 AM, Bhupesh Sharma <bhsharma@redhat.com> wrote:
> > On Tue, Dec 26, 2017 at 7:58 AM, AKASHI Takahiro
> > <takahiro.akashi@linaro.org> wrote:
> >> On Tue, Dec 26, 2017 at 09:35:17AM +0800, Dave Young wrote:
> >>> [snip]
> >>> > > > Well, we may be able to change pr_warn() to pr_warn_once() here, but
> >>> > > > I hope that adding "numa=off" to kernel command line should also work.
> >>> > >
> >>> > > Hmm, adding "numa=off" to crashkernel bootargs works, and TBH it was
> >>> > > my initial thought process as well, but I am not sure if this will
> >>> > > cause any regressions on aarch64 systems which use crashdump feature.
> >>> >
> >>> > It should be fine since we use numa=off by default for all other arches
> >>> > ie. x86, ppc64 and s390. Actually disabling numa in kdump kernel can save
> >>> > mm component memory usage.
> >>> >
> >>>
> >>> Forgot to say I means in RHEL and Fedora we use numa=off for kdump..
> >>
> >> Thank you for the clarification.
> >> (It might be better to make numa off automatically if maxcpus == 0 (and 1?).)
> >>
> >
> > Not sure if we can leave this to the distribution-specific kdump
> > scripts (as the crashkernel boot can be held up for sufficient time
> > and may appear stuck). The distribution scripts may be different (for
> > e.g. ubuntu and RHEL/fedora) across distributions and may have
> > different bootarg options.
> >
> > So how about considering a kernel fix only which doesn't require
> > relying on changing the distribution-specific kdump scripts, as we
> > should avoid introducing a regression while trying to fix a regression
> > :)
> >
> > Just my 2 cents.
> >
> 
> Sorry for the delay but I was on holidays in the last week.
> 
> Are you planning to send a patch to fix this issue or do you want me
> to send a RFC version instead?

I should have submitted my own patch before my new year holidays,
but I will do so as soon as possible.

Thanks,
-Takahiro AKASHI


> i think this is a blocking issue for aarch64 kdump support on newer
> kernels (v4.14) and we are already hearing about this issue from other
> users as well, so it would be great to get this fixed now that we have
> root-caused the issue and found a possible way around.
> 
> Regards,
> Bhupesh
AKASHI Takahiro Jan. 9, 2018, 5:02 a.m. UTC | #9
On Tue, Dec 26, 2017 at 02:56:36PM +0800, Dave Young wrote:
> On 12/26/17 at 11:28am, AKASHI Takahiro wrote:
> > On Tue, Dec 26, 2017 at 09:35:17AM +0800, Dave Young wrote:
> > > [snip]
> > > > > > Well, we may be able to change pr_warn() to pr_warn_once() here, but
> > > > > > I hope that adding "numa=off" to kernel command line should also work.
> > > > > 
> > > > > Hmm, adding "numa=off" to crashkernel bootargs works, and TBH it was
> > > > > my initial thought process as well, but I am not sure if this will
> > > > > cause any regressions on aarch64 systems which use crashdump feature.
> > > > 
> > > > It should be fine since we use numa=off by default for all other arches
> > > > ie. x86, ppc64 and s390. Actually disabling numa in kdump kernel can save
> > > > mm component memory usage. 
> > > > 
> > > 
> > > Forgot to say I means in RHEL and Fedora we use numa=off for kdump..
> > 
> > Thank you for the clarification.
> > (It might be better to make numa off automatically if maxcpus == 0 (and 1?).)
> 
> Hmm, I did a quick test with qemu/kvm, kdump kernel boot without numa=off
> I'm not sure why I do not see the warning messages on x86
> machines, maybe something arm64 specific?

I didn't see the messages(i.e. "potential offnode page_structs")
on arm64 qemu (with -smp 2 -numa node -numa node).

It seems that qemu doesn't generate acpi slit(inter-node distance table).

Thanks,
-Takahiro AKASHI

> > 
> > -Takahiro AKASHI
AKASHI Takahiro Jan. 9, 2018, 5:22 a.m. UTC | #10
On Tue, Dec 26, 2017 at 02:58:45PM +0800, Dave Young wrote:
> On 12/26/17 at 08:26am, Bhupesh Sharma wrote:
> > On Tue, Dec 26, 2017 at 7:58 AM, AKASHI Takahiro
> > <takahiro.akashi@linaro.org> wrote:
> > > On Tue, Dec 26, 2017 at 09:35:17AM +0800, Dave Young wrote:
> > >> [snip]
> > >> > > > Well, we may be able to change pr_warn() to pr_warn_once() here, but
> > >> > > > I hope that adding "numa=off" to kernel command line should also work.
> > >> > >
> > >> > > Hmm, adding "numa=off" to crashkernel bootargs works, and TBH it was
> > >> > > my initial thought process as well, but I am not sure if this will
> > >> > > cause any regressions on aarch64 systems which use crashdump feature.
> > >> >
> > >> > It should be fine since we use numa=off by default for all other arches
> > >> > ie. x86, ppc64 and s390. Actually disabling numa in kdump kernel can save
> > >> > mm component memory usage.
> > >> >
> > >>
> > >> Forgot to say I means in RHEL and Fedora we use numa=off for kdump..
> > >
> > > Thank you for the clarification.
> > > (It might be better to make numa off automatically if maxcpus == 0 (and 1?).)
> > >
> > 
> > Not sure if we can leave this to the distribution-specific kdump
> > scripts (as the crashkernel boot can be held up for sufficient time
> > and may appear stuck). The distribution scripts may be different (for
> > e.g. ubuntu and RHEL/fedora) across distributions and may have
> > different bootarg options.
> 
> Personally I think distribution should take care of this param as for
> kdump.  But as AKASHI said it could be a issue for 1st kernel with
> nr_cpus=1 booting.  Problem is why we do not see this issue on other
> machines.

The issue won't be kdump-specific. Theoretically, it also takes place
when "mem=" is specified on numa.

Since we can avoid annoying messages by adding "numa=off", I'm reluctant to
suppress most of messages but the first. My suggestion here is to add some
notes in Documentation/kdump/kdump.txt regarding NUMA case.

Thanks,
Takahiro AKASHI


> > 
> > So how about considering a kernel fix only which doesn't require
> > relying on changing the distribution-specific kdump scripts, as we
> > should avoid introducing a regression while trying to fix a regression
> > :)
> > 
> > Just my 2 cents.
> > 
> > Thanks,
> > Bhupesh
> 
> Thanks
> Dave
Bhupesh Sharma Jan. 9, 2018, 11:46 a.m. UTC | #11
On Tue, Jan 9, 2018 at 10:12 AM, AKASHI Takahiro
<takahiro.akashi@linaro.org> wrote:
> Bhupesh,
>
> On Tue, Jan 09, 2018 at 01:30:07AM +0530, Bhupesh Sharma wrote:
>> Hello Akashi,
>>
>> On Tue, Dec 26, 2017 at 8:26 AM, Bhupesh Sharma <bhsharma@redhat.com> wrote:
>> > On Tue, Dec 26, 2017 at 7:58 AM, AKASHI Takahiro
>> > <takahiro.akashi@linaro.org> wrote:
>> >> On Tue, Dec 26, 2017 at 09:35:17AM +0800, Dave Young wrote:
>> >>> [snip]
>> >>> > > > Well, we may be able to change pr_warn() to pr_warn_once() here, but
>> >>> > > > I hope that adding "numa=off" to kernel command line should also work.
>> >>> > >
>> >>> > > Hmm, adding "numa=off" to crashkernel bootargs works, and TBH it was
>> >>> > > my initial thought process as well, but I am not sure if this will
>> >>> > > cause any regressions on aarch64 systems which use crashdump feature.
>> >>> >
>> >>> > It should be fine since we use numa=off by default for all other arches
>> >>> > ie. x86, ppc64 and s390. Actually disabling numa in kdump kernel can save
>> >>> > mm component memory usage.
>> >>> >
>> >>>
>> >>> Forgot to say I means in RHEL and Fedora we use numa=off for kdump..
>> >>
>> >> Thank you for the clarification.
>> >> (It might be better to make numa off automatically if maxcpus == 0 (and 1?).)
>> >>
>> >
>> > Not sure if we can leave this to the distribution-specific kdump
>> > scripts (as the crashkernel boot can be held up for sufficient time
>> > and may appear stuck). The distribution scripts may be different (for
>> > e.g. ubuntu and RHEL/fedora) across distributions and may have
>> > different bootarg options.
>> >
>> > So how about considering a kernel fix only which doesn't require
>> > relying on changing the distribution-specific kdump scripts, as we
>> > should avoid introducing a regression while trying to fix a regression
>> > :)
>> >
>> > Just my 2 cents.
>> >
>>
>> Sorry for the delay but I was on holidays in the last week.
>>
>> Are you planning to send a patch to fix this issue or do you want me
>> to send a RFC version instead?
>
> I should have submitted my own patch before my new year holidays,
> but I will do so as soon as possible.

Thanks for the confirmation.
I will look forward to the patches and give them a go on the arm64
boards available with me.

Regards,
Bhupesh

>
>> i think this is a blocking issue for aarch64 kdump support on newer
>> kernels (v4.14) and we are already hearing about this issue from other
>> users as well, so it would be great to get this fixed now that we have
>> root-caused the issue and found a possible way around.
>>
>> Regards,
>> Bhupesh
diff mbox

Patch

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 17acf01791fa..4c13fe3c644d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -169,7 +169,7 @@  void __meminit vmemmap_verify(pte_t *pte, int node,
        int actual_node = early_pfn_to_nid(pfn);

        if (node_distance(actual_node, node) > LOCAL_DISTANCE)
-               pr_warn("[%lx-%lx] potential offnode page_structs\n",
+               pr_warn_once("[%lx-%lx] potential offnode page_structs\n",
                        start, end - 1);
 }