mbox series

[0/2] arm64: kexec_file_load vs memory reservations

Message ID 20210429133533.1750721-1-maz@kernel.org (mailing list archive)
Headers show
Series arm64: kexec_file_load vs memory reservations | expand

Message

Marc Zyngier April 29, 2021, 1:35 p.m. UTC
It recently became apparent that using kexec with kexec_file_load() on
arm64 is pretty similar to playing Russian roulette.

Depending on the amount of memory, the HW supported and the firmware
interface used, your secondary kernel may overwrite critical memory
regions without which the secondary kernel cannot boot (the GICv3 LPI
tables being a prime example of such reserved regions).

It turns out that there is at least two ways for reserved memory
regions to be described to kexec: /proc/iomem for the userspace
implementation, and memblock.reserved for kexec_file. And of course,
our LPI tables are only reserved using the resource tree, leading to
the aforementioned stamping. Similar things could happen with ACPI
tables as well.

On my 24xA53 system artificially limited to 256MB of RAM (yes, it
boots with that little memory), trying to kexec a secondary kernel
failed every times. I can only presume that this was mostly tested
using kdump, which preserves the entire kernel memory range.

This small series aims at triggering a discussion on what are the
expectations for kexec_file, and whether we should unify the two
reservation mechanisms.

And in the meantime, it gets things going...

Marc Zyngier (2):
  firmware/efi: Tell memblock about EFI reservations
  ACPI: arm64: Reserve the ACPI tables in memblock

 arch/arm64/kernel/acpi.c   |  1 +
 drivers/firmware/efi/efi.c | 23 ++++++++++++++++++++++-
 2 files changed, 23 insertions(+), 1 deletion(-)

Comments

Marc Zyngier May 12, 2021, 6:04 p.m. UTC | #1
+ Dave Young, which I accidentally missed in my initial post

On Thu, 29 Apr 2021 14:35:31 +0100,
Marc Zyngier <maz@kernel.org> wrote:
> 
> It recently became apparent that using kexec with kexec_file_load() on
> arm64 is pretty similar to playing Russian roulette.
> 
> Depending on the amount of memory, the HW supported and the firmware
> interface used, your secondary kernel may overwrite critical memory
> regions without which the secondary kernel cannot boot (the GICv3 LPI
> tables being a prime example of such reserved regions).
> 
> It turns out that there is at least two ways for reserved memory
> regions to be described to kexec: /proc/iomem for the userspace
> implementation, and memblock.reserved for kexec_file. And of course,
> our LPI tables are only reserved using the resource tree, leading to
> the aforementioned stamping. Similar things could happen with ACPI
> tables as well.
> 
> On my 24xA53 system artificially limited to 256MB of RAM (yes, it
> boots with that little memory), trying to kexec a secondary kernel
> failed every times. I can only presume that this was mostly tested
> using kdump, which preserves the entire kernel memory range.
> 
> This small series aims at triggering a discussion on what are the
> expectations for kexec_file, and whether we should unify the two
> reservation mechanisms.
> 
> And in the meantime, it gets things going...
> 
> Marc Zyngier (2):
>   firmware/efi: Tell memblock about EFI reservations
>   ACPI: arm64: Reserve the ACPI tables in memblock
> 
>  arch/arm64/kernel/acpi.c   |  1 +
>  drivers/firmware/efi/efi.c | 23 ++++++++++++++++++++++-
>  2 files changed, 23 insertions(+), 1 deletion(-)

Any comment on this?

I've separately started working on using the resource tree to slice
and dice the memblocks that are candidate for kexec_file_load(), but
I'd like some consensus on whether this is the right way to address
the issue.

Without something like this, kexec_file_load() is not usable on arm64,
so I'm pretty eager to see the back of this bug.

Thanks,

	M.
Dave Young May 13, 2021, 3:17 a.m. UTC | #2
Hi Marc,
On 05/12/21 at 07:04pm, Marc Zyngier wrote:
> + Dave Young, which I accidentally missed in my initial post
> 
> On Thu, 29 Apr 2021 14:35:31 +0100,
> Marc Zyngier <maz@kernel.org> wrote:
> > 
> > It recently became apparent that using kexec with kexec_file_load() on
> > arm64 is pretty similar to playing Russian roulette.
> > 
> > Depending on the amount of memory, the HW supported and the firmware
> > interface used, your secondary kernel may overwrite critical memory
> > regions without which the secondary kernel cannot boot (the GICv3 LPI
> > tables being a prime example of such reserved regions).
> > 
> > It turns out that there is at least two ways for reserved memory
> > regions to be described to kexec: /proc/iomem for the userspace
> > implementation, and memblock.reserved for kexec_file. And of course,
> > our LPI tables are only reserved using the resource tree, leading to
> > the aforementioned stamping. Similar things could happen with ACPI
> > tables as well.
> > 
> > On my 24xA53 system artificially limited to 256MB of RAM (yes, it
> > boots with that little memory), trying to kexec a secondary kernel
> > failed every times. I can only presume that this was mostly tested
> > using kdump, which preserves the entire kernel memory range.
> > 
> > This small series aims at triggering a discussion on what are the
> > expectations for kexec_file, and whether we should unify the two
> > reservation mechanisms.
> > 
> > And in the meantime, it gets things going...
> > 
> > Marc Zyngier (2):
> >   firmware/efi: Tell memblock about EFI reservations
> >   ACPI: arm64: Reserve the ACPI tables in memblock
> > 
> >  arch/arm64/kernel/acpi.c   |  1 +
> >  drivers/firmware/efi/efi.c | 23 ++++++++++++++++++++++-
> >  2 files changed, 23 insertions(+), 1 deletion(-)
> 
> Any comment on this?
> 
> I've separately started working on using the resource tree to slice
> and dice the memblocks that are candidate for kexec_file_load(), but
> I'd like some consensus on whether this is the right way to address
> the issue.
> 
> Without something like this, kexec_file_load() is not usable on arm64,
> so I'm pretty eager to see the back of this bug.

The arm64 memory reservation is tricky, I do not think I understand it
correctly.  Previously there were a lot discussion, Ard and AKASHI
should know more about it, see if they can provide comments.

About the problem you see, another way is to just add an arch weak
function like powerpc: arch_kexec_locate_mem_hole, and walking resource
tree for kexec_file_load as well.  But I might be wrong since I did not
follow up the arm64 specific history.

> 
> Thanks,
> 
> 	M.
> 
> -- 
> Without deviation from the norm, progress is not possible.
> 

Thanks
Dave
Marc Zyngier May 13, 2021, 11:07 a.m. UTC | #3
[- Bhupesh, as his RH address now bounces]

On Thu, 13 May 2021 04:17:38 +0100,
Dave Young <dyoung@redhat.com> wrote:
> 
> Hi Marc,
> On 05/12/21 at 07:04pm, Marc Zyngier wrote:
> > + Dave Young, which I accidentally missed in my initial post
> > 
> > On Thu, 29 Apr 2021 14:35:31 +0100,
> > Marc Zyngier <maz@kernel.org> wrote:
> > > 
> > > It recently became apparent that using kexec with kexec_file_load() on
> > > arm64 is pretty similar to playing Russian roulette.
> > > 
> > > Depending on the amount of memory, the HW supported and the firmware
> > > interface used, your secondary kernel may overwrite critical memory
> > > regions without which the secondary kernel cannot boot (the GICv3 LPI
> > > tables being a prime example of such reserved regions).
> > > 
> > > It turns out that there is at least two ways for reserved memory
> > > regions to be described to kexec: /proc/iomem for the userspace
> > > implementation, and memblock.reserved for kexec_file. And of course,
> > > our LPI tables are only reserved using the resource tree, leading to
> > > the aforementioned stamping. Similar things could happen with ACPI
> > > tables as well.
> > > 
> > > On my 24xA53 system artificially limited to 256MB of RAM (yes, it
> > > boots with that little memory), trying to kexec a secondary kernel
> > > failed every times. I can only presume that this was mostly tested
> > > using kdump, which preserves the entire kernel memory range.
> > > 
> > > This small series aims at triggering a discussion on what are the
> > > expectations for kexec_file, and whether we should unify the two
> > > reservation mechanisms.
> > > 
> > > And in the meantime, it gets things going...
> > > 
> > > Marc Zyngier (2):
> > >   firmware/efi: Tell memblock about EFI reservations
> > >   ACPI: arm64: Reserve the ACPI tables in memblock
> > > 
> > >  arch/arm64/kernel/acpi.c   |  1 +
> > >  drivers/firmware/efi/efi.c | 23 ++++++++++++++++++++++-
> > >  2 files changed, 23 insertions(+), 1 deletion(-)
> > 
> > Any comment on this?
> > 
> > I've separately started working on using the resource tree to slice
> > and dice the memblocks that are candidate for kexec_file_load(), but
> > I'd like some consensus on whether this is the right way to address
> > the issue.
> > 
> > Without something like this, kexec_file_load() is not usable on arm64,
> > so I'm pretty eager to see the back of this bug.
> 
> The arm64 memory reservation is tricky, I do not think I understand it
> correctly.  Previously there were a lot discussion, Ard and AKASHI
> should know more about it, see if they can provide comments.

I'll let them chime in. It looks like most of the discussions were
around kdump, which doesn't suffer from this issue (the memory is
reserved upfront).

> About the problem you see, another way is to just add an arch weak
> function like powerpc: arch_kexec_locate_mem_hole, and walking resource
> tree for kexec_file_load as well.  But I might be wrong since I did not
> follow up the arm64 specific history.

Right, this would avoid messing with the core code. However, the
problem remains in the sense that there is no clear definition of what
"reserved memory" is in general, and where it is described. For
example, x86 places the ACPI tables in reserved memblocks, while arm64
doesn't (unless we use my second patch).

To reliably use the resource tree, we need to ensure that it contains
all the reservations that appeared in memblock too. And if we can't
have a single reference, then we have to consider the union of the two
trees.

Thoughts?

	M.
Will Deacon May 18, 2021, 11:48 a.m. UTC | #4
[Fixing Bhupesh's email address]

On Thu, Apr 29, 2021 at 02:35:31PM +0100, Marc Zyngier wrote:
> It recently became apparent that using kexec with kexec_file_load() on
> arm64 is pretty similar to playing Russian roulette.
> 
> Depending on the amount of memory, the HW supported and the firmware
> interface used, your secondary kernel may overwrite critical memory
> regions without which the secondary kernel cannot boot (the GICv3 LPI
> tables being a prime example of such reserved regions).
> 
> It turns out that there is at least two ways for reserved memory
> regions to be described to kexec: /proc/iomem for the userspace
> implementation, and memblock.reserved for kexec_file. And of course,
> our LPI tables are only reserved using the resource tree, leading to
> the aforementioned stamping. Similar things could happen with ACPI
> tables as well.
> 
> On my 24xA53 system artificially limited to 256MB of RAM (yes, it
> boots with that little memory), trying to kexec a secondary kernel
> failed every times. I can only presume that this was mostly tested
> using kdump, which preserves the entire kernel memory range.
> 
> This small series aims at triggering a discussion on what are the
> expectations for kexec_file, and whether we should unify the two
> reservation mechanisms.

Bhupesh, since you've been involved with kexec file on arm64 before, please
could you take a look at these patches?

Thanks,

Will
Bhupesh Sharma May 18, 2021, 2:23 p.m. UTC | #5
Hi Will,

On Tue, 18 May 2021 at 17:19, Will Deacon <will@kernel.org> wrote:
>
> [Fixing Bhupesh's email address]
>
> On Thu, Apr 29, 2021 at 02:35:31PM +0100, Marc Zyngier wrote:
> > It recently became apparent that using kexec with kexec_file_load() on
> > arm64 is pretty similar to playing Russian roulette.
> >
> > Depending on the amount of memory, the HW supported and the firmware
> > interface used, your secondary kernel may overwrite critical memory
> > regions without which the secondary kernel cannot boot (the GICv3 LPI
> > tables being a prime example of such reserved regions).
> >
> > It turns out that there is at least two ways for reserved memory
> > regions to be described to kexec: /proc/iomem for the userspace
> > implementation, and memblock.reserved for kexec_file. And of course,
> > our LPI tables are only reserved using the resource tree, leading to
> > the aforementioned stamping. Similar things could happen with ACPI
> > tables as well.
> >
> > On my 24xA53 system artificially limited to 256MB of RAM (yes, it
> > boots with that little memory), trying to kexec a secondary kernel
> > failed every times. I can only presume that this was mostly tested
> > using kdump, which preserves the entire kernel memory range.
> >
> > This small series aims at triggering a discussion on what are the
> > expectations for kexec_file, and whether we should unify the two
> > reservation mechanisms.
>
> Bhupesh, since you've been involved with kexec file on arm64 before, please
> could you take a look at these patches?

Thanks for adding me in Cc.
Yes, I will look and test these patches asap.

Regards,
Bhupesh
Catalin Marinas May 19, 2021, 3:19 p.m. UTC | #6
On Thu, Apr 29, 2021 at 02:35:31PM +0100, Marc Zyngier wrote:
> It recently became apparent that using kexec with kexec_file_load() on
> arm64 is pretty similar to playing Russian roulette.
> 
> Depending on the amount of memory, the HW supported and the firmware
> interface used, your secondary kernel may overwrite critical memory
> regions without which the secondary kernel cannot boot (the GICv3 LPI
> tables being a prime example of such reserved regions).
> 
> It turns out that there is at least two ways for reserved memory
> regions to be described to kexec: /proc/iomem for the userspace
> implementation, and memblock.reserved for kexec_file. And of course,
> our LPI tables are only reserved using the resource tree, leading to
> the aforementioned stamping. Similar things could happen with ACPI
> tables as well.

So which one of these (/proc/iomem and memblock.reserved) would be the
correct option? If none of them, is their intersection any better?
Looking at the default kexec_locate_mem_hole(), it uses the resources
tree if !CONFIG_ARCH_KEEP_MEMBLOCK, otherwise memblock.

PowerPC implements its own arch_kexec_locate_mem_hole() to skip specific
arch regions. We could do something similar for arm64 if the arch code
knows where the LPI reservation is or the ACPI tables.

If we conclude that we need some intersection of resource reservations
and memblock, maybe we should change the default kexec_locate_mem_hole()
implementation to check for both (e.g. start with the resource tree and
only consider a range valid if not in memblock.reserved).
Marc Zyngier May 25, 2021, 4:22 p.m. UTC | #7
On Wed, 19 May 2021 16:19:44 +0100,
Catalin Marinas <catalin.marinas@arm.com> wrote:
> 
> On Thu, Apr 29, 2021 at 02:35:31PM +0100, Marc Zyngier wrote:
> > It recently became apparent that using kexec with kexec_file_load() on
> > arm64 is pretty similar to playing Russian roulette.
> > 
> > Depending on the amount of memory, the HW supported and the firmware
> > interface used, your secondary kernel may overwrite critical memory
> > regions without which the secondary kernel cannot boot (the GICv3 LPI
> > tables being a prime example of such reserved regions).
> > 
> > It turns out that there is at least two ways for reserved memory
> > regions to be described to kexec: /proc/iomem for the userspace
> > implementation, and memblock.reserved for kexec_file. And of course,
> > our LPI tables are only reserved using the resource tree, leading to
> > the aforementioned stamping. Similar things could happen with ACPI
> > tables as well.
> 
> So which one of these (/proc/iomem and memblock.reserved) would be the
> correct option? If none of them, is their intersection any better?

/proc/iomem is what we use for userspace, so you'd expect this to be
the right thing to use.

> Looking at the default kexec_locate_mem_hole(), it uses the resources
> tree if !CONFIG_ARCH_KEEP_MEMBLOCK, otherwise memblock.

Yup, and funnily enough, forcing a fallback to the resources tree
doesn't help either, because the logic used here isn't much better (it
takes the RAM areas at face value, without excluding any of the
reserved regions that are children of the "System RAM" regions).

It's not funny anymore.

> PowerPC implements its own arch_kexec_locate_mem_hole() to skip specific
> arch regions. We could do something similar for arm64 if the arch code
> knows where the LPI reservation is or the ACPI tables.

It feels like a bit of a failure to duplicate all that code. I'd
consider that the last possible outcome.

> If we conclude that we need some intersection of resource reservations
> and memblock, maybe we should change the default kexec_locate_mem_hole()
> implementation to check for both (e.g. start with the resource tree and
> only consider a range valid if not in memblock.reserved).

I am more angling towards this. But my worry is that different
architectures have already different ways to reserve memory (PPC seems
to do their own stuff on top of memblock, x86 I assumes uses the
resource tree in a different way than arm64).

Anyway, I'll keep digging.

	M.
James Morse June 2, 2021, 2:22 p.m. UTC | #8
Hi Marc,

On 29/04/2021 14:35, Marc Zyngier wrote:
> It recently became apparent that using kexec with kexec_file_load() on
> arm64 is pretty similar to playing Russian roulette.
> 
> Depending on the amount of memory, the HW supported and the firmware
> interface used, your secondary kernel may overwrite critical memory
> regions without which the secondary kernel cannot boot (the GICv3 LPI
> tables being a prime example of such reserved regions).
> 
> It turns out that there is at least two ways for reserved memory
> regions to be described to kexec: /proc/iomem for the userspace
> implementation, and memblock.reserved for kexec_file. 

One is spilled into the other by request_standard_resources()...


> And of course,
> our LPI tables are only reserved using the resource tree, leading to
> the aforementioned stamping.

Presumably well after efi_init() has run...

> Similar things could happen with ACPI tables as well.

efi_init() calls reserve_regions(), which has:
|	/* keep ACPI reclaim memory intact for kexec etc. */
|	if (md->type == EFI_ACPI_RECLAIM_MEMORY)
|		memblock_reserve(paddr, size);

This is also what stops mm from allocating them, as memblock-reserved gets copied into the
PG_Reserved flag by free_low_memory_core_early()'s calls to reserve_bootmem_region().

Is your machines firmware putting them in a region with a different type?

(The UEFI spec has something to say: see 2.3.6 "AArch64 Platforms":
| ACPI Tables loaded at boot time can be contained in memory of type EfiACPIReclaimMemory
| (recommended) or EfiACPIMemoryNVS

NVS would fail the is_usable_memory() check earlier, so gets treated as nomap)


Thanks,

James

> On my 24xA53 system artificially limited to 256MB of RAM (yes, it
> boots with that little memory), trying to kexec a secondary kernel
> failed every times. I can only presume that this was mostly tested
> using kdump, which preserves the entire kernel memory range.
> 
> This small series aims at triggering a discussion on what are the
> expectations for kexec_file, and whether we should unify the two
> reservation mechanisms.
> 
> And in the meantime, it gets things going...
> 
> Marc Zyngier (2):
>   firmware/efi: Tell memblock about EFI reservations
>   ACPI: arm64: Reserve the ACPI tables in memblock
> 
>  arch/arm64/kernel/acpi.c   |  1 +
>  drivers/firmware/efi/efi.c | 23 ++++++++++++++++++++++-
>  2 files changed, 23 insertions(+), 1 deletion(-)
>
Marc Zyngier June 2, 2021, 3:59 p.m. UTC | #9
Hi James,

On Wed, 02 Jun 2021 15:22:00 +0100,
James Morse <james.morse@arm.com> wrote:
> 
> Hi Marc,
> 
> On 29/04/2021 14:35, Marc Zyngier wrote:
> > It recently became apparent that using kexec with kexec_file_load() on
> > arm64 is pretty similar to playing Russian roulette.
> > 
> > Depending on the amount of memory, the HW supported and the firmware
> > interface used, your secondary kernel may overwrite critical memory
> > regions without which the secondary kernel cannot boot (the GICv3 LPI
> > tables being a prime example of such reserved regions).
> > 
> > It turns out that there is at least two ways for reserved memory
> > regions to be described to kexec: /proc/iomem for the userspace
> > implementation, and memblock.reserved for kexec_file. 
> 
> One is spilled into the other by request_standard_resources()...
> 
> 
> > And of course,
> > our LPI tables are only reserved using the resource tree, leading to
> > the aforementioned stamping.
> 
> Presumably well after efi_init() has run...

Yup, much later. And we can keep on reserving memory as long as we
boot new CPUs. Having it as a one-off sync doesn't really help here.

> 
> > Similar things could happen with ACPI tables as well.
> 
> efi_init() calls reserve_regions(), which has:
> |	/* keep ACPI reclaim memory intact for kexec etc. */
> |	if (md->type == EFI_ACPI_RECLAIM_MEMORY)
> |		memblock_reserve(paddr, size);
> 
> This is also what stops mm from allocating them, as
> memblock-reserved gets copied into the PG_Reserved flag by
> free_low_memory_core_early()'s calls to reserve_bootmem_region().
> 
> Is your machines firmware putting them in a region with a different type?

Good question. Moritz (cc'd) saw the tables being overwritten on his
system (which I don't have access to), so I guess this is not entirely
clear cut how this happens.

My SQ box reports the ACPI region as "ACPI Reclaim", so I guess it
works as expected here.

> (The UEFI spec has something to say: see 2.3.6 "AArch64 Platforms":
> | ACPI Tables loaded at boot time can be contained in memory of type EfiACPIReclaimMemory
> | (recommended) or EfiACPIMemoryNVS
> 
> NVS would fail the is_usable_memory() check earlier, so gets treated
> as nomap)

Note that I've since changed tactics and proposed that we fully rely
on the resource tree instead[1].

Thanks,

	M.

[1] https://lore.kernel.org/r/20210531095720.77469-1-maz@kernel.org
James Morse June 2, 2021, 4:58 p.m. UTC | #10
Hi Marc,

On 02/06/2021 16:59, Marc Zyngier wrote:
> On Wed, 02 Jun 2021 15:22:00 +0100,
> James Morse <james.morse@arm.com> wrote:
>> On 29/04/2021 14:35, Marc Zyngier wrote:
>>> It recently became apparent that using kexec with kexec_file_load() on
>>> arm64 is pretty similar to playing Russian roulette.
>>>
>>> Depending on the amount of memory, the HW supported and the firmware
>>> interface used, your secondary kernel may overwrite critical memory
>>> regions without which the secondary kernel cannot boot (the GICv3 LPI
>>> tables being a prime example of such reserved regions).
>>>
>>> It turns out that there is at least two ways for reserved memory
>>> regions to be described to kexec: /proc/iomem for the userspace
>>> implementation, and memblock.reserved for kexec_file. 
>>
>> One is spilled into the other by request_standard_resources()...
>>
>>
>>> And of course,
>>> our LPI tables are only reserved using the resource tree, leading to
>>> the aforementioned stamping.
>>
>> Presumably well after efi_init() has run...
> 
> Yup, much later. And we can keep on reserving memory as long as we
> boot new CPUs. Having it as a one-off sync doesn't really help here.

It might need doing for all possible CPUs up-front... otherwise someone loads a kexec
kernel and correctly picks a safe location ... then a CPU comes online and reserves a hole
in the middle of it: kexec isn't using the selected location until you reboot().
(memory hotplug has some 'fun' in this area, which can only be fixed by using memblock,
which ought to know about removable memory ranges ... but doesn't)

There does need to be a point where the list of reserved memory stops changing.


>>> Similar things could happen with ACPI tables as well.
>>
>> efi_init() calls reserve_regions(), which has:
>> |	/* keep ACPI reclaim memory intact for kexec etc. */
>> |	if (md->type == EFI_ACPI_RECLAIM_MEMORY)
>> |		memblock_reserve(paddr, size);
>>
>> This is also what stops mm from allocating them, as
>> memblock-reserved gets copied into the PG_Reserved flag by
>> free_low_memory_core_early()'s calls to reserve_bootmem_region().
>>
>> Is your machines firmware putting them in a region with a different type?

> Good question. Moritz (cc'd) saw the tables being overwritten on his
> system (which I don't have access to), so I guess this is not entirely
> clear cut how this happens.

If we have systems that store the tables in 'conventional memory' we have bigger problems!


> My SQ box reports the ACPI region as "ACPI Reclaim", so I guess it
> works as expected here.
> 
>> (The UEFI spec has something to say: see 2.3.6 "AArch64 Platforms":
>> | ACPI Tables loaded at boot time can be contained in memory of type EfiACPIReclaimMemory
>> | (recommended) or EfiACPIMemoryNVS
>>
>> NVS would fail the is_usable_memory() check earlier, so gets treated
>> as nomap)

> Note that I've since changed tactics and proposed that we fully rely
> on the resource tree instead[1].

Yup - I came back here to work out why you gave up on memblock:reserving the reserved
regions...



Thanks,

James