mbox series

[RFC,v1,0/4] arm64: MMU enabled kexec kernel relocation

Message ID 20190716165641.6990-1-pasha.tatashin@soleen.com (mailing list archive)
Headers show
Series arm64: MMU enabled kexec kernel relocation | expand

Message

Pasha Tatashin July 16, 2019, 4:56 p.m. UTC
Added identity mapped page table, and keep MMU enabled while
kernel is being relocated from sparse pages to the final
destination during kexec.

More description about the problem I am trying to solve here, can be
found here:
https://lore.kernel.org/lkml/20190709182014.16052-1-pasha.tatashin@soleen.com/

This patch series works in terms, that I can kexec-reboot both in QEMU
and on a physical machine. However, I do not see performance improvement
during relocation. The performance is just as slow as before with disabled
caches.

Am I missing something? Perhaps, there is some flag that I should also
enable in page table? Please provide me with any suggestions.

Pavel Tatashin (4):
  arm64, mm: identity mapped page table
  arm64, kexec: interface preparation for mmu enabled kexec
  arm64, kexec: add kexec's own identity page table
  arm64: Keep MMU on while kernel is being relocated

 arch/arm64/include/asm/ident_map.h  |  26 ++++++
 arch/arm64/include/asm/kexec.h      |   5 +-
 arch/arm64/kernel/cpu-reset.S       |   8 --
 arch/arm64/kernel/cpu-reset.h       |   7 +-
 arch/arm64/kernel/machine_kexec.c   | 128 +++++++++++++++++++++-------
 arch/arm64/kernel/relocate_kernel.S |  36 +++++---
 arch/arm64/mm/Makefile              |   1 +
 arch/arm64/mm/ident_map.c           |  99 +++++++++++++++++++++
 8 files changed, 255 insertions(+), 55 deletions(-)
 create mode 100644 arch/arm64/include/asm/ident_map.h
 create mode 100644 arch/arm64/mm/ident_map.c

Comments

Bhupesh Sharma July 16, 2019, 7:14 p.m. UTC | #1
Hi Pavel,

On Tue, Jul 16, 2019 at 10:26 PM Pavel Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> Added identity mapped page table, and keep MMU enabled while
> kernel is being relocated from sparse pages to the final
> destination during kexec.
>
> More description about the problem I am trying to solve here, can be
> found here:
> https://lore.kernel.org/lkml/20190709182014.16052-1-pasha.tatashin@soleen.com/
>
> This patch series works in terms, that I can kexec-reboot both in QEMU
> and on a physical machine. However, I do not see performance improvement
> during relocation. The performance is just as slow as before with disabled
> caches.

Thanks for the patchset, but if the changes still don't positively
impact the kexec-reboot timings, I am not sure we if gain by adding
these to the kernel.

Like I mentioned in the previous threads, we have been carrying some
relevant fixes for the same in Linux distros. I have been trying to
find time to fix them and send them upstream, but I am caught up with
some nasty kexec_file_load() issues on arm64 currently.

So, I will find some time to work on them (may be next week) and will
Cc you when I post them out after some checks on real physical
hardware.

Thanks,
Bhupesh

> Am I missing something? Perhaps, there is some flag that I should also
> enable in page table? Please provide me with any suggestions.
>
> Pavel Tatashin (4):
>   arm64, mm: identity mapped page table
>   arm64, kexec: interface preparation for mmu enabled kexec
>   arm64, kexec: add kexec's own identity page table
>   arm64: Keep MMU on while kernel is being relocated
>
>  arch/arm64/include/asm/ident_map.h  |  26 ++++++
>  arch/arm64/include/asm/kexec.h      |   5 +-
>  arch/arm64/kernel/cpu-reset.S       |   8 --
>  arch/arm64/kernel/cpu-reset.h       |   7 +-
>  arch/arm64/kernel/machine_kexec.c   | 128 +++++++++++++++++++++-------
>  arch/arm64/kernel/relocate_kernel.S |  36 +++++---
>  arch/arm64/mm/Makefile              |   1 +
>  arch/arm64/mm/ident_map.c           |  99 +++++++++++++++++++++
>  8 files changed, 255 insertions(+), 55 deletions(-)
>  create mode 100644 arch/arm64/include/asm/ident_map.h
>  create mode 100644 arch/arm64/mm/ident_map.c
>
> --
> 2.22.0
>
>
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
Pasha Tatashin July 16, 2019, 7:26 p.m. UTC | #2
On Tue, Jul 16, 2019 at 3:14 PM Bhupesh Sharma <bhsharma@redhat.com> wrote:
>
> Hi Pavel,
>
> On Tue, Jul 16, 2019 at 10:26 PM Pavel Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > Added identity mapped page table, and keep MMU enabled while
> > kernel is being relocated from sparse pages to the final
> > destination during kexec.
> >
> > More description about the problem I am trying to solve here, can be
> > found here:
> > https://lore.kernel.org/lkml/20190709182014.16052-1-pasha.tatashin@soleen.com/
> >
> > This patch series works in terms, that I can kexec-reboot both in QEMU
> > and on a physical machine. However, I do not see performance improvement
> > during relocation. The performance is just as slow as before with disabled
> > caches.
>
> Thanks for the patchset, but if the changes still don't positively
> impact the kexec-reboot timings, I am not sure we if gain by adding
> these to the kernel.

Hi Bhupesh,

I am not asking to add these to the kernel (hence RFC), I am looking
for help to figure out why the relocation is still slow, once that is
understood I will submit the patches for integration. My previous
patch series fixed the relocation problem by pre-reserving space, but
because the culprit of the problem was narrowed down to disabled
caches it was decided that a better fix would be to do relocation with
MMU still enabled, this is why I created this new series.

>
> Like I mentioned in the previous threads, we have been carrying some
> relevant fixes for the same in Linux distros. I have been trying to
> find time to fix them and send them upstream, but I am caught up with
> some nasty kexec_file_load() issues on arm64 currently.

As I understood, the fixes were for slow purgatory checksum checking,
and not for relocation of there kernel. Are you saying redhat is
carrying some patches that address slow relocation problem as well?

Thank you,
Pasha

>
> So, I will find some time to work on them (may be next week) and will
> Cc you when I post them out after some checks on real physical
> hardware.
>
> Thanks,
> Bhupesh
>
> > Am I missing something? Perhaps, there is some flag that I should also
> > enable in page table? Please provide me with any suggestions.
> >
> > Pavel Tatashin (4):
> >   arm64, mm: identity mapped page table
> >   arm64, kexec: interface preparation for mmu enabled kexec
> >   arm64, kexec: add kexec's own identity page table
> >   arm64: Keep MMU on while kernel is being relocated
> >
> >  arch/arm64/include/asm/ident_map.h  |  26 ++++++
> >  arch/arm64/include/asm/kexec.h      |   5 +-
> >  arch/arm64/kernel/cpu-reset.S       |   8 --
> >  arch/arm64/kernel/cpu-reset.h       |   7 +-
> >  arch/arm64/kernel/machine_kexec.c   | 128 +++++++++++++++++++++-------
> >  arch/arm64/kernel/relocate_kernel.S |  36 +++++---
> >  arch/arm64/mm/Makefile              |   1 +
> >  arch/arm64/mm/ident_map.c           |  99 +++++++++++++++++++++
> >  8 files changed, 255 insertions(+), 55 deletions(-)
> >  create mode 100644 arch/arm64/include/asm/ident_map.h
> >  create mode 100644 arch/arm64/mm/ident_map.c
> >
> > --
> > 2.22.0
> >
> >
> > _______________________________________________
> > kexec mailing list
> > kexec@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
James Morse July 17, 2019, 5:51 p.m. UTC | #3
Hi Pavel,

On 16/07/2019 17:56, Pavel Tatashin wrote:
> Added identity mapped page table, and keep MMU enabled while
> kernel is being relocated from sparse pages to the final
> destination during kexec.

The 'tl;dr' version of this: I strongly urge you to start with the hibernate code that
already covers all these known corner cases. x86 was not a good starting point.


After a quick skim:

This will map 'nomap' regions of memory with cacheable attributes. This is a non-starter.
These regions were described by firmware as having content that was/is written with
different attributes. The attributes must match whenever it is mapped, otherwise we have a
loss of coherency. Mapping this stuff as cacheable means the CPU can prefetch it into the
cache whenever it likes.
It may be important that we do not ever map some of these regions, even though its
described as memory. On AMD-Seattle the bottom page of memory is reserved by firmware for
its own use; it is made secure-only, and any access causes an
external-abort/machine-check. UEFI describes this as 'Reserved', and we preserve this in
the kernel as 'nomap'. The equivalent DT support uses memreserve, possibly with the
'nomap' attribute.

Mapping a 'new'/unknown region with cacheable attributes can never be safe, even if we
trusted kexec-tool to only write the kernel to memory. The host may be using a bigger page
size causing more memory to become cacheable than was intended.
Linux's EFI support rounds the UEFI memory map to the largest support page size, (and
winges about firmware bugs).
If we're allowing kexec to load images in a region not described as IORESOURCE_SYSTEM_RAM,
that is a bug we should fix.

The only way to do this properly is to copy the linear mapping. The arch code has lots of
complex code to generate it correctly at boot, we do not want to duplicate it.
(this is why hibernate copies the linear mapping)


These patches do not remove the running page tables from TTBR1. As you overwrite the live
page tables you will corrupt the state of the CPU. The page-table walker may access things
that aren't memory, cache memory that shouldn't be cached (see above), and allocate
conflicting entries in the TLB.

You cannot use the mm page table helpers to build an idmap on arm64. The mm page table
helpers have a compile-time VA_BITS, and we support systems where there is no memory below
1<<VA_BITS. (crazy huh!). Picking on AMD-Seattle again: if you boot a 4K 39bit VA kernel,
the idmap will have more page table levels than the page table helpers can build. This is
why there are special helpers to load the idmap, and twiddle TCR_EL1.T0SZ.
You already need to copy the linear-map, so using an idmap is extra work. You want to work
with linear-map addresses, you probably need to add the field to the appropriate structure.

The kexec relocation code still runs at EL2. You can't use a copy of the linear map here
as there is only one TTBR on v8.0, and you'd need to setup EL2 as its been torn back to
the hyp-stub. This is the reason hibernate posts EL2 in a holding pen while it rewrites
all of memory, then calls back to fixup EL2. Keeping the rewrite phase at EL1 means it
doesn't need independently tweaking/testing. You need to do something similar, either
calling EL2 to start the new image, or disabling the MMU at EL1 to start the new image there.

You will need to alter the relocation code to do nothing for kdump, as no relocation is
required and building page-tables is extra work where the kernel may croak, preventing us
from reaching kdump.

Finally, having this independent idmap machinery isn't desirable from a maintenance
perspective. Please start with the hibernate code that already solves a very similar
problem, as it already has most of these problems covered.


> This patch series works in terms, that I can kexec-reboot both in QEMU

I wouldn't expect Qemu's emulation of the MMU and caches to be performance accurate.


> and on a physical machine. However, I do not see performance improvement
> during relocation. The performance is just as slow as before with disabled
> caches.

> Am I missing something? Perhaps, there is some flag that I should also
> enable in page table? Please provide me with any suggestions.

Some information about the physical machine you tested this on would help.
I'm guessing its v8.0, and booted at EL2....


Thanks,

James
Pasha Tatashin July 17, 2019, 7:13 p.m. UTC | #4
Hi James,

Thank you for taking a look at this work.

> After a quick skim:
>
> This will map 'nomap' regions of memory with cacheable attributes. This is a non-starter.
> These regions were described by firmware as having content that was/is written with
> different attributes. The attributes must match whenever it is mapped, otherwise we have a
> loss of coherency. Mapping this stuff as cacheable means the CPU can prefetch it into the
> cache whenever it likes.

> It may be important that we do not ever map some of these regions, even though its
> described as memory. On AMD-Seattle the bottom page of memory is reserved by firmware for
> its own use; it is made secure-only, and any access causes an
> external-abort/machine-check. UEFI describes this as 'Reserved', and we preserve this in
> the kernel as 'nomap'. The equivalent DT support uses memreserve, possibly with the
> 'nomap' attribute.
>
> Mapping a 'new'/unknown region with cacheable attributes can never be safe, even if we
> trusted kexec-tool to only write the kernel to memory. The host may be using a bigger page
> size causing more memory to become cacheable than was intended.
> Linux's EFI support rounds the UEFI memory map to the largest support page size, (and
> winges about firmware bugs).
> If we're allowing kexec to load images in a region not described as IORESOURCE_SYSTEM_RAM,
> that is a bug we should fix.

We are allowing this. If you consider this to be a bug, I will fix it,
and this will actually simplify the idmap page table. User will
receive an error during kexec load if a request is made to load into
!IORESOURCE_SYSTEM_RAM region.

>
> The only way to do this properly is to copy the linear mapping. The arch code has lots of
> complex code to generate it correctly at boot, we do not want to duplicate it.
> (this is why hibernate copies the linear mapping)

As I understand, you would like to take a copy of idmap page table,
and add entries only for segment
sources and destinations into the new page table?

If so, there is a slight problem: arch hook machine_kexec_prepare() is
called prior to loading segments from userland. We can solve this by
adding another hook that is called after kimage_terminate().

> These patches do not remove the running page tables from TTBR1. As you overwrite the live
> page tables you will corrupt the state of the CPU. The page-table walker may access things
> that aren't memory, cache memory that shouldn't be cached (see above), and allocate
> conflicting entries in the TLB.

Indeed. However, I was following what is done in create_safe_exec_page():
https://soleen.com/source/xref/linux/arch/arm64/kernel/hibernate.c?r=af873fce#263

ttbr1 is not removed there. Am I missing something, or is not yet
configured there?

I will set ttbr1 to zero page.

> You cannot use the mm page table helpers to build an idmap on arm64. The mm page table
> helpers have a compile-time VA_BITS, and we support systems where there is no memory below
> 1<<VA_BITS. (crazy huh!). Picking on AMD-Seattle again: if you boot a 4K 39bit VA kernel,
> the idmap will have more page table levels than the page table helpers can build. This is
> why there are special helpers to load the idmap, and twiddle TCR_EL1.T0SZ.
> You already need to copy the linear-map, so using an idmap is extra work. You want to work
> with linear-map addresses, you probably need to add the field to the appropriate structure.

OK. Makes sense. I will do the way hibernate setup this table. I was
indeed following x86, hoping that eventually it would be possible to
unite: kasan, hibernate, and kexec implementation of this page table.

>
> The kexec relocation code still runs at EL2. You can't use a copy of the linear map here
> as there is only one TTBR on v8.0, and you'd need to setup EL2 as its been torn back to
> the hyp-stub.

As I understand normally on baremetal kexec runs at EL1 not EL2. On my
machine is_kernel_in_hyp_mode() == false in cpu_soft_restart.

This is the reason hibernate posts EL2 in a holding pen while it rewrites
> all of memory, then calls back to fixup EL2. Keeping the rewrite phase at EL1 means it
> doesn't need independently tweaking/testing. You need to do something similar, either
> calling EL2 to start the new image, or disabling the MMU at EL1 to start the new image there.

OK, I will study how hibernate does this. I was thinking that if we
are running in EL2 we can simply configure TTBR0_EL2 instead of
TTBR0_EL1. But, I need to understand this better.

>
> You will need to alter the relocation code to do nothing for kdump, as no relocation is
> required and building page-tables is extra work where the kernel may croak, preventing us
> from reaching kdump.

Yes, I was planning to do nothing for kdump, which involves not
allocating page table. It is not part of the current patchest, as the
current series is not ready.

>
> Finally, having this independent idmap machinery isn't desirable from a maintenance
> perspective. Please start with the hibernate code that already solves a very similar
> problem, as it already has most of these problems covered.

OK.

> > This patch series works in terms, that I can kexec-reboot both in QEMU
>
> I wouldn't expect Qemu's emulation of the MMU and caches to be performance accurate.

I am not measuring performance in QEMU, I use it for
development/verification only. The performance is measured on real
hardware only.

>
> > and on a physical machine. However, I do not see performance improvement
> > during relocation. The performance is just as slow as before with disabled
> > caches.
>
> > Am I missing something? Perhaps, there is some flag that I should also
> > enable in page table? Please provide me with any suggestions.
>
> Some information about the physical machine you tested this on would help.
> I'm guessing its v8.0, and booted at EL2....

I am using Broadcom's Stingray SoC. Because  is_kernel_in_hyp_mode()
returns false, I believe it is EL1. How can I boot it at EL2?

So, I am still confused why I do not see performance improvements
during relocation on this machine. Any theories?

Thank you,
Pasha
James Morse July 26, 2019, 2 p.m. UTC | #5
Hi Pavel,

On 17/07/2019 20:13, Pavel Tatashin wrote:
>> After a quick skim:
>>
>> This will map 'nomap' regions of memory with cacheable attributes. This is a non-starter.
>> These regions were described by firmware as having content that was/is written with
>> different attributes. The attributes must match whenever it is mapped, otherwise we have a
>> loss of coherency. Mapping this stuff as cacheable means the CPU can prefetch it into the
>> cache whenever it likes.
> 
>> It may be important that we do not ever map some of these regions, even though its
>> described as memory. On AMD-Seattle the bottom page of memory is reserved by firmware for
>> its own use; it is made secure-only, and any access causes an
>> external-abort/machine-check. UEFI describes this as 'Reserved', and we preserve this in
>> the kernel as 'nomap'. The equivalent DT support uses memreserve, possibly with the
>> 'nomap' attribute.
>>
>> Mapping a 'new'/unknown region with cacheable attributes can never be safe, even if we
>> trusted kexec-tool to only write the kernel to memory. The host may be using a bigger page
>> size causing more memory to become cacheable than was intended.
>> Linux's EFI support rounds the UEFI memory map to the largest support page size, (and
>> winges about firmware bugs).
>> If we're allowing kexec to load images in a region not described as IORESOURCE_SYSTEM_RAM,
>> that is a bug we should fix.
> 
> We are allowing this. If you consider this to be a bug, I will fix it,
> and this will actually simplify the idmap page table. User will
> receive an error during kexec load if a request is made to load into
> !IORESOURCE_SYSTEM_RAM region.

I consider this a bug, but we can see what others think.
This suggests kexec-tools can open /proc/iomem, find a likely looking gap, and try to load
the new kernel between two platform devices.


>> The only way to do this properly is to copy the linear mapping. The arch code has lots of
>> complex code to generate it correctly at boot, we do not want to duplicate it.
>> (this is why hibernate copies the linear mapping)
> 
> As I understand, you would like to take a copy of idmap page table,
> and add entries only for segment
> sources and destinations into the new page table?

I don't think there is a need to idmap memory at all. We should copy the linear map so you
know you won't overwrite its page tables as part of loading the new kernel.


> If so, there is a slight problem: arch hook machine_kexec_prepare() is
> called prior to loading segments from userland. We can solve this by
> adding another hook that is called after kimage_terminate().

Yes, all this would need doing as machine_kexec() runs. We preferably need to allocate
memory in this path, or at least have a bitmap of what we can/can't overwrite.


>> These patches do not remove the running page tables from TTBR1. As you overwrite the live
>> page tables you will corrupt the state of the CPU. The page-table walker may access things
>> that aren't memory, cache memory that shouldn't be cached (see above), and allocate
>> conflicting entries in the TLB.
> 
> Indeed. However, I was following what is done in create_safe_exec_page():
> https://soleen.com/source/xref/linux/arch/arm64/kernel/hibernate.c?r=af873fce#263
> 
> ttbr1 is not removed there. Am I missing something, or is not yet
> configured there?

Hibernate maps a single executable page in ttbr0_el1 that holds its relocation code.
The relocation code then switches ttbr1_el1 to point to the copy of the linear map. See
the 'break_before_make_ttbr_switch' macro in swsusp_arch_suspend_exit().


> I will set ttbr1 to zero page.
> 
>> You cannot use the mm page table helpers to build an idmap on arm64. The mm page table
>> helpers have a compile-time VA_BITS, and we support systems where there is no memory below
>> 1<<VA_BITS. (crazy huh!). Picking on AMD-Seattle again: if you boot a 4K 39bit VA kernel,
>> the idmap will have more page table levels than the page table helpers can build. This is
>> why there are special helpers to load the idmap, and twiddle TCR_EL1.T0SZ.
>> You already need to copy the linear-map, so using an idmap is extra work. You want to work
>> with linear-map addresses, you probably need to add the field to the appropriate structure.
> 
> OK. Makes sense. I will do the way hibernate setup this table. I was
> indeed following x86, hoping that eventually it would be possible to
> unite: kasan, hibernate, and kexec implementation of this page table.

Our kasan and hibernate code already went a different way. I doubt we can bring them back
in to look like x86, they have different problems to solve.


>> The kexec relocation code still runs at EL2. You can't use a copy of the linear map here
>> as there is only one TTBR on v8.0, and you'd need to setup EL2 as its been torn back to
>> the hyp-stub.
> 
> As I understand normally on baremetal kexec runs at EL1 not EL2. On my
> machine is_kernel_in_hyp_mode() == false in cpu_soft_restart.

and is_hyp_mode_available() ?


This depends on which exception-level your bootloader set Linux running. You should get a
boot message that tells you:
| CPU: All CPU(s) started at EL2

is_kernel_in_hyp_mode() is for determining if the kernel is running at EL2. This is the
case if you get a message like:
| kvm [1]: VHE mode initialized successfully
VHE is a v8.1 feature that repaints the system-registers so a kernel written to run at EL1
can run almost unmodified at EL2.

We also have is_hyp_mode_available(), which will return true if all the CPUs booted at EL2.


kexec either runs its relocation code at EL1, or at EL2 if that is where the first kernel
booted. If you call to EL2, the MMU was already off as KVM will reset EL2 to the hyp-stub
in response to the reboot notifiers hardware_disable() call.

(kvm_arch_hardware_disable() calls cpu_hyp_reset())


> This is the reason hibernate posts EL2 in a holding pen while it rewrites
>> all of memory, then calls back to fixup EL2. Keeping the rewrite phase at EL1 means it
>> doesn't need independently tweaking/testing. You need to do something similar, either
>> calling EL2 to start the new image, or disabling the MMU at EL1 to start the new image there.

> OK, I will study how hibernate does this. I was thinking that if we
> are running in EL2 we can simply configure TTBR0_EL2 instead of
> TTBR0_EL1. But, I need to understand this better.

Yes, if you've got a VHE system it can skip the jumping around Exception-levels, but we
still need to support the non-VHE systems.


>>> This patch series works in terms, that I can kexec-reboot both in QEMU
>>> and on a physical machine. However, I do not see performance improvement
>>> during relocation. The performance is just as slow as before with disabled
>>> caches.
>>
>>> Am I missing something? Perhaps, there is some flag that I should also
>>> enable in page table? Please provide me with any suggestions.
>>
>> Some information about the physical machine you tested this on would help.
>> I'm guessing its v8.0, and booted at EL2....

> I am using Broadcom's Stingray SoC.

First hit on google is [0]. Which assuming its the same SoC, says its Cortex-A72, this is
a v8.0 part, it doesn't have VHE. The kernel will be running at EL1, if it supports KVM it
must have booted at EL2.


> Because  is_kernel_in_hyp_mode()
> returns false, I believe it is EL1. How can I boot it at EL2?

This check is for VHE.
|static inline bool is_kernel_in_hyp_mode(void)
| {
|	return read_sysreg(CurrentEL) == CurrentEL_EL2;
| }

The kernel's early startup in head.S detects if its running at EL2 or EL1. If it has VHE,
it enables the feature and stays at EL2. Otherwise it installs the 'hyp-stub' and drops to
EL1.

On v8.0 the kernel has to run at EL1 because we need the two ttbr registers.
(kernel/user-space). EL2 on these parts only has one. v8.1's VHE adds a second ttbr to
EL2, meaning we can run linux at EL2 and KVM avoids the jumping between exception levels.


You can check for that:
| CPU: All CPU(s) started at EL2
message. If your bootloader is starting the OS at EL1, you will need to speak to the
firmware folk about getting that changed.


> So, I am still confused why I do not see performance improvements
> during relocation on this machine. Any theories?

I assume you started at EL2. You moved EL1's mmu-off code, but cpu_soft_restart() will
call the relocation code at EL2 where the MMU is already off. Because you expected to be
using an idmap, nothing goes wrong, but nothing has changed.


Thanks,

James

[0] https://www.broadcom.com/products/storage/ethernet-storage-adapters-ics/ps1100r