[v2,00/19] arm64: Enable LPA2 support for 4k and 16k pages

Message ID	20221124123932.2648991-1-ardb@kernel.org (mailing list archive)
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Ard Biesheuvel <ardb@kernel.org> To: linux-arm-kernel@lists.infradead.org Cc: Ard Biesheuvel <ardb@kernel.org>, Marc Zyngier <maz@kernel.org>, Will Deacon <will@kernel.org>, Mark Rutland <mark.rutland@arm.com>, Kees Cook <keescook@chromium.org>, Catalin Marinas <catalin.marinas@arm.com>, Mark Brown <broonie@kernel.org>, Anshuman Khandual <anshuman.khandual@arm.com>, Richard Henderson <richard.henderson@linaro.org>, Ryan Roberts <ryan.roberts@arm.com> Subject: [PATCH v2 00/19] arm64: Enable LPA2 support for 4k and 16k pages Date: Thu, 24 Nov 2022 13:39:13 +0100 Message-Id: <20221124123932.2648991-1-ardb@kernel.org> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	arm64: Enable LPA2 support for 4k and 16k pages \| expand [v2,00/19] arm64: Enable LPA2 support for 4k and 16k pages [v2,01/19] arm64/mm: Simplify and document pte_to_phys() for 52 bit addresses [v2,02/19] arm64/mm: Add FEAT_LPA2 specific TCR_EL1.DS field [v2,03/19] arm64/mm: Add FEAT_LPA2 specific ID_AA64MMFR0.TGRAN[2] [v2,04/19] arm64: kaslr: Adjust randomization range dynamically [v2,05/19] arm64: mm: get rid of kimage_vaddr global variable [v2,06/19] arm64: head: remove order argument from early mapping routine [v2,07/19] arm64: mm: Handle LVA support as a CPU feature [v2,08/19] arm64: mm: Deal with potential ID map extension if VA_BITS > VA_BITS_MIN [v2,09/19] arm64: mm: Add feature override support for LVA [v2,10/19] arm64: mm: Wire up TCR.DS bit to PTE shareability fields [v2,11/19] arm64: mm: Add LPA2 support to phys<->pte conversion routines [v2,12/19] arm64: mm: Add definitions to support 5 levels of paging [v2,13/19] arm64: mm: add 5 level paging support to G-to-nG conversion routine [v2,14/19] arm64: Enable LPA2 at boot if supported by the system [v2,15/19] arm64: mm: Add 5 level paging support to fixmap and swapper handling [v2,16/19] arm64: kasan: Reduce minimum shadow alignment and enable 5 level paging [v2,17/19] arm64: mm: Add support for folding PUDs at runtime [v2,18/19] arm64: ptdump: Disregard unaddressable VA space [v2,19/19] arm64: Enable 52-bit virtual addressing for 4k and 16k granule configs

Ard Biesheuvel Nov. 24, 2022, 12:39 p.m. UTC

Enable support for LPA2 when running with 4k or 16k pages. In the former
case, this requires 5 level paging with a runtime fallback to 4 on
non-LPA2 hardware. For consistency, the same approach is adopted for 16k
pages, where we fall back to 3 level paging (47 bit virtual addressing)
on non-LPA2 configurations. (Falling back to 48 bits would involve
finding a workaround for the fact that we cannot construct a level 0
table covering 52 bits of VA space that appears aligned to its size in
memory, and has the top 2 entries that represent the 48-bit region
appearing at an alignment of 64 bytes, which is required by the
architecture for TTBR address values. Also, using an additional level of
paging to translate a single VA bit is wasteful in terms of TLB
efficiency)

This means support for falling back to 3 levels of paging at runtime
when configured for 4 is also needed.

Another thing worth to note is that the repurposed physical address bits
in the page table descriptors were not RES0 before, and so there is now
a big global switch (called TCR.DS) which controls how all page table
descriptors are interpreted. This requires some extra care in the PTE
conversion helpers, and additional handling in the boot code to ensure
that we set TCR.DS safely if supported (and not overridden)

Note that this series is mostly orthogonal to work by Anshuman done last
year: this series assumes that 52-bit physical addressing is never
needed to map the kernel image itself, and therefore that we never need
ID map range extension to cover the kernel with a 5th level when running
with 4. And given that the LPA2 architectural feature covers both the
virtual and physical range extensions, where enabling the latter is
required to enable the former, we can simplify things further by only
enabling them as a pair. (I.e., 52-bit physical addressing cannot be
enabled for 48-bit VA space or smaller)

This series applies onto some of my previous work that is still in
flight, so these patches will not apply in isolation. Complete branch
can be found here:
https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=arm64-4k-lpa2

It supersedes the RFC v1 I sent out last week, which covered 16k pages
only. It also supersedes some related work I sent out in isolation
before:

[PATCH] arm64: mm: Enable KASAN for 16k/48-bit VA configurations
[PATCH 0/3] arm64: mm: Model LVA support as a CPU feature

Tested on QEMU with -cpu max and lpa2 both off and on, as well as using
the arm64.nolva override kernel command line parameter. Note that this
requires a QEMU built from the latest sources.

Cc: Marc Zyngier <maz@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>

Anshuman Khandual (3):
  arm64/mm: Simplify and document pte_to_phys() for 52 bit addresses
  arm64/mm: Add FEAT_LPA2 specific TCR_EL1.DS field
  arm64/mm: Add FEAT_LPA2 specific ID_AA64MMFR0.TGRAN[2]

Ard Biesheuvel (16):
  arm64: kaslr: Adjust randomization range dynamically
  arm64: mm: get rid of kimage_vaddr global variable
  arm64: head: remove order argument from early mapping routine
  arm64: mm: Handle LVA support as a CPU feature
  arm64: mm: Deal with potential ID map extension if VA_BITS >
    VA_BITS_MIN
  arm64: mm: Add feature override support for LVA
  arm64: mm: Wire up TCR.DS bit to PTE shareability fields
  arm64: mm: Add LPA2 support to phys<->pte conversion routines
  arm64: mm: Add definitions to support 5 levels of paging
  arm64: mm: add 5 level paging support to G-to-nG conversion routine
  arm64: Enable LPA2 at boot if supported by the system
  arm64: mm: Add 5 level paging support to fixmap and swapper handling
  arm64: kasan: Reduce minimum shadow alignment and enable 5 level
    paging
  arm64: mm: Add support for folding PUDs at runtime
  arm64: ptdump: Disregard unaddressable VA space
  arm64: Enable 52-bit virtual addressing for 4k and 16k granule configs

 arch/arm64/Kconfig                      |  23 ++-
 arch/arm64/include/asm/assembler.h      |  42 ++---
 arch/arm64/include/asm/cpufeature.h     |   2 +
 arch/arm64/include/asm/fixmap.h         |   1 +
 arch/arm64/include/asm/kernel-pgtable.h |  27 ++-
 arch/arm64/include/asm/memory.h         |  23 ++-
 arch/arm64/include/asm/pgalloc.h        |  53 +++++-
 arch/arm64/include/asm/pgtable-hwdef.h  |  34 +++-
 arch/arm64/include/asm/pgtable-prot.h   |  18 +-
 arch/arm64/include/asm/pgtable-types.h  |   6 +
 arch/arm64/include/asm/pgtable.h        | 197 ++++++++++++++++++--
 arch/arm64/include/asm/sysreg.h         |   2 +
 arch/arm64/include/asm/tlb.h            |   3 +-
 arch/arm64/kernel/cpufeature.c          |  46 ++++-
 arch/arm64/kernel/head.S                |  99 +++++-----
 arch/arm64/kernel/image-vars.h          |   4 +
 arch/arm64/kernel/pi/idreg-override.c   |  29 ++-
 arch/arm64/kernel/pi/kaslr_early.c      |  23 ++-
 arch/arm64/kernel/pi/map_kernel.c       | 115 +++++++++++-
 arch/arm64/kernel/sleep.S               |   3 -
 arch/arm64/mm/init.c                    |   2 +-
 arch/arm64/mm/kasan_init.c              | 124 ++++++++++--
 arch/arm64/mm/mmap.c                    |   4 +
 arch/arm64/mm/mmu.c                     | 138 ++++++++++----
 arch/arm64/mm/pgd.c                     |  17 +-
 arch/arm64/mm/proc.S                    |  76 +++++++-
 arch/arm64/mm/ptdump.c                  |   4 +-
 arch/arm64/tools/cpucaps                |   1 +
 28 files changed, 907 insertions(+), 209 deletions(-)

Ryan Roberts Nov. 24, 2022, 2:39 p.m. UTC | #1

Hi Ard,

Thanks for including me on this. I'll plan to do a review over the next week or
so, but in the meantime, I have a couple of general questions/comments:

On 24/11/2022 12:39, Ard Biesheuvel wrote:
> Enable support for LPA2 when running with 4k or 16k pages. In the former
> case, this requires 5 level paging with a runtime fallback to 4 on
> non-LPA2 hardware. For consistency, the same approach is adopted for 16k
> pages, where we fall back to 3 level paging (47 bit virtual addressing)
> on non-LPA2 configurations. 

It seems odd to me that if you have a non-LPA2 system, if you run a kernel that
is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if
you run a kernel that is compiled for 16KB pages and 52 VA bits then you will
get 47 VA bits? Wouldn't that pose a potential user space compat issue?

> (Falling back to 48 bits would involve
> finding a workaround for the fact that we cannot construct a level 0
> table covering 52 bits of VA space that appears aligned to its size in
> memory, and has the top 2 entries that represent the 48-bit region
> appearing at an alignment of 64 bytes, which is required by the
> architecture for TTBR address values. 

I'm not sure I've understood this. The level 0 table would need 32 entries for
52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is
a factor of 256 so surely the top 2 entries are guaranteed to also meet the
constraint for the fallback path too?

> Also, using an additional level of
> paging to translate a single VA bit is wasteful in terms of TLB
> efficiency)
> 
> This means support for falling back to 3 levels of paging at runtime
> when configured for 4 is also needed.
> 
> Another thing worth to note is that the repurposed physical address bits
> in the page table descriptors were not RES0 before, and so there is now
> a big global switch (called TCR.DS) which controls how all page table
> descriptors are interpreted. This requires some extra care in the PTE
> conversion helpers, and additional handling in the boot code to ensure
> that we set TCR.DS safely if supported (and not overridden)
> 
> Note that this series is mostly orthogonal to work by Anshuman done last
> year: this series assumes that 52-bit physical addressing is never
> needed to map the kernel image itself, and therefore that we never need
> ID map range extension to cover the kernel with a 5th level when running
> with 4. 

This limitation will certainly make it more tricky to test the the LPA2 stage2
implementation that I have done. I've got scripts that construct host systems
with all the RAM above 48 bits so that the output addresses in the stage2 page
tables can be guaranteed to contain OAs > 48 bits. I think the work around here
would be to place the RAM so that it straddles the 48 bit boundary such that the
size of RAM below is the same size as the kernel image, and place the kernel
image in it. Then this will ensure that the VM's memory still uses the RAM above
the threshold. Or is there a simpler approach?

> And given that the LPA2 architectural feature covers both the
> virtual and physical range extensions, where enabling the latter is
> required to enable the former, we can simplify things further by only
> enabling them as a pair. (I.e., 52-bit physical addressing cannot be
> enabled for 48-bit VA space or smaller)
> 
> [...]

Thanks,
Ryan

Ard Biesheuvel Nov. 24, 2022, 5:14 p.m. UTC | #2

On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi Ard,
>
> Thanks for including me on this. I'll plan to do a review over the next week or
> so, but in the meantime, I have a couple of general questions/comments:
>
> On 24/11/2022 12:39, Ard Biesheuvel wrote:
> > Enable support for LPA2 when running with 4k or 16k pages. In the former
> > case, this requires 5 level paging with a runtime fallback to 4 on
> > non-LPA2 hardware. For consistency, the same approach is adopted for 16k
> > pages, where we fall back to 3 level paging (47 bit virtual addressing)
> > on non-LPA2 configurations.
>
> It seems odd to me that if you have a non-LPA2 system, if you run a kernel that
> is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if
> you run a kernel that is compiled for 16KB pages and 52 VA bits then you will
> get 47 VA bits? Wouldn't that pose a potential user space compat issue?
>

Well, given that Android happily runs with 39-bit VAs to avoid 4 level
paging at all cost, I don't think that is a universal concern.

The benefit of this approach is that you can decide at runtime whether
you want to take the performance hit of 4 (or 5) level paging to get
access to the extended VA space.

> > (Falling back to 48 bits would involve
> > finding a workaround for the fact that we cannot construct a level 0
> > table covering 52 bits of VA space that appears aligned to its size in
> > memory, and has the top 2 entries that represent the 48-bit region
> > appearing at an alignment of 64 bytes, which is required by the
> > architecture for TTBR address values.
>
> I'm not sure I've understood this. The level 0 table would need 32 entries for
> 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is
> a factor of 256 so surely the top 2 entries are guaranteed to also meet the
> constraint for the fallback path too?
>

The top 2 entries are 16 bytes combined, and end on a 256 byte aligned
boundary so I don't see how they can start on a 64 byte aligned
boundary at the same time.

My RFC had a workaround for this, but it is a bit nasty because we
need to copy those two entries at the right time and keep them in
sync.

> > Also, using an additional level of
> > paging to translate a single VA bit is wasteful in terms of TLB
> > efficiency)
> >
> > This means support for falling back to 3 levels of paging at runtime
> > when configured for 4 is also needed.
> >
> > Another thing worth to note is that the repurposed physical address bits
> > in the page table descriptors were not RES0 before, and so there is now
> > a big global switch (called TCR.DS) which controls how all page table
> > descriptors are interpreted. This requires some extra care in the PTE
> > conversion helpers, and additional handling in the boot code to ensure
> > that we set TCR.DS safely if supported (and not overridden)
> >
> > Note that this series is mostly orthogonal to work by Anshuman done last
> > year: this series assumes that 52-bit physical addressing is never
> > needed to map the kernel image itself, and therefore that we never need
> > ID map range extension to cover the kernel with a 5th level when running
> > with 4.
>
> This limitation will certainly make it more tricky to test the the LPA2 stage2
> implementation that I have done. I've got scripts that construct host systems
> with all the RAM above 48 bits so that the output addresses in the stage2 page
> tables can be guaranteed to contain OAs > 48 bits. I think the work around here
> would be to place the RAM so that it straddles the 48 bit boundary such that the
> size of RAM below is the same size as the kernel image, and place the kernel
> image in it. Then this will ensure that the VM's memory still uses the RAM above
> the threshold. Or is there a simpler approach?
>

No, that sounds reasonable. I'm using QEMU which happily lets you put
the start of DRAM at any address you can imagine (if you recompile it)

Another approach could be to simply stick a memblock_reserve()
somewhere that covers all 48-bit addressable memory, but you will need
some of both in any case.

> > And given that the LPA2 architectural feature covers both the
> > virtual and physical range extensions, where enabling the latter is
> > required to enable the former, we can simplify things further by only
> > enabling them as a pair. (I.e., 52-bit physical addressing cannot be
> > enabled for 48-bit VA space or smaller)
> >
> > [...]
>
> Thanks,
> Ryan
>
>

Ryan Roberts Nov. 25, 2022, 9:22 a.m. UTC | #3

On 24/11/2022 17:14, Ard Biesheuvel wrote:
> On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi Ard,
>>
>> Thanks for including me on this. I'll plan to do a review over the next week or
>> so, but in the meantime, I have a couple of general questions/comments:
>>
>> On 24/11/2022 12:39, Ard Biesheuvel wrote:
>>> Enable support for LPA2 when running with 4k or 16k pages. In the former
>>> case, this requires 5 level paging with a runtime fallback to 4 on
>>> non-LPA2 hardware. For consistency, the same approach is adopted for 16k
>>> pages, where we fall back to 3 level paging (47 bit virtual addressing)
>>> on non-LPA2 configurations.
>>
>> It seems odd to me that if you have a non-LPA2 system, if you run a kernel that
>> is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if
>> you run a kernel that is compiled for 16KB pages and 52 VA bits then you will
>> get 47 VA bits? Wouldn't that pose a potential user space compat issue?
>>
> 
> Well, given that Android happily runs with 39-bit VAs to avoid 4 level
> paging at all cost, I don't think that is a universal concern.

Well presumably the Android kernel is always explicitly compiled for 39 VA bits
so that's what user space is used to? I was really just making the point that if
you have (the admittedly exotic and unlikely) case of having a 16KB kernel
previously compiled for 48 VA bits, and you "upgrade" it to 52 VA bits now that
the option is available, on HW without LPA2, this will actually be observed as a
"downgrade" to 47 bits. If you previously wanted to limit to 3 levels of lookup
with 16KB you would already have been compiling for 47 VA bits.

> 
> The benefit of this approach is that you can decide at runtime whether
> you want to take the performance hit of 4 (or 5) level paging to get
> access to the extended VA space.
> 
>>> (Falling back to 48 bits would involve
>>> finding a workaround for the fact that we cannot construct a level 0
>>> table covering 52 bits of VA space that appears aligned to its size in
>>> memory, and has the top 2 entries that represent the 48-bit region
>>> appearing at an alignment of 64 bytes, which is required by the
>>> architecture for TTBR address values.
>>
>> I'm not sure I've understood this. The level 0 table would need 32 entries for
>> 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is
>> a factor of 256 so surely the top 2 entries are guaranteed to also meet the
>> constraint for the fallback path too?
>>
> 
> The top 2 entries are 16 bytes combined, and end on a 256 byte aligned
> boundary so I don't see how they can start on a 64 byte aligned
> boundary at the same time.

I'm still not following; why would the 2 entry/16 byte table *end* on a 256 byte
boundary? I guess I should go and read your patch before making assumptions, but
my assumption from your description here was that you were optimistically
allocating a 32 entry/256 byte table for the 52 VA bit case, then needing to
reuse that table for the 2 entry/16 byte case if HW turns out not to support
LPA2. In which case, surely the 2 entry table would be overlayed at the start
(low address) of the allocated 32 entry table, and therefore its alignment is
256 bytes, which meets the HW's 64 byte alignment requirement?

> 
> My RFC had a workaround for this, but it is a bit nasty because we
> need to copy those two entries at the right time and keep them in
> sync.
> 
>>> Also, using an additional level of
>>> paging to translate a single VA bit is wasteful in terms of TLB
>>> efficiency)
>>>
>>> This means support for falling back to 3 levels of paging at runtime
>>> when configured for 4 is also needed.
>>>
>>> Another thing worth to note is that the repurposed physical address bits
>>> in the page table descriptors were not RES0 before, and so there is now
>>> a big global switch (called TCR.DS) which controls how all page table
>>> descriptors are interpreted. This requires some extra care in the PTE
>>> conversion helpers, and additional handling in the boot code to ensure
>>> that we set TCR.DS safely if supported (and not overridden)
>>>
>>> Note that this series is mostly orthogonal to work by Anshuman done last
>>> year: this series assumes that 52-bit physical addressing is never
>>> needed to map the kernel image itself, and therefore that we never need
>>> ID map range extension to cover the kernel with a 5th level when running
>>> with 4.
>>
>> This limitation will certainly make it more tricky to test the the LPA2 stage2
>> implementation that I have done. I've got scripts that construct host systems
>> with all the RAM above 48 bits so that the output addresses in the stage2 page
>> tables can be guaranteed to contain OAs > 48 bits. I think the work around here
>> would be to place the RAM so that it straddles the 48 bit boundary such that the
>> size of RAM below is the same size as the kernel image, and place the kernel
>> image in it. Then this will ensure that the VM's memory still uses the RAM above
>> the threshold. Or is there a simpler approach?
>>
> 
> No, that sounds reasonable. I'm using QEMU which happily lets you put
> the start of DRAM at any address you can imagine (if you recompile it)

I'm running on FVP, which will let me do this with runtime parameters. Anyway,
I'll update my tests to cope with this constraint and run this patch set
through, and I'll let you know if it spots anything.
> 
> Another approach could be to simply stick a memblock_reserve()
> somewhere that covers all 48-bit addressable memory, but you will need
> some of both in any case.
> 
>>> And given that the LPA2 architectural feature covers both the
>>> virtual and physical range extensions, where enabling the latter is
>>> required to enable the former, we can simplify things further by only
>>> enabling them as a pair. (I.e., 52-bit physical addressing cannot be
>>> enabled for 48-bit VA space or smaller)
>>>
>>> [...]
>>
>> Thanks,
>> Ryan
>>
>>

Ard Biesheuvel Nov. 25, 2022, 9:35 a.m. UTC | #4

On Fri, 25 Nov 2022 at 10:23, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 24/11/2022 17:14, Ard Biesheuvel wrote:
> > On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Hi Ard,
> >>
> >> Thanks for including me on this. I'll plan to do a review over the next week or
> >> so, but in the meantime, I have a couple of general questions/comments:
> >>
> >> On 24/11/2022 12:39, Ard Biesheuvel wrote:
> >>> Enable support for LPA2 when running with 4k or 16k pages. In the former
> >>> case, this requires 5 level paging with a runtime fallback to 4 on
> >>> non-LPA2 hardware. For consistency, the same approach is adopted for 16k
> >>> pages, where we fall back to 3 level paging (47 bit virtual addressing)
> >>> on non-LPA2 configurations.
> >>
> >> It seems odd to me that if you have a non-LPA2 system, if you run a kernel that
> >> is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if
> >> you run a kernel that is compiled for 16KB pages and 52 VA bits then you will
> >> get 47 VA bits? Wouldn't that pose a potential user space compat issue?
> >>
> >
> > Well, given that Android happily runs with 39-bit VAs to avoid 4 level
> > paging at all cost, I don't think that is a universal concern.
>
> Well presumably the Android kernel is always explicitly compiled for 39 VA bits
> so that's what user space is used to? I was really just making the point that if
> you have (the admittedly exotic and unlikely) case of having a 16KB kernel
> previously compiled for 48 VA bits, and you "upgrade" it to 52 VA bits now that
> the option is available, on HW without LPA2, this will actually be observed as a
> "downgrade" to 47 bits. If you previously wanted to limit to 3 levels of lookup
> with 16KB you would already have been compiling for 47 VA bits.
>

I am not debating that. I'm just saying that, without any hardware in
existence, it is difficult to predict which of these concerns is going
to dominate, and so I opted for the least messy and most symmetrical
approach.

> >
> > The benefit of this approach is that you can decide at runtime whether
> > you want to take the performance hit of 4 (or 5) level paging to get
> > access to the extended VA space.
> >
> >>> (Falling back to 48 bits would involve
> >>> finding a workaround for the fact that we cannot construct a level 0
> >>> table covering 52 bits of VA space that appears aligned to its size in
> >>> memory, and has the top 2 entries that represent the 48-bit region
> >>> appearing at an alignment of 64 bytes, which is required by the
> >>> architecture for TTBR address values.
> >>
> >> I'm not sure I've understood this. The level 0 table would need 32 entries for
> >> 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is
> >> a factor of 256 so surely the top 2 entries are guaranteed to also meet the
> >> constraint for the fallback path too?
> >>
> >
> > The top 2 entries are 16 bytes combined, and end on a 256 byte aligned
> > boundary so I don't see how they can start on a 64 byte aligned
> > boundary at the same time.
>
> I'm still not following; why would the 2 entry/16 byte table *end* on a 256 byte
> boundary? I guess I should go and read your patch before making assumptions, but
> my assumption from your description here was that you were optimistically
> allocating a 32 entry/256 byte table for the 52 VA bit case, then needing to
> reuse that table for the 2 entry/16 byte case if HW turns out not to support
> LPA2. In which case, surely the 2 entry table would be overlayed at the start
> (low address) of the allocated 32 entry table, and therefore its alignment is
> 256 bytes, which meets the HW's 64 byte alignment requirement?
>

No, it's at the end, that is the point. I am specifically referring to
TTBR1 upper region page tables here.

Please refer to the existing ttbr1_offset asm macro, which implements
this today for 64k pages + LVA. In this case, however, the condensed
table covers 6 bits of translation so it is naturally aligned to the
TTBR minimum alignment.

> >
> > My RFC had a workaround for this, but it is a bit nasty because we
> > need to copy those two entries at the right time and keep them in
> > sync.
> >
> >>> Also, using an additional level of
> >>> paging to translate a single VA bit is wasteful in terms of TLB
> >>> efficiency)
> >>>
> >>> This means support for falling back to 3 levels of paging at runtime
> >>> when configured for 4 is also needed.
> >>>
> >>> Another thing worth to note is that the repurposed physical address bits
> >>> in the page table descriptors were not RES0 before, and so there is now
> >>> a big global switch (called TCR.DS) which controls how all page table
> >>> descriptors are interpreted. This requires some extra care in the PTE
> >>> conversion helpers, and additional handling in the boot code to ensure
> >>> that we set TCR.DS safely if supported (and not overridden)
> >>>
> >>> Note that this series is mostly orthogonal to work by Anshuman done last
> >>> year: this series assumes that 52-bit physical addressing is never
> >>> needed to map the kernel image itself, and therefore that we never need
> >>> ID map range extension to cover the kernel with a 5th level when running
> >>> with 4.
> >>
> >> This limitation will certainly make it more tricky to test the the LPA2 stage2
> >> implementation that I have done. I've got scripts that construct host systems
> >> with all the RAM above 48 bits so that the output addresses in the stage2 page
> >> tables can be guaranteed to contain OAs > 48 bits. I think the work around here
> >> would be to place the RAM so that it straddles the 48 bit boundary such that the
> >> size of RAM below is the same size as the kernel image, and place the kernel
> >> image in it. Then this will ensure that the VM's memory still uses the RAM above
> >> the threshold. Or is there a simpler approach?
> >>
> >
> > No, that sounds reasonable. I'm using QEMU which happily lets you put
> > the start of DRAM at any address you can imagine (if you recompile it)
>
> I'm running on FVP, which will let me do this with runtime parameters. Anyway,
> I'll update my tests to cope with this constraint and run this patch set
> through, and I'll let you know if it spots anything.

Excellent, thanks.

> >
> > Another approach could be to simply stick a memblock_reserve()
> > somewhere that covers all 48-bit addressable memory, but you will need
> > some of both in any case.
> >
> >>> And given that the LPA2 architectural feature covers both the
> >>> virtual and physical range extensions, where enabling the latter is
> >>> required to enable the former, we can simplify things further by only
> >>> enabling them as a pair. (I.e., 52-bit physical addressing cannot be
> >>> enabled for 48-bit VA space or smaller)
> >>>
> >>> [...]
> >>
> >> Thanks,
> >> Ryan
> >>
> >>
>

Ryan Roberts Nov. 25, 2022, 10:07 a.m. UTC | #5

On 25/11/2022 09:35, Ard Biesheuvel wrote:
> On Fri, 25 Nov 2022 at 10:23, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 24/11/2022 17:14, Ard Biesheuvel wrote:
>>> On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Hi Ard,
>>>>
>>>> Thanks for including me on this. I'll plan to do a review over the next week or
>>>> so, but in the meantime, I have a couple of general questions/comments:
>>>>
>>>> On 24/11/2022 12:39, Ard Biesheuvel wrote:
>>>>> Enable support for LPA2 when running with 4k or 16k pages. In the former
>>>>> case, this requires 5 level paging with a runtime fallback to 4 on
>>>>> non-LPA2 hardware. For consistency, the same approach is adopted for 16k
>>>>> pages, where we fall back to 3 level paging (47 bit virtual addressing)
>>>>> on non-LPA2 configurations.
>>>>
>>>> It seems odd to me that if you have a non-LPA2 system, if you run a kernel that
>>>> is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if
>>>> you run a kernel that is compiled for 16KB pages and 52 VA bits then you will
>>>> get 47 VA bits? Wouldn't that pose a potential user space compat issue?
>>>>
>>>
>>> Well, given that Android happily runs with 39-bit VAs to avoid 4 level
>>> paging at all cost, I don't think that is a universal concern.
>>
>> Well presumably the Android kernel is always explicitly compiled for 39 VA bits
>> so that's what user space is used to? I was really just making the point that if
>> you have (the admittedly exotic and unlikely) case of having a 16KB kernel
>> previously compiled for 48 VA bits, and you "upgrade" it to 52 VA bits now that
>> the option is available, on HW without LPA2, this will actually be observed as a
>> "downgrade" to 47 bits. If you previously wanted to limit to 3 levels of lookup
>> with 16KB you would already have been compiling for 47 VA bits.
>>
> 
> I am not debating that. I'm just saying that, without any hardware in
> existence, it is difficult to predict which of these concerns is going
> to dominate, and so I opted for the least messy and most symmetrical
> approach.

OK fair enough. My opinion is logged ;-).

> 
>>>
>>> The benefit of this approach is that you can decide at runtime whether
>>> you want to take the performance hit of 4 (or 5) level paging to get
>>> access to the extended VA space.
>>>
>>>>> (Falling back to 48 bits would involve
>>>>> finding a workaround for the fact that we cannot construct a level 0
>>>>> table covering 52 bits of VA space that appears aligned to its size in
>>>>> memory, and has the top 2 entries that represent the 48-bit region
>>>>> appearing at an alignment of 64 bytes, which is required by the
>>>>> architecture for TTBR address values.
>>>>
>>>> I'm not sure I've understood this. The level 0 table would need 32 entries for
>>>> 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is
>>>> a factor of 256 so surely the top 2 entries are guaranteed to also meet the
>>>> constraint for the fallback path too?
>>>>
>>>
>>> The top 2 entries are 16 bytes combined, and end on a 256 byte aligned
>>> boundary so I don't see how they can start on a 64 byte aligned
>>> boundary at the same time.
>>
>> I'm still not following; why would the 2 entry/16 byte table *end* on a 256 byte
>> boundary? I guess I should go and read your patch before making assumptions, but
>> my assumption from your description here was that you were optimistically
>> allocating a 32 entry/256 byte table for the 52 VA bit case, then needing to
>> reuse that table for the 2 entry/16 byte case if HW turns out not to support
>> LPA2. In which case, surely the 2 entry table would be overlayed at the start
>> (low address) of the allocated 32 entry table, and therefore its alignment is
>> 256 bytes, which meets the HW's 64 byte alignment requirement?
>>
> 
> No, it's at the end, that is the point. I am specifically referring to
> TTBR1 upper region page tables here.
> 
> Please refer to the existing ttbr1_offset asm macro, which implements
> this today for 64k pages + LVA. In this case, however, the condensed
> table covers 6 bits of translation so it is naturally aligned to the
> TTBR minimum alignment.

Afraid I don't see any such ttbr1_offset macro, either in upstream or the branch
you posted. The best I can find is TTBR1_OFFSET in arm arch, which I'm guessing
isn't it. I'm keen to understand this better if you can point me to the right
location?

Regardless, the Arm ARM states this for TTBR1_EL1.BADDR:

"""
BADDR[47:1], bits [47:1]

Translation table base address:
• Bits A[47:x] of the stage 1 translation table base address bits are in
register bits[47:x].
• Bits A[(x-1):0] of the stage 1 translation table base address are zero.

Address bit x is the minimum address bit required to align the translation table
to the size of the table. The smallest permitted value of x is 6. The AArch64
Virtual Memory System Architecture chapter describes how x is calculated based
on the value of TCR_EL1.T1SZ, the translation stage, and the translation granule
size.

Note
A translation table is required to be aligned to the size of the table. If a
table contains fewer than eight entries, it must be aligned on a 64 byte address
boundary.
"""

I don't see how that is referring to the alignment of the *end* of the table?



> 
>>>
>>> My RFC had a workaround for this, but it is a bit nasty because we
>>> need to copy those two entries at the right time and keep them in
>>> sync.
>>>
>>>>> Also, using an additional level of
>>>>> paging to translate a single VA bit is wasteful in terms of TLB
>>>>> efficiency)
>>>>>
>>>>> This means support for falling back to 3 levels of paging at runtime
>>>>> when configured for 4 is also needed.
>>>>>
>>>>> Another thing worth to note is that the repurposed physical address bits
>>>>> in the page table descriptors were not RES0 before, and so there is now
>>>>> a big global switch (called TCR.DS) which controls how all page table
>>>>> descriptors are interpreted. This requires some extra care in the PTE
>>>>> conversion helpers, and additional handling in the boot code to ensure
>>>>> that we set TCR.DS safely if supported (and not overridden)
>>>>>
>>>>> Note that this series is mostly orthogonal to work by Anshuman done last
>>>>> year: this series assumes that 52-bit physical addressing is never
>>>>> needed to map the kernel image itself, and therefore that we never need
>>>>> ID map range extension to cover the kernel with a 5th level when running
>>>>> with 4.
>>>>
>>>> This limitation will certainly make it more tricky to test the the LPA2 stage2
>>>> implementation that I have done. I've got scripts that construct host systems
>>>> with all the RAM above 48 bits so that the output addresses in the stage2 page
>>>> tables can be guaranteed to contain OAs > 48 bits. I think the work around here
>>>> would be to place the RAM so that it straddles the 48 bit boundary such that the
>>>> size of RAM below is the same size as the kernel image, and place the kernel
>>>> image in it. Then this will ensure that the VM's memory still uses the RAM above
>>>> the threshold. Or is there a simpler approach?
>>>>
>>>
>>> No, that sounds reasonable. I'm using QEMU which happily lets you put
>>> the start of DRAM at any address you can imagine (if you recompile it)
>>
>> I'm running on FVP, which will let me do this with runtime parameters. Anyway,
>> I'll update my tests to cope with this constraint and run this patch set
>> through, and I'll let you know if it spots anything.
> 
> Excellent, thanks.
> 
>>>
>>> Another approach could be to simply stick a memblock_reserve()
>>> somewhere that covers all 48-bit addressable memory, but you will need
>>> some of both in any case.
>>>
>>>>> And given that the LPA2 architectural feature covers both the
>>>>> virtual and physical range extensions, where enabling the latter is
>>>>> required to enable the former, we can simplify things further by only
>>>>> enabling them as a pair. (I.e., 52-bit physical addressing cannot be
>>>>> enabled for 48-bit VA space or smaller)
>>>>>
>>>>> [...]
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>

Ard Biesheuvel Nov. 25, 2022, 10:36 a.m. UTC | #6

On Fri, 25 Nov 2022 at 11:07, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 25/11/2022 09:35, Ard Biesheuvel wrote:
> > On Fri, 25 Nov 2022 at 10:23, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 24/11/2022 17:14, Ard Biesheuvel wrote:
> >>> On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> Hi Ard,
> >>>>
> >>>> Thanks for including me on this. I'll plan to do a review over the next week or
> >>>> so, but in the meantime, I have a couple of general questions/comments:
> >>>>
> >>>> On 24/11/2022 12:39, Ard Biesheuvel wrote:
> >>>>> Enable support for LPA2 when running with 4k or 16k pages. In the former
> >>>>> case, this requires 5 level paging with a runtime fallback to 4 on
> >>>>> non-LPA2 hardware. For consistency, the same approach is adopted for 16k
> >>>>> pages, where we fall back to 3 level paging (47 bit virtual addressing)
> >>>>> on non-LPA2 configurations.
> >>>>
> >>>> It seems odd to me that if you have a non-LPA2 system, if you run a kernel that
> >>>> is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if
> >>>> you run a kernel that is compiled for 16KB pages and 52 VA bits then you will
> >>>> get 47 VA bits? Wouldn't that pose a potential user space compat issue?
> >>>>
> >>>
> >>> Well, given that Android happily runs with 39-bit VAs to avoid 4 level
> >>> paging at all cost, I don't think that is a universal concern.
> >>
> >> Well presumably the Android kernel is always explicitly compiled for 39 VA bits
> >> so that's what user space is used to? I was really just making the point that if
> >> you have (the admittedly exotic and unlikely) case of having a 16KB kernel
> >> previously compiled for 48 VA bits, and you "upgrade" it to 52 VA bits now that
> >> the option is available, on HW without LPA2, this will actually be observed as a
> >> "downgrade" to 47 bits. If you previously wanted to limit to 3 levels of lookup
> >> with 16KB you would already have been compiling for 47 VA bits.
> >>
> >
> > I am not debating that. I'm just saying that, without any hardware in
> > existence, it is difficult to predict which of these concerns is going
> > to dominate, and so I opted for the least messy and most symmetrical
> > approach.
>
> OK fair enough. My opinion is logged ;-).
>
> >
> >>>
> >>> The benefit of this approach is that you can decide at runtime whether
> >>> you want to take the performance hit of 4 (or 5) level paging to get
> >>> access to the extended VA space.
> >>>
> >>>>> (Falling back to 48 bits would involve
> >>>>> finding a workaround for the fact that we cannot construct a level 0
> >>>>> table covering 52 bits of VA space that appears aligned to its size in
> >>>>> memory, and has the top 2 entries that represent the 48-bit region
> >>>>> appearing at an alignment of 64 bytes, which is required by the
> >>>>> architecture for TTBR address values.
> >>>>
> >>>> I'm not sure I've understood this. The level 0 table would need 32 entries for
> >>>> 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is
> >>>> a factor of 256 so surely the top 2 entries are guaranteed to also meet the
> >>>> constraint for the fallback path too?
> >>>>
> >>>
> >>> The top 2 entries are 16 bytes combined, and end on a 256 byte aligned
> >>> boundary so I don't see how they can start on a 64 byte aligned
> >>> boundary at the same time.
> >>
> >> I'm still not following; why would the 2 entry/16 byte table *end* on a 256 byte
> >> boundary? I guess I should go and read your patch before making assumptions, but
> >> my assumption from your description here was that you were optimistically
> >> allocating a 32 entry/256 byte table for the 52 VA bit case, then needing to
> >> reuse that table for the 2 entry/16 byte case if HW turns out not to support
> >> LPA2. In which case, surely the 2 entry table would be overlayed at the start
> >> (low address) of the allocated 32 entry table, and therefore its alignment is
> >> 256 bytes, which meets the HW's 64 byte alignment requirement?
> >>
> >
> > No, it's at the end, that is the point. I am specifically referring to
> > TTBR1 upper region page tables here.
> >
> > Please refer to the existing ttbr1_offset asm macro, which implements
> > this today for 64k pages + LVA. In this case, however, the condensed
> > table covers 6 bits of translation so it is naturally aligned to the
> > TTBR minimum alignment.
>
> Afraid I don't see any such ttbr1_offset macro, either in upstream or the branch
> you posted. The best I can find is TTBR1_OFFSET in arm arch, which I'm guessing
> isn't it. I'm keen to understand this better if you can point me to the right
> location?
>

Apologies, I got the name wrong

/*
 * Offset ttbr1 to allow for 48-bit kernel VAs set with 52-bit PTRS_PER_PGD.
 * orr is used as it can cover the immediate value (and is idempotent).
 * In future this may be nop'ed out when dealing with 52-bit kernel VAs.
 *      ttbr: Value of ttbr to set, modified.
 */
        .macro  offset_ttbr1, ttbr, tmp
#ifdef CONFIG_ARM64_VA_BITS_52
        mrs_s   \tmp, SYS_ID_AA64MMFR2_EL1
        and     \tmp, \tmp, #(0xf << ID_AA64MMFR2_EL1_VARange_SHIFT)
        cbnz    \tmp, .Lskipoffs_\@
        orr     \ttbr, \ttbr, #TTBR1_BADDR_4852_OFFSET
.Lskipoffs_\@ :
#endif
        .endm

> Regardless, the Arm ARM states this for TTBR1_EL1.BADDR:
>
> """
> BADDR[47:1], bits [47:1]
>
> Translation table base address:
> • Bits A[47:x] of the stage 1 translation table base address bits are in
> register bits[47:x].
> • Bits A[(x-1):0] of the stage 1 translation table base address are zero.
>
> Address bit x is the minimum address bit required to align the translation table
> to the size of the table. The smallest permitted value of x is 6. The AArch64
> Virtual Memory System Architecture chapter describes how x is calculated based
> on the value of TCR_EL1.T1SZ, the translation stage, and the translation granule
> size.
>
> Note
> A translation table is required to be aligned to the size of the table. If a
> table contains fewer than eight entries, it must be aligned on a 64 byte address
> boundary.
> """
>
> I don't see how that is referring to the alignment of the *end* of the table?
>

It refers to the address poked into the register

When you create a 256 byte aligned 32 entry 52-bit level 0 table for
16k pages, entry #0 covers the start of the 52-bit addressable VA
space, and entry #30 covers the start of the 48-bit addressable VA
space.

When LPA2 is not supported, the walk must start at entry #30 so that
is where TTBR1_EL1 should point, but doing so violates the
architecture's alignment requirement.

So what we might do is double the size of the table, and clone entries
#30 and #31 to positions #32 and #33, for instance (and remember to
keep them in sync, which is not that hard)

Ryan Roberts Nov. 25, 2022, 2:12 p.m. UTC | #7

On 25/11/2022 10:36, Ard Biesheuvel wrote:
> On Fri, 25 Nov 2022 at 11:07, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 25/11/2022 09:35, Ard Biesheuvel wrote:
>>> On Fri, 25 Nov 2022 at 10:23, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 24/11/2022 17:14, Ard Biesheuvel wrote:
>>>>> On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> Hi Ard,
>>>>>>
>>>>>> Thanks for including me on this. I'll plan to do a review over the next week or
>>>>>> so, but in the meantime, I have a couple of general questions/comments:
>>>>>>
>>>>>> On 24/11/2022 12:39, Ard Biesheuvel wrote:
>>>>>>> Enable support for LPA2 when running with 4k or 16k pages. In the former
>>>>>>> case, this requires 5 level paging with a runtime fallback to 4 on
>>>>>>> non-LPA2 hardware. For consistency, the same approach is adopted for 16k
>>>>>>> pages, where we fall back to 3 level paging (47 bit virtual addressing)
>>>>>>> on non-LPA2 configurations.
>>>>>>
>>>>>> It seems odd to me that if you have a non-LPA2 system, if you run a kernel that
>>>>>> is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if
>>>>>> you run a kernel that is compiled for 16KB pages and 52 VA bits then you will
>>>>>> get 47 VA bits? Wouldn't that pose a potential user space compat issue?
>>>>>>
>>>>>
>>>>> Well, given that Android happily runs with 39-bit VAs to avoid 4 level
>>>>> paging at all cost, I don't think that is a universal concern.
>>>>
>>>> Well presumably the Android kernel is always explicitly compiled for 39 VA bits
>>>> so that's what user space is used to? I was really just making the point that if
>>>> you have (the admittedly exotic and unlikely) case of having a 16KB kernel
>>>> previously compiled for 48 VA bits, and you "upgrade" it to 52 VA bits now that
>>>> the option is available, on HW without LPA2, this will actually be observed as a
>>>> "downgrade" to 47 bits. If you previously wanted to limit to 3 levels of lookup
>>>> with 16KB you would already have been compiling for 47 VA bits.
>>>>
>>>
>>> I am not debating that. I'm just saying that, without any hardware in
>>> existence, it is difficult to predict which of these concerns is going
>>> to dominate, and so I opted for the least messy and most symmetrical
>>> approach.
>>
>> OK fair enough. My opinion is logged ;-).
>>
>>>
>>>>>
>>>>> The benefit of this approach is that you can decide at runtime whether
>>>>> you want to take the performance hit of 4 (or 5) level paging to get
>>>>> access to the extended VA space.
>>>>>
>>>>>>> (Falling back to 48 bits would involve
>>>>>>> finding a workaround for the fact that we cannot construct a level 0
>>>>>>> table covering 52 bits of VA space that appears aligned to its size in
>>>>>>> memory, and has the top 2 entries that represent the 48-bit region
>>>>>>> appearing at an alignment of 64 bytes, which is required by the
>>>>>>> architecture for TTBR address values.
>>>>>>
>>>>>> I'm not sure I've understood this. The level 0 table would need 32 entries for
>>>>>> 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is
>>>>>> a factor of 256 so surely the top 2 entries are guaranteed to also meet the
>>>>>> constraint for the fallback path too?
>>>>>>
>>>>>
>>>>> The top 2 entries are 16 bytes combined, and end on a 256 byte aligned
>>>>> boundary so I don't see how they can start on a 64 byte aligned
>>>>> boundary at the same time.
>>>>
>>>> I'm still not following; why would the 2 entry/16 byte table *end* on a 256 byte
>>>> boundary? I guess I should go and read your patch before making assumptions, but
>>>> my assumption from your description here was that you were optimistically
>>>> allocating a 32 entry/256 byte table for the 52 VA bit case, then needing to
>>>> reuse that table for the 2 entry/16 byte case if HW turns out not to support
>>>> LPA2. In which case, surely the 2 entry table would be overlayed at the start
>>>> (low address) of the allocated 32 entry table, and therefore its alignment is
>>>> 256 bytes, which meets the HW's 64 byte alignment requirement?
>>>>
>>>
>>> No, it's at the end, that is the point. I am specifically referring to
>>> TTBR1 upper region page tables here.
>>>
>>> Please refer to the existing ttbr1_offset asm macro, which implements
>>> this today for 64k pages + LVA. In this case, however, the condensed
>>> table covers 6 bits of translation so it is naturally aligned to the
>>> TTBR minimum alignment.
>>
>> Afraid I don't see any such ttbr1_offset macro, either in upstream or the branch
>> you posted. The best I can find is TTBR1_OFFSET in arm arch, which I'm guessing
>> isn't it. I'm keen to understand this better if you can point me to the right
>> location?
>>
> 
> Apologies, I got the name wrong
> 
> /*
>  * Offset ttbr1 to allow for 48-bit kernel VAs set with 52-bit PTRS_PER_PGD.
>  * orr is used as it can cover the immediate value (and is idempotent).
>  * In future this may be nop'ed out when dealing with 52-bit kernel VAs.
>  *      ttbr: Value of ttbr to set, modified.
>  */
>         .macro  offset_ttbr1, ttbr, tmp
> #ifdef CONFIG_ARM64_VA_BITS_52
>         mrs_s   \tmp, SYS_ID_AA64MMFR2_EL1
>         and     \tmp, \tmp, #(0xf << ID_AA64MMFR2_EL1_VARange_SHIFT)
>         cbnz    \tmp, .Lskipoffs_\@
>         orr     \ttbr, \ttbr, #TTBR1_BADDR_4852_OFFSET
> .Lskipoffs_\@ :
> #endif
>         .endm
> 
>> Regardless, the Arm ARM states this for TTBR1_EL1.BADDR:
>>
>> """
>> BADDR[47:1], bits [47:1]
>>
>> Translation table base address:
>> • Bits A[47:x] of the stage 1 translation table base address bits are in
>> register bits[47:x].
>> • Bits A[(x-1):0] of the stage 1 translation table base address are zero.
>>
>> Address bit x is the minimum address bit required to align the translation table
>> to the size of the table. The smallest permitted value of x is 6. The AArch64
>> Virtual Memory System Architecture chapter describes how x is calculated based
>> on the value of TCR_EL1.T1SZ, the translation stage, and the translation granule
>> size.
>>
>> Note
>> A translation table is required to be aligned to the size of the table. If a
>> table contains fewer than eight entries, it must be aligned on a 64 byte address
>> boundary.
>> """
>>
>> I don't see how that is referring to the alignment of the *end* of the table?
>>
> 
> It refers to the address poked into the register
> 
> When you create a 256 byte aligned 32 entry 52-bit level 0 table for
> 16k pages, entry #0 covers the start of the 52-bit addressable VA
> space, and entry #30 covers the start of the 48-bit addressable VA
> space.
> 
> When LPA2 is not supported, the walk must start at entry #30 so that
> is where TTBR1_EL1 should point, but doing so violates the
> architecture's alignment requirement.
> 
> So what we might do is double the size of the table, and clone entries
> #30 and #31 to positions #32 and #33, for instance (and remember to
> keep them in sync, which is not that hard)

OK I get it now - thanks for explaining. I think the key part that I didn't
appreciate is that the table is always allocated and populated as if we have 52
VA bits even if we only have 48, then you just fix up TTBR1 to point part way
down the table if we only have 48 bits.

Ard Biesheuvel Nov. 25, 2022, 2:19 p.m. UTC | #8

On Fri, 25 Nov 2022 at 15:13, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 25/11/2022 10:36, Ard Biesheuvel wrote:
> > On Fri, 25 Nov 2022 at 11:07, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 25/11/2022 09:35, Ard Biesheuvel wrote:
> >>> On Fri, 25 Nov 2022 at 10:23, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 24/11/2022 17:14, Ard Biesheuvel wrote:
> >>>>> On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>
> >>>>>> Hi Ard,
> >>>>>>
> >>>>>> Thanks for including me on this. I'll plan to do a review over the next week or
> >>>>>> so, but in the meantime, I have a couple of general questions/comments:
> >>>>>>
> >>>>>> On 24/11/2022 12:39, Ard Biesheuvel wrote:
> >>>>>>> Enable support for LPA2 when running with 4k or 16k pages. In the former
> >>>>>>> case, this requires 5 level paging with a runtime fallback to 4 on
> >>>>>>> non-LPA2 hardware. For consistency, the same approach is adopted for 16k
> >>>>>>> pages, where we fall back to 3 level paging (47 bit virtual addressing)
> >>>>>>> on non-LPA2 configurations.
> >>>>>>
> >>>>>> It seems odd to me that if you have a non-LPA2 system, if you run a kernel that
> >>>>>> is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if
> >>>>>> you run a kernel that is compiled for 16KB pages and 52 VA bits then you will
> >>>>>> get 47 VA bits? Wouldn't that pose a potential user space compat issue?
> >>>>>>
> >>>>>
> >>>>> Well, given that Android happily runs with 39-bit VAs to avoid 4 level
> >>>>> paging at all cost, I don't think that is a universal concern.
> >>>>
> >>>> Well presumably the Android kernel is always explicitly compiled for 39 VA bits
> >>>> so that's what user space is used to? I was really just making the point that if
> >>>> you have (the admittedly exotic and unlikely) case of having a 16KB kernel
> >>>> previously compiled for 48 VA bits, and you "upgrade" it to 52 VA bits now that
> >>>> the option is available, on HW without LPA2, this will actually be observed as a
> >>>> "downgrade" to 47 bits. If you previously wanted to limit to 3 levels of lookup
> >>>> with 16KB you would already have been compiling for 47 VA bits.
> >>>>
> >>>
> >>> I am not debating that. I'm just saying that, without any hardware in
> >>> existence, it is difficult to predict which of these concerns is going
> >>> to dominate, and so I opted for the least messy and most symmetrical
> >>> approach.
> >>
> >> OK fair enough. My opinion is logged ;-).
> >>
> >>>
> >>>>>
> >>>>> The benefit of this approach is that you can decide at runtime whether
> >>>>> you want to take the performance hit of 4 (or 5) level paging to get
> >>>>> access to the extended VA space.
> >>>>>
> >>>>>>> (Falling back to 48 bits would involve
> >>>>>>> finding a workaround for the fact that we cannot construct a level 0
> >>>>>>> table covering 52 bits of VA space that appears aligned to its size in
> >>>>>>> memory, and has the top 2 entries that represent the 48-bit region
> >>>>>>> appearing at an alignment of 64 bytes, which is required by the
> >>>>>>> architecture for TTBR address values.
> >>>>>>
> >>>>>> I'm not sure I've understood this. The level 0 table would need 32 entries for
> >>>>>> 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is
> >>>>>> a factor of 256 so surely the top 2 entries are guaranteed to also meet the
> >>>>>> constraint for the fallback path too?
> >>>>>>
> >>>>>
> >>>>> The top 2 entries are 16 bytes combined, and end on a 256 byte aligned
> >>>>> boundary so I don't see how they can start on a 64 byte aligned
> >>>>> boundary at the same time.
> >>>>
> >>>> I'm still not following; why would the 2 entry/16 byte table *end* on a 256 byte
> >>>> boundary? I guess I should go and read your patch before making assumptions, but
> >>>> my assumption from your description here was that you were optimistically
> >>>> allocating a 32 entry/256 byte table for the 52 VA bit case, then needing to
> >>>> reuse that table for the 2 entry/16 byte case if HW turns out not to support
> >>>> LPA2. In which case, surely the 2 entry table would be overlayed at the start
> >>>> (low address) of the allocated 32 entry table, and therefore its alignment is
> >>>> 256 bytes, which meets the HW's 64 byte alignment requirement?
> >>>>
> >>>
> >>> No, it's at the end, that is the point. I am specifically referring to
> >>> TTBR1 upper region page tables here.
> >>>
> >>> Please refer to the existing ttbr1_offset asm macro, which implements
> >>> this today for 64k pages + LVA. In this case, however, the condensed
> >>> table covers 6 bits of translation so it is naturally aligned to the
> >>> TTBR minimum alignment.
> >>
> >> Afraid I don't see any such ttbr1_offset macro, either in upstream or the branch
> >> you posted. The best I can find is TTBR1_OFFSET in arm arch, which I'm guessing
> >> isn't it. I'm keen to understand this better if you can point me to the right
> >> location?
> >>
> >
> > Apologies, I got the name wrong
> >
> > /*
> >  * Offset ttbr1 to allow for 48-bit kernel VAs set with 52-bit PTRS_PER_PGD.
> >  * orr is used as it can cover the immediate value (and is idempotent).
> >  * In future this may be nop'ed out when dealing with 52-bit kernel VAs.
> >  *      ttbr: Value of ttbr to set, modified.
> >  */
> >         .macro  offset_ttbr1, ttbr, tmp
> > #ifdef CONFIG_ARM64_VA_BITS_52
> >         mrs_s   \tmp, SYS_ID_AA64MMFR2_EL1
> >         and     \tmp, \tmp, #(0xf << ID_AA64MMFR2_EL1_VARange_SHIFT)
> >         cbnz    \tmp, .Lskipoffs_\@
> >         orr     \ttbr, \ttbr, #TTBR1_BADDR_4852_OFFSET
> > .Lskipoffs_\@ :
> > #endif
> >         .endm
> >
> >> Regardless, the Arm ARM states this for TTBR1_EL1.BADDR:
> >>
> >> """
> >> BADDR[47:1], bits [47:1]
> >>
> >> Translation table base address:
> >> • Bits A[47:x] of the stage 1 translation table base address bits are in
> >> register bits[47:x].
> >> • Bits A[(x-1):0] of the stage 1 translation table base address are zero.
> >>
> >> Address bit x is the minimum address bit required to align the translation table
> >> to the size of the table. The smallest permitted value of x is 6. The AArch64
> >> Virtual Memory System Architecture chapter describes how x is calculated based
> >> on the value of TCR_EL1.T1SZ, the translation stage, and the translation granule
> >> size.
> >>
> >> Note
> >> A translation table is required to be aligned to the size of the table. If a
> >> table contains fewer than eight entries, it must be aligned on a 64 byte address
> >> boundary.
> >> """
> >>
> >> I don't see how that is referring to the alignment of the *end* of the table?
> >>
> >
> > It refers to the address poked into the register
> >
> > When you create a 256 byte aligned 32 entry 52-bit level 0 table for
> > 16k pages, entry #0 covers the start of the 52-bit addressable VA
> > space, and entry #30 covers the start of the 48-bit addressable VA
> > space.
> >
> > When LPA2 is not supported, the walk must start at entry #30 so that
> > is where TTBR1_EL1 should point, but doing so violates the
> > architecture's alignment requirement.
> >
> > So what we might do is double the size of the table, and clone entries
> > #30 and #31 to positions #32 and #33, for instance (and remember to
> > keep them in sync, which is not that hard)
>
> OK I get it now - thanks for explaining. I think the key part that I didn't
> appreciate is that the table is always allocated and populated as if we have 52
> VA bits even if we only have 48, then you just fix up TTBR1 to point part way
> down the table if we only have 48 bits.
>

Exactly.The entire address space is represented in software as if we
support 52-bits of virtual addressing, and we simply don't map
anything outside of the 48-bit addressable area if the hardware does
not support it. That way, handling the LVA/LPA2 feature remains an
arch specific implementation detail, and all the generic software page
table walking code has to be none the wiser. And for user space page
tables and the ID map, we can simply use the same root table address -
only kernel page tables need the additional hack here.

Ryan Roberts Nov. 29, 2022, 3:31 p.m. UTC | #9

Hi Ard,

As promised, I ran your patch set through my test set up and have noticed a few
issues. Sorry it turned into rather a long email...

First, a quick explanation of the test suite: For all valid combinations of the
below parameters, boot the host kernel on the FVP, then boot the guest kernel in
a VM, check that booting succeeds all the way to the guest shell then poweroff
guest followed by host can check shutdown is clean.

Parameters:
 - hw_pa:		[48, lpa, lpa2]
 - hw_va:		[48, 52]
 - kvm_mode:		[vhe, nvhe, protected]
 - host_page_size:	[4KB, 16KB, 64KB]
 - host_pa:		[48, 52]
 - host_va:		[48, 52]
 - host_load_addr:	[low, high]
 - guest_page_size:	[64KB]
 - guest_pa:		[52]
 - guest_va:		[52]
 - guest_load_addr:	[low, high]

When *_load_addr is 'low', that means the RAM is below 48 bits in (I)PA space.
'high' means the RAM starts at 2048TB for the guest (52 bit PA), and it means
there are 2 regions for the host; one at 0x880000000000 (48 bit PA) sized to
hold the kernel image only and another at 0x8800000000000 (52 bit PA) sized at
4GB. The FVP only allows RAM at certain locations and having a contiguous region
cross the 48 bit boundary is not an option. So I chose these values to ensure
that the linear map size is within 51 bits, which is a requirement for
nhve/protected mode kvm.

In all cases, I preload TF-A bl31, kernel, dt and initrd into RAM and run the
FVP. This sidesteps problems with EFI needing low memory, and with the FVP's
block devices needing DMA memory below 44 bits PA. bl31 and dt are appropriately
fixed up for the 2 different memory layouts.

Given this was designed to test my KVM changes, I was previously running these
without the host_load_addr=high option for the 4k and 16k host kernels (since
this requires your patch set). In this situation there are 132 valid configs and
all of them pass.

I then rebased my changes on top of yours and added in the host_load_addr=high
option. Now there are 186 valid configs, 64 of which fail. (some of these
failures are regressions). From a quick initial triage, there are 3 failure modes:

1) 18 FAILING TESTS: Host kernel never outputs anything to console

  TF-A runs successfully, says it is jumping to the kernel, then nothing further
  is seen. I'm pretty confident that the blobs are loaded into memory correctly
  because the same framework is working for the other configs (including 64k
  kernel loaded into high memory). This affects all configs where a host kernel
  with 4k or 16k pages built with LPA2 support is loaded into high memory.

2) 4 FAILING TESTS: Host kernel gets stuck initializing KVM

  During kernel boot, last console log is "kvm [1]: vgic interrupt IRQ9". All
  failing tests are configured for protected KVM, and are build with LPA2
  support, running on non-LPA2 HW.

3) 42 FAILING TESTS: Guest kernel never outputs anything to console

  Host kernel boots fine, and we attempt to launch a guest kernel using kvmtool.
  There is no error reported, but the guest never outputs anything. Haven't
  worked out which config options are common to all failures yet.

Finally, I removed my code, and ran with your patch set as provided. For this I
hacked up my test suite to boot the host, and ignore booting a guest. I also
didn't bother to vary the KVM mode and just left it in VHE mode. There were 46
valid configs here, of which 4 failed. They were all the same failure mode as
(1) above. Failing configs were:

id  hw_pa  hw_va  host_page_size  host_pa  host_va  host_load_addr
------------------------------------------------------------------
40  lpa    52     4k              52       52       high
45  lpa    52     16k             52       52       high
55  lpa2   52     4k              52       52       high
60  lpa2   52     16k             52       52       high

So on the balance of probabilities, I think failure mode (1) is very likely to
be due to a bug in your code. (2) and (3) could be my issue or your issue: I
propose to dig into those a bit further and will get back to you on them. I
don't plan to look any further into (1).

Thanks,
Ryan

On 24/11/2022 12:39, Ard Biesheuvel wrote:
> Enable support for LPA2 when running with 4k or 16k pages. In the former
> case, this requires 5 level paging with a runtime fallback to 4 on
> non-LPA2 hardware. For consistency, the same approach is adopted for 16k
> pages, where we fall back to 3 level paging (47 bit virtual addressing)
> on non-LPA2 configurations. (Falling back to 48 bits would involve
> finding a workaround for the fact that we cannot construct a level 0
> table covering 52 bits of VA space that appears aligned to its size in
> memory, and has the top 2 entries that represent the 48-bit region
> appearing at an alignment of 64 bytes, which is required by the
> architecture for TTBR address values. Also, using an additional level of
> paging to translate a single VA bit is wasteful in terms of TLB
> efficiency)
> 
> This means support for falling back to 3 levels of paging at runtime
> when configured for 4 is also needed.
> 
> Another thing worth to note is that the repurposed physical address bits
> in the page table descriptors were not RES0 before, and so there is now
> a big global switch (called TCR.DS) which controls how all page table
> descriptors are interpreted. This requires some extra care in the PTE
> conversion helpers, and additional handling in the boot code to ensure
> that we set TCR.DS safely if supported (and not overridden)
> 
> Note that this series is mostly orthogonal to work by Anshuman done last
> year: this series assumes that 52-bit physical addressing is never
> needed to map the kernel image itself, and therefore that we never need
> ID map range extension to cover the kernel with a 5th level when running
> with 4. And given that the LPA2 architectural feature covers both the
> virtual and physical range extensions, where enabling the latter is
> required to enable the former, we can simplify things further by only
> enabling them as a pair. (I.e., 52-bit physical addressing cannot be
> enabled for 48-bit VA space or smaller)
> 
> This series applies onto some of my previous work that is still in
> flight, so these patches will not apply in isolation. Complete branch
> can be found here:
> https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=arm64-4k-lpa2
> 
> It supersedes the RFC v1 I sent out last week, which covered 16k pages
> only. It also supersedes some related work I sent out in isolation
> before:
> 
> [PATCH] arm64: mm: Enable KASAN for 16k/48-bit VA configurations
> [PATCH 0/3] arm64: mm: Model LVA support as a CPU feature
> 
> Tested on QEMU with -cpu max and lpa2 both off and on, as well as using
> the arm64.nolva override kernel command line parameter. Note that this
> requires a QEMU built from the latest sources.
> 
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Will Deacon <will@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Mark Brown <broonie@kernel.org>
> Cc: Anshuman Khandual <anshuman.khandual@arm.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> 
> Anshuman Khandual (3):
>   arm64/mm: Simplify and document pte_to_phys() for 52 bit addresses
>   arm64/mm: Add FEAT_LPA2 specific TCR_EL1.DS field
>   arm64/mm: Add FEAT_LPA2 specific ID_AA64MMFR0.TGRAN[2]
> 
> Ard Biesheuvel (16):
>   arm64: kaslr: Adjust randomization range dynamically
>   arm64: mm: get rid of kimage_vaddr global variable
>   arm64: head: remove order argument from early mapping routine
>   arm64: mm: Handle LVA support as a CPU feature
>   arm64: mm: Deal with potential ID map extension if VA_BITS >
>     VA_BITS_MIN
>   arm64: mm: Add feature override support for LVA
>   arm64: mm: Wire up TCR.DS bit to PTE shareability fields
>   arm64: mm: Add LPA2 support to phys<->pte conversion routines
>   arm64: mm: Add definitions to support 5 levels of paging
>   arm64: mm: add 5 level paging support to G-to-nG conversion routine
>   arm64: Enable LPA2 at boot if supported by the system
>   arm64: mm: Add 5 level paging support to fixmap and swapper handling
>   arm64: kasan: Reduce minimum shadow alignment and enable 5 level
>     paging
>   arm64: mm: Add support for folding PUDs at runtime
>   arm64: ptdump: Disregard unaddressable VA space
>   arm64: Enable 52-bit virtual addressing for 4k and 16k granule configs
> 
>  arch/arm64/Kconfig                      |  23 ++-
>  arch/arm64/include/asm/assembler.h      |  42 ++---
>  arch/arm64/include/asm/cpufeature.h     |   2 +
>  arch/arm64/include/asm/fixmap.h         |   1 +
>  arch/arm64/include/asm/kernel-pgtable.h |  27 ++-
>  arch/arm64/include/asm/memory.h         |  23 ++-
>  arch/arm64/include/asm/pgalloc.h        |  53 +++++-
>  arch/arm64/include/asm/pgtable-hwdef.h  |  34 +++-
>  arch/arm64/include/asm/pgtable-prot.h   |  18 +-
>  arch/arm64/include/asm/pgtable-types.h  |   6 +
>  arch/arm64/include/asm/pgtable.h        | 197 ++++++++++++++++++--
>  arch/arm64/include/asm/sysreg.h         |   2 +
>  arch/arm64/include/asm/tlb.h            |   3 +-
>  arch/arm64/kernel/cpufeature.c          |  46 ++++-
>  arch/arm64/kernel/head.S                |  99 +++++-----
>  arch/arm64/kernel/image-vars.h          |   4 +
>  arch/arm64/kernel/pi/idreg-override.c   |  29 ++-
>  arch/arm64/kernel/pi/kaslr_early.c      |  23 ++-
>  arch/arm64/kernel/pi/map_kernel.c       | 115 +++++++++++-
>  arch/arm64/kernel/sleep.S               |   3 -
>  arch/arm64/mm/init.c                    |   2 +-
>  arch/arm64/mm/kasan_init.c              | 124 ++++++++++--
>  arch/arm64/mm/mmap.c                    |   4 +
>  arch/arm64/mm/mmu.c                     | 138 ++++++++++----
>  arch/arm64/mm/pgd.c                     |  17 +-
>  arch/arm64/mm/proc.S                    |  76 +++++++-
>  arch/arm64/mm/ptdump.c                  |   4 +-
>  arch/arm64/tools/cpucaps                |   1 +
>  28 files changed, 907 insertions(+), 209 deletions(-)
>

Ard Biesheuvel Nov. 29, 2022, 3:47 p.m. UTC | #10

On Tue, 29 Nov 2022 at 16:31, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi Ard,
>
> As promised, I ran your patch set through my test set up and have noticed a few
> issues. Sorry it turned into rather a long email...
>

No worries, and thanks a lot for going through the trouble.

> First, a quick explanation of the test suite: For all valid combinations of the
> below parameters, boot the host kernel on the FVP, then boot the guest kernel in
> a VM, check that booting succeeds all the way to the guest shell then poweroff
> guest followed by host can check shutdown is clean.
>
> Parameters:
>  - hw_pa:               [48, lpa, lpa2]
>  - hw_va:               [48, 52]
>  - kvm_mode:            [vhe, nvhe, protected]
>  - host_page_size:      [4KB, 16KB, 64KB]
>  - host_pa:             [48, 52]
>  - host_va:             [48, 52]
>  - host_load_addr:      [low, high]
>  - guest_page_size:     [64KB]
>  - guest_pa:            [52]
>  - guest_va:            [52]
>  - guest_load_addr:     [low, high]
>
> When *_load_addr is 'low', that means the RAM is below 48 bits in (I)PA space.
> 'high' means the RAM starts at 2048TB for the guest (52 bit PA), and it means
> there are 2 regions for the host; one at 0x880000000000 (48 bit PA) sized to
> hold the kernel image only and another at 0x8800000000000 (52 bit PA) sized at
> 4GB. The FVP only allows RAM at certain locations and having a contiguous region
> cross the 48 bit boundary is not an option. So I chose these values to ensure
> that the linear map size is within 51 bits, which is a requirement for
> nhve/protected mode kvm.
>
> In all cases, I preload TF-A bl31, kernel, dt and initrd into RAM and run the
> FVP. This sidesteps problems with EFI needing low memory, and with the FVP's
> block devices needing DMA memory below 44 bits PA. bl31 and dt are appropriately
> fixed up for the 2 different memory layouts.
>
> Given this was designed to test my KVM changes, I was previously running these
> without the host_load_addr=high option for the 4k and 16k host kernels (since
> this requires your patch set). In this situation there are 132 valid configs and
> all of them pass.
>
> I then rebased my changes on top of yours and added in the host_load_addr=high
> option. Now there are 186 valid configs, 64 of which fail. (some of these
> failures are regressions). From a quick initial triage, there are 3 failure modes:
>
>
> 1) 18 FAILING TESTS: Host kernel never outputs anything to console
>
>   TF-A runs successfully, says it is jumping to the kernel, then nothing further
>   is seen. I'm pretty confident that the blobs are loaded into memory correctly
>   because the same framework is working for the other configs (including 64k
>   kernel loaded into high memory). This affects all configs where a host kernel
>   with 4k or 16k pages built with LPA2 support is loaded into high memory.
>

Not sure how to interpret this in combination with your explanation
above, but if 'loaded high' means that the kernel itself is not in
48-bit addressable physical memory, this failure is expected.

Given that we have no way of informing the bootloader or firmware
whether or not a certain kernel image supports running from such a
high offset, it must currently assume it cannot. We've just queued up
a documentation fix to clarify this in the boot protocol, i.e., that
the kernel must be loaded in 48-bit addressable physical memory.

The fact that you had to doctor your boot environment to get around
this kind of proves my point, and unless someone is silly enough to
ship a SoC that cannot function without this, I don't think we should
add this support.

I understand how this is an interesting case for completeness from a
validation pov, but the reality is that adding support for this would
mean introducing changes amounting to dead code to fragile boot
sequence code that is already hard to maintain.

>
> 2) 4 FAILING TESTS: Host kernel gets stuck initializing KVM
>
>   During kernel boot, last console log is "kvm [1]: vgic interrupt IRQ9". All
>   failing tests are configured for protected KVM, and are build with LPA2
>   support, running on non-LPA2 HW.
>

I will try to reproduce this locally.

>
> 3) 42 FAILING TESTS: Guest kernel never outputs anything to console
>
>   Host kernel boots fine, and we attempt to launch a guest kernel using kvmtool.
>   There is no error reported, but the guest never outputs anything. Haven't
>   worked out which config options are common to all failures yet.
>

This goes a bit beyond what I am currently set up for in terms of
testing, but I'm happy to help narrow this down.

>
> Finally, I removed my code, and ran with your patch set as provided. For this I
> hacked up my test suite to boot the host, and ignore booting a guest. I also
> didn't bother to vary the KVM mode and just left it in VHE mode. There were 46
> valid configs here, of which 4 failed. They were all the same failure mode as
> (1) above. Failing configs were:
>
> id  hw_pa  hw_va  host_page_size  host_pa  host_va  host_load_addr
> ------------------------------------------------------------------
> 40  lpa    52     4k              52       52       high
> 45  lpa    52     16k             52       52       high
> 55  lpa2   52     4k              52       52       high
> 60  lpa2   52     16k             52       52       high
>

Same point as above then, I guess.

>
> So on the balance of probabilities, I think failure mode (1) is very likely to
> be due to a bug in your code. (2) and (3) could be my issue or your issue: I
> propose to dig into those a bit further and will get back to you on them. I
> don't plan to look any further into (1).
>

Thanks again. (1) is expected, and (2) is something I will investigate further.

Ryan Roberts Nov. 29, 2022, 4:35 p.m. UTC | #11

On 29/11/2022 15:47, Ard Biesheuvel wrote:
> On Tue, 29 Nov 2022 at 16:31, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi Ard,
>>
>> As promised, I ran your patch set through my test set up and have noticed a few
>> issues. Sorry it turned into rather a long email...
>>
> 
> No worries, and thanks a lot for going through the trouble.
> 
>> First, a quick explanation of the test suite: For all valid combinations of the
>> below parameters, boot the host kernel on the FVP, then boot the guest kernel in
>> a VM, check that booting succeeds all the way to the guest shell then poweroff
>> guest followed by host can check shutdown is clean.
>>
>> Parameters:
>>  - hw_pa:               [48, lpa, lpa2]
>>  - hw_va:               [48, 52]
>>  - kvm_mode:            [vhe, nvhe, protected]
>>  - host_page_size:      [4KB, 16KB, 64KB]
>>  - host_pa:             [48, 52]
>>  - host_va:             [48, 52]
>>  - host_load_addr:      [low, high]
>>  - guest_page_size:     [64KB]
>>  - guest_pa:            [52]
>>  - guest_va:            [52]
>>  - guest_load_addr:     [low, high]
>>
>> When *_load_addr is 'low', that means the RAM is below 48 bits in (I)PA space.
>> 'high' means the RAM starts at 2048TB for the guest (52 bit PA), and it means
>> there are 2 regions for the host; one at 0x880000000000 (48 bit PA) sized to
>> hold the kernel image only and another at 0x8800000000000 (52 bit PA) sized at
>> 4GB. The FVP only allows RAM at certain locations and having a contiguous region
>> cross the 48 bit boundary is not an option. So I chose these values to ensure
>> that the linear map size is within 51 bits, which is a requirement for
>> nhve/protected mode kvm.
>>
>> In all cases, I preload TF-A bl31, kernel, dt and initrd into RAM and run the
>> FVP. This sidesteps problems with EFI needing low memory, and with the FVP's
>> block devices needing DMA memory below 44 bits PA. bl31 and dt are appropriately
>> fixed up for the 2 different memory layouts.
>>
>> Given this was designed to test my KVM changes, I was previously running these
>> without the host_load_addr=high option for the 4k and 16k host kernels (since
>> this requires your patch set). In this situation there are 132 valid configs and
>> all of them pass.
>>
>> I then rebased my changes on top of yours and added in the host_load_addr=high
>> option. Now there are 186 valid configs, 64 of which fail. (some of these
>> failures are regressions). From a quick initial triage, there are 3 failure modes:
>>
>>
>> 1) 18 FAILING TESTS: Host kernel never outputs anything to console
>>
>>   TF-A runs successfully, says it is jumping to the kernel, then nothing further
>>   is seen. I'm pretty confident that the blobs are loaded into memory correctly
>>   because the same framework is working for the other configs (including 64k
>>   kernel loaded into high memory). This affects all configs where a host kernel
>>   with 4k or 16k pages built with LPA2 support is loaded into high memory.
>>
> 
> Not sure how to interpret this in combination with your explanation
> above, but if 'loaded high' means that the kernel itself is not in
> 48-bit addressable physical memory, this failure is expected.

Sorry - my wording was confusing. host_load_addr=high means what I said at the
top; the kernel image is loaded at 0x880000000000 in a block of memory sized to
hold the kernel image only (actually its forward aligned to 2MB). The dtb and
initrd are loaded into a 4GB region at 0x8800000000000. The reason I'm doing
this is to ensure that when I create a VM, the memory used for it (at least the
vast majority) is coming from the region at 52 bits. I want to do this to prove
that the stage2 implementation is correctly handling the 52 OA case.

> 
> Given that we have no way of informing the bootloader or firmware
> whether or not a certain kernel image supports running from such a
> high offset, it must currently assume it cannot. We've just queued up
> a documentation fix to clarify this in the boot protocol, i.e., that
> the kernel must be loaded in 48-bit addressable physical memory.

OK, but I think what I'm doing complies with this. Unless the DTB also has to be
below 48 bits?

> 
> The fact that you had to doctor your boot environment to get around
> this kind of proves my point, and unless someone is silly enough to
> ship a SoC that cannot function without this, I don't think we should
> add this support.
> 
> I understand how this is an interesting case for completeness from a
> validation pov, but the reality is that adding support for this would
> mean introducing changes amounting to dead code to fragile boot
> sequence code that is already hard to maintain.

I'm not disagreeing. But I think what I'm doing should conform with the
requirements? (Previously I had the tests set up to just have a single region of
memory above 52 bits and the kernel image was placed there. That works/worked
for the 64KB kernel. But I brought the kernel image to below 48 bits to align
with the requirements of this patch set.

If you see an easier way for me to validate 52 bit OAs in the stage 2 (and
ideally hyp stage 1), then I'm all ears!

> 
>>
>> 2) 4 FAILING TESTS: Host kernel gets stuck initializing KVM
>>
>>   During kernel boot, last console log is "kvm [1]: vgic interrupt IRQ9". All
>>   failing tests are configured for protected KVM, and are build with LPA2
>>   support, running on non-LPA2 HW.
>>
> 
> I will try to reproduce this locally.
> 
>>
>> 3) 42 FAILING TESTS: Guest kernel never outputs anything to console
>>
>>   Host kernel boots fine, and we attempt to launch a guest kernel using kvmtool.
>>   There is no error reported, but the guest never outputs anything. Haven't
>>   worked out which config options are common to all failures yet.
>>
> 
> This goes a bit beyond what I am currently set up for in terms of
> testing, but I'm happy to help narrow this down.
> 
>>
>> Finally, I removed my code, and ran with your patch set as provided. For this I
>> hacked up my test suite to boot the host, and ignore booting a guest. I also
>> didn't bother to vary the KVM mode and just left it in VHE mode. There were 46
>> valid configs here, of which 4 failed. They were all the same failure mode as
>> (1) above. Failing configs were:
>>
>> id  hw_pa  hw_va  host_page_size  host_pa  host_va  host_load_addr
>> ------------------------------------------------------------------
>> 40  lpa    52     4k              52       52       high
>> 45  lpa    52     16k             52       52       high
>> 55  lpa2   52     4k              52       52       high
>> 60  lpa2   52     16k             52       52       high
>>
> 
> Same point as above then, I guess.
> 
>>
>> So on the balance of probabilities, I think failure mode (1) is very likely to
>> be due to a bug in your code. (2) and (3) could be my issue or your issue: I
>> propose to dig into those a bit further and will get back to you on them. I
>> don't plan to look any further into (1).
>>
> 
> Thanks again. (1) is expected, and (2) is something I will investigate further.

Ard Biesheuvel Nov. 29, 2022, 4:56 p.m. UTC | #12

On Tue, 29 Nov 2022 at 17:36, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 29/11/2022 15:47, Ard Biesheuvel wrote:
> > On Tue, 29 Nov 2022 at 16:31, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Hi Ard,
> >>
> >> As promised, I ran your patch set through my test set up and have noticed a few
> >> issues. Sorry it turned into rather a long email...
> >>
> >
> > No worries, and thanks a lot for going through the trouble.
> >
> >> First, a quick explanation of the test suite: For all valid combinations of the
> >> below parameters, boot the host kernel on the FVP, then boot the guest kernel in
> >> a VM, check that booting succeeds all the way to the guest shell then poweroff
> >> guest followed by host can check shutdown is clean.
> >>
> >> Parameters:
> >>  - hw_pa:               [48, lpa, lpa2]
> >>  - hw_va:               [48, 52]
> >>  - kvm_mode:            [vhe, nvhe, protected]
> >>  - host_page_size:      [4KB, 16KB, 64KB]
> >>  - host_pa:             [48, 52]
> >>  - host_va:             [48, 52]
> >>  - host_load_addr:      [low, high]
> >>  - guest_page_size:     [64KB]
> >>  - guest_pa:            [52]
> >>  - guest_va:            [52]
> >>  - guest_load_addr:     [low, high]
> >>
> >> When *_load_addr is 'low', that means the RAM is below 48 bits in (I)PA space.
> >> 'high' means the RAM starts at 2048TB for the guest (52 bit PA), and it means
> >> there are 2 regions for the host; one at 0x880000000000 (48 bit PA) sized to
> >> hold the kernel image only and another at 0x8800000000000 (52 bit PA) sized at
> >> 4GB. The FVP only allows RAM at certain locations and having a contiguous region
> >> cross the 48 bit boundary is not an option. So I chose these values to ensure
> >> that the linear map size is within 51 bits, which is a requirement for
> >> nhve/protected mode kvm.
> >>
> >> In all cases, I preload TF-A bl31, kernel, dt and initrd into RAM and run the
> >> FVP. This sidesteps problems with EFI needing low memory, and with the FVP's
> >> block devices needing DMA memory below 44 bits PA. bl31 and dt are appropriately
> >> fixed up for the 2 different memory layouts.
> >>
> >> Given this was designed to test my KVM changes, I was previously running these
> >> without the host_load_addr=high option for the 4k and 16k host kernels (since
> >> this requires your patch set). In this situation there are 132 valid configs and
> >> all of them pass.
> >>
> >> I then rebased my changes on top of yours and added in the host_load_addr=high
> >> option. Now there are 186 valid configs, 64 of which fail. (some of these
> >> failures are regressions). From a quick initial triage, there are 3 failure modes:
> >>
> >>
> >> 1) 18 FAILING TESTS: Host kernel never outputs anything to console
> >>
> >>   TF-A runs successfully, says it is jumping to the kernel, then nothing further
> >>   is seen. I'm pretty confident that the blobs are loaded into memory correctly
> >>   because the same framework is working for the other configs (including 64k
> >>   kernel loaded into high memory). This affects all configs where a host kernel
> >>   with 4k or 16k pages built with LPA2 support is loaded into high memory.
> >>
> >
> > Not sure how to interpret this in combination with your explanation
> > above, but if 'loaded high' means that the kernel itself is not in
> > 48-bit addressable physical memory, this failure is expected.
>
> Sorry - my wording was confusing. host_load_addr=high means what I said at the
> top; the kernel image is loaded at 0x880000000000 in a block of memory sized to
> hold the kernel image only (actually its forward aligned to 2MB). The dtb and
> initrd are loaded into a 4GB region at 0x8800000000000. The reason I'm doing
> this is to ensure that when I create a VM, the memory used for it (at least the
> vast majority) is coming from the region at 52 bits. I want to do this to prove
> that the stage2 implementation is correctly handling the 52 OA case.
>
> >
> > Given that we have no way of informing the bootloader or firmware
> > whether or not a certain kernel image supports running from such a
> > high offset, it must currently assume it cannot. We've just queued up
> > a documentation fix to clarify this in the boot protocol, i.e., that
> > the kernel must be loaded in 48-bit addressable physical memory.
>
> OK, but I think what I'm doing complies with this. Unless the DTB also has to be
> below 48 bits?
>

Ahh yes, good point. Yes, this is actually implied but we should
clarify this. Or fix it.

But the same reasoning applies: currently, a loader cannot know from
looking at a certain binary whether or not it supports addressing any
memory outside of the 48-bit addressable range, so any asset loading
into physical memory is affected by the same limitation.

This has implications for other firmware aspects as well, i.e., ACPI tables etc.

> >
> > The fact that you had to doctor your boot environment to get around
> > this kind of proves my point, and unless someone is silly enough to
> > ship a SoC that cannot function without this, I don't think we should
> > add this support.
> >
> > I understand how this is an interesting case for completeness from a
> > validation pov, but the reality is that adding support for this would
> > mean introducing changes amounting to dead code to fragile boot
> > sequence code that is already hard to maintain.
>
> I'm not disagreeing. But I think what I'm doing should conform with the
> requirements? (Previously I had the tests set up to just have a single region of
> memory above 52 bits and the kernel image was placed there. That works/worked
> for the 64KB kernel. But I brought the kernel image to below 48 bits to align
> with the requirements of this patch set.
>
> If you see an easier way for me to validate 52 bit OAs in the stage 2 (and
> ideally hyp stage 1), then I'm all ears!
>

There is a Kconfig knob CONFIG_ARM64_FORCE_52BIT which was introduced
for a similar purpose, i.e., to ensure that the 52-bit range gets
utilized. I could imagine adding a similar control for KVM in
particular or for preferring allocations from outside the 48-bit PA
range in general.

> >
> >>
> >> 2) 4 FAILING TESTS: Host kernel gets stuck initializing KVM
> >>
> >>   During kernel boot, last console log is "kvm [1]: vgic interrupt IRQ9". All
> >>   failing tests are configured for protected KVM, and are build with LPA2
> >>   support, running on non-LPA2 HW.
> >>
> >
> > I will try to reproduce this locally.
> >
> >>
> >> 3) 42 FAILING TESTS: Guest kernel never outputs anything to console
> >>
> >>   Host kernel boots fine, and we attempt to launch a guest kernel using kvmtool.
> >>   There is no error reported, but the guest never outputs anything. Haven't
> >>   worked out which config options are common to all failures yet.
> >>
> >
> > This goes a bit beyond what I am currently set up for in terms of
> > testing, but I'm happy to help narrow this down.
> >
> >>
> >> Finally, I removed my code, and ran with your patch set as provided. For this I
> >> hacked up my test suite to boot the host, and ignore booting a guest. I also
> >> didn't bother to vary the KVM mode and just left it in VHE mode. There were 46
> >> valid configs here, of which 4 failed. They were all the same failure mode as
> >> (1) above. Failing configs were:
> >>
> >> id  hw_pa  hw_va  host_page_size  host_pa  host_va  host_load_addr
> >> ------------------------------------------------------------------
> >> 40  lpa    52     4k              52       52       high
> >> 45  lpa    52     16k             52       52       high
> >> 55  lpa2   52     4k              52       52       high
> >> 60  lpa2   52     16k             52       52       high
> >>
> >
> > Same point as above then, I guess.
> >
> >>
> >> So on the balance of probabilities, I think failure mode (1) is very likely to
> >> be due to a bug in your code. (2) and (3) could be my issue or your issue: I
> >> propose to dig into those a bit further and will get back to you on them. I
> >> don't plan to look any further into (1).
> >>
> >
> > Thanks again. (1) is expected, and (2) is something I will investigate further.
>

Ryan Roberts Dec. 1, 2022, 12:22 p.m. UTC | #13

Hi Ard,

I wanted to provide a quick update on the debugging I've been doing from my end.
See below...


On 29/11/2022 16:56, Ard Biesheuvel wrote:
> On Tue, 29 Nov 2022 at 17:36, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 29/11/2022 15:47, Ard Biesheuvel wrote:
>>> On Tue, 29 Nov 2022 at 16:31, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Hi Ard,
>>>>
>>>> As promised, I ran your patch set through my test set up and have noticed a few
>>>> issues. Sorry it turned into rather a long email...
>>>>
>>>
>>> No worries, and thanks a lot for going through the trouble.
>>>
>>>> First, a quick explanation of the test suite: For all valid combinations of the
>>>> below parameters, boot the host kernel on the FVP, then boot the guest kernel in
>>>> a VM, check that booting succeeds all the way to the guest shell then poweroff
>>>> guest followed by host can check shutdown is clean.
>>>>
>>>> Parameters:
>>>>  - hw_pa:               [48, lpa, lpa2]
>>>>  - hw_va:               [48, 52]
>>>>  - kvm_mode:            [vhe, nvhe, protected]
>>>>  - host_page_size:      [4KB, 16KB, 64KB]
>>>>  - host_pa:             [48, 52]
>>>>  - host_va:             [48, 52]
>>>>  - host_load_addr:      [low, high]
>>>>  - guest_page_size:     [64KB]
>>>>  - guest_pa:            [52]
>>>>  - guest_va:            [52]
>>>>  - guest_load_addr:     [low, high]
>>>>
>>>> When *_load_addr is 'low', that means the RAM is below 48 bits in (I)PA space.
>>>> 'high' means the RAM starts at 2048TB for the guest (52 bit PA), and it means
>>>> there are 2 regions for the host; one at 0x880000000000 (48 bit PA) sized to
>>>> hold the kernel image only and another at 0x8800000000000 (52 bit PA) sized at
>>>> 4GB. The FVP only allows RAM at certain locations and having a contiguous region
>>>> cross the 48 bit boundary is not an option. So I chose these values to ensure
>>>> that the linear map size is within 51 bits, which is a requirement for
>>>> nhve/protected mode kvm.
>>>>
>>>> In all cases, I preload TF-A bl31, kernel, dt and initrd into RAM and run the
>>>> FVP. This sidesteps problems with EFI needing low memory, and with the FVP's
>>>> block devices needing DMA memory below 44 bits PA. bl31 and dt are appropriately
>>>> fixed up for the 2 different memory layouts.
>>>>
>>>> Given this was designed to test my KVM changes, I was previously running these
>>>> without the host_load_addr=high option for the 4k and 16k host kernels (since
>>>> this requires your patch set). In this situation there are 132 valid configs and
>>>> all of them pass.
>>>>
>>>> I then rebased my changes on top of yours and added in the host_load_addr=high
>>>> option. Now there are 186 valid configs, 64 of which fail. (some of these
>>>> failures are regressions). From a quick initial triage, there are 3 failure modes:
>>>>
>>>>
>>>> 1) 18 FAILING TESTS: Host kernel never outputs anything to console
>>>>
>>>>   TF-A runs successfully, says it is jumping to the kernel, then nothing further
>>>>   is seen. I'm pretty confident that the blobs are loaded into memory correctly
>>>>   because the same framework is working for the other configs (including 64k
>>>>   kernel loaded into high memory). This affects all configs where a host kernel
>>>>   with 4k or 16k pages built with LPA2 support is loaded into high memory.
>>>>
>>>
>>> Not sure how to interpret this in combination with your explanation
>>> above, but if 'loaded high' means that the kernel itself is not in
>>> 48-bit addressable physical memory, this failure is expected.
>>
>> Sorry - my wording was confusing. host_load_addr=high means what I said at the
>> top; the kernel image is loaded at 0x880000000000 in a block of memory sized to
>> hold the kernel image only (actually its forward aligned to 2MB). The dtb and
>> initrd are loaded into a 4GB region at 0x8800000000000. The reason I'm doing
>> this is to ensure that when I create a VM, the memory used for it (at least the
>> vast majority) is coming from the region at 52 bits. I want to do this to prove
>> that the stage2 implementation is correctly handling the 52 OA case.
>>
>>>
>>> Given that we have no way of informing the bootloader or firmware
>>> whether or not a certain kernel image supports running from such a
>>> high offset, it must currently assume it cannot. We've just queued up
>>> a documentation fix to clarify this in the boot protocol, i.e., that
>>> the kernel must be loaded in 48-bit addressable physical memory.
>>
>> OK, but I think what I'm doing complies with this. Unless the DTB also has to be
>> below 48 bits?
>>
> 
> Ahh yes, good point. Yes, this is actually implied but we should
> clarify this. Or fix it.
> 
> But the same reasoning applies: currently, a loader cannot know from
> looking at a certain binary whether or not it supports addressing any
> memory outside of the 48-bit addressable range, so any asset loading
> into physical memory is affected by the same limitation.
> 
> This has implications for other firmware aspects as well, i.e., ACPI tables etc.
> 
>>>
>>> The fact that you had to doctor your boot environment to get around
>>> this kind of proves my point, and unless someone is silly enough to
>>> ship a SoC that cannot function without this, I don't think we should
>>> add this support.
>>>
>>> I understand how this is an interesting case for completeness from a
>>> validation pov, but the reality is that adding support for this would
>>> mean introducing changes amounting to dead code to fragile boot
>>> sequence code that is already hard to maintain.
>>
>> I'm not disagreeing. But I think what I'm doing should conform with the
>> requirements? (Previously I had the tests set up to just have a single region of
>> memory above 52 bits and the kernel image was placed there. That works/worked
>> for the 64KB kernel. But I brought the kernel image to below 48 bits to align
>> with the requirements of this patch set.
>>
>> If you see an easier way for me to validate 52 bit OAs in the stage 2 (and
>> ideally hyp stage 1), then I'm all ears!
>>
> 
> There is a Kconfig knob CONFIG_ARM64_FORCE_52BIT which was introduced
> for a similar purpose, i.e., to ensure that the 52-bit range gets
> utilized. I could imagine adding a similar control for KVM in
> particular or for preferring allocations from outside the 48-bit PA
> range in general.

I managed to get these all booting after moving the kernel and dtb to low
memory. I've kept the initrd in high memory for now, which works fine. (I think
the initrd memory will get freed and I didn't want any free low memory floating
around to ensure that kvm guests get high memory). Once booting, some of these
get converted to passes, and others remain failures but for other reasons (see
below).


> 
>>>
>>>>
>>>> 2) 4 FAILING TESTS: Host kernel gets stuck initializing KVM
>>>>
>>>>   During kernel boot, last console log is "kvm [1]: vgic interrupt IRQ9". All
>>>>   failing tests are configured for protected KVM, and are build with LPA2
>>>>   support, running on non-LPA2 HW.
>>>>
>>>
>>> I will try to reproduce this locally.

It turns out the same issue is hit when running your patches without mine on
top. The root cause in both cases is an assumption that kvm_get_parange() makes
that ID_AA64MMFR0_EL1_PARANGE_MAX will always be 48 for 4KB and 16KB PAGE_SIZE.
That is no longer true with your patches. It causes the VTCR_EL2 to be
programmed incorrectly for the host stage2, then on return to the host, bang.

This demonstrates that kernel stage1 support for LPA2 depends on kvm support for
LPA2, since for protected kvm, the host stage2 needs to be able to id map the
full physical range that the host kernel sees prior to deprivilege. So I don't
think it's fixable in your series. I have a fix in my series for this.

I also found another dependency problem (hit by some of the tests that were
previously failing at the first issue) where kvm uses its page table library to
walk a user space page table created by the kernel (see
get_user_mapping_size()). In the case where the kernel creates an LPA2 page
table, kvm can't walk it without my patches. I've also added a fix for this to
my series.


>>>
>>>>
>>>> 3) 42 FAILING TESTS: Guest kernel never outputs anything to console
>>>>
>>>>   Host kernel boots fine, and we attempt to launch a guest kernel using kvmtool.
>>>>   There is no error reported, but the guest never outputs anything. Haven't
>>>>   worked out which config options are common to all failures yet.
>>>>
>>>
>>> This goes a bit beyond what I am currently set up for in terms of
>>> testing, but I'm happy to help narrow this down.

I don't have a root cause for this yet. I'll try to take a loo this afternoon.
Will keep you posted.

>>>
>>>>
>>>> Finally, I removed my code, and ran with your patch set as provided. For this I
>>>> hacked up my test suite to boot the host, and ignore booting a guest. I also
>>>> didn't bother to vary the KVM mode and just left it in VHE mode. There were 46
>>>> valid configs here, of which 4 failed. They were all the same failure mode as
>>>> (1) above. Failing configs were:
>>>>
>>>> id  hw_pa  hw_va  host_page_size  host_pa  host_va  host_load_addr
>>>> ------------------------------------------------------------------
>>>> 40  lpa    52     4k              52       52       high
>>>> 45  lpa    52     16k             52       52       high
>>>> 55  lpa2   52     4k              52       52       high
>>>> 60  lpa2   52     16k             52       52       high
>>>>
>>>
>>> Same point as above then, I guess.>>>
>>>>
>>>> So on the balance of probabilities, I think failure mode (1) is very likely to
>>>> be due to a bug in your code. (2) and (3) could be my issue or your issue: I
>>>> propose to dig into those a bit further and will get back to you on them. I
>>>> don't plan to look any further into (1).
>>>>
>>>
>>> Thanks again. (1) is expected, and (2) is something I will investigate further.
>>

Once I have all the tests passing, I'll post my series, then hopefully we can
move it all forwards as one?

As part of my debugging, I've got a patch to sort out the tlbi code to support
LPA2 properly - I think I raised that comment on one of the patches. Are you
happy for me to post as part of my series?

Thanks,
Ryan

Ard Biesheuvel Dec. 1, 2022, 1:43 p.m. UTC | #14

On Thu, 1 Dec 2022 at 13:22, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi Ard,
>
> I wanted to provide a quick update on the debugging I've been doing from my end.
> See below...
>
>
> On 29/11/2022 16:56, Ard Biesheuvel wrote:
> > On Tue, 29 Nov 2022 at 17:36, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 29/11/2022 15:47, Ard Biesheuvel wrote:
> >>> On Tue, 29 Nov 2022 at 16:31, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> Hi Ard,
> >>>>
> >>>> As promised, I ran your patch set through my test set up and have noticed a few
> >>>> issues. Sorry it turned into rather a long email...
> >>>>
> >>>
> >>> No worries, and thanks a lot for going through the trouble.
> >>>
> >>>> First, a quick explanation of the test suite: For all valid combinations of the
> >>>> below parameters, boot the host kernel on the FVP, then boot the guest kernel in
> >>>> a VM, check that booting succeeds all the way to the guest shell then poweroff
> >>>> guest followed by host can check shutdown is clean.
> >>>>
> >>>> Parameters:
> >>>>  - hw_pa:               [48, lpa, lpa2]
> >>>>  - hw_va:               [48, 52]
> >>>>  - kvm_mode:            [vhe, nvhe, protected]
> >>>>  - host_page_size:      [4KB, 16KB, 64KB]
> >>>>  - host_pa:             [48, 52]
> >>>>  - host_va:             [48, 52]
> >>>>  - host_load_addr:      [low, high]
> >>>>  - guest_page_size:     [64KB]
> >>>>  - guest_pa:            [52]
> >>>>  - guest_va:            [52]
> >>>>  - guest_load_addr:     [low, high]
> >>>>
> >>>> When *_load_addr is 'low', that means the RAM is below 48 bits in (I)PA space.
> >>>> 'high' means the RAM starts at 2048TB for the guest (52 bit PA), and it means
> >>>> there are 2 regions for the host; one at 0x880000000000 (48 bit PA) sized to
> >>>> hold the kernel image only and another at 0x8800000000000 (52 bit PA) sized at
> >>>> 4GB. The FVP only allows RAM at certain locations and having a contiguous region
> >>>> cross the 48 bit boundary is not an option. So I chose these values to ensure
> >>>> that the linear map size is within 51 bits, which is a requirement for
> >>>> nhve/protected mode kvm.
> >>>>
> >>>> In all cases, I preload TF-A bl31, kernel, dt and initrd into RAM and run the
> >>>> FVP. This sidesteps problems with EFI needing low memory, and with the FVP's
> >>>> block devices needing DMA memory below 44 bits PA. bl31 and dt are appropriately
> >>>> fixed up for the 2 different memory layouts.
> >>>>
> >>>> Given this was designed to test my KVM changes, I was previously running these
> >>>> without the host_load_addr=high option for the 4k and 16k host kernels (since
> >>>> this requires your patch set). In this situation there are 132 valid configs and
> >>>> all of them pass.
> >>>>
> >>>> I then rebased my changes on top of yours and added in the host_load_addr=high
> >>>> option. Now there are 186 valid configs, 64 of which fail. (some of these
> >>>> failures are regressions). From a quick initial triage, there are 3 failure modes:
> >>>>
> >>>>
> >>>> 1) 18 FAILING TESTS: Host kernel never outputs anything to console
> >>>>
> >>>>   TF-A runs successfully, says it is jumping to the kernel, then nothing further
> >>>>   is seen. I'm pretty confident that the blobs are loaded into memory correctly
> >>>>   because the same framework is working for the other configs (including 64k
> >>>>   kernel loaded into high memory). This affects all configs where a host kernel
> >>>>   with 4k or 16k pages built with LPA2 support is loaded into high memory.
> >>>>
> >>>
> >>> Not sure how to interpret this in combination with your explanation
> >>> above, but if 'loaded high' means that the kernel itself is not in
> >>> 48-bit addressable physical memory, this failure is expected.
> >>
> >> Sorry - my wording was confusing. host_load_addr=high means what I said at the
> >> top; the kernel image is loaded at 0x880000000000 in a block of memory sized to
> >> hold the kernel image only (actually its forward aligned to 2MB). The dtb and
> >> initrd are loaded into a 4GB region at 0x8800000000000. The reason I'm doing
> >> this is to ensure that when I create a VM, the memory used for it (at least the
> >> vast majority) is coming from the region at 52 bits. I want to do this to prove
> >> that the stage2 implementation is correctly handling the 52 OA case.
> >>
> >>>
> >>> Given that we have no way of informing the bootloader or firmware
> >>> whether or not a certain kernel image supports running from such a
> >>> high offset, it must currently assume it cannot. We've just queued up
> >>> a documentation fix to clarify this in the boot protocol, i.e., that
> >>> the kernel must be loaded in 48-bit addressable physical memory.
> >>
> >> OK, but I think what I'm doing complies with this. Unless the DTB also has to be
> >> below 48 bits?
> >>
> >
> > Ahh yes, good point. Yes, this is actually implied but we should
> > clarify this. Or fix it.
> >
> > But the same reasoning applies: currently, a loader cannot know from
> > looking at a certain binary whether or not it supports addressing any
> > memory outside of the 48-bit addressable range, so any asset loading
> > into physical memory is affected by the same limitation.
> >
> > This has implications for other firmware aspects as well, i.e., ACPI tables etc.
> >
> >>>
> >>> The fact that you had to doctor your boot environment to get around
> >>> this kind of proves my point, and unless someone is silly enough to
> >>> ship a SoC that cannot function without this, I don't think we should
> >>> add this support.
> >>>
> >>> I understand how this is an interesting case for completeness from a
> >>> validation pov, but the reality is that adding support for this would
> >>> mean introducing changes amounting to dead code to fragile boot
> >>> sequence code that is already hard to maintain.
> >>
> >> I'm not disagreeing. But I think what I'm doing should conform with the
> >> requirements? (Previously I had the tests set up to just have a single region of
> >> memory above 52 bits and the kernel image was placed there. That works/worked
> >> for the 64KB kernel. But I brought the kernel image to below 48 bits to align
> >> with the requirements of this patch set.
> >>
> >> If you see an easier way for me to validate 52 bit OAs in the stage 2 (and
> >> ideally hyp stage 1), then I'm all ears!
> >>
> >
> > There is a Kconfig knob CONFIG_ARM64_FORCE_52BIT which was introduced
> > for a similar purpose, i.e., to ensure that the 52-bit range gets
> > utilized. I could imagine adding a similar control for KVM in
> > particular or for preferring allocations from outside the 48-bit PA
> > range in general.
>
> I managed to get these all booting after moving the kernel and dtb to low
> memory. I've kept the initrd in high memory for now, which works fine. (I think
> the initrd memory will get freed and I didn't want any free low memory floating
> around to ensure that kvm guests get high memory).

The DT's early mapping is via the ID map, while the initrd is only
mapped much later via the kernel mappings in TTBR1, so this is why it
works.

However, for the boot protocol, this distinction doesn't really
matter: if the boot stack cannot be certain that a certain kernel
image was built to support 52 physical addressing, it simply must
never place anything there.

> Once booting, some of these
> get converted to passes, and others remain failures but for other reasons (see
> below).
>
>
> >
> >>>
> >>>>
> >>>> 2) 4 FAILING TESTS: Host kernel gets stuck initializing KVM
> >>>>
> >>>>   During kernel boot, last console log is "kvm [1]: vgic interrupt IRQ9". All
> >>>>   failing tests are configured for protected KVM, and are build with LPA2
> >>>>   support, running on non-LPA2 HW.
> >>>>
> >>>
> >>> I will try to reproduce this locally.
>
> It turns out the same issue is hit when running your patches without mine on
> top. The root cause in both cases is an assumption that kvm_get_parange() makes
> that ID_AA64MMFR0_EL1_PARANGE_MAX will always be 48 for 4KB and 16KB PAGE_SIZE.
> That is no longer true with your patches. It causes the VTCR_EL2 to be
> programmed incorrectly for the host stage2, then on return to the host, bang.
>

Right. I made an attempt at replacing 'u32 level' with 's32 level'
throughout that code, along with some related changes, but I didn't
spot this issue.

> This demonstrates that kernel stage1 support for LPA2 depends on kvm support for
> LPA2, since for protected kvm, the host stage2 needs to be able to id map the
> full physical range that the host kernel sees prior to deprivilege. So I don't
> think it's fixable in your series. I have a fix in my series for this.
>

The reference to ID_AA64MMFR0_EL1_PARANGE_MAX should be fixable in
isolation, no? Even if it results in a KVM that cannot use 52-bit PAs
while the host can.

> I also found another dependency problem (hit by some of the tests that were
> previously failing at the first issue) where kvm uses its page table library to
> walk a user space page table created by the kernel (see
> get_user_mapping_size()). In the case where the kernel creates an LPA2 page
> table, kvm can't walk it without my patches. I've also added a fix for this to
> my series.
>

OK

>
> >>>
> >>>>
> >>>> 3) 42 FAILING TESTS: Guest kernel never outputs anything to console
> >>>>
> >>>>   Host kernel boots fine, and we attempt to launch a guest kernel using kvmtool.
> >>>>   There is no error reported, but the guest never outputs anything. Haven't
> >>>>   worked out which config options are common to all failures yet.
> >>>>
> >>>
> >>> This goes a bit beyond what I am currently set up for in terms of
> >>> testing, but I'm happy to help narrow this down.
>
> I don't have a root cause for this yet. I'll try to take a loo this afternoon.
> Will keep you posted.
>

Thanks

> >>>
> >>>>
> >>>> Finally, I removed my code, and ran with your patch set as provided. For this I
> >>>> hacked up my test suite to boot the host, and ignore booting a guest. I also
> >>>> didn't bother to vary the KVM mode and just left it in VHE mode. There were 46
> >>>> valid configs here, of which 4 failed. They were all the same failure mode as
> >>>> (1) above. Failing configs were:
> >>>>
> >>>> id  hw_pa  hw_va  host_page_size  host_pa  host_va  host_load_addr
> >>>> ------------------------------------------------------------------
> >>>> 40  lpa    52     4k              52       52       high
> >>>> 45  lpa    52     16k             52       52       high
> >>>> 55  lpa2   52     4k              52       52       high
> >>>> 60  lpa2   52     16k             52       52       high
> >>>>
> >>>
> >>> Same point as above then, I guess.>>>
> >>>>
> >>>> So on the balance of probabilities, I think failure mode (1) is very likely to
> >>>> be due to a bug in your code. (2) and (3) could be my issue or your issue: I
> >>>> propose to dig into those a bit further and will get back to you on them. I
> >>>> don't plan to look any further into (1).
> >>>>
> >>>
> >>> Thanks again. (1) is expected, and (2) is something I will investigate further.
> >>
>
> Once I have all the tests passing, I'll post my series, then hopefully we can
> move it all forwards as one?
>

That would be great, yes, although my work depends on a sizable rework
of the early boot code that has seen very little review as of yet.

So for the time being, let's keep aligned but let's not put any eggs
in each other's baskets :-)

> As part of my debugging, I've got a patch to sort out the tlbi code to support
> LPA2 properly - I think I raised that comment on one of the patches. Are you
> happy for me to post as part of my series?
>

I wasn't sure where to look tbh. The generic 5-level paging stuff
seems to work fine - is this specific to KVM?

Ryan Roberts Dec. 1, 2022, 4 p.m. UTC | #15

On 01/12/2022 13:43, Ard Biesheuvel wrote:
> [...]>>>>>>
>>>>>> 2) 4 FAILING TESTS: Host kernel gets stuck initializing KVM
>>>>>>
>>>>>>   During kernel boot, last console log is "kvm [1]: vgic interrupt IRQ9". All
>>>>>>   failing tests are configured for protected KVM, and are build with LPA2
>>>>>>   support, running on non-LPA2 HW.
>>>>>>
>>>>>
>>>>> I will try to reproduce this locally.
>>
>> It turns out the same issue is hit when running your patches without mine on
>> top. The root cause in both cases is an assumption that kvm_get_parange() makes
>> that ID_AA64MMFR0_EL1_PARANGE_MAX will always be 48 for 4KB and 16KB PAGE_SIZE.
>> That is no longer true with your patches. It causes the VTCR_EL2 to be
>> programmed incorrectly for the host stage2, then on return to the host, bang.
>>
> 
> Right. I made an attempt at replacing 'u32 level' with 's32 level'
> throughout that code, along with some related changes, but I didn't
> spot this issue.
> 
>> This demonstrates that kernel stage1 support for LPA2 depends on kvm support for
>> LPA2, since for protected kvm, the host stage2 needs to be able to id map the
>> full physical range that the host kernel sees prior to deprivilege. So I don't
>> think it's fixable in your series. I have a fix in my series for this.
>>
> 
> The reference to ID_AA64MMFR0_EL1_PARANGE_MAX should be fixable in
> isolation, no? Even if it results in a KVM that cannot use 52-bit PAs
> while the host can.

Well that doesn't really sound like a good solution to me - the VMM can't
control which physical memory it is using so it would be the luck of the draw as
to whether kvm gets passed memory above 48 PA bits. So you would probably have
to explicitly prevent kvm from initializing if you want to keep your kernel LPA2
patches decoupled from mine.

> 
>> I also found another dependency problem (hit by some of the tests that were
>> previously failing at the first issue) where kvm uses its page table library to
>> walk a user space page table created by the kernel (see
>> get_user_mapping_size()). In the case where the kernel creates an LPA2 page
>> table, kvm can't walk it without my patches. I've also added a fix for this to
>> my series.
>>
> 
> OK

Again, I think the only way to work around this without kvm understanding LPA2
is to disable KVM when the kernel is using LPA2.

> 
>>
>>>>>
>>>>>>
>>>>>> 3) 42 FAILING TESTS: Guest kernel never outputs anything to console
>>>>>>
>>>>>>   Host kernel boots fine, and we attempt to launch a guest kernel using kvmtool.
>>>>>>   There is no error reported, but the guest never outputs anything. Haven't
>>>>>>   worked out which config options are common to all failures yet.
>>>>>>
>>>>>
>>>>> This goes a bit beyond what I am currently set up for in terms of
>>>>> testing, but I'm happy to help narrow this down.
>>
>> I don't have a root cause for this yet. I'll try to take a loo this afternoon.
>> Will keep you posted.
>>

OK found it. All these failures are when loading the guest in 'high' memory. My
test environment is allocating a vm with 256M@2048T (so all memory above 48 bits
IPA) and putting the 64KB/52VA/52PA kernel, dtb and intird there. This works
fine without your changes. I guess as part of your change you have broken 64KB
kernel's ability to create an ID map above 48 bits (which conforms to your
clarification of the boot protocol). I guess this was intentional? It makes it
really hard for me to test >48 bit IA at stage2...

> 
>>
>> Once I have all the tests passing, I'll post my series, then hopefully we can
>> move it all forwards as one?
>>
> 
> That would be great, yes, although my work depends on a sizable rework
> of the early boot code that has seen very little review as of yet.
> 
> So for the time being, let's keep aligned but let's not put any eggs
> in each other's baskets :-)>
>> As part of my debugging, I've got a patch to sort out the tlbi code to support
>> LPA2 properly - I think I raised that comment on one of the patches. Are you
>> happy for me to post as part of my series?
>>
> 
> I wasn't sure where to look tbh. The generic 5-level paging stuff
> seems to work fine - is this specific to KVM?

There are a couple of issues that I spotted; for the range-based tlbi
instructions, when LPA2 is in use (TCR_EL1.DS=1), BaseADDR must be 64KB aligned
(when LPA2 is disabled it only needs to be page aligned). So there is some
forward alignment required using the non-range tbli in __flush_tlb_range(). I
think this would manifest as invalidating the wrong entries if left as-is once
your patches are applied.

The second problem is that __tlbi_level() uses level 0 as the "don't use level
hint" sentinel. I think this will work just fine (if slightly suboptimal) if
left as is, since the higher layers are never passing anything outside the range
[0, 3] at the moment, but if that changed and -1 was passed then it would cause
a bug.

[v2,00/19] arm64: Enable LPA2 support for 4k and 16k pages

Message

Comments