Message ID | 20221124123932.2648991-1-ardb@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | arm64: Enable LPA2 support for 4k and 16k pages | expand |
Hi Ard, Thanks for including me on this. I'll plan to do a review over the next week or so, but in the meantime, I have a couple of general questions/comments: On 24/11/2022 12:39, Ard Biesheuvel wrote: > Enable support for LPA2 when running with 4k or 16k pages. In the former > case, this requires 5 level paging with a runtime fallback to 4 on > non-LPA2 hardware. For consistency, the same approach is adopted for 16k > pages, where we fall back to 3 level paging (47 bit virtual addressing) > on non-LPA2 configurations. It seems odd to me that if you have a non-LPA2 system, if you run a kernel that is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if you run a kernel that is compiled for 16KB pages and 52 VA bits then you will get 47 VA bits? Wouldn't that pose a potential user space compat issue? > (Falling back to 48 bits would involve > finding a workaround for the fact that we cannot construct a level 0 > table covering 52 bits of VA space that appears aligned to its size in > memory, and has the top 2 entries that represent the 48-bit region > appearing at an alignment of 64 bytes, which is required by the > architecture for TTBR address values. I'm not sure I've understood this. The level 0 table would need 32 entries for 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is a factor of 256 so surely the top 2 entries are guaranteed to also meet the constraint for the fallback path too? > Also, using an additional level of > paging to translate a single VA bit is wasteful in terms of TLB > efficiency) > > This means support for falling back to 3 levels of paging at runtime > when configured for 4 is also needed. > > Another thing worth to note is that the repurposed physical address bits > in the page table descriptors were not RES0 before, and so there is now > a big global switch (called TCR.DS) which controls how all page table > descriptors are interpreted. This requires some extra care in the PTE > conversion helpers, and additional handling in the boot code to ensure > that we set TCR.DS safely if supported (and not overridden) > > Note that this series is mostly orthogonal to work by Anshuman done last > year: this series assumes that 52-bit physical addressing is never > needed to map the kernel image itself, and therefore that we never need > ID map range extension to cover the kernel with a 5th level when running > with 4. This limitation will certainly make it more tricky to test the the LPA2 stage2 implementation that I have done. I've got scripts that construct host systems with all the RAM above 48 bits so that the output addresses in the stage2 page tables can be guaranteed to contain OAs > 48 bits. I think the work around here would be to place the RAM so that it straddles the 48 bit boundary such that the size of RAM below is the same size as the kernel image, and place the kernel image in it. Then this will ensure that the VM's memory still uses the RAM above the threshold. Or is there a simpler approach? > And given that the LPA2 architectural feature covers both the > virtual and physical range extensions, where enabling the latter is > required to enable the former, we can simplify things further by only > enabling them as a pair. (I.e., 52-bit physical addressing cannot be > enabled for 48-bit VA space or smaller) > > [...] Thanks, Ryan
On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts@arm.com> wrote: > > Hi Ard, > > Thanks for including me on this. I'll plan to do a review over the next week or > so, but in the meantime, I have a couple of general questions/comments: > > On 24/11/2022 12:39, Ard Biesheuvel wrote: > > Enable support for LPA2 when running with 4k or 16k pages. In the former > > case, this requires 5 level paging with a runtime fallback to 4 on > > non-LPA2 hardware. For consistency, the same approach is adopted for 16k > > pages, where we fall back to 3 level paging (47 bit virtual addressing) > > on non-LPA2 configurations. > > It seems odd to me that if you have a non-LPA2 system, if you run a kernel that > is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if > you run a kernel that is compiled for 16KB pages and 52 VA bits then you will > get 47 VA bits? Wouldn't that pose a potential user space compat issue? > Well, given that Android happily runs with 39-bit VAs to avoid 4 level paging at all cost, I don't think that is a universal concern. The benefit of this approach is that you can decide at runtime whether you want to take the performance hit of 4 (or 5) level paging to get access to the extended VA space. > > (Falling back to 48 bits would involve > > finding a workaround for the fact that we cannot construct a level 0 > > table covering 52 bits of VA space that appears aligned to its size in > > memory, and has the top 2 entries that represent the 48-bit region > > appearing at an alignment of 64 bytes, which is required by the > > architecture for TTBR address values. > > I'm not sure I've understood this. The level 0 table would need 32 entries for > 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is > a factor of 256 so surely the top 2 entries are guaranteed to also meet the > constraint for the fallback path too? > The top 2 entries are 16 bytes combined, and end on a 256 byte aligned boundary so I don't see how they can start on a 64 byte aligned boundary at the same time. My RFC had a workaround for this, but it is a bit nasty because we need to copy those two entries at the right time and keep them in sync. > > Also, using an additional level of > > paging to translate a single VA bit is wasteful in terms of TLB > > efficiency) > > > > This means support for falling back to 3 levels of paging at runtime > > when configured for 4 is also needed. > > > > Another thing worth to note is that the repurposed physical address bits > > in the page table descriptors were not RES0 before, and so there is now > > a big global switch (called TCR.DS) which controls how all page table > > descriptors are interpreted. This requires some extra care in the PTE > > conversion helpers, and additional handling in the boot code to ensure > > that we set TCR.DS safely if supported (and not overridden) > > > > Note that this series is mostly orthogonal to work by Anshuman done last > > year: this series assumes that 52-bit physical addressing is never > > needed to map the kernel image itself, and therefore that we never need > > ID map range extension to cover the kernel with a 5th level when running > > with 4. > > This limitation will certainly make it more tricky to test the the LPA2 stage2 > implementation that I have done. I've got scripts that construct host systems > with all the RAM above 48 bits so that the output addresses in the stage2 page > tables can be guaranteed to contain OAs > 48 bits. I think the work around here > would be to place the RAM so that it straddles the 48 bit boundary such that the > size of RAM below is the same size as the kernel image, and place the kernel > image in it. Then this will ensure that the VM's memory still uses the RAM above > the threshold. Or is there a simpler approach? > No, that sounds reasonable. I'm using QEMU which happily lets you put the start of DRAM at any address you can imagine (if you recompile it) Another approach could be to simply stick a memblock_reserve() somewhere that covers all 48-bit addressable memory, but you will need some of both in any case. > > And given that the LPA2 architectural feature covers both the > > virtual and physical range extensions, where enabling the latter is > > required to enable the former, we can simplify things further by only > > enabling them as a pair. (I.e., 52-bit physical addressing cannot be > > enabled for 48-bit VA space or smaller) > > > > [...] > > Thanks, > Ryan > >
On 24/11/2022 17:14, Ard Biesheuvel wrote: > On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> Hi Ard, >> >> Thanks for including me on this. I'll plan to do a review over the next week or >> so, but in the meantime, I have a couple of general questions/comments: >> >> On 24/11/2022 12:39, Ard Biesheuvel wrote: >>> Enable support for LPA2 when running with 4k or 16k pages. In the former >>> case, this requires 5 level paging with a runtime fallback to 4 on >>> non-LPA2 hardware. For consistency, the same approach is adopted for 16k >>> pages, where we fall back to 3 level paging (47 bit virtual addressing) >>> on non-LPA2 configurations. >> >> It seems odd to me that if you have a non-LPA2 system, if you run a kernel that >> is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if >> you run a kernel that is compiled for 16KB pages and 52 VA bits then you will >> get 47 VA bits? Wouldn't that pose a potential user space compat issue? >> > > Well, given that Android happily runs with 39-bit VAs to avoid 4 level > paging at all cost, I don't think that is a universal concern. Well presumably the Android kernel is always explicitly compiled for 39 VA bits so that's what user space is used to? I was really just making the point that if you have (the admittedly exotic and unlikely) case of having a 16KB kernel previously compiled for 48 VA bits, and you "upgrade" it to 52 VA bits now that the option is available, on HW without LPA2, this will actually be observed as a "downgrade" to 47 bits. If you previously wanted to limit to 3 levels of lookup with 16KB you would already have been compiling for 47 VA bits. > > The benefit of this approach is that you can decide at runtime whether > you want to take the performance hit of 4 (or 5) level paging to get > access to the extended VA space. > >>> (Falling back to 48 bits would involve >>> finding a workaround for the fact that we cannot construct a level 0 >>> table covering 52 bits of VA space that appears aligned to its size in >>> memory, and has the top 2 entries that represent the 48-bit region >>> appearing at an alignment of 64 bytes, which is required by the >>> architecture for TTBR address values. >> >> I'm not sure I've understood this. The level 0 table would need 32 entries for >> 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is >> a factor of 256 so surely the top 2 entries are guaranteed to also meet the >> constraint for the fallback path too? >> > > The top 2 entries are 16 bytes combined, and end on a 256 byte aligned > boundary so I don't see how they can start on a 64 byte aligned > boundary at the same time. I'm still not following; why would the 2 entry/16 byte table *end* on a 256 byte boundary? I guess I should go and read your patch before making assumptions, but my assumption from your description here was that you were optimistically allocating a 32 entry/256 byte table for the 52 VA bit case, then needing to reuse that table for the 2 entry/16 byte case if HW turns out not to support LPA2. In which case, surely the 2 entry table would be overlayed at the start (low address) of the allocated 32 entry table, and therefore its alignment is 256 bytes, which meets the HW's 64 byte alignment requirement? > > My RFC had a workaround for this, but it is a bit nasty because we > need to copy those two entries at the right time and keep them in > sync. > >>> Also, using an additional level of >>> paging to translate a single VA bit is wasteful in terms of TLB >>> efficiency) >>> >>> This means support for falling back to 3 levels of paging at runtime >>> when configured for 4 is also needed. >>> >>> Another thing worth to note is that the repurposed physical address bits >>> in the page table descriptors were not RES0 before, and so there is now >>> a big global switch (called TCR.DS) which controls how all page table >>> descriptors are interpreted. This requires some extra care in the PTE >>> conversion helpers, and additional handling in the boot code to ensure >>> that we set TCR.DS safely if supported (and not overridden) >>> >>> Note that this series is mostly orthogonal to work by Anshuman done last >>> year: this series assumes that 52-bit physical addressing is never >>> needed to map the kernel image itself, and therefore that we never need >>> ID map range extension to cover the kernel with a 5th level when running >>> with 4. >> >> This limitation will certainly make it more tricky to test the the LPA2 stage2 >> implementation that I have done. I've got scripts that construct host systems >> with all the RAM above 48 bits so that the output addresses in the stage2 page >> tables can be guaranteed to contain OAs > 48 bits. I think the work around here >> would be to place the RAM so that it straddles the 48 bit boundary such that the >> size of RAM below is the same size as the kernel image, and place the kernel >> image in it. Then this will ensure that the VM's memory still uses the RAM above >> the threshold. Or is there a simpler approach? >> > > No, that sounds reasonable. I'm using QEMU which happily lets you put > the start of DRAM at any address you can imagine (if you recompile it) I'm running on FVP, which will let me do this with runtime parameters. Anyway, I'll update my tests to cope with this constraint and run this patch set through, and I'll let you know if it spots anything. > > Another approach could be to simply stick a memblock_reserve() > somewhere that covers all 48-bit addressable memory, but you will need > some of both in any case. > >>> And given that the LPA2 architectural feature covers both the >>> virtual and physical range extensions, where enabling the latter is >>> required to enable the former, we can simplify things further by only >>> enabling them as a pair. (I.e., 52-bit physical addressing cannot be >>> enabled for 48-bit VA space or smaller) >>> >>> [...] >> >> Thanks, >> Ryan >> >>
On Fri, 25 Nov 2022 at 10:23, Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 24/11/2022 17:14, Ard Biesheuvel wrote: > > On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> Hi Ard, > >> > >> Thanks for including me on this. I'll plan to do a review over the next week or > >> so, but in the meantime, I have a couple of general questions/comments: > >> > >> On 24/11/2022 12:39, Ard Biesheuvel wrote: > >>> Enable support for LPA2 when running with 4k or 16k pages. In the former > >>> case, this requires 5 level paging with a runtime fallback to 4 on > >>> non-LPA2 hardware. For consistency, the same approach is adopted for 16k > >>> pages, where we fall back to 3 level paging (47 bit virtual addressing) > >>> on non-LPA2 configurations. > >> > >> It seems odd to me that if you have a non-LPA2 system, if you run a kernel that > >> is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if > >> you run a kernel that is compiled for 16KB pages and 52 VA bits then you will > >> get 47 VA bits? Wouldn't that pose a potential user space compat issue? > >> > > > > Well, given that Android happily runs with 39-bit VAs to avoid 4 level > > paging at all cost, I don't think that is a universal concern. > > Well presumably the Android kernel is always explicitly compiled for 39 VA bits > so that's what user space is used to? I was really just making the point that if > you have (the admittedly exotic and unlikely) case of having a 16KB kernel > previously compiled for 48 VA bits, and you "upgrade" it to 52 VA bits now that > the option is available, on HW without LPA2, this will actually be observed as a > "downgrade" to 47 bits. If you previously wanted to limit to 3 levels of lookup > with 16KB you would already have been compiling for 47 VA bits. > I am not debating that. I'm just saying that, without any hardware in existence, it is difficult to predict which of these concerns is going to dominate, and so I opted for the least messy and most symmetrical approach. > > > > The benefit of this approach is that you can decide at runtime whether > > you want to take the performance hit of 4 (or 5) level paging to get > > access to the extended VA space. > > > >>> (Falling back to 48 bits would involve > >>> finding a workaround for the fact that we cannot construct a level 0 > >>> table covering 52 bits of VA space that appears aligned to its size in > >>> memory, and has the top 2 entries that represent the 48-bit region > >>> appearing at an alignment of 64 bytes, which is required by the > >>> architecture for TTBR address values. > >> > >> I'm not sure I've understood this. The level 0 table would need 32 entries for > >> 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is > >> a factor of 256 so surely the top 2 entries are guaranteed to also meet the > >> constraint for the fallback path too? > >> > > > > The top 2 entries are 16 bytes combined, and end on a 256 byte aligned > > boundary so I don't see how they can start on a 64 byte aligned > > boundary at the same time. > > I'm still not following; why would the 2 entry/16 byte table *end* on a 256 byte > boundary? I guess I should go and read your patch before making assumptions, but > my assumption from your description here was that you were optimistically > allocating a 32 entry/256 byte table for the 52 VA bit case, then needing to > reuse that table for the 2 entry/16 byte case if HW turns out not to support > LPA2. In which case, surely the 2 entry table would be overlayed at the start > (low address) of the allocated 32 entry table, and therefore its alignment is > 256 bytes, which meets the HW's 64 byte alignment requirement? > No, it's at the end, that is the point. I am specifically referring to TTBR1 upper region page tables here. Please refer to the existing ttbr1_offset asm macro, which implements this today for 64k pages + LVA. In this case, however, the condensed table covers 6 bits of translation so it is naturally aligned to the TTBR minimum alignment. > > > > My RFC had a workaround for this, but it is a bit nasty because we > > need to copy those two entries at the right time and keep them in > > sync. > > > >>> Also, using an additional level of > >>> paging to translate a single VA bit is wasteful in terms of TLB > >>> efficiency) > >>> > >>> This means support for falling back to 3 levels of paging at runtime > >>> when configured for 4 is also needed. > >>> > >>> Another thing worth to note is that the repurposed physical address bits > >>> in the page table descriptors were not RES0 before, and so there is now > >>> a big global switch (called TCR.DS) which controls how all page table > >>> descriptors are interpreted. This requires some extra care in the PTE > >>> conversion helpers, and additional handling in the boot code to ensure > >>> that we set TCR.DS safely if supported (and not overridden) > >>> > >>> Note that this series is mostly orthogonal to work by Anshuman done last > >>> year: this series assumes that 52-bit physical addressing is never > >>> needed to map the kernel image itself, and therefore that we never need > >>> ID map range extension to cover the kernel with a 5th level when running > >>> with 4. > >> > >> This limitation will certainly make it more tricky to test the the LPA2 stage2 > >> implementation that I have done. I've got scripts that construct host systems > >> with all the RAM above 48 bits so that the output addresses in the stage2 page > >> tables can be guaranteed to contain OAs > 48 bits. I think the work around here > >> would be to place the RAM so that it straddles the 48 bit boundary such that the > >> size of RAM below is the same size as the kernel image, and place the kernel > >> image in it. Then this will ensure that the VM's memory still uses the RAM above > >> the threshold. Or is there a simpler approach? > >> > > > > No, that sounds reasonable. I'm using QEMU which happily lets you put > > the start of DRAM at any address you can imagine (if you recompile it) > > I'm running on FVP, which will let me do this with runtime parameters. Anyway, > I'll update my tests to cope with this constraint and run this patch set > through, and I'll let you know if it spots anything. Excellent, thanks. > > > > Another approach could be to simply stick a memblock_reserve() > > somewhere that covers all 48-bit addressable memory, but you will need > > some of both in any case. > > > >>> And given that the LPA2 architectural feature covers both the > >>> virtual and physical range extensions, where enabling the latter is > >>> required to enable the former, we can simplify things further by only > >>> enabling them as a pair. (I.e., 52-bit physical addressing cannot be > >>> enabled for 48-bit VA space or smaller) > >>> > >>> [...] > >> > >> Thanks, > >> Ryan > >> > >> >
On 25/11/2022 09:35, Ard Biesheuvel wrote: > On Fri, 25 Nov 2022 at 10:23, Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 24/11/2022 17:14, Ard Biesheuvel wrote: >>> On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> Hi Ard, >>>> >>>> Thanks for including me on this. I'll plan to do a review over the next week or >>>> so, but in the meantime, I have a couple of general questions/comments: >>>> >>>> On 24/11/2022 12:39, Ard Biesheuvel wrote: >>>>> Enable support for LPA2 when running with 4k or 16k pages. In the former >>>>> case, this requires 5 level paging with a runtime fallback to 4 on >>>>> non-LPA2 hardware. For consistency, the same approach is adopted for 16k >>>>> pages, where we fall back to 3 level paging (47 bit virtual addressing) >>>>> on non-LPA2 configurations. >>>> >>>> It seems odd to me that if you have a non-LPA2 system, if you run a kernel that >>>> is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if >>>> you run a kernel that is compiled for 16KB pages and 52 VA bits then you will >>>> get 47 VA bits? Wouldn't that pose a potential user space compat issue? >>>> >>> >>> Well, given that Android happily runs with 39-bit VAs to avoid 4 level >>> paging at all cost, I don't think that is a universal concern. >> >> Well presumably the Android kernel is always explicitly compiled for 39 VA bits >> so that's what user space is used to? I was really just making the point that if >> you have (the admittedly exotic and unlikely) case of having a 16KB kernel >> previously compiled for 48 VA bits, and you "upgrade" it to 52 VA bits now that >> the option is available, on HW without LPA2, this will actually be observed as a >> "downgrade" to 47 bits. If you previously wanted to limit to 3 levels of lookup >> with 16KB you would already have been compiling for 47 VA bits. >> > > I am not debating that. I'm just saying that, without any hardware in > existence, it is difficult to predict which of these concerns is going > to dominate, and so I opted for the least messy and most symmetrical > approach. OK fair enough. My opinion is logged ;-). > >>> >>> The benefit of this approach is that you can decide at runtime whether >>> you want to take the performance hit of 4 (or 5) level paging to get >>> access to the extended VA space. >>> >>>>> (Falling back to 48 bits would involve >>>>> finding a workaround for the fact that we cannot construct a level 0 >>>>> table covering 52 bits of VA space that appears aligned to its size in >>>>> memory, and has the top 2 entries that represent the 48-bit region >>>>> appearing at an alignment of 64 bytes, which is required by the >>>>> architecture for TTBR address values. >>>> >>>> I'm not sure I've understood this. The level 0 table would need 32 entries for >>>> 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is >>>> a factor of 256 so surely the top 2 entries are guaranteed to also meet the >>>> constraint for the fallback path too? >>>> >>> >>> The top 2 entries are 16 bytes combined, and end on a 256 byte aligned >>> boundary so I don't see how they can start on a 64 byte aligned >>> boundary at the same time. >> >> I'm still not following; why would the 2 entry/16 byte table *end* on a 256 byte >> boundary? I guess I should go and read your patch before making assumptions, but >> my assumption from your description here was that you were optimistically >> allocating a 32 entry/256 byte table for the 52 VA bit case, then needing to >> reuse that table for the 2 entry/16 byte case if HW turns out not to support >> LPA2. In which case, surely the 2 entry table would be overlayed at the start >> (low address) of the allocated 32 entry table, and therefore its alignment is >> 256 bytes, which meets the HW's 64 byte alignment requirement? >> > > No, it's at the end, that is the point. I am specifically referring to > TTBR1 upper region page tables here. > > Please refer to the existing ttbr1_offset asm macro, which implements > this today for 64k pages + LVA. In this case, however, the condensed > table covers 6 bits of translation so it is naturally aligned to the > TTBR minimum alignment. Afraid I don't see any such ttbr1_offset macro, either in upstream or the branch you posted. The best I can find is TTBR1_OFFSET in arm arch, which I'm guessing isn't it. I'm keen to understand this better if you can point me to the right location? Regardless, the Arm ARM states this for TTBR1_EL1.BADDR: """ BADDR[47:1], bits [47:1] Translation table base address: • Bits A[47:x] of the stage 1 translation table base address bits are in register bits[47:x]. • Bits A[(x-1):0] of the stage 1 translation table base address are zero. Address bit x is the minimum address bit required to align the translation table to the size of the table. The smallest permitted value of x is 6. The AArch64 Virtual Memory System Architecture chapter describes how x is calculated based on the value of TCR_EL1.T1SZ, the translation stage, and the translation granule size. Note A translation table is required to be aligned to the size of the table. If a table contains fewer than eight entries, it must be aligned on a 64 byte address boundary. """ I don't see how that is referring to the alignment of the *end* of the table? > >>> >>> My RFC had a workaround for this, but it is a bit nasty because we >>> need to copy those two entries at the right time and keep them in >>> sync. >>> >>>>> Also, using an additional level of >>>>> paging to translate a single VA bit is wasteful in terms of TLB >>>>> efficiency) >>>>> >>>>> This means support for falling back to 3 levels of paging at runtime >>>>> when configured for 4 is also needed. >>>>> >>>>> Another thing worth to note is that the repurposed physical address bits >>>>> in the page table descriptors were not RES0 before, and so there is now >>>>> a big global switch (called TCR.DS) which controls how all page table >>>>> descriptors are interpreted. This requires some extra care in the PTE >>>>> conversion helpers, and additional handling in the boot code to ensure >>>>> that we set TCR.DS safely if supported (and not overridden) >>>>> >>>>> Note that this series is mostly orthogonal to work by Anshuman done last >>>>> year: this series assumes that 52-bit physical addressing is never >>>>> needed to map the kernel image itself, and therefore that we never need >>>>> ID map range extension to cover the kernel with a 5th level when running >>>>> with 4. >>>> >>>> This limitation will certainly make it more tricky to test the the LPA2 stage2 >>>> implementation that I have done. I've got scripts that construct host systems >>>> with all the RAM above 48 bits so that the output addresses in the stage2 page >>>> tables can be guaranteed to contain OAs > 48 bits. I think the work around here >>>> would be to place the RAM so that it straddles the 48 bit boundary such that the >>>> size of RAM below is the same size as the kernel image, and place the kernel >>>> image in it. Then this will ensure that the VM's memory still uses the RAM above >>>> the threshold. Or is there a simpler approach? >>>> >>> >>> No, that sounds reasonable. I'm using QEMU which happily lets you put >>> the start of DRAM at any address you can imagine (if you recompile it) >> >> I'm running on FVP, which will let me do this with runtime parameters. Anyway, >> I'll update my tests to cope with this constraint and run this patch set >> through, and I'll let you know if it spots anything. > > Excellent, thanks. > >>> >>> Another approach could be to simply stick a memblock_reserve() >>> somewhere that covers all 48-bit addressable memory, but you will need >>> some of both in any case. >>> >>>>> And given that the LPA2 architectural feature covers both the >>>>> virtual and physical range extensions, where enabling the latter is >>>>> required to enable the former, we can simplify things further by only >>>>> enabling them as a pair. (I.e., 52-bit physical addressing cannot be >>>>> enabled for 48-bit VA space or smaller) >>>>> >>>>> [...] >>>> >>>> Thanks, >>>> Ryan >>>> >>>> >>
On Fri, 25 Nov 2022 at 11:07, Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 25/11/2022 09:35, Ard Biesheuvel wrote: > > On Fri, 25 Nov 2022 at 10:23, Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> On 24/11/2022 17:14, Ard Biesheuvel wrote: > >>> On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>> > >>>> Hi Ard, > >>>> > >>>> Thanks for including me on this. I'll plan to do a review over the next week or > >>>> so, but in the meantime, I have a couple of general questions/comments: > >>>> > >>>> On 24/11/2022 12:39, Ard Biesheuvel wrote: > >>>>> Enable support for LPA2 when running with 4k or 16k pages. In the former > >>>>> case, this requires 5 level paging with a runtime fallback to 4 on > >>>>> non-LPA2 hardware. For consistency, the same approach is adopted for 16k > >>>>> pages, where we fall back to 3 level paging (47 bit virtual addressing) > >>>>> on non-LPA2 configurations. > >>>> > >>>> It seems odd to me that if you have a non-LPA2 system, if you run a kernel that > >>>> is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if > >>>> you run a kernel that is compiled for 16KB pages and 52 VA bits then you will > >>>> get 47 VA bits? Wouldn't that pose a potential user space compat issue? > >>>> > >>> > >>> Well, given that Android happily runs with 39-bit VAs to avoid 4 level > >>> paging at all cost, I don't think that is a universal concern. > >> > >> Well presumably the Android kernel is always explicitly compiled for 39 VA bits > >> so that's what user space is used to? I was really just making the point that if > >> you have (the admittedly exotic and unlikely) case of having a 16KB kernel > >> previously compiled for 48 VA bits, and you "upgrade" it to 52 VA bits now that > >> the option is available, on HW without LPA2, this will actually be observed as a > >> "downgrade" to 47 bits. If you previously wanted to limit to 3 levels of lookup > >> with 16KB you would already have been compiling for 47 VA bits. > >> > > > > I am not debating that. I'm just saying that, without any hardware in > > existence, it is difficult to predict which of these concerns is going > > to dominate, and so I opted for the least messy and most symmetrical > > approach. > > OK fair enough. My opinion is logged ;-). > > > > >>> > >>> The benefit of this approach is that you can decide at runtime whether > >>> you want to take the performance hit of 4 (or 5) level paging to get > >>> access to the extended VA space. > >>> > >>>>> (Falling back to 48 bits would involve > >>>>> finding a workaround for the fact that we cannot construct a level 0 > >>>>> table covering 52 bits of VA space that appears aligned to its size in > >>>>> memory, and has the top 2 entries that represent the 48-bit region > >>>>> appearing at an alignment of 64 bytes, which is required by the > >>>>> architecture for TTBR address values. > >>>> > >>>> I'm not sure I've understood this. The level 0 table would need 32 entries for > >>>> 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is > >>>> a factor of 256 so surely the top 2 entries are guaranteed to also meet the > >>>> constraint for the fallback path too? > >>>> > >>> > >>> The top 2 entries are 16 bytes combined, and end on a 256 byte aligned > >>> boundary so I don't see how they can start on a 64 byte aligned > >>> boundary at the same time. > >> > >> I'm still not following; why would the 2 entry/16 byte table *end* on a 256 byte > >> boundary? I guess I should go and read your patch before making assumptions, but > >> my assumption from your description here was that you were optimistically > >> allocating a 32 entry/256 byte table for the 52 VA bit case, then needing to > >> reuse that table for the 2 entry/16 byte case if HW turns out not to support > >> LPA2. In which case, surely the 2 entry table would be overlayed at the start > >> (low address) of the allocated 32 entry table, and therefore its alignment is > >> 256 bytes, which meets the HW's 64 byte alignment requirement? > >> > > > > No, it's at the end, that is the point. I am specifically referring to > > TTBR1 upper region page tables here. > > > > Please refer to the existing ttbr1_offset asm macro, which implements > > this today for 64k pages + LVA. In this case, however, the condensed > > table covers 6 bits of translation so it is naturally aligned to the > > TTBR minimum alignment. > > Afraid I don't see any such ttbr1_offset macro, either in upstream or the branch > you posted. The best I can find is TTBR1_OFFSET in arm arch, which I'm guessing > isn't it. I'm keen to understand this better if you can point me to the right > location? > Apologies, I got the name wrong /* * Offset ttbr1 to allow for 48-bit kernel VAs set with 52-bit PTRS_PER_PGD. * orr is used as it can cover the immediate value (and is idempotent). * In future this may be nop'ed out when dealing with 52-bit kernel VAs. * ttbr: Value of ttbr to set, modified. */ .macro offset_ttbr1, ttbr, tmp #ifdef CONFIG_ARM64_VA_BITS_52 mrs_s \tmp, SYS_ID_AA64MMFR2_EL1 and \tmp, \tmp, #(0xf << ID_AA64MMFR2_EL1_VARange_SHIFT) cbnz \tmp, .Lskipoffs_\@ orr \ttbr, \ttbr, #TTBR1_BADDR_4852_OFFSET .Lskipoffs_\@ : #endif .endm > Regardless, the Arm ARM states this for TTBR1_EL1.BADDR: > > """ > BADDR[47:1], bits [47:1] > > Translation table base address: > • Bits A[47:x] of the stage 1 translation table base address bits are in > register bits[47:x]. > • Bits A[(x-1):0] of the stage 1 translation table base address are zero. > > Address bit x is the minimum address bit required to align the translation table > to the size of the table. The smallest permitted value of x is 6. The AArch64 > Virtual Memory System Architecture chapter describes how x is calculated based > on the value of TCR_EL1.T1SZ, the translation stage, and the translation granule > size. > > Note > A translation table is required to be aligned to the size of the table. If a > table contains fewer than eight entries, it must be aligned on a 64 byte address > boundary. > """ > > I don't see how that is referring to the alignment of the *end* of the table? > It refers to the address poked into the register When you create a 256 byte aligned 32 entry 52-bit level 0 table for 16k pages, entry #0 covers the start of the 52-bit addressable VA space, and entry #30 covers the start of the 48-bit addressable VA space. When LPA2 is not supported, the walk must start at entry #30 so that is where TTBR1_EL1 should point, but doing so violates the architecture's alignment requirement. So what we might do is double the size of the table, and clone entries #30 and #31 to positions #32 and #33, for instance (and remember to keep them in sync, which is not that hard)
On 25/11/2022 10:36, Ard Biesheuvel wrote: > On Fri, 25 Nov 2022 at 11:07, Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 25/11/2022 09:35, Ard Biesheuvel wrote: >>> On Fri, 25 Nov 2022 at 10:23, Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> On 24/11/2022 17:14, Ard Biesheuvel wrote: >>>>> On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>> >>>>>> Hi Ard, >>>>>> >>>>>> Thanks for including me on this. I'll plan to do a review over the next week or >>>>>> so, but in the meantime, I have a couple of general questions/comments: >>>>>> >>>>>> On 24/11/2022 12:39, Ard Biesheuvel wrote: >>>>>>> Enable support for LPA2 when running with 4k or 16k pages. In the former >>>>>>> case, this requires 5 level paging with a runtime fallback to 4 on >>>>>>> non-LPA2 hardware. For consistency, the same approach is adopted for 16k >>>>>>> pages, where we fall back to 3 level paging (47 bit virtual addressing) >>>>>>> on non-LPA2 configurations. >>>>>> >>>>>> It seems odd to me that if you have a non-LPA2 system, if you run a kernel that >>>>>> is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if >>>>>> you run a kernel that is compiled for 16KB pages and 52 VA bits then you will >>>>>> get 47 VA bits? Wouldn't that pose a potential user space compat issue? >>>>>> >>>>> >>>>> Well, given that Android happily runs with 39-bit VAs to avoid 4 level >>>>> paging at all cost, I don't think that is a universal concern. >>>> >>>> Well presumably the Android kernel is always explicitly compiled for 39 VA bits >>>> so that's what user space is used to? I was really just making the point that if >>>> you have (the admittedly exotic and unlikely) case of having a 16KB kernel >>>> previously compiled for 48 VA bits, and you "upgrade" it to 52 VA bits now that >>>> the option is available, on HW without LPA2, this will actually be observed as a >>>> "downgrade" to 47 bits. If you previously wanted to limit to 3 levels of lookup >>>> with 16KB you would already have been compiling for 47 VA bits. >>>> >>> >>> I am not debating that. I'm just saying that, without any hardware in >>> existence, it is difficult to predict which of these concerns is going >>> to dominate, and so I opted for the least messy and most symmetrical >>> approach. >> >> OK fair enough. My opinion is logged ;-). >> >>> >>>>> >>>>> The benefit of this approach is that you can decide at runtime whether >>>>> you want to take the performance hit of 4 (or 5) level paging to get >>>>> access to the extended VA space. >>>>> >>>>>>> (Falling back to 48 bits would involve >>>>>>> finding a workaround for the fact that we cannot construct a level 0 >>>>>>> table covering 52 bits of VA space that appears aligned to its size in >>>>>>> memory, and has the top 2 entries that represent the 48-bit region >>>>>>> appearing at an alignment of 64 bytes, which is required by the >>>>>>> architecture for TTBR address values. >>>>>> >>>>>> I'm not sure I've understood this. The level 0 table would need 32 entries for >>>>>> 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is >>>>>> a factor of 256 so surely the top 2 entries are guaranteed to also meet the >>>>>> constraint for the fallback path too? >>>>>> >>>>> >>>>> The top 2 entries are 16 bytes combined, and end on a 256 byte aligned >>>>> boundary so I don't see how they can start on a 64 byte aligned >>>>> boundary at the same time. >>>> >>>> I'm still not following; why would the 2 entry/16 byte table *end* on a 256 byte >>>> boundary? I guess I should go and read your patch before making assumptions, but >>>> my assumption from your description here was that you were optimistically >>>> allocating a 32 entry/256 byte table for the 52 VA bit case, then needing to >>>> reuse that table for the 2 entry/16 byte case if HW turns out not to support >>>> LPA2. In which case, surely the 2 entry table would be overlayed at the start >>>> (low address) of the allocated 32 entry table, and therefore its alignment is >>>> 256 bytes, which meets the HW's 64 byte alignment requirement? >>>> >>> >>> No, it's at the end, that is the point. I am specifically referring to >>> TTBR1 upper region page tables here. >>> >>> Please refer to the existing ttbr1_offset asm macro, which implements >>> this today for 64k pages + LVA. In this case, however, the condensed >>> table covers 6 bits of translation so it is naturally aligned to the >>> TTBR minimum alignment. >> >> Afraid I don't see any such ttbr1_offset macro, either in upstream or the branch >> you posted. The best I can find is TTBR1_OFFSET in arm arch, which I'm guessing >> isn't it. I'm keen to understand this better if you can point me to the right >> location? >> > > Apologies, I got the name wrong > > /* > * Offset ttbr1 to allow for 48-bit kernel VAs set with 52-bit PTRS_PER_PGD. > * orr is used as it can cover the immediate value (and is idempotent). > * In future this may be nop'ed out when dealing with 52-bit kernel VAs. > * ttbr: Value of ttbr to set, modified. > */ > .macro offset_ttbr1, ttbr, tmp > #ifdef CONFIG_ARM64_VA_BITS_52 > mrs_s \tmp, SYS_ID_AA64MMFR2_EL1 > and \tmp, \tmp, #(0xf << ID_AA64MMFR2_EL1_VARange_SHIFT) > cbnz \tmp, .Lskipoffs_\@ > orr \ttbr, \ttbr, #TTBR1_BADDR_4852_OFFSET > .Lskipoffs_\@ : > #endif > .endm > >> Regardless, the Arm ARM states this for TTBR1_EL1.BADDR: >> >> """ >> BADDR[47:1], bits [47:1] >> >> Translation table base address: >> • Bits A[47:x] of the stage 1 translation table base address bits are in >> register bits[47:x]. >> • Bits A[(x-1):0] of the stage 1 translation table base address are zero. >> >> Address bit x is the minimum address bit required to align the translation table >> to the size of the table. The smallest permitted value of x is 6. The AArch64 >> Virtual Memory System Architecture chapter describes how x is calculated based >> on the value of TCR_EL1.T1SZ, the translation stage, and the translation granule >> size. >> >> Note >> A translation table is required to be aligned to the size of the table. If a >> table contains fewer than eight entries, it must be aligned on a 64 byte address >> boundary. >> """ >> >> I don't see how that is referring to the alignment of the *end* of the table? >> > > It refers to the address poked into the register > > When you create a 256 byte aligned 32 entry 52-bit level 0 table for > 16k pages, entry #0 covers the start of the 52-bit addressable VA > space, and entry #30 covers the start of the 48-bit addressable VA > space. > > When LPA2 is not supported, the walk must start at entry #30 so that > is where TTBR1_EL1 should point, but doing so violates the > architecture's alignment requirement. > > So what we might do is double the size of the table, and clone entries > #30 and #31 to positions #32 and #33, for instance (and remember to > keep them in sync, which is not that hard) OK I get it now - thanks for explaining. I think the key part that I didn't appreciate is that the table is always allocated and populated as if we have 52 VA bits even if we only have 48, then you just fix up TTBR1 to point part way down the table if we only have 48 bits.
On Fri, 25 Nov 2022 at 15:13, Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 25/11/2022 10:36, Ard Biesheuvel wrote: > > On Fri, 25 Nov 2022 at 11:07, Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> On 25/11/2022 09:35, Ard Biesheuvel wrote: > >>> On Fri, 25 Nov 2022 at 10:23, Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>> > >>>> On 24/11/2022 17:14, Ard Biesheuvel wrote: > >>>>> On Thu, 24 Nov 2022 at 15:39, Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>> > >>>>>> Hi Ard, > >>>>>> > >>>>>> Thanks for including me on this. I'll plan to do a review over the next week or > >>>>>> so, but in the meantime, I have a couple of general questions/comments: > >>>>>> > >>>>>> On 24/11/2022 12:39, Ard Biesheuvel wrote: > >>>>>>> Enable support for LPA2 when running with 4k or 16k pages. In the former > >>>>>>> case, this requires 5 level paging with a runtime fallback to 4 on > >>>>>>> non-LPA2 hardware. For consistency, the same approach is adopted for 16k > >>>>>>> pages, where we fall back to 3 level paging (47 bit virtual addressing) > >>>>>>> on non-LPA2 configurations. > >>>>>> > >>>>>> It seems odd to me that if you have a non-LPA2 system, if you run a kernel that > >>>>>> is compiled for 16KB pages and 48 VA bits, then you will get 48 VA bits. But if > >>>>>> you run a kernel that is compiled for 16KB pages and 52 VA bits then you will > >>>>>> get 47 VA bits? Wouldn't that pose a potential user space compat issue? > >>>>>> > >>>>> > >>>>> Well, given that Android happily runs with 39-bit VAs to avoid 4 level > >>>>> paging at all cost, I don't think that is a universal concern. > >>>> > >>>> Well presumably the Android kernel is always explicitly compiled for 39 VA bits > >>>> so that's what user space is used to? I was really just making the point that if > >>>> you have (the admittedly exotic and unlikely) case of having a 16KB kernel > >>>> previously compiled for 48 VA bits, and you "upgrade" it to 52 VA bits now that > >>>> the option is available, on HW without LPA2, this will actually be observed as a > >>>> "downgrade" to 47 bits. If you previously wanted to limit to 3 levels of lookup > >>>> with 16KB you would already have been compiling for 47 VA bits. > >>>> > >>> > >>> I am not debating that. I'm just saying that, without any hardware in > >>> existence, it is difficult to predict which of these concerns is going > >>> to dominate, and so I opted for the least messy and most symmetrical > >>> approach. > >> > >> OK fair enough. My opinion is logged ;-). > >> > >>> > >>>>> > >>>>> The benefit of this approach is that you can decide at runtime whether > >>>>> you want to take the performance hit of 4 (or 5) level paging to get > >>>>> access to the extended VA space. > >>>>> > >>>>>>> (Falling back to 48 bits would involve > >>>>>>> finding a workaround for the fact that we cannot construct a level 0 > >>>>>>> table covering 52 bits of VA space that appears aligned to its size in > >>>>>>> memory, and has the top 2 entries that represent the 48-bit region > >>>>>>> appearing at an alignment of 64 bytes, which is required by the > >>>>>>> architecture for TTBR address values. > >>>>>> > >>>>>> I'm not sure I've understood this. The level 0 table would need 32 entries for > >>>>>> 52 VA bits so the table size is 256 bytes, naturally aligned to 256 bytes. 64 is > >>>>>> a factor of 256 so surely the top 2 entries are guaranteed to also meet the > >>>>>> constraint for the fallback path too? > >>>>>> > >>>>> > >>>>> The top 2 entries are 16 bytes combined, and end on a 256 byte aligned > >>>>> boundary so I don't see how they can start on a 64 byte aligned > >>>>> boundary at the same time. > >>>> > >>>> I'm still not following; why would the 2 entry/16 byte table *end* on a 256 byte > >>>> boundary? I guess I should go and read your patch before making assumptions, but > >>>> my assumption from your description here was that you were optimistically > >>>> allocating a 32 entry/256 byte table for the 52 VA bit case, then needing to > >>>> reuse that table for the 2 entry/16 byte case if HW turns out not to support > >>>> LPA2. In which case, surely the 2 entry table would be overlayed at the start > >>>> (low address) of the allocated 32 entry table, and therefore its alignment is > >>>> 256 bytes, which meets the HW's 64 byte alignment requirement? > >>>> > >>> > >>> No, it's at the end, that is the point. I am specifically referring to > >>> TTBR1 upper region page tables here. > >>> > >>> Please refer to the existing ttbr1_offset asm macro, which implements > >>> this today for 64k pages + LVA. In this case, however, the condensed > >>> table covers 6 bits of translation so it is naturally aligned to the > >>> TTBR minimum alignment. > >> > >> Afraid I don't see any such ttbr1_offset macro, either in upstream or the branch > >> you posted. The best I can find is TTBR1_OFFSET in arm arch, which I'm guessing > >> isn't it. I'm keen to understand this better if you can point me to the right > >> location? > >> > > > > Apologies, I got the name wrong > > > > /* > > * Offset ttbr1 to allow for 48-bit kernel VAs set with 52-bit PTRS_PER_PGD. > > * orr is used as it can cover the immediate value (and is idempotent). > > * In future this may be nop'ed out when dealing with 52-bit kernel VAs. > > * ttbr: Value of ttbr to set, modified. > > */ > > .macro offset_ttbr1, ttbr, tmp > > #ifdef CONFIG_ARM64_VA_BITS_52 > > mrs_s \tmp, SYS_ID_AA64MMFR2_EL1 > > and \tmp, \tmp, #(0xf << ID_AA64MMFR2_EL1_VARange_SHIFT) > > cbnz \tmp, .Lskipoffs_\@ > > orr \ttbr, \ttbr, #TTBR1_BADDR_4852_OFFSET > > .Lskipoffs_\@ : > > #endif > > .endm > > > >> Regardless, the Arm ARM states this for TTBR1_EL1.BADDR: > >> > >> """ > >> BADDR[47:1], bits [47:1] > >> > >> Translation table base address: > >> • Bits A[47:x] of the stage 1 translation table base address bits are in > >> register bits[47:x]. > >> • Bits A[(x-1):0] of the stage 1 translation table base address are zero. > >> > >> Address bit x is the minimum address bit required to align the translation table > >> to the size of the table. The smallest permitted value of x is 6. The AArch64 > >> Virtual Memory System Architecture chapter describes how x is calculated based > >> on the value of TCR_EL1.T1SZ, the translation stage, and the translation granule > >> size. > >> > >> Note > >> A translation table is required to be aligned to the size of the table. If a > >> table contains fewer than eight entries, it must be aligned on a 64 byte address > >> boundary. > >> """ > >> > >> I don't see how that is referring to the alignment of the *end* of the table? > >> > > > > It refers to the address poked into the register > > > > When you create a 256 byte aligned 32 entry 52-bit level 0 table for > > 16k pages, entry #0 covers the start of the 52-bit addressable VA > > space, and entry #30 covers the start of the 48-bit addressable VA > > space. > > > > When LPA2 is not supported, the walk must start at entry #30 so that > > is where TTBR1_EL1 should point, but doing so violates the > > architecture's alignment requirement. > > > > So what we might do is double the size of the table, and clone entries > > #30 and #31 to positions #32 and #33, for instance (and remember to > > keep them in sync, which is not that hard) > > OK I get it now - thanks for explaining. I think the key part that I didn't > appreciate is that the table is always allocated and populated as if we have 52 > VA bits even if we only have 48, then you just fix up TTBR1 to point part way > down the table if we only have 48 bits. > Exactly.The entire address space is represented in software as if we support 52-bits of virtual addressing, and we simply don't map anything outside of the 48-bit addressable area if the hardware does not support it. That way, handling the LVA/LPA2 feature remains an arch specific implementation detail, and all the generic software page table walking code has to be none the wiser. And for user space page tables and the ID map, we can simply use the same root table address - only kernel page tables need the additional hack here.
Hi Ard, As promised, I ran your patch set through my test set up and have noticed a few issues. Sorry it turned into rather a long email... First, a quick explanation of the test suite: For all valid combinations of the below parameters, boot the host kernel on the FVP, then boot the guest kernel in a VM, check that booting succeeds all the way to the guest shell then poweroff guest followed by host can check shutdown is clean. Parameters: - hw_pa: [48, lpa, lpa2] - hw_va: [48, 52] - kvm_mode: [vhe, nvhe, protected] - host_page_size: [4KB, 16KB, 64KB] - host_pa: [48, 52] - host_va: [48, 52] - host_load_addr: [low, high] - guest_page_size: [64KB] - guest_pa: [52] - guest_va: [52] - guest_load_addr: [low, high] When *_load_addr is 'low', that means the RAM is below 48 bits in (I)PA space. 'high' means the RAM starts at 2048TB for the guest (52 bit PA), and it means there are 2 regions for the host; one at 0x880000000000 (48 bit PA) sized to hold the kernel image only and another at 0x8800000000000 (52 bit PA) sized at 4GB. The FVP only allows RAM at certain locations and having a contiguous region cross the 48 bit boundary is not an option. So I chose these values to ensure that the linear map size is within 51 bits, which is a requirement for nhve/protected mode kvm. In all cases, I preload TF-A bl31, kernel, dt and initrd into RAM and run the FVP. This sidesteps problems with EFI needing low memory, and with the FVP's block devices needing DMA memory below 44 bits PA. bl31 and dt are appropriately fixed up for the 2 different memory layouts. Given this was designed to test my KVM changes, I was previously running these without the host_load_addr=high option for the 4k and 16k host kernels (since this requires your patch set). In this situation there are 132 valid configs and all of them pass. I then rebased my changes on top of yours and added in the host_load_addr=high option. Now there are 186 valid configs, 64 of which fail. (some of these failures are regressions). From a quick initial triage, there are 3 failure modes: 1) 18 FAILING TESTS: Host kernel never outputs anything to console TF-A runs successfully, says it is jumping to the kernel, then nothing further is seen. I'm pretty confident that the blobs are loaded into memory correctly because the same framework is working for the other configs (including 64k kernel loaded into high memory). This affects all configs where a host kernel with 4k or 16k pages built with LPA2 support is loaded into high memory. 2) 4 FAILING TESTS: Host kernel gets stuck initializing KVM During kernel boot, last console log is "kvm [1]: vgic interrupt IRQ9". All failing tests are configured for protected KVM, and are build with LPA2 support, running on non-LPA2 HW. 3) 42 FAILING TESTS: Guest kernel never outputs anything to console Host kernel boots fine, and we attempt to launch a guest kernel using kvmtool. There is no error reported, but the guest never outputs anything. Haven't worked out which config options are common to all failures yet. Finally, I removed my code, and ran with your patch set as provided. For this I hacked up my test suite to boot the host, and ignore booting a guest. I also didn't bother to vary the KVM mode and just left it in VHE mode. There were 46 valid configs here, of which 4 failed. They were all the same failure mode as (1) above. Failing configs were: id hw_pa hw_va host_page_size host_pa host_va host_load_addr ------------------------------------------------------------------ 40 lpa 52 4k 52 52 high 45 lpa 52 16k 52 52 high 55 lpa2 52 4k 52 52 high 60 lpa2 52 16k 52 52 high So on the balance of probabilities, I think failure mode (1) is very likely to be due to a bug in your code. (2) and (3) could be my issue or your issue: I propose to dig into those a bit further and will get back to you on them. I don't plan to look any further into (1). Thanks, Ryan On 24/11/2022 12:39, Ard Biesheuvel wrote: > Enable support for LPA2 when running with 4k or 16k pages. In the former > case, this requires 5 level paging with a runtime fallback to 4 on > non-LPA2 hardware. For consistency, the same approach is adopted for 16k > pages, where we fall back to 3 level paging (47 bit virtual addressing) > on non-LPA2 configurations. (Falling back to 48 bits would involve > finding a workaround for the fact that we cannot construct a level 0 > table covering 52 bits of VA space that appears aligned to its size in > memory, and has the top 2 entries that represent the 48-bit region > appearing at an alignment of 64 bytes, which is required by the > architecture for TTBR address values. Also, using an additional level of > paging to translate a single VA bit is wasteful in terms of TLB > efficiency) > > This means support for falling back to 3 levels of paging at runtime > when configured for 4 is also needed. > > Another thing worth to note is that the repurposed physical address bits > in the page table descriptors were not RES0 before, and so there is now > a big global switch (called TCR.DS) which controls how all page table > descriptors are interpreted. This requires some extra care in the PTE > conversion helpers, and additional handling in the boot code to ensure > that we set TCR.DS safely if supported (and not overridden) > > Note that this series is mostly orthogonal to work by Anshuman done last > year: this series assumes that 52-bit physical addressing is never > needed to map the kernel image itself, and therefore that we never need > ID map range extension to cover the kernel with a 5th level when running > with 4. And given that the LPA2 architectural feature covers both the > virtual and physical range extensions, where enabling the latter is > required to enable the former, we can simplify things further by only > enabling them as a pair. (I.e., 52-bit physical addressing cannot be > enabled for 48-bit VA space or smaller) > > This series applies onto some of my previous work that is still in > flight, so these patches will not apply in isolation. Complete branch > can be found here: > https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=arm64-4k-lpa2 > > It supersedes the RFC v1 I sent out last week, which covered 16k pages > only. It also supersedes some related work I sent out in isolation > before: > > [PATCH] arm64: mm: Enable KASAN for 16k/48-bit VA configurations > [PATCH 0/3] arm64: mm: Model LVA support as a CPU feature > > Tested on QEMU with -cpu max and lpa2 both off and on, as well as using > the arm64.nolva override kernel command line parameter. Note that this > requires a QEMU built from the latest sources. > > Cc: Marc Zyngier <maz@kernel.org> > Cc: Will Deacon <will@kernel.org> > Cc: Mark Rutland <mark.rutland@arm.com> > Cc: Kees Cook <keescook@chromium.org> > Cc: Catalin Marinas <catalin.marinas@arm.com> > Cc: Mark Brown <broonie@kernel.org> > Cc: Anshuman Khandual <anshuman.khandual@arm.com> > Cc: Richard Henderson <richard.henderson@linaro.org> > Cc: Ryan Roberts <ryan.roberts@arm.com> > > Anshuman Khandual (3): > arm64/mm: Simplify and document pte_to_phys() for 52 bit addresses > arm64/mm: Add FEAT_LPA2 specific TCR_EL1.DS field > arm64/mm: Add FEAT_LPA2 specific ID_AA64MMFR0.TGRAN[2] > > Ard Biesheuvel (16): > arm64: kaslr: Adjust randomization range dynamically > arm64: mm: get rid of kimage_vaddr global variable > arm64: head: remove order argument from early mapping routine > arm64: mm: Handle LVA support as a CPU feature > arm64: mm: Deal with potential ID map extension if VA_BITS > > VA_BITS_MIN > arm64: mm: Add feature override support for LVA > arm64: mm: Wire up TCR.DS bit to PTE shareability fields > arm64: mm: Add LPA2 support to phys<->pte conversion routines > arm64: mm: Add definitions to support 5 levels of paging > arm64: mm: add 5 level paging support to G-to-nG conversion routine > arm64: Enable LPA2 at boot if supported by the system > arm64: mm: Add 5 level paging support to fixmap and swapper handling > arm64: kasan: Reduce minimum shadow alignment and enable 5 level > paging > arm64: mm: Add support for folding PUDs at runtime > arm64: ptdump: Disregard unaddressable VA space > arm64: Enable 52-bit virtual addressing for 4k and 16k granule configs > > arch/arm64/Kconfig | 23 ++- > arch/arm64/include/asm/assembler.h | 42 ++--- > arch/arm64/include/asm/cpufeature.h | 2 + > arch/arm64/include/asm/fixmap.h | 1 + > arch/arm64/include/asm/kernel-pgtable.h | 27 ++- > arch/arm64/include/asm/memory.h | 23 ++- > arch/arm64/include/asm/pgalloc.h | 53 +++++- > arch/arm64/include/asm/pgtable-hwdef.h | 34 +++- > arch/arm64/include/asm/pgtable-prot.h | 18 +- > arch/arm64/include/asm/pgtable-types.h | 6 + > arch/arm64/include/asm/pgtable.h | 197 ++++++++++++++++++-- > arch/arm64/include/asm/sysreg.h | 2 + > arch/arm64/include/asm/tlb.h | 3 +- > arch/arm64/kernel/cpufeature.c | 46 ++++- > arch/arm64/kernel/head.S | 99 +++++----- > arch/arm64/kernel/image-vars.h | 4 + > arch/arm64/kernel/pi/idreg-override.c | 29 ++- > arch/arm64/kernel/pi/kaslr_early.c | 23 ++- > arch/arm64/kernel/pi/map_kernel.c | 115 +++++++++++- > arch/arm64/kernel/sleep.S | 3 - > arch/arm64/mm/init.c | 2 +- > arch/arm64/mm/kasan_init.c | 124 ++++++++++-- > arch/arm64/mm/mmap.c | 4 + > arch/arm64/mm/mmu.c | 138 ++++++++++---- > arch/arm64/mm/pgd.c | 17 +- > arch/arm64/mm/proc.S | 76 +++++++- > arch/arm64/mm/ptdump.c | 4 +- > arch/arm64/tools/cpucaps | 1 + > 28 files changed, 907 insertions(+), 209 deletions(-) >
On Tue, 29 Nov 2022 at 16:31, Ryan Roberts <ryan.roberts@arm.com> wrote: > > Hi Ard, > > As promised, I ran your patch set through my test set up and have noticed a few > issues. Sorry it turned into rather a long email... > No worries, and thanks a lot for going through the trouble. > First, a quick explanation of the test suite: For all valid combinations of the > below parameters, boot the host kernel on the FVP, then boot the guest kernel in > a VM, check that booting succeeds all the way to the guest shell then poweroff > guest followed by host can check shutdown is clean. > > Parameters: > - hw_pa: [48, lpa, lpa2] > - hw_va: [48, 52] > - kvm_mode: [vhe, nvhe, protected] > - host_page_size: [4KB, 16KB, 64KB] > - host_pa: [48, 52] > - host_va: [48, 52] > - host_load_addr: [low, high] > - guest_page_size: [64KB] > - guest_pa: [52] > - guest_va: [52] > - guest_load_addr: [low, high] > > When *_load_addr is 'low', that means the RAM is below 48 bits in (I)PA space. > 'high' means the RAM starts at 2048TB for the guest (52 bit PA), and it means > there are 2 regions for the host; one at 0x880000000000 (48 bit PA) sized to > hold the kernel image only and another at 0x8800000000000 (52 bit PA) sized at > 4GB. The FVP only allows RAM at certain locations and having a contiguous region > cross the 48 bit boundary is not an option. So I chose these values to ensure > that the linear map size is within 51 bits, which is a requirement for > nhve/protected mode kvm. > > In all cases, I preload TF-A bl31, kernel, dt and initrd into RAM and run the > FVP. This sidesteps problems with EFI needing low memory, and with the FVP's > block devices needing DMA memory below 44 bits PA. bl31 and dt are appropriately > fixed up for the 2 different memory layouts. > > Given this was designed to test my KVM changes, I was previously running these > without the host_load_addr=high option for the 4k and 16k host kernels (since > this requires your patch set). In this situation there are 132 valid configs and > all of them pass. > > I then rebased my changes on top of yours and added in the host_load_addr=high > option. Now there are 186 valid configs, 64 of which fail. (some of these > failures are regressions). From a quick initial triage, there are 3 failure modes: > > > 1) 18 FAILING TESTS: Host kernel never outputs anything to console > > TF-A runs successfully, says it is jumping to the kernel, then nothing further > is seen. I'm pretty confident that the blobs are loaded into memory correctly > because the same framework is working for the other configs (including 64k > kernel loaded into high memory). This affects all configs where a host kernel > with 4k or 16k pages built with LPA2 support is loaded into high memory. > Not sure how to interpret this in combination with your explanation above, but if 'loaded high' means that the kernel itself is not in 48-bit addressable physical memory, this failure is expected. Given that we have no way of informing the bootloader or firmware whether or not a certain kernel image supports running from such a high offset, it must currently assume it cannot. We've just queued up a documentation fix to clarify this in the boot protocol, i.e., that the kernel must be loaded in 48-bit addressable physical memory. The fact that you had to doctor your boot environment to get around this kind of proves my point, and unless someone is silly enough to ship a SoC that cannot function without this, I don't think we should add this support. I understand how this is an interesting case for completeness from a validation pov, but the reality is that adding support for this would mean introducing changes amounting to dead code to fragile boot sequence code that is already hard to maintain. > > 2) 4 FAILING TESTS: Host kernel gets stuck initializing KVM > > During kernel boot, last console log is "kvm [1]: vgic interrupt IRQ9". All > failing tests are configured for protected KVM, and are build with LPA2 > support, running on non-LPA2 HW. > I will try to reproduce this locally. > > 3) 42 FAILING TESTS: Guest kernel never outputs anything to console > > Host kernel boots fine, and we attempt to launch a guest kernel using kvmtool. > There is no error reported, but the guest never outputs anything. Haven't > worked out which config options are common to all failures yet. > This goes a bit beyond what I am currently set up for in terms of testing, but I'm happy to help narrow this down. > > Finally, I removed my code, and ran with your patch set as provided. For this I > hacked up my test suite to boot the host, and ignore booting a guest. I also > didn't bother to vary the KVM mode and just left it in VHE mode. There were 46 > valid configs here, of which 4 failed. They were all the same failure mode as > (1) above. Failing configs were: > > id hw_pa hw_va host_page_size host_pa host_va host_load_addr > ------------------------------------------------------------------ > 40 lpa 52 4k 52 52 high > 45 lpa 52 16k 52 52 high > 55 lpa2 52 4k 52 52 high > 60 lpa2 52 16k 52 52 high > Same point as above then, I guess. > > So on the balance of probabilities, I think failure mode (1) is very likely to > be due to a bug in your code. (2) and (3) could be my issue or your issue: I > propose to dig into those a bit further and will get back to you on them. I > don't plan to look any further into (1). > Thanks again. (1) is expected, and (2) is something I will investigate further.
On 29/11/2022 15:47, Ard Biesheuvel wrote: > On Tue, 29 Nov 2022 at 16:31, Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> Hi Ard, >> >> As promised, I ran your patch set through my test set up and have noticed a few >> issues. Sorry it turned into rather a long email... >> > > No worries, and thanks a lot for going through the trouble. > >> First, a quick explanation of the test suite: For all valid combinations of the >> below parameters, boot the host kernel on the FVP, then boot the guest kernel in >> a VM, check that booting succeeds all the way to the guest shell then poweroff >> guest followed by host can check shutdown is clean. >> >> Parameters: >> - hw_pa: [48, lpa, lpa2] >> - hw_va: [48, 52] >> - kvm_mode: [vhe, nvhe, protected] >> - host_page_size: [4KB, 16KB, 64KB] >> - host_pa: [48, 52] >> - host_va: [48, 52] >> - host_load_addr: [low, high] >> - guest_page_size: [64KB] >> - guest_pa: [52] >> - guest_va: [52] >> - guest_load_addr: [low, high] >> >> When *_load_addr is 'low', that means the RAM is below 48 bits in (I)PA space. >> 'high' means the RAM starts at 2048TB for the guest (52 bit PA), and it means >> there are 2 regions for the host; one at 0x880000000000 (48 bit PA) sized to >> hold the kernel image only and another at 0x8800000000000 (52 bit PA) sized at >> 4GB. The FVP only allows RAM at certain locations and having a contiguous region >> cross the 48 bit boundary is not an option. So I chose these values to ensure >> that the linear map size is within 51 bits, which is a requirement for >> nhve/protected mode kvm. >> >> In all cases, I preload TF-A bl31, kernel, dt and initrd into RAM and run the >> FVP. This sidesteps problems with EFI needing low memory, and with the FVP's >> block devices needing DMA memory below 44 bits PA. bl31 and dt are appropriately >> fixed up for the 2 different memory layouts. >> >> Given this was designed to test my KVM changes, I was previously running these >> without the host_load_addr=high option for the 4k and 16k host kernels (since >> this requires your patch set). In this situation there are 132 valid configs and >> all of them pass. >> >> I then rebased my changes on top of yours and added in the host_load_addr=high >> option. Now there are 186 valid configs, 64 of which fail. (some of these >> failures are regressions). From a quick initial triage, there are 3 failure modes: >> >> >> 1) 18 FAILING TESTS: Host kernel never outputs anything to console >> >> TF-A runs successfully, says it is jumping to the kernel, then nothing further >> is seen. I'm pretty confident that the blobs are loaded into memory correctly >> because the same framework is working for the other configs (including 64k >> kernel loaded into high memory). This affects all configs where a host kernel >> with 4k or 16k pages built with LPA2 support is loaded into high memory. >> > > Not sure how to interpret this in combination with your explanation > above, but if 'loaded high' means that the kernel itself is not in > 48-bit addressable physical memory, this failure is expected. Sorry - my wording was confusing. host_load_addr=high means what I said at the top; the kernel image is loaded at 0x880000000000 in a block of memory sized to hold the kernel image only (actually its forward aligned to 2MB). The dtb and initrd are loaded into a 4GB region at 0x8800000000000. The reason I'm doing this is to ensure that when I create a VM, the memory used for it (at least the vast majority) is coming from the region at 52 bits. I want to do this to prove that the stage2 implementation is correctly handling the 52 OA case. > > Given that we have no way of informing the bootloader or firmware > whether or not a certain kernel image supports running from such a > high offset, it must currently assume it cannot. We've just queued up > a documentation fix to clarify this in the boot protocol, i.e., that > the kernel must be loaded in 48-bit addressable physical memory. OK, but I think what I'm doing complies with this. Unless the DTB also has to be below 48 bits? > > The fact that you had to doctor your boot environment to get around > this kind of proves my point, and unless someone is silly enough to > ship a SoC that cannot function without this, I don't think we should > add this support. > > I understand how this is an interesting case for completeness from a > validation pov, but the reality is that adding support for this would > mean introducing changes amounting to dead code to fragile boot > sequence code that is already hard to maintain. I'm not disagreeing. But I think what I'm doing should conform with the requirements? (Previously I had the tests set up to just have a single region of memory above 52 bits and the kernel image was placed there. That works/worked for the 64KB kernel. But I brought the kernel image to below 48 bits to align with the requirements of this patch set. If you see an easier way for me to validate 52 bit OAs in the stage 2 (and ideally hyp stage 1), then I'm all ears! > >> >> 2) 4 FAILING TESTS: Host kernel gets stuck initializing KVM >> >> During kernel boot, last console log is "kvm [1]: vgic interrupt IRQ9". All >> failing tests are configured for protected KVM, and are build with LPA2 >> support, running on non-LPA2 HW. >> > > I will try to reproduce this locally. > >> >> 3) 42 FAILING TESTS: Guest kernel never outputs anything to console >> >> Host kernel boots fine, and we attempt to launch a guest kernel using kvmtool. >> There is no error reported, but the guest never outputs anything. Haven't >> worked out which config options are common to all failures yet. >> > > This goes a bit beyond what I am currently set up for in terms of > testing, but I'm happy to help narrow this down. > >> >> Finally, I removed my code, and ran with your patch set as provided. For this I >> hacked up my test suite to boot the host, and ignore booting a guest. I also >> didn't bother to vary the KVM mode and just left it in VHE mode. There were 46 >> valid configs here, of which 4 failed. They were all the same failure mode as >> (1) above. Failing configs were: >> >> id hw_pa hw_va host_page_size host_pa host_va host_load_addr >> ------------------------------------------------------------------ >> 40 lpa 52 4k 52 52 high >> 45 lpa 52 16k 52 52 high >> 55 lpa2 52 4k 52 52 high >> 60 lpa2 52 16k 52 52 high >> > > Same point as above then, I guess. > >> >> So on the balance of probabilities, I think failure mode (1) is very likely to >> be due to a bug in your code. (2) and (3) could be my issue or your issue: I >> propose to dig into those a bit further and will get back to you on them. I >> don't plan to look any further into (1). >> > > Thanks again. (1) is expected, and (2) is something I will investigate further.
On Tue, 29 Nov 2022 at 17:36, Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 29/11/2022 15:47, Ard Biesheuvel wrote: > > On Tue, 29 Nov 2022 at 16:31, Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> Hi Ard, > >> > >> As promised, I ran your patch set through my test set up and have noticed a few > >> issues. Sorry it turned into rather a long email... > >> > > > > No worries, and thanks a lot for going through the trouble. > > > >> First, a quick explanation of the test suite: For all valid combinations of the > >> below parameters, boot the host kernel on the FVP, then boot the guest kernel in > >> a VM, check that booting succeeds all the way to the guest shell then poweroff > >> guest followed by host can check shutdown is clean. > >> > >> Parameters: > >> - hw_pa: [48, lpa, lpa2] > >> - hw_va: [48, 52] > >> - kvm_mode: [vhe, nvhe, protected] > >> - host_page_size: [4KB, 16KB, 64KB] > >> - host_pa: [48, 52] > >> - host_va: [48, 52] > >> - host_load_addr: [low, high] > >> - guest_page_size: [64KB] > >> - guest_pa: [52] > >> - guest_va: [52] > >> - guest_load_addr: [low, high] > >> > >> When *_load_addr is 'low', that means the RAM is below 48 bits in (I)PA space. > >> 'high' means the RAM starts at 2048TB for the guest (52 bit PA), and it means > >> there are 2 regions for the host; one at 0x880000000000 (48 bit PA) sized to > >> hold the kernel image only and another at 0x8800000000000 (52 bit PA) sized at > >> 4GB. The FVP only allows RAM at certain locations and having a contiguous region > >> cross the 48 bit boundary is not an option. So I chose these values to ensure > >> that the linear map size is within 51 bits, which is a requirement for > >> nhve/protected mode kvm. > >> > >> In all cases, I preload TF-A bl31, kernel, dt and initrd into RAM and run the > >> FVP. This sidesteps problems with EFI needing low memory, and with the FVP's > >> block devices needing DMA memory below 44 bits PA. bl31 and dt are appropriately > >> fixed up for the 2 different memory layouts. > >> > >> Given this was designed to test my KVM changes, I was previously running these > >> without the host_load_addr=high option for the 4k and 16k host kernels (since > >> this requires your patch set). In this situation there are 132 valid configs and > >> all of them pass. > >> > >> I then rebased my changes on top of yours and added in the host_load_addr=high > >> option. Now there are 186 valid configs, 64 of which fail. (some of these > >> failures are regressions). From a quick initial triage, there are 3 failure modes: > >> > >> > >> 1) 18 FAILING TESTS: Host kernel never outputs anything to console > >> > >> TF-A runs successfully, says it is jumping to the kernel, then nothing further > >> is seen. I'm pretty confident that the blobs are loaded into memory correctly > >> because the same framework is working for the other configs (including 64k > >> kernel loaded into high memory). This affects all configs where a host kernel > >> with 4k or 16k pages built with LPA2 support is loaded into high memory. > >> > > > > Not sure how to interpret this in combination with your explanation > > above, but if 'loaded high' means that the kernel itself is not in > > 48-bit addressable physical memory, this failure is expected. > > Sorry - my wording was confusing. host_load_addr=high means what I said at the > top; the kernel image is loaded at 0x880000000000 in a block of memory sized to > hold the kernel image only (actually its forward aligned to 2MB). The dtb and > initrd are loaded into a 4GB region at 0x8800000000000. The reason I'm doing > this is to ensure that when I create a VM, the memory used for it (at least the > vast majority) is coming from the region at 52 bits. I want to do this to prove > that the stage2 implementation is correctly handling the 52 OA case. > > > > > Given that we have no way of informing the bootloader or firmware > > whether or not a certain kernel image supports running from such a > > high offset, it must currently assume it cannot. We've just queued up > > a documentation fix to clarify this in the boot protocol, i.e., that > > the kernel must be loaded in 48-bit addressable physical memory. > > OK, but I think what I'm doing complies with this. Unless the DTB also has to be > below 48 bits? > Ahh yes, good point. Yes, this is actually implied but we should clarify this. Or fix it. But the same reasoning applies: currently, a loader cannot know from looking at a certain binary whether or not it supports addressing any memory outside of the 48-bit addressable range, so any asset loading into physical memory is affected by the same limitation. This has implications for other firmware aspects as well, i.e., ACPI tables etc. > > > > The fact that you had to doctor your boot environment to get around > > this kind of proves my point, and unless someone is silly enough to > > ship a SoC that cannot function without this, I don't think we should > > add this support. > > > > I understand how this is an interesting case for completeness from a > > validation pov, but the reality is that adding support for this would > > mean introducing changes amounting to dead code to fragile boot > > sequence code that is already hard to maintain. > > I'm not disagreeing. But I think what I'm doing should conform with the > requirements? (Previously I had the tests set up to just have a single region of > memory above 52 bits and the kernel image was placed there. That works/worked > for the 64KB kernel. But I brought the kernel image to below 48 bits to align > with the requirements of this patch set. > > If you see an easier way for me to validate 52 bit OAs in the stage 2 (and > ideally hyp stage 1), then I'm all ears! > There is a Kconfig knob CONFIG_ARM64_FORCE_52BIT which was introduced for a similar purpose, i.e., to ensure that the 52-bit range gets utilized. I could imagine adding a similar control for KVM in particular or for preferring allocations from outside the 48-bit PA range in general. > > > >> > >> 2) 4 FAILING TESTS: Host kernel gets stuck initializing KVM > >> > >> During kernel boot, last console log is "kvm [1]: vgic interrupt IRQ9". All > >> failing tests are configured for protected KVM, and are build with LPA2 > >> support, running on non-LPA2 HW. > >> > > > > I will try to reproduce this locally. > > > >> > >> 3) 42 FAILING TESTS: Guest kernel never outputs anything to console > >> > >> Host kernel boots fine, and we attempt to launch a guest kernel using kvmtool. > >> There is no error reported, but the guest never outputs anything. Haven't > >> worked out which config options are common to all failures yet. > >> > > > > This goes a bit beyond what I am currently set up for in terms of > > testing, but I'm happy to help narrow this down. > > > >> > >> Finally, I removed my code, and ran with your patch set as provided. For this I > >> hacked up my test suite to boot the host, and ignore booting a guest. I also > >> didn't bother to vary the KVM mode and just left it in VHE mode. There were 46 > >> valid configs here, of which 4 failed. They were all the same failure mode as > >> (1) above. Failing configs were: > >> > >> id hw_pa hw_va host_page_size host_pa host_va host_load_addr > >> ------------------------------------------------------------------ > >> 40 lpa 52 4k 52 52 high > >> 45 lpa 52 16k 52 52 high > >> 55 lpa2 52 4k 52 52 high > >> 60 lpa2 52 16k 52 52 high > >> > > > > Same point as above then, I guess. > > > >> > >> So on the balance of probabilities, I think failure mode (1) is very likely to > >> be due to a bug in your code. (2) and (3) could be my issue or your issue: I > >> propose to dig into those a bit further and will get back to you on them. I > >> don't plan to look any further into (1). > >> > > > > Thanks again. (1) is expected, and (2) is something I will investigate further. >
Hi Ard, I wanted to provide a quick update on the debugging I've been doing from my end. See below... On 29/11/2022 16:56, Ard Biesheuvel wrote: > On Tue, 29 Nov 2022 at 17:36, Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 29/11/2022 15:47, Ard Biesheuvel wrote: >>> On Tue, 29 Nov 2022 at 16:31, Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> Hi Ard, >>>> >>>> As promised, I ran your patch set through my test set up and have noticed a few >>>> issues. Sorry it turned into rather a long email... >>>> >>> >>> No worries, and thanks a lot for going through the trouble. >>> >>>> First, a quick explanation of the test suite: For all valid combinations of the >>>> below parameters, boot the host kernel on the FVP, then boot the guest kernel in >>>> a VM, check that booting succeeds all the way to the guest shell then poweroff >>>> guest followed by host can check shutdown is clean. >>>> >>>> Parameters: >>>> - hw_pa: [48, lpa, lpa2] >>>> - hw_va: [48, 52] >>>> - kvm_mode: [vhe, nvhe, protected] >>>> - host_page_size: [4KB, 16KB, 64KB] >>>> - host_pa: [48, 52] >>>> - host_va: [48, 52] >>>> - host_load_addr: [low, high] >>>> - guest_page_size: [64KB] >>>> - guest_pa: [52] >>>> - guest_va: [52] >>>> - guest_load_addr: [low, high] >>>> >>>> When *_load_addr is 'low', that means the RAM is below 48 bits in (I)PA space. >>>> 'high' means the RAM starts at 2048TB for the guest (52 bit PA), and it means >>>> there are 2 regions for the host; one at 0x880000000000 (48 bit PA) sized to >>>> hold the kernel image only and another at 0x8800000000000 (52 bit PA) sized at >>>> 4GB. The FVP only allows RAM at certain locations and having a contiguous region >>>> cross the 48 bit boundary is not an option. So I chose these values to ensure >>>> that the linear map size is within 51 bits, which is a requirement for >>>> nhve/protected mode kvm. >>>> >>>> In all cases, I preload TF-A bl31, kernel, dt and initrd into RAM and run the >>>> FVP. This sidesteps problems with EFI needing low memory, and with the FVP's >>>> block devices needing DMA memory below 44 bits PA. bl31 and dt are appropriately >>>> fixed up for the 2 different memory layouts. >>>> >>>> Given this was designed to test my KVM changes, I was previously running these >>>> without the host_load_addr=high option for the 4k and 16k host kernels (since >>>> this requires your patch set). In this situation there are 132 valid configs and >>>> all of them pass. >>>> >>>> I then rebased my changes on top of yours and added in the host_load_addr=high >>>> option. Now there are 186 valid configs, 64 of which fail. (some of these >>>> failures are regressions). From a quick initial triage, there are 3 failure modes: >>>> >>>> >>>> 1) 18 FAILING TESTS: Host kernel never outputs anything to console >>>> >>>> TF-A runs successfully, says it is jumping to the kernel, then nothing further >>>> is seen. I'm pretty confident that the blobs are loaded into memory correctly >>>> because the same framework is working for the other configs (including 64k >>>> kernel loaded into high memory). This affects all configs where a host kernel >>>> with 4k or 16k pages built with LPA2 support is loaded into high memory. >>>> >>> >>> Not sure how to interpret this in combination with your explanation >>> above, but if 'loaded high' means that the kernel itself is not in >>> 48-bit addressable physical memory, this failure is expected. >> >> Sorry - my wording was confusing. host_load_addr=high means what I said at the >> top; the kernel image is loaded at 0x880000000000 in a block of memory sized to >> hold the kernel image only (actually its forward aligned to 2MB). The dtb and >> initrd are loaded into a 4GB region at 0x8800000000000. The reason I'm doing >> this is to ensure that when I create a VM, the memory used for it (at least the >> vast majority) is coming from the region at 52 bits. I want to do this to prove >> that the stage2 implementation is correctly handling the 52 OA case. >> >>> >>> Given that we have no way of informing the bootloader or firmware >>> whether or not a certain kernel image supports running from such a >>> high offset, it must currently assume it cannot. We've just queued up >>> a documentation fix to clarify this in the boot protocol, i.e., that >>> the kernel must be loaded in 48-bit addressable physical memory. >> >> OK, but I think what I'm doing complies with this. Unless the DTB also has to be >> below 48 bits? >> > > Ahh yes, good point. Yes, this is actually implied but we should > clarify this. Or fix it. > > But the same reasoning applies: currently, a loader cannot know from > looking at a certain binary whether or not it supports addressing any > memory outside of the 48-bit addressable range, so any asset loading > into physical memory is affected by the same limitation. > > This has implications for other firmware aspects as well, i.e., ACPI tables etc. > >>> >>> The fact that you had to doctor your boot environment to get around >>> this kind of proves my point, and unless someone is silly enough to >>> ship a SoC that cannot function without this, I don't think we should >>> add this support. >>> >>> I understand how this is an interesting case for completeness from a >>> validation pov, but the reality is that adding support for this would >>> mean introducing changes amounting to dead code to fragile boot >>> sequence code that is already hard to maintain. >> >> I'm not disagreeing. But I think what I'm doing should conform with the >> requirements? (Previously I had the tests set up to just have a single region of >> memory above 52 bits and the kernel image was placed there. That works/worked >> for the 64KB kernel. But I brought the kernel image to below 48 bits to align >> with the requirements of this patch set. >> >> If you see an easier way for me to validate 52 bit OAs in the stage 2 (and >> ideally hyp stage 1), then I'm all ears! >> > > There is a Kconfig knob CONFIG_ARM64_FORCE_52BIT which was introduced > for a similar purpose, i.e., to ensure that the 52-bit range gets > utilized. I could imagine adding a similar control for KVM in > particular or for preferring allocations from outside the 48-bit PA > range in general. I managed to get these all booting after moving the kernel and dtb to low memory. I've kept the initrd in high memory for now, which works fine. (I think the initrd memory will get freed and I didn't want any free low memory floating around to ensure that kvm guests get high memory). Once booting, some of these get converted to passes, and others remain failures but for other reasons (see below). > >>> >>>> >>>> 2) 4 FAILING TESTS: Host kernel gets stuck initializing KVM >>>> >>>> During kernel boot, last console log is "kvm [1]: vgic interrupt IRQ9". All >>>> failing tests are configured for protected KVM, and are build with LPA2 >>>> support, running on non-LPA2 HW. >>>> >>> >>> I will try to reproduce this locally. It turns out the same issue is hit when running your patches without mine on top. The root cause in both cases is an assumption that kvm_get_parange() makes that ID_AA64MMFR0_EL1_PARANGE_MAX will always be 48 for 4KB and 16KB PAGE_SIZE. That is no longer true with your patches. It causes the VTCR_EL2 to be programmed incorrectly for the host stage2, then on return to the host, bang. This demonstrates that kernel stage1 support for LPA2 depends on kvm support for LPA2, since for protected kvm, the host stage2 needs to be able to id map the full physical range that the host kernel sees prior to deprivilege. So I don't think it's fixable in your series. I have a fix in my series for this. I also found another dependency problem (hit by some of the tests that were previously failing at the first issue) where kvm uses its page table library to walk a user space page table created by the kernel (see get_user_mapping_size()). In the case where the kernel creates an LPA2 page table, kvm can't walk it without my patches. I've also added a fix for this to my series. >>> >>>> >>>> 3) 42 FAILING TESTS: Guest kernel never outputs anything to console >>>> >>>> Host kernel boots fine, and we attempt to launch a guest kernel using kvmtool. >>>> There is no error reported, but the guest never outputs anything. Haven't >>>> worked out which config options are common to all failures yet. >>>> >>> >>> This goes a bit beyond what I am currently set up for in terms of >>> testing, but I'm happy to help narrow this down. I don't have a root cause for this yet. I'll try to take a loo this afternoon. Will keep you posted. >>> >>>> >>>> Finally, I removed my code, and ran with your patch set as provided. For this I >>>> hacked up my test suite to boot the host, and ignore booting a guest. I also >>>> didn't bother to vary the KVM mode and just left it in VHE mode. There were 46 >>>> valid configs here, of which 4 failed. They were all the same failure mode as >>>> (1) above. Failing configs were: >>>> >>>> id hw_pa hw_va host_page_size host_pa host_va host_load_addr >>>> ------------------------------------------------------------------ >>>> 40 lpa 52 4k 52 52 high >>>> 45 lpa 52 16k 52 52 high >>>> 55 lpa2 52 4k 52 52 high >>>> 60 lpa2 52 16k 52 52 high >>>> >>> >>> Same point as above then, I guess.>>> >>>> >>>> So on the balance of probabilities, I think failure mode (1) is very likely to >>>> be due to a bug in your code. (2) and (3) could be my issue or your issue: I >>>> propose to dig into those a bit further and will get back to you on them. I >>>> don't plan to look any further into (1). >>>> >>> >>> Thanks again. (1) is expected, and (2) is something I will investigate further. >> Once I have all the tests passing, I'll post my series, then hopefully we can move it all forwards as one? As part of my debugging, I've got a patch to sort out the tlbi code to support LPA2 properly - I think I raised that comment on one of the patches. Are you happy for me to post as part of my series? Thanks, Ryan
On Thu, 1 Dec 2022 at 13:22, Ryan Roberts <ryan.roberts@arm.com> wrote: > > Hi Ard, > > I wanted to provide a quick update on the debugging I've been doing from my end. > See below... > > > On 29/11/2022 16:56, Ard Biesheuvel wrote: > > On Tue, 29 Nov 2022 at 17:36, Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> On 29/11/2022 15:47, Ard Biesheuvel wrote: > >>> On Tue, 29 Nov 2022 at 16:31, Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>> > >>>> Hi Ard, > >>>> > >>>> As promised, I ran your patch set through my test set up and have noticed a few > >>>> issues. Sorry it turned into rather a long email... > >>>> > >>> > >>> No worries, and thanks a lot for going through the trouble. > >>> > >>>> First, a quick explanation of the test suite: For all valid combinations of the > >>>> below parameters, boot the host kernel on the FVP, then boot the guest kernel in > >>>> a VM, check that booting succeeds all the way to the guest shell then poweroff > >>>> guest followed by host can check shutdown is clean. > >>>> > >>>> Parameters: > >>>> - hw_pa: [48, lpa, lpa2] > >>>> - hw_va: [48, 52] > >>>> - kvm_mode: [vhe, nvhe, protected] > >>>> - host_page_size: [4KB, 16KB, 64KB] > >>>> - host_pa: [48, 52] > >>>> - host_va: [48, 52] > >>>> - host_load_addr: [low, high] > >>>> - guest_page_size: [64KB] > >>>> - guest_pa: [52] > >>>> - guest_va: [52] > >>>> - guest_load_addr: [low, high] > >>>> > >>>> When *_load_addr is 'low', that means the RAM is below 48 bits in (I)PA space. > >>>> 'high' means the RAM starts at 2048TB for the guest (52 bit PA), and it means > >>>> there are 2 regions for the host; one at 0x880000000000 (48 bit PA) sized to > >>>> hold the kernel image only and another at 0x8800000000000 (52 bit PA) sized at > >>>> 4GB. The FVP only allows RAM at certain locations and having a contiguous region > >>>> cross the 48 bit boundary is not an option. So I chose these values to ensure > >>>> that the linear map size is within 51 bits, which is a requirement for > >>>> nhve/protected mode kvm. > >>>> > >>>> In all cases, I preload TF-A bl31, kernel, dt and initrd into RAM and run the > >>>> FVP. This sidesteps problems with EFI needing low memory, and with the FVP's > >>>> block devices needing DMA memory below 44 bits PA. bl31 and dt are appropriately > >>>> fixed up for the 2 different memory layouts. > >>>> > >>>> Given this was designed to test my KVM changes, I was previously running these > >>>> without the host_load_addr=high option for the 4k and 16k host kernels (since > >>>> this requires your patch set). In this situation there are 132 valid configs and > >>>> all of them pass. > >>>> > >>>> I then rebased my changes on top of yours and added in the host_load_addr=high > >>>> option. Now there are 186 valid configs, 64 of which fail. (some of these > >>>> failures are regressions). From a quick initial triage, there are 3 failure modes: > >>>> > >>>> > >>>> 1) 18 FAILING TESTS: Host kernel never outputs anything to console > >>>> > >>>> TF-A runs successfully, says it is jumping to the kernel, then nothing further > >>>> is seen. I'm pretty confident that the blobs are loaded into memory correctly > >>>> because the same framework is working for the other configs (including 64k > >>>> kernel loaded into high memory). This affects all configs where a host kernel > >>>> with 4k or 16k pages built with LPA2 support is loaded into high memory. > >>>> > >>> > >>> Not sure how to interpret this in combination with your explanation > >>> above, but if 'loaded high' means that the kernel itself is not in > >>> 48-bit addressable physical memory, this failure is expected. > >> > >> Sorry - my wording was confusing. host_load_addr=high means what I said at the > >> top; the kernel image is loaded at 0x880000000000 in a block of memory sized to > >> hold the kernel image only (actually its forward aligned to 2MB). The dtb and > >> initrd are loaded into a 4GB region at 0x8800000000000. The reason I'm doing > >> this is to ensure that when I create a VM, the memory used for it (at least the > >> vast majority) is coming from the region at 52 bits. I want to do this to prove > >> that the stage2 implementation is correctly handling the 52 OA case. > >> > >>> > >>> Given that we have no way of informing the bootloader or firmware > >>> whether or not a certain kernel image supports running from such a > >>> high offset, it must currently assume it cannot. We've just queued up > >>> a documentation fix to clarify this in the boot protocol, i.e., that > >>> the kernel must be loaded in 48-bit addressable physical memory. > >> > >> OK, but I think what I'm doing complies with this. Unless the DTB also has to be > >> below 48 bits? > >> > > > > Ahh yes, good point. Yes, this is actually implied but we should > > clarify this. Or fix it. > > > > But the same reasoning applies: currently, a loader cannot know from > > looking at a certain binary whether or not it supports addressing any > > memory outside of the 48-bit addressable range, so any asset loading > > into physical memory is affected by the same limitation. > > > > This has implications for other firmware aspects as well, i.e., ACPI tables etc. > > > >>> > >>> The fact that you had to doctor your boot environment to get around > >>> this kind of proves my point, and unless someone is silly enough to > >>> ship a SoC that cannot function without this, I don't think we should > >>> add this support. > >>> > >>> I understand how this is an interesting case for completeness from a > >>> validation pov, but the reality is that adding support for this would > >>> mean introducing changes amounting to dead code to fragile boot > >>> sequence code that is already hard to maintain. > >> > >> I'm not disagreeing. But I think what I'm doing should conform with the > >> requirements? (Previously I had the tests set up to just have a single region of > >> memory above 52 bits and the kernel image was placed there. That works/worked > >> for the 64KB kernel. But I brought the kernel image to below 48 bits to align > >> with the requirements of this patch set. > >> > >> If you see an easier way for me to validate 52 bit OAs in the stage 2 (and > >> ideally hyp stage 1), then I'm all ears! > >> > > > > There is a Kconfig knob CONFIG_ARM64_FORCE_52BIT which was introduced > > for a similar purpose, i.e., to ensure that the 52-bit range gets > > utilized. I could imagine adding a similar control for KVM in > > particular or for preferring allocations from outside the 48-bit PA > > range in general. > > I managed to get these all booting after moving the kernel and dtb to low > memory. I've kept the initrd in high memory for now, which works fine. (I think > the initrd memory will get freed and I didn't want any free low memory floating > around to ensure that kvm guests get high memory). The DT's early mapping is via the ID map, while the initrd is only mapped much later via the kernel mappings in TTBR1, so this is why it works. However, for the boot protocol, this distinction doesn't really matter: if the boot stack cannot be certain that a certain kernel image was built to support 52 physical addressing, it simply must never place anything there. > Once booting, some of these > get converted to passes, and others remain failures but for other reasons (see > below). > > > > > >>> > >>>> > >>>> 2) 4 FAILING TESTS: Host kernel gets stuck initializing KVM > >>>> > >>>> During kernel boot, last console log is "kvm [1]: vgic interrupt IRQ9". All > >>>> failing tests are configured for protected KVM, and are build with LPA2 > >>>> support, running on non-LPA2 HW. > >>>> > >>> > >>> I will try to reproduce this locally. > > It turns out the same issue is hit when running your patches without mine on > top. The root cause in both cases is an assumption that kvm_get_parange() makes > that ID_AA64MMFR0_EL1_PARANGE_MAX will always be 48 for 4KB and 16KB PAGE_SIZE. > That is no longer true with your patches. It causes the VTCR_EL2 to be > programmed incorrectly for the host stage2, then on return to the host, bang. > Right. I made an attempt at replacing 'u32 level' with 's32 level' throughout that code, along with some related changes, but I didn't spot this issue. > This demonstrates that kernel stage1 support for LPA2 depends on kvm support for > LPA2, since for protected kvm, the host stage2 needs to be able to id map the > full physical range that the host kernel sees prior to deprivilege. So I don't > think it's fixable in your series. I have a fix in my series for this. > The reference to ID_AA64MMFR0_EL1_PARANGE_MAX should be fixable in isolation, no? Even if it results in a KVM that cannot use 52-bit PAs while the host can. > I also found another dependency problem (hit by some of the tests that were > previously failing at the first issue) where kvm uses its page table library to > walk a user space page table created by the kernel (see > get_user_mapping_size()). In the case where the kernel creates an LPA2 page > table, kvm can't walk it without my patches. I've also added a fix for this to > my series. > OK > > >>> > >>>> > >>>> 3) 42 FAILING TESTS: Guest kernel never outputs anything to console > >>>> > >>>> Host kernel boots fine, and we attempt to launch a guest kernel using kvmtool. > >>>> There is no error reported, but the guest never outputs anything. Haven't > >>>> worked out which config options are common to all failures yet. > >>>> > >>> > >>> This goes a bit beyond what I am currently set up for in terms of > >>> testing, but I'm happy to help narrow this down. > > I don't have a root cause for this yet. I'll try to take a loo this afternoon. > Will keep you posted. > Thanks > >>> > >>>> > >>>> Finally, I removed my code, and ran with your patch set as provided. For this I > >>>> hacked up my test suite to boot the host, and ignore booting a guest. I also > >>>> didn't bother to vary the KVM mode and just left it in VHE mode. There were 46 > >>>> valid configs here, of which 4 failed. They were all the same failure mode as > >>>> (1) above. Failing configs were: > >>>> > >>>> id hw_pa hw_va host_page_size host_pa host_va host_load_addr > >>>> ------------------------------------------------------------------ > >>>> 40 lpa 52 4k 52 52 high > >>>> 45 lpa 52 16k 52 52 high > >>>> 55 lpa2 52 4k 52 52 high > >>>> 60 lpa2 52 16k 52 52 high > >>>> > >>> > >>> Same point as above then, I guess.>>> > >>>> > >>>> So on the balance of probabilities, I think failure mode (1) is very likely to > >>>> be due to a bug in your code. (2) and (3) could be my issue or your issue: I > >>>> propose to dig into those a bit further and will get back to you on them. I > >>>> don't plan to look any further into (1). > >>>> > >>> > >>> Thanks again. (1) is expected, and (2) is something I will investigate further. > >> > > Once I have all the tests passing, I'll post my series, then hopefully we can > move it all forwards as one? > That would be great, yes, although my work depends on a sizable rework of the early boot code that has seen very little review as of yet. So for the time being, let's keep aligned but let's not put any eggs in each other's baskets :-) > As part of my debugging, I've got a patch to sort out the tlbi code to support > LPA2 properly - I think I raised that comment on one of the patches. Are you > happy for me to post as part of my series? > I wasn't sure where to look tbh. The generic 5-level paging stuff seems to work fine - is this specific to KVM?
On 01/12/2022 13:43, Ard Biesheuvel wrote: > [...]>>>>>> >>>>>> 2) 4 FAILING TESTS: Host kernel gets stuck initializing KVM >>>>>> >>>>>> During kernel boot, last console log is "kvm [1]: vgic interrupt IRQ9". All >>>>>> failing tests are configured for protected KVM, and are build with LPA2 >>>>>> support, running on non-LPA2 HW. >>>>>> >>>>> >>>>> I will try to reproduce this locally. >> >> It turns out the same issue is hit when running your patches without mine on >> top. The root cause in both cases is an assumption that kvm_get_parange() makes >> that ID_AA64MMFR0_EL1_PARANGE_MAX will always be 48 for 4KB and 16KB PAGE_SIZE. >> That is no longer true with your patches. It causes the VTCR_EL2 to be >> programmed incorrectly for the host stage2, then on return to the host, bang. >> > > Right. I made an attempt at replacing 'u32 level' with 's32 level' > throughout that code, along with some related changes, but I didn't > spot this issue. > >> This demonstrates that kernel stage1 support for LPA2 depends on kvm support for >> LPA2, since for protected kvm, the host stage2 needs to be able to id map the >> full physical range that the host kernel sees prior to deprivilege. So I don't >> think it's fixable in your series. I have a fix in my series for this. >> > > The reference to ID_AA64MMFR0_EL1_PARANGE_MAX should be fixable in > isolation, no? Even if it results in a KVM that cannot use 52-bit PAs > while the host can. Well that doesn't really sound like a good solution to me - the VMM can't control which physical memory it is using so it would be the luck of the draw as to whether kvm gets passed memory above 48 PA bits. So you would probably have to explicitly prevent kvm from initializing if you want to keep your kernel LPA2 patches decoupled from mine. > >> I also found another dependency problem (hit by some of the tests that were >> previously failing at the first issue) where kvm uses its page table library to >> walk a user space page table created by the kernel (see >> get_user_mapping_size()). In the case where the kernel creates an LPA2 page >> table, kvm can't walk it without my patches. I've also added a fix for this to >> my series. >> > > OK Again, I think the only way to work around this without kvm understanding LPA2 is to disable KVM when the kernel is using LPA2. > >> >>>>> >>>>>> >>>>>> 3) 42 FAILING TESTS: Guest kernel never outputs anything to console >>>>>> >>>>>> Host kernel boots fine, and we attempt to launch a guest kernel using kvmtool. >>>>>> There is no error reported, but the guest never outputs anything. Haven't >>>>>> worked out which config options are common to all failures yet. >>>>>> >>>>> >>>>> This goes a bit beyond what I am currently set up for in terms of >>>>> testing, but I'm happy to help narrow this down. >> >> I don't have a root cause for this yet. I'll try to take a loo this afternoon. >> Will keep you posted. >> OK found it. All these failures are when loading the guest in 'high' memory. My test environment is allocating a vm with 256M@2048T (so all memory above 48 bits IPA) and putting the 64KB/52VA/52PA kernel, dtb and intird there. This works fine without your changes. I guess as part of your change you have broken 64KB kernel's ability to create an ID map above 48 bits (which conforms to your clarification of the boot protocol). I guess this was intentional? It makes it really hard for me to test >48 bit IA at stage2... > >> >> Once I have all the tests passing, I'll post my series, then hopefully we can >> move it all forwards as one? >> > > That would be great, yes, although my work depends on a sizable rework > of the early boot code that has seen very little review as of yet. > > So for the time being, let's keep aligned but let's not put any eggs > in each other's baskets :-)> >> As part of my debugging, I've got a patch to sort out the tlbi code to support >> LPA2 properly - I think I raised that comment on one of the patches. Are you >> happy for me to post as part of my series? >> > > I wasn't sure where to look tbh. The generic 5-level paging stuff > seems to work fine - is this specific to KVM? There are a couple of issues that I spotted; for the range-based tlbi instructions, when LPA2 is in use (TCR_EL1.DS=1), BaseADDR must be 64KB aligned (when LPA2 is disabled it only needs to be page aligned). So there is some forward alignment required using the non-range tbli in __flush_tlb_range(). I think this would manifest as invalidating the wrong entries if left as-is once your patches are applied. The second problem is that __tlbi_level() uses level 0 as the "don't use level hint" sentinel. I think this will work just fine (if slightly suboptimal) if left as is, since the higher layers are never passing anything outside the range [0, 3] at the moment, but if that changed and -1 was passed then it would cause a bug.