mbox series

[v2,00/14] Transparent Contiguous PTEs for User Mappings

Message ID 20231115163018.1303287-1-ryan.roberts@arm.com (mailing list archive)
Headers show
Series Transparent Contiguous PTEs for User Mappings | expand

Message

Ryan Roberts Nov. 15, 2023, 4:30 p.m. UTC
Hi All,

This is v2 of a series to opportunistically and transparently use contpte
mappings (set the contiguous bit in ptes) for user memory when those mappings
meet the requirements. It is part of a wider effort to improve performance by
allocating and mapping variable-sized blocks of memory (folios). One aim is for
the 4K kernel to approach the performance of the 16K kernel, but without
breaking compatibility and without the associated increase in memory. Another
aim is to benefit the 16K and 64K kernels by enabling 2M THP, since this is the
contpte size for those kernels. We have good performance data that demonstrates
both aims are being met (see below).

Of course this is only one half of the change. We require the mapped physical
memory to be the correct size and alignment for this to actually be useful (i.e.
64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this
problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs, ...) will
allocate large folios up to the PMD size today, and more filesystems are coming.
And the other half of my work, to enable "small-sized THP" (large folios) for
anonymous memory, makes contpte sized folios prevalent for anonymous memory too
[2].

Optimistically, I would really like to get this series merged for v6.8; there is
a chance that the small-sized THP series will also get merged for that version.
But even if it doesn't, this series still benefits file-backed memory from the
file systems that support large folios so shouldn't be held up for it.
Additionally I've got data that shows this series adds no regression when the
system has no appropriate large folios.

All dependecies listed against v1 are now resolved; This series applies cleanly
against v6.7-rc1.

Note that the first patch is for core-mm and provides the refactoring to make a
crucial optimization possible - which is then implemented in patch 13. The
remaining patches are arm64-specific.

Testing
=======

I've tested this series together with small-sized THP [2] on both Ampere Altra
(bare metal) and Apple M2 (VM):
  - mm selftests (inc new tests written for small-sized THP); no regressions
  - Speedometer Java script benchmark in Chromium web browser; no issues
  - Kernel compilation; no issues
  - Various tests under high memory pressure with swap enabled; no issues


Performance
===========

John Hubbard at Nvidia has indicated dramatic 10x performance improvements for
some workloads at [3], when using 64K base page kernel.

You can also see the original performance results I posted against v1 [1] which
are still valid.

I've additionally run the kernel compilation and speedometer benchmarks on a
system with small-sized THP disabled and large folio support for file-backed
memory intentionally disabled; I see no change in performance in this case (i.e.
no regression when this change is "present but not useful").


Changes since v1
================

  - Export contpte_* symbols so that modules can continue to call inline
    functions (e.g. ptep_get) which may now call the contpte_* functions (thanks
    to JohnH)
  - Use pte_valid() instead of pte_present() where sensible (thanks to Catalin)
  - Factor out (pte_valid() && pte_cont()) into new pte_valid_cont() helper
    (thanks to Catalin)
  - Fixed bug in contpte_ptep_set_access_flags() where TLBIs were missed (thanks
    to Catalin)
  - Added ARM64_CONTPTE expert Kconfig (enabled by default) (thanks to Anshuman)
  - Simplified contpte_ptep_get_and_clear_full()
  - Improved various code comments


[1] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-arm-kernel/20231115132734.931023-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com/


Thanks,
Ryan


Ryan Roberts (14):
  mm: Batch-copy PTE ranges during fork()
  arm64/mm: set_pte(): New layer to manage contig bit
  arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
  arm64/mm: pte_clear(): New layer to manage contig bit
  arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
  arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
  arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
  arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
  arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
  arm64/mm: ptep_get(): New layer to manage contig bit
  arm64/mm: Split __flush_tlb_range() to elide trailing DSB
  arm64/mm: Wire up PTE_CONT for user mappings
  arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
  arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

 arch/arm64/Kconfig                |  10 +-
 arch/arm64/include/asm/pgtable.h  | 325 +++++++++++++++++++---
 arch/arm64/include/asm/tlbflush.h |  13 +-
 arch/arm64/kernel/efi.c           |   4 +-
 arch/arm64/kernel/mte.c           |   2 +-
 arch/arm64/kvm/guest.c            |   2 +-
 arch/arm64/mm/Makefile            |   1 +
 arch/arm64/mm/contpte.c           | 447 ++++++++++++++++++++++++++++++
 arch/arm64/mm/fault.c             |  12 +-
 arch/arm64/mm/fixmap.c            |   4 +-
 arch/arm64/mm/hugetlbpage.c       |  40 +--
 arch/arm64/mm/kasan_init.c        |   6 +-
 arch/arm64/mm/mmu.c               |  16 +-
 arch/arm64/mm/pageattr.c          |   6 +-
 arch/arm64/mm/trans_pgd.c         |   6 +-
 include/linux/pgtable.h           |  13 +
 mm/memory.c                       | 175 +++++++++---
 17 files changed, 956 insertions(+), 126 deletions(-)
 create mode 100644 arch/arm64/mm/contpte.c

--
2.25.1

Comments

Barry Song Nov. 27, 2023, 3:18 a.m. UTC | #1
> Ryan Roberts (14):
>   mm: Batch-copy PTE ranges during fork()
>   arm64/mm: set_pte(): New layer to manage contig bit
>   arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
>   arm64/mm: pte_clear(): New layer to manage contig bit
>   arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
>   arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
>   arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
>   arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
>   arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
>   arm64/mm: ptep_get(): New layer to manage contig bit
>   arm64/mm: Split __flush_tlb_range() to elide trailing DSB
>   arm64/mm: Wire up PTE_CONT for user mappings
>   arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
>   arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

Hi Ryan,
Not quite sure if I missed something, are we splitting/unfolding CONTPTES
in the below cases

1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio

2. vma split in a large folio due to various reasons such as mprotect,
munmap, mlock etc.

3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
rather than being as a whole.

In hardware, we need to make sure CONTPTE follow the rule - always 16
contiguous physical address with CONTPTE set. if one of them run away
from the 16 ptes group and PTEs become unconsistent, some terrible
errors/faults can happen in HW. for example

case0:
addr0 PTE - has no CONTPE
addr0+4kb PTE - has CONTPTE
....
addr0+60kb PTE - has CONTPTE

case 1:
addr0 PTE - has no CONTPE
addr0+4kb PTE - has CONTPTE
....
addr0+60kb PTE - has swap

Unconsistent 16 PTEs will lead to crash even in the firmware based on
our observation.

Thanks
Barry
Ryan Roberts Nov. 27, 2023, 9:15 a.m. UTC | #2
On 27/11/2023 03:18, Barry Song wrote:
>> Ryan Roberts (14):
>>   mm: Batch-copy PTE ranges during fork()
>>   arm64/mm: set_pte(): New layer to manage contig bit
>>   arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
>>   arm64/mm: pte_clear(): New layer to manage contig bit
>>   arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
>>   arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
>>   arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
>>   arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
>>   arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
>>   arm64/mm: ptep_get(): New layer to manage contig bit
>>   arm64/mm: Split __flush_tlb_range() to elide trailing DSB
>>   arm64/mm: Wire up PTE_CONT for user mappings
>>   arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
>>   arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
> 
> Hi Ryan,
> Not quite sure if I missed something, are we splitting/unfolding CONTPTES
> in the below cases

The general idea is that the core-mm sets the individual ptes (one at a time if
it likes with set_pte_at(), or in a block with set_ptes()), modifies its
permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
(ptep_clear(), etc); This is exactly the same interface as previously.

BUT, the arm64 implementation of those interfaces will now detect when a set of
adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
base pages) are all appropriate for having the CONT_PTE bit set; in this case
the block is "folded". And it will detect when the first PTE in the block
changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
requirements for folding a contpte block is that all the pages must belong to
the *same* folio (that means its safe to only track access/dirty for thecontpte
block as a whole rather than for each individual pte).

(there are a couple of optimizations that make the reality slightly more
complicated than what I've just explained, but you get the idea).

On that basis, I believe all the specific cases you describe below are all
covered and safe - please let me know if you think there is a hole here!

> 
> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio

The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
whatever). The implementation of that will cause an unfold and the CONT_PTE bit
is removed from the whole contpte block. If there is then a subsequent
set_pte_at() to set a swap entry, the implementation will see that its not
appropriate to re-fold, so the range will remain unfolded.

> 
> 2. vma split in a large folio due to various reasons such as mprotect,
> munmap, mlock etc.

I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
suspect not, so if the VMA is split in the middle of a currently folded contpte
block, it will remain folded. But this is safe and continues to work correctly.
The VMA arrangement is not important; it is just important that a single folio
is mapped contiguously across the whole block.

> 
> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
> rather than being as a whole.

Yes, as per 1; the arm64 implementation will notice when the first entry is
cleared and unfold the contpte block.

> 
> In hardware, we need to make sure CONTPTE follow the rule - always 16
> contiguous physical address with CONTPTE set. if one of them run away
> from the 16 ptes group and PTEs become unconsistent, some terrible
> errors/faults can happen in HW. for example

Yes, the implementation obeys all these rules; see contpte_try_fold() and
contpte_try_unfold(). the fold/unfold operation is only done when all
requirements are met, and we perform it in a manner that is conformant to the
architecture requirements (see contpte_fold() - being renamed to
contpte_convert() in the next version).

Thanks for the review!

Thanks,
Ryan

> 
> case0:
> addr0 PTE - has no CONTPE
> addr0+4kb PTE - has CONTPTE
> ....
> addr0+60kb PTE - has CONTPTE
> 
> case 1:
> addr0 PTE - has no CONTPE
> addr0+4kb PTE - has CONTPTE
> ....
> addr0+60kb PTE - has swap
> 
> Unconsistent 16 PTEs will lead to crash even in the firmware based on
> our observation.
> 
> Thanks
> Barry
> 
>
Barry Song Nov. 27, 2023, 10:35 a.m. UTC | #3
On Mon, Nov 27, 2023 at 10:15 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2023 03:18, Barry Song wrote:
> >> Ryan Roberts (14):
> >>   mm: Batch-copy PTE ranges during fork()
> >>   arm64/mm: set_pte(): New layer to manage contig bit
> >>   arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
> >>   arm64/mm: pte_clear(): New layer to manage contig bit
> >>   arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
> >>   arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
> >>   arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
> >>   arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
> >>   arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
> >>   arm64/mm: ptep_get(): New layer to manage contig bit
> >>   arm64/mm: Split __flush_tlb_range() to elide trailing DSB
> >>   arm64/mm: Wire up PTE_CONT for user mappings
> >>   arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
> >>   arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
> >
> > Hi Ryan,
> > Not quite sure if I missed something, are we splitting/unfolding CONTPTES
> > in the below cases
>
> The general idea is that the core-mm sets the individual ptes (one at a time if
> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
> (ptep_clear(), etc); This is exactly the same interface as previously.
>
> BUT, the arm64 implementation of those interfaces will now detect when a set of
> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
> base pages) are all appropriate for having the CONT_PTE bit set; in this case
> the block is "folded". And it will detect when the first PTE in the block
> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
> requirements for folding a contpte block is that all the pages must belong to
> the *same* folio (that means its safe to only track access/dirty for thecontpte
> block as a whole rather than for each individual pte).
>
> (there are a couple of optimizations that make the reality slightly more
> complicated than what I've just explained, but you get the idea).
>
> On that basis, I believe all the specific cases you describe below are all
> covered and safe - please let me know if you think there is a hole here!
>
> >
> > 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
>
> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
> is removed from the whole contpte block. If there is then a subsequent
> set_pte_at() to set a swap entry, the implementation will see that its not
> appropriate to re-fold, so the range will remain unfolded.
>
> >
> > 2. vma split in a large folio due to various reasons such as mprotect,
> > munmap, mlock etc.
>
> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
> suspect not, so if the VMA is split in the middle of a currently folded contpte
> block, it will remain folded. But this is safe and continues to work correctly.
> The VMA arrangement is not important; it is just important that a single folio
> is mapped contiguously across the whole block.

I don't think it is safe to keep CONTPTE folded in a split_vma case. as
otherwise, copy_ptes in your other patch might only copy a part
of CONTPES.
For example,  if page0-page4 and page5-page15 are splitted in split_vma,
in fork, while copying pte for the first VMA, we are copying page0-page4,
this will immediately cause inconsistent CONTPTE. as we have to
make sure all CONTPTEs are atomically mapped in a PTL.

>
> >
> > 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
> > rather than being as a whole.
>
> Yes, as per 1; the arm64 implementation will notice when the first entry is
> cleared and unfold the contpte block.
>
> >
> > In hardware, we need to make sure CONTPTE follow the rule - always 16
> > contiguous physical address with CONTPTE set. if one of them run away
> > from the 16 ptes group and PTEs become unconsistent, some terrible
> > errors/faults can happen in HW. for example
>
> Yes, the implementation obeys all these rules; see contpte_try_fold() and
> contpte_try_unfold(). the fold/unfold operation is only done when all
> requirements are met, and we perform it in a manner that is conformant to the
> architecture requirements (see contpte_fold() - being renamed to
> contpte_convert() in the next version).
>
> Thanks for the review!
>
> Thanks,
> Ryan
>
> >
> > case0:
> > addr0 PTE - has no CONTPE
> > addr0+4kb PTE - has CONTPTE
> > ....
> > addr0+60kb PTE - has CONTPTE
> >
> > case 1:
> > addr0 PTE - has no CONTPE
> > addr0+4kb PTE - has CONTPTE
> > ....
> > addr0+60kb PTE - has swap
> >
> > Unconsistent 16 PTEs will lead to crash even in the firmware based on
> > our observation.
> >

Thanks
 Barry
Ryan Roberts Nov. 27, 2023, 11:11 a.m. UTC | #4
On 27/11/2023 10:35, Barry Song wrote:
> On Mon, Nov 27, 2023 at 10:15 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2023 03:18, Barry Song wrote:
>>>> Ryan Roberts (14):
>>>>   mm: Batch-copy PTE ranges during fork()
>>>>   arm64/mm: set_pte(): New layer to manage contig bit
>>>>   arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
>>>>   arm64/mm: pte_clear(): New layer to manage contig bit
>>>>   arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
>>>>   arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
>>>>   arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
>>>>   arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
>>>>   arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
>>>>   arm64/mm: ptep_get(): New layer to manage contig bit
>>>>   arm64/mm: Split __flush_tlb_range() to elide trailing DSB
>>>>   arm64/mm: Wire up PTE_CONT for user mappings
>>>>   arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
>>>>   arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
>>>
>>> Hi Ryan,
>>> Not quite sure if I missed something, are we splitting/unfolding CONTPTES
>>> in the below cases
>>
>> The general idea is that the core-mm sets the individual ptes (one at a time if
>> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
>> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
>> (ptep_clear(), etc); This is exactly the same interface as previously.
>>
>> BUT, the arm64 implementation of those interfaces will now detect when a set of
>> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
>> base pages) are all appropriate for having the CONT_PTE bit set; in this case
>> the block is "folded". And it will detect when the first PTE in the block
>> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
>> requirements for folding a contpte block is that all the pages must belong to
>> the *same* folio (that means its safe to only track access/dirty for thecontpte
>> block as a whole rather than for each individual pte).
>>
>> (there are a couple of optimizations that make the reality slightly more
>> complicated than what I've just explained, but you get the idea).
>>
>> On that basis, I believe all the specific cases you describe below are all
>> covered and safe - please let me know if you think there is a hole here!
>>
>>>
>>> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
>>
>> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
>> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
>> is removed from the whole contpte block. If there is then a subsequent
>> set_pte_at() to set a swap entry, the implementation will see that its not
>> appropriate to re-fold, so the range will remain unfolded.
>>
>>>
>>> 2. vma split in a large folio due to various reasons such as mprotect,
>>> munmap, mlock etc.
>>
>> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
>> suspect not, so if the VMA is split in the middle of a currently folded contpte
>> block, it will remain folded. But this is safe and continues to work correctly.
>> The VMA arrangement is not important; it is just important that a single folio
>> is mapped contiguously across the whole block.
> 
> I don't think it is safe to keep CONTPTE folded in a split_vma case. as
> otherwise, copy_ptes in your other patch might only copy a part
> of CONTPES.
> For example,  if page0-page4 and page5-page15 are splitted in split_vma,
> in fork, while copying pte for the first VMA, we are copying page0-page4,
> this will immediately cause inconsistent CONTPTE. as we have to
> make sure all CONTPTEs are atomically mapped in a PTL.

No that's not how it works. The CONT_PTE bit is not blindly copied from parent
to child. It is explicitly managed by the arch code and set when appropriate. In
the case above, we will end up calling set_ptes() for page0-page4 in the child.
set_ptes() will notice that there are only 5 contiguous pages so it will map
without the CONT_PTE bit.

> 
>>
>>>
>>> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
>>> rather than being as a whole.
>>
>> Yes, as per 1; the arm64 implementation will notice when the first entry is
>> cleared and unfold the contpte block.
>>
>>>
>>> In hardware, we need to make sure CONTPTE follow the rule - always 16
>>> contiguous physical address with CONTPTE set. if one of them run away
>>> from the 16 ptes group and PTEs become unconsistent, some terrible
>>> errors/faults can happen in HW. for example
>>
>> Yes, the implementation obeys all these rules; see contpte_try_fold() and
>> contpte_try_unfold(). the fold/unfold operation is only done when all
>> requirements are met, and we perform it in a manner that is conformant to the
>> architecture requirements (see contpte_fold() - being renamed to
>> contpte_convert() in the next version).
>>
>> Thanks for the review!
>>
>> Thanks,
>> Ryan
>>
>>>
>>> case0:
>>> addr0 PTE - has no CONTPE
>>> addr0+4kb PTE - has CONTPTE
>>> ....
>>> addr0+60kb PTE - has CONTPTE
>>>
>>> case 1:
>>> addr0 PTE - has no CONTPE
>>> addr0+4kb PTE - has CONTPTE
>>> ....
>>> addr0+60kb PTE - has swap
>>>
>>> Unconsistent 16 PTEs will lead to crash even in the firmware based on
>>> our observation.
>>>
> 
> Thanks
>  Barry
Barry Song Nov. 27, 2023, 10:53 p.m. UTC | #5
On Tue, Nov 28, 2023 at 12:11 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2023 10:35, Barry Song wrote:
> > On Mon, Nov 27, 2023 at 10:15 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/11/2023 03:18, Barry Song wrote:
> >>>> Ryan Roberts (14):
> >>>>   mm: Batch-copy PTE ranges during fork()
> >>>>   arm64/mm: set_pte(): New layer to manage contig bit
> >>>>   arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
> >>>>   arm64/mm: pte_clear(): New layer to manage contig bit
> >>>>   arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
> >>>>   arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
> >>>>   arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
> >>>>   arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
> >>>>   arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
> >>>>   arm64/mm: ptep_get(): New layer to manage contig bit
> >>>>   arm64/mm: Split __flush_tlb_range() to elide trailing DSB
> >>>>   arm64/mm: Wire up PTE_CONT for user mappings
> >>>>   arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
> >>>>   arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
> >>>
> >>> Hi Ryan,
> >>> Not quite sure if I missed something, are we splitting/unfolding CONTPTES
> >>> in the below cases
> >>
> >> The general idea is that the core-mm sets the individual ptes (one at a time if
> >> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
> >> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
> >> (ptep_clear(), etc); This is exactly the same interface as previously.
> >>
> >> BUT, the arm64 implementation of those interfaces will now detect when a set of
> >> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
> >> base pages) are all appropriate for having the CONT_PTE bit set; in this case
> >> the block is "folded". And it will detect when the first PTE in the block
> >> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
> >> requirements for folding a contpte block is that all the pages must belong to
> >> the *same* folio (that means its safe to only track access/dirty for thecontpte
> >> block as a whole rather than for each individual pte).
> >>
> >> (there are a couple of optimizations that make the reality slightly more
> >> complicated than what I've just explained, but you get the idea).
> >>
> >> On that basis, I believe all the specific cases you describe below are all
> >> covered and safe - please let me know if you think there is a hole here!
> >>
> >>>
> >>> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
> >>
> >> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
> >> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
> >> is removed from the whole contpte block. If there is then a subsequent
> >> set_pte_at() to set a swap entry, the implementation will see that its not
> >> appropriate to re-fold, so the range will remain unfolded.
> >>
> >>>
> >>> 2. vma split in a large folio due to various reasons such as mprotect,
> >>> munmap, mlock etc.
> >>
> >> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
> >> suspect not, so if the VMA is split in the middle of a currently folded contpte
> >> block, it will remain folded. But this is safe and continues to work correctly.
> >> The VMA arrangement is not important; it is just important that a single folio
> >> is mapped contiguously across the whole block.
> >
> > I don't think it is safe to keep CONTPTE folded in a split_vma case. as
> > otherwise, copy_ptes in your other patch might only copy a part
> > of CONTPES.
> > For example,  if page0-page4 and page5-page15 are splitted in split_vma,
> > in fork, while copying pte for the first VMA, we are copying page0-page4,
> > this will immediately cause inconsistent CONTPTE. as we have to
> > make sure all CONTPTEs are atomically mapped in a PTL.
>
> No that's not how it works. The CONT_PTE bit is not blindly copied from parent
> to child. It is explicitly managed by the arch code and set when appropriate. In
> the case above, we will end up calling set_ptes() for page0-page4 in the child.
> set_ptes() will notice that there are only 5 contiguous pages so it will map
> without the CONT_PTE bit.

Ok. cool. alternatively, in the code I shared to you, we are doing an unfold
immediately when split_vma happens within a large anon folio, so we disallow
CONTPTE to cross two VMAs to avoid all kinds of complexity afterwards.

https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/huge_memory.c

#ifdef CONFIG_CONT_PTE_HUGEPAGE
void vma_adjust_cont_pte_trans_huge(struct vm_area_struct *vma,
    unsigned long start,
    unsigned long end,
    long adjust_next)
{
         /*
         * If the new start address isn't hpage aligned and it could
         * previously contain an hugepage: check if we need to split
         * an huge pmd.
         */
         if (start & ~HPAGE_CONT_PTE_MASK &&
             (start & HPAGE_CONT_PTE_MASK) >= vma->vm_start &&
             (start & HPAGE_CONT_PTE_MASK) + HPAGE_CONT_PTE_SIZE <= vma->vm_end)
                  split_huge_cont_pte_address(vma, start, false, NULL);

         ....
}
#endif

In your approach, you are still holding CONTPTE crossing two VMAs. but it seems
ok. I can't have a case which might fail in my brain right now. only
running the code on
a large amount of real hardware will tell :-)

>
> >
> >>
> >>>
> >>> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
> >>> rather than being as a whole.
> >>
> >> Yes, as per 1; the arm64 implementation will notice when the first entry is
> >> cleared and unfold the contpte block.
> >>
> >>>
> >>> In hardware, we need to make sure CONTPTE follow the rule - always 16
> >>> contiguous physical address with CONTPTE set. if one of them run away
> >>> from the 16 ptes group and PTEs become unconsistent, some terrible
> >>> errors/faults can happen in HW. for example
> >>
> >> Yes, the implementation obeys all these rules; see contpte_try_fold() and
> >> contpte_try_unfold(). the fold/unfold operation is only done when all
> >> requirements are met, and we perform it in a manner that is conformant to the
> >> architecture requirements (see contpte_fold() - being renamed to
> >> contpte_convert() in the next version).
> >>
> >> Thanks for the review!
> >>
> >> Thanks,
> >> Ryan
> >>
> >>>
> >>> case0:
> >>> addr0 PTE - has no CONTPE
> >>> addr0+4kb PTE - has CONTPTE
> >>> ....
> >>> addr0+60kb PTE - has CONTPTE
> >>>
> >>> case 1:
> >>> addr0 PTE - has no CONTPE
> >>> addr0+4kb PTE - has CONTPTE
> >>> ....
> >>> addr0+60kb PTE - has swap
> >>>
> >>> Unconsistent 16 PTEs will lead to crash even in the firmware based on
> >>> our observation.
> >>>
> >

Thanks
Barry
Yang Shi Nov. 28, 2023, 3:13 a.m. UTC | #6
On Mon, Nov 27, 2023 at 1:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2023 03:18, Barry Song wrote:
> >> Ryan Roberts (14):
> >>   mm: Batch-copy PTE ranges during fork()
> >>   arm64/mm: set_pte(): New layer to manage contig bit
> >>   arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
> >>   arm64/mm: pte_clear(): New layer to manage contig bit
> >>   arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
> >>   arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
> >>   arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
> >>   arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
> >>   arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
> >>   arm64/mm: ptep_get(): New layer to manage contig bit
> >>   arm64/mm: Split __flush_tlb_range() to elide trailing DSB
> >>   arm64/mm: Wire up PTE_CONT for user mappings
> >>   arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
> >>   arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
> >
> > Hi Ryan,
> > Not quite sure if I missed something, are we splitting/unfolding CONTPTES
> > in the below cases
>
> The general idea is that the core-mm sets the individual ptes (one at a time if
> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
> (ptep_clear(), etc); This is exactly the same interface as previously.
>
> BUT, the arm64 implementation of those interfaces will now detect when a set of
> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
> base pages) are all appropriate for having the CONT_PTE bit set; in this case
> the block is "folded". And it will detect when the first PTE in the block
> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
> requirements for folding a contpte block is that all the pages must belong to
> the *same* folio (that means its safe to only track access/dirty for thecontpte
> block as a whole rather than for each individual pte).
>
> (there are a couple of optimizations that make the reality slightly more
> complicated than what I've just explained, but you get the idea).
>
> On that basis, I believe all the specific cases you describe below are all
> covered and safe - please let me know if you think there is a hole here!
>
> >
> > 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
>
> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
> is removed from the whole contpte block. If there is then a subsequent
> set_pte_at() to set a swap entry, the implementation will see that its not
> appropriate to re-fold, so the range will remain unfolded.
>
> >
> > 2. vma split in a large folio due to various reasons such as mprotect,
> > munmap, mlock etc.
>
> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
> suspect not, so if the VMA is split in the middle of a currently folded contpte
> block, it will remain folded. But this is safe and continues to work correctly.
> The VMA arrangement is not important; it is just important that a single folio
> is mapped contiguously across the whole block.

Even with different permissions, for example, read-only vs read-write?
The mprotect() may change the permission. It should be misprogramming
per ARM ARM.

>
> >
> > 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
> > rather than being as a whole.
>
> Yes, as per 1; the arm64 implementation will notice when the first entry is
> cleared and unfold the contpte block.
>
> >
> > In hardware, we need to make sure CONTPTE follow the rule - always 16
> > contiguous physical address with CONTPTE set. if one of them run away
> > from the 16 ptes group and PTEs become unconsistent, some terrible
> > errors/faults can happen in HW. for example
>
> Yes, the implementation obeys all these rules; see contpte_try_fold() and
> contpte_try_unfold(). the fold/unfold operation is only done when all
> requirements are met, and we perform it in a manner that is conformant to the
> architecture requirements (see contpte_fold() - being renamed to
> contpte_convert() in the next version).
>
> Thanks for the review!
>
> Thanks,
> Ryan
>
> >
> > case0:
> > addr0 PTE - has no CONTPE
> > addr0+4kb PTE - has CONTPTE
> > ....
> > addr0+60kb PTE - has CONTPTE
> >
> > case 1:
> > addr0 PTE - has no CONTPE
> > addr0+4kb PTE - has CONTPTE
> > ....
> > addr0+60kb PTE - has swap
> >
> > Unconsistent 16 PTEs will lead to crash even in the firmware based on
> > our observation.
> >
> > Thanks
> > Barry
> >
> >
>
>
Barry Song Nov. 28, 2023, 5:49 a.m. UTC | #7
On Mon, Nov 27, 2023 at 5:15 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 27/11/2023 03:18, Barry Song wrote:
> >> Ryan Roberts (14):
> >>   mm: Batch-copy PTE ranges during fork()
> >>   arm64/mm: set_pte(): New layer to manage contig bit
> >>   arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
> >>   arm64/mm: pte_clear(): New layer to manage contig bit
> >>   arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
> >>   arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
> >>   arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
> >>   arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
> >>   arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
> >>   arm64/mm: ptep_get(): New layer to manage contig bit
> >>   arm64/mm: Split __flush_tlb_range() to elide trailing DSB
> >>   arm64/mm: Wire up PTE_CONT for user mappings
> >>   arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
> >>   arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
> >
> > Hi Ryan,
> > Not quite sure if I missed something, are we splitting/unfolding CONTPTES
> > in the below cases
>
> The general idea is that the core-mm sets the individual ptes (one at a time if
> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
> (ptep_clear(), etc); This is exactly the same interface as previously.
>
> BUT, the arm64 implementation of those interfaces will now detect when a set of
> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
> base pages) are all appropriate for having the CONT_PTE bit set; in this case
> the block is "folded". And it will detect when the first PTE in the block
> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
> requirements for folding a contpte block is that all the pages must belong to
> the *same* folio (that means its safe to only track access/dirty for thecontpte
> block as a whole rather than for each individual pte).
>
> (there are a couple of optimizations that make the reality slightly more
> complicated than what I've just explained, but you get the idea).
>
> On that basis, I believe all the specific cases you describe below are all
> covered and safe - please let me know if you think there is a hole here!
>
> >
> > 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
>
> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
> is removed from the whole contpte block. If there is then a subsequent
> set_pte_at() to set a swap entry, the implementation will see that its not
> appropriate to re-fold, so the range will remain unfolded.
>
> >
> > 2. vma split in a large folio due to various reasons such as mprotect,
> > munmap, mlock etc.
>
> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
> suspect not, so if the VMA is split in the middle of a currently folded contpte
> block, it will remain folded. But this is safe and continues to work correctly.
> The VMA arrangement is not important; it is just important that a single folio
> is mapped contiguously across the whole block.
>
> >
> > 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
> > rather than being as a whole.
>
> Yes, as per 1; the arm64 implementation will notice when the first entry is
> cleared and unfold the contpte block.
>
> >
> > In hardware, we need to make sure CONTPTE follow the rule - always 16
> > contiguous physical address with CONTPTE set. if one of them run away
> > from the 16 ptes group and PTEs become unconsistent, some terrible
> > errors/faults can happen in HW. for example
>
> Yes, the implementation obeys all these rules; see contpte_try_fold() and
> contpte_try_unfold(). the fold/unfold operation is only done when all
> requirements are met, and we perform it in a manner that is conformant to the
> architecture requirements (see contpte_fold() - being renamed to
> contpte_convert() in the next version).

Hi Ryan,

sorry for too many comments, I remembered another case

4. mremap

a CONTPTE might be remapped to another address which might not be
aligned with 16*basepage. thus, in move_ptes(), we are copying CONPTEs
from src to dst.
static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
                unsigned long old_addr, unsigned long old_end,
                struct vm_area_struct *new_vma, pmd_t *new_pmd,
                unsigned long new_addr, bool need_rmap_locks)
{
        struct mm_struct *mm = vma->vm_mm;
        pte_t *old_pte, *new_pte, pte;
        ...

        /*
         * We don't have to worry about the ordering of src and dst
         * pte locks because exclusive mmap_lock prevents deadlock.
         */
        old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
        if (!old_pte) {
                err = -EAGAIN;
                goto out;
        }
        new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
        if (!new_pte) {
                pte_unmap_unlock(old_pte, old_ptl);
                err = -EAGAIN;
                goto out;
        }
        if (new_ptl != old_ptl)
                spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
        flush_tlb_batched_pending(vma->vm_mm);
        arch_enter_lazy_mmu_mode();

        for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
                                   new_pte++, new_addr += PAGE_SIZE) {
                if (pte_none(ptep_get(old_pte)))
                        continue;

                pte = ptep_get_and_clear(mm, old_addr, old_pte);
       ....
}

This has two possibilities
1. new_pte is aligned with CONT_PTES, we can still keep CONTPTE;
2. new_pte is not aligned with CONT_PTES, we should drop CONTPTE
while copying.

does your code also handle this properly?

>
> Thanks for the review!
>
> Thanks,
> Ryan
>
> >
> > case0:
> > addr0 PTE - has no CONTPE
> > addr0+4kb PTE - has CONTPTE
> > ....
> > addr0+60kb PTE - has CONTPTE
> >
> > case 1:
> > addr0 PTE - has no CONTPE
> > addr0+4kb PTE - has CONTPTE
> > ....
> > addr0+60kb PTE - has swap
> >
> > Unconsistent 16 PTEs will lead to crash even in the firmware based on
> > our observation.
> >
> > Thanks
> > Barry

Thanks
Barry
Ryan Roberts Nov. 28, 2023, 11:52 a.m. UTC | #8
On 27/11/2023 22:53, Barry Song wrote:
> On Tue, Nov 28, 2023 at 12:11 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2023 10:35, Barry Song wrote:
>>> On Mon, Nov 27, 2023 at 10:15 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 27/11/2023 03:18, Barry Song wrote:
>>>>>> Ryan Roberts (14):
>>>>>>   mm: Batch-copy PTE ranges during fork()
>>>>>>   arm64/mm: set_pte(): New layer to manage contig bit
>>>>>>   arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
>>>>>>   arm64/mm: pte_clear(): New layer to manage contig bit
>>>>>>   arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
>>>>>>   arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
>>>>>>   arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
>>>>>>   arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
>>>>>>   arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
>>>>>>   arm64/mm: ptep_get(): New layer to manage contig bit
>>>>>>   arm64/mm: Split __flush_tlb_range() to elide trailing DSB
>>>>>>   arm64/mm: Wire up PTE_CONT for user mappings
>>>>>>   arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
>>>>>>   arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
>>>>>
>>>>> Hi Ryan,
>>>>> Not quite sure if I missed something, are we splitting/unfolding CONTPTES
>>>>> in the below cases
>>>>
>>>> The general idea is that the core-mm sets the individual ptes (one at a time if
>>>> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
>>>> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
>>>> (ptep_clear(), etc); This is exactly the same interface as previously.
>>>>
>>>> BUT, the arm64 implementation of those interfaces will now detect when a set of
>>>> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
>>>> base pages) are all appropriate for having the CONT_PTE bit set; in this case
>>>> the block is "folded". And it will detect when the first PTE in the block
>>>> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
>>>> requirements for folding a contpte block is that all the pages must belong to
>>>> the *same* folio (that means its safe to only track access/dirty for thecontpte
>>>> block as a whole rather than for each individual pte).
>>>>
>>>> (there are a couple of optimizations that make the reality slightly more
>>>> complicated than what I've just explained, but you get the idea).
>>>>
>>>> On that basis, I believe all the specific cases you describe below are all
>>>> covered and safe - please let me know if you think there is a hole here!
>>>>
>>>>>
>>>>> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
>>>>
>>>> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
>>>> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
>>>> is removed from the whole contpte block. If there is then a subsequent
>>>> set_pte_at() to set a swap entry, the implementation will see that its not
>>>> appropriate to re-fold, so the range will remain unfolded.
>>>>
>>>>>
>>>>> 2. vma split in a large folio due to various reasons such as mprotect,
>>>>> munmap, mlock etc.
>>>>
>>>> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
>>>> suspect not, so if the VMA is split in the middle of a currently folded contpte
>>>> block, it will remain folded. But this is safe and continues to work correctly.
>>>> The VMA arrangement is not important; it is just important that a single folio
>>>> is mapped contiguously across the whole block.
>>>
>>> I don't think it is safe to keep CONTPTE folded in a split_vma case. as
>>> otherwise, copy_ptes in your other patch might only copy a part
>>> of CONTPES.
>>> For example,  if page0-page4 and page5-page15 are splitted in split_vma,
>>> in fork, while copying pte for the first VMA, we are copying page0-page4,
>>> this will immediately cause inconsistent CONTPTE. as we have to
>>> make sure all CONTPTEs are atomically mapped in a PTL.
>>
>> No that's not how it works. The CONT_PTE bit is not blindly copied from parent
>> to child. It is explicitly managed by the arch code and set when appropriate. In
>> the case above, we will end up calling set_ptes() for page0-page4 in the child.
>> set_ptes() will notice that there are only 5 contiguous pages so it will map
>> without the CONT_PTE bit.
> 
> Ok. cool. alternatively, in the code I shared to you, we are doing an unfold
> immediately when split_vma happens within a large anon folio, so we disallow
> CONTPTE to cross two VMAs to avoid all kinds of complexity afterwards.
> 
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/huge_memory.c
> 
> #ifdef CONFIG_CONT_PTE_HUGEPAGE
> void vma_adjust_cont_pte_trans_huge(struct vm_area_struct *vma,
>     unsigned long start,
>     unsigned long end,
>     long adjust_next)
> {
>          /*
>          * If the new start address isn't hpage aligned and it could
>          * previously contain an hugepage: check if we need to split
>          * an huge pmd.
>          */
>          if (start & ~HPAGE_CONT_PTE_MASK &&
>              (start & HPAGE_CONT_PTE_MASK) >= vma->vm_start &&
>              (start & HPAGE_CONT_PTE_MASK) + HPAGE_CONT_PTE_SIZE <= vma->vm_end)
>                   split_huge_cont_pte_address(vma, start, false, NULL);
> 
>          ....
> }
> #endif
> 
> In your approach, you are still holding CONTPTE crossing two VMAs. but it seems
> ok. I can't have a case which might fail in my brain right now. only

Yes, I'm dealing with the CONT_PTE bit at the pgtable level, not at the VMA level.


> running the code on
> a large amount of real hardware will tell :-)

Indeed - is this something you might be able to help with? :)

> 
>>
>>>
>>>>
>>>>>
>>>>> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
>>>>> rather than being as a whole.
>>>>
>>>> Yes, as per 1; the arm64 implementation will notice when the first entry is
>>>> cleared and unfold the contpte block.
>>>>
>>>>>
>>>>> In hardware, we need to make sure CONTPTE follow the rule - always 16
>>>>> contiguous physical address with CONTPTE set. if one of them run away
>>>>> from the 16 ptes group and PTEs become unconsistent, some terrible
>>>>> errors/faults can happen in HW. for example
>>>>
>>>> Yes, the implementation obeys all these rules; see contpte_try_fold() and
>>>> contpte_try_unfold(). the fold/unfold operation is only done when all
>>>> requirements are met, and we perform it in a manner that is conformant to the
>>>> architecture requirements (see contpte_fold() - being renamed to
>>>> contpte_convert() in the next version).
>>>>
>>>> Thanks for the review!
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>>
>>>>> case0:
>>>>> addr0 PTE - has no CONTPE
>>>>> addr0+4kb PTE - has CONTPTE
>>>>> ....
>>>>> addr0+60kb PTE - has CONTPTE
>>>>>
>>>>> case 1:
>>>>> addr0 PTE - has no CONTPE
>>>>> addr0+4kb PTE - has CONTPTE
>>>>> ....
>>>>> addr0+60kb PTE - has swap
>>>>>
>>>>> Unconsistent 16 PTEs will lead to crash even in the firmware based on
>>>>> our observation.
>>>>>
>>>
> 
> Thanks
> Barry
Ryan Roberts Nov. 28, 2023, 11:58 a.m. UTC | #9
On 28/11/2023 03:13, Yang Shi wrote:
> On Mon, Nov 27, 2023 at 1:15 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2023 03:18, Barry Song wrote:
>>>> Ryan Roberts (14):
>>>>   mm: Batch-copy PTE ranges during fork()
>>>>   arm64/mm: set_pte(): New layer to manage contig bit
>>>>   arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
>>>>   arm64/mm: pte_clear(): New layer to manage contig bit
>>>>   arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
>>>>   arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
>>>>   arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
>>>>   arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
>>>>   arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
>>>>   arm64/mm: ptep_get(): New layer to manage contig bit
>>>>   arm64/mm: Split __flush_tlb_range() to elide trailing DSB
>>>>   arm64/mm: Wire up PTE_CONT for user mappings
>>>>   arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
>>>>   arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
>>>
>>> Hi Ryan,
>>> Not quite sure if I missed something, are we splitting/unfolding CONTPTES
>>> in the below cases
>>
>> The general idea is that the core-mm sets the individual ptes (one at a time if
>> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
>> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
>> (ptep_clear(), etc); This is exactly the same interface as previously.
>>
>> BUT, the arm64 implementation of those interfaces will now detect when a set of
>> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
>> base pages) are all appropriate for having the CONT_PTE bit set; in this case
>> the block is "folded". And it will detect when the first PTE in the block
>> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
>> requirements for folding a contpte block is that all the pages must belong to
>> the *same* folio (that means its safe to only track access/dirty for thecontpte
>> block as a whole rather than for each individual pte).
>>
>> (there are a couple of optimizations that make the reality slightly more
>> complicated than what I've just explained, but you get the idea).
>>
>> On that basis, I believe all the specific cases you describe below are all
>> covered and safe - please let me know if you think there is a hole here!
>>
>>>
>>> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
>>
>> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
>> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
>> is removed from the whole contpte block. If there is then a subsequent
>> set_pte_at() to set a swap entry, the implementation will see that its not
>> appropriate to re-fold, so the range will remain unfolded.
>>
>>>
>>> 2. vma split in a large folio due to various reasons such as mprotect,
>>> munmap, mlock etc.
>>
>> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
>> suspect not, so if the VMA is split in the middle of a currently folded contpte
>> block, it will remain folded. But this is safe and continues to work correctly.
>> The VMA arrangement is not important; it is just important that a single folio
>> is mapped contiguously across the whole block.
> 
> Even with different permissions, for example, read-only vs read-write?
> The mprotect() may change the permission. It should be misprogramming
> per ARM ARM.

If the permissions are changed, then mprotect() must have called the pgtable
helpers to modify the page table (e.g. ptep_set_wrprotect(),
ptep_set_access_flags() or whatever). These functions will notice that the
contpte block is currently folded and unfold it before apply the permissions
change. The unfolding process is done in a way that intentionally avoids
misprogramming as defined by the Arm ARM. See contpte_fold() in contpte.c.

> 
>>
>>>
>>> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
>>> rather than being as a whole.
>>
>> Yes, as per 1; the arm64 implementation will notice when the first entry is
>> cleared and unfold the contpte block.
>>
>>>
>>> In hardware, we need to make sure CONTPTE follow the rule - always 16
>>> contiguous physical address with CONTPTE set. if one of them run away
>>> from the 16 ptes group and PTEs become unconsistent, some terrible
>>> errors/faults can happen in HW. for example
>>
>> Yes, the implementation obeys all these rules; see contpte_try_fold() and
>> contpte_try_unfold(). the fold/unfold operation is only done when all
>> requirements are met, and we perform it in a manner that is conformant to the
>> architecture requirements (see contpte_fold() - being renamed to
>> contpte_convert() in the next version).
>>
>> Thanks for the review!
>>
>> Thanks,
>> Ryan
>>
>>>
>>> case0:
>>> addr0 PTE - has no CONTPE
>>> addr0+4kb PTE - has CONTPTE
>>> ....
>>> addr0+60kb PTE - has CONTPTE
>>>
>>> case 1:
>>> addr0 PTE - has no CONTPE
>>> addr0+4kb PTE - has CONTPTE
>>> ....
>>> addr0+60kb PTE - has swap
>>>
>>> Unconsistent 16 PTEs will lead to crash even in the firmware based on
>>> our observation.
>>>
>>> Thanks
>>> Barry
>>>
>>>
>>
>>
Ryan Roberts Nov. 28, 2023, 12:08 p.m. UTC | #10
On 28/11/2023 05:49, Barry Song wrote:
> On Mon, Nov 27, 2023 at 5:15 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 27/11/2023 03:18, Barry Song wrote:
>>>> Ryan Roberts (14):
>>>>   mm: Batch-copy PTE ranges during fork()
>>>>   arm64/mm: set_pte(): New layer to manage contig bit
>>>>   arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
>>>>   arm64/mm: pte_clear(): New layer to manage contig bit
>>>>   arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
>>>>   arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
>>>>   arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
>>>>   arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
>>>>   arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
>>>>   arm64/mm: ptep_get(): New layer to manage contig bit
>>>>   arm64/mm: Split __flush_tlb_range() to elide trailing DSB
>>>>   arm64/mm: Wire up PTE_CONT for user mappings
>>>>   arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
>>>>   arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
>>>
>>> Hi Ryan,
>>> Not quite sure if I missed something, are we splitting/unfolding CONTPTES
>>> in the below cases
>>
>> The general idea is that the core-mm sets the individual ptes (one at a time if
>> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
>> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
>> (ptep_clear(), etc); This is exactly the same interface as previously.
>>
>> BUT, the arm64 implementation of those interfaces will now detect when a set of
>> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
>> base pages) are all appropriate for having the CONT_PTE bit set; in this case
>> the block is "folded". And it will detect when the first PTE in the block
>> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
>> requirements for folding a contpte block is that all the pages must belong to
>> the *same* folio (that means its safe to only track access/dirty for thecontpte
>> block as a whole rather than for each individual pte).
>>
>> (there are a couple of optimizations that make the reality slightly more
>> complicated than what I've just explained, but you get the idea).
>>
>> On that basis, I believe all the specific cases you describe below are all
>> covered and safe - please let me know if you think there is a hole here!
>>
>>>
>>> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
>>
>> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
>> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
>> is removed from the whole contpte block. If there is then a subsequent
>> set_pte_at() to set a swap entry, the implementation will see that its not
>> appropriate to re-fold, so the range will remain unfolded.
>>
>>>
>>> 2. vma split in a large folio due to various reasons such as mprotect,
>>> munmap, mlock etc.
>>
>> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
>> suspect not, so if the VMA is split in the middle of a currently folded contpte
>> block, it will remain folded. But this is safe and continues to work correctly.
>> The VMA arrangement is not important; it is just important that a single folio
>> is mapped contiguously across the whole block.
>>
>>>
>>> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
>>> rather than being as a whole.
>>
>> Yes, as per 1; the arm64 implementation will notice when the first entry is
>> cleared and unfold the contpte block.
>>
>>>
>>> In hardware, we need to make sure CONTPTE follow the rule - always 16
>>> contiguous physical address with CONTPTE set. if one of them run away
>>> from the 16 ptes group and PTEs become unconsistent, some terrible
>>> errors/faults can happen in HW. for example
>>
>> Yes, the implementation obeys all these rules; see contpte_try_fold() and
>> contpte_try_unfold(). the fold/unfold operation is only done when all
>> requirements are met, and we perform it in a manner that is conformant to the
>> architecture requirements (see contpte_fold() - being renamed to
>> contpte_convert() in the next version).
> 
> Hi Ryan,
> 
> sorry for too many comments, I remembered another case
> 
> 4. mremap
> 
> a CONTPTE might be remapped to another address which might not be
> aligned with 16*basepage. thus, in move_ptes(), we are copying CONPTEs
> from src to dst.
> static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
>                 unsigned long old_addr, unsigned long old_end,
>                 struct vm_area_struct *new_vma, pmd_t *new_pmd,
>                 unsigned long new_addr, bool need_rmap_locks)
> {
>         struct mm_struct *mm = vma->vm_mm;
>         pte_t *old_pte, *new_pte, pte;
>         ...
> 
>         /*
>          * We don't have to worry about the ordering of src and dst
>          * pte locks because exclusive mmap_lock prevents deadlock.
>          */
>         old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
>         if (!old_pte) {
>                 err = -EAGAIN;
>                 goto out;
>         }
>         new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
>         if (!new_pte) {
>                 pte_unmap_unlock(old_pte, old_ptl);
>                 err = -EAGAIN;
>                 goto out;
>         }
>         if (new_ptl != old_ptl)
>                 spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
>         flush_tlb_batched_pending(vma->vm_mm);
>         arch_enter_lazy_mmu_mode();
> 
>         for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
>                                    new_pte++, new_addr += PAGE_SIZE) {
>                 if (pte_none(ptep_get(old_pte)))
>                         continue;
> 
>                 pte = ptep_get_and_clear(mm, old_addr, old_pte);
>        ....
> }
> 
> This has two possibilities
> 1. new_pte is aligned with CONT_PTES, we can still keep CONTPTE;
> 2. new_pte is not aligned with CONT_PTES, we should drop CONTPTE
> while copying.
> 
> does your code also handle this properly?

Yes; same mechanism - the arm64 arch code does the CONT_PTE bit management and
folds/unfolds as neccessary.

Admittedly this may be non-optimal because we are iterating a single PTE at a
time. When we clear the first pte of a contpte block in the source, the block
will be unfolded. When we set the last pte of the contpte block in the dest, the
block will be folded. If we had a batching mechanism, we could just clear the
whole source contpte block in one hit (no need to unfold first) and we could
just set the dest contpte block in one hit (no need to fold at the end).

I haven't personally seen this as a hotspot though; I don't know if you have any
data to the contrary? I've followed this type of batching technique for the fork
case (patch 13). We could do a similar thing in theory, but its a bit more
complex because of the ptep_get_and_clear() return value; you would need to
return all ptes for the cleared range, or somehow collapse the actual info that
the caller requires (presumably access/dirty info).

> 
>>
>> Thanks for the review!
>>
>> Thanks,
>> Ryan
>>
>>>
>>> case0:
>>> addr0 PTE - has no CONTPE
>>> addr0+4kb PTE - has CONTPTE
>>> ....
>>> addr0+60kb PTE - has CONTPTE
>>>
>>> case 1:
>>> addr0 PTE - has no CONTPE
>>> addr0+4kb PTE - has CONTPTE
>>> ....
>>> addr0+60kb PTE - has swap
>>>
>>> Unconsistent 16 PTEs will lead to crash even in the firmware based on
>>> our observation.
>>>
>>> Thanks
>>> Barry
> 
> Thanks
> Barry
Barry Song Nov. 28, 2023, 7:37 p.m. UTC | #11
On Wed, Nov 29, 2023 at 1:08 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/11/2023 05:49, Barry Song wrote:
> > On Mon, Nov 27, 2023 at 5:15 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 27/11/2023 03:18, Barry Song wrote:
> >>>> Ryan Roberts (14):
> >>>>   mm: Batch-copy PTE ranges during fork()
> >>>>   arm64/mm: set_pte(): New layer to manage contig bit
> >>>>   arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit
> >>>>   arm64/mm: pte_clear(): New layer to manage contig bit
> >>>>   arm64/mm: ptep_get_and_clear(): New layer to manage contig bit
> >>>>   arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit
> >>>>   arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit
> >>>>   arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit
> >>>>   arm64/mm: ptep_set_access_flags(): New layer to manage contig bit
> >>>>   arm64/mm: ptep_get(): New layer to manage contig bit
> >>>>   arm64/mm: Split __flush_tlb_range() to elide trailing DSB
> >>>>   arm64/mm: Wire up PTE_CONT for user mappings
> >>>>   arm64/mm: Implement ptep_set_wrprotects() to optimize fork()
> >>>>   arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown
> >>>
> >>> Hi Ryan,
> >>> Not quite sure if I missed something, are we splitting/unfolding CONTPTES
> >>> in the below cases
> >>
> >> The general idea is that the core-mm sets the individual ptes (one at a time if
> >> it likes with set_pte_at(), or in a block with set_ptes()), modifies its
> >> permissions (ptep_set_wrprotect(), ptep_set_access_flags()) and clears them
> >> (ptep_clear(), etc); This is exactly the same interface as previously.
> >>
> >> BUT, the arm64 implementation of those interfaces will now detect when a set of
> >> adjacent PTEs (a contpte block - so 16 naturally aligned entries when using 4K
> >> base pages) are all appropriate for having the CONT_PTE bit set; in this case
> >> the block is "folded". And it will detect when the first PTE in the block
> >> changes such that the CONT_PTE bit must now be unset ("unfolded"). One of the
> >> requirements for folding a contpte block is that all the pages must belong to
> >> the *same* folio (that means its safe to only track access/dirty for thecontpte
> >> block as a whole rather than for each individual pte).
> >>
> >> (there are a couple of optimizations that make the reality slightly more
> >> complicated than what I've just explained, but you get the idea).
> >>
> >> On that basis, I believe all the specific cases you describe below are all
> >> covered and safe - please let me know if you think there is a hole here!
> >>
> >>>
> >>> 1. madvise(MADV_DONTNEED) on a part of basepages on a CONTPTE large folio
> >>
> >> The page will first be unmapped (e.g. ptep_clear() or ptep_get_and_clear(), or
> >> whatever). The implementation of that will cause an unfold and the CONT_PTE bit
> >> is removed from the whole contpte block. If there is then a subsequent
> >> set_pte_at() to set a swap entry, the implementation will see that its not
> >> appropriate to re-fold, so the range will remain unfolded.
> >>
> >>>
> >>> 2. vma split in a large folio due to various reasons such as mprotect,
> >>> munmap, mlock etc.
> >>
> >> I'm not sure if PTEs are explicitly unmapped/remapped when splitting a VMA? I
> >> suspect not, so if the VMA is split in the middle of a currently folded contpte
> >> block, it will remain folded. But this is safe and continues to work correctly.
> >> The VMA arrangement is not important; it is just important that a single folio
> >> is mapped contiguously across the whole block.
> >>
> >>>
> >>> 3. try_to_unmap_one() to reclaim a folio, ptes are scanned one by one
> >>> rather than being as a whole.
> >>
> >> Yes, as per 1; the arm64 implementation will notice when the first entry is
> >> cleared and unfold the contpte block.
> >>
> >>>
> >>> In hardware, we need to make sure CONTPTE follow the rule - always 16
> >>> contiguous physical address with CONTPTE set. if one of them run away
> >>> from the 16 ptes group and PTEs become unconsistent, some terrible
> >>> errors/faults can happen in HW. for example
> >>
> >> Yes, the implementation obeys all these rules; see contpte_try_fold() and
> >> contpte_try_unfold(). the fold/unfold operation is only done when all
> >> requirements are met, and we perform it in a manner that is conformant to the
> >> architecture requirements (see contpte_fold() - being renamed to
> >> contpte_convert() in the next version).
> >
> > Hi Ryan,
> >
> > sorry for too many comments, I remembered another case
> >
> > 4. mremap
> >
> > a CONTPTE might be remapped to another address which might not be
> > aligned with 16*basepage. thus, in move_ptes(), we are copying CONPTEs
> > from src to dst.
> > static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
> >                 unsigned long old_addr, unsigned long old_end,
> >                 struct vm_area_struct *new_vma, pmd_t *new_pmd,
> >                 unsigned long new_addr, bool need_rmap_locks)
> > {
> >         struct mm_struct *mm = vma->vm_mm;
> >         pte_t *old_pte, *new_pte, pte;
> >         ...
> >
> >         /*
> >          * We don't have to worry about the ordering of src and dst
> >          * pte locks because exclusive mmap_lock prevents deadlock.
> >          */
> >         old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
> >         if (!old_pte) {
> >                 err = -EAGAIN;
> >                 goto out;
> >         }
> >         new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
> >         if (!new_pte) {
> >                 pte_unmap_unlock(old_pte, old_ptl);
> >                 err = -EAGAIN;
> >                 goto out;
> >         }
> >         if (new_ptl != old_ptl)
> >                 spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> >         flush_tlb_batched_pending(vma->vm_mm);
> >         arch_enter_lazy_mmu_mode();
> >
> >         for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE,
> >                                    new_pte++, new_addr += PAGE_SIZE) {
> >                 if (pte_none(ptep_get(old_pte)))
> >                         continue;
> >
> >                 pte = ptep_get_and_clear(mm, old_addr, old_pte);
> >        ....
> > }
> >
> > This has two possibilities
> > 1. new_pte is aligned with CONT_PTES, we can still keep CONTPTE;
> > 2. new_pte is not aligned with CONT_PTES, we should drop CONTPTE
> > while copying.
> >
> > does your code also handle this properly?
>
> Yes; same mechanism - the arm64 arch code does the CONT_PTE bit management and
> folds/unfolds as neccessary.
>
> Admittedly this may be non-optimal because we are iterating a single PTE at a
> time. When we clear the first pte of a contpte block in the source, the block
> will be unfolded. When we set the last pte of the contpte block in the dest, the
> block will be folded. If we had a batching mechanism, we could just clear the
> whole source contpte block in one hit (no need to unfold first) and we could
> just set the dest contpte block in one hit (no need to fold at the end).
>
> I haven't personally seen this as a hotspot though; I don't know if you have any
> data to the contrary? I've followed this type of batching technique for the fork
> case (patch 13). We could do a similar thing in theory, but its a bit more

in my previous testing, i don't see mremap quite often, so no worries.
as long as it is bug-free,
it is fine to me though a mremap microbench will definitely lose :-)

> complex because of the ptep_get_and_clear() return value; you would need to
> return all ptes for the cleared range, or somehow collapse the actual info that
> the caller requires (presumably access/dirty info).
>

Thanks
Barry