Message ID | cover.1722861064.git.zhengqi.arch@bytedance.com (mailing list archive) |
---|---|
Headers | show |
Series | synchronously scan and reclaim empty user PTE pages | expand |
Add the x86 mailing list. On 2024/8/5 20:55, Qi Zheng wrote: > Changes in RFC v2: > - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by > kernel test robot > - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid() > in retract_page_tables() (in [RFC PATCH 4/7]) > - rebase onto the next-20240805 > > Hi all, > > Previously, we tried to use a completely asynchronous method to reclaim empty > user PTE pages [1]. After discussing with David Hildenbrand, we decided to > implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the > first step. > > So this series aims to synchronously scan and reclaim empty user PTE pages in > zap_page_range_single() (madvise(MADV_DONTNEED) etc will invoke this). In > zap_page_range_single(), mmu_gather is used to perform batch tlb flushing and > page freeing operations. Therefore, if we want to free the empty PTE page in > this path, the most natural way is to add it to mmu_gather as well. There are > two problems that need to be solved here: > > 1. Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free > page table pages by semi RCU: > > - batch table freeing: asynchronous free by RCU > - single table freeing: IPI + synchronous free > > But this is not enough to free the empty PTE page table pages in paths other > that munmap and exit_mmap path, because IPI cannot be synchronized with > rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table > also be freed by RCU like batch table freeing. > > 2. When we use mmu_gather to batch flush tlb and free PTE pages, the TLB is not > flushed before pmd lock is unlocked. This may result in the following two > situations: > > 1) Userland can trigger page fault and fill a huge page, which will cause > the existence of small size TLB and huge TLB for the same address. > > 2) Userland can also trigger page fault and fill a PTE page, which will > cause the existence of two small size TLBs, but the PTE page they map > are different. > > For case 1), according to Intel's TLB Application note (317080), some CPUs of > x86 do not allow it: > > ``` > If software modifies the paging structures so that the page size used for a > 4-KByte range of linear addresses changes, the TLBs may subsequently contain > both ordinary and large-page translations for the address range.12 A reference > to a linear address in the address range may use either translation. Which of > the two translations is used may vary from one execution to another and the > choice may be implementation-specific. > > Software wishing to prevent this uncertainty should not write to a paging- > structure entry in a way that would change, for any linear address, both the > page size and either the page frame or attributes. It can instead use the > following algorithm: first mark the relevant paging-structure entry (e.g., > PDE) not present; then invalidate any translations for the affected linear > addresses (see Section 5.2); and then modify the relevant paging-structure > entry to mark it present and establish translation(s) for the new page size. > ``` > > We can also learn more information from the comments above pmdp_invalidate() > in __split_huge_pmd_locked(). > > For case 2), we can see from the comments above ptep_clear_flush() in > wp_page_copy() that this situation is also not allowed. Even without > this patch series, madvise(MADV_DONTNEED) can also cause this situation: > > CPU 0 CPU 1 > > madvise (MADV_DONTNEED) > --> clear pte entry > pte_unmap_unlock > touch and tlb miss > --> set pte entry > mmu_gather flush tlb > > But strangely, I didn't see any relevant fix code, maybe I missed something, > or is this guaranteed by userland? > > Anyway, this series defines the following two functions to be implemented by > the architecture. If the architecture does not allow the above two situations, > then define these two functions to flush the tlb before set_pmd_at(). > > - arch_flush_tlb_before_set_huge_page > - arch_flush_tlb_before_set_pte_page > > As a first step, we supported this feature on x86_64 and selectd the newly > introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM. > > In order to reduce overhead, we only handle the cases with a high probability > of generating empty PTE pages, and other cases will be filtered out, such as: > > - hugetlb vma (unsuitable) > - userfaultfd_wp vma (may reinstall the pte entry) > - writable private file mapping case (COW-ed anon page is not zapped) > - etc > > For userfaultfd_wp and writable private file mapping cases (and MADV_FREE case, > of course), consider scanning and freeing empty PTE pages asynchronously in > the future. > > This series is based on next-20240805. > > Comments and suggestions are welcome! > > Thanks, > Qi > > [1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ > > Qi Zheng (7): > mm: pgtable: make pte_offset_map_nolock() return pmdval > mm: introduce CONFIG_PT_RECLAIM > mm: pass address information to pmd_install() > mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single() > x86: mm: free page table pages by RCU instead of semi RCU > x86: mm: define arch_flush_tlb_before_set_huge_page > x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 > > Documentation/mm/split_page_table_lock.rst | 3 +- > arch/arm/mm/fault-armv.c | 2 +- > arch/powerpc/mm/pgtable.c | 2 +- > arch/x86/Kconfig | 1 + > arch/x86/include/asm/pgtable.h | 6 + > arch/x86/include/asm/tlb.h | 19 +++ > arch/x86/kernel/paravirt.c | 7 ++ > arch/x86/mm/pgtable.c | 23 +++- > include/linux/hugetlb.h | 2 +- > include/linux/mm.h | 13 +- > include/linux/pgtable.h | 14 +++ > mm/Kconfig | 14 +++ > mm/Makefile | 1 + > mm/debug_vm_pgtable.c | 2 +- > mm/filemap.c | 4 +- > mm/gup.c | 2 +- > mm/huge_memory.c | 3 + > mm/internal.h | 17 ++- > mm/khugepaged.c | 32 +++-- > mm/memory.c | 21 ++-- > mm/migrate_device.c | 2 +- > mm/mmu_gather.c | 9 +- > mm/mprotect.c | 8 +- > mm/mremap.c | 4 +- > mm/page_vma_mapped.c | 2 +- > mm/pgtable-generic.c | 21 ++-- > mm/pt_reclaim.c | 131 +++++++++++++++++++++ > mm/userfaultfd.c | 10 +- > mm/vmscan.c | 2 +- > 29 files changed, 321 insertions(+), 56 deletions(-) > create mode 100644 mm/pt_reclaim.c >
Hi all, On 2024/8/5 20:55, Qi Zheng wrote: [...] > > 2. When we use mmu_gather to batch flush tlb and free PTE pages, the TLB is not > flushed before pmd lock is unlocked. This may result in the following two > situations: > > 1) Userland can trigger page fault and fill a huge page, which will cause > the existence of small size TLB and huge TLB for the same address. > > 2) Userland can also trigger page fault and fill a PTE page, which will > cause the existence of two small size TLBs, but the PTE page they map > are different. > > For case 1), according to Intel's TLB Application note (317080), some CPUs of > x86 do not allow it: > > ``` > If software modifies the paging structures so that the page size used for a > 4-KByte range of linear addresses changes, the TLBs may subsequently contain > both ordinary and large-page translations for the address range.12 A reference > to a linear address in the address range may use either translation. Which of > the two translations is used may vary from one execution to another and the > choice may be implementation-specific. > > Software wishing to prevent this uncertainty should not write to a paging- > structure entry in a way that would change, for any linear address, both the > page size and either the page frame or attributes. It can instead use the > following algorithm: first mark the relevant paging-structure entry (e.g., > PDE) not present; then invalidate any translations for the affected linear > addresses (see Section 5.2); and then modify the relevant paging-structure > entry to mark it present and establish translation(s) for the new page size. > ``` > > We can also learn more information from the comments above pmdp_invalidate() > in __split_huge_pmd_locked(). > > For case 2), we can see from the comments above ptep_clear_flush() in > wp_page_copy() that this situation is also not allowed. Even without > this patch series, madvise(MADV_DONTNEED) can also cause this situation: > > CPU 0 CPU 1 > > madvise (MADV_DONTNEED) > --> clear pte entry > pte_unmap_unlock > touch and tlb miss > --> set pte entry > mmu_gather flush tlb > > But strangely, I didn't see any relevant fix code, maybe I missed something, > or is this guaranteed by userland? I'm still quite confused about this, is there anyone who is familiar with this part? Thanks, Qi > > Anyway, this series defines the following two functions to be implemented by > the architecture. If the architecture does not allow the above two situations, > then define these two functions to flush the tlb before set_pmd_at(). > > - arch_flush_tlb_before_set_huge_page > - arch_flush_tlb_before_set_pte_page > [...] >
On 2024/8/6 11:31, Qi Zheng wrote: > Hi all, > > On 2024/8/5 20:55, Qi Zheng wrote: > > [...] > >> >> 2. When we use mmu_gather to batch flush tlb and free PTE pages, the >> TLB is not >> flushed before pmd lock is unlocked. This may result in the >> following two >> situations: >> >> 1) Userland can trigger page fault and fill a huge page, which >> will cause >> the existence of small size TLB and huge TLB for the same address. >> >> 2) Userland can also trigger page fault and fill a PTE page, which >> will >> cause the existence of two small size TLBs, but the PTE page >> they map >> are different. >> >> For case 1), according to Intel's TLB Application note (317080), >> some CPUs of >> x86 do not allow it: >> >> ``` >> If software modifies the paging structures so that the page size >> used for a >> 4-KByte range of linear addresses changes, the TLBs may >> subsequently contain >> both ordinary and large-page translations for the address range.12 >> A reference >> to a linear address in the address range may use either >> translation. Which of >> the two translations is used may vary from one execution to >> another and the >> choice may be implementation-specific. >> >> Software wishing to prevent this uncertainty should not write to a >> paging- >> structure entry in a way that would change, for any linear >> address, both the >> page size and either the page frame or attributes. It can instead >> use the >> following algorithm: first mark the relevant paging-structure >> entry (e.g., >> PDE) not present; then invalidate any translations for the >> affected linear >> addresses (see Section 5.2); and then modify the relevant >> paging-structure >> entry to mark it present and establish translation(s) for the new >> page size. >> ``` >> >> We can also learn more information from the comments above >> pmdp_invalidate() >> in __split_huge_pmd_locked(). >> >> For case 2), we can see from the comments above ptep_clear_flush() in >> wp_page_copy() that this situation is also not allowed. Even without >> this patch series, madvise(MADV_DONTNEED) can also cause this >> situation: >> >> CPU 0 CPU 1 >> >> madvise (MADV_DONTNEED) >> --> clear pte entry >> pte_unmap_unlock >> touch and tlb miss >> --> set pte entry >> mmu_gather flush tlb >> >> But strangely, I didn't see any relevant fix code, maybe I missed >> something, >> or is this guaranteed by userland? > > I'm still quite confused about this, is there anyone who is familiar > with this part? This is not a new issue introduced by this patch series, and I have sent a separate RFC patch [1] to track this issue. I will remove this part of the handling in the next version. [1]. https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/ > > Thanks, > Qi > >> >> Anyway, this series defines the following two functions to be >> implemented by >> the architecture. If the architecture does not allow the above two >> situations, >> then define these two functions to flush the tlb before set_pmd_at(). >> >> - arch_flush_tlb_before_set_huge_page >> - arch_flush_tlb_before_set_pte_page >> > > [...] > >>