Message ID | cover.1719570849.git.zhengqi.arch@bytedance.com (mailing list archive) |
---|---|
Headers | show |
Series | synchronously scan and reclaim empty user PTE pages | expand |
Add the x86 mailing list that I forgot to CC before. On 2024/7/1 16:46, Qi Zheng wrote: > Hi all, > > Previously, we tried to use a completely asynchronous method to reclaim empty > user PTE pages [1]. After discussing with David Hildenbrand, we decided to > implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the > first step. > > So this series aims to synchronously scan and reclaim empty user PTE pages in > zap_page_range_single() (madvise(MADV_DONTNEED) etc will invoke this). In > zap_page_range_single(), mmu_gather is used to perform batch tlb flushing and > page freeing operations. Therefore, if we want to free the empty PTE page in > this path, the most natural way is to add it to mmu_gather as well. There are > two problems that need to be solved here: > > 1. Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free > page table pages by semi RCU: > > - batch table freeing: asynchronous free by RCU > - single table freeing: IPI + synchronous free > > But this is not enough to free the empty PTE page table pages in paths other > that munmap and exit_mmap path, because IPI cannot be synchronized with > rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table > also be freed by RCU like batch table freeing. > > 2. When we use mmu_gather to batch flush tlb and free PTE pages, the TLB is not > flushed before pmd lock is unlocked. This may result in the following two > situations: > > 1) Userland can trigger page fault and fill a huge page, which will cause > the existence of small size TLB and huge TLB for the same address. > > 2) Userland can also trigger page fault and fill a PTE page, which will > cause the existence of two small size TLBs, but the PTE page they map > are different. > > For case 1), according to Intel's TLB Application note (317080), some CPUs of > x86 do not allow it: > > ``` > If software modifies the paging structures so that the page size used for a > 4-KByte range of linear addresses changes, the TLBs may subsequently contain > both ordinary and large-page translations for the address range.12 A reference > to a linear address in the address range may use either translation. Which of > the two translations is used may vary from one execution to another and the > choice may be implementation-specific. > > Software wishing to prevent this uncertainty should not write to a paging- > structure entry in a way that would change, for any linear address, both the > page size and either the page frame or attributes. It can instead use the > following algorithm: first mark the relevant paging-structure entry (e.g., > PDE) not present; then invalidate any translations for the affected linear > addresses (see Section 5.2); and then modify the relevant paging-structure > entry to mark it present and establish translation(s) for the new page size. > ``` > > We can also learn more information from the comments above pmdp_invalidate() > in __split_huge_pmd_locked(). > > For case 2), we can see from the comments above ptep_clear_flush() in > wp_page_copy() that this situation is also not allowed. Even without > this patch series, madvise(MADV_DONTNEED) can also cause this situation: > > CPU 0 CPU 1 > > madvise (MADV_DONTNEED) > --> clear pte entry > pte_unmap_unlock > touch and tlb miss > --> set pte entry > mmu_gather flush tlb > > But strangely, I didn't see any relevant fix code, maybe I missed something, > or is this guaranteed by userland? > > Anyway, this series defines the following two functions to be implemented by > the architecture. If the architecture does not allow the above two situations, > then define these two functions to flush the tlb before set_pmd_at(). > > - arch_flush_tlb_before_set_huge_page > - arch_flush_tlb_before_set_pte_page > > As a first step, we supported this feature on x86_64 and selectd the newly > introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM. > > In order to reduce overhead, we only handle the cases with a high probability > of generating empty PTE pages, and other cases will be filtered out, such as: > > - hugetlb vma (unsuitable) > - userfaultfd_wp vma (may reinstall the pte entry) > - writable private file mapping case (COW-ed anon page is not zapped) > - etc > > For userfaultfd_wp and writable private file mapping cases (and MADV_FREE case, > of course), consider scanning and freeing empty PTE pages asynchronously in > the future. > > This series is based on next-20240627. > > Comments and suggestions are welcome! > > Thanks, > Qi > > [1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ > > Qi Zheng (7): > mm: pgtable: make pte_offset_map_nolock() return pmdval > mm: introduce CONFIG_PT_RECLAIM > mm: pass address information to pmd_install() > mm: pgtable: try to reclaim empty PTE pages in zap_page_range_single() > x86: mm: free page table pages by RCU instead of semi RCU > x86: mm: define arch_flush_tlb_before_set_huge_page > x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 > > Documentation/mm/split_page_table_lock.rst | 3 +- > arch/arm/mm/fault-armv.c | 2 +- > arch/powerpc/mm/pgtable.c | 2 +- > arch/x86/Kconfig | 1 + > arch/x86/include/asm/pgtable.h | 6 + > arch/x86/include/asm/tlb.h | 23 ++++ > arch/x86/kernel/paravirt.c | 7 ++ > arch/x86/mm/pgtable.c | 15 ++- > include/linux/hugetlb.h | 2 +- > include/linux/mm.h | 13 +- > include/linux/pgtable.h | 14 +++ > mm/Kconfig | 14 +++ > mm/Makefile | 1 + > mm/debug_vm_pgtable.c | 2 +- > mm/filemap.c | 4 +- > mm/gup.c | 2 +- > mm/huge_memory.c | 3 + > mm/internal.h | 17 ++- > mm/khugepaged.c | 24 +++- > mm/memory.c | 21 ++-- > mm/migrate_device.c | 2 +- > mm/mmu_gather.c | 2 +- > mm/mprotect.c | 8 +- > mm/mremap.c | 4 +- > mm/page_vma_mapped.c | 2 +- > mm/pgtable-generic.c | 21 ++-- > mm/pt_reclaim.c | 131 +++++++++++++++++++++ > mm/userfaultfd.c | 10 +- > mm/vmscan.c | 2 +- > 29 files changed, 307 insertions(+), 51 deletions(-) > create mode 100644 mm/pt_reclaim.c >