Message ID | cover.1733305182.git.zhengqi.arch@bytedance.com (mailing list archive) |
---|---|
Headers | show |
Series | synchronously scan and reclaim empty user PTE pages | expand |
On Wed, 4 Dec 2024 19:09:40 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote: > > ... > > Previously, we tried to use a completely asynchronous method to reclaim empty > user PTE pages [1]. After discussing with David Hildenbrand, we decided to > implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the > first step. Please help us understand what the other steps are. Because we dont want to commit to a particular partial implementation only to later discover that completing that implementation causes us problems. > So this series aims to synchronously free the empty PTE pages in > madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in > zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than > madvise(MADV_DONTNEED). > > In zap_pte_range(), mmu_gather is used to perform batch tlb flushing and page > freeing operations. Therefore, if we want to free the empty PTE page in this > path, the most natural way is to add it to mmu_gather as well. Now, if > CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free page table > pages by semi RCU: > > - batch table freeing: asynchronous free by RCU > - single table freeing: IPI + synchronous free > > But this is not enough to free the empty PTE page table pages in paths other > that munmap and exit_mmap path, because IPI cannot be synchronized with > rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also > be freed by RCU like batch table freeing. > > As a first step, we supported this feature on x86_64 and selectd the newly > introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM. > > For other cases such as madvise(MADV_FREE), consider scanning and freeing empty > PTE pages asynchronously in the future. Handling MADV_FREE sounds fairly straightforward?
On Wed, Dec 4, 2024 at 11:49 PM Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 4 Dec 2024 19:09:40 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote: > > But this is not enough to free the empty PTE page table pages in paths other > > that munmap and exit_mmap path, because IPI cannot be synchronized with > > rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also > > be freed by RCU like batch table freeing. > > > > As a first step, we supported this feature on x86_64 and selectd the newly > > introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM. > > > > For other cases such as madvise(MADV_FREE), consider scanning and freeing empty > > PTE pages asynchronously in the future. > > Handling MADV_FREE sounds fairly straightforward? AFAIU MADV_FREE usually doesn't immediately clear PTEs (except if they are swap/hwpoison/... PTEs). So the easy thing to do would be to check whether the page table has become empty within madvise(), but I think the most likely case would be that PTEs still remain (and will be asynchronously zapped later when memory pressure causes reclaim, or something like that). So I don't see an easy path to doing it for MADV_FREE.
On 2024/12/5 06:49, Andrew Morton wrote: > On Wed, 4 Dec 2024 19:09:40 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote: > >> >> ... >> >> Previously, we tried to use a completely asynchronous method to reclaim empty >> user PTE pages [1]. After discussing with David Hildenbrand, we decided to >> implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the >> first step. > > Please help us understand what the other steps are. Because we dont > want to commit to a particular partial implementation only to later > discover that completing that implementation causes us problems. Although it is the first step, it is relatively independent because it solve the problem (huge PTE memory usage) in the case of madvise(MADV_DONTNEED), while the other steps are to solve the problem in other cases. I can briefly describe all the plans in my mind here: First step ========== I plan to implement synchronous empty user PTE pages reclamation in madvise(MADV_DONTNEED) case for the following reasons: 1. It covers most of the known cases. (On ByteDance server, all the problems of huge PTE memory usage are in this case) 2. It helps verify the lock protection scheme and other infrastructure. This is what this patch is doing (only support x86). Once this is done, support for more architectures will be added. Second step =========== I plan to implement asynchronous reclamation for madvise(MADV_FREE) and other cases. The initial idea is to mark vma first, then add the corresponding mm to a global linked list, and then perform asynchronous scanning and reclamation in the memory reclamation process. Third step ========== Based on the above infrastructure, we may try to reclaim all full-zero PTE pages (all pte entries map zero page), which will be beneficial to the memory balloon case mentioned by David Hildenbrand. Another plan ============ Currently, page table modification are protected by page table locks (page_table_lock or split pmd/pte lock), but the life cycle of page table pages are protected by mmap_lock (and vma lock). For more details, please refer to the latest added Documentation/mm/process_addrs.rst file. Currently we try to free the PTE pages through RCU when CONFIG_PT_RECLAIM is turned on. In this case, we will no longer need to hold mmap_lock for the read/write op on the PTE pages. So maybe we can remove the page table from the protection of the mmap lock (which is too big), like this: 1. free all levels of page table pages by RCU, not just PTE pages, but also pmd, pud, etc. 2. similar to pte_offset_map/pte_unmap, add [pmd|pud]_offset_map/[pmd|pud]_unmap, and make them all contain rcu_read_lock/rcu_read_unlcok, and make them accept failure. In this way, we no longer need the mmap lock. For readers, such as page table wallers, we are already in the critical section of RCU. For writers, we only need to hold the page table lock. But there is a difficulty here, that is, the RCU critical section is not allowed to sleep, but it is possible to sleep in the callback function of .pmd_entry, such as mmu_notifier_invalidate_range_start(). Use SRCU instead? Or use RCU + refcount method? Not sure. But I think it's an interesting thing to try. Thanks!
On 2024/12/5 06:56, Jann Horn wrote: > On Wed, Dec 4, 2024 at 11:49 PM Andrew Morton <akpm@linux-foundation.org> wrote: >> On Wed, 4 Dec 2024 19:09:40 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote: >>> But this is not enough to free the empty PTE page table pages in paths other >>> that munmap and exit_mmap path, because IPI cannot be synchronized with >>> rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also >>> be freed by RCU like batch table freeing. >>> >>> As a first step, we supported this feature on x86_64 and selectd the newly >>> introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM. >>> >>> For other cases such as madvise(MADV_FREE), consider scanning and freeing empty >>> PTE pages asynchronously in the future. >> >> Handling MADV_FREE sounds fairly straightforward? > > AFAIU MADV_FREE usually doesn't immediately clear PTEs (except if they > are swap/hwpoison/... PTEs). So the easy thing to do would be to check > whether the page table has become empty within madvise(), but I think > the most likely case would be that PTEs still remain (and will be > asynchronously zapped later when memory pressure causes reclaim, or > something like that). > > So I don't see an easy path to doing it for MADV_FREE. +1. Thanks for helping explain!
Hi Andrew, I have sent patch[1][2][3] to fix recently reported issues: [1]. https://lore.kernel.org/lkml/20241210084156.89877-1-zhengqi.arch@bytedance.com/ (Fix warning, need to be folded into [PATCH v4 02/11]) [2]. https://lore.kernel.org/lkml/20241206112348.51570-1-zhengqi.arch@bytedance.com/ (Fix uninitialized symbol, need to be folded into [PATCH v4 09/11]) [3]. https://lore.kernel.org/lkml/20241210084431.91414-1-zhengqi.arch@bytedance.com/ (fix UAF, need to be placed before [PATCH v4 11/11]) If you need me to re-post a complete v5, please let me know. Thanks, Qi On 2024/12/4 19:09, Qi Zheng wrote: > Changes in v4: > - update the process_addrs.rst in [PATCH v4 01/11] > (suggested by Lorenzo Stoakes) > - fix [PATCH v3 4/9] and move it after [PATCH v3 5/9] > (pointed by David Hildenbrand) > - change to use any_skipped instead of rechecking pte_none() to detect empty > user PTE pages (suggested by David Hildenbrand) > - rebase onto the next-20241203 > > Changes in v3: > - recheck pmd state instead of pmd_same() in retract_page_tables() > (suggested by Jann Horn) > - recheck dst_pmd entry in move_pages_pte() (pointed by Jann Horn) > - introduce new skip_none_ptes() (suggested by David Hildenbrand) > - minor changes in [PATCH v2 5/7] > - remove tlb_remove_table_sync_one() if CONFIG_PT_RECLAIM is enabled. > - use put_page() instead of free_page_and_swap_cache() in > __tlb_remove_table_one_rcu() (pointed by Jann Horn) > - collect the Reviewed-bys and Acked-bys > - rebase onto the next-20241112 > > Changes in v2: > - fix [PATCH v1 1/7] (Jann Horn) > - reset force_flush and force_break to false in [PATCH v1 2/7] (Jann Horn) > - introduce zap_nonpresent_ptes() and do_zap_pte_range() > - check pte_none() instead of can_reclaim_pt after the processing of PTEs > (remove [PATCH v1 3/7] and [PATCH v1 4/7]) > - reorder patches > - rebase onto the next-20241031 > > Changes in v1: > - replace [RFC PATCH 1/7] with a separate serise (already merge into mm-unstable): > https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/ > (suggested by David Hildenbrand) > - squash [RFC PATCH 2/7] into [RFC PATCH 4/7] > (suggested by David Hildenbrand) > - change to scan and reclaim empty user PTE pages in zap_pte_range() > (suggested by David Hildenbrand) > - sent a separate RFC patch to track the tlb flushing issue, and remove > that part form this series ([RFC PATCH 3/7] and [RFC PATCH 6/7]). > link: https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/ > - add [PATCH v1 1/7] into this series > - drop RFC tag > - rebase onto the next-20241011 > > Changes in RFC v2: > - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by > kernel test robot > - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid() > in retract_page_tables() (in [RFC PATCH 4/7]) > - rebase onto the next-20240805 > > Hi all, > > Previously, we tried to use a completely asynchronous method to reclaim empty > user PTE pages [1]. After discussing with David Hildenbrand, we decided to > implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the > first step. > > So this series aims to synchronously free the empty PTE pages in > madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in > zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than > madvise(MADV_DONTNEED). > > In zap_pte_range(), mmu_gather is used to perform batch tlb flushing and page > freeing operations. Therefore, if we want to free the empty PTE page in this > path, the most natural way is to add it to mmu_gather as well. Now, if > CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free page table > pages by semi RCU: > > - batch table freeing: asynchronous free by RCU > - single table freeing: IPI + synchronous free > > But this is not enough to free the empty PTE page table pages in paths other > that munmap and exit_mmap path, because IPI cannot be synchronized with > rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also > be freed by RCU like batch table freeing. > > As a first step, we supported this feature on x86_64 and selectd the newly > introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM. > > For other cases such as madvise(MADV_FREE), consider scanning and freeing empty > PTE pages asynchronously in the future. > > This series is based on next-20241112 (which contains the series [2]). > > Note: issues related to TLB flushing are not new to this series and are tracked > in the separate RFC patch [3]. And more context please refer to this > thread [4]. > > Comments and suggestions are welcome! > > Thanks, > Qi > > [1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ > [2]. https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/ > [3]. https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/ > [4]. https://lore.kernel.org/lkml/6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com/ > > Qi Zheng (11): > mm: khugepaged: recheck pmd state in retract_page_tables() > mm: userfaultfd: recheck dst_pmd entry in move_pages_pte() > mm: introduce zap_nonpresent_ptes() > mm: introduce do_zap_pte_range() > mm: skip over all consecutive none ptes in do_zap_pte_range() > mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been > re-installed > mm: do_zap_pte_range: return any_skipped information to the caller > mm: make zap_pte_range() handle full within-PMD range > mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED) > x86: mm: free page table pages by RCU instead of semi RCU > x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64 > > Documentation/mm/process_addrs.rst | 4 + > arch/x86/Kconfig | 1 + > arch/x86/include/asm/tlb.h | 20 +++ > arch/x86/kernel/paravirt.c | 7 + > arch/x86/mm/pgtable.c | 10 +- > include/linux/mm.h | 1 + > include/linux/mm_inline.h | 11 +- > include/linux/mm_types.h | 4 +- > mm/Kconfig | 15 ++ > mm/Makefile | 1 + > mm/internal.h | 19 +++ > mm/khugepaged.c | 45 +++-- > mm/madvise.c | 7 +- > mm/memory.c | 253 ++++++++++++++++++----------- > mm/mmu_gather.c | 9 +- > mm/pt_reclaim.c | 71 ++++++++ > mm/userfaultfd.c | 51 ++++-- > 17 files changed, 397 insertions(+), 132 deletions(-) > create mode 100644 mm/pt_reclaim.c >