mbox series

[v4,00/11] synchronously scan and reclaim empty user PTE pages

Message ID cover.1733305182.git.zhengqi.arch@bytedance.com (mailing list archive)
Headers show
Series synchronously scan and reclaim empty user PTE pages | expand

Message

Qi Zheng Dec. 4, 2024, 11:09 a.m. UTC
Changes in v4:
 - update the process_addrs.rst in [PATCH v4 01/11]
   (suggested by Lorenzo Stoakes)
 - fix [PATCH v3 4/9] and move it after [PATCH v3 5/9]
   (pointed by David Hildenbrand)
 - change to use any_skipped instead of rechecking pte_none() to detect empty
   user PTE pages (suggested by David Hildenbrand)
 - rebase onto the next-20241203

Changes in v3:
 - recheck pmd state instead of pmd_same() in retract_page_tables()
   (suggested by Jann Horn)
 - recheck dst_pmd entry in move_pages_pte() (pointed by Jann Horn)
 - introduce new skip_none_ptes() (suggested by David Hildenbrand)
 - minor changes in [PATCH v2 5/7]
 - remove tlb_remove_table_sync_one() if CONFIG_PT_RECLAIM is enabled.
 - use put_page() instead of free_page_and_swap_cache() in
   __tlb_remove_table_one_rcu() (pointed by Jann Horn)
 - collect the Reviewed-bys and Acked-bys
 - rebase onto the next-20241112

Changes in v2:
 - fix [PATCH v1 1/7] (Jann Horn)
 - reset force_flush and force_break to false in [PATCH v1 2/7] (Jann Horn)
 - introduce zap_nonpresent_ptes() and do_zap_pte_range()
 - check pte_none() instead of can_reclaim_pt after the processing of PTEs
   (remove [PATCH v1 3/7] and [PATCH v1 4/7])
 - reorder patches
 - rebase onto the next-20241031

Changes in v1:
 - replace [RFC PATCH 1/7] with a separate serise (already merge into mm-unstable):
   https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/
   (suggested by David Hildenbrand)
 - squash [RFC PATCH 2/7] into [RFC PATCH 4/7]
   (suggested by David Hildenbrand)
 - change to scan and reclaim empty user PTE pages in zap_pte_range()
   (suggested by David Hildenbrand)
 - sent a separate RFC patch to track the tlb flushing issue, and remove
   that part form this series ([RFC PATCH 3/7] and [RFC PATCH 6/7]).
   link: https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/
 - add [PATCH v1 1/7] into this series
 - drop RFC tag
 - rebase onto the next-20241011

Changes in RFC v2:
 - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by
   kernel test robot
 - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid()
   in retract_page_tables() (in [RFC PATCH 4/7])
 - rebase onto the next-20240805

Hi all,

Previously, we tried to use a completely asynchronous method to reclaim empty
user PTE pages [1]. After discussing with David Hildenbrand, we decided to
implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
first step.

So this series aims to synchronously free the empty PTE pages in
madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in
zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than
madvise(MADV_DONTNEED).

In zap_pte_range(), mmu_gather is used to perform batch tlb flushing and page
freeing operations. Therefore, if we want to free the empty PTE page in this
path, the most natural way is to add it to mmu_gather as well. Now, if
CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free page table
pages by semi RCU:

 - batch table freeing: asynchronous free by RCU
 - single table freeing: IPI + synchronous free

But this is not enough to free the empty PTE page table pages in paths other
that munmap and exit_mmap path, because IPI cannot be synchronized with
rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also
be freed by RCU like batch table freeing.

As a first step, we supported this feature on x86_64 and selectd the newly
introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.

For other cases such as madvise(MADV_FREE), consider scanning and freeing empty
PTE pages asynchronously in the future.

This series is based on next-20241112 (which contains the series [2]).

Note: issues related to TLB flushing are not new to this series and are tracked
      in the separate RFC patch [3]. And more context please refer to this
      thread [4].

Comments and suggestions are welcome!

Thanks,
Qi

[1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
[2]. https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/
[3]. https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/
[4]. https://lore.kernel.org/lkml/6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com/

Qi Zheng (11):
  mm: khugepaged: recheck pmd state in retract_page_tables()
  mm: userfaultfd: recheck dst_pmd entry in move_pages_pte()
  mm: introduce zap_nonpresent_ptes()
  mm: introduce do_zap_pte_range()
  mm: skip over all consecutive none ptes in do_zap_pte_range()
  mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been
    re-installed
  mm: do_zap_pte_range: return any_skipped information to the caller
  mm: make zap_pte_range() handle full within-PMD range
  mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)
  x86: mm: free page table pages by RCU instead of semi RCU
  x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64

 Documentation/mm/process_addrs.rst |   4 +
 arch/x86/Kconfig                   |   1 +
 arch/x86/include/asm/tlb.h         |  20 +++
 arch/x86/kernel/paravirt.c         |   7 +
 arch/x86/mm/pgtable.c              |  10 +-
 include/linux/mm.h                 |   1 +
 include/linux/mm_inline.h          |  11 +-
 include/linux/mm_types.h           |   4 +-
 mm/Kconfig                         |  15 ++
 mm/Makefile                        |   1 +
 mm/internal.h                      |  19 +++
 mm/khugepaged.c                    |  45 +++--
 mm/madvise.c                       |   7 +-
 mm/memory.c                        | 253 ++++++++++++++++++-----------
 mm/mmu_gather.c                    |   9 +-
 mm/pt_reclaim.c                    |  71 ++++++++
 mm/userfaultfd.c                   |  51 ++++--
 17 files changed, 397 insertions(+), 132 deletions(-)
 create mode 100644 mm/pt_reclaim.c

Comments

Andrew Morton Dec. 4, 2024, 10:49 p.m. UTC | #1
On Wed,  4 Dec 2024 19:09:40 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote:

> 
> ...
>
> Previously, we tried to use a completely asynchronous method to reclaim empty
> user PTE pages [1]. After discussing with David Hildenbrand, we decided to
> implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
> first step.

Please help us understand what the other steps are.  Because we dont
want to commit to a particular partial implementation only to later
discover that completing that implementation causes us problems.

> So this series aims to synchronously free the empty PTE pages in
> madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in
> zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than
> madvise(MADV_DONTNEED).
> 
> In zap_pte_range(), mmu_gather is used to perform batch tlb flushing and page
> freeing operations. Therefore, if we want to free the empty PTE page in this
> path, the most natural way is to add it to mmu_gather as well. Now, if
> CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free page table
> pages by semi RCU:
> 
>  - batch table freeing: asynchronous free by RCU
>  - single table freeing: IPI + synchronous free
> 
> But this is not enough to free the empty PTE page table pages in paths other
> that munmap and exit_mmap path, because IPI cannot be synchronized with
> rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also
> be freed by RCU like batch table freeing.
> 
> As a first step, we supported this feature on x86_64 and selectd the newly
> introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.
> 
> For other cases such as madvise(MADV_FREE), consider scanning and freeing empty
> PTE pages asynchronously in the future.

Handling MADV_FREE sounds fairly straightforward?
Jann Horn Dec. 4, 2024, 10:56 p.m. UTC | #2
On Wed, Dec 4, 2024 at 11:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> On Wed,  4 Dec 2024 19:09:40 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> > But this is not enough to free the empty PTE page table pages in paths other
> > that munmap and exit_mmap path, because IPI cannot be synchronized with
> > rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also
> > be freed by RCU like batch table freeing.
> >
> > As a first step, we supported this feature on x86_64 and selectd the newly
> > introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.
> >
> > For other cases such as madvise(MADV_FREE), consider scanning and freeing empty
> > PTE pages asynchronously in the future.
>
> Handling MADV_FREE sounds fairly straightforward?

AFAIU MADV_FREE usually doesn't immediately clear PTEs (except if they
are swap/hwpoison/... PTEs). So the easy thing to do would be to check
whether the page table has become empty within madvise(), but I think
the most likely case would be that PTEs still remain (and will be
asynchronously zapped later when memory pressure causes reclaim, or
something like that).

So I don't see an easy path to doing it for MADV_FREE.
Qi Zheng Dec. 5, 2024, 3:56 a.m. UTC | #3
On 2024/12/5 06:49, Andrew Morton wrote:
> On Wed,  4 Dec 2024 19:09:40 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote:
> 
>>
>> ...
>>
>> Previously, we tried to use a completely asynchronous method to reclaim empty
>> user PTE pages [1]. After discussing with David Hildenbrand, we decided to
>> implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
>> first step.
> 
> Please help us understand what the other steps are.  Because we dont
> want to commit to a particular partial implementation only to later
> discover that completing that implementation causes us problems.

Although it is the first step, it is relatively independent because it
solve the problem (huge PTE memory usage) in the case of
madvise(MADV_DONTNEED), while the other steps are to solve the problem
in other cases.

I can briefly describe all the plans in my mind here:

First step
==========

I plan to implement synchronous empty user PTE pages reclamation in
madvise(MADV_DONTNEED) case for the following reasons:

1. It covers most of the known cases. (On ByteDance server, all the
    problems of huge PTE memory usage are in this case)
2. It helps verify the lock protection scheme and other infrastructure.

This is what this patch is doing (only support x86). Once this is done,
support for more architectures will be added.

Second step
===========

I plan to implement asynchronous reclamation for madvise(MADV_FREE) and
other cases. The initial idea is to mark vma first, then add the
corresponding mm to a global linked list, and then perform asynchronous
scanning and reclamation in the memory reclamation process.

Third step
==========

Based on the above infrastructure, we may try to reclaim all full-zero
PTE pages (all pte entries map zero page), which will be beneficial to
the memory balloon case mentioned by David Hildenbrand.

Another plan
============

Currently, page table modification are protected by page table locks
(page_table_lock or split pmd/pte lock), but the life cycle of page
table pages are protected by mmap_lock (and vma lock). For more details,
please refer to the latest added Documentation/mm/process_addrs.rst file.

Currently we try to free the PTE pages through RCU when
CONFIG_PT_RECLAIM is turned on. In this case, we will no longer
need to hold mmap_lock for the read/write op on the PTE pages.

So maybe we can remove the page table from the protection of the mmap
lock (which is too big), like this:

1. free all levels of page table pages by RCU, not just PTE pages, but
    also pmd, pud, etc.
2. similar to pte_offset_map/pte_unmap, add
    [pmd|pud]_offset_map/[pmd|pud]_unmap, and make them all contain
    rcu_read_lock/rcu_read_unlcok, and make them accept failure.

In this way, we no longer need the mmap lock. For readers, such as page
table wallers, we are already in the critical section of RCU. For
writers, we only need to hold the page table lock.

But there is a difficulty here, that is, the RCU critical section is not
allowed to sleep, but it is possible to sleep in the callback function
of .pmd_entry, such as mmu_notifier_invalidate_range_start().

Use SRCU instead? Or use RCU + refcount method? Not sure. But I think
it's an interesting thing to try.

Thanks!
Qi Zheng Dec. 5, 2024, 3:59 a.m. UTC | #4
On 2024/12/5 06:56, Jann Horn wrote:
> On Wed, Dec 4, 2024 at 11:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>> On Wed,  4 Dec 2024 19:09:40 +0800 Qi Zheng <zhengqi.arch@bytedance.com> wrote:
>>> But this is not enough to free the empty PTE page table pages in paths other
>>> that munmap and exit_mmap path, because IPI cannot be synchronized with
>>> rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also
>>> be freed by RCU like batch table freeing.
>>>
>>> As a first step, we supported this feature on x86_64 and selectd the newly
>>> introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.
>>>
>>> For other cases such as madvise(MADV_FREE), consider scanning and freeing empty
>>> PTE pages asynchronously in the future.
>>
>> Handling MADV_FREE sounds fairly straightforward?
> 
> AFAIU MADV_FREE usually doesn't immediately clear PTEs (except if they
> are swap/hwpoison/... PTEs). So the easy thing to do would be to check
> whether the page table has become empty within madvise(), but I think
> the most likely case would be that PTEs still remain (and will be
> asynchronously zapped later when memory pressure causes reclaim, or
> something like that).
> 
> So I don't see an easy path to doing it for MADV_FREE.

+1. Thanks for helping explain!
Qi Zheng Dec. 10, 2024, 8:57 a.m. UTC | #5
Hi Andrew,

I have sent patch[1][2][3] to fix recently reported issues:

[1]. 
https://lore.kernel.org/lkml/20241210084156.89877-1-zhengqi.arch@bytedance.com/
(Fix warning, need to be folded into [PATCH v4 02/11])

[2]. 
https://lore.kernel.org/lkml/20241206112348.51570-1-zhengqi.arch@bytedance.com/
(Fix uninitialized symbol, need to be folded into [PATCH v4 09/11])

[3]. 
https://lore.kernel.org/lkml/20241210084431.91414-1-zhengqi.arch@bytedance.com/
(fix UAF, need to be placed before [PATCH v4 11/11])

If you need me to re-post a complete v5, please let me know.

Thanks,
Qi


On 2024/12/4 19:09, Qi Zheng wrote:
> Changes in v4:
>   - update the process_addrs.rst in [PATCH v4 01/11]
>     (suggested by Lorenzo Stoakes)
>   - fix [PATCH v3 4/9] and move it after [PATCH v3 5/9]
>     (pointed by David Hildenbrand)
>   - change to use any_skipped instead of rechecking pte_none() to detect empty
>     user PTE pages (suggested by David Hildenbrand)
>   - rebase onto the next-20241203
> 
> Changes in v3:
>   - recheck pmd state instead of pmd_same() in retract_page_tables()
>     (suggested by Jann Horn)
>   - recheck dst_pmd entry in move_pages_pte() (pointed by Jann Horn)
>   - introduce new skip_none_ptes() (suggested by David Hildenbrand)
>   - minor changes in [PATCH v2 5/7]
>   - remove tlb_remove_table_sync_one() if CONFIG_PT_RECLAIM is enabled.
>   - use put_page() instead of free_page_and_swap_cache() in
>     __tlb_remove_table_one_rcu() (pointed by Jann Horn)
>   - collect the Reviewed-bys and Acked-bys
>   - rebase onto the next-20241112
> 
> Changes in v2:
>   - fix [PATCH v1 1/7] (Jann Horn)
>   - reset force_flush and force_break to false in [PATCH v1 2/7] (Jann Horn)
>   - introduce zap_nonpresent_ptes() and do_zap_pte_range()
>   - check pte_none() instead of can_reclaim_pt after the processing of PTEs
>     (remove [PATCH v1 3/7] and [PATCH v1 4/7])
>   - reorder patches
>   - rebase onto the next-20241031
> 
> Changes in v1:
>   - replace [RFC PATCH 1/7] with a separate serise (already merge into mm-unstable):
>     https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/
>     (suggested by David Hildenbrand)
>   - squash [RFC PATCH 2/7] into [RFC PATCH 4/7]
>     (suggested by David Hildenbrand)
>   - change to scan and reclaim empty user PTE pages in zap_pte_range()
>     (suggested by David Hildenbrand)
>   - sent a separate RFC patch to track the tlb flushing issue, and remove
>     that part form this series ([RFC PATCH 3/7] and [RFC PATCH 6/7]).
>     link: https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/
>   - add [PATCH v1 1/7] into this series
>   - drop RFC tag
>   - rebase onto the next-20241011
> 
> Changes in RFC v2:
>   - fix compilation errors in [RFC PATCH 5/7] and [RFC PATCH 7/7] reproted by
>     kernel test robot
>   - use pte_offset_map_nolock() + pmd_same() instead of check_pmd_still_valid()
>     in retract_page_tables() (in [RFC PATCH 4/7])
>   - rebase onto the next-20240805
> 
> Hi all,
> 
> Previously, we tried to use a completely asynchronous method to reclaim empty
> user PTE pages [1]. After discussing with David Hildenbrand, we decided to
> implement synchronous reclaimation in the case of madvise(MADV_DONTNEED) as the
> first step.
> 
> So this series aims to synchronously free the empty PTE pages in
> madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in
> zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than
> madvise(MADV_DONTNEED).
> 
> In zap_pte_range(), mmu_gather is used to perform batch tlb flushing and page
> freeing operations. Therefore, if we want to free the empty PTE page in this
> path, the most natural way is to add it to mmu_gather as well. Now, if
> CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, mmu_gather will free page table
> pages by semi RCU:
> 
>   - batch table freeing: asynchronous free by RCU
>   - single table freeing: IPI + synchronous free
> 
> But this is not enough to free the empty PTE page table pages in paths other
> that munmap and exit_mmap path, because IPI cannot be synchronized with
> rcu_read_lock() in pte_offset_map{_lock}(). So we should let single table also
> be freed by RCU like batch table freeing.
> 
> As a first step, we supported this feature on x86_64 and selectd the newly
> introduced CONFIG_ARCH_SUPPORTS_PT_RECLAIM.
> 
> For other cases such as madvise(MADV_FREE), consider scanning and freeing empty
> PTE pages asynchronously in the future.
> 
> This series is based on next-20241112 (which contains the series [2]).
> 
> Note: issues related to TLB flushing are not new to this series and are tracked
>        in the separate RFC patch [3]. And more context please refer to this
>        thread [4].
> 
> Comments and suggestions are welcome!
> 
> Thanks,
> Qi
> 
> [1]. https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
> [2]. https://lore.kernel.org/lkml/cover.1727332572.git.zhengqi.arch@bytedance.com/
> [3]. https://lore.kernel.org/lkml/20240815120715.14516-1-zhengqi.arch@bytedance.com/
> [4]. https://lore.kernel.org/lkml/6f38cb19-9847-4f70-bbe7-06881bb016be@bytedance.com/
> 
> Qi Zheng (11):
>    mm: khugepaged: recheck pmd state in retract_page_tables()
>    mm: userfaultfd: recheck dst_pmd entry in move_pages_pte()
>    mm: introduce zap_nonpresent_ptes()
>    mm: introduce do_zap_pte_range()
>    mm: skip over all consecutive none ptes in do_zap_pte_range()
>    mm: zap_install_uffd_wp_if_needed: return whether uffd-wp pte has been
>      re-installed
>    mm: do_zap_pte_range: return any_skipped information to the caller
>    mm: make zap_pte_range() handle full within-PMD range
>    mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)
>    x86: mm: free page table pages by RCU instead of semi RCU
>    x86: select ARCH_SUPPORTS_PT_RECLAIM if X86_64
> 
>   Documentation/mm/process_addrs.rst |   4 +
>   arch/x86/Kconfig                   |   1 +
>   arch/x86/include/asm/tlb.h         |  20 +++
>   arch/x86/kernel/paravirt.c         |   7 +
>   arch/x86/mm/pgtable.c              |  10 +-
>   include/linux/mm.h                 |   1 +
>   include/linux/mm_inline.h          |  11 +-
>   include/linux/mm_types.h           |   4 +-
>   mm/Kconfig                         |  15 ++
>   mm/Makefile                        |   1 +
>   mm/internal.h                      |  19 +++
>   mm/khugepaged.c                    |  45 +++--
>   mm/madvise.c                       |   7 +-
>   mm/memory.c                        | 253 ++++++++++++++++++-----------
>   mm/mmu_gather.c                    |   9 +-
>   mm/pt_reclaim.c                    |  71 ++++++++
>   mm/userfaultfd.c                   |  51 ++++--
>   17 files changed, 397 insertions(+), 132 deletions(-)
>   create mode 100644 mm/pt_reclaim.c
>