[RFC,0/7] Try to free empty and zero user PTE page table pages

Message ID	20220825101037.96517-1-zhengqi.arch@bytedance.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Qi Zheng <zhengqi.arch@bytedance.com> To: akpm@linux-foundation.org, david@redhat.com, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, jgg@nvidia.com, tglx@linutronix.de, willy@infradead.org Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, muchun.song@linux.dev, Qi Zheng <zhengqi.arch@bytedance.com> Subject: [RFC PATCH 0/7] Try to free empty and zero user PTE page table pages Date: Thu, 25 Aug 2022 18:10:30 +0800 Message-Id: <20220825101037.96517-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Try to free empty and zero user PTE page table pages \| expand [RFC,0/7] Try to free empty and zero user PTE page table pages [RFC,1/7] mm: use ptep_clear() in non-present cases [RFC,2/7] mm: introduce CONFIG_FREE_USER_PTE [RFC,3/7] mm: add pte_to_page() helper [RFC,4/7] mm: introduce pte_refcount for user PTE page table page [RFC,5/7] pte_ref: add track_pte_{set, clear}() helper [RFC,6/7] x86/mm: add x86_64 support for pte_ref [RFC,7/7] mm: add proc interface to free user PTE page table pages

Message ID

20220825101037.96517-1-zhengqi.arch@bytedance.com (mailing list archive)

Headers

From: Qi Zheng <zhengqi.arch@bytedance.com>
To: akpm@linux-foundation.org,
	david@redhat.com,
	kirill.shutemov@linux.intel.com,
	mika.penttila@nextfour.com,
	jgg@nvidia.com,
	tglx@linutronix.de,
	willy@infradead.org
Cc: linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	muchun.song@linux.dev,
	Qi Zheng <zhengqi.arch@bytedance.com>
Subject: [RFC PATCH 0/7] Try to free empty and zero user PTE page table pages
Date: Thu, 25 Aug 2022 18:10:30 +0800
Message-Id: <20220825101037.96517-1-zhengqi.arch@bytedance.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

Try to free empty and zero user PTE page table pages | expand

Message

Qi Zheng Aug. 25, 2022, 10:10 a.m. UTC

Hi,

Before this, in order to free empty user PTE page table pages, I posted the
following patch sets of two solutions:
 - atomic refcount version:
	https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/
 - percpu refcount version:
	https://lore.kernel.org/lkml/20220429133552.33768-1-zhengqi.arch@bytedance.com/

Both patch sets have the following behavior:
a. Protect the page table walker by hooking pte_offset_map{_lock}() and
   pte_unmap{_unlock}()
b. Will automatically reclaim PTE page table pages in the non-reclaiming path

For behavior a, there may be the following disadvantages mentioned by
David Hildenbrand:
 - It introduces a lot of complexity. It's not something easy to get in and most
   probably not easy to get out again
 - It is inconvenient to extend to other architectures. For example, for the
   continuous ptes of arm64, the pointer to the PTE entry is obtained directly
   through pte_offset_kernel() instead of pte_offset_map{_lock}()
 - It has been found that pte_unmap() is missing in some places that only
   execute on 64-bit systems, which is a disaster for pte_refcount

For behavior b, it may not be necessary to actively reclaim PTE pages, especially
when memory pressure is not high, and deferring to the reclaim path may be a
better choice.

In addition, the above two solutions are only for empty PTE pages (a PTE page
where all entries are empty), and do not deal with the zero PTE page ( a PTE
page where all page table entries are mapped to shared zero page) mentioned by
David Hildenbrand:
	"Especially the shared zeropage is nasty, because there are
	 sane use cases that can trigger it. Assume you have a VM
	 (e.g., QEMU) that inflated the balloon to return free memory
	 to the hypervisor.

	 Simply migrating that VM will populate the shared zeropage to
	 all inflated pages, because migration code ends up reading all
	 VM memory. Similarly, the guest can just read that memory as
	 well, for example, when the guest issues kdump itself."

The purpose of this RFC patch is to continue the discussion and fix the above
issues. The following is the solution to be discussed.

In order to quickly identify the above two types of PTE pages, we still
introduced a pte_refcount for each PTE page. We put the mapped and zero PTE
entry counter into the pte_refcount of the PTE page. The bitmask has the
following meaning:

 - bits 0-9 are mapped PTE entry count
 - bits 10-19 are zero PTE entry count

In this way, when mapped PTE entry count is 0, we can know that the current PTE
page is an empty PTE page, and when zero PTE entry count is PTRS_PER_PTE, we can
know that the current PTE page is a zero PTE page.

We only update the pte_refcount when setting and clearing of PTE entry, and
since they are both protected by pte lock, pte_refcount can be a non-atomic
variable with little performance overhead.

For page table walker, we mutually exclusive it by holding write lock of
mmap_lock when doing pmd_clear() (in the newly added path to reclaim PTE pages).

The [RFC PATCH 7/7] is an example of reclaiming empty and zero PTE page in a
process. But the best time to reclaim should be in the reclaiming path, such as
before waking up the oom killer. At this point, the system can not reclaim more
memory. Compared with killing a process, it is more acceptable to hold a write
lock of mmap_lock to reclaim memory by releasing empty and zero PTE pages.

My idea is to count the number of bytes (mm->reclaimable_pt_bytes, similar to
mm->pgtables_bytes) of reclaimable PTE pages (including empty and zero PTE page)
in each mm, and maintain a rbtree with mm->reclaimable_pt_bytes as the key, then
we can pick the mm with the largest mm->reclaimable_pt_bytes to reclaim in the
reclaim path.

This series is based on v5.19.

Comments and suggestions are welcome.

Thanks,
Qi

Qi Zheng (7):
  mm: use ptep_clear() in non-present cases
  mm: introduce CONFIG_FREE_USER_PTE
  mm: add pte_to_page() helper
  mm: introduce pte_refcount for user PTE page table page
  pte_ref: add track_pte_{set, clear}() helper
  x86/mm: add x86_64 support for pte_ref
  mm: add proc interface to free user PTE page table pages

 arch/x86/Kconfig               |   1 +
 arch/x86/include/asm/pgtable.h |   4 +
 include/linux/mm.h             |   2 +
 include/linux/mm_types.h       |   1 +
 include/linux/pgtable.h        |  11 +-
 include/linux/pte_ref.h        |  41 ++++++
 kernel/sysctl.c                |  12 ++
 mm/Kconfig                     |  11 ++
 mm/Makefile                    |   2 +-
 mm/memory.c                    |   2 +-
 mm/mprotect.c                  |   2 +-
 mm/pte_ref.c                   | 234 +++++++++++++++++++++++++++++++++
 12 files changed, 319 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/pte_ref.h
 create mode 100644 mm/pte_ref.c

Comments

David Hildenbrand Aug. 29, 2022, 10:09 a.m. UTC | #1

On 25.08.22 12:10, Qi Zheng wrote:
> Hi,
> 
> Before this, in order to free empty user PTE page table pages, I posted the
> following patch sets of two solutions:
>  - atomic refcount version:
> 	https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/
>  - percpu refcount version:
> 	https://lore.kernel.org/lkml/20220429133552.33768-1-zhengqi.arch@bytedance.com/
> 
> Both patch sets have the following behavior:
> a. Protect the page table walker by hooking pte_offset_map{_lock}() and
>    pte_unmap{_unlock}()
> b. Will automatically reclaim PTE page table pages in the non-reclaiming path
> 
> For behavior a, there may be the following disadvantages mentioned by
> David Hildenbrand:
>  - It introduces a lot of complexity. It's not something easy to get in and most
>    probably not easy to get out again
>  - It is inconvenient to extend to other architectures. For example, for the
>    continuous ptes of arm64, the pointer to the PTE entry is obtained directly
>    through pte_offset_kernel() instead of pte_offset_map{_lock}()
>  - It has been found that pte_unmap() is missing in some places that only
>    execute on 64-bit systems, which is a disaster for pte_refcount
> 
> For behavior b, it may not be necessary to actively reclaim PTE pages, especially
> when memory pressure is not high, and deferring to the reclaim path may be a
> better choice.
> 
> In addition, the above two solutions are only for empty PTE pages (a PTE page
> where all entries are empty), and do not deal with the zero PTE page ( a PTE
> page where all page table entries are mapped to shared zero page) mentioned by
> David Hildenbrand:
> 	"Especially the shared zeropage is nasty, because there are
> 	 sane use cases that can trigger it. Assume you have a VM
> 	 (e.g., QEMU) that inflated the balloon to return free memory
> 	 to the hypervisor.
> 
> 	 Simply migrating that VM will populate the shared zeropage to
> 	 all inflated pages, because migration code ends up reading all
> 	 VM memory. Similarly, the guest can just read that memory as
> 	 well, for example, when the guest issues kdump itself."
> 
> The purpose of this RFC patch is to continue the discussion and fix the above
> issues. The following is the solution to be discussed.

Thanks for providing an alternative! It's certainly easier to digest :)

> 
> In order to quickly identify the above two types of PTE pages, we still
> introduced a pte_refcount for each PTE page. We put the mapped and zero PTE
> entry counter into the pte_refcount of the PTE page. The bitmask has the
> following meaning:
> 
>  - bits 0-9 are mapped PTE entry count
>  - bits 10-19 are zero PTE entry count

I guess we could factor the zero PTE change out, to have an even simpler
first version. The issue is that some features (userfaultfd) don't
expect page faults when something was aleady mapped previously.

PTE markers as introduced by Peter might require a thought -- we don't
have anything mapped but do have additional information that we have to
maintain.

> 
> In this way, when mapped PTE entry count is 0, we can know that the current PTE
> page is an empty PTE page, and when zero PTE entry count is PTRS_PER_PTE, we can
> know that the current PTE page is a zero PTE page.
> 
> We only update the pte_refcount when setting and clearing of PTE entry, and
> since they are both protected by pte lock, pte_refcount can be a non-atomic
> variable with little performance overhead.
> 
> For page table walker, we mutually exclusive it by holding write lock of
> mmap_lock when doing pmd_clear() (in the newly added path to reclaim PTE pages).

I recall when I played with that idea that the mmap_lock is not
sufficient to rip out a page table. IIRC, we also have to hold the rmap
lock(s), to prevent RMAP walkers from still using the page table.

Especially if multiple VMAs intersect a page table, things might get
tricky, because multiple rmap locks could be involved.

We might want/need another mechanism to synchronize against page table
walkers.

Qi Zheng Aug. 29, 2022, 2 p.m. UTC | #2

On 2022/8/29 18:09, David Hildenbrand wrote:
> On 25.08.22 12:10, Qi Zheng wrote:
>> Hi,
>>
>> Before this, in order to free empty user PTE page table pages, I posted the
>> following patch sets of two solutions:
>>   - atomic refcount version:
>> 	https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/
>>   - percpu refcount version:
>> 	https://lore.kernel.org/lkml/20220429133552.33768-1-zhengqi.arch@bytedance.com/
>>
>> Both patch sets have the following behavior:
>> a. Protect the page table walker by hooking pte_offset_map{_lock}() and
>>     pte_unmap{_unlock}()
>> b. Will automatically reclaim PTE page table pages in the non-reclaiming path
>>
>> For behavior a, there may be the following disadvantages mentioned by
>> David Hildenbrand:
>>   - It introduces a lot of complexity. It's not something easy to get in and most
>>     probably not easy to get out again
>>   - It is inconvenient to extend to other architectures. For example, for the
>>     continuous ptes of arm64, the pointer to the PTE entry is obtained directly
>>     through pte_offset_kernel() instead of pte_offset_map{_lock}()
>>   - It has been found that pte_unmap() is missing in some places that only
>>     execute on 64-bit systems, which is a disaster for pte_refcount
>>
>> For behavior b, it may not be necessary to actively reclaim PTE pages, especially
>> when memory pressure is not high, and deferring to the reclaim path may be a
>> better choice.
>>
>> In addition, the above two solutions are only for empty PTE pages (a PTE page
>> where all entries are empty), and do not deal with the zero PTE page ( a PTE
>> page where all page table entries are mapped to shared zero page) mentioned by
>> David Hildenbrand:
>> 	"Especially the shared zeropage is nasty, because there are
>> 	 sane use cases that can trigger it. Assume you have a VM
>> 	 (e.g., QEMU) that inflated the balloon to return free memory
>> 	 to the hypervisor.
>>
>> 	 Simply migrating that VM will populate the shared zeropage to
>> 	 all inflated pages, because migration code ends up reading all
>> 	 VM memory. Similarly, the guest can just read that memory as
>> 	 well, for example, when the guest issues kdump itself."
>>
>> The purpose of this RFC patch is to continue the discussion and fix the above
>> issues. The following is the solution to be discussed.
> 
> Thanks for providing an alternative! It's certainly easier to digest :)

Hi David,

Nice to see your reply.

> 
>>
>> In order to quickly identify the above two types of PTE pages, we still
>> introduced a pte_refcount for each PTE page. We put the mapped and zero PTE
>> entry counter into the pte_refcount of the PTE page. The bitmask has the
>> following meaning:
>>
>>   - bits 0-9 are mapped PTE entry count
>>   - bits 10-19 are zero PTE entry count
> 
> I guess we could factor the zero PTE change out, to have an even simpler
OK, we can deal with the empty PTE page case first.

> first version. The issue is that some features (userfaultfd) don't
> expect page faults when something was aleady mapped previously.
> 
> PTE markers as introduced by Peter might require a thought -- we don't
> have anything mapped but do have additional information that we have to
> maintain.

I see the pte marker entry is non-present entry not empty entry 
(pte_none()). So we've dealt with this situation, which is also
what's done in [RFC PATCH 1/7].

> 
>>
>> In this way, when mapped PTE entry count is 0, we can know that the current PTE
>> page is an empty PTE page, and when zero PTE entry count is PTRS_PER_PTE, we can
>> know that the current PTE page is a zero PTE page.
>>
>> We only update the pte_refcount when setting and clearing of PTE entry, and
>> since they are both protected by pte lock, pte_refcount can be a non-atomic
>> variable with little performance overhead.
>>
>> For page table walker, we mutually exclusive it by holding write lock of
>> mmap_lock when doing pmd_clear() (in the newly added path to reclaim PTE pages).
> 
> I recall when I played with that idea that the mmap_lock is not
> sufficient to rip out a page table. IIRC, we also have to hold the rmap
> lock(s), to prevent RMAP walkers from still using the page table.

Oh, I forgot this. We should also hold rmap lock(s) like
move_normal_pmd().

> 
> Especially if multiple VMAs intersect a page table, things might get
> tricky, because multiple rmap locks could be involved.

Maybe we can iterate over the vma list and just process the 2M aligned
part?

> 
> We might want/need another mechanism to synchronize against page table
> walkers.

This is a tricky problem, equivalent to narrowing the protection scope
of mmap_lock. Any preliminary ideas?

Thanks,
Qi

>