[RFC,00/18] Try to free user PTE page table pages

Message ID	20220429133552.33768-1-zhengqi.arch@bytedance.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Qi Zheng <zhengqi.arch@bytedance.com> To: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, jgg@nvidia.com, tj@kernel.org, dennis@kernel.org, ming.lei@redhat.com Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com, Qi Zheng <zhengqi.arch@bytedance.com> Subject: [RFC PATCH 00/18] Try to free user PTE page table pages Date: Fri, 29 Apr 2022 21:35:34 +0800 Message-Id: <20220429133552.33768-1-zhengqi.arch@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Try to free user PTE page table pages \| expand [RFC,00/18] Try to free user PTE page table pages [RFC,01/18] x86/mm/encrypt: add the missing pte_unmap() call [RFC,02/18] percpu_ref: make ref stable after percpu_ref_switch_to_atomic_sync() returns [RFC,03/18] percpu_ref: make percpu_ref_switch_lock per percpu_ref [RFC,04/18] mm: convert to use ptep_clear() in pte_clear_not_present_full() [RFC,05/18] mm: split the related definitions of pte_offset_map_lock() into pgtable.h [RFC,06/18] mm: introduce CONFIG_FREE_USER_PTE [RFC,07/18] mm: add pte_to_page() helper [RFC,08/18] mm: introduce percpu_ref for user PTE page table page [RFC,09/18] pte_ref: add pte_tryget() and {__,}pte_put() helper [RFC,10/18] mm: add pte_tryget_map{_lock}() helper [RFC,11/18] mm: convert to use pte_tryget_map_lock() [RFC,12/18] mm: convert to use pte_tryget_map() [RFC,13/18] mm: add try_to_free_user_pte() helper [RFC,14/18] mm: use try_to_free_user_pte() in MADV_DONTNEED case [RFC,15/18] mm: use try_to_free_user_pte() in MADV_FREE case [RFC,16/18] pte_ref: add track_pte_{set, clear}() helper [RFC,17/18] x86/mm: add x86_64 support for pte_ref [RFC,18/18] Documentation: add document for pte_ref

Qi Zheng April 29, 2022, 1:35 p.m. UTC

Hi,

This patch series aims to try to free user PTE page table pages when no one is
using it.

The beginning of this story is that some malloc libraries(e.g. jemalloc or
tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those
VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want.
But the page tables do not be freed by madvise(), so it can produce many
page tables when the process touches an enormous virtual address space.

The following figures are a memory usage snapshot of one process which actually
happened on our server:

        VIRT:  55t
        RES:   590g
        VmPTE: 110g

As we can see, the PTE page tables size is 110g, while the RES is 590g. In
theory, the process only need 1.2g PTE page tables to map those physical
memory. The reason why PTE page tables occupy a lot of memory is that
madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
doesn't free the PTE page table pages. So we can free those empty PTE page
tables to save memory. In the above cases, we can save memory about 108g(best
case). And the larger the difference between the size of VIRT and RES, the
more memory we save.

In this patch series, we add a pte_ref field to the struct page of page table
to track how many users of user PTE page table. Similar to the mechanism of page
refcount, the user of PTE page table should hold a refcount to it before
accessing. The user PTE page table page may be freed when the last refcount is
dropped.

Different from the idea of another patchset of mine before[1], the pte_ref
becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
entryies, and then release the user PTE page table page when checking that
pte_ref is 0. The advantage of this is that there is basically no performance
overhead in percpu mode, but it can also free the empty PTEs. In addition, the
code implementation of this patchset is much simpler and more portable than the
another patchset[1].

Testing:

The following code snippet can show the effect of optimization:

        mmap 50G
        while (1) {
                for (; i < 1024 * 25; i++) {
                        touch 2M memory
                        madvise MADV_DONTNEED 2M
                }
        }

As we can see, the memory usage of VmPTE is reduced:

                        before                          after
VIRT                   50.0 GB                        50.0 GB
RES                     3.1 MB                         3.1 MB
VmPTE                102640 kB                          96 kB

I also have tested the stability by LTP[2] for several weeks. I have not seen
any crash so far.

This series is based on v5.18-rc2.

Comments and suggestions are welcome.

Thanks,
Qi.

[1] https://patchwork.kernel.org/project/linux-mm/cover/20211110105428.32458-1-zhengqi.arch@bytedance.com/
[2] https://github.com/linux-test-project/ltp

Qi Zheng (18):
  x86/mm/encrypt: add the missing pte_unmap() call
  percpu_ref: make ref stable after percpu_ref_switch_to_atomic_sync()
    returns
  percpu_ref: make percpu_ref_switch_lock per percpu_ref
  mm: convert to use ptep_clear() in pte_clear_not_present_full()
  mm: split the related definitions of pte_offset_map_lock() into
    pgtable.h
  mm: introduce CONFIG_FREE_USER_PTE
  mm: add pte_to_page() helper
  mm: introduce percpu_ref for user PTE page table page
  pte_ref: add pte_tryget() and {__,}pte_put() helper
  mm: add pte_tryget_map{_lock}() helper
  mm: convert to use pte_tryget_map_lock()
  mm: convert to use pte_tryget_map()
  mm: add try_to_free_user_pte() helper
  mm: use try_to_free_user_pte() in MADV_DONTNEED case
  mm: use try_to_free_user_pte() in MADV_FREE case
  pte_ref: add track_pte_{set, clear}() helper
  x86/mm: add x86_64 support for pte_ref
  Documentation: add document for pte_ref

 Documentation/vm/index.rst         |   1 +
 Documentation/vm/pte_ref.rst       | 210 ++++++++++++++++++++++++++
 arch/x86/Kconfig                   |   1 +
 arch/x86/include/asm/pgtable.h     |   7 +-
 arch/x86/mm/mem_encrypt_identity.c |  10 +-
 fs/proc/task_mmu.c                 |  16 +-
 fs/userfaultfd.c                   |  10 +-
 include/linux/mm.h                 | 162 ++------------------
 include/linux/mm_types.h           |   1 +
 include/linux/percpu-refcount.h    |   6 +-
 include/linux/pgtable.h            | 196 +++++++++++++++++++++++-
 include/linux/pte_ref.h            |  73 +++++++++
 include/linux/rmap.h               |   2 +
 include/linux/swapops.h            |   4 +-
 kernel/events/core.c               |   5 +-
 lib/percpu-refcount.c              |  86 +++++++----
 mm/Kconfig                         |  10 ++
 mm/Makefile                        |   2 +-
 mm/damon/vaddr.c                   |  30 ++--
 mm/debug_vm_pgtable.c              |   2 +-
 mm/filemap.c                       |   4 +-
 mm/gup.c                           |  20 ++-
 mm/hmm.c                           |   9 +-
 mm/huge_memory.c                   |   4 +-
 mm/internal.h                      |   3 +-
 mm/khugepaged.c                    |  18 ++-
 mm/ksm.c                           |   4 +-
 mm/madvise.c                       |  35 +++--
 mm/memcontrol.c                    |   8 +-
 mm/memory-failure.c                |  15 +-
 mm/memory.c                        | 187 +++++++++++++++--------
 mm/mempolicy.c                     |   4 +-
 mm/migrate.c                       |   8 +-
 mm/migrate_device.c                |  22 ++-
 mm/mincore.c                       |   5 +-
 mm/mlock.c                         |   5 +-
 mm/mprotect.c                      |   4 +-
 mm/mremap.c                        |  10 +-
 mm/oom_kill.c                      |   3 +-
 mm/page_table_check.c              |   2 +-
 mm/page_vma_mapped.c               |  59 +++++++-
 mm/pagewalk.c                      |   6 +-
 mm/pte_ref.c                       | 230 +++++++++++++++++++++++++++++
 mm/rmap.c                          |   9 ++
 mm/swap_state.c                    |   4 +-
 mm/swapfile.c                      |  18 ++-
 mm/userfaultfd.c                   |  11 +-
 mm/vmalloc.c                       |   2 +-
 48 files changed, 1203 insertions(+), 340 deletions(-)
 create mode 100644 Documentation/vm/pte_ref.rst
 create mode 100644 include/linux/pte_ref.h
 create mode 100644 mm/pte_ref.c

Qi Zheng May 17, 2022, 8:30 a.m. UTC | #1

On 2022/4/29 9:35 PM, Qi Zheng wrote:
> Hi,
> 
> This patch series aims to try to free user PTE page table pages when no one is
> using it.
> 
> The beginning of this story is that some malloc libraries(e.g. jemalloc or
> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those
> VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want.
> But the page tables do not be freed by madvise(), so it can produce many
> page tables when the process touches an enormous virtual address space.
> 
> The following figures are a memory usage snapshot of one process which actually
> happened on our server:
> 
>          VIRT:  55t
>          RES:   590g
>          VmPTE: 110g
> 
> As we can see, the PTE page tables size is 110g, while the RES is 590g. In
> theory, the process only need 1.2g PTE page tables to map those physical
> memory. The reason why PTE page tables occupy a lot of memory is that
> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
> doesn't free the PTE page table pages. So we can free those empty PTE page
> tables to save memory. In the above cases, we can save memory about 108g(best
> case). And the larger the difference between the size of VIRT and RES, the
> more memory we save.
> 
> In this patch series, we add a pte_ref field to the struct page of page table
> to track how many users of user PTE page table. Similar to the mechanism of page
> refcount, the user of PTE page table should hold a refcount to it before
> accessing. The user PTE page table page may be freed when the last refcount is
> dropped.
> 
> Different from the idea of another patchset of mine before[1], the pte_ref
> becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
> such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
> entryies, and then release the user PTE page table page when checking that
> pte_ref is 0. The advantage of this is that there is basically no performance
> overhead in percpu mode, but it can also free the empty PTEs. In addition, the
> code implementation of this patchset is much simpler and more portable than the
> another patchset[1].

Hi David,

I learned from the LWN article[1] that you led a session at the LSFMM on
the problems posed by the lack of page-table reclaim (And thank you very
much for mentioning some of my work in this direction). So I want to
know, what are the further plans of the community for this problem?

For the way of adding pte_ref to each PTE page table page, I currently
posted two versions: atomic count version[2] and percpu_ref version(This
patchset).

For the atomic count version:
- Advantage: PTE pages can be freed as soon as the reference count drops
              to 0.
- Disadvantage: The addition and subtraction of pte_ref are atomic
                 operations, which have a certain performance overhead,
                 but should not become a performance bottleneck until the
                 mmap_lock contention problem is resolved.

For the percpu_ref version:
- Advantage: In the percpu mode, the addition and subtraction of the
              pte_ref are all operations on local cpu variables, there
              is basically no performance overhead.
Disadvantage: Need to explicitly convert the pte_ref to atomic mode so
               that the unused PTE pages can be freed.

There are still many places to optimize the code implementation of these
two versions. But before I do further work, I would like to hear your
and the community's views and suggestions on these two versions.

Thanks,
Qi

[1]: https://lwn.net/Articles/893726 (Ways to reclaim unused page-table 
pages)
[2]: 
https://lore.kernel.org/lkml/20211110105428.32458-1-zhengqi.arch@bytedance.com/

>

David Hildenbrand May 18, 2022, 2:51 p.m. UTC | #2

On 17.05.22 10:30, Qi Zheng wrote:
> 
> 
> On 2022/4/29 9:35 PM, Qi Zheng wrote:
>> Hi,
>>
>> This patch series aims to try to free user PTE page table pages when no one is
>> using it.
>>
>> The beginning of this story is that some malloc libraries(e.g. jemalloc or
>> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those
>> VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want.
>> But the page tables do not be freed by madvise(), so it can produce many
>> page tables when the process touches an enormous virtual address space.
>>
>> The following figures are a memory usage snapshot of one process which actually
>> happened on our server:
>>
>>          VIRT:  55t
>>          RES:   590g
>>          VmPTE: 110g
>>
>> As we can see, the PTE page tables size is 110g, while the RES is 590g. In
>> theory, the process only need 1.2g PTE page tables to map those physical
>> memory. The reason why PTE page tables occupy a lot of memory is that
>> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
>> doesn't free the PTE page table pages. So we can free those empty PTE page
>> tables to save memory. In the above cases, we can save memory about 108g(best
>> case). And the larger the difference between the size of VIRT and RES, the
>> more memory we save.
>>
>> In this patch series, we add a pte_ref field to the struct page of page table
>> to track how many users of user PTE page table. Similar to the mechanism of page
>> refcount, the user of PTE page table should hold a refcount to it before
>> accessing. The user PTE page table page may be freed when the last refcount is
>> dropped.
>>
>> Different from the idea of another patchset of mine before[1], the pte_ref
>> becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
>> such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
>> entryies, and then release the user PTE page table page when checking that
>> pte_ref is 0. The advantage of this is that there is basically no performance
>> overhead in percpu mode, but it can also free the empty PTEs. In addition, the
>> code implementation of this patchset is much simpler and more portable than the
>> another patchset[1].
> 
> Hi David,
> 
> I learned from the LWN article[1] that you led a session at the LSFMM on
> the problems posed by the lack of page-table reclaim (And thank you very
> much for mentioning some of my work in this direction). So I want to
> know, what are the further plans of the community for this problem?

Hi,

yes, I talked about the involved challenges, especially, how malicious
user space can trigger allocation of almost elusively page tables and
essentially consume a lot of unmovable+unswappable memory and even store
secrets in the page table structure.

Empty PTE tables is one such case we care about, but there is more. Even
with your approach, we can still end up with many page tables that are
allocated on higher levels (e.g., PMD tables) or page tables that are
not empty (especially, filled with the shared zeropage).

Ideally, we'd have some mechanism that can reclaim also other
reclaimable page tables (e.g., filled with shared zeropage). One idea
was to add reclaimable page tables to the LRU list and to then
scan+reclaim them on demand. There are multiple challenges involved,
obviously. One is how to synchronize against concurrent page table
walkers, another one is how to invalidate MMU notifiers from reclaim
context. It would most probably involve storing required information in
the memmap to be able to lock+synchronize.

Having that said, adding infrastructure that might not be easy to extend
to the more general case of reclaiming other reclaimable page tables on
multiple levels (esp PMD tables) might not be what we want. OTOH, it
gets the job done for once case we care about.

It's really hard to tell what to do because reclaiming page tables and
eventually handling malicious user space correctly is far from trivial :)

I'll be on vacation until end of May, I'll come back to this mail once
I'm back.

Matthew Wilcox May 18, 2022, 2:56 p.m. UTC | #3

On Wed, May 18, 2022 at 04:51:06PM +0200, David Hildenbrand wrote:
> yes, I talked about the involved challenges, especially, how malicious
> user space can trigger allocation of almost elusively page tables and
> essentially consume a lot of unmovable+unswappable memory and even store
> secrets in the page table structure.

There are a lot of ways for userspace to consume a large amount of
kernel memory.  For example, one can open a file and set file locks on
alternate bytes.  We generally handle this by accounting the memory to
the process and let the OOM killer, rlimits, memcg or other mechanism
take care of it.  Just because page tables are (generally) reclaimable
doesn't mean we need to treat them specially.

Qi Zheng May 19, 2022, 3:58 a.m. UTC | #4

On 2022/5/18 10:51 PM, David Hildenbrand wrote:
> On 17.05.22 10:30, Qi Zheng wrote:
>>
>>
>> On 2022/4/29 9:35 PM, Qi Zheng wrote:
>>> Hi, >>>
>>> Different from the idea of another patchset of mine before[1], the pte_ref
>>> becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
>>> such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
>>> entryies, and then release the user PTE page table page when checking that
>>> pte_ref is 0. The advantage of this is that there is basically no performance
>>> overhead in percpu mode, but it can also free the empty PTEs. In addition, the
>>> code implementation of this patchset is much simpler and more portable than the
>>> another patchset[1].
>>
>> Hi David,
>>
>> I learned from the LWN article[1] that you led a session at the LSFMM on
>> the problems posed by the lack of page-table reclaim (And thank you very
>> much for mentioning some of my work in this direction). So I want to
>> know, what are the further plans of the community for this problem?
> 
> Hi,
> 
> yes, I talked about the involved challenges, especially, how malicious
> user space can trigger allocation of almost elusively page tables and
> essentially consume a lot of unmovable+unswappable memory and even store
> secrets in the page table structure.

It is indeed difficult to deal with malicious user space programs,
because as long as there is an entry in PTE page table page that
maps the physical page, the entire PTE page cannot be freed.

So maybe we should first solve the problems encountered in engineering
practice. We encountered the problems I mentioned in the cover letter
several times on our server:

	VIRT:  55t
         RES:   590g
         VmPTE: 110g

They are not malicious programs, they just use jemalloc/tcmalloc
normally (currently jemalloc/tcmalloc often uses mmap+madvise instead
of mmap+munmap to improve performance). And we checked and found taht
most of these VmPTEs are empty.

Of course, normal operations may also lead to the consequences of
similar malicious programs, but we have not found such examples
on our servers.

> 
> Empty PTE tables is one such case we care about, but there is more. Even
> with your approach, we can still end up with many page tables that are
> allocated on higher levels (e.g., PMD tables) or page tables that are

Yes, currently my patch does not consider PMD tables. The reason is that
its maximum memory consumption is only 1G on 64-bits system, the impact
is smaller that 512G of PTE tables.

> not empty (especially, filled with the shared zeropage).

This case is indeed a problem, and more difficult. :(

> 
> Ideally, we'd have some mechanism that can reclaim also other
> reclaimable page tables (e.g., filled with shared zeropage). One idea
> was to add reclaimable page tables to the LRU list and to then
> scan+reclaim them on demand. There are multiple challenges involved,
> obviously. One is how to synchronize against concurrent page table

Agree, the current situation is that holding the read lock of mmap_lock
can ensure that the PTE tables is stable. If the refcount method is not
considered or the logic of the lock that protects the PTE tables is not
changed, then the write lock of mmap_lock should be held to ensure
synchronization (this has a huge impact on performance).

> walkers, another one is how to invalidate MMU notifiers from reclaim
> context. It would most probably involve storing required information in
> the memmap to be able to lock+synchronize.

This may also be a way to explore.

> 
> Having that said, adding infrastructure that might not be easy to extend
> to the more general case of reclaiming other reclaimable page tables on
> multiple levels (esp PMD tables) might not be what we want. OTOH, it
> gets the job done for once case we care about.
> 
> It's really hard to tell what to do because reclaiming page tables and
> eventually handling malicious user space correctly is far from trivial :)

Yeah, agree :(

> 
> I'll be on vacation until end of May, I'll come back to this mail once
> I'm back.
> 

OK, thanks, and have a nice holiday.

Qi Zheng May 19, 2022, 4:03 a.m. UTC | #5

On 2022/5/18 10:56 PM, Matthew Wilcox wrote:
> On Wed, May 18, 2022 at 04:51:06PM +0200, David Hildenbrand wrote:
>> yes, I talked about the involved challenges, especially, how malicious
>> user space can trigger allocation of almost elusively page tables and
>> essentially consume a lot of unmovable+unswappable memory and even store
>> secrets in the page table structure.
> 
> There are a lot of ways for userspace to consume a large amount of
> kernel memory.  For example, one can open a file and set file locks on

Yes, malicious programs are really hard to avoid, maybe we should try to
solve some common cases first (such as empty PTE tables).

> alternate bytes.  We generally handle this by accounting the memory to
> the process and let the OOM killer, rlimits, memcg or other mechanism
> take care of it.  Just because page tables are (generally) reclaimable
> doesn't mean we need to treat them specially.
>

[RFC,00/18] Try to free user PTE page table pages

Message

Comments