[RFC,v2,0/4] Add support for sharing page tables across processes (Previously mshare)

Message ID	cover.1682453344.git.khalid.aziz@oracle.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Khalid Aziz <khalid.aziz@oracle.com> To: akpm@linux-foundation.org, willy@infradead.org, markhemm@googlemail.com, viro@zeniv.linux.org.uk, david@redhat.com, mike.kravetz@oracle.com Cc: Khalid Aziz <khalid.aziz@oracle.com>, andreyknvl@gmail.com, dave.hansen@intel.com, luto@kernel.org, brauner@kernel.org, arnd@arndb.de, ebiederm@xmission.com, catalin.marinas@arm.com, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhiramat@kernel.org, rostedt@goodmis.org, vasily.averin@linux.dev, xhao@linux.alibaba.com, pcc@google.com, neilb@suse.de, maz@kernel.org Subject: [PATCH RFC v2 0/4] Add support for sharing page tables across processes (Previously mshare) Date: Wed, 26 Apr 2023 10:49:47 -0600 Message-Id: <cover.1682453344.git.khalid.aziz@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Add support for sharing page tables across processes (Previously mshare) \| expand [RFC,v2,0/4] Add support for sharing page tables across processes (Previously mshare) [RFC,v2,1/4] mm/ptshare: Add vm flag for shared PTE [RFC,v2,2/4] mm/ptshare: Add flag MAP_SHARED_PT to mmap() [RFC,v2,3/4] mm/ptshare: Create new mm struct for page table sharing [RFC,v2,4/4] mm/ptshare: Add page fault handling for page table shared regions

Khalid Aziz April 26, 2023, 4:49 p.m. UTC

Memory pages shared between processes require a page table entry
(PTE) for each process. Each of these PTE consumes some of the
memory and as long as number of mappings being maintained is small
enough, this space consumed by page tables is not objectionable.
When very few memory pages are shared between processes, the number
of page table entries (PTEs) to maintain is mostly constrained by
the number of pages of memory on the system.  As the number of
shared pages and the number of times pages are shared goes up,
amount of memory consumed by page tables starts to become
significant. This issue does not apply to threads. Any number of
threads can share the same pages inside a process while sharing the
same PTEs. Extending this same model to sharing pages across
processes can eliminate this issue for sharing across processes as
well.

Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is only 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page. On a
database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, amount of memory saved is very significant.

This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
This flag can be specified along with MAP_SHARED by a process to
hint to kernel that it wishes to share page table entries for this
file mapping mmap region with other processes. Any other process
that mmaps the same file with MAP_SHARED_PT flag can then share the
same page table entries. Besides specifying MAP_SHARED_PT flag, the
processes must map the files at a PMD aligned address with a size
that is a multiple of PMD size and at the same virtual addresses.
This last requirement of same virtual addresses can possibly be
relaxed if that is the consensus.

When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
is created to hold the shared page tables. Host mm struct is not
attached to a process. Start and size of host mm are set to the
start and size of the mmap region and a VMA covering this range is
also added to host mm struct. Existing page table entries from the
process that creates the mapping are copied over to the host mm
struct. All processes mapping this shared region are considered
guest processes. When a guest process mmap's the shared region, a vm
flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
flag is found, its corresponding PMD is updated with the PMD from
host mm struct so the PMD will point to the page tables in host mm
struct. vm_mm pointer of the VMA is also updated to point to host mm
struct for the duration of fault handling to ensure fault handling
happens in the context of host mm struct. When a new PTE is
created, it is created in the host mm struct page tables and the PMD
in guest mm points to the same PTEs.

This is a basic working implementation. It will need to go through
more testing and refinements. Some notes and questions:

- PMD size alignment and size requirement is currently hard coded
  in. Is there a need or desire to make this more flexible and work
  with other alignments/sizes? PMD size allows for adapting this
  infrastructure to form the basis for hugetlbfs page table sharing
  as well. More work will be needed to make that happen.

- Is there a reason to allow a userspace app to query this size and
  alignment requirement for MAP_SHARED_PT in some way?

- Shared PTEs means mprotect() call made by one process affects all
  processes sharing the same mapping and that behavior will need to
  be documented clearly. Effect of mprotect call being different for
  processes using shared page tables is the primary reason to
  require an explicit opt-in from userspace processes to share page
  tables. With a transparent sharing derived from MAP_SHARED alone,
  changed effect of mprotect can break significant number of
  userspace apps. One could work around that by unsharing whenever
  mprotect changes modes on shared mapping but that introduces
  complexity and the capability to execute a single mprotect to
  change modes across 1000's of processes sharing a mapped database
  is a feature explicitly asked for by database folks. This
  capability has significant performance advantage when compared to
  mechanism of sending messages to every process using shared
  mapping to call mprotect and change modes in each process, or
  using traps on permissions mismatch in each process.

- This implementation does not allow unmapping page table shared
  mappings partially. Should that be supported in future?

Some concerns in this RFC:

- When page tables for a process are freed upon process exit,
  pmd_free_tlb() gets called at one point to free all PMDs allocated
  by the process. For a shared page table, shared PMDs can not be
  released when a guest process exits. These shared PMDs are
  released when host mm struct is released upon end of last
  reference to page table shared region hosted by this mm. For now
  to stop PMDs being released, this RFC introduces following change
  in mm/memory.c which works but does not feel like the right
  approach. Any suggestions for a better long term approach will be
  very appreciated:

@@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
pud_t *pud,

        pmd = pmd_offset(pud, start);
        pud_clear(pud);
-       pmd_free_tlb(tlb, pmd, start);
-       mm_dec_nr_pmds(tlb->mm);
+       if (shared_pte) {
+               tlb_flush_pud_range(tlb, start, PAGE_SIZE);
+               tlb->freed_tables = 1;
+       } else {
+               pmd_free_tlb(tlb, pmd, start);
+               mm_dec_nr_pmds(tlb->mm);
+       }
 }

 static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,

- This implementation requires an additional VM flag. Since all lower
  32 bits are currently in use, the new VM flag must come from upper
  32 bits which restricts this feature to 64-bit processors.

- This feature is implemented for file mappings only. Is there a
  need to support it for anonymous memory as well?

- Accounting for MAP_SHARED_PT mapped filepages in a process and
  pagetable bytes is not quite accurate yet in this RFC and will be
  fixed in the non-RFC version of patches.

I appreciate any feedback on these patches and ideas for
improvements before moving these patches out of RFC stage.


Changes from RFC v1:
- Broken the patches up into smaller patches
- Fixed a few bugs related to freeing PTEs and PMDs incorrectly
- Cleaned up the code a bit


Khalid Aziz (4):
  mm/ptshare: Add vm flag for shared PTE
  mm/ptshare: Add flag MAP_SHARED_PT to mmap()
  mm/ptshare: Create new mm struct for page table sharing
  mm/ptshare: Add page fault handling for page table shared regions

 include/linux/fs.h                     |   2 +
 include/linux/mm.h                     |   8 +
 include/trace/events/mmflags.h         |   3 +-
 include/uapi/asm-generic/mman-common.h |   1 +
 mm/Makefile                            |   2 +-
 mm/internal.h                          |  21 ++
 mm/memory.c                            | 105 ++++++++--
 mm/mmap.c                              |  88 +++++++++
 mm/ptshare.c                           | 263 +++++++++++++++++++++++++
 9 files changed, 476 insertions(+), 17 deletions(-)
 create mode 100644 mm/ptshare.c

Mike Kravetz April 26, 2023, 9:27 p.m. UTC | #1

On 04/26/23 10:49, Khalid Aziz wrote:
> Memory pages shared between processes require a page table entry
> (PTE) for each process. Each of these PTE consumes some of the
> memory and as long as number of mappings being maintained is small
> enough, this space consumed by page tables is not objectionable.
> When very few memory pages are shared between processes, the number
> of page table entries (PTEs) to maintain is mostly constrained by
> the number of pages of memory on the system.  As the number of
> shared pages and the number of times pages are shared goes up,
> amount of memory consumed by page tables starts to become
> significant. This issue does not apply to threads. Any number of
> threads can share the same pages inside a process while sharing the
> same PTEs. Extending this same model to sharing pages across
> processes can eliminate this issue for sharing across processes as
> well.
> 
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.

When hugetlb pmd sharing was introduced the motivating factor was performance
as much as memory savings.  See commit,
39dde65c9940 [PATCH] shared page table for hugetlb page

I have not verified the claims in that commit message, but it does sound
reasonable.  My assumption is that the same would apply here.

> 
> This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
> This flag can be specified along with MAP_SHARED by a process to
> hint to kernel that it wishes to share page table entries for this
> file mapping mmap region with other processes. Any other process
> that mmaps the same file with MAP_SHARED_PT flag can then share the
> same page table entries. Besides specifying MAP_SHARED_PT flag, the
> processes must map the files at a PMD aligned address with a size
> that is a multiple of PMD size and at the same virtual addresses.
> This last requirement of same virtual addresses can possibly be
> relaxed if that is the consensus.

I would think the same virtual address requirement is problematic in the
case of ASLR.  hugetlb pmd sharing does not have the 'same virtual
addresses' requirement.  My guess is that requiring the same virtual
address does not make code simpler.

> When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
> is created to hold the shared page tables. Host mm struct is not
> attached to a process. Start and size of host mm are set to the
> start and size of the mmap region and a VMA covering this range is
> also added to host mm struct. Existing page table entries from the
> process that creates the mapping are copied over to the host mm
> struct.

It seems there is one host mm struct per shared mapping.  Is that
correct?  And, the host mm is the size of that single mapping?
Suppose the following:
- process A maps shared file offset 0 to 2xPMD_SIZE
- process B maps shared file offset 0 to 2xPMD_SIZE
It then makes sense A and B would use the same host mm.  Now,
- process C maps shared file offset 0 to 4xPMD_SIZE
Does C share anything with A and B?  Are there multiple host mm structs?

Just wondering at a high level how this would work without looking at
the code.

>         All processes mapping this shared region are considered
> guest processes. When a guest process mmap's the shared region, a vm
> flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
> fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
> flag is found, its corresponding PMD is updated with the PMD from
> host mm struct so the PMD will point to the page tables in host mm
> struct. vm_mm pointer of the VMA is also updated to point to host mm
> struct for the duration of fault handling to ensure fault handling
> happens in the context of host mm struct. When a new PTE is
> created, it is created in the host mm struct page tables and the PMD
> in guest mm points to the same PTEs.
> 
> This is a basic working implementation. It will need to go through
> more testing and refinements. Some notes and questions:
> 
> - PMD size alignment and size requirement is currently hard coded
>   in. Is there a need or desire to make this more flexible and work
>   with other alignments/sizes? PMD size allows for adapting this
>   infrastructure to form the basis for hugetlbfs page table sharing
>   as well. More work will be needed to make that happen.
> 
> - Is there a reason to allow a userspace app to query this size and
>   alignment requirement for MAP_SHARED_PT in some way?
> 
> - Shared PTEs means mprotect() call made by one process affects all
>   processes sharing the same mapping and that behavior will need to
>   be documented clearly. Effect of mprotect call being different for
>   processes using shared page tables is the primary reason to
>   require an explicit opt-in from userspace processes to share page
>   tables. With a transparent sharing derived from MAP_SHARED alone,
>   changed effect of mprotect can break significant number of
>   userspace apps. One could work around that by unsharing whenever
>   mprotect changes modes on shared mapping but that introduces
>   complexity and the capability to execute a single mprotect to
>   change modes across 1000's of processes sharing a mapped database
>   is a feature explicitly asked for by database folks. This
>   capability has significant performance advantage when compared to
>   mechanism of sending messages to every process using shared
>   mapping to call mprotect and change modes in each process, or
>   using traps on permissions mismatch in each process.

I would guess this is more than just mprotect, and anything that impacts
page tables.  Correct?  For example MADV_DONTNEED, MADV_HUGEPAGE,
MADV_NOHUGEPAGE.

> - This implementation does not allow unmapping page table shared
>   mappings partially. Should that be supported in future?
> 
> Some concerns in this RFC:
> 
> - When page tables for a process are freed upon process exit,
>   pmd_free_tlb() gets called at one point to free all PMDs allocated
>   by the process. For a shared page table, shared PMDs can not be
>   released when a guest process exits. These shared PMDs are
>   released when host mm struct is released upon end of last
>   reference to page table shared region hosted by this mm. For now
>   to stop PMDs being released, this RFC introduces following change
>   in mm/memory.c which works but does not feel like the right
>   approach. Any suggestions for a better long term approach will be
>   very appreciated:
> 
> @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
> pud_t *pud,
> 
>         pmd = pmd_offset(pud, start);
>         pud_clear(pud);
> -       pmd_free_tlb(tlb, pmd, start);
> -       mm_dec_nr_pmds(tlb->mm);
> +       if (shared_pte) {
> +               tlb_flush_pud_range(tlb, start, PAGE_SIZE);
> +               tlb->freed_tables = 1;
> +       } else {
> +               pmd_free_tlb(tlb, pmd, start);
> +               mm_dec_nr_pmds(tlb->mm);
> +       }
>  }
> 
>  static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
> 
> - This implementation requires an additional VM flag. Since all lower
>   32 bits are currently in use, the new VM flag must come from upper
>   32 bits which restricts this feature to 64-bit processors.
> 
> - This feature is implemented for file mappings only. Is there a
>   need to support it for anonymous memory as well?

I have not looked at the implementation, are 'file mappings' only
mappings using the page cache?  Or, do you just need a file descriptor?
Would a file descriptor created via memfd_create work for anonymous
memory?

Khalid Aziz April 27, 2023, 4:40 p.m. UTC | #2

On 4/26/23 15:27, Mike Kravetz wrote:
> On 04/26/23 10:49, Khalid Aziz wrote:
>> Memory pages shared between processes require a page table entry
>> (PTE) for each process. Each of these PTE consumes some of the
>> memory and as long as number of mappings being maintained is small
>> enough, this space consumed by page tables is not objectionable.
>> When very few memory pages are shared between processes, the number
>> of page table entries (PTEs) to maintain is mostly constrained by
>> the number of pages of memory on the system.  As the number of
>> shared pages and the number of times pages are shared goes up,
>> amount of memory consumed by page tables starts to become
>> significant. This issue does not apply to threads. Any number of
>> threads can share the same pages inside a process while sharing the
>> same PTEs. Extending this same model to sharing pages across
>> processes can eliminate this issue for sharing across processes as
>> well.
>>
>> Some of the field deployments commonly see memory pages shared
>> across 1000s of processes. On x86_64, each page requires a PTE that
>> is only 8 bytes long which is very small compared to the 4K page
>> size. When 2000 processes map the same page in their address space,
>> each one of them requires 8 bytes for its PTE and together that adds
>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>> database server with 300GB SGA, a system crash was seen with
>> out-of-memory condition when 1500+ clients tried to share this SGA
>> even though the system had 512GB of memory. On this server, in the
>> worst case scenario of all 1500 processes mapping every page from
>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>> could be shared, amount of memory saved is very significant.
> 
> When hugetlb pmd sharing was introduced the motivating factor was performance
> as much as memory savings.  See commit,
> 39dde65c9940 [PATCH] shared page table for hugetlb page
> 
> I have not verified the claims in that commit message, but it does sound
> reasonable.  My assumption is that the same would apply here.

Hi Mike,

Thanks for the feedback. I looked up the commit log for 39dde65c9940. You are right, same does apply here.

> 
>>
>> This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
>> This flag can be specified along with MAP_SHARED by a process to
>> hint to kernel that it wishes to share page table entries for this
>> file mapping mmap region with other processes. Any other process
>> that mmaps the same file with MAP_SHARED_PT flag can then share the
>> same page table entries. Besides specifying MAP_SHARED_PT flag, the
>> processes must map the files at a PMD aligned address with a size
>> that is a multiple of PMD size and at the same virtual addresses.
>> This last requirement of same virtual addresses can possibly be
>> relaxed if that is the consensus.
> 
> I would think the same virtual address requirement is problematic in the
> case of ASLR.  hugetlb pmd sharing does not have the 'same virtual
> addresses' requirement.  My guess is that requiring the same virtual
> address does not make code simpler.

Same virtual address requirement make pevious mshare implementation quite a bit simpler, but this reimplementation in 
this RFC does not rely upon virtual addresses being same as much. This does open up the possibility of removing this 
requirement. I just haven't looked through any potential complications enough.

> 
>> When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
>> is created to hold the shared page tables. Host mm struct is not
>> attached to a process. Start and size of host mm are set to the
>> start and size of the mmap region and a VMA covering this range is
>> also added to host mm struct. Existing page table entries from the
>> process that creates the mapping are copied over to the host mm
>> struct.
> 
> It seems there is one host mm struct per shared mapping.  Is that
> correct?  And, the host mm is the size of that single mapping?

There is one host mm per shared file. This was done to allow for multiple VMAs to exist in this mm for a file that can 
potentially represent multiple shared regions of the file. In the RFC, I am creating only one shared region per file for 
now.

> Suppose the following:
> - process A maps shared file offset 0 to 2xPMD_SIZE
> - process B maps shared file offset 0 to 2xPMD_SIZE
> It then makes sense A and B would use the same host mm.  Now,

Yes, that is correct.

> - process C maps shared file offset 0 to 4xPMD_SIZE
> Does C share anything with A and B?  Are there multiple host mm structs?

In the RFC implementation, C does not share anything with A. The code is written to allow for expansion to support this 
sharing as well as sharing multiple non-contiguous regions of the file. My goal is to debug and finalize sharing 
infrastructure for the simple case and then start expanding it. In this specific case, implementation would expand the 
existing shared VMAs to cover offset 0 to 4xPMD.

> 
> Just wondering at a high level how this would work without looking at
> the code.
> 
>>          All processes mapping this shared region are considered
>> guest processes. When a guest process mmap's the shared region, a vm
>> flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
>> fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
>> flag is found, its corresponding PMD is updated with the PMD from
>> host mm struct so the PMD will point to the page tables in host mm
>> struct. vm_mm pointer of the VMA is also updated to point to host mm
>> struct for the duration of fault handling to ensure fault handling
>> happens in the context of host mm struct. When a new PTE is
>> created, it is created in the host mm struct page tables and the PMD
>> in guest mm points to the same PTEs.
>>
>> This is a basic working implementation. It will need to go through
>> more testing and refinements. Some notes and questions:
>>
>> - PMD size alignment and size requirement is currently hard coded
>>    in. Is there a need or desire to make this more flexible and work
>>    with other alignments/sizes? PMD size allows for adapting this
>>    infrastructure to form the basis for hugetlbfs page table sharing
>>    as well. More work will be needed to make that happen.
>>
>> - Is there a reason to allow a userspace app to query this size and
>>    alignment requirement for MAP_SHARED_PT in some way?
>>
>> - Shared PTEs means mprotect() call made by one process affects all
>>    processes sharing the same mapping and that behavior will need to
>>    be documented clearly. Effect of mprotect call being different for
>>    processes using shared page tables is the primary reason to
>>    require an explicit opt-in from userspace processes to share page
>>    tables. With a transparent sharing derived from MAP_SHARED alone,
>>    changed effect of mprotect can break significant number of
>>    userspace apps. One could work around that by unsharing whenever
>>    mprotect changes modes on shared mapping but that introduces
>>    complexity and the capability to execute a single mprotect to
>>    change modes across 1000's of processes sharing a mapped database
>>    is a feature explicitly asked for by database folks. This
>>    capability has significant performance advantage when compared to
>>    mechanism of sending messages to every process using shared
>>    mapping to call mprotect and change modes in each process, or
>>    using traps on permissions mismatch in each process.
> 
> I would guess this is more than just mprotect, and anything that impacts
> page tables.  Correct?  For example MADV_DONTNEED, MADV_HUGEPAGE,
> MADV_NOHUGEPAGE.

Yes. For instance, the RFC implementation allows one process to set MADV_DONTNEED and remove PTEs which removes them for 
all other processes sharing the mapping. That behavior does not quite sound right. A more robust implementation would 
possibly remove PTEs only if all sharing processes have set MADV_DONTNEED. Other flags might need different treatment, 
for instance MADV_WILLNEED should trigger read-ahead even if set by just one process.


> 
>> - This implementation does not allow unmapping page table shared
>>    mappings partially. Should that be supported in future?
>>
>> Some concerns in this RFC:
>>
>> - When page tables for a process are freed upon process exit,
>>    pmd_free_tlb() gets called at one point to free all PMDs allocated
>>    by the process. For a shared page table, shared PMDs can not be
>>    released when a guest process exits. These shared PMDs are
>>    released when host mm struct is released upon end of last
>>    reference to page table shared region hosted by this mm. For now
>>    to stop PMDs being released, this RFC introduces following change
>>    in mm/memory.c which works but does not feel like the right
>>    approach. Any suggestions for a better long term approach will be
>>    very appreciated:
>>
>> @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
>> pud_t *pud,
>>
>>          pmd = pmd_offset(pud, start);
>>          pud_clear(pud);
>> -       pmd_free_tlb(tlb, pmd, start);
>> -       mm_dec_nr_pmds(tlb->mm);
>> +       if (shared_pte) {
>> +               tlb_flush_pud_range(tlb, start, PAGE_SIZE);
>> +               tlb->freed_tables = 1;
>> +       } else {
>> +               pmd_free_tlb(tlb, pmd, start);
>> +               mm_dec_nr_pmds(tlb->mm);
>> +       }
>>   }
>>
>>   static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
>>
>> - This implementation requires an additional VM flag. Since all lower
>>    32 bits are currently in use, the new VM flag must come from upper
>>    32 bits which restricts this feature to 64-bit processors.
>>
>> - This feature is implemented for file mappings only. Is there a
>>    need to support it for anonymous memory as well?
> 
> I have not looked at the implementation, are 'file mappings' only
> mappings using the page cache?  Or, do you just need a file descriptor?
> Would a file descriptor created via memfd_create work for anonymous
> memory?
> 

I look for just a file pointer, so an fd created by memfd would work for anonymous memory.

Thanks,
Khalid

Peter Xu June 12, 2023, 4:25 p.m. UTC | #3

On Wed, Apr 26, 2023 at 10:49:47AM -0600, Khalid Aziz wrote:
> This patch series adds a new flag to mmap() call - MAP_SHARED_PT.

Since hugetlb has this, it'll be very helpful if you can provide a
comparison between this approach and hugetlb's - especially on the
differences - and reasonings about those.

Merging anything like this definitely should also pave way for hugetlb's
future, so it even seems to be an "requirement" of such patchset even
though implicitily..  considering the "hotness" that hugetlb recently has
on refactoring demand (if not a rewrite).

Some high level questions:

  - Why mmap() flag?

    For this one, I agree it should be opt-in - sharing pgtable definitely
    means sharing of a lot of privileges operating on current mm, so one
    should be aware and be prepared to be messed up.

    IIUC hugetlb doesn't do this but instead when something "racy" happens
    itt just unshares by default.  To me opt-in makes more sense if to
    start from zero, because I don't think by default a process wants to
    leak its mm to others.

    I think you mentioned allowing pgtable to be shared even for mprotect()
    from one MM then all MMs can see; but if so then DONTNEED should really
    do the same - when one MM DONTNEED it it should go away for all.  It
    doesn't make a lot of sense to me to treat it differently with a
    DONTNEED refcount anywhere..

  - Can guest MM opt-out?

    Should we allow guest MM to opt-out when they want?  It sounds like a
    good thing to have to me, especially for file that sounds like as
    simple as zapping the pgtable.  But then mmap flag will stop working
    iiuc, so goes back to that question (e.g. what about madvise or prctl?).

  - Why mm_struct to represent shared pgtable?

    IIUC hugetlb used the pgtable page itself plus some refcounting (the
    refcounting is racy with gup-fast that Jann used to point out, but
    that's another story..).  My question is do you think that won't work?
    Are there reasons to explain?  Why mm_struct is the structure you chose
    for representing a shared pgtable?  Why per-file?

Thanks,

Rongwei Wang June 30, 2023, 11:29 a.m. UTC | #4

Hi Khalid

I see this patch has send out in April, and wanna to ask about the 
status of this RFC now (IMHO, it seems that the code has some places to 
fix/do). This feature is useful to save much memory on pgtables, so we 
also want to use this optimization in our databases if upstream accept that.

BTW, in the past few weeks, I made some adjustments to simplify and meet 
with our databases base on your code, e.g. multi-vmas share same shadow 
mm, madvise, and memory compaction. if you are interested, I can provide 
a detailed codes.


Thanks,

-wrw

On 2023/4/27 00:49, Khalid Aziz wrote:
> Memory pages shared between processes require a page table entry
> (PTE) for each process. Each of these PTE consumes some of the
> memory and as long as number of mappings being maintained is small
> enough, this space consumed by page tables is not objectionable.
> When very few memory pages are shared between processes, the number
> of page table entries (PTEs) to maintain is mostly constrained by
> the number of pages of memory on the system.  As the number of
> shared pages and the number of times pages are shared goes up,
> amount of memory consumed by page tables starts to become
> significant. This issue does not apply to threads. Any number of
> threads can share the same pages inside a process while sharing the
> same PTEs. Extending this same model to sharing pages across
> processes can eliminate this issue for sharing across processes as
> well.
>
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.
>
> This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
> This flag can be specified along with MAP_SHARED by a process to
> hint to kernel that it wishes to share page table entries for this
> file mapping mmap region with other processes. Any other process
> that mmaps the same file with MAP_SHARED_PT flag can then share the
> same page table entries. Besides specifying MAP_SHARED_PT flag, the
> processes must map the files at a PMD aligned address with a size
> that is a multiple of PMD size and at the same virtual addresses.
> This last requirement of same virtual addresses can possibly be
> relaxed if that is the consensus.
>
> When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
> is created to hold the shared page tables. Host mm struct is not
> attached to a process. Start and size of host mm are set to the
> start and size of the mmap region and a VMA covering this range is
> also added to host mm struct. Existing page table entries from the
> process that creates the mapping are copied over to the host mm
> struct. All processes mapping this shared region are considered
> guest processes. When a guest process mmap's the shared region, a vm
> flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
> fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
> flag is found, its corresponding PMD is updated with the PMD from
> host mm struct so the PMD will point to the page tables in host mm
> struct. vm_mm pointer of the VMA is also updated to point to host mm
> struct for the duration of fault handling to ensure fault handling
> happens in the context of host mm struct. When a new PTE is
> created, it is created in the host mm struct page tables and the PMD
> in guest mm points to the same PTEs.
>
> This is a basic working implementation. It will need to go through
> more testing and refinements. Some notes and questions:
>
> - PMD size alignment and size requirement is currently hard coded
>    in. Is there a need or desire to make this more flexible and work
>    with other alignments/sizes? PMD size allows for adapting this
>    infrastructure to form the basis for hugetlbfs page table sharing
>    as well. More work will be needed to make that happen.
>
> - Is there a reason to allow a userspace app to query this size and
>    alignment requirement for MAP_SHARED_PT in some way?
>
> - Shared PTEs means mprotect() call made by one process affects all
>    processes sharing the same mapping and that behavior will need to
>    be documented clearly. Effect of mprotect call being different for
>    processes using shared page tables is the primary reason to
>    require an explicit opt-in from userspace processes to share page
>    tables. With a transparent sharing derived from MAP_SHARED alone,
>    changed effect of mprotect can break significant number of
>    userspace apps. One could work around that by unsharing whenever
>    mprotect changes modes on shared mapping but that introduces
>    complexity and the capability to execute a single mprotect to
>    change modes across 1000's of processes sharing a mapped database
>    is a feature explicitly asked for by database folks. This
>    capability has significant performance advantage when compared to
>    mechanism of sending messages to every process using shared
>    mapping to call mprotect and change modes in each process, or
>    using traps on permissions mismatch in each process.
>
> - This implementation does not allow unmapping page table shared
>    mappings partially. Should that be supported in future?
>
> Some concerns in this RFC:
>
> - When page tables for a process are freed upon process exit,
>    pmd_free_tlb() gets called at one point to free all PMDs allocated
>    by the process. For a shared page table, shared PMDs can not be
>    released when a guest process exits. These shared PMDs are
>    released when host mm struct is released upon end of last
>    reference to page table shared region hosted by this mm. For now
>    to stop PMDs being released, this RFC introduces following change
>    in mm/memory.c which works but does not feel like the right
>    approach. Any suggestions for a better long term approach will be
>    very appreciated:
>
> @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
> pud_t *pud,
>
>          pmd = pmd_offset(pud, start);
>          pud_clear(pud);
> -       pmd_free_tlb(tlb, pmd, start);
> -       mm_dec_nr_pmds(tlb->mm);
> +       if (shared_pte) {
> +               tlb_flush_pud_range(tlb, start, PAGE_SIZE);
> +               tlb->freed_tables = 1;
> +       } else {
> +               pmd_free_tlb(tlb, pmd, start);
> +               mm_dec_nr_pmds(tlb->mm);
> +       }
>   }
>
>   static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
>
> - This implementation requires an additional VM flag. Since all lower
>    32 bits are currently in use, the new VM flag must come from upper
>    32 bits which restricts this feature to 64-bit processors.
>
> - This feature is implemented for file mappings only. Is there a
>    need to support it for anonymous memory as well?
>
> - Accounting for MAP_SHARED_PT mapped filepages in a process and
>    pagetable bytes is not quite accurate yet in this RFC and will be
>    fixed in the non-RFC version of patches.
>
> I appreciate any feedback on these patches and ideas for
> improvements before moving these patches out of RFC stage.
>
>
> Changes from RFC v1:
> - Broken the patches up into smaller patches
> - Fixed a few bugs related to freeing PTEs and PMDs incorrectly
> - Cleaned up the code a bit
>
>
> Khalid Aziz (4):
>    mm/ptshare: Add vm flag for shared PTE
>    mm/ptshare: Add flag MAP_SHARED_PT to mmap()
>    mm/ptshare: Create new mm struct for page table sharing
>    mm/ptshare: Add page fault handling for page table shared regions
>
>   include/linux/fs.h                     |   2 +
>   include/linux/mm.h                     |   8 +
>   include/trace/events/mmflags.h         |   3 +-
>   include/uapi/asm-generic/mman-common.h |   1 +
>   mm/Makefile                            |   2 +-
>   mm/internal.h                          |  21 ++
>   mm/memory.c                            | 105 ++++++++--
>   mm/mmap.c                              |  88 +++++++++
>   mm/ptshare.c                           | 263 +++++++++++++++++++++++++
>   9 files changed, 476 insertions(+), 17 deletions(-)
>   create mode 100644 mm/ptshare.c
>

Rongwei Wang July 31, 2023, 4:35 a.m. UTC | #5

Hi Matthew

May I ask you another question about mshare under this RFC? I remember 
you said you will redesign the mshare to per-vma not per-mapping 
(apologize if remember wrongly) in last time MM alignment session. And I 
also refer to you to re-code this part in our internal version (based on 
this RFC). It seems that per VMA will can simplify the structure of 
pgtable sharing, even doesn't care the different permission of file 
mapping. these are advantages (maybe) that I can imagine. But IMHO, It 
seems not a strongly reason to switch per-mapping to per-vma.

And I can't imagine other considerations of upstream. Can you share the 
reason why redesigning in a per-vma way, due to integation with 
hugetlbfs pgtable sharing or anonymous page sharing?

Thanks for your time.

On 2023/4/27 00:49, Khalid Aziz wrote:
> Memory pages shared between processes require a page table entry
> (PTE) for each process. Each of these PTE consumes some of the
> memory and as long as number of mappings being maintained is small
> enough, this space consumed by page tables is not objectionable.
> When very few memory pages are shared between processes, the number
> of page table entries (PTEs) to maintain is mostly constrained by
> the number of pages of memory on the system.  As the number of
> shared pages and the number of times pages are shared goes up,
> amount of memory consumed by page tables starts to become
> significant. This issue does not apply to threads. Any number of
> threads can share the same pages inside a process while sharing the
> same PTEs. Extending this same model to sharing pages across
> processes can eliminate this issue for sharing across processes as
> well.
>
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.
>
> This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
> This flag can be specified along with MAP_SHARED by a process to
> hint to kernel that it wishes to share page table entries for this
> file mapping mmap region with other processes. Any other process
> that mmaps the same file with MAP_SHARED_PT flag can then share the
> same page table entries. Besides specifying MAP_SHARED_PT flag, the
> processes must map the files at a PMD aligned address with a size
> that is a multiple of PMD size and at the same virtual addresses.
> This last requirement of same virtual addresses can possibly be
> relaxed if that is the consensus.
>
> When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
> is created to hold the shared page tables. Host mm struct is not
> attached to a process. Start and size of host mm are set to the
> start and size of the mmap region and a VMA covering this range is
> also added to host mm struct. Existing page table entries from the
> process that creates the mapping are copied over to the host mm
> struct. All processes mapping this shared region are considered
> guest processes. When a guest process mmap's the shared region, a vm
> flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
> fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
> flag is found, its corresponding PMD is updated with the PMD from
> host mm struct so the PMD will point to the page tables in host mm
> struct. vm_mm pointer of the VMA is also updated to point to host mm
> struct for the duration of fault handling to ensure fault handling
> happens in the context of host mm struct. When a new PTE is
> created, it is created in the host mm struct page tables and the PMD
> in guest mm points to the same PTEs.
>
> This is a basic working implementation. It will need to go through
> more testing and refinements. Some notes and questions:
>
> - PMD size alignment and size requirement is currently hard coded
>    in. Is there a need or desire to make this more flexible and work
>    with other alignments/sizes? PMD size allows for adapting this
>    infrastructure to form the basis for hugetlbfs page table sharing
>    as well. More work will be needed to make that happen.
>
> - Is there a reason to allow a userspace app to query this size and
>    alignment requirement for MAP_SHARED_PT in some way?
>
> - Shared PTEs means mprotect() call made by one process affects all
>    processes sharing the same mapping and that behavior will need to
>    be documented clearly. Effect of mprotect call being different for
>    processes using shared page tables is the primary reason to
>    require an explicit opt-in from userspace processes to share page
>    tables. With a transparent sharing derived from MAP_SHARED alone,
>    changed effect of mprotect can break significant number of
>    userspace apps. One could work around that by unsharing whenever
>    mprotect changes modes on shared mapping but that introduces
>    complexity and the capability to execute a single mprotect to
>    change modes across 1000's of processes sharing a mapped database
>    is a feature explicitly asked for by database folks. This
>    capability has significant performance advantage when compared to
>    mechanism of sending messages to every process using shared
>    mapping to call mprotect and change modes in each process, or
>    using traps on permissions mismatch in each process.
>
> - This implementation does not allow unmapping page table shared
>    mappings partially. Should that be supported in future?
>
> Some concerns in this RFC:
>
> - When page tables for a process are freed upon process exit,
>    pmd_free_tlb() gets called at one point to free all PMDs allocated
>    by the process. For a shared page table, shared PMDs can not be
>    released when a guest process exits. These shared PMDs are
>    released when host mm struct is released upon end of last
>    reference to page table shared region hosted by this mm. For now
>    to stop PMDs being released, this RFC introduces following change
>    in mm/memory.c which works but does not feel like the right
>    approach. Any suggestions for a better long term approach will be
>    very appreciated:
>
> @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
> pud_t *pud,
>
>          pmd = pmd_offset(pud, start);
>          pud_clear(pud);
> -       pmd_free_tlb(tlb, pmd, start);
> -       mm_dec_nr_pmds(tlb->mm);
> +       if (shared_pte) {
> +               tlb_flush_pud_range(tlb, start, PAGE_SIZE);
> +               tlb->freed_tables = 1;
> +       } else {
> +               pmd_free_tlb(tlb, pmd, start);
> +               mm_dec_nr_pmds(tlb->mm);
> +       }
>   }
>
>   static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
>
> - This implementation requires an additional VM flag. Since all lower
>    32 bits are currently in use, the new VM flag must come from upper
>    32 bits which restricts this feature to 64-bit processors.
>
> - This feature is implemented for file mappings only. Is there a
>    need to support it for anonymous memory as well?
>
> - Accounting for MAP_SHARED_PT mapped filepages in a process and
>    pagetable bytes is not quite accurate yet in this RFC and will be
>    fixed in the non-RFC version of patches.
>
> I appreciate any feedback on these patches and ideas for
> improvements before moving these patches out of RFC stage.
>
>
> Changes from RFC v1:
> - Broken the patches up into smaller patches
> - Fixed a few bugs related to freeing PTEs and PMDs incorrectly
> - Cleaned up the code a bit
>
>
> Khalid Aziz (4):
>    mm/ptshare: Add vm flag for shared PTE
>    mm/ptshare: Add flag MAP_SHARED_PT to mmap()
>    mm/ptshare: Create new mm struct for page table sharing
>    mm/ptshare: Add page fault handling for page table shared regions
>
>   include/linux/fs.h                     |   2 +
>   include/linux/mm.h                     |   8 +
>   include/trace/events/mmflags.h         |   3 +-
>   include/uapi/asm-generic/mman-common.h |   1 +
>   mm/Makefile                            |   2 +-
>   mm/internal.h                          |  21 ++
>   mm/memory.c                            | 105 ++++++++--
>   mm/mmap.c                              |  88 +++++++++
>   mm/ptshare.c                           | 263 +++++++++++++++++++++++++
>   9 files changed, 476 insertions(+), 17 deletions(-)
>   create mode 100644 mm/ptshare.c
>

Matthew Wilcox July 31, 2023, 12:25 p.m. UTC | #6

On Mon, Jul 31, 2023 at 12:35:00PM +0800, Rongwei Wang wrote:
> Hi Matthew
> 
> May I ask you another question about mshare under this RFC? I remember you
> said you will redesign the mshare to per-vma not per-mapping (apologize if
> remember wrongly) in last time MM alignment session. And I also refer to you
> to re-code this part in our internal version (based on this RFC). It seems
> that per VMA will can simplify the structure of pgtable sharing, even
> doesn't care the different permission of file mapping. these are advantages
> (maybe) that I can imagine. But IMHO, It seems not a strongly reason to
> switch per-mapping to per-vma.
> 
> And I can't imagine other considerations of upstream. Can you share the
> reason why redesigning in a per-vma way, due to integation with hugetlbfs
> pgtable sharing or anonymous page sharing?

It was David who wants to make page table sharing be per-VMA.  I think
he is advocating for the wrong approach.  In any case, I don't have time
to work on mshare and Khalid is on leave until September, so I don't
think anybody is actively working on mshare.

> Thanks for your time.
> 
> On 2023/4/27 00:49, Khalid Aziz wrote:
> > Memory pages shared between processes require a page table entry
> > (PTE) for each process. Each of these PTE consumes some of the
> > memory and as long as number of mappings being maintained is small
> > enough, this space consumed by page tables is not objectionable.
> > When very few memory pages are shared between processes, the number
> > of page table entries (PTEs) to maintain is mostly constrained by
> > the number of pages of memory on the system.  As the number of
> > shared pages and the number of times pages are shared goes up,
> > amount of memory consumed by page tables starts to become
> > significant. This issue does not apply to threads. Any number of
> > threads can share the same pages inside a process while sharing the
> > same PTEs. Extending this same model to sharing pages across
> > processes can eliminate this issue for sharing across processes as
> > well.
> > 
> > Some of the field deployments commonly see memory pages shared
> > across 1000s of processes. On x86_64, each page requires a PTE that
> > is only 8 bytes long which is very small compared to the 4K page
> > size. When 2000 processes map the same page in their address space,
> > each one of them requires 8 bytes for its PTE and together that adds
> > up to 8K of memory just to hold the PTEs for one 4K page. On a
> > database server with 300GB SGA, a system crash was seen with
> > out-of-memory condition when 1500+ clients tried to share this SGA
> > even though the system had 512GB of memory. On this server, in the
> > worst case scenario of all 1500 processes mapping every page from
> > SGA would have required 878GB+ for just the PTEs. If these PTEs
> > could be shared, amount of memory saved is very significant.
> > 
> > This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
> > This flag can be specified along with MAP_SHARED by a process to
> > hint to kernel that it wishes to share page table entries for this
> > file mapping mmap region with other processes. Any other process
> > that mmaps the same file with MAP_SHARED_PT flag can then share the
> > same page table entries. Besides specifying MAP_SHARED_PT flag, the
> > processes must map the files at a PMD aligned address with a size
> > that is a multiple of PMD size and at the same virtual addresses.
> > This last requirement of same virtual addresses can possibly be
> > relaxed if that is the consensus.
> > 
> > When mmap() is called with MAP_SHARED_PT flag, a new host mm struct
> > is created to hold the shared page tables. Host mm struct is not
> > attached to a process. Start and size of host mm are set to the
> > start and size of the mmap region and a VMA covering this range is
> > also added to host mm struct. Existing page table entries from the
> > process that creates the mapping are copied over to the host mm
> > struct. All processes mapping this shared region are considered
> > guest processes. When a guest process mmap's the shared region, a vm
> > flag VM_SHARED_PT is added to the VMAs in guest process. Upon a page
> > fault, VMA is checked for the presence of VM_SHARED_PT flag. If the
> > flag is found, its corresponding PMD is updated with the PMD from
> > host mm struct so the PMD will point to the page tables in host mm
> > struct. vm_mm pointer of the VMA is also updated to point to host mm
> > struct for the duration of fault handling to ensure fault handling
> > happens in the context of host mm struct. When a new PTE is
> > created, it is created in the host mm struct page tables and the PMD
> > in guest mm points to the same PTEs.
> > 
> > This is a basic working implementation. It will need to go through
> > more testing and refinements. Some notes and questions:
> > 
> > - PMD size alignment and size requirement is currently hard coded
> >    in. Is there a need or desire to make this more flexible and work
> >    with other alignments/sizes? PMD size allows for adapting this
> >    infrastructure to form the basis for hugetlbfs page table sharing
> >    as well. More work will be needed to make that happen.
> > 
> > - Is there a reason to allow a userspace app to query this size and
> >    alignment requirement for MAP_SHARED_PT in some way?
> > 
> > - Shared PTEs means mprotect() call made by one process affects all
> >    processes sharing the same mapping and that behavior will need to
> >    be documented clearly. Effect of mprotect call being different for
> >    processes using shared page tables is the primary reason to
> >    require an explicit opt-in from userspace processes to share page
> >    tables. With a transparent sharing derived from MAP_SHARED alone,
> >    changed effect of mprotect can break significant number of
> >    userspace apps. One could work around that by unsharing whenever
> >    mprotect changes modes on shared mapping but that introduces
> >    complexity and the capability to execute a single mprotect to
> >    change modes across 1000's of processes sharing a mapped database
> >    is a feature explicitly asked for by database folks. This
> >    capability has significant performance advantage when compared to
> >    mechanism of sending messages to every process using shared
> >    mapping to call mprotect and change modes in each process, or
> >    using traps on permissions mismatch in each process.
> > 
> > - This implementation does not allow unmapping page table shared
> >    mappings partially. Should that be supported in future?
> > 
> > Some concerns in this RFC:
> > 
> > - When page tables for a process are freed upon process exit,
> >    pmd_free_tlb() gets called at one point to free all PMDs allocated
> >    by the process. For a shared page table, shared PMDs can not be
> >    released when a guest process exits. These shared PMDs are
> >    released when host mm struct is released upon end of last
> >    reference to page table shared region hosted by this mm. For now
> >    to stop PMDs being released, this RFC introduces following change
> >    in mm/memory.c which works but does not feel like the right
> >    approach. Any suggestions for a better long term approach will be
> >    very appreciated:
> > 
> > @@ -210,13 +221,19 @@ static inline void free_pmd_range(struct mmu_gather *tlb,
> > pud_t *pud,
> > 
> >          pmd = pmd_offset(pud, start);
> >          pud_clear(pud);
> > -       pmd_free_tlb(tlb, pmd, start);
> > -       mm_dec_nr_pmds(tlb->mm);
> > +       if (shared_pte) {
> > +               tlb_flush_pud_range(tlb, start, PAGE_SIZE);
> > +               tlb->freed_tables = 1;
> > +       } else {
> > +               pmd_free_tlb(tlb, pmd, start);
> > +               mm_dec_nr_pmds(tlb->mm);
> > +       }
> >   }
> > 
> >   static inline void free_pud_range(struct mmu_gather *tlb, p4d_t *p4d,
> > 
> > - This implementation requires an additional VM flag. Since all lower
> >    32 bits are currently in use, the new VM flag must come from upper
> >    32 bits which restricts this feature to 64-bit processors.
> > 
> > - This feature is implemented for file mappings only. Is there a
> >    need to support it for anonymous memory as well?
> > 
> > - Accounting for MAP_SHARED_PT mapped filepages in a process and
> >    pagetable bytes is not quite accurate yet in this RFC and will be
> >    fixed in the non-RFC version of patches.
> > 
> > I appreciate any feedback on these patches and ideas for
> > improvements before moving these patches out of RFC stage.
> > 
> > 
> > Changes from RFC v1:
> > - Broken the patches up into smaller patches
> > - Fixed a few bugs related to freeing PTEs and PMDs incorrectly
> > - Cleaned up the code a bit
> > 
> > 
> > Khalid Aziz (4):
> >    mm/ptshare: Add vm flag for shared PTE
> >    mm/ptshare: Add flag MAP_SHARED_PT to mmap()
> >    mm/ptshare: Create new mm struct for page table sharing
> >    mm/ptshare: Add page fault handling for page table shared regions
> > 
> >   include/linux/fs.h                     |   2 +
> >   include/linux/mm.h                     |   8 +
> >   include/trace/events/mmflags.h         |   3 +-
> >   include/uapi/asm-generic/mman-common.h |   1 +
> >   mm/Makefile                            |   2 +-
> >   mm/internal.h                          |  21 ++
> >   mm/memory.c                            | 105 ++++++++--
> >   mm/mmap.c                              |  88 +++++++++
> >   mm/ptshare.c                           | 263 +++++++++++++++++++++++++
> >   9 files changed, 476 insertions(+), 17 deletions(-)
> >   create mode 100644 mm/ptshare.c
> >

David Hildenbrand July 31, 2023, 12:50 p.m. UTC | #7

On 31.07.23 14:25, Matthew Wilcox wrote:
> On Mon, Jul 31, 2023 at 12:35:00PM +0800, Rongwei Wang wrote:
>> Hi Matthew
>>
>> May I ask you another question about mshare under this RFC? I remember you
>> said you will redesign the mshare to per-vma not per-mapping (apologize if
>> remember wrongly) in last time MM alignment session. And I also refer to you
>> to re-code this part in our internal version (based on this RFC). It seems
>> that per VMA will can simplify the structure of pgtable sharing, even
>> doesn't care the different permission of file mapping. these are advantages
>> (maybe) that I can imagine. But IMHO, It seems not a strongly reason to
>> switch per-mapping to per-vma.
>>
>> And I can't imagine other considerations of upstream. Can you share the
>> reason why redesigning in a per-vma way, due to integation with hugetlbfs
>> pgtable sharing or anonymous page sharing?
> 
> It was David who wants to make page table sharing be per-VMA.  I think
> he is advocating for the wrong approach.  In any case, I don't have time
> to work on mshare and Khalid is on leave until September, so I don't
> think anybody is actively working on mshare.

Not that I also don't have any time to look into this, but my comment 
essentially was that we should try decoupling page table sharing (reduce 
memory consumption, shorter rmap walk) from the mprotect(PROT_READ) use 
case.

For page table sharing I was wondering whether there could be ways to 
just have that done semi-automatically. Similar to how it's done for 
hugetlb. There are some clear limitations: mappings < PMD_SIZE won't be 
able to benefit.

It's still unclear whether that is a real limitation. Some use cases 
were raised (put all user space library mappings into a shared area), 
but I realized that these conflict with MAP_PRIVATE requirements of such 
areas. Maybe I'm wrong and this is easily resolved.

At least it's not the primary use case that was raised. For the primary 
use cases (VMs, databases) that map huge areas, it might not be a 
limitation.

Regarding mprotect(PROT_READ), my point was that mprotect() is most 
probably the wrong tool to use (especially, due to signal handling). 
Instead, I was suggesting having a way to essentially protect pages in a 
shmem file -- and get notified whenever wants to write to such a page 
either via the page tables or via write() and friends. We do have the 
write-notify infrastructure for filesystems in place that we might 
extend/reuse. That mechanism could benefit from shared page tables by 
having to do less rmap walks.

Again, I don't have time to look into that (just like everybody else as 
it appears) and might miss something important. Just sharing my thoughts 
that I raised in the call.

Rongwei Wang July 31, 2023, 4:19 p.m. UTC | #8

On 2023/7/31 20:50, David Hildenbrand wrote:
> On 31.07.23 14:25, Matthew Wilcox wrote:
>> On Mon, Jul 31, 2023 at 12:35:00PM +0800, Rongwei Wang wrote:
>>> Hi Matthew
>>>
>>> May I ask you another question about mshare under this RFC? I 
>>> remember you
>>> said you will redesign the mshare to per-vma not per-mapping 
>>> (apologize if
>>> remember wrongly) in last time MM alignment session. And I also 
>>> refer to you
>>> to re-code this part in our internal version (based on this RFC). It 
>>> seems
>>> that per VMA will can simplify the structure of pgtable sharing, even
>>> doesn't care the different permission of file mapping. these are 
>>> advantages
>>> (maybe) that I can imagine. But IMHO, It seems not a strongly reason to
>>> switch per-mapping to per-vma.
>>>
>>> And I can't imagine other considerations of upstream. Can you share the
>>> reason why redesigning in a per-vma way, due to integation with 
>>> hugetlbfs
>>> pgtable sharing or anonymous page sharing?
>>
>> It was David who wants to make page table sharing be per-VMA.  I think
>> he is advocating for the wrong approach.  In any case, I don't have time
>> to work on mshare and Khalid is on leave until September, so I don't
>> think anybody is actively working on mshare.
>
> Not that I also don't have any time to look into this, but my comment 
> essentially was that we should try decoupling page table sharing 
> (reduce memory consumption, shorter rmap walk) from the 
> mprotect(PROT_READ) use case.

Hi David, Matthew

Thanks for your reply.

Uh, sorry, I can't imagine the relative between decouping page table 
sharing with per-VMA design. And I think mprotect(PROT_READ) has to 
modify all sharing page tables of related tasks. It seems that I miss 
something about per-VMA from your words.

BTW, I can imagine a corner case to show the defect (maybe) of 
per-mapping. If we create a range of page table sharing by 
memfd_create(), and a child also own this range of page table sharing. 
But this child process can not create page table sharing base on the 
same fd after mumap() this range (same mapping but different vma area). 
Of course, per-VMA is better choice that can continue to create page 
table sharing base on original fd. That's because new mm struct created 
in this way. I guess that is a type of decoupling you said?

It's just corner case. I am not sure how important it is.

>
> For page table sharing I was wondering whether there could be ways to 
> just have that done semi-automatically. Similar to how it's done for 
> hugetlb. There are some clear limitations: mappings < PMD_SIZE won't 
> be able to benefit.
>
> It's still unclear whether that is a real limitation. Some use cases 
> were raised (put all user space library mappings into a shared area), 
> but I realized that these conflict with MAP_PRIVATE requirements of 
> such areas. Maybe I'm wrong and this is easily resolved.
>
> At least it's not the primary use case that was raised. For the 
> primary use cases (VMs, databases) that map huge areas, it might not 
> be a limitation.
>
>
> Regarding mprotect(PROT_READ), my point was that mprotect() is most 
> probably the wrong tool to use (especially, due to signal handling). 
> Instead, I was suggesting having a way to essentially protect pages in 
> a shmem file -- and get notified whenever wants to write to such a 
> page either via the page tables or via write() and friends. We do have 
> the write-notify infrastructure for filesystems in place that we might 
> extend/reuse.
I am poor in filesystem. The write-notify sounds a good idea. Maybe I 
need some times to digest this.
> That mechanism could benefit from shared page tables by having to do 
> less rmap walks.
>
> Again, I don't have time to look into that (just like everybody else 
> as it appears) and might miss something important. Just sharing my 
> thoughts that I raised in the call.

Your words are very helpful to me. I try to design our internal version 
about this feature in a right way.

Thanks again.

David Hildenbrand July 31, 2023, 4:30 p.m. UTC | #9

On 31.07.23 18:19, Rongwei Wang wrote:
> 
> On 2023/7/31 20:50, David Hildenbrand wrote:
>> On 31.07.23 14:25, Matthew Wilcox wrote:
>>> On Mon, Jul 31, 2023 at 12:35:00PM +0800, Rongwei Wang wrote:
>>>> Hi Matthew
>>>>
>>>> May I ask you another question about mshare under this RFC? I
>>>> remember you
>>>> said you will redesign the mshare to per-vma not per-mapping
>>>> (apologize if
>>>> remember wrongly) in last time MM alignment session. And I also
>>>> refer to you
>>>> to re-code this part in our internal version (based on this RFC). It
>>>> seems
>>>> that per VMA will can simplify the structure of pgtable sharing, even
>>>> doesn't care the different permission of file mapping. these are
>>>> advantages
>>>> (maybe) that I can imagine. But IMHO, It seems not a strongly reason to
>>>> switch per-mapping to per-vma.
>>>>
>>>> And I can't imagine other considerations of upstream. Can you share the
>>>> reason why redesigning in a per-vma way, due to integation with
>>>> hugetlbfs
>>>> pgtable sharing or anonymous page sharing?
>>>
>>> It was David who wants to make page table sharing be per-VMA.  I think
>>> he is advocating for the wrong approach.  In any case, I don't have time
>>> to work on mshare and Khalid is on leave until September, so I don't
>>> think anybody is actively working on mshare.
>>
>> Not that I also don't have any time to look into this, but my comment
>> essentially was that we should try decoupling page table sharing
>> (reduce memory consumption, shorter rmap walk) from the
>> mprotect(PROT_READ) use case.
> 
> Hi David, Matthew
> 
> Thanks for your reply.
> 
> Uh, sorry, I can't imagine the relative between decouping page table
> sharing with per-VMA design. And I think mprotect(PROT_READ) has to
> modify all sharing page tables of related tasks. It seems that I miss
> something about per-VMA from your words.

Assume we do do the page table sharing at mmap time, if the flags are 
right. Let's focus on the most common:

mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)

And doing the same in each and every process.

Having the original design of doing an mprotect(PROT_READ) in each and 
every process is just absolutely inefficient to protect a memfd page.

For that case, my thought was that you actually want to write-protect 
the pages on the memfd level.

So instead of doing mprotect(PROT_READ) in 999 processes, or doing 
mprotect(PROT_READ) on mshare(), you have memfd feature to protect pages 
from any write access -- not using virtual addresses but using an offset 
in the memfd.

Assume such a (badly imagined) memfd_protect(PROT_READ) would make sure 
that:
(1) Any page table mappings of the page are write-protected and
(2) Any write access using the page table mappings trigger write-notify and
(3) Any other access -- e.g., write() -- similarly informs memfd.

Without page table sharing, (1) would have to walk all mappings via the 
rmap. With page table sharing, it would only have to walk one page table.

But the features would be two separate things.

What memfd would do with that write notification (inject a signal, 
something like uffd) would be a different story.

Again, just an idea and maybe complete garbage.

Matthew Wilcox July 31, 2023, 4:38 p.m. UTC | #10

On Mon, Jul 31, 2023 at 06:30:22PM +0200, David Hildenbrand wrote:
> Assume we do do the page table sharing at mmap time, if the flags are right.
> Let's focus on the most common:
> 
> mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)
> 
> And doing the same in each and every process.

That may be the most common in your usage, but for a database, you're
looking at two usage scenarios.  Postgres calls mmap() on the database
file itself so that all processes share the kernel page cache.
Some Commercial Databases call mmap() on a hugetlbfs file so that all
processes share the same userspace buffer cache.  Other Commecial
Databases call shmget() / shmat() with SHM_HUGETLB for the exact
same reason.

This is why I proposed mshare().  Anyone can use it for anything.
We have such a diverse set of users who want to do stuff with shared
page tables that we should not be tying it to memfd or any other
filesystem.  Not to mention that it's more flexible; you can map
individual 4kB files into it and still get page table sharing.

David Hildenbrand July 31, 2023, 4:48 p.m. UTC | #11

On 31.07.23 18:38, Matthew Wilcox wrote:
> On Mon, Jul 31, 2023 at 06:30:22PM +0200, David Hildenbrand wrote:
>> Assume we do do the page table sharing at mmap time, if the flags are right.
>> Let's focus on the most common:
>>
>> mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)
>>
>> And doing the same in each and every process.
> 
> That may be the most common in your usage, but for a database, you're
> looking at two usage scenarios.  Postgres calls mmap() on the database
> file itself so that all processes share the kernel page cache.
> Some Commercial Databases call mmap() on a hugetlbfs file so that all
> processes share the same userspace buffer cache.  Other Commecial
> Databases call shmget() / shmat() with SHM_HUGETLB for the exact
> same reason.

I remember you said that postgres might be looking into using shmem as 
well, maybe I am wrong.

memfd/hugetlb/shmem could all be handled alike, just "arbitrary 
filesystems" would require more work.

> 
> This is why I proposed mshare().  Anyone can use it for anything.
> We have such a diverse set of users who want to do stuff with shared
> page tables that we should not be tying it to memfd or any other
> filesystem.  Not to mention that it's more flexible; you can map
> individual 4kB files into it and still get page table sharing.

That's not what the current proposal does, or am I wrong?

Also, I'm curious, is that a real requirement in the database world?

Matthew Wilcox July 31, 2023, 4:54 p.m. UTC | #12

On Mon, Jul 31, 2023 at 06:48:47PM +0200, David Hildenbrand wrote:
> On 31.07.23 18:38, Matthew Wilcox wrote:
> > On Mon, Jul 31, 2023 at 06:30:22PM +0200, David Hildenbrand wrote:
> > > Assume we do do the page table sharing at mmap time, if the flags are right.
> > > Let's focus on the most common:
> > > 
> > > mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)
> > > 
> > > And doing the same in each and every process.
> > 
> > That may be the most common in your usage, but for a database, you're
> > looking at two usage scenarios.  Postgres calls mmap() on the database
> > file itself so that all processes share the kernel page cache.
> > Some Commercial Databases call mmap() on a hugetlbfs file so that all
> > processes share the same userspace buffer cache.  Other Commecial
> > Databases call shmget() / shmat() with SHM_HUGETLB for the exact
> > same reason.
> 
> I remember you said that postgres might be looking into using shmem as well,
> maybe I am wrong.

No, I said that postgres was also interested in sharing page tables.
I don't think they have any use for shmem.

> memfd/hugetlb/shmem could all be handled alike, just "arbitrary filesystems"
> would require more work.

But arbitrary filesystems was one of the origin use cases; where the
database is stored on a persistent memory filesystem, and neither the
kernel nor userspace has a cache.  The Postgres & Commercial Database
use-cases collapse into the same case, and we want to mmap the files
directly and share the page tables.

> > This is why I proposed mshare().  Anyone can use it for anything.
> > We have such a diverse set of users who want to do stuff with shared
> > page tables that we should not be tying it to memfd or any other
> > filesystem.  Not to mention that it's more flexible; you can map
> > individual 4kB files into it and still get page table sharing.
> 
> That's not what the current proposal does, or am I wrong?

I think you're wrong, but I haven't had time to read the latest patches.

> Also, I'm curious, is that a real requirement in the database world?

I don't know.  It's definitely an advantage that falls out of the design
of mshare.

David Hildenbrand July 31, 2023, 5:06 p.m. UTC | #13

On 31.07.23 18:54, Matthew Wilcox wrote:
> On Mon, Jul 31, 2023 at 06:48:47PM +0200, David Hildenbrand wrote:
>> On 31.07.23 18:38, Matthew Wilcox wrote:
>>> On Mon, Jul 31, 2023 at 06:30:22PM +0200, David Hildenbrand wrote:
>>>> Assume we do do the page table sharing at mmap time, if the flags are right.
>>>> Let's focus on the most common:
>>>>
>>>> mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)
>>>>
>>>> And doing the same in each and every process.
>>>
>>> That may be the most common in your usage, but for a database, you're
>>> looking at two usage scenarios.  Postgres calls mmap() on the database
>>> file itself so that all processes share the kernel page cache.
>>> Some Commercial Databases call mmap() on a hugetlbfs file so that all
>>> processes share the same userspace buffer cache.  Other Commecial
>>> Databases call shmget() / shmat() with SHM_HUGETLB for the exact
>>> same reason.
>>
>> I remember you said that postgres might be looking into using shmem as well,
>> maybe I am wrong.
> 
> No, I said that postgres was also interested in sharing page tables.
> I don't think they have any use for shmem.
> 
>> memfd/hugetlb/shmem could all be handled alike, just "arbitrary filesystems"
>> would require more work.
> 
> But arbitrary filesystems was one of the origin use cases; where the
> database is stored on a persistent memory filesystem, and neither the
> kernel nor userspace has a cache.  The Postgres & Commercial Database
> use-cases collapse into the same case, and we want to mmap the files
> directly and share the page tables.

Yes, and transparent page table sharing can be achieved otherwise.

I guess what you imply is that they want to share page tables and have a 
single mprotect(PROT_READ) to modify the shared page tables.

> 
>>> This is why I proposed mshare().  Anyone can use it for anything.
>>> We have such a diverse set of users who want to do stuff with shared
>>> page tables that we should not be tying it to memfd or any other
>>> filesystem.  Not to mention that it's more flexible; you can map
>>> individual 4kB files into it and still get page table sharing.
>>
>> That's not what the current proposal does, or am I wrong?
> 
> I think you're wrong, but I haven't had time to read the latest patches.
> 

Maybe I misunderstood what the MAP_SHARED_PT actually does.

"
This patch series adds a new flag to mmap() call - MAP_SHARED_PT.
This flag can be specified along with MAP_SHARED by a process to
hint to kernel that it wishes to share page table entries for this
file mapping mmap region with other processes. Any other process
that mmaps the same file with MAP_SHARED_PT flag can then share the
same page table entries. Besides specifying MAP_SHARED_PT flag, the
processes must map the files at a PMD aligned address with a size
that is a multiple of PMD size and at the same virtual addresses.
This last requirement of same virtual addresses can possibly be
relaxed if that is the consensus.
"

Reading this, I'm confused how 4k files would interact with the PMD size 
requirement.

Probably I got it all wrong.

>> Also, I'm curious, is that a real requirement in the database world?
> 
> I don't know.  It's definitely an advantage that falls out of the design
> of mshare.

Okay, just checking if there is an important use case I'm missing, I'm 
also not aware of any.

Anyhow, I have other work to do. Happy to continue the discussion 
someone is actually working on this (again).

Rongwei Wang Aug. 1, 2023, 6:53 a.m. UTC | #14

On 2023/8/1 00:38, Matthew Wilcox wrote:
> On Mon, Jul 31, 2023 at 06:30:22PM +0200, David Hildenbrand wrote:
>> Assume we do do the page table sharing at mmap time, if the flags are right.
>> Let's focus on the most common:
>>
>> mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)
>>
>> And doing the same in each and every process.
> That may be the most common in your usage, but for a database, you're
> looking at two usage scenarios.  Postgres calls mmap() on the database
> file itself so that all processes share the kernel page cache.
> Some Commercial Databases call mmap() on a hugetlbfs file so that all
> processes share the same userspace buffer cache.  Other Commecial
> Databases call shmget() / shmat() with SHM_HUGETLB for the exact
> same reason.
>
> This is why I proposed mshare().  Anyone can use it for anything.

Hi Matthew

I'm a little confused about this mshare(). Which one is the mshare() you 
refer to here, previous mshare() based on filesystem or this RFC v2 
posted by Khalid?

IMHO, they have much difference between previously mshare() and 
MAP_SHARED_PT now.

> We have such a diverse set of users who want to do stuff with shared
> page tables that we should not be tying it to memfd or any other
> filesystem.  Not to mention that it's more flexible; you can map
> individual 4kB files into it and still get page table sharing.

Matthew Wilcox Aug. 1, 2023, 7:28 p.m. UTC | #15

On Tue, Aug 01, 2023 at 02:53:02PM +0800, Rongwei Wang wrote:
> 
> On 2023/8/1 00:38, Matthew Wilcox wrote:
> > On Mon, Jul 31, 2023 at 06:30:22PM +0200, David Hildenbrand wrote:
> > > Assume we do do the page table sharing at mmap time, if the flags are right.
> > > Let's focus on the most common:
> > > 
> > > mmap(memfd, PROT_READ | PROT_WRITE, MAP_SHARED)
> > > 
> > > And doing the same in each and every process.
> > That may be the most common in your usage, but for a database, you're
> > looking at two usage scenarios.  Postgres calls mmap() on the database
> > file itself so that all processes share the kernel page cache.
> > Some Commercial Databases call mmap() on a hugetlbfs file so that all
> > processes share the same userspace buffer cache.  Other Commecial
> > Databases call shmget() / shmat() with SHM_HUGETLB for the exact
> > same reason.
> > 
> > This is why I proposed mshare().  Anyone can use it for anything.
> 
> Hi Matthew
> 
> I'm a little confused about this mshare(). Which one is the mshare() you
> refer to here, previous mshare() based on filesystem or this RFC v2 posted
> by Khalid?
> 
> IMHO, they have much difference between previously mshare() and
> MAP_SHARED_PT now.

I haven't read this version of the patchset.  I'm describing the original
idea, not what it may have turned into.  As far as I'm concerned, we're
still trying to decide what functionality we actually want, not arguing
about whether this exact patchset has the correct number of tab indents
to be merged.

[RFC,v2,0/4] Add support for sharing page tables across processes (Previously mshare)

Message

Comments