mbox series

[v2,0/9] Add support for shared PTEs across processes

Message ID cover.1656531090.git.khalid.aziz@oracle.com (mailing list archive)
Headers show
Series Add support for shared PTEs across processes | expand

Message

Khalid Aziz June 29, 2022, 10:53 p.m. UTC
Memory pages shared between processes require a page table entry
(PTE) for each process. Each of these PTE consumes consume some of
the memory and as long as number of mappings being maintained is
small enough, this space consumed by page tables is not
objectionable. When very few memory pages are shared between
processes, the number of page table entries (PTEs) to maintain is
mostly constrained by the number of pages of memory on the system.
As the number of shared pages and the number of times pages are
shared goes up, amount of memory consumed by page tables starts to
become significant. This issue does not apply to threads. Any number
of threads can share the same pages inside a process while sharing
the same PTEs. Extending this same model to sharing pages across
processes can eliminate this issue for sharing across processes as
well.

Some of the field deployments commonly see memory pages shared
across 1000s of processes. On x86_64, each page requires a PTE that
is only 8 bytes long which is very small compared to the 4K page
size. When 2000 processes map the same page in their address space,
each one of them requires 8 bytes for its PTE and together that adds
up to 8K of memory just to hold the PTEs for one 4K page. On a
database server with 300GB SGA, a system crash was seen with
out-of-memory condition when 1500+ clients tried to share this SGA
even though the system had 512GB of memory. On this server, in the
worst case scenario of all 1500 processes mapping every page from
SGA would have required 878GB+ for just the PTEs. If these PTEs
could be shared, amount of memory saved is very significant.

This patch series implements a mechanism in kernel to allow
userspace processes to opt into sharing PTEs. It adds a new
in-memory filesystem - msharefs. A file created on msharefs creates
a new shared region where all processes sharing that region will
share the PTEs as well. A process can create a new file on msharefs
and then mmap it which assigns a starting address and size to this
mshare'd region. Another process that has the right permission to
open the file on msharefs can then mmap this file in its address
space at same virtual address and size and share this region through
shared PTEs. An unlink() on the file marks the mshare'd region for
deletion once there are no more users of the region. When the mshare
region is deleted, all the pages used by the region are freed.


API
===

mshare does not introduce a new API. It instead uses existing APIs
to implement page table sharing. The steps to use this feature are:

1. Mount msharefs on /sys/fs/mshare -
	mount -t msharefs msharefs /sys/fs/mshare

2. mshare regions have alignment and size requirements. Start
   address for the region must be aligned to an address boundary and
   be a multiple of fixed size. This alignment and size requirement
   can be obtained by reading the file /sys/fs/mshare/mshare_info
   which returns a number in text format. mshare regions must be
   aligned to this boundary and be a multiple of this size.

3. For the process creating mshare region:
	a. Create a file on /sys/fs/mshare, for example -
		fd = open("/sys/fs/mshare/shareme",
				O_RDWR|O_CREAT|O_EXCL, 0600);
	
	b. mmap this file to establish starting address and size - 
		mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
                        MAP_SHARED, fd, 0);

	c. Write and read to mshared region normally.

4. For processes attaching to mshare'd region:
	a. Open the file on msharefs, for example -
		fd = open("/sys/fs/mshare/shareme", O_RDWR);

	b. Get information about mshare'd region from the file:
		struct mshare_info {
			unsigned long start;
			unsigned long size;
		} m_info;

		read(fd, &m_info, sizeof(m_info));
	
	c. mmap the mshare'd region -
		mmap(m_info.start, m_info.size,
			PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

5. To delete the mshare region -
		unlink("/sys/fs/mshare/shareme");



Example Code
============

Snippet of the code that a donor process would run looks like below:

-----------------
	fd = open("/sys/fs/mshare/mshare_info", O_RDONLY);
	read(fd, req, 128);
	alignsize = atoi(req);
	close(fd);
	fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600);
	start = alignsize * 4;
	size = alignsize * 2;
        addr = mmap((void *)start, size, PROT_READ | PROT_WRITE,
                        MAP_SHARED | MAP_ANONYMOUS, 0, 0);
        if (addr == MAP_FAILED)
                perror("ERROR: mmap failed");

        strncpy(addr, "Some random shared text",
			sizeof("Some random shared text"));
-----------------


Snippet of code that a consumer process would execute looks like:

-----------------
	struct mshare_info {
		unsigned long start;
		unsigned long size;
	} minfo;


        fd = open("/sys/fs/mshare/shareme", O_RDONLY);

        if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
                printf("INFO: %ld bytes shared at addr 0x%lx \n",
				minfo.size, minfo.start);

        addr = mmap(minfo.start, minfo.size, PROT_READ | PROT_WRITE,
			MAP_SHARED, fd, 0);

        printf("Guest mmap at %px:\n", addr);
        printf("%s\n", addr);
	printf("\nDone\n");

-----------------



v1 -> v2:
	- Eliminated mshare and mshare_unlink system calls and
	  replaced API with standard mmap and unlink (Based upon
	  v1 patch discussions and LSF/MM discussions)
	- All fd based API (based upon feedback and suggestions from
	  Andy Lutomirski, Eric Biederman, Kirill and others)
	- Added a file /sys/fs/mshare/mshare_info to provide
	  alignment and size requirement info (based upon feedback
	  from Dave Hansen, Mark Hemment and discussions at LSF/MM)
	- Addressed TODOs in v1
	- Added support for directories in msharefs
	- Added locks around any time vma is touched (Dave Hansen)
	- Eliminated the need to point vm_mm in original vmas to the
	  newly synthesized mshare mm
	- Ensured mmap_read_unlock is called for correct mm in
	  handle_mm_fault (Dave Hansen)

Khalid Aziz (9):
  mm: Add msharefs filesystem
  mm/mshare: pre-populate msharefs with information file
  mm/mshare: make msharefs writable and support directories
  mm/mshare: Add a read operation for msharefs files
  mm/mshare: Add vm flag for shared PTE
  mm/mshare: Add mmap operation
  mm/mshare: Add unlink and munmap support
  mm/mshare: Add basic page table sharing support
  mm/mshare: Enable mshare region mapping across processes

 Documentation/filesystems/msharefs.rst |  19 +
 include/linux/mm.h                     |  10 +
 include/trace/events/mmflags.h         |   3 +-
 include/uapi/linux/magic.h             |   1 +
 include/uapi/linux/mman.h              |   5 +
 mm/Makefile                            |   2 +-
 mm/internal.h                          |   7 +
 mm/memory.c                            | 101 ++++-
 mm/mshare.c                            | 575 +++++++++++++++++++++++++
 9 files changed, 719 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/filesystems/msharefs.rst
 create mode 100644 mm/mshare.c

Comments

Mark Hemment June 30, 2022, 11:57 a.m. UTC | #1
Hi Khalid,

On Wed, 29 Jun 2022 at 23:54, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
>
> Memory pages shared between processes require a page table entry
> (PTE) for each process. Each of these PTE consumes consume some of
> the memory and as long as number of mappings being maintained is
> small enough, this space consumed by page tables is not
> objectionable. When very few memory pages are shared between
> processes, the number of page table entries (PTEs) to maintain is
> mostly constrained by the number of pages of memory on the system.
> As the number of shared pages and the number of times pages are
> shared goes up, amount of memory consumed by page tables starts to
> become significant. This issue does not apply to threads. Any number
> of threads can share the same pages inside a process while sharing
> the same PTEs. Extending this same model to sharing pages across
> processes can eliminate this issue for sharing across processes as
> well.
>
> Some of the field deployments commonly see memory pages shared
> across 1000s of processes. On x86_64, each page requires a PTE that
> is only 8 bytes long which is very small compared to the 4K page
> size. When 2000 processes map the same page in their address space,
> each one of them requires 8 bytes for its PTE and together that adds
> up to 8K of memory just to hold the PTEs for one 4K page. On a
> database server with 300GB SGA, a system crash was seen with
> out-of-memory condition when 1500+ clients tried to share this SGA
> even though the system had 512GB of memory. On this server, in the
> worst case scenario of all 1500 processes mapping every page from
> SGA would have required 878GB+ for just the PTEs. If these PTEs
> could be shared, amount of memory saved is very significant.
>
> This patch series implements a mechanism in kernel to allow
> userspace processes to opt into sharing PTEs. It adds a new
> in-memory filesystem - msharefs. A file created on msharefs creates
> a new shared region where all processes sharing that region will
> share the PTEs as well. A process can create a new file on msharefs
> and then mmap it which assigns a starting address and size to this
> mshare'd region. Another process that has the right permission to
> open the file on msharefs can then mmap this file in its address
> space at same virtual address and size and share this region through
> shared PTEs. An unlink() on the file marks the mshare'd region for
> deletion once there are no more users of the region. When the mshare
> region is deleted, all the pages used by the region are freed.

  Noting the flexibility of 'mshare' has been reduced from v1.  The
earlier version allowed msharing of named mappings, while this patch
is only for anonymous mappings.
  Any plans to support named mappings?  If not, I guess *someone* will
want it (eventually).  Minor, as the patch does not introduce new
syscalls, but having an API which is flexible for both named and anon
mappings would be good (this is a nit, not a strong suggestion).

  The cover letter details the problem being solved and the API, but
gives no details of the implementation.  A paragraph on the use of a
mm_struct per-msharefs file would be helpful.

  I've only quickly scanned the patchset; not in enough detail to
comment on each patch, but a few observations.

  o I was expecting to see mprotect() against a mshared vma to either
be disallowed or code to support the splitting of a mshared vma.  I
didn't see either.

  o For the case where the mshare file has been closed/unmmap but not
unlinked, a 'mshare_data' structure will leaked when the inode is
evicted.

  o The alignment requirement is PGDIR_SIZE, which is very large.
Should/could this be PMD_SIZE?

  o mshare should be a conditional feature (CONFIG_MSHARE ?).


  I might get a chance do a finer grain review later/tomorrow.

> API
> ===
>
> mshare does not introduce a new API. It instead uses existing APIs
> to implement page table sharing. The steps to use this feature are:
>
> 1. Mount msharefs on /sys/fs/mshare -
>         mount -t msharefs msharefs /sys/fs/mshare
>
> 2. mshare regions have alignment and size requirements. Start
>    address for the region must be aligned to an address boundary and
>    be a multiple of fixed size. This alignment and size requirement
>    can be obtained by reading the file /sys/fs/mshare/mshare_info
>    which returns a number in text format. mshare regions must be
>    aligned to this boundary and be a multiple of this size.
>
> 3. For the process creating mshare region:
>         a. Create a file on /sys/fs/mshare, for example -
>                 fd = open("/sys/fs/mshare/shareme",
>                                 O_RDWR|O_CREAT|O_EXCL, 0600);
>
>         b. mmap this file to establish starting address and size -
>                 mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
>                         MAP_SHARED, fd, 0);
>
>         c. Write and read to mshared region normally.
>
> 4. For processes attaching to mshare'd region:
>         a. Open the file on msharefs, for example -
>                 fd = open("/sys/fs/mshare/shareme", O_RDWR);
>
>         b. Get information about mshare'd region from the file:
>                 struct mshare_info {
>                         unsigned long start;
>                         unsigned long size;
>                 } m_info;
>
>                 read(fd, &m_info, sizeof(m_info));
>
>         c. mmap the mshare'd region -
>                 mmap(m_info.start, m_info.size,
>                         PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>
> 5. To delete the mshare region -
>                 unlink("/sys/fs/mshare/shareme");
>
>
>
> Example Code
> ============
>
> Snippet of the code that a donor process would run looks like below:
>
> -----------------
>         fd = open("/sys/fs/mshare/mshare_info", O_RDONLY);
>         read(fd, req, 128);
>         alignsize = atoi(req);
>         close(fd);
>         fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600);
>         start = alignsize * 4;
>         size = alignsize * 2;
>         addr = mmap((void *)start, size, PROT_READ | PROT_WRITE,
>                         MAP_SHARED | MAP_ANONYMOUS, 0, 0);

Typo, missing 'fd'; MAP_SHARED | MAP_ANONYMOUS, fd, 0)

>         if (addr == MAP_FAILED)
>                 perror("ERROR: mmap failed");
>
>         strncpy(addr, "Some random shared text",
>                         sizeof("Some random shared text"));
> -----------------
>
>
> Snippet of code that a consumer process would execute looks like:
>
> -----------------
>         struct mshare_info {
>                 unsigned long start;
>                 unsigned long size;
>         } minfo;
>
>
>         fd = open("/sys/fs/mshare/shareme", O_RDONLY);
>
>         if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
>                 printf("INFO: %ld bytes shared at addr 0x%lx \n",
>                                 minfo.size, minfo.start);
>
>         addr = mmap(minfo.start, minfo.size, PROT_READ | PROT_WRITE,
>                         MAP_SHARED, fd, 0);
>
>         printf("Guest mmap at %px:\n", addr);
>         printf("%s\n", addr);
>         printf("\nDone\n");
>
> -----------------
>
>
>
> v1 -> v2:
>         - Eliminated mshare and mshare_unlink system calls and
>           replaced API with standard mmap and unlink (Based upon
>           v1 patch discussions and LSF/MM discussions)
>         - All fd based API (based upon feedback and suggestions from
>           Andy Lutomirski, Eric Biederman, Kirill and others)
>         - Added a file /sys/fs/mshare/mshare_info to provide
>           alignment and size requirement info (based upon feedback
>           from Dave Hansen, Mark Hemment and discussions at LSF/MM)
>         - Addressed TODOs in v1
>         - Added support for directories in msharefs
>         - Added locks around any time vma is touched (Dave Hansen)
>         - Eliminated the need to point vm_mm in original vmas to the
>           newly synthesized mshare mm
>         - Ensured mmap_read_unlock is called for correct mm in
>           handle_mm_fault (Dave Hansen)
>
> Khalid Aziz (9):
>   mm: Add msharefs filesystem
>   mm/mshare: pre-populate msharefs with information file
>   mm/mshare: make msharefs writable and support directories
>   mm/mshare: Add a read operation for msharefs files
>   mm/mshare: Add vm flag for shared PTE
>   mm/mshare: Add mmap operation
>   mm/mshare: Add unlink and munmap support
>   mm/mshare: Add basic page table sharing support
>   mm/mshare: Enable mshare region mapping across processes
>
>  Documentation/filesystems/msharefs.rst |  19 +
>  include/linux/mm.h                     |  10 +
>  include/trace/events/mmflags.h         |   3 +-
>  include/uapi/linux/magic.h             |   1 +
>  include/uapi/linux/mman.h              |   5 +
>  mm/Makefile                            |   2 +-
>  mm/internal.h                          |   7 +
>  mm/memory.c                            | 101 ++++-
>  mm/mshare.c                            | 575 +++++++++++++++++++++++++
>  9 files changed, 719 insertions(+), 4 deletions(-)
>  create mode 100644 Documentation/filesystems/msharefs.rst
>  create mode 100644 mm/mshare.c
>
> --
> 2.32.0

Cheers,
Mark
Khalid Aziz June 30, 2022, 3:39 p.m. UTC | #2
On 6/30/22 05:57, Mark Hemment wrote:
> Hi Khalid,
> 
> On Wed, 29 Jun 2022 at 23:54, Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>>
>> Memory pages shared between processes require a page table entry
>> (PTE) for each process. Each of these PTE consumes consume some of
>> the memory and as long as number of mappings being maintained is
>> small enough, this space consumed by page tables is not
>> objectionable. When very few memory pages are shared between
>> processes, the number of page table entries (PTEs) to maintain is
>> mostly constrained by the number of pages of memory on the system.
>> As the number of shared pages and the number of times pages are
>> shared goes up, amount of memory consumed by page tables starts to
>> become significant. This issue does not apply to threads. Any number
>> of threads can share the same pages inside a process while sharing
>> the same PTEs. Extending this same model to sharing pages across
>> processes can eliminate this issue for sharing across processes as
>> well.
>>
>> Some of the field deployments commonly see memory pages shared
>> across 1000s of processes. On x86_64, each page requires a PTE that
>> is only 8 bytes long which is very small compared to the 4K page
>> size. When 2000 processes map the same page in their address space,
>> each one of them requires 8 bytes for its PTE and together that adds
>> up to 8K of memory just to hold the PTEs for one 4K page. On a
>> database server with 300GB SGA, a system crash was seen with
>> out-of-memory condition when 1500+ clients tried to share this SGA
>> even though the system had 512GB of memory. On this server, in the
>> worst case scenario of all 1500 processes mapping every page from
>> SGA would have required 878GB+ for just the PTEs. If these PTEs
>> could be shared, amount of memory saved is very significant.
>>
>> This patch series implements a mechanism in kernel to allow
>> userspace processes to opt into sharing PTEs. It adds a new
>> in-memory filesystem - msharefs. A file created on msharefs creates
>> a new shared region where all processes sharing that region will
>> share the PTEs as well. A process can create a new file on msharefs
>> and then mmap it which assigns a starting address and size to this
>> mshare'd region. Another process that has the right permission to
>> open the file on msharefs can then mmap this file in its address
>> space at same virtual address and size and share this region through
>> shared PTEs. An unlink() on the file marks the mshare'd region for
>> deletion once there are no more users of the region. When the mshare
>> region is deleted, all the pages used by the region are freed.
> 
>    Noting the flexibility of 'mshare' has been reduced from v1.  The
> earlier version allowed msharing of named mappings, while this patch
> is only for anonymous mappings.
>    Any plans to support named mappings?  If not, I guess *someone* will
> want it (eventually).  Minor, as the patch does not introduce new
> syscalls, but having an API which is flexible for both named and anon
> mappings would be good (this is a nit, not a strong suggestion).

I apologize for not clarifying this. The initial mmap() call looks like an anonymous mapping but one could easily call 
mremap later and map any other objects in the same address space which remains shared until the mshare region is torn 
down. It is my intent to support mapping any objects in mshare region.

> 
>    The cover letter details the problem being solved and the API, but
> gives no details of the implementation.  A paragraph on the use of a
> mm_struct per-msharefs file would be helpful.

Good point. I will do that next time.

> 
>    I've only quickly scanned the patchset; not in enough detail to
> comment on each patch, but a few observations.
> 
>    o I was expecting to see mprotect() against a mshared vma to either
> be disallowed or code to support the splitting of a mshared vma.  I
> didn't see either.msharefs_delmm

Since mshare region is intended to support multiple objects being mapped in the region and different protections on 
different parts of region, mprotect should be supported and should handle splitting the mshare'd vmas. Until basic code 
is solid, it would make sense to prevent splitting vmas and add that on later. I will add this code.

> 
>    o For the case where the mshare file has been closed/unmmap but not
> unlinked, a 'mshare_data' structure will leaked when the inode is
> evicted.

You are right. mshare_evict_inode() needs to call msharefs_delmm() to clean up.

> 
>    o The alignment requirement is PGDIR_SIZE, which is very large.
> Should/could this be PMD_SIZE?

Yes, PGDIR_SIZE is large. It works for the database folks who requested this feature but PMD might be more versatile. I 
have been thinking about switching to PMD since that will make it easier to move hugetlbfs page table sharing code over 
to this code.

> 
>    o mshare should be a conditional feature (CONFIG_MSHARE ?).

I can do that. I was reluctant to add yet another CONFIG option. Since this feature is activated explicitly by userspace 
code, is it necessary to make it a config option?

> 
> 
>    I might get a chance do a finer grain review later/tomorrow.
> 
>> API
>> ===
>>
>> mshare does not introduce a new API. It instead uses existing APIs
>> to implement page table sharing. The steps to use this feature are:
>>
>> 1. Mount msharefs on /sys/fs/mshare -
>>          mount -t msharefs msharefs /sys/fs/mshare
>>
>> 2. mshare regions have alignment and size requirements. Start
>>     address for the region must be aligned to an address boundary and
>>     be a multiple of fixed size. This alignment and size requirement
>>     can be obtained by reading the file /sys/fs/mshare/mshare_info
>>     which returns a number in text format. mshare regions must be
>>     aligned to this boundary and be a multiple of this size.
>>
>> 3. For the process creating mshare region:
>>          a. Create a file on /sys/fs/mshare, for example -
>>                  fd = open("/sys/fs/mshare/shareme",
>>                                  O_RDWR|O_CREAT|O_EXCL, 0600);
>>
>>          b. mmap this file to establish starting address and size -
>>                  mmap((void *)TB(2), BUF_SIZE, PROT_READ | PROT_WRITE,
>>                          MAP_SHARED, fd, 0);
>>
>>          c. Write and read to mshared region normally.
>>
>> 4. For processes attaching to mshare'd region:
>>          a. Open the file on msharefs, for example -
>>                  fd = open("/sys/fs/mshare/shareme", O_RDWR);
>>
>>          b. Get information about mshare'd region from the file:
>>                  struct mshare_info {
>>                          unsigned long start;
>>                          unsigned long size;
>>                  } m_info;
>>
>>                  read(fd, &m_info, sizeof(m_info));
>>
>>          c. mmap the mshare'd region -
>>                  mmap(m_info.start, m_info.size,
>>                          PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>>
>> 5. To delete the mshare region -
>>                  unlink("/sys/fs/mshare/shareme");
>>
>>
>>
>> Example Code
>> ============
>>
>> Snippet of the code that a donor process would run looks like below:
>>
>> -----------------
>>          fd = open("/sys/fs/mshare/mshare_info", O_RDONLY);
>>          read(fd, req, 128);
>>          alignsize = atoi(req);
>>          close(fd);
>>          fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600);
>>          start = alignsize * 4;
>>          size = alignsize * 2;
>>          addr = mmap((void *)start, size, PROT_READ | PROT_WRITE,
>>                          MAP_SHARED | MAP_ANONYMOUS, 0, 0);
> 
> Typo, missing 'fd'; MAP_SHARED | MAP_ANONYMOUS, fd, 0)

Yes, you are right. I will fix that.

Thanks, Mark! I really appreciate your taking time to review this code.

--
Khalid
Andrew Morton July 2, 2022, 4:24 a.m. UTC | #3
On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:

> This patch series implements a mechanism in kernel to allow
> userspace processes to opt into sharing PTEs. It adds a new
> in-memory filesystem - msharefs. 

Dumb question: why do we need a new filesystem for this?  Is it not
feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files?
Khalid Aziz July 6, 2022, 7:26 p.m. UTC | #4
On 7/1/22 22:24, Andrew Morton wrote:
> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
>> This patch series implements a mechanism in kernel to allow
>> userspace processes to opt into sharing PTEs. It adds a new
>> in-memory filesystem - msharefs.
> 
> Dumb question: why do we need a new filesystem for this?  Is it not
> feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files?

Hi Andrew,

The new filesystem is meant to provide only the control files for sharing PTE. It contains a file that provides 
alignment/size requirement. Other files are created as named objects to represent shared regions and these files provide 
information about the size and virtual address for each shared regions when the file is read. Actual shared data is not 
hosted on msharefs. Actual data is mmap'ed using anonymous pages, ext4/xfs/btfrfs/etc files.

Thanks,
Khalid
David Hildenbrand July 8, 2022, 11:47 a.m. UTC | #5
On 02.07.22 06:24, Andrew Morton wrote:
> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
>> This patch series implements a mechanism in kernel to allow
>> userspace processes to opt into sharing PTEs. It adds a new
>> in-memory filesystem - msharefs. 
> 
> Dumb question: why do we need a new filesystem for this?  Is it not
> feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files?
> 

IIRC, the general opinion at LSF/MM was that this approach at hand is
makes people nervous and I at least am not convinced that we really want
to have this upstream.

What's *completely* missing from the cover letter are the dirty details:
"Actual data is mmap'ed using anonymous pages, ext4/xfs/btfrfs/etc
files.". Gah.

As raised already, "anonymous pages" makes me shiver.


(To me, what I read, this looks like an RFC to me, yet I see "v2". But I
am a little confused why most of the feedback at LSF/MM seems to be
ignored and people are moving forward with this approach. But maybe my
memory is wrong.)

Please, let's look into more generic page table sharing just like
hugetlb already provides to some degree. And if we need additional
alignment requirements to share multiple page table levels, then let's
look into that as well for special users.
Khalid Aziz July 8, 2022, 7:36 p.m. UTC | #6
On 7/8/22 05:47, David Hildenbrand wrote:
> On 02.07.22 06:24, Andrew Morton wrote:
>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>>> This patch series implements a mechanism in kernel to allow
>>> userspace processes to opt into sharing PTEs. It adds a new
>>> in-memory filesystem - msharefs.
>>
>> Dumb question: why do we need a new filesystem for this?  Is it not
>> feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files?
>>
> 
> IIRC, the general opinion at LSF/MM was that this approach at hand is
> makes people nervous and I at least am not convinced that we really want
> to have this upstream.

Hi David,

You are right that sharing page tables across processes feels scary, but at the same time threads already share PTEs and 
this just extends that concept to processes. A number of people have commented on potential usefulness of this concept 
and implementation. There were concerns raised about being able to make this safe and reliable. I had agreed to send a 
second version of the patch incorporating feedback from last review and LSF/MM, and that is what v2 patch is about. The 
suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion 
on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is 
better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that 
wrong.

> 
> What's *completely* missing from the cover letter are the dirty details:
> "Actual data is mmap'ed using anonymous pages, ext4/xfs/btfrfs/etc
> files.". Gah.

Yes, cover letter in v2 patch was a little lacking. I will add more details next time.

> 
> As raised already, "anonymous pages" makes me shiver.
> 
> 
> (To me, what I read, this looks like an RFC to me, yet I see "v2". But I
> am a little confused why most of the feedback at LSF/MM seems to be
> ignored and people are moving forward with this approach. But maybe my
> memory is wrong.)
> 
> Please, let's look into more generic page table sharing just like
> hugetlb already provides to some degree. And if we need additional
> alignment requirements to share multiple page table levels, then let's
> look into that as well for special users.
> 

Thanks,
Khalid
David Hildenbrand July 13, 2022, 2 p.m. UTC | #7
On 08.07.22 21:36, Khalid Aziz wrote:
> On 7/8/22 05:47, David Hildenbrand wrote:
>> On 02.07.22 06:24, Andrew Morton wrote:
>>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>
>>>> This patch series implements a mechanism in kernel to allow
>>>> userspace processes to opt into sharing PTEs. It adds a new
>>>> in-memory filesystem - msharefs.
>>>
>>> Dumb question: why do we need a new filesystem for this?  Is it not
>>> feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files?
>>>
>>
>> IIRC, the general opinion at LSF/MM was that this approach at hand is
>> makes people nervous and I at least am not convinced that we really want
>> to have this upstream.
> 
> Hi David,

Hi Khalid,

> 
> You are right that sharing page tables across processes feels scary, but at the same time threads already share PTEs and 
> this just extends that concept to processes.

They share a *mm* including a consistent virtual memory layout (VMA
list). Page table sharing is just a side product of that. You could even
call page tables just an implementation detail to produce that
consistent virtual memory layout -- described for that MM via a
different data structure.

> A number of people have commented on potential usefulness of this concept 
> and implementation.

... and a lot of people raised concerns. Yes, page table sharing to
reduce memory consumption/tlb misses/... is something reasonable to
have. But that doesn't require mshare, as hugetlb has proven.

The design might be useful for a handful of corner (!) cases, but as the
cover letter only talks about memory consumption of page tables, I'll
not care about those. Once these corner cases are explained and deemed
important, we might want to think of possible alternatives to explore
the solution space.

> There were concerns raised about being able to make this safe and reliable.
> I had agreed to send a 
> second version of the patch incorporating feedback from last review and LSF/MM, and that is what v2 patch is about. The 

Okay, most of the changes I saw are related to the user interface, not
to any of the actual dirty implementation-detail concerns. And the cover
letter is not really clear what's actually happening under the hood and
what the (IMHO) weird semantics of the design imply (as can be seen from
Andrews reply).

> suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion 
> on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is 
> better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that 
> wrong.

Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff,
and asked the honest question if we can just remove it and replace it by
something generic in the future. And as I learned, we most probably
cannot rip that out without affecting existing user space. Even
replacing it by mshare() would degrade existing user space.

So the natural thing to reduce page table consumption (again, what this
cover letter talks about) for user space (semi- ?)automatically for
MAP_SHARED files is to factor out what hugetlb has, and teach generic MM
code to cache and reuse page tables (PTE and PMD tables should be
sufficient) where suitable.

For reasonably aligned mappings and mapping sizes, it shouldn't be too
hard (I know, locking ...), to cache and reuse page tables attached to
files -- similar to what hugetlb does, just in a generic way. We might
want a mechanism to enable/disable this for specific processes and/or
VMAs, but these are minor details.

And that could come for free for existing user space, because page
tables, and how they are handled, would just be an implementation detail.


I'd be really interested into what the major roadblocks/downsides
file-based page table sharing has. Because I am not convinced that a
mechanism like mshare() -- that has to be explicitly implemented+used by
user space -- is required for that.
Mike Kravetz July 13, 2022, 5:58 p.m. UTC | #8
On 07/13/22 16:00, David Hildenbrand wrote:
> On 08.07.22 21:36, Khalid Aziz wrote:
> > On 7/8/22 05:47, David Hildenbrand wrote:
> >> On 02.07.22 06:24, Andrew Morton wrote:
> >>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
> 
> > suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion 
> > on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is 
> > better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that 
> > wrong.
> 
> Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff,
> and asked the honest question if we can just remove it and replace it by
> something generic in the future. And as I learned, we most probably
> cannot rip that out without affecting existing user space. Even
> replacing it by mshare() would degrade existing user space.
> 
> So the natural thing to reduce page table consumption (again, what this
> cover letter talks about) for user space (semi- ?)automatically for
> MAP_SHARED files is to factor out what hugetlb has, and teach generic MM
> code to cache and reuse page tables (PTE and PMD tables should be
> sufficient) where suitable.
> 
> For reasonably aligned mappings and mapping sizes, it shouldn't be too
> hard (I know, locking ...), to cache and reuse page tables attached to
> files -- similar to what hugetlb does, just in a generic way. We might
> want a mechanism to enable/disable this for specific processes and/or
> VMAs, but these are minor details.
> 
> And that could come for free for existing user space, because page
> tables, and how they are handled, would just be an implementation detail.
> 
> 
> I'd be really interested into what the major roadblocks/downsides
> file-based page table sharing has. Because I am not convinced that a
> mechanism like mshare() -- that has to be explicitly implemented+used by
> user space -- is required for that.

Perhaps this is an 'opportunity' for me to write up in detail how
hugetlb pmd sharing works.  As you know, I have been struggling with
keeping that working AND safe AND performant.  Who knows, this may lead
to changes in the existing implementation.
David Hildenbrand July 13, 2022, 6:03 p.m. UTC | #9
On 13.07.22 19:58, Mike Kravetz wrote:
> On 07/13/22 16:00, David Hildenbrand wrote:
>> On 08.07.22 21:36, Khalid Aziz wrote:
>>> On 7/8/22 05:47, David Hildenbrand wrote:
>>>> On 02.07.22 06:24, Andrew Morton wrote:
>>>>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>>> suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion 
>>> on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is 
>>> better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that 
>>> wrong.
>>
>> Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff,
>> and asked the honest question if we can just remove it and replace it by
>> something generic in the future. And as I learned, we most probably
>> cannot rip that out without affecting existing user space. Even
>> replacing it by mshare() would degrade existing user space.
>>
>> So the natural thing to reduce page table consumption (again, what this
>> cover letter talks about) for user space (semi- ?)automatically for
>> MAP_SHARED files is to factor out what hugetlb has, and teach generic MM
>> code to cache and reuse page tables (PTE and PMD tables should be
>> sufficient) where suitable.
>>
>> For reasonably aligned mappings and mapping sizes, it shouldn't be too
>> hard (I know, locking ...), to cache and reuse page tables attached to
>> files -- similar to what hugetlb does, just in a generic way. We might
>> want a mechanism to enable/disable this for specific processes and/or
>> VMAs, but these are minor details.
>>
>> And that could come for free for existing user space, because page
>> tables, and how they are handled, would just be an implementation detail.
>>
>>
>> I'd be really interested into what the major roadblocks/downsides
>> file-based page table sharing has. Because I am not convinced that a
>> mechanism like mshare() -- that has to be explicitly implemented+used by
>> user space -- is required for that.
> 
> Perhaps this is an 'opportunity' for me to write up in detail how
> hugetlb pmd sharing works.  As you know, I have been struggling with
> keeping that working AND safe AND performant. 

Yes, and I have your locking-related changes in my inbox marked as "to
be reviewed" :D Sheding some light on that would be highly appreciated,
especially, how hugetlb-specific it currently is and for which reason.
Khalid Aziz July 14, 2022, 10:02 p.m. UTC | #10
On 7/13/22 08:00, David Hildenbrand wrote:
> On 08.07.22 21:36, Khalid Aziz wrote:
>> On 7/8/22 05:47, David Hildenbrand wrote:
>>> On 02.07.22 06:24, Andrew Morton wrote:
>>>> On Wed, 29 Jun 2022 16:53:51 -0600 Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>>
>>>>> This patch series implements a mechanism in kernel to allow
>>>>> userspace processes to opt into sharing PTEs. It adds a new
>>>>> in-memory filesystem - msharefs.
>>>>
>>>> Dumb question: why do we need a new filesystem for this?  Is it not
>>>> feasible to permit PTE sharing for mmaps of tmpfs/xfs/ext4/etc files?
>>>>
>>>
>>> IIRC, the general opinion at LSF/MM was that this approach at hand is
>>> makes people nervous and I at least am not convinced that we really want
>>> to have this upstream.
>>
>> Hi David,
> 
> Hi Khalid,
> 
>>
>> You are right that sharing page tables across processes feels scary, but at the same time threads already share PTEs and
>> this just extends that concept to processes.
> 
> They share a *mm* including a consistent virtual memory layout (VMA
> list). Page table sharing is just a side product of that. You could even
> call page tables just an implementation detail to produce that
> consistent virtual memory layout -- described for that MM via a
> different data structure.

Yes, sharing an mm and vma chain does make it different from implementation point of view.

> 
>> A number of people have commented on potential usefulness of this concept
>> and implementation.
> 
> ... and a lot of people raised concerns. Yes, page table sharing to
> reduce memory consumption/tlb misses/... is something reasonable to
> have. But that doesn't require mshare, as hugetlb has proven.
> 
> The design might be useful for a handful of corner (!) cases, but as the
> cover letter only talks about memory consumption of page tables, I'll
> not care about those. Once these corner cases are explained and deemed
> important, we might want to think of possible alternatives to explore
> the solution space.

Memory consumption by page tables is turning out to be significant issue. I mentioned one real-world example from a 
customer where a 300GB SGA on a 512GB server resulted in OOM when 1500+ processes tried to map parts of the SGA into 
their address space. Some customers are able to solve this issue by switching to hugetlbfs but that is not feasible for 
every one.

> 
>> There were concerns raised about being able to make this safe and reliable.
>> I had agreed to send a
>> second version of the patch incorporating feedback from last review and LSF/MM, and that is what v2 patch is about. The
> 
> Okay, most of the changes I saw are related to the user interface, not
> to any of the actual dirty implementation-detail concerns. And the cover
> letter is not really clear what's actually happening under the hood and
> what the (IMHO) weird semantics of the design imply (as can be seen from
> Andrews reply).

Sure, I will add more details to the cover letter next time. msharefs needs more explanation. I will highlight the 
creation of a new mm struct for mshare regions that is not owned by any process. There was another under-the-hood change 
that is listed in changelog but could have been highlighted - "Eliminated the need to point vm_mm in original vmas to 
the newly synthesized mshare mm". How the fields in new mm struct are used helped make this change and could use more 
details in cover letter.

> 
>> suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion
>> on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is
>> better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that
>> wrong.
> 
> Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff,
> and asked the honest question if we can just remove it and replace it by
> something generic in the future. And as I learned, we most probably
> cannot rip that out without affecting existing user space. Even
> replacing it by mshare() would degrade existing user space.
> 
> So the natural thing to reduce page table consumption (again, what this
> cover letter talks about) for user space (semi- ?)automatically for
> MAP_SHARED files is to factor out what hugetlb has, and teach generic MM
> code to cache and reuse page tables (PTE and PMD tables should be
> sufficient) where suitable.
> 
> For reasonably aligned mappings and mapping sizes, it shouldn't be too
> hard (I know, locking ...), to cache and reuse page tables attached to
> files -- similar to what hugetlb does, just in a generic way. We might
> want a mechanism to enable/disable this for specific processes and/or
> VMAs, but these are minor details.
> 
> And that could come for free for existing user space, because page
> tables, and how they are handled, would just be an implementation detail.
> 
> 
> I'd be really interested into what the major roadblocks/downsides
> file-based page table sharing has. Because I am not convinced that a
> mechanism like mshare() -- that has to be explicitly implemented+used by
> user space -- is required for that.
> 

I see two parts to what you are suggesting (please correct me if I get this wrong):

1. Implement a generic page table sharing mechanism
2. Implement a way to use this mechanism from userspace

For 1, your suggestion seems to be extract the page table sharing code from hugetlb and make it generic. My approach is 
to create a special mm struct to host the shared page tables and create a minimal set of changes to simply get PTEs from 
this special mm struct whenever a shared VMA is accessed. There may be value to extracting hugetlb page table sharing 
code and recasting it into this framework of special mm struct. I will look some more into it.

As for 2, is it fair to say you are not fond of explicit opt-in from userspace and would rather have the sharing be file 
based like hugetlb? That is worth considering but is limiting page table sharing to just file objects reasonable? A goal 
for mshare mechanism was to allow shared objects to be files, anonymous pages, RDMA buffers, whatever. Idea being if you 
can map it, you can share it with shared page tables. Maybe that is too ambitious a goal and I am open to course correction.

Thanks,
Khalid
David Hildenbrand July 18, 2022, 12:59 p.m. UTC | #11
[sorry for not being as responsive as I usually am]

>>
>> They share a *mm* including a consistent virtual memory layout (VMA
>> list). Page table sharing is just a side product of that. You could even
>> call page tables just an implementation detail to produce that
>> consistent virtual memory layout -- described for that MM via a
>> different data structure.
> 
> Yes, sharing an mm and vma chain does make it different from implementation point of view.
> 
>>
>>> A number of people have commented on potential usefulness of this concept
>>> and implementation.
>>
>> ... and a lot of people raised concerns. Yes, page table sharing to
>> reduce memory consumption/tlb misses/... is something reasonable to
>> have. But that doesn't require mshare, as hugetlb has proven.
>>
>> The design might be useful for a handful of corner (!) cases, but as the
>> cover letter only talks about memory consumption of page tables, I'll
>> not care about those. Once these corner cases are explained and deemed
>> important, we might want to think of possible alternatives to explore
>> the solution space.
> 
> Memory consumption by page tables is turning out to be significant issue. I mentioned one real-world example from a 
> customer where a 300GB SGA on a 512GB server resulted in OOM when 1500+ processes tried to map parts of the SGA into 
> their address space. Some customers are able to solve this issue by switching to hugetlbfs but that is not feasible for 
> every one.

Yes. Another use case I am aware of are KVM-based virtual machines, when
VM memory (shmem, file-backed) is not only mapped into the emulator
process, but also into other processes used to carry out I/O (e.g.,
vhost-user).

In that case, it's tempting to simply share the page tables between all
processes for the shared mapping -- automatically, just like
shmem/hugetlb already does.

[...]

>>
>>> suggestion to extend hugetlb PMD sharing was discussed briefly. Conclusion from that discussion and earlier discussion
>>> on mailing list was hugetlb PMD sharing is built with special case code in too many places in the kernel and it is
>>> better to replace it with something more general purpose than build even more on it. Mike can correct me if I got that
>>> wrong.
>>
>> Yes, I pushed for the removal of that yet-another-hugetlb-special-stuff,
>> and asked the honest question if we can just remove it and replace it by
>> something generic in the future. And as I learned, we most probably
>> cannot rip that out without affecting existing user space. Even
>> replacing it by mshare() would degrade existing user space.
>>
>> So the natural thing to reduce page table consumption (again, what this
>> cover letter talks about) for user space (semi- ?)automatically for
>> MAP_SHARED files is to factor out what hugetlb has, and teach generic MM
>> code to cache and reuse page tables (PTE and PMD tables should be
>> sufficient) where suitable.
>>
>> For reasonably aligned mappings and mapping sizes, it shouldn't be too
>> hard (I know, locking ...), to cache and reuse page tables attached to
>> files -- similar to what hugetlb does, just in a generic way. We might
>> want a mechanism to enable/disable this for specific processes and/or
>> VMAs, but these are minor details.
>>
>> And that could come for free for existing user space, because page
>> tables, and how they are handled, would just be an implementation detail.
>>
>>
>> I'd be really interested into what the major roadblocks/downsides
>> file-based page table sharing has. Because I am not convinced that a
>> mechanism like mshare() -- that has to be explicitly implemented+used by
>> user space -- is required for that.
>>
> 
> I see two parts to what you are suggesting (please correct me if I get this wrong):
> 
> 1. Implement a generic page table sharing mechanism
> 2. Implement a way to use this mechanism from userspace

Yes. Whereby 2) would usually just be some heuristic (e.g.,. file > X
MiB -> start sharing), with an additional way to just disable it or just
enable it. But yes, most of that stuff should just be automatic.

> 
> For 1, your suggestion seems to be extract the page table sharing code from hugetlb and make it generic. My approach is 
> to create a special mm struct to host the shared page tables and create a minimal set of changes to simply get PTEs from 
> this special mm struct whenever a shared VMA is accessed. There may be value to extracting hugetlb page table sharing 
> code and recasting it into this framework of special mm struct. I will look some more into it.

The basic idea would be that whenever a MAP_SHARED VMA has a reasonable
size, is aligned in a suitable way (including MAP offset), and
protection match, you can just share PTE tables and even PMD tables. As
page tables of shared mappings usually don't really store per-process
information (exceptions I am aware of are userfaultfd and softdirty
tracking), we can simply share/unshare page tables of shared mappings
fairly easily.

Then, you'd have e.g., 2 sets of page tables cached by the fd that can
be mapped into processes

1) PROT_READ|PROT_WRITE
2) PROT_READ

On VMA protection changes, one would have to unshare (unmap the page
table) and either map another shared one, or map a private one. I don't
think there would be need to optimize e.g., for PROT_NONE, but of
course, other combinations could make sense to cache.


PROT_NONE and other corner cases (softdirty tracking) would simply not
use shared page tables.

Shared page tables would have to be refcounted and one could e.g.,
implement a shrinker that frees unused page tables in the fd cache when
memory reclaim kicks in.

With something like that in place, page table consumption could be
reduced and vmscan/rmap walks could turn out more efficient.

> 
> As for 2, is it fair to say you are not fond of explicit opt-in from userspace and would rather have the sharing be file 
> based like hugetlb? That is worth considering but is limiting page table sharing to just file objects reasonable? A goal 
> for mshare mechanism was to allow shared objects to be files, anonymous pages, RDMA buffers, whatever. Idea being if you 
> can map it, you can share it with shared page tables. Maybe that is too ambitious a goal and I am open to course correction.


We can glue it to the file or anything else that's shared I think  -- I
don't think we particularly, as long as it's something shared between
processes to be mapped. And to be quite honest, whenever I read about
anonymous memory (i.e., MAP_PRIVATE) I hear my inner voice screaming:
just use *shared* memory when you want to *share* memory between
processes, and optimize that if anything is missing.


Having that said, I understood from previous discussions that there is a
use case of efficient read-only protection across many processes/VMAs. I
was wondering if that could be handled on the fs-level (pte_mkwrite). I
remember I raised the idea before: if one could have a
userfaultfd-wp-style (overlay?) file (system?), user-space could
protect/unprotect file pages via a different mechanism (ioctl) and get
notified about write access via something similar to userfaultfd
user-space handlers, not via signals. Instead of adjusting VMAs, once
would only adjust file page mappings to map the relevant pages R/O when
protecting -- if page tables are shared, that would be efficient.


Now, that is is just a very vague brain dump to get it out of my
(overloaded) system. What I think the overall message is: let's try not
designing new features around page table sharing, let's use page table
sharing as an rmap performance optimization and as a mechanism to reduce
page table overhead. I hope what I said makes any sense, I might eb just
wrong.