[v1,00/14] Add support for shared PTEs across processes

Message ID	cover.1649370874.git.khalid.aziz@oracle.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Khalid Aziz <khalid.aziz@oracle.com> To: akpm@linux-foundation.org, willy@infradead.org Cc: Khalid Aziz <khalid.aziz@oracle.com>, aneesh.kumar@linux.ibm.com, arnd@arndb.de, 21cnbao@gmail.com, corbet@lwn.net, dave.hansen@linux.intel.com, david@redhat.com, ebiederm@xmission.com, hagen@jauu.net, jack@suse.cz, keescook@chromium.org, kirill@shutemov.name, kucharsk@gmail.com, linkinjeon@kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, longpeng2@huawei.com, luto@kernel.org, markhemm@googlemail.com, pcc@google.com, rppt@kernel.org, sieberf@amazon.com, sjpark@amazon.de, surenb@google.com, tst@schoebel-theuer.de, yzaikin@google.com Subject: [PATCH v1 00/14] Add support for shared PTEs across processes Date: Mon, 11 Apr 2022 10:05:44 -0600 Message-Id: <cover.1649370874.git.khalid.aziz@oracle.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Add support for shared PTEs across processes \| expand [v1,00/14] Add support for shared PTEs across processes [v1,01/14] mm: Add new system calls mshare, mshare_unlink [v1,02/14] mm/mshare: Add msharefs filesystem [v1,03/14] mm/mshare: Add read for msharefs [v1,04/14] mm/mshare: implement mshare_unlink syscall [v1,05/14] mm/mshare: Add locking to msharefs syscalls [v1,06/14] mm/mshare: Check for mounted filesystem [v1,07/14] mm/mshare: Add vm flag for shared PTE [v1,08/14] mm/mshare: Add basic page table sharing using mshare [v1,09/14] mm/mshare: Do not free PTEs for mshare'd PTEs [v1,10/14] mm/mshare: Check for mapped vma when mshare'ing existing mshare'd range [v1,11/14] mm/mshare: unmap vmas in mshare_unlink [v1,12/14] mm/mshare: Add a proc file with mshare alignment/size information [v1,13/14] mm/mshare: Enforce mshare'd region permissions [v1,14/14] mm/mshare: Copy PTEs to host mm

Khalid Aziz April 11, 2022, 4:05 p.m. UTC

Page tables in kernel consume some of the memory and as long as number
of mappings being maintained is small enough, this space consumed by
page tables is not objectionable. When very few memory pages are
shared between processes, the number of page table entries (PTEs) to
maintain is mostly constrained by the number of pages of memory on the
system. As the number of shared pages and the number of times pages
are shared goes up, amount of memory consumed by page tables starts to
become significant.

Some of the field deployments commonly see memory pages shared across
1000s of processes. On x86_64, each page requires a PTE that is only 8
bytes long which is very small compared to the 4K page size. When 2000
processes map the same page in their address space, each one of them
requires 8 bytes for its PTE and together that adds up to 8K of memory
just to hold the PTEs for one 4K page. On a database server with 300GB
SGA, a system carsh was seen with out-of-memory condition when 1500+
clients tried to share this SGA even though the system had 512GB of
memory. On this server, in the worst case scenario of all 1500
processes mapping every page from SGA would have required 878GB+ for
just the PTEs. If these PTEs could be shared, amount of memory saved
is very significant.

This patch series implements a mechanism in kernel to allow userspace
processes to opt into sharing PTEs. It adds two new system calls - (1)
mshare(), which can be used by a process to create a region (we will
call it mshare'd region) which can be used by other processes to map
same pages using shared PTEs, (2) mshare_unlink() which is used to
detach from the mshare'd region. Once an mshare'd region is created,
other process(es), assuming they have the right permissions, can make
the mashare() system call to map the shared pages into their address
space using the shared PTEs.  When a process is done using this
mshare'd region, it makes a mshare_unlink() system call to end its
access. When the last process accessing mshare'd region calls
mshare_unlink(), the mshare'd region is torn down and memory used by
it is freed.


API
===

The mshare API consists of two system calls - mshare() and mshare_unlink()

--
int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)

mshare() creates and opens a new, or opens an existing mshare'd
region that will be shared at PTE level. "name" refers to shared object
name that exists under /sys/fs/mshare. "addr" is the starting address
of this shared memory area and length is the size of this area.
oflags can be one of:

- O_RDONLY opens shared memory area for read only access by everyone
- O_RDWR opens shared memory area for read and write access
- O_CREAT creates the named shared memory area if it does not exist
- O_EXCL If O_CREAT was also specified, and a shared memory area
  exists with that name, return an error.

mode represents the creation mode for the shared object under
/sys/fs/mshare.

mshare() returns an error code if it fails, otherwise it returns 0.

PTEs are shared at pgdir level and hence it imposes following
requirements on the address and size given to the mshare():

- Starting address must be aligned to pgdir size (512GB on x86_64).
  This alignment value can be looked up in /proc/sys/vm//mshare_size
- Size must be a multiple of pgdir size
- Any mappings created in this address range at any time become
  shared automatically
- Shared address range can have unmapped addresses in it. Any access
  to unmapped address will result in SIGBUS

Mappings within this address range behave as if they were shared
between threads, so a write to a MAP_PRIVATE mapping will create a
page which is shared between all the sharers. The first process that
declares an address range mshare'd can continue to map objects in
the shared area. All other processes that want mshare'd access to
this memory area can do so by calling mshare(). After this call, the
address range given by mshare becomes a shared range in its address
space. Anonymous mappings will be shared and not COWed.

A file under /sys/fs/mshare can be opened and read from. A read from
this file returns two long values - (1) starting address, and (2)
size of the mshare'd region.

--
int mshare_unlink(char *name)

A shared address range created by mshare() can be destroyed using
mshare_unlink() which removes the  shared named object. Once all
processes have unmapped the shared object, the shared address range
references are de-allocated and destroyed.

mshare_unlink() returns 0 on success or -1 on error.


Example Code
============

Snippet of the code that a donor process would run looks like below:

-----------------
        addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
                        MAP_SHARED | MAP_ANONYMOUS, 0, 0);
        if (addr == MAP_FAILED)
                perror("ERROR: mmap failed");

        err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
			GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
        if (err < 0) {
                perror("mshare() syscall failed");
                exit(1);
        }

        strncpy(addr, "Some random shared text",
			sizeof("Some random shared text"));
-----------------

Snippet of code that a consumer process would execute looks like:

-----------------
	struct mshare_info minfo;

        fd = open("testregion", O_RDONLY);
        if (fd < 0) {
                perror("open failed");
                exit(1);
        }

        if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
                printf("INFO: %ld bytes shared at addr 0x%lx \n",
				minfo.size, minfo.start);
        else
                perror("read failed");

        close(fd);

        addr = (void *)minfo.start;
        err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
			O_RDWR, 600);
        if (err < 0) {
                perror("mshare() syscall failed");
                exit(1);
        }

        printf("Guest mmap at %px:\n", addr);
        printf("%s\n", addr);
	printf("\nDone\n");

        err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
        if (err < 0) {
                perror("mshare_unlink() failed");
                exit(1);
        }
-----------------


Patch series
============

This series of patches is an initial implementation of these two
system calls. This code implements working basic functionality.

Prototype for the two syscalls is:

SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
		unsigned long, len, int, oflag, mode_t, mode)

SYSCALL_DEFINE1(mshare_unlink, const char *, name)

In order to facilitate page table sharing, this implemntation adds a
new in-memory filesystem - msharefs which will be mounted at
/sys/fs/mshare. When a new mshare'd region is created, a file with
the name given by initial mshare() call is created under this
filesystem.  Permissions for this file are given by the "mode"
parameter to mshare(). The only operation supported on this file is
read. A read from this file returns a structure containing
information about mshare'd region - (1) starting virtual address for
the region, and (2) size of mshare'd region.

A donor process that wants to create an mshare'd region from a
section of its mapped addresses calls mshare() with O_CREAT oflag.
mshare() syscall then creates a new mm_struct which will host the
page tables for the mshare'd region.  vma->vm_private_data for the
vmas covering address range for this region are updated to point to
a structure containing pointer to this new mm_struct.  Existing page
tables are copied over to new mm struct.

A consumer process that wants to map mshare'd region opens
/sys/fs/mshare/<filename> and reads the starting address and size of
mshare'd region. It then calls mshare() with these values to map the
entire region in its address space. Consumer process calls
mshare_unlink() to terminate its access.


Since RFC
=========

This patch series includes better error handling and more robust
locking besides improved implementation of mshare since the original
RFC. It also incorporates feedback from original RFC. Alignment and
size requirment are PGDIR_SIZE, same as RFC and this is open to
change based upon further feedback. More review is needed for this
patch series and is much appreciated.



Khalid Aziz (14):
  mm: Add new system calls mshare, mshare_unlink
  mm: Add msharefs filesystem
  mm: Add read for msharefs
  mm: implement mshare_unlink syscall
  mm: Add locking to msharefs syscalls
  mm/msharefs: Check for mounted filesystem
  mm: Add vm flag for shared PTE
  mm/mshare: Add basic page table sharing using mshare
  mm: Do not free PTEs for mshare'd PTEs
  mm/mshare: Check for mapped vma when mshare'ing existing mshare'd
    range
  mm/mshare: unmap vmas in mshare_unlink
  mm/mshare: Add a proc file with mshare alignment/size information
  mm/mshare: Enforce mshare'd region permissions
  mm/mshare: Copy PTEs to host mm

 Documentation/filesystems/msharefs.rst |  19 +
 arch/x86/entry/syscalls/syscall_64.tbl |   2 +
 include/linux/mm.h                     |  11 +
 include/trace/events/mmflags.h         |   3 +-
 include/uapi/asm-generic/unistd.h      |   7 +-
 include/uapi/linux/magic.h             |   1 +
 include/uapi/linux/mman.h              |   5 +
 kernel/sysctl.c                        |   7 +
 mm/Makefile                            |   2 +-
 mm/internal.h                          |   7 +
 mm/memory.c                            | 105 ++++-
 mm/mshare.c                            | 587 +++++++++++++++++++++++++
 12 files changed, 750 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/filesystems/msharefs.rst
 create mode 100644 mm/mshare.c

Matthew Wilcox April 11, 2022, 5:37 p.m. UTC | #1

On Mon, Apr 11, 2022 at 10:05:44AM -0600, Khalid Aziz wrote:
> Page tables in kernel consume some of the memory and as long as number
> of mappings being maintained is small enough, this space consumed by
> page tables is not objectionable. When very few memory pages are
> shared between processes, the number of page table entries (PTEs) to
> maintain is mostly constrained by the number of pages of memory on the
> system. As the number of shared pages and the number of times pages
> are shared goes up, amount of memory consumed by page tables starts to
> become significant.

All of this is true.  However, I've found a lot of people don't see this
as compelling.  I've had more success explaining this from a different
direction:

--- 8< ---

Linux supports processes which share all of their address space (threads)
and processes that share none of their address space (tasks).  We propose
a useful intermediate model where two or more cooperating processes
can choose to share portions of their address space with each other.
The shared portion is referred to by a file descriptor which processes
can choose to attach to their own address space.

Modifications to the shared region affect all processes sharing
that region, just as changes by one thread affect all threads in a
multithreaded program.  This implies a certain level of trust between
the different processes (ie malicious processes should not be allowed
access to the mshared region).

--- 8< ---

Another argument that MM developers find compelling is that we can reduce
some of the complexity in hugetlbfs where it has the ability to share
page tables between processes.

One objection that was raised is that the mechanism for starting the
shared region is a bit clunky.  Did you investigate the proposed approach
of creating an empty address space, attaching to it and using an fd-based
mmap to modify its contents?

> int mshare_unlink(char *name)
> 
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all
> processes have unmapped the shared object, the shared address range
> references are de-allocated and destroyed.
> 
> mshare_unlink() returns 0 on success or -1 on error.

Can you explain why this is a syscall instead of being a library
function which does

	int dirfd = open("/sys/fs/mshare");
	err = unlinkat(dirfd, name, 0);
	close(dirfd);
	return err;

Does msharefs support creating directories, so that we can use file
permissions to limit who can see the sharable files?  Or is it strictly
a single-level-deep hierarchy?

Dave Hansen April 11, 2022, 6:47 p.m. UTC | #2

On 4/11/22 09:05, Khalid Aziz wrote:
> PTEs are shared at pgdir level and hence it imposes following
> requirements on the address and size given to the mshare():
> 
> - Starting address must be aligned to pgdir size (512GB on x86_64).
>   This alignment value can be looked up in /proc/sys/vm//mshare_size
> - Size must be a multiple of pgdir size
> - Any mappings created in this address range at any time become
>   shared automatically
> - Shared address range can have unmapped addresses in it. Any access
>   to unmapped address will result in SIGBUS
> 
> Mappings within this address range behave as if they were shared
> between threads, so a write to a MAP_PRIVATE mapping will create a
> page which is shared between all the sharers. The first process that
> declares an address range mshare'd can continue to map objects in
> the shared area. All other processes that want mshare'd access to
> this memory area can do so by calling mshare(). After this call, the
> address range given by mshare becomes a shared range in its address
> space. Anonymous mappings will be shared and not COWed.

What does this mean in practice?

Dave Hansen April 11, 2022, 6:51 p.m. UTC | #3

On 4/11/22 10:37, Matthew Wilcox wrote:
> Another argument that MM developers find compelling is that we can reduce
> some of the complexity in hugetlbfs where it has the ability to share
> page tables between processes.

When could this complexity reduction actually happen in practice?  Can
this mshare thingy be somehow dropped in underneath the existing
hugetlbfs implementation?  Or would userspace need to change?

Matthew Wilcox April 11, 2022, 7:08 p.m. UTC | #4

On Mon, Apr 11, 2022 at 11:51:46AM -0700, Dave Hansen wrote:
> On 4/11/22 10:37, Matthew Wilcox wrote:
> > Another argument that MM developers find compelling is that we can reduce
> > some of the complexity in hugetlbfs where it has the ability to share
> > page tables between processes.
> 
> When could this complexity reduction actually happen in practice?  Can
> this mshare thingy be somehow dropped in underneath the existing
> hugetlbfs implementation?  Or would userspace need to change?

Userspace needs to opt in to mshare, so there's going to be a transition
period where we still need hugetlbfs to still support it, but I have
the impression that the users that need page table sharing are pretty
specialised and we'll be able to find them all before disabling it.

I don't think we can make it transparent to userspace, but I'll noodle
on that a bit.

Khalid Aziz April 11, 2022, 7:52 p.m. UTC | #5

On 4/11/22 11:37, Matthew Wilcox wrote:
> On Mon, Apr 11, 2022 at 10:05:44AM -0600, Khalid Aziz wrote:
>> Page tables in kernel consume some of the memory and as long as number
>> of mappings being maintained is small enough, this space consumed by
>> page tables is not objectionable. When very few memory pages are
>> shared between processes, the number of page table entries (PTEs) to
>> maintain is mostly constrained by the number of pages of memory on the
>> system. As the number of shared pages and the number of times pages
>> are shared goes up, amount of memory consumed by page tables starts to
>> become significant.
> 
> All of this is true.  However, I've found a lot of people don't see this
> as compelling.  I've had more success explaining this from a different
> direction:
> 
> --- 8< ---
> 
> Linux supports processes which share all of their address space (threads)
> and processes that share none of their address space (tasks).  We propose
> a useful intermediate model where two or more cooperating processes
> can choose to share portions of their address space with each other.
> The shared portion is referred to by a file descriptor which processes
> can choose to attach to their own address space.
> 
> Modifications to the shared region affect all processes sharing
> that region, just as changes by one thread affect all threads in a
> multithreaded program.  This implies a certain level of trust between
> the different processes (ie malicious processes should not be allowed
> access to the mshared region).
> 
> --- 8< ---
> 
> Another argument that MM developers find compelling is that we can reduce
> some of the complexity in hugetlbfs where it has the ability to share
> page tables between processes.

This all sounds reasonable.

> 
> One objection that was raised is that the mechanism for starting the
> shared region is a bit clunky.  Did you investigate the proposed approach
> of creating an empty address space, attaching to it and using an fd-based
> mmap to modify its contents?

I want to make sure I understand this correctly. In the example I gave, the process creating mshare'd region maps in the 
address space first possibly using mmap(). It then calls mshare() to share this already-mapped region. Are you 
suggesting that the process be able to call mshare() before mapping in address range and then map things into that 
address range later? If yes, it is my intent to support that after the initial implementation as expansion of original 
concept.

> 
>> int mshare_unlink(char *name)
>>
>> A shared address range created by mshare() can be destroyed using
>> mshare_unlink() which removes the  shared named object. Once all
>> processes have unmapped the shared object, the shared address range
>> references are de-allocated and destroyed.
>>
>> mshare_unlink() returns 0 on success or -1 on error.
> 
> Can you explain why this is a syscall instead of being a library
> function which does
> 
> 	int dirfd = open("/sys/fs/mshare");
> 	err = unlinkat(dirfd, name, 0);
> 	close(dirfd);
> 	return err;

mshare_unlink can be simple unlink on the file in msharefs. API will be asymmetrical in that creating mshare'd region is 
a syscall while tearing it down is a file op. I don't mind saving a syscall slot. Would you prefer it that way?

> 
> Does msharefs support creating directories, so that we can use file
> permissions to limit who can see the sharable files?  Or is it strictly
> a single-level-deep hierarchy?
> 

For now msharefs is single-level-deep. It can be expanded to support directories to limit visibility of filenames. Would 
you prefer to see it support directories from the beginning or can that be a future expansion of this feature?

Thanks,
Khalid

Eric W. Biederman April 11, 2022, 8:10 p.m. UTC | #6

Khalid Aziz <khalid.aziz@oracle.com> writes:

> Page tables in kernel consume some of the memory and as long as number
> of mappings being maintained is small enough, this space consumed by
> page tables is not objectionable. When very few memory pages are
> shared between processes, the number of page table entries (PTEs) to
> maintain is mostly constrained by the number of pages of memory on the
> system. As the number of shared pages and the number of times pages
> are shared goes up, amount of memory consumed by page tables starts to
> become significant.
>
> Some of the field deployments commonly see memory pages shared across
> 1000s of processes. On x86_64, each page requires a PTE that is only 8
> bytes long which is very small compared to the 4K page size. When 2000
> processes map the same page in their address space, each one of them
> requires 8 bytes for its PTE and together that adds up to 8K of memory
> just to hold the PTEs for one 4K page. On a database server with 300GB
> SGA, a system carsh was seen with out-of-memory condition when 1500+
> clients tried to share this SGA even though the system had 512GB of
> memory. On this server, in the worst case scenario of all 1500
> processes mapping every page from SGA would have required 878GB+ for
> just the PTEs. If these PTEs could be shared, amount of memory saved
> is very significant.
>
> This patch series implements a mechanism in kernel to allow userspace
> processes to opt into sharing PTEs. It adds two new system calls - (1)
> mshare(), which can be used by a process to create a region (we will
> call it mshare'd region) which can be used by other processes to map
> same pages using shared PTEs, (2) mshare_unlink() which is used to
> detach from the mshare'd region. Once an mshare'd region is created,
> other process(es), assuming they have the right permissions, can make
> the mashare() system call to map the shared pages into their address
> space using the shared PTEs.  When a process is done using this
> mshare'd region, it makes a mshare_unlink() system call to end its
> access. When the last process accessing mshare'd region calls
> mshare_unlink(), the mshare'd region is torn down and memory used by
> it is freed.
>

>
> API
> ===
>
> The mshare API consists of two system calls - mshare() and mshare_unlink()
>
> --
> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>
> mshare() creates and opens a new, or opens an existing mshare'd
> region that will be shared at PTE level. "name" refers to shared object
> name that exists under /sys/fs/mshare. "addr" is the starting address
> of this shared memory area and length is the size of this area.
> oflags can be one of:
>
> - O_RDONLY opens shared memory area for read only access by everyone
> - O_RDWR opens shared memory area for read and write access
> - O_CREAT creates the named shared memory area if it does not exist
> - O_EXCL If O_CREAT was also specified, and a shared memory area
>   exists with that name, return an error.
>
> mode represents the creation mode for the shared object under
> /sys/fs/mshare.
>
> mshare() returns an error code if it fails, otherwise it returns 0.
>

Please don't add system calls that take names.
 
Please just open objects on the filesystem and allow multi-instances of
the filesystem.  Otherwise someone is going to have to come along later
and implement namespace magic to deal with your new system calls.  You
already have a filesystem all that is needed to avoid having to
introduce namespace magic is to simply allow multiple instances of the
filesystem to be mounted.

On that note.  Since you have a filesystem, introduce a well known
name for a directory and in that directory place all of the information
and possibly control files for your filesystem.  No need to for proc
files and the like, and if at somepoint you have mount options that
allow the information to be changed you can have different mounts
with different values present.


This is must me.  But I find it weird that you don't use mmap
to place the shared area from the mshare fd into your address space.

I think I would do:
	// Establish the mshare region
	addr = mmap(NULL, PGDIR_SIZE, PROT_READ | PROT_WRITE,
			MAP_SHARED | MAP_MSHARE, msharefd, 0);

	// Remove the mshare region
        addr2 = mmap(addr, PGDIR_SIZE, PROT_NONE,
        		MAP_FIXED | MAP_MUNSHARE, msharefd, 0);

I could see a point of using separate system calls instead of
adding MAP_SHARE and MAP_UNSHARE flags.

What are the locking implications of taking a page fault in the shared
region?

Is it a noop or is it going to make some of the nasty locking we have in
the kernel for things like truncates worse?

Eric

Khalid Aziz April 11, 2022, 10:21 p.m. UTC | #7

On 4/11/22 14:10, Eric W. Biederman wrote:
> Khalid Aziz <khalid.aziz@oracle.com> writes:
> 
>> Page tables in kernel consume some of the memory and as long as number
>> of mappings being maintained is small enough, this space consumed by
>> page tables is not objectionable. When very few memory pages are
>> shared between processes, the number of page table entries (PTEs) to
>> maintain is mostly constrained by the number of pages of memory on the
>> system. As the number of shared pages and the number of times pages
>> are shared goes up, amount of memory consumed by page tables starts to
>> become significant.
>>
>> Some of the field deployments commonly see memory pages shared across
>> 1000s of processes. On x86_64, each page requires a PTE that is only 8
>> bytes long which is very small compared to the 4K page size. When 2000
>> processes map the same page in their address space, each one of them
>> requires 8 bytes for its PTE and together that adds up to 8K of memory
>> just to hold the PTEs for one 4K page. On a database server with 300GB
>> SGA, a system carsh was seen with out-of-memory condition when 1500+
>> clients tried to share this SGA even though the system had 512GB of
>> memory. On this server, in the worst case scenario of all 1500
>> processes mapping every page from SGA would have required 878GB+ for
>> just the PTEs. If these PTEs could be shared, amount of memory saved
>> is very significant.
>>
>> This patch series implements a mechanism in kernel to allow userspace
>> processes to opt into sharing PTEs. It adds two new system calls - (1)
>> mshare(), which can be used by a process to create a region (we will
>> call it mshare'd region) which can be used by other processes to map
>> same pages using shared PTEs, (2) mshare_unlink() which is used to
>> detach from the mshare'd region. Once an mshare'd region is created,
>> other process(es), assuming they have the right permissions, can make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs.  When a process is done using this
>> mshare'd region, it makes a mshare_unlink() system call to end its
>> access. When the last process accessing mshare'd region calls
>> mshare_unlink(), the mshare'd region is torn down and memory used by
>> it is freed.
>>
> 
>>
>> API
>> ===
>>
>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>
>> --
>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>
>> mshare() creates and opens a new, or opens an existing mshare'd
>> region that will be shared at PTE level. "name" refers to shared object
>> name that exists under /sys/fs/mshare. "addr" is the starting address
>> of this shared memory area and length is the size of this area.
>> oflags can be one of:
>>
>> - O_RDONLY opens shared memory area for read only access by everyone
>> - O_RDWR opens shared memory area for read and write access
>> - O_CREAT creates the named shared memory area if it does not exist
>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>    exists with that name, return an error.
>>
>> mode represents the creation mode for the shared object under
>> /sys/fs/mshare.
>>
>> mshare() returns an error code if it fails, otherwise it returns 0.
>>
> 
> Please don't add system calls that take names.
>   
> Please just open objects on the filesystem and allow multi-instances of
> the filesystem.  Otherwise someone is going to have to come along later
> and implement namespace magic to deal with your new system calls.  You
> already have a filesystem all that is needed to avoid having to
> introduce namespace magic is to simply allow multiple instances of the
> filesystem to be mounted.

Hi Eric,

Thanks for taking the time to provide feedback.

That sounds reasonable. Is something like this more in line with what you are thinking:

fd = open("/sys/fs/mshare/testregion", O_CREAT|O_EXCL, 0400);
mshare(fd, start, end, O_RDWR);

This sequence creates the mshare'd region and assigns it an address range (which may or may not be empty). Then a client 
process would do something like:

fd = open("/sys/fs/mshare/testregion", O_RDONLY);
mshare(fd, start, end, O_RDWR);

which maps the mshare'd region into client process's address space.

> 
> On that note.  Since you have a filesystem, introduce a well known
> name for a directory and in that directory place all of the information
> and possibly control files for your filesystem.  No need to for proc
> files and the like, and if at somepoint you have mount options that
> allow the information to be changed you can have different mounts
> with different values present.

So have the kernel mount msharefs at /sys/fs/mshare and create a file /sys/fs/mshare/info. A read from 
/sys/fs/mshare/info returns what /proc/sys/vm//mshare_size in patch 12 returns. Did I understand that correctly?

> 
> 
> This is must me.  But I find it weird that you don't use mmap
> to place the shared area from the mshare fd into your address space.
> 
> I think I would do:
> 	// Establish the mshare region
> 	addr = mmap(NULL, PGDIR_SIZE, PROT_READ | PROT_WRITE,
> 			MAP_SHARED | MAP_MSHARE, msharefd, 0);
> 
> 	// Remove the mshare region
>          addr2 = mmap(addr, PGDIR_SIZE, PROT_NONE,
>          		MAP_FIXED | MAP_MUNSHARE, msharefd, 0);
> 
> I could see a point of using separate system calls instead of
> adding MAP_SHARE and MAP_UNSHARE flags.

In that case, we can just do away with syscalls completely. For instance, to create mshare'd region, one would:

fd = open("/sys/fs/mshare/testregion", O_CREAT|O_EXCL, 0400);
addr = mmap(NULL, PGDIR_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
or
addr = mmap(myaddr, PGDIR_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, fd, 0);

First mmap using this fd sets the address range which stays the same through out the life of region. To populate data in 
this address range, the process could use mmap/mremap and other mechanisms subsequently.

To map this mshare'd region into its address space, a process would:

struct mshare_info {
         unsigned long start;
         unsigned long size;
} minfo;
fd = open("/sys/fs/mshare/testregion", O_CREAT|O_EXCL, 0400);
read(fd, &minfo, sizeof(struct mshare_info));
addr = mmap(NULL, minfo.size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

When done with mshare'd region, process would unlink("/sys/fs/mshare/testregion").

This changes API significantly but if it results in a cleaner and more maintainable API, I will make the changes.

> 
> What are the locking implications of taking a page fault in the shared
> region?
> 
> Is it a noop or is it going to make some of the nasty locking we have in
> the kernel for things like truncates worse?
> 
> Eric

Handling page fault would require locking mm for the faulting process and host mm which would synchronize any other 
process taking a page fault in the shared region at the same time.

Thanks,
Khalid

Barry Song May 30, 2022, 10:48 a.m. UTC | #8

On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>
> Page tables in kernel consume some of the memory and as long as number
> of mappings being maintained is small enough, this space consumed by
> page tables is not objectionable. When very few memory pages are
> shared between processes, the number of page table entries (PTEs) to
> maintain is mostly constrained by the number of pages of memory on the
> system. As the number of shared pages and the number of times pages
> are shared goes up, amount of memory consumed by page tables starts to
> become significant.
>
> Some of the field deployments commonly see memory pages shared across
> 1000s of processes. On x86_64, each page requires a PTE that is only 8
> bytes long which is very small compared to the 4K page size. When 2000
> processes map the same page in their address space, each one of them
> requires 8 bytes for its PTE and together that adds up to 8K of memory
> just to hold the PTEs for one 4K page. On a database server with 300GB
> SGA, a system carsh was seen with out-of-memory condition when 1500+
> clients tried to share this SGA even though the system had 512GB of
> memory. On this server, in the worst case scenario of all 1500
> processes mapping every page from SGA would have required 878GB+ for
> just the PTEs. If these PTEs could be shared, amount of memory saved
> is very significant.
>
> This patch series implements a mechanism in kernel to allow userspace
> processes to opt into sharing PTEs. It adds two new system calls - (1)
> mshare(), which can be used by a process to create a region (we will
> call it mshare'd region) which can be used by other processes to map
> same pages using shared PTEs, (2) mshare_unlink() which is used to
> detach from the mshare'd region. Once an mshare'd region is created,
> other process(es), assuming they have the right permissions, can make
> the mashare() system call to map the shared pages into their address
> space using the shared PTEs.  When a process is done using this
> mshare'd region, it makes a mshare_unlink() system call to end its
> access. When the last process accessing mshare'd region calls
> mshare_unlink(), the mshare'd region is torn down and memory used by
> it is freed.
>
>
> API
> ===
>
> The mshare API consists of two system calls - mshare() and mshare_unlink()
>
> --
> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>
> mshare() creates and opens a new, or opens an existing mshare'd
> region that will be shared at PTE level. "name" refers to shared object
> name that exists under /sys/fs/mshare. "addr" is the starting address
> of this shared memory area and length is the size of this area.
> oflags can be one of:
>
> - O_RDONLY opens shared memory area for read only access by everyone
> - O_RDWR opens shared memory area for read and write access
> - O_CREAT creates the named shared memory area if it does not exist
> - O_EXCL If O_CREAT was also specified, and a shared memory area
>   exists with that name, return an error.
>
> mode represents the creation mode for the shared object under
> /sys/fs/mshare.
>
> mshare() returns an error code if it fails, otherwise it returns 0.
>
> PTEs are shared at pgdir level and hence it imposes following
> requirements on the address and size given to the mshare():
>
> - Starting address must be aligned to pgdir size (512GB on x86_64).
>   This alignment value can be looked up in /proc/sys/vm//mshare_size
> - Size must be a multiple of pgdir size
> - Any mappings created in this address range at any time become
>   shared automatically
> - Shared address range can have unmapped addresses in it. Any access
>   to unmapped address will result in SIGBUS
>
> Mappings within this address range behave as if they were shared
> between threads, so a write to a MAP_PRIVATE mapping will create a
> page which is shared between all the sharers. The first process that
> declares an address range mshare'd can continue to map objects in
> the shared area. All other processes that want mshare'd access to
> this memory area can do so by calling mshare(). After this call, the
> address range given by mshare becomes a shared range in its address
> space. Anonymous mappings will be shared and not COWed.
>
> A file under /sys/fs/mshare can be opened and read from. A read from
> this file returns two long values - (1) starting address, and (2)
> size of the mshare'd region.
>
> --
> int mshare_unlink(char *name)
>
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all
> processes have unmapped the shared object, the shared address range
> references are de-allocated and destroyed.
>
> mshare_unlink() returns 0 on success or -1 on error.
>
>
> Example Code
> ============
>
> Snippet of the code that a donor process would run looks like below:
>
> -----------------
>         addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>                         MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>         if (addr == MAP_FAILED)
>                 perror("ERROR: mmap failed");
>
>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>                         GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>         if (err < 0) {
>                 perror("mshare() syscall failed");
>                 exit(1);
>         }
>
>         strncpy(addr, "Some random shared text",
>                         sizeof("Some random shared text"));
> -----------------
>
> Snippet of code that a consumer process would execute looks like:
>
> -----------------
>         struct mshare_info minfo;
>
>         fd = open("testregion", O_RDONLY);
>         if (fd < 0) {
>                 perror("open failed");
>                 exit(1);
>         }
>
>         if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
>                 printf("INFO: %ld bytes shared at addr 0x%lx \n",
>                                 minfo.size, minfo.start);
>         else
>                 perror("read failed");
>
>         close(fd);
>
>         addr = (void *)minfo.start;
>         err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
>                         O_RDWR, 600);
>         if (err < 0) {
>                 perror("mshare() syscall failed");
>                 exit(1);
>         }
>
>         printf("Guest mmap at %px:\n", addr);
>         printf("%s\n", addr);
>         printf("\nDone\n");
>
>         err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>         if (err < 0) {
>                 perror("mshare_unlink() failed");
>                 exit(1);
>         }
> -----------------


Does  that mean those shared pages will get page_mapcount()=1 ?

A big pain for a memory limited system like a desktop/embedded system is
that reverse mapping will take tons of cpu in memory reclamation path
especially for those pages mapped by multiple processes. sometimes,
100% cpu utilization on LRU to scan and find out if a page is accessed
by reading PTE young.

if we result in one PTE only by this patchset, it means we are getting
significant
performance improvement in kernel LRU particularly when free memory
approaches the watermarks.

But I don't see how a new system call like mshare(),  can be used
by those systems as they might need some more automatic PTEs sharing
mechanism.

BTW, I suppose we are actually able to share PTEs as long as the address
is 2MB aligned?

>
>
> Patch series
> ============
>
> This series of patches is an initial implementation of these two
> system calls. This code implements working basic functionality.
>
> Prototype for the two syscalls is:
>
> SYSCALL_DEFINE5(mshare, const char *, name, unsigned long, addr,
>                 unsigned long, len, int, oflag, mode_t, mode)
>
> SYSCALL_DEFINE1(mshare_unlink, const char *, name)
>
> In order to facilitate page table sharing, this implemntation adds a
> new in-memory filesystem - msharefs which will be mounted at
> /sys/fs/mshare. When a new mshare'd region is created, a file with
> the name given by initial mshare() call is created under this
> filesystem.  Permissions for this file are given by the "mode"
> parameter to mshare(). The only operation supported on this file is
> read. A read from this file returns a structure containing
> information about mshare'd region - (1) starting virtual address for
> the region, and (2) size of mshare'd region.
>
> A donor process that wants to create an mshare'd region from a
> section of its mapped addresses calls mshare() with O_CREAT oflag.
> mshare() syscall then creates a new mm_struct which will host the
> page tables for the mshare'd region.  vma->vm_private_data for the
> vmas covering address range for this region are updated to point to
> a structure containing pointer to this new mm_struct.  Existing page
> tables are copied over to new mm struct.
>
> A consumer process that wants to map mshare'd region opens
> /sys/fs/mshare/<filename> and reads the starting address and size of
> mshare'd region. It then calls mshare() with these values to map the
> entire region in its address space. Consumer process calls
> mshare_unlink() to terminate its access.
>
>
> Since RFC
> =========
>
> This patch series includes better error handling and more robust
> locking besides improved implementation of mshare since the original
> RFC. It also incorporates feedback from original RFC. Alignment and
> size requirment are PGDIR_SIZE, same as RFC and this is open to
> change based upon further feedback. More review is needed for this
> patch series and is much appreciated.
>
>
>
> Khalid Aziz (14):
>   mm: Add new system calls mshare, mshare_unlink
>   mm: Add msharefs filesystem
>   mm: Add read for msharefs
>   mm: implement mshare_unlink syscall
>   mm: Add locking to msharefs syscalls
>   mm/msharefs: Check for mounted filesystem
>   mm: Add vm flag for shared PTE
>   mm/mshare: Add basic page table sharing using mshare
>   mm: Do not free PTEs for mshare'd PTEs
>   mm/mshare: Check for mapped vma when mshare'ing existing mshare'd
>     range
>   mm/mshare: unmap vmas in mshare_unlink
>   mm/mshare: Add a proc file with mshare alignment/size information
>   mm/mshare: Enforce mshare'd region permissions
>   mm/mshare: Copy PTEs to host mm
>
>  Documentation/filesystems/msharefs.rst |  19 +
>  arch/x86/entry/syscalls/syscall_64.tbl |   2 +
>  include/linux/mm.h                     |  11 +
>  include/trace/events/mmflags.h         |   3 +-
>  include/uapi/asm-generic/unistd.h      |   7 +-
>  include/uapi/linux/magic.h             |   1 +
>  include/uapi/linux/mman.h              |   5 +
>  kernel/sysctl.c                        |   7 +
>  mm/Makefile                            |   2 +-
>  mm/internal.h                          |   7 +
>  mm/memory.c                            | 105 ++++-
>  mm/mshare.c                            | 587 +++++++++++++++++++++++++
>  12 files changed, 750 insertions(+), 6 deletions(-)
>  create mode 100644 Documentation/filesystems/msharefs.rst
>  create mode 100644 mm/mshare.c
>
> --
> 2.32.0
>

Thanks
Barry

David Hildenbrand May 30, 2022, 11:18 a.m. UTC | #9

On 30.05.22 12:48, Barry Song wrote:
> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> Page tables in kernel consume some of the memory and as long as number
>> of mappings being maintained is small enough, this space consumed by
>> page tables is not objectionable. When very few memory pages are
>> shared between processes, the number of page table entries (PTEs) to
>> maintain is mostly constrained by the number of pages of memory on the
>> system. As the number of shared pages and the number of times pages
>> are shared goes up, amount of memory consumed by page tables starts to
>> become significant.
>>
>> Some of the field deployments commonly see memory pages shared across
>> 1000s of processes. On x86_64, each page requires a PTE that is only 8
>> bytes long which is very small compared to the 4K page size. When 2000
>> processes map the same page in their address space, each one of them
>> requires 8 bytes for its PTE and together that adds up to 8K of memory
>> just to hold the PTEs for one 4K page. On a database server with 300GB
>> SGA, a system carsh was seen with out-of-memory condition when 1500+
>> clients tried to share this SGA even though the system had 512GB of
>> memory. On this server, in the worst case scenario of all 1500
>> processes mapping every page from SGA would have required 878GB+ for
>> just the PTEs. If these PTEs could be shared, amount of memory saved
>> is very significant.
>>
>> This patch series implements a mechanism in kernel to allow userspace
>> processes to opt into sharing PTEs. It adds two new system calls - (1)
>> mshare(), which can be used by a process to create a region (we will
>> call it mshare'd region) which can be used by other processes to map
>> same pages using shared PTEs, (2) mshare_unlink() which is used to
>> detach from the mshare'd region. Once an mshare'd region is created,
>> other process(es), assuming they have the right permissions, can make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs.  When a process is done using this
>> mshare'd region, it makes a mshare_unlink() system call to end its
>> access. When the last process accessing mshare'd region calls
>> mshare_unlink(), the mshare'd region is torn down and memory used by
>> it is freed.
>>
>>
>> API
>> ===
>>
>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>
>> --
>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>
>> mshare() creates and opens a new, or opens an existing mshare'd
>> region that will be shared at PTE level. "name" refers to shared object
>> name that exists under /sys/fs/mshare. "addr" is the starting address
>> of this shared memory area and length is the size of this area.
>> oflags can be one of:
>>
>> - O_RDONLY opens shared memory area for read only access by everyone
>> - O_RDWR opens shared memory area for read and write access
>> - O_CREAT creates the named shared memory area if it does not exist
>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>   exists with that name, return an error.
>>
>> mode represents the creation mode for the shared object under
>> /sys/fs/mshare.
>>
>> mshare() returns an error code if it fails, otherwise it returns 0.
>>
>> PTEs are shared at pgdir level and hence it imposes following
>> requirements on the address and size given to the mshare():
>>
>> - Starting address must be aligned to pgdir size (512GB on x86_64).
>>   This alignment value can be looked up in /proc/sys/vm//mshare_size
>> - Size must be a multiple of pgdir size
>> - Any mappings created in this address range at any time become
>>   shared automatically
>> - Shared address range can have unmapped addresses in it. Any access
>>   to unmapped address will result in SIGBUS
>>
>> Mappings within this address range behave as if they were shared
>> between threads, so a write to a MAP_PRIVATE mapping will create a
>> page which is shared between all the sharers. The first process that
>> declares an address range mshare'd can continue to map objects in
>> the shared area. All other processes that want mshare'd access to
>> this memory area can do so by calling mshare(). After this call, the
>> address range given by mshare becomes a shared range in its address
>> space. Anonymous mappings will be shared and not COWed.
>>
>> A file under /sys/fs/mshare can be opened and read from. A read from
>> this file returns two long values - (1) starting address, and (2)
>> size of the mshare'd region.
>>
>> --
>> int mshare_unlink(char *name)
>>
>> A shared address range created by mshare() can be destroyed using
>> mshare_unlink() which removes the  shared named object. Once all
>> processes have unmapped the shared object, the shared address range
>> references are de-allocated and destroyed.
>>
>> mshare_unlink() returns 0 on success or -1 on error.
>>
>>
>> Example Code
>> ============
>>
>> Snippet of the code that a donor process would run looks like below:
>>
>> -----------------
>>         addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>                         MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>         if (addr == MAP_FAILED)
>>                 perror("ERROR: mmap failed");
>>
>>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>                         GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>         if (err < 0) {
>>                 perror("mshare() syscall failed");
>>                 exit(1);
>>         }
>>
>>         strncpy(addr, "Some random shared text",
>>                         sizeof("Some random shared text"));
>> -----------------
>>
>> Snippet of code that a consumer process would execute looks like:
>>
>> -----------------
>>         struct mshare_info minfo;
>>
>>         fd = open("testregion", O_RDONLY);
>>         if (fd < 0) {
>>                 perror("open failed");
>>                 exit(1);
>>         }
>>
>>         if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
>>                 printf("INFO: %ld bytes shared at addr 0x%lx \n",
>>                                 minfo.size, minfo.start);
>>         else
>>                 perror("read failed");
>>
>>         close(fd);
>>
>>         addr = (void *)minfo.start;
>>         err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
>>                         O_RDWR, 600);
>>         if (err < 0) {
>>                 perror("mshare() syscall failed");
>>                 exit(1);
>>         }
>>
>>         printf("Guest mmap at %px:\n", addr);
>>         printf("%s\n", addr);
>>         printf("\nDone\n");
>>
>>         err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>         if (err < 0) {
>>                 perror("mshare_unlink() failed");
>>                 exit(1);
>>         }
>> -----------------
> 
> 
> Does  that mean those shared pages will get page_mapcount()=1 ?

AFAIU, for mshare() that is the case.

> 
> A big pain for a memory limited system like a desktop/embedded system is
> that reverse mapping will take tons of cpu in memory reclamation path
> especially for those pages mapped by multiple processes. sometimes,
> 100% cpu utilization on LRU to scan and find out if a page is accessed
> by reading PTE young.

Regarding PTE-table sharing:

Even if we'd account each logical mapping (independent of page table
sharing) in the page_mapcount(), we would benefit from page table
sharing. Simply when we unmap the page from the shared page table, we'd
have to adjust the mapcount accordingly. So unmapping from a single
(shared) pagetable could directly result in the mapcount dropping to zero.

What I am trying to say is: how the mapcount is handled might be an
implementation detail for PTE-sharing. Not sure how hugetlb handles that
with its PMD-table sharing.

We'd have to clarify what the mapcount actually expresses. Having the
mapcount express "is this page mapped by multiple processes or at
multiple VMAs" might be helpful in some cases. Right now it mostly
expresses exactly that.

> 
> if we result in one PTE only by this patchset, it means we are getting
> significant
> performance improvement in kernel LRU particularly when free memory
> approaches the watermarks.
> 
> But I don't see how a new system call like mshare(),  can be used
> by those systems as they might need some more automatic PTEs sharing
> mechanism.

IMHO, we should look into automatic PTE-table sharing of MAP_SHARED
mappings, similar to what hugetlb provides for PMD table sharing, which
leaves semantics unchanged for existing user space. Maybe there is a way
to factor that out and reuse it for PTE-table sharing.

I can understand that there are use cases for explicit sharing with new
(e.g., mprotect) semantics.

> 
> BTW, I suppose we are actually able to share PTEs as long as the address
> is 2MB aligned?

2MB is x86 specific, but yes. As long as we're aligned to PMDs and
* the VMA spans the complete PMD
* the VMA permissions match the page table
* no process-specific page table features are used (uffd, softdirty
  tracking)

... probably there are more corner cases. (e.g., referenced and dirty
bit don't reflect what the mapping process did, which might or might not
be relevant for current/future features)

Barry Song May 30, 2022, 11:49 a.m. UTC | #10

On Mon, May 30, 2022 at 11:18 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 30.05.22 12:48, Barry Song wrote:
> > On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
> >>
> >> Page tables in kernel consume some of the memory and as long as number
> >> of mappings being maintained is small enough, this space consumed by
> >> page tables is not objectionable. When very few memory pages are
> >> shared between processes, the number of page table entries (PTEs) to
> >> maintain is mostly constrained by the number of pages of memory on the
> >> system. As the number of shared pages and the number of times pages
> >> are shared goes up, amount of memory consumed by page tables starts to
> >> become significant.
> >>
> >> Some of the field deployments commonly see memory pages shared across
> >> 1000s of processes. On x86_64, each page requires a PTE that is only 8
> >> bytes long which is very small compared to the 4K page size. When 2000
> >> processes map the same page in their address space, each one of them
> >> requires 8 bytes for its PTE and together that adds up to 8K of memory
> >> just to hold the PTEs for one 4K page. On a database server with 300GB
> >> SGA, a system carsh was seen with out-of-memory condition when 1500+
> >> clients tried to share this SGA even though the system had 512GB of
> >> memory. On this server, in the worst case scenario of all 1500
> >> processes mapping every page from SGA would have required 878GB+ for
> >> just the PTEs. If these PTEs could be shared, amount of memory saved
> >> is very significant.
> >>
> >> This patch series implements a mechanism in kernel to allow userspace
> >> processes to opt into sharing PTEs. It adds two new system calls - (1)
> >> mshare(), which can be used by a process to create a region (we will
> >> call it mshare'd region) which can be used by other processes to map
> >> same pages using shared PTEs, (2) mshare_unlink() which is used to
> >> detach from the mshare'd region. Once an mshare'd region is created,
> >> other process(es), assuming they have the right permissions, can make
> >> the mashare() system call to map the shared pages into their address
> >> space using the shared PTEs.  When a process is done using this
> >> mshare'd region, it makes a mshare_unlink() system call to end its
> >> access. When the last process accessing mshare'd region calls
> >> mshare_unlink(), the mshare'd region is torn down and memory used by
> >> it is freed.
> >>
> >>
> >> API
> >> ===
> >>
> >> The mshare API consists of two system calls - mshare() and mshare_unlink()
> >>
> >> --
> >> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
> >>
> >> mshare() creates and opens a new, or opens an existing mshare'd
> >> region that will be shared at PTE level. "name" refers to shared object
> >> name that exists under /sys/fs/mshare. "addr" is the starting address
> >> of this shared memory area and length is the size of this area.
> >> oflags can be one of:
> >>
> >> - O_RDONLY opens shared memory area for read only access by everyone
> >> - O_RDWR opens shared memory area for read and write access
> >> - O_CREAT creates the named shared memory area if it does not exist
> >> - O_EXCL If O_CREAT was also specified, and a shared memory area
> >>   exists with that name, return an error.
> >>
> >> mode represents the creation mode for the shared object under
> >> /sys/fs/mshare.
> >>
> >> mshare() returns an error code if it fails, otherwise it returns 0.
> >>
> >> PTEs are shared at pgdir level and hence it imposes following
> >> requirements on the address and size given to the mshare():
> >>
> >> - Starting address must be aligned to pgdir size (512GB on x86_64).
> >>   This alignment value can be looked up in /proc/sys/vm//mshare_size
> >> - Size must be a multiple of pgdir size
> >> - Any mappings created in this address range at any time become
> >>   shared automatically
> >> - Shared address range can have unmapped addresses in it. Any access
> >>   to unmapped address will result in SIGBUS
> >>
> >> Mappings within this address range behave as if they were shared
> >> between threads, so a write to a MAP_PRIVATE mapping will create a
> >> page which is shared between all the sharers. The first process that
> >> declares an address range mshare'd can continue to map objects in
> >> the shared area. All other processes that want mshare'd access to
> >> this memory area can do so by calling mshare(). After this call, the
> >> address range given by mshare becomes a shared range in its address
> >> space. Anonymous mappings will be shared and not COWed.
> >>
> >> A file under /sys/fs/mshare can be opened and read from. A read from
> >> this file returns two long values - (1) starting address, and (2)
> >> size of the mshare'd region.
> >>
> >> --
> >> int mshare_unlink(char *name)
> >>
> >> A shared address range created by mshare() can be destroyed using
> >> mshare_unlink() which removes the  shared named object. Once all
> >> processes have unmapped the shared object, the shared address range
> >> references are de-allocated and destroyed.
> >>
> >> mshare_unlink() returns 0 on success or -1 on error.
> >>
> >>
> >> Example Code
> >> ============
> >>
> >> Snippet of the code that a donor process would run looks like below:
> >>
> >> -----------------
> >>         addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
> >>                         MAP_SHARED | MAP_ANONYMOUS, 0, 0);
> >>         if (addr == MAP_FAILED)
> >>                 perror("ERROR: mmap failed");
> >>
> >>         err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
> >>                         GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
> >>         if (err < 0) {
> >>                 perror("mshare() syscall failed");
> >>                 exit(1);
> >>         }
> >>
> >>         strncpy(addr, "Some random shared text",
> >>                         sizeof("Some random shared text"));
> >> -----------------
> >>
> >> Snippet of code that a consumer process would execute looks like:
> >>
> >> -----------------
> >>         struct mshare_info minfo;
> >>
> >>         fd = open("testregion", O_RDONLY);
> >>         if (fd < 0) {
> >>                 perror("open failed");
> >>                 exit(1);
> >>         }
> >>
> >>         if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
> >>                 printf("INFO: %ld bytes shared at addr 0x%lx \n",
> >>                                 minfo.size, minfo.start);
> >>         else
> >>                 perror("read failed");
> >>
> >>         close(fd);
> >>
> >>         addr = (void *)minfo.start;
> >>         err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
> >>                         O_RDWR, 600);
> >>         if (err < 0) {
> >>                 perror("mshare() syscall failed");
> >>                 exit(1);
> >>         }
> >>
> >>         printf("Guest mmap at %px:\n", addr);
> >>         printf("%s\n", addr);
> >>         printf("\nDone\n");
> >>
> >>         err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
> >>         if (err < 0) {
> >>                 perror("mshare_unlink() failed");
> >>                 exit(1);
> >>         }
> >> -----------------
> >
> >
> > Does  that mean those shared pages will get page_mapcount()=1 ?
>
> AFAIU, for mshare() that is the case.
>
> >
> > A big pain for a memory limited system like a desktop/embedded system is
> > that reverse mapping will take tons of cpu in memory reclamation path
> > especially for those pages mapped by multiple processes. sometimes,
> > 100% cpu utilization on LRU to scan and find out if a page is accessed
> > by reading PTE young.
>
> Regarding PTE-table sharing:
>
> Even if we'd account each logical mapping (independent of page table
> sharing) in the page_mapcount(), we would benefit from page table
> sharing. Simply when we unmap the page from the shared page table, we'd
> have to adjust the mapcount accordingly. So unmapping from a single
> (shared) pagetable could directly result in the mapcount dropping to zero.
>
> What I am trying to say is: how the mapcount is handled might be an
> implementation detail for PTE-sharing. Not sure how hugetlb handles that
> with its PMD-table sharing.
>
> We'd have to clarify what the mapcount actually expresses. Having the
> mapcount express "is this page mapped by multiple processes or at
> multiple VMAs" might be helpful in some cases. Right now it mostly
> expresses exactly that.

right, i guess mapcount, as a number for itself, isn't so important. the
only important thing is that we only need to read one PTE rather
than 1000 PTEs to figure out if one page is young.

so this patchset has already been able to do reverse mapping only
one time for a page shared by 1000 processes rather than reading
1000 PTEs?

i mean, i actually haven't found your actually touched any files in
mm/vmscan.c etc.

>
> >
> > if we result in one PTE only by this patchset, it means we are getting
> > significant
> > performance improvement in kernel LRU particularly when free memory
> > approaches the watermarks.
> >
> > But I don't see how a new system call like mshare(),  can be used
> > by those systems as they might need some more automatic PTEs sharing
> > mechanism.
>
> IMHO, we should look into automatic PTE-table sharing of MAP_SHARED
> mappings, similar to what hugetlb provides for PMD table sharing, which
> leaves semantics unchanged for existing user space. Maybe there is a way
> to factor that out and reuse it for PTE-table sharing.
>
> I can understand that there are use cases for explicit sharing with new
> (e.g., mprotect) semantics.
>
> >
> > BTW, I suppose we are actually able to share PTEs as long as the address
> > is 2MB aligned?
>
> 2MB is x86 specific, but yes. As long as we're aligned to PMDs and
> * the VMA spans the complete PMD
> * the VMA permissions match the page table
> * no process-specific page table features are used (uffd, softdirty
>   tracking)
>
> ... probably there are more corner cases. (e.g., referenced and dirty
> bit don't reflect what the mapping process did, which might or might not
> be relevant for current/future features)
>
> --
> Thanks,
>
> David / dhildenb
>

Thanks
Barry

Khalid Aziz June 29, 2022, 5:40 p.m. UTC | #11

On 5/30/22 04:48, Barry Song wrote:
> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>
>> Page tables in kernel consume some of the memory and as long as number
>> of mappings being maintained is small enough, this space consumed by
>> page tables is not objectionable. When very few memory pages are
>> shared between processes, the number of page table entries (PTEs) to
>> maintain is mostly constrained by the number of pages of memory on the
>> system. As the number of shared pages and the number of times pages
>> are shared goes up, amount of memory consumed by page tables starts to
>> become significant.
>>
>> Some of the field deployments commonly see memory pages shared across
>> 1000s of processes. On x86_64, each page requires a PTE that is only 8
>> bytes long which is very small compared to the 4K page size. When 2000
>> processes map the same page in their address space, each one of them
>> requires 8 bytes for its PTE and together that adds up to 8K of memory
>> just to hold the PTEs for one 4K page. On a database server with 300GB
>> SGA, a system carsh was seen with out-of-memory condition when 1500+
>> clients tried to share this SGA even though the system had 512GB of
>> memory. On this server, in the worst case scenario of all 1500
>> processes mapping every page from SGA would have required 878GB+ for
>> just the PTEs. If these PTEs could be shared, amount of memory saved
>> is very significant.
>>
>> This patch series implements a mechanism in kernel to allow userspace
>> processes to opt into sharing PTEs. It adds two new system calls - (1)
>> mshare(), which can be used by a process to create a region (we will
>> call it mshare'd region) which can be used by other processes to map
>> same pages using shared PTEs, (2) mshare_unlink() which is used to
>> detach from the mshare'd region. Once an mshare'd region is created,
>> other process(es), assuming they have the right permissions, can make
>> the mashare() system call to map the shared pages into their address
>> space using the shared PTEs.  When a process is done using this
>> mshare'd region, it makes a mshare_unlink() system call to end its
>> access. When the last process accessing mshare'd region calls
>> mshare_unlink(), the mshare'd region is torn down and memory used by
>> it is freed.
>>
>>
>> API
>> ===
>>
>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>
>> --
>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>
>> mshare() creates and opens a new, or opens an existing mshare'd
>> region that will be shared at PTE level. "name" refers to shared object
>> name that exists under /sys/fs/mshare. "addr" is the starting address
>> of this shared memory area and length is the size of this area.
>> oflags can be one of:
>>
>> - O_RDONLY opens shared memory area for read only access by everyone
>> - O_RDWR opens shared memory area for read and write access
>> - O_CREAT creates the named shared memory area if it does not exist
>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>    exists with that name, return an error.
>>
>> mode represents the creation mode for the shared object under
>> /sys/fs/mshare.
>>
>> mshare() returns an error code if it fails, otherwise it returns 0.
>>
>> PTEs are shared at pgdir level and hence it imposes following
>> requirements on the address and size given to the mshare():
>>
>> - Starting address must be aligned to pgdir size (512GB on x86_64).
>>    This alignment value can be looked up in /proc/sys/vm//mshare_size
>> - Size must be a multiple of pgdir size
>> - Any mappings created in this address range at any time become
>>    shared automatically
>> - Shared address range can have unmapped addresses in it. Any access
>>    to unmapped address will result in SIGBUS
>>
>> Mappings within this address range behave as if they were shared
>> between threads, so a write to a MAP_PRIVATE mapping will create a
>> page which is shared between all the sharers. The first process that
>> declares an address range mshare'd can continue to map objects in
>> the shared area. All other processes that want mshare'd access to
>> this memory area can do so by calling mshare(). After this call, the
>> address range given by mshare becomes a shared range in its address
>> space. Anonymous mappings will be shared and not COWed.
>>
>> A file under /sys/fs/mshare can be opened and read from. A read from
>> this file returns two long values - (1) starting address, and (2)
>> size of the mshare'd region.
>>
>> --
>> int mshare_unlink(char *name)
>>
>> A shared address range created by mshare() can be destroyed using
>> mshare_unlink() which removes the  shared named object. Once all
>> processes have unmapped the shared object, the shared address range
>> references are de-allocated and destroyed.
>>
>> mshare_unlink() returns 0 on success or -1 on error.
>>
>>
>> Example Code
>> ============
>>
>> Snippet of the code that a donor process would run looks like below:
>>
>> -----------------
>>          addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>                          MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>          if (addr == MAP_FAILED)
>>                  perror("ERROR: mmap failed");
>>
>>          err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>                          GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>          if (err < 0) {
>>                  perror("mshare() syscall failed");
>>                  exit(1);
>>          }
>>
>>          strncpy(addr, "Some random shared text",
>>                          sizeof("Some random shared text"));
>> -----------------
>>
>> Snippet of code that a consumer process would execute looks like:
>>
>> -----------------
>>          struct mshare_info minfo;
>>
>>          fd = open("testregion", O_RDONLY);
>>          if (fd < 0) {
>>                  perror("open failed");
>>                  exit(1);
>>          }
>>
>>          if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
>>                  printf("INFO: %ld bytes shared at addr 0x%lx \n",
>>                                  minfo.size, minfo.start);
>>          else
>>                  perror("read failed");
>>
>>          close(fd);
>>
>>          addr = (void *)minfo.start;
>>          err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
>>                          O_RDWR, 600);
>>          if (err < 0) {
>>                  perror("mshare() syscall failed");
>>                  exit(1);
>>          }
>>
>>          printf("Guest mmap at %px:\n", addr);
>>          printf("%s\n", addr);
>>          printf("\nDone\n");
>>
>>          err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>          if (err < 0) {
>>                  perror("mshare_unlink() failed");
>>                  exit(1);
>>          }
>> -----------------
> 
> 
> Does  that mean those shared pages will get page_mapcount()=1 ?
> 
> A big pain for a memory limited system like a desktop/embedded system is
> that reverse mapping will take tons of cpu in memory reclamation path
> especially for those pages mapped by multiple processes. sometimes,
> 100% cpu utilization on LRU to scan and find out if a page is accessed
> by reading PTE young.
> 
> if we result in one PTE only by this patchset, it means we are getting
> significant
> performance improvement in kernel LRU particularly when free memory
> approaches the watermarks.
> 
> But I don't see how a new system call like mshare(),  can be used
> by those systems as they might need some more automatic PTEs sharing
> mechanism.
> 
> BTW, I suppose we are actually able to share PTEs as long as the address
> is 2MB aligned?
> 

The anonymous pages that are allocated to back the virtual pages in vmas maintained in mshare region get added to rmap 
once. mshare() system call is going away and has been replaced by fd based mmap instead in the next version of this patch.

Thanks,
Khalid

Khalid Aziz June 29, 2022, 5:48 p.m. UTC | #12

On 5/30/22 05:18, David Hildenbrand wrote:
> On 30.05.22 12:48, Barry Song wrote:
>> On Tue, Apr 12, 2022 at 4:07 AM Khalid Aziz <khalid.aziz@oracle.com> wrote:
>>>
>>> Page tables in kernel consume some of the memory and as long as number
>>> of mappings being maintained is small enough, this space consumed by
>>> page tables is not objectionable. When very few memory pages are
>>> shared between processes, the number of page table entries (PTEs) to
>>> maintain is mostly constrained by the number of pages of memory on the
>>> system. As the number of shared pages and the number of times pages
>>> are shared goes up, amount of memory consumed by page tables starts to
>>> become significant.
>>>
>>> Some of the field deployments commonly see memory pages shared across
>>> 1000s of processes. On x86_64, each page requires a PTE that is only 8
>>> bytes long which is very small compared to the 4K page size. When 2000
>>> processes map the same page in their address space, each one of them
>>> requires 8 bytes for its PTE and together that adds up to 8K of memory
>>> just to hold the PTEs for one 4K page. On a database server with 300GB
>>> SGA, a system carsh was seen with out-of-memory condition when 1500+
>>> clients tried to share this SGA even though the system had 512GB of
>>> memory. On this server, in the worst case scenario of all 1500
>>> processes mapping every page from SGA would have required 878GB+ for
>>> just the PTEs. If these PTEs could be shared, amount of memory saved
>>> is very significant.
>>>
>>> This patch series implements a mechanism in kernel to allow userspace
>>> processes to opt into sharing PTEs. It adds two new system calls - (1)
>>> mshare(), which can be used by a process to create a region (we will
>>> call it mshare'd region) which can be used by other processes to map
>>> same pages using shared PTEs, (2) mshare_unlink() which is used to
>>> detach from the mshare'd region. Once an mshare'd region is created,
>>> other process(es), assuming they have the right permissions, can make
>>> the mashare() system call to map the shared pages into their address
>>> space using the shared PTEs.  When a process is done using this
>>> mshare'd region, it makes a mshare_unlink() system call to end its
>>> access. When the last process accessing mshare'd region calls
>>> mshare_unlink(), the mshare'd region is torn down and memory used by
>>> it is freed.
>>>
>>>
>>> API
>>> ===
>>>
>>> The mshare API consists of two system calls - mshare() and mshare_unlink()
>>>
>>> --
>>> int mshare(char *name, void *addr, size_t length, int oflags, mode_t mode)
>>>
>>> mshare() creates and opens a new, or opens an existing mshare'd
>>> region that will be shared at PTE level. "name" refers to shared object
>>> name that exists under /sys/fs/mshare. "addr" is the starting address
>>> of this shared memory area and length is the size of this area.
>>> oflags can be one of:
>>>
>>> - O_RDONLY opens shared memory area for read only access by everyone
>>> - O_RDWR opens shared memory area for read and write access
>>> - O_CREAT creates the named shared memory area if it does not exist
>>> - O_EXCL If O_CREAT was also specified, and a shared memory area
>>>    exists with that name, return an error.
>>>
>>> mode represents the creation mode for the shared object under
>>> /sys/fs/mshare.
>>>
>>> mshare() returns an error code if it fails, otherwise it returns 0.
>>>
>>> PTEs are shared at pgdir level and hence it imposes following
>>> requirements on the address and size given to the mshare():
>>>
>>> - Starting address must be aligned to pgdir size (512GB on x86_64).
>>>    This alignment value can be looked up in /proc/sys/vm//mshare_size
>>> - Size must be a multiple of pgdir size
>>> - Any mappings created in this address range at any time become
>>>    shared automatically
>>> - Shared address range can have unmapped addresses in it. Any access
>>>    to unmapped address will result in SIGBUS
>>>
>>> Mappings within this address range behave as if they were shared
>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>> page which is shared between all the sharers. The first process that
>>> declares an address range mshare'd can continue to map objects in
>>> the shared area. All other processes that want mshare'd access to
>>> this memory area can do so by calling mshare(). After this call, the
>>> address range given by mshare becomes a shared range in its address
>>> space. Anonymous mappings will be shared and not COWed.
>>>
>>> A file under /sys/fs/mshare can be opened and read from. A read from
>>> this file returns two long values - (1) starting address, and (2)
>>> size of the mshare'd region.
>>>
>>> --
>>> int mshare_unlink(char *name)
>>>
>>> A shared address range created by mshare() can be destroyed using
>>> mshare_unlink() which removes the  shared named object. Once all
>>> processes have unmapped the shared object, the shared address range
>>> references are de-allocated and destroyed.
>>>
>>> mshare_unlink() returns 0 on success or -1 on error.
>>>
>>>
>>> Example Code
>>> ============
>>>
>>> Snippet of the code that a donor process would run looks like below:
>>>
>>> -----------------
>>>          addr = mmap((void *)TB(2), GB(512), PROT_READ | PROT_WRITE,
>>>                          MAP_SHARED | MAP_ANONYMOUS, 0, 0);
>>>          if (addr == MAP_FAILED)
>>>                  perror("ERROR: mmap failed");
>>>
>>>          err = syscall(MSHARE_SYSCALL, "testregion", (void *)TB(2),
>>>                          GB(512), O_CREAT|O_RDWR|O_EXCL, 600);
>>>          if (err < 0) {
>>>                  perror("mshare() syscall failed");
>>>                  exit(1);
>>>          }
>>>
>>>          strncpy(addr, "Some random shared text",
>>>                          sizeof("Some random shared text"));
>>> -----------------
>>>
>>> Snippet of code that a consumer process would execute looks like:
>>>
>>> -----------------
>>>          struct mshare_info minfo;
>>>
>>>          fd = open("testregion", O_RDONLY);
>>>          if (fd < 0) {
>>>                  perror("open failed");
>>>                  exit(1);
>>>          }
>>>
>>>          if ((count = read(fd, &minfo, sizeof(struct mshare_info)) > 0))
>>>                  printf("INFO: %ld bytes shared at addr 0x%lx \n",
>>>                                  minfo.size, minfo.start);
>>>          else
>>>                  perror("read failed");
>>>
>>>          close(fd);
>>>
>>>          addr = (void *)minfo.start;
>>>          err = syscall(MSHARE_SYSCALL, "testregion", addr, minfo.size,
>>>                          O_RDWR, 600);
>>>          if (err < 0) {
>>>                  perror("mshare() syscall failed");
>>>                  exit(1);
>>>          }
>>>
>>>          printf("Guest mmap at %px:\n", addr);
>>>          printf("%s\n", addr);
>>>          printf("\nDone\n");
>>>
>>>          err = syscall(MSHARE_UNLINK_SYSCALL, "testregion");
>>>          if (err < 0) {
>>>                  perror("mshare_unlink() failed");
>>>                  exit(1);
>>>          }
>>> -----------------
>>
>>
>> Does  that mean those shared pages will get page_mapcount()=1 ?
> 
> AFAIU, for mshare() that is the case.
> 
>>
>> A big pain for a memory limited system like a desktop/embedded system is
>> that reverse mapping will take tons of cpu in memory reclamation path
>> especially for those pages mapped by multiple processes. sometimes,
>> 100% cpu utilization on LRU to scan and find out if a page is accessed
>> by reading PTE young.
> 
> Regarding PTE-table sharing:
> 
> Even if we'd account each logical mapping (independent of page table
> sharing) in the page_mapcount(), we would benefit from page table
> sharing. Simply when we unmap the page from the shared page table, we'd
> have to adjust the mapcount accordingly. So unmapping from a single
> (shared) pagetable could directly result in the mapcount dropping to zero.
> 
> What I am trying to say is: how the mapcount is handled might be an
> implementation detail for PTE-sharing. Not sure how hugetlb handles that
> with its PMD-table sharing.
> 
> We'd have to clarify what the mapcount actually expresses. Having the
> mapcount express "is this page mapped by multiple processes or at
> multiple VMAs" might be helpful in some cases. Right now it mostly
> expresses exactly that.

Right, that is the question - what does mapcount represent. I am interpreting it as mapcount represents how many ptes 
map the page. Since mshare uses one pte for each shared page irrespective of how many processes share the page, a 
mapcount of 1 sounds reasonable to me.

> 
>>
>> if we result in one PTE only by this patchset, it means we are getting
>> significant
>> performance improvement in kernel LRU particularly when free memory
>> approaches the watermarks.
>>
>> But I don't see how a new system call like mshare(),  can be used
>> by those systems as they might need some more automatic PTEs sharing
>> mechanism.
> 
> IMHO, we should look into automatic PTE-table sharing of MAP_SHARED
> mappings, similar to what hugetlb provides for PMD table sharing, which
> leaves semantics unchanged for existing user space. Maybe there is a way
> to factor that out and reuse it for PTE-table sharing.
> 
> I can understand that there are use cases for explicit sharing with new
> (e.g., mprotect) semantics.

It is tempting to make this sharing automatic and mshare may evolve to that. Since mshare assumes significant trust 
between the processes sharing pages (shared pages share attributes and protection keys possibly) , it sounds dangerous 
to make that assumption automatically without processes explicitly declaring that level of trust.

Thanks,
Khalid

[v1,00/14] Add support for shared PTEs across processes

Message

Comments