mbox series

[RFC,0/5] madvise MADV_DOEXEC

Message ID 1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com (mailing list archive)
Headers show
Series madvise MADV_DOEXEC | expand

Message

Anthony Yznaga July 27, 2020, 5:11 p.m. UTC
This patchset adds support for preserving an anonymous memory range across
exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
sharing memory in this manner, as opposed to re-attaching to a named shared
memory segment, is to ensure it is mapped at the same virtual address in
the new process as it was in the old one.  An intended use for this is to
preserve guest memory for guests using vfio while qemu exec's an updated
version of itself.  By ensuring the memory is preserved at a fixed address,
vfio mappings and their associated kernel data structures can remain valid.
In addition, for the qemu use case, qemu instances that back guest RAM with
anonymous memory can be updated.

Patches 1 and 2 ensure that loading of ELF load segments does not silently
clobber existing VMAS, and remove assumptions that the stack is the only
VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
and could be considered on its own.

Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
using an ELF note.

Anthony Yznaga (5):
  elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
  mm: do not assume only the stack vma exists in setup_arg_pages()
  mm: introduce VM_EXEC_KEEP
  exec, elf: require opt-in for accepting preserved mem
  mm: introduce MADV_DOEXEC

 arch/x86/Kconfig                       |   1 +
 fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
 fs/exec.c                              |  33 +++++-
 include/linux/binfmts.h                |   7 +-
 include/linux/mm.h                     |   5 +
 include/uapi/asm-generic/mman-common.h |   3 +
 kernel/fork.c                          |   2 +-
 mm/madvise.c                           |  25 +++++
 mm/mmap.c                              |  47 ++++++++
 9 files changed, 266 insertions(+), 53 deletions(-)

Comments

Eric W. Biederman July 27, 2020, 5:07 p.m. UTC | #1
Anthony Yznaga <anthony.yznaga@oracle.com> writes:

> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

What is wrong with using a file descriptor to a possibly deleted file
and re-mmaping it?

There is already MAP_FIXED that allows you to ensure you have the same
address.

I think all it would take would be a small protocol from one version
to the next to say map file descriptor #N and address #A.  Something
easily passed on the command line.

There is just enough complexity in exec currently that our
implementation of exec is already teetering.  So if we could use
existing mechanisms it would be good.

And I don't see why this version of sharing a mmap area would be
particularly general.  I do imagine that being able to force a
mmap area into a setuid executable would be a fun attack vector.

Perhaps I missed this in my skim of these patches but I did not see
anything that guarded this feature against an exec that changes an
applications privileges.

Eric


> Patches 1 and 2 ensure that loading of ELF load segments does not silently
> clobber existing VMAS, and remove assumptions that the stack is the only
> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
> and could be considered on its own.
>
> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
> using an ELF note.
>
> Anthony Yznaga (5):
>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>   mm: do not assume only the stack vma exists in setup_arg_pages()
>   mm: introduce VM_EXEC_KEEP
>   exec, elf: require opt-in for accepting preserved mem
>   mm: introduce MADV_DOEXEC
>
>  arch/x86/Kconfig                       |   1 +
>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>  fs/exec.c                              |  33 +++++-
>  include/linux/binfmts.h                |   7 +-
>  include/linux/mm.h                     |   5 +
>  include/uapi/asm-generic/mman-common.h |   3 +
>  kernel/fork.c                          |   2 +-
>  mm/madvise.c                           |  25 +++++
>  mm/mmap.c                              |  47 ++++++++
>  9 files changed, 266 insertions(+), 53 deletions(-)
Steven Sistare July 27, 2020, 6 p.m. UTC | #2
On 7/27/2020 1:07 PM, ebiederm@xmission.com wrote:
> Anthony Yznaga <anthony.yznaga@oracle.com> writes:
> 
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
> 
> What is wrong with using a file descriptor to a possibly deleted file
> and re-mmaping it?
> 
> There is already MAP_FIXED that allows you to ensure you have the same
> address.

MAP_FIXED blows away any existing mapping in that range, which is not the 
desired behavior.  We want to preserve the previously created mapping at
the same VA and co-exist with other mappings created by the new process.
There is no way to guarantee availability of a VA post-exec.

> I think all it would take would be a small protocol from one version
> to the next to say map file descriptor #N and address #A.  Something
> easily passed on the command line.
> 
> There is just enough complexity in exec currently that our
> implementation of exec is already teetering.  So if we could use
> existing mechanisms it would be good.
> 
> And I don't see why this version of sharing a mmap area would be
> particularly general.  I do imagine that being able to force a
> mmap area into a setuid executable would be a fun attack vector.

Any mmap(MAP_ANON) segment can be preserved.  That is very general, and is 
the case we need to support to upgrade legacy applications that are already
running (such as qemu) -- we cannot recode them before they are updated.

> Perhaps I missed this in my skim of these patches but I did not see
> anything that guarded this feature against an exec that changes an
> applications privileges.

The ELF opt-in flag must be set, so only selected executables will accept
incoming mappings.  The exec() code verifies that the preserved mappings 
do not overlap or become adjacent to text or stack, so it is not possible for
example for an attacker to cause stack underflow or overflow to access injected
content.  A gadget invoked by some other attack could access the preserved content.

- Steve

>> Patches 1 and 2 ensure that loading of ELF load segments does not silently
>> clobber existing VMAS, and remove assumptions that the stack is the only
>> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
>> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
>> and could be considered on its own.
>>
>> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
>> using an ELF note.
>>
>> Anthony Yznaga (5):
>>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>>   mm: do not assume only the stack vma exists in setup_arg_pages()
>>   mm: introduce VM_EXEC_KEEP
>>   exec, elf: require opt-in for accepting preserved mem
>>   mm: introduce MADV_DOEXEC
>>
>>  arch/x86/Kconfig                       |   1 +
>>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>>  fs/exec.c                              |  33 +++++-
>>  include/linux/binfmts.h                |   7 +-
>>  include/linux/mm.h                     |   5 +
>>  include/uapi/asm-generic/mman-common.h |   3 +
>>  kernel/fork.c                          |   2 +-
>>  mm/madvise.c                           |  25 +++++
>>  mm/mmap.c                              |  47 ++++++++
>>  9 files changed, 266 insertions(+), 53 deletions(-)
Kirill Tkhai July 28, 2020, 11:34 a.m. UTC | #3
On 27.07.2020 20:11, Anthony Yznaga wrote:
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,

So, the goal is an update of QEMU binary without a stopping of virtual machine?

> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.
> 
> Patches 1 and 2 ensure that loading of ELF load segments does not silently
> clobber existing VMAS, and remove assumptions that the stack is the only
> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
> and could be considered on its own.
> 
> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
> using an ELF note.
> 
> Anthony Yznaga (5):
>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>   mm: do not assume only the stack vma exists in setup_arg_pages()
>   mm: introduce VM_EXEC_KEEP
>   exec, elf: require opt-in for accepting preserved mem
>   mm: introduce MADV_DOEXEC
> 
>  arch/x86/Kconfig                       |   1 +
>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>  fs/exec.c                              |  33 +++++-
>  include/linux/binfmts.h                |   7 +-
>  include/linux/mm.h                     |   5 +
>  include/uapi/asm-generic/mman-common.h |   3 +
>  kernel/fork.c                          |   2 +-
>  mm/madvise.c                           |  25 +++++
>  mm/mmap.c                              |  47 ++++++++
>  9 files changed, 266 insertions(+), 53 deletions(-)
>
Christian Brauner July 28, 2020, 1:40 p.m. UTC | #4
On Mon, Jul 27, 2020 at 02:00:17PM -0400, Steven Sistare wrote:
> On 7/27/2020 1:07 PM, ebiederm@xmission.com wrote:
> > Anthony Yznaga <anthony.yznaga@oracle.com> writes:
> > 
> >> This patchset adds support for preserving an anonymous memory range across
> >> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> >> sharing memory in this manner, as opposed to re-attaching to a named shared
> >> memory segment, is to ensure it is mapped at the same virtual address in
> >> the new process as it was in the old one.  An intended use for this is to
> >> preserve guest memory for guests using vfio while qemu exec's an updated
> >> version of itself.  By ensuring the memory is preserved at a fixed address,
> >> vfio mappings and their associated kernel data structures can remain valid.
> >> In addition, for the qemu use case, qemu instances that back guest RAM with
> >> anonymous memory can be updated.
> > 
> > What is wrong with using a file descriptor to a possibly deleted file
> > and re-mmaping it?
> > 
> > There is already MAP_FIXED that allows you to ensure you have the same
> > address.
> 
> MAP_FIXED blows away any existing mapping in that range, which is not the 
> desired behavior.  We want to preserve the previously created mapping at

There's also MAP_FIXED_NOREPLACE since v4.17 in case that helps.

Note that this should really go to linux-api too. I won't argue to
resend it since that would mean spamming everyone's inbox with the same
thread again but in case you send a revised version, please ensure to Cc
linux-api. The glibc folks are listening on there too.

Thanks!
Christian
Andy Lutomirski July 28, 2020, 2:23 p.m. UTC | #5
> On Jul 27, 2020, at 10:02 AM, Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> 
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

This will be an amazing attack surface. Perhaps use of this flag should require no_new_privs?  Arguably it should also require a special flag to execve() to honor it.  Otherwise library helpers that do vfork()+exec() or posix_spawn() could be quite surprised.
Steven Sistare July 28, 2020, 2:30 p.m. UTC | #6
On 7/28/2020 10:23 AM, Andy Lutomirski wrote:
>> On Jul 27, 2020, at 10:02 AM, Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>>
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
> 
> This will be an amazing attack surface. Perhaps use of this flag should require no_new_privs?  Arguably it should also require a special flag to execve() to honor it.  Otherwise library helpers that do vfork()+exec() or posix_spawn() could be quite surprised.

Preservation is disabled across fork, so fork/exec combo's are not affected.  We forgot to document that.

- Steve
Anthony Yznaga July 28, 2020, 5:28 p.m. UTC | #7
On 7/28/20 4:34 AM, Kirill Tkhai wrote:
> On 27.07.2020 20:11, Anthony Yznaga wrote:
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
> So, the goal is an update of QEMU binary without a stopping of virtual machine?
Essentially, yes.  The VM is paused very briefly.

Anthony
>
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
>>
>> Patches 1 and 2 ensure that loading of ELF load segments does not silently
>> clobber existing VMAS, and remove assumptions that the stack is the only
>> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
>> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
>> and could be considered on its own.
>>
>> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
>> using an ELF note.
>>
>> Anthony Yznaga (5):
>>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>>   mm: do not assume only the stack vma exists in setup_arg_pages()
>>   mm: introduce VM_EXEC_KEEP
>>   exec, elf: require opt-in for accepting preserved mem
>>   mm: introduce MADV_DOEXEC
>>
>>  arch/x86/Kconfig                       |   1 +
>>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>>  fs/exec.c                              |  33 +++++-
>>  include/linux/binfmts.h                |   7 +-
>>  include/linux/mm.h                     |   5 +
>>  include/uapi/asm-generic/mman-common.h |   3 +
>>  kernel/fork.c                          |   2 +-
>>  mm/madvise.c                           |  25 +++++
>>  mm/mmap.c                              |  47 ++++++++
>>  9 files changed, 266 insertions(+), 53 deletions(-)
>>
Matthew Wilcox (Oracle) July 30, 2020, 3:22 p.m. UTC | #8
On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

I just realised that something else I'm working on might be a suitable
alternative to this.  Apologies for not realising it sooner.

http://www.wil.cx/~willy/linux/sileby.html

To use this, you'd mshare() the anonymous memory range, essentially
detaching the VMA from the current process's mm_struct and reparenting
it to this new mm_struct, which has an fd referencing it.

Then you call exec(), and the exec'ed task gets to call mmap() on that
new fd to attach the memory range to its own address space.

Presto!
Christian Brauner July 30, 2020, 3:27 p.m. UTC | #9
On Thu, Jul 30, 2020 at 04:22:50PM +0100, Matthew Wilcox wrote:
> On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> > This patchset adds support for preserving an anonymous memory range across
> > exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> > sharing memory in this manner, as opposed to re-attaching to a named shared
> > memory segment, is to ensure it is mapped at the same virtual address in
> > the new process as it was in the old one.  An intended use for this is to
> > preserve guest memory for guests using vfio while qemu exec's an updated
> > version of itself.  By ensuring the memory is preserved at a fixed address,
> > vfio mappings and their associated kernel data structures can remain valid.
> > In addition, for the qemu use case, qemu instances that back guest RAM with
> > anonymous memory can be updated.
> 
> I just realised that something else I'm working on might be a suitable
> alternative to this.  Apologies for not realising it sooner.
> 
> http://www.wil.cx/~willy/linux/sileby.html

Just skimming: make it O_CLOEXEC by default. ;)

> 
> To use this, you'd mshare() the anonymous memory range, essentially
> detaching the VMA from the current process's mm_struct and reparenting
> it to this new mm_struct, which has an fd referencing it.
> 
> Then you call exec(), and the exec'ed task gets to call mmap() on that
> new fd to attach the memory range to its own address space.
> 
> Presto!
Matthew Wilcox (Oracle) July 30, 2020, 3:34 p.m. UTC | #10
On Thu, Jul 30, 2020 at 05:27:05PM +0200, Christian Brauner wrote:
> On Thu, Jul 30, 2020 at 04:22:50PM +0100, Matthew Wilcox wrote:
> > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> > > This patchset adds support for preserving an anonymous memory range across
> > > exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> > > sharing memory in this manner, as opposed to re-attaching to a named shared
> > > memory segment, is to ensure it is mapped at the same virtual address in
> > > the new process as it was in the old one.  An intended use for this is to
> > > preserve guest memory for guests using vfio while qemu exec's an updated
> > > version of itself.  By ensuring the memory is preserved at a fixed address,
> > > vfio mappings and their associated kernel data structures can remain valid.
> > > In addition, for the qemu use case, qemu instances that back guest RAM with
> > > anonymous memory can be updated.
> > 
> > I just realised that something else I'm working on might be a suitable
> > alternative to this.  Apologies for not realising it sooner.
> > 
> > http://www.wil.cx/~willy/linux/sileby.html
> 
> Just skimming: make it O_CLOEXEC by default. ;)

I appreciate the suggestion, and it makes sense for many 'return an fd'
interfaces, but the point of mshare() is to, well, share.  So sharing
the fd with a child is a common usecase, unlike say sharing a timerfd.
The only other reason to use mshare() is to pass the fd over a unix
socket to a non-child, and I submit that is far less common than wanting
to share with a child.
Christian Brauner July 30, 2020, 3:54 p.m. UTC | #11
On Thu, Jul 30, 2020 at 04:34:50PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 05:27:05PM +0200, Christian Brauner wrote:
> > On Thu, Jul 30, 2020 at 04:22:50PM +0100, Matthew Wilcox wrote:
> > > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> > > > This patchset adds support for preserving an anonymous memory range across
> > > > exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> > > > sharing memory in this manner, as opposed to re-attaching to a named shared
> > > > memory segment, is to ensure it is mapped at the same virtual address in
> > > > the new process as it was in the old one.  An intended use for this is to
> > > > preserve guest memory for guests using vfio while qemu exec's an updated
> > > > version of itself.  By ensuring the memory is preserved at a fixed address,
> > > > vfio mappings and their associated kernel data structures can remain valid.
> > > > In addition, for the qemu use case, qemu instances that back guest RAM with
> > > > anonymous memory can be updated.
> > > 
> > > I just realised that something else I'm working on might be a suitable
> > > alternative to this.  Apologies for not realising it sooner.
> > > 
> > > http://www.wil.cx/~willy/linux/sileby.html
> > 
> > Just skimming: make it O_CLOEXEC by default. ;)
> 
> I appreciate the suggestion, and it makes sense for many 'return an fd'
> interfaces, but the point of mshare() is to, well, share.  So sharing
> the fd with a child is a common usecase, unlike say sharing a timerfd.

Fair point, from reading I thought the main reason was share after
fork() but having an fd over exec() may be a good use-case too.
(Fwiw, this very much looks like what the memfd_create() api should have
looked like, i.e. mshare() could've possibly encompassed both.)

> The only other reason to use mshare() is to pass the fd over a unix
> socket to a non-child, and I submit that is far less common than wanting
> to share with a child.

Well, we have use-cases for that too. E.g. where we need to attach to an
existing pid namespace which requires a first fork() of an intermediate
process so that the caller doesn't change the pid_for_children
namespace. Then we setns() to the target pid namespace in the
intermediate process which changes the pid_ns_children namespace such
that the next process will be a proper member of the target pid
namespace. Finally, we fork() the target process that is now a full
member of the target pid namespace. We also set CLONE_PARENT so
grandfather process == father process for the second process and then
have the intermediate process exit. Some fds we only ever open or create
after the intermediate process exited and some fds we can only open or
create after the intermediate process has been created and then send the
fds via scm (since we can't share the fdtable) to the final process.

Christian
Steven Sistare July 30, 2020, 3:59 p.m. UTC | #12
On 7/30/2020 11:22 AM, Matthew Wilcox wrote:
> On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
> 
> I just realised that something else I'm working on might be a suitable
> alternative to this.  Apologies for not realising it sooner.
> 
> http://www.wil.cx/~willy/linux/sileby.html
> 
> To use this, you'd mshare() the anonymous memory range, essentially
> detaching the VMA from the current process's mm_struct and reparenting
> it to this new mm_struct, which has an fd referencing it.
> 
> Then you call exec(), and the exec'ed task gets to call mmap() on that
> new fd to attach the memory range to its own address space.
> 
> Presto!

To be suitable for the qemu use case, we need a guarantee that the same VA range
is available in the new process, with nothing else mapped there.  From your spec,
it sounds like the new process could do a series of unrelated mmap's which could
overlap the desired va range before the silby mmap(fd) is performed??

Also, we need to support updating legacy processes that already created anon segments.
We inject code that calls MADV_DOEXEC for such segments.

- Steve
Matthew Wilcox (Oracle) July 30, 2020, 5:12 p.m. UTC | #13
On Thu, Jul 30, 2020 at 11:59:42AM -0400, Steven Sistare wrote:
> On 7/30/2020 11:22 AM, Matthew Wilcox wrote:
> > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> >> This patchset adds support for preserving an anonymous memory range across
> >> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> >> sharing memory in this manner, as opposed to re-attaching to a named shared
> >> memory segment, is to ensure it is mapped at the same virtual address in
> >> the new process as it was in the old one.  An intended use for this is to
> >> preserve guest memory for guests using vfio while qemu exec's an updated
> >> version of itself.  By ensuring the memory is preserved at a fixed address,
> >> vfio mappings and their associated kernel data structures can remain valid.
> >> In addition, for the qemu use case, qemu instances that back guest RAM with
> >> anonymous memory can be updated.
> > 
> > I just realised that something else I'm working on might be a suitable
> > alternative to this.  Apologies for not realising it sooner.
> > 
> > http://www.wil.cx/~willy/linux/sileby.html
> > 
> > To use this, you'd mshare() the anonymous memory range, essentially
> > detaching the VMA from the current process's mm_struct and reparenting
> > it to this new mm_struct, which has an fd referencing it.
> > 
> > Then you call exec(), and the exec'ed task gets to call mmap() on that
> > new fd to attach the memory range to its own address space.
> > 
> > Presto!
> 
> To be suitable for the qemu use case, we need a guarantee that the same VA range
> is available in the new process, with nothing else mapped there.  From your spec,
> it sounds like the new process could do a series of unrelated mmap's which could
> overlap the desired va range before the silby mmap(fd) is performed??

That could happen.  eg libc might get its text segment mapped there
randomly.  I believe Khalid was working on a solution for reserving
memory ranges.

> Also, we need to support updating legacy processes that already created anon segments.
> We inject code that calls MADV_DOEXEC for such segments.

Yes, I was assuming you'd inject code that called mshare().

Actually, since you're injecting code, why do you need the kernel to
be involved?  You can mmap the new executable and any libraries it depends
upon, set up a new stack and jump to the main() entry point, all without
calling exec().  I appreciate it'd be a fair amount of code, but it'd all
be in userspace and you can probably steal / reuse code from ld.so (I'm
not familiar with the details of how setting up an executable is done).
Steven Sistare July 30, 2020, 5:35 p.m. UTC | #14
On 7/30/2020 1:12 PM, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 11:59:42AM -0400, Steven Sistare wrote:
>> On 7/30/2020 11:22 AM, Matthew Wilcox wrote:
>>> On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
>>>> This patchset adds support for preserving an anonymous memory range across
>>>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>>>> sharing memory in this manner, as opposed to re-attaching to a named shared
>>>> memory segment, is to ensure it is mapped at the same virtual address in
>>>> the new process as it was in the old one.  An intended use for this is to
>>>> preserve guest memory for guests using vfio while qemu exec's an updated
>>>> version of itself.  By ensuring the memory is preserved at a fixed address,
>>>> vfio mappings and their associated kernel data structures can remain valid.
>>>> In addition, for the qemu use case, qemu instances that back guest RAM with
>>>> anonymous memory can be updated.
>>>
>>> I just realised that something else I'm working on might be a suitable
>>> alternative to this.  Apologies for not realising it sooner.
>>>
>>> http://www.wil.cx/~willy/linux/sileby.html
>>>
>>> To use this, you'd mshare() the anonymous memory range, essentially
>>> detaching the VMA from the current process's mm_struct and reparenting
>>> it to this new mm_struct, which has an fd referencing it.
>>>
>>> Then you call exec(), and the exec'ed task gets to call mmap() on that
>>> new fd to attach the memory range to its own address space.
>>>
>>> Presto!
>>
>> To be suitable for the qemu use case, we need a guarantee that the same VA range
>> is available in the new process, with nothing else mapped there.  From your spec,
>> it sounds like the new process could do a series of unrelated mmap's which could
>> overlap the desired va range before the silby mmap(fd) is performed??
> 
> That could happen.  eg libc might get its text segment mapped there
> randomly.  I believe Khalid was working on a solution for reserving
> memory ranges.

mshare + VA reservation is another possible solution.

Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.

>> Also, we need to support updating legacy processes that already created anon segments.
>> We inject code that calls MADV_DOEXEC for such segments.
> 
> Yes, I was assuming you'd inject code that called mshare().

OK, mshare works on existing memory and builds a new vma.

> Actually, since you're injecting code, why do you need the kernel to
> be involved?  You can mmap the new executable and any libraries it depends
> upon, set up a new stack and jump to the main() entry point, all without
> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
> be in userspace and you can probably steal / reuse code from ld.so (I'm
> not familiar with the details of how setting up an executable is done).

Duplicating all the work that the kernel and loader do to exec a process would
be error prone, require ongoing maintenance, and be redundant.  Better to define 
a small kernel extension and leave exec to the kernel.

- Steve
Matthew Wilcox (Oracle) July 30, 2020, 5:49 p.m. UTC | #15
On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
> mshare + VA reservation is another possible solution.
> 
> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.

We are.  This is the part of the review process where we explore other
solutions to the problem.

> >> Also, we need to support updating legacy processes that already created anon segments.
> >> We inject code that calls MADV_DOEXEC for such segments.
> > 
> > Yes, I was assuming you'd inject code that called mshare().
> 
> OK, mshare works on existing memory and builds a new vma.

Actually, reparents an existing VMA, and reuses the existing page tables.

> > Actually, since you're injecting code, why do you need the kernel to
> > be involved?  You can mmap the new executable and any libraries it depends
> > upon, set up a new stack and jump to the main() entry point, all without
> > calling exec().  I appreciate it'd be a fair amount of code, but it'd all
> > be in userspace and you can probably steal / reuse code from ld.so (I'm
> > not familiar with the details of how setting up an executable is done).
> 
> Duplicating all the work that the kernel and loader do to exec a process would
> be error prone, require ongoing maintenance, and be redundant.  Better to define 
> a small kernel extension and leave exec to the kernel.

Either this is a one-off kind of thing, in which case it doesn't need
ongoing maintenance, or it's something with broad applicability, in
which case it can live as its own userspace project.  It could even
start off life as part of qemu and then fork into its own project.
The idea of tagging an ELF executable to say "I can cope with having
chunks of my address space provided to me by my executor" is ... odd.
Steven Sistare July 30, 2020, 6:27 p.m. UTC | #16
On 7/30/2020 1:49 PM, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
>> mshare + VA reservation is another possible solution.
>>
>> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.
> 
> We are.  This is the part of the review process where we explore other
> solutions to the problem.
> 
>>>> Also, we need to support updating legacy processes that already created anon segments.
>>>> We inject code that calls MADV_DOEXEC for such segments.
>>>
>>> Yes, I was assuming you'd inject code that called mshare().
>>
>> OK, mshare works on existing memory and builds a new vma.
> 
> Actually, reparents an existing VMA, and reuses the existing page tables.
> 
>>> Actually, since you're injecting code, why do you need the kernel to
>>> be involved?  You can mmap the new executable and any libraries it depends
>>> upon, set up a new stack and jump to the main() entry point, all without
>>> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
>>> be in userspace and you can probably steal / reuse code from ld.so (I'm
>>> not familiar with the details of how setting up an executable is done).
>>
>> Duplicating all the work that the kernel and loader do to exec a process would
>> be error prone, require ongoing maintenance, and be redundant.  Better to define 
>> a small kernel extension and leave exec to the kernel.
> 
> Either this is a one-off kind of thing, in which case it doesn't need
> ongoing maintenance, or it's something with broad applicability, in
> which case it can live as its own userspace project.  It could even
> start off life as part of qemu and then fork into its own project.

exec will be enhanced over time in the kernel.  A separate user space implementation
would need to track that.

Reimplementing exec in userland would be a big gross mess.  Not a good solution when
we have simple and concise ways of solving the problem.

> The idea of tagging an ELF executable to say "I can cope with having
> chunks of my address space provided to me by my executor" is ... odd.

I don't disagree.  But it is useful.  We already pass a block of data containing
environment variables and arguments from one process to the next.  Preserving 
additional segments is not a big leap from there.

- Steve
Eric W. Biederman July 30, 2020, 9:58 p.m. UTC | #17
Steven Sistare <steven.sistare@oracle.com> writes:

> On 7/30/2020 1:49 PM, Matthew Wilcox wrote:
>> On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
>>> mshare + VA reservation is another possible solution.
>>>
>>> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.
>> 
>> We are.  This is the part of the review process where we explore other
>> solutions to the problem.
>> 
>>>>> Also, we need to support updating legacy processes that already created anon segments.
>>>>> We inject code that calls MADV_DOEXEC for such segments.
>>>>
>>>> Yes, I was assuming you'd inject code that called mshare().
>>>
>>> OK, mshare works on existing memory and builds a new vma.
>> 
>> Actually, reparents an existing VMA, and reuses the existing page tables.
>> 
>>>> Actually, since you're injecting code, why do you need the kernel to
>>>> be involved?  You can mmap the new executable and any libraries it depends
>>>> upon, set up a new stack and jump to the main() entry point, all without
>>>> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
>>>> be in userspace and you can probably steal / reuse code from ld.so (I'm
>>>> not familiar with the details of how setting up an executable is done).
>>>
>>> Duplicating all the work that the kernel and loader do to exec a process would
>>> be error prone, require ongoing maintenance, and be redundant.  Better to define 
>>> a small kernel extension and leave exec to the kernel.
>> 
>> Either this is a one-off kind of thing, in which case it doesn't need
>> ongoing maintenance, or it's something with broad applicability, in
>> which case it can live as its own userspace project.  It could even
>> start off life as part of qemu and then fork into its own project.
>
> exec will be enhanced over time in the kernel.  A separate user space implementation
> would need to track that.
>
> Reimplementing exec in userland would be a big gross mess.  Not a good solution when
> we have simple and concise ways of solving the problem.

The interesting work is not done in exec.  The interesting work of
loading an executable is done in ld.so, and ld.so lives in userspace.

>> The idea of tagging an ELF executable to say "I can cope with having
>> chunks of my address space provided to me by my executor" is ... odd.
>
> I don't disagree.  But it is useful.  We already pass a block of data containing
> environment variables and arguments from one process to the next.  Preserving 
> additional segments is not a big leap from there.

But we randomize where that block lives.

It has been found to be very useful to have randomization by default and
you are arguing against that kind of randomization.



Here is another suggestion.

Have a very simple program that does:

	for (;;) {
		handle = dlopen("/my/real/program");
		real_main = dlsym(handle, "main");
		real_main(argc, argv, envp);
		dlclose(handle);
	}

With whatever obvious adjustments are needed to fit your usecase.

That should give the same level of functionality, be portable to all
unices, and not require you to duplicate code.  It belive it limits you
to not upgrading libc, or librt but that is a comparatively small
limitation.


Given that in general the interesting work is done in userspace and that
userspace has provided an interface for reusing that work already.
I don't see the justification for adding anything to exec at this point.


Eric
Stefan Hajnoczi July 31, 2020, 9:12 a.m. UTC | #18
Hi,
Sileby looks interesting! I had just written up the following idea which
seems similar but includes a mechanism for revoking mappings.

Alexander Graf recently brought up an idea that solves the following
problem:

When process A passes shared memory file descriptors to process B there
is no way for process A to revoke access or change page protection bits
after passing the fd.

I'll describe the idea (not sure if it's exactly what Alexander had in
mind).

Memory view driver
------------------
The memory view driver allows process A to control the page table
entries of an mmap in process B. It is a character device driver that
process A opens like this:

  int fd = open("/dev/memory-view", O_RDWR);

This returns a file descriptor to a new memory view.

Next process A sets the size of the memory view:

  ftruncate(fd, 16 * GiB);

The size determines how large the memory view will be. The size is a
virtual memory concept and does not consume resources (there is no
physical memory backing this).

Process A populates the memory view with ranges from file descriptors it
wishes to share. The file descriptor can be a shared memory file
descriptor:

  int memfd = memfd_create("guest-ram, 0);
  ftruncate(memfd, 32 * GiB);

  /* Map [8GB, 10GB) at 8GB into the memory view */
  struct memview_map_fd_info = {
      .fd = memfd,
      .fd_offset = 8 * GiB,
      .size = 2 * GiB,
      .mem_offset = 8 * GiB,
      .flags = MEMVIEW_MAP_READ | MEMVIEW_MAP_WRITE,
  };
  ioctl(fd, MEMVIEW_MAP_FD, &map_fd_info);

It is also possible to populate the memory view from the page cache:

  int filefd = open("big-file.iso", O_RDONLY);

  /* Map [4GB, 12GB) at 0B into the memory view */
  struct memview_map_fd_info = {
      .fd = filefd,
      .fd_offset = 4 * GiB,
      .size = 8 * GiB,
      .mem_offset = 0,
      .flags = MEMVIEW_MAP_READ,
  };
  ioctl(fd, MEMVIEW_MAP_FD, &map_fd_info);

The memory view has now been populated like this:

Range (GiB)   Fd               Permissions
0-8           big-file.iso     read
8-10          guest-ram        read+write
10-16         <none>           <none>

Now process A gets the "view" file descriptor for this memory view. The
view file descriptor does not allow ioctls. It can be safely passed to
process B in the knowledge that process B can only mmap or close it:

  int viewfd = ioctl(fd, MEMVIEW_GET_VIEWFD);

  ...pass viewfd to process B...

Process B receives viewfd and mmaps it:

  void *ptr = mmap(NULL, 16 * GiB, PROT_READ | PROT_WRITE, MAP_SHARED,
                   viewfd, 0);

When process B accesses a page in the mmap region the memory view
driver resolves the page fault by checking if the page is mapped to an
fd and what its permissions are.

For example, accessing the page at 4GB from the start of the memory view
is an access at 8GB into big-file.iso. That's because 8GB of
big-file.iso was mapped at 0 with fd_offset 4GB.

To summarize, there is one vma in process B and the memory view driver
maps pages from the file descriptors added with ioctl(MEMVIEW_MAP_FD) by
process A.

Page protection bits are the AND of the mmap
PROT_READ/PROT_WRITE/PROT_EXEC flags with the memory view driver's
MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC flags for the
mapping in question.

Does vmf_insert_mixed_prot() or a similar Linux API allow this?

Can the memory view driver map pages from fds without pinning the pages?

Process A can make further ioctl(MEMVIEW_MAP_FD) calls and also
ioctl(MEMVIEW_UNMAP_FD) calls to change the mappings. This requires
zapping affected process B ptes. When process B accesses those pages
again the fault handler will handle the page fault based on the latest
memory view layout.

If process B accesses a page with incorrect permissions or that has not
been configured by process A ioctl calls, a SIGSEGV/SIGBUS signal is
raised.

When process B uses mprotect(2) and other virtual memory syscalls it
is unable to increase page permissions. Instead it can only reduce them
because the pte protection bits are the AND of the mmap flags and the
memory view driver's MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC
flags.

Use cases
---------
How to use the memory view driver for vhost-user
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
vhost-user and other out-of-process device emulation interfaces need a
way for the VMM to enforce the IOMMU mappings on the device emulation
process.

Today the VMM passes all guest RAM fds to the device emulation process
and has no way of restricting access or revoking it later. With the
memory view driver the VMM will pass one or more memory view fds instead
of the actual guest RAM fds. This allows the VMM to invoke
ioctl(MEMVIEW_MAP_FD/MEMVIEW_UNMAP_FD) to enforce permissions or revoke
access.

How to use the memory view driver for virtio-fs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The virtiofsd vhost-user process creates a memory view for the device's
DAX Window and passes it to QEMU. QEMU installs it as a kvm.ko memory
region so that the guest directly accesses the memory view.

Now virtiofsd can map portions of files into the DAX Window without
coordinating with the QEMU process. This simplifies the virtio-fs code
and should also improve DAX map/unmap performance.

Stefan
Steven Sistare July 31, 2020, 2:57 p.m. UTC | #19
On 7/30/2020 5:58 PM, ebiederm@xmission.com wrote:
> Steven Sistare <steven.sistare@oracle.com> writes:
>> On 7/30/2020 1:49 PM, Matthew Wilcox wrote:
>>> On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
>>>> mshare + VA reservation is another possible solution.
>>>>
>>>> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.
>>>
>>> We are.  This is the part of the review process where we explore other
>>> solutions to the problem.
>>>
>>>>>> Also, we need to support updating legacy processes that already created anon segments.
>>>>>> We inject code that calls MADV_DOEXEC for such segments.
>>>>>
>>>>> Yes, I was assuming you'd inject code that called mshare().
>>>>
>>>> OK, mshare works on existing memory and builds a new vma.
>>>
>>> Actually, reparents an existing VMA, and reuses the existing page tables.
>>>
>>>>> Actually, since you're injecting code, why do you need the kernel to
>>>>> be involved?  You can mmap the new executable and any libraries it depends
>>>>> upon, set up a new stack and jump to the main() entry point, all without
>>>>> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
>>>>> be in userspace and you can probably steal / reuse code from ld.so (I'm
>>>>> not familiar with the details of how setting up an executable is done).
>>>>
>>>> Duplicating all the work that the kernel and loader do to exec a process would
>>>> be error prone, require ongoing maintenance, and be redundant.  Better to define 
>>>> a small kernel extension and leave exec to the kernel.
>>>
>>> Either this is a one-off kind of thing, in which case it doesn't need
>>> ongoing maintenance, or it's something with broad applicability, in
>>> which case it can live as its own userspace project.  It could even
>>> start off life as part of qemu and then fork into its own project.
>>
>> exec will be enhanced over time in the kernel.  A separate user space implementation
>> would need to track that.
>>
>> Reimplementing exec in userland would be a big gross mess.  Not a good solution when
>> we have simple and concise ways of solving the problem.
> 
> The interesting work is not done in exec.  The interesting work of
> loading an executable is done in ld.so, and ld.so lives in userspace.

And yet a smart guy named Eric said:
  "There is just enough complexity in exec currently that our implementation of exec is 
  already teetering."
which suggests that a user-space re-implementation will also be complex.
In complexity lies bugs.

>>> The idea of tagging an ELF executable to say "I can cope with having
>>> chunks of my address space provided to me by my executor" is ... odd.
>>
>> I don't disagree.  But it is useful.  We already pass a block of data containing
>> environment variables and arguments from one process to the next.  Preserving 
>> additional segments is not a big leap from there.
> 
> But we randomize where that block lives.
> 
> It has been found to be very useful to have randomization by default and
> you are arguing against that kind of randomization.

In the normal use case the addresses of the preserved anon segments were
already randomized during allocation in the pre-exec process.  Preservation
across exec does not make the process less safe from an external attack.

Now, if the attacker injects content at known addresses by preserving and
exec'ing, then the post-exec process must decide whether to trust the
pre-exec process.  Let's prevent preservation when exec'ing setuid binaries.
That is not part of this patch series, but is a good idea.  So, uid does not
change across exec.  Do I trust myself?  Yes I do.  If the uid is trusted,
as it must be in the qemu use case which accesses system resources, then
that is sufficient.  If the exec raises caps, and the uid is not trusted,
then the post-exec process must do additional work to verify the pre-exec
credentials.  The method is left to the app.  We defined the ELF opt-in so
that processes that do not use preserved memory will never receive any and
will never need to deal with such verification.

Matthews sileby/mshare proposal has the same issue.  If a process opts-in
and mmap's an address in the shared region, then content becomes mapped at
a VA that was known to the pre-fork or pre-exec process.  Trust must still
be established.

> Here is another suggestion.
> 
> Have a very simple program that does:
> 
> 	for (;;) {
> 		handle = dlopen("/my/real/program");
> 		real_main = dlsym(handle, "main");
> 		real_main(argc, argv, envp);
> 		dlclose(handle);
> 	}
> 
> With whatever obvious adjustments are needed to fit your usecase.
> 
> That should give the same level of functionality, be portable to all
> unices, and not require you to duplicate code.  It belive it limits you
> to not upgrading libc, or librt but that is a comparatively small
> limitation.
> 
> 
> Given that in general the interesting work is done in userspace and that
> userspace has provided an interface for reusing that work already.
> I don't see the justification for adding anything to exec at this point. 

Thanks for the suggestion.  That is clever, and would make a fun project,
but I would not trust it for production.  These few lines are just
the first of many that it would take to reset the environment to the
well-defined post-exec initial conditions that all executables expect,
and incrementally tearing down state will be prone to bugs.  Getting a
clean slate from a kernel exec is a much more reliable design.  The use
case is creating long-lived apps that never go down, and the simplest
implementation will have the fewest bugs and is the best.  MADV_DOEXEC is
simple, and does not even require a new system call, and the kernel already
knows how to exec without bugs.

- Steve
Matthew Wilcox (Oracle) July 31, 2020, 3:27 p.m. UTC | #20
On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote:
> Matthews sileby/mshare proposal has the same issue.  If a process opts-in
> and mmap's an address in the shared region, then content becomes mapped at
> a VA that was known to the pre-fork or pre-exec process.  Trust must still
> be established.

It's up to the recipient whether they try to map it at the same address
or at a fresh address.  The intended use case is a "semi-shared" address
space between two processes (ie partway between a threaded, fully-shared
address space and a forked un-shared address space), in which case
there's a certain amount of trust and cooperation between the processes.

Your preservation-across-exec use-case might or might not need the
VMA to be mapped at the same address.  I don't know whether qemu stores
pointers in this VMA which are absolute within the qemu address space.
If it's just the emulated process's address space, then everything will
be absolute within its own address space and everything will be opaque
to qemu.  If qemu is storing its own pointers in it, then it has to be
mapped at the same address.

> > Here is another suggestion.
> > 
> > Have a very simple program that does:
> > 
> > 	for (;;) {
> > 		handle = dlopen("/my/real/program");
> > 		real_main = dlsym(handle, "main");
> > 		real_main(argc, argv, envp);
> > 		dlclose(handle);
> > 	}
> > 
> > With whatever obvious adjustments are needed to fit your usecase.
> > 
> > That should give the same level of functionality, be portable to all
> > unices, and not require you to duplicate code.  It belive it limits you
> > to not upgrading libc, or librt but that is a comparatively small
> > limitation.
> > 
> > 
> > Given that in general the interesting work is done in userspace and that
> > userspace has provided an interface for reusing that work already.
> > I don't see the justification for adding anything to exec at this point. 
> 
> Thanks for the suggestion.  That is clever, and would make a fun project,
> but I would not trust it for production.  These few lines are just
> the first of many that it would take to reset the environment to the
> well-defined post-exec initial conditions that all executables expect,
> and incrementally tearing down state will be prone to bugs.  Getting a
> clean slate from a kernel exec is a much more reliable design.  The use
> case is creating long-lived apps that never go down, and the simplest
> implementation will have the fewest bugs and is the best.  MADV_DOEXEC is
> simple, and does not even require a new system call, and the kernel already
> knows how to exec without bugs.

It's a net increase of 200 lines of kernel code.  If 4 lines of userspace
code removes 200 lines of kernel code, I think I know which I prefer ...
Steven Sistare July 31, 2020, 4:11 p.m. UTC | #21
On 7/31/2020 11:27 AM, Matthew Wilcox wrote:
> On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote:
>> Matthews sileby/mshare proposal has the same issue.  If a process opts-in
>> and mmap's an address in the shared region, then content becomes mapped at
>> a VA that was known to the pre-fork or pre-exec process.  Trust must still
>> be established.
> 
> It's up to the recipient whether they try to map it at the same address
> or at a fresh address.  The intended use case is a "semi-shared" address
> space between two processes (ie partway between a threaded, fully-shared
> address space and a forked un-shared address space), in which case
> there's a certain amount of trust and cooperation between the processes.

Understood, but if the recipient does map at any of the same, which is the whole
point because you want to share the page table.  The trust relationship is no
different than for the live update case.  
 
> Your preservation-across-exec use-case might or might not need the
> VMA to be mapped at the same address.  

It does.  qemu registers memory with vfio which remembers the va's in kernel
metadata for the device.

> I don't know whether qemu stores
> pointers in this VMA which are absolute within the qemu address space.
> If it's just the emulated process's address space, then everything will
> be absolute within its own address space and everything will be opaque
> to qemu.  If qemu is storing its own pointers in it, then it has to be
> mapped at the same address.

qemu does not do the latter but that could be a nice way for apps to use
preserved memory.

>>> Here is another suggestion.
>>>
>>> Have a very simple program that does:
>>>
>>> 	for (;;) {
>>> 		handle = dlopen("/my/real/program");
>>> 		real_main = dlsym(handle, "main");
>>> 		real_main(argc, argv, envp);
>>> 		dlclose(handle);
>>> 	}
>>>
>>> With whatever obvious adjustments are needed to fit your usecase.
>>>
>>> That should give the same level of functionality, be portable to all
>>> unices, and not require you to duplicate code.  It belive it limits you
>>> to not upgrading libc, or librt but that is a comparatively small
>>> limitation.
>>>
>>>
>>> Given that in general the interesting work is done in userspace and that
>>> userspace has provided an interface for reusing that work already.
>>> I don't see the justification for adding anything to exec at this point. 
>>
>> Thanks for the suggestion.  That is clever, and would make a fun project,
>> but I would not trust it for production.  These few lines are just
>> the first of many that it would take to reset the environment to the
>> well-defined post-exec initial conditions that all executables expect,
>> and incrementally tearing down state will be prone to bugs.  Getting a
>> clean slate from a kernel exec is a much more reliable design.  The use
>> case is creating long-lived apps that never go down, and the simplest
>> implementation will have the fewest bugs and is the best.  MADV_DOEXEC is
>> simple, and does not even require a new system call, and the kernel already
>> knows how to exec without bugs.
> 
> It's a net increase of 200 lines of kernel code.  If 4 lines of userspace
> code removes 200 lines of kernel code, I think I know which I prefer ...

It will be *far* more than 4 lines.
Much of the 200 lines is mostly for the elf opt in, and much of the elf code is from
anthony reviving an earlier patch that use MAP_FIXED_NOREPLACE during segment setup.

- Steve
Jason Gunthorpe July 31, 2020, 4:56 p.m. UTC | #22
On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
> > Your preservation-across-exec use-case might or might not need the
> > VMA to be mapped at the same address.  
> 
> It does.  qemu registers memory with vfio which remembers the va's in kernel
> metadata for the device.

Once the memory is registered with vfio the VA doesn't matter, vfio
will keep the iommu pointing at the same physical pages no matter
where they are mapped.

Jason
Steven Sistare July 31, 2020, 5:15 p.m. UTC | #23
On 7/31/2020 12:56 PM, Jason Gunthorpe wrote:
> On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
>>> Your preservation-across-exec use-case might or might not need the
>>> VMA to be mapped at the same address.  
>>
>> It does.  qemu registers memory with vfio which remembers the va's in kernel
>> metadata for the device.
> 
> Once the memory is registered with vfio the VA doesn't matter, vfio
> will keep the iommu pointing at the same physical pages no matter
> where they are mapped.

Yes, but there are other code paths that compute and use offsets between va and the
base va.  Mapping at a different va in the new process breaks vfio; I have tried it.

- Steve
Matthew Wilcox (Oracle) July 31, 2020, 5:23 p.m. UTC | #24
On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
> On 7/31/2020 11:27 AM, Matthew Wilcox wrote:
> > On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote:
> >> Matthews sileby/mshare proposal has the same issue.  If a process opts-in
> >> and mmap's an address in the shared region, then content becomes mapped at
> >> a VA that was known to the pre-fork or pre-exec process.  Trust must still
> >> be established.
> > 
> > It's up to the recipient whether they try to map it at the same address
> > or at a fresh address.  The intended use case is a "semi-shared" address
> > space between two processes (ie partway between a threaded, fully-shared
> > address space and a forked un-shared address space), in which case
> > there's a certain amount of trust and cooperation between the processes.
> 
> Understood, but if the recipient does map at any of the same, which is the whole
> point because you want to share the page table.  The trust relationship is no
> different than for the live update case.  

You don't have to map at the same address to share the page tables.
For example, on x86 if you share an 8GB region, that must be aligned at
1GB in both the donor and the recipient, but they need not be mapped at
the same address.

> > It's a net increase of 200 lines of kernel code.  If 4 lines of userspace
> > code removes 200 lines of kernel code, I think I know which I prefer ...
> 
> It will be *far* more than 4 lines.
> Much of the 200 lines is mostly for the elf opt in, and much of the elf code is from
> anthony reviving an earlier patch that use MAP_FIXED_NOREPLACE during segment setup.

It doesn't really matter how much of it is for the opt-in and how much
is for the exec path itself.  The MAP_FIXED_NOREPLACE patch is only net
+16 lines, so that's not the problem.
Jason Gunthorpe July 31, 2020, 5:48 p.m. UTC | #25
On Fri, Jul 31, 2020 at 01:15:34PM -0400, Steven Sistare wrote:
> On 7/31/2020 12:56 PM, Jason Gunthorpe wrote:
> > On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
> >>> Your preservation-across-exec use-case might or might not need the
> >>> VMA to be mapped at the same address.  
> >>
> >> It does.  qemu registers memory with vfio which remembers the va's in kernel
> >> metadata for the device.
> > 
> > Once the memory is registered with vfio the VA doesn't matter, vfio
> > will keep the iommu pointing at the same physical pages no matter
> > where they are mapped.
> 
> Yes, but there are other code paths that compute and use offsets between va and the
> base va.  Mapping at a different va in the new process breaks vfio; I have tried it.

Maybe you could fix vfio instead of having this adventure, if vfio is
the only motivation.

Jason
Steven Sistare July 31, 2020, 5:55 p.m. UTC | #26
On 7/31/2020 1:48 PM, Jason Gunthorpe wrote:
> On Fri, Jul 31, 2020 at 01:15:34PM -0400, Steven Sistare wrote:
>> On 7/31/2020 12:56 PM, Jason Gunthorpe wrote:
>>> On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
>>>>> Your preservation-across-exec use-case might or might not need the
>>>>> VMA to be mapped at the same address.  
>>>>
>>>> It does.  qemu registers memory with vfio which remembers the va's in kernel
>>>> metadata for the device.
>>>
>>> Once the memory is registered with vfio the VA doesn't matter, vfio
>>> will keep the iommu pointing at the same physical pages no matter
>>> where they are mapped.
>>
>> Yes, but there are other code paths that compute and use offsets between va and the
>> base va.  Mapping at a different va in the new process breaks vfio; I have tried it.
> 
> Maybe you could fix vfio instead of having this adventure, if vfio is
> the only motivation.

Maybe.  We still need to preserve an anonymous segment, though.  MADV_DOEXEC, or mshare,
or something else.  And I think the ability to preserve memory containing pointers to itself
is an interesting use case, though not ours.

- Steve
Steven Sistare July 31, 2020, 7:41 p.m. UTC | #27
On 7/27/2020 1:11 PM, Anthony Yznaga wrote:
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

I forgot to mention, our use case is not just theoretical.  It has been implemented
and is pretty cool (but I am biased).  The pause time for the guest is in the
100 - 200 msec range.  We submitted qemu patches for review based on the MADV_DOEXEC
proposal.  In case you are curious:
  https://lore.kernel.org/qemu-devel/1596122076-341293-1-git-send-email-steven.sistare@oracle.com/

- Steve
Eric W. Biederman Aug. 3, 2020, 3:28 p.m. UTC | #28
Steven Sistare <steven.sistare@oracle.com> writes:

> On 7/30/2020 5:58 PM, ebiederm@xmission.com wrote:
>> Here is another suggestion.
>> 
>> Have a very simple program that does:
>> 
>> 	for (;;) {
>> 		handle = dlopen("/my/real/program");
>> 		real_main = dlsym(handle, "main");
>> 		real_main(argc, argv, envp);
>> 		dlclose(handle);
>> 	}
>> 
>> With whatever obvious adjustments are needed to fit your usecase.
>> 
>> That should give the same level of functionality, be portable to all
>> unices, and not require you to duplicate code.  It belive it limits you
>> to not upgrading libc, or librt but that is a comparatively small
>> limitation.
>> 
>> 
>> Given that in general the interesting work is done in userspace and that
>> userspace has provided an interface for reusing that work already.
>> I don't see the justification for adding anything to exec at this point. 
>
> Thanks for the suggestion.  That is clever, and would make a fun project,
> but I would not trust it for production.  These few lines are just
> the first of many that it would take to reset the environment to the
> well-defined post-exec initial conditions that all executables expect,
> and incrementally tearing down state will be prone to bugs.

Agreed.

> Getting a clean slate from a kernel exec is a much more reliable
> design.

Except you are explicitly throwing that out the window, by preserving
VMAs.  You very much need to have a clean bug free shutdown to pass VMAs
reliably.

> The use case is creating long-lived apps that never go down, and the
> simplest implementation will have the fewest bugs and is the best.
> MADV_DOEXEC is simple, and does not even require a new system call,
> and the kernel already knows how to exec without bugs.

*ROFL*  I wish the kernel knew how to exec things without bugs.
The bugs are hard to hit but the ones I am aware of are not straight
forward to fix.

MADV_DOEXEC is not conceptually simple.  It completely violates the
guarantees that exec is known to make about the contents of the memory
of the new process.  This makes it very difficult to reason about.  Nor
will MADV_DOEXEC be tested very much as it has only one or two users.
Which means in the fullness of time it is likely someone will change
something that will break the implementation subtlely and the bug report
probably won't come in for 3 years, or maybe a decade.  At which point
it won't be clear if the bug even can be fixed as something else might
rely on it.

What is wrong with live migration between one qemu process and another
qemu process on the same machine not work for this use case?

Just reusing live migration would seem to be the simplest path of all,
as the code is already implemented.  Further if something goes wrong
with the live migration you can fallback to the existing process.  With
exec there is no fallback if the new version does not properly support
the handoff protocol of the old version.

Eric
James Bottomley Aug. 3, 2020, 3:42 p.m. UTC | #29
On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote:
[...]
> What is wrong with live migration between one qemu process and
> another qemu process on the same machine not work for this use case?
> 
> Just reusing live migration would seem to be the simplest path of
> all, as the code is already implemented.  Further if something goes
> wrong with the live migration you can fallback to the existing
> process.  With exec there is no fallback if the new version does not
> properly support the handoff protocol of the old version.

Actually, could I ask this another way: the other patch set you sent to
the KVM list was to snapshot the VM to a PKRAM capsule preserved across
kexec using zero copy for extremely fast save/restore.  The original
idea was to use this as part of a CRIU based snapshot, kexec to new
system, restore.  However, why can't you do a local snapshot, restart
qemu, restore using the PKRAM capsule to achieve exactly the same as
MADV_DOEXEC does but using a system that's easy to reason about?  It
may be slightly slower, but I think we're still talking milliseconds.

James
Steven Sistare Aug. 3, 2020, 7:29 p.m. UTC | #30
On 8/3/2020 11:28 AM, ebiederm@xmission.com wrote:
> Steven Sistare <steven.sistare@oracle.com> writes:
>> On 7/30/2020 5:58 PM, ebiederm@xmission.com wrote:
>>> Here is another suggestion.
>>>
>>> Have a very simple program that does:
>>>
>>> 	for (;;) {
>>> 		handle = dlopen("/my/real/program");
>>> 		real_main = dlsym(handle, "main");
>>> 		real_main(argc, argv, envp);
>>> 		dlclose(handle);
>>> 	}
>>>
>>> With whatever obvious adjustments are needed to fit your usecase.
>>>
>>> That should give the same level of functionality, be portable to all
>>> unices, and not require you to duplicate code.  It belive it limits you
>>> to not upgrading libc, or librt but that is a comparatively small
>>> limitation.
>>>
>>>
>>> Given that in general the interesting work is done in userspace and that
>>> userspace has provided an interface for reusing that work already.
>>> I don't see the justification for adding anything to exec at this point. 
>>
>> Thanks for the suggestion.  That is clever, and would make a fun project,
>> but I would not trust it for production.  These few lines are just
>> the first of many that it would take to reset the environment to the
>> well-defined post-exec initial conditions that all executables expect,
>> and incrementally tearing down state will be prone to bugs.
> 
> Agreed.
> 
>> Getting a clean slate from a kernel exec is a much more reliable
>> design.
> 
> Except you are explicitly throwing that out the window, by preserving
> VMAs.  You very much need to have a clean bug free shutdown to pass VMAs
> reliably.

Sure.  The whole community relies on you and others to provide a bug free exec.

>> The use case is creating long-lived apps that never go down, and the
>> simplest implementation will have the fewest bugs and is the best.
>> MADV_DOEXEC is simple, and does not even require a new system call,
>> and the kernel already knows how to exec without bugs.
> 
> *ROFL*  I wish the kernel knew how to exec things without bugs.

Essentially you are saying you would argue against any enhancement to exec.
Surely that is too high a bar.  We must continue to evolve an innovate and
balance risk against reward.  This use case matters to our business a lot,
and to others as well, see below.  That is the reward.  I feel you are 
overstating the risk.  Surely there is some early point in the development
cycle of some release where this can be integrated and get enough test
time and soak time to be proven reliable.

> The bugs are hard to hit but the ones I am aware of are not straight
> forward to fix.
> 
> MADV_DOEXEC is not conceptually simple.  It completely violates the
> guarantees that exec is known to make about the contents of the memory
> of the new process.  This makes it very difficult to reason about.  

I have having trouble see the difficulty.  Perhaps I am too familar with
it, but the semantics are few and easy to explain, and it does not introduce
new concepts: the post-exec process is born with a few more mappings than
previously, and non-fixed further mmaps choose addresses in the holes.

> Nor
> will MADV_DOEXEC be tested very much as it has only one or two users.
> Which means in the fullness of time it is likely someone will change
> something that will break the implementation subtlely and the bug report
> probably won't come in for 3 years, or maybe a decade.  At which point
> it won't be clear if the bug even can be fixed as something else might
> rely on it.

That's on us; we need to provide kernel tests and be diligent about testing
new releases.  This matters to our business and we will do so. 
> What is wrong with live migration between one qemu process and another
> qemu process on the same machine not work for this use case?
> 
> Just reusing live migration would seem to be the simplest path of all,
> as the code is already implemented.  Further if something goes wrong
> with the live migration you can fallback to the existing process.  With
> exec there is no fallback if the new version does not properly support
> the handoff protocol of the old version.

This is less resource intensive than live migration.  The latter ties up two
hosts, consumes lots of memory and network bandwidth, may take a long time
to converge on a busy system, and is unfeasible for guests with a huge amount
of local storeage (which we call dense I/O shapes).  Live update takes less than
1 second total, and the guest pause time is 100 - 200 msecs.  It is a very
attractive solution that other cloud vendors have implemented as well, with
their own private modifications to exec and and fork.  We have been independently
working in this area, and we are offering our implementation to the community.

- Steve
Steven Sistare Aug. 3, 2020, 8:03 p.m. UTC | #31
On 8/3/2020 11:42 AM, James Bottomley wrote:
> On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote:
> [...]
>> What is wrong with live migration between one qemu process and
>> another qemu process on the same machine not work for this use case?
>>
>> Just reusing live migration would seem to be the simplest path of
>> all, as the code is already implemented.  Further if something goes
>> wrong with the live migration you can fallback to the existing
>> process.  With exec there is no fallback if the new version does not
>> properly support the handoff protocol of the old version.
> 
> Actually, could I ask this another way: the other patch set you sent to
> the KVM list was to snapshot the VM to a PKRAM capsule preserved across
> kexec using zero copy for extremely fast save/restore.  The original
> idea was to use this as part of a CRIU based snapshot, kexec to new
> system, restore.  However, why can't you do a local snapshot, restart
> qemu, restore using the PKRAM capsule to achieve exactly the same as
> MADV_DOEXEC does but using a system that's easy to reason about?  It
> may be slightly slower, but I think we're still talking milliseconds.

Hi James, good to hear from you.  PKRAM or SysV shm could be used for
a restart in that manner, but it would only support sriov guests if the
guest exports an agent that supports suspend-to-ram, and if all guest
drivers support the suspend-to-ram method.  I have done this using a linux
guest and qemu guest agent, and IIRC the guest pause time is 500 - 1000 msec.
With MADV_DOEXEC, pause time is 100 - 200 msec.  The pause time is a handful
of seconds if the guest uses an nvme drive because CC.SHN takes so long
to persist metadata to stable storage.

We could instead pass vfio descriptors from the old process to a 3rd party escrow 
process and pass  them back to the new qemu process, but the shm that vfio has 
already registered must be remapped at the same VA as the previous process, and 
there is no interface to guarantee that.  MAP_FIXED blows away existing mappings 
and breaks the app. MAP_FIXED_NOREPLACE respects existing mappings but cannot map 
the shm and breaks the app.  Adding a feature that reserves VAs would fix that, we 
have experimnted with one.  Fixing the vfio kernel implementation to not use the 
original VA base would also work, but I don't know how doable/difficult that would be.

Both solutions would require a qemu instance to be stopped and relaunched using shm
as guest ram, and its guest rebooted, so they do not let us update legacy 
already-running instances that use anon memory.  That problem solves itself if we 
get these rfe's into linux and qemu, and eventually users shut down the legacy
instances, but that takes years and we need to do it sooner.

- Steve
Matthew Wilcox (Oracle) Aug. 4, 2020, 11:13 a.m. UTC | #32
On Tue, Aug 04, 2020 at 08:44:42AM +0000, David Laight wrote:
> From: James Bottomley
> > Sent: 03 August 2020 16:43
> > 
> > On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote:
> > [...]
> > > What is wrong with live migration between one qemu process and
> > > another qemu process on the same machine not work for this use case?
> > >
> > > Just reusing live migration would seem to be the simplest path of
> > > all, as the code is already implemented.  Further if something goes
> > > wrong with the live migration you can fallback to the existing
> > > process.  With exec there is no fallback if the new version does not
> > > properly support the handoff protocol of the old version.
> > 
> > Actually, could I ask this another way: the other patch set you sent to
> > the KVM list was to snapshot the VM to a PKRAM capsule preserved across
> > kexec using zero copy for extremely fast save/restore.  The original
> > idea was to use this as part of a CRIU based snapshot, kexec to new
> > system, restore.  However, why can't you do a local snapshot, restart
> > qemu, restore using the PKRAM capsule to achieve exactly the same as
> > MADV_DOEXEC does but using a system that's easy to reason about?  It
> > may be slightly slower, but I think we're still talking milliseconds.
> 
> 
> I've had another idea (that is probably impossible...).
> What about a 'reverse mmap' operation.
> Something that creates an fd whose contents are a chunk of the
> processes address space.

http://www.wil.cx/~willy/linux/sileby.html
Longpeng(Mike) July 8, 2021, 9:52 a.m. UTC | #33
Hi Anthony and Steven,

在 2020/7/28 1:11, Anthony Yznaga 写道:
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.
> 

We have a requirement like yours, but ours seems more complex. We want to
isolate some memory regions from the VM's memory space and the start a child
process who will using these memory regions.

I've wrote a draft to support this feature, but I just find that my draft is
pretty like yours.

It seems that you've already abandoned this patchset, why ?

> Patches 1 and 2 ensure that loading of ELF load segments does not silently
> clobber existing VMAS, and remove assumptions that the stack is the only
> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
> and could be considered on its own.
> 
> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
> using an ELF note.
> 
> Anthony Yznaga (5):
>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>   mm: do not assume only the stack vma exists in setup_arg_pages()
>   mm: introduce VM_EXEC_KEEP
>   exec, elf: require opt-in for accepting preserved mem
>   mm: introduce MADV_DOEXEC
> 
>  arch/x86/Kconfig                       |   1 +
>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>  fs/exec.c                              |  33 +++++-
>  include/linux/binfmts.h                |   7 +-
>  include/linux/mm.h                     |   5 +
>  include/uapi/asm-generic/mman-common.h |   3 +
>  kernel/fork.c                          |   2 +-
>  mm/madvise.c                           |  25 +++++
>  mm/mmap.c                              |  47 ++++++++
>  9 files changed, 266 insertions(+), 53 deletions(-)
>
Steven Sistare July 8, 2021, 12:48 p.m. UTC | #34
On 7/8/2021 5:52 AM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> Hi Anthony and Steven,
> 
> 在 2020/7/28 1:11, Anthony Yznaga 写道:
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
> 
> We have a requirement like yours, but ours seems more complex. We want to
> isolate some memory regions from the VM's memory space and the start a child
> process who will using these memory regions.
> 
> I've wrote a draft to support this feature, but I just find that my draft is
> pretty like yours.
> 
> It seems that you've already abandoned this patchset, why ?

Hi Longpeng,
  The reviewers did not like the proposal for several reasons, but the showstopper
was that they did not want to add complexity to the exec path in the kernel.  You
can read the email archive for details.

We solved part of our problem by adding new vfio interfaces: VFIO_DMA_UNMAP_FLAG_VADDR
and VFIO_DMA_MAP_FLAG_VADDR.  That solves the vfio problem for shared memory, but not
for mmap MAP_ANON memory.

- Steve

>> Patches 1 and 2 ensure that loading of ELF load segments does not silently
>> clobber existing VMAS, and remove assumptions that the stack is the only
>> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
>> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
>> and could be considered on its own.
>>
>> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
>> using an ELF note.
>>
>> Anthony Yznaga (5):
>>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>>   mm: do not assume only the stack vma exists in setup_arg_pages()
>>   mm: introduce VM_EXEC_KEEP
>>   exec, elf: require opt-in for accepting preserved mem
>>   mm: introduce MADV_DOEXEC
>>
>>  arch/x86/Kconfig                       |   1 +
>>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>>  fs/exec.c                              |  33 +++++-
>>  include/linux/binfmts.h                |   7 +-
>>  include/linux/mm.h                     |   5 +
>>  include/uapi/asm-generic/mman-common.h |   3 +
>>  kernel/fork.c                          |   2 +-
>>  mm/madvise.c                           |  25 +++++
>>  mm/mmap.c                              |  47 ++++++++
>>  9 files changed, 266 insertions(+), 53 deletions(-)
>>
>
Longpeng(Mike) July 12, 2021, 1:05 a.m. UTC | #35
Hi Steven,

在 2021/7/8 20:48, Steven Sistare 写道:
> On 7/8/2021 5:52 AM, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
>> Hi Anthony and Steven,
>>
>> 在 2020/7/28 1:11, Anthony Yznaga 写道:
>>> This patchset adds support for preserving an anonymous memory range across
>>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>>> sharing memory in this manner, as opposed to re-attaching to a named shared
>>> memory segment, is to ensure it is mapped at the same virtual address in
>>> the new process as it was in the old one.  An intended use for this is to
>>> preserve guest memory for guests using vfio while qemu exec's an updated
>>> version of itself.  By ensuring the memory is preserved at a fixed address,
>>> vfio mappings and their associated kernel data structures can remain valid.
>>> In addition, for the qemu use case, qemu instances that back guest RAM with
>>> anonymous memory can be updated.
>>
>> We have a requirement like yours, but ours seems more complex. We want to
>> isolate some memory regions from the VM's memory space and the start a child
>> process who will using these memory regions.
>>
>> I've wrote a draft to support this feature, but I just find that my draft is
>> pretty like yours.
>>
>> It seems that you've already abandoned this patchset, why ?
> 
> Hi Longpeng,
>   The reviewers did not like the proposal for several reasons, but the showstopper
> was that they did not want to add complexity to the exec path in the kernel.  You
> can read the email archive for details.
> 
I've read the archive and did some study these days, maybe this soluation is
more sutiable for my use case.

Let me describe my use case more clearly (just ignore if you're not interested
in it):

1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the allocated VA
range is [0x40000000,0x140000000)

2. Prog A specifies [0x48000000,0x50000000) and [0x80000000,0x100000000) will be
shared by its child.

3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.

4. Prog B notice the shared ranges (e.g. by input parameters or ...) and remap
them to a continuous VA range.

Do you have any suggestions ?

> We solved part of our problem by adding new vfio interfaces: VFIO_DMA_UNMAP_FLAG_VADDR
> and VFIO_DMA_MAP_FLAG_VADDR.  That solves the vfio problem for shared memory, but not
> for mmap MAP_ANON memory.
> 
> - Steve
> 
>>> Patches 1 and 2 ensure that loading of ELF load segments does not silently
>>> clobber existing VMAS, and remove assumptions that the stack is the only
>>> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
>>> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
>>> and could be considered on its own.
>>>
>>> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
>>> using an ELF note.
>>>
>>> Anthony Yznaga (5):
>>>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>>>   mm: do not assume only the stack vma exists in setup_arg_pages()
>>>   mm: introduce VM_EXEC_KEEP
>>>   exec, elf: require opt-in for accepting preserved mem
>>>   mm: introduce MADV_DOEXEC
>>>
>>>  arch/x86/Kconfig                       |   1 +
>>>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>>>  fs/exec.c                              |  33 +++++-
>>>  include/linux/binfmts.h                |   7 +-
>>>  include/linux/mm.h                     |   5 +
>>>  include/uapi/asm-generic/mman-common.h |   3 +
>>>  kernel/fork.c                          |   2 +-
>>>  mm/madvise.c                           |  25 +++++
>>>  mm/mmap.c                              |  47 ++++++++
>>>  9 files changed, 266 insertions(+), 53 deletions(-)
>>>
>>
> .
>
Matthew Wilcox (Oracle) July 12, 2021, 1:30 a.m. UTC | #36
On Mon, Jul 12, 2021 at 09:05:45AM +0800, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> Let me describe my use case more clearly (just ignore if you're not interested
> in it):
> 
> 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the allocated VA
> range is [0x40000000,0x140000000)
> 
> 2. Prog A specifies [0x48000000,0x50000000) and [0x80000000,0x100000000) will be
> shared by its child.
> 
> 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
> 
> 4. Prog B notice the shared ranges (e.g. by input parameters or ...) and remap
> them to a continuous VA range.

This is dangerous.  There must be an active step for Prog B to accept
Prog A's ranges into its address space.  Otherwise Prog A could almost
completely fill Prog B's address space and so control where Prog B
places its mappings.  It could also provoke a latent bug in Prog B
if it doesn't handle address space exhaustion gracefully.

I had a proposal to handle this.  Would it meet your requirements?
https://lore.kernel.org/lkml/20200730152250.GG23808@casper.infradead.org/
Longpeng(Mike) July 13, 2021, 12:57 a.m. UTC | #37
Hi Matthew,

> -----Original Message-----
> From: Matthew Wilcox [mailto:willy@infradead.org]
> Sent: Monday, July 12, 2021 9:30 AM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> <longpeng2@huawei.com>
> Cc: Steven Sistare <steven.sistare@oracle.com>; Anthony Yznaga
> <anthony.yznaga@oracle.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; Gonglei (Arei) <arei.gonglei@huawei.com>
> Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
> 
> On Mon, Jul 12, 2021 at 09:05:45AM +0800, Longpeng (Mike, Cloud
> Infrastructure Service Product Dept.) wrote:
> > Let me describe my use case more clearly (just ignore if you're not
> > interested in it):
> >
> > 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
> > allocated VA range is [0x40000000,0x140000000)
> >
> > 2. Prog A specifies [0x48000000,0x50000000) and
> > [0x80000000,0x100000000) will be shared by its child.
> >
> > 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
> >
> > 4. Prog B notice the shared ranges (e.g. by input parameters or ...)
> > and remap them to a continuous VA range.
> 
> This is dangerous.  There must be an active step for Prog B to accept Prog A's
> ranges into its address space.  Otherwise Prog A could almost completely fill
> Prog B's address space and so control where Prog B places its mappings.  It
> could also provoke a latent bug in Prog B if it doesn't handle address space
> exhaustion gracefully.
> 
> I had a proposal to handle this.  Would it meet your requirements?
> https://lore.kernel.org/lkml/20200730152250.GG23808@casper.infradead.org/

I noticed your proposal of project Sileby and I think it can meet Steven's requirement, but I not sure whether it's suitable for mine because there's no sample code yet, is it in progress ?

According to the abstract of Sileby, I have two questions:
1. Would you plan to support the file-mapping memory sharing ? e.g. Prog A's 4G memory is backend with 2M hugetlb.
2. Does each mshare fd only containe one sharing VMA ? For large memory process (1T~4T in our env), maybe there is hundreds of memory ranges need to be shared, this will take too much fd space if so ?
Khalid Aziz Aug. 13, 2021, 7:49 p.m. UTC | #38
On Tue, 2021-07-13 at 00:57 +0000, Longpeng (Mike, Cloud Infrastructure
Service Product Dept.) wrote:
> Hi Matthew,
> 
> > -----Original Message-----
> > From: Matthew Wilcox [mailto:willy@infradead.org]
> > Sent: Monday, July 12, 2021 9:30 AM
> > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > <longpeng2@huawei.com>
> > Cc: Steven Sistare <steven.sistare@oracle.com>; Anthony Yznaga
> > <anthony.yznaga@oracle.com>; linux-kernel@vger.kernel.org;
> > linux-mm@kvack.org; Gonglei (Arei) <arei.gonglei@huawei.com>
> > Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
> > 
> > On Mon, Jul 12, 2021 at 09:05:45AM +0800, Longpeng (Mike, Cloud
> > Infrastructure Service Product Dept.) wrote:
> > > Let me describe my use case more clearly (just ignore if you're not
> > > interested in it):
> > > 
> > > 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
> > > allocated VA range is [0x40000000,0x140000000)
> > > 
> > > 2. Prog A specifies [0x48000000,0x50000000) and
> > > [0x80000000,0x100000000) will be shared by its child.
> > > 
> > > 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
> > > 
> > > 4. Prog B notice the shared ranges (e.g. by input parameters or
> > > ...)
> > > and remap them to a continuous VA range.
> > 
> > This is dangerous.  There must be an active step for Prog B to accept
> > Prog A's
> > ranges into its address space.  Otherwise Prog A could almost
> > completely fill
> > Prog B's address space and so control where Prog B places its
> > mappings.  It
> > could also provoke a latent bug in Prog B if it doesn't handle
> > address space
> > exhaustion gracefully.
> > 
> > I had a proposal to handle this.  Would it meet your requirements?
> > https://lore.kernel.org/lkml/20200730152250.GG23808@casper.infradead.org/
> 
> I noticed your proposal of project Sileby and I think it can meet
> Steven's requirement, but I not sure whether it's suitable for mine
> because there's no sample code yet, is it in progress ?

Hi Mike,

I am working on refining the ideas from project Sileby. I am also
working on designing the implementation. Since the original concept,
the mshare API has evolved further. Here is what it loks like:

The mshare API consists of two system calls - mshare() and
mshare_unlink()

mshare
======

int mshare(char *name,void *addr, size_t length, int oflags, mode_t
mode)

mshare() creates and opens a new, or opens an existing shared memory
area that will be shared at PTE level. name refers to shared object
name that exists under /dev/mshare (this is subject to change. There
might be better ways to manage the names for mshare'd areas). addr is
the starting address of this shared memory area and length is the size
of this area. oflags can be one of:

    O_RDONLY opens shared memory area for read only access by everyone
    O_RDWR opens shared memory area for read and write access
    O_CREAT creates the named shared memory area if it does not exist
    O_EXCL If O_CREAT was also specified, and a shared memory area
        exists with that name, return an error.

mode represents the creation mode for the shared object under
/dev/mshare.

Return Value
------------

mshare() returns a file descriptor. A read from this file descriptor
returns two long values - (1) starting address, and (2) size of the
shared memory area.

Notes
-----

PTEs are shared at pgdir level and hence it imposes following
requirements on the address and size given to the mshare():

    - Starting address must be aligned to pgdir size (512GB on x86_64)
    - Size must be a multiple of pgdir size
    - Any mappings created in this address range at any time become
    shared automatically
    - Shared address range can have unmapped addresses in it. Any
    access to unmapped address will result in SIGBUS

Mappings within this address range behave as if they were shared
between threads, so a write to a MAP_PRIVATE mapping will create a
page which is shared between all the sharers. The first process that
declares an address range mshare'd can continue to map objects in the
shared area. All other processes that want mshare'd access to this
memory area can do so by calling mshare(). After this call, the
address range given by mshare becomes a shared range in its address
space. Anonymous mappings will be shared and not COWed.


mshare_unlink
=============

int mshare_unlink(char *name)

A shared address range created by mshare() can be destroyed using
mshare_unlink() which removes the  shared named object. Once all
processes have unmapped the shared object, the shared address range
references are de-allocated and destroyed.

Return Value
------------

mshare_unlink() returns 0 on success or -1 on error.


Example
=======

A process can create an mshare'd area and map objects into it as
follows:

    fd = mshare("junk",  TB(1), GB(512), O_CREAT|O_RDWR, 0600);

    /* Map objects in the shared address space and/or Write data */

    mshare_unlink("junk");

Another process can then access this shared memory area with another
call to mshare():

    fd = mshare("junk", TB(1), GB(512), O_RDWR, 0600);

    /* Read and write data in TB(1)-((TB(1)+GB(512)) range */

    mshare_unlink("junk");


> 
> According to the abstract of Sileby, I have two questions:
> 1. Would you plan to support the file-mapping memory sharing ? e.g.
> Prog A's 4G memory is backend with 2M hugetlb.

Yes, file-mapped memory sharing support is planned.

> 2. Does each mshare fd only containe one sharing VMA ? For large
> memory process (1T~4T in our env), maybe there is hundreds of memory
> ranges need to be shared, this will take too much fd space if so ?
> 

No, each fd can support all VMAs covered by the address range with a
size that is multiple of pgdir size.

--
Khalid
David Laight Aug. 14, 2021, 8:07 p.m. UTC | #39
...
> > > > Let me describe my use case more clearly (just ignore if you're not
> > > > interested in it):
> > > >
> > > > 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
> > > > allocated VA range is [0x40000000,0x140000000)
> > > >
> > > > 2. Prog A specifies [0x48000000,0x50000000) and
> > > > [0x80000000,0x100000000) will be shared by its child.
> > > >
> > > > 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
> > > >
> > > > 4. Prog B notice the shared ranges (e.g. by input parameters or
> > > > ...)
> > > > and remap them to a continuous VA range.

Remapping to contiguous VA is going to be difficult in the
general case for (IIRC) VIVT caches.
The required cache coherence may only be attainable by
using uncached mappings.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Longpeng(Mike) Aug. 16, 2021, 12:26 a.m. UTC | #40
Hi David,

在 2021/8/15 4:07, David Laight 写道:
> ...
>>>>> Let me describe my use case more clearly (just ignore if you're not
>>>>> interested in it):
>>>>>
>>>>> 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
>>>>> allocated VA range is [0x40000000,0x140000000)
>>>>>
>>>>> 2. Prog A specifies [0x48000000,0x50000000) and
>>>>> [0x80000000,0x100000000) will be shared by its child.
>>>>>
>>>>> 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
>>>>>
>>>>> 4. Prog B notice the shared ranges (e.g. by input parameters or
>>>>> ...)
>>>>> and remap them to a continuous VA range.
> 
> Remapping to contiguous VA is going to be difficult in the
> general case for (IIRC) VIVT caches.
> The required cache coherence may only be attainable by
> using uncached mappings.
> 

The Prog B uses mremap() syscall to remapping the shared ranges to other places,
this is a common case for mremap in userspace.
The cache coherence should already be processed in mremap core logic, otherwise
there's maybe something wrong in mremap().

> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
Longpeng(Mike) Aug. 16, 2021, 6:54 a.m. UTC | #41
Hi Khalid,

Thanks for your replay, PSB :)

> -----Original Message-----
> From: Khalid Aziz [mailto:khalid.aziz@oracle.com]
> Sent: Saturday, August 14, 2021 3:49 AM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> <longpeng2@huawei.com>; Matthew Wilcox <willy@infradead.org>
> Cc: Steven Sistare <steven.sistare@oracle.com>; Anthony Yznaga
> <anthony.yznaga@oracle.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; Gonglei (Arei) <arei.gonglei@huawei.com>
> Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
> 
> On Tue, 2021-07-13 at 00:57 +0000, Longpeng (Mike, Cloud Infrastructure Service
> Product Dept.) wrote:
> > Hi Matthew,
> >
> > > -----Original Message-----
> > > From: Matthew Wilcox [mailto:willy@infradead.org]
> > > Sent: Monday, July 12, 2021 9:30 AM
> > > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > > <longpeng2@huawei.com>
> > > Cc: Steven Sistare <steven.sistare@oracle.com>; Anthony Yznaga
> > > <anthony.yznaga@oracle.com>; linux-kernel@vger.kernel.org;
> > > linux-mm@kvack.org; Gonglei (Arei) <arei.gonglei@huawei.com>
> > > Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
> > >
> > > On Mon, Jul 12, 2021 at 09:05:45AM +0800, Longpeng (Mike, Cloud
> > > Infrastructure Service Product Dept.) wrote:
> > > > Let me describe my use case more clearly (just ignore if you're
> > > > not interested in it):
> > > >
> > > > 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
> > > > allocated VA range is [0x40000000,0x140000000)
> > > >
> > > > 2. Prog A specifies [0x48000000,0x50000000) and
> > > > [0x80000000,0x100000000) will be shared by its child.
> > > >
> > > > 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
> > > >
> > > > 4. Prog B notice the shared ranges (e.g. by input parameters or
> > > > ...)
> > > > and remap them to a continuous VA range.
> > >
> > > This is dangerous.  There must be an active step for Prog B to
> > > accept Prog A's ranges into its address space.  Otherwise Prog A
> > > could almost completely fill Prog B's address space and so control
> > > where Prog B places its mappings.  It could also provoke a latent
> > > bug in Prog B if it doesn't handle address space exhaustion
> > > gracefully.
> > >
> > > I had a proposal to handle this.  Would it meet your requirements?
> > > https://lore.kernel.org/lkml/20200730152250.GG23808@casper.infradead
> > > .org/
> >
> > I noticed your proposal of project Sileby and I think it can meet
> > Steven's requirement, but I not sure whether it's suitable for mine
> > because there's no sample code yet, is it in progress ?
> 
> Hi Mike,
> 
> I am working on refining the ideas from project Sileby. I am also working on
> designing the implementation. Since the original concept, the mshare API has
> evolved further. Here is what it loks like:
> 

What's the actual use cases of Sileby ? We can already share anonymous memory 
by memfd.

> The mshare API consists of two system calls - mshare() and
> mshare_unlink()
> 
> mshare
> ======
> 
> int mshare(char *name,void *addr, size_t length, int oflags, mode_t
> mode)
> 
> mshare() creates and opens a new, or opens an existing shared memory area that
> will be shared at PTE level. name refers to shared object name that exists under
> /dev/mshare (this is subject to change. There might be better ways to manage the
> names for mshare'd areas). addr is the starting address of this shared memory
> area and length is the size of this area. oflags can be one of:
> 
>     O_RDONLY opens shared memory area for read only access by everyone
>     O_RDWR opens shared memory area for read and write access
>     O_CREAT creates the named shared memory area if it does not exist
>     O_EXCL If O_CREAT was also specified, and a shared memory area
>         exists with that name, return an error.
> 
> mode represents the creation mode for the shared object under /dev/mshare.
> 
> Return Value
> ------------
> 
> mshare() returns a file descriptor. A read from this file descriptor returns two long
> values - (1) starting address, and (2) size of the shared memory area.
> 
> Notes
> -----
> 
> PTEs are shared at pgdir level and hence it imposes following requirements on the
> address and size given to the mshare():
> 
>     - Starting address must be aligned to pgdir size (512GB on x86_64)

The limitation seems not unreasonable, why ?

>     - Size must be a multiple of pgdir size
>     - Any mappings created in this address range at any time become
>     shared automatically
>     - Shared address range can have unmapped addresses in it. Any
>     access to unmapped address will result in SIGBUS
> 
> Mappings within this address range behave as if they were shared between
> threads, so a write to a MAP_PRIVATE mapping will create a page which is shared
> between all the sharers. The first process that declares an address range mshare'd
> can continue to map objects in the shared area. All other processes that want
> mshare'd access to this memory area can do so by calling mshare(). After this call,
> the address range given by mshare becomes a shared range in its address space.
> Anonymous mappings will be shared and not COWed.
> 
> 
> mshare_unlink
> =============
> 
> int mshare_unlink(char *name)
> 
> A shared address range created by mshare() can be destroyed using
> mshare_unlink() which removes the  shared named object. Once all processes
> have unmapped the shared object, the shared address range references are
> de-allocated and destroyed.
> 
> Return Value
> ------------
> 
> mshare_unlink() returns 0 on success or -1 on error.
> 
> 
> Example
> =======
> 
> A process can create an mshare'd area and map objects into it as
> follows:
> 
>     fd = mshare("junk",  TB(1), GB(512), O_CREAT|O_RDWR, 0600);
> 
>     /* Map objects in the shared address space and/or Write data */
> 
>     mshare_unlink("junk");
> 

Use the name to identify the range seems insecure and looks easy to be attacked.
How about to use the fd ? We can pass the fd to another process who is permit to
access the mshare'd memory.

> Another process can then access this shared memory area with another call to
> mshare():
> 
>     fd = mshare("junk", TB(1), GB(512), O_RDWR, 0600);
> 
>     /* Read and write data in TB(1)-((TB(1)+GB(512)) range */
> 
>     mshare_unlink("junk");
> 
> 
> >
> > According to the abstract of Sileby, I have two questions:
> > 1. Would you plan to support the file-mapping memory sharing ? e.g.
> > Prog A's 4G memory is backend with 2M hugetlb.
> 
> Yes, file-mapped memory sharing support is planned.
> 
> > 2. Does each mshare fd only containe one sharing VMA ? For large
> > memory process (1T~4T in our env), maybe there is hundreds of memory
> > ranges need to be shared, this will take too much fd space if so ?
> >
> 
> No, each fd can support all VMAs covered by the address range with a size that is
> multiple of pgdir size.
> 

I also made a proposal to meet our requirement inside.
Our requirement is:
1. The process A can specify some ranges to be shared by another process B.
2. The ranges can be shared in the order of the specified by process A.
3. The ranges can support more than one VMA, some of the VMAs can be anon and
  others can be file-mapped.

The proposal introduces a char device named /dev/mshare and interacts with the users
by ioctl interfaces. It supports three command currently:
- MSHARE_SIGN: specify the range want to be shared
- MSHARE_INFO: get the info of the shared ranges
- MSHARE_MMAP: map the shared ranges into the address space

Here is an example to show how to use it.

1. Suppose process A want to share four ranges:
  a. 0x1000 ~ 0x2000 --> Anonymous
  b. 0x4000 ~ 0x8000 --> PFN mapped range
  c. 0x200000 ~ 0x400000 --> 2M hugetlb
  d. 0x500000 ~ 0x501000 --> Anonymous


2. Process A write data into these four ranges according the following sequence:
  "b -- c -- a -- d"
  ( So the process B should map the ranges as the same order )


3. Process A specify the ranges:

  FD = open( /dev/msahre );
  ioctl(FD, MSHARE_SIGN, range b);
  ioctl(FD, MSHARE_SIGN, range c);
  ioctl(FD, MSHARE_SIGN, range a);
  ioctl(FD, MSHARE_SIGN, range d);


4. Process A make sure the O_CLOEXEC of FD is CLEAR


5. Process A now fork+exec process B with parameter shared_fd=FD:

  ./bin_B shared_fd=FD


6. Process B get the info by MSHARE_INFO ioctl:

  ioctl( shared_fd, MSHARE_INFO, &info );

  The info includes:
    num: the count of the shared ranges
    totalvm: the total vm size of these shared ranges
    align: align requirement


7. Process B use MSHARE_MMAP to map these ranges into its space:

  p = mmap ( info.totalvm, info.align );
  ioctl(shared_fd, MSHARE_MMAP, {MMAP_ALL, address fixed, p} );


I have wrote a draft for it and it works in local. But I'm not sure the concept is right,
so I'm glad to know that you're working on Sileby, maybe I can refer to the work of yours.


> --
> Khalid
David Hildenbrand Aug. 16, 2021, 8:02 a.m. UTC | #42
On 13.08.21 21:49, Khalid Aziz wrote:
> On Tue, 2021-07-13 at 00:57 +0000, Longpeng (Mike, Cloud Infrastructure
> Service Product Dept.) wrote:
>> Hi Matthew,
>>
>>> -----Original Message-----
>>> From: Matthew Wilcox [mailto:willy@infradead.org]
>>> Sent: Monday, July 12, 2021 9:30 AM
>>> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
>>> <longpeng2@huawei.com>
>>> Cc: Steven Sistare <steven.sistare@oracle.com>; Anthony Yznaga
>>> <anthony.yznaga@oracle.com>; linux-kernel@vger.kernel.org;
>>> linux-mm@kvack.org; Gonglei (Arei) <arei.gonglei@huawei.com>
>>> Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
>>>
>>> On Mon, Jul 12, 2021 at 09:05:45AM +0800, Longpeng (Mike, Cloud
>>> Infrastructure Service Product Dept.) wrote:
>>>> Let me describe my use case more clearly (just ignore if you're not
>>>> interested in it):
>>>>
>>>> 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
>>>> allocated VA range is [0x40000000,0x140000000)
>>>>
>>>> 2. Prog A specifies [0x48000000,0x50000000) and
>>>> [0x80000000,0x100000000) will be shared by its child.
>>>>
>>>> 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
>>>>
>>>> 4. Prog B notice the shared ranges (e.g. by input parameters or
>>>> ...)
>>>> and remap them to a continuous VA range.
>>>
>>> This is dangerous.  There must be an active step for Prog B to accept
>>> Prog A's
>>> ranges into its address space.  Otherwise Prog A could almost
>>> completely fill
>>> Prog B's address space and so control where Prog B places its
>>> mappings.  It
>>> could also provoke a latent bug in Prog B if it doesn't handle
>>> address space
>>> exhaustion gracefully.
>>>
>>> I had a proposal to handle this.  Would it meet your requirements?
>>> https://lore.kernel.org/lkml/20200730152250.GG23808@casper.infradead.org/
>>
>> I noticed your proposal of project Sileby and I think it can meet
>> Steven's requirement, but I not sure whether it's suitable for mine
>> because there's no sample code yet, is it in progress ?
> 
> Hi Mike,
> 
> I am working on refining the ideas from project Sileby. I am also
> working on designing the implementation. Since the original concept,
> the mshare API has evolved further. Here is what it loks like:
> 
> The mshare API consists of two system calls - mshare() and
> mshare_unlink()
> 
> mshare
> ======
> 
> int mshare(char *name,void *addr, size_t length, int oflags, mode_t
> mode)
> 
> mshare() creates and opens a new, or opens an existing shared memory
> area that will be shared at PTE level. name refers to shared object
> name that exists under /dev/mshare (this is subject to change. There
> might be better ways to manage the names for mshare'd areas). addr is
> the starting address of this shared memory area and length is the size
> of this area. oflags can be one of:
> 
>      O_RDONLY opens shared memory area for read only access by everyone
>      O_RDWR opens shared memory area for read and write access
>      O_CREAT creates the named shared memory area if it does not exist
>      O_EXCL If O_CREAT was also specified, and a shared memory area
>          exists with that name, return an error.
> 
> mode represents the creation mode for the shared object under
> /dev/mshare.
> 
> Return Value
> ------------
> 
> mshare() returns a file descriptor. A read from this file descriptor
> returns two long values - (1) starting address, and (2) size of the
> shared memory area.
> 
> Notes
> -----
> 
> PTEs are shared at pgdir level and hence it imposes following
> requirements on the address and size given to the mshare():
> 
>      - Starting address must be aligned to pgdir size (512GB on x86_64)
>      - Size must be a multiple of pgdir size
>      - Any mappings created in this address range at any time become
>      shared automatically
>      - Shared address range can have unmapped addresses in it. Any
>      access to unmapped address will result in SIGBUS
> 
> Mappings within this address range behave as if they were shared
> between threads, so a write to a MAP_PRIVATE mapping will create a
> page which is shared between all the sharers. The first process that
> declares an address range mshare'd can continue to map objects in the
> shared area. All other processes that want mshare'd access to this
> memory area can do so by calling mshare(). After this call, the
> address range given by mshare becomes a shared range in its address
> space. Anonymous mappings will be shared and not COWed.

Did I understand correctly that you want to share actual page tables 
between processes and consequently different MMs? That sounds like a 
very bad idea.
David Laight Aug. 16, 2021, 8:07 a.m. UTC | #43
From: Longpeng
> Sent: 16 August 2021 01:26
> Hi David,
> 
> 在 2021/8/15 4:07, David Laight 写道:
> > ...
> >>>>> Let me describe my use case more clearly (just ignore if you're not
> >>>>> interested in it):
> >>>>>
> >>>>> 1. Prog A mmap() 4GB memory (anon or file-mapping), suppose the
> >>>>> allocated VA range is [0x40000000,0x140000000)
> >>>>>
> >>>>> 2. Prog A specifies [0x48000000,0x50000000) and
> >>>>> [0x80000000,0x100000000) will be shared by its child.
> >>>>>
> >>>>> 3. Prog A fork() Prog B and then Prog B exec() a new ELF binary.
> >>>>>
> >>>>> 4. Prog B notice the shared ranges (e.g. by input parameters or
> >>>>> ...)
> >>>>> and remap them to a continuous VA range.
> >
> > Remapping to contiguous VA is going to be difficult in the
> > general case for (IIRC) VIVT caches.
> > The required cache coherence may only be attainable by
> > using uncached mappings.
> >
> 
> The Prog B uses mremap() syscall to remapping the shared ranges to other places,
> this is a common case for mremap in userspace.
> The cache coherence should already be processed in mremap core logic, otherwise
> there's maybe something wrong in mremap().

Maybe it does, and probably mremap() makes it work.
But with VIVT caches if a pages gets mapped in two processes
at the same time at different offsets into the cache then
then both mappings end up being uncached.
This was always a problem with normal file mmap.
I don't know if Linux manages to pick the VA after finding the
page has another mapping - SVR4 didn't.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Matthew Wilcox (Oracle) Aug. 16, 2021, 12:07 p.m. UTC | #44
On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
> > Mappings within this address range behave as if they were shared
> > between threads, so a write to a MAP_PRIVATE mapping will create a
> > page which is shared between all the sharers. The first process that
> > declares an address range mshare'd can continue to map objects in the
> > shared area. All other processes that want mshare'd access to this
> > memory area can do so by calling mshare(). After this call, the
> > address range given by mshare becomes a shared range in its address
> > space. Anonymous mappings will be shared and not COWed.
> 
> Did I understand correctly that you want to share actual page tables between
> processes and consequently different MMs? That sounds like a very bad idea.

That is the entire point.  Consider a machine with 10,000 instances
of an application running (process model, not thread model).  If each
application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
tables per process or 40GB of RAM for the whole machine.

There's a reason hugetlbfs was enhanced to allow this page table sharing.
I'm not a fan of the implementation as it gets some locks upside down,
so this is an attempt to generalise the concept beyond hugetlbfs.

Think of it like partial threading.  You get to share some parts, but not
all, of your address space with your fellow processes.  Obviously you
don't want to expose this to random other processes, only to other
instances of yourself being run as the same user.
David Hildenbrand Aug. 16, 2021, 12:20 p.m. UTC | #45
On 16.08.21 14:07, Matthew Wilcox wrote:
> On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
>>> Mappings within this address range behave as if they were shared
>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>> page which is shared between all the sharers. The first process that
>>> declares an address range mshare'd can continue to map objects in the
>>> shared area. All other processes that want mshare'd access to this
>>> memory area can do so by calling mshare(). After this call, the
>>> address range given by mshare becomes a shared range in its address
>>> space. Anonymous mappings will be shared and not COWed.
>>
>> Did I understand correctly that you want to share actual page tables between
>> processes and consequently different MMs? That sounds like a very bad idea.
> 
> That is the entire point.  Consider a machine with 10,000 instances
> of an application running (process model, not thread model).  If each
> application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
> tables per process or 40GB of RAM for the whole machine.

What speaks against 1 GB pages then?

> 
> There's a reason hugetlbfs was enhanced to allow this page table sharing.
> I'm not a fan of the implementation as it gets some locks upside down,
> so this is an attempt to generalise the concept beyond hugetlbfs.

Who do we account the page tables to? What are MADV_DONTNEED semantics? 
Who cleans up the page tables? What happens during munmap? How does the 
rmap even work? How to we actually synchronize page table walkers?

See how hugetlbfs just doesn't raise these problems because we are 
sharing pages and not page tables?

TBH, I quite dislike just thinking about sharing page tables between 
processes.

> 
> Think of it like partial threading.  You get to share some parts, but not
> all, of your address space with your fellow processes.  Obviously you
> don't want to expose this to random other processes, only to other
> instances of yourself being run as the same user.

Sounds like a nice way to over-complicate MM to optimize for some 
special use cases. I know, I'm probably wrong. :)
David Hildenbrand Aug. 16, 2021, 12:27 p.m. UTC | #46
On 16.08.21 14:07, Matthew Wilcox wrote:
> On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
>>> Mappings within this address range behave as if they were shared
>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>> page which is shared between all the sharers. The first process that
>>> declares an address range mshare'd can continue to map objects in the
>>> shared area. All other processes that want mshare'd access to this
>>> memory area can do so by calling mshare(). After this call, the
>>> address range given by mshare becomes a shared range in its address
>>> space. Anonymous mappings will be shared and not COWed.
>>
>> Did I understand correctly that you want to share actual page tables between
>> processes and consequently different MMs? That sounds like a very bad idea.
> 
> That is the entire point.  Consider a machine with 10,000 instances
> of an application running (process model, not thread model).  If each
> application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
> tables per process or 40GB of RAM for the whole machine.

Note that I am working on asynchronous reclaim of page tables, whereby I 
would even reclaim !anonymous page tables under memory pressure.

Assuming your processes don't touch all memory all the time of course 
... of course, it's a research project and will still require quite some 
work because devil is in the detail (locking).
David Hildenbrand Aug. 16, 2021, 12:30 p.m. UTC | #47
On 16.08.21 14:27, David Hildenbrand wrote:
> On 16.08.21 14:07, Matthew Wilcox wrote:
>> On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
>>>> Mappings within this address range behave as if they were shared
>>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>>> page which is shared between all the sharers. The first process that
>>>> declares an address range mshare'd can continue to map objects in the
>>>> shared area. All other processes that want mshare'd access to this
>>>> memory area can do so by calling mshare(). After this call, the
>>>> address range given by mshare becomes a shared range in its address
>>>> space. Anonymous mappings will be shared and not COWed.
>>>
>>> Did I understand correctly that you want to share actual page tables between
>>> processes and consequently different MMs? That sounds like a very bad idea.
>>
>> That is the entire point.  Consider a machine with 10,000 instances
>> of an application running (process model, not thread model).  If each
>> application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
>> tables per process or 40GB of RAM for the whole machine.
> 
> Note that I am working on asynchronous reclaim of page tables, whereby I
> would even reclaim !anonymous page tables under memory pressure.
> 
> Assuming your processes don't touch all memory all the time of course
> ... of course, it's a research project and will still require quite some
> work because devil is in the detail (locking).
> 

Well, that comment did turn out not-so-private, so it's certainly a good 
discussion-starter.
David Hildenbrand Aug. 16, 2021, 12:42 p.m. UTC | #48
On 16.08.21 14:20, David Hildenbrand wrote:
> On 16.08.21 14:07, Matthew Wilcox wrote:
>> On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
>>>> Mappings within this address range behave as if they were shared
>>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>>> page which is shared between all the sharers. The first process that
>>>> declares an address range mshare'd can continue to map objects in the
>>>> shared area. All other processes that want mshare'd access to this
>>>> memory area can do so by calling mshare(). After this call, the
>>>> address range given by mshare becomes a shared range in its address
>>>> space. Anonymous mappings will be shared and not COWed.
>>>
>>> Did I understand correctly that you want to share actual page tables between
>>> processes and consequently different MMs? That sounds like a very bad idea.
>>
>> That is the entire point.  Consider a machine with 10,000 instances
>> of an application running (process model, not thread model).  If each
>> application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
>> tables per process or 40GB of RAM for the whole machine.
> 
> What speaks against 1 GB pages then?
> 
>>
>> There's a reason hugetlbfs was enhanced to allow this page table sharing.
>> I'm not a fan of the implementation as it gets some locks upside down,
>> so this is an attempt to generalise the concept beyond hugetlbfs.
> 
> Who do we account the page tables to? What are MADV_DONTNEED semantics?
> Who cleans up the page tables? What happens during munmap? How does the
> rmap even work? How to we actually synchronize page table walkers?
> 
> See how hugetlbfs just doesn't raise these problems because we are
> sharing pages and not page tables?

I found what you were referring to: CONFIG_ARCH_WANT_HUGE_PMD_SHARE

I was not aware that we have such a monstrosity in the kernel.
Matthew Wilcox (Oracle) Aug. 16, 2021, 12:46 p.m. UTC | #49
On Mon, Aug 16, 2021 at 02:20:43PM +0200, David Hildenbrand wrote:
> On 16.08.21 14:07, Matthew Wilcox wrote:
> > On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
> > > > Mappings within this address range behave as if they were shared
> > > > between threads, so a write to a MAP_PRIVATE mapping will create a
> > > > page which is shared between all the sharers. The first process that
> > > > declares an address range mshare'd can continue to map objects in the
> > > > shared area. All other processes that want mshare'd access to this
> > > > memory area can do so by calling mshare(). After this call, the
> > > > address range given by mshare becomes a shared range in its address
> > > > space. Anonymous mappings will be shared and not COWed.
> > > 
> > > Did I understand correctly that you want to share actual page tables between
> > > processes and consequently different MMs? That sounds like a very bad idea.
> > 
> > That is the entire point.  Consider a machine with 10,000 instances
> > of an application running (process model, not thread model).  If each
> > application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
> > tables per process or 40GB of RAM for the whole machine.
> 
> What speaks against 1 GB pages then?

Until recently, the CPUs only having 4 1GB TLB entries.  I'm sure we
still have customers using that generation of CPUs.  2MB pages perform
better than 1GB pages on the previous generation of hardware, and I
haven't seen numbers for the next generation yet.

> > There's a reason hugetlbfs was enhanced to allow this page table sharing.
> > I'm not a fan of the implementation as it gets some locks upside down,
> > so this is an attempt to generalise the concept beyond hugetlbfs.
> 
> Who do we account the page tables to? What are MADV_DONTNEED semantics? Who
> cleans up the page tables? What happens during munmap? How does the rmap
> even work? How to we actually synchronize page table walkers?
> 
> See how hugetlbfs just doesn't raise these problems because we are sharing
> pages and not page tables?

No, really, hugetlbfs shares page tables already.  You just didn't
notice that yet.

> > Think of it like partial threading.  You get to share some parts, but not
> > all, of your address space with your fellow processes.  Obviously you
> > don't want to expose this to random other processes, only to other
> > instances of yourself being run as the same user.
> 
> Sounds like a nice way to over-complicate MM to optimize for some special
> use cases. I know, I'm probably wrong. :)

It's really not as bad as you seem to think it is.
David Hildenbrand Aug. 16, 2021, 1:24 p.m. UTC | #50
On 16.08.21 14:46, Matthew Wilcox wrote:
> On Mon, Aug 16, 2021 at 02:20:43PM +0200, David Hildenbrand wrote:
>> On 16.08.21 14:07, Matthew Wilcox wrote:
>>> On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
>>>>> Mappings within this address range behave as if they were shared
>>>>> between threads, so a write to a MAP_PRIVATE mapping will create a
>>>>> page which is shared between all the sharers. The first process that
>>>>> declares an address range mshare'd can continue to map objects in the
>>>>> shared area. All other processes that want mshare'd access to this
>>>>> memory area can do so by calling mshare(). After this call, the
>>>>> address range given by mshare becomes a shared range in its address
>>>>> space. Anonymous mappings will be shared and not COWed.
>>>>
>>>> Did I understand correctly that you want to share actual page tables between
>>>> processes and consequently different MMs? That sounds like a very bad idea.
>>>
>>> That is the entire point.  Consider a machine with 10,000 instances
>>> of an application running (process model, not thread model).  If each
>>> application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
>>> tables per process or 40GB of RAM for the whole machine.
>>
>> What speaks against 1 GB pages then?
> 
> Until recently, the CPUs only having 4 1GB TLB entries.  I'm sure we
> still have customers using that generation of CPUs.  2MB pages perform
> better than 1GB pages on the previous generation of hardware, and I
> haven't seen numbers for the next generation yet.

I read that somewhere else before, yet we have heavy 1 GiB page users, 
especially in the context of VMs and DPDK.

> 
>>> There's a reason hugetlbfs was enhanced to allow this page table sharing.
>>> I'm not a fan of the implementation as it gets some locks upside down,
>>> so this is an attempt to generalise the concept beyond hugetlbfs.
>>
>> Who do we account the page tables to? What are MADV_DONTNEED semantics? Who
>> cleans up the page tables? What happens during munmap? How does the rmap
>> even work? How to we actually synchronize page table walkers?
>>
>> See how hugetlbfs just doesn't raise these problems because we are sharing
>> pages and not page tables?
> 
> No, really, hugetlbfs shares page tables already.  You just didn't
> notice that yet.

So, it only works for hugetlbfs in case uffd is not in place (-> no 
per-process data in the page table) and we have an actual shared 
mappings. When unsharing, we zap the PUD entry, which will result in 
allocating a per-process page table on next fault.

I will rephrase my previous statement "hugetlbfs just doesn't raise 
these problems because we are special casing it all over the place 
already". For example, not allowing to swap such pages. Disallowing 
MADV_DONTNEED. Special hugetlbfs locking.
Matthew Wilcox (Oracle) Aug. 16, 2021, 1:32 p.m. UTC | #51
On Mon, Aug 16, 2021 at 03:24:38PM +0200, David Hildenbrand wrote:
> On 16.08.21 14:46, Matthew Wilcox wrote:
> > On Mon, Aug 16, 2021 at 02:20:43PM +0200, David Hildenbrand wrote:
> > > On 16.08.21 14:07, Matthew Wilcox wrote:
> > > > On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
> > > > > > Mappings within this address range behave as if they were shared
> > > > > > between threads, so a write to a MAP_PRIVATE mapping will create a
> > > > > > page which is shared between all the sharers. The first process that
> > > > > > declares an address range mshare'd can continue to map objects in the
> > > > > > shared area. All other processes that want mshare'd access to this
> > > > > > memory area can do so by calling mshare(). After this call, the
> > > > > > address range given by mshare becomes a shared range in its address
> > > > > > space. Anonymous mappings will be shared and not COWed.
> > > > > 
> > > > > Did I understand correctly that you want to share actual page tables between
> > > > > processes and consequently different MMs? That sounds like a very bad idea.
> > > > 
> > > > That is the entire point.  Consider a machine with 10,000 instances
> > > > of an application running (process model, not thread model).  If each
> > > > application wants to map 1TB of RAM using 2MB pages, that's 4MB of page
> > > > tables per process or 40GB of RAM for the whole machine.
> > > 
> > > What speaks against 1 GB pages then?
> > 
> > Until recently, the CPUs only having 4 1GB TLB entries.  I'm sure we
> > still have customers using that generation of CPUs.  2MB pages perform
> > better than 1GB pages on the previous generation of hardware, and I
> > haven't seen numbers for the next generation yet.
> 
> I read that somewhere else before, yet we have heavy 1 GiB page users,
> especially in the context of VMs and DPDK.

I wonder if those users actually benchmarked.  Or whether the memory
savings worked out so well for them that the loss of TLB performance
didn't matter.

> So, it only works for hugetlbfs in case uffd is not in place (-> no
> per-process data in the page table) and we have an actual shared mappings.
> When unsharing, we zap the PUD entry, which will result in allocating a
> per-process page table on next fault.

I think uffd was a huge mistake.  It should have been a filesystem
instead of a hack on the side of anonymous memory.

> I will rephrase my previous statement "hugetlbfs just doesn't raise these
> problems because we are special casing it all over the place already". For
> example, not allowing to swap such pages. Disallowing MADV_DONTNEED. Special
> hugetlbfs locking.

Sure, that's why I want to drag this feature out of "oh this is a
hugetlb special case" and into "this is something Linux supports".
David Hildenbrand Aug. 16, 2021, 2:10 p.m. UTC | #52
>>> Until recently, the CPUs only having 4 1GB TLB entries.  I'm sure we
>>> still have customers using that generation of CPUs.  2MB pages perform
>>> better than 1GB pages on the previous generation of hardware, and I
>>> haven't seen numbers for the next generation yet.
>>
>> I read that somewhere else before, yet we have heavy 1 GiB page users,
>> especially in the context of VMs and DPDK.
> 
> I wonder if those users actually benchmarked.  Or whether the memory
> savings worked out so well for them that the loss of TLB performance
> didn't matter.

These applications are extremely performance sensitive (i.e., RT 
workloads), that's why I'm wondering. I recall that they are most 
certainly using more than 4 GiB memory in real applications.

E.g., the doc [1] even has a note that "For 64-bit applications, it is 
recommended to use 1 GB hugepages if the platform supports them."


[1] https://doc.dpdk.org/guides-16.04/linux_gsg/sys_reqs.html

> 
>> So, it only works for hugetlbfs in case uffd is not in place (-> no
>> per-process data in the page table) and we have an actual shared mappings.
>> When unsharing, we zap the PUD entry, which will result in allocating a
>> per-process page table on next fault.
> 
> I think uffd was a huge mistake.  It should have been a filesystem
> instead of a hack on the side of anonymous memory.

Yes it was. Especially, looking at all the special-casing, for example, 
even in mm/pagewalk.c.

> 
>> I will rephrase my previous statement "hugetlbfs just doesn't raise these
>> problems because we are special casing it all over the place already". For
>> example, not allowing to swap such pages. Disallowing MADV_DONTNEED. Special
>> hugetlbfs locking.
> 
> Sure, that's why I want to drag this feature out of "oh this is a
> hugetlb special case" and into "this is something Linux supports".

I would have understood the move to optimize SHMEM internally - similar 
to how we seem to optimize hugetlbfs SHMEM right now internally. 
(although sharing page tables for shmem can still be quite tricky)

I did not follow why we have to play games with MAP_PRIVATE, and having 
private anonymous pages shared between processes that don't COW, 
introducing new syscalls etc.
Matthew Wilcox (Oracle) Aug. 16, 2021, 2:27 p.m. UTC | #53
On Mon, Aug 16, 2021 at 04:10:28PM +0200, David Hildenbrand wrote:
> > > > Until recently, the CPUs only having 4 1GB TLB entries.  I'm sure we
> > > > still have customers using that generation of CPUs.  2MB pages perform
> > > > better than 1GB pages on the previous generation of hardware, and I
> > > > haven't seen numbers for the next generation yet.
> > > 
> > > I read that somewhere else before, yet we have heavy 1 GiB page users,
> > > especially in the context of VMs and DPDK.
> > 
> > I wonder if those users actually benchmarked.  Or whether the memory
> > savings worked out so well for them that the loss of TLB performance
> > didn't matter.
> 
> These applications are extremely performance sensitive (i.e., RT workloads),

"real time does not mean real fast".  it means predictable latency.

> > > I will rephrase my previous statement "hugetlbfs just doesn't raise these
> > > problems because we are special casing it all over the place already". For
> > > example, not allowing to swap such pages. Disallowing MADV_DONTNEED. Special
> > > hugetlbfs locking.
> > 
> > Sure, that's why I want to drag this feature out of "oh this is a
> > hugetlb special case" and into "this is something Linux supports".
> 
> I would have understood the move to optimize SHMEM internally - similar to
> how we seem to optimize hugetlbfs SHMEM right now internally. (although
> sharing page tables for shmem can still be quite tricky)
> 
> I did not follow why we have to play games with MAP_PRIVATE, and having
> private anonymous pages shared between processes that don't COW, introducing
> new syscalls etc.

It's not about SHMEM, it's about file-backed pages on regular
filesystems.  I don't want to have XFS, ext4 and btrfs all with their
own implementations of ARCH_WANT_HUGE_PMD_SHARE.
David Hildenbrand Aug. 16, 2021, 2:33 p.m. UTC | #54
On 16.08.21 16:27, Matthew Wilcox wrote:
> On Mon, Aug 16, 2021 at 04:10:28PM +0200, David Hildenbrand wrote:
>>>>> Until recently, the CPUs only having 4 1GB TLB entries.  I'm sure we
>>>>> still have customers using that generation of CPUs.  2MB pages perform
>>>>> better than 1GB pages on the previous generation of hardware, and I
>>>>> haven't seen numbers for the next generation yet.
>>>>
>>>> I read that somewhere else before, yet we have heavy 1 GiB page users,
>>>> especially in the context of VMs and DPDK.
>>>
>>> I wonder if those users actually benchmarked.  Or whether the memory
>>> savings worked out so well for them that the loss of TLB performance
>>> didn't matter.
>>
>> These applications are extremely performance sensitive (i.e., RT workloads),
> 
> "real time does not mean real fast".  it means predictable latency.

I know, but that doesn't explain why you would use 2 MB vs 1 GiB.

(most of these applications want also a low predictable latency)

> 
>>>> I will rephrase my previous statement "hugetlbfs just doesn't raise these
>>>> problems because we are special casing it all over the place already". For
>>>> example, not allowing to swap such pages. Disallowing MADV_DONTNEED. Special
>>>> hugetlbfs locking.
>>>
>>> Sure, that's why I want to drag this feature out of "oh this is a
>>> hugetlb special case" and into "this is something Linux supports".
>>
>> I would have understood the move to optimize SHMEM internally - similar to
>> how we seem to optimize hugetlbfs SHMEM right now internally. (although
>> sharing page tables for shmem can still be quite tricky)
>>
>> I did not follow why we have to play games with MAP_PRIVATE, and having
>> private anonymous pages shared between processes that don't COW, introducing
>> new syscalls etc.
> 
> It's not about SHMEM, it's about file-backed pages on regular
> filesystems.  I don't want to have XFS, ext4 and btrfs all with their
> own implementations of ARCH_WANT_HUGE_PMD_SHARE.

Let me ask this way: why do we have to play such games with MAP_PRIVATE?
Matthew Wilcox (Oracle) Aug. 16, 2021, 2:40 p.m. UTC | #55
On Mon, Aug 16, 2021 at 04:33:09PM +0200, David Hildenbrand wrote:
> > > I did not follow why we have to play games with MAP_PRIVATE, and having
> > > private anonymous pages shared between processes that don't COW, introducing
> > > new syscalls etc.
> > 
> > It's not about SHMEM, it's about file-backed pages on regular
> > filesystems.  I don't want to have XFS, ext4 and btrfs all with their
> > own implementations of ARCH_WANT_HUGE_PMD_SHARE.
> 
> Let me ask this way: why do we have to play such games with MAP_PRIVATE?

Are you referring to this?

: Mappings within this address range behave as if they were shared
: between threads, so a write to a MAP_PRIVATE mapping will create a
: page which is shared between all the sharers.

If so, that's a misunderstanding, because there are no games being played.
What Khalid's saying there is that because the page tables are already
shared for that range of address space, the COW of a MAP_PRIVATE will
create a new page, but that page will be shared between all the sharers.
The second write to a MAP_PRIVATE page (by any of the sharers) will not
create a COW situation.  Just like if all the sharers were threads of
the same process.
David Hildenbrand Aug. 16, 2021, 3:01 p.m. UTC | #56
On 16.08.21 16:40, Matthew Wilcox wrote:
> On Mon, Aug 16, 2021 at 04:33:09PM +0200, David Hildenbrand wrote:
>>>> I did not follow why we have to play games with MAP_PRIVATE, and having
>>>> private anonymous pages shared between processes that don't COW, introducing
>>>> new syscalls etc.
>>>
>>> It's not about SHMEM, it's about file-backed pages on regular
>>> filesystems.  I don't want to have XFS, ext4 and btrfs all with their
>>> own implementations of ARCH_WANT_HUGE_PMD_SHARE.
>>
>> Let me ask this way: why do we have to play such games with MAP_PRIVATE?
> 
> Are you referring to this?

Yes

> 
> : Mappings within this address range behave as if they were shared
> : between threads, so a write to a MAP_PRIVATE mapping will create a
> : page which is shared between all the sharers.
> 
> If so, that's a misunderstanding, because there are no games being played.
> What Khalid's saying there is that because the page tables are already
> shared for that range of address space, the COW of a MAP_PRIVATE will
> create a new page, but that page will be shared between all the sharers.
> The second write to a MAP_PRIVATE page (by any of the sharers) will not
> create a COW situation.  Just like if all the sharers were threads of
> the same process.
> 

It actually seems to be just like I understood it. We'll have multiple 
processes share anonymous pages writable, even though they are not using 
shared memory.

IMHO, sharing page tables to optimize for something kernel-internal 
(page table consumption) should be completely transparent to user space. 
Just like ARCH_WANT_HUGE_PMD_SHARE currently is unless I am missing 
something important.

The VM_MAYSHARE check in want_pmd_share()->vma_shareable() makes me 
assume that we really only optimize for MAP_SHARED right now, never for 
MAP_PRIVATE.
Matthew Wilcox (Oracle) Aug. 16, 2021, 3:59 p.m. UTC | #57
On Mon, Aug 16, 2021 at 05:01:44PM +0200, David Hildenbrand wrote:
> On 16.08.21 16:40, Matthew Wilcox wrote:
> > On Mon, Aug 16, 2021 at 04:33:09PM +0200, David Hildenbrand wrote:
> > > > > I did not follow why we have to play games with MAP_PRIVATE, and having
> > > > > private anonymous pages shared between processes that don't COW, introducing
> > > > > new syscalls etc.
> > > > 
> > > > It's not about SHMEM, it's about file-backed pages on regular
> > > > filesystems.  I don't want to have XFS, ext4 and btrfs all with their
> > > > own implementations of ARCH_WANT_HUGE_PMD_SHARE.
> > > 
> > > Let me ask this way: why do we have to play such games with MAP_PRIVATE?
> > 
> > : Mappings within this address range behave as if they were shared
> > : between threads, so a write to a MAP_PRIVATE mapping will create a
> > : page which is shared between all the sharers.
> > 
> > If so, that's a misunderstanding, because there are no games being played.
> > What Khalid's saying there is that because the page tables are already
> > shared for that range of address space, the COW of a MAP_PRIVATE will
> > create a new page, but that page will be shared between all the sharers.
> > The second write to a MAP_PRIVATE page (by any of the sharers) will not
> > create a COW situation.  Just like if all the sharers were threads of
> > the same process.
> > 
> 
> It actually seems to be just like I understood it. We'll have multiple
> processes share anonymous pages writable, even though they are not using
> shared memory.
> 
> IMHO, sharing page tables to optimize for something kernel-internal (page
> table consumption) should be completely transparent to user space. Just like
> ARCH_WANT_HUGE_PMD_SHARE currently is unless I am missing something
> important.
> 
> The VM_MAYSHARE check in want_pmd_share()->vma_shareable() makes me assume
> that we really only optimize for MAP_SHARED right now, never for
> MAP_PRIVATE.

It's definitely *not* about being transparent to userspace.  It's about
giving userspace new functionality where multiple processes can choose
to share a portion of their address space with each other.  What any
process changes in that range changes, every sharing process sees.
mmap(), munmap(), mprotect(), mremap(), everything.
Khalid Aziz Aug. 16, 2021, 4:06 p.m. UTC | #58
On 8/16/21 9:59 AM, Matthew Wilcox wrote:
> On Mon, Aug 16, 2021 at 05:01:44PM +0200, David Hildenbrand wrote:
>> On 16.08.21 16:40, Matthew Wilcox wrote:
>>> On Mon, Aug 16, 2021 at 04:33:09PM +0200, David Hildenbrand wrote:
>>>>>> I did not follow why we have to play games with MAP_PRIVATE, and having
>>>>>> private anonymous pages shared between processes that don't COW, introducing
>>>>>> new syscalls etc.
>>>>>
>>>>> It's not about SHMEM, it's about file-backed pages on regular
>>>>> filesystems.  I don't want to have XFS, ext4 and btrfs all with their
>>>>> own implementations of ARCH_WANT_HUGE_PMD_SHARE.
>>>>
>>>> Let me ask this way: why do we have to play such games with MAP_PRIVATE?
>>>
>>> : Mappings within this address range behave as if they were shared
>>> : between threads, so a write to a MAP_PRIVATE mapping will create a
>>> : page which is shared between all the sharers.
>>>
>>> If so, that's a misunderstanding, because there are no games being played.
>>> What Khalid's saying there is that because the page tables are already
>>> shared for that range of address space, the COW of a MAP_PRIVATE will
>>> create a new page, but that page will be shared between all the sharers.
>>> The second write to a MAP_PRIVATE page (by any of the sharers) will not
>>> create a COW situation.  Just like if all the sharers were threads of
>>> the same process.
>>>
>>
>> It actually seems to be just like I understood it. We'll have multiple
>> processes share anonymous pages writable, even though they are not using
>> shared memory.
>>
>> IMHO, sharing page tables to optimize for something kernel-internal (page
>> table consumption) should be completely transparent to user space. Just like
>> ARCH_WANT_HUGE_PMD_SHARE currently is unless I am missing something
>> important.
>>
>> The VM_MAYSHARE check in want_pmd_share()->vma_shareable() makes me assume
>> that we really only optimize for MAP_SHARED right now, never for
>> MAP_PRIVATE.
> 
> It's definitely *not* about being transparent to userspace.  It's about
> giving userspace new functionality where multiple processes can choose
> to share a portion of their address space with each other.  What any
> process changes in that range changes, every sharing process sees.
> mmap(), munmap(), mprotect(), mremap(), everything.
> 

Exactly and to further elaborate, once a process calls mshare() to declare its intent to share PTEs for a range of 
address and another process accepts that sharing by calling mshare() itself, the two (or more) processes have agreed to 
share PTEs for that entire address range. A MAP_PRIVATE mapping in this address range goes against the original intent 
of sharing and what we are saying is the original intent of sharing takes precedence in case of this conflict.

--
Khalid
David Hildenbrand Aug. 16, 2021, 4:13 p.m. UTC | #59
On 16.08.21 17:59, Matthew Wilcox wrote:
> On Mon, Aug 16, 2021 at 05:01:44PM +0200, David Hildenbrand wrote:
>> On 16.08.21 16:40, Matthew Wilcox wrote:
>>> On Mon, Aug 16, 2021 at 04:33:09PM +0200, David Hildenbrand wrote:
>>>>>> I did not follow why we have to play games with MAP_PRIVATE, and having
>>>>>> private anonymous pages shared between processes that don't COW, introducing
>>>>>> new syscalls etc.
>>>>>
>>>>> It's not about SHMEM, it's about file-backed pages on regular
>>>>> filesystems.  I don't want to have XFS, ext4 and btrfs all with their
>>>>> own implementations of ARCH_WANT_HUGE_PMD_SHARE.
>>>>
>>>> Let me ask this way: why do we have to play such games with MAP_PRIVATE?
>>>
>>> : Mappings within this address range behave as if they were shared
>>> : between threads, so a write to a MAP_PRIVATE mapping will create a
>>> : page which is shared between all the sharers.
>>>
>>> If so, that's a misunderstanding, because there are no games being played.
>>> What Khalid's saying there is that because the page tables are already
>>> shared for that range of address space, the COW of a MAP_PRIVATE will
>>> create a new page, but that page will be shared between all the sharers.
>>> The second write to a MAP_PRIVATE page (by any of the sharers) will not
>>> create a COW situation.  Just like if all the sharers were threads of
>>> the same process.
>>>
>>
>> It actually seems to be just like I understood it. We'll have multiple
>> processes share anonymous pages writable, even though they are not using
>> shared memory.
>>
>> IMHO, sharing page tables to optimize for something kernel-internal (page
>> table consumption) should be completely transparent to user space. Just like
>> ARCH_WANT_HUGE_PMD_SHARE currently is unless I am missing something
>> important.
>>
>> The VM_MAYSHARE check in want_pmd_share()->vma_shareable() makes me assume
>> that we really only optimize for MAP_SHARED right now, never for
>> MAP_PRIVATE.
> 
> It's definitely *not* about being transparent to userspace.  It's about
> giving userspace new functionality where multiple processes can choose
> to share a portion of their address space with each other.  What any
> process changes in that range changes, every sharing process sees.
> mmap(), munmap(), mprotect(), mremap(), everything.

Oh okay, so it's actually much more complicated and complex than I 
thought. Thanks for clarifying that! I recall virtiofsd had similar 
requirements for sharing memory with the QEMU main process, I might be 
wrong.

"existing shared memory area" and your initial page table example made 
me assume that we are simply dealing with sharing page tables of MAP_SHARED.

It's actually something like a VMA container that you share between 
processes. And whatever VMAs are currently inside that VMA container is 
mirrored to other processes. I assume sharing page tables could actually 
be an implementation detail, especially when keeping MAP_PRIVATE 
(confusing in that context!) and other features that will give you 
surprises (uffd) out of the picture.
Matthew Wilcox (Oracle) Aug. 16, 2021, 4:15 p.m. UTC | #60
On Mon, Aug 16, 2021 at 10:06:47AM -0600, Khalid Aziz wrote:
> On 8/16/21 9:59 AM, Matthew Wilcox wrote:
> > On Mon, Aug 16, 2021 at 05:01:44PM +0200, David Hildenbrand wrote:
> > > On 16.08.21 16:40, Matthew Wilcox wrote:
> > > > On Mon, Aug 16, 2021 at 04:33:09PM +0200, David Hildenbrand wrote:
> > > > > > > I did not follow why we have to play games with MAP_PRIVATE, and having
> > > > > > > private anonymous pages shared between processes that don't COW, introducing
> > > > > > > new syscalls etc.
> > > > > > 
> > > > > > It's not about SHMEM, it's about file-backed pages on regular
> > > > > > filesystems.  I don't want to have XFS, ext4 and btrfs all with their
> > > > > > own implementations of ARCH_WANT_HUGE_PMD_SHARE.
> > > > > 
> > > > > Let me ask this way: why do we have to play such games with MAP_PRIVATE?
> > > > 
> > > > : Mappings within this address range behave as if they were shared
> > > > : between threads, so a write to a MAP_PRIVATE mapping will create a
> > > > : page which is shared between all the sharers.
> > > > 
> > > > If so, that's a misunderstanding, because there are no games being played.
> > > > What Khalid's saying there is that because the page tables are already
> > > > shared for that range of address space, the COW of a MAP_PRIVATE will
> > > > create a new page, but that page will be shared between all the sharers.
> > > > The second write to a MAP_PRIVATE page (by any of the sharers) will not
> > > > create a COW situation.  Just like if all the sharers were threads of
> > > > the same process.
> > > > 
> > > 
> > > It actually seems to be just like I understood it. We'll have multiple
> > > processes share anonymous pages writable, even though they are not using
> > > shared memory.
> > > 
> > > IMHO, sharing page tables to optimize for something kernel-internal (page
> > > table consumption) should be completely transparent to user space. Just like
> > > ARCH_WANT_HUGE_PMD_SHARE currently is unless I am missing something
> > > important.
> > > 
> > > The VM_MAYSHARE check in want_pmd_share()->vma_shareable() makes me assume
> > > that we really only optimize for MAP_SHARED right now, never for
> > > MAP_PRIVATE.
> > 
> > It's definitely *not* about being transparent to userspace.  It's about
> > giving userspace new functionality where multiple processes can choose
> > to share a portion of their address space with each other.  What any
> > process changes in that range changes, every sharing process sees.
> > mmap(), munmap(), mprotect(), mremap(), everything.
> > 
> 
> Exactly and to further elaborate, once a process calls mshare() to declare
> its intent to share PTEs for a range of address and another process accepts
> that sharing by calling mshare() itself, the two (or more) processes have
> agreed to share PTEs for that entire address range. A MAP_PRIVATE mapping in
> this address range goes against the original intent of sharing and what we
> are saying is the original intent of sharing takes precedence in case of
> this conflict.

I don't know that it's against the original intent ... I think
MAP_PRIVATE in this context means "Private to this process and every
process sharing this chunk of address space".  So a store doesn't go
through to the page cache, as it would with MAP_SHARED, but it is
visible to the other processes sharing these page tables.
Longpeng(Mike) Aug. 17, 2021, 12:47 a.m. UTC | #61
> -----Original Message-----
> From: Matthew Wilcox [mailto:willy@infradead.org]
> Sent: Monday, August 16, 2021 8:08 PM
> To: David Hildenbrand <david@redhat.com>
> Cc: Khalid Aziz <khalid.aziz@oracle.com>; Longpeng (Mike, Cloud Infrastructure
> Service Product Dept.) <longpeng2@huawei.com>; Steven Sistare
> <steven.sistare@oracle.com>; Anthony Yznaga <anthony.yznaga@oracle.com>;
> linux-kernel@vger.kernel.org; linux-mm@kvack.org; Gonglei (Arei)
> <arei.gonglei@huawei.com>
> Subject: Re: [RFC PATCH 0/5] madvise MADV_DOEXEC
> 
> On Mon, Aug 16, 2021 at 10:02:22AM +0200, David Hildenbrand wrote:
> > > Mappings within this address range behave as if they were shared
> > > between threads, so a write to a MAP_PRIVATE mapping will create a
> > > page which is shared between all the sharers. The first process that
> > > declares an address range mshare'd can continue to map objects in
> > > the shared area. All other processes that want mshare'd access to
> > > this memory area can do so by calling mshare(). After this call, the
> > > address range given by mshare becomes a shared range in its address
> > > space. Anonymous mappings will be shared and not COWed.
> >
> > Did I understand correctly that you want to share actual page tables
> > between processes and consequently different MMs? That sounds like a very bad
> idea.
> 
> That is the entire point.  Consider a machine with 10,000 instances of an
> application running (process model, not thread model).  If each application wants
> to map 1TB of RAM using 2MB pages, that's 4MB of page tables per process or
> 40GB of RAM for the whole machine.
> 
> There's a reason hugetlbfs was enhanced to allow this page table sharing.
> I'm not a fan of the implementation as it gets some locks upside down, so this is an
> attempt to generalise the concept beyond hugetlbfs.
> 
> Think of it like partial threading.  You get to share some parts, but not all, of your
> address space with your fellow processes.  Obviously you don't want to expose
> this to random other processes, only to other instances of yourself being run as the
> same user.

I understand your intent now, you want to share memory ranges by sharing the relevant
pgtable pages. 

I had implemented a similar idea to support QEMU live upgrade about four years ago
( in late 2017),

https://patents.google.com/patent/US20210089345A1

"""
[0131]
In a first possible implementation, the generation unit includes a copying subunit configured
to copy an entry corresponding to the virtual memory area in a PGD page table of the first
virtual machine to an entry corresponding to the virtual memory area in a PGD page table of
the second virtual machine.
"""

We want to share the anonymous memory between old QEMU process and the new one, 
so we limit the QEMU to mmap the VM's memory address in 4T-8T and then share the 
memory by direct copy the PGD entries ( implementation is much more complicated than I 
say ).

Besides to save memory, large memory range can be shared fast in this way.
Matthew Wilcox (Oracle) Aug. 17, 2021, 12:55 a.m. UTC | #62
On Tue, Aug 17, 2021 at 12:47:19AM +0000, Longpeng (Mike, Cloud Infrastructure Service Product Dept.) wrote:
> I understand your intent now, you want to share memory ranges by sharing the relevant
> pgtable pages. 
> 
> I had implemented a similar idea to support QEMU live upgrade about four years ago
> ( in late 2017),
> 
> https://patents.google.com/patent/US20210089345A1

I am not going to read a patent.  This conversation is now over until
I have had a conversation with my patent attorney.