mbox series

[RFC,0/5] madvise MADV_DOEXEC

Message ID 1595869887-23307-1-git-send-email-anthony.yznaga@oracle.com
Headers show
Series madvise MADV_DOEXEC | expand

Message

Anthony Yznaga July 27, 2020, 5:11 p.m. UTC
This patchset adds support for preserving an anonymous memory range across
exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
sharing memory in this manner, as opposed to re-attaching to a named shared
memory segment, is to ensure it is mapped at the same virtual address in
the new process as it was in the old one.  An intended use for this is to
preserve guest memory for guests using vfio while qemu exec's an updated
version of itself.  By ensuring the memory is preserved at a fixed address,
vfio mappings and their associated kernel data structures can remain valid.
In addition, for the qemu use case, qemu instances that back guest RAM with
anonymous memory can be updated.

Patches 1 and 2 ensure that loading of ELF load segments does not silently
clobber existing VMAS, and remove assumptions that the stack is the only
VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
and could be considered on its own.

Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
using an ELF note.

Anthony Yznaga (5):
  elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
  mm: do not assume only the stack vma exists in setup_arg_pages()
  mm: introduce VM_EXEC_KEEP
  exec, elf: require opt-in for accepting preserved mem
  mm: introduce MADV_DOEXEC

 arch/x86/Kconfig                       |   1 +
 fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
 fs/exec.c                              |  33 +++++-
 include/linux/binfmts.h                |   7 +-
 include/linux/mm.h                     |   5 +
 include/uapi/asm-generic/mman-common.h |   3 +
 kernel/fork.c                          |   2 +-
 mm/madvise.c                           |  25 +++++
 mm/mmap.c                              |  47 ++++++++
 9 files changed, 266 insertions(+), 53 deletions(-)

Comments

Eric W. Biederman July 27, 2020, 5:07 p.m. UTC | #1
Anthony Yznaga <anthony.yznaga@oracle.com> writes:

> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

What is wrong with using a file descriptor to a possibly deleted file
and re-mmaping it?

There is already MAP_FIXED that allows you to ensure you have the same
address.

I think all it would take would be a small protocol from one version
to the next to say map file descriptor #N and address #A.  Something
easily passed on the command line.

There is just enough complexity in exec currently that our
implementation of exec is already teetering.  So if we could use
existing mechanisms it would be good.

And I don't see why this version of sharing a mmap area would be
particularly general.  I do imagine that being able to force a
mmap area into a setuid executable would be a fun attack vector.

Perhaps I missed this in my skim of these patches but I did not see
anything that guarded this feature against an exec that changes an
applications privileges.

Eric


> Patches 1 and 2 ensure that loading of ELF load segments does not silently
> clobber existing VMAS, and remove assumptions that the stack is the only
> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
> and could be considered on its own.
>
> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
> using an ELF note.
>
> Anthony Yznaga (5):
>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>   mm: do not assume only the stack vma exists in setup_arg_pages()
>   mm: introduce VM_EXEC_KEEP
>   exec, elf: require opt-in for accepting preserved mem
>   mm: introduce MADV_DOEXEC
>
>  arch/x86/Kconfig                       |   1 +
>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>  fs/exec.c                              |  33 +++++-
>  include/linux/binfmts.h                |   7 +-
>  include/linux/mm.h                     |   5 +
>  include/uapi/asm-generic/mman-common.h |   3 +
>  kernel/fork.c                          |   2 +-
>  mm/madvise.c                           |  25 +++++
>  mm/mmap.c                              |  47 ++++++++
>  9 files changed, 266 insertions(+), 53 deletions(-)
Steven Sistare July 27, 2020, 6 p.m. UTC | #2
On 7/27/2020 1:07 PM, ebiederm@xmission.com wrote:
> Anthony Yznaga <anthony.yznaga@oracle.com> writes:
> 
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
> 
> What is wrong with using a file descriptor to a possibly deleted file
> and re-mmaping it?
> 
> There is already MAP_FIXED that allows you to ensure you have the same
> address.

MAP_FIXED blows away any existing mapping in that range, which is not the 
desired behavior.  We want to preserve the previously created mapping at
the same VA and co-exist with other mappings created by the new process.
There is no way to guarantee availability of a VA post-exec.

> I think all it would take would be a small protocol from one version
> to the next to say map file descriptor #N and address #A.  Something
> easily passed on the command line.
> 
> There is just enough complexity in exec currently that our
> implementation of exec is already teetering.  So if we could use
> existing mechanisms it would be good.
> 
> And I don't see why this version of sharing a mmap area would be
> particularly general.  I do imagine that being able to force a
> mmap area into a setuid executable would be a fun attack vector.

Any mmap(MAP_ANON) segment can be preserved.  That is very general, and is 
the case we need to support to upgrade legacy applications that are already
running (such as qemu) -- we cannot recode them before they are updated.

> Perhaps I missed this in my skim of these patches but I did not see
> anything that guarded this feature against an exec that changes an
> applications privileges.

The ELF opt-in flag must be set, so only selected executables will accept
incoming mappings.  The exec() code verifies that the preserved mappings 
do not overlap or become adjacent to text or stack, so it is not possible for
example for an attacker to cause stack underflow or overflow to access injected
content.  A gadget invoked by some other attack could access the preserved content.

- Steve

>> Patches 1 and 2 ensure that loading of ELF load segments does not silently
>> clobber existing VMAS, and remove assumptions that the stack is the only
>> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
>> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
>> and could be considered on its own.
>>
>> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
>> using an ELF note.
>>
>> Anthony Yznaga (5):
>>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>>   mm: do not assume only the stack vma exists in setup_arg_pages()
>>   mm: introduce VM_EXEC_KEEP
>>   exec, elf: require opt-in for accepting preserved mem
>>   mm: introduce MADV_DOEXEC
>>
>>  arch/x86/Kconfig                       |   1 +
>>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>>  fs/exec.c                              |  33 +++++-
>>  include/linux/binfmts.h                |   7 +-
>>  include/linux/mm.h                     |   5 +
>>  include/uapi/asm-generic/mman-common.h |   3 +
>>  kernel/fork.c                          |   2 +-
>>  mm/madvise.c                           |  25 +++++
>>  mm/mmap.c                              |  47 ++++++++
>>  9 files changed, 266 insertions(+), 53 deletions(-)
Kirill Tkhai July 28, 2020, 11:34 a.m. UTC | #3
On 27.07.2020 20:11, Anthony Yznaga wrote:
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,

So, the goal is an update of QEMU binary without a stopping of virtual machine?

> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.
> 
> Patches 1 and 2 ensure that loading of ELF load segments does not silently
> clobber existing VMAS, and remove assumptions that the stack is the only
> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
> and could be considered on its own.
> 
> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
> using an ELF note.
> 
> Anthony Yznaga (5):
>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>   mm: do not assume only the stack vma exists in setup_arg_pages()
>   mm: introduce VM_EXEC_KEEP
>   exec, elf: require opt-in for accepting preserved mem
>   mm: introduce MADV_DOEXEC
> 
>  arch/x86/Kconfig                       |   1 +
>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>  fs/exec.c                              |  33 +++++-
>  include/linux/binfmts.h                |   7 +-
>  include/linux/mm.h                     |   5 +
>  include/uapi/asm-generic/mman-common.h |   3 +
>  kernel/fork.c                          |   2 +-
>  mm/madvise.c                           |  25 +++++
>  mm/mmap.c                              |  47 ++++++++
>  9 files changed, 266 insertions(+), 53 deletions(-)
>
Christian Brauner July 28, 2020, 1:40 p.m. UTC | #4
On Mon, Jul 27, 2020 at 02:00:17PM -0400, Steven Sistare wrote:
> On 7/27/2020 1:07 PM, ebiederm@xmission.com wrote:
> > Anthony Yznaga <anthony.yznaga@oracle.com> writes:
> > 
> >> This patchset adds support for preserving an anonymous memory range across
> >> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> >> sharing memory in this manner, as opposed to re-attaching to a named shared
> >> memory segment, is to ensure it is mapped at the same virtual address in
> >> the new process as it was in the old one.  An intended use for this is to
> >> preserve guest memory for guests using vfio while qemu exec's an updated
> >> version of itself.  By ensuring the memory is preserved at a fixed address,
> >> vfio mappings and their associated kernel data structures can remain valid.
> >> In addition, for the qemu use case, qemu instances that back guest RAM with
> >> anonymous memory can be updated.
> > 
> > What is wrong with using a file descriptor to a possibly deleted file
> > and re-mmaping it?
> > 
> > There is already MAP_FIXED that allows you to ensure you have the same
> > address.
> 
> MAP_FIXED blows away any existing mapping in that range, which is not the 
> desired behavior.  We want to preserve the previously created mapping at

There's also MAP_FIXED_NOREPLACE since v4.17 in case that helps.

Note that this should really go to linux-api too. I won't argue to
resend it since that would mean spamming everyone's inbox with the same
thread again but in case you send a revised version, please ensure to Cc
linux-api. The glibc folks are listening on there too.

Thanks!
Christian
Andy Lutomirski July 28, 2020, 2:23 p.m. UTC | #5
> On Jul 27, 2020, at 10:02 AM, Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
> 
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

This will be an amazing attack surface. Perhaps use of this flag should require no_new_privs?  Arguably it should also require a special flag to execve() to honor it.  Otherwise library helpers that do vfork()+exec() or posix_spawn() could be quite surprised.
Steven Sistare July 28, 2020, 2:30 p.m. UTC | #6
On 7/28/2020 10:23 AM, Andy Lutomirski wrote:
>> On Jul 27, 2020, at 10:02 AM, Anthony Yznaga <anthony.yznaga@oracle.com> wrote:
>>
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
> 
> This will be an amazing attack surface. Perhaps use of this flag should require no_new_privs?  Arguably it should also require a special flag to execve() to honor it.  Otherwise library helpers that do vfork()+exec() or posix_spawn() could be quite surprised.

Preservation is disabled across fork, so fork/exec combo's are not affected.  We forgot to document that.

- Steve
Anthony Yznaga July 28, 2020, 5:28 p.m. UTC | #7
On 7/28/20 4:34 AM, Kirill Tkhai wrote:
> On 27.07.2020 20:11, Anthony Yznaga wrote:
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
> So, the goal is an update of QEMU binary without a stopping of virtual machine?
Essentially, yes.  The VM is paused very briefly.

Anthony
>
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
>>
>> Patches 1 and 2 ensure that loading of ELF load segments does not silently
>> clobber existing VMAS, and remove assumptions that the stack is the only
>> VMA in the mm when the stack is set up.  Patch 1 re-introduces the use of
>> MAP_FIXED_NOREPLACE to load ELF binaries that addresses the previous issues
>> and could be considered on its own.
>>
>> Patches 3, 4, and 5 introduce the feature and an opt-in method for its use
>> using an ELF note.
>>
>> Anthony Yznaga (5):
>>   elf: reintroduce using MAP_FIXED_NOREPLACE for elf executable mappings
>>   mm: do not assume only the stack vma exists in setup_arg_pages()
>>   mm: introduce VM_EXEC_KEEP
>>   exec, elf: require opt-in for accepting preserved mem
>>   mm: introduce MADV_DOEXEC
>>
>>  arch/x86/Kconfig                       |   1 +
>>  fs/binfmt_elf.c                        | 196 +++++++++++++++++++++++++--------
>>  fs/exec.c                              |  33 +++++-
>>  include/linux/binfmts.h                |   7 +-
>>  include/linux/mm.h                     |   5 +
>>  include/uapi/asm-generic/mman-common.h |   3 +
>>  kernel/fork.c                          |   2 +-
>>  mm/madvise.c                           |  25 +++++
>>  mm/mmap.c                              |  47 ++++++++
>>  9 files changed, 266 insertions(+), 53 deletions(-)
>>
Matthew Wilcox (Oracle) July 30, 2020, 3:22 p.m. UTC | #8
On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

I just realised that something else I'm working on might be a suitable
alternative to this.  Apologies for not realising it sooner.

http://www.wil.cx/~willy/linux/sileby.html

To use this, you'd mshare() the anonymous memory range, essentially
detaching the VMA from the current process's mm_struct and reparenting
it to this new mm_struct, which has an fd referencing it.

Then you call exec(), and the exec'ed task gets to call mmap() on that
new fd to attach the memory range to its own address space.

Presto!
Christian Brauner July 30, 2020, 3:27 p.m. UTC | #9
On Thu, Jul 30, 2020 at 04:22:50PM +0100, Matthew Wilcox wrote:
> On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> > This patchset adds support for preserving an anonymous memory range across
> > exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> > sharing memory in this manner, as opposed to re-attaching to a named shared
> > memory segment, is to ensure it is mapped at the same virtual address in
> > the new process as it was in the old one.  An intended use for this is to
> > preserve guest memory for guests using vfio while qemu exec's an updated
> > version of itself.  By ensuring the memory is preserved at a fixed address,
> > vfio mappings and their associated kernel data structures can remain valid.
> > In addition, for the qemu use case, qemu instances that back guest RAM with
> > anonymous memory can be updated.
> 
> I just realised that something else I'm working on might be a suitable
> alternative to this.  Apologies for not realising it sooner.
> 
> http://www.wil.cx/~willy/linux/sileby.html

Just skimming: make it O_CLOEXEC by default. ;)

> 
> To use this, you'd mshare() the anonymous memory range, essentially
> detaching the VMA from the current process's mm_struct and reparenting
> it to this new mm_struct, which has an fd referencing it.
> 
> Then you call exec(), and the exec'ed task gets to call mmap() on that
> new fd to attach the memory range to its own address space.
> 
> Presto!
Matthew Wilcox (Oracle) July 30, 2020, 3:34 p.m. UTC | #10
On Thu, Jul 30, 2020 at 05:27:05PM +0200, Christian Brauner wrote:
> On Thu, Jul 30, 2020 at 04:22:50PM +0100, Matthew Wilcox wrote:
> > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> > > This patchset adds support for preserving an anonymous memory range across
> > > exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> > > sharing memory in this manner, as opposed to re-attaching to a named shared
> > > memory segment, is to ensure it is mapped at the same virtual address in
> > > the new process as it was in the old one.  An intended use for this is to
> > > preserve guest memory for guests using vfio while qemu exec's an updated
> > > version of itself.  By ensuring the memory is preserved at a fixed address,
> > > vfio mappings and their associated kernel data structures can remain valid.
> > > In addition, for the qemu use case, qemu instances that back guest RAM with
> > > anonymous memory can be updated.
> > 
> > I just realised that something else I'm working on might be a suitable
> > alternative to this.  Apologies for not realising it sooner.
> > 
> > http://www.wil.cx/~willy/linux/sileby.html
> 
> Just skimming: make it O_CLOEXEC by default. ;)

I appreciate the suggestion, and it makes sense for many 'return an fd'
interfaces, but the point of mshare() is to, well, share.  So sharing
the fd with a child is a common usecase, unlike say sharing a timerfd.
The only other reason to use mshare() is to pass the fd over a unix
socket to a non-child, and I submit that is far less common than wanting
to share with a child.
Christian Brauner July 30, 2020, 3:54 p.m. UTC | #11
On Thu, Jul 30, 2020 at 04:34:50PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 05:27:05PM +0200, Christian Brauner wrote:
> > On Thu, Jul 30, 2020 at 04:22:50PM +0100, Matthew Wilcox wrote:
> > > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> > > > This patchset adds support for preserving an anonymous memory range across
> > > > exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> > > > sharing memory in this manner, as opposed to re-attaching to a named shared
> > > > memory segment, is to ensure it is mapped at the same virtual address in
> > > > the new process as it was in the old one.  An intended use for this is to
> > > > preserve guest memory for guests using vfio while qemu exec's an updated
> > > > version of itself.  By ensuring the memory is preserved at a fixed address,
> > > > vfio mappings and their associated kernel data structures can remain valid.
> > > > In addition, for the qemu use case, qemu instances that back guest RAM with
> > > > anonymous memory can be updated.
> > > 
> > > I just realised that something else I'm working on might be a suitable
> > > alternative to this.  Apologies for not realising it sooner.
> > > 
> > > http://www.wil.cx/~willy/linux/sileby.html
> > 
> > Just skimming: make it O_CLOEXEC by default. ;)
> 
> I appreciate the suggestion, and it makes sense for many 'return an fd'
> interfaces, but the point of mshare() is to, well, share.  So sharing
> the fd with a child is a common usecase, unlike say sharing a timerfd.

Fair point, from reading I thought the main reason was share after
fork() but having an fd over exec() may be a good use-case too.
(Fwiw, this very much looks like what the memfd_create() api should have
looked like, i.e. mshare() could've possibly encompassed both.)

> The only other reason to use mshare() is to pass the fd over a unix
> socket to a non-child, and I submit that is far less common than wanting
> to share with a child.

Well, we have use-cases for that too. E.g. where we need to attach to an
existing pid namespace which requires a first fork() of an intermediate
process so that the caller doesn't change the pid_for_children
namespace. Then we setns() to the target pid namespace in the
intermediate process which changes the pid_ns_children namespace such
that the next process will be a proper member of the target pid
namespace. Finally, we fork() the target process that is now a full
member of the target pid namespace. We also set CLONE_PARENT so
grandfather process == father process for the second process and then
have the intermediate process exit. Some fds we only ever open or create
after the intermediate process exited and some fds we can only open or
create after the intermediate process has been created and then send the
fds via scm (since we can't share the fdtable) to the final process.

Christian
Steven Sistare July 30, 2020, 3:59 p.m. UTC | #12
On 7/30/2020 11:22 AM, Matthew Wilcox wrote:
> On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
>> This patchset adds support for preserving an anonymous memory range across
>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>> sharing memory in this manner, as opposed to re-attaching to a named shared
>> memory segment, is to ensure it is mapped at the same virtual address in
>> the new process as it was in the old one.  An intended use for this is to
>> preserve guest memory for guests using vfio while qemu exec's an updated
>> version of itself.  By ensuring the memory is preserved at a fixed address,
>> vfio mappings and their associated kernel data structures can remain valid.
>> In addition, for the qemu use case, qemu instances that back guest RAM with
>> anonymous memory can be updated.
> 
> I just realised that something else I'm working on might be a suitable
> alternative to this.  Apologies for not realising it sooner.
> 
> http://www.wil.cx/~willy/linux/sileby.html
> 
> To use this, you'd mshare() the anonymous memory range, essentially
> detaching the VMA from the current process's mm_struct and reparenting
> it to this new mm_struct, which has an fd referencing it.
> 
> Then you call exec(), and the exec'ed task gets to call mmap() on that
> new fd to attach the memory range to its own address space.
> 
> Presto!

To be suitable for the qemu use case, we need a guarantee that the same VA range
is available in the new process, with nothing else mapped there.  From your spec,
it sounds like the new process could do a series of unrelated mmap's which could
overlap the desired va range before the silby mmap(fd) is performed??

Also, we need to support updating legacy processes that already created anon segments.
We inject code that calls MADV_DOEXEC for such segments.

- Steve
Matthew Wilcox (Oracle) July 30, 2020, 5:12 p.m. UTC | #13
On Thu, Jul 30, 2020 at 11:59:42AM -0400, Steven Sistare wrote:
> On 7/30/2020 11:22 AM, Matthew Wilcox wrote:
> > On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
> >> This patchset adds support for preserving an anonymous memory range across
> >> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> >> sharing memory in this manner, as opposed to re-attaching to a named shared
> >> memory segment, is to ensure it is mapped at the same virtual address in
> >> the new process as it was in the old one.  An intended use for this is to
> >> preserve guest memory for guests using vfio while qemu exec's an updated
> >> version of itself.  By ensuring the memory is preserved at a fixed address,
> >> vfio mappings and their associated kernel data structures can remain valid.
> >> In addition, for the qemu use case, qemu instances that back guest RAM with
> >> anonymous memory can be updated.
> > 
> > I just realised that something else I'm working on might be a suitable
> > alternative to this.  Apologies for not realising it sooner.
> > 
> > http://www.wil.cx/~willy/linux/sileby.html
> > 
> > To use this, you'd mshare() the anonymous memory range, essentially
> > detaching the VMA from the current process's mm_struct and reparenting
> > it to this new mm_struct, which has an fd referencing it.
> > 
> > Then you call exec(), and the exec'ed task gets to call mmap() on that
> > new fd to attach the memory range to its own address space.
> > 
> > Presto!
> 
> To be suitable for the qemu use case, we need a guarantee that the same VA range
> is available in the new process, with nothing else mapped there.  From your spec,
> it sounds like the new process could do a series of unrelated mmap's which could
> overlap the desired va range before the silby mmap(fd) is performed??

That could happen.  eg libc might get its text segment mapped there
randomly.  I believe Khalid was working on a solution for reserving
memory ranges.

> Also, we need to support updating legacy processes that already created anon segments.
> We inject code that calls MADV_DOEXEC for such segments.

Yes, I was assuming you'd inject code that called mshare().

Actually, since you're injecting code, why do you need the kernel to
be involved?  You can mmap the new executable and any libraries it depends
upon, set up a new stack and jump to the main() entry point, all without
calling exec().  I appreciate it'd be a fair amount of code, but it'd all
be in userspace and you can probably steal / reuse code from ld.so (I'm
not familiar with the details of how setting up an executable is done).
Steven Sistare July 30, 2020, 5:35 p.m. UTC | #14
On 7/30/2020 1:12 PM, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 11:59:42AM -0400, Steven Sistare wrote:
>> On 7/30/2020 11:22 AM, Matthew Wilcox wrote:
>>> On Mon, Jul 27, 2020 at 10:11:22AM -0700, Anthony Yznaga wrote:
>>>> This patchset adds support for preserving an anonymous memory range across
>>>> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
>>>> sharing memory in this manner, as opposed to re-attaching to a named shared
>>>> memory segment, is to ensure it is mapped at the same virtual address in
>>>> the new process as it was in the old one.  An intended use for this is to
>>>> preserve guest memory for guests using vfio while qemu exec's an updated
>>>> version of itself.  By ensuring the memory is preserved at a fixed address,
>>>> vfio mappings and their associated kernel data structures can remain valid.
>>>> In addition, for the qemu use case, qemu instances that back guest RAM with
>>>> anonymous memory can be updated.
>>>
>>> I just realised that something else I'm working on might be a suitable
>>> alternative to this.  Apologies for not realising it sooner.
>>>
>>> http://www.wil.cx/~willy/linux/sileby.html
>>>
>>> To use this, you'd mshare() the anonymous memory range, essentially
>>> detaching the VMA from the current process's mm_struct and reparenting
>>> it to this new mm_struct, which has an fd referencing it.
>>>
>>> Then you call exec(), and the exec'ed task gets to call mmap() on that
>>> new fd to attach the memory range to its own address space.
>>>
>>> Presto!
>>
>> To be suitable for the qemu use case, we need a guarantee that the same VA range
>> is available in the new process, with nothing else mapped there.  From your spec,
>> it sounds like the new process could do a series of unrelated mmap's which could
>> overlap the desired va range before the silby mmap(fd) is performed??
> 
> That could happen.  eg libc might get its text segment mapped there
> randomly.  I believe Khalid was working on a solution for reserving
> memory ranges.

mshare + VA reservation is another possible solution.

Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.

>> Also, we need to support updating legacy processes that already created anon segments.
>> We inject code that calls MADV_DOEXEC for such segments.
> 
> Yes, I was assuming you'd inject code that called mshare().

OK, mshare works on existing memory and builds a new vma.

> Actually, since you're injecting code, why do you need the kernel to
> be involved?  You can mmap the new executable and any libraries it depends
> upon, set up a new stack and jump to the main() entry point, all without
> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
> be in userspace and you can probably steal / reuse code from ld.so (I'm
> not familiar with the details of how setting up an executable is done).

Duplicating all the work that the kernel and loader do to exec a process would
be error prone, require ongoing maintenance, and be redundant.  Better to define 
a small kernel extension and leave exec to the kernel.

- Steve
Matthew Wilcox (Oracle) July 30, 2020, 5:49 p.m. UTC | #15
On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
> mshare + VA reservation is another possible solution.
> 
> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.

We are.  This is the part of the review process where we explore other
solutions to the problem.

> >> Also, we need to support updating legacy processes that already created anon segments.
> >> We inject code that calls MADV_DOEXEC for such segments.
> > 
> > Yes, I was assuming you'd inject code that called mshare().
> 
> OK, mshare works on existing memory and builds a new vma.

Actually, reparents an existing VMA, and reuses the existing page tables.

> > Actually, since you're injecting code, why do you need the kernel to
> > be involved?  You can mmap the new executable and any libraries it depends
> > upon, set up a new stack and jump to the main() entry point, all without
> > calling exec().  I appreciate it'd be a fair amount of code, but it'd all
> > be in userspace and you can probably steal / reuse code from ld.so (I'm
> > not familiar with the details of how setting up an executable is done).
> 
> Duplicating all the work that the kernel and loader do to exec a process would
> be error prone, require ongoing maintenance, and be redundant.  Better to define 
> a small kernel extension and leave exec to the kernel.

Either this is a one-off kind of thing, in which case it doesn't need
ongoing maintenance, or it's something with broad applicability, in
which case it can live as its own userspace project.  It could even
start off life as part of qemu and then fork into its own project.
The idea of tagging an ELF executable to say "I can cope with having
chunks of my address space provided to me by my executor" is ... odd.
Steven Sistare July 30, 2020, 6:27 p.m. UTC | #16
On 7/30/2020 1:49 PM, Matthew Wilcox wrote:
> On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
>> mshare + VA reservation is another possible solution.
>>
>> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.
> 
> We are.  This is the part of the review process where we explore other
> solutions to the problem.
> 
>>>> Also, we need to support updating legacy processes that already created anon segments.
>>>> We inject code that calls MADV_DOEXEC for such segments.
>>>
>>> Yes, I was assuming you'd inject code that called mshare().
>>
>> OK, mshare works on existing memory and builds a new vma.
> 
> Actually, reparents an existing VMA, and reuses the existing page tables.
> 
>>> Actually, since you're injecting code, why do you need the kernel to
>>> be involved?  You can mmap the new executable and any libraries it depends
>>> upon, set up a new stack and jump to the main() entry point, all without
>>> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
>>> be in userspace and you can probably steal / reuse code from ld.so (I'm
>>> not familiar with the details of how setting up an executable is done).
>>
>> Duplicating all the work that the kernel and loader do to exec a process would
>> be error prone, require ongoing maintenance, and be redundant.  Better to define 
>> a small kernel extension and leave exec to the kernel.
> 
> Either this is a one-off kind of thing, in which case it doesn't need
> ongoing maintenance, or it's something with broad applicability, in
> which case it can live as its own userspace project.  It could even
> start off life as part of qemu and then fork into its own project.

exec will be enhanced over time in the kernel.  A separate user space implementation
would need to track that.

Reimplementing exec in userland would be a big gross mess.  Not a good solution when
we have simple and concise ways of solving the problem.

> The idea of tagging an ELF executable to say "I can cope with having
> chunks of my address space provided to me by my executor" is ... odd.

I don't disagree.  But it is useful.  We already pass a block of data containing
environment variables and arguments from one process to the next.  Preserving 
additional segments is not a big leap from there.

- Steve
Eric W. Biederman July 30, 2020, 9:58 p.m. UTC | #17
Steven Sistare <steven.sistare@oracle.com> writes:

> On 7/30/2020 1:49 PM, Matthew Wilcox wrote:
>> On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
>>> mshare + VA reservation is another possible solution.
>>>
>>> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.
>> 
>> We are.  This is the part of the review process where we explore other
>> solutions to the problem.
>> 
>>>>> Also, we need to support updating legacy processes that already created anon segments.
>>>>> We inject code that calls MADV_DOEXEC for such segments.
>>>>
>>>> Yes, I was assuming you'd inject code that called mshare().
>>>
>>> OK, mshare works on existing memory and builds a new vma.
>> 
>> Actually, reparents an existing VMA, and reuses the existing page tables.
>> 
>>>> Actually, since you're injecting code, why do you need the kernel to
>>>> be involved?  You can mmap the new executable and any libraries it depends
>>>> upon, set up a new stack and jump to the main() entry point, all without
>>>> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
>>>> be in userspace and you can probably steal / reuse code from ld.so (I'm
>>>> not familiar with the details of how setting up an executable is done).
>>>
>>> Duplicating all the work that the kernel and loader do to exec a process would
>>> be error prone, require ongoing maintenance, and be redundant.  Better to define 
>>> a small kernel extension and leave exec to the kernel.
>> 
>> Either this is a one-off kind of thing, in which case it doesn't need
>> ongoing maintenance, or it's something with broad applicability, in
>> which case it can live as its own userspace project.  It could even
>> start off life as part of qemu and then fork into its own project.
>
> exec will be enhanced over time in the kernel.  A separate user space implementation
> would need to track that.
>
> Reimplementing exec in userland would be a big gross mess.  Not a good solution when
> we have simple and concise ways of solving the problem.

The interesting work is not done in exec.  The interesting work of
loading an executable is done in ld.so, and ld.so lives in userspace.

>> The idea of tagging an ELF executable to say "I can cope with having
>> chunks of my address space provided to me by my executor" is ... odd.
>
> I don't disagree.  But it is useful.  We already pass a block of data containing
> environment variables and arguments from one process to the next.  Preserving 
> additional segments is not a big leap from there.

But we randomize where that block lives.

It has been found to be very useful to have randomization by default and
you are arguing against that kind of randomization.



Here is another suggestion.

Have a very simple program that does:

	for (;;) {
		handle = dlopen("/my/real/program");
		real_main = dlsym(handle, "main");
		real_main(argc, argv, envp);
		dlclose(handle);
	}

With whatever obvious adjustments are needed to fit your usecase.

That should give the same level of functionality, be portable to all
unices, and not require you to duplicate code.  It belive it limits you
to not upgrading libc, or librt but that is a comparatively small
limitation.


Given that in general the interesting work is done in userspace and that
userspace has provided an interface for reusing that work already.
I don't see the justification for adding anything to exec at this point.


Eric
Stefan Hajnoczi July 31, 2020, 9:12 a.m. UTC | #18
Hi,
Sileby looks interesting! I had just written up the following idea which
seems similar but includes a mechanism for revoking mappings.

Alexander Graf recently brought up an idea that solves the following
problem:

When process A passes shared memory file descriptors to process B there
is no way for process A to revoke access or change page protection bits
after passing the fd.

I'll describe the idea (not sure if it's exactly what Alexander had in
mind).

Memory view driver
------------------
The memory view driver allows process A to control the page table
entries of an mmap in process B. It is a character device driver that
process A opens like this:

  int fd = open("/dev/memory-view", O_RDWR);

This returns a file descriptor to a new memory view.

Next process A sets the size of the memory view:

  ftruncate(fd, 16 * GiB);

The size determines how large the memory view will be. The size is a
virtual memory concept and does not consume resources (there is no
physical memory backing this).

Process A populates the memory view with ranges from file descriptors it
wishes to share. The file descriptor can be a shared memory file
descriptor:

  int memfd = memfd_create("guest-ram, 0);
  ftruncate(memfd, 32 * GiB);

  /* Map [8GB, 10GB) at 8GB into the memory view */
  struct memview_map_fd_info = {
      .fd = memfd,
      .fd_offset = 8 * GiB,
      .size = 2 * GiB,
      .mem_offset = 8 * GiB,
      .flags = MEMVIEW_MAP_READ | MEMVIEW_MAP_WRITE,
  };
  ioctl(fd, MEMVIEW_MAP_FD, &map_fd_info);

It is also possible to populate the memory view from the page cache:

  int filefd = open("big-file.iso", O_RDONLY);

  /* Map [4GB, 12GB) at 0B into the memory view */
  struct memview_map_fd_info = {
      .fd = filefd,
      .fd_offset = 4 * GiB,
      .size = 8 * GiB,
      .mem_offset = 0,
      .flags = MEMVIEW_MAP_READ,
  };
  ioctl(fd, MEMVIEW_MAP_FD, &map_fd_info);

The memory view has now been populated like this:

Range (GiB)   Fd               Permissions
0-8           big-file.iso     read
8-10          guest-ram        read+write
10-16         <none>           <none>

Now process A gets the "view" file descriptor for this memory view. The
view file descriptor does not allow ioctls. It can be safely passed to
process B in the knowledge that process B can only mmap or close it:

  int viewfd = ioctl(fd, MEMVIEW_GET_VIEWFD);

  ...pass viewfd to process B...

Process B receives viewfd and mmaps it:

  void *ptr = mmap(NULL, 16 * GiB, PROT_READ | PROT_WRITE, MAP_SHARED,
                   viewfd, 0);

When process B accesses a page in the mmap region the memory view
driver resolves the page fault by checking if the page is mapped to an
fd and what its permissions are.

For example, accessing the page at 4GB from the start of the memory view
is an access at 8GB into big-file.iso. That's because 8GB of
big-file.iso was mapped at 0 with fd_offset 4GB.

To summarize, there is one vma in process B and the memory view driver
maps pages from the file descriptors added with ioctl(MEMVIEW_MAP_FD) by
process A.

Page protection bits are the AND of the mmap
PROT_READ/PROT_WRITE/PROT_EXEC flags with the memory view driver's
MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC flags for the
mapping in question.

Does vmf_insert_mixed_prot() or a similar Linux API allow this?

Can the memory view driver map pages from fds without pinning the pages?

Process A can make further ioctl(MEMVIEW_MAP_FD) calls and also
ioctl(MEMVIEW_UNMAP_FD) calls to change the mappings. This requires
zapping affected process B ptes. When process B accesses those pages
again the fault handler will handle the page fault based on the latest
memory view layout.

If process B accesses a page with incorrect permissions or that has not
been configured by process A ioctl calls, a SIGSEGV/SIGBUS signal is
raised.

When process B uses mprotect(2) and other virtual memory syscalls it
is unable to increase page permissions. Instead it can only reduce them
because the pte protection bits are the AND of the mmap flags and the
memory view driver's MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC
flags.

Use cases
---------
How to use the memory view driver for vhost-user
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
vhost-user and other out-of-process device emulation interfaces need a
way for the VMM to enforce the IOMMU mappings on the device emulation
process.

Today the VMM passes all guest RAM fds to the device emulation process
and has no way of restricting access or revoking it later. With the
memory view driver the VMM will pass one or more memory view fds instead
of the actual guest RAM fds. This allows the VMM to invoke
ioctl(MEMVIEW_MAP_FD/MEMVIEW_UNMAP_FD) to enforce permissions or revoke
access.

How to use the memory view driver for virtio-fs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The virtiofsd vhost-user process creates a memory view for the device's
DAX Window and passes it to QEMU. QEMU installs it as a kvm.ko memory
region so that the guest directly accesses the memory view.

Now virtiofsd can map portions of files into the DAX Window without
coordinating with the QEMU process. This simplifies the virtio-fs code
and should also improve DAX map/unmap performance.

Stefan
Steven Sistare July 31, 2020, 2:57 p.m. UTC | #19
On 7/30/2020 5:58 PM, ebiederm@xmission.com wrote:
> Steven Sistare <steven.sistare@oracle.com> writes:
>> On 7/30/2020 1:49 PM, Matthew Wilcox wrote:
>>> On Thu, Jul 30, 2020 at 01:35:51PM -0400, Steven Sistare wrote:
>>>> mshare + VA reservation is another possible solution.
>>>>
>>>> Or MADV_DOEXEC alone, which is ready now.  I hope we can get back to reviewing that.
>>>
>>> We are.  This is the part of the review process where we explore other
>>> solutions to the problem.
>>>
>>>>>> Also, we need to support updating legacy processes that already created anon segments.
>>>>>> We inject code that calls MADV_DOEXEC for such segments.
>>>>>
>>>>> Yes, I was assuming you'd inject code that called mshare().
>>>>
>>>> OK, mshare works on existing memory and builds a new vma.
>>>
>>> Actually, reparents an existing VMA, and reuses the existing page tables.
>>>
>>>>> Actually, since you're injecting code, why do you need the kernel to
>>>>> be involved?  You can mmap the new executable and any libraries it depends
>>>>> upon, set up a new stack and jump to the main() entry point, all without
>>>>> calling exec().  I appreciate it'd be a fair amount of code, but it'd all
>>>>> be in userspace and you can probably steal / reuse code from ld.so (I'm
>>>>> not familiar with the details of how setting up an executable is done).
>>>>
>>>> Duplicating all the work that the kernel and loader do to exec a process would
>>>> be error prone, require ongoing maintenance, and be redundant.  Better to define 
>>>> a small kernel extension and leave exec to the kernel.
>>>
>>> Either this is a one-off kind of thing, in which case it doesn't need
>>> ongoing maintenance, or it's something with broad applicability, in
>>> which case it can live as its own userspace project.  It could even
>>> start off life as part of qemu and then fork into its own project.
>>
>> exec will be enhanced over time in the kernel.  A separate user space implementation
>> would need to track that.
>>
>> Reimplementing exec in userland would be a big gross mess.  Not a good solution when
>> we have simple and concise ways of solving the problem.
> 
> The interesting work is not done in exec.  The interesting work of
> loading an executable is done in ld.so, and ld.so lives in userspace.

And yet a smart guy named Eric said:
  "There is just enough complexity in exec currently that our implementation of exec is 
  already teetering."
which suggests that a user-space re-implementation will also be complex.
In complexity lies bugs.

>>> The idea of tagging an ELF executable to say "I can cope with having
>>> chunks of my address space provided to me by my executor" is ... odd.
>>
>> I don't disagree.  But it is useful.  We already pass a block of data containing
>> environment variables and arguments from one process to the next.  Preserving 
>> additional segments is not a big leap from there.
> 
> But we randomize where that block lives.
> 
> It has been found to be very useful to have randomization by default and
> you are arguing against that kind of randomization.

In the normal use case the addresses of the preserved anon segments were
already randomized during allocation in the pre-exec process.  Preservation
across exec does not make the process less safe from an external attack.

Now, if the attacker injects content at known addresses by preserving and
exec'ing, then the post-exec process must decide whether to trust the
pre-exec process.  Let's prevent preservation when exec'ing setuid binaries.
That is not part of this patch series, but is a good idea.  So, uid does not
change across exec.  Do I trust myself?  Yes I do.  If the uid is trusted,
as it must be in the qemu use case which accesses system resources, then
that is sufficient.  If the exec raises caps, and the uid is not trusted,
then the post-exec process must do additional work to verify the pre-exec
credentials.  The method is left to the app.  We defined the ELF opt-in so
that processes that do not use preserved memory will never receive any and
will never need to deal with such verification.

Matthews sileby/mshare proposal has the same issue.  If a process opts-in
and mmap's an address in the shared region, then content becomes mapped at
a VA that was known to the pre-fork or pre-exec process.  Trust must still
be established.

> Here is another suggestion.
> 
> Have a very simple program that does:
> 
> 	for (;;) {
> 		handle = dlopen("/my/real/program");
> 		real_main = dlsym(handle, "main");
> 		real_main(argc, argv, envp);
> 		dlclose(handle);
> 	}
> 
> With whatever obvious adjustments are needed to fit your usecase.
> 
> That should give the same level of functionality, be portable to all
> unices, and not require you to duplicate code.  It belive it limits you
> to not upgrading libc, or librt but that is a comparatively small
> limitation.
> 
> 
> Given that in general the interesting work is done in userspace and that
> userspace has provided an interface for reusing that work already.
> I don't see the justification for adding anything to exec at this point. 

Thanks for the suggestion.  That is clever, and would make a fun project,
but I would not trust it for production.  These few lines are just
the first of many that it would take to reset the environment to the
well-defined post-exec initial conditions that all executables expect,
and incrementally tearing down state will be prone to bugs.  Getting a
clean slate from a kernel exec is a much more reliable design.  The use
case is creating long-lived apps that never go down, and the simplest
implementation will have the fewest bugs and is the best.  MADV_DOEXEC is
simple, and does not even require a new system call, and the kernel already
knows how to exec without bugs.

- Steve
Matthew Wilcox (Oracle) July 31, 2020, 3:27 p.m. UTC | #20
On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote:
> Matthews sileby/mshare proposal has the same issue.  If a process opts-in
> and mmap's an address in the shared region, then content becomes mapped at
> a VA that was known to the pre-fork or pre-exec process.  Trust must still
> be established.

It's up to the recipient whether they try to map it at the same address
or at a fresh address.  The intended use case is a "semi-shared" address
space between two processes (ie partway between a threaded, fully-shared
address space and a forked un-shared address space), in which case
there's a certain amount of trust and cooperation between the processes.

Your preservation-across-exec use-case might or might not need the
VMA to be mapped at the same address.  I don't know whether qemu stores
pointers in this VMA which are absolute within the qemu address space.
If it's just the emulated process's address space, then everything will
be absolute within its own address space and everything will be opaque
to qemu.  If qemu is storing its own pointers in it, then it has to be
mapped at the same address.

> > Here is another suggestion.
> > 
> > Have a very simple program that does:
> > 
> > 	for (;;) {
> > 		handle = dlopen("/my/real/program");
> > 		real_main = dlsym(handle, "main");
> > 		real_main(argc, argv, envp);
> > 		dlclose(handle);
> > 	}
> > 
> > With whatever obvious adjustments are needed to fit your usecase.
> > 
> > That should give the same level of functionality, be portable to all
> > unices, and not require you to duplicate code.  It belive it limits you
> > to not upgrading libc, or librt but that is a comparatively small
> > limitation.
> > 
> > 
> > Given that in general the interesting work is done in userspace and that
> > userspace has provided an interface for reusing that work already.
> > I don't see the justification for adding anything to exec at this point. 
> 
> Thanks for the suggestion.  That is clever, and would make a fun project,
> but I would not trust it for production.  These few lines are just
> the first of many that it would take to reset the environment to the
> well-defined post-exec initial conditions that all executables expect,
> and incrementally tearing down state will be prone to bugs.  Getting a
> clean slate from a kernel exec is a much more reliable design.  The use
> case is creating long-lived apps that never go down, and the simplest
> implementation will have the fewest bugs and is the best.  MADV_DOEXEC is
> simple, and does not even require a new system call, and the kernel already
> knows how to exec without bugs.

It's a net increase of 200 lines of kernel code.  If 4 lines of userspace
code removes 200 lines of kernel code, I think I know which I prefer ...
Steven Sistare July 31, 2020, 4:11 p.m. UTC | #21
On 7/31/2020 11:27 AM, Matthew Wilcox wrote:
> On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote:
>> Matthews sileby/mshare proposal has the same issue.  If a process opts-in
>> and mmap's an address in the shared region, then content becomes mapped at
>> a VA that was known to the pre-fork or pre-exec process.  Trust must still
>> be established.
> 
> It's up to the recipient whether they try to map it at the same address
> or at a fresh address.  The intended use case is a "semi-shared" address
> space between two processes (ie partway between a threaded, fully-shared
> address space and a forked un-shared address space), in which case
> there's a certain amount of trust and cooperation between the processes.

Understood, but if the recipient does map at any of the same, which is the whole
point because you want to share the page table.  The trust relationship is no
different than for the live update case.  
 
> Your preservation-across-exec use-case might or might not need the
> VMA to be mapped at the same address.  

It does.  qemu registers memory with vfio which remembers the va's in kernel
metadata for the device.

> I don't know whether qemu stores
> pointers in this VMA which are absolute within the qemu address space.
> If it's just the emulated process's address space, then everything will
> be absolute within its own address space and everything will be opaque
> to qemu.  If qemu is storing its own pointers in it, then it has to be
> mapped at the same address.

qemu does not do the latter but that could be a nice way for apps to use
preserved memory.

>>> Here is another suggestion.
>>>
>>> Have a very simple program that does:
>>>
>>> 	for (;;) {
>>> 		handle = dlopen("/my/real/program");
>>> 		real_main = dlsym(handle, "main");
>>> 		real_main(argc, argv, envp);
>>> 		dlclose(handle);
>>> 	}
>>>
>>> With whatever obvious adjustments are needed to fit your usecase.
>>>
>>> That should give the same level of functionality, be portable to all
>>> unices, and not require you to duplicate code.  It belive it limits you
>>> to not upgrading libc, or librt but that is a comparatively small
>>> limitation.
>>>
>>>
>>> Given that in general the interesting work is done in userspace and that
>>> userspace has provided an interface for reusing that work already.
>>> I don't see the justification for adding anything to exec at this point. 
>>
>> Thanks for the suggestion.  That is clever, and would make a fun project,
>> but I would not trust it for production.  These few lines are just
>> the first of many that it would take to reset the environment to the
>> well-defined post-exec initial conditions that all executables expect,
>> and incrementally tearing down state will be prone to bugs.  Getting a
>> clean slate from a kernel exec is a much more reliable design.  The use
>> case is creating long-lived apps that never go down, and the simplest
>> implementation will have the fewest bugs and is the best.  MADV_DOEXEC is
>> simple, and does not even require a new system call, and the kernel already
>> knows how to exec without bugs.
> 
> It's a net increase of 200 lines of kernel code.  If 4 lines of userspace
> code removes 200 lines of kernel code, I think I know which I prefer ...

It will be *far* more than 4 lines.
Much of the 200 lines is mostly for the elf opt in, and much of the elf code is from
anthony reviving an earlier patch that use MAP_FIXED_NOREPLACE during segment setup.

- Steve
Jason Gunthorpe July 31, 2020, 4:56 p.m. UTC | #22
On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
> > Your preservation-across-exec use-case might or might not need the
> > VMA to be mapped at the same address.  
> 
> It does.  qemu registers memory with vfio which remembers the va's in kernel
> metadata for the device.

Once the memory is registered with vfio the VA doesn't matter, vfio
will keep the iommu pointing at the same physical pages no matter
where they are mapped.

Jason
Steven Sistare July 31, 2020, 5:15 p.m. UTC | #23
On 7/31/2020 12:56 PM, Jason Gunthorpe wrote:
> On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
>>> Your preservation-across-exec use-case might or might not need the
>>> VMA to be mapped at the same address.  
>>
>> It does.  qemu registers memory with vfio which remembers the va's in kernel
>> metadata for the device.
> 
> Once the memory is registered with vfio the VA doesn't matter, vfio
> will keep the iommu pointing at the same physical pages no matter
> where they are mapped.

Yes, but there are other code paths that compute and use offsets between va and the
base va.  Mapping at a different va in the new process breaks vfio; I have tried it.

- Steve
Matthew Wilcox (Oracle) July 31, 2020, 5:23 p.m. UTC | #24
On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
> On 7/31/2020 11:27 AM, Matthew Wilcox wrote:
> > On Fri, Jul 31, 2020 at 10:57:44AM -0400, Steven Sistare wrote:
> >> Matthews sileby/mshare proposal has the same issue.  If a process opts-in
> >> and mmap's an address in the shared region, then content becomes mapped at
> >> a VA that was known to the pre-fork or pre-exec process.  Trust must still
> >> be established.
> > 
> > It's up to the recipient whether they try to map it at the same address
> > or at a fresh address.  The intended use case is a "semi-shared" address
> > space between two processes (ie partway between a threaded, fully-shared
> > address space and a forked un-shared address space), in which case
> > there's a certain amount of trust and cooperation between the processes.
> 
> Understood, but if the recipient does map at any of the same, which is the whole
> point because you want to share the page table.  The trust relationship is no
> different than for the live update case.  

You don't have to map at the same address to share the page tables.
For example, on x86 if you share an 8GB region, that must be aligned at
1GB in both the donor and the recipient, but they need not be mapped at
the same address.

> > It's a net increase of 200 lines of kernel code.  If 4 lines of userspace
> > code removes 200 lines of kernel code, I think I know which I prefer ...
> 
> It will be *far* more than 4 lines.
> Much of the 200 lines is mostly for the elf opt in, and much of the elf code is from
> anthony reviving an earlier patch that use MAP_FIXED_NOREPLACE during segment setup.

It doesn't really matter how much of it is for the opt-in and how much
is for the exec path itself.  The MAP_FIXED_NOREPLACE patch is only net
+16 lines, so that's not the problem.
Jason Gunthorpe July 31, 2020, 5:48 p.m. UTC | #25
On Fri, Jul 31, 2020 at 01:15:34PM -0400, Steven Sistare wrote:
> On 7/31/2020 12:56 PM, Jason Gunthorpe wrote:
> > On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
> >>> Your preservation-across-exec use-case might or might not need the
> >>> VMA to be mapped at the same address.  
> >>
> >> It does.  qemu registers memory with vfio which remembers the va's in kernel
> >> metadata for the device.
> > 
> > Once the memory is registered with vfio the VA doesn't matter, vfio
> > will keep the iommu pointing at the same physical pages no matter
> > where they are mapped.
> 
> Yes, but there are other code paths that compute and use offsets between va and the
> base va.  Mapping at a different va in the new process breaks vfio; I have tried it.

Maybe you could fix vfio instead of having this adventure, if vfio is
the only motivation.

Jason
Steven Sistare July 31, 2020, 5:55 p.m. UTC | #26
On 7/31/2020 1:48 PM, Jason Gunthorpe wrote:
> On Fri, Jul 31, 2020 at 01:15:34PM -0400, Steven Sistare wrote:
>> On 7/31/2020 12:56 PM, Jason Gunthorpe wrote:
>>> On Fri, Jul 31, 2020 at 12:11:52PM -0400, Steven Sistare wrote:
>>>>> Your preservation-across-exec use-case might or might not need the
>>>>> VMA to be mapped at the same address.  
>>>>
>>>> It does.  qemu registers memory with vfio which remembers the va's in kernel
>>>> metadata for the device.
>>>
>>> Once the memory is registered with vfio the VA doesn't matter, vfio
>>> will keep the iommu pointing at the same physical pages no matter
>>> where they are mapped.
>>
>> Yes, but there are other code paths that compute and use offsets between va and the
>> base va.  Mapping at a different va in the new process breaks vfio; I have tried it.
> 
> Maybe you could fix vfio instead of having this adventure, if vfio is
> the only motivation.

Maybe.  We still need to preserve an anonymous segment, though.  MADV_DOEXEC, or mshare,
or something else.  And I think the ability to preserve memory containing pointers to itself
is an interesting use case, though not ours.

- Steve
Steven Sistare July 31, 2020, 7:41 p.m. UTC | #27
On 7/27/2020 1:11 PM, Anthony Yznaga wrote:
> This patchset adds support for preserving an anonymous memory range across
> exec(3) using a new madvise MADV_DOEXEC argument.  The primary benefit for
> sharing memory in this manner, as opposed to re-attaching to a named shared
> memory segment, is to ensure it is mapped at the same virtual address in
> the new process as it was in the old one.  An intended use for this is to
> preserve guest memory for guests using vfio while qemu exec's an updated
> version of itself.  By ensuring the memory is preserved at a fixed address,
> vfio mappings and their associated kernel data structures can remain valid.
> In addition, for the qemu use case, qemu instances that back guest RAM with
> anonymous memory can be updated.

I forgot to mention, our use case is not just theoretical.  It has been implemented
and is pretty cool (but I am biased).  The pause time for the guest is in the
100 - 200 msec range.  We submitted qemu patches for review based on the MADV_DOEXEC
proposal.  In case you are curious:
  https://lore.kernel.org/qemu-devel/1596122076-341293-1-git-send-email-steven.sistare@oracle.com/

- Steve
Eric W. Biederman Aug. 3, 2020, 3:28 p.m. UTC | #28
Steven Sistare <steven.sistare@oracle.com> writes:

> On 7/30/2020 5:58 PM, ebiederm@xmission.com wrote:
>> Here is another suggestion.
>> 
>> Have a very simple program that does:
>> 
>> 	for (;;) {
>> 		handle = dlopen("/my/real/program");
>> 		real_main = dlsym(handle, "main");
>> 		real_main(argc, argv, envp);
>> 		dlclose(handle);
>> 	}
>> 
>> With whatever obvious adjustments are needed to fit your usecase.
>> 
>> That should give the same level of functionality, be portable to all
>> unices, and not require you to duplicate code.  It belive it limits you
>> to not upgrading libc, or librt but that is a comparatively small
>> limitation.
>> 
>> 
>> Given that in general the interesting work is done in userspace and that
>> userspace has provided an interface for reusing that work already.
>> I don't see the justification for adding anything to exec at this point. 
>
> Thanks for the suggestion.  That is clever, and would make a fun project,
> but I would not trust it for production.  These few lines are just
> the first of many that it would take to reset the environment to the
> well-defined post-exec initial conditions that all executables expect,
> and incrementally tearing down state will be prone to bugs.

Agreed.

> Getting a clean slate from a kernel exec is a much more reliable
> design.

Except you are explicitly throwing that out the window, by preserving
VMAs.  You very much need to have a clean bug free shutdown to pass VMAs
reliably.

> The use case is creating long-lived apps that never go down, and the
> simplest implementation will have the fewest bugs and is the best.
> MADV_DOEXEC is simple, and does not even require a new system call,
> and the kernel already knows how to exec without bugs.

*ROFL*  I wish the kernel knew how to exec things without bugs.
The bugs are hard to hit but the ones I am aware of are not straight
forward to fix.

MADV_DOEXEC is not conceptually simple.  It completely violates the
guarantees that exec is known to make about the contents of the memory
of the new process.  This makes it very difficult to reason about.  Nor
will MADV_DOEXEC be tested very much as it has only one or two users.
Which means in the fullness of time it is likely someone will change
something that will break the implementation subtlely and the bug report
probably won't come in for 3 years, or maybe a decade.  At which point
it won't be clear if the bug even can be fixed as something else might
rely on it.

What is wrong with live migration between one qemu process and another
qemu process on the same machine not work for this use case?

Just reusing live migration would seem to be the simplest path of all,
as the code is already implemented.  Further if something goes wrong
with the live migration you can fallback to the existing process.  With
exec there is no fallback if the new version does not properly support
the handoff protocol of the old version.

Eric
James Bottomley Aug. 3, 2020, 3:42 p.m. UTC | #29
On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote:
[...]
> What is wrong with live migration between one qemu process and
> another qemu process on the same machine not work for this use case?
> 
> Just reusing live migration would seem to be the simplest path of
> all, as the code is already implemented.  Further if something goes
> wrong with the live migration you can fallback to the existing
> process.  With exec there is no fallback if the new version does not
> properly support the handoff protocol of the old version.

Actually, could I ask this another way: the other patch set you sent to
the KVM list was to snapshot the VM to a PKRAM capsule preserved across
kexec using zero copy for extremely fast save/restore.  The original
idea was to use this as part of a CRIU based snapshot, kexec to new
system, restore.  However, why can't you do a local snapshot, restart
qemu, restore using the PKRAM capsule to achieve exactly the same as
MADV_DOEXEC does but using a system that's easy to reason about?  It
may be slightly slower, but I think we're still talking milliseconds.

James
Steven Sistare Aug. 3, 2020, 7:29 p.m. UTC | #30
On 8/3/2020 11:28 AM, ebiederm@xmission.com wrote:
> Steven Sistare <steven.sistare@oracle.com> writes:
>> On 7/30/2020 5:58 PM, ebiederm@xmission.com wrote:
>>> Here is another suggestion.
>>>
>>> Have a very simple program that does:
>>>
>>> 	for (;;) {
>>> 		handle = dlopen("/my/real/program");
>>> 		real_main = dlsym(handle, "main");
>>> 		real_main(argc, argv, envp);
>>> 		dlclose(handle);
>>> 	}
>>>
>>> With whatever obvious adjustments are needed to fit your usecase.
>>>
>>> That should give the same level of functionality, be portable to all
>>> unices, and not require you to duplicate code.  It belive it limits you
>>> to not upgrading libc, or librt but that is a comparatively small
>>> limitation.
>>>
>>>
>>> Given that in general the interesting work is done in userspace and that
>>> userspace has provided an interface for reusing that work already.
>>> I don't see the justification for adding anything to exec at this point. 
>>
>> Thanks for the suggestion.  That is clever, and would make a fun project,
>> but I would not trust it for production.  These few lines are just
>> the first of many that it would take to reset the environment to the
>> well-defined post-exec initial conditions that all executables expect,
>> and incrementally tearing down state will be prone to bugs.
> 
> Agreed.
> 
>> Getting a clean slate from a kernel exec is a much more reliable
>> design.
> 
> Except you are explicitly throwing that out the window, by preserving
> VMAs.  You very much need to have a clean bug free shutdown to pass VMAs
> reliably.

Sure.  The whole community relies on you and others to provide a bug free exec.

>> The use case is creating long-lived apps that never go down, and the
>> simplest implementation will have the fewest bugs and is the best.
>> MADV_DOEXEC is simple, and does not even require a new system call,
>> and the kernel already knows how to exec without bugs.
> 
> *ROFL*  I wish the kernel knew how to exec things without bugs.

Essentially you are saying you would argue against any enhancement to exec.
Surely that is too high a bar.  We must continue to evolve an innovate and
balance risk against reward.  This use case matters to our business a lot,
and to others as well, see below.  That is the reward.  I feel you are 
overstating the risk.  Surely there is some early point in the development
cycle of some release where this can be integrated and get enough test
time and soak time to be proven reliable.

> The bugs are hard to hit but the ones I am aware of are not straight
> forward to fix.
> 
> MADV_DOEXEC is not conceptually simple.  It completely violates the
> guarantees that exec is known to make about the contents of the memory
> of the new process.  This makes it very difficult to reason about.  

I have having trouble see the difficulty.  Perhaps I am too familar with
it, but the semantics are few and easy to explain, and it does not introduce
new concepts: the post-exec process is born with a few more mappings than
previously, and non-fixed further mmaps choose addresses in the holes.

> Nor
> will MADV_DOEXEC be tested very much as it has only one or two users.
> Which means in the fullness of time it is likely someone will change
> something that will break the implementation subtlely and the bug report
> probably won't come in for 3 years, or maybe a decade.  At which point
> it won't be clear if the bug even can be fixed as something else might
> rely on it.

That's on us; we need to provide kernel tests and be diligent about testing
new releases.  This matters to our business and we will do so. 
> What is wrong with live migration between one qemu process and another
> qemu process on the same machine not work for this use case?
> 
> Just reusing live migration would seem to be the simplest path of all,
> as the code is already implemented.  Further if something goes wrong
> with the live migration you can fallback to the existing process.  With
> exec there is no fallback if the new version does not properly support
> the handoff protocol of the old version.

This is less resource intensive than live migration.  The latter ties up two
hosts, consumes lots of memory and network bandwidth, may take a long time
to converge on a busy system, and is unfeasible for guests with a huge amount
of local storeage (which we call dense I/O shapes).  Live update takes less than
1 second total, and the guest pause time is 100 - 200 msecs.  It is a very
attractive solution that other cloud vendors have implemented as well, with
their own private modifications to exec and and fork.  We have been independently
working in this area, and we are offering our implementation to the community.

- Steve
Steven Sistare Aug. 3, 2020, 8:03 p.m. UTC | #31
On 8/3/2020 11:42 AM, James Bottomley wrote:
> On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote:
> [...]
>> What is wrong with live migration between one qemu process and
>> another qemu process on the same machine not work for this use case?
>>
>> Just reusing live migration would seem to be the simplest path of
>> all, as the code is already implemented.  Further if something goes
>> wrong with the live migration you can fallback to the existing
>> process.  With exec there is no fallback if the new version does not
>> properly support the handoff protocol of the old version.
> 
> Actually, could I ask this another way: the other patch set you sent to
> the KVM list was to snapshot the VM to a PKRAM capsule preserved across
> kexec using zero copy for extremely fast save/restore.  The original
> idea was to use this as part of a CRIU based snapshot, kexec to new
> system, restore.  However, why can't you do a local snapshot, restart
> qemu, restore using the PKRAM capsule to achieve exactly the same as
> MADV_DOEXEC does but using a system that's easy to reason about?  It
> may be slightly slower, but I think we're still talking milliseconds.

Hi James, good to hear from you.  PKRAM or SysV shm could be used for
a restart in that manner, but it would only support sriov guests if the
guest exports an agent that supports suspend-to-ram, and if all guest
drivers support the suspend-to-ram method.  I have done this using a linux
guest and qemu guest agent, and IIRC the guest pause time is 500 - 1000 msec.
With MADV_DOEXEC, pause time is 100 - 200 msec.  The pause time is a handful
of seconds if the guest uses an nvme drive because CC.SHN takes so long
to persist metadata to stable storage.

We could instead pass vfio descriptors from the old process to a 3rd party escrow 
process and pass  them back to the new qemu process, but the shm that vfio has 
already registered must be remapped at the same VA as the previous process, and 
there is no interface to guarantee that.  MAP_FIXED blows away existing mappings 
and breaks the app. MAP_FIXED_NOREPLACE respects existing mappings but cannot map 
the shm and breaks the app.  Adding a feature that reserves VAs would fix that, we 
have experimnted with one.  Fixing the vfio kernel implementation to not use the 
original VA base would also work, but I don't know how doable/difficult that would be.

Both solutions would require a qemu instance to be stopped and relaunched using shm
as guest ram, and its guest rebooted, so they do not let us update legacy 
already-running instances that use anon memory.  That problem solves itself if we 
get these rfe's into linux and qemu, and eventually users shut down the legacy
instances, but that takes years and we need to do it sooner.

- Steve
Matthew Wilcox (Oracle) Aug. 4, 2020, 11:13 a.m. UTC | #32
On Tue, Aug 04, 2020 at 08:44:42AM +0000, David Laight wrote:
> From: James Bottomley
> > Sent: 03 August 2020 16:43
> > 
> > On Mon, 2020-08-03 at 10:28 -0500, Eric W. Biederman wrote:
> > [...]
> > > What is wrong with live migration between one qemu process and
> > > another qemu process on the same machine not work for this use case?
> > >
> > > Just reusing live migration would seem to be the simplest path of
> > > all, as the code is already implemented.  Further if something goes
> > > wrong with the live migration you can fallback to the existing
> > > process.  With exec there is no fallback if the new version does not
> > > properly support the handoff protocol of the old version.
> > 
> > Actually, could I ask this another way: the other patch set you sent to
> > the KVM list was to snapshot the VM to a PKRAM capsule preserved across
> > kexec using zero copy for extremely fast save/restore.  The original
> > idea was to use this as part of a CRIU based snapshot, kexec to new
> > system, restore.  However, why can't you do a local snapshot, restart
> > qemu, restore using the PKRAM capsule to achieve exactly the same as
> > MADV_DOEXEC does but using a system that's easy to reason about?  It
> > may be slightly slower, but I think we're still talking milliseconds.
> 
> 
> I've had another idea (that is probably impossible...).
> What about a 'reverse mmap' operation.
> Something that creates an fd whose contents are a chunk of the
> processes address space.

http://www.wil.cx/~willy/linux/sileby.html