mbox series

[RESEND,RFC,0/5] Remote mapping

Message ID 20200904113116.20648-1-alazar@bitdefender.com (mailing list archive)
Headers show
Series Remote mapping | expand

Message

Adalbert Lazăr Sept. 4, 2020, 11:31 a.m. UTC
This patchset adds support for the remote mapping feature.
Remote mapping, as its name suggests, is a means for transparent and
zero-copy access of a remote process' address space.
access of a remote process' address space.

The feature was designed according to a specification suggested by
Paolo Bonzini:
>> The proposed API is a new pidfd system call, through which the parent
>> can map portions of its virtual address space into a file descriptor
>> and then pass that file descriptor to a child.
>>
>> This should be:
>>
>> - upstreamable, pidfd is the new cool thing and we could sell it as a
>> better way to do PTRACE_{PEEK,POKE}DATA
>>
>> - relatively easy to do based on the bitdefender remote process
>> mapping patches at.
>>
>> - pidfd_mem() takes a pidfd and some flags (which are 0) and returns
>> two file descriptors for respectively the control plane and the memory access.
>>
>> - the control plane accepts three ioctls
>>
>> PIDFD_MEM_MAP takes a struct like
>>
>>     struct pidfd_mem_map {
>>          uint64_t address;
>>          off_t offset;
>>          off_t size;
>>          int flags;
>>          int padding[7];
>>     }
>>
>> After this is done, the memory access fd can be mmap-ed at range
>> [offset,
>> offset+size), and it will read memory from range [address,
>> address+size) of the target descriptor.
>>
>> PIDFD_MEM_UNMAP takes a struct like
>>
>>     struct pidfd_mem_unmap {
>>          off_t offset;
>>          off_t size;
>>     }
>>
>> and unmaps the corresponding range of course.
>>
>> Finally PIDFD_MEM_LOCK forbids subsequent PIDFD_MEM_MAP or
>> PIDFD_MEM_UNMAP.  For now I think it should just check that the
>> argument is zero, bells and whistles can be added later.
>>
>> - the memory access fd can be mmap-ed as in the bitdefender patches
>> but also accessed with read/write/pread/pwrite/...  As in the
>> BitDefender patches, MMU notifiers can be used to adjust any mmap-ed
>> regions when the source address space changes.  In this case,
>> PIDFD_MEM_UNMAP could also cause a pre-existing mmap to "disappear".
(it currently doesn't support read/write/pread/pwrite/...)

The main remote mapping patch also contains the legacy implementation which
creates a region the size of the whole process address space by means of the
REMOTE_PROC_MAP ioctl. The user is then free to mmap() any region of the
address space it wishes.

VMAs obtained by mmap()ing memory access fds mirror the contents of the remote
process address space within the specified range. Pages are installed in the
current process page tables at fault time and removed by the mmu_interval_notifier
invalidate callbck. No further memory management is involved.
On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP,
or if the remote process address space was reaped by OOM, the remote mapping
fault handler returns VM_FAULT_SIGBUS.

At Bitdefender we are using remote mapping for virtual machine introspection:
- the QEMU running the introspected machine creates the pair of file descriptors,
passes the access fd to the introspector QEMU, and uses the control fd to allow
access to the memslots it creates for its machine
- the QEMU running the introspector machine receives the access fd and mmap()s
the regions made available, then hotplugs the obtained memory in its machine
Having this setup creates nested invalidate_range_start/end MMU notifier calls.

Patch organization:
- patch 1 allows unmap_page_range() to run without rescheduling
  Needed for remote mapping to zap current process page tables when OOM calls
  mmu_notifier_invalidate_range_start_nonblock(&range)

- patch 2 creates VMA-specific zapping behavior
  A remote mapping VMA does not own the pages it maps, so all it has to do is
  clear the PTEs.

- patch 3 removed MMU notifier lockdep map
  It was just incompatible with our use case.

- patch 4 is the remote mapping implementation

- patch 5 adds suggested pidfd_mem system call

Mircea Cirjaliu (5):
  mm: add atomic capability to zap_details
  mm: let the VMA decide how zap_pte_range() acts on mapped pages
  mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in
    nested scenarios
  mm/remote_mapping: use a pidfd to access memory belonging to unrelated
    process
  pidfd_mem: implemented remote memory mapping system call

 arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 include/linux/mm.h                     |   22 +
 include/linux/mmu_notifier.h           |    5 +-
 include/linux/pid.h                    |    1 +
 include/linux/remote_mapping.h         |   22 +
 include/linux/syscalls.h               |    1 +
 include/uapi/asm-generic/unistd.h      |    2 +
 include/uapi/linux/remote_mapping.h    |   36 +
 kernel/exit.c                          |    2 +-
 kernel/pid.c                           |   55 +
 mm/Kconfig                             |   11 +
 mm/Makefile                            |    1 +
 mm/memory.c                            |  193 ++--
 mm/mmu_notifier.c                      |   19 -
 mm/remote_mapping.c                    | 1273 ++++++++++++++++++++++++
 16 files changed, 1535 insertions(+), 110 deletions(-)
 create mode 100644 include/linux/remote_mapping.h
 create mode 100644 include/uapi/linux/remote_mapping.h
 create mode 100644 mm/remote_mapping.c


base-commit: ae83d0b416db002fe95601e7f97f64b59514d936

Comments

Jason Gunthorpe Sept. 4, 2020, 12:11 p.m. UTC | #1
On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> VMAs obtained by mmap()ing memory access fds mirror the contents of the remote
> process address space within the specified range. Pages are installed in the
> current process page tables at fault time and removed by the mmu_interval_notifier
> invalidate callbck. No further memory management is involved.
> On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP,
> or if the remote process address space was reaped by OOM, the remote mapping
> fault handler returns VM_FAULT_SIGBUS.

I still think anything along these lines needs to meet the XPMEM use
cases as well, we have to have more general solutions for such MM
stuff:

https://gitlab.com/hjelmn/xpmem

However, I think this fundamentally falls into some of the same bad
direction as xpmem.

I would much rather see this design copy & clone the VMA's than try to
mirror the PTEs inside the VMAs from the remote into a single giant
VMA and somehow split/mirror the VMA ops.

This is just too weird and fragile to be maintaible over a long
term.

For instance, one of the major bugs in things like xpmem was that they
are incompatible with get_user_pages(), largely because of this issue.

I feel like I said this already..

Jason
Mircea CIRJALIU - MELIU Sept. 4, 2020, 1:24 p.m. UTC | #2
> On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > VMAs obtained by mmap()ing memory access fds mirror the contents of
> > the remote process address space within the specified range. Pages are
> > installed in the current process page tables at fault time and removed
> > by the mmu_interval_notifier invalidate callbck. No further memory
> management is involved.
> > On attempts to access a hole, or if a mapping was removed by
> > PIDFD_MEM_UNMAP, or if the remote process address space was reaped
> by
> > OOM, the remote mapping fault handler returns VM_FAULT_SIGBUS.
> 
> I still think anything along these lines needs to meet the XPMEM use cases as
> well, we have to have more general solutions for such MM
> stuff:
> 
> https://gitlab.com/hjelmn/xpmem
> 
> However, I think this fundamentally falls into some of the same bad direction
> as xpmem.
> 
> I would much rather see this design copy & clone the VMA's than try to
> mirror the PTEs inside the VMAs from the remote into a single giant VMA and
> somehow split/mirror the VMA ops.

This design was made specifically for virtual machine introspection, where we 
care more about the contents of the address space, rather than the remote VMAs
and their vmops. (Right now only anon pages can be mapped, but I guess
we can extend to pagecache pages as well.) I just used what seemed to be the
common denominator to all page-related operations: range invalidation.
This looks like a general solution.

IMO cloning a VMA in an address space that has a completely different layout
will present its own set of caveats: What happens if the VMA resizes/splits? 
Can you replay all the remote VMA vmops on the clone VMA?

> This is just too weird and fragile to be maintaible over a long term.
> 
> For instance, one of the major bugs in things like xpmem was that they are
> incompatible with get_user_pages(), largely because of this issue.

We support get_user_pages(), that's how we integrate with KVM.
The difference is the page returned will not belong to the current process.

> I feel like I said this already..
> 
> Jason
Jason Gunthorpe Sept. 4, 2020, 1:39 p.m. UTC | #3
On Fri, Sep 04, 2020 at 01:24:43PM +0000, Mircea CIRJALIU - MELIU wrote:
> > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > > VMAs obtained by mmap()ing memory access fds mirror the contents of
> > > the remote process address space within the specified range. Pages are
> > > installed in the current process page tables at fault time and removed
> > > by the mmu_interval_notifier invalidate callbck. No further memory
> > management is involved.
> > > On attempts to access a hole, or if a mapping was removed by
> > > PIDFD_MEM_UNMAP, or if the remote process address space was reaped
> > by
> > > OOM, the remote mapping fault handler returns VM_FAULT_SIGBUS.
> > 
> > I still think anything along these lines needs to meet the XPMEM use cases as
> > well, we have to have more general solutions for such MM
> > stuff:
> > 
> > https://gitlab.com/hjelmn/xpmem
> > 
> > However, I think this fundamentally falls into some of the same bad direction
> > as xpmem.
> > 
> > I would much rather see this design copy & clone the VMA's than try to
> > mirror the PTEs inside the VMAs from the remote into a single giant VMA and
> > somehow split/mirror the VMA ops.
> 
> This design was made specifically for virtual machine introspection, where we 
> care more about the contents of the address space, rather than the remote VMAs
> and their vmops. (Right now only anon pages can be mapped, but I guess
> we can extend to pagecache pages as well.) I just used what seemed to be the
> common denominator to all page-related operations: range invalidation.
> This looks like a general solution.

The point is that a VMA is how the MM connects its parts together,
cloning the content of a VMA without the rest of the VMA meta-data is
just going to be very fragile in the long run.. 

Especially if the VMA is presented as a normal VMA with working struct
pages/etc, not a pfn map.

> IMO cloning a VMA in an address space that has a completely different layout
> will present its own set of caveats: What happens if the VMA resizes/splits? 
> Can you replay all the remote VMA vmops on the clone VMA?

The mirror would have to reclone the source VMA every time the source
VMA changes.

> > This is just too weird and fragile to be maintaible over a long term.
> > 
> > For instance, one of the major bugs in things like xpmem was that they are
> > incompatible with get_user_pages(), largely because of this issue.
> 
> We support get_user_pages(), that's how we integrate with KVM.

This seems really sketchy, get_user_pages is sensitive to the VMA,
what happens when VMA flags are different/etc?

Jason
Mircea CIRJALIU - MELIU Sept. 4, 2020, 2:18 p.m. UTC | #4
> On Fri, Sep 04, 2020 at 01:24:43PM +0000, Mircea CIRJALIU - MELIU wrote:
> > > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > > > VMAs obtained by mmap()ing memory access fds mirror the contents
> > > > of the remote process address space within the specified range.
> > > > Pages are installed in the current process page tables at fault
> > > > time and removed by the mmu_interval_notifier invalidate callbck.
> > > > No further memory
> > > management is involved.
> > > > On attempts to access a hole, or if a mapping was removed by
> > > > PIDFD_MEM_UNMAP, or if the remote process address space was
> reaped
> > > by
> > > > OOM, the remote mapping fault handler returns VM_FAULT_SIGBUS.
> > >
> > > I still think anything along these lines needs to meet the XPMEM use
> > > cases as well, we have to have more general solutions for such MM
> > > stuff:
> > >
> > > https://gitlab.com/hjelmn/xpmem
> > >
> > > However, I think this fundamentally falls into some of the same bad
> > > direction as xpmem.
> > >
> > > I would much rather see this design copy & clone the VMA's than try
> > > to mirror the PTEs inside the VMAs from the remote into a single
> > > giant VMA and somehow split/mirror the VMA ops.
> >
> > This design was made specifically for virtual machine introspection,
> > where we care more about the contents of the address space, rather
> > than the remote VMAs and their vmops. (Right now only anon pages can
> > be mapped, but I guess we can extend to pagecache pages as well.) I
> > just used what seemed to be the common denominator to all page-related
> operations: range invalidation.
> > This looks like a general solution.
> 
> The point is that a VMA is how the MM connects its parts together, cloning
> the content of a VMA without the rest of the VMA meta-data is just going to
> be very fragile in the long run..
> 
> Especially if the VMA is presented as a normal VMA with working struct
> pages/etc, not a pfn map.
>
> > IMO cloning a VMA in an address space that has a completely different
> > layout will present its own set of caveats: What happens if the VMA
> resizes/splits?
> > Can you replay all the remote VMA vmops on the clone VMA?
> 
> The mirror would have to reclone the source VMA every time the source
> VMA changes.
> 
> > > This is just too weird and fragile to be maintaible over a long term.
> > >
> > > For instance, one of the major bugs in things like xpmem was that
> > > they are incompatible with get_user_pages(), largely because of this
> issue.
> >
> > We support get_user_pages(), that's how we integrate with KVM.
> 
> This seems really sketchy, get_user_pages is sensitive to the VMA, what
> happens when VMA flags are different/etc?

A debugger shouldn't complain if a portion of the debuggee is read-only,
just overwrite the data.

> Jason
Jason Gunthorpe Sept. 4, 2020, 2:39 p.m. UTC | #5
On Fri, Sep 04, 2020 at 02:18:37PM +0000, Mircea CIRJALIU - MELIU wrote:
> > This seems really sketchy, get_user_pages is sensitive to the VMA, what
> > happens when VMA flags are different/etc?
> 
> A debugger shouldn't complain if a portion of the debuggee is read-only,
> just overwrite the data.

At this point the kernel API here is so incredibly limited you may as
well use a memfd for passing the shared address space instead of
trying to do and maintain this complexity.

Your use case is only qemu, so what is the problem to replace the
allocator backing VM memory in userspace? Other people have been
talking about doing a memfd already for different reasons - and memfd
can already be shared as this scheme desires.

Jason
Mircea CIRJALIU - MELIU Sept. 4, 2020, 3:40 p.m. UTC | #6
> On Fri, Sep 04, 2020 at 02:18:37PM +0000, Mircea CIRJALIU - MELIU wrote:
> > > This seems really sketchy, get_user_pages is sensitive to the VMA,
> > > what happens when VMA flags are different/etc?
> >
> > A debugger shouldn't complain if a portion of the debuggee is
> > read-only, just overwrite the data.
> 
> At this point the kernel API here is so incredibly limited you may as well use a
> memfd for passing the shared address space instead of trying to do and
> maintain this complexity.
> 
> Your use case is only qemu, so what is the problem to replace the allocator
> backing VM memory in userspace? Other people have been talking about
> doing a memfd already for different reasons - and memfd can already be
> shared as this scheme desires.

KSM doesn't work on shmem.
Once you replace the allocator you render KSM useless.

Besides that, I had a mail once from Paolo Bonzini:
>> Hi,
>>
>> here at FOSDEM we discussed having a way for a parent process to 
>> split parts of an mmap range with one or more child processes.  This 
>> turns out to be a generalization of the remote memory mapping concept 
>> that BitDefender proposed for virtual machine introspection ( 
>> https://patchwork.kernel.org/patch/11284561/).  So far the patches 
>> haven't had a great reception from the MM people, but it shouldn't be 
>> hard to adjust the API according to the sketch below.  I am also 
>> including Mircea who is the author.
>>
>> The proposed API is a new pidfd system call, through which the parent 
>> can map portions of its virtual address space into a file descriptor 
>> and then pass that file descriptor to a child.
(the rest can be found in the cover letter)

Therefore I had to do a module that peeks into anon process memory.
And be compatible with KSM. This was among the requirements for the first
version of remote mapping, which ended up non-scalable.

Figures out it can peek into any kind of memory involving pages.
Also it doesn't have the overhead associated with mapping a page in a VMA.
And compared to ptrace(), it can keep the pages resident as long as needed.

Mircea
Jason Gunthorpe Sept. 4, 2020, 4:11 p.m. UTC | #7
On Fri, Sep 04, 2020 at 03:40:55PM +0000, Mircea CIRJALIU - MELIU wrote:
> > On Fri, Sep 04, 2020 at 02:18:37PM +0000, Mircea CIRJALIU - MELIU wrote:
> > > > This seems really sketchy, get_user_pages is sensitive to the VMA,
> > > > what happens when VMA flags are different/etc?
> > >
> > > A debugger shouldn't complain if a portion of the debuggee is
> > > read-only, just overwrite the data.
> > 
> > At this point the kernel API here is so incredibly limited you may as well use a
> > memfd for passing the shared address space instead of trying to do and
> > maintain this complexity.
> > 
> > Your use case is only qemu, so what is the problem to replace the allocator
> > backing VM memory in userspace? Other people have been talking about
> > doing a memfd already for different reasons - and memfd can already be
> > shared as this scheme desires.
> 
> KSM doesn't work on shmem.
> Once you replace the allocator you render KSM useless.

I suspect making memfd to work with KSM will be much less hacky than
this.

> Figures out it can peek into any kind of memory involving pages.

No, this approach is really liminted to anonymous VMAs.

You might make some argument that VMA differences can be ignored if
they are all anonymous to start with, but certainly not once other
types are VMAs are included.

Jason
Florian Weimer Sept. 4, 2020, 7:19 p.m. UTC | #8
* Adalbert Lazăr:

>>> - the control plane accepts three ioctls
>>>
>>> PIDFD_MEM_MAP takes a struct like
>>>
>>>     struct pidfd_mem_map {
>>>          uint64_t address;
>>>          off_t offset;
>>>          off_t size;
>>>          int flags;
>>>          int padding[7];
>>>     }
>>>
>>> After this is done, the memory access fd can be mmap-ed at range
>>> [offset,
>>> offset+size), and it will read memory from range [address,
>>> address+size) of the target descriptor.
>>>
>>> PIDFD_MEM_UNMAP takes a struct like
>>>
>>>     struct pidfd_mem_unmap {
>>>          off_t offset;
>>>          off_t size;
>>>     }
>>>
>>> and unmaps the corresponding range of course.

off_t depends on preprocessor macros (_FILE_OFFSET_BITS), so these
types should be uint64_t, too.

I'm not sure what the advantage is of returning separate file
descriptors, and nit operating directly on the pidfd.
Andy Lutomirski Sept. 4, 2020, 7:39 p.m. UTC | #9
On Fri, Sep 4, 2020 at 4:41 AM Adalbert Lazăr <alazar@bitdefender.com> wrote:
>
> This patchset adds support for the remote mapping feature.
> Remote mapping, as its name suggests, is a means for transparent and
> zero-copy access of a remote process' address space.
> access of a remote process' address space.
>

I think this is very clever, but I find myself wondering what happens
if people start trying to abuse this by, for example, setting up a
remote mapping pointing to fun regions like userfaultfd or another
remote mapping.

I'm a little concerned that it's actually too clever and that maybe a
more straightforward solution should be investigated.  I personally
rather dislike the KVM model in which the guest address space mirrors
the host (QEMU) address space rather than being its own thing.  In
particular, the current model means that extra-special-strange
mappings like SEV-encrypted memory are required to be present in the
QEMU page tables in order for the guest to see them.

(If I had noticed that last bit before it went upstream, I would have
NAKked it.  I would still like to see it deprecated and ideally
eventually removed from the kernel.  We have absolutely no business
creating incoherent mappings like this.)

--Andy
Matthew Wilcox Sept. 4, 2020, 7:41 p.m. UTC | #10
On Fri, Sep 04, 2020 at 09:11:48AM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > VMAs obtained by mmap()ing memory access fds mirror the contents of the remote
> > process address space within the specified range. Pages are installed in the
> > current process page tables at fault time and removed by the mmu_interval_notifier
> > invalidate callbck. No further memory management is involved.
> > On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP,
> > or if the remote process address space was reaped by OOM, the remote mapping
> > fault handler returns VM_FAULT_SIGBUS.
> 
> I still think anything along these lines needs to meet the XPMEM use
> cases as well, we have to have more general solutions for such MM
> stuff:
> 
> https://gitlab.com/hjelmn/xpmem
> 
> However, I think this fundamentally falls into some of the same bad
> direction as xpmem.
> 
> I would much rather see this design copy & clone the VMA's than try to
> mirror the PTEs inside the VMAs from the remote into a single giant
> VMA and somehow split/mirror the VMA ops.

I'm on holiday for the next few days, but does the mshare() API work for
your use case?

Proposal: http://www.wil.cx/~willy/linux/sileby.html
Start at implementation:
http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/mshare
Jason Gunthorpe Sept. 4, 2020, 7:49 p.m. UTC | #11
On Fri, Sep 04, 2020 at 08:41:39PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 04, 2020 at 09:11:48AM -0300, Jason Gunthorpe wrote:
> > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > > VMAs obtained by mmap()ing memory access fds mirror the contents of the remote
> > > process address space within the specified range. Pages are installed in the
> > > current process page tables at fault time and removed by the mmu_interval_notifier
> > > invalidate callbck. No further memory management is involved.
> > > On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP,
> > > or if the remote process address space was reaped by OOM, the remote mapping
> > > fault handler returns VM_FAULT_SIGBUS.
> > 
> > I still think anything along these lines needs to meet the XPMEM use
> > cases as well, we have to have more general solutions for such MM
> > stuff:
> > 
> > https://gitlab.com/hjelmn/xpmem
> > 
> > However, I think this fundamentally falls into some of the same bad
> > direction as xpmem.
> > 
> > I would much rather see this design copy & clone the VMA's than try to
> > mirror the PTEs inside the VMAs from the remote into a single giant
> > VMA and somehow split/mirror the VMA ops.
> 
> I'm on holiday for the next few days, but does the mshare() API work for
> your use case?
>
> Proposal: http://www.wil.cx/~willy/linux/sileby.html
> Start at implementation:
> http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/mshare

Let me ask around, it seems in a similar space at least!

Thanks,
Jason
Paolo Bonzini Sept. 4, 2020, 8:08 p.m. UTC | #12
On 04/09/20 21:41, Matthew Wilcox wrote:
> Proposal: http://www.wil.cx/~willy/linux/sileby.html
> Start at implementation:
> http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/mshare

The main difference between mshare() and this is that we don't want an
all-or-nothing thing.

Adalbert's introspection thing is rather simple, but what I would like
to be able to do (and the reason why I suggested the multi-pidfd
approach) is actually a bit more complex:

- a parent process creates a range of memory

- there are multiple VMs child processes.  One of this VM is a primary
VM, the others are enclave VMs.  VMs are created by the parent process
and each VM gets a different view of the memory range through pidfd_mem.

- once an enclave VM is created, the primary VM must not be able to
access the memory that has been assigned to the enclave VM.  If the
parent unmaps the memory in the primary VM, the child must SIGBUS when
it's accessed.

- if memory is removed from a VM and assigned to another, this should
not involve any copy at all.

For this usecase the range of memory would be backed by hugetlbfs,
anonymous memory, VFIO, whatever.  Userfaultfd is certainly part of the
picture here on the VM side.  Having userfaultfd on the parent side
would be nice though I don't have a use for it right now.  I'm not sure
about non-anonymous VMAs.

Thanks,

Paolo
Paolo Bonzini Sept. 4, 2020, 8:09 p.m. UTC | #13
On 04/09/20 21:39, Andy Lutomirski wrote:
> I'm a little concerned that it's actually too clever and that maybe a
> more straightforward solution should be investigated.  I personally
> rather dislike the KVM model in which the guest address space mirrors
> the host (QEMU) address space rather than being its own thing.  In
> particular, the current model means that extra-special-strange
> mappings like SEV-encrypted memory are required to be present in the
> QEMU page tables in order for the guest to see them.
> 
> (If I had noticed that last bit before it went upstream, I would have
> NAKked it.  I would still like to see it deprecated and ideally
> eventually removed from the kernel.  We have absolutely no business
> creating incoherent mappings like this.)

NACK first and ask second, right Andy?  I see that nothing has changed
since Alan Cox left Linux.

Paolo
Paolo Bonzini Sept. 4, 2020, 8:18 p.m. UTC | #14
On 04/09/20 21:19, Florian Weimer wrote:
> I'm not sure what the advantage is of returning separate file
> descriptors, and nit operating directly on the pidfd.

For privilege separation.  So far, the common case of pidfd operations
has been that whoever possesses a pidfd has "power" over the target
process.  Here however we also want to cover the case where one
privileged process wants to set up and manage a memory range for
multiple children.  The privilege process can do so by passing the
access file descriptor via SCM_RIGHTS.

We also want different children to have visibility over different
ranges, which is why there are multiple control fds rather than using
the pidfd itself as control fd.  You could have the map/unmap/lock ioctl
on the pidfd itself and the access fd as an argument of the ioctl, but
it seems cleaner to represent the pidfd-mem control capability as its
own file descriptor.

Paolo
Andy Lutomirski Sept. 4, 2020, 8:34 p.m. UTC | #15
>> On Sep 4, 2020, at 1:09 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> 
>> On 04/09/20 21:39, Andy Lutomirski wrote:
>> I'm a little concerned that it's actually too clever and that maybe a
>> more straightforward solution should be investigated.  I personally
>> rather dislike the KVM model in which the guest address space mirrors
>> the host (QEMU) address space rather than being its own thing.  In
>> particular, the current model means that extra-special-strange
>> mappings like SEV-encrypted memory are required to be present in the
>> QEMU page tables in order for the guest to see them.
>> (If I had noticed that last bit before it went upstream, I would have
>> NAKked it.  I would still like to see it deprecated and ideally
>> eventually removed from the kernel.  We have absolutely no business
>> creating incoherent mappings like this.)
> 
> NACK first and ask second, right Andy?  I see that nothing has changed
> since Alan Cox left Linux.

NACKs are negotiable.  And maybe someone can convince me that the SEV mapping scheme is reasonable, but I would be surprised.

Regardless, you seem to be suggesting that you want to have enclave VMs in which the enclave can see some memory that the parent VM can’t see. How does this fit into the KVM mapping model?  How does this remote mapping mechanism help?  Do you want QEMU to have that memory mapped in its own pagetables?

As it stands, the way that KVM memory mappings are created seems to be convenient, but it also seems to be resulting in increasing bizarre userspace mappings.  At what point is the right solution to decouple KVM’s mappings from QEMU’s?
Paolo Bonzini Sept. 4, 2020, 9:58 p.m. UTC | #16
On 04/09/20 22:34, Andy Lutomirski wrote:
> On Sep 4, 2020, at 1:09 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>> On 04/09/20 21:39, Andy Lutomirski wrote:
>>> I'm a little concerned
>>> that it's actually too clever and that maybe a more
>>> straightforward solution should be investigated.  I personally 
>>> rather dislike the KVM model in which the guest address space
>>> mirrors the host (QEMU) address space rather than being its own
>>> thing.  In particular, the current model means that
>>> extra-special-strange mappings like SEV-encrypted memory are
>>> required to be present in the QEMU page tables in order for the
>>> guest to see them. (If I had noticed that last bit before it went
>>> upstream, I would have NAKked it.  I would still like to see it
>>> deprecated and ideally eventually removed from the kernel.  We
>>> have absolutely no business creating incoherent mappings like
>>> this.)
>> 
>> NACK first and ask second, right Andy?  I see that nothing has
>> changed since Alan Cox left Linux.
> 
> NACKs are negotiable.  And maybe someone can convince me that the SEV
> mapping scheme is reasonable, but I would be surprised.

So why say NACK?  Any half-decent maintainer would hold on merging the
patches at least until the discussion is over.  Also I suppose any
deprecation proposal should come with a description of an alternative.

Anyway, for SEV the problem is DMA.  There is no way to know in advance
which memory the guest will use for I/O; it can change at any time and
the same host-physical address can even be mapped both as C=0 and C=1 by
the guest.  There's no communication protocol between the guest and the
host to tell the host _which_ memory should be mapped in QEMU.  (One was
added to support migration, but that doesn't even work with SEV-ES
processors where migration is planned to happen mostly with help from
the guest, either in the firmware or somewhere else).

But this is a digression.  (If you would like to continue the discussion
please trim the recipient list and change the subject).

> Regardless, you seem to be suggesting that you want to have enclave
> VMs in which the enclave can see some memory that the parent VM can’t
> see. How does this fit into the KVM mapping model?  How does this
> remote mapping mechanism help?  Do you want QEMU to have that memory
> mapped in its own pagetables?

There are three processes:

- the manager, which is the parent of the VMs and uses the pidfd_mem
system call

- the primary VM

- the enclave VM(s)

The primary VM and the enclave VM(s) would each get a different memory
access file descriptor.  QEMU would treat them no differently from any
other externally-provided memory backend, say hugetlbfs or memfd, so
yeah they would be mmap-ed to userspace and the host virtual address
passed as usual to KVM.

Enclave VMs could be used to store secrets and perform crypto for
example.  The enclave is measured at boot, any keys or other stuff it
needs can be provided out-of-band from the manager

The manager can decide at any time to hide some memory from the parent
VM (in order to give it to an enclave).  This would actually be done on
 request of the parent VM itself, and QEMU would probably be so kind as
to replace the "hole" left in the guest memory with zeroes.  But QEMU is
untrusted, so the manager cannot rely on QEMU behaving well.  Hence the
 privilege separation model that was implemented here.

Actually Amazon has already created something like that and Andra-Irina
Paraschiv has posted patches on the list for this.  Their implementation
is not open source, but this pidfd-mem concept is something that Andra,
Alexander Graf and I came up with as a way to 1) reimplement the feature
upstream and 2) satisfy Bitdefender's need for memory introspection 3)
add what seemed a useful interface anyway, for example to replace
PTRACE_{PEEK,POKE}DATA.  Though (3) would only need pread/pwrite, not
mmap which adds a lot of the complexity.

> As it stands, the way that KVM memory mappings are created seems to
> be convenient, but it also seems to be resulting in increasing
> bizarre userspace mappings.  At what point is the right solution to
> decouple KVM’s mappings from QEMU’s?

So what you are suggesting is that KVM manages its own address space
instead of host virtual addresses (and with no relationship to host
virtual addresses, it would be just a "cookie")?  It would then need a
couple ioctls to mmap/munmap (creating and deleting VMAs) into the
address space, and those cookies would be passed to
KVM_SET_USER_MEMORY_REGION.  QEMU would still need access to these VMAs,
would it mmap a file descriptor provided by KVM?  All in all the
implementation seems quite complex, and I don't understand why it would
avoid incoherent SEV mappings; what am I missing?

Paolo
Andy Lutomirski Sept. 4, 2020, 11:17 p.m. UTC | #17
On Fri, Sep 4, 2020 at 2:59 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 04/09/20 22:34, Andy Lutomirski wrote:
> > On Sep 4, 2020, at 1:09 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >> On 04/09/20 21:39, Andy Lutomirski wrote:
> >>> I'm a little concerned
> >>> that it's actually too clever and that maybe a more
> >>> straightforward solution should be investigated.  I personally
> >>> rather dislike the KVM model in which the guest address space
> >>> mirrors the host (QEMU) address space rather than being its own
> >>> thing.  In particular, the current model means that
> >>> extra-special-strange mappings like SEV-encrypted memory are
> >>> required to be present in the QEMU page tables in order for the
> >>> guest to see them. (If I had noticed that last bit before it went
> >>> upstream, I would have NAKked it.  I would still like to see it
> >>> deprecated and ideally eventually removed from the kernel.  We
> >>> have absolutely no business creating incoherent mappings like
> >>> this.)
> >>
> >> NACK first and ask second, right Andy?  I see that nothing has
> >> changed since Alan Cox left Linux.
> >
> > NACKs are negotiable.  And maybe someone can convince me that the SEV
> > mapping scheme is reasonable, but I would be surprised.
>
> So why say NACK?  Any half-decent maintainer would hold on merging the
> patches at least until the discussion is over.  Also I suppose any
> deprecation proposal should come with a description of an alternative.

I suppose that's a fair point.

>
> Anyway, for SEV the problem is DMA.  There is no way to know in advance
> which memory the guest will use for I/O; it can change at any time and
> the same host-physical address can even be mapped both as C=0 and C=1 by
> the guest.  There's no communication protocol between the guest and the
> host to tell the host _which_ memory should be mapped in QEMU.  (One was
> added to support migration, but that doesn't even work with SEV-ES
> processors where migration is planned to happen mostly with help from
> the guest, either in the firmware or somewhere else).

There's sev_pin_memory(), so QEMU must have at least some idea of
which memory could potentially be encrypted.  Is it in fact the case
that QEMU doesn't know that some SEV pinned memory might actually be
used for DMA until the guest tries to do DMA on that memory?  If so,
yuck.

>
> But this is a digression.  (If you would like to continue the discussion
> please trim the recipient list and change the subject).

Fair enough.  And my apologies for bring grumpier about SEV than was called for.

>
> > Regardless, you seem to be suggesting that you want to have enclave
> > VMs in which the enclave can see some memory that the parent VM can’t
> > see. How does this fit into the KVM mapping model?  How does this
> > remote mapping mechanism help?  Do you want QEMU to have that memory
> > mapped in its own pagetables?
>
> There are three processes:
>
> - the manager, which is the parent of the VMs and uses the pidfd_mem
> system call
>
> - the primary VM
>
> - the enclave VM(s)
>
> The primary VM and the enclave VM(s) would each get a different memory
> access file descriptor.  QEMU would treat them no differently from any
> other externally-provided memory backend, say hugetlbfs or memfd, so
> yeah they would be mmap-ed to userspace and the host virtual address
> passed as usual to KVM.

Would the VM processes mmap() these descriptors, or would KVM learn
how to handle that memory without it being mapped?

>
> Enclave VMs could be used to store secrets and perform crypto for
> example.  The enclave is measured at boot, any keys or other stuff it
> needs can be provided out-of-band from the manager
>
> The manager can decide at any time to hide some memory from the parent
> VM (in order to give it to an enclave).  This would actually be done on
>  request of the parent VM itself, and QEMU would probably be so kind as
> to replace the "hole" left in the guest memory with zeroes.  But QEMU is
> untrusted, so the manager cannot rely on QEMU behaving well.  Hence the
>  privilege separation model that was implemented here.

How does this work?  Is there a revoke mechanism, or does the parent
just munmap() the memory itself?

>
> Actually Amazon has already created something like that and Andra-Irina
> Paraschiv has posted patches on the list for this.  Their implementation
> is not open source, but this pidfd-mem concept is something that Andra,
> Alexander Graf and I came up with as a way to 1) reimplement the feature
> upstream and 2) satisfy Bitdefender's need for memory introspection 3)
> add what seemed a useful interface anyway, for example to replace
> PTRACE_{PEEK,POKE}DATA.  Though (3) would only need pread/pwrite, not
> mmap which adds a lot of the complexity.
>
> > As it stands, the way that KVM memory mappings are created seems to
> > be convenient, but it also seems to be resulting in increasing
> > bizarre userspace mappings.  At what point is the right solution to
> > decouple KVM’s mappings from QEMU’s?
>
> So what you are suggesting is that KVM manages its own address space
> instead of host virtual addresses (and with no relationship to host
> virtual addresses, it would be just a "cookie")?  It would then need a
> couple ioctls to mmap/munmap (creating and deleting VMAs) into the
> address space, and those cookies would be passed to
> KVM_SET_USER_MEMORY_REGION.  QEMU would still need access to these VMAs,
> would it mmap a file descriptor provided by KVM?  All in all the
> implementation seems quite complex, and I don't understand why it would
> avoid incoherent SEV mappings; what am I missing?

It might not avoid incoherent SEV mappings in particular, but it would
certainly enable other, somewhat related usecases.  For example, QEMU
could have KVM map a memfd without itself mapping that memfd, which
would reduce the extent to which the memory would be exposed to an
attacker who can read QEMU memory.

For this pidfd-mem scheme in particular, it might avoid the nasty
corner case I mentioned.  With pidfd-mem as in this patchset, I'm
concerned about what happens when process A maps some process B
memory, process B maps some of process A's memory, and there's a
recursive mapping that results.  Or when a process maps its own
memory, for that matter.  If KVM could map fd's directly, then there
could be a parallel mechanism for KVM to import portions of more than
one process's address space, and this particular problem would be
avoided.  So a process would create pidfd-mem-like object and pass
that to KVM (via an intermediary process or directly) and KVM could
map that, but no normal process would be allowed to map it.  This
avoids the recursion problems.

Or memfd could get fancier with operations to split memfds, remove
pages from memfds, etc.  Maybe that's overkill.

(Or a fancy recursion detector could be built, but this has been a
pain point in AF_UNIX, epoll, etc in the past.  It may be solvable,
but it won't be pretty.)

I admit that allowing KVM to map fd's directly without some specific
vm_operations support for this could be challenging, but ISTM kvm
could plausibly own an mm_struct and pagetables at the cost of some
wasted memory.  The result would, under the hood, work more or less
like the current implementation, but the API would be quite different.

>
> Paolo
>
Paolo Bonzini Sept. 5, 2020, 6:27 p.m. UTC | #18
On 05/09/20 01:17, Andy Lutomirski wrote:
> There's sev_pin_memory(), so QEMU must have at least some idea of
> which memory could potentially be encrypted.  Is it in fact the case
> that QEMU doesn't know that some SEV pinned memory might actually be
> used for DMA until the guest tries to do DMA on that memory?  If so,
> yuck.

Yes.  All the memory is pinned, all the memory could potentially be used
for DMA (of garbage if it's encrypted).  And it's the same for pretty
much all protected VM extensions (SEV, POWER, s390, Intel TDX).

>> The primary VM and the enclave VM(s) would each get a different memory
>> access file descriptor.  QEMU would treat them no differently from any
>> other externally-provided memory backend, say hugetlbfs or memfd, so
>> yeah they would be mmap-ed to userspace and the host virtual address
>> passed as usual to KVM.
> 
> Would the VM processes mmap() these descriptors, or would KVM learn
> how to handle that memory without it being mapped?

The idea is that the process mmaps them, QEMU would treat them just the
same as a hugetlbfs file descriptor for example.

>> The manager can decide at any time to hide some memory from the parent
>> VM (in order to give it to an enclave).  This would actually be done on
>> request of the parent VM itself [...] But QEMU is
>> untrusted, so the manager cannot rely on QEMU behaving well.  Hence the
>> privilege separation model that was implemented here.
> 
> How does this work?  Is there a revoke mechanism, or does the parent
> just munmap() the memory itself?

The parent has ioctls to add and remove memory from the pidfd-mem.  So
unmapping is just calling the ioctl that removes a range.

>> So what you are suggesting is that KVM manages its own address space
>> instead of host virtual addresses (and with no relationship to host
>> virtual addresses, it would be just a "cookie")?
> 
> [...] For this pidfd-mem scheme in particular, it might avoid the nasty
> corner case I mentioned.  With pidfd-mem as in this patchset, I'm
> concerned about what happens when process A maps some process B
> memory, process B maps some of process A's memory, and there's a
> recursive mapping that results.  Or when a process maps its own
> memory, for that matter.
> 
> Or memfd could get fancier with operations to split memfds, remove
> pages from memfds, etc.  Maybe that's overkill.

Doing it directly with memfd is certainly an option, especially since
MFD_HUGE_* exists.  Basically you'd have a system call to create a
secondary view of the memfd, and the syscall interface could still be
very similar to what is in this patch, in particular the control/access
pair.  Probably this could be used also to implement Matthew Wilcox's ideas.

I still believe that the pidfd-mem concept has merit as a
"capability-like" PTRACE_{PEEK,POKE}DATA replacement, but it would not
need any of privilege separation or mmap support, only direct read/write.

So there's two concepts mixed in one interface in this patch, with two
completely different usecases.  Merging them is clever, but perhaps too
clever.  I can say that since it was my idea. :D

Thanks,

Paolo
Christoph Hellwig Sept. 7, 2020, 7:05 a.m. UTC | #19
On Fri, Sep 04, 2020 at 11:58:57PM +0200, Paolo Bonzini wrote:
> So why say NACK?  Any half-decent maintainer would hold on merging the
> patches at least until the discussion is over.  Also I suppose any
> deprecation proposal should come with a description of an alternative.

Please stop these totally pointless and overly aggressive personal
attacks.  A maintainers prime job is to say no.
Christian Brauner Sept. 7, 2020, 8:33 a.m. UTC | #20
Hey Paolo,

On Fri, Sep 04, 2020 at 10:18:17PM +0200, Paolo Bonzini wrote:
> On 04/09/20 21:19, Florian Weimer wrote:
> > I'm not sure what the advantage is of returning separate file
> > descriptors, and nit operating directly on the pidfd.
> 
> For privilege separation.  So far, the common case of pidfd operations
> has been that whoever possesses a pidfd has "power" over the target

I may misunderstand you but that's actually not quite true. Currently,
pidfds are just handles on processes and currently only convey identity.
They don't guarantee any sort of privilege over the target process. We
have had discussion to treat them more as a capability in the future but
that needs to be carefully thought out.

> process.  Here however we also want to cover the case where one
> privileged process wants to set up and manage a memory range for
> multiple children.  The privilege process can do so by passing the
> access file descriptor via SCM_RIGHTS.
> 
> We also want different children to have visibility over different
> ranges, which is why there are multiple control fds rather than using
> the pidfd itself as control fd.  You could have the map/unmap/lock ioctl
> on the pidfd itself and the access fd as an argument of the ioctl, but
> it seems cleaner to represent the pidfd-mem control capability as its
> own file descriptor.

We have very much on purpose avoided adding ioctls() on top of pidfds
and I'm not fond of the idea of starting to add them. Supporting
ioctl()s on an fd usually opens up a can of worms and makes sneaking in
questionable features more likely (I'm not saying your patchset does
that!).
If this interface holds up, I would ask you to please either keep this
as a separate fd type or please propose system calls only.

Christian
Christian Brauner Sept. 7, 2020, 8:38 a.m. UTC | #21
On Sat, Sep 05, 2020 at 08:27:29PM +0200, Paolo Bonzini wrote:
> On 05/09/20 01:17, Andy Lutomirski wrote:
> > There's sev_pin_memory(), so QEMU must have at least some idea of
> > which memory could potentially be encrypted.  Is it in fact the case
> > that QEMU doesn't know that some SEV pinned memory might actually be
> > used for DMA until the guest tries to do DMA on that memory?  If so,
> > yuck.
> 
> Yes.  All the memory is pinned, all the memory could potentially be used
> for DMA (of garbage if it's encrypted).  And it's the same for pretty
> much all protected VM extensions (SEV, POWER, s390, Intel TDX).
> 
> >> The primary VM and the enclave VM(s) would each get a different memory
> >> access file descriptor.  QEMU would treat them no differently from any
> >> other externally-provided memory backend, say hugetlbfs or memfd, so
> >> yeah they would be mmap-ed to userspace and the host virtual address
> >> passed as usual to KVM.
> > 
> > Would the VM processes mmap() these descriptors, or would KVM learn
> > how to handle that memory without it being mapped?
> 
> The idea is that the process mmaps them, QEMU would treat them just the
> same as a hugetlbfs file descriptor for example.
> 
> >> The manager can decide at any time to hide some memory from the parent
> >> VM (in order to give it to an enclave).  This would actually be done on
> >> request of the parent VM itself [...] But QEMU is
> >> untrusted, so the manager cannot rely on QEMU behaving well.  Hence the
> >> privilege separation model that was implemented here.
> > 
> > How does this work?  Is there a revoke mechanism, or does the parent
> > just munmap() the memory itself?
> 
> The parent has ioctls to add and remove memory from the pidfd-mem.  So
> unmapping is just calling the ioctl that removes a range.

I would strongly suggest we move away from ioctl() patterns. If
something like this comes up in the future, just propose them as system
calls.

> 
> >> So what you are suggesting is that KVM manages its own address space
> >> instead of host virtual addresses (and with no relationship to host
> >> virtual addresses, it would be just a "cookie")?
> > 
> > [...] For this pidfd-mem scheme in particular, it might avoid the nasty
> > corner case I mentioned.  With pidfd-mem as in this patchset, I'm
> > concerned about what happens when process A maps some process B
> > memory, process B maps some of process A's memory, and there's a
> > recursive mapping that results.  Or when a process maps its own
> > memory, for that matter.
> > 
> > Or memfd could get fancier with operations to split memfds, remove
> > pages from memfds, etc.  Maybe that's overkill.
> 
> Doing it directly with memfd is certainly an option, especially since
> MFD_HUGE_* exists.  Basically you'd have a system call to create a

I like that idea way better to be honest.

Christian
Paolo Bonzini Sept. 7, 2020, 8:44 a.m. UTC | #22
On 07/09/20 09:05, Christoph Hellwig wrote:
> On Fri, Sep 04, 2020 at 11:58:57PM +0200, Paolo Bonzini wrote:
>> So why say NACK?  Any half-decent maintainer would hold on merging the
>> patches at least until the discussion is over.  Also I suppose any
>> deprecation proposal should come with a description of an alternative.
>
> Please stop these totally pointless and overly aggressive personal
> attacks.  A maintainers prime job is to say no.

Christoph,

I'm not sure who is attacking whom honestly, especially since Andy later
said "Fair enough.  And my apologies for bring grumpier about SEV than
was called for" and we had a perfectly civil and technical conversation.

Paolo
Mircea CIRJALIU - MELIU Sept. 7, 2020, 10:25 a.m. UTC | #23
> From: Andy Lutomirski <luto@kernel.org>
> I think this is very clever, but I find myself wondering what happens if people
> start trying to abuse this by, for example, setting up a remote mapping
> pointing to fun regions like userfaultfd or another remote mapping.

Can ptrace() be used to abuse fun regions of a process address space?
Remote mapping recursiveness can be eliminated by checking the VMA the
remote page is extracted from. (NYI)

> I'm a little concerned that it's actually too clever and that maybe a more
> straightforward solution should be investigated.  I personally rather dislike
> the KVM model in which the guest address space mirrors the host (QEMU)
> address space rather than being its own thing.

I've seen a few internal mmap()s throughout the kernel. Just wondering how
memory accounting is implemented for these cases. Will this be reflected in
the memory usage of the process that controls such a module? Especially in
the case of a virtual machine that needs a few GBs of memory.

Mircea
Mircea CIRJALIU - MELIU Sept. 7, 2020, 12:41 p.m. UTC | #24
> From: Andy Lutomirski <luto@kernel.org>
> > > As it stands, the way that KVM memory mappings are created seems to
> > > be convenient, but it also seems to be resulting in increasing
> > > bizarre userspace mappings.  At what point is the right solution to
> > > decouple KVM’s mappings from QEMU’s?

Convenience is one of the drivers of code reuse.

> > So what you are suggesting is that KVM manages its own address space
> > instead of host virtual addresses (and with no relationship to host
> > virtual addresses, it would be just a "cookie")?  It would then need a
> > couple ioctls to mmap/munmap (creating and deleting VMAs) into the
> > address space, and those cookies would be passed to
> > KVM_SET_USER_MEMORY_REGION.  QEMU would still need access to
> these
> > VMAs, would it mmap a file descriptor provided by KVM?  All in all the
> > implementation seems quite complex, and I don't understand why it
> > would avoid incoherent SEV mappings; what am I missing?
> 
> It might not avoid incoherent SEV mappings in particular, but it would
> certainly enable other, somewhat related usecases.  For example, QEMU
> could have KVM map a memfd without itself mapping that memfd, which
> would reduce the extent to which the memory would be exposed to an
> attacker who can read QEMU memory.

Isn't this security through obscurity?

> For this pidfd-mem scheme in particular, it might avoid the nasty corner case
> I mentioned.  With pidfd-mem as in this patchset, I'm concerned about what
> happens when process A maps some process B memory, process B maps
> some of process A's memory, and there's a recursive mapping that results.
> Or when a process maps its own memory, for that matter.  If KVM could map
> fd's directly, then there could be a parallel mechanism for KVM to import
> portions of more than one process's address space, and this particular
> problem would be avoided.  So a process would create pidfd-mem-like
> object and pass that to KVM (via an intermediary process or directly) and
> KVM could map that, but no normal process would be allowed to map it.  This
> avoids the recursion problems.
> 
> Or memfd could get fancier with operations to split memfds, remove pages
> from memfds, etc.  Maybe that's overkill.
> 
> (Or a fancy recursion detector could be built, but this has been a pain point in
> AF_UNIX, epoll, etc in the past.  It may be solvable, but it won't be pretty.)
> 
> I admit that allowing KVM to map fd's directly without some specific
> vm_operations support for this could be challenging, but ISTM kvm could
> plausibly own an mm_struct and pagetables at the cost of some wasted
> memory.  The result would, under the hood, work more or less like the
> current implementation, but the API would be quite different.

This looks like an attempt to pass memory related concerns to KVM developers.
The userspace mapping mechanism is good as it is. Probably not perfect, just 
good. The problem is that it's stuck to a few VMA models and needs to evolve
towards more bizarre/sketchy/weird/fragile patterns.

Also the memory code is one of the most tightly coupled code I have seen.
Probably explains the fear of the maintainers to try something new.

Mircea
Christian Brauner Sept. 7, 2020, 3:05 p.m. UTC | #25
On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> This patchset adds support for the remote mapping feature.
> Remote mapping, as its name suggests, is a means for transparent and
> zero-copy access of a remote process' address space.
> access of a remote process' address space.
> 
> The feature was designed according to a specification suggested by
> Paolo Bonzini:
> >> The proposed API is a new pidfd system call, through which the parent
> >> can map portions of its virtual address space into a file descriptor
> >> and then pass that file descriptor to a child.
> >>
> >> This should be:
> >>
> >> - upstreamable, pidfd is the new cool thing and we could sell it as a
> >> better way to do PTRACE_{PEEK,POKE}DATA

In all honesty, that sentence made me a bit uneasy as it reads like this
is implemented on top of pidfds because it makes it more likely to go
upstream not because it is the right design. To be clear, I'm not
implying any sort of malicious intent on your part but I would suggest
to phrase this a little better. :)
Andy Lutomirski Sept. 7, 2020, 8:43 p.m. UTC | #26
On Mon, Sep 7, 2020 at 8:05 AM Christian Brauner
<christian.brauner@ubuntu.com> wrote:
>
> On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > This patchset adds support for the remote mapping feature.
> > Remote mapping, as its name suggests, is a means for transparent and
> > zero-copy access of a remote process' address space.
> > access of a remote process' address space.
> >
> > The feature was designed according to a specification suggested by
> > Paolo Bonzini:
> > >> The proposed API is a new pidfd system call, through which the parent
> > >> can map portions of its virtual address space into a file descriptor
> > >> and then pass that file descriptor to a child.
> > >>
> > >> This should be:
> > >>
> > >> - upstreamable, pidfd is the new cool thing and we could sell it as a
> > >> better way to do PTRACE_{PEEK,POKE}DATA
>
> In all honesty, that sentence made me a bit uneasy as it reads like this
> is implemented on top of pidfds because it makes it more likely to go
> upstream not because it is the right design. To be clear, I'm not
> implying any sort of malicious intent on your part but I would suggest
> to phrase this a little better. :)


I thought about this whole thing some more, and here are some thoughts.

First, I was nervous about two things.  One was faulting in pages from
the wrong context.  (When a normal page fault or KVM faults in a page,
the mm is loaded.  (In the KVM case, the mm is sort of not loaded when
the actual fault happens, but the mm is loaded when the fault is
handled, I think.  Maybe there are workqueues involved and I'm wrong.)
 When a remote mapping faults in a page, the mm is *not* loaded.)
This ought not to be a problem, though -- get_user_pages_remote() also
faults in pages from a non-current mm, and that's at least supposed to
work correctly, so maybe this is okay.

Second is recursion.  I think this is a genuine problem.

And I think that tying this to pidfds is the wrong approach.  In fact,
tying it to processes at all seems wrong.  There is a lot of demand
for various forms of memory isolation in which memory is mapped only
by its intended user.  Using something tied to a process mm gets in
the way of this in the same way that KVM's current mapping model gets
in the way.

All that being said, I think the whole idea of making fancy address
spaces composed from other mappable objects is neat and possibly quite
useful.  And, if you squint a bit, this is a lot like what KVM does
today.

So I suggest something that may be more generally useful as an
alternative.  This is a sketch and very subject to bikeshedding:

Create an empty address space:

int address_space_create(int flags, etc);

Map an fd into an address space:

int address_space_mmap(int asfd, int fd_to_map, offset, size, prot,
...);  /* might run out of args here */

Unmap from an address space:

int address_space_munmap(int asfd, unsigned long addr, unsigned long len);

Stick an address space into KVM:

ioctl(vmfd, KVM_MAP_ADDRESS_SPACE, asfd);  /* or similar */

Maybe some day allow mapping an address space into a process.

mmap(..., asfd, ...);


And at least for now, there's a rule that an address space that is
address_space_mmapped into an address space is disallowed.


Maybe some day we also allow mremap(), madvise(), etc.  And maybe some
day we allow creating a special address_space that represents a real
process's address space.


Under the hood, an address_space could own an mm_struct that is not
used by any tasks.  And we could have special memfds that are bound to
a VM such that all you can do with them is stick them into an
address_space and map that address_space into the VM in question.  For
this to work, we would want a special vm_operation for mapping into a
VM.


What do you all think?  Is this useful?  Does it solve your problems?
Is it a good approach going forward?
Stefan Hajnoczi Sept. 9, 2020, 11:38 a.m. UTC | #27
On Mon, Sep 07, 2020 at 01:43:48PM -0700, Andy Lutomirski wrote:
> On Mon, Sep 7, 2020 at 8:05 AM Christian Brauner
> <christian.brauner@ubuntu.com> wrote:
> >
> > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > > This patchset adds support for the remote mapping feature.
> > > Remote mapping, as its name suggests, is a means for transparent and
> > > zero-copy access of a remote process' address space.
> > > access of a remote process' address space.
> > >
> > > The feature was designed according to a specification suggested by
> > > Paolo Bonzini:
> > > >> The proposed API is a new pidfd system call, through which the parent
> > > >> can map portions of its virtual address space into a file descriptor
> > > >> and then pass that file descriptor to a child.
> > > >>
> > > >> This should be:
> > > >>
> > > >> - upstreamable, pidfd is the new cool thing and we could sell it as a
> > > >> better way to do PTRACE_{PEEK,POKE}DATA
> >
> > In all honesty, that sentence made me a bit uneasy as it reads like this
> > is implemented on top of pidfds because it makes it more likely to go
> > upstream not because it is the right design. To be clear, I'm not
> > implying any sort of malicious intent on your part but I would suggest
> > to phrase this a little better. :)
> 
> 
> I thought about this whole thing some more, and here are some thoughts.
> 
> First, I was nervous about two things.  One was faulting in pages from
> the wrong context.  (When a normal page fault or KVM faults in a page,
> the mm is loaded.  (In the KVM case, the mm is sort of not loaded when
> the actual fault happens, but the mm is loaded when the fault is
> handled, I think.  Maybe there are workqueues involved and I'm wrong.)
>  When a remote mapping faults in a page, the mm is *not* loaded.)
> This ought not to be a problem, though -- get_user_pages_remote() also
> faults in pages from a non-current mm, and that's at least supposed to
> work correctly, so maybe this is okay.
> 
> Second is recursion.  I think this is a genuine problem.
> 
> And I think that tying this to pidfds is the wrong approach.  In fact,
> tying it to processes at all seems wrong.  There is a lot of demand
> for various forms of memory isolation in which memory is mapped only
> by its intended user.  Using something tied to a process mm gets in
> the way of this in the same way that KVM's current mapping model gets
> in the way.
> 
> All that being said, I think the whole idea of making fancy address
> spaces composed from other mappable objects is neat and possibly quite
> useful.  And, if you squint a bit, this is a lot like what KVM does
> today.
> 
> So I suggest something that may be more generally useful as an
> alternative.  This is a sketch and very subject to bikeshedding:
> 
> Create an empty address space:
> 
> int address_space_create(int flags, etc);
> 
> Map an fd into an address space:
> 
> int address_space_mmap(int asfd, int fd_to_map, offset, size, prot,
> ...);  /* might run out of args here */
> 
> Unmap from an address space:
> 
> int address_space_munmap(int asfd, unsigned long addr, unsigned long len);
> 
> Stick an address space into KVM:
> 
> ioctl(vmfd, KVM_MAP_ADDRESS_SPACE, asfd);  /* or similar */
> 
> Maybe some day allow mapping an address space into a process.
> 
> mmap(..., asfd, ...);
> 
> 
> And at least for now, there's a rule that an address space that is
> address_space_mmapped into an address space is disallowed.
> 
> 
> Maybe some day we also allow mremap(), madvise(), etc.  And maybe some
> day we allow creating a special address_space that represents a real
> process's address space.
> 
> 
> Under the hood, an address_space could own an mm_struct that is not
> used by any tasks.  And we could have special memfds that are bound to
> a VM such that all you can do with them is stick them into an
> address_space and map that address_space into the VM in question.  For
> this to work, we would want a special vm_operation for mapping into a
> VM.
> 
> 
> What do you all think?  Is this useful?  Does it solve your problems?
> Is it a good approach going forward?

Hi Adalbert and Andy,
As everyone continues to discuss how the mechanism should look, I want
to share two use cases for something like this. Let me know if you would
like more detail on these use cases.

They requirement in both cases is that process A can map a virtual
memory range from process B so that mmap/munmap operations within the
memory range in process B also affect process A.

An enforcing vIOMMU for vhost-user and vfio-user
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
vhost-user, vfio-user, and other out-of-process device emulation
interfaces need a way for the virtual machine manager (VMM) to enforce
the vIOMMU mappings on the device emulation process. The VMM emulates
the vIOMMU and only wants to expose a subset of memory to the device
emulation process. This subset can change as the guest programs the
vIOMMU.

Today the VMM passes all guest RAM fds to the device emulation process
and has no way of restricting access or revoking it later.

The new mechanism would allow the VMM to add/remove mappings so that the
device emulation process can only access ranges of memory programmed by
the guest vIOMMU. Accesses to unmapped addresses would raise a signal.

Accelerating the virtio-fs DAX window
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The virtiofsd vhost-user process handles guest file map/unmap messages.
The map/unmap messages allow the guest to map ranges of files into its
memory space. The guest kernel then uses DAX to access the file pages
without copying their contents into the guest page cache and mmap
MAP_SHARED is coherent when guests access the same file.

Today virtiofsd sends a message to the VMM over a UNIX domain socket
asking for an mmap/munmap. The VMM must perform the mapping on behalf of
virtiofsd. This communication and file descriptor passing is clumsy and
slow.

The new mechanism would allow virtiofsd to map/unmap without extra
coordination with the VMM. The VMM only needs to perform an initial mmap
of the DAX window so that kvm.ko can resolve page faults to that region.

Stefan
Jason Gunthorpe Dec. 1, 2020, 6:01 p.m. UTC | #28
On Fri, Sep 04, 2020 at 08:41:39PM +0100, Matthew Wilcox wrote:
> On Fri, Sep 04, 2020 at 09:11:48AM -0300, Jason Gunthorpe wrote:
> > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote:
> > > VMAs obtained by mmap()ing memory access fds mirror the contents of the remote
> > > process address space within the specified range. Pages are installed in the
> > > current process page tables at fault time and removed by the mmu_interval_notifier
> > > invalidate callbck. No further memory management is involved.
> > > On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP,
> > > or if the remote process address space was reaped by OOM, the remote mapping
> > > fault handler returns VM_FAULT_SIGBUS.
> > 
> > I still think anything along these lines needs to meet the XPMEM use
> > cases as well, we have to have more general solutions for such MM
> > stuff:
> > 
> > https://gitlab.com/hjelmn/xpmem
> > 
> > However, I think this fundamentally falls into some of the same bad
> > direction as xpmem.
> > 
> > I would much rather see this design copy & clone the VMA's than try to
> > mirror the PTEs inside the VMAs from the remote into a single giant
> > VMA and somehow split/mirror the VMA ops.
> 
> I'm on holiday for the next few days, but does the mshare() API work for
> your use case?
> 
> Proposal: http://www.wil.cx/~willy/linux/sileby.html
> Start at implementation:
> http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/mshare

I found some interest in this project here, with the detail that
pin_user_pages() should keep working, ideally on both sides of the
share, but essentially on the side that calls mshare()

Maybe we can help out, at least cc me if you make progress :)

Thanks,
Jason