Message ID | 20200904113116.20648-1-alazar@bitdefender.com (mailing list archive) |
---|---|
Headers | show |
Series | Remote mapping | expand |
On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote: > VMAs obtained by mmap()ing memory access fds mirror the contents of the remote > process address space within the specified range. Pages are installed in the > current process page tables at fault time and removed by the mmu_interval_notifier > invalidate callbck. No further memory management is involved. > On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP, > or if the remote process address space was reaped by OOM, the remote mapping > fault handler returns VM_FAULT_SIGBUS. I still think anything along these lines needs to meet the XPMEM use cases as well, we have to have more general solutions for such MM stuff: https://gitlab.com/hjelmn/xpmem However, I think this fundamentally falls into some of the same bad direction as xpmem. I would much rather see this design copy & clone the VMA's than try to mirror the PTEs inside the VMAs from the remote into a single giant VMA and somehow split/mirror the VMA ops. This is just too weird and fragile to be maintaible over a long term. For instance, one of the major bugs in things like xpmem was that they are incompatible with get_user_pages(), largely because of this issue. I feel like I said this already.. Jason
> On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote: > > VMAs obtained by mmap()ing memory access fds mirror the contents of > > the remote process address space within the specified range. Pages are > > installed in the current process page tables at fault time and removed > > by the mmu_interval_notifier invalidate callbck. No further memory > management is involved. > > On attempts to access a hole, or if a mapping was removed by > > PIDFD_MEM_UNMAP, or if the remote process address space was reaped > by > > OOM, the remote mapping fault handler returns VM_FAULT_SIGBUS. > > I still think anything along these lines needs to meet the XPMEM use cases as > well, we have to have more general solutions for such MM > stuff: > > https://gitlab.com/hjelmn/xpmem > > However, I think this fundamentally falls into some of the same bad direction > as xpmem. > > I would much rather see this design copy & clone the VMA's than try to > mirror the PTEs inside the VMAs from the remote into a single giant VMA and > somehow split/mirror the VMA ops. This design was made specifically for virtual machine introspection, where we care more about the contents of the address space, rather than the remote VMAs and their vmops. (Right now only anon pages can be mapped, but I guess we can extend to pagecache pages as well.) I just used what seemed to be the common denominator to all page-related operations: range invalidation. This looks like a general solution. IMO cloning a VMA in an address space that has a completely different layout will present its own set of caveats: What happens if the VMA resizes/splits? Can you replay all the remote VMA vmops on the clone VMA? > This is just too weird and fragile to be maintaible over a long term. > > For instance, one of the major bugs in things like xpmem was that they are > incompatible with get_user_pages(), largely because of this issue. We support get_user_pages(), that's how we integrate with KVM. The difference is the page returned will not belong to the current process. > I feel like I said this already.. > > Jason
On Fri, Sep 04, 2020 at 01:24:43PM +0000, Mircea CIRJALIU - MELIU wrote: > > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote: > > > VMAs obtained by mmap()ing memory access fds mirror the contents of > > > the remote process address space within the specified range. Pages are > > > installed in the current process page tables at fault time and removed > > > by the mmu_interval_notifier invalidate callbck. No further memory > > management is involved. > > > On attempts to access a hole, or if a mapping was removed by > > > PIDFD_MEM_UNMAP, or if the remote process address space was reaped > > by > > > OOM, the remote mapping fault handler returns VM_FAULT_SIGBUS. > > > > I still think anything along these lines needs to meet the XPMEM use cases as > > well, we have to have more general solutions for such MM > > stuff: > > > > https://gitlab.com/hjelmn/xpmem > > > > However, I think this fundamentally falls into some of the same bad direction > > as xpmem. > > > > I would much rather see this design copy & clone the VMA's than try to > > mirror the PTEs inside the VMAs from the remote into a single giant VMA and > > somehow split/mirror the VMA ops. > > This design was made specifically for virtual machine introspection, where we > care more about the contents of the address space, rather than the remote VMAs > and their vmops. (Right now only anon pages can be mapped, but I guess > we can extend to pagecache pages as well.) I just used what seemed to be the > common denominator to all page-related operations: range invalidation. > This looks like a general solution. The point is that a VMA is how the MM connects its parts together, cloning the content of a VMA without the rest of the VMA meta-data is just going to be very fragile in the long run.. Especially if the VMA is presented as a normal VMA with working struct pages/etc, not a pfn map. > IMO cloning a VMA in an address space that has a completely different layout > will present its own set of caveats: What happens if the VMA resizes/splits? > Can you replay all the remote VMA vmops on the clone VMA? The mirror would have to reclone the source VMA every time the source VMA changes. > > This is just too weird and fragile to be maintaible over a long term. > > > > For instance, one of the major bugs in things like xpmem was that they are > > incompatible with get_user_pages(), largely because of this issue. > > We support get_user_pages(), that's how we integrate with KVM. This seems really sketchy, get_user_pages is sensitive to the VMA, what happens when VMA flags are different/etc? Jason
> On Fri, Sep 04, 2020 at 01:24:43PM +0000, Mircea CIRJALIU - MELIU wrote: > > > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote: > > > > VMAs obtained by mmap()ing memory access fds mirror the contents > > > > of the remote process address space within the specified range. > > > > Pages are installed in the current process page tables at fault > > > > time and removed by the mmu_interval_notifier invalidate callbck. > > > > No further memory > > > management is involved. > > > > On attempts to access a hole, or if a mapping was removed by > > > > PIDFD_MEM_UNMAP, or if the remote process address space was > reaped > > > by > > > > OOM, the remote mapping fault handler returns VM_FAULT_SIGBUS. > > > > > > I still think anything along these lines needs to meet the XPMEM use > > > cases as well, we have to have more general solutions for such MM > > > stuff: > > > > > > https://gitlab.com/hjelmn/xpmem > > > > > > However, I think this fundamentally falls into some of the same bad > > > direction as xpmem. > > > > > > I would much rather see this design copy & clone the VMA's than try > > > to mirror the PTEs inside the VMAs from the remote into a single > > > giant VMA and somehow split/mirror the VMA ops. > > > > This design was made specifically for virtual machine introspection, > > where we care more about the contents of the address space, rather > > than the remote VMAs and their vmops. (Right now only anon pages can > > be mapped, but I guess we can extend to pagecache pages as well.) I > > just used what seemed to be the common denominator to all page-related > operations: range invalidation. > > This looks like a general solution. > > The point is that a VMA is how the MM connects its parts together, cloning > the content of a VMA without the rest of the VMA meta-data is just going to > be very fragile in the long run.. > > Especially if the VMA is presented as a normal VMA with working struct > pages/etc, not a pfn map. > > > IMO cloning a VMA in an address space that has a completely different > > layout will present its own set of caveats: What happens if the VMA > resizes/splits? > > Can you replay all the remote VMA vmops on the clone VMA? > > The mirror would have to reclone the source VMA every time the source > VMA changes. > > > > This is just too weird and fragile to be maintaible over a long term. > > > > > > For instance, one of the major bugs in things like xpmem was that > > > they are incompatible with get_user_pages(), largely because of this > issue. > > > > We support get_user_pages(), that's how we integrate with KVM. > > This seems really sketchy, get_user_pages is sensitive to the VMA, what > happens when VMA flags are different/etc? A debugger shouldn't complain if a portion of the debuggee is read-only, just overwrite the data. > Jason
On Fri, Sep 04, 2020 at 02:18:37PM +0000, Mircea CIRJALIU - MELIU wrote: > > This seems really sketchy, get_user_pages is sensitive to the VMA, what > > happens when VMA flags are different/etc? > > A debugger shouldn't complain if a portion of the debuggee is read-only, > just overwrite the data. At this point the kernel API here is so incredibly limited you may as well use a memfd for passing the shared address space instead of trying to do and maintain this complexity. Your use case is only qemu, so what is the problem to replace the allocator backing VM memory in userspace? Other people have been talking about doing a memfd already for different reasons - and memfd can already be shared as this scheme desires. Jason
> On Fri, Sep 04, 2020 at 02:18:37PM +0000, Mircea CIRJALIU - MELIU wrote: > > > This seems really sketchy, get_user_pages is sensitive to the VMA, > > > what happens when VMA flags are different/etc? > > > > A debugger shouldn't complain if a portion of the debuggee is > > read-only, just overwrite the data. > > At this point the kernel API here is so incredibly limited you may as well use a > memfd for passing the shared address space instead of trying to do and > maintain this complexity. > > Your use case is only qemu, so what is the problem to replace the allocator > backing VM memory in userspace? Other people have been talking about > doing a memfd already for different reasons - and memfd can already be > shared as this scheme desires. KSM doesn't work on shmem. Once you replace the allocator you render KSM useless. Besides that, I had a mail once from Paolo Bonzini: >> Hi, >> >> here at FOSDEM we discussed having a way for a parent process to >> split parts of an mmap range with one or more child processes. This >> turns out to be a generalization of the remote memory mapping concept >> that BitDefender proposed for virtual machine introspection ( >> https://patchwork.kernel.org/patch/11284561/). So far the patches >> haven't had a great reception from the MM people, but it shouldn't be >> hard to adjust the API according to the sketch below. I am also >> including Mircea who is the author. >> >> The proposed API is a new pidfd system call, through which the parent >> can map portions of its virtual address space into a file descriptor >> and then pass that file descriptor to a child. (the rest can be found in the cover letter) Therefore I had to do a module that peeks into anon process memory. And be compatible with KSM. This was among the requirements for the first version of remote mapping, which ended up non-scalable. Figures out it can peek into any kind of memory involving pages. Also it doesn't have the overhead associated with mapping a page in a VMA. And compared to ptrace(), it can keep the pages resident as long as needed. Mircea
On Fri, Sep 04, 2020 at 03:40:55PM +0000, Mircea CIRJALIU - MELIU wrote: > > On Fri, Sep 04, 2020 at 02:18:37PM +0000, Mircea CIRJALIU - MELIU wrote: > > > > This seems really sketchy, get_user_pages is sensitive to the VMA, > > > > what happens when VMA flags are different/etc? > > > > > > A debugger shouldn't complain if a portion of the debuggee is > > > read-only, just overwrite the data. > > > > At this point the kernel API here is so incredibly limited you may as well use a > > memfd for passing the shared address space instead of trying to do and > > maintain this complexity. > > > > Your use case is only qemu, so what is the problem to replace the allocator > > backing VM memory in userspace? Other people have been talking about > > doing a memfd already for different reasons - and memfd can already be > > shared as this scheme desires. > > KSM doesn't work on shmem. > Once you replace the allocator you render KSM useless. I suspect making memfd to work with KSM will be much less hacky than this. > Figures out it can peek into any kind of memory involving pages. No, this approach is really liminted to anonymous VMAs. You might make some argument that VMA differences can be ignored if they are all anonymous to start with, but certainly not once other types are VMAs are included. Jason
* Adalbert Lazăr: >>> - the control plane accepts three ioctls >>> >>> PIDFD_MEM_MAP takes a struct like >>> >>> struct pidfd_mem_map { >>> uint64_t address; >>> off_t offset; >>> off_t size; >>> int flags; >>> int padding[7]; >>> } >>> >>> After this is done, the memory access fd can be mmap-ed at range >>> [offset, >>> offset+size), and it will read memory from range [address, >>> address+size) of the target descriptor. >>> >>> PIDFD_MEM_UNMAP takes a struct like >>> >>> struct pidfd_mem_unmap { >>> off_t offset; >>> off_t size; >>> } >>> >>> and unmaps the corresponding range of course. off_t depends on preprocessor macros (_FILE_OFFSET_BITS), so these types should be uint64_t, too. I'm not sure what the advantage is of returning separate file descriptors, and nit operating directly on the pidfd.
On Fri, Sep 4, 2020 at 4:41 AM Adalbert Lazăr <alazar@bitdefender.com> wrote: > > This patchset adds support for the remote mapping feature. > Remote mapping, as its name suggests, is a means for transparent and > zero-copy access of a remote process' address space. > access of a remote process' address space. > I think this is very clever, but I find myself wondering what happens if people start trying to abuse this by, for example, setting up a remote mapping pointing to fun regions like userfaultfd or another remote mapping. I'm a little concerned that it's actually too clever and that maybe a more straightforward solution should be investigated. I personally rather dislike the KVM model in which the guest address space mirrors the host (QEMU) address space rather than being its own thing. In particular, the current model means that extra-special-strange mappings like SEV-encrypted memory are required to be present in the QEMU page tables in order for the guest to see them. (If I had noticed that last bit before it went upstream, I would have NAKked it. I would still like to see it deprecated and ideally eventually removed from the kernel. We have absolutely no business creating incoherent mappings like this.) --Andy
On Fri, Sep 04, 2020 at 09:11:48AM -0300, Jason Gunthorpe wrote: > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote: > > VMAs obtained by mmap()ing memory access fds mirror the contents of the remote > > process address space within the specified range. Pages are installed in the > > current process page tables at fault time and removed by the mmu_interval_notifier > > invalidate callbck. No further memory management is involved. > > On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP, > > or if the remote process address space was reaped by OOM, the remote mapping > > fault handler returns VM_FAULT_SIGBUS. > > I still think anything along these lines needs to meet the XPMEM use > cases as well, we have to have more general solutions for such MM > stuff: > > https://gitlab.com/hjelmn/xpmem > > However, I think this fundamentally falls into some of the same bad > direction as xpmem. > > I would much rather see this design copy & clone the VMA's than try to > mirror the PTEs inside the VMAs from the remote into a single giant > VMA and somehow split/mirror the VMA ops. I'm on holiday for the next few days, but does the mshare() API work for your use case? Proposal: http://www.wil.cx/~willy/linux/sileby.html Start at implementation: http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/mshare
On Fri, Sep 04, 2020 at 08:41:39PM +0100, Matthew Wilcox wrote: > On Fri, Sep 04, 2020 at 09:11:48AM -0300, Jason Gunthorpe wrote: > > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote: > > > VMAs obtained by mmap()ing memory access fds mirror the contents of the remote > > > process address space within the specified range. Pages are installed in the > > > current process page tables at fault time and removed by the mmu_interval_notifier > > > invalidate callbck. No further memory management is involved. > > > On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP, > > > or if the remote process address space was reaped by OOM, the remote mapping > > > fault handler returns VM_FAULT_SIGBUS. > > > > I still think anything along these lines needs to meet the XPMEM use > > cases as well, we have to have more general solutions for such MM > > stuff: > > > > https://gitlab.com/hjelmn/xpmem > > > > However, I think this fundamentally falls into some of the same bad > > direction as xpmem. > > > > I would much rather see this design copy & clone the VMA's than try to > > mirror the PTEs inside the VMAs from the remote into a single giant > > VMA and somehow split/mirror the VMA ops. > > I'm on holiday for the next few days, but does the mshare() API work for > your use case? > > Proposal: http://www.wil.cx/~willy/linux/sileby.html > Start at implementation: > http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/mshare Let me ask around, it seems in a similar space at least! Thanks, Jason
On 04/09/20 21:41, Matthew Wilcox wrote: > Proposal: http://www.wil.cx/~willy/linux/sileby.html > Start at implementation: > http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/mshare The main difference between mshare() and this is that we don't want an all-or-nothing thing. Adalbert's introspection thing is rather simple, but what I would like to be able to do (and the reason why I suggested the multi-pidfd approach) is actually a bit more complex: - a parent process creates a range of memory - there are multiple VMs child processes. One of this VM is a primary VM, the others are enclave VMs. VMs are created by the parent process and each VM gets a different view of the memory range through pidfd_mem. - once an enclave VM is created, the primary VM must not be able to access the memory that has been assigned to the enclave VM. If the parent unmaps the memory in the primary VM, the child must SIGBUS when it's accessed. - if memory is removed from a VM and assigned to another, this should not involve any copy at all. For this usecase the range of memory would be backed by hugetlbfs, anonymous memory, VFIO, whatever. Userfaultfd is certainly part of the picture here on the VM side. Having userfaultfd on the parent side would be nice though I don't have a use for it right now. I'm not sure about non-anonymous VMAs. Thanks, Paolo
On 04/09/20 21:39, Andy Lutomirski wrote: > I'm a little concerned that it's actually too clever and that maybe a > more straightforward solution should be investigated. I personally > rather dislike the KVM model in which the guest address space mirrors > the host (QEMU) address space rather than being its own thing. In > particular, the current model means that extra-special-strange > mappings like SEV-encrypted memory are required to be present in the > QEMU page tables in order for the guest to see them. > > (If I had noticed that last bit before it went upstream, I would have > NAKked it. I would still like to see it deprecated and ideally > eventually removed from the kernel. We have absolutely no business > creating incoherent mappings like this.) NACK first and ask second, right Andy? I see that nothing has changed since Alan Cox left Linux. Paolo
On 04/09/20 21:19, Florian Weimer wrote: > I'm not sure what the advantage is of returning separate file > descriptors, and nit operating directly on the pidfd. For privilege separation. So far, the common case of pidfd operations has been that whoever possesses a pidfd has "power" over the target process. Here however we also want to cover the case where one privileged process wants to set up and manage a memory range for multiple children. The privilege process can do so by passing the access file descriptor via SCM_RIGHTS. We also want different children to have visibility over different ranges, which is why there are multiple control fds rather than using the pidfd itself as control fd. You could have the map/unmap/lock ioctl on the pidfd itself and the access fd as an argument of the ioctl, but it seems cleaner to represent the pidfd-mem control capability as its own file descriptor. Paolo
>> On Sep 4, 2020, at 1:09 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> >> On 04/09/20 21:39, Andy Lutomirski wrote: >> I'm a little concerned that it's actually too clever and that maybe a >> more straightforward solution should be investigated. I personally >> rather dislike the KVM model in which the guest address space mirrors >> the host (QEMU) address space rather than being its own thing. In >> particular, the current model means that extra-special-strange >> mappings like SEV-encrypted memory are required to be present in the >> QEMU page tables in order for the guest to see them. >> (If I had noticed that last bit before it went upstream, I would have >> NAKked it. I would still like to see it deprecated and ideally >> eventually removed from the kernel. We have absolutely no business >> creating incoherent mappings like this.) > > NACK first and ask second, right Andy? I see that nothing has changed > since Alan Cox left Linux. NACKs are negotiable. And maybe someone can convince me that the SEV mapping scheme is reasonable, but I would be surprised. Regardless, you seem to be suggesting that you want to have enclave VMs in which the enclave can see some memory that the parent VM can’t see. How does this fit into the KVM mapping model? How does this remote mapping mechanism help? Do you want QEMU to have that memory mapped in its own pagetables? As it stands, the way that KVM memory mappings are created seems to be convenient, but it also seems to be resulting in increasing bizarre userspace mappings. At what point is the right solution to decouple KVM’s mappings from QEMU’s?
On 04/09/20 22:34, Andy Lutomirski wrote: > On Sep 4, 2020, at 1:09 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: >> On 04/09/20 21:39, Andy Lutomirski wrote: >>> I'm a little concerned >>> that it's actually too clever and that maybe a more >>> straightforward solution should be investigated. I personally >>> rather dislike the KVM model in which the guest address space >>> mirrors the host (QEMU) address space rather than being its own >>> thing. In particular, the current model means that >>> extra-special-strange mappings like SEV-encrypted memory are >>> required to be present in the QEMU page tables in order for the >>> guest to see them. (If I had noticed that last bit before it went >>> upstream, I would have NAKked it. I would still like to see it >>> deprecated and ideally eventually removed from the kernel. We >>> have absolutely no business creating incoherent mappings like >>> this.) >> >> NACK first and ask second, right Andy? I see that nothing has >> changed since Alan Cox left Linux. > > NACKs are negotiable. And maybe someone can convince me that the SEV > mapping scheme is reasonable, but I would be surprised. So why say NACK? Any half-decent maintainer would hold on merging the patches at least until the discussion is over. Also I suppose any deprecation proposal should come with a description of an alternative. Anyway, for SEV the problem is DMA. There is no way to know in advance which memory the guest will use for I/O; it can change at any time and the same host-physical address can even be mapped both as C=0 and C=1 by the guest. There's no communication protocol between the guest and the host to tell the host _which_ memory should be mapped in QEMU. (One was added to support migration, but that doesn't even work with SEV-ES processors where migration is planned to happen mostly with help from the guest, either in the firmware or somewhere else). But this is a digression. (If you would like to continue the discussion please trim the recipient list and change the subject). > Regardless, you seem to be suggesting that you want to have enclave > VMs in which the enclave can see some memory that the parent VM can’t > see. How does this fit into the KVM mapping model? How does this > remote mapping mechanism help? Do you want QEMU to have that memory > mapped in its own pagetables? There are three processes: - the manager, which is the parent of the VMs and uses the pidfd_mem system call - the primary VM - the enclave VM(s) The primary VM and the enclave VM(s) would each get a different memory access file descriptor. QEMU would treat them no differently from any other externally-provided memory backend, say hugetlbfs or memfd, so yeah they would be mmap-ed to userspace and the host virtual address passed as usual to KVM. Enclave VMs could be used to store secrets and perform crypto for example. The enclave is measured at boot, any keys or other stuff it needs can be provided out-of-band from the manager The manager can decide at any time to hide some memory from the parent VM (in order to give it to an enclave). This would actually be done on request of the parent VM itself, and QEMU would probably be so kind as to replace the "hole" left in the guest memory with zeroes. But QEMU is untrusted, so the manager cannot rely on QEMU behaving well. Hence the privilege separation model that was implemented here. Actually Amazon has already created something like that and Andra-Irina Paraschiv has posted patches on the list for this. Their implementation is not open source, but this pidfd-mem concept is something that Andra, Alexander Graf and I came up with as a way to 1) reimplement the feature upstream and 2) satisfy Bitdefender's need for memory introspection 3) add what seemed a useful interface anyway, for example to replace PTRACE_{PEEK,POKE}DATA. Though (3) would only need pread/pwrite, not mmap which adds a lot of the complexity. > As it stands, the way that KVM memory mappings are created seems to > be convenient, but it also seems to be resulting in increasing > bizarre userspace mappings. At what point is the right solution to > decouple KVM’s mappings from QEMU’s? So what you are suggesting is that KVM manages its own address space instead of host virtual addresses (and with no relationship to host virtual addresses, it would be just a "cookie")? It would then need a couple ioctls to mmap/munmap (creating and deleting VMAs) into the address space, and those cookies would be passed to KVM_SET_USER_MEMORY_REGION. QEMU would still need access to these VMAs, would it mmap a file descriptor provided by KVM? All in all the implementation seems quite complex, and I don't understand why it would avoid incoherent SEV mappings; what am I missing? Paolo
On Fri, Sep 4, 2020 at 2:59 PM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 04/09/20 22:34, Andy Lutomirski wrote: > > On Sep 4, 2020, at 1:09 PM, Paolo Bonzini <pbonzini@redhat.com> wrote: > >> On 04/09/20 21:39, Andy Lutomirski wrote: > >>> I'm a little concerned > >>> that it's actually too clever and that maybe a more > >>> straightforward solution should be investigated. I personally > >>> rather dislike the KVM model in which the guest address space > >>> mirrors the host (QEMU) address space rather than being its own > >>> thing. In particular, the current model means that > >>> extra-special-strange mappings like SEV-encrypted memory are > >>> required to be present in the QEMU page tables in order for the > >>> guest to see them. (If I had noticed that last bit before it went > >>> upstream, I would have NAKked it. I would still like to see it > >>> deprecated and ideally eventually removed from the kernel. We > >>> have absolutely no business creating incoherent mappings like > >>> this.) > >> > >> NACK first and ask second, right Andy? I see that nothing has > >> changed since Alan Cox left Linux. > > > > NACKs are negotiable. And maybe someone can convince me that the SEV > > mapping scheme is reasonable, but I would be surprised. > > So why say NACK? Any half-decent maintainer would hold on merging the > patches at least until the discussion is over. Also I suppose any > deprecation proposal should come with a description of an alternative. I suppose that's a fair point. > > Anyway, for SEV the problem is DMA. There is no way to know in advance > which memory the guest will use for I/O; it can change at any time and > the same host-physical address can even be mapped both as C=0 and C=1 by > the guest. There's no communication protocol between the guest and the > host to tell the host _which_ memory should be mapped in QEMU. (One was > added to support migration, but that doesn't even work with SEV-ES > processors where migration is planned to happen mostly with help from > the guest, either in the firmware or somewhere else). There's sev_pin_memory(), so QEMU must have at least some idea of which memory could potentially be encrypted. Is it in fact the case that QEMU doesn't know that some SEV pinned memory might actually be used for DMA until the guest tries to do DMA on that memory? If so, yuck. > > But this is a digression. (If you would like to continue the discussion > please trim the recipient list and change the subject). Fair enough. And my apologies for bring grumpier about SEV than was called for. > > > Regardless, you seem to be suggesting that you want to have enclave > > VMs in which the enclave can see some memory that the parent VM can’t > > see. How does this fit into the KVM mapping model? How does this > > remote mapping mechanism help? Do you want QEMU to have that memory > > mapped in its own pagetables? > > There are three processes: > > - the manager, which is the parent of the VMs and uses the pidfd_mem > system call > > - the primary VM > > - the enclave VM(s) > > The primary VM and the enclave VM(s) would each get a different memory > access file descriptor. QEMU would treat them no differently from any > other externally-provided memory backend, say hugetlbfs or memfd, so > yeah they would be mmap-ed to userspace and the host virtual address > passed as usual to KVM. Would the VM processes mmap() these descriptors, or would KVM learn how to handle that memory without it being mapped? > > Enclave VMs could be used to store secrets and perform crypto for > example. The enclave is measured at boot, any keys or other stuff it > needs can be provided out-of-band from the manager > > The manager can decide at any time to hide some memory from the parent > VM (in order to give it to an enclave). This would actually be done on > request of the parent VM itself, and QEMU would probably be so kind as > to replace the "hole" left in the guest memory with zeroes. But QEMU is > untrusted, so the manager cannot rely on QEMU behaving well. Hence the > privilege separation model that was implemented here. How does this work? Is there a revoke mechanism, or does the parent just munmap() the memory itself? > > Actually Amazon has already created something like that and Andra-Irina > Paraschiv has posted patches on the list for this. Their implementation > is not open source, but this pidfd-mem concept is something that Andra, > Alexander Graf and I came up with as a way to 1) reimplement the feature > upstream and 2) satisfy Bitdefender's need for memory introspection 3) > add what seemed a useful interface anyway, for example to replace > PTRACE_{PEEK,POKE}DATA. Though (3) would only need pread/pwrite, not > mmap which adds a lot of the complexity. > > > As it stands, the way that KVM memory mappings are created seems to > > be convenient, but it also seems to be resulting in increasing > > bizarre userspace mappings. At what point is the right solution to > > decouple KVM’s mappings from QEMU’s? > > So what you are suggesting is that KVM manages its own address space > instead of host virtual addresses (and with no relationship to host > virtual addresses, it would be just a "cookie")? It would then need a > couple ioctls to mmap/munmap (creating and deleting VMAs) into the > address space, and those cookies would be passed to > KVM_SET_USER_MEMORY_REGION. QEMU would still need access to these VMAs, > would it mmap a file descriptor provided by KVM? All in all the > implementation seems quite complex, and I don't understand why it would > avoid incoherent SEV mappings; what am I missing? It might not avoid incoherent SEV mappings in particular, but it would certainly enable other, somewhat related usecases. For example, QEMU could have KVM map a memfd without itself mapping that memfd, which would reduce the extent to which the memory would be exposed to an attacker who can read QEMU memory. For this pidfd-mem scheme in particular, it might avoid the nasty corner case I mentioned. With pidfd-mem as in this patchset, I'm concerned about what happens when process A maps some process B memory, process B maps some of process A's memory, and there's a recursive mapping that results. Or when a process maps its own memory, for that matter. If KVM could map fd's directly, then there could be a parallel mechanism for KVM to import portions of more than one process's address space, and this particular problem would be avoided. So a process would create pidfd-mem-like object and pass that to KVM (via an intermediary process or directly) and KVM could map that, but no normal process would be allowed to map it. This avoids the recursion problems. Or memfd could get fancier with operations to split memfds, remove pages from memfds, etc. Maybe that's overkill. (Or a fancy recursion detector could be built, but this has been a pain point in AF_UNIX, epoll, etc in the past. It may be solvable, but it won't be pretty.) I admit that allowing KVM to map fd's directly without some specific vm_operations support for this could be challenging, but ISTM kvm could plausibly own an mm_struct and pagetables at the cost of some wasted memory. The result would, under the hood, work more or less like the current implementation, but the API would be quite different. > > Paolo >
On 05/09/20 01:17, Andy Lutomirski wrote: > There's sev_pin_memory(), so QEMU must have at least some idea of > which memory could potentially be encrypted. Is it in fact the case > that QEMU doesn't know that some SEV pinned memory might actually be > used for DMA until the guest tries to do DMA on that memory? If so, > yuck. Yes. All the memory is pinned, all the memory could potentially be used for DMA (of garbage if it's encrypted). And it's the same for pretty much all protected VM extensions (SEV, POWER, s390, Intel TDX). >> The primary VM and the enclave VM(s) would each get a different memory >> access file descriptor. QEMU would treat them no differently from any >> other externally-provided memory backend, say hugetlbfs or memfd, so >> yeah they would be mmap-ed to userspace and the host virtual address >> passed as usual to KVM. > > Would the VM processes mmap() these descriptors, or would KVM learn > how to handle that memory without it being mapped? The idea is that the process mmaps them, QEMU would treat them just the same as a hugetlbfs file descriptor for example. >> The manager can decide at any time to hide some memory from the parent >> VM (in order to give it to an enclave). This would actually be done on >> request of the parent VM itself [...] But QEMU is >> untrusted, so the manager cannot rely on QEMU behaving well. Hence the >> privilege separation model that was implemented here. > > How does this work? Is there a revoke mechanism, or does the parent > just munmap() the memory itself? The parent has ioctls to add and remove memory from the pidfd-mem. So unmapping is just calling the ioctl that removes a range. >> So what you are suggesting is that KVM manages its own address space >> instead of host virtual addresses (and with no relationship to host >> virtual addresses, it would be just a "cookie")? > > [...] For this pidfd-mem scheme in particular, it might avoid the nasty > corner case I mentioned. With pidfd-mem as in this patchset, I'm > concerned about what happens when process A maps some process B > memory, process B maps some of process A's memory, and there's a > recursive mapping that results. Or when a process maps its own > memory, for that matter. > > Or memfd could get fancier with operations to split memfds, remove > pages from memfds, etc. Maybe that's overkill. Doing it directly with memfd is certainly an option, especially since MFD_HUGE_* exists. Basically you'd have a system call to create a secondary view of the memfd, and the syscall interface could still be very similar to what is in this patch, in particular the control/access pair. Probably this could be used also to implement Matthew Wilcox's ideas. I still believe that the pidfd-mem concept has merit as a "capability-like" PTRACE_{PEEK,POKE}DATA replacement, but it would not need any of privilege separation or mmap support, only direct read/write. So there's two concepts mixed in one interface in this patch, with two completely different usecases. Merging them is clever, but perhaps too clever. I can say that since it was my idea. :D Thanks, Paolo
On Fri, Sep 04, 2020 at 11:58:57PM +0200, Paolo Bonzini wrote: > So why say NACK? Any half-decent maintainer would hold on merging the > patches at least until the discussion is over. Also I suppose any > deprecation proposal should come with a description of an alternative. Please stop these totally pointless and overly aggressive personal attacks. A maintainers prime job is to say no.
Hey Paolo, On Fri, Sep 04, 2020 at 10:18:17PM +0200, Paolo Bonzini wrote: > On 04/09/20 21:19, Florian Weimer wrote: > > I'm not sure what the advantage is of returning separate file > > descriptors, and nit operating directly on the pidfd. > > For privilege separation. So far, the common case of pidfd operations > has been that whoever possesses a pidfd has "power" over the target I may misunderstand you but that's actually not quite true. Currently, pidfds are just handles on processes and currently only convey identity. They don't guarantee any sort of privilege over the target process. We have had discussion to treat them more as a capability in the future but that needs to be carefully thought out. > process. Here however we also want to cover the case where one > privileged process wants to set up and manage a memory range for > multiple children. The privilege process can do so by passing the > access file descriptor via SCM_RIGHTS. > > We also want different children to have visibility over different > ranges, which is why there are multiple control fds rather than using > the pidfd itself as control fd. You could have the map/unmap/lock ioctl > on the pidfd itself and the access fd as an argument of the ioctl, but > it seems cleaner to represent the pidfd-mem control capability as its > own file descriptor. We have very much on purpose avoided adding ioctls() on top of pidfds and I'm not fond of the idea of starting to add them. Supporting ioctl()s on an fd usually opens up a can of worms and makes sneaking in questionable features more likely (I'm not saying your patchset does that!). If this interface holds up, I would ask you to please either keep this as a separate fd type or please propose system calls only. Christian
On Sat, Sep 05, 2020 at 08:27:29PM +0200, Paolo Bonzini wrote: > On 05/09/20 01:17, Andy Lutomirski wrote: > > There's sev_pin_memory(), so QEMU must have at least some idea of > > which memory could potentially be encrypted. Is it in fact the case > > that QEMU doesn't know that some SEV pinned memory might actually be > > used for DMA until the guest tries to do DMA on that memory? If so, > > yuck. > > Yes. All the memory is pinned, all the memory could potentially be used > for DMA (of garbage if it's encrypted). And it's the same for pretty > much all protected VM extensions (SEV, POWER, s390, Intel TDX). > > >> The primary VM and the enclave VM(s) would each get a different memory > >> access file descriptor. QEMU would treat them no differently from any > >> other externally-provided memory backend, say hugetlbfs or memfd, so > >> yeah they would be mmap-ed to userspace and the host virtual address > >> passed as usual to KVM. > > > > Would the VM processes mmap() these descriptors, or would KVM learn > > how to handle that memory without it being mapped? > > The idea is that the process mmaps them, QEMU would treat them just the > same as a hugetlbfs file descriptor for example. > > >> The manager can decide at any time to hide some memory from the parent > >> VM (in order to give it to an enclave). This would actually be done on > >> request of the parent VM itself [...] But QEMU is > >> untrusted, so the manager cannot rely on QEMU behaving well. Hence the > >> privilege separation model that was implemented here. > > > > How does this work? Is there a revoke mechanism, or does the parent > > just munmap() the memory itself? > > The parent has ioctls to add and remove memory from the pidfd-mem. So > unmapping is just calling the ioctl that removes a range. I would strongly suggest we move away from ioctl() patterns. If something like this comes up in the future, just propose them as system calls. > > >> So what you are suggesting is that KVM manages its own address space > >> instead of host virtual addresses (and with no relationship to host > >> virtual addresses, it would be just a "cookie")? > > > > [...] For this pidfd-mem scheme in particular, it might avoid the nasty > > corner case I mentioned. With pidfd-mem as in this patchset, I'm > > concerned about what happens when process A maps some process B > > memory, process B maps some of process A's memory, and there's a > > recursive mapping that results. Or when a process maps its own > > memory, for that matter. > > > > Or memfd could get fancier with operations to split memfds, remove > > pages from memfds, etc. Maybe that's overkill. > > Doing it directly with memfd is certainly an option, especially since > MFD_HUGE_* exists. Basically you'd have a system call to create a I like that idea way better to be honest. Christian
On 07/09/20 09:05, Christoph Hellwig wrote: > On Fri, Sep 04, 2020 at 11:58:57PM +0200, Paolo Bonzini wrote: >> So why say NACK? Any half-decent maintainer would hold on merging the >> patches at least until the discussion is over. Also I suppose any >> deprecation proposal should come with a description of an alternative. > > Please stop these totally pointless and overly aggressive personal > attacks. A maintainers prime job is to say no. Christoph, I'm not sure who is attacking whom honestly, especially since Andy later said "Fair enough. And my apologies for bring grumpier about SEV than was called for" and we had a perfectly civil and technical conversation. Paolo
> From: Andy Lutomirski <luto@kernel.org> > I think this is very clever, but I find myself wondering what happens if people > start trying to abuse this by, for example, setting up a remote mapping > pointing to fun regions like userfaultfd or another remote mapping. Can ptrace() be used to abuse fun regions of a process address space? Remote mapping recursiveness can be eliminated by checking the VMA the remote page is extracted from. (NYI) > I'm a little concerned that it's actually too clever and that maybe a more > straightforward solution should be investigated. I personally rather dislike > the KVM model in which the guest address space mirrors the host (QEMU) > address space rather than being its own thing. I've seen a few internal mmap()s throughout the kernel. Just wondering how memory accounting is implemented for these cases. Will this be reflected in the memory usage of the process that controls such a module? Especially in the case of a virtual machine that needs a few GBs of memory. Mircea
> From: Andy Lutomirski <luto@kernel.org> > > > As it stands, the way that KVM memory mappings are created seems to > > > be convenient, but it also seems to be resulting in increasing > > > bizarre userspace mappings. At what point is the right solution to > > > decouple KVM’s mappings from QEMU’s? Convenience is one of the drivers of code reuse. > > So what you are suggesting is that KVM manages its own address space > > instead of host virtual addresses (and with no relationship to host > > virtual addresses, it would be just a "cookie")? It would then need a > > couple ioctls to mmap/munmap (creating and deleting VMAs) into the > > address space, and those cookies would be passed to > > KVM_SET_USER_MEMORY_REGION. QEMU would still need access to > these > > VMAs, would it mmap a file descriptor provided by KVM? All in all the > > implementation seems quite complex, and I don't understand why it > > would avoid incoherent SEV mappings; what am I missing? > > It might not avoid incoherent SEV mappings in particular, but it would > certainly enable other, somewhat related usecases. For example, QEMU > could have KVM map a memfd without itself mapping that memfd, which > would reduce the extent to which the memory would be exposed to an > attacker who can read QEMU memory. Isn't this security through obscurity? > For this pidfd-mem scheme in particular, it might avoid the nasty corner case > I mentioned. With pidfd-mem as in this patchset, I'm concerned about what > happens when process A maps some process B memory, process B maps > some of process A's memory, and there's a recursive mapping that results. > Or when a process maps its own memory, for that matter. If KVM could map > fd's directly, then there could be a parallel mechanism for KVM to import > portions of more than one process's address space, and this particular > problem would be avoided. So a process would create pidfd-mem-like > object and pass that to KVM (via an intermediary process or directly) and > KVM could map that, but no normal process would be allowed to map it. This > avoids the recursion problems. > > Or memfd could get fancier with operations to split memfds, remove pages > from memfds, etc. Maybe that's overkill. > > (Or a fancy recursion detector could be built, but this has been a pain point in > AF_UNIX, epoll, etc in the past. It may be solvable, but it won't be pretty.) > > I admit that allowing KVM to map fd's directly without some specific > vm_operations support for this could be challenging, but ISTM kvm could > plausibly own an mm_struct and pagetables at the cost of some wasted > memory. The result would, under the hood, work more or less like the > current implementation, but the API would be quite different. This looks like an attempt to pass memory related concerns to KVM developers. The userspace mapping mechanism is good as it is. Probably not perfect, just good. The problem is that it's stuck to a few VMA models and needs to evolve towards more bizarre/sketchy/weird/fragile patterns. Also the memory code is one of the most tightly coupled code I have seen. Probably explains the fear of the maintainers to try something new. Mircea
On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote: > This patchset adds support for the remote mapping feature. > Remote mapping, as its name suggests, is a means for transparent and > zero-copy access of a remote process' address space. > access of a remote process' address space. > > The feature was designed according to a specification suggested by > Paolo Bonzini: > >> The proposed API is a new pidfd system call, through which the parent > >> can map portions of its virtual address space into a file descriptor > >> and then pass that file descriptor to a child. > >> > >> This should be: > >> > >> - upstreamable, pidfd is the new cool thing and we could sell it as a > >> better way to do PTRACE_{PEEK,POKE}DATA In all honesty, that sentence made me a bit uneasy as it reads like this is implemented on top of pidfds because it makes it more likely to go upstream not because it is the right design. To be clear, I'm not implying any sort of malicious intent on your part but I would suggest to phrase this a little better. :)
On Mon, Sep 7, 2020 at 8:05 AM Christian Brauner <christian.brauner@ubuntu.com> wrote: > > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote: > > This patchset adds support for the remote mapping feature. > > Remote mapping, as its name suggests, is a means for transparent and > > zero-copy access of a remote process' address space. > > access of a remote process' address space. > > > > The feature was designed according to a specification suggested by > > Paolo Bonzini: > > >> The proposed API is a new pidfd system call, through which the parent > > >> can map portions of its virtual address space into a file descriptor > > >> and then pass that file descriptor to a child. > > >> > > >> This should be: > > >> > > >> - upstreamable, pidfd is the new cool thing and we could sell it as a > > >> better way to do PTRACE_{PEEK,POKE}DATA > > In all honesty, that sentence made me a bit uneasy as it reads like this > is implemented on top of pidfds because it makes it more likely to go > upstream not because it is the right design. To be clear, I'm not > implying any sort of malicious intent on your part but I would suggest > to phrase this a little better. :) I thought about this whole thing some more, and here are some thoughts. First, I was nervous about two things. One was faulting in pages from the wrong context. (When a normal page fault or KVM faults in a page, the mm is loaded. (In the KVM case, the mm is sort of not loaded when the actual fault happens, but the mm is loaded when the fault is handled, I think. Maybe there are workqueues involved and I'm wrong.) When a remote mapping faults in a page, the mm is *not* loaded.) This ought not to be a problem, though -- get_user_pages_remote() also faults in pages from a non-current mm, and that's at least supposed to work correctly, so maybe this is okay. Second is recursion. I think this is a genuine problem. And I think that tying this to pidfds is the wrong approach. In fact, tying it to processes at all seems wrong. There is a lot of demand for various forms of memory isolation in which memory is mapped only by its intended user. Using something tied to a process mm gets in the way of this in the same way that KVM's current mapping model gets in the way. All that being said, I think the whole idea of making fancy address spaces composed from other mappable objects is neat and possibly quite useful. And, if you squint a bit, this is a lot like what KVM does today. So I suggest something that may be more generally useful as an alternative. This is a sketch and very subject to bikeshedding: Create an empty address space: int address_space_create(int flags, etc); Map an fd into an address space: int address_space_mmap(int asfd, int fd_to_map, offset, size, prot, ...); /* might run out of args here */ Unmap from an address space: int address_space_munmap(int asfd, unsigned long addr, unsigned long len); Stick an address space into KVM: ioctl(vmfd, KVM_MAP_ADDRESS_SPACE, asfd); /* or similar */ Maybe some day allow mapping an address space into a process. mmap(..., asfd, ...); And at least for now, there's a rule that an address space that is address_space_mmapped into an address space is disallowed. Maybe some day we also allow mremap(), madvise(), etc. And maybe some day we allow creating a special address_space that represents a real process's address space. Under the hood, an address_space could own an mm_struct that is not used by any tasks. And we could have special memfds that are bound to a VM such that all you can do with them is stick them into an address_space and map that address_space into the VM in question. For this to work, we would want a special vm_operation for mapping into a VM. What do you all think? Is this useful? Does it solve your problems? Is it a good approach going forward?
On Mon, Sep 07, 2020 at 01:43:48PM -0700, Andy Lutomirski wrote: > On Mon, Sep 7, 2020 at 8:05 AM Christian Brauner > <christian.brauner@ubuntu.com> wrote: > > > > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote: > > > This patchset adds support for the remote mapping feature. > > > Remote mapping, as its name suggests, is a means for transparent and > > > zero-copy access of a remote process' address space. > > > access of a remote process' address space. > > > > > > The feature was designed according to a specification suggested by > > > Paolo Bonzini: > > > >> The proposed API is a new pidfd system call, through which the parent > > > >> can map portions of its virtual address space into a file descriptor > > > >> and then pass that file descriptor to a child. > > > >> > > > >> This should be: > > > >> > > > >> - upstreamable, pidfd is the new cool thing and we could sell it as a > > > >> better way to do PTRACE_{PEEK,POKE}DATA > > > > In all honesty, that sentence made me a bit uneasy as it reads like this > > is implemented on top of pidfds because it makes it more likely to go > > upstream not because it is the right design. To be clear, I'm not > > implying any sort of malicious intent on your part but I would suggest > > to phrase this a little better. :) > > > I thought about this whole thing some more, and here are some thoughts. > > First, I was nervous about two things. One was faulting in pages from > the wrong context. (When a normal page fault or KVM faults in a page, > the mm is loaded. (In the KVM case, the mm is sort of not loaded when > the actual fault happens, but the mm is loaded when the fault is > handled, I think. Maybe there are workqueues involved and I'm wrong.) > When a remote mapping faults in a page, the mm is *not* loaded.) > This ought not to be a problem, though -- get_user_pages_remote() also > faults in pages from a non-current mm, and that's at least supposed to > work correctly, so maybe this is okay. > > Second is recursion. I think this is a genuine problem. > > And I think that tying this to pidfds is the wrong approach. In fact, > tying it to processes at all seems wrong. There is a lot of demand > for various forms of memory isolation in which memory is mapped only > by its intended user. Using something tied to a process mm gets in > the way of this in the same way that KVM's current mapping model gets > in the way. > > All that being said, I think the whole idea of making fancy address > spaces composed from other mappable objects is neat and possibly quite > useful. And, if you squint a bit, this is a lot like what KVM does > today. > > So I suggest something that may be more generally useful as an > alternative. This is a sketch and very subject to bikeshedding: > > Create an empty address space: > > int address_space_create(int flags, etc); > > Map an fd into an address space: > > int address_space_mmap(int asfd, int fd_to_map, offset, size, prot, > ...); /* might run out of args here */ > > Unmap from an address space: > > int address_space_munmap(int asfd, unsigned long addr, unsigned long len); > > Stick an address space into KVM: > > ioctl(vmfd, KVM_MAP_ADDRESS_SPACE, asfd); /* or similar */ > > Maybe some day allow mapping an address space into a process. > > mmap(..., asfd, ...); > > > And at least for now, there's a rule that an address space that is > address_space_mmapped into an address space is disallowed. > > > Maybe some day we also allow mremap(), madvise(), etc. And maybe some > day we allow creating a special address_space that represents a real > process's address space. > > > Under the hood, an address_space could own an mm_struct that is not > used by any tasks. And we could have special memfds that are bound to > a VM such that all you can do with them is stick them into an > address_space and map that address_space into the VM in question. For > this to work, we would want a special vm_operation for mapping into a > VM. > > > What do you all think? Is this useful? Does it solve your problems? > Is it a good approach going forward? Hi Adalbert and Andy, As everyone continues to discuss how the mechanism should look, I want to share two use cases for something like this. Let me know if you would like more detail on these use cases. They requirement in both cases is that process A can map a virtual memory range from process B so that mmap/munmap operations within the memory range in process B also affect process A. An enforcing vIOMMU for vhost-user and vfio-user ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ vhost-user, vfio-user, and other out-of-process device emulation interfaces need a way for the virtual machine manager (VMM) to enforce the vIOMMU mappings on the device emulation process. The VMM emulates the vIOMMU and only wants to expose a subset of memory to the device emulation process. This subset can change as the guest programs the vIOMMU. Today the VMM passes all guest RAM fds to the device emulation process and has no way of restricting access or revoking it later. The new mechanism would allow the VMM to add/remove mappings so that the device emulation process can only access ranges of memory programmed by the guest vIOMMU. Accesses to unmapped addresses would raise a signal. Accelerating the virtio-fs DAX window ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The virtiofsd vhost-user process handles guest file map/unmap messages. The map/unmap messages allow the guest to map ranges of files into its memory space. The guest kernel then uses DAX to access the file pages without copying their contents into the guest page cache and mmap MAP_SHARED is coherent when guests access the same file. Today virtiofsd sends a message to the VMM over a UNIX domain socket asking for an mmap/munmap. The VMM must perform the mapping on behalf of virtiofsd. This communication and file descriptor passing is clumsy and slow. The new mechanism would allow virtiofsd to map/unmap without extra coordination with the VMM. The VMM only needs to perform an initial mmap of the DAX window so that kvm.ko can resolve page faults to that region. Stefan
On Fri, Sep 04, 2020 at 08:41:39PM +0100, Matthew Wilcox wrote: > On Fri, Sep 04, 2020 at 09:11:48AM -0300, Jason Gunthorpe wrote: > > On Fri, Sep 04, 2020 at 02:31:11PM +0300, Adalbert Lazăr wrote: > > > VMAs obtained by mmap()ing memory access fds mirror the contents of the remote > > > process address space within the specified range. Pages are installed in the > > > current process page tables at fault time and removed by the mmu_interval_notifier > > > invalidate callbck. No further memory management is involved. > > > On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP, > > > or if the remote process address space was reaped by OOM, the remote mapping > > > fault handler returns VM_FAULT_SIGBUS. > > > > I still think anything along these lines needs to meet the XPMEM use > > cases as well, we have to have more general solutions for such MM > > stuff: > > > > https://gitlab.com/hjelmn/xpmem > > > > However, I think this fundamentally falls into some of the same bad > > direction as xpmem. > > > > I would much rather see this design copy & clone the VMA's than try to > > mirror the PTEs inside the VMAs from the remote into a single giant > > VMA and somehow split/mirror the VMA ops. > > I'm on holiday for the next few days, but does the mshare() API work for > your use case? > > Proposal: http://www.wil.cx/~willy/linux/sileby.html > Start at implementation: > http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/mshare I found some interest in this project here, with the detail that pin_user_pages() should keep working, ideally on both sides of the share, but essentially on the side that calls mshare() Maybe we can help out, at least cc me if you make progress :) Thanks, Jason