mbox series

[RFC,0/5] Remote mapping

Message ID 20200903174730.2685-1-alazar@bitdefender.com (mailing list archive)
Headers show
Series Remote mapping | expand

Message

Adalbert Lazăr Sept. 3, 2020, 5:47 p.m. UTC
This patchset adds support for the remote mapping feature.
Remote mapping, as its name suggests, is a means for transparent and
zero-copy access of a remote process' address space.
access of a remote process' address space.

The feature was designed according to a specification suggested by Paolo Bonzini:
>> The proposed API is a new pidfd system call, through which the parent
>> can map portions of its virtual address space into a file descriptor
>> and then pass that file descriptor to a child.
>>
>> This should be:
>>
>> - upstreamable, pidfd is the new cool thing and we could sell it as a
>> better way to do PTRACE_{PEEK,POKE}DATA
>>
>> - relatively easy to do based on the bitdefender remote process
>> mapping patches at.
>>
>> - pidfd_mem() takes a pidfd and some flags (which are 0) and returns
>> two file descriptors for respectively the control plane and the memory access.
>>
>> - the control plane accepts three ioctls
>>
>> PIDFD_MEM_MAP takes a struct like
>>
>>     struct pidfd_mem_map {
>>          uint64_t address;
>>          off_t offset;
>>          off_t size;
>>          int flags;
>>          int padding[7];
>>     }
>>
>> After this is done, the memory access fd can be mmap-ed at range
>> [offset,
>> offset+size), and it will read memory from range [address,
>> address+size) of the target descriptor.
>>
>> PIDFD_MEM_UNMAP takes a struct like
>>
>>     struct pidfd_mem_unmap {
>>          off_t offset;
>>          off_t size;
>>     }
>>
>> and unmaps the corresponding range of course.
>>
>> Finally PIDFD_MEM_LOCK forbids subsequent PIDFD_MEM_MAP or
>> PIDFD_MEM_UNMAP.  For now I think it should just check that the
>> argument is zero, bells and whistles can be added later.
>>
>> - the memory access fd can be mmap-ed as in the bitdefender patches
>> but also accessed with read/write/pread/pwrite/...  As in the
>> BitDefender patches, MMU notifiers can be used to adjust any mmap-ed
>> regions when the source address space changes.  In this case,
>> PIDFD_MEM_UNMAP could also cause a pre-existing mmap to "disappear".
(it currently doesn't support read/write/pread/pwrite/...)

The main remote mapping patch also contains the legacy implementation which
creates a region the size of the whole process address space by means of the
REMOTE_PROC_MAP ioctl. The user is then free to mmap() any region of the
address space it wishes.

VMAs obtained by mmap()ing memory access fds mirror the contents of the remote
process address space within the specified range. Pages are installed in the
current process page tables at fault time and removed by the mmu_interval_notifier
invalidate callbck. No further memory management is involved.
On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP,
or if the remote process address space was reaped by OOM, the remote mapping
fault handler returns VM_FAULT_SIGBUS.

At Bitdefender we are using remote mapping for virtual machine introspection:
- the QEMU running the introspected machine creates the pair of file descriptors,
passes the access fd to the introspector QEMU, and uses the control fd to allow
access to the memslots it creates for its machine
- the QEMU running the introspector machine receives the access fd and mmap()s
the regions made available, then hotplugs the obtained memory in its machine
Having this setup creates nested invalidate_range_start/end MMU notifier calls.

Patch organization:
- patch 1 allows unmap_page_range() to run without rescheduling
  Needed for remote mapping to zap current process page tables when OOM calls
  mmu_notifier_invalidate_range_start_nonblock(&range)

- patch 2 creates VMA-specific zapping behavior
  A remote mapping VMA does not own the pages it maps, so all it has to do is
  clear the PTEs.

- patch 3 removed MMU notifier lockdep map
  It was just incompatible with our use case.

- patch 4 is the remote mapping implementation

- patch 5 adds suggested pidfd_mem system call

Mircea Cirjaliu (5):
  mm: add atomic capability to zap_details
  mm: let the VMA decide how zap_pte_range() acts on mapped pages
  mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in
    nested scenarios
  mm/remote_mapping: use a pidfd to access memory belonging to unrelated
    process
  pidfd_mem: implemented remote memory mapping system call

 arch/x86/entry/syscalls/syscall_32.tbl |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 include/linux/mm.h                     |   22 +
 include/linux/mmu_notifier.h           |    5 +-
 include/linux/pid.h                    |    1 +
 include/linux/remote_mapping.h         |   22 +
 include/linux/syscalls.h               |    1 +
 include/uapi/asm-generic/unistd.h      |    2 +
 include/uapi/linux/remote_mapping.h    |   36 +
 kernel/exit.c                          |    2 +-
 kernel/pid.c                           |   55 +
 mm/Kconfig                             |   11 +
 mm/Makefile                            |    1 +
 mm/memory.c                            |  193 ++--
 mm/mmu_notifier.c                      |   19 -
 mm/remote_mapping.c                    | 1273 ++++++++++++++++++++++++
 16 files changed, 1535 insertions(+), 110 deletions(-)
 create mode 100644 include/linux/remote_mapping.h
 create mode 100644 include/uapi/linux/remote_mapping.h
 create mode 100644 mm/remote_mapping.c


CC:Christian Brauner <christian@brauner.io>
base-commit: ae83d0b416db002fe95601e7f97f64b59514d936

Comments

Adalbert Lazăr Sept. 3, 2020, 6:08 p.m. UTC | #1
CC+= Mihai, Mircea

On Thu,  3 Sep 2020 20:47:25 +0300, Adalbert Lazăr <alazar@bitdefender.com> wrote:
> This patchset adds support for the remote mapping feature.
> Remote mapping, as its name suggests, is a means for transparent and
> zero-copy access of a remote process' address space.
> access of a remote process' address space.
> 
> The feature was designed according to a specification suggested by Paolo Bonzini:
> >> The proposed API is a new pidfd system call, through which the parent
> >> can map portions of its virtual address space into a file descriptor
> >> and then pass that file descriptor to a child.
> >>
> >> This should be:
> >>
> >> - upstreamable, pidfd is the new cool thing and we could sell it as a
> >> better way to do PTRACE_{PEEK,POKE}DATA
> >>
> >> - relatively easy to do based on the bitdefender remote process
> >> mapping patches at.
> >>
> >> - pidfd_mem() takes a pidfd and some flags (which are 0) and returns
> >> two file descriptors for respectively the control plane and the memory access.
> >>
> >> - the control plane accepts three ioctls
> >>
> >> PIDFD_MEM_MAP takes a struct like
> >>
> >>     struct pidfd_mem_map {
> >>          uint64_t address;
> >>          off_t offset;
> >>          off_t size;
> >>          int flags;
> >>          int padding[7];
> >>     }
> >>
> >> After this is done, the memory access fd can be mmap-ed at range
> >> [offset,
> >> offset+size), and it will read memory from range [address,
> >> address+size) of the target descriptor.
> >>
> >> PIDFD_MEM_UNMAP takes a struct like
> >>
> >>     struct pidfd_mem_unmap {
> >>          off_t offset;
> >>          off_t size;
> >>     }
> >>
> >> and unmaps the corresponding range of course.
> >>
> >> Finally PIDFD_MEM_LOCK forbids subsequent PIDFD_MEM_MAP or
> >> PIDFD_MEM_UNMAP.  For now I think it should just check that the
> >> argument is zero, bells and whistles can be added later.
> >>
> >> - the memory access fd can be mmap-ed as in the bitdefender patches
> >> but also accessed with read/write/pread/pwrite/...  As in the
> >> BitDefender patches, MMU notifiers can be used to adjust any mmap-ed
> >> regions when the source address space changes.  In this case,
> >> PIDFD_MEM_UNMAP could also cause a pre-existing mmap to "disappear".
> (it currently doesn't support read/write/pread/pwrite/...)
> 
> The main remote mapping patch also contains the legacy implementation which
> creates a region the size of the whole process address space by means of the
> REMOTE_PROC_MAP ioctl. The user is then free to mmap() any region of the
> address space it wishes.
> 
> VMAs obtained by mmap()ing memory access fds mirror the contents of the remote
> process address space within the specified range. Pages are installed in the
> current process page tables at fault time and removed by the mmu_interval_notifier
> invalidate callbck. No further memory management is involved.
> On attempts to access a hole, or if a mapping was removed by PIDFD_MEM_UNMAP,
> or if the remote process address space was reaped by OOM, the remote mapping
> fault handler returns VM_FAULT_SIGBUS.
> 
> At Bitdefender we are using remote mapping for virtual machine introspection:
> - the QEMU running the introspected machine creates the pair of file descriptors,
> passes the access fd to the introspector QEMU, and uses the control fd to allow
> access to the memslots it creates for its machine
> - the QEMU running the introspector machine receives the access fd and mmap()s
> the regions made available, then hotplugs the obtained memory in its machine
> Having this setup creates nested invalidate_range_start/end MMU notifier calls.
> 
> Patch organization:
> - patch 1 allows unmap_page_range() to run without rescheduling
>   Needed for remote mapping to zap current process page tables when OOM calls
>   mmu_notifier_invalidate_range_start_nonblock(&range)
> 
> - patch 2 creates VMA-specific zapping behavior
>   A remote mapping VMA does not own the pages it maps, so all it has to do is
>   clear the PTEs.
> 
> - patch 3 removed MMU notifier lockdep map
>   It was just incompatible with our use case.
> 
> - patch 4 is the remote mapping implementation
> 
> - patch 5 adds suggested pidfd_mem system call
> 
> Mircea Cirjaliu (5):
>   mm: add atomic capability to zap_details
>   mm: let the VMA decide how zap_pte_range() acts on mapped pages
>   mm/mmu_notifier: remove lockdep map, allow mmu notifier to be used in
>     nested scenarios
>   mm/remote_mapping: use a pidfd to access memory belonging to unrelated
>     process
>   pidfd_mem: implemented remote memory mapping system call
> 
>  arch/x86/entry/syscalls/syscall_32.tbl |    1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |    1 +
>  include/linux/mm.h                     |   22 +
>  include/linux/mmu_notifier.h           |    5 +-
>  include/linux/pid.h                    |    1 +
>  include/linux/remote_mapping.h         |   22 +
>  include/linux/syscalls.h               |    1 +
>  include/uapi/asm-generic/unistd.h      |    2 +
>  include/uapi/linux/remote_mapping.h    |   36 +
>  kernel/exit.c                          |    2 +-
>  kernel/pid.c                           |   55 +
>  mm/Kconfig                             |   11 +
>  mm/Makefile                            |    1 +
>  mm/memory.c                            |  193 ++--
>  mm/mmu_notifier.c                      |   19 -
>  mm/remote_mapping.c                    | 1273 ++++++++++++++++++++++++
>  16 files changed, 1535 insertions(+), 110 deletions(-)
>  create mode 100644 include/linux/remote_mapping.h
>  create mode 100644 include/uapi/linux/remote_mapping.h
>  create mode 100644 mm/remote_mapping.c
> 
> 
> CC:Christian Brauner <christian@brauner.io>
> base-commit: ae83d0b416db002fe95601e7f97f64b59514d936
Christian Brauner Sept. 4, 2020, 9:54 a.m. UTC | #2
On Thu, Sep 03, 2020 at 08:47:25PM +0300, Adalbert Lazăr wrote:
> This patchset adds support for the remote mapping feature.
> Remote mapping, as its name suggests, is a means for transparent and
> zero-copy access of a remote process' address space.
> access of a remote process' address space.

Hey Adalbert,

Thanks for the patch. When you resend this patch series, could you
please make sure that everyone Cced on any individual patch receives the
full patch series? I only got patch 5/5 and it's a bit annoying because
one completely lacks context of what's going on. I first thought "did
someone just add a syscall with 3 lines of commit message?". :)

Could you please resend the patch series with linux-api, me and the
following people Cced:

Andy Lutomirski <luto@kernel.org>
Arnd Bergmann <arnd@arndb.de>
Sargun Dhillon <sargun@sargun.me>
Aleksa Sarai <cyphar@cyphar.com>
Oleg Nesterov <oleg@redhat.com>
Jann Horn <jannh@google.com>
Kees Cook <keescook@chromium.org>
Matthew Wilcox <willy@infradead.org>
linux-api@vger.kernel.org

Christian
Adalbert Lazăr Sept. 4, 2020, 11:34 a.m. UTC | #3
On Fri, 4 Sep 2020 11:54:38 +0200, Christian Brauner <christian.brauner@ubuntu.com> wrote:
> On Thu, Sep 03, 2020 at 08:47:25PM +0300, Adalbert Lazăr wrote:
> > This patchset adds support for the remote mapping feature.
> > Remote mapping, as its name suggests, is a means for transparent and
> > zero-copy access of a remote process' address space.
> > access of a remote process' address space.
> 
> Hey Adalbert,
> 
> Thanks for the patch. When you resend this patch series, could you
> please make sure that everyone Cced on any individual patch receives the
> full patch series? I only got patch 5/5 and it's a bit annoying because
> one completely lacks context of what's going on. I first thought "did
> someone just add a syscall with 3 lines of commit message?". :)
> 
> Could you please resend the patch series with linux-api, me and the
> following people Cced:

Done :D
Thank you, Christian