mbox series

[v4,0/5] userfaultfd: add /dev/userfaultfd for fine grained access control

Message ID 20220719195628.3415852-1-axelrasmussen@google.com (mailing list archive)
Headers show
Series userfaultfd: add /dev/userfaultfd for fine grained access control | expand

Message

Axel Rasmussen July 19, 2022, 7:56 p.m. UTC
This series is based on torvalds/master.

The series is split up like so:
- Patch 1 is a simple fixup which we should take in any case (even by itself).
- Patches 2-6 add the feature, configurable selftest support, and docs.

Why not ...?
============

- Why not /proc/[pid]/userfaultfd? The proposed use case for this is for one
  process to open a userfaultfd which can intercept another process' page
  faults. This seems to me like exactly what CAP_SYS_PTRACE is for, though, so I
  think this use case can simply use a syscall without the powers CAP_SYS_PTRACE
  grants being "too much".

- Why not use a syscall? Access to syscalls is generally controlled by
  capabilities. We don't have a capability which is used for userfaultfd access
  without also granting more / other permissions as well, and adding a new
  capability was rejected [1].

    - It's possible a LSM could be used to control access instead. I suspect
      adding a brand new one just for this would be rejected, but I think some
      existing ones like SELinux can be used to filter syscall access. Enabling
      SELinux for large production deployments which don't already use it is
      likely to be a huge undertaking though, and I don't think this use case by
      itself is enough to motivate that kind of architectural change.

Changelog
=========

v3->v4:
  - Picked up an Acked-by on 5/5.
  - Updated cover letter to cover "why not ...".
  - Refactored userfaultfd_allowed() into userfaultfd_syscall_allowed(). [Peter]
  - Removed obsolete comment from a previous version. [Peter]
  - Refactored userfaultfd_open() in selftest. [Peter]
  - Reworded admin-guide documentation. [Mike, Peter]
  - Squashed 2 commits adding /dev/userfaultfd to selftest and making selftest
    configurable. [Peter]
  - Added "syscall" test modifier (the default behavior) to selftest. [Peter]

v2->v3:
  - Rebased onto linux-next/akpm-base, in order to be based on top of the
    run_vmtests.sh refactor which was merged previously.
  - Picked up some Reviewed-by's.
  - Fixed ioctl definition (_IO instead of _IOWR), and stopped using
    compat_ptr_ioctl since it is unneeded for ioctls which don't take a pointer.
  - Removed the "handle_kernel_faults" bool, simplifying the code. The result is
    logically equivalent, but simpler.
  - Fixed userfaultfd selftest so it returns KSFT_SKIP appropriately.
  - Reworded documentation per Shuah's feedback on v2.
  - Improved example usage for userfaultfd selftest.

v1->v2:
  - Add documentation update.
  - Test *both* userfaultfd(2) and /dev/userfaultfd via the selftest.

[1]: https://lore.kernel.org/lkml/686276b9-4530-2045-6bd8-170e5943abe4@schaufler-ca.com/T/

Axel Rasmussen (5):
  selftests: vm: add hugetlb_shared userfaultfd test to run_vmtests.sh
  userfaultfd: add /dev/userfaultfd for fine grained access control
  userfaultfd: selftests: modify selftest to use /dev/userfaultfd
  userfaultfd: update documentation to describe /dev/userfaultfd
  selftests: vm: add /dev/userfaultfd test cases to run_vmtests.sh

 Documentation/admin-guide/mm/userfaultfd.rst | 41 +++++++++++-
 Documentation/admin-guide/sysctl/vm.rst      |  3 +
 fs/userfaultfd.c                             | 69 ++++++++++++++++----
 include/uapi/linux/userfaultfd.h             |  4 ++
 tools/testing/selftests/vm/run_vmtests.sh    | 11 +++-
 tools/testing/selftests/vm/userfaultfd.c     | 69 +++++++++++++++++---
 6 files changed, 169 insertions(+), 28 deletions(-)

--
2.37.0.170.g444d1eabd0-goog

Comments

Schaufler, Casey July 20, 2022, 10:16 p.m. UTC | #1
> -----Original Message-----
> From: Axel Rasmussen <axelrasmussen@google.com>
> Sent: Tuesday, July 19, 2022 12:56 PM
> To: Alexander Viro <viro@zeniv.linux.org.uk>; Andrew Morton
> <akpm@linux-foundation.org>; Dave Hansen
> <dave.hansen@linux.intel.com>; Dmitry V . Levin <ldv@altlinux.org>; Gleb
> Fotengauer-Malinovskiy <glebfm@altlinux.org>; Hugh Dickins
> <hughd@google.com>; Jan Kara <jack@suse.cz>; Jonathan Corbet
> <corbet@lwn.net>; Mel Gorman <mgorman@techsingularity.net>; Mike
> Kravetz <mike.kravetz@oracle.com>; Mike Rapoport <rppt@kernel.org>;
> Amit, Nadav <namit@vmware.com>; Peter Xu <peterx@redhat.com>;
> Shuah Khan <shuah@kernel.org>; Suren Baghdasaryan
> <surenb@google.com>; Vlastimil Babka <vbabka@suse.cz>; zhangyi
> <yi.zhang@huawei.com>
> Cc: Axel Rasmussen <axelrasmussen@google.com>; linux-
> doc@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; linux-
> kselftest@vger.kernel.org
> Subject: [PATCH v4 0/5] userfaultfd: add /dev/userfaultfd for fine grained
> access control

I assume that leaving the LSM mailing list off of the CC is purely
accidental. Please, please include us in the next round.

> 
> This series is based on torvalds/master.
> 
> The series is split up like so:
> - Patch 1 is a simple fixup which we should take in any case (even by itself).
> - Patches 2-6 add the feature, configurable selftest support, and docs.
> 
> Why not ...?
> ============
> 
> - Why not /proc/[pid]/userfaultfd? The proposed use case for this is for one
>   process to open a userfaultfd which can intercept another process' page
>   faults. This seems to me like exactly what CAP_SYS_PTRACE is for, though,
> so I
>   think this use case can simply use a syscall without the powers
> CAP_SYS_PTRACE
>   grants being "too much".
> 
> - Why not use a syscall? Access to syscalls is generally controlled by
>   capabilities. We don't have a capability which is used for userfaultfd access
>   without also granting more / other permissions as well, and adding a new
>   capability was rejected [1].
> 
>     - It's possible a LSM could be used to control access instead. I suspect
>       adding a brand new one just for this would be rejected,

You won't know if you don't ask.

>       but I think some
>       existing ones like SELinux can be used to filter syscall access. Enabling
>       SELinux for large production deployments which don't already use it is
>       likely to be a huge undertaking though, and I don't think this use case by
>       itself is enough to motivate that kind of architectural change.
> 
> Changelog
> =========
> 
> v3->v4:
>   - Picked up an Acked-by on 5/5.
>   - Updated cover letter to cover "why not ...".
>   - Refactored userfaultfd_allowed() into userfaultfd_syscall_allowed().
> [Peter]
>   - Removed obsolete comment from a previous version. [Peter]
>   - Refactored userfaultfd_open() in selftest. [Peter]
>   - Reworded admin-guide documentation. [Mike, Peter]
>   - Squashed 2 commits adding /dev/userfaultfd to selftest and making
> selftest
>     configurable. [Peter]
>   - Added "syscall" test modifier (the default behavior) to selftest. [Peter]
> 
> v2->v3:
>   - Rebased onto linux-next/akpm-base, in order to be based on top of the
>     run_vmtests.sh refactor which was merged previously.
>   - Picked up some Reviewed-by's.
>   - Fixed ioctl definition (_IO instead of _IOWR), and stopped using
>     compat_ptr_ioctl since it is unneeded for ioctls which don't take a pointer.
>   - Removed the "handle_kernel_faults" bool, simplifying the code. The result
> is
>     logically equivalent, but simpler.
>   - Fixed userfaultfd selftest so it returns KSFT_SKIP appropriately.
>   - Reworded documentation per Shuah's feedback on v2.
>   - Improved example usage for userfaultfd selftest.
> 
> v1->v2:
>   - Add documentation update.
>   - Test *both* userfaultfd(2) and /dev/userfaultfd via the selftest.
> 
> [1]: https://lore.kernel.org/lkml/686276b9-4530-2045-6bd8-
> 170e5943abe4@schaufler-ca.com/T/
> 
> Axel Rasmussen (5):
>   selftests: vm: add hugetlb_shared userfaultfd test to run_vmtests.sh
>   userfaultfd: add /dev/userfaultfd for fine grained access control
>   userfaultfd: selftests: modify selftest to use /dev/userfaultfd
>   userfaultfd: update documentation to describe /dev/userfaultfd
>   selftests: vm: add /dev/userfaultfd test cases to run_vmtests.sh
> 
>  Documentation/admin-guide/mm/userfaultfd.rst | 41 +++++++++++-
>  Documentation/admin-guide/sysctl/vm.rst      |  3 +
>  fs/userfaultfd.c                             | 69 ++++++++++++++++----
>  include/uapi/linux/userfaultfd.h             |  4 ++
>  tools/testing/selftests/vm/run_vmtests.sh    | 11 +++-
>  tools/testing/selftests/vm/userfaultfd.c     | 69 +++++++++++++++++---
>  6 files changed, 169 insertions(+), 28 deletions(-)
> 
> --
> 2.37.0.170.g444d1eabd0-goog
Axel Rasmussen July 20, 2022, 11:04 p.m. UTC | #2
On Wed, Jul 20, 2022 at 3:16 PM Schaufler, Casey
<casey.schaufler@intel.com> wrote:
>
> > -----Original Message-----
> > From: Axel Rasmussen <axelrasmussen@google.com>
> > Sent: Tuesday, July 19, 2022 12:56 PM
> > To: Alexander Viro <viro@zeniv.linux.org.uk>; Andrew Morton
> > <akpm@linux-foundation.org>; Dave Hansen
> > <dave.hansen@linux.intel.com>; Dmitry V . Levin <ldv@altlinux.org>; Gleb
> > Fotengauer-Malinovskiy <glebfm@altlinux.org>; Hugh Dickins
> > <hughd@google.com>; Jan Kara <jack@suse.cz>; Jonathan Corbet
> > <corbet@lwn.net>; Mel Gorman <mgorman@techsingularity.net>; Mike
> > Kravetz <mike.kravetz@oracle.com>; Mike Rapoport <rppt@kernel.org>;
> > Amit, Nadav <namit@vmware.com>; Peter Xu <peterx@redhat.com>;
> > Shuah Khan <shuah@kernel.org>; Suren Baghdasaryan
> > <surenb@google.com>; Vlastimil Babka <vbabka@suse.cz>; zhangyi
> > <yi.zhang@huawei.com>
> > Cc: Axel Rasmussen <axelrasmussen@google.com>; linux-
> > doc@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
> > kernel@vger.kernel.org; linux-mm@kvack.org; linux-
> > kselftest@vger.kernel.org
> > Subject: [PATCH v4 0/5] userfaultfd: add /dev/userfaultfd for fine grained
> > access control
>
> I assume that leaving the LSM mailing list off of the CC is purely
> accidental. Please, please include us in the next round.

Honestly it just hadn't occurred to me, but I'm more than happy to CC
it on future revisions.

>
> >
> > This series is based on torvalds/master.
> >
> > The series is split up like so:
> > - Patch 1 is a simple fixup which we should take in any case (even by itself).
> > - Patches 2-6 add the feature, configurable selftest support, and docs.
> >
> > Why not ...?
> > ============
> >
> > - Why not /proc/[pid]/userfaultfd? The proposed use case for this is for one
> >   process to open a userfaultfd which can intercept another process' page
> >   faults. This seems to me like exactly what CAP_SYS_PTRACE is for, though,
> > so I
> >   think this use case can simply use a syscall without the powers
> > CAP_SYS_PTRACE
> >   grants being "too much".
> >
> > - Why not use a syscall? Access to syscalls is generally controlled by
> >   capabilities. We don't have a capability which is used for userfaultfd access
> >   without also granting more / other permissions as well, and adding a new
> >   capability was rejected [1].
> >
> >     - It's possible a LSM could be used to control access instead. I suspect
> >       adding a brand new one just for this would be rejected,
>
> You won't know if you don't ask.

Fair enough - I wonder if MM folks (Andrew, Peter, Nadav especially)
would find that approach more palatable than /proc/[pid]/userfaultfd?
Would it make sense from our perspective to propose a userfaultfd- or
MM-specific LSM for controlling access to certain features?

I remember +Andrea saying Red Hat was also interested in some kind of
access control mechanism like this. Would one or the other approach be
more convenient for you?

>
> >       but I think some
> >       existing ones like SELinux can be used to filter syscall access. Enabling
> >       SELinux for large production deployments which don't already use it is
> >       likely to be a huge undertaking though, and I don't think this use case by
> >       itself is enough to motivate that kind of architectural change.
> >
> > Changelog
> > =========
> >
> > v3->v4:
> >   - Picked up an Acked-by on 5/5.
> >   - Updated cover letter to cover "why not ...".
> >   - Refactored userfaultfd_allowed() into userfaultfd_syscall_allowed().
> > [Peter]
> >   - Removed obsolete comment from a previous version. [Peter]
> >   - Refactored userfaultfd_open() in selftest. [Peter]
> >   - Reworded admin-guide documentation. [Mike, Peter]
> >   - Squashed 2 commits adding /dev/userfaultfd to selftest and making
> > selftest
> >     configurable. [Peter]
> >   - Added "syscall" test modifier (the default behavior) to selftest. [Peter]
> >
> > v2->v3:
> >   - Rebased onto linux-next/akpm-base, in order to be based on top of the
> >     run_vmtests.sh refactor which was merged previously.
> >   - Picked up some Reviewed-by's.
> >   - Fixed ioctl definition (_IO instead of _IOWR), and stopped using
> >     compat_ptr_ioctl since it is unneeded for ioctls which don't take a pointer.
> >   - Removed the "handle_kernel_faults" bool, simplifying the code. The result
> > is
> >     logically equivalent, but simpler.
> >   - Fixed userfaultfd selftest so it returns KSFT_SKIP appropriately.
> >   - Reworded documentation per Shuah's feedback on v2.
> >   - Improved example usage for userfaultfd selftest.
> >
> > v1->v2:
> >   - Add documentation update.
> >   - Test *both* userfaultfd(2) and /dev/userfaultfd via the selftest.
> >
> > [1]: https://lore.kernel.org/lkml/686276b9-4530-2045-6bd8-
> > 170e5943abe4@schaufler-ca.com/T/
> >
> > Axel Rasmussen (5):
> >   selftests: vm: add hugetlb_shared userfaultfd test to run_vmtests.sh
> >   userfaultfd: add /dev/userfaultfd for fine grained access control
> >   userfaultfd: selftests: modify selftest to use /dev/userfaultfd
> >   userfaultfd: update documentation to describe /dev/userfaultfd
> >   selftests: vm: add /dev/userfaultfd test cases to run_vmtests.sh
> >
> >  Documentation/admin-guide/mm/userfaultfd.rst | 41 +++++++++++-
> >  Documentation/admin-guide/sysctl/vm.rst      |  3 +
> >  fs/userfaultfd.c                             | 69 ++++++++++++++++----
> >  include/uapi/linux/userfaultfd.h             |  4 ++
> >  tools/testing/selftests/vm/run_vmtests.sh    | 11 +++-
> >  tools/testing/selftests/vm/userfaultfd.c     | 69 +++++++++++++++++---
> >  6 files changed, 169 insertions(+), 28 deletions(-)
> >
> > --
> > 2.37.0.170.g444d1eabd0-goog
>
Nadav Amit July 20, 2022, 11:21 p.m. UTC | #3
On Jul 20, 2022, at 4:04 PM, Axel Rasmussen <axelrasmussen@google.com> wrote:

> ⚠ External Email
> 
> On Wed, Jul 20, 2022 at 3:16 PM Schaufler, Casey
> <casey.schaufler@intel.com> wrote:
>>> -----Original Message-----
>>> From: Axel Rasmussen <axelrasmussen@google.com>
>>> Sent: Tuesday, July 19, 2022 12:56 PM
>>> To: Alexander Viro <viro@zeniv.linux.org.uk>; Andrew Morton
>>> <akpm@linux-foundation.org>; Dave Hansen
>>> <dave.hansen@linux.intel.com>; Dmitry V . Levin <ldv@altlinux.org>; Gleb
>>> Fotengauer-Malinovskiy <glebfm@altlinux.org>; Hugh Dickins
>>> <hughd@google.com>; Jan Kara <jack@suse.cz>; Jonathan Corbet
>>> <corbet@lwn.net>; Mel Gorman <mgorman@techsingularity.net>; Mike
>>> Kravetz <mike.kravetz@oracle.com>; Mike Rapoport <rppt@kernel.org>;
>>> Amit, Nadav <namit@vmware.com>; Peter Xu <peterx@redhat.com>;
>>> Shuah Khan <shuah@kernel.org>; Suren Baghdasaryan
>>> <surenb@google.com>; Vlastimil Babka <vbabka@suse.cz>; zhangyi
>>> <yi.zhang@huawei.com>
>>> Cc: Axel Rasmussen <axelrasmussen@google.com>; linux-
>>> doc@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
>>> kernel@vger.kernel.org; linux-mm@kvack.org; linux-
>>> kselftest@vger.kernel.org
>>> Subject: [PATCH v4 0/5] userfaultfd: add /dev/userfaultfd for fine grained
>>> access control
>> 
>> I assume that leaving the LSM mailing list off of the CC is purely
>> accidental. Please, please include us in the next round.
> 
> Honestly it just hadn't occurred to me, but I'm more than happy to CC
> it on future revisions.
> 
>>> This series is based on torvalds/master.
>>> 
>>> The series is split up like so:
>>> - Patch 1 is a simple fixup which we should take in any case (even by itself).
>>> - Patches 2-6 add the feature, configurable selftest support, and docs.
>>> 
>>> Why not ...?
>>> ============
>>> 
>>> - Why not /proc/[pid]/userfaultfd? The proposed use case for this is for one
>>> process to open a userfaultfd which can intercept another process' page
>>> faults. This seems to me like exactly what CAP_SYS_PTRACE is for, though,
>>> so I
>>> think this use case can simply use a syscall without the powers
>>> CAP_SYS_PTRACE
>>> grants being "too much".
>>> 
>>> - Why not use a syscall? Access to syscalls is generally controlled by
>>> capabilities. We don't have a capability which is used for userfaultfd access
>>> without also granting more / other permissions as well, and adding a new
>>> capability was rejected [1].
>>> 
>>> - It's possible a LSM could be used to control access instead. I suspect
>>> adding a brand new one just for this would be rejected,
>> 
>> You won't know if you don't ask.
> 
> Fair enough - I wonder if MM folks (Andrew, Peter, Nadav especially)
> would find that approach more palatable than /proc/[pid]/userfaultfd?
> Would it make sense from our perspective to propose a userfaultfd- or
> MM-specific LSM for controlling access to certain features?
> 
> I remember +Andrea saying Red Hat was also interested in some kind of
> access control mechanism like this. Would one or the other approach be
> more convenient for you?

To reiterate my position - I think that /proc/[pid]/userfaultfd is very
natural and can be easily extended to support cross-process access of
userfaultfd. The necessary access controls are simple in any case. For
cross-process access, they are similar to those that are used for other
/proc/[pid]/X such as pagemap.

I have little experience with LSM and I do not know how real deployments use
them. If they are easier to deploy and people prefer them over some
pseudo-file, I cannot argue against them.
Axel Rasmussen Aug. 1, 2022, 5:13 p.m. UTC | #4
I finished up some other work and got around to writing a v5 today,
but I ran into a problem with /proc/[pid]/userfaultfd.

Files in /proc/[pid]/* are owned by the user/group which started the
process, and they don't support being chmod'ed.

For the userfaultfd device, I think we want the following semantics:
- For UFFDs created via the device, we want to always allow handling
kernel mode faults
- For security, the device should be owned by root:root by default, so
unprivileged users don't have default access to handle kernel faults
- But, the system administrator should be able to chown/chmod it, to
grant access to handling kernel faults for this process more widely.

It could be made to work like that but I think it would involve at least:

- Special casing userfaultfd in proc_pid_make_inode
- Updating setattr/getattr for /proc/[pid] to meaningfully store and
then retrieve uid/gid different from the task's, again probably
special cased for userfautlfd since we don't want this behavior for
other files

It seems to me such a change might raise eyebrows among procfs folks.
Before I spend the time to write this up, does this seem like
something that would obviously be nack'ed?

On Wed, Jul 20, 2022 at 4:21 PM Nadav Amit <namit@vmware.com> wrote:
>
> On Jul 20, 2022, at 4:04 PM, Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> > ⚠ External Email
> >
> > On Wed, Jul 20, 2022 at 3:16 PM Schaufler, Casey
> > <casey.schaufler@intel.com> wrote:
> >>> -----Original Message-----
> >>> From: Axel Rasmussen <axelrasmussen@google.com>
> >>> Sent: Tuesday, July 19, 2022 12:56 PM
> >>> To: Alexander Viro <viro@zeniv.linux.org.uk>; Andrew Morton
> >>> <akpm@linux-foundation.org>; Dave Hansen
> >>> <dave.hansen@linux.intel.com>; Dmitry V . Levin <ldv@altlinux.org>; Gleb
> >>> Fotengauer-Malinovskiy <glebfm@altlinux.org>; Hugh Dickins
> >>> <hughd@google.com>; Jan Kara <jack@suse.cz>; Jonathan Corbet
> >>> <corbet@lwn.net>; Mel Gorman <mgorman@techsingularity.net>; Mike
> >>> Kravetz <mike.kravetz@oracle.com>; Mike Rapoport <rppt@kernel.org>;
> >>> Amit, Nadav <namit@vmware.com>; Peter Xu <peterx@redhat.com>;
> >>> Shuah Khan <shuah@kernel.org>; Suren Baghdasaryan
> >>> <surenb@google.com>; Vlastimil Babka <vbabka@suse.cz>; zhangyi
> >>> <yi.zhang@huawei.com>
> >>> Cc: Axel Rasmussen <axelrasmussen@google.com>; linux-
> >>> doc@vger.kernel.org; linux-fsdevel@vger.kernel.org; linux-
> >>> kernel@vger.kernel.org; linux-mm@kvack.org; linux-
> >>> kselftest@vger.kernel.org
> >>> Subject: [PATCH v4 0/5] userfaultfd: add /dev/userfaultfd for fine grained
> >>> access control
> >>
> >> I assume that leaving the LSM mailing list off of the CC is purely
> >> accidental. Please, please include us in the next round.
> >
> > Honestly it just hadn't occurred to me, but I'm more than happy to CC
> > it on future revisions.
> >
> >>> This series is based on torvalds/master.
> >>>
> >>> The series is split up like so:
> >>> - Patch 1 is a simple fixup which we should take in any case (even by itself).
> >>> - Patches 2-6 add the feature, configurable selftest support, and docs.
> >>>
> >>> Why not ...?
> >>> ============
> >>>
> >>> - Why not /proc/[pid]/userfaultfd? The proposed use case for this is for one
> >>> process to open a userfaultfd which can intercept another process' page
> >>> faults. This seems to me like exactly what CAP_SYS_PTRACE is for, though,
> >>> so I
> >>> think this use case can simply use a syscall without the powers
> >>> CAP_SYS_PTRACE
> >>> grants being "too much".
> >>>
> >>> - Why not use a syscall? Access to syscalls is generally controlled by
> >>> capabilities. We don't have a capability which is used for userfaultfd access
> >>> without also granting more / other permissions as well, and adding a new
> >>> capability was rejected [1].
> >>>
> >>> - It's possible a LSM could be used to control access instead. I suspect
> >>> adding a brand new one just for this would be rejected,
> >>
> >> You won't know if you don't ask.
> >
> > Fair enough - I wonder if MM folks (Andrew, Peter, Nadav especially)
> > would find that approach more palatable than /proc/[pid]/userfaultfd?
> > Would it make sense from our perspective to propose a userfaultfd- or
> > MM-specific LSM for controlling access to certain features?
> >
> > I remember +Andrea saying Red Hat was also interested in some kind of
> > access control mechanism like this. Would one or the other approach be
> > more convenient for you?
>
> To reiterate my position - I think that /proc/[pid]/userfaultfd is very
> natural and can be easily extended to support cross-process access of
> userfaultfd. The necessary access controls are simple in any case. For
> cross-process access, they are similar to those that are used for other
> /proc/[pid]/X such as pagemap.
>
> I have little experience with LSM and I do not know how real deployments use
> them. If they are easier to deploy and people prefer them over some
> pseudo-file, I cannot argue against them.
>
>
Nadav Amit Aug. 1, 2022, 7:53 p.m. UTC | #5
On Aug 1, 2022, at 10:13 AM, Axel Rasmussen <axelrasmussen@google.com> wrote:

> ⚠ External Email
> 
> I finished up some other work and got around to writing a v5 today,
> but I ran into a problem with /proc/[pid]/userfaultfd.
> 
> Files in /proc/[pid]/* are owned by the user/group which started the
> process, and they don't support being chmod'ed.
> 
> For the userfaultfd device, I think we want the following semantics:
> - For UFFDs created via the device, we want to always allow handling
> kernel mode faults
> - For security, the device should be owned by root:root by default, so
> unprivileged users don't have default access to handle kernel faults
> - But, the system administrator should be able to chown/chmod it, to
> grant access to handling kernel faults for this process more widely.
> 
> It could be made to work like that but I think it would involve at least:
> 
> - Special casing userfaultfd in proc_pid_make_inode
> - Updating setattr/getattr for /proc/[pid] to meaningfully store and
> then retrieve uid/gid different from the task's, again probably
> special cased for userfautlfd since we don't want this behavior for
> other files
> 
> It seems to me such a change might raise eyebrows among procfs folks.
> Before I spend the time to write this up, does this seem like
> something that would obviously be nack'ed?

[ Please avoid top-posting in the future ]

I have no interest in making your life harder than it should be. If you
cannot find a suitable alternative, I will not fight against it.

How about this alternative: how about following KVM usage-model?

IOW: You open /dev/userfaultfd, but this is not the file-descriptor that you
use for most operations. Instead you first issue an ioctl - similarly to
KVM_CREATE_VM - to get a file-descriptor for your specific process. You then
use this new file-descriptor to perform your operations (read/ioctl/etc).

This would make the fact that ioctls/reads from different processes refer to
different contexts (i.e., file-descriptors) much more natural.

Does it sound better?
Axel Rasmussen Aug. 1, 2022, 10:50 p.m. UTC | #6
On Mon, Aug 1, 2022 at 12:53 PM Nadav Amit <namit@vmware.com> wrote:
>
> On Aug 1, 2022, at 10:13 AM, Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> > ⚠ External Email
> >
> > I finished up some other work and got around to writing a v5 today,
> > but I ran into a problem with /proc/[pid]/userfaultfd.
> >
> > Files in /proc/[pid]/* are owned by the user/group which started the
> > process, and they don't support being chmod'ed.
> >
> > For the userfaultfd device, I think we want the following semantics:
> > - For UFFDs created via the device, we want to always allow handling
> > kernel mode faults
> > - For security, the device should be owned by root:root by default, so
> > unprivileged users don't have default access to handle kernel faults
> > - But, the system administrator should be able to chown/chmod it, to
> > grant access to handling kernel faults for this process more widely.
> >
> > It could be made to work like that but I think it would involve at least:
> >
> > - Special casing userfaultfd in proc_pid_make_inode
> > - Updating setattr/getattr for /proc/[pid] to meaningfully store and
> > then retrieve uid/gid different from the task's, again probably
> > special cased for userfautlfd since we don't want this behavior for
> > other files
> >
> > It seems to me such a change might raise eyebrows among procfs folks.
> > Before I spend the time to write this up, does this seem like
> > something that would obviously be nack'ed?
>
> [ Please avoid top-posting in the future ]

I will remember this. Gmail's default behavior is annoying. :/

>
> I have no interest in making your life harder than it should be. If you
> cannot find a suitable alternative, I will not fight against it.
>
> How about this alternative: how about following KVM usage-model?
>
> IOW: You open /dev/userfaultfd, but this is not the file-descriptor that you
> use for most operations. Instead you first issue an ioctl - similarly to
> KVM_CREATE_VM - to get a file-descriptor for your specific process. You then
> use this new file-descriptor to perform your operations (read/ioctl/etc).
>
> This would make the fact that ioctls/reads from different processes refer to
> different contexts (i.e., file-descriptors) much more natural.
>
> Does it sound better?

Ah, that I think is more or less what my series already proposes, if I
understand you correctly.

The usage is:

fd = open(/dev/userfaultfd) /* This FD is only useful for creating new
userfaultfds */
uffd = ioctl(fd, USERFAULTFD_IOC_NEW) /* Now you get a real uffd */
close(fd); /* No longer needed now that we have a real uffd */

/* Use uffd to register, COPY, CONTINUE, whatever */

One thing we could do now or in the future is extend
USERFAULTFD_IOC_NEW to take a pid as an argument, to support creating
uffds for remote processes.



And then we get the benefit of permissions for /dev nodes working very
naturally - they default to root, but can be configured by the
sysadmin via chown/chmod, or udev rules, or whatever.
Nadav Amit Aug. 1, 2022, 11:19 p.m. UTC | #7
On Aug 1, 2022, at 3:50 PM, Axel Rasmussen <axelrasmussen@google.com> wrote:

> ⚠ External Email
> 
> On Mon, Aug 1, 2022 at 12:53 PM Nadav Amit <namit@vmware.com> wrote:
>> On Aug 1, 2022, at 10:13 AM, Axel Rasmussen <axelrasmussen@google.com> wrote:
> 
> Ah, that I think is more or less what my series already proposes, if I
> understand you correctly.
> 
> The usage is:
> 
> fd = open(/dev/userfaultfd) /* This FD is only useful for creating new
> userfaultfds */
> uffd = ioctl(fd, USERFAULTFD_IOC_NEW) /* Now you get a real uffd */
> close(fd); /* No longer needed now that we have a real uffd */
> 
> /* Use uffd to register, COPY, CONTINUE, whatever */
> 
> One thing we could do now or in the future is extend
> USERFAULTFD_IOC_NEW to take a pid as an argument, to support creating
> uffds for remote processes.
> 
> 
> 
> And then we get the benefit of permissions for /dev nodes working very
> naturally - they default to root, but can be configured by the
> sysadmin via chown/chmod, or udev rules, or whatever.

Oh. Stupid me. Then yes, using the /dev/userfaultfd is in line with other
usage models, such as KVM. And reading from each file descriptor is indeed
providing different output.