[RFC,00/13] Mount, FS, Block and Keyrings notifications [ver #4]

Message ID	155991702981.15579.6007568669839441045.stgit@warthog.procyon.org.uk (mailing list archive)
Headers	show Return-Path: <linux-security-module-owner@kernel.org> Subject: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4] From: David Howells <dhowells@redhat.com> To: viro@zeniv.linux.org.uk Cc: linux-usb@vger.kernel.org, linux-security-module@vger.kernel.org, Casey Schaufler <casey@schaufler-ca.com>, Stephen Smalley <sds@tycho.nsa.gov>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, dhowells@redhat.com, raven@themaw.net, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-block@vger.kernel.org, keyrings@vger.kernel.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org Date: Fri, 07 Jun 2019 15:17:10 +0100 Message-ID: <155991702981.15579.6007568669839441045.stgit@warthog.procyon.org.uk> User-Agent: StGit/unknown-version MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-security-module@vger.kernel.org Precedence: bulk
Series	Mount, FS, Block and Keyrings notifications [ver #4] \| expand [RFC,00/13] Mount, FS, Block and Keyrings notifications [ver #4] [01/13] security: Override creds in __fput() with last fputter's creds [ver #4] [02/13] uapi: General notification ring definitions [ver #4] [03/13] security: Add hooks to rule on setting a watch [ver #4] [04/13] security: Add a hook for the point of notification insertion [ver #4] [05/13] General notification queue with user mmap()'able ring buffer [ver #4] [06/13] keys: Add a notification facility [ver #4] [07/13] vfs: Add a mount-notification facility [ver #4] [08/13] vfs: Add superblock notifications [ver #4] [09/13] fsinfo: Export superblock notification counter [ver #4] [10/13] Add a general, global device notification watch list [ver #4] [11/13] block: Add block layer notifications [ver #4] [12/13] usb: Add USB subsystem notifications [ver #4] [13/13] Add sample notification program [ver #4]

David Howells June 7, 2019, 2:17 p.m. UTC

Hi Al,

Here's a set of patches to add a general variable-length notification queue
concept and to add sources of events for:

 (1) Mount topology events, such as mounting, unmounting, mount expiry,
     mount reconfiguration.

 (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
     errors (not complete yet).

 (3) Key/keyring events, such as creating, linking and removal of keys.

 (4) General device events (single common queue) including:

     - Block layer events, such as device errors

     - USB subsystem events, such as device/bus attach/remove, device
       reset, device errors.

One of the reasons for this is so that we can remove the issue of processes
having to repeatedly and regularly scan /proc/mounts, which has proven to
be a system performance problem.  To further aid this, the fsinfo() syscall
on which this patch series depends, provides a way to access superblock and
mount information in binary form without the need to parse /proc/mounts.


LSM support is included, but controversial:

 (1) The creds of the process that did the fput() that reduced the refcount
     to zero are cached in the file struct.

 (2) __fput() overrides the current creds with the creds from (1) whilst
     doing the cleanup, thereby making sure that the creds seen by the
     destruction notification generated by mntput() appears to come from
     the last fputter.

 (3) security_post_notification() is called for each queue that we might
     want to post a notification into, thereby allowing the LSM to prevent
     covert communications.

 (?) Do I need to add security_set_watch(), say, to rule on whether a watch
     may be set in the first place?  I might need to add a variant per
     watch-type.

 (?) Do I really need to keep track of the process creds in which an
     implicit object destruction happened?  For example, imagine you create
     an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
     refers to on close unless move_mount() clears that flag.  Now, imagine
     someone looking at that fd through procfs at the same time as you exit
     due to an error.  The LSM sees the destruction notification come from
     the looker if they happen to do their fput() after yours.


Design decisions:

 (1) A misc chardev is used to create and open a ring buffer:

	fd = open("/dev/watch_queue", O_RDWR);

     which is then configured and mmap'd into userspace:

	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
	buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

     The fd cannot be read or written (though there is a facility to use
     write to inject records for debugging) and userspace just pulls data
     directly out of the buffer.

 (2) The ring index pointers are stored inside the ring and are thus
     accessible to userspace.  Userspace should only update the tail
     pointer and never the head pointer or risk breaking the buffer.  The
     kernel checks that the pointers appear valid before trying to use
     them.  A 'skip' record is maintained around the pointers.

 (3) poll() can be used to wait for data to appear in the buffer.

 (4) Records in the buffer are binary, typed and have a length so that they
     can be of varying size.

     This means that multiple heterogeneous sources can share a common
     buffer.  Tags may be specified when a watchpoint is created to help
     distinguish the sources.

 (5) The queue is reusable as there are 16 million types available, of
     which I've used 4, so there is scope for others to be used.

 (6) Records are filterable as types have up to 256 subtypes that can be
     individually filtered.  Other filtration is also available.

 (7) Each time the buffer is opened, a new buffer is created - this means
     that there's no interference between watchers.

 (8) When recording a notification, the kernel will not sleep, but will
     rather mark a queue as overrun if there's insufficient space, thereby
     avoiding userspace causing the kernel to hang.

 (9) The 'watchpoint' should be specific where possible, meaning that you
     specify the object that you want to watch.

(10) The buffer is created and then watchpoints are attached to it, using
     one of:

	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
	mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
	sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);

     where in all three cases, fd indicates the queue and the number after
     is a tag between 0 and 255.

(11) The watch must be removed if either the watch buffer is destroyed or
     the watched object is destroyed.


Things I want to avoid:

 (1) Introducing features that make the core VFS dependent on the network
     stack or networking namespaces (ie. usage of netlink).

 (2) Dumping all this stuff into dmesg and having a daemon that sits there
     parsing the output and distributing it as this then puts the
     responsibility for security into userspace and makes handling
     namespaces tricky.  Further, dmesg might not exist or might be
     inaccessible inside a container.

 (3) Letting users see events they shouldn't be able to see.


Further things that could be considered:

 (1) Adding a keyctl call to allow a watch on a keyring to be extended to
     "children" of that keyring, such that the watch is removed from the
     child if it is unlinked from the keyring.

 (2) Adding global superblock event queue.

 (3) Propagating watches to child superblock over automounts.


The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications

Changes:

 v4: Split the basic UAPI bits out into their own patch and then split the
     LSM hooks out into an intermediate patch.  Add LSM hooks for setting
     watches.

     Rename the *_notify() system calls to watch_*() for consistency.

 v3: I've added a USB notification source and reformulated the block
     notification source so that there's now a common watch list, for which
     the system call is now device_notify().

     I've assigned a pair of unused ioctl numbers in the 'W' series to the
     ioctls added by this series.

     I've also added a description of the kernel API to the documentation.

 v2: I've fixed various issues raised by Jann Horn and GregKH and moved to
     krefs for refcounting.  I've added some security features to try and
     give Casey Schaufler the LSM control he wants.

David
---
David Howells (13):
      security: Override creds in __fput() with last fputter's creds
      uapi: General notification ring definitions
      security: Add hooks to rule on setting a watch
      security: Add a hook for the point of notification insertion
      General notification queue with user mmap()'able ring buffer
      keys: Add a notification facility
      vfs: Add a mount-notification facility
      vfs: Add superblock notifications
      fsinfo: Export superblock notification counter
      Add a general, global device notification watch list
      block: Add block layer notifications
      usb: Add USB subsystem notifications
      Add sample notification program


 Documentation/ioctl/ioctl-number.txt   |    1 
 Documentation/security/keys/core.rst   |   58 ++
 Documentation/watch_queue.rst          |  492 ++++++++++++++++++
 arch/x86/entry/syscalls/syscall_32.tbl |    3 
 arch/x86/entry/syscalls/syscall_64.tbl |    3 
 block/Kconfig                          |    9 
 block/blk-core.c                       |   29 +
 drivers/base/Kconfig                   |    9 
 drivers/base/Makefile                  |    1 
 drivers/base/watch.c                   |   89 +++
 drivers/misc/Kconfig                   |   13 
 drivers/misc/Makefile                  |    1 
 drivers/misc/watch_queue.c             |  889 ++++++++++++++++++++++++++++++++
 drivers/usb/core/Kconfig               |   10 
 drivers/usb/core/devio.c               |   55 ++
 drivers/usb/core/hub.c                 |    3 
 fs/Kconfig                             |   21 +
 fs/Makefile                            |    1 
 fs/file_table.c                        |   12 
 fs/fsinfo.c                            |   12 
 fs/mount.h                             |   33 +
 fs/mount_notify.c                      |  187 +++++++
 fs/namespace.c                         |    9 
 fs/super.c                             |  122 ++++
 include/linux/blkdev.h                 |   15 +
 include/linux/dcache.h                 |    1 
 include/linux/device.h                 |    7 
 include/linux/fs.h                     |   79 +++
 include/linux/key.h                    |    4 
 include/linux/lsm_hooks.h              |   48 ++
 include/linux/security.h               |   35 +
 include/linux/syscalls.h               |    5 
 include/linux/usb.h                    |   19 +
 include/linux/watch_queue.h            |   87 +++
 include/uapi/linux/fsinfo.h            |   10 
 include/uapi/linux/keyctl.h            |    1 
 include/uapi/linux/watch_queue.h       |  213 ++++++++
 kernel/sys_ni.c                        |    7 
 samples/Kconfig                        |    6 
 samples/Makefile                       |    1 
 samples/vfs/test-fsinfo.c              |   13 
 samples/watch_queue/Makefile           |    9 
 samples/watch_queue/watch_test.c       |  308 +++++++++++
 security/keys/Kconfig                  |   10 
 security/keys/compat.c                 |    2 
 security/keys/gc.c                     |    5 
 security/keys/internal.h               |   30 +
 security/keys/key.c                    |   37 +
 security/keys/keyctl.c                 |   95 +++
 security/keys/keyring.c                |   17 -
 security/keys/request_key.c            |    4 
 security/security.c                    |   29 +
 52 files changed, 3121 insertions(+), 38 deletions(-)
 create mode 100644 Documentation/watch_queue.rst
 create mode 100644 drivers/base/watch.c
 create mode 100644 drivers/misc/watch_queue.c
 create mode 100644 fs/mount_notify.c
 create mode 100644 include/linux/watch_queue.h
 create mode 100644 include/uapi/linux/watch_queue.h
 create mode 100644 samples/watch_queue/Makefile
 create mode 100644 samples/watch_queue/watch_test.c

Stephen Smalley June 10, 2019, 3:21 p.m. UTC | #1

On 6/7/19 10:17 AM, David Howells wrote:
> 
> Hi Al,
> 
> Here's a set of patches to add a general variable-length notification queue
> concept and to add sources of events for:
> 
>   (1) Mount topology events, such as mounting, unmounting, mount expiry,
>       mount reconfiguration.
> 
>   (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>       errors (not complete yet).
> 
>   (3) Key/keyring events, such as creating, linking and removal of keys.
> 
>   (4) General device events (single common queue) including:
> 
>       - Block layer events, such as device errors
> 
>       - USB subsystem events, such as device/bus attach/remove, device
>         reset, device errors.
> 
> One of the reasons for this is so that we can remove the issue of processes
> having to repeatedly and regularly scan /proc/mounts, which has proven to
> be a system performance problem.  To further aid this, the fsinfo() syscall
> on which this patch series depends, provides a way to access superblock and
> mount information in binary form without the need to parse /proc/mounts.
> 
> 
> LSM support is included, but controversial:
> 
>   (1) The creds of the process that did the fput() that reduced the refcount
>       to zero are cached in the file struct.
> 
>   (2) __fput() overrides the current creds with the creds from (1) whilst
>       doing the cleanup, thereby making sure that the creds seen by the
>       destruction notification generated by mntput() appears to come from
>       the last fputter.
> 
>   (3) security_post_notification() is called for each queue that we might
>       want to post a notification into, thereby allowing the LSM to prevent
>       covert communications.
> 
>   (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>       may be set in the first place?  I might need to add a variant per
>       watch-type.
> 
>   (?) Do I really need to keep track of the process creds in which an
>       implicit object destruction happened?  For example, imagine you create
>       an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>       refers to on close unless move_mount() clears that flag.  Now, imagine
>       someone looking at that fd through procfs at the same time as you exit
>       due to an error.  The LSM sees the destruction notification come from
>       the looker if they happen to do their fput() after yours.

I remain unconvinced that (1), (2), (3), and the final (?) above are a 
good idea.

For SELinux, I would expect that one would implement a collection of per 
watch-type WATCH permission checks on the target object (or to some 
well-defined object label like the kernel SID if there is no object) 
that allow receipt of all notifications of that watch-type for objects 
related to the target object, where "related to" is defined per watch-type.

I wouldn't expect SELinux to implement security_post_notification() at 
all.  I can't see how one can construct a meaningful, stable policy for 
it.  I'd argue that the triggering process is not posting the 
notification; the kernel is posting the notification and the watcher has 
been authorized to receive it.

> 
> 
> Design decisions:
> 
>   (1) A misc chardev is used to create and open a ring buffer:
> 
> 	fd = open("/dev/watch_queue", O_RDWR);
> 
>       which is then configured and mmap'd into userspace:
> 
> 	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
> 	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
> 	buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
> 		   MAP_SHARED, fd, 0);
> 
>       The fd cannot be read or written (though there is a facility to use
>       write to inject records for debugging) and userspace just pulls data
>       directly out of the buffer.
> 
>   (2) The ring index pointers are stored inside the ring and are thus
>       accessible to userspace.  Userspace should only update the tail
>       pointer and never the head pointer or risk breaking the buffer.  The
>       kernel checks that the pointers appear valid before trying to use
>       them.  A 'skip' record is maintained around the pointers.
> 
>   (3) poll() can be used to wait for data to appear in the buffer.
> 
>   (4) Records in the buffer are binary, typed and have a length so that they
>       can be of varying size.
> 
>       This means that multiple heterogeneous sources can share a common
>       buffer.  Tags may be specified when a watchpoint is created to help
>       distinguish the sources.
> 
>   (5) The queue is reusable as there are 16 million types available, of
>       which I've used 4, so there is scope for others to be used.
> 
>   (6) Records are filterable as types have up to 256 subtypes that can be
>       individually filtered.  Other filtration is also available.
> 
>   (7) Each time the buffer is opened, a new buffer is created - this means
>       that there's no interference between watchers.
> 
>   (8) When recording a notification, the kernel will not sleep, but will
>       rather mark a queue as overrun if there's insufficient space, thereby
>       avoiding userspace causing the kernel to hang.
> 
>   (9) The 'watchpoint' should be specific where possible, meaning that you
>       specify the object that you want to watch.
> 
> (10) The buffer is created and then watchpoints are attached to it, using
>       one of:
> 
> 	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
> 	mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
> 	sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
> 
>       where in all three cases, fd indicates the queue and the number after
>       is a tag between 0 and 255.
> 
> (11) The watch must be removed if either the watch buffer is destroyed or
>       the watched object is destroyed.
> 
> 
> Things I want to avoid:
> 
>   (1) Introducing features that make the core VFS dependent on the network
>       stack or networking namespaces (ie. usage of netlink).
> 
>   (2) Dumping all this stuff into dmesg and having a daemon that sits there
>       parsing the output and distributing it as this then puts the
>       responsibility for security into userspace and makes handling
>       namespaces tricky.  Further, dmesg might not exist or might be
>       inaccessible inside a container.
> 
>   (3) Letting users see events they shouldn't be able to see.
> 
> 
> Further things that could be considered:
> 
>   (1) Adding a keyctl call to allow a watch on a keyring to be extended to
>       "children" of that keyring, such that the watch is removed from the
>       child if it is unlinked from the keyring.
> 
>   (2) Adding global superblock event queue.
> 
>   (3) Propagating watches to child superblock over automounts.
> 
> 
> The patches can be found here also:
> 
> 	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
> 
> Changes:
> 
>   v4: Split the basic UAPI bits out into their own patch and then split the
>       LSM hooks out into an intermediate patch.  Add LSM hooks for setting
>       watches.
> 
>       Rename the *_notify() system calls to watch_*() for consistency.
> 
>   v3: I've added a USB notification source and reformulated the block
>       notification source so that there's now a common watch list, for which
>       the system call is now device_notify().
> 
>       I've assigned a pair of unused ioctl numbers in the 'W' series to the
>       ioctls added by this series.
> 
>       I've also added a description of the kernel API to the documentation.
> 
>   v2: I've fixed various issues raised by Jann Horn and GregKH and moved to
>       krefs for refcounting.  I've added some security features to try and
>       give Casey Schaufler the LSM control he wants.
> 
> David
> ---
> David Howells (13):
>        security: Override creds in __fput() with last fputter's creds
>        uapi: General notification ring definitions
>        security: Add hooks to rule on setting a watch
>        security: Add a hook for the point of notification insertion
>        General notification queue with user mmap()'able ring buffer
>        keys: Add a notification facility
>        vfs: Add a mount-notification facility
>        vfs: Add superblock notifications
>        fsinfo: Export superblock notification counter
>        Add a general, global device notification watch list
>        block: Add block layer notifications
>        usb: Add USB subsystem notifications
>        Add sample notification program
> 
> 
>   Documentation/ioctl/ioctl-number.txt   |    1
>   Documentation/security/keys/core.rst   |   58 ++
>   Documentation/watch_queue.rst          |  492 ++++++++++++++++++
>   arch/x86/entry/syscalls/syscall_32.tbl |    3
>   arch/x86/entry/syscalls/syscall_64.tbl |    3
>   block/Kconfig                          |    9
>   block/blk-core.c                       |   29 +
>   drivers/base/Kconfig                   |    9
>   drivers/base/Makefile                  |    1
>   drivers/base/watch.c                   |   89 +++
>   drivers/misc/Kconfig                   |   13
>   drivers/misc/Makefile                  |    1
>   drivers/misc/watch_queue.c             |  889 ++++++++++++++++++++++++++++++++
>   drivers/usb/core/Kconfig               |   10
>   drivers/usb/core/devio.c               |   55 ++
>   drivers/usb/core/hub.c                 |    3
>   fs/Kconfig                             |   21 +
>   fs/Makefile                            |    1
>   fs/file_table.c                        |   12
>   fs/fsinfo.c                            |   12
>   fs/mount.h                             |   33 +
>   fs/mount_notify.c                      |  187 +++++++
>   fs/namespace.c                         |    9
>   fs/super.c                             |  122 ++++
>   include/linux/blkdev.h                 |   15 +
>   include/linux/dcache.h                 |    1
>   include/linux/device.h                 |    7
>   include/linux/fs.h                     |   79 +++
>   include/linux/key.h                    |    4
>   include/linux/lsm_hooks.h              |   48 ++
>   include/linux/security.h               |   35 +
>   include/linux/syscalls.h               |    5
>   include/linux/usb.h                    |   19 +
>   include/linux/watch_queue.h            |   87 +++
>   include/uapi/linux/fsinfo.h            |   10
>   include/uapi/linux/keyctl.h            |    1
>   include/uapi/linux/watch_queue.h       |  213 ++++++++
>   kernel/sys_ni.c                        |    7
>   samples/Kconfig                        |    6
>   samples/Makefile                       |    1
>   samples/vfs/test-fsinfo.c              |   13
>   samples/watch_queue/Makefile           |    9
>   samples/watch_queue/watch_test.c       |  308 +++++++++++
>   security/keys/Kconfig                  |   10
>   security/keys/compat.c                 |    2
>   security/keys/gc.c                     |    5
>   security/keys/internal.h               |   30 +
>   security/keys/key.c                    |   37 +
>   security/keys/keyctl.c                 |   95 +++
>   security/keys/keyring.c                |   17 -
>   security/keys/request_key.c            |    4
>   security/security.c                    |   29 +
>   52 files changed, 3121 insertions(+), 38 deletions(-)
>   create mode 100644 Documentation/watch_queue.rst
>   create mode 100644 drivers/base/watch.c
>   create mode 100644 drivers/misc/watch_queue.c
>   create mode 100644 fs/mount_notify.c
>   create mode 100644 include/linux/watch_queue.h
>   create mode 100644 include/uapi/linux/watch_queue.h
>   create mode 100644 samples/watch_queue/Makefile
>   create mode 100644 samples/watch_queue/watch_test.c
>

Casey Schaufler June 10, 2019, 4:33 p.m. UTC | #2

On 6/10/2019 8:21 AM, Stephen Smalley wrote:
> On 6/7/19 10:17 AM, David Howells wrote:
>>
>> Hi Al,
>>
>> Here's a set of patches to add a general variable-length notification queue
>> concept and to add sources of events for:
>>
>>   (1) Mount topology events, such as mounting, unmounting, mount expiry,
>>       mount reconfiguration.
>>
>>   (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>>       errors (not complete yet).
>>
>>   (3) Key/keyring events, such as creating, linking and removal of keys.
>>
>>   (4) General device events (single common queue) including:
>>
>>       - Block layer events, such as device errors
>>
>>       - USB subsystem events, such as device/bus attach/remove, device
>>         reset, device errors.
>>
>> One of the reasons for this is so that we can remove the issue of processes
>> having to repeatedly and regularly scan /proc/mounts, which has proven to
>> be a system performance problem.  To further aid this, the fsinfo() syscall
>> on which this patch series depends, provides a way to access superblock and
>> mount information in binary form without the need to parse /proc/mounts.
>>
>>
>> LSM support is included, but controversial:
>>
>>   (1) The creds of the process that did the fput() that reduced the refcount
>>       to zero are cached in the file struct.
>>
>>   (2) __fput() overrides the current creds with the creds from (1) whilst
>>       doing the cleanup, thereby making sure that the creds seen by the
>>       destruction notification generated by mntput() appears to come from
>>       the last fputter.
>>
>>   (3) security_post_notification() is called for each queue that we might
>>       want to post a notification into, thereby allowing the LSM to prevent
>>       covert communications.
>>
>>   (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>       may be set in the first place?  I might need to add a variant per
>>       watch-type.
>>
>>   (?) Do I really need to keep track of the process creds in which an
>>       implicit object destruction happened?  For example, imagine you create
>>       an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>>       refers to on close unless move_mount() clears that flag.  Now, imagine
>>       someone looking at that fd through procfs at the same time as you exit
>>       due to an error.  The LSM sees the destruction notification come from
>>       the looker if they happen to do their fput() after yours.
>
> I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
>
> For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
>
> I wouldn't expect SELinux to implement security_post_notification() at all.  I can't see how one can construct a meaningful, stable policy for it.  I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.

I cannot agree. There is an explicit action by a subject that results
in information being delivered to an object. Just like a signal or a
UDP packet delivery. Smack handles this kind of thing just fine. The
internal mechanism that results in the access is irrelevant from
this viewpoint. I can understand how a mechanism like SELinux that
works on finer granularity might view it differently.



>
>>
>>
>> Design decisions:
>>
>>   (1) A misc chardev is used to create and open a ring buffer:
>>
>>     fd = open("/dev/watch_queue", O_RDWR);
>>
>>       which is then configured and mmap'd into userspace:
>>
>>     ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
>>     ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
>>     buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
>>            MAP_SHARED, fd, 0);
>>
>>       The fd cannot be read or written (though there is a facility to use
>>       write to inject records for debugging) and userspace just pulls data
>>       directly out of the buffer.
>>
>>   (2) The ring index pointers are stored inside the ring and are thus
>>       accessible to userspace.  Userspace should only update the tail
>>       pointer and never the head pointer or risk breaking the buffer.  The
>>       kernel checks that the pointers appear valid before trying to use
>>       them.  A 'skip' record is maintained around the pointers.
>>
>>   (3) poll() can be used to wait for data to appear in the buffer.
>>
>>   (4) Records in the buffer are binary, typed and have a length so that they
>>       can be of varying size.
>>
>>       This means that multiple heterogeneous sources can share a common
>>       buffer.  Tags may be specified when a watchpoint is created to help
>>       distinguish the sources.
>>
>>   (5) The queue is reusable as there are 16 million types available, of
>>       which I've used 4, so there is scope for others to be used.
>>
>>   (6) Records are filterable as types have up to 256 subtypes that can be
>>       individually filtered.  Other filtration is also available.
>>
>>   (7) Each time the buffer is opened, a new buffer is created - this means
>>       that there's no interference between watchers.
>>
>>   (8) When recording a notification, the kernel will not sleep, but will
>>       rather mark a queue as overrun if there's insufficient space, thereby
>>       avoiding userspace causing the kernel to hang.
>>
>>   (9) The 'watchpoint' should be specific where possible, meaning that you
>>       specify the object that you want to watch.
>>
>> (10) The buffer is created and then watchpoints are attached to it, using
>>       one of:
>>
>>     keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
>>     mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
>>     sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
>>
>>       where in all three cases, fd indicates the queue and the number after
>>       is a tag between 0 and 255.
>>
>> (11) The watch must be removed if either the watch buffer is destroyed or
>>       the watched object is destroyed.
>>
>>
>> Things I want to avoid:
>>
>>   (1) Introducing features that make the core VFS dependent on the network
>>       stack or networking namespaces (ie. usage of netlink).
>>
>>   (2) Dumping all this stuff into dmesg and having a daemon that sits there
>>       parsing the output and distributing it as this then puts the
>>       responsibility for security into userspace and makes handling
>>       namespaces tricky.  Further, dmesg might not exist or might be
>>       inaccessible inside a container.
>>
>>   (3) Letting users see events they shouldn't be able to see.
>>
>>
>> Further things that could be considered:
>>
>>   (1) Adding a keyctl call to allow a watch on a keyring to be extended to
>>       "children" of that keyring, such that the watch is removed from the
>>       child if it is unlinked from the keyring.
>>
>>   (2) Adding global superblock event queue.
>>
>>   (3) Propagating watches to child superblock over automounts.
>>
>>
>> The patches can be found here also:
>>
>>     http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
>>
>> Changes:
>>
>>   v4: Split the basic UAPI bits out into their own patch and then split the
>>       LSM hooks out into an intermediate patch.  Add LSM hooks for setting
>>       watches.
>>
>>       Rename the *_notify() system calls to watch_*() for consistency.
>>
>>   v3: I've added a USB notification source and reformulated the block
>>       notification source so that there's now a common watch list, for which
>>       the system call is now device_notify().
>>
>>       I've assigned a pair of unused ioctl numbers in the 'W' series to the
>>       ioctls added by this series.
>>
>>       I've also added a description of the kernel API to the documentation.
>>
>>   v2: I've fixed various issues raised by Jann Horn and GregKH and moved to
>>       krefs for refcounting.  I've added some security features to try and
>>       give Casey Schaufler the LSM control he wants.
>>
>> David
>> ---
>> David Howells (13):
>>        security: Override creds in __fput() with last fputter's creds
>>        uapi: General notification ring definitions
>>        security: Add hooks to rule on setting a watch
>>        security: Add a hook for the point of notification insertion
>>        General notification queue with user mmap()'able ring buffer
>>        keys: Add a notification facility
>>        vfs: Add a mount-notification facility
>>        vfs: Add superblock notifications
>>        fsinfo: Export superblock notification counter
>>        Add a general, global device notification watch list
>>        block: Add block layer notifications
>>        usb: Add USB subsystem notifications
>>        Add sample notification program
>>
>>
>>   Documentation/ioctl/ioctl-number.txt   |    1
>>   Documentation/security/keys/core.rst   |   58 ++
>>   Documentation/watch_queue.rst          |  492 ++++++++++++++++++
>>   arch/x86/entry/syscalls/syscall_32.tbl |    3
>>   arch/x86/entry/syscalls/syscall_64.tbl |    3
>>   block/Kconfig                          |    9
>>   block/blk-core.c                       |   29 +
>>   drivers/base/Kconfig                   |    9
>>   drivers/base/Makefile                  |    1
>>   drivers/base/watch.c                   |   89 +++
>>   drivers/misc/Kconfig                   |   13
>>   drivers/misc/Makefile                  |    1
>>   drivers/misc/watch_queue.c             |  889 ++++++++++++++++++++++++++++++++
>>   drivers/usb/core/Kconfig               |   10
>>   drivers/usb/core/devio.c               |   55 ++
>>   drivers/usb/core/hub.c                 |    3
>>   fs/Kconfig                             |   21 +
>>   fs/Makefile                            |    1
>>   fs/file_table.c                        |   12
>>   fs/fsinfo.c                            |   12
>>   fs/mount.h                             |   33 +
>>   fs/mount_notify.c                      |  187 +++++++
>>   fs/namespace.c                         |    9
>>   fs/super.c                             |  122 ++++
>>   include/linux/blkdev.h                 |   15 +
>>   include/linux/dcache.h                 |    1
>>   include/linux/device.h                 |    7
>>   include/linux/fs.h                     |   79 +++
>>   include/linux/key.h                    |    4
>>   include/linux/lsm_hooks.h              |   48 ++
>>   include/linux/security.h               |   35 +
>>   include/linux/syscalls.h               |    5
>>   include/linux/usb.h                    |   19 +
>>   include/linux/watch_queue.h            |   87 +++
>>   include/uapi/linux/fsinfo.h            |   10
>>   include/uapi/linux/keyctl.h            |    1
>>   include/uapi/linux/watch_queue.h       |  213 ++++++++
>>   kernel/sys_ni.c                        |    7
>>   samples/Kconfig                        |    6
>>   samples/Makefile                       |    1
>>   samples/vfs/test-fsinfo.c              |   13
>>   samples/watch_queue/Makefile           |    9
>>   samples/watch_queue/watch_test.c       |  308 +++++++++++
>>   security/keys/Kconfig                  |   10
>>   security/keys/compat.c                 |    2
>>   security/keys/gc.c                     |    5
>>   security/keys/internal.h               |   30 +
>>   security/keys/key.c                    |   37 +
>>   security/keys/keyctl.c                 |   95 +++
>>   security/keys/keyring.c                |   17 -
>>   security/keys/request_key.c            |    4
>>   security/security.c                    |   29 +
>>   52 files changed, 3121 insertions(+), 38 deletions(-)
>>   create mode 100644 Documentation/watch_queue.rst
>>   create mode 100644 drivers/base/watch.c
>>   create mode 100644 drivers/misc/watch_queue.c
>>   create mode 100644 fs/mount_notify.c
>>   create mode 100644 include/linux/watch_queue.h
>>   create mode 100644 include/uapi/linux/watch_queue.h
>>   create mode 100644 samples/watch_queue/Makefile
>>   create mode 100644 samples/watch_queue/watch_test.c
>>
>

Andy Lutomirski June 10, 2019, 4:42 p.m. UTC | #3

On Mon, Jun 10, 2019 at 9:34 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> On 6/10/2019 8:21 AM, Stephen Smalley wrote:
> > On 6/7/19 10:17 AM, David Howells wrote:
> >>
> >> Hi Al,
> >>
> >> Here's a set of patches to add a general variable-length notification queue
> >> concept and to add sources of events for:
> >>
> >>   (1) Mount topology events, such as mounting, unmounting, mount expiry,
> >>       mount reconfiguration.
> >>
> >>   (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
> >>       errors (not complete yet).
> >>
> >>   (3) Key/keyring events, such as creating, linking and removal of keys.
> >>
> >>   (4) General device events (single common queue) including:
> >>
> >>       - Block layer events, such as device errors
> >>
> >>       - USB subsystem events, such as device/bus attach/remove, device
> >>         reset, device errors.
> >>
> >> One of the reasons for this is so that we can remove the issue of processes
> >> having to repeatedly and regularly scan /proc/mounts, which has proven to
> >> be a system performance problem.  To further aid this, the fsinfo() syscall
> >> on which this patch series depends, provides a way to access superblock and
> >> mount information in binary form without the need to parse /proc/mounts.
> >>
> >>
> >> LSM support is included, but controversial:
> >>
> >>   (1) The creds of the process that did the fput() that reduced the refcount
> >>       to zero are cached in the file struct.
> >>
> >>   (2) __fput() overrides the current creds with the creds from (1) whilst
> >>       doing the cleanup, thereby making sure that the creds seen by the
> >>       destruction notification generated by mntput() appears to come from
> >>       the last fputter.
> >>
> >>   (3) security_post_notification() is called for each queue that we might
> >>       want to post a notification into, thereby allowing the LSM to prevent
> >>       covert communications.
> >>
> >>   (?) Do I need to add security_set_watch(), say, to rule on whether a watch
> >>       may be set in the first place?  I might need to add a variant per
> >>       watch-type.
> >>
> >>   (?) Do I really need to keep track of the process creds in which an
> >>       implicit object destruction happened?  For example, imagine you create
> >>       an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
> >>       refers to on close unless move_mount() clears that flag.  Now, imagine
> >>       someone looking at that fd through procfs at the same time as you exit
> >>       due to an error.  The LSM sees the destruction notification come from
> >>       the looker if they happen to do their fput() after yours.
> >
> > I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
> >
> > For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
> >
> > I wouldn't expect SELinux to implement security_post_notification() at all.  I can't see how one can construct a meaningful, stable policy for it.  I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.
>
> I cannot agree. There is an explicit action by a subject that results
> in information being delivered to an object. Just like a signal or a
> UDP packet delivery. Smack handles this kind of thing just fine. The
> internal mechanism that results in the access is irrelevant from
> this viewpoint. I can understand how a mechanism like SELinux that
> works on finer granularity might view it differently.

I think you really need to give an example of a coherent policy that
needs this.  As it stands, your analogy seems confusing.  If someone
changes the system clock, we don't restrict who is allowed to be
notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
was changed based on who changed the clock.  Similarly, if someone
tries to receive a packet on a socket, we check whether they have the
right to receive on that socket (from the endpoint in question) and,
if the sender is local, whether the sender can send to that socket.
We do not check whether the sender can send to the receiver.

The signal example is inapplicable.  Sending a signal to a process is
an explicit action done to that process, and it can easily adversely
affect the target.  Of course it requires permission.

--Andy

Casey Schaufler June 10, 2019, 6:01 p.m. UTC | #4

On 6/10/2019 9:42 AM, Andy Lutomirski wrote:
> On Mon, Jun 10, 2019 at 9:34 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 6/10/2019 8:21 AM, Stephen Smalley wrote:
>>> On 6/7/19 10:17 AM, David Howells wrote:
>>>> Hi Al,
>>>>
>>>> Here's a set of patches to add a general variable-length notification queue
>>>> concept and to add sources of events for:
>>>>
>>>>   (1) Mount topology events, such as mounting, unmounting, mount expiry,
>>>>       mount reconfiguration.
>>>>
>>>>   (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>>>>       errors (not complete yet).
>>>>
>>>>   (3) Key/keyring events, such as creating, linking and removal of keys.
>>>>
>>>>   (4) General device events (single common queue) including:
>>>>
>>>>       - Block layer events, such as device errors
>>>>
>>>>       - USB subsystem events, such as device/bus attach/remove, device
>>>>         reset, device errors.
>>>>
>>>> One of the reasons for this is so that we can remove the issue of processes
>>>> having to repeatedly and regularly scan /proc/mounts, which has proven to
>>>> be a system performance problem.  To further aid this, the fsinfo() syscall
>>>> on which this patch series depends, provides a way to access superblock and
>>>> mount information in binary form without the need to parse /proc/mounts.
>>>>
>>>>
>>>> LSM support is included, but controversial:
>>>>
>>>>   (1) The creds of the process that did the fput() that reduced the refcount
>>>>       to zero are cached in the file struct.
>>>>
>>>>   (2) __fput() overrides the current creds with the creds from (1) whilst
>>>>       doing the cleanup, thereby making sure that the creds seen by the
>>>>       destruction notification generated by mntput() appears to come from
>>>>       the last fputter.
>>>>
>>>>   (3) security_post_notification() is called for each queue that we might
>>>>       want to post a notification into, thereby allowing the LSM to prevent
>>>>       covert communications.
>>>>
>>>>   (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>>>       may be set in the first place?  I might need to add a variant per
>>>>       watch-type.
>>>>
>>>>   (?) Do I really need to keep track of the process creds in which an
>>>>       implicit object destruction happened?  For example, imagine you create
>>>>       an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>>>>       refers to on close unless move_mount() clears that flag.  Now, imagine
>>>>       someone looking at that fd through procfs at the same time as you exit
>>>>       due to an error.  The LSM sees the destruction notification come from
>>>>       the looker if they happen to do their fput() after yours.
>>> I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
>>>
>>> For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
>>>
>>> I wouldn't expect SELinux to implement security_post_notification() at all.  I can't see how one can construct a meaningful, stable policy for it.  I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.
>> I cannot agree. There is an explicit action by a subject that results
>> in information being delivered to an object. Just like a signal or a
>> UDP packet delivery. Smack handles this kind of thing just fine. The
>> internal mechanism that results in the access is irrelevant from
>> this viewpoint. I can understand how a mechanism like SELinux that
>> works on finer granularity might view it differently.
> I think you really need to give an example of a coherent policy that
> needs this.

I keep telling you, and you keep ignoring what I say.

>   As it stands, your analogy seems confusing.

It's pretty simple. I have given both the abstract
and examples.

>   If someone
> changes the system clock, we don't restrict who is allowed to be
> notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
> was changed based on who changed the clock.

That's right. The system clock is not an object that
unprivileged processes can modify. In fact, it is not
an object at all. If you care to look, you will see that
Smack does nothing with the clock.

>   Similarly, if someone
> tries to receive a packet on a socket, we check whether they have the
> right to receive on that socket (from the endpoint in question) and,
> if the sender is local, whether the sender can send to that socket.
> We do not check whether the sender can send to the receiver.

Bzzzt! Smack sure does.

> The signal example is inapplicable.

From a modeling viewpoint the actions are identical.

>   Sending a signal to a process is
> an explicit action done to that process, and it can easily adversely
> affect the target.  Of course it requires permission.
>
> --Andy

Andy Lutomirski June 10, 2019, 6:22 p.m. UTC | #5

> On Jun 10, 2019, at 11:01 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
>> On 6/10/2019 9:42 AM, Andy Lutomirski wrote:
>>> On Mon, Jun 10, 2019 at 9:34 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>> On 6/10/2019 8:21 AM, Stephen Smalley wrote:
>>>>> On 6/7/19 10:17 AM, David Howells wrote:
>>>>> Hi Al,
>>>>> 
>>>>> Here's a set of patches to add a general variable-length notification queue
>>>>> concept and to add sources of events for:
>>>>> 
>>>>>  (1) Mount topology events, such as mounting, unmounting, mount expiry,
>>>>>      mount reconfiguration.
>>>>> 
>>>>>  (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>>>>>      errors (not complete yet).
>>>>> 
>>>>>  (3) Key/keyring events, such as creating, linking and removal of keys.
>>>>> 
>>>>>  (4) General device events (single common queue) including:
>>>>> 
>>>>>      - Block layer events, such as device errors
>>>>> 
>>>>>      - USB subsystem events, such as device/bus attach/remove, device
>>>>>        reset, device errors.
>>>>> 
>>>>> One of the reasons for this is so that we can remove the issue of processes
>>>>> having to repeatedly and regularly scan /proc/mounts, which has proven to
>>>>> be a system performance problem.  To further aid this, the fsinfo() syscall
>>>>> on which this patch series depends, provides a way to access superblock and
>>>>> mount information in binary form without the need to parse /proc/mounts.
>>>>> 
>>>>> 
>>>>> LSM support is included, but controversial:
>>>>> 
>>>>>  (1) The creds of the process that did the fput() that reduced the refcount
>>>>>      to zero are cached in the file struct.
>>>>> 
>>>>>  (2) __fput() overrides the current creds with the creds from (1) whilst
>>>>>      doing the cleanup, thereby making sure that the creds seen by the
>>>>>      destruction notification generated by mntput() appears to come from
>>>>>      the last fputter.
>>>>> 
>>>>>  (3) security_post_notification() is called for each queue that we might
>>>>>      want to post a notification into, thereby allowing the LSM to prevent
>>>>>      covert communications.
>>>>> 
>>>>>  (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>>>>      may be set in the first place?  I might need to add a variant per
>>>>>      watch-type.
>>>>> 
>>>>>  (?) Do I really need to keep track of the process creds in which an
>>>>>      implicit object destruction happened?  For example, imagine you create
>>>>>      an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>>>>>      refers to on close unless move_mount() clears that flag.  Now, imagine
>>>>>      someone looking at that fd through procfs at the same time as you exit
>>>>>      due to an error.  The LSM sees the destruction notification come from
>>>>>      the looker if they happen to do their fput() after yours.
>>>> I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
>>>> 
>>>> For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
>>>> 
>>>> I wouldn't expect SELinux to implement security_post_notification() at all.  I can't see how one can construct a meaningful, stable policy for it.  I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.
>>> I cannot agree. There is an explicit action by a subject that results
>>> in information being delivered to an object. Just like a signal or a
>>> UDP packet delivery. Smack handles this kind of thing just fine. The
>>> internal mechanism that results in the access is irrelevant from
>>> this viewpoint. I can understand how a mechanism like SELinux that
>>> works on finer granularity might view it differently.
>> I think you really need to give an example of a coherent policy that
>> needs this.
> 
> I keep telling you, and you keep ignoring what I say.
> 
>>  As it stands, your analogy seems confusing.
> 
> It's pretty simple. I have given both the abstract
> and examples.

You gave the /dev/null example, which is inapplicable to this patchset.

> 
>>  If someone
>> changes the system clock, we don't restrict who is allowed to be
>> notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
>> was changed based on who changed the clock.
> 
> That's right. The system clock is not an object that
> unprivileged processes can modify. In fact, it is not
> an object at all. If you care to look, you will see that
> Smack does nothing with the clock.

And this is different from the mount tree how?

> 
>>  Similarly, if someone
>> tries to receive a packet on a socket, we check whether they have the
>> right to receive on that socket (from the endpoint in question) and,
>> if the sender is local, whether the sender can send to that socket.
>> We do not check whether the sender can send to the receiver.
> 
> Bzzzt! Smack sure does.

This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.

> 
>> The signal example is inapplicable.
> 
> From a modeling viewpoint the actions are identical.

This seems incorrect to me and, I think, to most everyone else reading this. Can you explain?

In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.

Casey Schaufler June 10, 2019, 7:33 p.m. UTC | #6

On 6/10/2019 11:22 AM, Andy Lutomirski wrote:
>> On Jun 10, 2019, at 11:01 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>
>>> On 6/10/2019 9:42 AM, Andy Lutomirski wrote:
>>>> On Mon, Jun 10, 2019 at 9:34 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>> On 6/10/2019 8:21 AM, Stephen Smalley wrote:
>>>>>> On 6/7/19 10:17 AM, David Howells wrote:
>>>>>> Hi Al,
>>>>>>
>>>>>> Here's a set of patches to add a general variable-length notification queue
>>>>>> concept and to add sources of events for:
>>>>>>
>>>>>>  (1) Mount topology events, such as mounting, unmounting, mount expiry,
>>>>>>      mount reconfiguration.
>>>>>>
>>>>>>  (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>>>>>>      errors (not complete yet).
>>>>>>
>>>>>>  (3) Key/keyring events, such as creating, linking and removal of keys.
>>>>>>
>>>>>>  (4) General device events (single common queue) including:
>>>>>>
>>>>>>      - Block layer events, such as device errors
>>>>>>
>>>>>>      - USB subsystem events, such as device/bus attach/remove, device
>>>>>>        reset, device errors.
>>>>>>
>>>>>> One of the reasons for this is so that we can remove the issue of processes
>>>>>> having to repeatedly and regularly scan /proc/mounts, which has proven to
>>>>>> be a system performance problem.  To further aid this, the fsinfo() syscall
>>>>>> on which this patch series depends, provides a way to access superblock and
>>>>>> mount information in binary form without the need to parse /proc/mounts.
>>>>>>
>>>>>>
>>>>>> LSM support is included, but controversial:
>>>>>>
>>>>>>  (1) The creds of the process that did the fput() that reduced the refcount
>>>>>>      to zero are cached in the file struct.
>>>>>>
>>>>>>  (2) __fput() overrides the current creds with the creds from (1) whilst
>>>>>>      doing the cleanup, thereby making sure that the creds seen by the
>>>>>>      destruction notification generated by mntput() appears to come from
>>>>>>      the last fputter.
>>>>>>
>>>>>>  (3) security_post_notification() is called for each queue that we might
>>>>>>      want to post a notification into, thereby allowing the LSM to prevent
>>>>>>      covert communications.
>>>>>>
>>>>>>  (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>>>>>      may be set in the first place?  I might need to add a variant per
>>>>>>      watch-type.
>>>>>>
>>>>>>  (?) Do I really need to keep track of the process creds in which an
>>>>>>      implicit object destruction happened?  For example, imagine you create
>>>>>>      an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>>>>>>      refers to on close unless move_mount() clears that flag.  Now, imagine
>>>>>>      someone looking at that fd through procfs at the same time as you exit
>>>>>>      due to an error.  The LSM sees the destruction notification come from
>>>>>>      the looker if they happen to do their fput() after yours.
>>>>> I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
>>>>>
>>>>> For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
>>>>>
>>>>> I wouldn't expect SELinux to implement security_post_notification() at all.  I can't see how one can construct a meaningful, stable policy for it.  I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.
>>>> I cannot agree. There is an explicit action by a subject that results
>>>> in information being delivered to an object. Just like a signal or a
>>>> UDP packet delivery. Smack handles this kind of thing just fine. The
>>>> internal mechanism that results in the access is irrelevant from
>>>> this viewpoint. I can understand how a mechanism like SELinux that
>>>> works on finer granularity might view it differently.
>>> I think you really need to give an example of a coherent policy that
>>> needs this.
>> I keep telling you, and you keep ignoring what I say.
>>
>>>  As it stands, your analogy seems confusing.
>> It's pretty simple. I have given both the abstract
>> and examples.
> You gave the /dev/null example, which is inapplicable to this patchset.

That addressed an explicit objection, and pointed out
an exception to a generality you had asserted, which was
not true. It's also a red herring regarding the current
discussion.

>>>  If someone
>>> changes the system clock, we don't restrict who is allowed to be
>>> notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
>>> was changed based on who changed the clock.
>> That's right. The system clock is not an object that
>> unprivileged processes can modify. In fact, it is not
>> an object at all. If you care to look, you will see that
>> Smack does nothing with the clock.
> And this is different from the mount tree how?

The mount tree can be modified by unprivileged users.
If nothing that unprivileged users can do to the mount
tree can trigger a notification you are correct, the
mount tree is very like the system clock. Is that the
case? 

>>>  Similarly, if someone
>>> tries to receive a packet on a socket, we check whether they have the
>>> right to receive on that socket (from the endpoint in question) and,
>>> if the sender is local, whether the sender can send to that socket.
>>> We do not check whether the sender can send to the receiver.
>> Bzzzt! Smack sure does.
> This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.

Process A sends a packet to process B.
If A has access to TopSecret data and B is not
allowed to see TopSecret data, the delivery should
be prevented. Is that nonsensical?

>>> The signal example is inapplicable.
>> From a modeling viewpoint the actions are identical.
> This seems incorrect to me

What would be correct then? Some convoluted combination
of system entities that aren't owned or controlled by
any mechanism? 

>  and, I think, to most everyone else reading this.

That's quite the assertion. You may even be correct.

>  Can you explain?
>
> In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.

YES!!!!!!!!!!!!

And when a process triggers a notification it is the subject
and the watching process is the object!

Subject == active entity
Object  == passive entity

Triggering an event is, like calling kill(), an action!

Andy Lutomirski June 10, 2019, 7:53 p.m. UTC | #7

On Mon, Jun 10, 2019 at 12:34 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>> I think you really need to give an example of a coherent policy that
> >>> needs this.
> >> I keep telling you, and you keep ignoring what I say.
> >>
> >>>  As it stands, your analogy seems confusing.
> >> It's pretty simple. I have given both the abstract
> >> and examples.
> > You gave the /dev/null example, which is inapplicable to this patchset.
>
> That addressed an explicit objection, and pointed out
> an exception to a generality you had asserted, which was
> not true. It's also a red herring regarding the current
> discussion.

This argument is pointless.

Please humor me and just give me an example.  If you think you have
already done so, feel free to repeat yourself.  If you have no
example, then please just say so.

>
> >>>  If someone
> >>> changes the system clock, we don't restrict who is allowed to be
> >>> notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
> >>> was changed based on who changed the clock.
> >> That's right. The system clock is not an object that
> >> unprivileged processes can modify. In fact, it is not
> >> an object at all. If you care to look, you will see that
> >> Smack does nothing with the clock.
> > And this is different from the mount tree how?
>
> The mount tree can be modified by unprivileged users.
> If nothing that unprivileged users can do to the mount
> tree can trigger a notification you are correct, the
> mount tree is very like the system clock. Is that the
> case?

The mount tree can't be modified by unprivileged users, unless a
privileged user very carefully configured it as such.  An unprivileged
user can create a new userns and a new mount ns, but then they're
modifying a whole different mount tree.

>
> >>>  Similarly, if someone
> >>> tries to receive a packet on a socket, we check whether they have the
> >>> right to receive on that socket (from the endpoint in question) and,
> >>> if the sender is local, whether the sender can send to that socket.
> >>> We do not check whether the sender can send to the receiver.
> >> Bzzzt! Smack sure does.
> > This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.
>
> Process A sends a packet to process B.
> If A has access to TopSecret data and B is not
> allowed to see TopSecret data, the delivery should
> be prevented. Is that nonsensical?

It makes sense.  As I see it, the way that a sensible policy should do
this is by making sure that there are no sockets, pipes, etc that
Process A can write and that Process B can read.

If you really want to prevent a malicious process with TopSecret data
from sending it to a different process, then you can't use Linux on
x86 or ARM.  Maybe that will be fixed some day, but you're going to
need to use an extremely tight sandbox to make this work.

>
> >>> The signal example is inapplicable.
> >> From a modeling viewpoint the actions are identical.
> > This seems incorrect to me
>
> What would be correct then? Some convoluted combination
> of system entities that aren't owned or controlled by
> any mechanism?
>

POSIX signal restrictions aren't there to prevent two processes from
communicating.  They're there to prevent the sender from manipulating
or crashing the receiver without appropriate privilege.


> >  and, I think, to most everyone else reading this.
>
> That's quite the assertion. You may even be correct.
>
> >  Can you explain?
> >
> > In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.
>
> YES!!!!!!!!!!!!
>
> And when a process triggers a notification it is the subject
> and the watching process is the object!
>
> Subject == active entity
> Object  == passive entity
>
> Triggering an event is, like calling kill(), an action!
>

And here is where I disagree with your interpretation.  Triggering an
event is a side effect of writing to the file.  There are *two*
security relevant actions, not one, and they are:

First, the write:

Subject == the writer
Action == write
Object == the file

Then the event, which could be modeled in a couple of ways:

Subject == the file
Action == notify
Object == the recipient

or

Subject == the recipient
Action == watch
Object == the file

By conflating these two actions into one, you've made the modeling
very hard, and you start running into all these nasty questions like
"who actually closed this open file"

Casey Schaufler June 10, 2019, 9:25 p.m. UTC | #8

On 6/10/2019 12:53 PM, Andy Lutomirski wrote:
> On Mon, Jun 10, 2019 at 12:34 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>> I think you really need to give an example of a coherent policy that
>>>>> needs this.
>>>> I keep telling you, and you keep ignoring what I say.
>>>>
>>>>>  As it stands, your analogy seems confusing.
>>>> It's pretty simple. I have given both the abstract
>>>> and examples.
>>> You gave the /dev/null example, which is inapplicable to this patchset.
>> That addressed an explicit objection, and pointed out
>> an exception to a generality you had asserted, which was
>> not true. It's also a red herring regarding the current
>> discussion.
> This argument is pointless.
>
> Please humor me and just give me an example.  If you think you have
> already done so, feel free to repeat yourself.  If you have no
> example, then please just say so.

To repeat the /dev/null example:

Process A and process B both open /dev/null.
A and B can write and read to their hearts content
to/from /dev/null without ever once communicating.
The mutual accessibility of /dev/null in no way implies that
A and B can communicate. If A can set a watch on /dev/null,
and B triggers an event, there still has to be an access
check on the delivery of the event because delivering an event
to A is not an action on /dev/null, but on A.


>
>>>>>  If someone
>>>>> changes the system clock, we don't restrict who is allowed to be
>>>>> notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
>>>>> was changed based on who changed the clock.
>>>> That's right. The system clock is not an object that
>>>> unprivileged processes can modify. In fact, it is not
>>>> an object at all. If you care to look, you will see that
>>>> Smack does nothing with the clock.
>>> And this is different from the mount tree how?
>> The mount tree can be modified by unprivileged users.
>> If nothing that unprivileged users can do to the mount
>> tree can trigger a notification you are correct, the
>> mount tree is very like the system clock. Is that the
>> case?
> The mount tree can't be modified by unprivileged users, unless a
> privileged user very carefully configured it as such.

"Unless" means *is* possible. In which case access control is
required. I will admit to being less then expert on the extent
to which mounts can be done without privilege.

>   An unprivileged
> user can create a new userns and a new mount ns, but then they're
> modifying a whole different mount tree.

Within those namespaces you can still have multiple users,
constrained be system access control policy.

>
>>>>>  Similarly, if someone
>>>>> tries to receive a packet on a socket, we check whether they have the
>>>>> right to receive on that socket (from the endpoint in question) and,
>>>>> if the sender is local, whether the sender can send to that socket.
>>>>> We do not check whether the sender can send to the receiver.
>>>> Bzzzt! Smack sure does.
>>> This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.
>> Process A sends a packet to process B.
>> If A has access to TopSecret data and B is not
>> allowed to see TopSecret data, the delivery should
>> be prevented. Is that nonsensical?
> It makes sense.  As I see it, the way that a sensible policy should do
> this is by making sure that there are no sockets, pipes, etc that
> Process A can write and that Process B can read.

You can't explain UDP controls without doing the access check
on packet delivery. The sendmsg() succeeds when the packet leaves
the sender. There doesn't even have to be a socket bound to the
port. The only opportunity you have for control is on packet
delivery, which is the only point at which you can have the
information required.

> If you really want to prevent a malicious process with TopSecret data
> from sending it to a different process, then you can't use Linux on
> x86 or ARM.  Maybe that will be fixed some day, but you're going to
> need to use an extremely tight sandbox to make this work.

I won't be commenting on that.

>
>>>>> The signal example is inapplicable.
>>>> From a modeling viewpoint the actions are identical.
>>> This seems incorrect to me
>> What would be correct then? Some convoluted combination
>> of system entities that aren't owned or controlled by
>> any mechanism?
>>
> POSIX signal restrictions aren't there to prevent two processes from
> communicating.  They're there to prevent the sender from manipulating
> or crashing the receiver without appropriate privilege.

POSIX signal restrictions have a long history. In the P10031e/2c
debates both communication and manipulation where seriously
considered. I would say both are true.

>>>  and, I think, to most everyone else reading this.
>> That's quite the assertion. You may even be correct.
>>
>>>  Can you explain?
>>>
>>> In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.
>> YES!!!!!!!!!!!!
>>
>> And when a process triggers a notification it is the subject
>> and the watching process is the object!
>>
>> Subject == active entity
>> Object  == passive entity
>>
>> Triggering an event is, like calling kill(), an action!
>>
> And here is where I disagree with your interpretation.  Triggering an
> event is a side effect of writing to the file.  There are *two*
> security relevant actions, not one, and they are:
>
> First, the write:
>
> Subject == the writer
> Action == write
> Object == the file
>
> Then the event, which could be modeled in a couple of ways:
>
> Subject == the file

Files   are   not   subjects. They are passive entities.

> Action == notify
> Object == the recipient
>
> or
>
> Subject == the recipient
> Action == watch
> Object == the file
>
> By conflating these two actions into one, you've made the modeling
> very hard, and you start running into all these nasty questions like
> "who actually closed this open file"

No, I've made the code more difficult. You can not call
the file a subject. That is just wrong. It's not a valid
model.

David Howells June 10, 2019, 10:07 p.m. UTC | #9

Casey Schaufler <casey@schaufler-ca.com> wrote:

> Process A and process B both open /dev/null.
> A and B can write and read to their hearts content
> to/from /dev/null without ever once communicating.
> The mutual accessibility of /dev/null in no way implies that
> A and B can communicate. If A can set a watch on /dev/null,
> and B triggers an event, there still has to be an access
> check on the delivery of the event because delivering an event
> to A is not an action on /dev/null, but on A.

If a process has the privilege, it appears that fanotify() allows that process
to see others accessing /dev/null (FAN_ACCESS, FAN_ACCESS_PERM).  There don't
seem to be any LSM checks there either.

On the other hand, the privilege required is CAP_SYS_ADMIN,

> > The mount tree can't be modified by unprivileged users, unless a
> > privileged user very carefully configured it as such.
> 
> "Unless" means *is* possible. In which case access control is
> required. I will admit to being less then expert on the extent
> to which mounts can be done without privilege.

Automounts in network filesystems, for example.

The initial mount of the network filesystem requires local privilege, but then
mountpoints are managed with remote privilege as granted by things like
kerberos tickets.  The local kernel has no control.

If you have CONFIG_AFS_FS enabled in your kernel, for example, and you install
the keyutils package (dnf, rpm, apt, etc.), then you should be able to do:

	mount -t afs none /mnt -o dyn
	ls /afs/grand.central.org/software/

for example.  That will go through a couple of automount points.  Assuming you
don't have a kerberos login on those servers, however, you shouldn't be able
to add new mountpoints.

Someone watching the mount topology can see events when an automount is
enacted and when it expires, the latter being an event with the system as the
subject since the expiry is done on a timeout set by the kernel.

David

Andy Lutomirski June 11, 2019, 12:13 a.m. UTC | #10

> On Jun 10, 2019, at 2:25 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
>> On 6/10/2019 12:53 PM, Andy Lutomirski wrote:
>> On Mon, Jun 10, 2019 at 12:34 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>> I think you really need to give an example of a coherent policy that
>>>>>> needs this.
>>>>> I keep telling you, and you keep ignoring what I say.
>>>>> 
>>>>>> As it stands, your analogy seems confusing.
>>>>> It's pretty simple. I have given both the abstract
>>>>> and examples.
>>>> You gave the /dev/null example, which is inapplicable to this patchset.
>>> That addressed an explicit objection, and pointed out
>>> an exception to a generality you had asserted, which was
>>> not true. It's also a red herring regarding the current
>>> discussion.
>> This argument is pointless.
>> 
>> Please humor me and just give me an example.  If you think you have
>> already done so, feel free to repeat yourself.  If you have no
>> example, then please just say so.
> 
> To repeat the /dev/null example:
> 
> Process A and process B both open /dev/null.
> A and B can write and read to their hearts content
> to/from /dev/null without ever once communicating.
> The mutual accessibility of /dev/null in no way implies that
> A and B can communicate. If A can set a watch on /dev/null,
> and B triggers an event, there still has to be an access
> check on the delivery of the event because delivering an event
> to A is not an action on /dev/null, but on A.
> 

At discussed, this is an irrelevant straw man. This patch series does not produce events when this happens. I’m looking for a relevant example, please.
> 
> 
>>  An unprivileged
>> user can create a new userns and a new mount ns, but then they're
>> modifying a whole different mount tree.
> 
> Within those namespaces you can still have multiple users,
> constrained be system access control policy.

And the one doing the mounting will be constrained by MAC and DAC policy, as always.  The namespace creator is, from the perspective of those processes, admin.

> 
>> 
>>>>>> Similarly, if someone
>>>>>> tries to receive a packet on a socket, we check whether they have the
>>>>>> right to receive on that socket (from the endpoint in question) and,
>>>>>> if the sender is local, whether the sender can send to that socket.
>>>>>> We do not check whether the sender can send to the receiver.
>>>>> Bzzzt! Smack sure does.
>>>> This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.
>>> Process A sends a packet to process B.
>>> If A has access to TopSecret data and B is not
>>> allowed to see TopSecret data, the delivery should
>>> be prevented. Is that nonsensical?
>> It makes sense.  As I see it, the way that a sensible policy should do
>> this is by making sure that there are no sockets, pipes, etc that
>> Process A can write and that Process B can read.
> 
> You can't explain UDP controls without doing the access check
> on packet delivery. The sendmsg() succeeds when the packet leaves
> the sender. There doesn't even have to be a socket bound to the
> port. The only opportunity you have for control is on packet
> delivery, which is the only point at which you can have the
> information required.

Huh?  You sendmsg() from an address to an address.  My point is that, for most purposes, that’s all the information that’s needed.

> 
>> If you really want to prevent a malicious process with TopSecret data
>> from sending it to a different process, then you can't use Linux on
>> x86 or ARM.  Maybe that will be fixed some day, but you're going to
>> need to use an extremely tight sandbox to make this work.
> 
> I won't be commenting on that.

Then why is preventing this is an absolute requirement? It’s unattainable.

> 
>> 
>>>>>> The signal example is inapplicable.
>>>>> From a modeling viewpoint the actions are identical.
>>>> This seems incorrect to me
>>> What would be correct then? Some convoluted combination
>>> of system entities that aren't owned or controlled by
>>> any mechanism?
>>> 
>> POSIX signal restrictions aren't there to prevent two processes from
>> communicating.  They're there to prevent the sender from manipulating
>> or crashing the receiver without appropriate privilege.
> 
> POSIX signal restrictions have a long history. In the P10031e/2c
> debates both communication and manipulation where seriously
> considered. I would say both are true.
> 
>>>> and, I think, to most everyone else reading this.
>>> That's quite the assertion. You may even be correct.
>>> 
>>>> Can you explain?
>>>> 
>>>> In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.
>>> YES!!!!!!!!!!!!
>>> 
>>> And when a process triggers a notification it is the subject
>>> and the watching process is the object!
>>> 
>>> Subject == active entity
>>> Object  == passive entity
>>> 
>>> Triggering an event is, like calling kill(), an action!
>>> 
>> And here is where I disagree with your interpretation.  Triggering an
>> event is a side effect of writing to the file.  There are *two*
>> security relevant actions, not one, and they are:
>> 
>> First, the write:
>> 
>> Subject == the writer
>> Action == write
>> Object == the file
>> 
>> Then the event, which could be modeled in a couple of ways:
>> 
>> Subject == the file
> 
> Files   are   not   subjects. They are passive entities.
> 
>> Action == notify
>> Object == the recipient

Great. Then use the variant below.

>> 
>> or
>> 
>> Subject == the recipient
>> Action == watch
>> Object == the file
>> 
>> By conflating these two actions into one, you've made the modeling
>> very hard, and you start running into all these nasty questions like
>> "who actually closed this open file"
> 
> No, I've made the code more difficult.
> You can not call
> the file a subject. That is just wrong. It's not a valid
> model.

You’ve ignored the “Action == watch” variant. Do you care to comment?

David Howells June 11, 2019, 2:21 p.m. UTC | #11

To see if we can try and make progress on this, can we try and come at this
from another angle: what do LSMs *actually* need to do this?  And I grant that
each LSM might require different things.

-~-

[A] There are a bunch of things available, some of which may be coincident,
depending on the context:

 (1) The creds of the process that created a watch_queue (ie. opened
     /dev/watch_queue).

 (2) The creds of the process that set a watch (ie. called watch_sb,
     KEYCTL_NOTIFY, ...);

 (3) The creds of the process that tripped the event (which might be the
     system).

 (4) The security attributes of the object on which the watch was set (uid,
     gid, mode, labels).

 (5) The security attributes of the object on which the event was tripped.

 (6) The security attributes of all the objects between the object in (5) and
     the object in (4), assuming we work from (5) towards (4) if the two
     aren't coincident (WATCH_INFO_RECURSIVE).

At the moment, when post_one_notification() wants to write a notification into
a queue, it calls security_post_notification() to ask if it should be allowed
to do so.  This is passed (1) and (3) above plus the notification record.


[B] There are a number of places I can usefully potentially add hooks:

 (a) The point at which a watch queue is created (ie. /dev/watch_queue is
     opened).

 (b) The point at which a watch is set (ie. watch_sb).

 (c) The point at which a notification is generated (ie. an automount point is
     tripped).

 (d) The point at which a notification is delivered (ie. we write the message
     into the queue).

 (e) All the points at which we walk over an object in a chain from (c) to
     find the watch on which we can effect (d) (eg. we walk rootwards from a
     mountpoint to find watches on a branch in the mount topology).


[C] Problems that need to be resolved:

 (x) Do I need to put a security pointer in struct watch for the active LSM to
     fill in?  If so, I presume this would need passing to
     security_post_notification().

 (y) What checks should be done on object destruction after final put and what
     contexts need to be supplied?

     This one is made all the harder because the creds that are in force when
     close(), exit(), exec(), dup2(), etc. close a file descriptor might need
     to be propagated to deferred-fput, which must in turn propagate them to
     af_unix-cleanup, and thence back to deferred-fput and thence to implicit
     unmount (dissolve_on_fput()[*]).

     [*] Though it should be noted that if this happens, the subtree cannot be
     	 attached to the root of a namespace.

     Further, if several processes are sharing a file object, it's not
     predictable as to which process the final notification will come from.

 (z) Do intermediate objects, say in a mount topology notification, actually
     need to be checked against the watcher's creds?  For a mount topology
     notification, would this require calling inode_permission() for each
     intervening directory?

     Doing that might be impractical as it would probably have to be done
     outside of of the RCU read lock and the filesystem ->permission() hooks
     might want to sleep (to touch disk or talk to a server).

David

Stephen Smalley June 11, 2019, 2:32 p.m. UTC | #12

On 6/10/19 8:13 PM, Andy Lutomirski wrote:
> 
> 
>> On Jun 10, 2019, at 2:25 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>
>>> On 6/10/2019 12:53 PM, Andy Lutomirski wrote:
>>> On Mon, Jun 10, 2019 at 12:34 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>>> I think you really need to give an example of a coherent policy that
>>>>>>> needs this.
>>>>>> I keep telling you, and you keep ignoring what I say.
>>>>>>
>>>>>>> As it stands, your analogy seems confusing.
>>>>>> It's pretty simple. I have given both the abstract
>>>>>> and examples.
>>>>> You gave the /dev/null example, which is inapplicable to this patchset.
>>>> That addressed an explicit objection, and pointed out
>>>> an exception to a generality you had asserted, which was
>>>> not true. It's also a red herring regarding the current
>>>> discussion.
>>> This argument is pointless.
>>>
>>> Please humor me and just give me an example.  If you think you have
>>> already done so, feel free to repeat yourself.  If you have no
>>> example, then please just say so.
>>
>> To repeat the /dev/null example:
>>
>> Process A and process B both open /dev/null.
>> A and B can write and read to their hearts content
>> to/from /dev/null without ever once communicating.
>> The mutual accessibility of /dev/null in no way implies that
>> A and B can communicate. If A can set a watch on /dev/null,
>> and B triggers an event, there still has to be an access
>> check on the delivery of the event because delivering an event
>> to A is not an action on /dev/null, but on A.
>>
> 
> At discussed, this is an irrelevant straw man. This patch series does not produce events when this happens. I’m looking for a relevant example, please.
>>
>>
>>>   An unprivileged
>>> user can create a new userns and a new mount ns, but then they're
>>> modifying a whole different mount tree.
>>
>> Within those namespaces you can still have multiple users,
>> constrained be system access control policy.
> 
> And the one doing the mounting will be constrained by MAC and DAC policy, as always.  The namespace creator is, from the perspective of those processes, admin.
> 
>>
>>>
>>>>>>> Similarly, if someone
>>>>>>> tries to receive a packet on a socket, we check whether they have the
>>>>>>> right to receive on that socket (from the endpoint in question) and,
>>>>>>> if the sender is local, whether the sender can send to that socket.
>>>>>>> We do not check whether the sender can send to the receiver.
>>>>>> Bzzzt! Smack sure does.
>>>>> This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.
>>>> Process A sends a packet to process B.
>>>> If A has access to TopSecret data and B is not
>>>> allowed to see TopSecret data, the delivery should
>>>> be prevented. Is that nonsensical?
>>> It makes sense.  As I see it, the way that a sensible policy should do
>>> this is by making sure that there are no sockets, pipes, etc that
>>> Process A can write and that Process B can read.
>>
>> You can't explain UDP controls without doing the access check
>> on packet delivery. The sendmsg() succeeds when the packet leaves
>> the sender. There doesn't even have to be a socket bound to the
>> port. The only opportunity you have for control is on packet
>> delivery, which is the only point at which you can have the
>> information required.
> 
> Huh?  You sendmsg() from an address to an address.  My point is that, for most purposes, that’s all the information that’s needed.
> 
>>
>>> If you really want to prevent a malicious process with TopSecret data
>>> from sending it to a different process, then you can't use Linux on
>>> x86 or ARM.  Maybe that will be fixed some day, but you're going to
>>> need to use an extremely tight sandbox to make this work.
>>
>> I won't be commenting on that.
> 
> Then why is preventing this is an absolute requirement? It’s unattainable.
> 
>>
>>>
>>>>>>> The signal example is inapplicable.
>>>>>>  From a modeling viewpoint the actions are identical.
>>>>> This seems incorrect to me
>>>> What would be correct then? Some convoluted combination
>>>> of system entities that aren't owned or controlled by
>>>> any mechanism?
>>>>
>>> POSIX signal restrictions aren't there to prevent two processes from
>>> communicating.  They're there to prevent the sender from manipulating
>>> or crashing the receiver without appropriate privilege.
>>
>> POSIX signal restrictions have a long history. In the P10031e/2c
>> debates both communication and manipulation where seriously
>> considered. I would say both are true.
>>
>>>>> and, I think, to most everyone else reading this.
>>>> That's quite the assertion. You may even be correct.
>>>>
>>>>> Can you explain?
>>>>>
>>>>> In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.
>>>> YES!!!!!!!!!!!!
>>>>
>>>> And when a process triggers a notification it is the subject
>>>> and the watching process is the object!
>>>>
>>>> Subject == active entity
>>>> Object  == passive entity
>>>>
>>>> Triggering an event is, like calling kill(), an action!
>>>>
>>> And here is where I disagree with your interpretation.  Triggering an
>>> event is a side effect of writing to the file.  There are *two*
>>> security relevant actions, not one, and they are:
>>>
>>> First, the write:
>>>
>>> Subject == the writer
>>> Action == write
>>> Object == the file
>>>
>>> Then the event, which could be modeled in a couple of ways:
>>>
>>> Subject == the file
>>
>> Files   are   not   subjects. They are passive entities.
>>
>>> Action == notify
>>> Object == the recipient
> 
> Great. Then use the variant below.
> 
>>>
>>> or
>>>
>>> Subject == the recipient
>>> Action == watch
>>> Object == the file
>>>
>>> By conflating these two actions into one, you've made the modeling
>>> very hard, and you start running into all these nasty questions like
>>> "who actually closed this open file"
>>
>> No, I've made the code more difficult.
>> You can not call
>> the file a subject. That is just wrong. It's not a valid
>> model.
> 
> You’ve ignored the “Action == watch” variant. Do you care to comment?

While I agree with this model in general, I will note two caveats when 
trying to apply this to watches/notifications:

1) The object on which the notification was triggered and the object on 
which the watch was placed are not necessarily the same and access to 
one might not imply access to the other,

2) If notifications can be triggered by read-like operations (as in 
fanotify, for example), then a "read" can be turned into a "write" flow 
through a notification.

Whether or not these caveats are applicable to the notifications in this 
series I am not clear.

Stephen Smalley June 11, 2019, 3:57 p.m. UTC | #13

On 6/11/19 10:21 AM, David Howells wrote:
> To see if we can try and make progress on this, can we try and come at this
> from another angle: what do LSMs *actually* need to do this?  And I grant that
> each LSM might require different things.

I think part of the problem here is that the discussion is too abstract 
and not dealing with the specifics of the notifications in question. 
Those details matter.

> 
> -~-
> 
> [A] There are a bunch of things available, some of which may be coincident,
> depending on the context:
> 
>   (1) The creds of the process that created a watch_queue (ie. opened
>       /dev/watch_queue).

These will be used when checking permissions to open /dev/watch_queue.

>   (2) The creds of the process that set a watch (ie. called watch_sb,
>       KEYCTL_NOTIFY, ...);

These will be used when checking permissions to set a watch.

>   (3) The creds of the process that tripped the event (which might be the
>       system).

These will be used when checking permission to perform whatever 
operation tripped the event (if the event is triggered by a userspace 
operation).

>   (4) The security attributes of the object on which the watch was set (uid,
>       gid, mode, labels).

These will be used when checking permissions to set the watch.

>   (5) The security attributes of the object on which the event was tripped.

These will be used when checking permission to perform whatever 
operation tripped the event.

>   (6) The security attributes of all the objects between the object in (5) and
>       the object in (4), assuming we work from (5) towards (4) if the two
>       aren't coincident (WATCH_INFO_RECURSIVE).

Does this apply to anything other than mount notifications?  And for 
mount notifications, isn't the notification actually for a change to the 
mount namespace, not a change to any file?  Hence, the real "object" for 
events that trigger mount notifications is the mount namespace, right? 
The watched path is just a way of identifying a subtree of the mount 
namespace for notifications - it isn't the real object being watched.

> At the moment, when post_one_notification() wants to write a notification into
> a queue, it calls security_post_notification() to ask if it should be allowed
> to do so.  This is passed (1) and (3) above plus the notification record.

Not convinced we need this.

> [B] There are a number of places I can usefully potentially add hooks:
> 
>   (a) The point at which a watch queue is created (ie. /dev/watch_queue is
>       opened).

Already covered by existing hooks on opening files.

>   (b) The point at which a watch is set (ie. watch_sb).

Yes, this requires a hook and corresponding check.

>   (c) The point at which a notification is generated (ie. an automount point is
>       tripped).

Preferably covered by existing hooks on object accesses that would 
generate notifications.

>   (d) The point at which a notification is delivered (ie. we write the message
>       into the queue).

Preferably not needed.

>   (e) All the points at which we walk over an object in a chain from (c) to
>       find the watch on which we can effect (d) (eg. we walk rootwards from a
>       mountpoint to find watches on a branch in the mount topology).

Not necessary if the real object of mount notifications is the mount 
namespace and if we do not support recursive notifications on e.g. 
directories or some other object where the two can truly diverge.

> [C] Problems that need to be resolved:
> 
>   (x) Do I need to put a security pointer in struct watch for the active LSM to
>       fill in?  If so, I presume this would need passing to
>       security_post_notification().

I don't see why or where it would get used.

>   (y) What checks should be done on object destruction after final put and what
>       contexts need to be supplied?

IMHO, no.

> 
>       This one is made all the harder because the creds that are in force when
>       close(), exit(), exec(), dup2(), etc. close a file descriptor might need
>       to be propagated to deferred-fput, which must in turn propagate them to
>       af_unix-cleanup, and thence back to deferred-fput and thence to implicit
>       unmount (dissolve_on_fput()[*]).
> 
>       [*] Though it should be noted that if this happens, the subtree cannot be
>       	 attached to the root of a namespace.
> 
>       Further, if several processes are sharing a file object, it's not
>       predictable as to which process the final notification will come from.
> 
>   (z) Do intermediate objects, say in a mount topology notification, actually
>       need to be checked against the watcher's creds?  For a mount topology
>       notification, would this require calling inode_permission() for each
>       intervening directory?

I don't think so, because the real object is the mount namespace, not 
the individual directories.

> 
>       Doing that might be impractical as it would probably have to be done
>       outside of of the RCU read lock and the filesystem ->permission() hooks
>       might want to sleep (to touch disk or talk to a server).

Casey Schaufler June 11, 2019, 4:22 p.m. UTC | #14

On 6/11/2019 7:21 AM, David Howells wrote:
> To see if we can try and make progress on this, can we try and come at this
> from another angle: what do LSMs *actually* need to do this?  And I grant that
> each LSM might require different things.
>
> -~-
>
> [A] There are a bunch of things available, some of which may be coincident,
> depending on the context:
>
>  (1) The creds of the process that created a watch_queue (ie. opened
>      /dev/watch_queue).

Smack needs this for the filesystem access required to open /dev/watch_queue.

>  (2) The creds of the process that set a watch (ie. called watch_sb,
>      KEYCTL_NOTIFY, ...);

Smack needs this to set a watch on any named object (file, key, ...).
Smack needs this as the object information for event delivery.

>  (3) The creds of the process that tripped the event (which might be the
>      system).

Smack needs this as the subject information for the event delivery.

>  (4) The security attributes of the object on which the watch was set (uid,
>      gid, mode, labels).

Smack needs this to set a watch on the named object (file, key, ...).
I am going to say that if you can't access an object you can't watch it.
I think that read access is sufficient provided that no one else can
see what watches I've created.

>  (5) The security attributes of the object on which the event was tripped.

Smack does not need these for the event mechanism as that object isn't
involved in the event delivery, except as may be required by (4).

>  (6) The security attributes of all the objects between the object in (5) and
>      the object in (4), assuming we work from (5) towards (4) if the two
>      aren't coincident (WATCH_INFO_RECURSIVE).

Smack needs these only as they would apply to (4).

> At the moment, when post_one_notification() wants to write a notification into
> a queue, it calls security_post_notification() to ask if it should be allowed
> to do so.  This is passed (1) and (3) above plus the notification record.

Is "current" (2)? Smack needs (2) for the event delivery access check.

> [B] There are a number of places I can usefully potentially add hooks:
>
>  (a) The point at which a watch queue is created (ie. /dev/watch_queue is
>      opened).

Smack would not need a new check as the filesystem checks should suffice.

>  (b) The point at which a watch is set (ie. watch_sb).

Smack would need a check to ensure the watcher has access
in cases where what is being watched is an object.

>  (c) The point at which a notification is generated (ie. an automount point is
>      tripped).

Smack does not require an explicit check here.

>  (d) The point at which a notification is delivered (ie. we write the message
>      into the queue).

Smack requires a check here. (2) as the object and (3) as the subject.

>  (e) All the points at which we walk over an object in a chain from (c) to
>      find the watch on which we can effect (d) (eg. we walk rootwards from a
>      mountpoint to find watches on a branch in the mount topology).

Smack does not require anything beyond existing checks.

> [C] Problems that need to be resolved:
>
>  (x) Do I need to put a security pointer in struct watch for the active LSM to
>      fill in?  If so, I presume this would need passing to
>      security_post_notification().

Smack does not need this.

>  (y) What checks should be done on object destruction after final put and what
>      contexts need to be supplied?

Classically there is no such thing as filesystem object deletion.
By making it possible to set a watch on that you've inadvertently
added a security relevant action to the security model. :o

>      This one is made all the harder because the creds that are in force when
>      close(), exit(), exec(), dup2(), etc. close a file descriptor might need
>      to be propagated to deferred-fput, which must in turn propagate them to
>      af_unix-cleanup, and thence back to deferred-fput and thence to implicit
>      unmount (dissolve_on_fput()[*]).
>
>      [*] Though it should be noted that if this happens, the subtree cannot be
>      	 attached to the root of a namespace.
>
>      Further, if several processes are sharing a file object, it's not
>      predictable as to which process the final notification will come from.

How about we don't add filesystem object deletion to the security model?

If what you really care about is removal of last filesystem reference
(last unlink) this shouldn't be any harder than any other watched change
(e.g. chmod) to the object.

If, on the other hand, you really want to watch for the last inode
reference and actual destruction of the thing I will suggest an
argument based on the model itself. If all of the directory entries
are unlinked the object no longer exists in the filesystem namespace.
If all of the fd's are closed (by whatever mechanism, we don't really
care) it no longer exists in any process space. At that point it has
no names, and is no longer a named object. No process (subject) deletes
it. They can't. They don't have access to it without a name. The
deletion is a system event, like setting the clock. There is no policy
that says when or even if the destruction occurs.

If and when the system gets around to cleaning up what has become
nothing more than system resources any outstanding watches can be
triggered using the system credential.

>  (z) Do intermediate objects, say in a mount topology notification, actually
>      need to be checked against the watcher's creds?  For a mount topology
>      notification, would this require calling inode_permission() for each
>      intervening directory?

Smack would not require this. Paths are an illusion.

>      Doing that might be impractical as it would probably have to be done
>      outside of of the RCU read lock and the filesystem ->permission() hooks
>      might want to sleep (to touch disk or talk to a server).
>
> David

David Howells June 12, 2019, 8:55 a.m. UTC | #15

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> 2) If notifications can be triggered by read-like operations (as in fanotify,
> for example), then a "read" can be turned into a "write" flow through a
> notification.

I don't think any of the things can be classed as "read-like" operations.  At
the moment, there are the following groups:

 (1) Addition of objects (eg. key_link, mount).

 (2) Modifications to things (eg. keyctl_write, remount).

 (3) Removal of objects (eg. key_unlink, unmount, fput+FMODE_NEED_UNMOUNT).

 (4) I/O or hardware errors (eg. USB device add/remove, EDQUOT, ENOSPC).

I have not currently defined any access events.

I've been looking at the possibility of having epoll generate events this way,
but that's not as straightforward as I'd hoped and fanotify could potentially
use it also, but in both those cases, the process is already getting the
events currently by watching for them using synchronous waiting syscalls.
Instead this would generate an event to say it had happened.

David

David Howells June 12, 2019, 11:43 a.m. UTC | #16

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> >   (6) The security attributes of all the objects between the object in (5)
> >       and the object in (4), assuming we work from (5) towards (4) if the
> >       two aren't coincident (WATCH_INFO_RECURSIVE).
> 
> Does this apply to anything other than mount notifications?

Not at the moment.  I'm considering making it such that you can make a watch
on a keyring get automatically propagated to keys that get added to the
keyring (and removed upon unlink) - the idea being that there is no 'single
parent path' concept for a keyring as there is for a directory.

I'm also pondering the idea of making it possible to have superblock watches
automatically propagated to superblocks created by automount points on the
watched superblock.

> And for mount notifications, isn't the notification actually for a change to
> the mount namespace, not a change to any file?

Yes.

> Hence, the real "object" for events that trigger mount notifications is the
> mount namespace, right?

Um... arguably.  Would that mean that that would need a label from somewhere?

> The watched path is just a way of identifying a subtree of the mount
> namespace for notifications - it isn't the real object being watched.

I like that argument.

Thanks,
David

David Howells June 12, 2019, 5:41 p.m. UTC | #17

Casey Schaufler <casey@schaufler-ca.com> wrote:

> >  (4) The security attributes of the object on which the watch was set (uid,
> >      gid, mode, labels).
> 
> Smack needs this to set a watch on the named object (file, key, ...).
> I am going to say that if you can't access an object you can't watch it.

So for the things I've so far defined:

 (*) Keys/keyrings require View permission, but it could be Read permission
     instead - though that may get disabled if the key type does not support
     KEYCTL_READ.

 (*) Mount/superblock watches - I've made these require execute permission on
     the specified directory.  Could be read permission instead.

 (*) Device events (block/usb) don't require any permissions, but currently
     only deliver hardware notifications.

> I think that read access is sufficient provided that no one else can
> see what watches I've created.

You can't find out what watches exist.

> > At the moment, when post_one_notification() wants to write a notification
> > into a queue, it calls security_post_notification() to ask if it should be
> > allowed to do so.  This is passed (1) and (3) above plus the notification
> > record.
> 
> Is "current" (2)? Smack needs (2) for the event delivery access check.

(2) was current_cred() when watch_sb(), KEYCTL_NOTIFY, etc. was called, but
isn't necessarily current_cred() at the point post_one_notification() is
called.  The latter is called at the point the event is generated and
current_cred() is the creds of whatever thread that is called in (which may be
a work_queue thread if it got deferred).

At the moment, I'm storing the creds of whoever opened the queue (ie. (1)) and
using that, but I could cache the creds of whoever created each watch
(ie. (2)) and pass that to post_one_notification() instead.

However, it should be noted that (1) is the creds of the buffer owner.

> >  (e) All the points at which we walk over an object in a chain from (c) to
> >      find the watch on which we can effect (d) (eg. we walk rootwards from a
> >      mountpoint to find watches on a branch in the mount topology).
> 
> Smack does not require anything beyond existing checks.

I'm glad to hear that, as this might be sufficiently impractical as to render
it unusable with Smack.  Calling i_op->permissions() a lot would suck.

> >  (y) What checks should be done on object destruction after final put and
> >      what contexts need to be supplied?
> 
> Classically there is no such thing as filesystem object deletion.
> By making it possible to set a watch on that you've inadvertently
> added a security relevant action to the security model. :o

That wasn't my original intention - I intended fsmount(2) to mount directly as
mount(2) does, but Al had other ideas.

Now fsmount(2) produces a file descriptor referring to a new mount object that
can be mounted into by move_mount(2) before being spliced into the mount
topology by move_mount(2).  However, if the fd is closed before the last step,
close() will destroy the mount subtree attached to it (kind of quasi-unmount).

That wouldn't be a problem, except that the fd from fsmount(2) can be used
anywhere an O_PATH fd can be used - including watch_mount(2), fchdir(2), ...
Further, FMODE_NEED_UNMOUNT gets cleared if the mount is spliced in at least
once.

Okay, having tried it you don't get an unmount event (since there's no actual
unmount), but you do get an event to say that your watch got deleted (if the
directory on which the watch was placed got removed from the system).

So...  does the "your watch got deleted" message need checking?  In my
opinion, it shouldn't get discarded because then the watcher doesn't know
their watch went away.

David

Casey Schaufler June 12, 2019, 6:14 p.m. UTC | #18

On 6/12/2019 10:41 AM, David Howells wrote:
> Casey Schaufler <casey@schaufler-ca.com> wrote:
>
>>>  (4) The security attributes of the object on which the watch was set (uid,
>>>      gid, mode, labels).
>> Smack needs this to set a watch on the named object (file, key, ...).
>> I am going to say that if you can't access an object you can't watch it.
> So for the things I've so far defined:
>
>  (*) Keys/keyrings require View permission, but it could be Read permission
>      instead - though that may get disabled if the key type does not support
>      KEYCTL_READ.

View is good enough.

>  (*) Mount/superblock watches - I've made these require execute permission on
>      the specified directory.  Could be read permission instead.

Execute is good enough.

>  (*) Device events (block/usb) don't require any permissions, but currently
>      only deliver hardware notifications.

How do you specify what device you want to watch?
Don't you have to access a /dev/something?

"currently" makes me nervous.

>> I think that read access is sufficient provided that no one else can
>> see what watches I've created.
> You can't find out what watches exist.

Not even your own?


>>> At the moment, when post_one_notification() wants to write a notification
>>> into a queue, it calls security_post_notification() to ask if it should be
>>> allowed to do so.  This is passed (1) and (3) above plus the notification
>>> record.
>> Is "current" (2)? Smack needs (2) for the event delivery access check.
> (2) was current_cred() when watch_sb(), KEYCTL_NOTIFY, etc. was called, but
> isn't necessarily current_cred() at the point post_one_notification() is
> called.  The latter is called at the point the event is generated and
> current_cred() is the creds of whatever thread that is called in (which may be
> a work_queue thread if it got deferred).
>
> At the moment, I'm storing the creds of whoever opened the queue (ie. (1)) and
> using that, but I could cache the creds of whoever created each watch
> (ie. (2)) and pass that to post_one_notification() instead.
>
> However, it should be noted that (1) is the creds of the buffer owner.

How are buffers shared? Who besides the buffer creator can use it?

>>>  (e) All the points at which we walk over an object in a chain from (c) to
>>>      find the watch on which we can effect (d) (eg. we walk rootwards from a
>>>      mountpoint to find watches on a branch in the mount topology).
>> Smack does not require anything beyond existing checks.
> I'm glad to hear that, as this might be sufficiently impractical as to render
> it unusable with Smack.  Calling i_op->permissions() a lot would suck.
>
>>>  (y) What checks should be done on object destruction after final put and
>>>      what contexts need to be supplied?
>> Classically there is no such thing as filesystem object deletion.
>> By making it possible to set a watch on that you've inadvertently
>> added a security relevant action to the security model. :o
> That wasn't my original intention - I intended fsmount(2) to mount directly as
> mount(2) does, but Al had other ideas.
>
> Now fsmount(2) produces a file descriptor referring to a new mount object that
> can be mounted into by move_mount(2) before being spliced into the mount
> topology by move_mount(2).  However, if the fd is closed before the last step,
> close() will destroy the mount subtree attached to it (kind of quasi-unmount).
>
> That wouldn't be a problem, except that the fd from fsmount(2) can be used
> anywhere an O_PATH fd can be used - including watch_mount(2), fchdir(2), ...
> Further, FMODE_NEED_UNMOUNT gets cleared if the mount is spliced in at least
> once.
>
> Okay, having tried it you don't get an unmount event (since there's no actual
> unmount), but you do get an event to say that your watch got deleted (if the
> directory on which the watch was placed got removed from the system).
>
> So...  does the "your watch got deleted" message need checking?  In my
> opinion, it shouldn't get discarded because then the watcher doesn't know
> their watch went away.

Can you glean information from the watch being deleted?
I wouldn't think so, and it seems like a one-time event
from the system, so I don't think an access check would
be required.

>
> David

David Howells June 12, 2019, 6:36 p.m. UTC | #19

Casey Schaufler <casey@schaufler-ca.com> wrote:

> >  (*) Device events (block/usb) don't require any permissions, but currently
> >      only deliver hardware notifications.
> 
> How do you specify what device you want to watch?

It's a general queue.

> Don't you have to access a /dev/something?

Not at the moment.  One problem is that there may not be a /dev/something for
a device (take a bridge for example), and even if it does, the device driver
doesn't necessarily have access to the path.  The messages contain the device
name string as appears in dmesg ("3-7" for a USB device, for example).

I think it would be wise to limit the general queue to hardware events that
either get triggered by someone physically mucking around with the hardware or
device errors, such as bad sectors or links going up and down.

> > You can't find out what watches exist.
> 
> Not even your own?

No.

> > However, it should be noted that (1) is the creds of the buffer owner.
> 
> How are buffers shared? Who besides the buffer creator can use it?

When you open /dev/watch_queue, you get buffers private to that file object; a
second open of the device, even by the same process, will get different
buffers.

The buffers are 'attached' to that file and are accessed by calling mmap() on
the fd; shareability is governed by how shareable the fd and a mapping are
shareable.

> Can you glean information from the watch being deleted?
> I wouldn't think so, and it seems like a one-time event
> from the system, so I don't think an access check would
> be required.

As you say, it's a one-time message per watch.  The object that got deleted
would need to be recreated, rewatched and made available to both parties.

David

Stephen Smalley June 13, 2019, 6:46 p.m. UTC | #20

On 6/12/19 7:43 AM, David Howells wrote:
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
>>>    (6) The security attributes of all the objects between the object in (5)
>>>        and the object in (4), assuming we work from (5) towards (4) if the
>>>        two aren't coincident (WATCH_INFO_RECURSIVE).
>>
>> Does this apply to anything other than mount notifications?
> 
> Not at the moment.  I'm considering making it such that you can make a watch
> on a keyring get automatically propagated to keys that get added to the
> keyring (and removed upon unlink) - the idea being that there is no 'single
> parent path' concept for a keyring as there is for a directory.
> 
> I'm also pondering the idea of making it possible to have superblock watches
> automatically propagated to superblocks created by automount points on the
> watched superblock.

So at the point where you can set a watch on one object O1 with label X, 
and receive notifications triggered upon operations on another object O2 
with label Y, we have to consider whether the relationship between X and 
Y is controlled in any way (possibly transitively through a series of 
checks performed earlier) and whether we can reasonably infer that the 
authorization to watch X implies the ability to be notified of 
operations on Y.  Not a problem for the mount notifications AFAICS 
because there is only truly one object - the mount namespace itself, and 
it is always our own.

> 
>> And for mount notifications, isn't the notification actually for a change to
>> the mount namespace, not a change to any file?
> 
> Yes.
> 
>> Hence, the real "object" for events that trigger mount notifications is the
>> mount namespace, right?
> 
> Um... arguably.  Would that mean that that would need a label from somewhere?

That takes us into the whole question of whether namespaces should be 
labeled (presumably from their creator), and the association between 
processes and their namespaces should be controlled.  I think when we 
originally looked at them, it wasn't much of a concern since the only 
means of creating a new namespace and associating with it was via 
clone() and then later also via unshare().  /proc/pid/ns and setns() 
changed that picture, but still requires ptrace read mode access, which 
at least provides some control over entering namespaces created by 
others. I suspect that ultimately we want namespaces to be labeled and 
controlled but that isn't your problem to solve here.

For your purposes, a process is setting a watch on its own namespace, 
and it already inherently can observe changes to that namespace without 
needing watches/notifications, and it can modify that namespace iff 
privileged wrt to the namespace.  One might argue that no check is 
required at all for setting the watch, and at most, it would be a check 
between the process and its own label to match the checking when 
accessing /proc/self/mounts. That presumes that no additional 
information is conveyed via the notification that isn't already 
available from /proc/self/mounts, particularly any information specific 
to the process that triggered the notification.  Does that make sense?

> 
>> The watched path is just a way of identifying a subtree of the mount
>> namespace for notifications - it isn't the real object being watched.
> 
> I like that argument.
> 
> Thanks,
> David
>

[RFC,00/13] Mount, FS, Block and Keyrings notifications [ver #4]

Message

Comments