[RFC,00/13] Mount, FS, Block and Keyrings notifications [ver #4]

Message ID	155991702981.15579.6007568669839441045.stgit@warthog.procyon.org.uk (mailing list archive)
Headers	show Return-Path: <linux-block-owner@kernel.org> Subject: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4] From: David Howells <dhowells@redhat.com> To: viro@zeniv.linux.org.uk Cc: linux-usb@vger.kernel.org, linux-security-module@vger.kernel.org, Casey Schaufler <casey@schaufler-ca.com>, Stephen Smalley <sds@tycho.nsa.gov>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, dhowells@redhat.com, raven@themaw.net, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-block@vger.kernel.org, keyrings@vger.kernel.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org Date: Fri, 07 Jun 2019 15:17:10 +0100 Message-ID: <155991702981.15579.6007568669839441045.stgit@warthog.procyon.org.uk> User-Agent: StGit/unknown-version MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	Mount, FS, Block and Keyrings notifications [ver #4] \| expand [RFC,00/13] Mount, FS, Block and Keyrings notifications [ver #4] [01/13] security: Override creds in __fput() with last fputter's creds [ver #4] [02/13] uapi: General notification ring definitions [ver #4] [03/13] security: Add hooks to rule on setting a watch [ver #4] [04/13] security: Add a hook for the point of notification insertion [ver #4] [05/13] General notification queue with user mmap()'able ring buffer [ver #4] [06/13] keys: Add a notification facility [ver #4] [07/13] vfs: Add a mount-notification facility [ver #4] [08/13] vfs: Add superblock notifications [ver #4] [09/13] fsinfo: Export superblock notification counter [ver #4] [10/13] Add a general, global device notification watch list [ver #4] [11/13] block: Add block layer notifications [ver #4] [12/13] usb: Add USB subsystem notifications [ver #4] [13/13] Add sample notification program [ver #4]

David Howells June 7, 2019, 2:17 p.m. UTC

Hi Al,

Here's a set of patches to add a general variable-length notification queue
concept and to add sources of events for:

 (1) Mount topology events, such as mounting, unmounting, mount expiry,
     mount reconfiguration.

 (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
     errors (not complete yet).

 (3) Key/keyring events, such as creating, linking and removal of keys.

 (4) General device events (single common queue) including:

     - Block layer events, such as device errors

     - USB subsystem events, such as device/bus attach/remove, device
       reset, device errors.

One of the reasons for this is so that we can remove the issue of processes
having to repeatedly and regularly scan /proc/mounts, which has proven to
be a system performance problem.  To further aid this, the fsinfo() syscall
on which this patch series depends, provides a way to access superblock and
mount information in binary form without the need to parse /proc/mounts.


LSM support is included, but controversial:

 (1) The creds of the process that did the fput() that reduced the refcount
     to zero are cached in the file struct.

 (2) __fput() overrides the current creds with the creds from (1) whilst
     doing the cleanup, thereby making sure that the creds seen by the
     destruction notification generated by mntput() appears to come from
     the last fputter.

 (3) security_post_notification() is called for each queue that we might
     want to post a notification into, thereby allowing the LSM to prevent
     covert communications.

 (?) Do I need to add security_set_watch(), say, to rule on whether a watch
     may be set in the first place?  I might need to add a variant per
     watch-type.

 (?) Do I really need to keep track of the process creds in which an
     implicit object destruction happened?  For example, imagine you create
     an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
     refers to on close unless move_mount() clears that flag.  Now, imagine
     someone looking at that fd through procfs at the same time as you exit
     due to an error.  The LSM sees the destruction notification come from
     the looker if they happen to do their fput() after yours.


Design decisions:

 (1) A misc chardev is used to create and open a ring buffer:

	fd = open("/dev/watch_queue", O_RDWR);

     which is then configured and mmap'd into userspace:

	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
	buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

     The fd cannot be read or written (though there is a facility to use
     write to inject records for debugging) and userspace just pulls data
     directly out of the buffer.

 (2) The ring index pointers are stored inside the ring and are thus
     accessible to userspace.  Userspace should only update the tail
     pointer and never the head pointer or risk breaking the buffer.  The
     kernel checks that the pointers appear valid before trying to use
     them.  A 'skip' record is maintained around the pointers.

 (3) poll() can be used to wait for data to appear in the buffer.

 (4) Records in the buffer are binary, typed and have a length so that they
     can be of varying size.

     This means that multiple heterogeneous sources can share a common
     buffer.  Tags may be specified when a watchpoint is created to help
     distinguish the sources.

 (5) The queue is reusable as there are 16 million types available, of
     which I've used 4, so there is scope for others to be used.

 (6) Records are filterable as types have up to 256 subtypes that can be
     individually filtered.  Other filtration is also available.

 (7) Each time the buffer is opened, a new buffer is created - this means
     that there's no interference between watchers.

 (8) When recording a notification, the kernel will not sleep, but will
     rather mark a queue as overrun if there's insufficient space, thereby
     avoiding userspace causing the kernel to hang.

 (9) The 'watchpoint' should be specific where possible, meaning that you
     specify the object that you want to watch.

(10) The buffer is created and then watchpoints are attached to it, using
     one of:

	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
	mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
	sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);

     where in all three cases, fd indicates the queue and the number after
     is a tag between 0 and 255.

(11) The watch must be removed if either the watch buffer is destroyed or
     the watched object is destroyed.


Things I want to avoid:

 (1) Introducing features that make the core VFS dependent on the network
     stack or networking namespaces (ie. usage of netlink).

 (2) Dumping all this stuff into dmesg and having a daemon that sits there
     parsing the output and distributing it as this then puts the
     responsibility for security into userspace and makes handling
     namespaces tricky.  Further, dmesg might not exist or might be
     inaccessible inside a container.

 (3) Letting users see events they shouldn't be able to see.


Further things that could be considered:

 (1) Adding a keyctl call to allow a watch on a keyring to be extended to
     "children" of that keyring, such that the watch is removed from the
     child if it is unlinked from the keyring.

 (2) Adding global superblock event queue.

 (3) Propagating watches to child superblock over automounts.


The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications

Changes:

 v4: Split the basic UAPI bits out into their own patch and then split the
     LSM hooks out into an intermediate patch.  Add LSM hooks for setting
     watches.

     Rename the *_notify() system calls to watch_*() for consistency.

 v3: I've added a USB notification source and reformulated the block
     notification source so that there's now a common watch list, for which
     the system call is now device_notify().

     I've assigned a pair of unused ioctl numbers in the 'W' series to the
     ioctls added by this series.

     I've also added a description of the kernel API to the documentation.

 v2: I've fixed various issues raised by Jann Horn and GregKH and moved to
     krefs for refcounting.  I've added some security features to try and
     give Casey Schaufler the LSM control he wants.

David
---
David Howells (13):
      security: Override creds in __fput() with last fputter's creds
      uapi: General notification ring definitions
      security: Add hooks to rule on setting a watch
      security: Add a hook for the point of notification insertion
      General notification queue with user mmap()'able ring buffer
      keys: Add a notification facility
      vfs: Add a mount-notification facility
      vfs: Add superblock notifications
      fsinfo: Export superblock notification counter
      Add a general, global device notification watch list
      block: Add block layer notifications
      usb: Add USB subsystem notifications
      Add sample notification program


 Documentation/ioctl/ioctl-number.txt   |    1 
 Documentation/security/keys/core.rst   |   58 ++
 Documentation/watch_queue.rst          |  492 ++++++++++++++++++
 arch/x86/entry/syscalls/syscall_32.tbl |    3 
 arch/x86/entry/syscalls/syscall_64.tbl |    3 
 block/Kconfig                          |    9 
 block/blk-core.c                       |   29 +
 drivers/base/Kconfig                   |    9 
 drivers/base/Makefile                  |    1 
 drivers/base/watch.c                   |   89 +++
 drivers/misc/Kconfig                   |   13 
 drivers/misc/Makefile                  |    1 
 drivers/misc/watch_queue.c             |  889 ++++++++++++++++++++++++++++++++
 drivers/usb/core/Kconfig               |   10 
 drivers/usb/core/devio.c               |   55 ++
 drivers/usb/core/hub.c                 |    3 
 fs/Kconfig                             |   21 +
 fs/Makefile                            |    1 
 fs/file_table.c                        |   12 
 fs/fsinfo.c                            |   12 
 fs/mount.h                             |   33 +
 fs/mount_notify.c                      |  187 +++++++
 fs/namespace.c                         |    9 
 fs/super.c                             |  122 ++++
 include/linux/blkdev.h                 |   15 +
 include/linux/dcache.h                 |    1 
 include/linux/device.h                 |    7 
 include/linux/fs.h                     |   79 +++
 include/linux/key.h                    |    4 
 include/linux/lsm_hooks.h              |   48 ++
 include/linux/security.h               |   35 +
 include/linux/syscalls.h               |    5 
 include/linux/usb.h                    |   19 +
 include/linux/watch_queue.h            |   87 +++
 include/uapi/linux/fsinfo.h            |   10 
 include/uapi/linux/keyctl.h            |    1 
 include/uapi/linux/watch_queue.h       |  213 ++++++++
 kernel/sys_ni.c                        |    7 
 samples/Kconfig                        |    6 
 samples/Makefile                       |    1 
 samples/vfs/test-fsinfo.c              |   13 
 samples/watch_queue/Makefile           |    9 
 samples/watch_queue/watch_test.c       |  308 +++++++++++
 security/keys/Kconfig                  |   10 
 security/keys/compat.c                 |    2 
 security/keys/gc.c                     |    5 
 security/keys/internal.h               |   30 +
 security/keys/key.c                    |   37 +
 security/keys/keyctl.c                 |   95 +++
 security/keys/keyring.c                |   17 -
 security/keys/request_key.c            |    4 
 security/security.c                    |   29 +
 52 files changed, 3121 insertions(+), 38 deletions(-)
 create mode 100644 Documentation/watch_queue.rst
 create mode 100644 drivers/base/watch.c
 create mode 100644 drivers/misc/watch_queue.c
 create mode 100644 fs/mount_notify.c
 create mode 100644 include/linux/watch_queue.h
 create mode 100644 include/uapi/linux/watch_queue.h
 create mode 100644 samples/watch_queue/Makefile
 create mode 100644 samples/watch_queue/watch_test.c

Stephen Smalley June 10, 2019, 3:21 p.m. UTC | #1

On 6/7/19 10:17 AM, David Howells wrote:
> 
> Hi Al,
> 
> Here's a set of patches to add a general variable-length notification queue
> concept and to add sources of events for:
> 
>   (1) Mount topology events, such as mounting, unmounting, mount expiry,
>       mount reconfiguration.
> 
>   (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>       errors (not complete yet).
> 
>   (3) Key/keyring events, such as creating, linking and removal of keys.
> 
>   (4) General device events (single common queue) including:
> 
>       - Block layer events, such as device errors
> 
>       - USB subsystem events, such as device/bus attach/remove, device
>         reset, device errors.
> 
> One of the reasons for this is so that we can remove the issue of processes
> having to repeatedly and regularly scan /proc/mounts, which has proven to
> be a system performance problem.  To further aid this, the fsinfo() syscall
> on which this patch series depends, provides a way to access superblock and
> mount information in binary form without the need to parse /proc/mounts.
> 
> 
> LSM support is included, but controversial:
> 
>   (1) The creds of the process that did the fput() that reduced the refcount
>       to zero are cached in the file struct.
> 
>   (2) __fput() overrides the current creds with the creds from (1) whilst
>       doing the cleanup, thereby making sure that the creds seen by the
>       destruction notification generated by mntput() appears to come from
>       the last fputter.
> 
>   (3) security_post_notification() is called for each queue that we might
>       want to post a notification into, thereby allowing the LSM to prevent
>       covert communications.
> 
>   (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>       may be set in the first place?  I might need to add a variant per
>       watch-type.
> 
>   (?) Do I really need to keep track of the process creds in which an
>       implicit object destruction happened?  For example, imagine you create
>       an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>       refers to on close unless move_mount() clears that flag.  Now, imagine
>       someone looking at that fd through procfs at the same time as you exit
>       due to an error.  The LSM sees the destruction notification come from
>       the looker if they happen to do their fput() after yours.

I remain unconvinced that (1), (2), (3), and the final (?) above are a 
good idea.

For SELinux, I would expect that one would implement a collection of per 
watch-type WATCH permission checks on the target object (or to some 
well-defined object label like the kernel SID if there is no object) 
that allow receipt of all notifications of that watch-type for objects 
related to the target object, where "related to" is defined per watch-type.

I wouldn't expect SELinux to implement security_post_notification() at 
all.  I can't see how one can construct a meaningful, stable policy for 
it.  I'd argue that the triggering process is not posting the 
notification; the kernel is posting the notification and the watcher has 
been authorized to receive it.

> 
> 
> Design decisions:
> 
>   (1) A misc chardev is used to create and open a ring buffer:
> 
> 	fd = open("/dev/watch_queue", O_RDWR);
> 
>       which is then configured and mmap'd into userspace:
> 
> 	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
> 	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
> 	buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
> 		   MAP_SHARED, fd, 0);
> 
>       The fd cannot be read or written (though there is a facility to use
>       write to inject records for debugging) and userspace just pulls data
>       directly out of the buffer.
> 
>   (2) The ring index pointers are stored inside the ring and are thus
>       accessible to userspace.  Userspace should only update the tail
>       pointer and never the head pointer or risk breaking the buffer.  The
>       kernel checks that the pointers appear valid before trying to use
>       them.  A 'skip' record is maintained around the pointers.
> 
>   (3) poll() can be used to wait for data to appear in the buffer.
> 
>   (4) Records in the buffer are binary, typed and have a length so that they
>       can be of varying size.
> 
>       This means that multiple heterogeneous sources can share a common
>       buffer.  Tags may be specified when a watchpoint is created to help
>       distinguish the sources.
> 
>   (5) The queue is reusable as there are 16 million types available, of
>       which I've used 4, so there is scope for others to be used.
> 
>   (6) Records are filterable as types have up to 256 subtypes that can be
>       individually filtered.  Other filtration is also available.
> 
>   (7) Each time the buffer is opened, a new buffer is created - this means
>       that there's no interference between watchers.
> 
>   (8) When recording a notification, the kernel will not sleep, but will
>       rather mark a queue as overrun if there's insufficient space, thereby
>       avoiding userspace causing the kernel to hang.
> 
>   (9) The 'watchpoint' should be specific where possible, meaning that you
>       specify the object that you want to watch.
> 
> (10) The buffer is created and then watchpoints are attached to it, using
>       one of:
> 
> 	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
> 	mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
> 	sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
> 
>       where in all three cases, fd indicates the queue and the number after
>       is a tag between 0 and 255.
> 
> (11) The watch must be removed if either the watch buffer is destroyed or
>       the watched object is destroyed.
> 
> 
> Things I want to avoid:
> 
>   (1) Introducing features that make the core VFS dependent on the network
>       stack or networking namespaces (ie. usage of netlink).
> 
>   (2) Dumping all this stuff into dmesg and having a daemon that sits there
>       parsing the output and distributing it as this then puts the
>       responsibility for security into userspace and makes handling
>       namespaces tricky.  Further, dmesg might not exist or might be
>       inaccessible inside a container.
> 
>   (3) Letting users see events they shouldn't be able to see.
> 
> 
> Further things that could be considered:
> 
>   (1) Adding a keyctl call to allow a watch on a keyring to be extended to
>       "children" of that keyring, such that the watch is removed from the
>       child if it is unlinked from the keyring.
> 
>   (2) Adding global superblock event queue.
> 
>   (3) Propagating watches to child superblock over automounts.
> 
> 
> The patches can be found here also:
> 
> 	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
> 
> Changes:
> 
>   v4: Split the basic UAPI bits out into their own patch and then split the
>       LSM hooks out into an intermediate patch.  Add LSM hooks for setting
>       watches.
> 
>       Rename the *_notify() system calls to watch_*() for consistency.
> 
>   v3: I've added a USB notification source and reformulated the block
>       notification source so that there's now a common watch list, for which
>       the system call is now device_notify().
> 
>       I've assigned a pair of unused ioctl numbers in the 'W' series to the
>       ioctls added by this series.
> 
>       I've also added a description of the kernel API to the documentation.
> 
>   v2: I've fixed various issues raised by Jann Horn and GregKH and moved to
>       krefs for refcounting.  I've added some security features to try and
>       give Casey Schaufler the LSM control he wants.
> 
> David
> ---
> David Howells (13):
>        security: Override creds in __fput() with last fputter's creds
>        uapi: General notification ring definitions
>        security: Add hooks to rule on setting a watch
>        security: Add a hook for the point of notification insertion
>        General notification queue with user mmap()'able ring buffer
>        keys: Add a notification facility
>        vfs: Add a mount-notification facility
>        vfs: Add superblock notifications
>        fsinfo: Export superblock notification counter
>        Add a general, global device notification watch list
>        block: Add block layer notifications
>        usb: Add USB subsystem notifications
>        Add sample notification program
> 
> 
>   Documentation/ioctl/ioctl-number.txt   |    1
>   Documentation/security/keys/core.rst   |   58 ++
>   Documentation/watch_queue.rst          |  492 ++++++++++++++++++
>   arch/x86/entry/syscalls/syscall_32.tbl |    3
>   arch/x86/entry/syscalls/syscall_64.tbl |    3
>   block/Kconfig                          |    9
>   block/blk-core.c                       |   29 +
>   drivers/base/Kconfig                   |    9
>   drivers/base/Makefile                  |    1
>   drivers/base/watch.c                   |   89 +++
>   drivers/misc/Kconfig                   |   13
>   drivers/misc/Makefile                  |    1
>   drivers/misc/watch_queue.c             |  889 ++++++++++++++++++++++++++++++++
>   drivers/usb/core/Kconfig               |   10
>   drivers/usb/core/devio.c               |   55 ++
>   drivers/usb/core/hub.c                 |    3
>   fs/Kconfig                             |   21 +
>   fs/Makefile                            |    1
>   fs/file_table.c                        |   12
>   fs/fsinfo.c                            |   12
>   fs/mount.h                             |   33 +
>   fs/mount_notify.c                      |  187 +++++++
>   fs/namespace.c                         |    9
>   fs/super.c                             |  122 ++++
>   include/linux/blkdev.h                 |   15 +
>   include/linux/dcache.h                 |    1
>   include/linux/device.h                 |    7
>   include/linux/fs.h                     |   79 +++
>   include/linux/key.h                    |    4
>   include/linux/lsm_hooks.h              |   48 ++
>   include/linux/security.h               |   35 +
>   include/linux/syscalls.h               |    5
>   include/linux/usb.h                    |   19 +
>   include/linux/watch_queue.h            |   87 +++
>   include/uapi/linux/fsinfo.h            |   10
>   include/uapi/linux/keyctl.h            |    1
>   include/uapi/linux/watch_queue.h       |  213 ++++++++
>   kernel/sys_ni.c                        |    7
>   samples/Kconfig                        |    6
>   samples/Makefile                       |    1
>   samples/vfs/test-fsinfo.c              |   13
>   samples/watch_queue/Makefile           |    9
>   samples/watch_queue/watch_test.c       |  308 +++++++++++
>   security/keys/Kconfig                  |   10
>   security/keys/compat.c                 |    2
>   security/keys/gc.c                     |    5
>   security/keys/internal.h               |   30 +
>   security/keys/key.c                    |   37 +
>   security/keys/keyctl.c                 |   95 +++
>   security/keys/keyring.c                |   17 -
>   security/keys/request_key.c            |    4
>   security/security.c                    |   29 +
>   52 files changed, 3121 insertions(+), 38 deletions(-)
>   create mode 100644 Documentation/watch_queue.rst
>   create mode 100644 drivers/base/watch.c
>   create mode 100644 drivers/misc/watch_queue.c
>   create mode 100644 fs/mount_notify.c
>   create mode 100644 include/linux/watch_queue.h
>   create mode 100644 include/uapi/linux/watch_queue.h
>   create mode 100644 samples/watch_queue/Makefile
>   create mode 100644 samples/watch_queue/watch_test.c
>

Casey Schaufler June 10, 2019, 4:33 p.m. UTC | #2

On 6/10/2019 8:21 AM, Stephen Smalley wrote:
> On 6/7/19 10:17 AM, David Howells wrote:
>>
>> Hi Al,
>>
>> Here's a set of patches to add a general variable-length notification queue
>> concept and to add sources of events for:
>>
>>   (1) Mount topology events, such as mounting, unmounting, mount expiry,
>>       mount reconfiguration.
>>
>>   (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>>       errors (not complete yet).
>>
>>   (3) Key/keyring events, such as creating, linking and removal of keys.
>>
>>   (4) General device events (single common queue) including:
>>
>>       - Block layer events, such as device errors
>>
>>       - USB subsystem events, such as device/bus attach/remove, device
>>         reset, device errors.
>>
>> One of the reasons for this is so that we can remove the issue of processes
>> having to repeatedly and regularly scan /proc/mounts, which has proven to
>> be a system performance problem.  To further aid this, the fsinfo() syscall
>> on which this patch series depends, provides a way to access superblock and
>> mount information in binary form without the need to parse /proc/mounts.
>>
>>
>> LSM support is included, but controversial:
>>
>>   (1) The creds of the process that did the fput() that reduced the refcount
>>       to zero are cached in the file struct.
>>
>>   (2) __fput() overrides the current creds with the creds from (1) whilst
>>       doing the cleanup, thereby making sure that the creds seen by the
>>       destruction notification generated by mntput() appears to come from
>>       the last fputter.
>>
>>   (3) security_post_notification() is called for each queue that we might
>>       want to post a notification into, thereby allowing the LSM to prevent
>>       covert communications.
>>
>>   (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>       may be set in the first place?  I might need to add a variant per
>>       watch-type.
>>
>>   (?) Do I really need to keep track of the process creds in which an
>>       implicit object destruction happened?  For example, imagine you create
>>       an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>>       refers to on close unless move_mount() clears that flag.  Now, imagine
>>       someone looking at that fd through procfs at the same time as you exit
>>       due to an error.  The LSM sees the destruction notification come from
>>       the looker if they happen to do their fput() after yours.
>
> I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
>
> For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
>
> I wouldn't expect SELinux to implement security_post_notification() at all.  I can't see how one can construct a meaningful, stable policy for it.  I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.

I cannot agree. There is an explicit action by a subject that results
in information being delivered to an object. Just like a signal or a
UDP packet delivery. Smack handles this kind of thing just fine. The
internal mechanism that results in the access is irrelevant from
this viewpoint. I can understand how a mechanism like SELinux that
works on finer granularity might view it differently.



>
>>
>>
>> Design decisions:
>>
>>   (1) A misc chardev is used to create and open a ring buffer:
>>
>>     fd = open("/dev/watch_queue", O_RDWR);
>>
>>       which is then configured and mmap'd into userspace:
>>
>>     ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
>>     ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
>>     buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
>>            MAP_SHARED, fd, 0);
>>
>>       The fd cannot be read or written (though there is a facility to use
>>       write to inject records for debugging) and userspace just pulls data
>>       directly out of the buffer.
>>
>>   (2) The ring index pointers are stored inside the ring and are thus
>>       accessible to userspace.  Userspace should only update the tail
>>       pointer and never the head pointer or risk breaking the buffer.  The
>>       kernel checks that the pointers appear valid before trying to use
>>       them.  A 'skip' record is maintained around the pointers.
>>
>>   (3) poll() can be used to wait for data to appear in the buffer.
>>
>>   (4) Records in the buffer are binary, typed and have a length so that they
>>       can be of varying size.
>>
>>       This means that multiple heterogeneous sources can share a common
>>       buffer.  Tags may be specified when a watchpoint is created to help
>>       distinguish the sources.
>>
>>   (5) The queue is reusable as there are 16 million types available, of
>>       which I've used 4, so there is scope for others to be used.
>>
>>   (6) Records are filterable as types have up to 256 subtypes that can be
>>       individually filtered.  Other filtration is also available.
>>
>>   (7) Each time the buffer is opened, a new buffer is created - this means
>>       that there's no interference between watchers.
>>
>>   (8) When recording a notification, the kernel will not sleep, but will
>>       rather mark a queue as overrun if there's insufficient space, thereby
>>       avoiding userspace causing the kernel to hang.
>>
>>   (9) The 'watchpoint' should be specific where possible, meaning that you
>>       specify the object that you want to watch.
>>
>> (10) The buffer is created and then watchpoints are attached to it, using
>>       one of:
>>
>>     keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
>>     mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
>>     sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
>>
>>       where in all three cases, fd indicates the queue and the number after
>>       is a tag between 0 and 255.
>>
>> (11) The watch must be removed if either the watch buffer is destroyed or
>>       the watched object is destroyed.
>>
>>
>> Things I want to avoid:
>>
>>   (1) Introducing features that make the core VFS dependent on the network
>>       stack or networking namespaces (ie. usage of netlink).
>>
>>   (2) Dumping all this stuff into dmesg and having a daemon that sits there
>>       parsing the output and distributing it as this then puts the
>>       responsibility for security into userspace and makes handling
>>       namespaces tricky.  Further, dmesg might not exist or might be
>>       inaccessible inside a container.
>>
>>   (3) Letting users see events they shouldn't be able to see.
>>
>>
>> Further things that could be considered:
>>
>>   (1) Adding a keyctl call to allow a watch on a keyring to be extended to
>>       "children" of that keyring, such that the watch is removed from the
>>       child if it is unlinked from the keyring.
>>
>>   (2) Adding global superblock event queue.
>>
>>   (3) Propagating watches to child superblock over automounts.
>>
>>
>> The patches can be found here also:
>>
>>     http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
>>
>> Changes:
>>
>>   v4: Split the basic UAPI bits out into their own patch and then split the
>>       LSM hooks out into an intermediate patch.  Add LSM hooks for setting
>>       watches.
>>
>>       Rename the *_notify() system calls to watch_*() for consistency.
>>
>>   v3: I've added a USB notification source and reformulated the block
>>       notification source so that there's now a common watch list, for which
>>       the system call is now device_notify().
>>
>>       I've assigned a pair of unused ioctl numbers in the 'W' series to the
>>       ioctls added by this series.
>>
>>       I've also added a description of the kernel API to the documentation.
>>
>>   v2: I've fixed various issues raised by Jann Horn and GregKH and moved to
>>       krefs for refcounting.  I've added some security features to try and
>>       give Casey Schaufler the LSM control he wants.
>>
>> David
>> ---
>> David Howells (13):
>>        security: Override creds in __fput() with last fputter's creds
>>        uapi: General notification ring definitions
>>        security: Add hooks to rule on setting a watch
>>        security: Add a hook for the point of notification insertion
>>        General notification queue with user mmap()'able ring buffer
>>        keys: Add a notification facility
>>        vfs: Add a mount-notification facility
>>        vfs: Add superblock notifications
>>        fsinfo: Export superblock notification counter
>>        Add a general, global device notification watch list
>>        block: Add block layer notifications
>>        usb: Add USB subsystem notifications
>>        Add sample notification program
>>
>>
>>   Documentation/ioctl/ioctl-number.txt   |    1
>>   Documentation/security/keys/core.rst   |   58 ++
>>   Documentation/watch_queue.rst          |  492 ++++++++++++++++++
>>   arch/x86/entry/syscalls/syscall_32.tbl |    3
>>   arch/x86/entry/syscalls/syscall_64.tbl |    3
>>   block/Kconfig                          |    9
>>   block/blk-core.c                       |   29 +
>>   drivers/base/Kconfig                   |    9
>>   drivers/base/Makefile                  |    1
>>   drivers/base/watch.c                   |   89 +++
>>   drivers/misc/Kconfig                   |   13
>>   drivers/misc/Makefile                  |    1
>>   drivers/misc/watch_queue.c             |  889 ++++++++++++++++++++++++++++++++
>>   drivers/usb/core/Kconfig               |   10
>>   drivers/usb/core/devio.c               |   55 ++
>>   drivers/usb/core/hub.c                 |    3
>>   fs/Kconfig                             |   21 +
>>   fs/Makefile                            |    1
>>   fs/file_table.c                        |   12
>>   fs/fsinfo.c                            |   12
>>   fs/mount.h                             |   33 +
>>   fs/mount_notify.c                      |  187 +++++++
>>   fs/namespace.c                         |    9
>>   fs/super.c                             |  122 ++++
>>   include/linux/blkdev.h                 |   15 +
>>   include/linux/dcache.h                 |    1
>>   include/linux/device.h                 |    7
>>   include/linux/fs.h                     |   79 +++
>>   include/linux/key.h                    |    4
>>   include/linux/lsm_hooks.h              |   48 ++
>>   include/linux/security.h               |   35 +
>>   include/linux/syscalls.h               |    5
>>   include/linux/usb.h                    |   19 +
>>   include/linux/watch_queue.h            |   87 +++
>>   include/uapi/linux/fsinfo.h            |   10
>>   include/uapi/linux/keyctl.h            |    1
>>   include/uapi/linux/watch_queue.h       |  213 ++++++++
>>   kernel/sys_ni.c                        |    7
>>   samples/Kconfig                        |    6
>>   samples/Makefile                       |    1
>>   samples/vfs/test-fsinfo.c              |   13
>>   samples/watch_queue/Makefile           |    9
>>   samples/watch_queue/watch_test.c       |  308 +++++++++++
>>   security/keys/Kconfig                  |   10
>>   security/keys/compat.c                 |    2
>>   security/keys/gc.c                     |    5
>>   security/keys/internal.h               |   30 +
>>   security/keys/key.c                    |   37 +
>>   security/keys/keyctl.c                 |   95 +++
>>   security/keys/keyring.c                |   17 -
>>   security/keys/request_key.c            |    4
>>   security/security.c                    |   29 +
>>   52 files changed, 3121 insertions(+), 38 deletions(-)
>>   create mode 100644 Documentation/watch_queue.rst
>>   create mode 100644 drivers/base/watch.c
>>   create mode 100644 drivers/misc/watch_queue.c
>>   create mode 100644 fs/mount_notify.c
>>   create mode 100644 include/linux/watch_queue.h
>>   create mode 100644 include/uapi/linux/watch_queue.h
>>   create mode 100644 samples/watch_queue/Makefile
>>   create mode 100644 samples/watch_queue/watch_test.c
>>
>

Andy Lutomirski June 10, 2019, 4:42 p.m. UTC | #3

On Mon, Jun 10, 2019 at 9:34 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> On 6/10/2019 8:21 AM, Stephen Smalley wrote:
> > On 6/7/19 10:17 AM, David Howells wrote:
> >>
> >> Hi Al,
> >>
> >> Here's a set of patches to add a general variable-length notification queue
> >> concept and to add sources of events for:
> >>
> >>   (1) Mount topology events, such as mounting, unmounting, mount expiry,
> >>       mount reconfiguration.
> >>
> >>   (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
> >>       errors (not complete yet).
> >>
> >>   (3) Key/keyring events, such as creating, linking and removal of keys.
> >>
> >>   (4) General device events (single common queue) including:
> >>
> >>       - Block layer events, such as device errors
> >>
> >>       - USB subsystem events, such as device/bus attach/remove, device
> >>         reset, device errors.
> >>
> >> One of the reasons for this is so that we can remove the issue of processes
> >> having to repeatedly and regularly scan /proc/mounts, which has proven to
> >> be a system performance problem.  To further aid this, the fsinfo() syscall
> >> on which this patch series depends, provides a way to access superblock and
> >> mount information in binary form without the need to parse /proc/mounts.
> >>
> >>
> >> LSM support is included, but controversial:
> >>
> >>   (1) The creds of the process that did the fput() that reduced the refcount
> >>       to zero are cached in the file struct.
> >>
> >>   (2) __fput() overrides the current creds with the creds from (1) whilst
> >>       doing the cleanup, thereby making sure that the creds seen by the
> >>       destruction notification generated by mntput() appears to come from
> >>       the last fputter.
> >>
> >>   (3) security_post_notification() is called for each queue that we might
> >>       want to post a notification into, thereby allowing the LSM to prevent
> >>       covert communications.
> >>
> >>   (?) Do I need to add security_set_watch(), say, to rule on whether a watch
> >>       may be set in the first place?  I might need to add a variant per
> >>       watch-type.
> >>
> >>   (?) Do I really need to keep track of the process creds in which an
> >>       implicit object destruction happened?  For example, imagine you create
> >>       an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
> >>       refers to on close unless move_mount() clears that flag.  Now, imagine
> >>       someone looking at that fd through procfs at the same time as you exit
> >>       due to an error.  The LSM sees the destruction notification come from
> >>       the looker if they happen to do their fput() after yours.
> >
> > I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
> >
> > For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
> >
> > I wouldn't expect SELinux to implement security_post_notification() at all.  I can't see how one can construct a meaningful, stable policy for it.  I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.
>
> I cannot agree. There is an explicit action by a subject that results
> in information being delivered to an object. Just like a signal or a
> UDP packet delivery. Smack handles this kind of thing just fine. The
> internal mechanism that results in the access is irrelevant from
> this viewpoint. I can understand how a mechanism like SELinux that
> works on finer granularity might view it differently.

I think you really need to give an example of a coherent policy that
needs this.  As it stands, your analogy seems confusing.  If someone
changes the system clock, we don't restrict who is allowed to be
notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
was changed based on who changed the clock.  Similarly, if someone
tries to receive a packet on a socket, we check whether they have the
right to receive on that socket (from the endpoint in question) and,
if the sender is local, whether the sender can send to that socket.
We do not check whether the sender can send to the receiver.

The signal example is inapplicable.  Sending a signal to a process is
an explicit action done to that process, and it can easily adversely
affect the target.  Of course it requires permission.

--Andy

Casey Schaufler June 10, 2019, 6:01 p.m. UTC | #4

On 6/10/2019 9:42 AM, Andy Lutomirski wrote:
> On Mon, Jun 10, 2019 at 9:34 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 6/10/2019 8:21 AM, Stephen Smalley wrote:
>>> On 6/7/19 10:17 AM, David Howells wrote:
>>>> Hi Al,
>>>>
>>>> Here's a set of patches to add a general variable-length notification queue
>>>> concept and to add sources of events for:
>>>>
>>>>   (1) Mount topology events, such as mounting, unmounting, mount expiry,
>>>>       mount reconfiguration.
>>>>
>>>>   (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>>>>       errors (not complete yet).
>>>>
>>>>   (3) Key/keyring events, such as creating, linking and removal of keys.
>>>>
>>>>   (4) General device events (single common queue) including:
>>>>
>>>>       - Block layer events, such as device errors
>>>>
>>>>       - USB subsystem events, such as device/bus attach/remove, device
>>>>         reset, device errors.
>>>>
>>>> One of the reasons for this is so that we can remove the issue of processes
>>>> having to repeatedly and regularly scan /proc/mounts, which has proven to
>>>> be a system performance problem.  To further aid this, the fsinfo() syscall
>>>> on which this patch series depends, provides a way to access superblock and
>>>> mount information in binary form without the need to parse /proc/mounts.
>>>>
>>>>
>>>> LSM support is included, but controversial:
>>>>
>>>>   (1) The creds of the process that did the fput() that reduced the refcount
>>>>       to zero are cached in the file struct.
>>>>
>>>>   (2) __fput() overrides the current creds with the creds from (1) whilst
>>>>       doing the cleanup, thereby making sure that the creds seen by the
>>>>       destruction notification generated by mntput() appears to come from
>>>>       the last fputter.
>>>>
>>>>   (3) security_post_notification() is called for each queue that we might
>>>>       want to post a notification into, thereby allowing the LSM to prevent
>>>>       covert communications.
>>>>
>>>>   (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>>>       may be set in the first place?  I might need to add a variant per
>>>>       watch-type.
>>>>
>>>>   (?) Do I really need to keep track of the process creds in which an
>>>>       implicit object destruction happened?  For example, imagine you create
>>>>       an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>>>>       refers to on close unless move_mount() clears that flag.  Now, imagine
>>>>       someone looking at that fd through procfs at the same time as you exit
>>>>       due to an error.  The LSM sees the destruction notification come from
>>>>       the looker if they happen to do their fput() after yours.
>>> I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
>>>
>>> For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
>>>
>>> I wouldn't expect SELinux to implement security_post_notification() at all.  I can't see how one can construct a meaningful, stable policy for it.  I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.
>> I cannot agree. There is an explicit action by a subject that results
>> in information being delivered to an object. Just like a signal or a
>> UDP packet delivery. Smack handles this kind of thing just fine. The
>> internal mechanism that results in the access is irrelevant from
>> this viewpoint. I can understand how a mechanism like SELinux that
>> works on finer granularity might view it differently.
> I think you really need to give an example of a coherent policy that
> needs this.

I keep telling you, and you keep ignoring what I say.

>   As it stands, your analogy seems confusing.

It's pretty simple. I have given both the abstract
and examples.

>   If someone
> changes the system clock, we don't restrict who is allowed to be
> notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
> was changed based on who changed the clock.

That's right. The system clock is not an object that
unprivileged processes can modify. In fact, it is not
an object at all. If you care to look, you will see that
Smack does nothing with the clock.

>   Similarly, if someone
> tries to receive a packet on a socket, we check whether they have the
> right to receive on that socket (from the endpoint in question) and,
> if the sender is local, whether the sender can send to that socket.
> We do not check whether the sender can send to the receiver.

Bzzzt! Smack sure does.

> The signal example is inapplicable.

From a modeling viewpoint the actions are identical.

>   Sending a signal to a process is
> an explicit action done to that process, and it can easily adversely
> affect the target.  Of course it requires permission.
>
> --Andy

Andy Lutomirski June 10, 2019, 6:22 p.m. UTC | #5

> On Jun 10, 2019, at 11:01 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
>> On 6/10/2019 9:42 AM, Andy Lutomirski wrote:
>>> On Mon, Jun 10, 2019 at 9:34 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>> On 6/10/2019 8:21 AM, Stephen Smalley wrote:
>>>>> On 6/7/19 10:17 AM, David Howells wrote:
>>>>> Hi Al,
>>>>> 
>>>>> Here's a set of patches to add a general variable-length notification queue
>>>>> concept and to add sources of events for:
>>>>> 
>>>>>  (1) Mount topology events, such as mounting, unmounting, mount expiry,
>>>>>      mount reconfiguration.
>>>>> 
>>>>>  (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>>>>>      errors (not complete yet).
>>>>> 
>>>>>  (3) Key/keyring events, such as creating, linking and removal of keys.
>>>>> 
>>>>>  (4) General device events (single common queue) including:
>>>>> 
>>>>>      - Block layer events, such as device errors
>>>>> 
>>>>>      - USB subsystem events, such as device/bus attach/remove, device
>>>>>        reset, device errors.
>>>>> 
>>>>> One of the reasons for this is so that we can remove the issue of processes
>>>>> having to repeatedly and regularly scan /proc/mounts, which has proven to
>>>>> be a system performance problem.  To further aid this, the fsinfo() syscall
>>>>> on which this patch series depends, provides a way to access superblock and
>>>>> mount information in binary form without the need to parse /proc/mounts.
>>>>> 
>>>>> 
>>>>> LSM support is included, but controversial:
>>>>> 
>>>>>  (1) The creds of the process that did the fput() that reduced the refcount
>>>>>      to zero are cached in the file struct.
>>>>> 
>>>>>  (2) __fput() overrides the current creds with the creds from (1) whilst
>>>>>      doing the cleanup, thereby making sure that the creds seen by the
>>>>>      destruction notification generated by mntput() appears to come from
>>>>>      the last fputter.
>>>>> 
>>>>>  (3) security_post_notification() is called for each queue that we might
>>>>>      want to post a notification into, thereby allowing the LSM to prevent
>>>>>      covert communications.
>>>>> 
>>>>>  (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>>>>      may be set in the first place?  I might need to add a variant per
>>>>>      watch-type.
>>>>> 
>>>>>  (?) Do I really need to keep track of the process creds in which an
>>>>>      implicit object destruction happened?  For example, imagine you create
>>>>>      an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>>>>>      refers to on close unless move_mount() clears that flag.  Now, imagine
>>>>>      someone looking at that fd through procfs at the same time as you exit
>>>>>      due to an error.  The LSM sees the destruction notification come from
>>>>>      the looker if they happen to do their fput() after yours.
>>>> I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
>>>> 
>>>> For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
>>>> 
>>>> I wouldn't expect SELinux to implement security_post_notification() at all.  I can't see how one can construct a meaningful, stable policy for it.  I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.
>>> I cannot agree. There is an explicit action by a subject that results
>>> in information being delivered to an object. Just like a signal or a
>>> UDP packet delivery. Smack handles this kind of thing just fine. The
>>> internal mechanism that results in the access is irrelevant from
>>> this viewpoint. I can understand how a mechanism like SELinux that
>>> works on finer granularity might view it differently.
>> I think you really need to give an example of a coherent policy that
>> needs this.
> 
> I keep telling you, and you keep ignoring what I say.
> 
>>  As it stands, your analogy seems confusing.
> 
> It's pretty simple. I have given both the abstract
> and examples.

You gave the /dev/null example, which is inapplicable to this patchset.

> 
>>  If someone
>> changes the system clock, we don't restrict who is allowed to be
>> notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
>> was changed based on who changed the clock.
> 
> That's right. The system clock is not an object that
> unprivileged processes can modify. In fact, it is not
> an object at all. If you care to look, you will see that
> Smack does nothing with the clock.

And this is different from the mount tree how?

> 
>>  Similarly, if someone
>> tries to receive a packet on a socket, we check whether they have the
>> right to receive on that socket (from the endpoint in question) and,
>> if the sender is local, whether the sender can send to that socket.
>> We do not check whether the sender can send to the receiver.
> 
> Bzzzt! Smack sure does.

This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.

> 
>> The signal example is inapplicable.
> 
> From a modeling viewpoint the actions are identical.

This seems incorrect to me and, I think, to most everyone else reading this. Can you explain?

In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.

Casey Schaufler June 10, 2019, 7:33 p.m. UTC | #6

On 6/10/2019 11:22 AM, Andy Lutomirski wrote:
>> On Jun 10, 2019, at 11:01 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>
>>> On 6/10/2019 9:42 AM, Andy Lutomirski wrote:
>>>> On Mon, Jun 10, 2019 at 9:34 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>> On 6/10/2019 8:21 AM, Stephen Smalley wrote:
>>>>>> On 6/7/19 10:17 AM, David Howells wrote:
>>>>>> Hi Al,
>>>>>>
>>>>>> Here's a set of patches to add a general variable-length notification queue
>>>>>> concept and to add sources of events for:
>>>>>>
>>>>>>  (1) Mount topology events, such as mounting, unmounting, mount expiry,
>>>>>>      mount reconfiguration.
>>>>>>
>>>>>>  (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>>>>>>      errors (not complete yet).
>>>>>>
>>>>>>  (3) Key/keyring events, such as creating, linking and removal of keys.
>>>>>>
>>>>>>  (4) General device events (single common queue) including:
>>>>>>
>>>>>>      - Block layer events, such as device errors
>>>>>>
>>>>>>      - USB subsystem events, such as device/bus attach/remove, device
>>>>>>        reset, device errors.
>>>>>>
>>>>>> One of the reasons for this is so that we can remove the issue of processes
>>>>>> having to repeatedly and regularly scan /proc/mounts, which has proven to
>>>>>> be a system performance problem.  To further aid this, the fsinfo() syscall
>>>>>> on which this patch series depends, provides a way to access superblock and
>>>>>> mount information in binary form without the need to parse /proc/mounts.
>>>>>>
>>>>>>
>>>>>> LSM support is included, but controversial:
>>>>>>
>>>>>>  (1) The creds of the process that did the fput() that reduced the refcount
>>>>>>      to zero are cached in the file struct.
>>>>>>
>>>>>>  (2) __fput() overrides the current creds with the creds from (1) whilst
>>>>>>      doing the cleanup, thereby making sure that the creds seen by the
>>>>>>      destruction notification generated by mntput() appears to come from
>>>>>>      the last fputter.
>>>>>>
>>>>>>  (3) security_post_notification() is called for each queue that we might
>>>>>>      want to post a notification into, thereby allowing the LSM to prevent
>>>>>>      covert communications.
>>>>>>
>>>>>>  (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>>>>>      may be set in the first place?  I might need to add a variant per
>>>>>>      watch-type.
>>>>>>
>>>>>>  (?) Do I really need to keep track of the process creds in which an
>>>>>>      implicit object destruction happened?  For example, imagine you create
>>>>>>      an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>>>>>>      refers to on close unless move_mount() clears that flag.  Now, imagine
>>>>>>      someone looking at that fd through procfs at the same time as you exit
>>>>>>      due to an error.  The LSM sees the destruction notification come from
>>>>>>      the looker if they happen to do their fput() after yours.
>>>>> I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
>>>>>
>>>>> For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
>>>>>
>>>>> I wouldn't expect SELinux to implement security_post_notification() at all.  I can't see how one can construct a meaningful, stable policy for it.  I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.
>>>> I cannot agree. There is an explicit action by a subject that results
>>>> in information being delivered to an object. Just like a signal or a
>>>> UDP packet delivery. Smack handles this kind of thing just fine. The
>>>> internal mechanism that results in the access is irrelevant from
>>>> this viewpoint. I can understand how a mechanism like SELinux that
>>>> works on finer granularity might view it differently.
>>> I think you really need to give an example of a coherent policy that
>>> needs this.
>> I keep telling you, and you keep ignoring what I say.
>>
>>>  As it stands, your analogy seems confusing.
>> It's pretty simple. I have given both the abstract
>> and examples.
> You gave the /dev/null example, which is inapplicable to this patchset.

That addressed an explicit objection, and pointed out
an exception to a generality you had asserted, which was
not true. It's also a red herring regarding the current
discussion.

>>>  If someone
>>> changes the system clock, we don't restrict who is allowed to be
>>> notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
>>> was changed based on who changed the clock.
>> That's right. The system clock is not an object that
>> unprivileged processes can modify. In fact, it is not
>> an object at all. If you care to look, you will see that
>> Smack does nothing with the clock.
> And this is different from the mount tree how?

The mount tree can be modified by unprivileged users.
If nothing that unprivileged users can do to the mount
tree can trigger a notification you are correct, the
mount tree is very like the system clock. Is that the
case? 

>>>  Similarly, if someone
>>> tries to receive a packet on a socket, we check whether they have the
>>> right to receive on that socket (from the endpoint in question) and,
>>> if the sender is local, whether the sender can send to that socket.
>>> We do not check whether the sender can send to the receiver.
>> Bzzzt! Smack sure does.
> This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.

Process A sends a packet to process B.
If A has access to TopSecret data and B is not
allowed to see TopSecret data, the delivery should
be prevented. Is that nonsensical?

>>> The signal example is inapplicable.
>> From a modeling viewpoint the actions are identical.
> This seems incorrect to me

What would be correct then? Some convoluted combination
of system entities that aren't owned or controlled by
any mechanism? 

>  and, I think, to most everyone else reading this.

That's quite the assertion. You may even be correct.

>  Can you explain?
>
> In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.

YES!!!!!!!!!!!!

And when a process triggers a notification it is the subject
and the watching process is the object!

Subject == active entity
Object  == passive entity

Triggering an event is, like calling kill(), an action!

Andy Lutomirski June 10, 2019, 7:53 p.m. UTC | #7

On Mon, Jun 10, 2019 at 12:34 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >>> I think you really need to give an example of a coherent policy that
> >>> needs this.
> >> I keep telling you, and you keep ignoring what I say.
> >>
> >>>  As it stands, your analogy seems confusing.
> >> It's pretty simple. I have given both the abstract
> >> and examples.
> > You gave the /dev/null example, which is inapplicable to this patchset.
>
> That addressed an explicit objection, and pointed out
> an exception to a generality you had asserted, which was
> not true. It's also a red herring regarding the current
> discussion.

This argument is pointless.

Please humor me and just give me an example.  If you think you have
already done so, feel free to repeat yourself.  If you have no
example, then please just say so.

>
> >>>  If someone
> >>> changes the system clock, we don't restrict who is allowed to be
> >>> notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
> >>> was changed based on who changed the clock.
> >> That's right. The system clock is not an object that
> >> unprivileged processes can modify. In fact, it is not
> >> an object at all. If you care to look, you will see that
> >> Smack does nothing with the clock.
> > And this is different from the mount tree how?
>
> The mount tree can be modified by unprivileged users.
> If nothing that unprivileged users can do to the mount
> tree can trigger a notification you are correct, the
> mount tree is very like the system clock. Is that the
> case?

The mount tree can't be modified by unprivileged users, unless a
privileged user very carefully configured it as such.  An unprivileged
user can create a new userns and a new mount ns, but then they're
modifying a whole different mount tree.

>
> >>>  Similarly, if someone
> >>> tries to receive a packet on a socket, we check whether they have the
> >>> right to receive on that socket (from the endpoint in question) and,
> >>> if the sender is local, whether the sender can send to that socket.
> >>> We do not check whether the sender can send to the receiver.
> >> Bzzzt! Smack sure does.
> > This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.
>
> Process A sends a packet to process B.
> If A has access to TopSecret data and B is not
> allowed to see TopSecret data, the delivery should
> be prevented. Is that nonsensical?

It makes sense.  As I see it, the way that a sensible policy should do
this is by making sure that there are no sockets, pipes, etc that
Process A can write and that Process B can read.

If you really want to prevent a malicious process with TopSecret data
from sending it to a different process, then you can't use Linux on
x86 or ARM.  Maybe that will be fixed some day, but you're going to
need to use an extremely tight sandbox to make this work.

>
> >>> The signal example is inapplicable.
> >> From a modeling viewpoint the actions are identical.
> > This seems incorrect to me
>
> What would be correct then? Some convoluted combination
> of system entities that aren't owned or controlled by
> any mechanism?
>

POSIX signal restrictions aren't there to prevent two processes from
communicating.  They're there to prevent the sender from manipulating
or crashing the receiver without appropriate privilege.


> >  and, I think, to most everyone else reading this.
>
> That's quite the assertion. You may even be correct.
>
> >  Can you explain?
> >
> > In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.
>
> YES!!!!!!!!!!!!
>
> And when a process triggers a notification it is the subject
> and the watching process is the object!
>
> Subject == active entity
> Object  == passive entity
>
> Triggering an event is, like calling kill(), an action!
>

And here is where I disagree with your interpretation.  Triggering an
event is a side effect of writing to the file.  There are *two*
security relevant actions, not one, and they are:

First, the write:

Subject == the writer
Action == write
Object == the file

Then the event, which could be modeled in a couple of ways:

Subject == the file
Action == notify
Object == the recipient

or

Subject == the recipient
Action == watch
Object == the file

By conflating these two actions into one, you've made the modeling
very hard, and you start running into all these nasty questions like
"who actually closed this open file"

Casey Schaufler June 10, 2019, 9:25 p.m. UTC | #8

On 6/10/2019 12:53 PM, Andy Lutomirski wrote:
> On Mon, Jun 10, 2019 at 12:34 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>> I think you really need to give an example of a coherent policy that
>>>>> needs this.
>>>> I keep telling you, and you keep ignoring what I say.
>>>>
>>>>>  As it stands, your analogy seems confusing.
>>>> It's pretty simple. I have given both the abstract
>>>> and examples.
>>> You gave the /dev/null example, which is inapplicable to this patchset.
>> That addressed an explicit objection, and pointed out
>> an exception to a generality you had asserted, which was
>> not true. It's also a red herring regarding the current
>> discussion.
> This argument is pointless.
>
> Please humor me and just give me an example.  If you think you have
> already done so, feel free to repeat yourself.  If you have no
> example, then please just say so.

To repeat the /dev/null example:

Process A and process B both open /dev/null.
A and B can write and read to their hearts content
to/from /dev/null without ever once communicating.
The mutual accessibility of /dev/null in no way implies that
A and B can communicate. If A can set a watch on /dev/null,
and B triggers an event, there still has to be an access
check on the delivery of the event because delivering an event
to A is not an action on /dev/null, but on A.


>
>>>>>  If someone
>>>>> changes the system clock, we don't restrict who is allowed to be
>>>>> notified (via, for example, TFD_TIMER_CANCEL_ON_SET) that the clock
>>>>> was changed based on who changed the clock.
>>>> That's right. The system clock is not an object that
>>>> unprivileged processes can modify. In fact, it is not
>>>> an object at all. If you care to look, you will see that
>>>> Smack does nothing with the clock.
>>> And this is different from the mount tree how?
>> The mount tree can be modified by unprivileged users.
>> If nothing that unprivileged users can do to the mount
>> tree can trigger a notification you are correct, the
>> mount tree is very like the system clock. Is that the
>> case?
> The mount tree can't be modified by unprivileged users, unless a
> privileged user very carefully configured it as such.

"Unless" means *is* possible. In which case access control is
required. I will admit to being less then expert on the extent
to which mounts can be done without privilege.

>   An unprivileged
> user can create a new userns and a new mount ns, but then they're
> modifying a whole different mount tree.

Within those namespaces you can still have multiple users,
constrained be system access control policy.

>
>>>>>  Similarly, if someone
>>>>> tries to receive a packet on a socket, we check whether they have the
>>>>> right to receive on that socket (from the endpoint in question) and,
>>>>> if the sender is local, whether the sender can send to that socket.
>>>>> We do not check whether the sender can send to the receiver.
>>>> Bzzzt! Smack sure does.
>>> This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.
>> Process A sends a packet to process B.
>> If A has access to TopSecret data and B is not
>> allowed to see TopSecret data, the delivery should
>> be prevented. Is that nonsensical?
> It makes sense.  As I see it, the way that a sensible policy should do
> this is by making sure that there are no sockets, pipes, etc that
> Process A can write and that Process B can read.

You can't explain UDP controls without doing the access check
on packet delivery. The sendmsg() succeeds when the packet leaves
the sender. There doesn't even have to be a socket bound to the
port. The only opportunity you have for control is on packet
delivery, which is the only point at which you can have the
information required.

> If you really want to prevent a malicious process with TopSecret data
> from sending it to a different process, then you can't use Linux on
> x86 or ARM.  Maybe that will be fixed some day, but you're going to
> need to use an extremely tight sandbox to make this work.

I won't be commenting on that.

>
>>>>> The signal example is inapplicable.
>>>> From a modeling viewpoint the actions are identical.
>>> This seems incorrect to me
>> What would be correct then? Some convoluted combination
>> of system entities that aren't owned or controlled by
>> any mechanism?
>>
> POSIX signal restrictions aren't there to prevent two processes from
> communicating.  They're there to prevent the sender from manipulating
> or crashing the receiver without appropriate privilege.

POSIX signal restrictions have a long history. In the P10031e/2c
debates both communication and manipulation where seriously
considered. I would say both are true.

>>>  and, I think, to most everyone else reading this.
>> That's quite the assertion. You may even be correct.
>>
>>>  Can you explain?
>>>
>>> In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.
>> YES!!!!!!!!!!!!
>>
>> And when a process triggers a notification it is the subject
>> and the watching process is the object!
>>
>> Subject == active entity
>> Object  == passive entity
>>
>> Triggering an event is, like calling kill(), an action!
>>
> And here is where I disagree with your interpretation.  Triggering an
> event is a side effect of writing to the file.  There are *two*
> security relevant actions, not one, and they are:
>
> First, the write:
>
> Subject == the writer
> Action == write
> Object == the file
>
> Then the event, which could be modeled in a couple of ways:
>
> Subject == the file

Files   are   not   subjects. They are passive entities.

> Action == notify
> Object == the recipient
>
> or
>
> Subject == the recipient
> Action == watch
> Object == the file
>
> By conflating these two actions into one, you've made the modeling
> very hard, and you start running into all these nasty questions like
> "who actually closed this open file"

No, I've made the code more difficult. You can not call
the file a subject. That is just wrong. It's not a valid
model.

David Howells June 10, 2019, 10:07 p.m. UTC | #9

Casey Schaufler <casey@schaufler-ca.com> wrote:

> Process A and process B both open /dev/null.
> A and B can write and read to their hearts content
> to/from /dev/null without ever once communicating.
> The mutual accessibility of /dev/null in no way implies that
> A and B can communicate. If A can set a watch on /dev/null,
> and B triggers an event, there still has to be an access
> check on the delivery of the event because delivering an event
> to A is not an action on /dev/null, but on A.

If a process has the privilege, it appears that fanotify() allows that process
to see others accessing /dev/null (FAN_ACCESS, FAN_ACCESS_PERM).  There don't
seem to be any LSM checks there either.

On the other hand, the privilege required is CAP_SYS_ADMIN,

> > The mount tree can't be modified by unprivileged users, unless a
> > privileged user very carefully configured it as such.
> 
> "Unless" means *is* possible. In which case access control is
> required. I will admit to being less then expert on the extent
> to which mounts can be done without privilege.

Automounts in network filesystems, for example.

The initial mount of the network filesystem requires local privilege, but then
mountpoints are managed with remote privilege as granted by things like
kerberos tickets.  The local kernel has no control.

If you have CONFIG_AFS_FS enabled in your kernel, for example, and you install
the keyutils package (dnf, rpm, apt, etc.), then you should be able to do:

	mount -t afs none /mnt -o dyn
	ls /afs/grand.central.org/software/

for example.  That will go through a couple of automount points.  Assuming you
don't have a kerberos login on those servers, however, you shouldn't be able
to add new mountpoints.

Someone watching the mount topology can see events when an automount is
enacted and when it expires, the latter being an event with the system as the
subject since the expiry is done on a timeout set by the kernel.

David

Andy Lutomirski June 11, 2019, 12:13 a.m. UTC | #10

> On Jun 10, 2019, at 2:25 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
>> On 6/10/2019 12:53 PM, Andy Lutomirski wrote:
>> On Mon, Jun 10, 2019 at 12:34 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>> I think you really need to give an example of a coherent policy that
>>>>>> needs this.
>>>>> I keep telling you, and you keep ignoring what I say.
>>>>> 
>>>>>> As it stands, your analogy seems confusing.
>>>>> It's pretty simple. I have given both the abstract
>>>>> and examples.
>>>> You gave the /dev/null example, which is inapplicable to this patchset.
>>> That addressed an explicit objection, and pointed out
>>> an exception to a generality you had asserted, which was
>>> not true. It's also a red herring regarding the current
>>> discussion.
>> This argument is pointless.
>> 
>> Please humor me and just give me an example.  If you think you have
>> already done so, feel free to repeat yourself.  If you have no
>> example, then please just say so.
> 
> To repeat the /dev/null example:
> 
> Process A and process B both open /dev/null.
> A and B can write and read to their hearts content
> to/from /dev/null without ever once communicating.
> The mutual accessibility of /dev/null in no way implies that
> A and B can communicate. If A can set a watch on /dev/null,
> and B triggers an event, there still has to be an access
> check on the delivery of the event because delivering an event
> to A is not an action on /dev/null, but on A.
> 

At discussed, this is an irrelevant straw man. This patch series does not produce events when this happens. I’m looking for a relevant example, please.
> 
> 
>>  An unprivileged
>> user can create a new userns and a new mount ns, but then they're
>> modifying a whole different mount tree.
> 
> Within those namespaces you can still have multiple users,
> constrained be system access control policy.

And the one doing the mounting will be constrained by MAC and DAC policy, as always.  The namespace creator is, from the perspective of those processes, admin.

> 
>> 
>>>>>> Similarly, if someone
>>>>>> tries to receive a packet on a socket, we check whether they have the
>>>>>> right to receive on that socket (from the endpoint in question) and,
>>>>>> if the sender is local, whether the sender can send to that socket.
>>>>>> We do not check whether the sender can send to the receiver.
>>>>> Bzzzt! Smack sure does.
>>>> This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.
>>> Process A sends a packet to process B.
>>> If A has access to TopSecret data and B is not
>>> allowed to see TopSecret data, the delivery should
>>> be prevented. Is that nonsensical?
>> It makes sense.  As I see it, the way that a sensible policy should do
>> this is by making sure that there are no sockets, pipes, etc that
>> Process A can write and that Process B can read.
> 
> You can't explain UDP controls without doing the access check
> on packet delivery. The sendmsg() succeeds when the packet leaves
> the sender. There doesn't even have to be a socket bound to the
> port. The only opportunity you have for control is on packet
> delivery, which is the only point at which you can have the
> information required.

Huh?  You sendmsg() from an address to an address.  My point is that, for most purposes, that’s all the information that’s needed.

> 
>> If you really want to prevent a malicious process with TopSecret data
>> from sending it to a different process, then you can't use Linux on
>> x86 or ARM.  Maybe that will be fixed some day, but you're going to
>> need to use an extremely tight sandbox to make this work.
> 
> I won't be commenting on that.

Then why is preventing this is an absolute requirement? It’s unattainable.

> 
>> 
>>>>>> The signal example is inapplicable.
>>>>> From a modeling viewpoint the actions are identical.
>>>> This seems incorrect to me
>>> What would be correct then? Some convoluted combination
>>> of system entities that aren't owned or controlled by
>>> any mechanism?
>>> 
>> POSIX signal restrictions aren't there to prevent two processes from
>> communicating.  They're there to prevent the sender from manipulating
>> or crashing the receiver without appropriate privilege.
> 
> POSIX signal restrictions have a long history. In the P10031e/2c
> debates both communication and manipulation where seriously
> considered. I would say both are true.
> 
>>>> and, I think, to most everyone else reading this.
>>> That's quite the assertion. You may even be correct.
>>> 
>>>> Can you explain?
>>>> 
>>>> In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.
>>> YES!!!!!!!!!!!!
>>> 
>>> And when a process triggers a notification it is the subject
>>> and the watching process is the object!
>>> 
>>> Subject == active entity
>>> Object  == passive entity
>>> 
>>> Triggering an event is, like calling kill(), an action!
>>> 
>> And here is where I disagree with your interpretation.  Triggering an
>> event is a side effect of writing to the file.  There are *two*
>> security relevant actions, not one, and they are:
>> 
>> First, the write:
>> 
>> Subject == the writer
>> Action == write
>> Object == the file
>> 
>> Then the event, which could be modeled in a couple of ways:
>> 
>> Subject == the file
> 
> Files   are   not   subjects. They are passive entities.
> 
>> Action == notify
>> Object == the recipient

Great. Then use the variant below.

>> 
>> or
>> 
>> Subject == the recipient
>> Action == watch
>> Object == the file
>> 
>> By conflating these two actions into one, you've made the modeling
>> very hard, and you start running into all these nasty questions like
>> "who actually closed this open file"
> 
> No, I've made the code more difficult.
> You can not call
> the file a subject. That is just wrong. It's not a valid
> model.

You’ve ignored the “Action == watch” variant. Do you care to comment?

Stephen Smalley June 11, 2019, 2:32 p.m. UTC | #11

On 6/10/19 8:13 PM, Andy Lutomirski wrote:
> 
> 
>> On Jun 10, 2019, at 2:25 PM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>
>>> On 6/10/2019 12:53 PM, Andy Lutomirski wrote:
>>> On Mon, Jun 10, 2019 at 12:34 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>>> I think you really need to give an example of a coherent policy that
>>>>>>> needs this.
>>>>>> I keep telling you, and you keep ignoring what I say.
>>>>>>
>>>>>>> As it stands, your analogy seems confusing.
>>>>>> It's pretty simple. I have given both the abstract
>>>>>> and examples.
>>>>> You gave the /dev/null example, which is inapplicable to this patchset.
>>>> That addressed an explicit objection, and pointed out
>>>> an exception to a generality you had asserted, which was
>>>> not true. It's also a red herring regarding the current
>>>> discussion.
>>> This argument is pointless.
>>>
>>> Please humor me and just give me an example.  If you think you have
>>> already done so, feel free to repeat yourself.  If you have no
>>> example, then please just say so.
>>
>> To repeat the /dev/null example:
>>
>> Process A and process B both open /dev/null.
>> A and B can write and read to their hearts content
>> to/from /dev/null without ever once communicating.
>> The mutual accessibility of /dev/null in no way implies that
>> A and B can communicate. If A can set a watch on /dev/null,
>> and B triggers an event, there still has to be an access
>> check on the delivery of the event because delivering an event
>> to A is not an action on /dev/null, but on A.
>>
> 
> At discussed, this is an irrelevant straw man. This patch series does not produce events when this happens. I’m looking for a relevant example, please.
>>
>>
>>>   An unprivileged
>>> user can create a new userns and a new mount ns, but then they're
>>> modifying a whole different mount tree.
>>
>> Within those namespaces you can still have multiple users,
>> constrained be system access control policy.
> 
> And the one doing the mounting will be constrained by MAC and DAC policy, as always.  The namespace creator is, from the perspective of those processes, admin.
> 
>>
>>>
>>>>>>> Similarly, if someone
>>>>>>> tries to receive a packet on a socket, we check whether they have the
>>>>>>> right to receive on that socket (from the endpoint in question) and,
>>>>>>> if the sender is local, whether the sender can send to that socket.
>>>>>>> We do not check whether the sender can send to the receiver.
>>>>>> Bzzzt! Smack sure does.
>>>>> This seems dubious. I’m still trying to get you to explain to a non-Smack person why this makes sense.
>>>> Process A sends a packet to process B.
>>>> If A has access to TopSecret data and B is not
>>>> allowed to see TopSecret data, the delivery should
>>>> be prevented. Is that nonsensical?
>>> It makes sense.  As I see it, the way that a sensible policy should do
>>> this is by making sure that there are no sockets, pipes, etc that
>>> Process A can write and that Process B can read.
>>
>> You can't explain UDP controls without doing the access check
>> on packet delivery. The sendmsg() succeeds when the packet leaves
>> the sender. There doesn't even have to be a socket bound to the
>> port. The only opportunity you have for control is on packet
>> delivery, which is the only point at which you can have the
>> information required.
> 
> Huh?  You sendmsg() from an address to an address.  My point is that, for most purposes, that’s all the information that’s needed.
> 
>>
>>> If you really want to prevent a malicious process with TopSecret data
>>> from sending it to a different process, then you can't use Linux on
>>> x86 or ARM.  Maybe that will be fixed some day, but you're going to
>>> need to use an extremely tight sandbox to make this work.
>>
>> I won't be commenting on that.
> 
> Then why is preventing this is an absolute requirement? It’s unattainable.
> 
>>
>>>
>>>>>>> The signal example is inapplicable.
>>>>>>  From a modeling viewpoint the actions are identical.
>>>>> This seems incorrect to me
>>>> What would be correct then? Some convoluted combination
>>>> of system entities that aren't owned or controlled by
>>>> any mechanism?
>>>>
>>> POSIX signal restrictions aren't there to prevent two processes from
>>> communicating.  They're there to prevent the sender from manipulating
>>> or crashing the receiver without appropriate privilege.
>>
>> POSIX signal restrictions have a long history. In the P10031e/2c
>> debates both communication and manipulation where seriously
>> considered. I would say both are true.
>>
>>>>> and, I think, to most everyone else reading this.
>>>> That's quite the assertion. You may even be correct.
>>>>
>>>>> Can you explain?
>>>>>
>>>>> In SELinux-ese, when you write to a file, the subject is the writer and the object is the file.  When you send a signal to a process, the object is the target process.
>>>> YES!!!!!!!!!!!!
>>>>
>>>> And when a process triggers a notification it is the subject
>>>> and the watching process is the object!
>>>>
>>>> Subject == active entity
>>>> Object  == passive entity
>>>>
>>>> Triggering an event is, like calling kill(), an action!
>>>>
>>> And here is where I disagree with your interpretation.  Triggering an
>>> event is a side effect of writing to the file.  There are *two*
>>> security relevant actions, not one, and they are:
>>>
>>> First, the write:
>>>
>>> Subject == the writer
>>> Action == write
>>> Object == the file
>>>
>>> Then the event, which could be modeled in a couple of ways:
>>>
>>> Subject == the file
>>
>> Files   are   not   subjects. They are passive entities.
>>
>>> Action == notify
>>> Object == the recipient
> 
> Great. Then use the variant below.
> 
>>>
>>> or
>>>
>>> Subject == the recipient
>>> Action == watch
>>> Object == the file
>>>
>>> By conflating these two actions into one, you've made the modeling
>>> very hard, and you start running into all these nasty questions like
>>> "who actually closed this open file"
>>
>> No, I've made the code more difficult.
>> You can not call
>> the file a subject. That is just wrong. It's not a valid
>> model.
> 
> You’ve ignored the “Action == watch” variant. Do you care to comment?

While I agree with this model in general, I will note two caveats when 
trying to apply this to watches/notifications:

1) The object on which the notification was triggered and the object on 
which the watch was placed are not necessarily the same and access to 
one might not imply access to the other,

2) If notifications can be triggered by read-like operations (as in 
fanotify, for example), then a "read" can be turned into a "write" flow 
through a notification.

Whether or not these caveats are applicable to the notifications in this 
series I am not clear.

David Howells June 12, 2019, 8:55 a.m. UTC | #12

Stephen Smalley <sds@tycho.nsa.gov> wrote:

> 2) If notifications can be triggered by read-like operations (as in fanotify,
> for example), then a "read" can be turned into a "write" flow through a
> notification.

I don't think any of the things can be classed as "read-like" operations.  At
the moment, there are the following groups:

 (1) Addition of objects (eg. key_link, mount).

 (2) Modifications to things (eg. keyctl_write, remount).

 (3) Removal of objects (eg. key_unlink, unmount, fput+FMODE_NEED_UNMOUNT).

 (4) I/O or hardware errors (eg. USB device add/remove, EDQUOT, ENOSPC).

I have not currently defined any access events.

I've been looking at the possibility of having epoll generate events this way,
but that's not as straightforward as I'd hoped and fanotify could potentially
use it also, but in both those cases, the process is already getting the
events currently by watching for them using synchronous waiting syscalls.
Instead this would generate an event to say it had happened.

David

[RFC,00/13] Mount, FS, Block and Keyrings notifications [ver #4]

Message

Comments