[RFC,00/10] Mount, FS, Block and Keyrings notifications [ver #3]

Message ID	155981411940.17513.7137844619951358374.stgit@warthog.procyon.org.uk (mailing list archive)
Headers	show Return-Path: <linux-usb-owner@kernel.org> Subject: [RFC][PATCH 00/10] Mount, FS, Block and Keyrings notifications [ver #3] From: David Howells <dhowells@redhat.com> To: viro@zeniv.linux.org.uk Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, Casey Schaufler <casey@schaufler-ca.com>, linux-usb@vger.kernel.org, dhowells@redhat.com, raven@themaw.net, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-block@vger.kernel.org, keyrings@vger.kernel.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org Date: Thu, 06 Jun 2019 10:41:59 +0100 Message-ID: <155981411940.17513.7137844619951358374.stgit@warthog.procyon.org.uk> User-Agent: StGit/unknown-version MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-usb-owner@vger.kernel.org Precedence: bulk
Series	Mount, FS, Block and Keyrings notifications [ver #3] \| expand [RFC,00/10] Mount, FS, Block and Keyrings notifications [ver #3] [09/10] usb: Add USB subsystem notifications [ver #3]

David Howells June 6, 2019, 9:41 a.m. UTC

Hi Al,

Here's a set of patches to add a general variable-length notification queue
concept and to add sources of events for:

 (1) Mount topology events, such as mounting, unmounting, mount expiry,
     mount reconfiguration.

 (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
     errors (not complete yet).

 (3) Key/keyring events, such as creating, linking and removal of keys.

 (4) General device events (single common queue) including:

     - Block layer events, such as device errors

     - USB subsystem events, such as device/bus attach/remove, device
       reset, device errors.

One of the reasons for this is so that we can remove the issue of processes
having to repeatedly and regularly scan /proc/mounts, which has proven to
be a system performance problem.  To further aid this, the fsinfo() syscall
on which this patch series depends, provides a way to access superblock and
mount information in binary form without the need to parse /proc/mounts.


LSM support is included, but controversial:

 (1) The creds of the process that did the fput() that reduced the refcount
     to zero are cached in the file struct.

 (2) __fput() overrides the current creds with the creds from (1) whilst
     doing the cleanup, thereby making sure that the creds seen by the
     destruction notification generated by mntput() appears to come from
     the last fputter.

 (3) security_post_notification() is called for each queue that we might
     want to post a notification into, thereby allowing the LSM to prevent
     covert communications.

 (?) Do I need to add security_set_watch(), say, to rule on whether a watch
     may be set in the first place?  I might need to add a variant per
     watch-type.

 (?) Do I really need to keep track of the process creds in which an
     implicit object destruction happened?  For example, imagine you create
     an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
     refers to on close unless move_mount() clears that flag.  Now, imagine
     someone looking at that fd through procfs at the same time as you exit
     due to an error.  The LSM sees the destruction notification come from
     the looker if they happen to do their fput() after yours.


Design decisions:

 (1) A misc chardev is used to create and open a ring buffer:

	fd = open("/dev/watch_queue", O_RDWR);

     which is then configured and mmap'd into userspace:

	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
	buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

     The fd cannot be read or written (though there is a facility to use
     write to inject records for debugging) and userspace just pulls data
     directly out of the buffer.

 (2) The ring index pointers are stored inside the ring and are thus
     accessible to userspace.  Userspace should only update the tail
     pointer and never the head pointer or risk breaking the buffer.  The
     kernel checks that the pointers appear valid before trying to use
     them.  A 'skip' record is maintained around the pointers.

 (3) poll() can be used to wait for data to appear in the buffer.

 (4) Records in the buffer are binary, typed and have a length so that they
     can be of varying size.

     This means that multiple heterogeneous sources can share a common
     buffer.  Tags may be specified when a watchpoint is created to help
     distinguish the sources.

 (5) The queue is reusable as there are 16 million types available, of
     which I've used 4, so there is scope for others to be used.

 (6) Records are filterable as types have up to 256 subtypes that can be
     individually filtered.  Other filtration is also available.

 (7) Each time the buffer is opened, a new buffer is created - this means
     that there's no interference between watchers.

 (8) When recording a notification, the kernel will not sleep, but will
     rather mark a queue as overrun if there's insufficient space, thereby
     avoiding userspace causing the kernel to hang.

 (9) The 'watchpoint' should be specific where possible, meaning that you
     specify the object that you want to watch.

(10) The buffer is created and then watchpoints are attached to it, using
     one of:

	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
	mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
	sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);

     where in all three cases, fd indicates the queue and the number after
     is a tag between 0 and 255.

(11) The watch must be removed if either the watch buffer is destroyed or
     the watched object is destroyed.


Things I want to avoid:

 (1) Introducing features that make the core VFS dependent on the network
     stack or networking namespaces (ie. usage of netlink).

 (2) Dumping all this stuff into dmesg and having a daemon that sits there
     parsing the output and distributing it as this then puts the
     responsibility for security into userspace and makes handling
     namespaces tricky.  Further, dmesg might not exist or might be
     inaccessible inside a container.

 (3) Letting users see events they shouldn't be able to see.


Further things that could be considered:

 (1) Adding a keyctl call to allow a watch on a keyring to be extended to
     "children" of that keyring, such that the watch is removed from the
     child if it is unlinked from the keyring.

 (2) Adding global superblock event queue.

 (3) Propagating watches to child superblock over automounts.


The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications

Changes:

 v3: I've added a USB notification source and reformulated the block
     notification source so that there's now a common watch list, for which
     the system call is now device_notify().

     I've assigned a pair of unused ioctl numbers in the 'W' series to the
     ioctls added by this series.

     I've also added a description of the kernel API to the documentation.

 v2: I've fixed various issues raised by Jann Horn and GregKH and moved to
     krefs for refcounting.  I've added some security features to try and
     give Casey Schaufler the LSM control he wants.

David
---
David Howells (10):
      security: Override creds in __fput() with last fputter's creds
      General notification queue with user mmap()'able ring buffer
      keys: Add a notification facility
      vfs: Add a mount-notification facility
      vfs: Add superblock notifications
      fsinfo: Export superblock notification counter
      Add a general, global device notification watch list
      block: Add block layer notifications
      usb: Add USB subsystem notifications
      Add sample notification program


 Documentation/ioctl/ioctl-number.txt   |    1 
 Documentation/security/keys/core.rst   |   58 ++
 Documentation/watch_queue.rst          |  492 ++++++++++++++++++
 arch/x86/entry/syscalls/syscall_32.tbl |    3 
 arch/x86/entry/syscalls/syscall_64.tbl |    3 
 block/Kconfig                          |    9 
 block/blk-core.c                       |   29 +
 drivers/base/Kconfig                   |    9 
 drivers/base/Makefile                  |    1 
 drivers/base/notify.c                  |   82 +++
 drivers/misc/Kconfig                   |   13 
 drivers/misc/Makefile                  |    1 
 drivers/misc/watch_queue.c             |  889 ++++++++++++++++++++++++++++++++
 drivers/usb/core/Kconfig               |   10 
 drivers/usb/core/devio.c               |   55 ++
 drivers/usb/core/hub.c                 |    3 
 fs/Kconfig                             |   21 +
 fs/Makefile                            |    1 
 fs/file_table.c                        |   12 
 fs/fsinfo.c                            |   12 
 fs/mount.h                             |   33 +
 fs/mount_notify.c                      |  180 ++++++
 fs/namespace.c                         |    9 
 fs/super.c                             |  116 ++++
 include/linux/blkdev.h                 |   15 +
 include/linux/dcache.h                 |    1 
 include/linux/device.h                 |    7 
 include/linux/fs.h                     |   79 +++
 include/linux/key.h                    |    4 
 include/linux/lsm_hooks.h              |   15 +
 include/linux/security.h               |   14 +
 include/linux/syscalls.h               |    5 
 include/linux/usb.h                    |   19 +
 include/linux/watch_queue.h            |   87 +++
 include/uapi/linux/fsinfo.h            |   10 
 include/uapi/linux/keyctl.h            |    1 
 include/uapi/linux/watch_queue.h       |  213 ++++++++
 kernel/sys_ni.c                        |    7 
 mm/interval_tree.c                     |    2 
 mm/memory.c                            |    1 
 samples/Kconfig                        |    6 
 samples/Makefile                       |    1 
 samples/vfs/test-fsinfo.c              |   13 
 samples/watch_queue/Makefile           |    9 
 samples/watch_queue/watch_test.c       |  310 +++++++++++
 security/keys/Kconfig                  |   10 
 security/keys/compat.c                 |    2 
 security/keys/gc.c                     |    5 
 security/keys/internal.h               |   30 +
 security/keys/key.c                    |   37 +
 security/keys/keyctl.c                 |   88 +++
 security/keys/keyring.c                |   17 -
 security/keys/request_key.c            |    4 
 security/security.c                    |    9 
 54 files changed, 3025 insertions(+), 38 deletions(-)
 create mode 100644 Documentation/watch_queue.rst
 create mode 100644 drivers/base/notify.c
 create mode 100644 drivers/misc/watch_queue.c
 create mode 100644 fs/mount_notify.c
 create mode 100644 include/linux/watch_queue.h
 create mode 100644 include/uapi/linux/watch_queue.h
 create mode 100644 samples/watch_queue/Makefile
 create mode 100644 samples/watch_queue/watch_test.c

Stephen Smalley June 6, 2019, 12:32 p.m. UTC | #1

On 6/6/19 5:41 AM, David Howells wrote:
> 
> Hi Al,
> 
> Here's a set of patches to add a general variable-length notification queue
> concept and to add sources of events for:
> 
>   (1) Mount topology events, such as mounting, unmounting, mount expiry,
>       mount reconfiguration.
> 
>   (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>       errors (not complete yet).
> 
>   (3) Key/keyring events, such as creating, linking and removal of keys.
> 
>   (4) General device events (single common queue) including:
> 
>       - Block layer events, such as device errors
> 
>       - USB subsystem events, such as device/bus attach/remove, device
>         reset, device errors.
> 
> One of the reasons for this is so that we can remove the issue of processes
> having to repeatedly and regularly scan /proc/mounts, which has proven to
> be a system performance problem.  To further aid this, the fsinfo() syscall
> on which this patch series depends, provides a way to access superblock and
> mount information in binary form without the need to parse /proc/mounts.
> 
> 
> LSM support is included, but controversial:
> 
>   (1) The creds of the process that did the fput() that reduced the refcount
>       to zero are cached in the file struct.
> 
>   (2) __fput() overrides the current creds with the creds from (1) whilst
>       doing the cleanup, thereby making sure that the creds seen by the
>       destruction notification generated by mntput() appears to come from
>       the last fputter.
> 
>   (3) security_post_notification() is called for each queue that we might
>       want to post a notification into, thereby allowing the LSM to prevent
>       covert communications.
> 
>   (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>       may be set in the first place?  I might need to add a variant per
>       watch-type.
> 
>   (?) Do I really need to keep track of the process creds in which an
>       implicit object destruction happened?  For example, imagine you create
>       an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>       refers to on close unless move_mount() clears that flag.  Now, imagine
>       someone looking at that fd through procfs at the same time as you exit
>       due to an error.  The LSM sees the destruction notification come from
>       the looker if they happen to do their fput() after yours.


I'm not in favor of this approach. Can we check permission to the object 
being watched when a watch is set (read-like access), make sure every 
access that can trigger a notification requires a (write-like) 
permission to the accessed object, and make sure there is some sane way 
to control the relationship between the accessed object and the watched 
object (write-like)?  For cases where we have no object per se or at 
least no security structure/label associated with it, we may have to 
fall back to a coarse-grained "Can the watcher get this kind of 
notification in general?".

> 
> 
> Design decisions:
> 
>   (1) A misc chardev is used to create and open a ring buffer:
> 
> 	fd = open("/dev/watch_queue", O_RDWR);
> 
>       which is then configured and mmap'd into userspace:
> 
> 	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
> 	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
> 	buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
> 		   MAP_SHARED, fd, 0);
> 
>       The fd cannot be read or written (though there is a facility to use
>       write to inject records for debugging) and userspace just pulls data
>       directly out of the buffer.
> 
>   (2) The ring index pointers are stored inside the ring and are thus
>       accessible to userspace.  Userspace should only update the tail
>       pointer and never the head pointer or risk breaking the buffer.  The
>       kernel checks that the pointers appear valid before trying to use
>       them.  A 'skip' record is maintained around the pointers.
> 
>   (3) poll() can be used to wait for data to appear in the buffer.
> 
>   (4) Records in the buffer are binary, typed and have a length so that they
>       can be of varying size.
> 
>       This means that multiple heterogeneous sources can share a common
>       buffer.  Tags may be specified when a watchpoint is created to help
>       distinguish the sources.
> 
>   (5) The queue is reusable as there are 16 million types available, of
>       which I've used 4, so there is scope for others to be used.
> 
>   (6) Records are filterable as types have up to 256 subtypes that can be
>       individually filtered.  Other filtration is also available.
> 
>   (7) Each time the buffer is opened, a new buffer is created - this means
>       that there's no interference between watchers.
> 
>   (8) When recording a notification, the kernel will not sleep, but will
>       rather mark a queue as overrun if there's insufficient space, thereby
>       avoiding userspace causing the kernel to hang.
> 
>   (9) The 'watchpoint' should be specific where possible, meaning that you
>       specify the object that you want to watch.
> 
> (10) The buffer is created and then watchpoints are attached to it, using
>       one of:
> 
> 	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
> 	mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
> 	sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
> 
>       where in all three cases, fd indicates the queue and the number after
>       is a tag between 0 and 255.
> 
> (11) The watch must be removed if either the watch buffer is destroyed or
>       the watched object is destroyed.
> 
> 
> Things I want to avoid:
> 
>   (1) Introducing features that make the core VFS dependent on the network
>       stack or networking namespaces (ie. usage of netlink).
> 
>   (2) Dumping all this stuff into dmesg and having a daemon that sits there
>       parsing the output and distributing it as this then puts the
>       responsibility for security into userspace and makes handling
>       namespaces tricky.  Further, dmesg might not exist or might be
>       inaccessible inside a container.
> 
>   (3) Letting users see events they shouldn't be able to see.
> 
> 
> Further things that could be considered:
> 
>   (1) Adding a keyctl call to allow a watch on a keyring to be extended to
>       "children" of that keyring, such that the watch is removed from the
>       child if it is unlinked from the keyring.
> 
>   (2) Adding global superblock event queue.
> 
>   (3) Propagating watches to child superblock over automounts.
> 
> 
> The patches can be found here also:
> 
> 	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
> 
> Changes:
> 
>   v3: I've added a USB notification source and reformulated the block
>       notification source so that there's now a common watch list, for which
>       the system call is now device_notify().
> 
>       I've assigned a pair of unused ioctl numbers in the 'W' series to the
>       ioctls added by this series.
> 
>       I've also added a description of the kernel API to the documentation.
> 
>   v2: I've fixed various issues raised by Jann Horn and GregKH and moved to
>       krefs for refcounting.  I've added some security features to try and
>       give Casey Schaufler the LSM control he wants.
> 
> David
> ---
> David Howells (10):
>        security: Override creds in __fput() with last fputter's creds
>        General notification queue with user mmap()'able ring buffer
>        keys: Add a notification facility
>        vfs: Add a mount-notification facility
>        vfs: Add superblock notifications
>        fsinfo: Export superblock notification counter
>        Add a general, global device notification watch list
>        block: Add block layer notifications
>        usb: Add USB subsystem notifications
>        Add sample notification program
> 
> 
>   Documentation/ioctl/ioctl-number.txt   |    1
>   Documentation/security/keys/core.rst   |   58 ++
>   Documentation/watch_queue.rst          |  492 ++++++++++++++++++
>   arch/x86/entry/syscalls/syscall_32.tbl |    3
>   arch/x86/entry/syscalls/syscall_64.tbl |    3
>   block/Kconfig                          |    9
>   block/blk-core.c                       |   29 +
>   drivers/base/Kconfig                   |    9
>   drivers/base/Makefile                  |    1
>   drivers/base/notify.c                  |   82 +++
>   drivers/misc/Kconfig                   |   13
>   drivers/misc/Makefile                  |    1
>   drivers/misc/watch_queue.c             |  889 ++++++++++++++++++++++++++++++++
>   drivers/usb/core/Kconfig               |   10
>   drivers/usb/core/devio.c               |   55 ++
>   drivers/usb/core/hub.c                 |    3
>   fs/Kconfig                             |   21 +
>   fs/Makefile                            |    1
>   fs/file_table.c                        |   12
>   fs/fsinfo.c                            |   12
>   fs/mount.h                             |   33 +
>   fs/mount_notify.c                      |  180 ++++++
>   fs/namespace.c                         |    9
>   fs/super.c                             |  116 ++++
>   include/linux/blkdev.h                 |   15 +
>   include/linux/dcache.h                 |    1
>   include/linux/device.h                 |    7
>   include/linux/fs.h                     |   79 +++
>   include/linux/key.h                    |    4
>   include/linux/lsm_hooks.h              |   15 +
>   include/linux/security.h               |   14 +
>   include/linux/syscalls.h               |    5
>   include/linux/usb.h                    |   19 +
>   include/linux/watch_queue.h            |   87 +++
>   include/uapi/linux/fsinfo.h            |   10
>   include/uapi/linux/keyctl.h            |    1
>   include/uapi/linux/watch_queue.h       |  213 ++++++++
>   kernel/sys_ni.c                        |    7
>   mm/interval_tree.c                     |    2
>   mm/memory.c                            |    1
>   samples/Kconfig                        |    6
>   samples/Makefile                       |    1
>   samples/vfs/test-fsinfo.c              |   13
>   samples/watch_queue/Makefile           |    9
>   samples/watch_queue/watch_test.c       |  310 +++++++++++
>   security/keys/Kconfig                  |   10
>   security/keys/compat.c                 |    2
>   security/keys/gc.c                     |    5
>   security/keys/internal.h               |   30 +
>   security/keys/key.c                    |   37 +
>   security/keys/keyctl.c                 |   88 +++
>   security/keys/keyring.c                |   17 -
>   security/keys/request_key.c            |    4
>   security/security.c                    |    9
>   54 files changed, 3025 insertions(+), 38 deletions(-)
>   create mode 100644 Documentation/watch_queue.rst
>   create mode 100644 drivers/base/notify.c
>   create mode 100644 drivers/misc/watch_queue.c
>   create mode 100644 fs/mount_notify.c
>   create mode 100644 include/linux/watch_queue.h
>   create mode 100644 include/uapi/linux/watch_queue.h
>   create mode 100644 samples/watch_queue/Makefile
>   create mode 100644 samples/watch_queue/watch_test.c
>

David Howells June 6, 2019, 1:16 p.m. UTC | #2

Stephen Smalley <sds@tycho.nsa.gov> wrote:

This might be easier to discuss if you can reply to:

	https://lore.kernel.org/lkml/5393.1559768763@warthog.procyon.org.uk/

which is on the ver #2 posting of this patchset.

> > LSM support is included, but controversial:
> >
> >   (1) The creds of the process that did the fput() that reduced the refcount
> >       to zero are cached in the file struct.
> >
> >   (2) __fput() overrides the current creds with the creds from (1) whilst
> >       doing the cleanup, thereby making sure that the creds seen by the
> >       destruction notification generated by mntput() appears to come from
> >       the last fputter.
> >
> >   (3) security_post_notification() is called for each queue that we might
> >       want to post a notification into, thereby allowing the LSM to prevent
> >       covert communications.
> >
> >   (?) Do I need to add security_set_watch(), say, to rule on whether a watch
> >       may be set in the first place?  I might need to add a variant per
> >       watch-type.
> >
> >   (?) Do I really need to keep track of the process creds in which an
> >       implicit object destruction happened?  For example, imagine you create
> >       an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
> >       refers to on close unless move_mount() clears that flag.  Now, imagine
> >       someone looking at that fd through procfs at the same time as you exit
> >       due to an error.  The LSM sees the destruction notification come from
> >       the looker if they happen to do their fput() after yours.
> 
> 
> I'm not in favor of this approach.

Which bit?  The last point?  Keeping track of the process creds after an
implicit object destruction.

> Can we check permission to the object being watched when a watch is set
> (read-like access),

Yes, and I need to do that.  I think it's likely to require an extra hook for
each entry point added because the objects are different:

	int security_watch_key(struct watch *watch, struct key *key);
	int security_watch_sb(struct watch *watch, struct path *path);
	int security_watch_mount(struct watch *watch, struct path *path);
	int security_watch_devices(struct watch *watch);

> make sure every access that can trigger a notification requires a
> (write-like) permission to the accessed object,

"write-like permssion" for whom?  The triggerer or the watcher?

There are various 'classes' of events:

 (1) System events (eg. hardware I/O errors, automount points expiring).

 (2) Direct events (eg. automounts, manual mounts, EDQUOT, key linkage).

 (3) Indirect events (eg. exit/close doing the last fput and causing an
     unmount).

Class (1) are uncaused by a process, so I use init_cred for them.  One could
argue that the automount point expiry should perhaps take place under the
creds of whoever triggered it in the first place, but we need to be careful
about long-term cred pinning.

Class (2) the causing process must've had permission to cause them - otherwise
we wouldn't have got the event.

Class (3) is interesting since it's currently entirely cleanup events and the
process may have the right to do them (close, dup2, exit, but also execve)
whether the LSM thinks it should be able to cause the object to be destroyed
or not.

It gets more complicated than that, though: multiple processes with different
security attributes can all have fds pointing to a common file object - and
the last one to close carries the can as far as the LSM is concerned.

And yet more complicated when you throw in unix sockets with partially passed
fds still in their queues.  That's what patch 01 is designed to try and cope
with.

> and make sure there is some sane way to control the relationship between the
> accessed object and the watched object (write-like)?

This is the trick.  Keys and superblocks have object labels of their own and
don't - for now - propagate their watches.  With these, the watch is on the
object you initially assign it to and it goes no further than that.

mount_notify() is the interesting case since we want to be able to detect
mount topology change events from within the vfs subtree rooted at the watched
directory without having to manually put a watch on every directory in that
subtree - or even just every mount object.

Or, maybe, that's what I'll have to do: make it mount_notify() can only apply
to the subtree within its superblock, and the caller must call mount_notify()
for every mount object it wants to monitor.  That would at least ensure that
the caller can, at that point, reach all those mount points.

> For cases where we have no object per se or at least no security
> structure/label associated with it, we may have to fall back to a
> coarse-grained "Can the watcher get this kind of notification in general?".

Agreed - and we should probably have that anyway.

David

Stephen Smalley June 6, 2019, 2:05 p.m. UTC | #3

On 6/6/19 9:16 AM, David Howells wrote:
> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> 
> This might be easier to discuss if you can reply to:
> 
> 	https://lore.kernel.org/lkml/5393.1559768763@warthog.procyon.org.uk/
> 
> which is on the ver #2 posting of this patchset.

Sorry for being late to the party.  Not sure whether you're asking me to 
respond only there or both there and here to your comments below.  I'll 
start here but can revisit if it's a problem.
> 
>>> LSM support is included, but controversial:
>>>
>>>    (1) The creds of the process that did the fput() that reduced the refcount
>>>        to zero are cached in the file struct.
>>>
>>>    (2) __fput() overrides the current creds with the creds from (1) whilst
>>>        doing the cleanup, thereby making sure that the creds seen by the
>>>        destruction notification generated by mntput() appears to come from
>>>        the last fputter.
>>>
>>>    (3) security_post_notification() is called for each queue that we might
>>>        want to post a notification into, thereby allowing the LSM to prevent
>>>        covert communications.
>>>
>>>    (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>>        may be set in the first place?  I might need to add a variant per
>>>        watch-type.
>>>
>>>    (?) Do I really need to keep track of the process creds in which an
>>>        implicit object destruction happened?  For example, imagine you create
>>>        an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>>>        refers to on close unless move_mount() clears that flag.  Now, imagine
>>>        someone looking at that fd through procfs at the same time as you exit
>>>        due to an error.  The LSM sees the destruction notification come from
>>>        the looker if they happen to do their fput() after yours.
>>
>>
>> I'm not in favor of this approach.
> 
> Which bit?  The last point?  Keeping track of the process creds after an
> implicit object destruction.

(1), (2), (3), and the last point.

>> Can we check permission to the object being watched when a watch is set
>> (read-like access),
> 
> Yes, and I need to do that.  I think it's likely to require an extra hook for
> each entry point added because the objects are different:
> 
> 	int security_watch_key(struct watch *watch, struct key *key);
> 	int security_watch_sb(struct watch *watch, struct path *path);
> 	int security_watch_mount(struct watch *watch, struct path *path);
> 	int security_watch_devices(struct watch *watch);
> 
>> make sure every access that can trigger a notification requires a
>> (write-like) permission to the accessed object,
> 
> "write-like permssion" for whom?  The triggerer or the watcher?

The former, i.e. the process that performed the operation that triggered 
the notification.  Think of it as a write from the process to the 
accessed object, which triggers a notification (another write) on some 
related object (the watched object), which is then read by the watcher.

> There are various 'classes' of events:
> 
>   (1) System events (eg. hardware I/O errors, automount points expiring).
> 
>   (2) Direct events (eg. automounts, manual mounts, EDQUOT, key linkage).
> 
>   (3) Indirect events (eg. exit/close doing the last fput and causing an
>       unmount).
> 
> Class (1) are uncaused by a process, so I use init_cred for them.  One could
> argue that the automount point expiry should perhaps take place under the
> creds of whoever triggered it in the first place, but we need to be careful
> about long-term cred pinning.

This seems equivalent to just checking whether the watcher is allowed to 
get that kind of event, no other cred truly needed.

> Class (2) the causing process must've had permission to cause them - otherwise
> we wouldn't have got the event.

So we've already done a check on the causing process, and we're going to 
check whether the watcher can set the watch. We just need to establish 
the connection between the accessed object and the watched object in 
some manner.

> Class (3) is interesting since it's currently entirely cleanup events and the
> process may have the right to do them (close, dup2, exit, but also execve)
> whether the LSM thinks it should be able to cause the object to be destroyed
> or not.
> 
> It gets more complicated than that, though: multiple processes with different
> security attributes can all have fds pointing to a common file object - and
> the last one to close carries the can as far as the LSM is concerned.

Yes, I'd prefer to avoid that.  You can't write policy that is stable 
and meaningful that way.  This may fall under a similar situation as 
class (1) - all we can meaningfully do is check whether the watcher is 
allowed to see all such events.

> And yet more complicated when you throw in unix sockets with partially passed
> fds still in their queues.  That's what patch 01 is designed to try and cope
> with.
> 
>> and make sure there is some sane way to control the relationship between the
>> accessed object and the watched object (write-like)?
> 
> This is the trick.  Keys and superblocks have object labels of their own and
> don't - for now - propagate their watches.  With these, the watch is on the
> object you initially assign it to and it goes no further than that.
> 
> mount_notify() is the interesting case since we want to be able to detect
> mount topology change events from within the vfs subtree rooted at the watched
> directory without having to manually put a watch on every directory in that
> subtree - or even just every mount object. >
> Or, maybe, that's what I'll have to do: make it mount_notify() can only apply
> to the subtree within its superblock, and the caller must call mount_notify()
> for every mount object it wants to monitor.  That would at least ensure that
> the caller can, at that point, reach all those mount points.

Would that at least make it consistent with fanotify (not that it 
provides a great example)?

>> For cases where we have no object per se or at least no security
>> structure/label associated with it, we may have to fall back to a
>> coarse-grained "Can the watcher get this kind of notification in general?".
> 
> Agreed - and we should probably have that anyway.
> 
> David

Christian Brauner June 6, 2019, 2:34 p.m. UTC | #4

On Thu, Jun 06, 2019 at 10:41:59AM +0100, David Howells wrote:
> 
> Hi Al,
> 
> Here's a set of patches to add a general variable-length notification queue
> concept and to add sources of events for:
> 
>  (1) Mount topology events, such as mounting, unmounting, mount expiry,
>      mount reconfiguration.
> 
>  (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>      errors (not complete yet).
> 
>  (3) Key/keyring events, such as creating, linking and removal of keys.
> 
>  (4) General device events (single common queue) including:
> 
>      - Block layer events, such as device errors
> 
>      - USB subsystem events, such as device/bus attach/remove, device
>        reset, device errors.
> 
> One of the reasons for this is so that we can remove the issue of processes
> having to repeatedly and regularly scan /proc/mounts, which has proven to
> be a system performance problem.  To further aid this, the fsinfo() syscall
> on which this patch series depends, provides a way to access superblock and
> mount information in binary form without the need to parse /proc/mounts.
> 
> 
> LSM support is included, but controversial:

Apart from the LSM/security controversy here the general direction of
this patchset is pretty well received it seems.
Imho, having a notification mechanism like this is a very good thing for
userspace. So would be great if there'd be a consensus on the LSM bits.

Christian

Casey Schaufler June 6, 2019, 4:43 p.m. UTC | #5

On 6/6/2019 7:05 AM, Stephen Smalley wrote:
> On 6/6/19 9:16 AM, David Howells wrote:
>> Stephen Smalley <sds@tycho.nsa.gov> wrote:
>>
>> This might be easier to discuss if you can reply to:
>>
>>     https://lore.kernel.org/lkml/5393.1559768763@warthog.procyon.org.uk/
>>
>> which is on the ver #2 posting of this patchset.
>
> Sorry for being late to the party.  Not sure whether you're asking me to respond only there or both there and here to your comments below.  I'll start here but can revisit if it's a problem.
>>
>>>> LSM support is included, but controversial:
>>>>
>>>>    (1) The creds of the process that did the fput() that reduced the refcount
>>>>        to zero are cached in the file struct.
>>>>
>>>>    (2) __fput() overrides the current creds with the creds from (1) whilst
>>>>        doing the cleanup, thereby making sure that the creds seen by the
>>>>        destruction notification generated by mntput() appears to come from
>>>>        the last fputter.
>>>>
>>>>    (3) security_post_notification() is called for each queue that we might
>>>>        want to post a notification into, thereby allowing the LSM to prevent
>>>>        covert communications.
>>>>
>>>>    (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>>>        may be set in the first place?  I might need to add a variant per
>>>>        watch-type.
>>>>
>>>>    (?) Do I really need to keep track of the process creds in which an
>>>>        implicit object destruction happened?  For example, imagine you create
>>>>        an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>>>>        refers to on close unless move_mount() clears that flag.  Now, imagine
>>>>        someone looking at that fd through procfs at the same time as you exit
>>>>        due to an error.  The LSM sees the destruction notification come from
>>>>        the looker if they happen to do their fput() after yours.
>>>
>>>
>>> I'm not in favor of this approach.
>>
>> Which bit?  The last point?  Keeping track of the process creds after an
>> implicit object destruction.
>
> (1), (2), (3), and the last point.
>
>>> Can we check permission to the object being watched when a watch is set
>>> (read-like access),
>>
>> Yes, and I need to do that.  I think it's likely to require an extra hook for
>> each entry point added because the objects are different:
>>
>>     int security_watch_key(struct watch *watch, struct key *key);
>>     int security_watch_sb(struct watch *watch, struct path *path);
>>     int security_watch_mount(struct watch *watch, struct path *path);
>>     int security_watch_devices(struct watch *watch);
>>
>>> make sure every access that can trigger a notification requires a
>>> (write-like) permission to the accessed object,
>>
>> "write-like permssion" for whom?  The triggerer or the watcher?
>
> The former, i.e. the process that performed the operation that triggered the notification.  Think of it as a write from the process to the accessed object, which triggers a notification (another write) on some related object (the watched object), which is then read by the watcher.

We agree that the process that performed the operation that triggered
the notification is the initial subject. Smack will treat the process
with the watch set (in particular, its ring buffer) as the object
being written to. SELinux, with its finer grained controls, will
involve other things as noted above. There are other place where we
see this, notably IP packet delivery.

The implication is that the information about the triggering
process needs to be available throughout.

>
>> There are various 'classes' of events:
>>
>>   (1) System events (eg. hardware I/O errors, automount points expiring).
>>
>>   (2) Direct events (eg. automounts, manual mounts, EDQUOT, key linkage).
>>
>>   (3) Indirect events (eg. exit/close doing the last fput and causing an
>>       unmount).
>>
>> Class (1) are uncaused by a process, so I use init_cred for them.  One could
>> argue that the automount point expiry should perhaps take place under the
>> creds of whoever triggered it in the first place, but we need to be careful
>> about long-term cred pinning.
>
> This seems equivalent to just checking whether the watcher is allowed to get that kind of event, no other cred truly needed.
>
>> Class (2) the causing process must've had permission to cause them - otherwise
>> we wouldn't have got the event.
>
> So we've already done a check on the causing process, and we're going to check whether the watcher can set the watch. We just need to establish the connection between the accessed object and the watched object in some manner.

I don't agree. That is, I don't believe it is sufficient.
There is no guarantee that being able to set a watch on an
object implies that every process that can trigger the event
can send it to you.

	Watcher has Smack label W
	Triggerer has Smack label T
	Watched object has Smack label O

	Relevant Smack rules are

	W O rw
	T O rw

The watcher will be able to set the watch,
the triggerer will be able to trigger the event,
but there is nothing that would allow the watcher
to receive the event. This is not a case of watcher
reading the watched object, as the event is delivered
without any action by watcher.

>
>> Class (3) is interesting since it's currently entirely cleanup events and the
>> process may have the right to do them (close, dup2, exit, but also execve)
>> whether the LSM thinks it should be able to cause the object to be destroyed
>> or not.
>>
>> It gets more complicated than that, though: multiple processes with different
>> security attributes can all have fds pointing to a common file object - and
>> the last one to close carries the can as far as the LSM is concerned.
>
> Yes, I'd prefer to avoid that.  You can't write policy that is stable and meaningful that way.  This may fall under a similar situation as class (1) - all we can meaningfully do is check whether the watcher is allowed to see all such events.

Back in the day when we were doing security evaluations
the implications of "deleting" an object gave the security
evaluators fits. UNIX/Linux files don't get deleted, they
simply fall off the filesystem namespace when no one cares
about them anymore. The model we used back in the day was
that "delete" wasn't an operation that occurs on filesystem
objects.

But now you want to do something with security implications
when the object disappears. We could say that the event does
nothing but signal that the system has removed the watch on
your behalf because it is now meaningless. No reason to worry
about who dropped the last reference. We don't care about
that from a policy viewpoint anyway.

>
>> And yet more complicated when you throw in unix sockets with partially passed
>> fds still in their queues.  That's what patch 01 is designed to try and cope
>> with.
>>
>>> and make sure there is some sane way to control the relationship between the
>>> accessed object and the watched object (write-like)?
>>
>> This is the trick.  Keys and superblocks have object labels of their own and
>> don't - for now - propagate their watches.  With these, the watch is on the
>> object you initially assign it to and it goes no further than that.
>>
>> mount_notify() is the interesting case since we want to be able to detect
>> mount topology change events from within the vfs subtree rooted at the watched
>> directory without having to manually put a watch on every directory in that
>> subtree - or even just every mount object. >
>> Or, maybe, that's what I'll have to do: make it mount_notify() can only apply
>> to the subtree within its superblock, and the caller must call mount_notify()
>> for every mount object it wants to monitor.  That would at least ensure that
>> the caller can, at that point, reach all those mount points.
>
> Would that at least make it consistent with fanotify (not that it provides a great example)?
>
>>> For cases where we have no object per se or at least no security
>>> structure/label associated with it, we may have to fall back to a
>>> coarse-grained "Can the watcher get this kind of notification in general?".
>>
>> Agreed - and we should probably have that anyway.
>>
>> David

Andy Lutomirski June 6, 2019, 5:11 p.m. UTC | #6

On Thu, Jun 6, 2019 at 9:43 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> On 6/6/2019 7:05 AM, Stephen Smalley wrote:
> > On 6/6/19 9:16 AM, David Howells wrote:
> >> Stephen Smalley <sds@tycho.nsa.gov> wrote:
> >>
> >> This might be easier to discuss if you can reply to:
> >>
> >>     https://lore.kernel.org/lkml/5393.1559768763@warthog.procyon.org.uk/
> >>
> >> which is on the ver #2 posting of this patchset.
> >
> > Sorry for being late to the party.  Not sure whether you're asking me to respond only there or both there and here to your comments below.  I'll start here but can revisit if it's a problem.
> >>
> >>>> LSM support is included, but controversial:
> >>>>
> >>>>    (1) The creds of the process that did the fput() that reduced the refcount
> >>>>        to zero are cached in the file struct.
> >>>>
> >>>>    (2) __fput() overrides the current creds with the creds from (1) whilst
> >>>>        doing the cleanup, thereby making sure that the creds seen by the
> >>>>        destruction notification generated by mntput() appears to come from
> >>>>        the last fputter.
> >>>>
> >>>>    (3) security_post_notification() is called for each queue that we might
> >>>>        want to post a notification into, thereby allowing the LSM to prevent
> >>>>        covert communications.
> >>>>
> >>>>    (?) Do I need to add security_set_watch(), say, to rule on whether a watch
> >>>>        may be set in the first place?  I might need to add a variant per
> >>>>        watch-type.
> >>>>
> >>>>    (?) Do I really need to keep track of the process creds in which an
> >>>>        implicit object destruction happened?  For example, imagine you create
> >>>>        an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
> >>>>        refers to on close unless move_mount() clears that flag.  Now, imagine
> >>>>        someone looking at that fd through procfs at the same time as you exit
> >>>>        due to an error.  The LSM sees the destruction notification come from
> >>>>        the looker if they happen to do their fput() after yours.
> >>>
> >>>
> >>> I'm not in favor of this approach.
> >>
> >> Which bit?  The last point?  Keeping track of the process creds after an
> >> implicit object destruction.
> >
> > (1), (2), (3), and the last point.
> >
> >>> Can we check permission to the object being watched when a watch is set
> >>> (read-like access),
> >>
> >> Yes, and I need to do that.  I think it's likely to require an extra hook for
> >> each entry point added because the objects are different:
> >>
> >>     int security_watch_key(struct watch *watch, struct key *key);
> >>     int security_watch_sb(struct watch *watch, struct path *path);
> >>     int security_watch_mount(struct watch *watch, struct path *path);
> >>     int security_watch_devices(struct watch *watch);
> >>
> >>> make sure every access that can trigger a notification requires a
> >>> (write-like) permission to the accessed object,
> >>
> >> "write-like permssion" for whom?  The triggerer or the watcher?
> >
> > The former, i.e. the process that performed the operation that triggered the notification.  Think of it as a write from the process to the accessed object, which triggers a notification (another write) on some related object (the watched object), which is then read by the watcher.
>
> We agree that the process that performed the operation that triggered
> the notification is the initial subject. Smack will treat the process
> with the watch set (in particular, its ring buffer) as the object
> being written to. SELinux, with its finer grained controls, will
> involve other things as noted above. There are other place where we
> see this, notably IP packet delivery.
>
> The implication is that the information about the triggering
> process needs to be available throughout.
>
> >
> >> There are various 'classes' of events:
> >>
> >>   (1) System events (eg. hardware I/O errors, automount points expiring).
> >>
> >>   (2) Direct events (eg. automounts, manual mounts, EDQUOT, key linkage).
> >>
> >>   (3) Indirect events (eg. exit/close doing the last fput and causing an
> >>       unmount).
> >>
> >> Class (1) are uncaused by a process, so I use init_cred for them.  One could
> >> argue that the automount point expiry should perhaps take place under the
> >> creds of whoever triggered it in the first place, but we need to be careful
> >> about long-term cred pinning.
> >
> > This seems equivalent to just checking whether the watcher is allowed to get that kind of event, no other cred truly needed.
> >
> >> Class (2) the causing process must've had permission to cause them - otherwise
> >> we wouldn't have got the event.
> >
> > So we've already done a check on the causing process, and we're going to check whether the watcher can set the watch. We just need to establish the connection between the accessed object and the watched object in some manner.
>
> I don't agree. That is, I don't believe it is sufficient.
> There is no guarantee that being able to set a watch on an
> object implies that every process that can trigger the event
> can send it to you.
>
>         Watcher has Smack label W
>         Triggerer has Smack label T
>         Watched object has Smack label O
>
>         Relevant Smack rules are
>
>         W O rw
>         T O rw
>
> The watcher will be able to set the watch,
> the triggerer will be able to trigger the event,
> but there is nothing that would allow the watcher
> to receive the event. This is not a case of watcher
> reading the watched object, as the event is delivered
> without any action by watcher.

I think this is an example of a bogus policy that should not be
supported by the kernel.

Stephen Smalley June 6, 2019, 5:16 p.m. UTC | #7

On 6/6/19 12:43 PM, Casey Schaufler wrote:
> On 6/6/2019 7:05 AM, Stephen Smalley wrote:
>> On 6/6/19 9:16 AM, David Howells wrote:
>>> Stephen Smalley <sds@tycho.nsa.gov> wrote:
>>>
>>> This might be easier to discuss if you can reply to:
>>>
>>>      https://lore.kernel.org/lkml/5393.1559768763@warthog.procyon.org.uk/
>>>
>>> which is on the ver #2 posting of this patchset.
>>
>> Sorry for being late to the party.  Not sure whether you're asking me to respond only there or both there and here to your comments below.  I'll start here but can revisit if it's a problem.
>>>
>>>>> LSM support is included, but controversial:
>>>>>
>>>>>     (1) The creds of the process that did the fput() that reduced the refcount
>>>>>         to zero are cached in the file struct.
>>>>>
>>>>>     (2) __fput() overrides the current creds with the creds from (1) whilst
>>>>>         doing the cleanup, thereby making sure that the creds seen by the
>>>>>         destruction notification generated by mntput() appears to come from
>>>>>         the last fputter.
>>>>>
>>>>>     (3) security_post_notification() is called for each queue that we might
>>>>>         want to post a notification into, thereby allowing the LSM to prevent
>>>>>         covert communications.
>>>>>
>>>>>     (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>>>>         may be set in the first place?  I might need to add a variant per
>>>>>         watch-type.
>>>>>
>>>>>     (?) Do I really need to keep track of the process creds in which an
>>>>>         implicit object destruction happened?  For example, imagine you create
>>>>>         an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
>>>>>         refers to on close unless move_mount() clears that flag.  Now, imagine
>>>>>         someone looking at that fd through procfs at the same time as you exit
>>>>>         due to an error.  The LSM sees the destruction notification come from
>>>>>         the looker if they happen to do their fput() after yours.
>>>>
>>>>
>>>> I'm not in favor of this approach.
>>>
>>> Which bit?  The last point?  Keeping track of the process creds after an
>>> implicit object destruction.
>>
>> (1), (2), (3), and the last point.
>>
>>>> Can we check permission to the object being watched when a watch is set
>>>> (read-like access),
>>>
>>> Yes, and I need to do that.  I think it's likely to require an extra hook for
>>> each entry point added because the objects are different:
>>>
>>>      int security_watch_key(struct watch *watch, struct key *key);
>>>      int security_watch_sb(struct watch *watch, struct path *path);
>>>      int security_watch_mount(struct watch *watch, struct path *path);
>>>      int security_watch_devices(struct watch *watch);
>>>
>>>> make sure every access that can trigger a notification requires a
>>>> (write-like) permission to the accessed object,
>>>
>>> "write-like permssion" for whom?  The triggerer or the watcher?
>>
>> The former, i.e. the process that performed the operation that triggered the notification.  Think of it as a write from the process to the accessed object, which triggers a notification (another write) on some related object (the watched object), which is then read by the watcher.
> 
> We agree that the process that performed the operation that triggered
> the notification is the initial subject. Smack will treat the process
> with the watch set (in particular, its ring buffer) as the object
> being written to. SELinux, with its finer grained controls, will
> involve other things as noted above. There are other place where we
> see this, notably IP packet delivery.
> 
> The implication is that the information about the triggering
> process needs to be available throughout.
> 
>>
>>> There are various 'classes' of events:
>>>
>>>    (1) System events (eg. hardware I/O errors, automount points expiring).
>>>
>>>    (2) Direct events (eg. automounts, manual mounts, EDQUOT, key linkage).
>>>
>>>    (3) Indirect events (eg. exit/close doing the last fput and causing an
>>>        unmount).
>>>
>>> Class (1) are uncaused by a process, so I use init_cred for them.  One could
>>> argue that the automount point expiry should perhaps take place under the
>>> creds of whoever triggered it in the first place, but we need to be careful
>>> about long-term cred pinning.
>>
>> This seems equivalent to just checking whether the watcher is allowed to get that kind of event, no other cred truly needed.
>>
>>> Class (2) the causing process must've had permission to cause them - otherwise
>>> we wouldn't have got the event.
>>
>> So we've already done a check on the causing process, and we're going to check whether the watcher can set the watch. We just need to establish the connection between the accessed object and the watched object in some manner.
> 
> I don't agree. That is, I don't believe it is sufficient.
> There is no guarantee that being able to set a watch on an
> object implies that every process that can trigger the event
> can send it to you.
> 
> 	Watcher has Smack label W
> 	Triggerer has Smack label T
> 	Watched object has Smack label O
> 
> 	Relevant Smack rules are
> 
> 	W O rw
> 	T O rw
> 
> The watcher will be able to set the watch,
> the triggerer will be able to trigger the event,
> but there is nothing that would allow the watcher
> to receive the event. This is not a case of watcher
> reading the watched object, as the event is delivered
> without any action by watcher.

You are allowing arbitrary information flow between T and W above.  Who 
cares about notifications?

How is it different from W and T mapping the same file as a shared 
mapping and communicating by reading and writing the shared memory?  You 
aren't performing a permission check directly between W and T there.

> 
>>
>>> Class (3) is interesting since it's currently entirely cleanup events and the
>>> process may have the right to do them (close, dup2, exit, but also execve)
>>> whether the LSM thinks it should be able to cause the object to be destroyed
>>> or not.
>>>
>>> It gets more complicated than that, though: multiple processes with different
>>> security attributes can all have fds pointing to a common file object - and
>>> the last one to close carries the can as far as the LSM is concerned.
>>
>> Yes, I'd prefer to avoid that.  You can't write policy that is stable and meaningful that way.  This may fall under a similar situation as class (1) - all we can meaningfully do is check whether the watcher is allowed to see all such events.
> 
> Back in the day when we were doing security evaluations
> the implications of "deleting" an object gave the security
> evaluators fits. UNIX/Linux files don't get deleted, they
> simply fall off the filesystem namespace when no one cares
> about them anymore. The model we used back in the day was
> that "delete" wasn't an operation that occurs on filesystem
> objects.
> 
> But now you want to do something with security implications
> when the object disappears. We could say that the event does
> nothing but signal that the system has removed the watch on
> your behalf because it is now meaningless. No reason to worry
> about who dropped the last reference. We don't care about
> that from a policy viewpoint anyway.
> 
>>
>>> And yet more complicated when you throw in unix sockets with partially passed
>>> fds still in their queues.  That's what patch 01 is designed to try and cope
>>> with.
>>>
>>>> and make sure there is some sane way to control the relationship between the
>>>> accessed object and the watched object (write-like)?
>>>
>>> This is the trick.  Keys and superblocks have object labels of their own and
>>> don't - for now - propagate their watches.  With these, the watch is on the
>>> object you initially assign it to and it goes no further than that.
>>>
>>> mount_notify() is the interesting case since we want to be able to detect
>>> mount topology change events from within the vfs subtree rooted at the watched
>>> directory without having to manually put a watch on every directory in that
>>> subtree - or even just every mount object. >
>>> Or, maybe, that's what I'll have to do: make it mount_notify() can only apply
>>> to the subtree within its superblock, and the caller must call mount_notify()
>>> for every mount object it wants to monitor.  That would at least ensure that
>>> the caller can, at that point, reach all those mount points.
>>
>> Would that at least make it consistent with fanotify (not that it provides a great example)?
>>
>>>> For cases where we have no object per se or at least no security
>>>> structure/label associated with it, we may have to fall back to a
>>>> coarse-grained "Can the watcher get this kind of notification in general?".
>>>
>>> Agreed - and we should probably have that anyway.
>>>
>>> David
>

Casey Schaufler June 6, 2019, 6:33 p.m. UTC | #8

On 6/6/2019 10:11 AM, Andy Lutomirski wrote:
> On Thu, Jun 6, 2019 at 9:43 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> ...
>> I don't agree. That is, I don't believe it is sufficient.
>> There is no guarantee that being able to set a watch on an
>> object implies that every process that can trigger the event
>> can send it to you.
>>
>>         Watcher has Smack label W
>>         Triggerer has Smack label T
>>         Watched object has Smack label O
>>
>>         Relevant Smack rules are
>>
>>         W O rw
>>         T O rw
>>
>> The watcher will be able to set the watch,
>> the triggerer will be able to trigger the event,
>> but there is nothing that would allow the watcher
>> to receive the event. This is not a case of watcher
>> reading the watched object, as the event is delivered
>> without any action by watcher.
> I think this is an example of a bogus policy that should not be
> supported by the kernel.

At this point it's pretty hard for me to care much what
you think. You don't seem to have any insight into the
implications of the features you're advocating, or their
potential consequences.

Andy Lutomirski June 6, 2019, 6:51 p.m. UTC | #9

> On Jun 6, 2019, at 11:33 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
>> On 6/6/2019 10:11 AM, Andy Lutomirski wrote:
>>> On Thu, Jun 6, 2019 at 9:43 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>> ...
>>> I don't agree. That is, I don't believe it is sufficient.
>>> There is no guarantee that being able to set a watch on an
>>> object implies that every process that can trigger the event
>>> can send it to you.
>>> 
>>>        Watcher has Smack label W
>>>        Triggerer has Smack label T
>>>        Watched object has Smack label O
>>> 
>>>        Relevant Smack rules are
>>> 
>>>        W O rw
>>>        T O rw
>>> 
>>> The watcher will be able to set the watch,
>>> the triggerer will be able to trigger the event,
>>> but there is nothing that would allow the watcher
>>> to receive the event. This is not a case of watcher
>>> reading the watched object, as the event is delivered
>>> without any action by watcher.
>> I think this is an example of a bogus policy that should not be
>> supported by the kernel.
> 
> At this point it's pretty hard for me to care much what
> you think. You don't seem to have any insight into the
> implications of the features you're advocating, or their
> potential consequences.
> 
> 

Can you try to spell it out, then?  A mostly or fully worked out example might help.

As Stephen said, it looks like you are considering cases where there is already a full communication channel between two processes, and you’re concerned that this new mechanism might add a side channel too.  If this is wrong, can you explain how?

Casey Schaufler June 6, 2019, 6:56 p.m. UTC | #10

On 6/6/2019 10:16 AM, Stephen Smalley wrote:
> On 6/6/19 12:43 PM, Casey Schaufler wrote:
>> ...
>> I don't agree. That is, I don't believe it is sufficient.
>> There is no guarantee that being able to set a watch on an
>> object implies that every process that can trigger the event
>> can send it to you.
>>
>>     Watcher has Smack label W
>>     Triggerer has Smack label T
>>     Watched object has Smack label O
>>
>>     Relevant Smack rules are
>>
>>     W O rw
>>     T O rw
>>
>> The watcher will be able to set the watch,
>> the triggerer will be able to trigger the event,
>> but there is nothing that would allow the watcher
>> to receive the event. This is not a case of watcher
>> reading the watched object, as the event is delivered
>> without any action by watcher.
>
> You are allowing arbitrary information flow between T and W above.  Who cares about notifications?

I do. If Watched object is /dev/null no data flow is possible.
There are many objects on a modern Linux system for which this
is true. Even if it's "just a file" the existence of one path
for data to flow does not justify ignoring the rules for other
data paths.

>
> How is it different from W and T mapping the same file as a shared mapping and communicating by reading and writing the shared memory?  You aren't performing a permission check directly between W and T there.

In this case there is one object O, two subjects, W and T and two accesses.

	W open O
	T open O

They fiddle about with the data in O.

In the event case, there are two objects, O and W, two subjects W and T, and
three accesses.

	W watch O
	T trigger O
	T write-event W

You can't wave away the flow of data. Different objects are involved.

An analogy is that two processes with different UIDs can open a file,
but still can't signal each other. Different mechanisms have different
policies. I'm not saying that's good, but it's the context we're in.

Andy Lutomirski June 6, 2019, 7:54 p.m. UTC | #11

On Thu, Jun 6, 2019 at 11:56 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> On 6/6/2019 10:16 AM, Stephen Smalley wrote:
> > On 6/6/19 12:43 PM, Casey Schaufler wrote:
> >> ...
> >> I don't agree. That is, I don't believe it is sufficient.
> >> There is no guarantee that being able to set a watch on an
> >> object implies that every process that can trigger the event
> >> can send it to you.
> >>
> >>     Watcher has Smack label W
> >>     Triggerer has Smack label T
> >>     Watched object has Smack label O
> >>
> >>     Relevant Smack rules are
> >>
> >>     W O rw
> >>     T O rw
> >>
> >> The watcher will be able to set the watch,
> >> the triggerer will be able to trigger the event,
> >> but there is nothing that would allow the watcher
> >> to receive the event. This is not a case of watcher
> >> reading the watched object, as the event is delivered
> >> without any action by watcher.
> >
> > You are allowing arbitrary information flow between T and W above.  Who cares about notifications?
>
> I do. If Watched object is /dev/null no data flow is possible.
> There are many objects on a modern Linux system for which this
> is true. Even if it's "just a file" the existence of one path
> for data to flow does not justify ignoring the rules for other
> data paths.

Aha!

Even ignoring security, writes to things like /dev/null should
probably not trigger notifications to people who are watching
/dev/null.  (There are probably lots of things like this: /dev/zero,
/dev/urandom, etc.)  David, are there any notification types that have
this issue in your patchset?  If so, is there a straightforward way to
fix it?  Generically, it seems like maybe writes to device nodes
shouldn't trigger notifications since, despite the fact that different
openers of a device node share an inode, there isn't necessarily any
connection between them.

Casey, if this is fixed in general, do you have another case where the
right to write and the right to read do not imply the right to
communicate?

> An analogy is that two processes with different UIDs can open a file,
> but still can't signal each other.

What do you mean "signal"?  If two processes with different UIDs can
open the same file for read and write, then they can communicate with
each other in many ways.  For example, one can write to the file and
the other can read it.  One can take locks and the other can read the
lock state.  They can both map it and use any number of memory access
side channels to communicate.  But, of course, they can't send each
other signals with kill().

If, however, one of these processes is using some fancy mechanism
(inotify, dnotify, kqueue, fanotify, whatever) to watch the file, and
the other one writes it, then it seems inconsistent to lie to the
watching process and say that the file wasn't written because some
security policy has decided to allow the write, allow the read, but
suppress this particular notification.  Hence my request for a real
example: when would it make sense to do this?

David Howells June 6, 2019, 9:17 p.m. UTC | #12

Andy Lutomirski <luto@kernel.org> wrote:

> > > You are allowing arbitrary information flow between T and W above.  Who
> > > cares about notifications?
> >
> > I do. If Watched object is /dev/null no data flow is possible.
> > There are many objects on a modern Linux system for which this
> > is true. Even if it's "just a file" the existence of one path
> > for data to flow does not justify ignoring the rules for other
> > data paths.
> 
> Aha!
> 
> Even ignoring security, writes to things like /dev/null should
> probably not trigger notifications to people who are watching
> /dev/null.  (There are probably lots of things like this: /dev/zero,
> /dev/urandom, etc.)

Even writes to /dev/null might generate access notifications; leastways,
vfs_read() will call fsnotify_access() afterwards on success.

Whether or not you can set marks on open device files is another matter.

> David, are there any notification types that have this issue in your
> patchset?  If so, is there a straightforward way to fix it?

I'm not sure what issue you're referring to specifically.  Do you mean whether
writes to device files generate notifications?

> Generically, it seems like maybe writes to device nodes shouldn't trigger
> notifications since, despite the fact that different openers of a device
> node share an inode, there isn't necessarily any connection between them.

With the notification types I have currently implemented, I don't even notice
any accesses to a device file unless:

 (1) Someone mounts over the top of one.

 (2) The access triggers an I/O error or device reset or causes the device to
     be attached or detached.

 (3) Wangling the device causes some other superblock event.

 (4) The driver calls request_key() and that creates a new key.

> Casey, if this is fixed in general, do you have another case where the
> right to write and the right to read do not imply the right to
> communicate?
> 
> > An analogy is that two processes with different UIDs can open a file,
> > but still can't signal each other.
> 
> What do you mean "signal"?  If two processes with different UIDs can
> open the same file for read and write, then they can communicate with
> each other in many ways.  For example, one can write to the file and
> the other can read it.  One can take locks and the other can read the
> lock state.  They can both map it and use any number of memory access
> side channels to communicate.  But, of course, they can't send each
> other signals with kill().
> 
> If, however, one of these processes is using some fancy mechanism
> (inotify, dnotify, kqueue, fanotify, whatever) to watch the file, and
> the other one writes it, then it seems inconsistent to lie to the
> watching process and say that the file wasn't written because some
> security policy has decided to allow the write, allow the read, but
> suppress this particular notification.  Hence my request for a real
> example: when would it make sense to do this?

Note that fanotify requires CAP_SYS_ADMIN, but inotify and dnotify do not.

dnotify is applied to an open file, so it might be usable on a chardev such as
/dev/null, say.

David

Andy Lutomirski June 6, 2019, 9:54 p.m. UTC | #13

> On Jun 6, 2019, at 2:17 PM, David Howells <dhowells@redhat.com> wrote:
> 
> Andy Lutomirski <luto@kernel.org> wrote:
> 
>>>> You are allowing arbitrary information flow between T and W above.  Who
>>>> cares about notifications?
>>> 
>>> I do. If Watched object is /dev/null no data flow is possible.
>>> There are many objects on a modern Linux system for which this
>>> is true. Even if it's "just a file" the existence of one path
>>> for data to flow does not justify ignoring the rules for other
>>> data paths.
>> 
>> Aha!
>> 
>> Even ignoring security, writes to things like /dev/null should
>> probably not trigger notifications to people who are watching
>> /dev/null.  (There are probably lots of things like this: /dev/zero,
>> /dev/urandom, etc.)
> 
> Even writes to /dev/null might generate access notifications; leastways,
> vfs_read() will call fsnotify_access() afterwards on success.

Hmm. I can see this being an issue, but I guess not with your patch set.

> 
> Whether or not you can set marks on open device files is another matter.
> 
>> David, are there any notification types that have this issue in your
>> patchset?  If so, is there a straightforward way to fix it?
> 
> I'm not sure what issue you're referring to specifically.  Do you mean whether
> writes to device files generate notifications?

I mean: are there cases where some action generates a notification but does not otherwise have an effect visible to the users who can receive the notification. It looks like the answer is probably “no”, which is good.

Casey, is this good enough for you, or is there still an issue?

David Howells June 6, 2019, 10:38 p.m. UTC | #14

Andy Lutomirski <luto@amacapital.net> wrote:

> I mean: are there cases where some action generates a notification but does
> not otherwise have an effect visible to the users who can receive the
> notification. It looks like the answer is probably “no”, which is good.

mount_notify().  You can get a notification that someone altered the mount
topology (eg. by mounting something).  A process receiving a notification
could then use fsinfo(), say, to reread the mount topology tree, find out
where the new mount is and wander over there to have a look - assuming they
have the permissions for pathwalk to succeed.

David

Andy Lutomirski June 6, 2019, 10:42 p.m. UTC | #15

> On Jun 6, 2019, at 3:38 PM, David Howells <dhowells@redhat.com> wrote:
> 
> Andy Lutomirski <luto@amacapital.net> wrote:
> 
>> I mean: are there cases where some action generates a notification but does
>> not otherwise have an effect visible to the users who can receive the
>> notification. It looks like the answer is probably “no”, which is good.
> 
> mount_notify().  You can get a notification that someone altered the mount
> topology (eg. by mounting something).  A process receiving a notification
> could then use fsinfo(), say, to reread the mount topology tree, find out
> where the new mount is and wander over there to have a look - assuming they
> have the permissions for pathwalk to succeed.
> 
> 

They can call fsinfo() anyway, or just read /proc/self/mounts. As far as I’m concerned, if you have CAP_SYS_ADMIN over a mount namespace and LSM policy lets you mount things, the of course you can get information to basically anyone who can use that mount namespace.

David Howells June 6, 2019, 10:50 p.m. UTC | #16

Andy Lutomirski <luto@amacapital.net> wrote:

> They can call fsinfo() anyway, or just read /proc/self/mounts. As far as I’m
> concerned, if you have CAP_SYS_ADMIN over a mount namespace and LSM policy
> lets you mount things, the of course you can get information to basically
> anyone who can use that mount namespace.

And automounts?  You don't need CAP_SYS_ADMIN to trigger one of those, but
they still generate events.  On the other hand, you need CSA to mount
something that has automounts in the first place, and if you're particularly
concerned about security, you probably don't want the processes you might be
suspicious of having access to things that contain automounts (typically
network filesystems).

David

[RFC,00/10] Mount, FS, Block and Keyrings notifications [ver #3]

Message

Comments