mbox series

[RFC,0/7] Mount, FS, Block and Keyrings notifications

Message ID 155905930702.7587.7100265859075976147.stgit@warthog.procyon.org.uk (mailing list archive)
Headers show
Series Mount, FS, Block and Keyrings notifications | expand

Message

David Howells May 28, 2019, 4:01 p.m. UTC
Hi Al,

Here's a set of patches to add a general variable-length notification queue
concept and to add sources of events for:

 (1) Mount topology events, such as mounting, unmounting, mount expiry,
     mount reconfiguration.

 (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
     errors (not complete yet).

 (3) Block layer events, such as I/O errors.

 (4) Key/keyring events, such as creating, linking and removal of keys.

One of the reasons for this is so that we can remove the issue of processes
having to repeatedly and regularly scan /proc/mounts, which has proven to
be a system performance problem.  To further aid this, the fsinfo() syscall
on which this patch series depends, provides a way to access superblock and
mount information in binary form without the need to parse /proc/mounts.


Design decisions:

 (1) A misc chardev is used to create and open a ring buffer:

	fd = open("/dev/watch_queue", O_RDWR);

     which is then configured and mmap'd into userspace:

	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
	buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

     The fd cannot be read or written (though there is a facility to use
     write to inject records for debugging) and userspace just pulls data
     directly out of the buffer.

 (2) The ring index pointers are stored inside the ring and are thus
     accessible to userspace.  Userspace should only update the tail
     pointer and never the head pointer or risk breaking the buffer.  The
     kernel checks that the pointers appear valid before trying to use
     them.  A 'skip' record is maintained around the pointers.

 (3) poll() can be used to wait for data to appear in the buffer.

 (4) Records in the buffer are binary, typed and have a length so that they
     can be of varying size.

     This means that multiple heterogeneous sources can share a common
     buffer.  Tags may be specified when a watchpoint is created to help
     distinguish the sources.

 (5) The queue is reusable as there are 16 million types available, of
     which I've used 4, so there is scope for others to be used.

 (6) Records are filterable as types have up to 256 subtypes that can be
     individually filtered.  Other filtration is also available.

 (7) Each time the buffer is opened, a new buffer is created - this means
     that there's no interference between watchers.

 (8) When recording a notification, the kernel will not sleep, but will
     rather mark a queue as overrun if there's insufficient space, thereby
     avoiding userspace causing the kernel to hang.

 (9) The 'watchpoint' should be specific where possible, meaning that you
     specify the object that you want to watch.

(10) The buffer is created and then watchpoints are attached to it, using
     one of:

	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
	mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
	sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);

     where in all three cases, fd indicates the queue and the number after
     is a tag between 0 and 255.

(11) The watch must be removed if either the watch buffer is destroyed or
     the watched object is destroyed.


Things I want to avoid:

 (1) Introducing features that make the core VFS dependent on the network
     stack or networking namespaces (ie. usage of netlink).

 (2) Dumping all this stuff into dmesg and having a daemon that sits there
     parsing the output and distributing it as this then puts the
     responsibility for security into userspace and makes handling
     namespaces tricky.  Further, dmesg might not exist or might be
     inaccessible inside a container.

 (3) Letting users see events they shouldn't be able to see.


Further things that could be considered:

 (1) Adding a keyctl call to allow a watch on a keyring to be extended to
     "children" of that keyring, such that the watch is removed from the
     child if it is unlinked from the keyring.

 (2) Adding global superblock event queue.

 (3) Propagating watches to child superblock over automounts.


The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications

David
---
David Howells (7):
      General notification queue with user mmap()'able ring buffer
      keys: Add a notification facility
      vfs: Add a mount-notification facility
      vfs: Add superblock notifications
      fsinfo: Export superblock notification counter
      block: Add block layer notifications
      Add sample notification program


 Documentation/security/keys/core.rst   |   58 ++
 Documentation/watch_queue.rst          |  311 +++++++++++
 arch/x86/entry/syscalls/syscall_32.tbl |    3 
 arch/x86/entry/syscalls/syscall_64.tbl |    3 
 block/Kconfig                          |    9 
 block/Makefile                         |    1 
 block/blk-core.c                       |   28 +
 block/blk-notify.c                     |   83 +++
 drivers/misc/Kconfig                   |   13 
 drivers/misc/Makefile                  |    1 
 drivers/misc/watch_queue.c             |  877 ++++++++++++++++++++++++++++++++
 fs/Kconfig                             |   21 +
 fs/Makefile                            |    1 
 fs/fsinfo.c                            |   12 
 fs/mount.h                             |   33 +
 fs/mount_notify.c                      |  178 ++++++
 fs/namespace.c                         |    9 
 fs/super.c                             |  116 ++++
 include/linux/blkdev.h                 |   10 
 include/linux/dcache.h                 |    1 
 include/linux/fs.h                     |   78 +++
 include/linux/key.h                    |    4 
 include/linux/lsm_hooks.h              |   15 +
 include/linux/security.h               |   14 +
 include/linux/syscalls.h               |    5 
 include/linux/watch_queue.h            |   86 +++
 include/uapi/linux/fsinfo.h            |   10 
 include/uapi/linux/keyctl.h            |    1 
 include/uapi/linux/watch_queue.h       |  185 +++++++
 kernel/sys_ni.c                        |    6 
 mm/interval_tree.c                     |    2 
 mm/memory.c                            |    1 
 samples/Kconfig                        |    6 
 samples/Makefile                       |    1 
 samples/vfs/test-fsinfo.c              |   13 
 samples/watch_queue/Makefile           |    9 
 samples/watch_queue/watch_test.c       |  284 ++++++++++
 security/keys/Kconfig                  |   10 
 security/keys/compat.c                 |    2 
 security/keys/gc.c                     |    5 
 security/keys/internal.h               |   30 +
 security/keys/key.c                    |   37 +
 security/keys/keyctl.c                 |   88 +++
 security/keys/keyring.c                |   17 -
 security/keys/request_key.c            |    4 
 security/security.c                    |    9 
 46 files changed, 2652 insertions(+), 38 deletions(-)
 create mode 100644 Documentation/watch_queue.rst
 create mode 100644 block/blk-notify.c
 create mode 100644 drivers/misc/watch_queue.c
 create mode 100644 fs/mount_notify.c
 create mode 100644 include/linux/watch_queue.h
 create mode 100644 include/uapi/linux/watch_queue.h
 create mode 100644 samples/watch_queue/Makefile
 create mode 100644 samples/watch_queue/watch_test.c

Comments

Greg KH May 28, 2019, 11:58 p.m. UTC | #1
On Tue, May 28, 2019 at 05:01:47PM +0100, David Howells wrote:
> Things I want to avoid:
> 
>  (1) Introducing features that make the core VFS dependent on the network
>      stack or networking namespaces (ie. usage of netlink).
> 
>  (2) Dumping all this stuff into dmesg and having a daemon that sits there
>      parsing the output and distributing it as this then puts the
>      responsibility for security into userspace and makes handling
>      namespaces tricky.  Further, dmesg might not exist or might be
>      inaccessible inside a container.
> 
>  (3) Letting users see events they shouldn't be able to see.

How are you handling namespaces then?  Are they determined by the
namespace of the process that opened the original device handle, or the
namespace that made the new syscall for the events to "start flowing"?

Am I missing the logic that determines this in the patches, or is that
not implemented yet?

thanks,

greg k-h
Amir Goldstein May 29, 2019, 6:33 a.m. UTC | #2
On Tue, May 28, 2019 at 7:03 PM David Howells <dhowells@redhat.com> wrote:
>
>
> Hi Al,
>
> Here's a set of patches to add a general variable-length notification queue
> concept and to add sources of events for:
>
>  (1) Mount topology events, such as mounting, unmounting, mount expiry,
>      mount reconfiguration.
>
>  (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>      errors (not complete yet).
>
>  (3) Block layer events, such as I/O errors.
>
>  (4) Key/keyring events, such as creating, linking and removal of keys.
>
> One of the reasons for this is so that we can remove the issue of processes
> having to repeatedly and regularly scan /proc/mounts, which has proven to
> be a system performance problem.  To further aid this, the fsinfo() syscall
> on which this patch series depends, provides a way to access superblock and
> mount information in binary form without the need to parse /proc/mounts.
>
>
> Design decisions:
>
>  (1) A misc chardev is used to create and open a ring buffer:
>
>         fd = open("/dev/watch_queue", O_RDWR);
>
>      which is then configured and mmap'd into userspace:
>
>         ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
>         ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
>         buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
>                    MAP_SHARED, fd, 0);
>
>      The fd cannot be read or written (though there is a facility to use
>      write to inject records for debugging) and userspace just pulls data
>      directly out of the buffer.
>
>  (2) The ring index pointers are stored inside the ring and are thus
>      accessible to userspace.  Userspace should only update the tail
>      pointer and never the head pointer or risk breaking the buffer.  The
>      kernel checks that the pointers appear valid before trying to use
>      them.  A 'skip' record is maintained around the pointers.
>
>  (3) poll() can be used to wait for data to appear in the buffer.
>
>  (4) Records in the buffer are binary, typed and have a length so that they
>      can be of varying size.
>
>      This means that multiple heterogeneous sources can share a common
>      buffer.  Tags may be specified when a watchpoint is created to help
>      distinguish the sources.
>
>  (5) The queue is reusable as there are 16 million types available, of
>      which I've used 4, so there is scope for others to be used.
>
>  (6) Records are filterable as types have up to 256 subtypes that can be
>      individually filtered.  Other filtration is also available.
>
>  (7) Each time the buffer is opened, a new buffer is created - this means
>      that there's no interference between watchers.
>
>  (8) When recording a notification, the kernel will not sleep, but will
>      rather mark a queue as overrun if there's insufficient space, thereby
>      avoiding userspace causing the kernel to hang.
>
>  (9) The 'watchpoint' should be specific where possible, meaning that you
>      specify the object that you want to watch.
>
> (10) The buffer is created and then watchpoints are attached to it, using
>      one of:
>
>         keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
>         mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
>         sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
>
>      where in all three cases, fd indicates the queue and the number after
>      is a tag between 0 and 255.
>
> (11) The watch must be removed if either the watch buffer is destroyed or
>      the watched object is destroyed.
>
>
> Things I want to avoid:
>
>  (1) Introducing features that make the core VFS dependent on the network
>      stack or networking namespaces (ie. usage of netlink).
>
>  (2) Dumping all this stuff into dmesg and having a daemon that sits there
>      parsing the output and distributing it as this then puts the
>      responsibility for security into userspace and makes handling
>      namespaces tricky.  Further, dmesg might not exist or might be
>      inaccessible inside a container.
>
>  (3) Letting users see events they shouldn't be able to see.
>
>
> Further things that could be considered:
>
>  (1) Adding a keyctl call to allow a watch on a keyring to be extended to
>      "children" of that keyring, such that the watch is removed from the
>      child if it is unlinked from the keyring.
>
>  (2) Adding global superblock event queue.
>
>  (3) Propagating watches to child superblock over automounts.
>

David,

I am interested to know how you envision filesystem notifications would
look with this interface.

fanotify can certainly benefit from providing a ring buffer interface to read
events.

From what I have seen, a common practice of users is to monitor mounts
(somehow) and place FAN_MARK_MOUNT fanotify watches dynamically.
It'd be good if those users can use a single watch mechanism/API for
watching the mount namespace and filesystem events within mounts.

A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM.
It provides users with two complete different mechanisms to watch error
and filesystem events. That is generally not a good thing to have.

I am not asking that you implement fs_notify() before merging sb_notify()
and I understand that you have a use case for sb_notify().
I am asking that you show me the path towards a unified API (how a
typical program would look like), so that we know before merging your
new API that it could be extended to accommodate fsnotify events
where the final result will look wholesome to users.

Thanks,
Amir.
David Howells May 29, 2019, 6:45 a.m. UTC | #3
Amir Goldstein <amir73il@gmail.com> wrote:

> I am interested to know how you envision filesystem notifications would
> look with this interface.

What sort of events are you thinking of by "filesystem notifications"?  You
mean things like file changes?

David
Amir Goldstein May 29, 2019, 7:40 a.m. UTC | #4
On Wed, May 29, 2019 at 9:45 AM David Howells <dhowells@redhat.com> wrote:
>
> Amir Goldstein <amir73il@gmail.com> wrote:
>
> > I am interested to know how you envision filesystem notifications would
> > look with this interface.
>
> What sort of events are you thinking of by "filesystem notifications"?  You
> mean things like file changes?

I mean all the events provided by
http://man7.org/linux/man-pages/man7/fanotify.7.html

Which was recently (v4.20) extended to support watching a super block
and more recently (v5.1) extended to support watching directory entry
modifications.

Thanks,
Amir.
David Howells May 29, 2019, 9:09 a.m. UTC | #5
Greg KH <gregkh@linuxfoundation.org> wrote:

> >  (3) Letting users see events they shouldn't be able to see.
> 
> How are you handling namespaces then?  Are they determined by the
> namespace of the process that opened the original device handle, or the
> namespace that made the new syscall for the events to "start flowing"?

So far I haven't had to deal directly with namespaces.

mount_notify() requires you to have access to the mountpoint you want to watch
- and the entire tree rooted there is in one namespace, so your event sources
are restricted to that namespace.  Further, mount objects don't themselves
have any other namespaces, not even a user_ns.

sb_notify() requires you to have access to the superblock you want to watch.
superblocks aren't directly namespaced as a class, though individual
superblocks may participate in particular namespaces (ipc, net, etc.).  I'm
thinking some of these should be marked unwatchable (all pseudo superblocks,
kernfs-class, proc, for example).

Superblocks, however, do each have a user_ns - but you were allowed to access
the superblock by pathwalk, so you must have some access to the user_ns - I
think.

KEYCTL_NOTIFY requires you to have View access on the key you're watching.
Currently, keys have no real namespace restrictions, though I have patches to
include a namespace tag in the lookup criteria.

block_notify() doesn't require any direct access since you're watching a
global queue and there is no blockdev namespacing.  LSMs are given the option
to filter events, though.  The thought here is that if you can access dmesg,
you should be able to watch for blockdev events.


Actually, thinking further on this, restricting access to events is trickier
than I thought and than perhaps Casey was suggesting.

Say you're watching a mount object and someone in a different user_ns
namespace or with a different security label mounts on it.  What governs
whether you are allowed to see the event?

You're watching the object for changes - and it *has* changed.  Further, you
might be able to see the result of this change by other means (/proc/mounts,
for instance).

Should you be denied the event based on the security model?

On the other hand, if you're watching a tree of mount objects, it could be
argued that you should be denied access to events on any mount object you
can't reach by pathwalk.

On the third hand, if you can see it in /proc/mounts or by fsinfo(), you
should get an event for it.


> How are you handling namespaces then?

So to go back to the original question.  At the moment they haven't impinged
directly and I haven't had to deal with them directly.  There are indirect
namespace restrictions that I get for free just due to pathwalk, for instance.

David
Jan Kara May 29, 2019, 2:25 p.m. UTC | #6
On Wed 29-05-19 09:33:35, Amir Goldstein wrote:
> On Tue, May 28, 2019 at 7:03 PM David Howells <dhowells@redhat.com> wrote:
> >
> >
> > Hi Al,
> >
> > Here's a set of patches to add a general variable-length notification queue
> > concept and to add sources of events for:
> >
> >  (1) Mount topology events, such as mounting, unmounting, mount expiry,
> >      mount reconfiguration.
> >
> >  (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
> >      errors (not complete yet).
> >
> >  (3) Block layer events, such as I/O errors.
> >
> >  (4) Key/keyring events, such as creating, linking and removal of keys.
> >
> > One of the reasons for this is so that we can remove the issue of processes
> > having to repeatedly and regularly scan /proc/mounts, which has proven to
> > be a system performance problem.  To further aid this, the fsinfo() syscall
> > on which this patch series depends, provides a way to access superblock and
> > mount information in binary form without the need to parse /proc/mounts.
> >
> >
> > Design decisions:
> >
> >  (1) A misc chardev is used to create and open a ring buffer:
> >
> >         fd = open("/dev/watch_queue", O_RDWR);
> >
> >      which is then configured and mmap'd into userspace:
> >
> >         ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
> >         ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
> >         buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
> >                    MAP_SHARED, fd, 0);
> >
> >      The fd cannot be read or written (though there is a facility to use
> >      write to inject records for debugging) and userspace just pulls data
> >      directly out of the buffer.
> >
> >  (2) The ring index pointers are stored inside the ring and are thus
> >      accessible to userspace.  Userspace should only update the tail
> >      pointer and never the head pointer or risk breaking the buffer.  The
> >      kernel checks that the pointers appear valid before trying to use
> >      them.  A 'skip' record is maintained around the pointers.
> >
> >  (3) poll() can be used to wait for data to appear in the buffer.
> >
> >  (4) Records in the buffer are binary, typed and have a length so that they
> >      can be of varying size.
> >
> >      This means that multiple heterogeneous sources can share a common
> >      buffer.  Tags may be specified when a watchpoint is created to help
> >      distinguish the sources.
> >
> >  (5) The queue is reusable as there are 16 million types available, of
> >      which I've used 4, so there is scope for others to be used.
> >
> >  (6) Records are filterable as types have up to 256 subtypes that can be
> >      individually filtered.  Other filtration is also available.
> >
> >  (7) Each time the buffer is opened, a new buffer is created - this means
> >      that there's no interference between watchers.
> >
> >  (8) When recording a notification, the kernel will not sleep, but will
> >      rather mark a queue as overrun if there's insufficient space, thereby
> >      avoiding userspace causing the kernel to hang.
> >
> >  (9) The 'watchpoint' should be specific where possible, meaning that you
> >      specify the object that you want to watch.
> >
> > (10) The buffer is created and then watchpoints are attached to it, using
> >      one of:
> >
> >         keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
> >         mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
> >         sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
> >
> >      where in all three cases, fd indicates the queue and the number after
> >      is a tag between 0 and 255.
> >
> > (11) The watch must be removed if either the watch buffer is destroyed or
> >      the watched object is destroyed.
> >
> >
> > Things I want to avoid:
> >
> >  (1) Introducing features that make the core VFS dependent on the network
> >      stack or networking namespaces (ie. usage of netlink).
> >
> >  (2) Dumping all this stuff into dmesg and having a daemon that sits there
> >      parsing the output and distributing it as this then puts the
> >      responsibility for security into userspace and makes handling
> >      namespaces tricky.  Further, dmesg might not exist or might be
> >      inaccessible inside a container.
> >
> >  (3) Letting users see events they shouldn't be able to see.
> >
> >
> > Further things that could be considered:
> >
> >  (1) Adding a keyctl call to allow a watch on a keyring to be extended to
> >      "children" of that keyring, such that the watch is removed from the
> >      child if it is unlinked from the keyring.
> >
> >  (2) Adding global superblock event queue.
> >
> >  (3) Propagating watches to child superblock over automounts.
> >
> 
> David,
> 
> I am interested to know how you envision filesystem notifications would
> look with this interface.
> 
> fanotify can certainly benefit from providing a ring buffer interface to read
> events.
> 
> From what I have seen, a common practice of users is to monitor mounts
> (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically.
> It'd be good if those users can use a single watch mechanism/API for
> watching the mount namespace and filesystem events within mounts.
> 
> A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM.
> It provides users with two complete different mechanisms to watch error
> and filesystem events. That is generally not a good thing to have.
> 
> I am not asking that you implement fs_notify() before merging sb_notify()
> and I understand that you have a use case for sb_notify().
> I am asking that you show me the path towards a unified API (how a
> typical program would look like), so that we know before merging your
> new API that it could be extended to accommodate fsnotify events
> where the final result will look wholesome to users.

Are you sure we want to combine notification about file changes etc. with
administrator-type notifications about the filesystem? To me these two
sound like rather different (although sometimes related) things.

								Honza
Greg KH May 29, 2019, 3:10 p.m. UTC | #7
On Wed, May 29, 2019 at 04:25:04PM +0200, Jan Kara wrote:
> > I am not asking that you implement fs_notify() before merging sb_notify()
> > and I understand that you have a use case for sb_notify().
> > I am asking that you show me the path towards a unified API (how a
> > typical program would look like), so that we know before merging your
> > new API that it could be extended to accommodate fsnotify events
> > where the final result will look wholesome to users.
> 
> Are you sure we want to combine notification about file changes etc. with
> administrator-type notifications about the filesystem? To me these two
> sound like rather different (although sometimes related) things.

This patchset is looking to create a "generic" kernel notification
system, so I think the question is valid.  It's up to the requestor to
ask for the specific type of notification.

thanks,

greg k-h
Casey Schaufler May 29, 2019, 3:41 p.m. UTC | #8
On 5/29/2019 2:09 AM, David Howells wrote:
> Greg KH <gregkh@linuxfoundation.org> wrote:
>
>>>  (3) Letting users see events they shouldn't be able to see.
>> How are you handling namespaces then?  Are they determined by the
>> namespace of the process that opened the original device handle, or the
>> namespace that made the new syscall for the events to "start flowing"?
> So far I haven't had to deal directly with namespaces.
>
> mount_notify() requires you to have access to the mountpoint you want to watch
> - and the entire tree rooted there is in one namespace, so your event sources
> are restricted to that namespace.  Further, mount objects don't themselves
> have any other namespaces, not even a user_ns.
>
> sb_notify() requires you to have access to the superblock you want to watch.
> superblocks aren't directly namespaced as a class, though individual
> superblocks may participate in particular namespaces (ipc, net, etc.).  I'm
> thinking some of these should be marked unwatchable (all pseudo superblocks,
> kernfs-class, proc, for example).
>
> Superblocks, however, do each have a user_ns - but you were allowed to access
> the superblock by pathwalk, so you must have some access to the user_ns - I
> think.
>
> KEYCTL_NOTIFY requires you to have View access on the key you're watching.
> Currently, keys have no real namespace restrictions, though I have patches to
> include a namespace tag in the lookup criteria.
>
> block_notify() doesn't require any direct access since you're watching a
> global queue and there is no blockdev namespacing.  LSMs are given the option
> to filter events, though.  The thought here is that if you can access dmesg,
> you should be able to watch for blockdev events.
>
>
> Actually, thinking further on this, restricting access to events is trickier
> than I thought and than perhaps Casey was suggesting.
>
> Say you're watching a mount object and someone in a different user_ns
> namespace or with a different security label mounts on it.  What governs
> whether you are allowed to see the event?

Conceptually it should be simple, but we have a variety of different
policies in the core OS, never mind what goes on inside the LSMs.
If you want to treat a notification like a signal you would only deliver
it if the process that performed the action that triggered the event
has the same UID as the process receiving the notification. Should you
decide to treat it like an IP packet only the LSMs would filter delivery.
If there are mode bits on the thing being watched shouldn't you respect
them?

> You're watching the object for changes - and it *has* changed.  Further, you
> might be able to see the result of this change by other means (/proc/mounts,
> for instance).
>
> Should you be denied the event based on the security model?

From a subject/object model view there are two objects and one
subject involved. The subject (active entity) is the process that
changes the first object, triggering an event. The watching process
(that will receive the notification) is the second object, because
its state will change (be written to) when the notification is
delivered. For the watching process to receive the notification
the changing process needs write access to the watching process.
The indirection of the notification mechanism isn't relevant.
If the changing process couldn't directly notify the watching process
it shouldn't be able to do it indirectly, either.

> On the other hand, if you're watching a tree of mount objects, it could be
> argued that you should be denied access to events on any mount object you
> can't reach by pathwalk.
>
> On the third hand, if you can see it in /proc/mounts or by fsinfo(), you
> should get an event for it.

Right. We've done a pretty good job of muddling the security
landscape by adding spiffy features to make life easier for
particular use cases. /proc is chuck full of examples. Objects
that can be viewed in many different ways make for confusing
security models. Try explaining /proc/234/fd/2 to a security
theory student.

>> How are you handling namespaces then?
> So to go back to the original question.  At the moment they haven't impinged
> directly and I haven't had to deal with them directly.  There are indirect
> namespace restrictions that I get for free just due to pathwalk, for instance.
>
> David
Amir Goldstein May 29, 2019, 3:53 p.m. UTC | #9
> > David,
> >
> > I am interested to know how you envision filesystem notifications would
> > look with this interface.
> >
> > fanotify can certainly benefit from providing a ring buffer interface to read
> > events.
> >
> > From what I have seen, a common practice of users is to monitor mounts
> > (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically.
> > It'd be good if those users can use a single watch mechanism/API for
> > watching the mount namespace and filesystem events within mounts.
> >
> > A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM.
> > It provides users with two complete different mechanisms to watch error
> > and filesystem events. That is generally not a good thing to have.
> >
> > I am not asking that you implement fs_notify() before merging sb_notify()
> > and I understand that you have a use case for sb_notify().
> > I am asking that you show me the path towards a unified API (how a
> > typical program would look like), so that we know before merging your
> > new API that it could be extended to accommodate fsnotify events
> > where the final result will look wholesome to users.
>
> Are you sure we want to combine notification about file changes etc. with
> administrator-type notifications about the filesystem? To me these two
> sound like rather different (although sometimes related) things.
>

Well I am sure that ring buffer for fanotify events would be useful, so
seeing that David is proposing a generic notification mechanism, I wanted
to know how that mechanism could best share infrastructure with fsnotify.

But apart from that I foresee the questions from users about why the
mount notification API and filesystem events API do not have better
integration.

The way I see it, the notification queue can serve several classes
of notifications and fsnotify could be one of those classes
(at least FAN_CLASS_NOTIF fits nicely to the model).

Thanks,
Amir.
Jan Kara May 30, 2019, 11 a.m. UTC | #10
On Wed 29-05-19 18:53:21, Amir Goldstein wrote:
> > > David,
> > >
> > > I am interested to know how you envision filesystem notifications would
> > > look with this interface.
> > >
> > > fanotify can certainly benefit from providing a ring buffer interface to read
> > > events.
> > >
> > > From what I have seen, a common practice of users is to monitor mounts
> > > (somehow) and place FAN_MARK_MOUNT fanotify watches dynamically.
> > > It'd be good if those users can use a single watch mechanism/API for
> > > watching the mount namespace and filesystem events within mounts.
> > >
> > > A similar usability concern is with sb_notify and FAN_MARK_FILESYSTEM.
> > > It provides users with two complete different mechanisms to watch error
> > > and filesystem events. That is generally not a good thing to have.
> > >
> > > I am not asking that you implement fs_notify() before merging sb_notify()
> > > and I understand that you have a use case for sb_notify().
> > > I am asking that you show me the path towards a unified API (how a
> > > typical program would look like), so that we know before merging your
> > > new API that it could be extended to accommodate fsnotify events
> > > where the final result will look wholesome to users.
> >
> > Are you sure we want to combine notification about file changes etc. with
> > administrator-type notifications about the filesystem? To me these two
> > sound like rather different (although sometimes related) things.
> >
> 
> Well I am sure that ring buffer for fanotify events would be useful, so
> seeing that David is proposing a generic notification mechanism, I wanted
> to know how that mechanism could best share infrastructure with fsnotify.
> 
> But apart from that I foresee the questions from users about why the
> mount notification API and filesystem events API do not have better
> integration.
> 
> The way I see it, the notification queue can serve several classes
> of notifications and fsnotify could be one of those classes
> (at least FAN_CLASS_NOTIF fits nicely to the model).

I agree that for some type of fsnotify uses a ring buffer would make sense.
But for others - such as permission events or unlimited queues - you cannot
really use the ring buffer and I don't like the idea of having different
ways of passing fsnotify events to userspace based on notification group
type...

								Honza
David Howells June 4, 2019, 12:33 p.m. UTC | #11
Amir Goldstein <amir73il@gmail.com> wrote:

> Well I am sure that ring buffer for fanotify events would be useful, so
> seeing that David is proposing a generic notification mechanism, I wanted
> to know how that mechanism could best share infrastructure with fsnotify.
>
> But apart from that I foresee the questions from users about why the
> mount notification API and filesystem events API do not have better
> integration.
>
> The way I see it, the notification queue can serve several classes
> of notifications and fsnotify could be one of those classes
> (at least FAN_CLASS_NOTIF fits nicely to the model).

It could be done; the main thing that concerns me is that the buffer is of
limited capacity.

However, I could take this:

	struct fanotify_event_metadata {
		__u32 event_len;
		__u8 vers;
		__u8 reserved;
		__u16 metadata_len;
		__aligned_u64 mask;
		__s32 fd;
		__s32 pid;
	};

and map it to:

	struct fanotify_notification {
		struct watch_notification watch; /* WATCH_TYPE_FANOTIFY */
		__aligned_u64	mask;
		__u16		metadata_len;
		__u8		vers;
		__u8		reserved;
		__u32		reserved2;
		__s32		fd;
		__s32		pid;
	};

and some of the watch::info bit could be used:

	n->watch.info & WATCH_INFO_OVERRUN	watch queue overran
	n->watch.info & WATCH_INFO_LENGTH	event_len
	n->watch.info & WATCH_INFO_RECURSIVE	FAN_EVENT_ON_CHILD
	n->watch.info & WATCH_INFO_FLAG_0	FAN_*_PERM
	n->watch.info & WATCH_INFO_FLAG_1	FAN_Q_OVERFLOW
	n->watch.info & WATCH_INFO_FLAG_2	FAN_ON_DIR
	n->subtype				ffs(n->mask)

Ideally, I'd dispense with metadata_len, vers, reserved* and set the version
when setting the watch.

	fanotify_watch(int watchfd, unsigned int flags, u64 *mask,
		       int dirfd, const char *pathname, unsigned int at_flags);

We might also want to extend the watch_filter to allow you to, say, filter on
the first __u64 after the watch member so that you could filter on specific
events:

	struct watch_notification_type_filter {
		__u32	type;
		__u32	info_filter;
		__u32	info_mask;
		__u32	subtype_filter[8];
		__u64	payload_mask[1];
		__u64	payload_set[1];
	};

So, in this case, it would require:

	n->mask & wf->payload_mask[0] == wf->payload_set[0]

to be true to record the message.

David