mbox series

[RFC,0/8] Mount, FS, Block and Keyrings notifications [ver #2]

Message ID 155966609977.17449.5624614375035334363.stgit@warthog.procyon.org.uk (mailing list archive)
Headers show
Series Mount, FS, Block and Keyrings notifications [ver #2] | expand

Message

David Howells June 4, 2019, 4:34 p.m. UTC
Hi Al,

Here's a set of patches to add a general variable-length notification queue
concept and to add sources of events for:

 (1) Mount topology events, such as mounting, unmounting, mount expiry,
     mount reconfiguration.

 (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
     errors (not complete yet).

 (3) Block layer events, such as I/O errors.

 (4) Key/keyring events, such as creating, linking and removal of keys.

One of the reasons for this is so that we can remove the issue of processes
having to repeatedly and regularly scan /proc/mounts, which has proven to
be a system performance problem.  To further aid this, the fsinfo() syscall
on which this patch series depends, provides a way to access superblock and
mount information in binary form without the need to parse /proc/mounts.


LSM support is included:

 (1) The creds of the process that did the fput() that reduced the refcount
     to zero are cached in the file struct.

 (2) __fput() overrides the current creds with the creds from (1) whilst
     doing the cleanup, thereby making sure that the creds seen by the
     destruction notification generated by mntput() appears to come from
     the last fputter.

 (3) security_post_notification() is called for each queue that we might
     want to post a notification into, thereby allowing the LSM to prevent
     covert communications.

 (?) Do I need to add security_set_watch(), say, to rule on whether a watch
     may be set in the first place?  I might need to add a variant per
     watch-type.

 (?) Do I really need to keep track of the process creds in which an
     implicit object destruction happened?  For example, imagine you create
     an fd with fsopen()/fsmount().  It is marked to dissolve the mount it
     refers to on close unless move_mount() clears that flag.  Now, imagine
     someone looking at that fd through procfs at the same time as you exit
     due to an error.  The LSM sees the destruction notification come from
     the looker if they happen to do their fput() after yours.


Design decisions:

 (1) A misc chardev is used to create and open a ring buffer:

	fd = open("/dev/watch_queue", O_RDWR);

     which is then configured and mmap'd into userspace:

	ioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
	ioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
	buf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

     The fd cannot be read or written (though there is a facility to use
     write to inject records for debugging) and userspace just pulls data
     directly out of the buffer.

 (2) The ring index pointers are stored inside the ring and are thus
     accessible to userspace.  Userspace should only update the tail
     pointer and never the head pointer or risk breaking the buffer.  The
     kernel checks that the pointers appear valid before trying to use
     them.  A 'skip' record is maintained around the pointers.

 (3) poll() can be used to wait for data to appear in the buffer.

 (4) Records in the buffer are binary, typed and have a length so that they
     can be of varying size.

     This means that multiple heterogeneous sources can share a common
     buffer.  Tags may be specified when a watchpoint is created to help
     distinguish the sources.

 (5) The queue is reusable as there are 16 million types available, of
     which I've used 4, so there is scope for others to be used.

 (6) Records are filterable as types have up to 256 subtypes that can be
     individually filtered.  Other filtration is also available.

 (7) Each time the buffer is opened, a new buffer is created - this means
     that there's no interference between watchers.

 (8) When recording a notification, the kernel will not sleep, but will
     rather mark a queue as overrun if there's insufficient space, thereby
     avoiding userspace causing the kernel to hang.

 (9) The 'watchpoint' should be specific where possible, meaning that you
     specify the object that you want to watch.

(10) The buffer is created and then watchpoints are attached to it, using
     one of:

	keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
	mount_notify(AT_FDCWD, "/", 0, fd, 0x02);
	sb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);

     where in all three cases, fd indicates the queue and the number after
     is a tag between 0 and 255.

(11) The watch must be removed if either the watch buffer is destroyed or
     the watched object is destroyed.


Things I want to avoid:

 (1) Introducing features that make the core VFS dependent on the network
     stack or networking namespaces (ie. usage of netlink).

 (2) Dumping all this stuff into dmesg and having a daemon that sits there
     parsing the output and distributing it as this then puts the
     responsibility for security into userspace and makes handling
     namespaces tricky.  Further, dmesg might not exist or might be
     inaccessible inside a container.

 (3) Letting users see events they shouldn't be able to see.


Further things that could be considered:

 (1) Adding a keyctl call to allow a watch on a keyring to be extended to
     "children" of that keyring, such that the watch is removed from the
     child if it is unlinked from the keyring.

 (2) Adding global superblock event queue.

 (3) Propagating watches to child superblock over automounts.


The patches can be found here also:

	http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications

Changes:

 v2: I've fixed various issues raised by Jann Horn and GregKH and moved to
     krefs for refcounting.  I've added some security features to try and
     give Casey Schaufler the LSM control he wants.

David
---
David Howells (8):
      security: Override creds in __fput() with last fputter's creds
      General notification queue with user mmap()'able ring buffer
      keys: Add a notification facility
      vfs: Add a mount-notification facility
      vfs: Add superblock notifications
      fsinfo: Export superblock notification counter
      block: Add block layer notifications
      Add sample notification program


 Documentation/security/keys/core.rst   |   58 ++
 Documentation/watch_queue.rst          |  328 ++++++++++++
 arch/x86/entry/syscalls/syscall_32.tbl |    3 
 arch/x86/entry/syscalls/syscall_64.tbl |    3 
 block/Kconfig                          |    9 
 block/Makefile                         |    1 
 block/blk-core.c                       |   29 +
 block/blk-notify.c                     |   83 +++
 drivers/misc/Kconfig                   |   13 
 drivers/misc/Makefile                  |    1 
 drivers/misc/watch_queue.c             |  895 ++++++++++++++++++++++++++++++++
 fs/Kconfig                             |   21 +
 fs/Makefile                            |    1 
 fs/file_table.c                        |   12 
 fs/fsinfo.c                            |   12 
 fs/mount.h                             |   33 +
 fs/mount_notify.c                      |  186 +++++++
 fs/namespace.c                         |    9 
 fs/super.c                             |  117 ++++
 include/linux/blkdev.h                 |   10 
 include/linux/dcache.h                 |    1 
 include/linux/fs.h                     |   79 +++
 include/linux/key.h                    |    4 
 include/linux/lsm_hooks.h              |   15 +
 include/linux/security.h               |   14 +
 include/linux/syscalls.h               |    5 
 include/linux/watch_queue.h            |   87 +++
 include/uapi/linux/fsinfo.h            |   10 
 include/uapi/linux/keyctl.h            |    1 
 include/uapi/linux/watch_queue.h       |  185 +++++++
 kernel/sys_ni.c                        |    7 
 mm/interval_tree.c                     |    2 
 mm/memory.c                            |    1 
 samples/Kconfig                        |    6 
 samples/Makefile                       |    1 
 samples/vfs/test-fsinfo.c              |   13 
 samples/watch_queue/Makefile           |    9 
 samples/watch_queue/watch_test.c       |  284 ++++++++++
 security/keys/Kconfig                  |   10 
 security/keys/compat.c                 |    2 
 security/keys/gc.c                     |    5 
 security/keys/internal.h               |   30 +
 security/keys/key.c                    |   37 +
 security/keys/keyctl.c                 |   89 +++
 security/keys/keyring.c                |   17 -
 security/keys/request_key.c            |    4 
 security/security.c                    |    9 
 47 files changed, 2713 insertions(+), 38 deletions(-)
 create mode 100644 Documentation/watch_queue.rst
 create mode 100644 block/blk-notify.c
 create mode 100644 drivers/misc/watch_queue.c
 create mode 100644 fs/mount_notify.c
 create mode 100644 include/linux/watch_queue.h
 create mode 100644 include/uapi/linux/watch_queue.h
 create mode 100644 samples/watch_queue/Makefile
 create mode 100644 samples/watch_queue/watch_test.c

Comments

Andy Lutomirski June 4, 2019, 5:43 p.m. UTC | #1
On Tue, Jun 4, 2019 at 9:35 AM David Howells <dhowells@redhat.com> wrote:
>
>
> Hi Al,
>
> Here's a set of patches to add a general variable-length notification queue
> concept and to add sources of events for:

I asked before and didn't see a response, so I'll ask again.  Why are
you paying any attention at all to the creds that generate an event?
It seems like the resulting security model will be vary hard to
understand and probably buggy.  Can't you define a sensible model in
which only the listener creds matter?

> LSM support is included:
>
>  (1) The creds of the process that did the fput() that reduced the refcount
>      to zero are cached in the file struct.
>
>  (2) __fput() overrides the current creds with the creds from (1) whilst
>      doing the cleanup, thereby making sure that the creds seen by the
>      destruction notification generated by mntput() appears to come from
>      the last fputter.

That looks like duct tape that is, at best, likely to be very buggy.

>
>  (3) security_post_notification() is called for each queue that we might
>      want to post a notification into, thereby allowing the LSM to prevent
>      covert communications.

This seems like the wrong approach.  If an LSM wants to prevent covert
communication from, say, mount actions, then it shouldn't allow the
watch to be set up in the first place.
Casey Schaufler June 4, 2019, 8:31 p.m. UTC | #2
n 6/4/2019 10:43 AM, Andy Lutomirski wrote:
> On Tue, Jun 4, 2019 at 9:35 AM David Howells <dhowells@redhat.com> wrote:
>>
>> Hi Al,
>>
>> Here's a set of patches to add a general variable-length notification queue
>> concept and to add sources of events for:
> I asked before and didn't see a response, so I'll ask again.  Why are
> you paying any attention at all to the creds that generate an event?
> It seems like the resulting security model will be vary hard to
> understand and probably buggy.  Can't you define a sensible model in
> which only the listener creds matter?

We've spent the last 18 months reeling from the implications
of what can happen when one process has the ability to snoop
on another. Introducing yet another mechanism that is trivial
to exploit is a very bad idea.

I will try to explain the problem once again. If process A
sends a signal (writes information) to process B the kernel
checks that either process A has the same UID as process B
or that process A has privilege to override that policy.
Process B is passive in this access control decision, while
process A is active. In the event delivery case, process A
does something (e.g. modifies a keyring) that generates an
event, which is then sent to process B's event buffer. Again,
A is active and B is passive. Process A must have write access
(defined by some policy) to process B's event buffer. To
implement such a policy requires A's credential, and some
information about the object (passive entity) to which the
event is being delivered. You can't just use the credential
from Process B because it is not the active entity, it is the
passive entity.


>
>> LSM support is included:
>>
>>  (1) The creds of the process that did the fput() that reduced the refcount
>>      to zero are cached in the file struct.
>>
>>  (2) __fput() overrides the current creds with the creds from (1) whilst
>>      doing the cleanup, thereby making sure that the creds seen by the
>>      destruction notification generated by mntput() appears to come from
>>      the last fputter.
> That looks like duct tape that is, at best, likely to be very buggy.
>
>>  (3) security_post_notification() is called for each queue that we might
>>      want to post a notification into, thereby allowing the LSM to prevent
>>      covert communications.
> This seems like the wrong approach.  If an LSM wants to prevent covert
> communication from, say, mount actions, then it shouldn't allow the
> watch to be set up in the first place.
David Howells June 4, 2019, 8:39 p.m. UTC | #3
Andy Lutomirski <luto@kernel.org> wrote:

> > Here's a set of patches to add a general variable-length notification queue
> > concept and to add sources of events for:
> 
> I asked before and didn't see a response, so I'll ask again.  Why are you
> paying any attention at all to the creds that generate an event?

Casey responded to you.  It's one of his requirements.

I'm not sure of the need, and I particularly don't like trying to make
indirect destruction events (mount destruction keyed on fput, for instance)
carry the creds of the triggerer.  Indeed, the trigger can come from all sorts
of places - including af_unix queue destruction, someone poking around in
procfs, a variety of processes fputting simultaneously.  Only one of them can
win, and the LSM needs to handle *all* the possibilities.

However, the LSMs (or at least SELinux) ignore f_cred and use current_cred()
when checking permissions.  See selinux_revalidate_file_permission() for
example - it uses current_cred() not file->f_cred to re-evaluate the perms,
and the fd might be shared between a number of processes with different creds.

> This seems like the wrong approach.  If an LSM wants to prevent covert
> communication from, say, mount actions, then it shouldn't allow the
> watch to be set up in the first place.

Yeah, I can agree to that.  Casey?

David
Andy Lutomirski June 4, 2019, 8:57 p.m. UTC | #4
On Tue, Jun 4, 2019 at 1:39 PM David Howells <dhowells@redhat.com> wrote:
>
> Andy Lutomirski <luto@kernel.org> wrote:
>
> > > Here's a set of patches to add a general variable-length notification queue
> > > concept and to add sources of events for:
> >
> > I asked before and didn't see a response, so I'll ask again.  Why are you
> > paying any attention at all to the creds that generate an event?
>
> Casey responded to you.  It's one of his requirements.
>

It being a "requirement" doesn't make it okay.

> However, the LSMs (or at least SELinux) ignore f_cred and use current_cred()
> when checking permissions.  See selinux_revalidate_file_permission() for
> example - it uses current_cred() not file->f_cred to re-evaluate the perms,
> and the fd might be shared between a number of processes with different creds.

That's a bug.  It's arguably a rather severe bug.  If I ever get
around to writing the patch I keep thinking of that will warn if we
use creds from invalid contexts, it will warn.

Let's please not repeat this.
Andy Lutomirski June 4, 2019, 9:05 p.m. UTC | #5
On Tue, Jun 4, 2019 at 1:31 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> n 6/4/2019 10:43 AM, Andy Lutomirski wrote:
> > On Tue, Jun 4, 2019 at 9:35 AM David Howells <dhowells@redhat.com> wrote:
> >>
> >> Hi Al,
> >>
> >> Here's a set of patches to add a general variable-length notification queue
> >> concept and to add sources of events for:
> > I asked before and didn't see a response, so I'll ask again.  Why are
> > you paying any attention at all to the creds that generate an event?
> > It seems like the resulting security model will be vary hard to
> > understand and probably buggy.  Can't you define a sensible model in
> > which only the listener creds matter?
>
> We've spent the last 18 months reeling from the implications
> of what can happen when one process has the ability to snoop
> on another. Introducing yet another mechanism that is trivial
> to exploit is a very bad idea.

If you're talking about Spectre, etc, this is IMO entirely irrelevant.
Among other things, setting these watches can and should require some
degree of privilege.

>
> I will try to explain the problem once again. If process A
> sends a signal (writes information) to process B the kernel
> checks that either process A has the same UID as process B
> or that process A has privilege to override that policy.
> Process B is passive in this access control decision, while
> process A is active.

Are you stating what you see to be a requirement?

> Process A must have write access
> (defined by some policy) to process B's event buffer.

No, stop right here.  Process B is monitoring some aspect of the
system.  Process A is doing something.  Process B should need
permission to monitor whatever it's monitoring, and process A should
have permission to do whatever it's doing.  I don't think it makes
sense to try to ascribe an identity to the actor doing some action to
decide to omit it from the watch -- this has all kinds of correctness
issues.

If you're writing a policy and you don't like letting process B spy on
processes doing various things, then disallow that type of spying.

> To
> implement such a policy requires A's credential,

You may not design a new mechanism that looks at the credential in a
context where looking at a credential is invalid unless you have some
very strong justification for why all of the known reasons that it's a
bad idea don't apply to what you're doing.

So, without a much stronger justification, NAK.
Casey Schaufler June 4, 2019, 9:11 p.m. UTC | #6
On 6/4/2019 1:39 PM, David Howells wrote:
> Andy Lutomirski <luto@kernel.org> wrote:
>
>>> Here's a set of patches to add a general variable-length notification queue
>>> concept and to add sources of events for:
>> I asked before and didn't see a response, so I'll ask again.  Why are you
>> paying any attention at all to the creds that generate an event?
> Casey responded to you.  It's one of his requirements.

Process A takes an action. As a result of that action,
an event is written to Process B's event buffer. This isn't
a covert channel, it's a direct access, just like sending
a signal. Process A is the subject and the event buffer,
which is part of Process B, is the object.


> I'm not sure of the need, and I particularly don't like trying to make
> indirect destruction events (mount destruction keyed on fput, for instance)
> carry the creds of the triggerer.  Indeed, the trigger can come from all sorts
> of places - including af_unix queue destruction, someone poking around in
> procfs, a variety of processes fputting simultaneously.  Only one of them can
> win, and the LSM needs to handle *all* the possibilities.

Yes, it's a hairy problem. It was a significant factor in the
demise of kdbus.

> However, the LSMs (or at least SELinux) ignore f_cred and use current_cred()
> when checking permissions.  See selinux_revalidate_file_permission() for
> example - it uses current_cred() not file->f_cred to re-evaluate the perms,
> and the fd might be shared between a number of processes with different creds.
>
>> This seems like the wrong approach.  If an LSM wants to prevent covert
>> communication from, say, mount actions, then it shouldn't allow the
>> watch to be set up in the first place.
> Yeah, I can agree to that.  Casey?

Back to your earlier point, you don't know where the
event is coming from when you create the event watch.
If you enforce a watch time, what are you going to check?
Isn't this going to be considered too restrictive?
Casey Schaufler June 4, 2019, 10:03 p.m. UTC | #7
On 6/4/2019 2:05 PM, Andy Lutomirski wrote:
> On Tue, Jun 4, 2019 at 1:31 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> n 6/4/2019 10:43 AM, Andy Lutomirski wrote:
>>> On Tue, Jun 4, 2019 at 9:35 AM David Howells <dhowells@redhat.com> wrote:
>>>> Hi Al,
>>>>
>>>> Here's a set of patches to add a general variable-length notification queue
>>>> concept and to add sources of events for:
>>> I asked before and didn't see a response, so I'll ask again.  Why are
>>> you paying any attention at all to the creds that generate an event?
>>> It seems like the resulting security model will be vary hard to
>>> understand and probably buggy.  Can't you define a sensible model in
>>> which only the listener creds matter?
>> We've spent the last 18 months reeling from the implications
>> of what can happen when one process has the ability to snoop
>> on another. Introducing yet another mechanism that is trivial
>> to exploit is a very bad idea.
> If you're talking about Spectre, etc, this is IMO entirely irrelevant.

We're seeing significant interest in using obscure mechanisms
in system exploits. Mechanisms will be exploited.

> Among other things, setting these watches can and should require some
> degree of privilege.

Requiring privilege would address the concerns for most
situations, although I don't see that it would help for
SELinux. SELinux does not generally put much credence in
what others consider "privilege".

Extreme care would probably be required for namespaces, too.

>
>> I will try to explain the problem once again. If process A
>> sends a signal (writes information) to process B the kernel
>> checks that either process A has the same UID as process B
>> or that process A has privilege to override that policy.
>> Process B is passive in this access control decision, while
>> process A is active.
> Are you stating what you see to be a requirement?

Basic subject/object access control is the core of
the Linux security model. Yes, there are exceptions,
but mostly they're historical in origin.


>> Process A must have write access
>> (defined by some policy) to process B's event buffer.
> No, stop right here.

Listening ...

>   Process B is monitoring some aspect of the
> system.

Process B is not "monitoring". At some point in the past it
has registered a request for information should an event occur.
It is currently passive.

> Process A is doing something.

Yes. It is active.'

> Process B should need
> permission to monitor whatever it's monitoring,

OK, I'm good with that. But the only time you
can tell that is when the event is registered,
and at that time you can't tell who might be causing
the event. (Or can you?)

> and process A should
> have permission to do whatever it's doing.

So there needs to be some connection between what B
can request events for and what events A can cause.
Then you can deny B's requests because of A.

>   I don't think it makes
> sense to try to ascribe an identity to the actor doing some action to
> decide to omit it from the watch -- this has all kinds of correctness
> issues.

It works for signals and UDP, but in general I get the concern.

> If you're writing a policy and you don't like letting process B spy on
> processes doing various things, then disallow that type of spying.

That gets you into a situation where you can't do the legitimate
monitoring you want to do just because there's the off chance you
might see something you shouldn't. "I hate security! It's confusing,
and always gets in the way!"

>> To
>> implement such a policy requires A's credential,
> You may not design a new mechanism that looks at the credential in a
> context where looking at a credential is invalid unless you have some
> very strong justification for why all of the known reasons that it's a
> bad idea don't apply to what you're doing.

Point. But you also don't get to ignore basic security policy
just because someone's spiffy lazy memory free cache hashing
tree (or similar mechanism) throws away references to important
information while it's still needed.

> So, without a much stronger justification, NAK.

I try to be reasonable. Really. All I want is something
with a security model that can be explained coherently 
within the context of the basic Linux security model.
There are enough variations as it is.
Andy Lutomirski June 5, 2019, 4:19 a.m. UTC | #8
On Tue, Jun 4, 2019 at 6:18 PM Stephen Smalley
<stephen.smalley@gmail.com> wrote:
>
> On Tue, Jun 4, 2019 at 4:58 PM Andy Lutomirski <luto@kernel.org> wrote:
>>
>> On Tue, Jun 4, 2019 at 1:39 PM David Howells <dhowells@redhat.com> wrote:
>> >
>> > Andy Lutomirski <luto@kernel.org> wrote:
>> >
>> > > > Here's a set of patches to add a general variable-length notification queue
>> > > > concept and to add sources of events for:
>> > >
>> > > I asked before and didn't see a response, so I'll ask again.  Why are you
>> > > paying any attention at all to the creds that generate an event?
>> >
>> > Casey responded to you.  It's one of his requirements.
>> >
>>
>> It being a "requirement" doesn't make it okay.
>>
>> > However, the LSMs (or at least SELinux) ignore f_cred and use current_cred()
>> > when checking permissions.  See selinux_revalidate_file_permission() for
>> > example - it uses current_cred() not file->f_cred to re-evaluate the perms,
>> > and the fd might be shared between a number of processes with different creds.
>>
>> That's a bug.  It's arguably a rather severe bug.  If I ever get
>> around to writing the patch I keep thinking of that will warn if we
>> use creds from invalid contexts, it will warn.
>
>
> No, not a bug.  Working as designed. Initial validation on open, but revalidation upon read/write if something has changed since open (process SID differs from opener, inode SID has changed, policy has changed). Current subject SID should be used for the revalidation. It's a MAC vs DAC difference.
>

Can you explain how the design is valid, then?  Consider nasty cases like this:

$ sudo -u lotsofgarbage 2>/dev/whatever

It is certainly the case that drivers, fs code, and other core code
MUST NOT look at current_cred() in the context of syscalls like
open().  Jann, I, and others have found quite a few rootable bugs of
this sort.  What makes MAC special here?

I would believe there are cases where auditing write() callers makes
some sense, but anyone reading those logs needs to understand that the
creds are dubious at best.
David Howells June 5, 2019, 8:41 a.m. UTC | #9
Casey Schaufler <casey@schaufler-ca.com> wrote:

> I will try to explain the problem once again. If process A
> sends a signal (writes information) to process B the kernel
> checks that either process A has the same UID as process B
> or that process A has privilege to override that policy.
> Process B is passive in this access control decision, while
> process A is active. In the event delivery case, process A
> does something (e.g. modifies a keyring) that generates an
> event, which is then sent to process B's event buffer.

I think this might be the core sticking point here.  It looks like two
different situations:

 (1) A explicitly sends event to B (eg. signalling, sendmsg, etc.)

 (2) A implicitly and unknowingly sends event to B as a side effect of some
     other action (eg. B has a watch for the event A did).

The LSM treats them as the same: that is B must have MAC authorisation to send
a message to A.

But there are problems with not sending the event:

 (1) B's internal state is then corrupt (or, at least, unknowingly invalid).

 (2) B can potentially figure out that the event happened by other means.


I've implemented four event sources so far:

 (1) Keys/keyrings.  You can only get events on a key you have View permission
     on and the other process has to have write access to it, so I think this
     is good enough.

 (2) Block layer.  Currently this will only get you hardware error events,
     which is probably safe.  I'm not sure you can manipulate those without
     permission to directly access the device files.

 (3) Superblock.  This is trickier since it can see events that can be
     manufactured (R/W <-> R/O remounting, EDQUOT) as well as events that
     can't without hardware control (EIO, network link loss, RF kill).

 (4) Mount topology.  This is the trickiest since it allows you to see events
     beyond the point at which you placed your watch (in essence, you place a
     subtree watch).

     The question is what permission checking should I do?  Ideally, I'd
     emulate a pathwalk between the watchpoint and the eventing object to see
     if the owner of the watchpoint could reach it.

     I'd need to do a reverse walk, calling inode_permission(MAY_NOT_BLOCK)
     for each directory between the eventing object and the watchpoint to see
     if one rejects it - but some filesystems have a permission check that
     can't be called in this state.

     It would also be necessary to do this separately for each watchpoint in
     the parental chain.

     Further, each permissions check would generate an audit event and could
     generate FAN_ACCESS and/or FAN_ACCESS_PERM fanotify events - which could
     be a problem if fanotify is also trying to post those events to the same
     watch queue.

David
Stephen Smalley June 5, 2019, 1:47 p.m. UTC | #10
On 6/5/19 12:19 AM, Andy Lutomirski wrote:
> On Tue, Jun 4, 2019 at 6:18 PM Stephen Smalley
> <stephen.smalley@gmail.com> wrote:
>>
>> On Tue, Jun 4, 2019 at 4:58 PM Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>> On Tue, Jun 4, 2019 at 1:39 PM David Howells <dhowells@redhat.com> wrote:
>>>>
>>>> Andy Lutomirski <luto@kernel.org> wrote:
>>>>
>>>>>> Here's a set of patches to add a general variable-length notification queue
>>>>>> concept and to add sources of events for:
>>>>>
>>>>> I asked before and didn't see a response, so I'll ask again.  Why are you
>>>>> paying any attention at all to the creds that generate an event?
>>>>
>>>> Casey responded to you.  It's one of his requirements.
>>>>
>>>
>>> It being a "requirement" doesn't make it okay.
>>>
>>>> However, the LSMs (or at least SELinux) ignore f_cred and use current_cred()
>>>> when checking permissions.  See selinux_revalidate_file_permission() for
>>>> example - it uses current_cred() not file->f_cred to re-evaluate the perms,
>>>> and the fd might be shared between a number of processes with different creds.
>>>
>>> That's a bug.  It's arguably a rather severe bug.  If I ever get
>>> around to writing the patch I keep thinking of that will warn if we
>>> use creds from invalid contexts, it will warn.
>>
>>
>> No, not a bug.  Working as designed. Initial validation on open, but revalidation upon read/write if something has changed since open (process SID differs from opener, inode SID has changed, policy has changed). Current subject SID should be used for the revalidation. It's a MAC vs DAC difference.
>>
> 
> Can you explain how the design is valid, then?  Consider nasty cases like this:
> 
> $ sudo -u lotsofgarbage 2>/dev/whatever

(sorry for the previous html email; gmail or my inability to properly 
use it strikes again!)

Here we have four (or more) opportunities to say no:
1) Upon selinux_inode_permission(), when checking write access to 
/dev/whatever in the context of the shell process,
2) Upon selinux_file_open(), when checking and caching the open and 
write access for shell to /dev/whatever in the file security struct,
3) Upon selinux_bprm_committing_creds() -> flush_unauthorized_files(), 
when revalidating write access to /dev/whatever in the context of sudo,
4) Upon selinux_file_permission() -> 
selinux_revalidate_file_permission(), when revalidating write access to 
/dev/whatever in the context of sudo.

If any of those fail, then access is denied, so unless both the shell 
and sudo are authorized to write to /dev/whatever, it is a no-go.  NB 
Only the shell context requires open permission here; the sudo context 
only needs write.

> It is certainly the case that drivers, fs code, and other core code
> MUST NOT look at current_cred() in the context of syscalls like
> open().  Jann, I, and others have found quite a few rootable bugs of
> this sort.  What makes MAC special here?

Do you mean syscalls like write(), not open()?  I think your concern is 
that they apply some check only during write() and not open() and 
therefore are susceptible to confused deputy scenario above.  In 
contrast we are validating access at open, transfer/inherit, and use. If 
we use file->f_cred instead of current_cred() in 
selinux_revalidate_file_permission() and the current process SID differs 
from that of the opener, we'll never apply a check for the actual 
security context performing the write(), so information can flow in 
violation of the MAC policy.

> I would believe there are cases where auditing write() callers makes
> some sense, but anyone reading those logs needs to understand that the
> creds are dubious at best.
Casey Schaufler June 5, 2019, 2:50 p.m. UTC | #11
On 6/5/2019 1:41 AM, David Howells wrote:
> Casey Schaufler <casey@schaufler-ca.com> wrote:
>
>> I will try to explain the problem once again. If process A
>> sends a signal (writes information) to process B the kernel
>> checks that either process A has the same UID as process B
>> or that process A has privilege to override that policy.
>> Process B is passive in this access control decision, while
>> process A is active. In the event delivery case, process A
>> does something (e.g. modifies a keyring) that generates an
>> event, which is then sent to process B's event buffer.
> I think this might be the core sticking point here.  It looks like two
> different situations:
>
>  (1) A explicitly sends event to B (eg. signalling, sendmsg, etc.)
>
>  (2) A implicitly and unknowingly sends event to B as a side effect of some
>      other action (eg. B has a watch for the event A did).
>
> The LSM treats them as the same: that is B must have MAC authorisation to send
> a message to A.

YES!

Threat is about what you can do, not what you intend to do.

And it would be really great if you put some thought into what
a rational model would be for UID based controls, too.

> But there are problems with not sending the event:
>
>  (1) B's internal state is then corrupt (or, at least, unknowingly invalid).

Then B is a badly written program.

>  (2) B can potentially figure out that the event happened by other means.

Then why does it need the event mechanism in the first place?

> I've implemented four event sources so far:
>
>  (1) Keys/keyrings.  You can only get events on a key you have View permission
>      on and the other process has to have write access to it, so I think this
>      is good enough.

Sounds fine.

>  (2) Block layer.  Currently this will only get you hardware error events,
>      which is probably safe.  I'm not sure you can manipulate those without
>      permission to directly access the device files.

There's an argument to be made that this should require CAP_SYS_ADMIN,
or that an LSM like SELinux might include hardware error events in
policy, but generally I agree that system generated events like this
are both harmless and pointless for the general public to watch.

>  (3) Superblock.  This is trickier since it can see events that can be
>      manufactured (R/W <-> R/O remounting, EDQUOT) as well as events that
>      can't without hardware control (EIO, network link loss, RF kill).

The events generated by processes (the 1st set) need controls
like keys. The events generated by the system (the 2nd set) may
need controls like the block layer.


>  (4) Mount topology.  This is the trickiest since it allows you to see events
>      beyond the point at which you placed your watch (in essence, you place a
>      subtree watch).

Like keys.

>      The question is what permission checking should I do?  Ideally, I'd
>      emulate a pathwalk between the watchpoint and the eventing object to see
>      if the owner of the watchpoint could reach it.

That will depend, as I've been saying, on what causes
the event to be generated. If it's from a process, the
question is "can the active process, the one that generated
the event, write to the passive, watching process?"
If it's the system on a hardware event, you may want the watcher
to have CAP_SYS_ADMIN.

>      I'd need to do a reverse walk, calling inode_permission(MAY_NOT_BLOCK)
>      for each directory between the eventing object and the watchpoint to see
>      if one rejects it - but some filesystems have a permission check that
>      can't be called in this state.

This is for setting the watch, right?

>      It would also be necessary to do this separately for each watchpoint in
>      the parental chain.
>
>      Further, each permissions check would generate an audit event and could
>      generate FAN_ACCESS and/or FAN_ACCESS_PERM fanotify events - which could
>      be a problem if fanotify is also trying to post those events to the same
>      watch queue.

If you required that the watching process open(dir) what
you want to watch you'd get this for free. Or did I miss
something obvious?

> David
Andy Lutomirski June 5, 2019, 4:04 p.m. UTC | #12
On Wed, Jun 5, 2019 at 7:51 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>
> On 6/5/2019 1:41 AM, David Howells wrote:
> > Casey Schaufler <casey@schaufler-ca.com> wrote:
> >
> >> I will try to explain the problem once again. If process A
> >> sends a signal (writes information) to process B the kernel
> >> checks that either process A has the same UID as process B
> >> or that process A has privilege to override that policy.
> >> Process B is passive in this access control decision, while
> >> process A is active. In the event delivery case, process A
> >> does something (e.g. modifies a keyring) that generates an
> >> event, which is then sent to process B's event buffer.
> > I think this might be the core sticking point here.  It looks like two
> > different situations:
> >
> >  (1) A explicitly sends event to B (eg. signalling, sendmsg, etc.)
> >
> >  (2) A implicitly and unknowingly sends event to B as a side effect of some
> >      other action (eg. B has a watch for the event A did).
> >
> > The LSM treats them as the same: that is B must have MAC authorisation to send
> > a message to A.
>
> YES!
>
> Threat is about what you can do, not what you intend to do.
>
> And it would be really great if you put some thought into what
> a rational model would be for UID based controls, too.
>
> > But there are problems with not sending the event:
> >
> >  (1) B's internal state is then corrupt (or, at least, unknowingly invalid).
>
> Then B is a badly written program.

Either I'm misunderstanding you or I strongly disagree.  If B has
authority to detect a certain action, and A has authority to perform
that action, then refusing to notify B because B is somehow missing
some special authorization to be notified by A is nuts.  This is just
introducing incorrectness into the design in support of a
not-actually-helpful security idea.

If I can read /proc/self/mounts, I can detect changes to my mount
namespace.  Giving me a faster and nicer way to do this is fine, AS
LONG AS IT ACTUALLY WORKS.  "Works" means it needs to detect all
changes.
David Howells June 5, 2019, 4:56 p.m. UTC | #13
Casey Schaufler <casey@schaufler-ca.com> wrote:

> YES!

I'm trying to decide if that's fervour or irritation at this point ;-)

> And it would be really great if you put some thought into what
> a rational model would be for UID based controls, too.

I have put some thought into it, but I don't see a single rational model.  It
depends very much on the situation.

In any case, that's what I was referring to when I said I might need to call
inode_permission().  But UIDs don't exist for all filesystems, for example,
and there are no UIDs on superblocks, mount objects or hardware events.

Now, I could see that you ignore UIDs on things like keys and
hardware-triggered events, but how does this interact with things like mount
watches that see directories that have UIDs?

Are you advocating making it such that process B can only see events triggered
by process A if they have the same UID, for example?

David
Casey Schaufler June 5, 2019, 5:01 p.m. UTC | #14
On 6/5/2019 9:04 AM, Andy Lutomirski wrote:
> On Wed, Jun 5, 2019 at 7:51 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 6/5/2019 1:41 AM, David Howells wrote:
>>> Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>
>>>> I will try to explain the problem once again. If process A
>>>> sends a signal (writes information) to process B the kernel
>>>> checks that either process A has the same UID as process B
>>>> or that process A has privilege to override that policy.
>>>> Process B is passive in this access control decision, while
>>>> process A is active. In the event delivery case, process A
>>>> does something (e.g. modifies a keyring) that generates an
>>>> event, which is then sent to process B's event buffer.
>>> I think this might be the core sticking point here.  It looks like two
>>> different situations:
>>>
>>>  (1) A explicitly sends event to B (eg. signalling, sendmsg, etc.)
>>>
>>>  (2) A implicitly and unknowingly sends event to B as a side effect of some
>>>      other action (eg. B has a watch for the event A did).
>>>
>>> The LSM treats them as the same: that is B must have MAC authorisation to send
>>> a message to A.
>> YES!
>>
>> Threat is about what you can do, not what you intend to do.
>>
>> And it would be really great if you put some thought into what
>> a rational model would be for UID based controls, too.
>>
>>> But there are problems with not sending the event:
>>>
>>>  (1) B's internal state is then corrupt (or, at least, unknowingly invalid).
>> Then B is a badly written program.
> Either I'm misunderstanding you or I strongly disagree.

A program needs to be aware of the conditions under
which it gets event, *including the possibility that
it may not get an event that it's not allowed*. Do you
regularly write programs that go into corrupt states
if an open() fails? Or where read() returns less than
the amount of data you ask for?

>   If B has
> authority to detect a certain action, and A has authority to perform
> that action, then refusing to notify B because B is somehow missing
> some special authorization to be notified by A is nuts.

You are hand-waving the notion of authority. You are assuming
that if A can read X and B can read X that A can write B.

>   This is just
> introducing incorrectness into the design in support of a
> not-actually-helpful security idea.

Where is the incorrectness? Are you seriously saying that
you expect all events to be generated exactly as you think
they should? Have you ever even used systemd? 

> If I can read /proc/self/mounts, I can detect changes to my mount
> namespace.

Then read /proc/self/mounts!
Can't you poll on an fd open on /proc/self/mounts?

>   Giving me a faster and nicer way to do this is fine, AS
> LONG AS IT ACTUALLY WORKS.  "Works" means it needs to detect all
> changes.

So long as "WORKS" includes maintaining the system security
policy, I agree. No, I don't. We already have too many bizarre
and unnatural mechanisms to address whimsical special cases.
If speed is such an issue you could look at making /proc better.
David Howells June 5, 2019, 5:21 p.m. UTC | #15
Casey Schaufler <casey@schaufler-ca.com> wrote:

> > But there are problems with not sending the event:
> >
> >  (1) B's internal state is then corrupt (or, at least, unknowingly invalid).
> 
> Then B is a badly written program.

No.  It may have the expectation that it will get events but then it is denied
those events and doesn't even know they've happened.

> >  (2) B can potentially figure out that the event happened by other means.
> 
> Then why does it need the event mechanism in the first place?

Why does a CPU have interrupt lines?  It can always continuously poll the
hardware.  Why do poll() and select() exist?

> > I've implemented four event sources so far:
> >
> >  (1) Keys/keyrings.  You can only get events on a key you have View permission
> >      on and the other process has to have write access to it, so I think this
> >      is good enough.
> 
> Sounds fine.
> 
> >  (2) Block layer.  Currently this will only get you hardware error events,
> >      which is probably safe.  I'm not sure you can manipulate those without
> >      permission to directly access the device files.
> 
> There's an argument to be made that this should require CAP_SYS_ADMIN,
> or that an LSM like SELinux might include hardware error events in
> policy, but generally I agree that system generated events like this
> are both harmless and pointless for the general public to watch.

CAP_SYS_ADMIN is probably too broad a hammer - this is something you might
want to let a file manager or desktop environment use.  I wonder if we could
add a CAP_SYS_NOTIFY - or is it too late for adding new caps?

> >  (3) Superblock.  This is trickier since it can see events that can be
> >      manufactured (R/W <-> R/O remounting, EDQUOT) as well as events that
> >      can't without hardware control (EIO, network link loss, RF kill).
> 
> The events generated by processes (the 1st set) need controls
> like keys. The events generated by the system (the 2nd set) may
> need controls like the block layer.
>
>
> > (4)  Mount topology.  This is the trickiest since it allows you to see
> >      events beyond the point at which you placed your watch (in essence,
> >      you place a subtree watch).
> 
> Like keys.
> 
> >      The question is what permission checking should I do?  Ideally, I'd
> >      emulate a pathwalk between the watchpoint and the eventing object to
> >      see if the owner of the watchpoint could reach it.
> 
> That will depend, as I've been saying, on what causes
> the event to be generated. If it's from a process, the
> question is "can the active process, the one that generated
> the event, write to the passive, watching process?"
> If it's the system on a hardware event, you may want the watcher
> to have CAP_SYS_ADMIN.
> 
> >      I'd need to do a reverse walk, calling
> >      inode_permission(MAY_NOT_BLOCK) for each directory between the
> >      eventing object and the watchpoint to see if one rejects it - but
> >      some filesystems have a permission check that can't be called in this
> >      state.
> 
> This is for setting the watch, right?

No.  Setting the watch requires execute permission on the directory on which
you're setting the watch, but there's no way to know what permissions will be
required for an event at that point.

I'm talking about when an event is generated (hence "eventing object").
Imagine you have a subpath:

	dirA/dirB/dirC/dirD/dirE

where dir* are directories.  If you place a watch on dirA and then an event
occurs on dirB (such as someone mounting on it), I do a walk back up the
parental tree, in the order:

	dirE, dirD, dirC, dirB, dirA

If I need to check permissions on all the directories, I would find the
watchpoint on dirA, then I would have to repeat the walk to find out whether
the owner of the watchpoint can access all of those directories (perhaps
skipping dirA since I had permission to place a watchpoint thereon).

Note that this is subject to going awry if there's a race versus rename().

> >      It would also be necessary to do this separately for each watchpoint in
> >      the parental chain.
> >
> >      Further, each permissions check would generate an audit event and
> >      could generate FAN_ACCESS and/or FAN_ACCESS_PERM fanotify events -
> >      which could be a problem if fanotify is also trying to post those
> >      events to the same watch queue.
> 
> If you required that the watching process open(dir) what
> you want to watch you'd get this for free. Or did I miss
> something obvious?

A subtree watch, such as the mount topology watch, watches not only the
directory and mount object you pointed directly at, but the subtree rooted
thereon.

Take the sample program in the last patch.  It places a watch on "/" with no
filter against WATCH_INFO_RECURSIVE, so it sees all mount topology events that
happen under the VFS path subtree rooted at "/" - whether or not it can
actually pathwalk to those mounts.

David
Casey Schaufler June 5, 2019, 5:40 p.m. UTC | #16
On 6/5/2019 9:56 AM, David Howells wrote:
> Casey Schaufler <casey@schaufler-ca.com> wrote:
>
>> YES!
> I'm trying to decide if that's fervour or irritation at this point ;-)

I think I finally got the point that the underlying mechanism,
direct or indirect, isn't the issue. It's the end result that
matters. That makes me happier.

>> And it would be really great if you put some thought into what
>> a rational model would be for UID based controls, too.
> I have put some thought into it, but I don't see a single rational model.  It
> depends very much on the situation.

Right. You're mixing the kind of things that can generate events,
and that makes having a single policy difficult.

> In any case, that's what I was referring to when I said I might need to call
> inode_permission().  But UIDs don't exist for all filesystems, for example,
> and there are no UIDs on superblocks, mount objects or hardware events.

If you open() or stat() a file on those filesystems the UID
used in the access control comes from somewhere. Setting a watch
on things with UIDs should use the access mode on the file,
just like any other filesystem operation.

Things like superblocks are sticker because we don't generally
think of them as objects. If you can do statfs(), you should be
able to set a watch on the filesystem metadata.

How would you specify a watch for a hardware event? If you say
you have to open /dev/mumble to sent a watch for mumbles, you're
good there, too.

> Now, I could see that you ignore UIDs on things like keys and
> hardware-triggered events, but how does this interact with things like mount
> watches that see directories that have UIDs?
>
> Are you advocating making it such that process B can only see events triggered
> by process A if they have the same UID, for example?

It's always seemed arbitrary to me that you can't open
your process up to get signals from other users. What about
putting mode bits on your ring buffer? By default you could
only accept your own events, but you could do a rb_chmod(0222)
and let all events through. Subject to LSM addition restrictions,
of course. That would require the cred of the process that
triggered the event or a system cred for "hardware" events.
If you don't like mode bits you could use an ACL for fine
granularity or a single "let'em all in" bit for coarse.

I'm not against access, I'm against uncontrolled access
in conflict with basic system policy.

> David
Andy Lutomirski June 5, 2019, 5:47 p.m. UTC | #17
> On Jun 5, 2019, at 10:01 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> 
>> On 6/5/2019 9:04 AM, Andy Lutomirski wrote:
>>> On Wed, Jun 5, 2019 at 7:51 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>> On 6/5/2019 1:41 AM, David Howells wrote:
>>>> Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>> 
>>>>> I will try to explain the problem once again. If process A
>>>>> sends a signal (writes information) to process B the kernel
>>>>> checks that either process A has the same UID as process B
>>>>> or that process A has privilege to override that policy.
>>>>> Process B is passive in this access control decision, while
>>>>> process A is active. In the event delivery case, process A
>>>>> does something (e.g. modifies a keyring) that generates an
>>>>> event, which is then sent to process B's event buffer.
>>>> I think this might be the core sticking point here.  It looks like two
>>>> different situations:
>>>> 
>>>> (1) A explicitly sends event to B (eg. signalling, sendmsg, etc.)
>>>> 
>>>> (2) A implicitly and unknowingly sends event to B as a side effect of some
>>>>     other action (eg. B has a watch for the event A did).
>>>> 
>>>> The LSM treats them as the same: that is B must have MAC authorisation to send
>>>> a message to A.
>>> YES!
>>> 
>>> Threat is about what you can do, not what you intend to do.
>>> 
>>> And it would be really great if you put some thought into what
>>> a rational model would be for UID based controls, too.
>>> 
>>>> But there are problems with not sending the event:
>>>> 
>>>> (1) B's internal state is then corrupt (or, at least, unknowingly invalid).
>>> Then B is a badly written program.
>> Either I'm misunderstanding you or I strongly disagree.
> 
> A program needs to be aware of the conditions under
> which it gets event, *including the possibility that
> it may not get an event that it's not allowed*. Do you
> regularly write programs that go into corrupt states
> if an open() fails? Or where read() returns less than
> the amount of data you ask for?

I do not regularly write programs that handle read() omitting data in the middle of a TCP stream.  I also don’t write programs that wait for processes to die and need to handle the case where a child is dead, waitid() can see it, but SIGCHLD wasn’t sent because “security”.

> 
>>  If B has
>> authority to detect a certain action, and A has authority to perform
>> that action, then refusing to notify B because B is somehow missing
>> some special authorization to be notified by A is nuts.
> 
> You are hand-waving the notion of authority. You are assuming
> that if A can read X and B can read X that A can write B.

No, read it again please. I’m assuming that if A can *write* X and B can read X then A can send information to B.
Casey Schaufler June 5, 2019, 6:12 p.m. UTC | #18
On 6/5/2019 10:47 AM, Andy Lutomirski wrote:
>> On Jun 5, 2019, at 10:01 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>
>>> On 6/5/2019 9:04 AM, Andy Lutomirski wrote:
>>>> On Wed, Jun 5, 2019 at 7:51 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>> On 6/5/2019 1:41 AM, David Howells wrote:
>>>>> Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>
>>>>>> I will try to explain the problem once again. If process A
>>>>>> sends a signal (writes information) to process B the kernel
>>>>>> checks that either process A has the same UID as process B
>>>>>> or that process A has privilege to override that policy.
>>>>>> Process B is passive in this access control decision, while
>>>>>> process A is active. In the event delivery case, process A
>>>>>> does something (e.g. modifies a keyring) that generates an
>>>>>> event, which is then sent to process B's event buffer.
>>>>> I think this might be the core sticking point here.  It looks like two
>>>>> different situations:
>>>>>
>>>>> (1) A explicitly sends event to B (eg. signalling, sendmsg, etc.)
>>>>>
>>>>> (2) A implicitly and unknowingly sends event to B as a side effect of some
>>>>>     other action (eg. B has a watch for the event A did).
>>>>>
>>>>> The LSM treats them as the same: that is B must have MAC authorisation to send
>>>>> a message to A.
>>>> YES!
>>>>
>>>> Threat is about what you can do, not what you intend to do.
>>>>
>>>> And it would be really great if you put some thought into what
>>>> a rational model would be for UID based controls, too.
>>>>
>>>>> But there are problems with not sending the event:
>>>>>
>>>>> (1) B's internal state is then corrupt (or, at least, unknowingly invalid).
>>>> Then B is a badly written program.
>>> Either I'm misunderstanding you or I strongly disagree.
>> A program needs to be aware of the conditions under
>> which it gets event, *including the possibility that
>> it may not get an event that it's not allowed*. Do you
>> regularly write programs that go into corrupt states
>> if an open() fails? Or where read() returns less than
>> the amount of data you ask for?
> I do not regularly write programs that handle read() omitting data in the middle of a TCP stream.  I also don’t write programs that wait for processes to die and need to handle the case where a child is dead, waitid() can see it, but SIGCHLD wasn’t sent because “security”.
>
>>>  If B has
>>> authority to detect a certain action, and A has authority to perform
>>> that action, then refusing to notify B because B is somehow missing
>>> some special authorization to be notified by A is nuts.
>> You are hand-waving the notion of authority. You are assuming
>> that if A can read X and B can read X that A can write B.
> No, read it again please. I’m assuming that if A can *write* X and B can read X then A can send information to B.

That is *not* a valid assumption:

	A can write to /dev/null.
	B can read from /dev/null.
	Does not imply B can read what A wrote.
	Does not imply A can send a signal to B.

	A can send a UDP datagram to port 3343
	B can is bound to port 3343
	Does not imply the packet will be delivered
Stephen Smalley June 5, 2019, 6:25 p.m. UTC | #19
On 6/5/19 1:47 PM, Andy Lutomirski wrote:
> 
>> On Jun 5, 2019, at 10:01 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>
>>> On 6/5/2019 9:04 AM, Andy Lutomirski wrote:
>>>> On Wed, Jun 5, 2019 at 7:51 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>> On 6/5/2019 1:41 AM, David Howells wrote:
>>>>> Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>
>>>>>> I will try to explain the problem once again. If process A
>>>>>> sends a signal (writes information) to process B the kernel
>>>>>> checks that either process A has the same UID as process B
>>>>>> or that process A has privilege to override that policy.
>>>>>> Process B is passive in this access control decision, while
>>>>>> process A is active. In the event delivery case, process A
>>>>>> does something (e.g. modifies a keyring) that generates an
>>>>>> event, which is then sent to process B's event buffer.
>>>>> I think this might be the core sticking point here.  It looks like two
>>>>> different situations:
>>>>>
>>>>> (1) A explicitly sends event to B (eg. signalling, sendmsg, etc.)
>>>>>
>>>>> (2) A implicitly and unknowingly sends event to B as a side effect of some
>>>>>      other action (eg. B has a watch for the event A did).
>>>>>
>>>>> The LSM treats them as the same: that is B must have MAC authorisation to send
>>>>> a message to A.
>>>> YES!
>>>>
>>>> Threat is about what you can do, not what you intend to do.
>>>>
>>>> And it would be really great if you put some thought into what
>>>> a rational model would be for UID based controls, too.
>>>>
>>>>> But there are problems with not sending the event:
>>>>>
>>>>> (1) B's internal state is then corrupt (or, at least, unknowingly invalid).
>>>> Then B is a badly written program.
>>> Either I'm misunderstanding you or I strongly disagree.
>>
>> A program needs to be aware of the conditions under
>> which it gets event, *including the possibility that
>> it may not get an event that it's not allowed*. Do you
>> regularly write programs that go into corrupt states
>> if an open() fails? Or where read() returns less than
>> the amount of data you ask for?
> 
> I do not regularly write programs that handle read() omitting data in the middle of a TCP stream.  I also don’t write programs that wait for processes to die and need to handle the case where a child is dead, waitid() can see it, but SIGCHLD wasn’t sent because “security”.
> 
>>
>>>   If B has
>>> authority to detect a certain action, and A has authority to perform
>>> that action, then refusing to notify B because B is somehow missing
>>> some special authorization to be notified by A is nuts.
>>
>> You are hand-waving the notion of authority. You are assuming
>> that if A can read X and B can read X that A can write B.
> 
> No, read it again please. I’m assuming that if A can *write* X and B can read X then A can send information to B.

I guess the questions here are:

1) How do we handle recursive notification support, since we can't check 
that B can read everything below a given directory easily?  Perhaps we 
can argue that if I have watch permission to / then that implies 
visibility to everything below it but that is rather broad.

2) Is there always a corresponding labeled object in view for each of 
these notifications to which we can check access when the watch is set?

3) Are notifications only generated for write events or can they be 
generated by processes that only have read access to the object?
Greg KH June 5, 2019, 7:28 p.m. UTC | #20
On Wed, Jun 05, 2019 at 02:25:33PM -0400, Stephen Smalley wrote:
> On 6/5/19 1:47 PM, Andy Lutomirski wrote:
> > 
> > > On Jun 5, 2019, at 10:01 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
> > > 
> > > > On 6/5/2019 9:04 AM, Andy Lutomirski wrote:
> > > > > On Wed, Jun 5, 2019 at 7:51 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
> > > > > > On 6/5/2019 1:41 AM, David Howells wrote:
> > > > > > Casey Schaufler <casey@schaufler-ca.com> wrote:
> > > > > > 
> > > > > > > I will try to explain the problem once again. If process A
> > > > > > > sends a signal (writes information) to process B the kernel
> > > > > > > checks that either process A has the same UID as process B
> > > > > > > or that process A has privilege to override that policy.
> > > > > > > Process B is passive in this access control decision, while
> > > > > > > process A is active. In the event delivery case, process A
> > > > > > > does something (e.g. modifies a keyring) that generates an
> > > > > > > event, which is then sent to process B's event buffer.
> > > > > > I think this might be the core sticking point here.  It looks like two
> > > > > > different situations:
> > > > > > 
> > > > > > (1) A explicitly sends event to B (eg. signalling, sendmsg, etc.)
> > > > > > 
> > > > > > (2) A implicitly and unknowingly sends event to B as a side effect of some
> > > > > >      other action (eg. B has a watch for the event A did).
> > > > > > 
> > > > > > The LSM treats them as the same: that is B must have MAC authorisation to send
> > > > > > a message to A.
> > > > > YES!
> > > > > 
> > > > > Threat is about what you can do, not what you intend to do.
> > > > > 
> > > > > And it would be really great if you put some thought into what
> > > > > a rational model would be for UID based controls, too.
> > > > > 
> > > > > > But there are problems with not sending the event:
> > > > > > 
> > > > > > (1) B's internal state is then corrupt (or, at least, unknowingly invalid).
> > > > > Then B is a badly written program.
> > > > Either I'm misunderstanding you or I strongly disagree.
> > > 
> > > A program needs to be aware of the conditions under
> > > which it gets event, *including the possibility that
> > > it may not get an event that it's not allowed*. Do you
> > > regularly write programs that go into corrupt states
> > > if an open() fails? Or where read() returns less than
> > > the amount of data you ask for?
> > 
> > I do not regularly write programs that handle read() omitting data in the middle of a TCP stream.  I also don’t write programs that wait for processes to die and need to handle the case where a child is dead, waitid() can see it, but SIGCHLD wasn’t sent because “security”.
> > 
> > > 
> > > >   If B has
> > > > authority to detect a certain action, and A has authority to perform
> > > > that action, then refusing to notify B because B is somehow missing
> > > > some special authorization to be notified by A is nuts.
> > > 
> > > You are hand-waving the notion of authority. You are assuming
> > > that if A can read X and B can read X that A can write B.
> > 
> > No, read it again please. I’m assuming that if A can *write* X and B can read X then A can send information to B.
> 
> I guess the questions here are:
> 
> 1) How do we handle recursive notification support, since we can't check
> that B can read everything below a given directory easily?  Perhaps we can
> argue that if I have watch permission to / then that implies visibility to
> everything below it but that is rather broad.

How do you handle fanotify today which I think can do this?

thanks,

greg k-h
Stephen Smalley June 5, 2019, 9:01 p.m. UTC | #21
On 6/5/19 3:28 PM, Greg KH wrote:
> On Wed, Jun 05, 2019 at 02:25:33PM -0400, Stephen Smalley wrote:
>> On 6/5/19 1:47 PM, Andy Lutomirski wrote:
>>>
>>>> On Jun 5, 2019, at 10:01 AM, Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>
>>>>> On 6/5/2019 9:04 AM, Andy Lutomirski wrote:
>>>>>> On Wed, Jun 5, 2019 at 7:51 AM Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>>> On 6/5/2019 1:41 AM, David Howells wrote:
>>>>>>> Casey Schaufler <casey@schaufler-ca.com> wrote:
>>>>>>>
>>>>>>>> I will try to explain the problem once again. If process A
>>>>>>>> sends a signal (writes information) to process B the kernel
>>>>>>>> checks that either process A has the same UID as process B
>>>>>>>> or that process A has privilege to override that policy.
>>>>>>>> Process B is passive in this access control decision, while
>>>>>>>> process A is active. In the event delivery case, process A
>>>>>>>> does something (e.g. modifies a keyring) that generates an
>>>>>>>> event, which is then sent to process B's event buffer.
>>>>>>> I think this might be the core sticking point here.  It looks like two
>>>>>>> different situations:
>>>>>>>
>>>>>>> (1) A explicitly sends event to B (eg. signalling, sendmsg, etc.)
>>>>>>>
>>>>>>> (2) A implicitly and unknowingly sends event to B as a side effect of some
>>>>>>>       other action (eg. B has a watch for the event A did).
>>>>>>>
>>>>>>> The LSM treats them as the same: that is B must have MAC authorisation to send
>>>>>>> a message to A.
>>>>>> YES!
>>>>>>
>>>>>> Threat is about what you can do, not what you intend to do.
>>>>>>
>>>>>> And it would be really great if you put some thought into what
>>>>>> a rational model would be for UID based controls, too.
>>>>>>
>>>>>>> But there are problems with not sending the event:
>>>>>>>
>>>>>>> (1) B's internal state is then corrupt (or, at least, unknowingly invalid).
>>>>>> Then B is a badly written program.
>>>>> Either I'm misunderstanding you or I strongly disagree.
>>>>
>>>> A program needs to be aware of the conditions under
>>>> which it gets event, *including the possibility that
>>>> it may not get an event that it's not allowed*. Do you
>>>> regularly write programs that go into corrupt states
>>>> if an open() fails? Or where read() returns less than
>>>> the amount of data you ask for?
>>>
>>> I do not regularly write programs that handle read() omitting data in the middle of a TCP stream.  I also don’t write programs that wait for processes to die and need to handle the case where a child is dead, waitid() can see it, but SIGCHLD wasn’t sent because “security”.
>>>
>>>>
>>>>>    If B has
>>>>> authority to detect a certain action, and A has authority to perform
>>>>> that action, then refusing to notify B because B is somehow missing
>>>>> some special authorization to be notified by A is nuts.
>>>>
>>>> You are hand-waving the notion of authority. You are assuming
>>>> that if A can read X and B can read X that A can write B.
>>>
>>> No, read it again please. I’m assuming that if A can *write* X and B can read X then A can send information to B.
>>
>> I guess the questions here are:
>>
>> 1) How do we handle recursive notification support, since we can't check
>> that B can read everything below a given directory easily?  Perhaps we can
>> argue that if I have watch permission to / then that implies visibility to
>> everything below it but that is rather broad.
> 
> How do you handle fanotify today which I think can do this?

Doesn't appear to have been given much thought; looks like 
fanotify_init() checks capable(CAP_SYS_ADMIN) and fanotify_mark() checks 
inode_permission(MAY_READ) on the mount/directory/file.  File 
descriptors for monitored files returned upon events at least get vetted 
through security_file_open() so that can prevent the monitoring process 
from receiving arbitrary descriptors. Would be preferable if 
fanotify_mark() did some kind of security_path_watch() or similar check, 
and distinguished mounts versus directories since monitoring of 
directories is not recursive.
David Howells June 5, 2019, 9:06 p.m. UTC | #22
Casey Schaufler <casey@schaufler-ca.com> wrote:

> Right. You're mixing the kind of things that can generate events,
> and that makes having a single policy difficult.

Whilst that's true, the notifications are clearly marked as to type, so it
should be possible to select different policies for different notification
types.

Question for you: what does the LSM *actually* need?  There are a bunch of
things available, some of which may be the same thing:

 (1) The creds of the process that created a watch_queue (ie. opened
     /dev/watch_queue).

 (2) The creds of the process that set a watch (ie. called sb_notify,
     KEYCTL_NOTIFY, ...);

 (3) The creds of the process that tripped the event (which might be the
     system).

 (4) The security attributes of the object on which the watch was set (uid,
     gid, mode, labels).

 (5) The security attributes of the object on which the event was tripped.

 (6) The security attributes of all the objects between the object in (5) and
     the object in (4), assuming we work from (5) towards (4) if the two
     aren't coincident (WATCH_INFO_RECURSIVE).

At the moment, when post_one_notification() wants to write a notification into
a queue, it calls security_post_notification() to ask if it should be allowed
to do so.  This is passed (1) and (3) above plus the notification record.

The only problem I really have is that for a destruction message you want to
get the creds of who did the last put on an object and caused it to be
destroyed - I think everything else probably gets the right creds, even if
they aren't even in the same namespaces (mount propagation, yuck).

However, that one is a biggie because close()/exit() must propagate it to
deferred-fput, which must propagate it to af_unix-cleanup, and thence back to
deferred-fput and thence to implicit unmount (dissolve_on_fput()[*]).

[*] Though it should be noted that if this happens, the subtree cannot be
    attached to the root of a namespace.

> > In any case, that's what I was referring to when I said I might need to call
> > inode_permission().  But UIDs don't exist for all filesystems, for example,
> > and there are no UIDs on superblocks, mount objects or hardware events.
> 
> If you open() or stat() a file on those filesystems the UID
> used in the access control comes from somewhere. Setting a watch
> on things with UIDs should use the access mode on the file,
> just like any other filesystem operation.

Another question for you: Do I need to let the LSM pass judgement on a watch
that a process is trying to set?  I think I probably do.  This would require
separate hooks for different object types:

	int security_watch_key(struct watch *watch, struct key *key);
	int security_watch_sb(struct watch *watch, struct path *path);
	int security_watch_mount(struct watch *watch, struct path *path);
	int security_watch_devices(struct watch *watch);

so that the LSM can see the object the watch is being placed on (the last has
a global queue, so there is no object).  

Further, do I need to put a "void *security" pointer in struct watch and
indicate to the LSM the object bring watched?  The watch could then be passed
to security_post_notification() instead of the watch queue creds (which I
could then dispense with).

	security_post_notification(const struct watch *watch,
				   const struct cred *trigger_cred,
				   struct watch_notification *n);


Also, should I let the LSM audit/edit the filter set by
IOC_WATCH_QUEUE_SET_FILTER?  Userspace can't retrieve the filter, so the LSM
could edit it to exclude certain things.  That might be a bit too complicated,
though.

> Things like superblocks are sticker because we don't generally
> think of them as objects. If you can do statfs(), you should be
> able to set a watch on the filesystem metadata.
> 
> How would you specify a watch for a hardware event? If you say
> you have to open /dev/mumble to sent a watch for mumbles, you're
> good there, too.

That's not how that works at the moment.  There's a global watch list for
device events.  I've repurposed it to carry any device's events - so it will
carry blockdev events (I/O errors only at the moment) and usb events
(add/remove device, add/remove bus, reset device at the moment).

> > Now, I could see that you ignore UIDs on things like keys and
> > hardware-triggered events, but how does this interact with things like mount
> > watches that see directories that have UIDs?
> >
> > Are you advocating making it such that process B can only see events
> > triggered by process A if they have the same UID, for example?
> 
> It's always seemed arbitrary to me that you can't open your process up to
> get signals from other users. What about putting mode bits on your ring
> buffer? By default you could only accept your own events, but you could do a
> rb_chmod(0222) and let all events through.

Ummm...  This mechanism is pretty much about events generated by others.
Depend on what you mean by 'you' and 'your own events', it might be considered
that you would know what events you were directly causing and wouldn't need a
notification system for it.

> Subject to LSM addition restrictions, of course. That would require the cred
> of the process that triggered the event or a system cred for "hardware"
> events.  If you don't like mode bits you could use an ACL for fine
> granularity or a single "let'em all in" bit for coarse.

I'm not entirely sure how an ACL would help.  If someone creates a watch
queue, sets an ACL with only a "let everything in" ACE, we're back to the
situation we're in now.

As I understand it, the issue you have is stopping them getting events that
they're willing to accept that you think they shouldn't be allowed.

> I'm not against access, I'm against uncontrolled access in conflict with
> basic system policy.

David