diff mbox series

[RFC,v2,19/19] ima: Setup securityfs for IMA namespace

Message ID 20211203023118.1447229-20-stefanb@linux.ibm.com (mailing list archive)
State New, archived
Headers show
Series ima: Namespace IMA with audit support in IMA-ns | expand

Commit Message

Stefan Berger Dec. 3, 2021, 2:31 a.m. UTC
Setup securityfs with symlinks, directories, and files for IMA
namespacing support. The same directory structure that IMA uses on the
host is also created for the namespacing case.

Increment the user namespace's refcount_teardown value by '1' once
securityfs has been successfully setup since the initialization of the
filesystem causes an additional reference to the user namespace to be
taken. The early teardown function will delete the file system and release
the additional reference.

The securityfs file and directory ownerships cannot be set when the
IMA namespace is initialized. Therefore, delay the setup of the file
system to a later point when securityfs initializes the fs_context.

This filesystem can now be mounted as follows:

mount -t securityfs /sys/kernel/security/ /sys/kernel/security/

The following directories, symlinks, and files are then available.

$ ls -l sys/kernel/security/
total 0
lr--r--r--. 1 root root 0 Dec  2 00:18 ima -> integrity/ima
drwxr-xr-x. 3 root root 0 Dec  2 00:18 integrity

$ ls -l sys/kernel/security/ima/
total 0
-r--r-----. 1 root root 0 Dec  2 00:18 ascii_runtime_measurements
-r--r-----. 1 root root 0 Dec  2 00:18 binary_runtime_measurements
-rw-------. 1 root root 0 Dec  2 00:18 policy
-r--r-----. 1 root root 0 Dec  2 00:18 runtime_measurements_count
-r--r-----. 1 root root 0 Dec  2 00:18 violations

Signed-off-by: Stefan Berger <stefanb@linux.ibm.com>
---
 include/linux/ima.h                      |  17 +++
 security/inode.c                         |   8 ++
 security/integrity/ima/ima.h             |   2 +
 security/integrity/ima/ima_fs.c          | 157 ++++++++++++++++++++++-
 security/integrity/ima/ima_init_ima_ns.c |   6 +-
 security/integrity/ima/ima_ns.c          |   2 +
 6 files changed, 189 insertions(+), 3 deletions(-)

Comments

Stefan Berger Dec. 3, 2021, 3:07 p.m. UTC | #1
On 12/2/21 21:31, Stefan Berger wrote:
>   extern struct ima_namespace init_ima_ns;
> diff --git a/security/inode.c b/security/inode.c
> index 2738a7b31469..6223f1d838f6 100644
> --- a/security/inode.c
> +++ b/security/inode.c
> @@ -22,6 +22,7 @@
>   #include <linux/lsm_hooks.h>
>   #include <linux/magic.h>
>   #include <linux/user_namespace.h>
> +#include <linux/ima.h>
>   
>   static struct vfsmount *securityfs_mount;
>   static int securityfs_mount_count;
> @@ -63,6 +64,13 @@ static const struct fs_context_operations securityfs_context_ops = {
>   
>   static int securityfs_init_fs_context(struct fs_context *fc)
>   {
> +	int rc;
> +
> +	if (fc->user_ns->ima_ns->late_fs_init) {
> +		rc = fc->user_ns->ima_ns->late_fs_init(fc->user_ns);
> +		if (rc)
> +			return rc;
> +	}
>   	fc->ops = &securityfs_context_ops;
>   	return 0;
>   }


Kernel test robot made me change it to this here:

static int securityfs_init_fs_context(struct fs_context *fc)
{
         fc->ops = &securityfs_context_ops;

         return ima_ns_late_fs_init(fc->user_ns);
}

With this here when CONFIG_IMA_NS is defined:

static inline int ima_ns_late_fs_init(struct user_namespace *user_ns)
{
         struct ima_namespace *ns = user_ns->ima_ns;

         if (ns->late_fs_init)
                 return ns->late_fs_init(ns);

         return 0;
}

    Stefan
James Bottomley Dec. 3, 2021, 5:03 p.m. UTC | #2
On Thu, 2021-12-02 at 21:31 -0500, Stefan Berger wrote:
[...]
>  static int securityfs_init_fs_context(struct fs_context *fc)
>  {
> +	int rc;
> +
> +	if (fc->user_ns->ima_ns->late_fs_init) {
> +		rc = fc->user_ns->ima_ns->late_fs_init(fc->user_ns);
> +		if (rc)
> +			return rc;
> +	}
>  	fc->ops = &securityfs_context_ops;
>  	return 0;
>  }

I know I suggested this, but to get this to work in general, it's going
to have to not be specific to IMA, so it's going to have to become
something generic like a notifier chain.  The other problem is it's
only working still by accident:

> +int ima_fs_ns_init(struct ima_namespace *ns)
> +{
> +	ns->mount = securityfs_ns_create_mount(ns->user_ns);

This actually triggers on the call to securityfs_init_fs_context, but
nothing happens because the callback is null.  Every subsequent use of
fscontext will trigger this.  The point of a keyed supeblock is that
fill_super is only called once per key, that's the place we should be
doing this.   It should also probably be a blocking notifier so any
consumer of securityfs can be namespaced by registering for this
notifier.

> +	if (IS_ERR(ns->mount)) {
> +		ns->mount = NULL;
> +		return -1;
> +	}
> +	ns->mount_count = 1;

This is a bit nasty, too: we're spilling the guts of mount count
tracking into IMA instead of encapsulating it inside securityfs.

> +
> +	/* Adjust the trigger for user namespace's early teardown of
> dependent
> +	 * namespaces. Due to the filesystem there's an additional
> reference
> +	 * to the user namespace.
> +	 */
> +	ns->user_ns->refcount_teardown += 1;
> +
> +	ns->late_fs_init = ima_fs_ns_late_init;
> +
> +	return 0;
> +}

I think what should be happening is that we shouldn't so the
simple_pin_fs, which creates the inodes, ahead of time; we should do it
inside fill_super using a notifier, meaning it gets called once per
key, creates the root dentry then triggers the notifier which
instantiates all the namespaced entries.  We can still use
simple_pin_fs for this because there's no locking across fill_super. 
This would mean fill_super would be called the first time the
securityfs is mounted inside the namespace.

If we do it this way, we can now make securityfs have its own mount and
mount_count inside the user namespace, which it uses internally to the
securityfs code, thus avoiding exposing them to ima or any other
namespaced consumer.

I also think we now don't need the securityfs_ns_ duplicated functions
because the callback via the notifier chain now ensures we can use the
namespace they were created in to distinguish between non namespaced
and namespaced entries.

So non-namespaced consumers of securityfs would do what they do now
(calling the securityfs_create on initialization) and namespaced
consumers would register a callback on the notifier which would get
called once for every namespace the securityfs gets mounted in.

I also theorize if we do it with notifiers, we could have a notifier on
kill_sb to tear down all the entires.  If we do this, I think we don't
have to pin any more.

James
Stefan Berger Dec. 3, 2021, 6:06 p.m. UTC | #3
On 12/3/21 12:03, James Bottomley wrote:
> On Thu, 2021-12-02 at 21:31 -0500, Stefan Berger wrote:
> [...]
>>   static int securityfs_init_fs_context(struct fs_context *fc)
>>   {
>> +	int rc;
>> +
>> +	if (fc->user_ns->ima_ns->late_fs_init) {
>> +		rc = fc->user_ns->ima_ns->late_fs_init(fc->user_ns);
>> +		if (rc)
>> +			return rc;
>> +	}
>>   	fc->ops = &securityfs_context_ops;
>>   	return 0;
>>   }
> I know I suggested this, but to get this to work in general, it's going
> to have to not be specific to IMA, so it's going to have to become
> something generic like a notifier chain.  The other problem is it's
> only working still by accident:

I had thought about this also but the rationale was:

securityfs is compiled due to CONFIG_IMA_NS and the user namespace 
exists there and that has a pointer now to ima_namespace, which can have 
that callback. I assumed that other namespaced subsystems could also be 
reached then via such a callback, but I don't know.

I suppose any late filesystem init callchain would have to be connected 
to the user_namespace somehow?


>
>> +int ima_fs_ns_init(struct ima_namespace *ns)
>> +{
>> +	ns->mount = securityfs_ns_create_mount(ns->user_ns);
> This actually triggers on the call to securityfs_init_fs_context, but
> nothing happens because the callback is null.  Every subsequent use of
> fscontext will trigger this.  The point of a keyed supeblock is that
> fill_super is only called once per key, that's the place we should be
> doing this.   It should also probably be a blocking notifier so any
> consumer of securityfs can be namespaced by registering for this
> notifier.


What I don't like about the fill_super is that it gets called too early:

[   67.058611] securityfs_ns_create_mount @ 102 target user_ns: 
ffff95c010698c80; nr_extents: 0
[   67.059836] securityfs_fill_super @ 47  user_ns: ffff95c010698c80; 
nr_extents: 0

We are switching to the target user namespace in 
securityfs_ns_create_mount. The expected nr_extents at this point is 0, 
since user_ns hasn't been configured, yet. But then security_fill_super 
is also called with nr_extents 0. We cannot use that, it's too early!


>
>> +	if (IS_ERR(ns->mount)) {
>> +		ns->mount = NULL;
>> +		return -1;
>> +	}
>> +	ns->mount_count = 1;
> This is a bit nasty, too: we're spilling the guts of mount count
> tracking into IMA instead of encapsulating it inside securityfs.


Ok, I can make this disappear.


>
>> +
>> +	/* Adjust the trigger for user namespace's early teardown of
>> dependent
>> +	 * namespaces. Due to the filesystem there's an additional
>> reference
>> +	 * to the user namespace.
>> +	 */
>> +	ns->user_ns->refcount_teardown += 1;
>> +
>> +	ns->late_fs_init = ima_fs_ns_late_init;
>> +
>> +	return 0;
>> +}
> I think what should be happening is that we shouldn't so the
> simple_pin_fs, which creates the inodes, ahead of time; we should do it
> inside fill_super using a notifier, meaning it gets called once per

fill_super would only work for the init_user_ns from what I can see.


> key, creates the root dentry then triggers the notifier which
> instantiates all the namespaced entries.  We can still use
> simple_pin_fs for this because there's no locking across fill_super.
> This would mean fill_super would be called the first time the
> securityfs is mounted inside the namespace.


I guess I would need to know how fill_super would work or how it could 
be called late/delayed as well.


>
> If we do it this way, we can now make securityfs have its own mount and
> mount_count inside the user namespace, which it uses internally to the
> securityfs code, thus avoiding exposing them to ima or any other
> namespaced consumer.
>
> I also think we now don't need the securityfs_ns_ duplicated functions
> because the callback via the notifier chain now ensures we can use the
> namespace they were created in to distinguish between non namespaced
> and namespaced entries.

Is there then no need to pass a separate vfsmount * in anymore? Where 
would the vfsmount pointer reside? For now it's in ima_namespace, but it 
sounds like it should be in a more centralized place? Should it also be 
connected to the user_namespace so we can pick it up using get_user_ns()?


>
> So non-namespaced consumers of securityfs would do what they do now
> (calling the securityfs_create on initialization) and namespaced
> consumers would register a callback on the notifier which would get
> called once for every namespace the securityfs gets mounted in.
>
> I also theorize if we do it with notifiers, we could have a notifier on
> kill_sb to tear down all the entires.  If we do this, I think we don't
> have to pin any more.
>
> James
>
>

diff --git a/security/inode.c b/security/inode.c
index ed5f1c533776..49c9839642ed 100644
--- a/security/inode.c
+++ b/security/inode.c
@@ -44,6 +44,8 @@ static int securityfs_fill_super(struct super_block 
*sb, struct fs_context *fc)
         static const struct tree_descr files[] = {{""}};
         int error;

+       printk(KERN_INFO "%s @ %u  user_ns: %px; nr_extents: %d\n", 
__func__, __LINE__, fc->user_ns, fc->user_ns->uid_map.nr_extents);
+
         error = simple_fill_super(sb, SECURITYFS_MAGIC, files);
         if (error)
                 return error;
@@ -97,6 +99,8 @@ struct vfsmount *securityfs_ns_create_mount(struct 
user_namespace *user_ns)
         put_user_ns(fc->user_ns);
         fc->user_ns = get_user_ns(user_ns);

+       printk(KERN_INFO "%s @ %u target user_ns: %px; nr_extents: 
%d\n", __func__, __LINE__, fc->user_ns, fc->user_ns->uid_map.nr_extents);
+
         mnt = fc_mount(fc);
         put_fs_context(fc);
         return mnt;
James Bottomley Dec. 3, 2021, 6:50 p.m. UTC | #4
On Fri, 2021-12-03 at 13:06 -0500, Stefan Berger wrote:
> On 12/3/21 12:03, James Bottomley wrote:
> > On Thu, 2021-12-02 at 21:31 -0500, Stefan Berger wrote:
> > [...]
> > >   static int securityfs_init_fs_context(struct fs_context *fc)
> > >   {
> > > +	int rc;
> > > +
> > > +	if (fc->user_ns->ima_ns->late_fs_init) {
> > > +		rc = fc->user_ns->ima_ns->late_fs_init(fc->user_ns);
> > > +		if (rc)
> > > +			return rc;
> > > +	}
> > >   	fc->ops = &securityfs_context_ops;
> > >   	return 0;
> > >   }
> > I know I suggested this, but to get this to work in general, it's
> > going to have to not be specific to IMA, so it's going to have to
> > become something generic like a notifier chain.  The other problem
> > is it's only working still by accident:
> 
> I had thought about this also but the rationale was:
> 
> securityfs is compiled due to CONFIG_IMA_NS and the user namespace 
> exists there and that has a pointer now to ima_namespace, which can
> have that callback. I assumed that other namespaced subsystems could
> also be  reached then via such a callback, but I don't know.

Well securityfs is supposed to exist for LSMs.  At some point each of
those is going to need to be namespaced, which may eventually be quite
a pile of callbacks, which is why I thought of a notifier.

> I suppose any late filesystem init callchain would have to be
> connected to the user_namespace somehow?

I don't think so; I think just moving some securityfs entries into the
user_namespace and managing the notifier chain from within securityfs
will do for now.  [although I'd have to spec this out in code before I
knew for sure].

> > > +int ima_fs_ns_init(struct ima_namespace *ns)
> > > +{
> > > +	ns->mount = securityfs_ns_create_mount(ns->user_ns);
> > This actually triggers on the call to securityfs_init_fs_context,
> > but nothing happens because the callback is null.  Every subsequent
> > use of fscontext will trigger this.  The point of a keyed supeblock
> > is that fill_super is only called once per key, that's the place we
> > should be doing this.   It should also probably be a blocking
> > notifier so anyconsumer of securityfs can be namespaced by
> > registering for this notifier.
> 
> What I don't like about the fill_super is that it gets called too
> early:
> 
> [   67.058611] securityfs_ns_create_mount @ 102 target user_ns: 
> ffff95c010698c80; nr_extents: 0
> [   67.059836] securityfs_fill_super @ 47  user_ns:
> ffff95c010698c80; 
> nr_extents: 0

Right, it's being activated by securityfs_ns_create_mount which is
called as soon as the user_ns is created.

> We are switching to the target user namespace in 
> securityfs_ns_create_mount. The expected nr_extents at this point is
> 0, since user_ns hasn't been configured, yet. But then
> security_fill_super is also called with nr_extents 0. We cannot use
> that, it's too early!

Exactly, so I was thinking of not having a securityfs_ns_create_mount
at all.  All the securityfs_ns_create.. calls would be in the notifier
call chain. This means there's nothing to fill the superblock until an
actual mount on it is called.

> > > +	if (IS_ERR(ns->mount)) {
> > > +		ns->mount = NULL;
> > > +		return -1;
> > > +	}
> > > +	ns->mount_count = 1;
> > This is a bit nasty, too: we're spilling the guts of mount count
> > tracking into IMA instead of encapsulating it inside securityfs.
> 
> Ok, I can make this disappear.
> 
> 
> > > +
> > > +	/* Adjust the trigger for user namespace's early teardown of
> > > dependent
> > > +	 * namespaces. Due to the filesystem there's an additional
> > > reference
> > > +	 * to the user namespace.
> > > +	 */
> > > +	ns->user_ns->refcount_teardown += 1;
> > > +
> > > +	ns->late_fs_init = ima_fs_ns_late_init;
> > > +
> > > +	return 0;
> > > +}
> > I think what should be happening is that we shouldn't so the
> > simple_pin_fs, which creates the inodes, ahead of time; we should
> > do it inside fill_super using a notifier, meaning it gets called
> > once per
> 
> fill_super would only work for the init_user_ns from what I can see.
> 
> 
> > key, creates the root dentry then triggers the notifier which
> > instantiates all the namespaced entries.  We can still use
> > simple_pin_fs for this because there's no locking across
> > fill_super.
> > This would mean fill_super would be called the first time the
> > securityfs is mounted inside the namespace.
> 
> I guess I would need to know how fill_super would work or how it
> could be called late/delayed as well.

So it would be called early in the init_user_ns by non-namespaced
consumers of securityfs, like it is now.

Namespaced consumers wouldn't call any securityfs_ns_create callbacks
to create dentries until they were notified from the fill_super
notifier, which would now only be triggered on first mount of
securityfs inside the namespace.

> > If we do it this way, we can now make securityfs have its own mount
> > and mount_count inside the user namespace, which it uses internally
> > to the securityfs code, thus avoiding exposing them to ima or any
> > other namespaced consumer.
> > 
> > I also think we now don't need the securityfs_ns_ duplicated
> > functions because the callback via the notifier chain now ensures
> > we can usethe namespace they were created in to distinguish between
> > non namespaced and namespaced entries.
> 
> Is there then no need to pass a separate vfsmount * in anymore? 

I don't think so no.  It could be entirely managed internally to
securityfs.

> Where would the vfsmount pointer reside? For now it's in
> ima_namespace, but it sounds like it should be in a more centralized
> place? Should it also be  connected to the user_namespace so we can
> pick it up using get_user_ns()?

exactly.  I think struct user_namespace should have two elements gated
by a #ifdef CONFIG_SECURITYFS which are the vfsmount and the
mount_count for passing into simple_pin_fs.


James
Stefan Berger Dec. 3, 2021, 7:11 p.m. UTC | #5
On 12/3/21 13:50, James Bottomley wrote:
> On Fri, 2021-12-03 at 13:06 -0500, Stefan Berger wrote:
>> On 12/3/21 12:03, James Bottomley wrote:
>>> On Thu, 2021-12-02 at 21:31 -0500, Stefan Berger wrote:
>>> [...]
>>>>    static int securityfs_init_fs_context(struct fs_context *fc)
>>>>    {
>>>> +	int rc;
>>>> +
>>>> +	if (fc->user_ns->ima_ns->late_fs_init) {
>>>> +		rc = fc->user_ns->ima_ns->late_fs_init(fc->user_ns);
>>>> +		if (rc)
>>>> +			return rc;
>>>> +	}
>>>>    	fc->ops = &securityfs_context_ops;
>>>>    	return 0;
>>>>    }
>>> I know I suggested this, but to get this to work in general, it's
>>> going to have to not be specific to IMA, so it's going to have to
>>> become something generic like a notifier chain.  The other problem
>>> is it's only working still by accident:
>> I had thought about this also but the rationale was:
>>
>> securityfs is compiled due to CONFIG_IMA_NS and the user namespace
>> exists there and that has a pointer now to ima_namespace, which can
>> have that callback. I assumed that other namespaced subsystems could
>> also be  reached then via such a callback, but I don't know.
> Well securityfs is supposed to exist for LSMs.  At some point each of
> those is going to need to be namespaced, which may eventually be quite
> a pile of callbacks, which is why I thought of a notifier.
>
>> I suppose any late filesystem init callchain would have to be
>> connected to the user_namespace somehow?
> I don't think so; I think just moving some securityfs entries into the
> user_namespace and managing the notifier chain from within securityfs
> will do for now.  [although I'd have to spec this out in code before I
> knew for sure].

It doesn't have to be right in the user_namespace. The IMA namespace is 
connected to the user namespace and holds the dentries now...

Please spec it out...


>
>>>> +int ima_fs_ns_init(struct ima_namespace *ns)
>>>> +{
>>>> +	ns->mount = securityfs_ns_create_mount(ns->user_ns);
>>> This actually triggers on the call to securityfs_init_fs_context,
>>> but nothing happens because the callback is null.  Every subsequent
>>> use of fscontext will trigger this.  The point of a keyed supeblock
>>> is that fill_super is only called once per key, that's the place we
>>> should be doing this.   It should also probably be a blocking
>>> notifier so anyconsumer of securityfs can be namespaced by
>>> registering for this notifier.
>> What I don't like about the fill_super is that it gets called too
>> early:
>>
>> [   67.058611] securityfs_ns_create_mount @ 102 target user_ns:
>> ffff95c010698c80; nr_extents: 0
>> [   67.059836] securityfs_fill_super @ 47  user_ns:
>> ffff95c010698c80;
>> nr_extents: 0
> Right, it's being activated by securityfs_ns_create_mount which is
> called as soon as the user_ns is created.

Well, that doesn't help us then...


>> We are switching to the target user namespace in
>> securityfs_ns_create_mount. The expected nr_extents at this point is
>> 0, since user_ns hasn't been configured, yet. But then
>> security_fill_super is also called with nr_extents 0. We cannot use
>> that, it's too early!
> Exactly, so I was thinking of not having a securityfs_ns_create_mount
> at all.  All the securityfs_ns_create.. calls would be in the notifier

But we need to somehow have a call to get_tree_keyed() and have that 
user namespace switched out. I don't know how else to do this other than 
having some function that does that and that is now called 
securityfs_ns_create_mount().

get_tree_keyed() will also call the fill_super() which is called when 
securityfs_ns_create_mount() is called.

[  196.739071] ima_fs_ns_init @ 639 before securityfs_ns_create_mount()
[  196.740426] securityfs_init_fs_context @ 72  user_ns: 
ffffffff98a3cc60; nr_extents: 1
[  196.741519] securityfs_ns_create_mount @ 105 target user_ns: 
ffff9e239753eb80; nr_extents: 0
[  196.742657] securityfs_get_tree @ 60 before get_tree_keyed()
[  196.743418] securityfs_fill_super @ 47  user_ns: ffff9e239753eb80; 
nr_extents: 0
[  196.744467] ima_fs_ns_init @ 641 after securityfs_ns_create_mount()
[  196.745304] ima: Allocated hash algorithm: sha256
[  196.757650] securityfs_init_fs_context @ 72  user_ns: 
ffff9e239753eb80; nr_extents: 1
[  196.758759] securityfs_get_tree @ 60 before get_tree_keyed()

You said it works by 'accident'. I know it works because the function 
securityfs_init_fs_context() that now populates the filesystem via the 
late_fs_init() is getting called twice. Does 'accident' here mean the 
call sequence could change?


>
>> Where would the vfsmount pointer reside? For now it's in
>> ima_namespace, but it sounds like it should be in a more centralized
>> place? Should it also be  connected to the user_namespace so we can
>> pick it up using get_user_ns()?
> exactly.  I think struct user_namespace should have two elements gated
> by a #ifdef CONFIG_SECURITYFS which are the vfsmount and the
> mount_count for passing into simple_pin_fs.

Also that we can do for as long as it flies beyond the conversation 
here... :-) Anyone else have an opinion ?

   Stefan


>
> James
>
>
Casey Schaufler Dec. 3, 2021, 7:37 p.m. UTC | #6
On 12/3/2021 10:50 AM, James Bottomley wrote:
> On Fri, 2021-12-03 at 13:06 -0500, Stefan Berger wrote:
>> On 12/3/21 12:03, James Bottomley wrote:
>>> On Thu, 2021-12-02 at 21:31 -0500, Stefan Berger wrote:
>>> [...]
>>>>    static int securityfs_init_fs_context(struct fs_context *fc)
>>>>    {
>>>> +	int rc;
>>>> +
>>>> +	if (fc->user_ns->ima_ns->late_fs_init) {
>>>> +		rc = fc->user_ns->ima_ns->late_fs_init(fc->user_ns);
>>>> +		if (rc)
>>>> +			return rc;
>>>> +	}
>>>>    	fc->ops = &securityfs_context_ops;
>>>>    	return 0;
>>>>    }
>>> I know I suggested this, but to get this to work in general, it's
>>> going to have to not be specific to IMA, so it's going to have to
>>> become something generic like a notifier chain.  The other problem
>>> is it's only working still by accident:
>> I had thought about this also but the rationale was:
>>
>> securityfs is compiled due to CONFIG_IMA_NS and the user namespace
>> exists there and that has a pointer now to ima_namespace, which can
>> have that callback. I assumed that other namespaced subsystems could
>> also be  reached then via such a callback, but I don't know.
> Well securityfs is supposed to exist for LSMs.  At some point each of
> those is going to need to be namespaced, which may eventually be quite
> a pile of callbacks, which is why I thought of a notifier.

While AppArmor, lockdown and the integrity family use securityfs,
SELinux and Smack do not. They have their own independent filesystems.
Implementations of namespacing for each of SELinux and Smack have been
proposed, but nothing has been adopted. It would be really handy to
namespace the infrastructure rather than each individual LSM, but I
fear that's a bigger project than anyone will be taking on any time
soon. It's likely to encounter many of the same issues that I've been
dealing with for module stacking.

>
>> I suppose any late filesystem init callchain would have to be
>> connected to the user_namespace somehow?
> I don't think so; I think just moving some securityfs entries into the
> user_namespace and managing the notifier chain from within securityfs
> will do for now.  [although I'd have to spec this out in code before I
> knew for sure].
>
>>>> +int ima_fs_ns_init(struct ima_namespace *ns)
>>>> +{
>>>> +	ns->mount = securityfs_ns_create_mount(ns->user_ns);
>>> This actually triggers on the call to securityfs_init_fs_context,
>>> but nothing happens because the callback is null.  Every subsequent
>>> use of fscontext will trigger this.  The point of a keyed supeblock
>>> is that fill_super is only called once per key, that's the place we
>>> should be doing this.   It should also probably be a blocking
>>> notifier so anyconsumer of securityfs can be namespaced by
>>> registering for this notifier.
>> What I don't like about the fill_super is that it gets called too
>> early:
>>
>> [   67.058611] securityfs_ns_create_mount @ 102 target user_ns:
>> ffff95c010698c80; nr_extents: 0
>> [   67.059836] securityfs_fill_super @ 47  user_ns:
>> ffff95c010698c80;
>> nr_extents: 0
> Right, it's being activated by securityfs_ns_create_mount which is
> called as soon as the user_ns is created.
>
>> We are switching to the target user namespace in
>> securityfs_ns_create_mount. The expected nr_extents at this point is
>> 0, since user_ns hasn't been configured, yet. But then
>> security_fill_super is also called with nr_extents 0. We cannot use
>> that, it's too early!
> Exactly, so I was thinking of not having a securityfs_ns_create_mount
> at all.  All the securityfs_ns_create.. calls would be in the notifier
> call chain. This means there's nothing to fill the superblock until an
> actual mount on it is called.
>
>>>> +	if (IS_ERR(ns->mount)) {
>>>> +		ns->mount = NULL;
>>>> +		return -1;
>>>> +	}
>>>> +	ns->mount_count = 1;
>>> This is a bit nasty, too: we're spilling the guts of mount count
>>> tracking into IMA instead of encapsulating it inside securityfs.
>> Ok, I can make this disappear.
>>
>>
>>>> +
>>>> +	/* Adjust the trigger for user namespace's early teardown of
>>>> dependent
>>>> +	 * namespaces. Due to the filesystem there's an additional
>>>> reference
>>>> +	 * to the user namespace.
>>>> +	 */
>>>> +	ns->user_ns->refcount_teardown += 1;
>>>> +
>>>> +	ns->late_fs_init = ima_fs_ns_late_init;
>>>> +
>>>> +	return 0;
>>>> +}
>>> I think what should be happening is that we shouldn't so the
>>> simple_pin_fs, which creates the inodes, ahead of time; we should
>>> do it inside fill_super using a notifier, meaning it gets called
>>> once per
>> fill_super would only work for the init_user_ns from what I can see.
>>
>>
>>> key, creates the root dentry then triggers the notifier which
>>> instantiates all the namespaced entries.  We can still use
>>> simple_pin_fs for this because there's no locking across
>>> fill_super.
>>> This would mean fill_super would be called the first time the
>>> securityfs is mounted inside the namespace.
>> I guess I would need to know how fill_super would work or how it
>> could be called late/delayed as well.
> So it would be called early in the init_user_ns by non-namespaced
> consumers of securityfs, like it is now.
>
> Namespaced consumers wouldn't call any securityfs_ns_create callbacks
> to create dentries until they were notified from the fill_super
> notifier, which would now only be triggered on first mount of
> securityfs inside the namespace.
>
>>> If we do it this way, we can now make securityfs have its own mount
>>> and mount_count inside the user namespace, which it uses internally
>>> to the securityfs code, thus avoiding exposing them to ima or any
>>> other namespaced consumer.
>>>
>>> I also think we now don't need the securityfs_ns_ duplicated
>>> functions because the callback via the notifier chain now ensures
>>> we can usethe namespace they were created in to distinguish between
>>> non namespaced and namespaced entries.
>> Is there then no need to pass a separate vfsmount * in anymore?
> I don't think so no.  It could be entirely managed internally to
> securityfs.
>
>> Where would the vfsmount pointer reside? For now it's in
>> ima_namespace, but it sounds like it should be in a more centralized
>> place? Should it also be  connected to the user_namespace so we can
>> pick it up using get_user_ns()?
> exactly.  I think struct user_namespace should have two elements gated
> by a #ifdef CONFIG_SECURITYFS which are the vfsmount and the
> mount_count for passing into simple_pin_fs.
>
>
> James
>
>
Stefan Berger Dec. 4, 2021, 12:33 a.m. UTC | #7
On 12/3/21 14:11, Stefan Berger wrote:
>
> On 12/3/21 13:50, James Bottomley wrote:
>
>
>>
>>> Where would the vfsmount pointer reside? For now it's in
>>> ima_namespace, but it sounds like it should be in a more centralized
>>> place? Should it also be  connected to the user_namespace so we can
>>> pick it up using get_user_ns()?
>> exactly.  I think struct user_namespace should have two elements gated
>> by a #ifdef CONFIG_SECURITYFS which are the vfsmount and the
>> mount_count for passing into simple_pin_fs.
>
> Also that we can do for as long as it flies beyond the conversation 
> here... :-) Anyone else have an opinion ?

I moved it now and this greatly reduced the amount of changes. The 
dentries are now all in the ima_namespace and it works with one API. Thanks!

I wonder whether to move the integrity dir also into the ima_namespace. 
It's generated in integrity/iint.c, so not in the IMA territory... For 
the IMA namespacing case I need to create it as well, though.

https://elixir.bootlin.com/linux/latest/source/security/integrity/iint.c#L218

    Stefan
James Bottomley Dec. 6, 2021, 4:27 a.m. UTC | #8
On Fri, 2021-12-03 at 14:11 -0500, Stefan Berger wrote:
> On 12/3/21 13:50, James Bottomley wrote:
> > On Fri, 2021-12-03 at 13:06 -0500, Stefan Berger wrote:
[...]
> > > I suppose any late filesystem init callchain would have to be
> > > connected to the user_namespace somehow?
> >  
> > I don't think so; I think just moving some securityfs entries into
> > the user_namespace and managing the notifier chain from within
> > securityfs will do for now.  [although I'd have to spec this out in
> > code before I knew for sure].
> 
> It doesn't have to be right in the user_namespace. The IMA namespace
> is  connected to the user namespace and holds the dentries now...
> 
> Please spec it out...

OK, this is what I have.  fill_super turned out to be a locking
nightmare, so I triggered it from free context instead (which doesn't
have the once per keyed superblock property, so I added a flag in the
user namespace).  I've got it to the point where the event is triggered
on mount and unmount, so all the entries for the namespace are added
when the filesystem is mounted and remove when it's unmounted.  This
style of addition no longer needs the simple_pin_fs, because the
add/remove callbacks substitute (plus, if we pinned, the free_super
wouldn't trigger on unmount).  The default behaviour still does pinning
and unpinning, but that can be keyed off the current user_namespace.

This is all on top of your current series ... some of the functions
should probably be renamed, but I kept them to show how the code was
migrating in this sketch.

James

---

From 59c45daa8698c66c3bcebfb194123977d548a9a6 Mon Sep 17 00:00:00 2001
From: James Bottomley <James.Bottomley@HansenPartnership.com>
Date: Sat, 4 Dec 2021 16:38:37 +0000
Subject: [PATCH] rework securityfs

---
 include/linux/security.h                 |  28 +--
 include/linux/user_namespace.h           |  21 +-
 security/inode.c                         | 292 ++++++++---------------
 security/integrity/ima/ima.h             |   3 +-
 security/integrity/ima/ima_fs.c          | 174 +++++---------
 security/integrity/ima/ima_init_ima_ns.c |   2 -
 security/integrity/ima/ima_ns.c          |   7 -
 7 files changed, 166 insertions(+), 361 deletions(-)

diff --git a/include/linux/security.h b/include/linux/security.h
index 83b3af3c2959..2f37651da6e5 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -29,6 +29,7 @@
 #include <linux/fs.h>
 #include <linux/slab.h>
 #include <linux/err.h>
+#include <linux/notifier.h>
 #include <linux/string.h>
 #include <linux/mm.h>
 
@@ -1919,6 +1920,13 @@ static inline void security_audit_rule_free(void *lsmrule)
 
 #ifdef CONFIG_SECURITYFS
 
+enum {
+	SECURITYFS_NS_ADD,
+	SECURITYFS_NS_REMOVE,
+};
+
+extern int securityfs_register_ns_notifier(struct notifier_block *nb);
+extern int securityfs_unregister_ns_notifier(struct notifier_block *nb);
 extern struct dentry *securityfs_create_file(const char *name, umode_t mode,
 					     struct dentry *parent, void *data,
 					     const struct file_operations *fops);
@@ -1929,20 +1937,6 @@ struct dentry *securityfs_create_symlink(const char *name,
 					 const struct inode_operations *iops);
 extern void securityfs_remove(struct dentry *dentry);
 
-extern struct dentry *securityfs_ns_create_file(const char *name, umode_t mode,
-						struct dentry *parent, void *data,
-						const struct file_operations *fops,
-						struct vfsmount **mount, int *mount_count);
-extern struct dentry *securityfs_ns_create_dir(const char *name, struct dentry *parent,
-					       struct vfsmount **mount, int *mount_count);
-struct dentry *securityfs_ns_create_symlink(const char *name,
-					    struct dentry *parent,
-					    const char *target,
-					    const struct inode_operations *iops,
-					    struct vfsmount **mount, int *mount_count);
-extern void securityfs_ns_remove(struct dentry *dentry,
-				 struct vfsmount **mount, int *mount_count);
-struct vfsmount *securityfs_ns_create_mount(struct user_namespace *user_ns);
 
 #else /* CONFIG_SECURITYFS */
 
@@ -1962,9 +1956,9 @@ static inline struct dentry *securityfs_create_file(const char *name,
 }
 
 static inline struct dentry *securityfs_create_symlink(const char *name,
-					struct dentry *parent,
-					const char *target,
-					const struct inode_operations *iops)
+						       struct dentry *parent,
+						       const char *target,
+						       const struct inode_operations *iops)
 {
 	return ERR_PTR(-ENODEV);
 }
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 8f7870b37c73..6b8bd060d8c4 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -103,11 +103,10 @@ struct user_namespace {
 #ifdef CONFIG_IMA
 	struct ima_namespace	*ima_ns;
 #endif
-	/* The refcount at which to start tearing down dependent namespaces
-	 * (currently only IMA) that may hold additional references to the
-	 * user namespace.
-	 */
-	unsigned int            refcount_teardown;
+#ifdef CONFIG_SECURITYFS
+	struct vfsmount		*securityfs_mount;
+	bool			securityfs_notifier_sent;
+#endif
 } __randomize_layout;
 
 struct ucounts {
@@ -158,19 +157,11 @@ static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
 extern int create_user_ns(struct cred *new);
 extern int unshare_userns(unsigned long unshare_flags, struct cred **new_cred);
 extern void __put_user_ns(struct user_namespace *ns);
-extern void ima_ns_userns_early_teardown(struct ima_namespace *ns);
 
 static inline void put_user_ns(struct user_namespace *ns)
 {
-	if (ns) {
-		if (refcount_dec_and_test(&ns->ns.count))
-			__put_user_ns(ns);
-		else if (refcount_read(&ns->ns.count) == ns->refcount_teardown) {
-#ifdef CONFIG_IMA_NS
-			ima_ns_userns_early_teardown(ns->ima_ns);
-#endif
-		}
-	}
+	if (ns && refcount_dec_and_test(&ns->ns.count))
+		__put_user_ns(ns);
 }
 
 struct seq_operations;
diff --git a/security/inode.c b/security/inode.c
index 6223f1d838f6..62ab4630dc31 100644
--- a/security/inode.c
+++ b/security/inode.c
@@ -18,15 +18,17 @@
 #include <linux/pagemap.h>
 #include <linux/init.h>
 #include <linux/namei.h>
+#include <linux/notifier.h>
 #include <linux/security.h>
 #include <linux/lsm_hooks.h>
 #include <linux/magic.h>
 #include <linux/user_namespace.h>
 #include <linux/ima.h>
 
-static struct vfsmount *securityfs_mount;
 static int securityfs_mount_count;
 
+static BLOCKING_NOTIFIER_HEAD(securityfs_ns_notifier);
+
 static void securityfs_free_inode(struct inode *inode)
 {
 	if (S_ISLNK(inode->i_mode))
@@ -39,6 +41,31 @@ static const struct super_operations securityfs_super_operations = {
 	.free_inode	= securityfs_free_inode,
 };
 
+static struct file_system_type fs_type;
+
+static void securityfs_free_context(struct fs_context *fc)
+{
+	struct user_namespace *ns = fc->user_ns;
+	if (ns == &init_user_ns ||
+	    ns->securityfs_notifier_sent)
+		return;
+
+	ns->securityfs_notifier_sent = true;
+
+	ns->securityfs_mount = vfs_kern_mount(&fs_type, SB_KERNMOUNT,
+					      fs_type.name, NULL);
+	if (IS_ERR(ns->securityfs_mount)) {
+		printk(KERN_ERR "kern mount on securityfs ERROR: %ld\n",
+		       PTR_ERR(ns->securityfs_mount));
+		ns->securityfs_mount = NULL;
+		return;
+	}
+
+	blocking_notifier_call_chain(&securityfs_ns_notifier,
+				     SECURITYFS_NS_ADD, fc->user_ns);
+	mntput(ns->securityfs_mount);
+}
+
 static int securityfs_fill_super(struct super_block *sb, struct fs_context *fc)
 {
 	static const struct tree_descr files[] = {{""}};
@@ -60,52 +87,44 @@ static int securityfs_get_tree(struct fs_context *fc)
 
 static const struct fs_context_operations securityfs_context_ops = {
 	.get_tree	= securityfs_get_tree,
+	.free		= securityfs_free_context,
 };
 
 static int securityfs_init_fs_context(struct fs_context *fc)
 {
-	int rc;
-
-	if (fc->user_ns->ima_ns->late_fs_init) {
-		rc = fc->user_ns->ima_ns->late_fs_init(fc->user_ns);
-		if (rc)
-			return rc;
-	}
 	fc->ops = &securityfs_context_ops;
 	return 0;
 }
 
+static void securityfs_kill_super(struct super_block *sb)
+{
+	struct user_namespace *ns = sb->s_fs_info;
+
+	if (ns != &init_user_ns)
+		blocking_notifier_call_chain(&securityfs_ns_notifier,
+					     SECURITYFS_NS_REMOVE,
+					     sb->s_fs_info);
+	ns->securityfs_notifier_sent = false;
+	ns->securityfs_mount = NULL;
+	kill_litter_super(sb);
+}
+
 static struct file_system_type fs_type = {
 	.owner =	THIS_MODULE,
 	.name =		"securityfs",
 	.init_fs_context = securityfs_init_fs_context,
-	.kill_sb =	kill_litter_super,
+	.kill_sb =	securityfs_kill_super,
 	.fs_flags =	FS_USERNS_MOUNT,
 };
 
-/**
- * securityfs_ns_create_mount - create instance of securityfs in given user namespace
- *
- * @user_ns: the user namespace to create the vfsmount in
- *
- * This function returns a pointer to the vfsmount or an error code. The vfsmount
- * has to be used when creating or removing filesystem dentries.
- */
-struct vfsmount *securityfs_ns_create_mount(struct user_namespace *user_ns)
+int securityfs_register_ns_notifier(struct notifier_block *nb)
 {
-	struct fs_context *fc;
-	struct vfsmount *mnt;
-
-	fc = fs_context_for_mount(&fs_type, SB_KERNMOUNT);
-	if (IS_ERR(fc))
-		return ERR_CAST(fc);
-
-	put_user_ns(fc->user_ns);
-	fc->user_ns = get_user_ns(user_ns);
+	return blocking_notifier_chain_register(&securityfs_ns_notifier, nb);
+}
 
-	mnt = fc_mount(fc);
-	put_fs_context(fc);
-	return mnt;
+int securityfs_unregister_ns_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&securityfs_ns_notifier, nb);
 }
 
 /**
@@ -147,24 +166,27 @@ struct vfsmount *securityfs_ns_create_mount(struct user_namespace *user_ns)
 static struct dentry *securityfs_create_dentry(const char *name, umode_t mode,
 					struct dentry *parent, void *data,
 					const struct file_operations *fops,
-					const struct inode_operations *iops,
-					struct vfsmount **mount, int *mount_count)
+					const struct inode_operations *iops)
 {
 	struct dentry *dentry;
 	struct inode *dir, *inode;
 	int error;
+	struct user_namespace *ns = current_user_ns();
 
 	if (!(mode & S_IFMT))
 		mode = (mode & S_IALLUGO) | S_IFREG;
 
-	pr_debug("securityfs: creating file '%s'\n",name);
+	pr_debug("securityfs: creating file '%s', ns=%u\n",name, ns->ns.inum);
 
-	error = simple_pin_fs(&fs_type, mount, mount_count);
-	if (error)
-		return ERR_PTR(error);
+	if (ns == &init_user_ns) {
+		error = simple_pin_fs(&fs_type, &ns->securityfs_mount,
+				      &securityfs_mount_count);
+		if (error)
+			return ERR_PTR(error);
+	}
 
 	if (!parent)
-		parent = (*mount)->mnt_root;
+		parent = ns->securityfs_mount->mnt_root;
 
 	dir = d_inode(parent);
 
@@ -209,7 +231,9 @@ static struct dentry *securityfs_create_dentry(const char *name, umode_t mode,
 	dentry = ERR_PTR(error);
 out:
 	inode_unlock(dir);
-	simple_release_fs(mount, mount_count);
+	if (ns == &init_user_ns)
+		simple_release_fs(&ns->securityfs_mount,
+				  &securityfs_mount_count);
 	return dentry;
 }
 
@@ -242,46 +266,10 @@ struct dentry *securityfs_create_file(const char *name, umode_t mode,
 				      struct dentry *parent, void *data,
 				      const struct file_operations *fops)
 {
-	return securityfs_create_dentry(name, mode, parent, data, fops, NULL,
-					&securityfs_mount,
-					&securityfs_mount_count);
+	return securityfs_create_dentry(name, mode, parent, data, fops, NULL);
 }
 EXPORT_SYMBOL_GPL(securityfs_create_file);
 
-/**
- * securityfs_ns_create_file - create a file in the securityfs_ns filesystem
- *
- * @name: a pointer to a string containing the name of the file to create.
- * @mode: the permission that the file should have
- * @parent: a pointer to the parent dentry for this file.  This should be a
- *          directory dentry if set.  If this parameter is %NULL, then the
- *          file will be created in the root of the securityfs_ns filesystem.
- * @data: a pointer to something that the caller will want to get to later
- *        on.  The inode.i_private pointer will point to this value on
- *        the open() call.
- * @fops: a pointer to a struct file_operations that should be used for
- *        this file.
- * @mount: Pointer to a pointer of a an existing vfsmount
- * @mount_count: The mount_count that goes along with the @mount
- *
- * This function creates a file in securityfs_ns with the given @name.
- *
- * This function returns a pointer to a dentry if it succeeds.  This
- * pointer must be passed to the securityfs_ns_remove() function when the file
- * is to be removed (no automatic cleanup happens if your module is unloaded,
- * you are responsible here).  If an error occurs, the function will return
- * the error value (via ERR_PTR).
- */
-struct dentry *securityfs_ns_create_file(const char *name, umode_t mode,
-					 struct dentry *parent, void *data,
-					 const struct file_operations *fops,
-					 struct vfsmount **mount, int *mount_count)
-{
-	return securityfs_create_dentry(name, mode, parent, data, fops, NULL,
-					mount, mount_count);
-}
-EXPORT_SYMBOL_GPL(securityfs_ns_create_file);
-
 /**
  * securityfs_create_dir - create a directory in the securityfs filesystem
  *
@@ -308,55 +296,6 @@ struct dentry *securityfs_create_dir(const char *name, struct dentry *parent)
 }
 EXPORT_SYMBOL_GPL(securityfs_create_dir);
 
-/**
- * securityfs_ns_create_dir - create a directory in the securityfs_ns filesystem
- *
- * @name: a pointer to a string containing the name of the directory to
- *        create.
- * @parent: a pointer to the parent dentry for this file.  This should be a
- *          directory dentry if set.  If this parameter is %NULL, then the
- *          directory will be created in the root of the securityfs_ns filesystem.
- * @mount: Pointer to a pointer of a an existing vfsmount
- * @mount_count: The mount_count that goes along with the @mount
- *
- * This function creates a directory in securityfs_ns with the given @name.
- *
- * This function returns a pointer to a dentry if it succeeds.  This
- * pointer must be passed to the securityfs_ns_remove() function when the file
- * is to be removed (no automatic cleanup happens if your module is unloaded,
- * you are responsible here).  If an error occurs, the function will return
- * the error value (via ERR_PTR).
- */
-struct dentry *securityfs_ns_create_dir(const char *name, struct dentry *parent,
-					struct vfsmount **mount, int *mount_count)
-{
-	return securityfs_ns_create_file(name, S_IFDIR | 0755, parent, NULL, NULL,
-					 mount, mount_count);
-}
-EXPORT_SYMBOL_GPL(securityfs_ns_create_dir);
-
-static struct dentry *_securityfs_create_symlink(const char *name,
-						 struct dentry *parent,
-						 const char *target,
-						 const struct inode_operations *iops,
-						 struct vfsmount **mount, int *mount_count)
-{
-	struct dentry *dent;
-	char *link = NULL;
-
-	if (target) {
-		link = kstrdup(target, GFP_KERNEL);
-		if (!link)
-			return ERR_PTR(-ENOMEM);
-	}
-	dent = securityfs_create_dentry(name, S_IFLNK | 0444, parent,
-					link, NULL, iops, mount, mount_count);
-	if (IS_ERR(dent))
-		kfree(link);
-
-	return dent;
-}
-
 /**
  * securityfs_create_symlink - create a symlink in the securityfs filesystem
  *
@@ -388,48 +327,40 @@ struct dentry *securityfs_create_symlink(const char *name,
 					 const char *target,
 					 const struct inode_operations *iops)
 {
-	return _securityfs_create_symlink(name, parent, target, iops,
-					  &securityfs_mount, &securityfs_mount_count);
+	struct dentry *dent;
+	char *link = NULL;
+
+	if (target) {
+		link = kstrdup(target, GFP_KERNEL);
+		if (!link)
+			return ERR_PTR(-ENOMEM);
+	}
+	dent = securityfs_create_dentry(name, S_IFLNK | 0444, parent,
+					link, NULL, iops);
+	if (IS_ERR(dent))
+		kfree(link);
+
+	return dent;
 }
-EXPORT_SYMBOL_GPL(securityfs_create_symlink);
+EXPORT_SYMBOL(securityfs_create_symlink);
 
 /**
- * securityfs_ns_create_symlink - create a symlink in the securityfs_ns filesystem
+ * securityfs_remove - removes a file or directory from the securityfs filesystem
  *
- * @name: a pointer to a string containing the name of the symlink to
- *        create.
- * @parent: a pointer to the parent dentry for the symlink.  This should be a
- *          directory dentry if set.  If this parameter is %NULL, then the
- *          directory will be created in the root of the securityfs_ns filesystem.
- * @target: a pointer to a string containing the name of the symlink's target.
- *          If this parameter is %NULL, then the @iops parameter needs to be
- *          setup to handle .readlink and .get_link inode_operations.
- * @mount: Pointer to a pointer of a an existing vfsmount
- * @mount_count: The mount_count that goes along with the @mount
+ * @dentry: a pointer to a the dentry of the file or directory to be removed.
  *
- * This function creates a symlink in securityfs_ns with the given @name.
+ * This function removes a file or directory in securityfs that was previously
+ * created with a call to another securityfs function (like
+ * securityfs_create_file() or variants thereof.)
  *
- * This function returns a pointer to a dentry if it succeeds.  This
- * pointer must be passed to the securityfs_ns_remove() function when the file
- * is to be removed (no automatic cleanup happens if your module is unloaded,
- * you are responsible here).  If an error occurs, the function will return
- * the error value (via ERR_PTR).
+ * This function is required to be called in order for the file to be
+ * removed. No automatic cleanup of files will happen when a module is
+ * removed; you are responsible here.
  */
-struct dentry *securityfs_ns_create_symlink(const char *name,
-					    struct dentry *parent,
-					    const char *target,
-					    const struct inode_operations *iops,
-					    struct vfsmount **mount, int *mount_count)
-{
-	return _securityfs_create_symlink(name, parent, target, iops,
-					  mount, mount_count);
-}
-EXPORT_SYMBOL_GPL(securityfs_ns_create_symlink);
-
-static void _securityfs_remove(struct dentry *dentry,
-			       struct vfsmount **mount, int *mount_count)
+void securityfs_remove(struct dentry *dentry)
 {
 	struct inode *dir;
+	struct user_namespace *ns = current_user_ns();
 
 	if (!dentry || IS_ERR(dentry))
 		return;
@@ -444,49 +375,12 @@ static void _securityfs_remove(struct dentry *dentry,
 		dput(dentry);
 	}
 	inode_unlock(dir);
-	simple_release_fs(mount, mount_count);
+	if (ns == &init_user_ns)
+		simple_release_fs(&ns->securityfs_mount,
+				  &securityfs_mount_count);
 }
+EXPORT_SYMBOL(securityfs_remove);
 
-/**
- * securityfs_remove - removes a file or directory from the securityfs filesystem
- *
- * @dentry: a pointer to a the dentry of the file or directory to be removed.
- *
- * This function removes a file or directory in securityfs that was previously
- * created with a call to another securityfs function (like
- * securityfs_create_file() or variants thereof.)
- *
- * This function is required to be called in order for the file to be
- * removed. No automatic cleanup of files will happen when a module is
- * removed; you are responsible here.
- */
-void securityfs_remove(struct dentry *dentry)
-{
-	_securityfs_remove(dentry, &securityfs_mount, &securityfs_mount_count);
-}
-
-EXPORT_SYMBOL_GPL(securityfs_remove);
-
-/**
- * securityfs_ns_remove - removes a file or directory from the securityfs_ns filesystem
- *
- * @dentry: a pointer to a the dentry of the file or directory to be removed.
- * @mount: Pointer to a pointer of a an existing vfsmount
- * @mount_count: The mount_count that goes along with the @mount
- *
- * This function removes a file or directory in securityfs_ns that was previously
- * created with a call to another securityfs_ns function (like
- * securityfs_ns_create_file() or variants thereof.)
- *
- * This function is required to be called in order for the file to be
- * removed. No automatic cleanup of files will happen when a module is
- * removed; you are responsible here.
- */
-void securityfs_ns_remove(struct dentry *dentry, struct vfsmount **mount, int *mount_count)
-{
-	_securityfs_remove(dentry, mount, mount_count);
-}
-EXPORT_SYMBOL_GPL(securityfs_ns_remove);
 
 #ifdef CONFIG_SECURITY
 static struct dentry *lsm_dentry;
@@ -511,6 +405,8 @@ static int __init securityfs_init(void)
 	if (retval)
 		return retval;
 
+	init_user_ns.securityfs_mount = NULL;
+
 	retval = register_filesystem(&fs_type);
 	if (retval) {
 		sysfs_remove_mount_point(kernel_kobj, "security");
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 9bcd71bb716c..12b7df65a5ff 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -139,8 +139,7 @@ struct ns_status {
 /* Internal IMA function definitions */
 int ima_init(void);
 int ima_fs_init(void);
-int ima_fs_ns_init(struct ima_namespace *ns);
-void ima_fs_ns_free(struct ima_namespace *ns);
+void ima_fs_ns_free(void);
 int ima_add_template_entry(struct ima_namespace *ns,
 			   struct ima_template_entry *entry, int violation,
 			   const char *op, struct inode *inode,
diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
index 65b2af7c14dd..26f26e8756a8 100644
--- a/security/integrity/ima/ima_fs.c
+++ b/security/integrity/ima/ima_fs.c
@@ -26,6 +26,8 @@
 
 #include "ima.h"
 
+int ima_fs_ns_init(void);
+
 bool ima_canonical_fmt;
 static int __init default_canonical_fmt_setup(char *str)
 {
@@ -360,14 +362,6 @@ static ssize_t ima_write_policy(struct file *file, const char __user *buf,
 	return result;
 }
 
-static struct dentry *ima_dir;
-static struct dentry *ima_symlink;
-static struct dentry *binary_runtime_measurements;
-static struct dentry *ascii_runtime_measurements;
-static struct dentry *runtime_measurements_count;
-static struct dentry *violations;
-static struct dentry *ima_policy;
-
 enum ima_fs_flags {
 	IMA_FS_BUSY,
 };
@@ -437,14 +431,8 @@ static int ima_release_policy(struct inode *inode, struct file *file)
 
 	ima_update_policy(ns);
 #if !defined(CONFIG_IMA_WRITE_POLICY) && !defined(CONFIG_IMA_READ_POLICY)
-	if (ns == &init_ima_ns) {
-		securityfs_remove(ima_policy);
-		ima_policy = NULL;
-	} else {
-		securityfs_ns_remove(ns->dentry[IMAFS_DENTRY_POLICY],
-				     &ns->mount, &ns->mount_count);
-		ns->dentry[IMAFS_DENTRY_POLICY] = NULL;
-	}
+	securityfs_remove(ns->dentry[IMAFS_DENTRY_POLICY]);
+	ns->dentry[IMAFS_DENTRY_POLICY] = NULL;
 #elif defined(CONFIG_IMA_WRITE_POLICY)
 	clear_bit(IMA_FS_BUSY, &ns->ima_fs_flags);
 #elif defined(CONFIG_IMA_READ_POLICY)
@@ -461,60 +449,32 @@ static const struct file_operations ima_measure_policy_ops = {
 	.llseek = generic_file_llseek,
 };
 
-int __init ima_fs_init(void)
+static int ima_fs_ns_late_init(struct user_namespace *user_ns);
+static void ima_fs_ns_free_dentries(struct ima_namespace *ns);
+static int ima_ns_notify(struct notifier_block *this, unsigned long msg,
+			    void *data)
 {
-	ima_dir = securityfs_create_dir("ima", integrity_dir);
-	if (IS_ERR(ima_dir))
-		return -1;
-
-	ima_symlink = securityfs_create_symlink("ima", NULL, "integrity/ima",
-						NULL);
-	if (IS_ERR(ima_symlink))
-		goto out;
-
-	binary_runtime_measurements =
-	    securityfs_create_file("binary_runtime_measurements",
-				   S_IRUSR | S_IRGRP, ima_dir, NULL,
-				   &ima_measurements_ops);
-	if (IS_ERR(binary_runtime_measurements))
-		goto out;
-
-	ascii_runtime_measurements =
-	    securityfs_create_file("ascii_runtime_measurements",
-				   S_IRUSR | S_IRGRP, ima_dir, NULL,
-				   &ima_ascii_measurements_ops);
-	if (IS_ERR(ascii_runtime_measurements))
-		goto out;
-
-	runtime_measurements_count =
-	    securityfs_create_file("runtime_measurements_count",
-				   S_IRUSR | S_IRGRP, ima_dir, NULL,
-				   &ima_measurements_count_ops);
-	if (IS_ERR(runtime_measurements_count))
-		goto out;
-
-	violations =
-	    securityfs_create_file("violations", S_IRUSR | S_IRGRP,
-				   ima_dir, NULL, &ima_htable_violations_ops);
-	if (IS_ERR(violations))
-		goto out;
+	struct user_namespace *ns = data;
+
+	switch (msg) {
+	case SECURITYFS_NS_ADD:
+		ima_fs_ns_late_init(ns);
+		break;
+	case SECURITYFS_NS_REMOVE:
+		ima_fs_ns_free_dentries(ns->ima_ns);
+		break;
+	}
+	return 0;
+}
 
-	ima_policy = securityfs_create_file("policy", POLICY_FILE_FLAGS,
-					    ima_dir, NULL,
-					    &ima_measure_policy_ops);
-	if (IS_ERR(ima_policy))
-		goto out;
+static struct notifier_block ima_ns_notifier = {
+	.notifier_call = ima_ns_notify,
+};
 
-	return 0;
-out:
-	securityfs_remove(violations);
-	securityfs_remove(runtime_measurements_count);
-	securityfs_remove(ascii_runtime_measurements);
-	securityfs_remove(binary_runtime_measurements);
-	securityfs_remove(ima_symlink);
-	securityfs_remove(ima_dir);
-	securityfs_remove(ima_policy);
-	return -1;
+int __init ima_fs_init(void)
+{
+	ima_fs_ns_init();
+	return ima_fs_ns_late_init(&init_user_ns);
 }
 
 static void ima_fs_ns_free_dentries(struct ima_namespace *ns)
@@ -528,12 +488,10 @@ static void ima_fs_ns_free_dentries(struct ima_namespace *ns)
 			/* files first */
 			continue;
 		}
-		securityfs_ns_remove(ns->dentry[i], &ns->mount, &ns->mount_count);
+		securityfs_remove(ns->dentry[i]);
 	}
-	securityfs_ns_remove(ns->dentry[IMAFS_DENTRY_DIR],
-			     &ns->mount, &ns->mount_count);
-	securityfs_ns_remove(ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR],
-			     &ns->mount, &ns->mount_count);
+	securityfs_remove(ns->dentry[IMAFS_DENTRY_DIR]);
+	securityfs_remove(ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR]);
 
 	memset(ns->dentry, 0, sizeof(ns->dentry));
 
@@ -551,25 +509,27 @@ static int ima_fs_ns_late_init(struct user_namespace *user_ns)
 	if (ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR])
 		return 0;
 
-	ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR] =
-	    securityfs_ns_create_dir("integrity", NULL,
-				     &ns->mount, &ns->mount_count);
+	/* FIXME: update when evm and integrity are namespaced */
+	if (user_ns != &init_user_ns)
+		ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR] =
+			securityfs_create_dir("integrity", NULL);
+	else
+		ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR] = integrity_dir;
 	if (IS_ERR(ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR])) {
 		ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR] = NULL;
 		goto out;
 	}
 
 	ns->dentry[IMAFS_DENTRY_DIR] =
-	    securityfs_ns_create_dir("ima", ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR],
-				     &ns->mount, &ns->mount_count);
+		securityfs_create_dir("ima",
+				      ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR]);
 	if (IS_ERR(ns->dentry[IMAFS_DENTRY_DIR])) {
 		ns->dentry[IMAFS_DENTRY_DIR] = NULL;
 		goto out;
 	}
 
 	ns->dentry[IMAFS_DENTRY_SYMLINK] =
-	    securityfs_ns_create_symlink("ima", NULL, "integrity/ima", NULL,
-				     &ns->mount, &ns->mount_count);
+		securityfs_create_symlink("ima", NULL, "integrity/ima", NULL);
 	if (IS_ERR(ns->dentry[IMAFS_DENTRY_SYMLINK])) {
 		ns->dentry[IMAFS_DENTRY_SYMLINK] = NULL;
 		goto out;
@@ -577,88 +537,62 @@ static int ima_fs_ns_late_init(struct user_namespace *user_ns)
 
 	parent = ns->dentry[IMAFS_DENTRY_DIR];
 	ns->dentry[IMAFS_DENTRY_BINARY_RUNTIME_MEASUREMENTS] =
-	    securityfs_ns_create_file("binary_runtime_measurements",
+	    securityfs_create_file("binary_runtime_measurements",
 				   S_IRUSR | S_IRGRP, parent, NULL,
-				   &ima_measurements_ops,
-				   &ns->mount, &ns->mount_count);
+				   &ima_measurements_ops);
 	if (IS_ERR(ns->dentry[IMAFS_DENTRY_BINARY_RUNTIME_MEASUREMENTS])) {
 		ns->dentry[IMAFS_DENTRY_BINARY_RUNTIME_MEASUREMENTS] = NULL;
 		goto out;
 	}
 
 	ns->dentry[IMAFS_DENTRY_ASCII_RUNTIME_MEASUREMENTS] =
-	    securityfs_ns_create_file("ascii_runtime_measurements",
+	    securityfs_create_file("ascii_runtime_measurements",
 				   S_IRUSR | S_IRGRP, parent, NULL,
-				   &ima_ascii_measurements_ops,
-				   &ns->mount, &ns->mount_count);
+				   &ima_ascii_measurements_ops);
 	if (IS_ERR(ns->dentry[IMAFS_DENTRY_ASCII_RUNTIME_MEASUREMENTS])) {
 		ns->dentry[IMAFS_DENTRY_ASCII_RUNTIME_MEASUREMENTS] = NULL;
 		goto out;
 	}
 
 	ns->dentry[IMAFS_DENTRY_RUNTIME_MEASUREMENTS_COUNT] =
-	    securityfs_ns_create_file("runtime_measurements_count",
+	    securityfs_create_file("runtime_measurements_count",
 				   S_IRUSR | S_IRGRP, parent, NULL,
-				   &ima_measurements_count_ops,
-				   &ns->mount, &ns->mount_count);
+				   &ima_measurements_count_ops);
 	if (IS_ERR(ns->dentry[IMAFS_DENTRY_RUNTIME_MEASUREMENTS_COUNT])) {
 		ns->dentry[IMAFS_DENTRY_RUNTIME_MEASUREMENTS_COUNT] = NULL;
 		goto out;
 	}
 
 	ns->dentry[IMAFS_DENTRY_VIOLATIONS] =
-	    securityfs_ns_create_file("violations", S_IRUSR | S_IRGRP,
-				   parent, NULL, &ima_htable_violations_ops,
-				   &ns->mount, &ns->mount_count);
+	    securityfs_create_file("violations", S_IRUSR | S_IRGRP,
+				   parent, NULL, &ima_htable_violations_ops);
 	if (IS_ERR(ns->dentry[IMAFS_DENTRY_VIOLATIONS])) {
 		ns->dentry[IMAFS_DENTRY_VIOLATIONS] = NULL;
 		goto out;
 	}
 
 	ns->dentry[IMAFS_DENTRY_IMA_POLICY] =
-	    securityfs_ns_create_file("policy", POLICY_FILE_FLAGS,
-				   parent, NULL, &ima_measure_policy_ops,
-				   &ns->mount, &ns->mount_count);
+	    securityfs_create_file("policy", POLICY_FILE_FLAGS,
+				   parent, NULL, &ima_measure_policy_ops);
 	if (IS_ERR(ns->dentry[IMAFS_DENTRY_IMA_POLICY])) {
 		ns->dentry[IMAFS_DENTRY_IMA_POLICY] = NULL;
 		goto out;
 	}
 
-
 	return 0;
 
 out:
-	ima_fs_ns_free_dentries(ns);
+	ima_fs_ns_free_dentries(user_ns->ima_ns);
 
 	return -1;
 }
 
-int ima_fs_ns_init(struct ima_namespace *ns)
+int ima_fs_ns_init(void)
 {
-	ns->mount = securityfs_ns_create_mount(ns->user_ns);
-	if (IS_ERR(ns->mount)) {
-		ns->mount = NULL;
-		return -1;
-	}
-	ns->mount_count = 1;
-
-	/* Adjust the trigger for user namespace's early teardown of dependent
-	 * namespaces. Due to the filesystem there's an additional reference
-	 * to the user namespace.
-	 */
-	ns->user_ns->refcount_teardown += 1;
-
-	ns->late_fs_init = ima_fs_ns_late_init;
-
-	return 0;
+	return securityfs_register_ns_notifier(&ima_ns_notifier);
 }
 
-void ima_fs_ns_free(struct ima_namespace *ns)
+void ima_fs_ns_free(void)
 {
-	ima_fs_ns_free_dentries(ns);
-	if (ns->mount) {
-		mntput(ns->mount);
-		ns->mount_count -= 1;
-	}
-	ns->mount = NULL;
+	securityfs_unregister_ns_notifier(&ima_ns_notifier);
 }
diff --git a/security/integrity/ima/ima_init_ima_ns.c b/security/integrity/ima/ima_init_ima_ns.c
index 86a89502c0c5..38d075a2c38d 100644
--- a/security/integrity/ima/ima_init_ima_ns.c
+++ b/security/integrity/ima/ima_init_ima_ns.c
@@ -54,8 +54,6 @@ int ima_init_namespace(struct ima_namespace *ns)
 	mutex_init(&ns->ima_write_mutex);
 	ns->valid_policy = 1;
 	ns->ima_fs_flags = 0;
-	if (ns != &init_ima_ns)
-		rc = ima_fs_ns_init(ns);
 
 	return rc;
 }
diff --git a/security/integrity/ima/ima_ns.c b/security/integrity/ima/ima_ns.c
index 9d5917c97fcc..4c147e0c1801 100644
--- a/security/integrity/ima/ima_ns.c
+++ b/security/integrity/ima/ima_ns.c
@@ -65,13 +65,6 @@ struct ima_namespace *copy_ima_ns(struct ima_namespace *old_ns,
 	return create_ima_ns(user_ns);
 }
 
-void ima_ns_userns_early_teardown(struct ima_namespace *ns)
-{
-	pr_debug("%s: ns=0x%p\n", __func__, ns);
-	ima_fs_ns_free(ns);
-}
-EXPORT_SYMBOL(ima_ns_userns_early_teardown);
-
 static void destroy_ima_ns(struct ima_namespace *ns)
 {
 	pr_debug("DESTROY ima_ns: 0x%p\n", ns);
Christian Brauner Dec. 6, 2021, 11:52 a.m. UTC | #9
On Fri, Dec 03, 2021 at 07:33:39PM -0500, Stefan Berger wrote:
> 
> On 12/3/21 14:11, Stefan Berger wrote:
> > 
> > On 12/3/21 13:50, James Bottomley wrote:
> > 
> > 
> > > 
> > > > Where would the vfsmount pointer reside? For now it's in
> > > > ima_namespace, but it sounds like it should be in a more centralized
> > > > place? Should it also be  connected to the user_namespace so we can
> > > > pick it up using get_user_ns()?
> > > exactly.  I think struct user_namespace should have two elements gated
> > > by a #ifdef CONFIG_SECURITYFS which are the vfsmount and the
> > > mount_count for passing into simple_pin_fs.
> > 
> > Also that we can do for as long as it flies beyond the conversation
> > here... :-) Anyone else have an opinion ?
> 
> I moved it now and this greatly reduced the amount of changes. The dentries
> are now all in the ima_namespace and it works with one API. Thanks!

Ideally you only have one entry in struct user_namespace for ima that
encompasses all information needed; not multiple entries. Similar to
what I did for binfmt_misc
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/commit/?h=fs.binfmt_misc&id=eb50eb90a694e05f6fd6533951a56ca3ed040761
if that works.
Christian Brauner Dec. 6, 2021, 12:08 p.m. UTC | #10
On Fri, Dec 03, 2021 at 11:37:14AM -0800, Casey Schaufler wrote:
> On 12/3/2021 10:50 AM, James Bottomley wrote:
> > On Fri, 2021-12-03 at 13:06 -0500, Stefan Berger wrote:
> > > On 12/3/21 12:03, James Bottomley wrote:
> > > > On Thu, 2021-12-02 at 21:31 -0500, Stefan Berger wrote:
> > > > [...]
> > > > >    static int securityfs_init_fs_context(struct fs_context *fc)
> > > > >    {
> > > > > +	int rc;
> > > > > +
> > > > > +	if (fc->user_ns->ima_ns->late_fs_init) {
> > > > > +		rc = fc->user_ns->ima_ns->late_fs_init(fc->user_ns);
> > > > > +		if (rc)
> > > > > +			return rc;
> > > > > +	}
> > > > >    	fc->ops = &securityfs_context_ops;
> > > > >    	return 0;
> > > > >    }
> > > > I know I suggested this, but to get this to work in general, it's
> > > > going to have to not be specific to IMA, so it's going to have to
> > > > become something generic like a notifier chain.  The other problem
> > > > is it's only working still by accident:
> > > I had thought about this also but the rationale was:
> > > 
> > > securityfs is compiled due to CONFIG_IMA_NS and the user namespace
> > > exists there and that has a pointer now to ima_namespace, which can
> > > have that callback. I assumed that other namespaced subsystems could
> > > also be  reached then via such a callback, but I don't know.
> > Well securityfs is supposed to exist for LSMs.  At some point each of
> > those is going to need to be namespaced, which may eventually be quite
> > a pile of callbacks, which is why I thought of a notifier.
> 
> While AppArmor, lockdown and the integrity family use securityfs,
> SELinux and Smack do not. They have their own independent filesystems.
> Implementations of namespacing for each of SELinux and Smack have been
> proposed, but nothing has been adopted. It would be really handy to
> namespace the infrastructure rather than each individual LSM, but I
> fear that's a bigger project than anyone will be taking on any time
> soon. It's likely to encounter many of the same issues that I've been
> dealing with for module stacking.

The main thing that bothers me is that it uses simple_pin_fs() and
simple_unpin_fs() which I would try hard to get rid of if possible. The
existence of this global pinning logic makes namespacing it properly
more difficult then it needs to be and it creates imho wonky semantics
where the last unmount doesn't really destroy the superblock. Instead
subsequents mounts resurface the same superblock. There might be an
inherent design reason why this needs to be this way but I would advise
against these semantics for anything that wants to be namespaced.
Probably the first securityfs mount in init_user_ns can follow these
semantics but ones tied to a non-initial user namespace should not as
the userns can go away. In that case the pinning logic seems strange as
conceptually the userns pins the securityfs mount as evidenced by the
fact that we key by it in get_tree_keyed().

> 
> > 
> > > I suppose any late filesystem init callchain would have to be
> > > connected to the user_namespace somehow?
> > I don't think so; I think just moving some securityfs entries into the
> > user_namespace and managing the notifier chain from within securityfs
> > will do for now.  [although I'd have to spec this out in code before I
> > knew for sure].
> > 
> > > > > +int ima_fs_ns_init(struct ima_namespace *ns)
> > > > > +{
> > > > > +	ns->mount = securityfs_ns_create_mount(ns->user_ns);
> > > > This actually triggers on the call to securityfs_init_fs_context,
> > > > but nothing happens because the callback is null.  Every subsequent
> > > > use of fscontext will trigger this.  The point of a keyed supeblock
> > > > is that fill_super is only called once per key, that's the place we
> > > > should be doing this.   It should also probably be a blocking
> > > > notifier so anyconsumer of securityfs can be namespaced by
> > > > registering for this notifier.
> > > What I don't like about the fill_super is that it gets called too
> > > early:
> > > 
> > > [   67.058611] securityfs_ns_create_mount @ 102 target user_ns:
> > > ffff95c010698c80; nr_extents: 0
> > > [   67.059836] securityfs_fill_super @ 47  user_ns:
> > > ffff95c010698c80;
> > > nr_extents: 0
> > Right, it's being activated by securityfs_ns_create_mount which is
> > called as soon as the user_ns is created.
> > 
> > > We are switching to the target user namespace in
> > > securityfs_ns_create_mount. The expected nr_extents at this point is
> > > 0, since user_ns hasn't been configured, yet. But then
> > > security_fill_super is also called with nr_extents 0. We cannot use
> > > that, it's too early!
> > Exactly, so I was thinking of not having a securityfs_ns_create_mount
> > at all.  All the securityfs_ns_create.. calls would be in the notifier
> > call chain. This means there's nothing to fill the superblock until an
> > actual mount on it is called.
> > 
> > > > > +	if (IS_ERR(ns->mount)) {
> > > > > +		ns->mount = NULL;
> > > > > +		return -1;
> > > > > +	}
> > > > > +	ns->mount_count = 1;
> > > > This is a bit nasty, too: we're spilling the guts of mount count
> > > > tracking into IMA instead of encapsulating it inside securityfs.
> > > Ok, I can make this disappear.
> > > 
> > > 
> > > > > +
> > > > > +	/* Adjust the trigger for user namespace's early teardown of
> > > > > dependent
> > > > > +	 * namespaces. Due to the filesystem there's an additional
> > > > > reference
> > > > > +	 * to the user namespace.
> > > > > +	 */
> > > > > +	ns->user_ns->refcount_teardown += 1;
> > > > > +
> > > > > +	ns->late_fs_init = ima_fs_ns_late_init;
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > I think what should be happening is that we shouldn't so the
> > > > simple_pin_fs, which creates the inodes, ahead of time; we should
> > > > do it inside fill_super using a notifier, meaning it gets called
> > > > once per
> > > fill_super would only work for the init_user_ns from what I can see.
> > > 
> > > 
> > > > key, creates the root dentry then triggers the notifier which
> > > > instantiates all the namespaced entries.  We can still use
> > > > simple_pin_fs for this because there's no locking across
> > > > fill_super.
> > > > This would mean fill_super would be called the first time the
> > > > securityfs is mounted inside the namespace.
> > > I guess I would need to know how fill_super would work or how it
> > > could be called late/delayed as well.
> > So it would be called early in the init_user_ns by non-namespaced
> > consumers of securityfs, like it is now.
> > 
> > Namespaced consumers wouldn't call any securityfs_ns_create callbacks
> > to create dentries until they were notified from the fill_super
> > notifier, which would now only be triggered on first mount of
> > securityfs inside the namespace.
> > 
> > > > If we do it this way, we can now make securityfs have its own mount
> > > > and mount_count inside the user namespace, which it uses internally
> > > > to the securityfs code, thus avoiding exposing them to ima or any
> > > > other namespaced consumer.
> > > > 
> > > > I also think we now don't need the securityfs_ns_ duplicated
> > > > functions because the callback via the notifier chain now ensures
> > > > we can usethe namespace they were created in to distinguish between
> > > > non namespaced and namespaced entries.
> > > Is there then no need to pass a separate vfsmount * in anymore?
> > I don't think so no.  It could be entirely managed internally to
> > securityfs.
> > 
> > > Where would the vfsmount pointer reside? For now it's in
> > > ima_namespace, but it sounds like it should be in a more centralized
> > > place? Should it also be  connected to the user_namespace so we can
> > > pick it up using get_user_ns()?
> > exactly.  I think struct user_namespace should have two elements gated
> > by a #ifdef CONFIG_SECURITYFS which are the vfsmount and the
> > mount_count for passing into simple_pin_fs.
> > 
> > 
> > James
> > 
> > 
>
James Bottomley Dec. 6, 2021, 1:38 p.m. UTC | #11
On Mon, 2021-12-06 at 13:08 +0100, Christian Brauner wrote:
> On Fri, Dec 03, 2021 at 11:37:14AM -0800, Casey Schaufler wrote:
> > On 12/3/2021 10:50 AM, James Bottomley wrote:
> > > On Fri, 2021-12-03 at 13:06 -0500, Stefan Berger wrote:
> > > > On 12/3/21 12:03, James Bottomley wrote:
> > > > > On Thu, 2021-12-02 at 21:31 -0500, Stefan Berger wrote:
> > > > > [...]
> > > > > >    static int securityfs_init_fs_context(struct fs_context
> > > > > > *fc)
> > > > > >    {
> > > > > > +	int rc;
> > > > > > +
> > > > > > +	if (fc->user_ns->ima_ns->late_fs_init) {
> > > > > > +		rc = fc->user_ns->ima_ns->late_fs_init(fc-
> > > > > > >user_ns);
> > > > > > +		if (rc)
> > > > > > +			return rc;
> > > > > > +	}
> > > > > >    	fc->ops = &securityfs_context_ops;
> > > > > >    	return 0;
> > > > > >    }
> > > > > I know I suggested this, but to get this to work in general,
> > > > > it's going to have to not be specific to IMA, so it's going
> > > > > to have to become something generic like a notifier
> > > > > chain.  The other problem is it's only working still by
> > > > > accident:
> > > >  
> > > > I had thought about this also but the rationale was:
> > > > 
> > > > securityfs is compiled due to CONFIG_IMA_NS and the user
> > > > namespace exists there and that has a pointer now to
> > > > ima_namespace, which can have that callback. I assumed that
> > > > other namespaced subsystems could also be  reached then via
> > > > such a callback, but I don't know.
> > >  
> > > Well securityfs is supposed to exist for LSMs.  At some point
> > > each of those is going to need to be namespaced, which may
> > > eventually be quite a pile of callbacks, which is why I thought
> > > of a notifier.
> > 
> > While AppArmor, lockdown and the integrity family use securityfs,
> > SELinux and Smack do not. They have their own independent
> > filesystems. Implementations of namespacing for each of SELinux and
> > Smack have been proposed, but nothing has been adopted. It would be
> > really handy to namespace the infrastructure rather than each
> > individual LSM, but I fear that's a bigger project than anyone will
> > be taking on any time soon. It's likely to encounter many of the
> > same issues that I've been dealing with for module stacking.
> 
> The main thing that bothers me is that it uses simple_pin_fs() and
> simple_unpin_fs() which I would try hard to get rid of if possible.
> The existence of this global pinning logic makes namespacing it
> properly more difficult then it needs to be and it creates imho wonky
> semantics where the last unmount doesn't really destroy the
> superblock.

So in the notifier sketch I posted, I got rid of the pinning but only
for the non root user namespace use case ... which basically means only
for converted consumers of securityfs.  The last unmount of securityfs
inside the namespace now does destroy the superblock ... I checked.

The same isn't true for the last unmount of the root namespace, but
that has to be so to keep the current semantics.

>  Instead subsequents mounts resurface the same superblock. There
> might be an inherent design reason why this needs to be this way but
> I would advise against these semantics for anything that wants to be
> namespaced. Probably the first securityfs mount in init_user_ns can
> follow these semantics but ones tied to a non-initial user namespace
> should not as the userns can go away. In that case the pinning logic
> seems strange as conceptually the userns pins the securityfs mount as
> evidenced by the fact that we key by it in get_tree_keyed().

Yes, that's basically what I did: pin if ns == &init_user_ns but don't
pin if not.  However, I'm still not sure I got the triggers right.  We
have to trigger the notifier call (which adds the namespaced file
entries) from context free, because that's the first place the
superblock mount is fully set up ... I can't do it in fill_super
because the mount isn't fully initialized (and the locking prevents
it).  I did manage to get the notifier for teardown triggered from
kill_super, though.

James
Stefan Berger Dec. 6, 2021, 2:03 p.m. UTC | #12
On 12/5/21 23:27, James Bottomley wrote:
> On Fri, 2021-12-03 at 14:11 -0500, Stefan Berger wrote:
>> On 12/3/21 13:50, James Bottomley wrote:
>>> On Fri, 2021-12-03 at 13:06 -0500, Stefan Berger wrote:
> [...]
>>>> I suppose any late filesystem init callchain would have to be
>>>> connected to the user_namespace somehow?
>>>   
>>> I don't think so; I think just moving some securityfs entries into
>>> the user_namespace and managing the notifier chain from within
>>> securityfs will do for now.  [although I'd have to spec this out in
>>> code before I knew for sure].
>> It doesn't have to be right in the user_namespace. The IMA namespace
>> is  connected to the user namespace and holds the dentries now...
>>
>> Please spec it out...
> OK, this is what I have.  fill_super turned out to be a locking
> nightmare, so I triggered it from free context instead (which doesn't
> have the once per keyed superblock property, so I added a flag in the
> user namespace).  I've got it to the point where the event is triggered
> on mount and unmount, so all the entries for the namespace are added
> when the filesystem is mounted and remove when it's unmounted.  This
> style of addition no longer needs the simple_pin_fs, because the
> add/remove callbacks substitute (plus, if we pinned, the free_super
> wouldn't trigger on unmount).  The default behaviour still does pinning
> and unpinning, but that can be keyed off the current user_namespace.
>
> This is all on top of your current series ... some of the functions
> should probably be renamed, but I kept them to show how the code was
> migrating in this sketch.
>
> James
>
> ---
>
>  From 59c45daa8698c66c3bcebfb194123977d548a9a6 Mon Sep 17 00:00:00 2001
> From: James Bottomley <James.Bottomley@HansenPartnership.com>
> Date: Sat, 4 Dec 2021 16:38:37 +0000
> Subject: [PATCH] rework securityfs
>
> ---
>
> -
> -static void _securityfs_remove(struct dentry *dentry,
> -			       struct vfsmount **mount, int *mount_count)
> +void securityfs_remove(struct dentry *dentry)
>   {
>   	struct inode *dir;
> +	struct user_namespace *ns = current_user_ns();

I had problems with this in this place. So I had to use use

struct user_namespace *user_ns = dentry->d_sb->s_user_ns;

I'll try to split up your patch and post a v3 with then. Or is it too early?

   Stefan
James Bottomley Dec. 6, 2021, 2:11 p.m. UTC | #13
On Mon, 2021-12-06 at 09:03 -0500, Stefan Berger wrote:
> On 12/5/21 23:27, James Bottomley wrote:
> > On Fri, 2021-12-03 at 14:11 -0500, Stefan Berger wrote:
> > > On 12/3/21 13:50, James Bottomley wrote:
> > > > On Fri, 2021-12-03 at 13:06 -0500, Stefan Berger wrote:
> > [...]
> > > > > I suppose any late filesystem init callchain would have to be
> > > > > connected to the user_namespace somehow?
> > > >   
> > > > I don't think so; I think just moving some securityfs entries
> > > > into
> > > > the user_namespace and managing the notifier chain from within
> > > > securityfs will do for now.  [although I'd have to spec this
> > > > out in
> > > > code before I knew for sure].
> > > It doesn't have to be right in the user_namespace. The IMA
> > > namespace
> > > is  connected to the user namespace and holds the dentries now...
> > > 
> > > Please spec it out...
> > OK, this is what I have.  fill_super turned out to be a locking
> > nightmare, so I triggered it from free context instead (which
> > doesn't
> > have the once per keyed superblock property, so I added a flag in
> > the
> > user namespace).  I've got it to the point where the event is
> > triggered
> > on mount and unmount, so all the entries for the namespace are
> > added
> > when the filesystem is mounted and remove when it's
> > unmounted.  This
> > style of addition no longer needs the simple_pin_fs, because the
> > add/remove callbacks substitute (plus, if we pinned, the free_super
> > wouldn't trigger on unmount).  The default behaviour still does
> > pinning
> > and unpinning, but that can be keyed off the current
> > user_namespace.
> > 
> > This is all on top of your current series ... some of the functions
> > should probably be renamed, but I kept them to show how the code
> > was
> > migrating in this sketch.
> > 
> > James
> > 
> > ---
> > 
> >  From 59c45daa8698c66c3bcebfb194123977d548a9a6 Mon Sep 17 00:00:00
> > 2001
> > From: James Bottomley <James.Bottomley@HansenPartnership.com>
> > Date: Sat, 4 Dec 2021 16:38:37 +0000
> > Subject: [PATCH] rework securityfs
> > 
> > ---
> > 
> > -
> > -static void _securityfs_remove(struct dentry *dentry,
> > -			       struct vfsmount **mount, int
> > *mount_count)
> > +void securityfs_remove(struct dentry *dentry)
> >   {
> >   	struct inode *dir;
> > +	struct user_namespace *ns = current_user_ns();
> 
> I had problems with this in this place. So I had to use use
> 
> struct user_namespace *user_ns = dentry->d_sb->s_user_ns;

Yes, I think that works ... the owner in the parent namespace could
actually unmount it, so keying off the user namespace it was mounted on
is definitely the correct form.

> I'll try to split up your patch and post a v3 with then. Or is it too
> early?

It's never too early to see what the series is shaping up as.  However,
I'm still not sure I got the right trigger for the SECURITYFS_NS_ADD
notifier, so that may still have to move ... or even that there isn't
some locking subtlety I missed in triggering SECURITY_NS_REMOVE from
kill_sb.

I also suspect Christian will want a pointer to the securityfs pieces
in struct user_namespace rather than discrete elements added directly.

James
Christian Brauner Dec. 6, 2021, 2:11 p.m. UTC | #14
On Fri, Dec 03, 2021 at 01:06:13PM -0500, Stefan Berger wrote:
> 
> On 12/3/21 12:03, James Bottomley wrote:
> > On Thu, 2021-12-02 at 21:31 -0500, Stefan Berger wrote:
> > [...]
> > >   static int securityfs_init_fs_context(struct fs_context *fc)
> > >   {
> > > +	int rc;
> > > +
> > > +	if (fc->user_ns->ima_ns->late_fs_init) {
> > > +		rc = fc->user_ns->ima_ns->late_fs_init(fc->user_ns);
> > > +		if (rc)
> > > +			return rc;
> > > +	}
> > >   	fc->ops = &securityfs_context_ops;
> > >   	return 0;
> > >   }
> > I know I suggested this, but to get this to work in general, it's going
> > to have to not be specific to IMA, so it's going to have to become
> > something generic like a notifier chain.  The other problem is it's
> > only working still by accident:
> 
> I had thought about this also but the rationale was:
> 
> securityfs is compiled due to CONFIG_IMA_NS and the user namespace exists
> there and that has a pointer now to ima_namespace, which can have that
> callback. I assumed that other namespaced subsystems could also be reached
> then via such a callback, but I don't know.
> 
> I suppose any late filesystem init callchain would have to be connected to
> the user_namespace somehow?
> 
> 
> > 
> > > +int ima_fs_ns_init(struct ima_namespace *ns)
> > > +{
> > > +	ns->mount = securityfs_ns_create_mount(ns->user_ns);
> > This actually triggers on the call to securityfs_init_fs_context, but
> > nothing happens because the callback is null.  Every subsequent use of
> > fscontext will trigger this.  The point of a keyed supeblock is that
> > fill_super is only called once per key, that's the place we should be
> > doing this.   It should also probably be a blocking notifier so any
> > consumer of securityfs can be namespaced by registering for this
> > notifier.
> 
> 
> What I don't like about the fill_super is that it gets called too early:
> 
> [   67.058611] securityfs_ns_create_mount @ 102 target user_ns:
> ffff95c010698c80; nr_extents: 0
> [   67.059836] securityfs_fill_super @ 47  user_ns: ffff95c010698c80;
> nr_extents: 0
> 
> We are switching to the target user namespace in securityfs_ns_create_mount.
> The expected nr_extents at this point is 0, since user_ns hasn't been
> configured, yet. But then security_fill_super is also called with nr_extents
> 0. We cannot use that, it's too early!

So the problem is that someone could mount securityfs before any
idmappings are setup or what? How does moving the setup to a later stage
help at all? I'm struggling to make sense of this. When or even if
idmappings are written isn't under imas control. Someone could mount
securityfs without any idmappings setup. In that case they should get
what they deserve, everything owner by overflowuid/overflowgid, no? Or
you can require in fill_super that kuid 0 and kgid 0 are mapped and fail
if they aren't.
Christian Brauner Dec. 6, 2021, 2:13 p.m. UTC | #15
On Mon, Dec 06, 2021 at 08:38:29AM -0500, James Bottomley wrote:
> On Mon, 2021-12-06 at 13:08 +0100, Christian Brauner wrote:
> > On Fri, Dec 03, 2021 at 11:37:14AM -0800, Casey Schaufler wrote:
> > > On 12/3/2021 10:50 AM, James Bottomley wrote:
> > > > On Fri, 2021-12-03 at 13:06 -0500, Stefan Berger wrote:
> > > > > On 12/3/21 12:03, James Bottomley wrote:
> > > > > > On Thu, 2021-12-02 at 21:31 -0500, Stefan Berger wrote:
> > > > > > [...]
> > > > > > >    static int securityfs_init_fs_context(struct fs_context
> > > > > > > *fc)
> > > > > > >    {
> > > > > > > +	int rc;
> > > > > > > +
> > > > > > > +	if (fc->user_ns->ima_ns->late_fs_init) {
> > > > > > > +		rc = fc->user_ns->ima_ns->late_fs_init(fc-
> > > > > > > >user_ns);
> > > > > > > +		if (rc)
> > > > > > > +			return rc;
> > > > > > > +	}
> > > > > > >    	fc->ops = &securityfs_context_ops;
> > > > > > >    	return 0;
> > > > > > >    }
> > > > > > I know I suggested this, but to get this to work in general,
> > > > > > it's going to have to not be specific to IMA, so it's going
> > > > > > to have to become something generic like a notifier
> > > > > > chain.  The other problem is it's only working still by
> > > > > > accident:
> > > > >  
> > > > > I had thought about this also but the rationale was:
> > > > > 
> > > > > securityfs is compiled due to CONFIG_IMA_NS and the user
> > > > > namespace exists there and that has a pointer now to
> > > > > ima_namespace, which can have that callback. I assumed that
> > > > > other namespaced subsystems could also be  reached then via
> > > > > such a callback, but I don't know.
> > > >  
> > > > Well securityfs is supposed to exist for LSMs.  At some point
> > > > each of those is going to need to be namespaced, which may
> > > > eventually be quite a pile of callbacks, which is why I thought
> > > > of a notifier.
> > > 
> > > While AppArmor, lockdown and the integrity family use securityfs,
> > > SELinux and Smack do not. They have their own independent
> > > filesystems. Implementations of namespacing for each of SELinux and
> > > Smack have been proposed, but nothing has been adopted. It would be
> > > really handy to namespace the infrastructure rather than each
> > > individual LSM, but I fear that's a bigger project than anyone will
> > > be taking on any time soon. It's likely to encounter many of the
> > > same issues that I've been dealing with for module stacking.
> > 
> > The main thing that bothers me is that it uses simple_pin_fs() and
> > simple_unpin_fs() which I would try hard to get rid of if possible.
> > The existence of this global pinning logic makes namespacing it
> > properly more difficult then it needs to be and it creates imho wonky
> > semantics where the last unmount doesn't really destroy the
> > superblock.
> 
> So in the notifier sketch I posted, I got rid of the pinning but only
> for the non root user namespace use case ... which basically means only
> for converted consumers of securityfs.  The last unmount of securityfs
> inside the namespace now does destroy the superblock ... I checked.

Yeah, I saw. I'm struggling to follow the series but I pulled Stefan's
branch and put your patch on top of it so I peruse it.

> 
> The same isn't true for the last unmount of the root namespace, but
> that has to be so to keep the current semantics.
> 
> >  Instead subsequents mounts resurface the same superblock. There
> > might be an inherent design reason why this needs to be this way but
> > I would advise against these semantics for anything that wants to be
> > namespaced. Probably the first securityfs mount in init_user_ns can
> > follow these semantics but ones tied to a non-initial user namespace
> > should not as the userns can go away. In that case the pinning logic
> > seems strange as conceptually the userns pins the securityfs mount as
> > evidenced by the fact that we key by it in get_tree_keyed().
> 
> Yes, that's basically what I did: pin if ns == &init_user_ns but don't
> pin if not.  However, I'm still not sure I got the triggers right.  We
> have to trigger the notifier call (which adds the namespaced file
> entries) from context free, because that's the first place the
> superblock mount is fully set up ... I can't do it in fill_super
> because the mount isn't fully initialized (and the locking prevents
> it).  I did manage to get the notifier for teardown triggered from
> kill_super, though.

Once Stefan answer my questions about fill_super I _might_ have an idea
how to improve this.
James Bottomley Dec. 6, 2021, 2:21 p.m. UTC | #16
On Mon, 2021-12-06 at 15:11 +0100, Christian Brauner wrote:
> On Fri, Dec 03, 2021 at 01:06:13PM -0500, Stefan Berger wrote:
> > On 12/3/21 12:03, James Bottomley wrote:
[...]
> > > > +int ima_fs_ns_init(struct ima_namespace *ns)
> > > > +{
> > > > +	ns->mount = securityfs_ns_create_mount(ns->user_ns);
> > >  
> > > This actually triggers on the call to securityfs_init_fs_context,
> > > but nothing happens because the callback is null.  Every
> > > subsequent use of fscontext will trigger this.  The point of a
> > > keyed supeblock is that fill_super is only called once per key,
> > > that's the place we should be doing this.   It should also
> > > probably be a blocking notifier so any consumer of securityfs can
> > > be namespaced by registering for this notifier.
> > 
> > What I don't like about the fill_super is that it gets called too
> > early:
> > 
> > [   67.058611] securityfs_ns_create_mount @ 102 target user_ns:
> > ffff95c010698c80; nr_extents: 0
> > [   67.059836] securityfs_fill_super @ 47  user_ns:
> > ffff95c010698c80;
> > nr_extents: 0
> > 
> > We are switching to the target user namespace in
> > securityfs_ns_create_mount.  The expected nr_extents at this point
> > is 0, since user_ns hasn't been configured, yet. But then
> > security_fill_super is also called with nr_extents 0. We cannot use
> > that, it's too early!
> 
> So the problem is that someone could mount securityfs before any
> idmappings are setup or what?

Yes, not exactly: we put a call to initialize IMA in create_user_ns()
but it's too early to have the mappings, so we can't create the
securityfs entries in that call.  We need the inode to pick up the root
owner from the s_user_ns mappings, so we can't create the dentries for
the IMA securityfs entries until those mappings exist.

I'm assuming that by the time someone tries to mount securityfs inside
the namespace, the mappings are set up, which is why triggering the
notifier to add the files on first mount seems like the best place to
put it.

>  How does moving the setup to a later stage help at all? I'm
> struggling to make sense of this.

It's not moving all the setup, just the creation of the securityfs
entries.

>  When or even if idmappings are written isn't under imas control.
> Someone could mount securityfs without any idmappings setup. In that
> case they should get what they deserve, everything owner by
> overflowuid/overflowgid, no?

Right, in the current scheme of doing things, if they still haven't
written the mappings by the time they do the mount, they're just going
to get nobody/nogroup as uid/gid, but that's their own fault.

> Or you can require in fill_super that kuid 0 and kgid 0 are mapped
> and fail if they aren't.

We can't create the securityfs entries in fill_super ... I already
tried and the locking just won't allow it.  And if we create them ahead
of time, that create of the entries will trigger fill_super because we
need the superblock to hang the dentries off.

James
Christian Brauner Dec. 6, 2021, 2:42 p.m. UTC | #17
On Mon, Dec 06, 2021 at 09:21:15AM -0500, James Bottomley wrote:
> On Mon, 2021-12-06 at 15:11 +0100, Christian Brauner wrote:
> > On Fri, Dec 03, 2021 at 01:06:13PM -0500, Stefan Berger wrote:
> > > On 12/3/21 12:03, James Bottomley wrote:
> [...]
> > > > > +int ima_fs_ns_init(struct ima_namespace *ns)
> > > > > +{
> > > > > +	ns->mount = securityfs_ns_create_mount(ns->user_ns);
> > > >  
> > > > This actually triggers on the call to securityfs_init_fs_context,
> > > > but nothing happens because the callback is null.  Every
> > > > subsequent use of fscontext will trigger this.  The point of a
> > > > keyed supeblock is that fill_super is only called once per key,
> > > > that's the place we should be doing this.   It should also
> > > > probably be a blocking notifier so any consumer of securityfs can
> > > > be namespaced by registering for this notifier.
> > > 
> > > What I don't like about the fill_super is that it gets called too
> > > early:
> > > 
> > > [   67.058611] securityfs_ns_create_mount @ 102 target user_ns:
> > > ffff95c010698c80; nr_extents: 0
> > > [   67.059836] securityfs_fill_super @ 47  user_ns:
> > > ffff95c010698c80;
> > > nr_extents: 0
> > > 
> > > We are switching to the target user namespace in
> > > securityfs_ns_create_mount.  The expected nr_extents at this point
> > > is 0, since user_ns hasn't been configured, yet. But then
> > > security_fill_super is also called with nr_extents 0. We cannot use
> > > that, it's too early!
> > 
> > So the problem is that someone could mount securityfs before any
> > idmappings are setup or what?
> 
> Yes, not exactly: we put a call to initialize IMA in create_user_ns()
> but it's too early to have the mappings, so we can't create the
> securityfs entries in that call.  We need the inode to pick up the root
> owner from the s_user_ns mappings, so we can't create the dentries for
> the IMA securityfs entries until those mappings exist.
> 
> I'm assuming that by the time someone tries to mount securityfs inside
> the namespace, the mappings are set up, which is why triggering the
> notifier to add the files on first mount seems like the best place to
> put it.
> 
> >  How does moving the setup to a later stage help at all? I'm
> > struggling to make sense of this.
> 
> It's not moving all the setup, just the creation of the securityfs
> entries.
> 
> >  When or even if idmappings are written isn't under imas control.
> > Someone could mount securityfs without any idmappings setup. In that
> > case they should get what they deserve, everything owner by
> > overflowuid/overflowgid, no?
> 
> Right, in the current scheme of doing things, if they still haven't
> written the mappings by the time they do the mount, they're just going
> to get nobody/nogroup as uid/gid, but that's their own fault.
> 
> > Or you can require in fill_super that kuid 0 and kgid 0 are mapped
> > and fail if they aren't.
> 
> We can't create the securityfs entries in fill_super ... I already
> tried and the locking just won't allow it.  And if we create them ahead

What is the locking issue there exactly?

I'm looking at ima_fs_ns_late_init() and there's nothing there that
would cause obvious issues. You might not be able to use
securityfs_create_*() in there for some reason but that just means you
need to add a simple helper. Nearly every filesystem that needs to
pre-create files does it in fill_super. So I really fail to see what the
issue is currently. I mist just miss something obvious.
James Bottomley Dec. 6, 2021, 2:51 p.m. UTC | #18
On Mon, 2021-12-06 at 15:42 +0100, Christian Brauner wrote:
> On Mon, Dec 06, 2021 at 09:21:15AM -0500, James Bottomley wrote:
> > On Mon, 2021-12-06 at 15:11 +0100, Christian Brauner wrote:
> > > On Fri, Dec 03, 2021 at 01:06:13PM -0500, Stefan Berger wrote:
> > > > On 12/3/21 12:03, James Bottomley wrote:
> > [...]
> > > > > > +int ima_fs_ns_init(struct ima_namespace *ns)
> > > > > > +{
> > > > > > +	ns->mount = securityfs_ns_create_mount(ns->user_ns);
> > > > >  
> > > > > This actually triggers on the call to
> > > > > securityfs_init_fs_context,
> > > > > but nothing happens because the callback is null.  Every
> > > > > subsequent use of fscontext will trigger this.  The point of
> > > > > a
> > > > > keyed supeblock is that fill_super is only called once per
> > > > > key,
> > > > > that's the place we should be doing this.   It should also
> > > > > probably be a blocking notifier so any consumer of securityfs
> > > > > can
> > > > > be namespaced by registering for this notifier.
> > > > 
> > > > What I don't like about the fill_super is that it gets called
> > > > too
> > > > early:
> > > > 
> > > > [   67.058611] securityfs_ns_create_mount @ 102 target user_ns:
> > > > ffff95c010698c80; nr_extents: 0
> > > > [   67.059836] securityfs_fill_super @ 47  user_ns:
> > > > ffff95c010698c80;
> > > > nr_extents: 0
> > > > 
> > > > We are switching to the target user namespace in
> > > > securityfs_ns_create_mount.  The expected nr_extents at this
> > > > point
> > > > is 0, since user_ns hasn't been configured, yet. But then
> > > > security_fill_super is also called with nr_extents 0. We cannot
> > > > use
> > > > that, it's too early!
> > > 
> > > So the problem is that someone could mount securityfs before any
> > > idmappings are setup or what?
> > 
> > Yes, not exactly: we put a call to initialize IMA in
> > create_user_ns()
> > but it's too early to have the mappings, so we can't create the
> > securityfs entries in that call.  We need the inode to pick up the
> > root
> > owner from the s_user_ns mappings, so we can't create the dentries
> > for
> > the IMA securityfs entries until those mappings exist.
> > 
> > I'm assuming that by the time someone tries to mount securityfs
> > inside
> > the namespace, the mappings are set up, which is why triggering the
> > notifier to add the files on first mount seems like the best place
> > to
> > put it.
> > 
> > >  How does moving the setup to a later stage help at all? I'm
> > > struggling to make sense of this.
> > 
> > It's not moving all the setup, just the creation of the securityfs
> > entries.
> > 
> > >  When or even if idmappings are written isn't under imas control.
> > > Someone could mount securityfs without any idmappings setup. In
> > > that
> > > case they should get what they deserve, everything owner by
> > > overflowuid/overflowgid, no?
> > 
> > Right, in the current scheme of doing things, if they still haven't
> > written the mappings by the time they do the mount, they're just
> > going
> > to get nobody/nogroup as uid/gid, but that's their own fault.
> > 
> > > Or you can require in fill_super that kuid 0 and kgid 0 are
> > > mapped
> > > and fail if they aren't.
> > 
> > We can't create the securityfs entries in fill_super ... I already
> > tried and the locking just won't allow it.  And if we create them
> > ahead
> 
> What is the locking issue there exactly?

The main problem is we have no vfsmount and we can't create one in
there because the fill super is triggered by the vfsmount creation for
the actual mount.  It's all done under the sb->s_umount semaphore.

> I'm looking at ima_fs_ns_late_init() and there's nothing there that
> would cause obvious issues. You might not be able to use
> securityfs_create_*() in there for some reason but that just means
> you need to add a simple helper. Nearly every filesystem that needs
> to pre-create files does it in fill_super. So I really fail to see
> what the issue is currently. I mist just miss something obvious.

I think we might get it to work if we keep the root dentry in the
securityfs namespace entries instead of the vfsmount; I'll investigate.

James
Christian Brauner Dec. 6, 2021, 3:44 p.m. UTC | #19
On Mon, Dec 06, 2021 at 08:38:29AM -0500, James Bottomley wrote:
> On Mon, 2021-12-06 at 13:08 +0100, Christian Brauner wrote:
> > On Fri, Dec 03, 2021 at 11:37:14AM -0800, Casey Schaufler wrote:
> > > On 12/3/2021 10:50 AM, James Bottomley wrote:
> > > > On Fri, 2021-12-03 at 13:06 -0500, Stefan Berger wrote:
> > > > > On 12/3/21 12:03, James Bottomley wrote:
> > > > > > On Thu, 2021-12-02 at 21:31 -0500, Stefan Berger wrote:
> > > > > > [...]
> > > > > > >    static int securityfs_init_fs_context(struct fs_context
> > > > > > > *fc)
> > > > > > >    {
> > > > > > > +	int rc;
> > > > > > > +
> > > > > > > +	if (fc->user_ns->ima_ns->late_fs_init) {
> > > > > > > +		rc = fc->user_ns->ima_ns->late_fs_init(fc-
> > > > > > > >user_ns);
> > > > > > > +		if (rc)
> > > > > > > +			return rc;
> > > > > > > +	}
> > > > > > >    	fc->ops = &securityfs_context_ops;
> > > > > > >    	return 0;
> > > > > > >    }
> > > > > > I know I suggested this, but to get this to work in general,
> > > > > > it's going to have to not be specific to IMA, so it's going
> > > > > > to have to become something generic like a notifier
> > > > > > chain.  The other problem is it's only working still by
> > > > > > accident:
> > > > >  
> > > > > I had thought about this also but the rationale was:
> > > > > 
> > > > > securityfs is compiled due to CONFIG_IMA_NS and the user
> > > > > namespace exists there and that has a pointer now to
> > > > > ima_namespace, which can have that callback. I assumed that
> > > > > other namespaced subsystems could also be  reached then via
> > > > > such a callback, but I don't know.
> > > >  
> > > > Well securityfs is supposed to exist for LSMs.  At some point
> > > > each of those is going to need to be namespaced, which may
> > > > eventually be quite a pile of callbacks, which is why I thought
> > > > of a notifier.
> > > 
> > > While AppArmor, lockdown and the integrity family use securityfs,
> > > SELinux and Smack do not. They have their own independent
> > > filesystems. Implementations of namespacing for each of SELinux and
> > > Smack have been proposed, but nothing has been adopted. It would be
> > > really handy to namespace the infrastructure rather than each
> > > individual LSM, but I fear that's a bigger project than anyone will
> > > be taking on any time soon. It's likely to encounter many of the
> > > same issues that I've been dealing with for module stacking.
> > 
> > The main thing that bothers me is that it uses simple_pin_fs() and
> > simple_unpin_fs() which I would try hard to get rid of if possible.
> > The existence of this global pinning logic makes namespacing it
> > properly more difficult then it needs to be and it creates imho wonky
> > semantics where the last unmount doesn't really destroy the
> > superblock.
> 
> So in the notifier sketch I posted, I got rid of the pinning but only
> for the non root user namespace use case ... which basically means only
> for converted consumers of securityfs.  The last unmount of securityfs
> inside the namespace now does destroy the superblock ... I checked.
> 
> The same isn't true for the last unmount of the root namespace, but
> that has to be so to keep the current semantics.
> 
> >  Instead subsequents mounts resurface the same superblock. There
> > might be an inherent design reason why this needs to be this way but
> > I would advise against these semantics for anything that wants to be
> > namespaced. Probably the first securityfs mount in init_user_ns can
> > follow these semantics but ones tied to a non-initial user namespace
> > should not as the userns can go away. In that case the pinning logic
> > seems strange as conceptually the userns pins the securityfs mount as
> > evidenced by the fact that we key by it in get_tree_keyed().
> 
> Yes, that's basically what I did: pin if ns == &init_user_ns but don't
> pin if not.  However, I'm still not sure I got the triggers right.  We
> have to trigger the notifier call (which adds the namespaced file
> entries) from context free, because that's the first place the
> superblock mount is fully set up ... I can't do it in fill_super
> because the mount isn't fully initialized (and the locking prevents
> it).  I did manage to get the notifier for teardown triggered from
> kill_super, though.

I don't think you need a vfsmount at all to be honest. I think this can
all be done without much ceremony. Here's a brutalist completely
untested patch outlining one approach:

From 4fc2d88d4194e3473fd545864a8bb0759036ed5e Mon Sep 17 00:00:00 2001
From: Christian Brauner <christian.brauner@ubuntu.com>
Date: Mon, 6 Dec 2021 14:08:28 +0100
Subject: [PATCH] !!!! HERE BE DRAGONS - COMPLETELY UNTESTED !!!!

---
 include/linux/securityfs.h      |  20 +++++
 include/linux/user_namespace.h  |   1 +
 kernel/user_namespace.c         |   3 +
 security/inode.c                | 129 ++++++++++++++++++++++++++++++--
 security/integrity/ima/ima.h    |   1 +
 security/integrity/ima/ima_fs.c |  20 ++++-
 6 files changed, 165 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/securityfs.h

diff --git a/include/linux/securityfs.h b/include/linux/securityfs.h
new file mode 100644
index 000000000000..2e973be160b1
--- /dev/null
+++ b/include/linux/securityfs.h
@@ -0,0 +1,20 @@
+#ifndef __LINUX_SECURITYFS_H
+#define __LINUX_SECURITYFS_H
+
+struct vfsmount;
+
+#ifdef CONFIG_SECURITYFS
+
+/*
+ * Allocated once per user_ns the first time securityfs is mounted.  Can be
+ * used to stash securityfs relevant state that absolutely needs to survive
+ * super_block destruction on last umount.
+ */
+struct securityfs_info {
+	// pointer to relevant ima stuff or instance?
+	// pointer to relevant apparmor stuff or instance?
+	// pointer to relevant selinux stuff or instance?
+};
+#endif /* CONFIG_SECURITYFS */
+
+#endif /* ! __LINUX_SECURITYFS_H */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 6b8bd060d8c4..42676f5bcd43 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -103,6 +103,7 @@ struct user_namespace {
 #ifdef CONFIG_IMA
 	struct ima_namespace	*ima_ns;
 #endif
+	struct securityfs_info *securityfs_info;
 #ifdef CONFIG_SECURITYFS
 	struct vfsmount		*securityfs_mount;
 	bool			securityfs_notifier_sent;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index c26885343b19..d65b20d8a90b 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -211,6 +211,9 @@ static void free_user_ns(struct work_struct *work)
 		}
 #ifdef CONFIG_IMA
 		put_ima_ns(ns->ima_ns);
+#endif
+#ifdef CONFIG_SECURITYFS
+		kfree(ns->securityfs_info);
 #endif
 		retire_userns_sysctls(ns);
 		key_free_user_ns(ns);
diff --git a/security/inode.c b/security/inode.c
index 62ab4630dc31..1c3b2797367d 100644
--- a/security/inode.c
+++ b/security/inode.c
@@ -66,18 +66,57 @@ static void securityfs_free_context(struct fs_context *fc)
 	mntput(ns->securityfs_mount);
 }
 
+
+/* 
+ * This is really just a helper we would need in case we wanted to retrieve
+ * securityfs_info independent of the super_block. If that's not needed, then
+ * you can as well remove the smp_load_acquire() and the associated
+ * smp_store_release().
+ */
+struct securitfs_info *to_securityfs_info(struct user_namespace *user_ns)
+{
+
+	return smp_load_acquire(&user_ns->securityfs_info);
+}
+
 static int securityfs_fill_super(struct super_block *sb, struct fs_context *fc)
 {
+	int err;
 	static const struct tree_descr files[] = {{""}};
-	int error;
-
-	error = simple_fill_super(sb, SECURITYFS_MAGIC, files);
-	if (error)
-		return error;
+	struct user_namespace *user_ns = sb->s_user_ns;
+	struct securityfs_info *securityfs_info;
+
+	/*
+	 * Allocate a new securityfs_info instance for this userns.
+	 * While multiple superblocks can exist they are keyed by userns in
+	 * s_fs_info for user_ns. Hence, the vfs guarantees that
+	 * securityfs_fill_super() is called exactly once whenever a
+	 * securityfs superblock for a userns is created. This in turn lets us
+	 * conclude that when a securityfs superblock is created for the first
+	 * time for a userns there's no one racing us. Therefore we don't need
+	 * any barriers when we dereference securityfs_info.
+	 */
+	securityfs_info = user_ns->securityfs_info;
+	if (!securityfs_info) {
+		securityfs_info = kzalloc(sizeof(struct securityfs_info), GFP_KERNEL);
+		if (!securityfs_info)
+			return -ENOMEM;
+
+		// TODO: Initialize securityfs_info
+
+		/* 
+		 * Pairs with smp_load_acquire() in to_securityfs_info().
+		 *
+		 * Please see the commment there.
+		 */
+		smp_store_release(&user_ns->securityfs_info, securityfs_info);
+	}
 
-	sb->s_op = &securityfs_super_operations;
+	err = simple_fill_super(sb, SECURITYFS_MAGIC, files);
+	if (!err)
+		sb->s_op = &securityfs_super_operations;
 
-	return 0;
+	return ima_fs_ns_late_init(sb);
 }
 
 static int securityfs_get_tree(struct fs_context *fc)
@@ -237,6 +276,82 @@ static struct dentry *securityfs_create_dentry(const char *name, umode_t mode,
 	return dentry;
 }
 
+struct dentry *securityfs_create_dentry_ns(struct super_block *sb,
+					   const char *name, umode_t mode,
+					   struct dentry *parent, void *data,
+					   const struct file_operations *fops,
+					   const struct inode_operations *iops)
+{
+	struct dentry *dentry;
+	struct inode *dir, *inode;
+	int error;
+	struct user_namespace *ns = sb->s_user_ns;
+
+	if (!(mode & S_IFMT))
+		mode = (mode & S_IALLUGO) | S_IFREG;
+
+	pr_debug("securityfs: creating file '%s', ns=%u\n",name, ns->ns.inum);
+
+	if (ns == &init_user_ns) {
+		error = simple_pin_fs(&fs_type, &ns->securityfs_mount,
+				      &securityfs_mount_count);
+		if (error)
+			return ERR_PTR(error);
+	}
+
+	/* You really just require to always pass the parent? */
+	if (!parent)
+		parent = sb->s_root;
+
+	dir = d_inode(parent);
+
+	inode_lock(dir);
+	dentry = lookup_one_len(name, parent, strlen(name));
+	if (IS_ERR(dentry))
+		goto out;
+
+	if (d_really_is_positive(dentry)) {
+		error = -EEXIST;
+		goto out1;
+	}
+
+	inode = new_inode(dir->i_sb);
+	if (!inode) {
+		error = -ENOMEM;
+		goto out1;
+	}
+
+	inode->i_ino = get_next_ino();
+	inode->i_mode = mode;
+	inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+	inode->i_private = data;
+	if (S_ISDIR(mode)) {
+		inode->i_op = &simple_dir_inode_operations;
+		inode->i_fop = &simple_dir_operations;
+		inc_nlink(inode);
+		inc_nlink(dir);
+	} else if (S_ISLNK(mode)) {
+		inode->i_op = iops ? iops : &simple_symlink_inode_operations;
+		inode->i_link = data;
+	} else {
+		inode->i_fop = fops;
+	}
+	d_instantiate(dentry, inode);
+	dget(dentry);
+	inode_unlock(dir);
+	return dentry;
+
+out1:
+	dput(dentry);
+	dentry = ERR_PTR(error);
+out:
+	inode_unlock(dir);
+	if (ns == &init_user_ns)
+		simple_release_fs(&ns->securityfs_mount,
+				  &securityfs_mount_count);
+	return dentry;
+}
+
 /**
  * securityfs_create_file - create a file in the securityfs filesystem
  *
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 12b7df65a5ff..806f19215052 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -140,6 +140,7 @@ struct ns_status {
 int ima_init(void);
 int ima_fs_init(void);
 void ima_fs_ns_free(void);
+int ima_fs_ns_late_init(struct super_block *sb);
 int ima_add_template_entry(struct ima_namespace *ns,
 			   struct ima_template_entry *entry, int violation,
 			   const char *op, struct inode *inode,
diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
index 26f26e8756a8..4b25912db448 100644
--- a/security/integrity/ima/ima_fs.c
+++ b/security/integrity/ima/ima_fs.c
@@ -500,11 +500,27 @@ static void ima_fs_ns_free_dentries(struct ima_namespace *ns)
 /* Function to populeate namespace SecurityFS once user namespace
  * has been configured.
  */
-static int ima_fs_ns_late_init(struct user_namespace *user_ns)
+int ima_fs_ns_late_init(struct super_block *sb)
 {
-	struct ima_namespace *ns = user_ns->ima_ns;
+	/*
+	 * We know that s_user_ns === ima_ns->user_ns.
+	 *
+	 * In other words, here we can go from superblock to relevant
+	 * namespaces never from namespace to superblock. Ideally we try to
+	 * avoid going from namespace to superblock.
+	 */
+	struct ima_namespace *ns = sb->s_user_ns->ima_ns;
 	struct dentry *parent;
 
+
+	// TODO:
+	//
+	// Port this to use new helpers that take a super_block as argument.
+	//
+	// This allows us to get rid of any vfsmount dependencies.
+	//
+	// Probably should also be renamed to something better.
+
 	/* already initialized? */
 	if (ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR])
 		return 0;
James Bottomley Dec. 6, 2021, 4:25 p.m. UTC | #20
On Mon, 2021-12-06 at 16:44 +0100, Christian Brauner wrote:
> On Mon, Dec 06, 2021 at 08:38:29AM -0500, James Bottomley wrote:
> > On Mon, 2021-12-06 at 13:08 +0100, Christian Brauner wrote:
[...]
> > >  Instead subsequents mounts resurface the same superblock. There
> > > might be an inherent design reason why this needs to be this way
> > > but I would advise against these semantics for anything that
> > > wants to be namespaced. Probably the first securityfs mount in
> > > init_user_ns can follow these semantics but ones tied to a non-
> > > initial user namespace should not as the userns can go away. In
> > > that case the pinning logic seems strange as conceptually the
> > > userns pins the securityfs mount as evidenced by the fact that we
> > > key by it in get_tree_keyed().
> > 
> > Yes, that's basically what I did: pin if ns == &init_user_ns but
> > don't pin if not.  However, I'm still not sure I got the triggers
> > right.  We have to trigger the notifier call (which adds the
> > namespaced file entries) from context free, because that's the
> > first place the superblock mount is fully set up ... I can't do it
> > in fill_super because the mount isn't fully initialized (and the
> > locking prevents it).  I did manage to get the notifier for
> > teardown triggered from kill_super, though.
> 
> I don't think you need a vfsmount at all to be honest. I think this
> can all be done without much ceremony. Here's a brutalist completely
> untested patch outlining one approach:

This is what I did (incremental to Stefan's series + my previous
patch): it avoids superblock threading by switching to a root dentry in
the securityfs user namespace area ... or am I being too simple again
... ?

I'm still a bit unhappy about triggering a blocking notifier under the
umount semaphore ...

James

---
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 6b8bd060d8c4..03a0879376a0 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -104,8 +104,7 @@ struct user_namespace {
 	struct ima_namespace	*ima_ns;
 #endif
 #ifdef CONFIG_SECURITYFS
-	struct vfsmount		*securityfs_mount;
-	bool			securityfs_notifier_sent;
+	struct dentry		*securityfs_root;
 #endif
 } __randomize_layout;
 
diff --git a/security/inode.c b/security/inode.c
index 62ab4630dc31..863fccfd3687 100644
--- a/security/inode.c
+++ b/security/inode.c
@@ -25,6 +25,7 @@
 #include <linux/user_namespace.h>
 #include <linux/ima.h>
 
+static struct vfsmount *securityfs_mount;
 static int securityfs_mount_count;
 
 static BLOCKING_NOTIFIER_HEAD(securityfs_ns_notifier);
@@ -41,42 +42,22 @@ static const struct super_operations securityfs_super_operations = {
 	.free_inode	= securityfs_free_inode,
 };
 
-static struct file_system_type fs_type;
-
-static void securityfs_free_context(struct fs_context *fc)
-{
-	struct user_namespace *ns = fc->user_ns;
-	if (ns == &init_user_ns ||
-	    ns->securityfs_notifier_sent)
-		return;
-
-	ns->securityfs_notifier_sent = true;
-
-	ns->securityfs_mount = vfs_kern_mount(&fs_type, SB_KERNMOUNT,
-					      fs_type.name, NULL);
-	if (IS_ERR(ns->securityfs_mount)) {
-		printk(KERN_ERR "kern mount on securityfs ERROR: %ld\n",
-		       PTR_ERR(ns->securityfs_mount));
-		ns->securityfs_mount = NULL;
-		return;
-	}
-
-	blocking_notifier_call_chain(&securityfs_ns_notifier,
-				     SECURITYFS_NS_ADD, fc->user_ns);
-	mntput(ns->securityfs_mount);
-}
-
 static int securityfs_fill_super(struct super_block *sb, struct fs_context *fc)
 {
 	static const struct tree_descr files[] = {{""}};
 	int error;
+	struct user_namespace *ns = fc->user_ns;
 
 	error = simple_fill_super(sb, SECURITYFS_MAGIC, files);
 	if (error)
 		return error;
 
+	ns->securityfs_root = sb->s_root;
+
 	sb->s_op = &securityfs_super_operations;
 
+	blocking_notifier_call_chain(&securityfs_ns_notifier,
+				     SECURITYFS_NS_ADD, ns);
 	return 0;
 }
 
@@ -87,7 +68,6 @@ static int securityfs_get_tree(struct fs_context *fc)
 
 static const struct fs_context_operations securityfs_context_ops = {
 	.get_tree	= securityfs_get_tree,
-	.free		= securityfs_free_context,
 };
 
 static int securityfs_init_fs_context(struct fs_context *fc)
@@ -104,8 +84,7 @@ static void securityfs_kill_super(struct super_block *sb)
 		blocking_notifier_call_chain(&securityfs_ns_notifier,
 					     SECURITYFS_NS_REMOVE,
 					     sb->s_fs_info);
-	ns->securityfs_notifier_sent = false;
-	ns->securityfs_mount = NULL;
+	ns->securityfs_root = NULL;
 	kill_litter_super(sb);
 }
 
@@ -179,14 +158,18 @@ static struct dentry *securityfs_create_dentry(const char *name, umode_t mode,
 	pr_debug("securityfs: creating file '%s', ns=%u\n",name, ns->ns.inum);
 
 	if (ns == &init_user_ns) {
-		error = simple_pin_fs(&fs_type, &ns->securityfs_mount,
+		error = simple_pin_fs(&fs_type, &securityfs_mount,
 				      &securityfs_mount_count);
 		if (error)
 			return ERR_PTR(error);
 	}
 
-	if (!parent)
-		parent = ns->securityfs_mount->mnt_root;
+	if (!parent) {
+		if (ns == &init_user_ns)
+			parent = securityfs_mount->mnt_root;
+		else
+			parent = ns->securityfs_root;
+	}
 
 	dir = d_inode(parent);
 
@@ -232,7 +215,7 @@ static struct dentry *securityfs_create_dentry(const char *name, umode_t mode,
 out:
 	inode_unlock(dir);
 	if (ns == &init_user_ns)
-		simple_release_fs(&ns->securityfs_mount,
+		simple_release_fs(&securityfs_mount,
 				  &securityfs_mount_count);
 	return dentry;
 }
@@ -376,7 +359,7 @@ void securityfs_remove(struct dentry *dentry)
 	}
 	inode_unlock(dir);
 	if (ns == &init_user_ns)
-		simple_release_fs(&ns->securityfs_mount,
+		simple_release_fs(&securityfs_mount,
 				  &securityfs_mount_count);
 }
 EXPORT_SYMBOL(securityfs_remove);
@@ -405,8 +388,6 @@ static int __init securityfs_init(void)
 	if (retval)
 		return retval;
 
-	init_user_ns.securityfs_mount = NULL;
-
 	retval = register_filesystem(&fs_type);
 	if (retval) {
 		sysfs_remove_mount_point(kernel_kobj, "security");
Stefan Berger Dec. 6, 2021, 5:22 p.m. UTC | #21
On 12/6/21 09:11, James Bottomley wrote:
> On Mon, 2021-12-06 at 09:03 -0500, Stefan Berger wrote:
>
> It's never too early to see what the series is shaping up as.  However,
> I'm still not sure I got the right trigger for the SECURITYFS_NS_ADD
> notifier, so that may still have to move ... or even that there isn't
> some locking subtlety I missed in triggering SECURITY_NS_REMOVE from
> kill_sb.

I'll post v3 with the early changes and your Signed-off-by's added where 
I made changes to files.

    Stefan
diff mbox series

Patch

diff --git a/include/linux/ima.h b/include/linux/ima.h
index 889e9c70cbfb..a13f934f15fc 100644
--- a/include/linux/ima.h
+++ b/include/linux/ima.h
@@ -220,6 +220,18 @@  struct ima_h_table {
 	struct hlist_head queue[IMA_MEASURE_HTABLE_SIZE];
 };
 
+enum {
+	IMAFS_DENTRY_INTEGRITY_DIR = 0,
+	IMAFS_DENTRY_DIR,
+	IMAFS_DENTRY_SYMLINK,
+	IMAFS_DENTRY_BINARY_RUNTIME_MEASUREMENTS,
+	IMAFS_DENTRY_ASCII_RUNTIME_MEASUREMENTS,
+	IMAFS_DENTRY_RUNTIME_MEASUREMENTS_COUNT,
+	IMAFS_DENTRY_VIOLATIONS,
+	IMAFS_DENTRY_IMA_POLICY,
+	IMAFS_DENTRY_LAST
+};
+
 struct ima_namespace {
 	struct kref kref;
 	struct user_namespace *user_ns;
@@ -266,6 +278,11 @@  struct ima_namespace {
 	struct mutex ima_write_mutex;
 	unsigned long ima_fs_flags;
 	int valid_policy;
+
+	struct dentry *dentry[IMAFS_DENTRY_LAST];
+	struct vfsmount *mount;
+	int mount_count;
+	int (*late_fs_init)(struct user_namespace *user_ns);
 };
 
 extern struct ima_namespace init_ima_ns;
diff --git a/security/inode.c b/security/inode.c
index 2738a7b31469..6223f1d838f6 100644
--- a/security/inode.c
+++ b/security/inode.c
@@ -22,6 +22,7 @@ 
 #include <linux/lsm_hooks.h>
 #include <linux/magic.h>
 #include <linux/user_namespace.h>
+#include <linux/ima.h>
 
 static struct vfsmount *securityfs_mount;
 static int securityfs_mount_count;
@@ -63,6 +64,13 @@  static const struct fs_context_operations securityfs_context_ops = {
 
 static int securityfs_init_fs_context(struct fs_context *fc)
 {
+	int rc;
+
+	if (fc->user_ns->ima_ns->late_fs_init) {
+		rc = fc->user_ns->ima_ns->late_fs_init(fc->user_ns);
+		if (rc)
+			return rc;
+	}
 	fc->ops = &securityfs_context_ops;
 	return 0;
 }
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index bb9763cd5fb1..9bcd71bb716c 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -139,6 +139,8 @@  struct ns_status {
 /* Internal IMA function definitions */
 int ima_init(void);
 int ima_fs_init(void);
+int ima_fs_ns_init(struct ima_namespace *ns);
+void ima_fs_ns_free(struct ima_namespace *ns);
 int ima_add_template_entry(struct ima_namespace *ns,
 			   struct ima_template_entry *entry, int violation,
 			   const char *op, struct inode *inode,
diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
index 6766bb8262f2..65b2af7c14dd 100644
--- a/security/integrity/ima/ima_fs.c
+++ b/security/integrity/ima/ima_fs.c
@@ -22,6 +22,7 @@ 
 #include <linux/parser.h>
 #include <linux/vmalloc.h>
 #include <linux/ima.h>
+#include <linux/namei.h>
 
 #include "ima.h"
 
@@ -436,8 +437,14 @@  static int ima_release_policy(struct inode *inode, struct file *file)
 
 	ima_update_policy(ns);
 #if !defined(CONFIG_IMA_WRITE_POLICY) && !defined(CONFIG_IMA_READ_POLICY)
-	securityfs_remove(ima_policy);
-	ima_policy = NULL;
+	if (ns == &init_ima_ns) {
+		securityfs_remove(ima_policy);
+		ima_policy = NULL;
+	} else {
+		securityfs_ns_remove(ns->dentry[IMAFS_DENTRY_POLICY],
+				     &ns->mount, &ns->mount_count);
+		ns->dentry[IMAFS_DENTRY_POLICY] = NULL;
+	}
 #elif defined(CONFIG_IMA_WRITE_POLICY)
 	clear_bit(IMA_FS_BUSY, &ns->ima_fs_flags);
 #elif defined(CONFIG_IMA_READ_POLICY)
@@ -509,3 +516,149 @@  int __init ima_fs_init(void)
 	securityfs_remove(ima_policy);
 	return -1;
 }
+
+static void ima_fs_ns_free_dentries(struct ima_namespace *ns)
+{
+	size_t i;
+
+	for (i = 0; i < IMAFS_DENTRY_LAST; i++) {
+		switch (i) {
+		case IMAFS_DENTRY_DIR:
+		case IMAFS_DENTRY_INTEGRITY_DIR:
+			/* files first */
+			continue;
+		}
+		securityfs_ns_remove(ns->dentry[i], &ns->mount, &ns->mount_count);
+	}
+	securityfs_ns_remove(ns->dentry[IMAFS_DENTRY_DIR],
+			     &ns->mount, &ns->mount_count);
+	securityfs_ns_remove(ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR],
+			     &ns->mount, &ns->mount_count);
+
+	memset(ns->dentry, 0, sizeof(ns->dentry));
+
+}
+
+/* Function to populeate namespace SecurityFS once user namespace
+ * has been configured.
+ */
+static int ima_fs_ns_late_init(struct user_namespace *user_ns)
+{
+	struct ima_namespace *ns = user_ns->ima_ns;
+	struct dentry *parent;
+
+	/* already initialized? */
+	if (ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR])
+		return 0;
+
+	ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR] =
+	    securityfs_ns_create_dir("integrity", NULL,
+				     &ns->mount, &ns->mount_count);
+	if (IS_ERR(ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR])) {
+		ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR] = NULL;
+		goto out;
+	}
+
+	ns->dentry[IMAFS_DENTRY_DIR] =
+	    securityfs_ns_create_dir("ima", ns->dentry[IMAFS_DENTRY_INTEGRITY_DIR],
+				     &ns->mount, &ns->mount_count);
+	if (IS_ERR(ns->dentry[IMAFS_DENTRY_DIR])) {
+		ns->dentry[IMAFS_DENTRY_DIR] = NULL;
+		goto out;
+	}
+
+	ns->dentry[IMAFS_DENTRY_SYMLINK] =
+	    securityfs_ns_create_symlink("ima", NULL, "integrity/ima", NULL,
+				     &ns->mount, &ns->mount_count);
+	if (IS_ERR(ns->dentry[IMAFS_DENTRY_SYMLINK])) {
+		ns->dentry[IMAFS_DENTRY_SYMLINK] = NULL;
+		goto out;
+	}
+
+	parent = ns->dentry[IMAFS_DENTRY_DIR];
+	ns->dentry[IMAFS_DENTRY_BINARY_RUNTIME_MEASUREMENTS] =
+	    securityfs_ns_create_file("binary_runtime_measurements",
+				   S_IRUSR | S_IRGRP, parent, NULL,
+				   &ima_measurements_ops,
+				   &ns->mount, &ns->mount_count);
+	if (IS_ERR(ns->dentry[IMAFS_DENTRY_BINARY_RUNTIME_MEASUREMENTS])) {
+		ns->dentry[IMAFS_DENTRY_BINARY_RUNTIME_MEASUREMENTS] = NULL;
+		goto out;
+	}
+
+	ns->dentry[IMAFS_DENTRY_ASCII_RUNTIME_MEASUREMENTS] =
+	    securityfs_ns_create_file("ascii_runtime_measurements",
+				   S_IRUSR | S_IRGRP, parent, NULL,
+				   &ima_ascii_measurements_ops,
+				   &ns->mount, &ns->mount_count);
+	if (IS_ERR(ns->dentry[IMAFS_DENTRY_ASCII_RUNTIME_MEASUREMENTS])) {
+		ns->dentry[IMAFS_DENTRY_ASCII_RUNTIME_MEASUREMENTS] = NULL;
+		goto out;
+	}
+
+	ns->dentry[IMAFS_DENTRY_RUNTIME_MEASUREMENTS_COUNT] =
+	    securityfs_ns_create_file("runtime_measurements_count",
+				   S_IRUSR | S_IRGRP, parent, NULL,
+				   &ima_measurements_count_ops,
+				   &ns->mount, &ns->mount_count);
+	if (IS_ERR(ns->dentry[IMAFS_DENTRY_RUNTIME_MEASUREMENTS_COUNT])) {
+		ns->dentry[IMAFS_DENTRY_RUNTIME_MEASUREMENTS_COUNT] = NULL;
+		goto out;
+	}
+
+	ns->dentry[IMAFS_DENTRY_VIOLATIONS] =
+	    securityfs_ns_create_file("violations", S_IRUSR | S_IRGRP,
+				   parent, NULL, &ima_htable_violations_ops,
+				   &ns->mount, &ns->mount_count);
+	if (IS_ERR(ns->dentry[IMAFS_DENTRY_VIOLATIONS])) {
+		ns->dentry[IMAFS_DENTRY_VIOLATIONS] = NULL;
+		goto out;
+	}
+
+	ns->dentry[IMAFS_DENTRY_IMA_POLICY] =
+	    securityfs_ns_create_file("policy", POLICY_FILE_FLAGS,
+				   parent, NULL, &ima_measure_policy_ops,
+				   &ns->mount, &ns->mount_count);
+	if (IS_ERR(ns->dentry[IMAFS_DENTRY_IMA_POLICY])) {
+		ns->dentry[IMAFS_DENTRY_IMA_POLICY] = NULL;
+		goto out;
+	}
+
+
+	return 0;
+
+out:
+	ima_fs_ns_free_dentries(ns);
+
+	return -1;
+}
+
+int ima_fs_ns_init(struct ima_namespace *ns)
+{
+	ns->mount = securityfs_ns_create_mount(ns->user_ns);
+	if (IS_ERR(ns->mount)) {
+		ns->mount = NULL;
+		return -1;
+	}
+	ns->mount_count = 1;
+
+	/* Adjust the trigger for user namespace's early teardown of dependent
+	 * namespaces. Due to the filesystem there's an additional reference
+	 * to the user namespace.
+	 */
+	ns->user_ns->refcount_teardown += 1;
+
+	ns->late_fs_init = ima_fs_ns_late_init;
+
+	return 0;
+}
+
+void ima_fs_ns_free(struct ima_namespace *ns)
+{
+	ima_fs_ns_free_dentries(ns);
+	if (ns->mount) {
+		mntput(ns->mount);
+		ns->mount_count -= 1;
+	}
+	ns->mount = NULL;
+}
diff --git a/security/integrity/ima/ima_init_ima_ns.c b/security/integrity/ima/ima_init_ima_ns.c
index 22ff74e85a5f..86a89502c0c5 100644
--- a/security/integrity/ima/ima_init_ima_ns.c
+++ b/security/integrity/ima/ima_init_ima_ns.c
@@ -20,6 +20,8 @@ 
 
 int ima_init_namespace(struct ima_namespace *ns)
 {
+	int rc = 0;
+
 	ns->ns_status_tree = RB_ROOT;
 	rwlock_init(&ns->ns_status_lock);
 	ns->ns_status_cache = KMEM_CACHE(ns_status, SLAB_PANIC);
@@ -52,8 +54,10 @@  int ima_init_namespace(struct ima_namespace *ns)
 	mutex_init(&ns->ima_write_mutex);
 	ns->valid_policy = 1;
 	ns->ima_fs_flags = 0;
+	if (ns != &init_ima_ns)
+		rc = ima_fs_ns_init(ns);
 
-	return 0;
+	return rc;
 }
 
 int __init ima_ns_init(void)
diff --git a/security/integrity/ima/ima_ns.c b/security/integrity/ima/ima_ns.c
index 4260f96c4eca..9d5917c97fcc 100644
--- a/security/integrity/ima/ima_ns.c
+++ b/security/integrity/ima/ima_ns.c
@@ -67,6 +67,8 @@  struct ima_namespace *copy_ima_ns(struct ima_namespace *old_ns,
 
 void ima_ns_userns_early_teardown(struct ima_namespace *ns)
 {
+	pr_debug("%s: ns=0x%p\n", __func__, ns);
+	ima_fs_ns_free(ns);
 }
 EXPORT_SYMBOL(ima_ns_userns_early_teardown);