diff mbox series

[RFC,1/1] fix NULL mnt [was Re: apparmor NULL pointer dereference on resume [efivarfs]]

Message ID 7bdcc2c5d8022d2f1a7ec23c0351f7816d4464c8.camel@HansenPartnership.com (mailing list archive)
State New
Headers show
Series [RFC,1/1] fix NULL mnt [was Re: apparmor NULL pointer dereference on resume [efivarfs]] | expand

Commit Message

James Bottomley March 14, 2025, 2:59 p.m. UTC
On Tue, 2025-03-11 at 09:01 -0400, James Bottomley wrote:
> On Tue, 2025-03-11 at 09:45 +0100, Christian Brauner wrote:
[...]
> > But since efivars does only ever have a single global superblock,
> > one possibility is to an internal superblock that always exits and
> > is resurfaced whenever userspace mounts efivarfs. That's
> > essentially the devtmpfs model.
> > 
> > Then you can stash:
> > 
> > static struct vfsmount *efivarfs_mnt;
> > 
> > globally and use that in efivarfs_pm_notify() to fill in struct
> > path.
> 
> I didn't see devtmpfs when looking for examples, since it's hiding
> outside of the fs/ directory.  However, it does seem to be a bit
> legacy nasty as an example to copy.  However, I get the basics: we'd
> instantiate the mnt and superblock on init (stashing mnt in the sfi
> so the notifier gets it).  Then we can do the variable population on
> reconfigure, just in case an EFI system doesn't want to mount
> efivarfs to save memory.
> 
> I can code that up if I can get an answer to the uid/gid parameter
> question above.

I coded up the naive implementation and it definitely works, but it
suffers from the problem that everything that pins in the module init
routine (like configfs) does in that once inserted the module can never
be removed.  Plus, for efivarfs, we would allocate all resources on
module insertion not on first mount.  The final problem we'd have is
that the uid/gid parameters for variable creation would be taken from
the kernel internal mount, so if they got specified on a user mount,
they'd be ignored (because the variable inodes are already created).

To answer some of your other questions:

> (1) Is it guaranteed that efivarfs_pm_notify() is only called once a
>     superblock exists?

Yes, as you realized.

> (2) Is it guaranteed that efivarfs_pm_notify() is only called when
>     and while a mount for the superblock exists?

No, but the behaviour is correct because the notifier needs to update
the variable list and we create the variable list in
efivarfs_fill_super.  Now you can argue this is suboptimal because if
userspace didn't ever mount, we'd simply destroy it all again on last
put of the superblock so it's wasted effort, but its function is
correct.

> Another question is whether the superblock can be freed while
> efivarfs_pm_notify() is running? I think that can't happen because
> blocking_notifier_chain_unregister(&efivar_ops_nh, &sfi->nb) will
> block in efivarfs_kill_sb() until all outstanding calls to
> efivarfs_pm_notify() are finished?

That's right: a blocking notifier is called under the notifier list
rwsem.  It's taken read for calls but write for register/unregister, so
efivarfs_kill_sb would block in the unregister until the call chain was
executed.

Taking into account the module removal issue, the simplest way I found
to fix the issue was to call vfs_kern_mount() from the notifier to get
a struct vfsmount before opening the path.  We ensure it's gone by
calling mntput immediately after open, but, by that time, the open file
is pinning the vfsmnt if the open was successful.

If this looks OK to everyone I'll code it up as a fix which can be cc'd
to stable.

Regards,

James

---

Comments

Christian Brauner March 15, 2025, 10:04 a.m. UTC | #1
On Fri, Mar 14, 2025 at 10:59:14AM -0400, James Bottomley wrote:
> On Tue, 2025-03-11 at 09:01 -0400, James Bottomley wrote:
> > On Tue, 2025-03-11 at 09:45 +0100, Christian Brauner wrote:
> [...]
> > > But since efivars does only ever have a single global superblock,
> > > one possibility is to an internal superblock that always exits and
> > > is resurfaced whenever userspace mounts efivarfs. That's
> > > essentially the devtmpfs model.
> > > 
> > > Then you can stash:
> > > 
> > > static struct vfsmount *efivarfs_mnt;
> > > 
> > > globally and use that in efivarfs_pm_notify() to fill in struct
> > > path.
> > 
> > I didn't see devtmpfs when looking for examples, since it's hiding
> > outside of the fs/ directory.  However, it does seem to be a bit
> > legacy nasty as an example to copy.  However, I get the basics: we'd
> > instantiate the mnt and superblock on init (stashing mnt in the sfi
> > so the notifier gets it).  Then we can do the variable population on
> > reconfigure, just in case an EFI system doesn't want to mount
> > efivarfs to save memory.
> > 
> > I can code that up if I can get an answer to the uid/gid parameter
> > question above.
> 
> I coded up the naive implementation and it definitely works, but it
> suffers from the problem that everything that pins in the module init
> routine (like configfs) does in that once inserted the module can never
> be removed.  Plus, for efivarfs, we would allocate all resources on
> module insertion not on first mount.  The final problem we'd have is
> that the uid/gid parameters for variable creation would be taken from
> the kernel internal mount, so if they got specified on a user mount,
> they'd be ignored (because the variable inodes are already created).
> 
> To answer some of your other questions:
> 
> > (1) Is it guaranteed that efivarfs_pm_notify() is only called once a
> >     superblock exists?
> 
> Yes, as you realized.
> 
> > (2) Is it guaranteed that efivarfs_pm_notify() is only called when
> >     and while a mount for the superblock exists?
> 
> No, but the behaviour is correct because the notifier needs to update
> the variable list and we create the variable list in
> efivarfs_fill_super.  Now you can argue this is suboptimal because if
> userspace didn't ever mount, we'd simply destroy it all again on last
> put of the superblock so it's wasted effort, but its function is
> correct.
> 
> > Another question is whether the superblock can be freed while
> > efivarfs_pm_notify() is running? I think that can't happen because
> > blocking_notifier_chain_unregister(&efivar_ops_nh, &sfi->nb) will
> > block in efivarfs_kill_sb() until all outstanding calls to
> > efivarfs_pm_notify() are finished?
> 
> That's right: a blocking notifier is called under the notifier list
> rwsem.  It's taken read for calls but write for register/unregister, so
> efivarfs_kill_sb would block in the unregister until the call chain was
> executed.
> 
> Taking into account the module removal issue, the simplest way I found
> to fix the issue was to call vfs_kern_mount() from the notifier to get

Yeah, Al had already mentioned that. I initially had the same idea but
since I didn't know enough about the notifier block stuff I wasn't sure
whether there's some odd deadlock that could be caused by this.

> a struct vfsmount before opening the path.  We ensure it's gone by
> calling mntput immediately after open, but, by that time, the open file
> is pinning the vfsmnt if the open was successful.
> 
> If this looks OK to everyone I'll code it up as a fix which can be cc'd
> to stable.
> 
> Regards,
> 
> James
> 
> ---
> 
> diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
> index 6eae8cf655c1..e2e6575b5abf 100644
> --- a/fs/efivarfs/super.c
> +++ b/fs/efivarfs/super.c
> @@ -474,12 +474,14 @@ static int efivarfs_check_missing(efi_char16_t *name16, efi_guid_t vendor,
>  	return err;
>  }
>  
> +static struct file_system_type efivarfs_type;
> +
>  static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
>  			      void *ptr)
>  {
>  	struct efivarfs_fs_info *sfi = container_of(nb, struct efivarfs_fs_info,
>  						    pm_nb);
> -	struct path path = { .mnt = NULL, .dentry = sfi->sb->s_root, };
> +	struct path path;
>  	struct efivarfs_ctx ectx = {
>  		.ctx = {
>  			.actor	= efivarfs_actor,
> @@ -501,9 +503,17 @@ static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
>  
>  	pr_info("efivarfs: resyncing variable state\n");
>  
> -	/* O_NOATIME is required to prevent oops on NULL mnt */
> +	path.dentry = sfi->sb->s_root;
> +	path.mnt = vfs_kern_mount(&efivarfs_type, SB_KERNMOUNT,
> +				  efivarfs_type.name, NULL);
> +	if (IS_ERR(path.mnt)) {
> +		pr_err("efivarfs: internal mount failed\n");
> +		return PTR_ERR(path.mnt);
> +	}

I see some issues with this. A umount by another task could already hit:

sb->kill_sb == efivarfs_kill_super()

which means the superblock is already marked as dying.

By calling vfs_kern_mount() unconditionally you end up calling
vfs_get_tree() and then get_tree_single() again. That would mean
efivarfs_pm_notify() now waits for the old superblock to be dead.

But the old superblock waits in efivarfs_kill_sb() for
efivarfs_pm_notify() to finish before actually killing the old
superblock.

So this would deadlock.

So you need to make sure that the superbock a) isn't dead and b) doesn't
go away behind your back:

diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
index 6eae8cf655c1..6a4f95c27697 100644
--- a/fs/efivarfs/super.c
+++ b/fs/efivarfs/super.c
@@ -474,6 +474,8 @@ static int efivarfs_check_missing(efi_char16_t *name16, efi_guid_t vendor,
        return err;
 }

+static struct file_system_type efivarfs_type;
+
 static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
                              void *ptr)
 {
@@ -499,6 +501,31 @@ static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
        if (rescan_done)
                return NOTIFY_DONE;

+       /*
+        * Ensure that efivarfs is still alive and cannot go away behind
+        * our back.
+        */
+       if (!atomic_inc_not_zero(&sfi->sb->s_active))
+               return NOTIFY_DONE;
+
+       path.mnt = vfs_kern_mount(&efivarfs_type, SB_KERNMOUNT,
+                                 efivarfs_type.name, NULL);

Since efivars uses a single global superblock and we know that sfi->sb
is still alive (After all we've just pinned it above.) vfs_kern_mount()
will reuse the same superblock.

There's two cases to consider:

(1) vfs_kern_mount() was successful. In this case path->mnt will hold an
    active superblock reference that will be released asynchronously via
    __fput(). That is safe and correct.

(2) vfs_kern_mount() fails. That's an issue because you need to call
    deactivate_super() which will have a similar deadlock problem.

    If efivarfs_pm_notify() now holds the last reference to the
    superblock then deactivate_super() super will put that last
    reference and call efivarfs_kill_super() which in turn will wait for
    efivarfs_pm_notify() to finish. => deadlock

So in the error case you need to offload the call to deactivate_super()
to a workqueue.

diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
index 6eae8cf655c1..288c1dd8622b 100644
--- a/fs/efivarfs/super.c
+++ b/fs/efivarfs/super.c
@@ -474,6 +474,8 @@ static int efivarfs_check_missing(efi_char16_t *name16, efi_guid_t vendor,
        return err;
 }

+static struct file_system_type efivarfs_type;
+
 static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
                              void *ptr)
 {
@@ -499,6 +501,39 @@ static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
        if (rescan_done)
                return NOTIFY_DONE;

+       /*
+        * Ensure that efivarfs is still alive and cannot go away behind
+        * our back.
+        */
+       if (!atomic_inc_not_zero(&sfi->sb->s_active))
+               return NOTIFY_DONE;
+
+       path.mnt = vfs_kern_mount(&efivarfs_type, SB_KERNMOUNT,
+                                 efivarfs_type.name, NULL);
+       /*
+        * Since efivars uses a single global superblock and we know
+        * that sfi->sb is still alive (After all we've just pinned it
+        * above.) vfs_kern_mount() will reuse the same superblock.
+        *
+        * If vfs_kern_mount() was successful path->mnt will hold an
+        * active superblock reference that will be released
+        * asynchronously via __fput().
+        *
+        * If vfs_kern_mount() fails we might be the ones to hold the
+        * last reference now so we need to call deactivate_super(). But
+        * we need to ensure that this is done asynchronously so
+        * efivarfs_kill_super() doesn't deadlock by waiting on
+        * efivarfs_pm_notify() to finish.
+        */
+       if (IS_ERR(path.mnt)) {
+
+               /* TODO: offload to workqueue so that we don't deadlock. */
+               deactivate_super(sfi->sb);
+               pr_err("efivarfs: internal mount failed\n");
+               return PTR_ERR(path.mnt);
+       }
+       atomic_dec(&sfi->sb->s_active);
+
        pr_info("efivarfs: resyncing variable state\n");

        /* O_NOATIME is required to prevent oops on NULL mnt */
James Bottomley March 15, 2025, 6:41 p.m. UTC | #2
On Sat, 2025-03-15 at 11:04 +0100, Christian Brauner wrote:
[...]
> Since efivars uses a single global superblock and we know that sfi-
> >sb is still alive (After all we've just pinned it above.)
> vfs_kern_mount() will reuse the same superblock.
> 
> There's two cases to consider:
> 
> (1) vfs_kern_mount() was successful. In this case path->mnt will hold
> an active superblock reference that will be released asynchronously
> via __fput(). That is safe and correct.
> 
> (2) vfs_kern_mount() fails. That's an issue because you need to call
> deactivate_super() which will have a similar deadlock problem.
> If efivarfs_pm_notify() now holds the last reference to the
> superblock then deactivate_super() super will put that last
> reference and call efivarfs_kill_super() which in turn will wait for
> efivarfs_pm_notify() to finish. => deadlock
> 
> So in the error case you need to offload the call to
> deactivate_super() to a workqueue.

OK, got that (although it did make my head explode a bit ... this is
certainly subtle stuff).  To do the delayed work for the
deactivate_super(), I hijacked the superblock destroy_work structure
which I think is safe because by the time the work structure is
executed, we own it and so it can be reused for the actual destroy_work
in deactivate_super().

However, there's another problem: the mntput after kernel_file_open may
synchronously call cleanup_mnt() (and thus deactivate_super()) if the
open fails because it's marked MNT_INTERNAL, which is caused by
SB_KERNMOUNT.  I fixed this just by not passing the SB_KERNMOUNT flag,
which feels a bit hacky.

I've put together everything at the bottom, however, I can't test the
error legs of this because trying to trigger and unmount and hybernate
at exactly the right point is pretty much impossible.  The rest seems
to work as advertised, although I would like a tested-by from the
apparmour people because I do run apparmour in my debian test rig but
don't see the problem.

Regards,

James

---

diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
index 6eae8cf655c1..2d826e98066b 100644
--- a/fs/efivarfs/super.c
+++ b/fs/efivarfs/super.c
@@ -474,12 +474,25 @@ static int efivarfs_check_missing(efi_char16_t *name16, efi_guid_t vendor,
 	return err;
 }
 
+static void efivarfs_deactivate_super_work(struct work_struct *work)
+{
+	struct super_block *s = container_of(work, struct super_block,
+					     destroy_work);
+	/*
+	 * note: here s->destroy_work is free for reuse (which
+	 * will happen in deactivate_super)
+	 */
+	deactivate_super(s);
+}
+
+static struct file_system_type efivarfs_type;
+
 static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
 			      void *ptr)
 {
 	struct efivarfs_fs_info *sfi = container_of(nb, struct efivarfs_fs_info,
 						    pm_nb);
-	struct path path = { .mnt = NULL, .dentry = sfi->sb->s_root, };
+	struct path path;
 	struct efivarfs_ctx ectx = {
 		.ctx = {
 			.actor	= efivarfs_actor,
@@ -487,6 +500,7 @@ static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
 		.sb = sfi->sb,
 	};
 	struct file *file;
+	struct super_block *s = sfi->sb;
 	static bool rescan_done = true;
 
 	if (action == PM_HIBERNATION_PREPARE) {
@@ -499,11 +513,39 @@ static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
 	if (rescan_done)
 		return NOTIFY_DONE;
 
+	/* ensure single superblock is alive and pin it */
+	if (!atomic_inc_not_zero(&s->s_active))
+		return NOTIFY_DONE;
+
 	pr_info("efivarfs: resyncing variable state\n");
 
-	/* O_NOATIME is required to prevent oops on NULL mnt */
+	path.dentry = sfi->sb->s_root;
+
+	/* do not add SB_KERNMOUNT which causes MNT_INTERNAL, see below */
+	path.mnt = vfs_kern_mount(&efivarfs_type, 0,
+				  efivarfs_type.name, NULL);
+	if (IS_ERR(path.mnt)) {
+		pr_err("efivarfs: internal mount failed\n");
+		/*
+		 * We may be the last pinner of the superblock but
+		 * calling efivarfs_kill_sb from within the notifier
+		 * here would deadlock trying to unregister it
+		 */
+		INIT_WORK(&s->destroy_work, efivarfs_deactivate_super_work);
+		schedule_work(&s->destroy_work);
+		return PTR_ERR(path.mnt);
+	}
+
+	/* path.mnt now has pin on superblock, so this must be above one */
+	atomic_dec(&s->s_active);
+
 	file = kernel_file_open(&path, O_RDONLY | O_DIRECTORY | O_NOATIME,
 				current_cred());
+	/*
+	 * safe even if last put because no MNT_INTERNAL means this
+	 * will do delayed deactivate_super and not deadlock
+	 */
+	mntput(path.mnt);
 	if (IS_ERR(file))
 		return NOTIFY_DONE;
diff mbox series

Patch

diff --git a/fs/efivarfs/super.c b/fs/efivarfs/super.c
index 6eae8cf655c1..e2e6575b5abf 100644
--- a/fs/efivarfs/super.c
+++ b/fs/efivarfs/super.c
@@ -474,12 +474,14 @@  static int efivarfs_check_missing(efi_char16_t *name16, efi_guid_t vendor,
 	return err;
 }
 
+static struct file_system_type efivarfs_type;
+
 static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
 			      void *ptr)
 {
 	struct efivarfs_fs_info *sfi = container_of(nb, struct efivarfs_fs_info,
 						    pm_nb);
-	struct path path = { .mnt = NULL, .dentry = sfi->sb->s_root, };
+	struct path path;
 	struct efivarfs_ctx ectx = {
 		.ctx = {
 			.actor	= efivarfs_actor,
@@ -501,9 +503,17 @@  static int efivarfs_pm_notify(struct notifier_block *nb, unsigned long action,
 
 	pr_info("efivarfs: resyncing variable state\n");
 
-	/* O_NOATIME is required to prevent oops on NULL mnt */
+	path.dentry = sfi->sb->s_root;
+	path.mnt = vfs_kern_mount(&efivarfs_type, SB_KERNMOUNT,
+				  efivarfs_type.name, NULL);
+	if (IS_ERR(path.mnt)) {
+		pr_err("efivarfs: internal mount failed\n");
+		return PTR_ERR(path.mnt);
+	}
+
 	file = kernel_file_open(&path, O_RDONLY | O_DIRECTORY | O_NOATIME,
 				current_cred());
+	mntput(path.mnt);
 	if (IS_ERR(file))
 		return NOTIFY_DONE;