diff mbox series

[03/19] fs: release anon dev_t in deactivate_locked_super

Message ID 20230913111013.77623-4-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [01/19] fs: reflow deactivate_locked_super | expand

Commit Message

Christoph Hellwig Sept. 13, 2023, 11:09 a.m. UTC
Releasing an anon dev_t is a very common thing when freeing a
super_block, as that's done for basically any not block based file
system (modulo the odd mtd special case).  So instead of requiring
a special ->kill_sb helper and a lot of boilerplate in more complicated
file systems, just release the anon dev_t in deactivate_locked_super if
the super_block was using one.

As the freeing is done after the main call to kill_super_notify, this
removes the need for having two slightly different call sites for it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/bdev.c                    |  1 -
 drivers/dax/super.c             |  1 -
 drivers/dma-buf/dma-buf.c       |  1 -
 drivers/gpu/drm/drm_drv.c       |  1 -
 drivers/misc/cxl/api.c          |  1 -
 drivers/scsi/cxlflash/ocxl_hw.c |  1 -
 fs/9p/vfs_super.c               |  2 +-
 fs/afs/super.c                  |  2 +-
 fs/aio.c                        |  1 -
 fs/anon_inodes.c                |  1 -
 fs/autofs/inode.c               |  4 ++--
 fs/btrfs/super.c                |  3 ++-
 fs/btrfs/tests/btrfs-tests.c    |  1 -
 fs/ceph/super.c                 |  2 +-
 fs/coda/inode.c                 |  1 -
 fs/ecryptfs/main.c              |  3 ++-
 fs/erofs/super.c                |  4 ++--
 fs/fuse/inode.c                 |  2 +-
 fs/fuse/virtio_fs.c             |  2 +-
 fs/hostfs/hostfs_kern.c         |  2 +-
 fs/kernfs/mount.c               |  2 +-
 fs/nfs/super.c                  |  2 +-
 fs/nsfs.c                       |  1 -
 fs/openpromfs/inode.c           |  1 -
 fs/orangefs/super.c             |  2 +-
 fs/overlayfs/super.c            |  1 -
 fs/pipe.c                       |  1 -
 fs/proc/root.c                  |  4 ++--
 fs/smb/client/cifsfs.c          |  2 +-
 fs/super.c                      | 22 ++++++++--------------
 fs/ubifs/super.c                |  3 ++-
 fs/vboxsf/super.c               |  1 -
 include/linux/fs.h              |  1 -
 kernel/resource.c               |  1 -
 mm/secretmem.c                  |  1 -
 net/socket.c                    |  1 -
 security/apparmor/apparmorfs.c  |  1 -
 37 files changed, 30 insertions(+), 53 deletions(-)

Comments

Al Viro Sept. 13, 2023, 11:27 p.m. UTC | #1
On Wed, Sep 13, 2023 at 08:09:57AM -0300, Christoph Hellwig wrote:
> Releasing an anon dev_t is a very common thing when freeing a
> super_block, as that's done for basically any not block based file
> system (modulo the odd mtd special case).  So instead of requiring
> a special ->kill_sb helper and a lot of boilerplate in more complicated
> file systems, just release the anon dev_t in deactivate_locked_super if
> the super_block was using one.
> 
> As the freeing is done after the main call to kill_super_notify, this
> removes the need for having two slightly different call sites for it.

Huh?  At this stage in your series freeing is still in ->kill_sb()
instances, after the calls of kill_anon_super() you've turned into
the calls of generic_shutdown_super().

You do split it off into a separate method later in the series, but
at this point you are reopening the same UAF that had been dealt with
in dc3216b14160 "super: ensure valid info".

Either move the introduction of ->free_sb() before that one, or
split it into lifting put_anon_bdev() (left here) and getting rid
of kill_anon_super() (after ->free_sb() introduction).
Al Viro Sept. 14, 2023, 2:37 a.m. UTC | #2
On Thu, Sep 14, 2023 at 12:27:12AM +0100, Al Viro wrote:
> On Wed, Sep 13, 2023 at 08:09:57AM -0300, Christoph Hellwig wrote:
> > Releasing an anon dev_t is a very common thing when freeing a
> > super_block, as that's done for basically any not block based file
> > system (modulo the odd mtd special case).  So instead of requiring
> > a special ->kill_sb helper and a lot of boilerplate in more complicated
> > file systems, just release the anon dev_t in deactivate_locked_super if
> > the super_block was using one.
> > 
> > As the freeing is done after the main call to kill_super_notify, this
> > removes the need for having two slightly different call sites for it.
> 
> Huh?  At this stage in your series freeing is still in ->kill_sb()
> instances, after the calls of kill_anon_super() you've turned into
> the calls of generic_shutdown_super().
> 
> You do split it off into a separate method later in the series, but
> at this point you are reopening the same UAF that had been dealt with
> in dc3216b14160 "super: ensure valid info".
> 
> Either move the introduction of ->free_sb() before that one, or
> split it into lifting put_anon_bdev() (left here) and getting rid
> of kill_anon_super() (after ->free_sb() introduction).

Actually, looking at the final stage in the series, you still have
kill_super_notify() done *AFTER* ->free_sb() call.  So the problem
persists until the very end...
Al Viro Sept. 14, 2023, 5:38 a.m. UTC | #3
On Thu, Sep 14, 2023 at 03:37:05AM +0100, Al Viro wrote:
> On Thu, Sep 14, 2023 at 12:27:12AM +0100, Al Viro wrote:
> > On Wed, Sep 13, 2023 at 08:09:57AM -0300, Christoph Hellwig wrote:
> > > Releasing an anon dev_t is a very common thing when freeing a
> > > super_block, as that's done for basically any not block based file
> > > system (modulo the odd mtd special case).  So instead of requiring
> > > a special ->kill_sb helper and a lot of boilerplate in more complicated
> > > file systems, just release the anon dev_t in deactivate_locked_super if
> > > the super_block was using one.
> > > 
> > > As the freeing is done after the main call to kill_super_notify, this
> > > removes the need for having two slightly different call sites for it.
> > 
> > Huh?  At this stage in your series freeing is still in ->kill_sb()
> > instances, after the calls of kill_anon_super() you've turned into
> > the calls of generic_shutdown_super().
> > 
> > You do split it off into a separate method later in the series, but
> > at this point you are reopening the same UAF that had been dealt with
> > in dc3216b14160 "super: ensure valid info".
> > 
> > Either move the introduction of ->free_sb() before that one, or
> > split it into lifting put_anon_bdev() (left here) and getting rid
> > of kill_anon_super() (after ->free_sb() introduction).
> 
> Actually, looking at the final stage in the series, you still have
> kill_super_notify() done *AFTER* ->free_sb() call.  So the problem
> persists until the very end...

It's worse - look at the rationale for 2c18a63b760a "super: wait until
we passed kill super".  Basically, "don't remove from the lists
until after block device closing".  IOW, we have

* stuff that needs to be done before generic_shutdown_super() (things
like pinned dentries on ramfs, etc.)
* generic_shutdown_super() itself (dentry/inode eviction, optionally
->put_super())
* stuff that needs to be done before eviction from the lists (block
device closing, since 2c18a63b760a)
* eviction from the lists
* stuff that needs to be done *after* eviction from the lists.

BTW, this part of commit message in 2c18a63b760a is rather confused:
    Recent rework moved block device closing out of sb->put_super() and into
    sb->kill_sb() to avoid deadlocks as s_umount is held in put_super() and
    blkdev_put() can end up taking s_umount again.

That was *NOT* what a recent rework had done.  Block device closing had never
been inside ->put_super() - at no point since that (closing, that is) had been
introduced back in 0.97 ;-)  ->put_super() predates it (0.95c+).

The race is real, but the cause is not some kind of move of blkdev_put().
Your 2ea6f68932f7 "fs: use the super_block as holder when mounting file
systems" is where it actually came from.

Christoph, could you explain what the hell do we need that for?  It does
create the race in question and AFAICS 2c18a63b760a (and followups trying
to plug holes in it) had been nothing but headache.

Old logics: if mount attempt with a different fs type happens, -EBUSY
is precisely corrent - we would've gotten just that if mount() came
before umount().  If the type matches, we might
	1) come before deactivate_locked_super() by umount(2).
No problem, we succeed.
	2) come after the beginning of shutdown, but before the
removal from the list; fine, we'll wait for the sucker to be
unlocked (which happens in the end of generic_shutdown_super()),
notice it's dead and create a new superblock.  Since the only
part left on the umount side is closing the device, we are
just fine.
	3) come after the removal from the list.  So we won't
wait for the old superblock to be unlocked, other than that
it's exactly the same as (2).  It doesn't matter whether we
open the device before or after close by umount - same owner
anyway, no -EBUSY.

Your "owner shall be the superblock" breaks that...

If you want to mess with _three_-way split of ->kill_sb(),
please start with writing down the rules re what should
go into each of those parts; such writeup should go into
Documentation/filesystems/porting anyway, even if the
split is a two-way one, BTW.
Christian Brauner Sept. 14, 2023, 7:56 a.m. UTC | #4
> BTW, this part of commit message in 2c18a63b760a is rather confused:
>     Recent rework moved block device closing out of sb->put_super() and into
>     sb->kill_sb() to avoid deadlocks as s_umount is held in put_super() and
>     blkdev_put() can end up taking s_umount again.
> 
> That was *NOT* what a recent rework had done.  Block device closing had never
> been inside ->put_super() - at no point since that (closing, that is) had been
> introduced back in 0.97 ;-)  ->put_super() predates it (0.95c+).

I think the commit message probably just isn't clear enough. The main
block device of a superblock isn't closed in sb->put_super(). That's
always been closed in kill_block_super() after generic_shutdown_super().

But afaict filesystem like ext4 and xfs may have additional block
devices open exclusively and closed them in sb->put_super():

xfs_fs_put_super()
-> xfs_close_devices()
   -> xfs_blkdev_put()
      -> blkdev_put()

ext4_put_super()
-> ext4_blkdev_remove()
   -> blkdev_put()
Christian Brauner Sept. 14, 2023, 2:02 p.m. UTC | #5
> Christoph, could you explain what the hell do we need that for?  It does
> create the race in question and AFAICS 2c18a63b760a (and followups trying
> to plug holes in it) had been nothing but headache.
> 
> Old logics: if mount attempt with a different fs type happens, -EBUSY
> is precisely corrent - we would've gotten just that if mount() came
> before umount().  If the type matches, we might
> 	1) come before deactivate_locked_super() by umount(2).
> No problem, we succeed.
> 	2) come after the beginning of shutdown, but before the
> removal from the list; fine, we'll wait for the sucker to be
> unlocked (which happens in the end of generic_shutdown_super()),
> notice it's dead and create a new superblock.  Since the only
> part left on the umount side is closing the device, we are
> just fine.
> 	3) come after the removal from the list.  So we won't
> wait for the old superblock to be unlocked, other than that
> it's exactly the same as (2).  It doesn't matter whether we
> open the device before or after close by umount - same owner
> anyway, no -EBUSY.
> 
> Your "owner shall be the superblock" breaks that...
> 
> If you want to mess with _three_-way split of ->kill_sb(),
> please start with writing down the rules re what should
> go into each of those parts; such writeup should go into
> Documentation/filesystems/porting anyway, even if the
> split is a two-way one, BTW.

Hm, I think that characterization of Christoph's changes is a bit harsh.

Yes, you're right that making the superblock and not the filesytem type
the bd_holder changes the logic and we are aware of that of course. And
it requires changes such as moving additional block device closing from
where some callers currently do it.

But the filesytem type is not a very useful holder itself and has other
drawbacks. The obvious one being that it requires us to wade through all
superblocks on the system trying to find the superblock associated with
a given block device continously grabbing and dropping sb_lock and
s_umount. None of that is very pleasant nor elegant and it is for sure
not very easy to understand (Plus, it's broken for btrfs freezing and
syncing via block level ioctls.).

Using the superblock as holder makes this go away and is overall a lot
more useful and intuitive and can be extended to filesystems with
multiple devices (Of which we apparently are bound to get more.).

So I think this change is worth the pain.

It's a fair point that these lifetime rules should be documented in
Documentation/filesystems/. The old lifetime documentation is too sparse
to be useful though.
Al Viro Sept. 14, 2023, 4:58 p.m. UTC | #6
On Thu, Sep 14, 2023 at 04:02:25PM +0200, Christian Brauner wrote:

> Yes, you're right that making the superblock and not the filesytem type
> the bd_holder changes the logic and we are aware of that of course. And
> it requires changes such as moving additional block device closing from
> where some callers currently do it.

Details, please?

> But the filesytem type is not a very useful holder itself and has other
> drawbacks. The obvious one being that it requires us to wade through all
> superblocks on the system trying to find the superblock associated with
> a given block device continously grabbing and dropping sb_lock and
> s_umount. None of that is very pleasant nor elegant and it is for sure
> not very easy to understand (Plus, it's broken for btrfs freezing and
> syncing via block level ioctls.).

"Constantly" is a bit of a stretch - IIRC, we grabbed sb_lock once, then
went through the list comparing ->s_bdev (without any extra locking),
then bumped ->s_count on the found superblock, dropped sb_lock,
grabbed ->s_umount on the sucker and verified it's still alive.

Repeated grabbing of any lock happened only on a race with fs shutdown;
normal case is one spin_lock, one spin_unlock, one down_read().

Oh, well...

> Using the superblock as holder makes this go away and is overall a lot
> more useful and intuitive and can be extended to filesystems with
> multiple devices (Of which we apparently are bound to get more.).
>
> So I think this change is worth the pain.
> 
> It's a fair point that these lifetime rules should be documented in
> Documentation/filesystems/. The old lifetime documentation is too sparse
> to be useful though.

What *are* these lifetime rules?  Seriously, you have 3 chunks of
fs-dependent actions at the moment:
	* the things needed to get rid of internal references pinning
inodes/dentries + whatever else we need done before generic_shutdown_super()
	* the stuff to be done between generic_shutdown_super() and
making the sucker invisible to sget()/sget_fc()
	* the stuff that must be done after we are sure that sget
callbacks won't be looking at this instance.

Note that Christoph's series has mashed (2) and (3) together, resulting
in UAF in a bunch of places.  And I'm dead serious about
Documentation/filesystems/porting being the right place; any development
tree of any filesystem (in-tree one or not) will have to go through the
changes and figure out WTF to do with their existing code.  We are
going to play whack-a-mole for at least several years as development
branches get rebased and merged.

Incidentally, I'm going to add a (belated by 10 years) chunk in porting.rst
re making sure that anything in superblock that might be needed by methods
called in RCU mode should *not* be freed without an RCU delay...  Should've
done that back in 3.12 merge window when RCU'd vfsmounts went in; as it
is, today we have several filesystems with exact same kind of breakage.
hfsplus and affs breakage had been there in 3.13 (missed those two), exfat
and ntfs3 - introduced later, by initial merges of filesystems in question.
Missed on review...

Hell knows - perhaps Documentation/filesystems/whack-a-mole might be a good
idea...
Al Viro Sept. 14, 2023, 7:23 p.m. UTC | #7
On Thu, Sep 14, 2023 at 05:58:05PM +0100, Al Viro wrote:

> Incidentally, I'm going to add a (belated by 10 years) chunk in porting.rst
> re making sure that anything in superblock that might be needed by methods
> called in RCU mode should *not* be freed without an RCU delay...  Should've
> done that back in 3.12 merge window when RCU'd vfsmounts went in; as it
> is, today we have several filesystems with exact same kind of breakage.
> hfsplus and affs breakage had been there in 3.13 (missed those two), exfat
> and ntfs3 - introduced later, by initial merges of filesystems in question.
> Missed on review...
> 
> Hell knows - perhaps Documentation/filesystems/whack-a-mole might be a good
> idea...

Actually, utf8 casefolding stuff also has the same problem, so ext4 and f2fs
with casefolding are also affected ;-/
Christian Brauner Sept. 15, 2023, 7:40 a.m. UTC | #8
On Thu, Sep 14, 2023 at 08:23:31PM +0100, Al Viro wrote:
> On Thu, Sep 14, 2023 at 05:58:05PM +0100, Al Viro wrote:
> 
> > Incidentally, I'm going to add a (belated by 10 years) chunk in porting.rst
> > re making sure that anything in superblock that might be needed by methods
> > called in RCU mode should *not* be freed without an RCU delay...  Should've
> > done that back in 3.12 merge window when RCU'd vfsmounts went in; as it
> > is, today we have several filesystems with exact same kind of breakage.
> > hfsplus and affs breakage had been there in 3.13 (missed those two), exfat
> > and ntfs3 - introduced later, by initial merges of filesystems in question.
> > Missed on review...
> > 
> > Hell knows - perhaps Documentation/filesystems/whack-a-mole might be a good
> > idea...

pitfalls.rst or common-bugs.rst

or something like that.

> 
> Actually, utf8 casefolding stuff also has the same problem, so ext4 and f2fs
> with casefolding are also affected ;-/
Christian Brauner Sept. 15, 2023, 9:44 a.m. UTC | #9
On Thu, Sep 14, 2023 at 05:58:05PM +0100, Al Viro wrote:
> On Thu, Sep 14, 2023 at 04:02:25PM +0200, Christian Brauner wrote:
> 
> > Yes, you're right that making the superblock and not the filesytem type
> > the bd_holder changes the logic and we are aware of that of course. And
> > it requires changes such as moving additional block device closing from
> > where some callers currently do it.
> 
> Details, please?

Filesystems like xfs and ext4 that closed additional block devices (For
example, the logdev= mount option for xfs.) in put_super() could go
through stuff like:

blkdev_put()
-> bdev->bd_disk->fops->release() == lo_release()
   -> __loop_clr_fd()
      -> disk_force_media_change()
         -> __invalidate_device()
            -> get_super()

which wouldn't have been a problem before because get_super() matched on
sb->s_bdev which obviously doesn't work because a log device or rt
device or whatever isn't the main block device. So we couldn't have
deadlocked.

But the fact that it is called in that manner from that place in the
first place is wildly adventurous especially considering that there
isn't __a single comment__ in that code why that is safe. So good luck
figuring this all out.

Now that we don't have to do that s_bdev matching thing anymore because
we directly associate the superblock with the block device we can go
straight from block device to superblock. But now calling blkdev_put()
under put_super() which holds s_umount could deadlock. So it's moved to
kill_sb where it should've always been called. Even without the
potential deadlock in the new scheme that's cleaner and easier to
understand imho and it just works for any block device.

> Note that Christoph's series has mashed (2) and (3) together, resulting
> in UAF in a bunch of places.  And I'm dead serious about

Yes, that I did fix as far as I'm aware. If the rules would've been
written down where when something was freed we would've had an easier
time figuring this out though. But they weren't so we missed it.

> Documentation/filesystems/porting being the right place; any development

Yes, agreed. I'll write a document for Christoph's next version.

I know that what you're saying is roughly that we shouldn't make the
same mistake as were done before but the fact that the old lifetime
rules weren't documented in any meaningful way and now we get grumbled
at in turn makes me grumble a bit. :) But overall point duly taken.

> tree of any filesystem (in-tree one or not) will have to go through the
> changes and figure out WTF to do with their existing code.  We are
> going to play whack-a-mole for at least several years as development
> branches get rebased and merged.

Let me write something up.

> 
> Incidentally, I'm going to add a (belated by 10 years) chunk in porting.rst
> re making sure that anything in superblock that might be needed by methods
> called in RCU mode should *not* be freed without an RCU delay...  Should've
> done that back in 3.12 merge window when RCU'd vfsmounts went in; as it
> is, today we have several filesystems with exact same kind of breakage.
> hfsplus and affs breakage had been there in 3.13 (missed those two), exfat
> and ntfs3 - introduced later, by initial merges of filesystems in question.
> Missed on review...

Cool, thanks for adding that.
Christian Brauner Sept. 15, 2023, 2:12 p.m. UTC | #10
> > tree of any filesystem (in-tree one or not) will have to go through the
> > changes and figure out WTF to do with their existing code.  We are
> > going to play whack-a-mole for at least several years as development
> > branches get rebased and merged.
> 
> Let me write something up.

So here I've written two porting.rst patches that aim to reflect the
current state of things (They do _not_ reflect what's in Christoph's
series here as that'ss again pretty separate and will require additional
spelling out.).

I'm adding explanation for both the old and new logic fwiw. I hope to
upstream these docs soon so we all have something to point to.

From 200666901f53db74edf309d48e3c74fd275a822a Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Fri, 15 Sep 2023 16:01:02 +0200
Subject: [PATCH 1/2] porting: document new block device opening order

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 Documentation/filesystems/porting.rst | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index deac4e973ddc..f436b64b77bf 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -949,3 +949,27 @@ mmap_lock held.  All in-tree users have been audited and do not seem to
 depend on the mmap_lock being held, but out of tree users should verify
 for themselves.  If they do need it, they can return VM_FAULT_RETRY to
 be called with the mmap_lock held.
+
+---
+
+**mandatory**
+
+The order of opening block devices and matching or creating superblocks has
+changed.
+
+The old logic opened block devices first and then tried to find a
+suitable superblock to reuse based on the block device pointer.
+
+The new logic finds or creates a superblock first, opening block devices
+afterwards. Since opening block devices cannot happen under s_umount because of
+lock ordering requirements s_umount is now dropped while opening block
+devices and reacquired before calling fill_super().
+
+In the old logic concurrent mounters would find the superblock on the list of
+active superblock for the filesystem type. Since the first opener of the block
+device would hold s_umount they would wait until the superblock became either
+born or died prematurely due to initialization failure.
+
+Since the new logic drops s_umount concurrent mounters could grab s_umount and
+would spin. Instead they are now made to wait using an explicit wait-wake
+mechanism without having to hold s_umount.
Al Viro Sept. 15, 2023, 2:28 p.m. UTC | #11
On Fri, Sep 15, 2023 at 04:12:07PM +0200, Christian Brauner wrote:
> +	static void some_fs_kill_sb(struct super_block *sb)
> +	{
> +		struct some_fs_info *info = sb->s_fs_info;
> +
> +		kill_*_super(sb);
> +		kfree(info);
> +	}
> +
> +It's best practice to never deviate from this pattern.

The last part is flat-out incorrect.  If e.g. fatfs or cifs ever switches
to that pattern, you'll get UAF - they need freeing of ->s_fs_info
of anything that ever had been mounted done with RCU delay; moreover,
unload_nls() in fatfs needs to be behind the same.

Lifetime rules for fs-private parts of superblock are really private to
filesystem; their use by sget/sget_fc callbacks might impose restrictions
on those, but that again is none of the VFS business.
Al Viro Sept. 15, 2023, 2:33 p.m. UTC | #12
On Fri, Sep 15, 2023 at 03:28:14PM +0100, Al Viro wrote:
> On Fri, Sep 15, 2023 at 04:12:07PM +0200, Christian Brauner wrote:
> > +	static void some_fs_kill_sb(struct super_block *sb)
> > +	{
> > +		struct some_fs_info *info = sb->s_fs_info;
> > +
> > +		kill_*_super(sb);
> > +		kfree(info);
> > +	}
> > +
> > +It's best practice to never deviate from this pattern.
> 
> The last part is flat-out incorrect.  If e.g. fatfs or cifs ever switches
> to that pattern, you'll get UAF - they need freeing of ->s_fs_info
> of anything that ever had been mounted done with RCU delay; moreover,
> unload_nls() in fatfs needs to be behind the same.
> 
> Lifetime rules for fs-private parts of superblock are really private to
> filesystem; their use by sget/sget_fc callbacks might impose restrictions
> on those, but that again is none of the VFS business.

PS: and no, we don't want to impose such RCU delay on every filesystem
out there; what's more, there's nothing to prohibit e.g. having ->s_fs_info
pointing to a refcounted fs-private object (possibly shared by various
superblocks), so freeing might very well be "drop the reference and destroy
if refcount has reached 0".
Christian Brauner Sept. 15, 2023, 2:40 p.m. UTC | #13
> Lifetime rules for fs-private parts of superblock are really private to

Fine, I'll drop that. It's still correct that a filesystem needs to take
care when it frees sb->s_fs_info. See the RCU fun you just encountered.
Christoph Hellwig Sept. 26, 2023, 9:31 a.m. UTC | #14
On Thu, Sep 14, 2023 at 09:56:57AM +0200, Christian Brauner wrote:
> > BTW, this part of commit message in 2c18a63b760a is rather confused:
> >     Recent rework moved block device closing out of sb->put_super() and into
> >     sb->kill_sb() to avoid deadlocks as s_umount is held in put_super() and
> >     blkdev_put() can end up taking s_umount again.
> > 
> > That was *NOT* what a recent rework had done.  Block device closing had never
> > been inside ->put_super() - at no point since that (closing, that is) had been
> > introduced back in 0.97 ;-)  ->put_super() predates it (0.95c+).
> 
> I think the commit message probably just isn't clear enough. The main
> block device of a superblock isn't closed in sb->put_super(). That's
> always been closed in kill_block_super() after generic_shutdown_super().

Yes.

> But afaict filesystem like ext4 and xfs may have additional block
> devices open exclusively and closed them in sb->put_super():
> 
> xfs_fs_put_super()
> -> xfs_close_devices()
>    -> xfs_blkdev_put()
>       -> blkdev_put()
> 
> ext4_put_super()
> -> ext4_blkdev_remove()
>    -> blkdev_put()

Yes.
Christoph Hellwig Sept. 26, 2023, 9:38 a.m. UTC | #15
On Thu, Sep 14, 2023 at 12:27:12AM +0100, Al Viro wrote:
> On Wed, Sep 13, 2023 at 08:09:57AM -0300, Christoph Hellwig wrote:
> > Releasing an anon dev_t is a very common thing when freeing a
> > super_block, as that's done for basically any not block based file
> > system (modulo the odd mtd special case).  So instead of requiring
> > a special ->kill_sb helper and a lot of boilerplate in more complicated
> > file systems, just release the anon dev_t in deactivate_locked_super if
> > the super_block was using one.
> > 
> > As the freeing is done after the main call to kill_super_notify, this
> > removes the need for having two slightly different call sites for it.
> 
> Huh?  At this stage in your series freeing is still in ->kill_sb()
> instances, after the calls of kill_anon_super() you've turned into
> the calls of generic_shutdown_super().

The above refers to freeing the anon dev_t, which at this stage is done
right after the kill_super_notify in generic_shutdown_super.

> You do split it off into a separate method later in the series, but
> at this point you are reopening the same UAF that had been dealt with
> in dc3216b14160 "super: ensure valid info".

How?

Old sequence before his patch:

	deactivate_locked_super()
	  -> kill_anon_super()
	    -> generic_shutdown_super()
	    -> kill_super_notify()
	    -> free_anon_bdev()
	  -> kill_super_notify()

New sequence with this patch:

	deactivate_locked_super()
	  -> generic_shutdown_super()
	    -> kill_super_notify()
	    -> free_anon_bdev()
Christoph Hellwig Sept. 26, 2023, 9:41 a.m. UTC | #16
On Thu, Sep 14, 2023 at 06:38:43AM +0100, Al Viro wrote:
> It's worse - look at the rationale for 2c18a63b760a "super: wait until
> we passed kill super".  Basically, "don't remove from the lists
> until after block device closing".  IOW, we have

As of this stage we don't even touch anything related to block devices..

> That was *NOT* what a recent rework had done.  Block device closing had never
> been inside ->put_super() - at no point since that (closing, that is) had been
> introduced back in 0.97 ;-)  ->put_super() predates it (0.95c+).
> 
> The race is real, but the cause is not some kind of move of blkdev_put().
> Your 2ea6f68932f7 "fs: use the super_block as holder when mounting file
> systems" is where it actually came from.
> 
> Christoph, could you explain what the hell do we need that for?  It does
> create the race in question and AFAICS 2c18a63b760a (and followups trying
> to plug holes in it) had been nothing but headache.

Because it allows us to actually get from the bdev to the holder directly,
someting we've badly neeed for a while.
Al Viro Sept. 26, 2023, 9:25 p.m. UTC | #17
On Tue, Sep 26, 2023 at 11:38:34AM +0200, Christoph Hellwig wrote:

> How?
> 
> Old sequence before his patch:
> 
> 	deactivate_locked_super()
> 	  -> kill_anon_super()
> 	    -> generic_shutdown_super()
> 	    -> kill_super_notify()
> 	    -> free_anon_bdev()
> 	  -> kill_super_notify()
> 
> New sequence with this patch:
> 
> 	deactivate_locked_super()
> 	  -> generic_shutdown_super()
> 	    -> kill_super_notify()
> 	    -> free_anon_bdev()
> 

Before your patch: foo_kill_super() calls kill_anon_super(),
which calls kill_super_notify(), which removes the sucker from
the list, then frees ->s_fs_info.  After your patch:
removal from the lists happens via the call of kill_super_notify()
*after* both of your methods had been called, while freeing
->s_fs_info happens from the method call.  IOW, you've restored
the situation prior to "super: ensure valid info".  The whole
point of that commit had been to make sure that we have nothing
in the lists with ->s_fs_info pointing to a freed object.

It's not about free_anon_bdev(); that part is fine - it's the
"we can drop the weird second call site of kill_super_notify()"
thing that is broken.

Al, still slogging through the rcu pathwalk races in the methods...
The latest catch: nfs_set_verifier() can get called on a dentry
that had just been seen to have positive parent, but is not
pinned down.
	grab ->d_lock; OK, we know that dentry won't get freed under us
	fetch ->d_parent->d_inode
	pass that to nfs_verify_change_attribute()
... which assumes that inode it's been given is not NULL.  Normally it
would've been - ->d_lock stabilizes ->d_parent, and negative dentries
obviously have no children.  Except that we might've been just hit
by dentry_kill() due to eviction on memory pressure, got ->d_lock
right after that and proceeded to play with ->d_parent, just as
that parent is going through dentry_kill() from the same eviction on
memory pressure...  If it gets to dentry_unlink_inode() before we get to
fetching ->d_parent->d_inode, nfs_verify_change_attribute(NULL, whatever)
is going to oops...
Al Viro Sept. 27, 2023, 10:29 p.m. UTC | #18
On Tue, Sep 26, 2023 at 10:25:15PM +0100, Al Viro wrote:

> Before your patch: foo_kill_super() calls kill_anon_super(),
> which calls kill_super_notify(), which removes the sucker from
> the list, then frees ->s_fs_info.  After your patch:
> removal from the lists happens via the call of kill_super_notify()
> *after* both of your methods had been called, while freeing
> ->s_fs_info happens from the method call.  IOW, you've restored
> the situation prior to "super: ensure valid info".  The whole
> point of that commit had been to make sure that we have nothing
> in the lists with ->s_fs_info pointing to a freed object.

More detailed example: take a look at NFS.  We have ->get_tree() there
call sget_fc() with nfs_compare_super() as possible 'test' callback.
It does look at ->s_fs_info of the superblocks found on the list
of instances for fs type in question.  Moreover, it proceeds to
call nfs_compare_mount_options(), which chases pointers from that
(at the very least fetch ->client in nfs_server instance ->s_fs_info
points to and dereferences that).

We really, really do not want nfs_free_server() happen while the
superblock is visible in the instances list.  Now, in your tree
nfs_free_sb() call nfs_free_server().  *Without* having called
kill_super_notify() first - you do that only after the call of
->free_sb().

So with this series applied we have UAF on race between mount and
umount.  For NFS.  No block devices involved.

Old logics had been "after generic_shutdown_super() the private
parts of superblock belong to filesystem alone; they might be
accessed by methods called from RCU pathwalk, but that's it".

I still don't see any clear rules for the new one.  And the more
I'm looking, the more sceptical I get about the approach you've
taken, TBH...
Christoph Hellwig Oct. 2, 2023, 6:46 a.m. UTC | #19
On Tue, Sep 26, 2023 at 10:25:15PM +0100, Al Viro wrote:
> Before your patch: foo_kill_super() calls kill_anon_super(),
> which calls kill_super_notify(), which removes the sucker from
> the list, then frees ->s_fs_info.  After your patch:
> removal from the lists happens via the call of kill_super_notify()
> *after* both of your methods had been called, while freeing
> ->s_fs_info happens from the method call.  IOW, you've restored
> the situation prior to "super: ensure valid info".  The whole
> point of that commit had been to make sure that we have nothing
> in the lists with ->s_fs_info pointing to a freed object.
> 
> It's not about free_anon_bdev(); that part is fine - it's the
> "we can drop the weird second call site of kill_super_notify()"
> thing that is broken.

The point has been to only release the anon dev_t after
kill_super_notify, to prevent two of them beeing reused.

Which we do as the free_anon_bdev is done directly in
deactivate_locked_super.  The new ->free_sb for non-block file systems
frees resources, but none of them matter for sget.
Al Viro Oct. 9, 2023, 9:57 p.m. UTC | #20
On Mon, Oct 02, 2023 at 08:46:46AM +0200, Christoph Hellwig wrote:
> On Tue, Sep 26, 2023 at 10:25:15PM +0100, Al Viro wrote:
> > Before your patch: foo_kill_super() calls kill_anon_super(),
> > which calls kill_super_notify(), which removes the sucker from
> > the list, then frees ->s_fs_info.  After your patch:
> > removal from the lists happens via the call of kill_super_notify()
> > *after* both of your methods had been called, while freeing
> > ->s_fs_info happens from the method call.  IOW, you've restored
> > the situation prior to "super: ensure valid info".  The whole
> > point of that commit had been to make sure that we have nothing
> > in the lists with ->s_fs_info pointing to a freed object.
> > 
> > It's not about free_anon_bdev(); that part is fine - it's the
> > "we can drop the weird second call site of kill_super_notify()"
> > thing that is broken.
> 
> The point has been to only release the anon dev_t after
> kill_super_notify, to prevent two of them beeing reused.
> 
> Which we do as the free_anon_bdev is done directly in
> deactivate_locked_super.  The new ->free_sb for non-block file systems
> frees resources, but none of them matter for sget.

We keep talking past each other...  Let me try again:
at the tip of your branch you have

static struct file_system_type ubifs_fs_type = {
        .name    = "ubifs",
	.owner   = THIS_MODULE,
	.mount   = ubifs_mount,
	.free_sb = ubifs_free_sb,
};

static void ubifs_free_sb(struct super_block *s)
{
        kfree(s->s_fs_info);
}

static struct dentry *ubifs_mount(struct file_system_type *fs_type, int flags,
                        const char *name, void *data)
{
	...
        sb = sget(fs_type, sb_test, sb_set, flags, c);
	...
}

static int sb_test(struct super_block *sb, void *data)
{
        struct ubifs_info *c1 = data;
        struct ubifs_info *c = sb->s_fs_info;

        return c->vi.cdev == c1->vi.cdev;
}

See the problem?  Mainline has

static void kill_ubifs_super(struct super_block *s)
{
        struct ubifs_info *c = s->s_fs_info;
        kill_anon_super(s);
        kfree(c);
}
and
void kill_anon_super(struct super_block *sb)
{
        dev_t dev = sb->s_dev;
        generic_shutdown_super(sb);
        kill_super_notify(sb);
        free_anon_bdev(dev);
}

That removes the superblock from the list of instances before its
->s_fs_info is freed.  In your branch removal happens here:

        if (fs->shutdown_sb)
                fs->shutdown_sb(s);
        generic_shutdown_super(s);
        if (fs->free_sb)
                fs->free_sb(s);

        kill_super_notify(s);

That comes *after* ubifs_free_sb() has freed ->s_fs_info.  And there's
nothing to stop ubifs_mount() (on a completely unrelated device) to get
called right at that moment.  Doing the sget() call quoted above.  Now,
in sget() we have
                hlist_for_each_entry(old, &type->fs_supers, s_instances) {
                        if (!test(old, data))
and that will hit sb_test(old, data), with old being a superblock still
in ->fs_supers, but with ->s_fs_info already freed.  So in sb_test()
we have c equal to old->s_fs_info and
        return c->vi.cdev == c1->vi.cdev;
is a bloody use after free.

Here we are unlikely to get fucked over - it's a plain fetch from freed
object.  If you look at e.g. nfs, you'll see a lot more than that -
pointer chasing from freed (and possibly reused) object.  The only
difference is that there you have sget_fc() instead of sget() - same
loop anyway.

The bottom line: in the form it is posted, your series reintroduces the
class of UAF that had been added by taking removal from the instances
list out of generic_shutdown_super() and then papered over by adding
that kill_super_notify() into kill_anon_super().

And frankly, I believe that the root cause is the insistence that
list removal should happen after generic_shutdown_super().  Sure, you
want the superblock to serve as bdev holder, which leads to fun
with -EBUSY if mount comes while umount still hadn't closed the
device.  I suspect that it would make a lot more sense to
introduce an intermediate state - "held, but will be released
in a short while".  You already have something similar, but
only for the entire disk ->bd_claiming stuff.

Add a new primitive (will_release_bdev()), so that attempts to
claim the sucker will wait until it gets released instead of
failing with -EBUSY.  And do *that* before generic_shutdown_super()
when unmounting something that is block-based.  Allows to bring
the list removal back where it used to be, no UAF at all...

IMO that direction is a lot more promising.
Christian Brauner Oct. 10, 2023, 8:44 a.m. UTC | #21
> list removal should happen after generic_shutdown_super().  Sure, you
> want the superblock to serve as bdev holder, which leads to fun
> with -EBUSY if mount comes while umount still hadn't closed the
> device.  I suspect that it would make a lot more sense to
> introduce an intermediate state - "held, but will be released
> in a short while".  You already have something similar, but
> only for the entire disk ->bd_claiming stuff.
> 
> Add a new primitive (will_release_bdev()), so that attempts to
> claim the sucker will wait until it gets released instead of
> failing with -EBUSY.  And do *that* before generic_shutdown_super()
> when unmounting something that is block-based.  Allows to bring
> the list removal back where it used to be, no UAF at all...

This is essentially equivalent to what is done right now. Only that this
would then happen in the block layer. I'm not sure it would buy us that
much. In all likelyhood we just get a range of other issues to fix.
Al Viro Oct. 17, 2023, 7:50 p.m. UTC | #22
On Tue, Oct 10, 2023 at 10:44:09AM +0200, Christian Brauner wrote:
> > list removal should happen after generic_shutdown_super().  Sure, you
> > want the superblock to serve as bdev holder, which leads to fun
> > with -EBUSY if mount comes while umount still hadn't closed the
> > device.  I suspect that it would make a lot more sense to
> > introduce an intermediate state - "held, but will be released
> > in a short while".  You already have something similar, but
> > only for the entire disk ->bd_claiming stuff.
> > 
> > Add a new primitive (will_release_bdev()), so that attempts to
> > claim the sucker will wait until it gets released instead of
> > failing with -EBUSY.  And do *that* before generic_shutdown_super()
> > when unmounting something that is block-based.  Allows to bring
> > the list removal back where it used to be, no UAF at all...
> 
> This is essentially equivalent to what is done right now. Only that this
> would then happen in the block layer. I'm not sure it would buy us that
> much. In all likelyhood we just get a range of other issues to fix.

The difference is, we separate the "close the block device" (which
can't be done until we stopped generating any IO on it, obviously)
from "tell anyone who wants to claim the sucker that we are going
to release it and they just need to wait".  That can be done before
generic_shutdown_super(), or from it (e.g. from ->put_super()),
untangling the ordering mess.
diff mbox series

Patch

diff --git a/block/bdev.c b/block/bdev.c
index f3b13aa1b7d428..9db691401497bb 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -358,7 +358,6 @@  static int bd_init_fs_context(struct fs_context *fc)
 static struct file_system_type bd_type = {
 	.name		= "bdev",
 	.init_fs_context = bd_init_fs_context,
-	.kill_sb	= kill_anon_super,
 };
 
 struct super_block *blockdev_superblock __read_mostly;
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 0da9232ea1754b..a9315b7396e68a 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -397,7 +397,6 @@  static int dax_init_fs_context(struct fs_context *fc)
 static struct file_system_type dax_fs_type = {
 	.name		= "dax",
 	.init_fs_context = dax_init_fs_context,
-	.kill_sb	= kill_anon_super,
 };
 
 static int dax_test(struct inode *inode, void *data)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 21916bba77d58b..7313e99f6e8ea5 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -125,7 +125,6 @@  static int dma_buf_fs_init_context(struct fs_context *fc)
 static struct file_system_type dma_buf_fs_type = {
 	.name = "dmabuf",
 	.init_fs_context = dma_buf_fs_init_context,
-	.kill_sb = kill_anon_super,
 };
 
 static int dma_buf_mmap_internal(struct file *file, struct vm_area_struct *vma)
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
index 3eda026ffac6a9..83676229cbe233 100644
--- a/drivers/gpu/drm/drm_drv.c
+++ b/drivers/gpu/drm/drm_drv.c
@@ -530,7 +530,6 @@  static struct file_system_type drm_fs_type = {
 	.name		= "drm",
 	.owner		= THIS_MODULE,
 	.init_fs_context = drm_fs_init_fs_context,
-	.kill_sb	= kill_anon_super,
 };
 
 static struct inode *drm_fs_inode_new(void)
diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index d85c5653086357..05b40076a0b481 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -44,7 +44,6 @@  static struct file_system_type cxl_fs_type = {
 	.name		= "cxl",
 	.owner		= THIS_MODULE,
 	.init_fs_context = cxl_fs_init_fs_context,
-	.kill_sb	= kill_anon_super,
 };
 
 
diff --git a/drivers/scsi/cxlflash/ocxl_hw.c b/drivers/scsi/cxlflash/ocxl_hw.c
index 6542818e595a64..20f22610b104df 100644
--- a/drivers/scsi/cxlflash/ocxl_hw.c
+++ b/drivers/scsi/cxlflash/ocxl_hw.c
@@ -43,7 +43,6 @@  static struct file_system_type ocxlflash_fs_type = {
 	.name		= "ocxlflash",
 	.owner		= THIS_MODULE,
 	.init_fs_context = ocxlflash_fs_init_fs_context,
-	.kill_sb	= kill_anon_super,
 };
 
 /*
diff --git a/fs/9p/vfs_super.c b/fs/9p/vfs_super.c
index 73db55c050bf10..9e60eddf5179ed 100644
--- a/fs/9p/vfs_super.c
+++ b/fs/9p/vfs_super.c
@@ -217,7 +217,7 @@  static void v9fs_kill_super(struct super_block *s)
 
 	p9_debug(P9_DEBUG_VFS, " %p\n", s);
 
-	kill_anon_super(s);
+	generic_shutdown_super(s);
 
 	v9fs_session_cancel(v9ses);
 	v9fs_session_close(v9ses);
diff --git a/fs/afs/super.c b/fs/afs/super.c
index 95d713074dc813..754b9828233497 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -544,7 +544,7 @@  static void afs_kill_super(struct super_block *sb)
 	 */
 	if (as->volume)
 		rcu_assign_pointer(as->volume->sb, NULL);
-	kill_anon_super(sb);
+	generic_shutdown_super(sb);
 	if (as->volume)
 		afs_deactivate_volume(as->volume);
 	afs_destroy_sbi(as);
diff --git a/fs/aio.c b/fs/aio.c
index a4c2a6bac72ce9..de56e4a880debe 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -293,7 +293,6 @@  static int __init aio_setup(void)
 	static struct file_system_type aio_fs = {
 		.name		= "aio",
 		.init_fs_context = aio_init_fs_context,
-		.kill_sb	= kill_anon_super,
 	};
 	aio_mnt = kern_mount(&aio_fs);
 	if (IS_ERR(aio_mnt))
diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 24192a7667edf7..9c670bbe0f62ce 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -52,7 +52,6 @@  static int anon_inodefs_init_fs_context(struct fs_context *fc)
 static struct file_system_type anon_inode_fs_type = {
 	.name		= "anon_inodefs",
 	.init_fs_context = anon_inodefs_init_fs_context,
-	.kill_sb	= kill_anon_super,
 };
 
 static struct inode *anon_inode_make_secure_inode(
diff --git a/fs/autofs/inode.c b/fs/autofs/inode.c
index 2b49662ed237de..c3b64799155840 100644
--- a/fs/autofs/inode.c
+++ b/fs/autofs/inode.c
@@ -44,7 +44,7 @@  void autofs_kill_sb(struct super_block *sb)
 	/*
 	 * In the event of a failure in get_sb_nodev the superblock
 	 * info is not present so nothing else has been setup, so
-	 * just call kill_anon_super when we are called from
+	 * just call generic_shutdown_super when we are called from
 	 * deactivate_super.
 	 */
 	if (sbi) {
@@ -54,7 +54,7 @@  void autofs_kill_sb(struct super_block *sb)
 	}
 
 	pr_debug("shutting down\n");
-	kill_litter_super(sb);
+	generic_shutdown_super(sb);
 	if (sbi)
 		kfree_rcu(sbi, rcu);
 }
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 09bfe68d2ea3fc..01b86bd4eae8dc 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2141,7 +2141,8 @@  static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 static void btrfs_kill_super(struct super_block *sb)
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(sb);
-	kill_anon_super(sb);
+
+	generic_shutdown_super(sb);
 	btrfs_free_fs_info(fs_info);
 }
 
diff --git a/fs/btrfs/tests/btrfs-tests.c b/fs/btrfs/tests/btrfs-tests.c
index ca09cf9afce800..c30280376fc32e 100644
--- a/fs/btrfs/tests/btrfs-tests.c
+++ b/fs/btrfs/tests/btrfs-tests.c
@@ -48,7 +48,6 @@  static int btrfs_test_init_fs_context(struct fs_context *fc)
 static struct file_system_type test_type = {
 	.name		= "btrfs_test_fs",
 	.init_fs_context = btrfs_test_init_fs_context,
-	.kill_sb	= kill_anon_super,
 };
 
 struct inode *btrfs_new_test_inode(void)
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index 2d7f5a8d4a9260..7feef0b35b97b5 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -1557,7 +1557,7 @@  static void ceph_kill_sb(struct super_block *s)
 	}
 
 	mdsc->stopping = CEPH_MDSC_STOPPING_FLUSHED;
-	kill_anon_super(s);
+	generic_shutdown_super(s);
 
 	fsc->client->extra_mon_dispatch = NULL;
 	ceph_fs_debugfs_cleanup(fsc);
diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index 0c7c2528791ebc..2d4ee3c7e8654b 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -325,7 +325,6 @@  struct file_system_type coda_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "coda",
 	.mount		= coda_mount,
-	.kill_sb	= kill_anon_super,
 	.fs_flags	= FS_BINARY_MOUNTDATA,
 };
 MODULE_ALIAS_FS("coda");
diff --git a/fs/ecryptfs/main.c b/fs/ecryptfs/main.c
index 2dc927ba067fec..d99b2311759166 100644
--- a/fs/ecryptfs/main.c
+++ b/fs/ecryptfs/main.c
@@ -625,7 +625,8 @@  static struct dentry *ecryptfs_mount(struct file_system_type *fs_type, int flags
 static void ecryptfs_kill_block_super(struct super_block *sb)
 {
 	struct ecryptfs_sb_info *sb_info = ecryptfs_superblock_to_private(sb);
-	kill_anon_super(sb);
+
+	generic_shutdown_super(sb);
 	if (!sb_info)
 		return;
 	ecryptfs_destroy_mount_crypt_stat(&sb_info->mount_crypt_stat);
diff --git a/fs/erofs/super.c b/fs/erofs/super.c
index 44a24d573f1fd3..07c36ccf454e53 100644
--- a/fs/erofs/super.c
+++ b/fs/erofs/super.c
@@ -878,12 +878,12 @@  static void erofs_kill_sb(struct super_block *sb)
 
 	/* pseudo mount for anon inodes */
 	if (sb->s_flags & SB_KERNMOUNT) {
-		kill_anon_super(sb);
+		generic_shutdown_super(sb);
 		return;
 	}
 
 	if (erofs_is_fscache_mode(sb))
-		kill_anon_super(sb);
+		generic_shutdown_super(sb);
 	else
 		kill_block_super(sb);
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 2e4eb7cf26fb33..42523edb32fd53 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1840,7 +1840,7 @@  EXPORT_SYMBOL(fuse_mount_destroy);
 static void fuse_kill_sb_anon(struct super_block *sb)
 {
 	fuse_sb_destroy(sb);
-	kill_anon_super(sb);
+	generic_shutdown_super(sb);
 	fuse_mount_destroy(get_fuse_mount_super(sb));
 }
 
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 5f1be1da92ce94..0a0d593e5a9c79 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1398,7 +1398,7 @@  static void virtio_kill_sb(struct super_block *sb)
 		if (last)
 			virtio_fs_conn_destroy(fm);
 	}
-	kill_anon_super(sb);
+	generic_shutdown_super(sb);
 	fuse_mount_destroy(fm);
 }
 
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index dc5a5cea5fae41..97f3c9709418c9 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -981,7 +981,7 @@  static struct dentry *hostfs_read_sb(struct file_system_type *type,
 
 static void hostfs_kill_sb(struct super_block *s)
 {
-	kill_anon_super(s);
+	generic_shutdown_super(s);
 	kfree(s->s_fs_info);
 }
 
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index c4bf26142eec9b..772d059d4054ec 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -399,7 +399,7 @@  void kernfs_kill_sb(struct super_block *sb)
 	 * Remove the superblock from fs_supers/s_instances
 	 * so we can't find it, before freeing kernfs_super_info.
 	 */
-	kill_anon_super(sb);
+	generic_shutdown_super(sb);
 	kfree(info);
 }
 
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 0d6473cb00cb3e..29d6a55b9d400d 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -1341,7 +1341,7 @@  void nfs_kill_super(struct super_block *s)
 	struct nfs_server *server = NFS_SB(s);
 
 	nfs_sysfs_move_sb_to_server(server);
-	kill_anon_super(s);
+	generic_shutdown_super(s);
 
 	nfs_fscache_release_super_cookie(s);
 
diff --git a/fs/nsfs.c b/fs/nsfs.c
index 647a22433bd8a9..d5fb7c00fe21f1 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -277,7 +277,6 @@  static int nsfs_init_fs_context(struct fs_context *fc)
 static struct file_system_type nsfs = {
 	.name = "nsfs",
 	.init_fs_context = nsfs_init_fs_context,
-	.kill_sb = kill_anon_super,
 };
 
 void __init nsfs_init(void)
diff --git a/fs/openpromfs/inode.c b/fs/openpromfs/inode.c
index b2457cb97fa008..3b51dd926d11c8 100644
--- a/fs/openpromfs/inode.c
+++ b/fs/openpromfs/inode.c
@@ -427,7 +427,6 @@  static struct file_system_type openprom_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "openpromfs",
 	.init_fs_context = openpromfs_init_fs_context,
-	.kill_sb	= kill_anon_super,
 };
 MODULE_ALIAS_FS("openpromfs");
 
diff --git a/fs/orangefs/super.c b/fs/orangefs/super.c
index 5254256a224d7a..42cb3e9b1effee 100644
--- a/fs/orangefs/super.c
+++ b/fs/orangefs/super.c
@@ -602,7 +602,7 @@  void orangefs_kill_sb(struct super_block *sb)
 	gossip_debug(GOSSIP_SUPER_DEBUG, "orangefs_kill_sb: called\n");
 
 	/* provided sb cleanup */
-	kill_anon_super(sb);
+	generic_shutdown_super(sb);
 
 	if (!ORANGEFS_SB(sb)) {
 		mutex_lock(&orangefs_request_mutex);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index def266b5e2a33b..194cf18787496d 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -1507,7 +1507,6 @@  struct file_system_type ovl_fs_type = {
 	.init_fs_context	= ovl_init_fs_context,
 	.parameters		= ovl_parameter_spec,
 	.fs_flags		= FS_USERNS_MOUNT,
-	.kill_sb		= kill_anon_super,
 };
 MODULE_ALIAS_FS("overlay");
 
diff --git a/fs/pipe.c b/fs/pipe.c
index 6c1a9b1db9076c..858e8b19d78527 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -1439,7 +1439,6 @@  static int pipefs_init_fs_context(struct fs_context *fc)
 static struct file_system_type pipe_fs_type = {
 	.name		= "pipefs",
 	.init_fs_context = pipefs_init_fs_context,
-	.kill_sb	= kill_anon_super,
 };
 
 #ifdef CONFIG_SYSCTL
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 9191248f2dacb4..2282366449ac0b 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -262,14 +262,14 @@  static void proc_kill_sb(struct super_block *sb)
 	struct proc_fs_info *fs_info = proc_sb_info(sb);
 
 	if (!fs_info) {
-		kill_anon_super(sb);
+		generic_shutdown_super(sb);
 		return;
 	}
 
 	dput(fs_info->proc_self);
 	dput(fs_info->proc_thread_self);
 
-	kill_anon_super(sb);
+	generic_shutdown_super(sb);
 	put_pid_ns(fs_info->pid_ns);
 	kfree(fs_info);
 }
diff --git a/fs/smb/client/cifsfs.c b/fs/smb/client/cifsfs.c
index 22869cda13565e..fb792acffeca37 100644
--- a/fs/smb/client/cifsfs.c
+++ b/fs/smb/client/cifsfs.c
@@ -302,7 +302,7 @@  static void cifs_kill_sb(struct super_block *sb)
 		cifs_sb->root = NULL;
 	}
 
-	kill_anon_super(sb);
+	generic_shutdown_super(sb);
 	cifs_umount(cifs_sb);
 }
 
diff --git a/fs/super.c b/fs/super.c
index ab234e6af48605..bbe55f0651cca4 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -438,10 +438,6 @@  static void kill_super_notify(struct super_block *sb)
 {
 	lockdep_assert_not_held(&sb->s_umount);
 
-	/* already notified earlier */
-	if (sb->s_flags & SB_DEAD)
-		return;
-
 	/*
 	 * Remove it from @fs_supers so it isn't found by new
 	 * sget{_fc}() walkers anymore. Any concurrent mounter still
@@ -491,6 +487,13 @@  void deactivate_locked_super(struct super_block *s)
 
 	kill_super_notify(s);
 
+	/*
+	 * If the super_block was using an anon dev_t, release it now that we've
+	 * notified everyone that the super_block is going away.
+	 */
+	if (s->s_dev && MAJOR(s->s_dev) == 0)
+		free_anon_bdev(s->s_dev);
+
 	/*
 	 * Since list_lru_destroy() may sleep, we cannot call it from
 	 * put_super(), where we hold the sb_lock. Therefore we destroy
@@ -1291,20 +1294,11 @@  int set_anon_super(struct super_block *s, void *data)
 }
 EXPORT_SYMBOL(set_anon_super);
 
-void kill_anon_super(struct super_block *sb)
-{
-	dev_t dev = sb->s_dev;
-	generic_shutdown_super(sb);
-	kill_super_notify(sb);
-	free_anon_bdev(dev);
-}
-EXPORT_SYMBOL(kill_anon_super);
-
 void kill_litter_super(struct super_block *sb)
 {
 	if (sb->s_root)
 		d_genocide(sb->s_root);
-	kill_anon_super(sb);
+	generic_shutdown_super(sb);
 }
 EXPORT_SYMBOL(kill_litter_super);
 
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index b08fb28d16b55b..6527175591a729 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -2350,7 +2350,8 @@  static struct dentry *ubifs_mount(struct file_system_type *fs_type, int flags,
 static void kill_ubifs_super(struct super_block *s)
 {
 	struct ubifs_info *c = s->s_fs_info;
-	kill_anon_super(s);
+
+	generic_shutdown_super(s);
 	kfree(c);
 }
 
diff --git a/fs/vboxsf/super.c b/fs/vboxsf/super.c
index 1fb8f4df60cbb3..aa7e627bf9bd4b 100644
--- a/fs/vboxsf/super.c
+++ b/fs/vboxsf/super.c
@@ -450,7 +450,6 @@  static struct file_system_type vboxsf_fs_type = {
 	.owner			= THIS_MODULE,
 	.name			= "vboxsf",
 	.init_fs_context	= vboxsf_init_fs_context,
-	.kill_sb		= kill_anon_super
 };
 
 /* Module initialization/finalization handlers */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4aeb3fa1192771..129b8c0c83960b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2382,7 +2382,6 @@  extern struct dentry *mount_subtree(struct vfsmount *mnt, const char *path);
 void retire_super(struct super_block *sb);
 void generic_shutdown_super(struct super_block *sb);
 void kill_block_super(struct super_block *sb);
-void kill_anon_super(struct super_block *sb);
 void kill_litter_super(struct super_block *sb);
 void deactivate_super(struct super_block *sb);
 void deactivate_locked_super(struct super_block *sb);
diff --git a/kernel/resource.c b/kernel/resource.c
index b1763b2fd7ef3e..fde412517ef0cc 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1979,7 +1979,6 @@  static struct file_system_type iomem_fs_type = {
 	.name		= "iomem",
 	.owner		= THIS_MODULE,
 	.init_fs_context = iomem_fs_init_fs_context,
-	.kill_sb	= kill_anon_super,
 };
 
 static int __init iomem_init_inode(void)
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 3afb5ad701e14a..74e1cdb1317cd7 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -275,7 +275,6 @@  static int secretmem_init_fs_context(struct fs_context *fc)
 static struct file_system_type secretmem_fs = {
 	.name		= "secretmem",
 	.init_fs_context = secretmem_init_fs_context,
-	.kill_sb	= kill_anon_super,
 };
 
 static int __init secretmem_init(void)
diff --git a/net/socket.c b/net/socket.c
index c8b08b32f097ec..a137b08a7d94d1 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -425,7 +425,6 @@  static struct vfsmount *sock_mnt __read_mostly;
 static struct file_system_type sock_fs_type = {
 	.name =		"sockfs",
 	.init_fs_context = sockfs_init_fs_context,
-	.kill_sb =	kill_anon_super,
 };
 
 /*
diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index bd6a910f65282a..eceb5443842cdf 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -198,7 +198,6 @@  static struct file_system_type aafs_ops = {
 	.owner = THIS_MODULE,
 	.name = AAFS_NAME,
 	.init_fs_context = apparmorfs_init_fs_context,
-	.kill_sb = kill_anon_super,
 };
 
 /**