[RFC,0/6] fs,block: yield devices

Message ID	20231024-vfs-super-rework-v1-0-37a8aa697148@kernel.org (mailing list archive)
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B0AA62E64E for <linux-fsdevel@vger.kernel.org>; Tue, 24 Oct 2023 14:54:04 +0000 (UTC) From: Christian Brauner <brauner@kernel.org> Subject: [PATCH RFC 0/6] fs,block: yield devices Date: Tue, 24 Oct 2023 16:53:38 +0200 Message-Id: <20231024-vfs-super-rework-v1-0-37a8aa697148@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@lst.de> Cc: linux-fsdevel@vger.kernel.org, Christian Brauner <brauner@kernel.org>
Series	fs,block: yield devices \| expand [RFC,0/6] fs,block: yield devices [RFC,1/6] fs: simplify setup_bdev_super() calls [RFC,2/6] xfs: simplify device handling [RFC,3/6] ext4: simplify device handling [RFC,4/6] bdev: simplify waiting for concurrent claimers [RFC,5/6] block: mark device as about to be released [RFC,6/6] fs: add ->yield_devices()

Christian Brauner Oct. 24, 2023, 2:53 p.m. UTC

Hey,

This is a mechanism that allows the holder of a block device to yield
device access before actually closing the block device.

If a someone yields a device then any concurrent opener claiming the
device exclusively with the same blk_holder_ops as the current owner can
wait for the device to be given up. Filesystems by default use
fs_holder_ps and so can wait on each other.

This mechanism allows us to simplify superblock handling quite a bit at
the expense of requiring filesystems to yield devices. A filesytems must
yield devices under s_umount. This allows costly work to be done outside
of s_umount.

There's nothing wrong with the way we currently do things but this does
allow us to simplify things and kills a whole class of theoretical UAF
when walking the superblock list.

I had originally considered doing it this way but wasn't able
(time-wise) to code that up but since we recently had that discussion
again here it is.

Survives both xfstests and blktests. Also tested this by introducing
custom delays into kill_block_super() to widen the race where a
superblock is removed from the instance list and the device is fully
closed and synced.
Based on on vfs.super and the freezer work sent out earlier.
Very barebones commit messages and less analyzed then usually for
possible side-effects.

Thanks!
Christian

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (6):
      fs: simplify setup_bdev_super() calls
      xfs: simplify device handling
      ext4: simplify device handling
      bdev: simplify waiting for concurrent claimers
      block: mark device as about to be released
      fs: add ->yield_devices()

 block/bdev.c              | 54 +++++++++++++++++++++++++++-----------
 fs/ext4/super.c           | 15 ++++++++---
 fs/super.c                | 67 ++++++++++++++---------------------------------
 fs/xfs/xfs_super.c        | 46 +++++++++++++++++++++-----------
 include/linux/blk_types.h |  8 +++++-
 include/linux/blkdev.h    |  1 +
 include/linux/fs.h        |  1 +
 7 files changed, 109 insertions(+), 83 deletions(-)
---
base-commit: c6cc4b13e95115c13433136a17150768d562a54c
change-id: 20231024-vfs-super-rework-ca447a3240c9

Jan Kara Oct. 25, 2023, 5:20 p.m. UTC | #1

Hello!

On Tue 24-10-23 16:53:38, Christian Brauner wrote:
> This is a mechanism that allows the holder of a block device to yield
> device access before actually closing the block device.
> 
> If a someone yields a device then any concurrent opener claiming the
> device exclusively with the same blk_holder_ops as the current owner can
> wait for the device to be given up. Filesystems by default use
> fs_holder_ps and so can wait on each other.
> 
> This mechanism allows us to simplify superblock handling quite a bit at
> the expense of requiring filesystems to yield devices. A filesytems must
> yield devices under s_umount. This allows costly work to be done outside
> of s_umount.
> 
> There's nothing wrong with the way we currently do things but this does
> allow us to simplify things and kills a whole class of theoretical UAF
> when walking the superblock list.

I'm not sure why is it better to create new ->yield callback called under
sb->s_umount rather than just move blkdev_put() calls back into
->put_super? Or at least yielding could be done in ->put_super instead of
a new callback if for some reason that is better than releasing the devices
right away in ->put_super.

								Honza

Christian Brauner Oct. 25, 2023, 8:46 p.m. UTC | #2

On Wed, Oct 25, 2023 at 07:20:57PM +0200, Jan Kara wrote:
> Hello!
> 
> On Tue 24-10-23 16:53:38, Christian Brauner wrote:
> > This is a mechanism that allows the holder of a block device to yield
> > device access before actually closing the block device.
> > 
> > If a someone yields a device then any concurrent opener claiming the
> > device exclusively with the same blk_holder_ops as the current owner can
> > wait for the device to be given up. Filesystems by default use
> > fs_holder_ps and so can wait on each other.
> > 
> > This mechanism allows us to simplify superblock handling quite a bit at
> > the expense of requiring filesystems to yield devices. A filesytems must
> > yield devices under s_umount. This allows costly work to be done outside
> > of s_umount.
> > 
> > There's nothing wrong with the way we currently do things but this does
> > allow us to simplify things and kills a whole class of theoretical UAF
> > when walking the superblock list.
> 
> I'm not sure why is it better to create new ->yield callback called under
> sb->s_umount rather than just move blkdev_put() calls back into
> ->put_super? Or at least yielding could be done in ->put_super instead of

The main reason was to not call potentially expensive
blkdev_put()/bdev_release() under s_umount. If we don't care about this
though then this shouldn't be a problem. And yes, then we need to move
blkdev_put()/bdev_release() under s_umount including the main block
device. IOW, we need to ensure that all bdev calls are done under
s_umount before we remove the superblock from the instance list. I think
that should be fine but I wanted to propose an alternative to that as
well: cheap mark-for-release under s_umount and heavy-duty without
s_umount. But I guess it doesn't matter because most filesystems did use
to close devices under s_umount before anyway. Let me know what you
think makes the most sense.

Jan Kara Oct. 26, 2023, 10:35 a.m. UTC | #3

On Wed 25-10-23 22:46:29, Christian Brauner wrote:
> On Wed, Oct 25, 2023 at 07:20:57PM +0200, Jan Kara wrote:
> > Hello!
> > 
> > On Tue 24-10-23 16:53:38, Christian Brauner wrote:
> > > This is a mechanism that allows the holder of a block device to yield
> > > device access before actually closing the block device.
> > > 
> > > If a someone yields a device then any concurrent opener claiming the
> > > device exclusively with the same blk_holder_ops as the current owner can
> > > wait for the device to be given up. Filesystems by default use
> > > fs_holder_ps and so can wait on each other.
> > > 
> > > This mechanism allows us to simplify superblock handling quite a bit at
> > > the expense of requiring filesystems to yield devices. A filesytems must
> > > yield devices under s_umount. This allows costly work to be done outside
> > > of s_umount.
> > > 
> > > There's nothing wrong with the way we currently do things but this does
> > > allow us to simplify things and kills a whole class of theoretical UAF
> > > when walking the superblock list.
> > 
> > I'm not sure why is it better to create new ->yield callback called under
> > sb->s_umount rather than just move blkdev_put() calls back into
> > ->put_super? Or at least yielding could be done in ->put_super instead of
> 
> The main reason was to not call potentially expensive
> blkdev_put()/bdev_release() under s_umount. If we don't care about this
> though then this shouldn't be a problem.

So I would not be really concerned about performance here. The superblock
is dying, nobody can do anything about that until the superblock is fully
dead and cleaned up. Maybe some places could skip such superblocks more
quickly but so far I'm not convinced it matters in practice (e.g. writeback
holds s_umount over the whole sync(1) time and nobody complains). And as
you mention below, we have been doing this for a long time without anybody
really complaining.

> And yes, then we need to move
> blkdev_put()/bdev_release() under s_umount including the main block
> device. IOW, we need to ensure that all bdev calls are done under
> s_umount before we remove the superblock from the instance list.

This is about those seemingly spurious "device busy" errors when the
superblock hasn't closed its devices yet, isn't it?  But we now remove
superblock from s_instances list in kill_super_notify() and until that
moment SB_DYING is protecting us from racing mounts. So is there some other
problem?

> I think
> that should be fine but I wanted to propose an alternative to that as
> well: cheap mark-for-release under s_umount and heavy-duty without
> s_umount. But I guess it doesn't matter because most filesystems did use
> to close devices under s_umount before anyway. Let me know what you
> think makes the most sense.

I think we should make it as simple as possible for filesystems. As I said
above I don't think s_umount hold time really matters so the only thing
that limits us are locking constraints. During superblock shutdown the only
thing that currently cannot be done under s_umount (that I'm aware of) is the
teardown of the sb->s_bdi because that waits for writeback threads and
those can be blocked waiting for s_umount.

So if we wanted we could just move ext4 & xfs bdev closing back into
->put_super to avoid ext4_kill_sb() and somewhat slim down xfs_mount_free()
but otherwise I don't see much space for simplification or need for adding
more callbacks :)

								Honza

Christian Brauner Oct. 26, 2023, 11:50 a.m. UTC | #4

On Tue, 24 Oct 2023 16:53:38 +0200, Christian Brauner wrote:
> Hey,
> 
> This is a mechanism that allows the holder of a block device to yield
> device access before actually closing the block device.
> 
> If a someone yields a device then any concurrent opener claiming the
> device exclusively with the same blk_holder_ops as the current owner can
> wait for the device to be given up. Filesystems by default use
> fs_holder_ps and so can wait on each other.
> 
> [...]

for v6.8

---

Applied to the vfs.super branch of the vfs/vfs.git tree.
Patches in the vfs.super branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.super

[1/6] fs: simplify setup_bdev_super() calls
      https://git.kernel.org/vfs/vfs/c/f15112134111
[2/6] xfs: simplify device handling
      https://git.kernel.org/vfs/vfs/c/cbded4a095bd
[3/6] ext4: simplify device handling
      https://git.kernel.org/vfs/vfs/c/ac85cf49ed93

Christian Brauner Oct. 26, 2023, 12:07 p.m. UTC | #5

On Thu, Oct 26, 2023 at 12:35:03PM +0200, Jan Kara wrote:
> On Wed 25-10-23 22:46:29, Christian Brauner wrote:
> > On Wed, Oct 25, 2023 at 07:20:57PM +0200, Jan Kara wrote:
> > > Hello!
> > > 
> > > On Tue 24-10-23 16:53:38, Christian Brauner wrote:
> > > > This is a mechanism that allows the holder of a block device to yield
> > > > device access before actually closing the block device.
> > > > 
> > > > If a someone yields a device then any concurrent opener claiming the
> > > > device exclusively with the same blk_holder_ops as the current owner can
> > > > wait for the device to be given up. Filesystems by default use
> > > > fs_holder_ps and so can wait on each other.
> > > > 
> > > > This mechanism allows us to simplify superblock handling quite a bit at
> > > > the expense of requiring filesystems to yield devices. A filesytems must
> > > > yield devices under s_umount. This allows costly work to be done outside
> > > > of s_umount.
> > > > 
> > > > There's nothing wrong with the way we currently do things but this does
> > > > allow us to simplify things and kills a whole class of theoretical UAF
> > > > when walking the superblock list.
> > > 
> > > I'm not sure why is it better to create new ->yield callback called under
> > > sb->s_umount rather than just move blkdev_put() calls back into
> > > ->put_super? Or at least yielding could be done in ->put_super instead of
> > 
> > The main reason was to not call potentially expensive
> > blkdev_put()/bdev_release() under s_umount. If we don't care about this
> > though then this shouldn't be a problem.
> 
> So I would not be really concerned about performance here. The superblock
> is dying, nobody can do anything about that until the superblock is fully
> dead and cleaned up. Maybe some places could skip such superblocks more
> quickly but so far I'm not convinced it matters in practice (e.g. writeback
> holds s_umount over the whole sync(1) time and nobody complains). And as
> you mention below, we have been doing this for a long time without anybody
> really complaining.

Ok, that's a good point.

> 
> > And yes, then we need to move
> > blkdev_put()/bdev_release() under s_umount including the main block
> > device. IOW, we need to ensure that all bdev calls are done under
> > s_umount before we remove the superblock from the instance list.
> 
> This is about those seemingly spurious "device busy" errors when the
> superblock hasn't closed its devices yet, isn't it?  But we now remove

Yes, because we tie superblock and block device neatly together.

> superblock from s_instances list in kill_super_notify() and until that
> moment SB_DYING is protecting us from racing mounts. So is there some other
> problem?

No, there isn't a problem at all. It's all working fine but it was
initially a little annoying as we had to update filesystems to ensure
that sb->s_fs_info is kept alive. But it's all fixed.

The possible advantage is that if we drop all block devices under
s_umount then we can remove the superblock from fs_type->s_instances in
the old location again. I'm not convinced it's worth it but it's a
possible simplification. I'm not even arguing it's objectively better I
think it's a matter of taste in the end.

> 
> > I think
> > that should be fine but I wanted to propose an alternative to that as
> > well: cheap mark-for-release under s_umount and heavy-duty without
> > s_umount. But I guess it doesn't matter because most filesystems did use
> > to close devices under s_umount before anyway. Let me know what you
> > think makes the most sense.
> 
> I think we should make it as simple as possible for filesystems. As I said

Yes, I fully agree.

The really good thing about the current mechanism is that it's
completely vfs-only. With s_umount/open_mutex ordering fixed filesystems
can now close block devices wherever and the vfs will ensure that you
don't get spurious ebusy. And the filesystem doesn't have to know a
thing about it or take any care.

> above I don't think s_umount hold time really matters so the only thing
> that limits us are locking constraints. During superblock shutdown the only
> thing that currently cannot be done under s_umount (that I'm aware of) is the
> teardown of the sb->s_bdi because that waits for writeback threads and
> those can be blocked waiting for s_umount.

Which we don't need to change if it's working fine.

> 
> So if we wanted we could just move ext4 & xfs bdev closing back into
> ->put_super to avoid ext4_kill_sb() and somewhat slim down xfs_mount_free()
> but otherwise I don't see much space for simplification or need for adding
> more callbacks :)

I mean, that I would only think is useful if we really wanted to close
all block devices under s_umount to possible be remove the waiting
mechanism we have right now. Otherwise I think it's churn for the sake
of churn and I quite like what we have right now.

Jan Kara Oct. 26, 2023, 1:04 p.m. UTC | #6

On Thu 26-10-23 14:07:38, Christian Brauner wrote:
> On Thu, Oct 26, 2023 at 12:35:03PM +0200, Jan Kara wrote:
> > This is about those seemingly spurious "device busy" errors when the
> > superblock hasn't closed its devices yet, isn't it?  But we now remove
> 
> Yes, because we tie superblock and block device neatly together.
> 
> > superblock from s_instances list in kill_super_notify() and until that
> > moment SB_DYING is protecting us from racing mounts. So is there some other
> > problem?
> 
> No, there isn't a problem at all. It's all working fine but it was
> initially a little annoying as we had to update filesystems to ensure
> that sb->s_fs_info is kept alive. But it's all fixed.
> 
> The possible advantage is that if we drop all block devices under
> s_umount then we can remove the superblock from fs_type->s_instances in
> the old location again. I'm not convinced it's worth it but it's a
> possible simplification. I'm not even arguing it's objectively better I
> think it's a matter of taste in the end.

Yes. But dropping the main bdev under s_umount is not easy to do with the
current callback structure. We have kill_block_super() calling into
generic_shutdown_super(). Now logically generic_shutdown_super() wants to
teardown bdi (for which it needs to drop s_umount) but bdev is closed only
in kill_block_super() because that's the sb-on-bdev specific call.
Previously we got away with this without spurious EBUSY errors because the
bdev holder was the filesystem type and not the superblock. But after
changing the holder it isn't fine anymore and we have to play these games
with s_instances and SB_DYING.

What we could probably do is to have something like:

	if (sb->s_bdev) {
		blkdev_put(sb->s_bdev);
		sb->s_bdev = NULL;
	}
	if (sb->s_mtd) {
		put_mtd_device(sb->s_mtd);
		sb->s_mtd = NULL;
	}

in generic_shutdown_super() and then remove sb from s_instances, drop
s_umount and cleanup bdi.

Then we can get rid of SB_DYING, kill_super_notify() and stuff but based on
your taste it could be also viewed as kind of layering violation so I'm not
100% convinced this is definitely a way to go.

								Honza

Christian Brauner Oct. 26, 2023, 3:08 p.m. UTC | #7

> your taste it could be also viewed as kind of layering violation so I'm not
> 100% convinced this is definitely a way to go.

Yeah, I'm not convinced either. As I said, I really like that right now
ti's a vfs thing only and we don't have specific requirements about how
devices are closed which is really nice. So let's just leave it as is?

Jan Kara Oct. 26, 2023, 3:58 p.m. UTC | #8

On Thu 26-10-23 17:08:00, Christian Brauner wrote:
> > your taste it could be also viewed as kind of layering violation so I'm not
> > 100% convinced this is definitely a way to go.
> 
> Yeah, I'm not convinced either. As I said, I really like that right now
> ti's a vfs thing only and we don't have specific requirements about how
> devices are closed which is really nice. So let's just leave it as is?

Yes, fine by me.

								Honza

Christoph Hellwig Oct. 27, 2023, 7:24 a.m. UTC | #9

On Thu, Oct 26, 2023 at 05:08:00PM +0200, Christian Brauner wrote:
> > your taste it could be also viewed as kind of layering violation so I'm not
> > 100% convinced this is definitely a way to go.
> 
> Yeah, I'm not convinced either. As I said, I really like that right now
> ti's a vfs thing only and we don't have specific requirements about how
> devices are closed which is really nice. So let's just leave it as is?

Yes, I can't really get excited about this change in any way.

[RFC,0/6] fs,block: yield devices

Message

Comments