fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock

Message ID	20231018152924.3858-1-jack@suse.cz (mailing list archive)
State	New, archived
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6E0E015AE2 for <linux-fsdevel@vger.kernel.org>; Wed, 18 Oct 2023 15:29:30 +0000 (UTC) From: Jan Kara <jack@suse.cz> To: Christoph Hellwig <hch@infradead.org> Cc: Christian Brauner <brauner@kernel.org>, <linux-fsdevel@vger.kernel.org>, Jan Kara <jack@suse.cz> Subject: [PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock Date: Wed, 18 Oct 2023 17:29:24 +0200 Message-Id: <20231018152924.3858-1-jack@suse.cz> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock \| expand fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock

Jan Kara Oct. 18, 2023, 3:29 p.m. UTC

The implementation of bdev holder operations such as fs_bdev_mark_dead()
and fs_bdev_sync() grab sb->s_umount semaphore under
bdev->bd_holder_lock. This is problematic because it leads to
disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive
(usually we grab higher level (e.g. filesystem) locks first and lower
level (e.g. block layer) locks later) and indeed makes lockdep complain
about possible locking cycles whenever we open a block device while
holding sb->s_umount semaphore. Implement a function
bdev_super_lock_shared() which safely transitions from holding
bdev->bd_holder_lock to holding sb->s_umount on alive superblock without
introducing the problematic lock dependency. We use this function
fs_bdev_sync() and fs_bdev_mark_dead().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/bdev.c  |  5 +++--
 block/ioctl.c |  5 +++--
 fs/super.c    | 48 ++++++++++++++++++++++++++++++------------------
 3 files changed, 36 insertions(+), 22 deletions(-)

Christoph Hellwig Oct. 18, 2023, 3:46 p.m. UTC | #1

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Christian Brauner Oct. 19, 2023, 8:16 a.m. UTC | #2

On Wed, 18 Oct 2023 17:29:24 +0200, Jan Kara wrote:
> The implementation of bdev holder operations such as fs_bdev_mark_dead()
> and fs_bdev_sync() grab sb->s_umount semaphore under
> bdev->bd_holder_lock. This is problematic because it leads to
> disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive
> (usually we grab higher level (e.g. filesystem) locks first and lower
> level (e.g. block layer) locks later) and indeed makes lockdep complain
> about possible locking cycles whenever we open a block device while
> holding sb->s_umount semaphore. Implement a function
> bdev_super_lock_shared() which safely transitions from holding
> bdev->bd_holder_lock to holding sb->s_umount on alive superblock without
> introducing the problematic lock dependency. We use this function
> fs_bdev_sync() and fs_bdev_mark_dead().
> 
> [...]

Thanks!

---

Applied to the vfs.super branch of the vfs/vfs.git tree.
Patches in the vfs.super branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.super

[1/1] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
      https://git.kernel.org/vfs/vfs/c/4f4f1b3da625

Christian Brauner Oct. 19, 2023, 8:33 a.m. UTC | #3

On Wed, Oct 18, 2023 at 05:29:24PM +0200, Jan Kara wrote:
> The implementation of bdev holder operations such as fs_bdev_mark_dead()
> and fs_bdev_sync() grab sb->s_umount semaphore under
> bdev->bd_holder_lock. This is problematic because it leads to
> disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive
> (usually we grab higher level (e.g. filesystem) locks first and lower
> level (e.g. block layer) locks later) and indeed makes lockdep complain
> about possible locking cycles whenever we open a block device while
> holding sb->s_umount semaphore. Implement a function

This patches together with my series that Christoph sent out for me
Link: https://lore.kernel.org/r/20231017184823.1383356-1-hch@lst.de
two days ago (tyvm!) the lockdep false positives are all gone and we
also eliminated the counterintuitive ordering requirement that forces us
to give up s_umount before opening block devices.

I've verified that yesterday and did a bunch of testing via sudden
device removal.

Christoph had thankfully added generic/730 and generic/731 to emulate
some device removal. I also messed around with the loop code and
specifically used LOOP_CHANGE_FD to force a disk_force_media_change() on
a filesystem.

Jan Kara Oct. 19, 2023, 10:57 a.m. UTC | #4

On Thu 19-10-23 10:33:36, Christian Brauner wrote:
> On Wed, Oct 18, 2023 at 05:29:24PM +0200, Jan Kara wrote:
> > The implementation of bdev holder operations such as fs_bdev_mark_dead()
> > and fs_bdev_sync() grab sb->s_umount semaphore under
> > bdev->bd_holder_lock. This is problematic because it leads to
> > disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive
> > (usually we grab higher level (e.g. filesystem) locks first and lower
> > level (e.g. block layer) locks later) and indeed makes lockdep complain
> > about possible locking cycles whenever we open a block device while
> > holding sb->s_umount semaphore. Implement a function
> 
> This patches together with my series that Christoph sent out for me
> Link: https://lore.kernel.org/r/20231017184823.1383356-1-hch@lst.de
> two days ago (tyvm!) the lockdep false positives are all gone and we
> also eliminated the counterintuitive ordering requirement that forces us
> to give up s_umount before opening block devices.
> 
> I've verified that yesterday and did a bunch of testing via sudden
> device removal.
> 
> Christoph had thankfully added generic/730 and generic/731 to emulate
> some device removal. I also messed around with the loop code and
> specifically used LOOP_CHANGE_FD to force a disk_force_media_change() on
> a filesystem.

Ah, glad to hear that! So now we can also slowly work on undoing the unlock
s_umount, open bdev, lock s_umount games we have introduced in several
places. But I guess let's wait a bit for the dust to settle :)

								Honza

Christoph Hellwig Oct. 19, 2023, 1:40 p.m. UTC | #5

On Thu, Oct 19, 2023 at 10:33:36AM +0200, Christian Brauner wrote:
> some device removal. I also messed around with the loop code and
> specifically used LOOP_CHANGE_FD to force a disk_force_media_change() on
> a filesystem.

Can you wire that up for either blktests or xfstests as well?

Christian Brauner Oct. 20, 2023, 11:18 a.m. UTC | #6

On Thu, Oct 19, 2023 at 12:57:17PM +0200, Jan Kara wrote:
> On Thu 19-10-23 10:33:36, Christian Brauner wrote:
> > On Wed, Oct 18, 2023 at 05:29:24PM +0200, Jan Kara wrote:
> > > The implementation of bdev holder operations such as fs_bdev_mark_dead()
> > > and fs_bdev_sync() grab sb->s_umount semaphore under
> > > bdev->bd_holder_lock. This is problematic because it leads to
> > > disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive
> > > (usually we grab higher level (e.g. filesystem) locks first and lower
> > > level (e.g. block layer) locks later) and indeed makes lockdep complain
> > > about possible locking cycles whenever we open a block device while
> > > holding sb->s_umount semaphore. Implement a function
> > 
> > This patches together with my series that Christoph sent out for me
> > Link: https://lore.kernel.org/r/20231017184823.1383356-1-hch@lst.de
> > two days ago (tyvm!) the lockdep false positives are all gone and we
> > also eliminated the counterintuitive ordering requirement that forces us
> > to give up s_umount before opening block devices.
> > 
> > I've verified that yesterday and did a bunch of testing via sudden
> > device removal.
> > 
> > Christoph had thankfully added generic/730 and generic/731 to emulate
> > some device removal. I also messed around with the loop code and
> > specifically used LOOP_CHANGE_FD to force a disk_force_media_change() on
> > a filesystem.
> 
> Ah, glad to hear that! So now we can also slowly work on undoing the unlock
> s_umount, open bdev, lock s_umount games we have introduced in several
> places. But I guess let's wait a bit for the dust to settle :)

Yeah, I had thought about this right away as well. And agreed, that can
happen later. :)

Christian Brauner Oct. 20, 2023, 11:31 a.m. UTC | #7

On Thu, Oct 19, 2023 at 06:40:27AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 19, 2023 at 10:33:36AM +0200, Christian Brauner wrote:
> > some device removal. I also messed around with the loop code and
> > specifically used LOOP_CHANGE_FD to force a disk_force_media_change() on
> > a filesystem.
> 
> Can you wire that up for either blktests or xfstests as well?

Yeah, I'll try to find some time to do this.

So while I was testing this I realized that the behavior of
LOOP_CHANGE_FD changed in commit 9f65c489b68d ("loop: raise media_change
event") and I'm not clear whether this is actually correct or not.

loop(4) states

"Switch the backing store of the loop device to the new file identified
 file descriptor specified in the (third) ioctl(2) argument, which is an
 integer.  This operation is possible only if the loop device is
 read-only and the new backing store is the same size and type as the
 old backing store."

So the original use-case for this ioctl seems to have been to silently
swap out the backing store. Specifcially it seems to have been used once
upon a time for live images together with device mapper over read-only
loop devices. Where device mapper can direct the writes to some other
location or sm.

Assuming that's correct, I think as long as you have something like
device mapper that doesn't use blk_holder_ops it would still work. But
that commit changed behavior for filesystems because we now do:

loop_change_fd()
-> disk_force_media_change()
   -> bdev_mark_dead()
      -> bdev->bd_holder_ops->mark_dead::fs_mark_dead()

So in essence if you have a filesystem on a loop device via:

truncate -s 10G img1
mkfs.xfs -f img1
LOOP_DEV=$(sudo losetup -f --read-only --show img1)

truncate -s 10G img2
sudo ./loop_change_fd $LOOP_DEV ./img2

We'll shut down that filesystem. I personally think that's correct and
it's buggy not to do that when we have the ability to inform the fs
about media removal but technically it kinda defeats the original
use-case for LOOP_CHANGE_FD.

In practice however, I don't think it matters because I think no one is
using LOOP_CHANGE_FD in that way. Right now all this is a useful for is
a bdev_mark_dead() test.

And one final question:

dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 1);
disk_force_media_change(lo->lo_disk);
/* more stuff */
dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 0);

What exactly does that achieve? Does it just delay the delivery of the
uevent after the disk sequence number was changed in
disk_force_media_change()? Because it doesn't seem to actually prevent
uevent generation.

Jan Kara Oct. 20, 2023, 12:04 p.m. UTC | #8

On Fri 20-10-23 13:31:20, Christian Brauner wrote:
> On Thu, Oct 19, 2023 at 06:40:27AM -0700, Christoph Hellwig wrote:
> > On Thu, Oct 19, 2023 at 10:33:36AM +0200, Christian Brauner wrote:
> > > some device removal. I also messed around with the loop code and
> > > specifically used LOOP_CHANGE_FD to force a disk_force_media_change() on
> > > a filesystem.
> > 
> > Can you wire that up for either blktests or xfstests as well?
> 
> Yeah, I'll try to find some time to do this.
> 
> So while I was testing this I realized that the behavior of
> LOOP_CHANGE_FD changed in commit 9f65c489b68d ("loop: raise media_change
> event") and I'm not clear whether this is actually correct or not.
> 
> loop(4) states
>               
> "Switch the backing store of the loop device to the new file identified
>  file descriptor specified in the (third) ioctl(2) argument, which is an
>  integer.  This operation is possible only if the loop device is
>  read-only and the new backing store is the same size and type as the
>  old backing store."
> 
> So the original use-case for this ioctl seems to have been to silently
> swap out the backing store. Specifcially it seems to have been used once
> upon a time for live images together with device mapper over read-only
> loop devices. Where device mapper can direct the writes to some other
> location or sm.

LOOP_CHANGE_FD was an old hacky way to switch from read-only installation
image to a full-blown RW filesystem without reboot. I'm not sure about
details how it was supposed to work but reportedly Al Viro implemented that
for Fedora installation back in the old days.

> Assuming that's correct, I think as long as you have something like
> device mapper that doesn't use blk_holder_ops it would still work. But
> that commit changed behavior for filesystems because we now do:
> 
> loop_change_fd()
> -> disk_force_media_change()
>    -> bdev_mark_dead()
>       -> bdev->bd_holder_ops->mark_dead::fs_mark_dead()
> 
> So in essence if you have a filesystem on a loop device via:
> 
> truncate -s 10G img1
> mkfs.xfs -f img1
> LOOP_DEV=$(sudo losetup -f --read-only --show img1)
> 
> truncate -s 10G img2
> sudo ./loop_change_fd $LOOP_DEV ./img2
> 
> We'll shut down that filesystem. I personally think that's correct and
> it's buggy not to do that when we have the ability to inform the fs
> about media removal but technically it kinda defeats the original
> use-case for LOOP_CHANGE_FD.

I agree that it breaks the original usecase for LOOP_CHANGE_FD. I'd say it
also shows nobody is likely using LOOP_CHANGE_FD anymore. Maybe time to try
deprecating it?

> In practice however, I don't think it matters because I think no one is
> using LOOP_CHANGE_FD in that way. Right now all this is a useful for is
> a bdev_mark_dead() test.

:)

> And one final question:
> 
> dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 1);
> disk_force_media_change(lo->lo_disk);
> /* more stuff */
> dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 0);
> 
> What exactly does that achieve? Does it just delay the delivery of the
> uevent after the disk sequence number was changed in
> disk_force_media_change()? Because it doesn't seem to actually prevent
> uevent generation.

Well, if you grep for dev_get_uevent_suppress() you'll notice there is
exactly *one* place looking at it - the generation of ADD event when adding
a partition bdev. I'm not sure what's the rationale behind this
functionality.

								Honza

Christian Brauner Oct. 23, 2023, 7:40 a.m. UTC | #9

> I agree that it breaks the original usecase for LOOP_CHANGE_FD. I'd say it
> also shows nobody is likely using LOOP_CHANGE_FD anymore. Maybe time to try
> deprecating it?

Yeah, why don't we try that. In it's current form the concept isn't very
useful and arguably is broken which no one has really noticed for years.

* codesearch.debian.net has zero users
* codesearch on github has zero users
* cs.android.com has zero users

Only definitions, strace, ltp, and stress-ng that sort of test this
functionality. I say we try to deprecate this.

Christian Brauner Oct. 23, 2023, 2:08 p.m. UTC | #10

> > And one final question:
> > 
> > dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 1);
> > disk_force_media_change(lo->lo_disk);
> > /* more stuff */
> > dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 0);
> > 
> > What exactly does that achieve? Does it just delay the delivery of the
> > uevent after the disk sequence number was changed in
> > disk_force_media_change()? Because it doesn't seem to actually prevent
> > uevent generation.
> 
> Well, if you grep for dev_get_uevent_suppress() you'll notice there is
> exactly *one* place looking at it - the generation of ADD event when adding
> a partition bdev. I'm not sure what's the rationale behind this
> functionality.

I looked at dev_set_uevent_suppress() before and what it does is that it
fully prevents the generation of uevents for the kobject. It doesn't
just hold them back like the comments "uncork" in loop_change_fd() and
loop_configure() suggest:

static inline void dev_set_uevent_suppress(struct device *dev, int val)
{
        dev->kobj.uevent_suppress = val;
}

and then

int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
                       char *envp_ext[])
{

        [...]
 
        /* skip the event, if uevent_suppress is set*/
        if (kobj->uevent_suppress) {
                pr_debug("kobject: '%s' (%p): %s: uevent_suppress "
                                 "caused the event to drop!\n",
                                 kobject_name(kobj), kobj, __func__);
                return 0;
        }

So commit 498ef5c777d9 ("loop: suppress uevents while reconfiguring the device")
tried to fix a problem where uevents were generated for LOOP_SET_FD
before LOOP_SET_STATUS* was called.

That broke LOOP_CONFIGURE because LOOP_CONFIGURE is supposed to be
LOOP_SET_FD + LOOP_SET_STATUS in one shot.

Then commit bb430b694226 ("loop: LOOP_CONFIGURE: send uevents for partitions")
fixed that by moving loop_reread_partitions() out of the uevent
suppression.

No you get uevents if you trigger a partition rescan but only if there
are actually partitions. What you never get however is a media change
event even though we do increment the disk sequence number and attach an
image to the loop device.

This seems problematic because we leave userspace unable to listen for
attaching images to a loop device. Shouldn't we regenerate the media
change event after we're done setting up the device and before the
partition rescan for LOOP_CONFIGURE?

Christoph Hellwig Oct. 24, 2023, 7:06 a.m. UTC | #11

On Mon, Oct 23, 2023 at 04:08:21PM +0200, Christian Brauner wrote:
> No you get uevents if you trigger a partition rescan but only if there
> are actually partitions. What you never get however is a media change
> event even though we do increment the disk sequence number and attach an
> image to the loop device.
> 
> This seems problematic because we leave userspace unable to listen for
> attaching images to a loop device. Shouldn't we regenerate the media
> change event after we're done setting up the device and before the
> partition rescan for LOOP_CONFIGURE?

Maybe.  I think people mostly didn't care about the even anyway, but
about the changing sequence number to check that the content hasn't
changed.

Christian Brauner Oct. 24, 2023, 8:42 a.m. UTC | #12

On Tue, Oct 24, 2023 at 12:06:26AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 23, 2023 at 04:08:21PM +0200, Christian Brauner wrote:
> > No you get uevents if you trigger a partition rescan but only if there
> > are actually partitions. What you never get however is a media change
> > event even though we do increment the disk sequence number and attach an
> > image to the loop device.
> > 
> > This seems problematic because we leave userspace unable to listen for
> > attaching images to a loop device. Shouldn't we regenerate the media
> > change event after we're done setting up the device and before the
> > partition rescan for LOOP_CONFIGURE?
> 
> Maybe.  I think people mostly didn't care about the even anyway, but
> about the changing sequence number to check that the content hasn't
> changed.

Yes, exactly. The core is the changed sequence number but you don't get
that notification if you don't have any partitions and you request a
partition rescan from LOOP_CONFIGURE.

So I think we should always send the media change event for the sequence
number independent of the partition notification.

fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock

Commit Message

Comments

Patch