diff mbox series

block: Do not reread partition table on exclusively open device

Message ID 20221130175653.24299-1-jack@suse.cz (mailing list archive)
State New, archived
Headers show
Series block: Do not reread partition table on exclusively open device | expand

Commit Message

Jan Kara Nov. 30, 2022, 5:56 p.m. UTC
Since commit 10c70d95c0f2 ("block: remove the bd_openers checks in
blk_drop_partitions") we allow rereading of partition table although
there are users of the block device. This has an undesirable consequence
that e.g. if sda and sdb are assembled to a RAID1 device md0 with
partitions, BLKRRPART ioctl on sda will rescan partition table and
create sda1 device. This partition device under a raid device confuses
some programs (such as libstorage-ng used for initial partitioning for
distribution installation) leading to failures.

Fix the problem refusing to rescan partitions if there is another user
that has the block device exclusively open.

Link: https://lore.kernel.org/all/20221130135344.2ul4cyfstfs3znxg@quack3
Fixes: 10c70d95c0f2 ("block: remove the bd_openers checks in blk_drop_partitions")
Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/blk.h   |  2 +-
 block/genhd.c |  7 +++++--
 block/ioctl.c | 12 +++++++-----
 3 files changed, 13 insertions(+), 8 deletions(-)

Comments

Jens Axboe Nov. 30, 2022, 10:52 p.m. UTC | #1
On 11/30/22 10:56 AM, Jan Kara wrote:
> Since commit 10c70d95c0f2 ("block: remove the bd_openers checks in
> blk_drop_partitions") we allow rereading of partition table although
> there are users of the block device. This has an undesirable consequence
> that e.g. if sda and sdb are assembled to a RAID1 device md0 with
> partitions, BLKRRPART ioctl on sda will rescan partition table and
> create sda1 device. This partition device under a raid device confuses
> some programs (such as libstorage-ng used for initial partitioning for
> distribution installation) leading to failures.
> 
> Fix the problem refusing to rescan partitions if there is another user
> that has the block device exclusively open.

This looks nice and clean to me. Was pondering whether to queue this up
for 6.1, but it's old enough that I think we should just funnel it through
6.2 and mark it stable.
Jens Axboe Nov. 30, 2022, 10:54 p.m. UTC | #2
On Wed, 30 Nov 2022 18:56:53 +0100, Jan Kara wrote:
> Since commit 10c70d95c0f2 ("block: remove the bd_openers checks in
> blk_drop_partitions") we allow rereading of partition table although
> there are users of the block device. This has an undesirable consequence
> that e.g. if sda and sdb are assembled to a RAID1 device md0 with
> partitions, BLKRRPART ioctl on sda will rescan partition table and
> create sda1 device. This partition device under a raid device confuses
> some programs (such as libstorage-ng used for initial partitioning for
> distribution installation) leading to failures.
> 
> [...]

Applied, thanks!

[1/1] block: Do not reread partition table on exclusively open device
      commit: 8d67fc20caf8b08ef6c90a04970846a950d46dd1

Best regards,
Christoph Hellwig Dec. 1, 2022, 7:10 a.m. UTC | #3
Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Jan Kara Dec. 1, 2022, 11:44 a.m. UTC | #4
Hello!

I'm sorry but this patch has a brownpaper bag bug because BLKRRPART now
bails also if there is no exlusive opener (which didn't have any adverse
effects on my test system because the BLKRRPART calls issued by
systemd-udev are mostly a pointless exercise as I've found out). I'll send
a fixup patch.  Either fold it into this patch or just add on top. Thanks
and I'm sorry for the trouble!
								Honza

On Wed 30-11-22 23:10:02, Christoph Hellwig wrote:
> Looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
Yu Kuai Jan. 31, 2023, 2 p.m. UTC | #5
Hi, Jan

在 2022/12/01 1:56, Jan Kara 写道:

> -int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
> +int disk_scan_partitions(struct gendisk *disk, fmode_t mode, void *owner)
>   {
>   	struct block_device *bdev;
>   
> @@ -366,6 +366,9 @@ int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
>   		return -EINVAL;
>   	if (disk->open_partitions)
>   		return -EBUSY;
> +	/* Someone else has bdev exclusively open? */
> +	if (disk->part0->bd_holder != owner)
> +		return -EBUSY;
>   
>   	set_bit(GD_NEED_PART_SCAN, &disk->state);
>   	bdev = blkdev_get_by_dev(disk_devt(disk), mode, NULL);

It just by code review, but I think the above checking should be inside
open_mutex, which is used to protect bdev claming. Otherwise there can
be race that this check is pass while someone else exclusive open the
disk before blkdev_get_by_dev().

How do you think about open coding blkdev_get_by_dev() here, something
like:

diff --git a/block/genhd.c b/block/genhd.c
index 23cf83b3331c..341af4db7d54 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -358,7 +358,7 @@ EXPORT_SYMBOL_GPL(disk_uevent);

  int disk_scan_partitions(struct gendisk *disk, fmode_t mode, void *owner)
  {
-       struct block_device *bdev;
+       int ret = 0;

         if (disk->flags & (GENHD_FL_NO_PART | GENHD_FL_HIDDEN))
                 return -EINVAL;
@@ -366,16 +366,31 @@ int disk_scan_partitions(struct gendisk *disk, 
fmode_t mode, void *owner)
                 return -EINVAL;
         if (disk->open_partitions)
                 return -EBUSY;
+
+       disk_block_events(disk);
+       mutex_lock(&disk->open_mutex);
+
         /* Someone else has bdev exclusively open? */
-       if (disk->part0->bd_holder && disk->part0->bd_holder != owner)
-               return -EBUSY;
+       if (disk->part0->bd_holder && disk->part0->bd_holder != owner) {
+               ret = -EBUSY;
+               goto out;
+       }
+
+       if (!disk_live(disk)) {
+               ret = -ENODEV;
+               goto out;
+       }

         set_bit(GD_NEED_PART_SCAN, &disk->state);
-       bdev = blkdev_get_by_dev(disk_devt(disk), mode, NULL);
-       if (IS_ERR(bdev))
-               return PTR_ERR(bdev);
-       blkdev_put(bdev, mode);
-       return 0;
+       ret = blkdev_get_whole(disk->part0, mode);
+out:
+       mutex_unlock(&disk->open_mutex);
+       disk_unblock_events(disk);
+
+       if (!ret)
+               blkdev_put_whole(disk->part0, mode);
+
+       return ret;
  }
Christoph Hellwig Jan. 31, 2023, 2:15 p.m. UTC | #6
It might be easier to just move the check into blkdev_get_whole,
which also ensures that non-excluisve openers don't cause a partition
scan while someone else has the device exclusively open.

diff --git a/block/bdev.c b/block/bdev.c
index edc110d90df404..a831b6c9c627d7 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -666,25 +666,28 @@ static void blkdev_flush_mapping(struct block_device *bdev)
 static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
 {
 	struct gendisk *disk = bdev->bd_disk;
-	int ret;
+	int ret = 0;
 
-	if (disk->fops->open) {
+	if (disk->fops->open)
 		ret = disk->fops->open(bdev, mode);
-		if (ret) {
-			/* avoid ghost partitions on a removed medium */
-			if (ret == -ENOMEDIUM &&
-			     test_bit(GD_NEED_PART_SCAN, &disk->state))
-				bdev_disk_changed(disk, true);
+
+	if (ret) {
+		/* avoid ghost partitions on a removed medium */
+		if (ret != -ENOMEDIUM)
 			return ret;
-		}
+	} else {
+		if (!atomic_read(&bdev->bd_openers))
+			set_init_blocksize(bdev);
+		atomic_inc(&bdev->bd_openers);
 	}
 
-	if (!atomic_read(&bdev->bd_openers))
-		set_init_blocksize(bdev);
-	if (test_bit(GD_NEED_PART_SCAN, &disk->state))
+	/*
+	 * Skip the partition scan if someone else has the device exclusively
+	 * open.
+	 */
+	if (test_bit(GD_NEED_PART_SCAN, &disk->state) && !bdev->bd_holder)
 		bdev_disk_changed(disk, false);
-	atomic_inc(&bdev->bd_openers);
-	return 0;
+	return ret;
 }
 
 static void blkdev_put_whole(struct block_device *bdev, fmode_t mode)
diff --git a/block/genhd.c b/block/genhd.c
index 23cf83b3331cde..4a7b1b8b9efdb5 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -366,9 +366,6 @@ int disk_scan_partitions(struct gendisk *disk, fmode_t mode, void *owner)
 		return -EINVAL;
 	if (disk->open_partitions)
 		return -EBUSY;
-	/* Someone else has bdev exclusively open? */
-	if (disk->part0->bd_holder && disk->part0->bd_holder != owner)
-		return -EBUSY;
 
 	set_bit(GD_NEED_PART_SCAN, &disk->state);
 	bdev = blkdev_get_by_dev(disk_devt(disk), mode, NULL);
Yu Kuai Feb. 1, 2023, 1:04 a.m. UTC | #7
Hi, Christoph

在 2023/01/31 22:15, Christoph Hellwig 写道:
> It might be easier to just move the check into blkdev_get_whole,
> which also ensures that non-excluisve openers don't cause a partition
> scan while someone else has the device exclusively open.
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index edc110d90df404..a831b6c9c627d7 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -666,25 +666,28 @@ static void blkdev_flush_mapping(struct block_device *bdev)
>   static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
>   {
>   	struct gendisk *disk = bdev->bd_disk;
> -	int ret;
> +	int ret = 0;
>   
> -	if (disk->fops->open) {
> +	if (disk->fops->open)
>   		ret = disk->fops->open(bdev, mode);
> -		if (ret) {
> -			/* avoid ghost partitions on a removed medium */
> -			if (ret == -ENOMEDIUM &&
> -			     test_bit(GD_NEED_PART_SCAN, &disk->state))
> -				bdev_disk_changed(disk, true);
> +
> +	if (ret) {
> +		/* avoid ghost partitions on a removed medium */
> +		if (ret != -ENOMEDIUM)
>   			return ret;
> -		}
> +	} else {
> +		if (!atomic_read(&bdev->bd_openers))
> +			set_init_blocksize(bdev);
> +		atomic_inc(&bdev->bd_openers);
>   	}
>   
> -	if (!atomic_read(&bdev->bd_openers))
> -		set_init_blocksize(bdev);
> -	if (test_bit(GD_NEED_PART_SCAN, &disk->state))
> +	/*
> +	 * Skip the partition scan if someone else has the device exclusively
> +	 * open.
> +	 */
> +	if (test_bit(GD_NEED_PART_SCAN, &disk->state) && !bdev->bd_holder)
>   		bdev_disk_changed(disk, false);

I think this is wrong here... We should at least allow the exclusively
opener to scan partition, right?

Thanks,
Kuai
Christoph Hellwig Feb. 1, 2023, 6:22 a.m. UTC | #8
On Wed, Feb 01, 2023 at 09:04:12AM +0800, Yu Kuai wrote:
> > +	if (test_bit(GD_NEED_PART_SCAN, &disk->state) && !bdev->bd_holder)
> >   		bdev_disk_changed(disk, false);
> 
> I think this is wrong here... We should at least allow the exclusively
> opener to scan partition, right?

bd_holder is only set in bd_finish_claiming, which is called after
the partition rescan.
Yu Kuai Feb. 1, 2023, 7:20 a.m. UTC | #9
Hi,

在 2023/02/01 14:22, Christoph Hellwig 写道:
> On Wed, Feb 01, 2023 at 09:04:12AM +0800, Yu Kuai wrote:
>>> +	if (test_bit(GD_NEED_PART_SCAN, &disk->state) && !bdev->bd_holder)
>>>    		bdev_disk_changed(disk, false);
>>
>> I think this is wrong here... We should at least allow the exclusively
>> opener to scan partition, right?
> 
> bd_holder is only set in bd_finish_claiming, which is called after
> the partition rescan.
> .
> 

I mean that someone open bdev exclusively first, and then call ioctl to
rescan partition.

By the way, disk_scan_partitions() won't claim bdev in the first
place...

Thanks,
Kuai
Yu Kuai Feb. 2, 2023, 12:50 p.m. UTC | #10
Hi, Jan and Christoph

在 2022/12/01 1:56, Jan Kara 写道:
> Since commit 10c70d95c0f2 ("block: remove the bd_openers checks in
> blk_drop_partitions") we allow rereading of partition table although
> there are users of the block device. This has an undesirable consequence
> that e.g. if sda and sdb are assembled to a RAID1 device md0 with
> partitions, BLKRRPART ioctl on sda will rescan partition table and
> create sda1 device. This partition device under a raid device confuses
> some programs (such as libstorage-ng used for initial partitioning for
> distribution installation) leading to failures.
> 
> Fix the problem refusing to rescan partitions if there is another user
> that has the block device exclusively open.

Still by code review, I just found another race that cound cause disk
can't scan partition while creating:

t1: create device	t2: open device exclusively
device_add_disk()
  bdev_add
   insert_inode_hash
                         ...
                         bd_finish_claiming
                          bdev->bd_holder = holder
disk_scan_partitions
// check holder and failed

This race is artificial and unlikely to happend in practice, but this
is easy to fix by following:

diff --git a/block/genhd.c b/block/genhd.c
index 09f2ac548832..23e87753313b 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -497,6 +497,11 @@ int __must_check device_add_disk(struct device 
*parent, struct gendisk *disk,
                 if (ret)
                         goto out_unregister_bdi;

+               /*
+                * In case user open device before disk_scan_partitions(),
+                * set state here, so that blkdev_open() can scan 
partitions.
+                */
+               set_bit(GD_NEED_PART_SCAN, &disk->state);
                 bdev_add(disk->part0, ddev->devt);
                 if (get_capacity(disk))
                         disk_scan_partitions(disk, FMODE_READ, NULL);
Yu Kuai Feb. 6, 2023, 3:24 a.m. UTC | #11
Hi, Jan and Chirstoph

在 2023/02/01 15:20, Yu Kuai 写道:
> Hi,
> 
> 在 2023/02/01 14:22, Christoph Hellwig 写道:
>> On Wed, Feb 01, 2023 at 09:04:12AM +0800, Yu Kuai wrote:
>>>> +    if (test_bit(GD_NEED_PART_SCAN, &disk->state) && !bdev->bd_holder)
>>>>            bdev_disk_changed(disk, false);
>>>
>>> I think this is wrong here... We should at least allow the exclusively
>>> opener to scan partition, right?
>>
>> bd_holder is only set in bd_finish_claiming, which is called after
>> the partition rescan.
>> .
>>
> 
> I mean that someone open bdev exclusively first, and then call ioctl to
> rescan partition.

Any suggestions?

Thanks,
Kuai
Jan Kara Feb. 8, 2023, 12:02 p.m. UTC | #12
Hi!

On Mon 06-02-23 11:24:10, Yu Kuai wrote:
> 在 2023/02/01 15:20, Yu Kuai 写道:
> > Hi,
> > 
> > 在 2023/02/01 14:22, Christoph Hellwig 写道:
> > > On Wed, Feb 01, 2023 at 09:04:12AM +0800, Yu Kuai wrote:
> > > > > +    if (test_bit(GD_NEED_PART_SCAN, &disk->state) && !bdev->bd_holder)
> > > > >            bdev_disk_changed(disk, false);
> > > > 
> > > > I think this is wrong here... We should at least allow the exclusively
> > > > opener to scan partition, right?
> > > 
> > > bd_holder is only set in bd_finish_claiming, which is called after
> > > the partition rescan.
> > > .
> > > 
> > 
> > I mean that someone open bdev exclusively first, and then call ioctl to
> > rescan partition.
> 
> Any suggestions?

After some thought I don't like opencoding blkdev_get_by_dev() in disk_scan
partitions. But I agree Christoph's approach with blkdev_get_whole() does
not quite work either. We could propagate holder/owner into
blkdev_get_whole() to fix Christoph's check but still we are left with a
question what to do with GD_NEED_PART_SCAN set bit when we get into
blkdev_get_whole() and find out we are not elligible to rescan partitions.
Because then some exclusive opener later might be caught by surprise when
the partition rescan happens due to this bit being set from the past failed
attempt to rescan.

So what we could do is play a similar trick as we do in the loop device and
do in disk_scan_partitions():

	/*
	 * If we don't hold exclusive handle for the device, upgrade to it
	 * here to avoid changing partitions under exclusive owner.
	 */
	if (!(mode & FMODE_EXCL)) {
		error = bd_prepare_to_claim(disk->part0, disk_scan_partitions);
		if (error)
			return error;
	}
	set_bit(GD_NEED_PART_SCAN, &disk->state);
	bdev = blkdev_get_by_dev(disk_devt(disk), mode & ~FMODE_EXCL, NULL);
	if (IS_ERR(bdev)) {
		error = PTR_ERR(bdev);
		goto abort;
	}
	blkdev_put(bdev, mode & ~FMODE_EXCL);
	error = 0;
abort:
	if (!(mode & FMODE_EXCL))
		bd_abort_claiming(disk->part0, disk_scan_partitions);
	return error;

So esentially we'll temporarily block any exlusive openers by claiming the
bdev while we set the GD_NEED_PART_SCAN and force partition rescan. What do
you think?

								Honza
Yu Kuai Feb. 8, 2023, 2:13 p.m. UTC | #13
Hi,

在 2023/02/08 20:02, Jan Kara 写道:
> 
> After some thought I don't like opencoding blkdev_get_by_dev() in disk_scan
> partitions. But I agree Christoph's approach with blkdev_get_whole() does
> not quite work either. We could propagate holder/owner into
> blkdev_get_whole() to fix Christoph's check but still we are left with a
> question what to do with GD_NEED_PART_SCAN set bit when we get into
> blkdev_get_whole() and find out we are not elligible to rescan partitions.
> Because then some exclusive opener later might be caught by surprise when
> the partition rescan happens due to this bit being set from the past failed
> attempt to rescan.
> 
> So what we could do is play a similar trick as we do in the loop device and
> do in disk_scan_partitions():
> 
> 	/*
> 	 * If we don't hold exclusive handle for the device, upgrade to it
> 	 * here to avoid changing partitions under exclusive owner.
> 	 */
> 	if (!(mode & FMODE_EXCL)) {
This is not necessary, all the caller make sure FMODE_EXCL is not set.

> 		error = bd_prepare_to_claim(disk->part0, disk_scan_partitions);
> 		if (error)
> 			return error;
> 	}
 From what I see, if thread open device excl first, and then call ioctl()
to reread partition, this will cause this ioctl() to fail?

> 	set_bit(GD_NEED_PART_SCAN, &disk->state);
> 	bdev = blkdev_get_by_dev(disk_devt(disk), mode & ~FMODE_EXCL, NULL);
> 	if (IS_ERR(bdev)) {
> 		error = PTR_ERR(bdev);
> 		goto abort;
> 	}
> 	blkdev_put(bdev, mode & ~FMODE_EXCL);
> 	error = 0;
> abort:
> 	if (!(mode & FMODE_EXCL))
> 		bd_abort_claiming(disk->part0, disk_scan_partitions);
> 	return error;
> 
> So esentially we'll temporarily block any exlusive openers by claiming the
> bdev while we set the GD_NEED_PART_SCAN and force partition rescan. What do
> you think?
> 
> 								Honza
>
Jan Kara Feb. 9, 2023, 9:04 a.m. UTC | #14
On Wed 08-02-23 22:13:02, Yu Kuai wrote:
> Hi,
> 
> 在 2023/02/08 20:02, Jan Kara 写道:
> > 
> > After some thought I don't like opencoding blkdev_get_by_dev() in disk_scan
> > partitions. But I agree Christoph's approach with blkdev_get_whole() does
> > not quite work either. We could propagate holder/owner into
> > blkdev_get_whole() to fix Christoph's check but still we are left with a
> > question what to do with GD_NEED_PART_SCAN set bit when we get into
> > blkdev_get_whole() and find out we are not elligible to rescan partitions.
> > Because then some exclusive opener later might be caught by surprise when
> > the partition rescan happens due to this bit being set from the past failed
> > attempt to rescan.
> > 
> > So what we could do is play a similar trick as we do in the loop device and
> > do in disk_scan_partitions():
> > 
> > 	/*
> > 	 * If we don't hold exclusive handle for the device, upgrade to it
> > 	 * here to avoid changing partitions under exclusive owner.
> > 	 */
> > 	if (!(mode & FMODE_EXCL)) {
> This is not necessary, all the caller make sure FMODE_EXCL is not set.

Yes, but we need to propagate it correctly from blkdev_common_ioctl() now,
exactly so that ioctl does not fail if you exclusively opened the device as
you realized below :)

> > 		error = bd_prepare_to_claim(disk->part0, disk_scan_partitions);
> > 		if (error)
> > 			return error;
> > 	}
> From what I see, if thread open device excl first, and then call ioctl()
> to reread partition, this will cause this ioctl() to fail?

								Honza
Yu Kuai Feb. 9, 2023, 9:32 a.m. UTC | #15
Hi,

在 2023/02/09 17:04, Jan Kara 写道:
> On Wed 08-02-23 22:13:02, Yu Kuai wrote:
>> Hi,
>>
>> 在 2023/02/08 20:02, Jan Kara 写道:
>>>
>>> After some thought I don't like opencoding blkdev_get_by_dev() in disk_scan
>>> partitions. But I agree Christoph's approach with blkdev_get_whole() does
>>> not quite work either. We could propagate holder/owner into
>>> blkdev_get_whole() to fix Christoph's check but still we are left with a
>>> question what to do with GD_NEED_PART_SCAN set bit when we get into
>>> blkdev_get_whole() and find out we are not elligible to rescan partitions.
>>> Because then some exclusive opener later might be caught by surprise when
>>> the partition rescan happens due to this bit being set from the past failed
>>> attempt to rescan.
>>>
>>> So what we could do is play a similar trick as we do in the loop device and
>>> do in disk_scan_partitions():
>>>
>>> 	/*
>>> 	 * If we don't hold exclusive handle for the device, upgrade to it
>>> 	 * here to avoid changing partitions under exclusive owner.
>>> 	 */
>>> 	if (!(mode & FMODE_EXCL)) {
>> This is not necessary, all the caller make sure FMODE_EXCL is not set.
> 
> Yes, but we need to propagate it correctly from blkdev_common_ioctl() now,
> exactly so that ioctl does not fail if you exclusively opened the device as
> you realized below :)

Ok, I get it that you want to pass in FMODE_EXCL from ioctl(). But I'm
not sure let others fail to open device extl is a good idea.

I still prefer to open code blkdev_get_by_dev(), because many operations
is not necessary here. And this way, we can clear GD_NEED_PART_SCAN
inside open_mutex if rescan failed.

Thanks,
Kuai

> 
>>> 		error = bd_prepare_to_claim(disk->part0, disk_scan_partitions);
>>> 		if (error)
>>> 			return error;
>>> 	}
>>  From what I see, if thread open device excl first, and then call ioctl()
>> to reread partition, this will cause this ioctl() to fail?
> 
> 								Honza
>
Jan Kara Feb. 9, 2023, 9:57 a.m. UTC | #16
On Thu 09-02-23 17:32:45, Yu Kuai wrote:
> 在 2023/02/09 17:04, Jan Kara 写道:
> > On Wed 08-02-23 22:13:02, Yu Kuai wrote:
> > > Hi,
> > > 
> > > 在 2023/02/08 20:02, Jan Kara 写道:
> > > > 
> > > > After some thought I don't like opencoding blkdev_get_by_dev() in disk_scan
> > > > partitions. But I agree Christoph's approach with blkdev_get_whole() does
> > > > not quite work either. We could propagate holder/owner into
> > > > blkdev_get_whole() to fix Christoph's check but still we are left with a
> > > > question what to do with GD_NEED_PART_SCAN set bit when we get into
> > > > blkdev_get_whole() and find out we are not elligible to rescan partitions.
> > > > Because then some exclusive opener later might be caught by surprise when
> > > > the partition rescan happens due to this bit being set from the past failed
> > > > attempt to rescan.
> > > > 
> > > > So what we could do is play a similar trick as we do in the loop device and
> > > > do in disk_scan_partitions():
> > > > 
> > > > 	/*
> > > > 	 * If we don't hold exclusive handle for the device, upgrade to it
> > > > 	 * here to avoid changing partitions under exclusive owner.
> > > > 	 */
> > > > 	if (!(mode & FMODE_EXCL)) {
> > > This is not necessary, all the caller make sure FMODE_EXCL is not set.
> > 
> > Yes, but we need to propagate it correctly from blkdev_common_ioctl() now,
> > exactly so that ioctl does not fail if you exclusively opened the device as
> > you realized below :)
> 
> Ok, I get it that you want to pass in FMODE_EXCL from ioctl(). But I'm
> not sure let others fail to open device extl is a good idea.

Other exclusive openers will not fail. They will block until we call
bd_abort_claiming() after the partitions have been reread.

> I still prefer to open code blkdev_get_by_dev(), because many operations
> is not necessary here. And this way, we can clear GD_NEED_PART_SCAN
> inside open_mutex if rescan failed.

I understand many operations are not needed there but this is not a hot
path and leaking of bdev internal details into genhd.c is not a good
practice either (e.g. you'd have to make blkdev_get_whole() extern).

We could create a special helper for partition rescan in block/bdev.c to
hide the details but think that bd_start_claiming() - bd_abort_claiming()
trick is the simplest solution to temporarily grab exclusive ownership if
we don't have it.

								Honza
Yu Kuai Feb. 9, 2023, 11:19 a.m. UTC | #17
Hi,

在 2023/02/09 17:57, Jan Kara 写道:
> On Thu 09-02-23 17:32:45, Yu Kuai wrote:
>> 在 2023/02/09 17:04, Jan Kara 写道:
>>> On Wed 08-02-23 22:13:02, Yu Kuai wrote:
>>>> Hi,
>>>>
>>>> 在 2023/02/08 20:02, Jan Kara 写道:
>>>>>
>>>>> After some thought I don't like opencoding blkdev_get_by_dev() in disk_scan
>>>>> partitions. But I agree Christoph's approach with blkdev_get_whole() does
>>>>> not quite work either. We could propagate holder/owner into
>>>>> blkdev_get_whole() to fix Christoph's check but still we are left with a
>>>>> question what to do with GD_NEED_PART_SCAN set bit when we get into
>>>>> blkdev_get_whole() and find out we are not elligible to rescan partitions.
>>>>> Because then some exclusive opener later might be caught by surprise when
>>>>> the partition rescan happens due to this bit being set from the past failed
>>>>> attempt to rescan.
>>>>>
>>>>> So what we could do is play a similar trick as we do in the loop device and
>>>>> do in disk_scan_partitions():
>>>>>
>>>>> 	/*
>>>>> 	 * If we don't hold exclusive handle for the device, upgrade to it
>>>>> 	 * here to avoid changing partitions under exclusive owner.
>>>>> 	 */
>>>>> 	if (!(mode & FMODE_EXCL)) {
>>>> This is not necessary, all the caller make sure FMODE_EXCL is not set.
>>>
>>> Yes, but we need to propagate it correctly from blkdev_common_ioctl() now,
>>> exactly so that ioctl does not fail if you exclusively opened the device as
>>> you realized below :)
>>
>> Ok, I get it that you want to pass in FMODE_EXCL from ioctl(). But I'm
>> not sure let others fail to open device extl is a good idea.
> 
> Other exclusive openers will not fail. They will block until we call
> bd_abort_claiming() after the partitions have been reread.

Yes, you're right, rescan and other exclusively openers will be
synchronized by bd_prepare_to_claim().
> 
>> I still prefer to open code blkdev_get_by_dev(), because many operations
>> is not necessary here. And this way, we can clear GD_NEED_PART_SCAN
>> inside open_mutex if rescan failed.
> 
> I understand many operations are not needed there but this is not a hot
> path and leaking of bdev internal details into genhd.c is not a good
> practice either (e.g. you'd have to make blkdev_get_whole() extern).

I was thinking that disk_scan_partitions() can be moved to bdev.c to
avoid that...
> 
> We could create a special helper for partition rescan in block/bdev.c to
> hide the details but think that bd_start_claiming() - bd_abort_claiming()
> trick is the simplest solution to temporarily grab exclusive ownership if
> we don't have it.

Now I'm good with this solution. By the way, do you think we must make
sure the first partition scan will be proceed?

Thanks,
Kuai
> 
> 								Honza
>
Jan Kara Feb. 9, 2023, 1:58 p.m. UTC | #18
On Thu 09-02-23 19:19:48, Yu Kuai wrote:
> 在 2023/02/09 17:57, Jan Kara 写道:
> > On Thu 09-02-23 17:32:45, Yu Kuai wrote:
> > > 在 2023/02/09 17:04, Jan Kara 写道:
> > > I still prefer to open code blkdev_get_by_dev(), because many operations
> > > is not necessary here. And this way, we can clear GD_NEED_PART_SCAN
> > > inside open_mutex if rescan failed.
> > 
> > I understand many operations are not needed there but this is not a hot
> > path and leaking of bdev internal details into genhd.c is not a good
> > practice either (e.g. you'd have to make blkdev_get_whole() extern).
> 
> I was thinking that disk_scan_partitions() can be moved to bdev.c to
> avoid that...
> > 
> > We could create a special helper for partition rescan in block/bdev.c to
> > hide the details but think that bd_start_claiming() - bd_abort_claiming()
> > trick is the simplest solution to temporarily grab exclusive ownership if
> > we don't have it.
> 
> Now I'm good with this solution. By the way, do you think we must make
> sure the first partition scan will be proceed?

Sorry, I'm not sure what your are asking about here :) Can you please
elaborate more?

								Honza
Yu Kuai Feb. 10, 2023, 1:58 a.m. UTC | #19
Hi,

在 2023/02/09 21:58, Jan Kara 写道:

> Sorry, I'm not sure what your are asking about here :) Can you please
> elaborate more?


It's another artificail race that will cause part scan fail in
device_add_disk().

bdev_add() -> user can open the device now

disk_scan_partitions -> will fail is the device is already opened 
exclusively

I'm thinking about set disk state before bdev_add()...

Thanks,
Kuai
Jan Kara Feb. 10, 2023, 10:31 a.m. UTC | #20
On Fri 10-02-23 09:58:36, Yu Kuai wrote:
> Hi,
> 
> 在 2023/02/09 21:58, Jan Kara 写道:
> 
> > Sorry, I'm not sure what your are asking about here :) Can you please
> > elaborate more?
> 
> 
> It's another artificail race that will cause part scan fail in
> device_add_disk().
> 
> bdev_add() -> user can open the device now
> 
> disk_scan_partitions -> will fail is the device is already opened
> exclusively
> 
> I'm thinking about set disk state before bdev_add()...

Oh, right. Yes, that should be a good fix to set GD_NEED_PART_SCAN before
calling bdev_add().

								Honza
Yu Kuai Feb. 11, 2023, 3:46 a.m. UTC | #21
Hi,

在 2023/02/10 18:31, Jan Kara 写道:
> On Fri 10-02-23 09:58:36, Yu Kuai wrote:
>> Hi,
>>
>> 在 2023/02/09 21:58, Jan Kara 写道:
>>
>>> Sorry, I'm not sure what your are asking about here :) Can you please
>>> elaborate more?
>>
>>
>> It's another artificail race that will cause part scan fail in
>> device_add_disk().
>>
>> bdev_add() -> user can open the device now
>>
>> disk_scan_partitions -> will fail is the device is already opened
>> exclusively
>>
>> I'm thinking about set disk state before bdev_add()...
> 
> Oh, right. Yes, that should be a good fix to set GD_NEED_PART_SCAN before
> calling bdev_add().

Glad to here that 
diff mbox series

Patch

diff --git a/block/blk.h b/block/blk.h
index a186ea20f39d..8b75a95b28d6 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -436,7 +436,7 @@  static inline struct kmem_cache *blk_get_queue_kmem_cache(bool srcu)
 }
 struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu);
 
-int disk_scan_partitions(struct gendisk *disk, fmode_t mode);
+int disk_scan_partitions(struct gendisk *disk, fmode_t mode, void *owner);
 
 int disk_alloc_events(struct gendisk *disk);
 void disk_add_events(struct gendisk *disk);
diff --git a/block/genhd.c b/block/genhd.c
index 0f9769db2de8..012529d36f5b 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -356,7 +356,7 @@  void disk_uevent(struct gendisk *disk, enum kobject_action action)
 }
 EXPORT_SYMBOL_GPL(disk_uevent);
 
-int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
+int disk_scan_partitions(struct gendisk *disk, fmode_t mode, void *owner)
 {
 	struct block_device *bdev;
 
@@ -366,6 +366,9 @@  int disk_scan_partitions(struct gendisk *disk, fmode_t mode)
 		return -EINVAL;
 	if (disk->open_partitions)
 		return -EBUSY;
+	/* Someone else has bdev exclusively open? */
+	if (disk->part0->bd_holder != owner)
+		return -EBUSY;
 
 	set_bit(GD_NEED_PART_SCAN, &disk->state);
 	bdev = blkdev_get_by_dev(disk_devt(disk), mode, NULL);
@@ -500,7 +503,7 @@  int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
 
 		bdev_add(disk->part0, ddev->devt);
 		if (get_capacity(disk))
-			disk_scan_partitions(disk, FMODE_READ);
+			disk_scan_partitions(disk, FMODE_READ, NULL);
 
 		/*
 		 * Announce the disk and partitions after all partitions are
diff --git a/block/ioctl.c b/block/ioctl.c
index 60121e89052b..96617512982e 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -467,9 +467,10 @@  static int blkdev_bszset(struct block_device *bdev, fmode_t mode,
  * user space. Note the separate arg/argp parameters that are needed
  * to deal with the compat_ptr() conversion.
  */
-static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
-				unsigned cmd, unsigned long arg, void __user *argp)
+static int blkdev_common_ioctl(struct file *file, fmode_t mode, unsigned cmd,
+			       unsigned long arg, void __user *argp)
 {
+	struct block_device *bdev = I_BDEV(file->f_mapping->host);
 	unsigned int max_sectors;
 
 	switch (cmd) {
@@ -527,7 +528,8 @@  static int blkdev_common_ioctl(struct block_device *bdev, fmode_t mode,
 			return -EACCES;
 		if (bdev_is_partition(bdev))
 			return -EINVAL;
-		return disk_scan_partitions(bdev->bd_disk, mode & ~FMODE_EXCL);
+		return disk_scan_partitions(bdev->bd_disk, mode & ~FMODE_EXCL,
+					    file);
 	case BLKTRACESTART:
 	case BLKTRACESTOP:
 	case BLKTRACETEARDOWN:
@@ -605,7 +607,7 @@  long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 		break;
 	}
 
-	ret = blkdev_common_ioctl(bdev, mode, cmd, arg, argp);
+	ret = blkdev_common_ioctl(file, mode, cmd, arg, argp);
 	if (ret != -ENOIOCTLCMD)
 		return ret;
 
@@ -674,7 +676,7 @@  long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg)
 		break;
 	}
 
-	ret = blkdev_common_ioctl(bdev, mode, cmd, arg, argp);
+	ret = blkdev_common_ioctl(file, mode, cmd, arg, argp);
 	if (ret == -ENOIOCTLCMD && disk->fops->compat_ioctl)
 		ret = disk->fops->compat_ioctl(bdev, mode, cmd, arg);