diff mbox series

[7/8] loop: remove lo_refcount and avoid lo_mutex in ->open / ->release

Message ID 20220316084519.2850118-8-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [1/8] loop: de-duplicate the idle worker freeing code | expand

Commit Message

Christoph Hellwig March 16, 2022, 8:45 a.m. UTC
lo_refcount counts how many openers a loop device has, but that count
is already provided by the block layer in the bd_openers field of the
whole-disk block_device.  Remove lo_refcount and allow opens to
succeed even on devices beeing deleted - now that ->free_disk is
implemented we can handle that race gracefull and all I/O on it will
just fail. Similarly there is a small race window now where
loop_control_remove does not synchronize the delete vs the remove
due do bd_openers not being under lo_mutex protection, but we can
handle that just as gracefully.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/block/loop.c | 36 +++++++-----------------------------
 drivers/block/loop.h |  1 -
 2 files changed, 7 insertions(+), 30 deletions(-)

Comments

Jan Kara March 16, 2022, 11:22 a.m. UTC | #1
On Wed 16-03-22 09:45:18, Christoph Hellwig wrote:
> lo_refcount counts how many openers a loop device has, but that count
> is already provided by the block layer in the bd_openers field of the
> whole-disk block_device.  Remove lo_refcount and allow opens to
> succeed even on devices beeing deleted - now that ->free_disk is
> implemented we can handle that race gracefull and all I/O on it will
> just fail. Similarly there is a small race window now where
> loop_control_remove does not synchronize the delete vs the remove
> due do bd_openers not being under lo_mutex protection, but we can
> handle that just as gracefully.

Honestly, I'm a bit uneasy about these races and ability to have open
removed loop device (given use of loop devices in container environments
potential problems could have security implications). But I guess it is no
different to having open hot-unplugged scsi disk. There may be just an
expectation from userspace that your open either blocks loop device removal
or open fails. But I guess we can deal with that if some real breakage
happens - it does not seem that hard to solve - we just need
loop_control_remove() to grab disk->open_mutex to make transition to
Lo_deleting state safe and keep Lo_deleting check in lo_open(). Plus we'd
need to use READ_ONCE / WRITE_ONCE for lo_state accesses.

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/block/loop.c | 36 +++++++-----------------------------
>  drivers/block/loop.h |  1 -
>  2 files changed, 7 insertions(+), 30 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index cbaa18bcad1fe..c270f3715d829 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1244,7 +1244,7 @@ static int loop_clr_fd(struct loop_device *lo)
>  	 * <dev>/do something like mkfs/losetup -d <dev> causing the losetup -d
>  	 * command to fail with EBUSY.
>  	 */
> -	if (atomic_read(&lo->lo_refcnt) > 1) {
> +	if (lo->lo_disk->part0->bd_openers > 1) {

But bd_openers can be read safely only under disk->open_mutex. So for this
to be safe against compiler playing nasty tricks with optimizations, we
need to either make bd_openers atomic_t or use READ_ONCE / WRITE_ONCE when
accessing it.

> @@ -1724,33 +1724,15 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode,
>  }
>  #endif
>  
> -static int lo_open(struct block_device *bdev, fmode_t mode)
> -{
> -	struct loop_device *lo = bdev->bd_disk->private_data;
> -	int err;
> -
> -	err = mutex_lock_killable(&lo->lo_mutex);
> -	if (err)
> -		return err;
> -	if (lo->lo_state == Lo_deleting)
> -		err = -ENXIO;
> -	else
> -		atomic_inc(&lo->lo_refcnt);
> -	mutex_unlock(&lo->lo_mutex);
> -	return err;
> -}
> -

Tetsuo has observed [1] that not grabbing lo_mutex when opening loop device
tends to break systemd-udevd because in loop_configure() we send
DISK_EVENT_MEDIA_CHANGE event before the device is fully configured (but
the configuration gets finished before we release the lo_mutex) and so
systemd-udev gets spurious IO errors when probing new loop devices and is
unhappy. So I think this is the right way to go but we need to reshuffle
loop_configure() a bit first.

[1] https://lore.kernel.org/all/a72c59c6-298b-e4ba-b1f5-2275afab49a1@I-love.SAKURA.ne.jp/T/#u


								Honza
Tetsuo Handa March 16, 2022, 1:14 p.m. UTC | #2
On 2022/03/16 20:22, Jan Kara wrote:
>> @@ -1244,7 +1244,7 @@ static int loop_clr_fd(struct loop_device *lo)
>>  	 * <dev>/do something like mkfs/losetup -d <dev> causing the losetup -d
>>  	 * command to fail with EBUSY.
>>  	 */
>> -	if (atomic_read(&lo->lo_refcnt) > 1) {
>> +	if (lo->lo_disk->part0->bd_openers > 1) {
> 
> But bd_openers can be read safely only under disk->open_mutex. So for this
> to be safe against compiler playing nasty tricks with optimizations, we
> need to either make bd_openers atomic_t or use READ_ONCE / WRITE_ONCE when
> accessing it.

Use of READ_ONCE() / WRITE_ONCE() are not for avoiding races but for making
sure that memory access happens only once. It is data_race() which is needed
for tolerating and annotating races. For example, data_race(lo->lo_state) is
needed when accessing lo->lo_state without lo->lo_mutex held.

Use of atomic_t for lo->lo_disk->part0->bd_openers does not help, for currently
lo->lo_mutex is held in order to avoid races. That is, it is disk->open_mutex
which loop_clr_fd() needs to hold when accessing lo->lo_disk->part0->bd_openers.
Jan Kara March 16, 2022, 2:38 p.m. UTC | #3
On Wed 16-03-22 22:14:50, Tetsuo Handa wrote:
> On 2022/03/16 20:22, Jan Kara wrote:
> >> @@ -1244,7 +1244,7 @@ static int loop_clr_fd(struct loop_device *lo)
> >>  	 * <dev>/do something like mkfs/losetup -d <dev> causing the losetup -d
> >>  	 * command to fail with EBUSY.
> >>  	 */
> >> -	if (atomic_read(&lo->lo_refcnt) > 1) {
> >> +	if (lo->lo_disk->part0->bd_openers > 1) {
> > 
> > But bd_openers can be read safely only under disk->open_mutex. So for this
> > to be safe against compiler playing nasty tricks with optimizations, we
> > need to either make bd_openers atomic_t or use READ_ONCE / WRITE_ONCE when
> > accessing it.
> 
> Use of READ_ONCE() / WRITE_ONCE() are not for avoiding races but for making
> sure that memory access happens only once. It is data_race() which is needed
> for tolerating and annotating races. For example, data_race(lo->lo_state) is
> needed when accessing lo->lo_state without lo->lo_mutex held.

Well, but another effect of READ_ONCE() / WRITE_ONCE() is that it
effectively forces the compiler to not store any intermediate value in
bd_openers. If you have code like bdev->bd_openers++, and bd_openers has
value say 1, the compiler is fully within its rights if unlocked reader
sees values, 1, 0, 3, 2. It would have to be a vicious compiler but the C
standard allows that and some of the optimizations compilers end up doing
result in code which is not far from this (read more about KCSAN and the
motivation behind it for details). So data_race() annotation is *not*
enough for unlocked bd_openers usage.

> Use of atomic_t for lo->lo_disk->part0->bd_openers does not help, for
> currently lo->lo_mutex is held in order to avoid races. That is, it is
> disk->open_mutex which loop_clr_fd() needs to hold when accessing
> lo->lo_disk->part0->bd_openers.

It does help because with atomic_t, seeing any intermediate values is not
possible even for unlocked readers.

								Honza
Christoph Hellwig March 22, 2022, 11:09 a.m. UTC | #4
On Wed, Mar 16, 2022 at 03:38:55PM +0100, Jan Kara wrote:
> Well, but another effect of READ_ONCE() / WRITE_ONCE() is that it
> effectively forces the compiler to not store any intermediate value in
> bd_openers. If you have code like bdev->bd_openers++, and bd_openers has
> value say 1, the compiler is fully within its rights if unlocked reader
> sees values, 1, 0, 3, 2. It would have to be a vicious compiler but the C
> standard allows that and some of the optimizations compilers end up doing
> result in code which is not far from this (read more about KCSAN and the
> motivation behind it for details). So data_race() annotation is *not*
> enough for unlocked bd_openers usage.
> 
> > Use of atomic_t for lo->lo_disk->part0->bd_openers does not help, for
> > currently lo->lo_mutex is held in order to avoid races. That is, it is
> > disk->open_mutex which loop_clr_fd() needs to hold when accessing
> > lo->lo_disk->part0->bd_openers.
> 
> It does help because with atomic_t, seeing any intermediate values is not
> possible even for unlocked readers.

The Linux memory model guarantees atomic reads from 32-bit integers.
But if it makes everyone happier I could do a READ_ONCE here.
Jan Kara March 23, 2022, 12:18 p.m. UTC | #5
On Tue 22-03-22 12:09:08, Christoph Hellwig wrote:
> On Wed, Mar 16, 2022 at 03:38:55PM +0100, Jan Kara wrote:
> > Well, but another effect of READ_ONCE() / WRITE_ONCE() is that it
> > effectively forces the compiler to not store any intermediate value in
> > bd_openers. If you have code like bdev->bd_openers++, and bd_openers has
> > value say 1, the compiler is fully within its rights if unlocked reader
> > sees values, 1, 0, 3, 2. It would have to be a vicious compiler but the C
> > standard allows that and some of the optimizations compilers end up doing
> > result in code which is not far from this (read more about KCSAN and the
> > motivation behind it for details). So data_race() annotation is *not*
> > enough for unlocked bd_openers usage.
> > 
> > > Use of atomic_t for lo->lo_disk->part0->bd_openers does not help, for
> > > currently lo->lo_mutex is held in order to avoid races. That is, it is
> > > disk->open_mutex which loop_clr_fd() needs to hold when accessing
> > > lo->lo_disk->part0->bd_openers.
> > 
> > It does help because with atomic_t, seeing any intermediate values is not
> > possible even for unlocked readers.
> 
> The Linux memory model guarantees atomic reads from 32-bit integers.
> But if it makes everyone happier I could do a READ_ONCE here.

Sure, the read is atomic wrt other CPU instructions, but it is not atomic
wrt how the compiler decides to implement bdi->bd_openers++. So we need to
make these bd_openers *updates* atomic so that the unlocked reads are
really safe. That being said I consider the concerns mostly theoretical so
I don't insist but some checker will surely complain sooner rather than
later.

								Honza
Tetsuo Handa March 23, 2022, 1:09 p.m. UTC | #6
On 2022/03/23 21:18, Jan Kara wrote:
>>>> Use of atomic_t for lo->lo_disk->part0->bd_openers does not help, for
>>>> currently lo->lo_mutex is held in order to avoid races. That is, it is
>>>> disk->open_mutex which loop_clr_fd() needs to hold when accessing
>>>> lo->lo_disk->part0->bd_openers.
>>>
>>> It does help because with atomic_t, seeing any intermediate values is not
>>> possible even for unlocked readers.
>>
>> The Linux memory model guarantees atomic reads from 32-bit integers.
>> But if it makes everyone happier I could do a READ_ONCE here.
> 
> Sure, the read is atomic wrt other CPU instructions, but it is not atomic
> wrt how the compiler decides to implement bdi->bd_openers++. So we need to
> make these bd_openers *updates* atomic so that the unlocked reads are
> really safe. That being said I consider the concerns mostly theoretical so
> I don't insist but some checker will surely complain sooner rather than
> later.

KCSAN will complain access without data_race(). But what I'm saying here is
that loop_clr_fd() needs to hold disk->open_mutex, for blkdev_get_whole() is
protected using disk->open_mutex. Then, KCSAN will not complain this access.

static int loop_clr_fd(struct loop_device *lo)
{
        int err;

        if (err)
                return err;
        if (lo->lo_state != Lo_bound) {
                mutex_unlock(&lo->lo_mutex);
                return -ENXIO;
        }
        if (lo->lo_disk->part0->bd_openers > 1) { // Sees lo->lo_disk->part0->bd_openers == 1.
                lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
                mutex_unlock(&lo->lo_mutex);
                return 0;
        }
	// Preemption starts.
							static int blkdev_get_whole(struct block_device *bdev, fmode_t mode)
							{
							        struct gendisk *disk = bdev->bd_disk;
							        int ret;
							
							        if (disk->fops->open) { // disk->fops->open == NULL because lo_open() is removed.
							                ret = disk->fops->open(bdev, mode);
							                if (ret) {
							                        /* avoid ghost partitions on a removed medium */
							                        if (ret == -ENOMEDIUM &&
							                             test_bit(GD_NEED_PART_SCAN, &disk->state))
							                                bdev_disk_changed(disk, true);
							                        return ret;
							                }
							        }
							
							        if (!bdev->bd_openers)
							                set_init_blocksize(bdev);
							        if (test_bit(GD_NEED_PART_SCAN, &disk->state))
							                bdev_disk_changed(disk, false);
							        bdev->bd_openers++;
							        return 0;
							}
	// Preemption ends.
        lo->lo_state = Lo_rundown;
        mutex_unlock(&lo->lo_mutex);

        __loop_clr_fd(lo, false); // Despite lo->lo_disk->part0->bd_openers > 1.
        return 0;
}
diff mbox series

Patch

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index cbaa18bcad1fe..c270f3715d829 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1244,7 +1244,7 @@  static int loop_clr_fd(struct loop_device *lo)
 	 * <dev>/do something like mkfs/losetup -d <dev> causing the losetup -d
 	 * command to fail with EBUSY.
 	 */
-	if (atomic_read(&lo->lo_refcnt) > 1) {
+	if (lo->lo_disk->part0->bd_openers > 1) {
 		lo->lo_flags |= LO_FLAGS_AUTOCLEAR;
 		mutex_unlock(&lo->lo_mutex);
 		return 0;
@@ -1724,33 +1724,15 @@  static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode,
 }
 #endif
 
-static int lo_open(struct block_device *bdev, fmode_t mode)
-{
-	struct loop_device *lo = bdev->bd_disk->private_data;
-	int err;
-
-	err = mutex_lock_killable(&lo->lo_mutex);
-	if (err)
-		return err;
-	if (lo->lo_state == Lo_deleting)
-		err = -ENXIO;
-	else
-		atomic_inc(&lo->lo_refcnt);
-	mutex_unlock(&lo->lo_mutex);
-	return err;
-}
-
 static void lo_release(struct gendisk *disk, fmode_t mode)
 {
 	struct loop_device *lo = disk->private_data;
 
-	mutex_lock(&lo->lo_mutex);
-	if (atomic_dec_return(&lo->lo_refcnt))
-		goto out_unlock;
+	if (disk->part0->bd_openers > 0)
+		return;
 
-	if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) {
-		if (lo->lo_state != Lo_bound)
-			goto out_unlock;
+	mutex_lock(&lo->lo_mutex);
+	if (lo->lo_state == Lo_bound && (lo->lo_flags & LO_FLAGS_AUTOCLEAR)) {
 		lo->lo_state = Lo_rundown;
 		mutex_unlock(&lo->lo_mutex);
 		/*
@@ -1760,8 +1742,6 @@  static void lo_release(struct gendisk *disk, fmode_t mode)
 		__loop_clr_fd(lo, true);
 		return;
 	}
-
-out_unlock:
 	mutex_unlock(&lo->lo_mutex);
 }
 
@@ -1775,7 +1755,6 @@  static void lo_free_disk(struct gendisk *disk)
 
 static const struct block_device_operations lo_fops = {
 	.owner =	THIS_MODULE,
-	.open =		lo_open,
 	.release =	lo_release,
 	.ioctl =	lo_ioctl,
 #ifdef CONFIG_COMPAT
@@ -2029,7 +2008,6 @@  static int loop_add(int i)
 	 */
 	if (!part_shift)
 		disk->flags |= GENHD_FL_NO_PART;
-	atomic_set(&lo->lo_refcnt, 0);
 	mutex_init(&lo->lo_mutex);
 	lo->lo_number		= i;
 	spin_lock_init(&lo->lo_lock);
@@ -2119,12 +2097,12 @@  static int loop_control_remove(int idx)
 	if (ret)
 		goto mark_visible;
 	if (lo->lo_state != Lo_unbound ||
-	    atomic_read(&lo->lo_refcnt) > 0) {
+	    lo->lo_disk->part0->bd_openers > 0) {
 		mutex_unlock(&lo->lo_mutex);
 		ret = -EBUSY;
 		goto mark_visible;
 	}
-	/* Mark this loop device no longer open()-able. */
+	/* Mark this loop device as no more bound, but not quite unbound yet */
 	lo->lo_state = Lo_deleting;
 	mutex_unlock(&lo->lo_mutex);
 
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 082d4b6bfc6a6..449d562738c52 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -28,7 +28,6 @@  struct loop_func_table;
 
 struct loop_device {
 	int		lo_number;
-	atomic_t	lo_refcnt;
 	loff_t		lo_offset;
 	loff_t		lo_sizelimit;
 	int		lo_flags;