Message ID | 20220316084519.2850118-8-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/8] loop: de-duplicate the idle worker freeing code | expand |
On Wed 16-03-22 09:45:18, Christoph Hellwig wrote: > lo_refcount counts how many openers a loop device has, but that count > is already provided by the block layer in the bd_openers field of the > whole-disk block_device. Remove lo_refcount and allow opens to > succeed even on devices beeing deleted - now that ->free_disk is > implemented we can handle that race gracefull and all I/O on it will > just fail. Similarly there is a small race window now where > loop_control_remove does not synchronize the delete vs the remove > due do bd_openers not being under lo_mutex protection, but we can > handle that just as gracefully. Honestly, I'm a bit uneasy about these races and ability to have open removed loop device (given use of loop devices in container environments potential problems could have security implications). But I guess it is no different to having open hot-unplugged scsi disk. There may be just an expectation from userspace that your open either blocks loop device removal or open fails. But I guess we can deal with that if some real breakage happens - it does not seem that hard to solve - we just need loop_control_remove() to grab disk->open_mutex to make transition to Lo_deleting state safe and keep Lo_deleting check in lo_open(). Plus we'd need to use READ_ONCE / WRITE_ONCE for lo_state accesses. > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > drivers/block/loop.c | 36 +++++++----------------------------- > drivers/block/loop.h | 1 - > 2 files changed, 7 insertions(+), 30 deletions(-) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index cbaa18bcad1fe..c270f3715d829 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -1244,7 +1244,7 @@ static int loop_clr_fd(struct loop_device *lo) > * <dev>/do something like mkfs/losetup -d <dev> causing the losetup -d > * command to fail with EBUSY. > */ > - if (atomic_read(&lo->lo_refcnt) > 1) { > + if (lo->lo_disk->part0->bd_openers > 1) { But bd_openers can be read safely only under disk->open_mutex. So for this to be safe against compiler playing nasty tricks with optimizations, we need to either make bd_openers atomic_t or use READ_ONCE / WRITE_ONCE when accessing it. > @@ -1724,33 +1724,15 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode, > } > #endif > > -static int lo_open(struct block_device *bdev, fmode_t mode) > -{ > - struct loop_device *lo = bdev->bd_disk->private_data; > - int err; > - > - err = mutex_lock_killable(&lo->lo_mutex); > - if (err) > - return err; > - if (lo->lo_state == Lo_deleting) > - err = -ENXIO; > - else > - atomic_inc(&lo->lo_refcnt); > - mutex_unlock(&lo->lo_mutex); > - return err; > -} > - Tetsuo has observed [1] that not grabbing lo_mutex when opening loop device tends to break systemd-udevd because in loop_configure() we send DISK_EVENT_MEDIA_CHANGE event before the device is fully configured (but the configuration gets finished before we release the lo_mutex) and so systemd-udev gets spurious IO errors when probing new loop devices and is unhappy. So I think this is the right way to go but we need to reshuffle loop_configure() a bit first. [1] https://lore.kernel.org/all/a72c59c6-298b-e4ba-b1f5-2275afab49a1@I-love.SAKURA.ne.jp/T/#u Honza
On 2022/03/16 20:22, Jan Kara wrote: >> @@ -1244,7 +1244,7 @@ static int loop_clr_fd(struct loop_device *lo) >> * <dev>/do something like mkfs/losetup -d <dev> causing the losetup -d >> * command to fail with EBUSY. >> */ >> - if (atomic_read(&lo->lo_refcnt) > 1) { >> + if (lo->lo_disk->part0->bd_openers > 1) { > > But bd_openers can be read safely only under disk->open_mutex. So for this > to be safe against compiler playing nasty tricks with optimizations, we > need to either make bd_openers atomic_t or use READ_ONCE / WRITE_ONCE when > accessing it. Use of READ_ONCE() / WRITE_ONCE() are not for avoiding races but for making sure that memory access happens only once. It is data_race() which is needed for tolerating and annotating races. For example, data_race(lo->lo_state) is needed when accessing lo->lo_state without lo->lo_mutex held. Use of atomic_t for lo->lo_disk->part0->bd_openers does not help, for currently lo->lo_mutex is held in order to avoid races. That is, it is disk->open_mutex which loop_clr_fd() needs to hold when accessing lo->lo_disk->part0->bd_openers.
On Wed 16-03-22 22:14:50, Tetsuo Handa wrote: > On 2022/03/16 20:22, Jan Kara wrote: > >> @@ -1244,7 +1244,7 @@ static int loop_clr_fd(struct loop_device *lo) > >> * <dev>/do something like mkfs/losetup -d <dev> causing the losetup -d > >> * command to fail with EBUSY. > >> */ > >> - if (atomic_read(&lo->lo_refcnt) > 1) { > >> + if (lo->lo_disk->part0->bd_openers > 1) { > > > > But bd_openers can be read safely only under disk->open_mutex. So for this > > to be safe against compiler playing nasty tricks with optimizations, we > > need to either make bd_openers atomic_t or use READ_ONCE / WRITE_ONCE when > > accessing it. > > Use of READ_ONCE() / WRITE_ONCE() are not for avoiding races but for making > sure that memory access happens only once. It is data_race() which is needed > for tolerating and annotating races. For example, data_race(lo->lo_state) is > needed when accessing lo->lo_state without lo->lo_mutex held. Well, but another effect of READ_ONCE() / WRITE_ONCE() is that it effectively forces the compiler to not store any intermediate value in bd_openers. If you have code like bdev->bd_openers++, and bd_openers has value say 1, the compiler is fully within its rights if unlocked reader sees values, 1, 0, 3, 2. It would have to be a vicious compiler but the C standard allows that and some of the optimizations compilers end up doing result in code which is not far from this (read more about KCSAN and the motivation behind it for details). So data_race() annotation is *not* enough for unlocked bd_openers usage. > Use of atomic_t for lo->lo_disk->part0->bd_openers does not help, for > currently lo->lo_mutex is held in order to avoid races. That is, it is > disk->open_mutex which loop_clr_fd() needs to hold when accessing > lo->lo_disk->part0->bd_openers. It does help because with atomic_t, seeing any intermediate values is not possible even for unlocked readers. Honza
On Wed, Mar 16, 2022 at 03:38:55PM +0100, Jan Kara wrote: > Well, but another effect of READ_ONCE() / WRITE_ONCE() is that it > effectively forces the compiler to not store any intermediate value in > bd_openers. If you have code like bdev->bd_openers++, and bd_openers has > value say 1, the compiler is fully within its rights if unlocked reader > sees values, 1, 0, 3, 2. It would have to be a vicious compiler but the C > standard allows that and some of the optimizations compilers end up doing > result in code which is not far from this (read more about KCSAN and the > motivation behind it for details). So data_race() annotation is *not* > enough for unlocked bd_openers usage. > > > Use of atomic_t for lo->lo_disk->part0->bd_openers does not help, for > > currently lo->lo_mutex is held in order to avoid races. That is, it is > > disk->open_mutex which loop_clr_fd() needs to hold when accessing > > lo->lo_disk->part0->bd_openers. > > It does help because with atomic_t, seeing any intermediate values is not > possible even for unlocked readers. The Linux memory model guarantees atomic reads from 32-bit integers. But if it makes everyone happier I could do a READ_ONCE here.
On Tue 22-03-22 12:09:08, Christoph Hellwig wrote: > On Wed, Mar 16, 2022 at 03:38:55PM +0100, Jan Kara wrote: > > Well, but another effect of READ_ONCE() / WRITE_ONCE() is that it > > effectively forces the compiler to not store any intermediate value in > > bd_openers. If you have code like bdev->bd_openers++, and bd_openers has > > value say 1, the compiler is fully within its rights if unlocked reader > > sees values, 1, 0, 3, 2. It would have to be a vicious compiler but the C > > standard allows that and some of the optimizations compilers end up doing > > result in code which is not far from this (read more about KCSAN and the > > motivation behind it for details). So data_race() annotation is *not* > > enough for unlocked bd_openers usage. > > > > > Use of atomic_t for lo->lo_disk->part0->bd_openers does not help, for > > > currently lo->lo_mutex is held in order to avoid races. That is, it is > > > disk->open_mutex which loop_clr_fd() needs to hold when accessing > > > lo->lo_disk->part0->bd_openers. > > > > It does help because with atomic_t, seeing any intermediate values is not > > possible even for unlocked readers. > > The Linux memory model guarantees atomic reads from 32-bit integers. > But if it makes everyone happier I could do a READ_ONCE here. Sure, the read is atomic wrt other CPU instructions, but it is not atomic wrt how the compiler decides to implement bdi->bd_openers++. So we need to make these bd_openers *updates* atomic so that the unlocked reads are really safe. That being said I consider the concerns mostly theoretical so I don't insist but some checker will surely complain sooner rather than later. Honza
On 2022/03/23 21:18, Jan Kara wrote: >>>> Use of atomic_t for lo->lo_disk->part0->bd_openers does not help, for >>>> currently lo->lo_mutex is held in order to avoid races. That is, it is >>>> disk->open_mutex which loop_clr_fd() needs to hold when accessing >>>> lo->lo_disk->part0->bd_openers. >>> >>> It does help because with atomic_t, seeing any intermediate values is not >>> possible even for unlocked readers. >> >> The Linux memory model guarantees atomic reads from 32-bit integers. >> But if it makes everyone happier I could do a READ_ONCE here. > > Sure, the read is atomic wrt other CPU instructions, but it is not atomic > wrt how the compiler decides to implement bdi->bd_openers++. So we need to > make these bd_openers *updates* atomic so that the unlocked reads are > really safe. That being said I consider the concerns mostly theoretical so > I don't insist but some checker will surely complain sooner rather than > later. KCSAN will complain access without data_race(). But what I'm saying here is that loop_clr_fd() needs to hold disk->open_mutex, for blkdev_get_whole() is protected using disk->open_mutex. Then, KCSAN will not complain this access. static int loop_clr_fd(struct loop_device *lo) { int err; if (err) return err; if (lo->lo_state != Lo_bound) { mutex_unlock(&lo->lo_mutex); return -ENXIO; } if (lo->lo_disk->part0->bd_openers > 1) { // Sees lo->lo_disk->part0->bd_openers == 1. lo->lo_flags |= LO_FLAGS_AUTOCLEAR; mutex_unlock(&lo->lo_mutex); return 0; } // Preemption starts. static int blkdev_get_whole(struct block_device *bdev, fmode_t mode) { struct gendisk *disk = bdev->bd_disk; int ret; if (disk->fops->open) { // disk->fops->open == NULL because lo_open() is removed. ret = disk->fops->open(bdev, mode); if (ret) { /* avoid ghost partitions on a removed medium */ if (ret == -ENOMEDIUM && test_bit(GD_NEED_PART_SCAN, &disk->state)) bdev_disk_changed(disk, true); return ret; } } if (!bdev->bd_openers) set_init_blocksize(bdev); if (test_bit(GD_NEED_PART_SCAN, &disk->state)) bdev_disk_changed(disk, false); bdev->bd_openers++; return 0; } // Preemption ends. lo->lo_state = Lo_rundown; mutex_unlock(&lo->lo_mutex); __loop_clr_fd(lo, false); // Despite lo->lo_disk->part0->bd_openers > 1. return 0; }
diff --git a/drivers/block/loop.c b/drivers/block/loop.c index cbaa18bcad1fe..c270f3715d829 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1244,7 +1244,7 @@ static int loop_clr_fd(struct loop_device *lo) * <dev>/do something like mkfs/losetup -d <dev> causing the losetup -d * command to fail with EBUSY. */ - if (atomic_read(&lo->lo_refcnt) > 1) { + if (lo->lo_disk->part0->bd_openers > 1) { lo->lo_flags |= LO_FLAGS_AUTOCLEAR; mutex_unlock(&lo->lo_mutex); return 0; @@ -1724,33 +1724,15 @@ static int lo_compat_ioctl(struct block_device *bdev, fmode_t mode, } #endif -static int lo_open(struct block_device *bdev, fmode_t mode) -{ - struct loop_device *lo = bdev->bd_disk->private_data; - int err; - - err = mutex_lock_killable(&lo->lo_mutex); - if (err) - return err; - if (lo->lo_state == Lo_deleting) - err = -ENXIO; - else - atomic_inc(&lo->lo_refcnt); - mutex_unlock(&lo->lo_mutex); - return err; -} - static void lo_release(struct gendisk *disk, fmode_t mode) { struct loop_device *lo = disk->private_data; - mutex_lock(&lo->lo_mutex); - if (atomic_dec_return(&lo->lo_refcnt)) - goto out_unlock; + if (disk->part0->bd_openers > 0) + return; - if (lo->lo_flags & LO_FLAGS_AUTOCLEAR) { - if (lo->lo_state != Lo_bound) - goto out_unlock; + mutex_lock(&lo->lo_mutex); + if (lo->lo_state == Lo_bound && (lo->lo_flags & LO_FLAGS_AUTOCLEAR)) { lo->lo_state = Lo_rundown; mutex_unlock(&lo->lo_mutex); /* @@ -1760,8 +1742,6 @@ static void lo_release(struct gendisk *disk, fmode_t mode) __loop_clr_fd(lo, true); return; } - -out_unlock: mutex_unlock(&lo->lo_mutex); } @@ -1775,7 +1755,6 @@ static void lo_free_disk(struct gendisk *disk) static const struct block_device_operations lo_fops = { .owner = THIS_MODULE, - .open = lo_open, .release = lo_release, .ioctl = lo_ioctl, #ifdef CONFIG_COMPAT @@ -2029,7 +2008,6 @@ static int loop_add(int i) */ if (!part_shift) disk->flags |= GENHD_FL_NO_PART; - atomic_set(&lo->lo_refcnt, 0); mutex_init(&lo->lo_mutex); lo->lo_number = i; spin_lock_init(&lo->lo_lock); @@ -2119,12 +2097,12 @@ static int loop_control_remove(int idx) if (ret) goto mark_visible; if (lo->lo_state != Lo_unbound || - atomic_read(&lo->lo_refcnt) > 0) { + lo->lo_disk->part0->bd_openers > 0) { mutex_unlock(&lo->lo_mutex); ret = -EBUSY; goto mark_visible; } - /* Mark this loop device no longer open()-able. */ + /* Mark this loop device as no more bound, but not quite unbound yet */ lo->lo_state = Lo_deleting; mutex_unlock(&lo->lo_mutex); diff --git a/drivers/block/loop.h b/drivers/block/loop.h index 082d4b6bfc6a6..449d562738c52 100644 --- a/drivers/block/loop.h +++ b/drivers/block/loop.h @@ -28,7 +28,6 @@ struct loop_func_table; struct loop_device { int lo_number; - atomic_t lo_refcnt; loff_t lo_offset; loff_t lo_sizelimit; int lo_flags;
lo_refcount counts how many openers a loop device has, but that count is already provided by the block layer in the bd_openers field of the whole-disk block_device. Remove lo_refcount and allow opens to succeed even on devices beeing deleted - now that ->free_disk is implemented we can handle that race gracefull and all I/O on it will just fail. Similarly there is a small race window now where loop_control_remove does not synchronize the delete vs the remove due do bd_openers not being under lo_mutex protection, but we can handle that just as gracefully. Signed-off-by: Christoph Hellwig <hch@lst.de> --- drivers/block/loop.c | 36 +++++++----------------------------- drivers/block/loop.h | 1 - 2 files changed, 7 insertions(+), 30 deletions(-)