Message ID | 20230126033358.1880-2-demi@invisiblethingslab.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Allow race-free block device handling | expand |
On Wed, Jan 25, 2023 at 10:33:53PM -0500, Demi Marie Obenour wrote: > The newly added blkdev_get_file() function allows kernel code to create > a struct file for any block device. The main use-case is for the > struct file to be exposed to userspace as a file descriptor. A future > patch will modify the DM_DEV_CREATE_CREATE ioctl to allow userspace to > get a file descriptor to the newly created block device, avoiding nasty > race conditions. NAK. Do not add wierd side-way interfaces to the block layer.
On Mon, Jan 30, 2023 at 12:08:23AM -0800, Christoph Hellwig wrote: > On Wed, Jan 25, 2023 at 10:33:53PM -0500, Demi Marie Obenour wrote: > > The newly added blkdev_get_file() function allows kernel code to create > > a struct file for any block device. The main use-case is for the > > struct file to be exposed to userspace as a file descriptor. A future > > patch will modify the DM_DEV_CREATE_CREATE ioctl to allow userspace to > > get a file descriptor to the newly created block device, avoiding nasty > > race conditions. > > NAK. Do not add wierd side-way interfaces to the block layer. What do you recommend instead? This solves a real problem for device-mapper users and I am not aware of a better solution.
On Mon, Jan 30, 2023 at 02:22:39PM -0500, Demi Marie Obenour wrote: > What do you recommend instead? This solves a real problem for > device-mapper users and I am not aware of a better solution. You could start with explaining the problem and what other methods you tried that failed. In the end it's not my job to fix your problem. I generally gladly help, but this kind of attitude doesn't get very far.
On Tue, Jan 31, 2023 at 12:53:03AM -0800, Christoph Hellwig wrote: > On Mon, Jan 30, 2023 at 02:22:39PM -0500, Demi Marie Obenour wrote: > > What do you recommend instead? This solves a real problem for > > device-mapper users and I am not aware of a better solution. > > You could start with explaining the problem and what other methods > you tried that failed. In the end it's not my job to fix your problem. I’m working on a “block not-script” (Xen block device hotplug script written in C) for Qubes OS. The current hotplug script is a shell script that takes a global lock, which serializes all invocations and significantly slows down VM creation and destruction. My C program avoids this problem. One of the goals of the not-script is to never leak resources, even if it dies with SIGKILL or is never called with the “remove” argument to destroy the devices it created. Therefore, whenever possible, it relies on automatic destruction of devices that are no longer used. I have managed to make this work for loop devices, provided that the Xen blkback driver is patched to accept a diskseq in the physical-device Xenstore node. I have *not* managed to make this work for device-mapper devices, however. One of the problems is that there is no way to atomically create a device-mapper device and obtain a file descriptor to it such that the device will be destroyed when no longer used. To solve this problem, I added a new flag (DM_FILE_DESCRIPTOR_FLAG) that asks the device-mapper driver to provide userspace a file descriptor for the device that was just created. The uAPI will likely change in future versions of the patch, but the general idea will not. While it is easy to provide userspace with an FD to any struct file, it is *not* easy to obtain a struct file for a given struct block_device. I could have had device-mapper implement everything itself, but that would have duplicated a large amount of code already in the block layer. Instead, I decided to refactor the block layer to provide a function that does exactly what was needed. The result was this patch. In the future, I would like to add an ioctl for /dev/loop-control that creates a loop device and returns a file descriptor to the loop device. I could also see iSCSI supporting this, with the socket file descriptor being passed in from userspace. blkdev_do_open() does not solve any problem for me at this time. Instead, it represents the code shared by blkdev_get_by_dev() and blkdev_get_file(). I decided to export it because it could be of independent use to others. In particular, it could potentially simplify disk_scan_partitions() in block/genhd.c, pkt_new_dev() in pktcdvd, backing_dev_store() in zram, and f2fs_scan_devices() in f2fs. I hope this is enough information. If it is not, feel free to ask for more.
On Tue, Jan 31, 2023 at 11:27:59AM -0500, Demi Marie Obenour wrote: > While it is easy to provide userspace with an FD to any struct file, it > is *not* easy to obtain a struct file for a given struct block_device. > I could have had device-mapper implement everything itself, but that > would have duplicated a large amount of code already in the block layer. > Instead, I decided to refactor the block layer to provide a function > that does exactly what was needed. The result was this patch. In the > future, I would like to add an ioctl for /dev/loop-control that creates > a loop device and returns a file descriptor to the loop device. I could > also see iSCSI supporting this, with the socket file descriptor being > passed in from userspace. And it is somewhat intentional that you can't. Block device inodes have interesting life times and are never directly exposed to userspace at all. They are internal, and only f_mapping of a file system inode delegates to them or I/O. Your patch now magically exposes them to userspace. And it then bypasses all pathname and inode permission based access checks and auditing. So we can't just do it. > blkdev_do_open() does not solve any problem for me at this time. > Instead, it represents the code shared by blkdev_get_by_dev() and > blkdev_get_file(). I decided to export it because it could be of > independent use to others. In particular, it could potentially > simplify disk_scan_partitions() in block/genhd.c, pkt_new_dev() in > pktcdvd, backing_dev_store() in zram, and f2fs_scan_devices() in f2fs. All thse need to actually open the underlying device as they do I/O. Doing I/O without opening the device is a no-go.
On Tue, Jan 31, 2023 at 11:45:55PM -0800, Christoph Hellwig wrote: > On Tue, Jan 31, 2023 at 11:27:59AM -0500, Demi Marie Obenour wrote: > > While it is easy to provide userspace with an FD to any struct file, it > > is *not* easy to obtain a struct file for a given struct block_device. > > I could have had device-mapper implement everything itself, but that > > would have duplicated a large amount of code already in the block layer. > > Instead, I decided to refactor the block layer to provide a function > > that does exactly what was needed. The result was this patch. In the > > future, I would like to add an ioctl for /dev/loop-control that creates > > a loop device and returns a file descriptor to the loop device. I could > > also see iSCSI supporting this, with the socket file descriptor being > > passed in from userspace. > > And it is somewhat intentional that you can't. Block device inodes > have interesting life times and are never directly exposed to userspace > at all. They are internal, and only f_mapping of a file system inode > delegates to them or I/O. Your patch now magically exposes them to > userspace. The intention is that the file descriptor is equvalent to what one would get by first creating the device and then opening it. If it is not, that is a bug in one of my patches. > And it then bypasses all pathname and inode permission > based access checks and auditing. So we can't just do it. Accessing /dev/mapper/control is already enough to panic the kernel, so presumably only fully trusted userspace can make the ioctl to begin with. Furthermore, this only allows a userspace process to get a file descriptor to the device-mapper device it itself created. > > blkdev_do_open() does not solve any problem for me at this time. > > Instead, it represents the code shared by blkdev_get_by_dev() and > > blkdev_get_file(). I decided to export it because it could be of > > independent use to others. In particular, it could potentially > > simplify disk_scan_partitions() in block/genhd.c, pkt_new_dev() in > > pktcdvd, backing_dev_store() in zram, and f2fs_scan_devices() in f2fs. > > All thse need to actually open the underlying device as they do I/O. > Doing I/O without opening the device is a no-go. blkdev_do_open() *does* open the device. If it doesn’t, that’s a bug. In v2 I will add the same access control checks that blkdev_get_by_dev() does. Is this sufficient?
On Tue, Jan 31, 2023 at 11:27:59AM -0500, Demi Marie Obenour wrote: > On Tue, Jan 31, 2023 at 12:53:03AM -0800, Christoph Hellwig wrote: > > On Mon, Jan 30, 2023 at 02:22:39PM -0500, Demi Marie Obenour wrote: > > > What do you recommend instead? This solves a real problem for > > > device-mapper users and I am not aware of a better solution. > > > > You could start with explaining the problem and what other methods > > you tried that failed. In the end it's not my job to fix your problem. > > I’m working on a “block not-script” (Xen block device hotplug script > written in C) for Qubes OS. The current hotplug script is a shell > script that takes a global lock, which serializes all invocations and > significantly slows down VM creation and destruction. My C program > avoids this problem. > > One of the goals of the not-script is to never leak resources, even if > it dies with SIGKILL or is never called with the “remove” argument to If it dies, you still can restart one new instance for handling the device leak by running one simple daemon to monitor if not-script is live. > destroy the devices it created. Therefore, whenever possible, it relies > on automatic destruction of devices that are no longer used. I have This automatic destruction of devices is supposed to be done in userspace, cause only userspace knows when device is needed, when it is needed. So not sure this kind of work should be involved in kernel. Thanks, Ming
On Thu, Feb 02, 2023 at 04:49:54PM +0800, Ming Lei wrote: > On Tue, Jan 31, 2023 at 11:27:59AM -0500, Demi Marie Obenour wrote: > > On Tue, Jan 31, 2023 at 12:53:03AM -0800, Christoph Hellwig wrote: > > > On Mon, Jan 30, 2023 at 02:22:39PM -0500, Demi Marie Obenour wrote: > > > > What do you recommend instead? This solves a real problem for > > > > device-mapper users and I am not aware of a better solution. > > > > > > You could start with explaining the problem and what other methods > > > you tried that failed. In the end it's not my job to fix your problem. > > > > I’m working on a “block not-script” (Xen block device hotplug script > > written in C) for Qubes OS. The current hotplug script is a shell > > script that takes a global lock, which serializes all invocations and > > significantly slows down VM creation and destruction. My C program > > avoids this problem. > > > > One of the goals of the not-script is to never leak resources, even if > > it dies with SIGKILL or is never called with the “remove” argument to > > If it dies, you still can restart one new instance for handling the device > leak by running one simple daemon to monitor if not-script is live. This requires userspace to maintain state that persists across process restarts, and is also non-compositional. If there was a userspace daemon that was responsible for all block device management in the system, this would be more reasonable, but no such daemon exists. Furthermore, the amount of code required in userspace dwarfs the amount of code my patches add to the kernel, both in size and complexity. > > destroy the devices it created. Therefore, whenever possible, it relies > > on automatic destruction of devices that are no longer used. I have > > This automatic destruction of devices is supposed to be done in > userspace, cause only userspace knows when device is needed, when > it is needed. In my use-case, the last reference to the device is held by the blkback driver in the kernel. More generally, any case where a device is created for a single purpose and should be destroyed when no longer used will benefit from this. Encrypted swap devices are a simple example, as they can be destroyed with a single “swapoff” command.
diff --git a/block/bdev.c b/block/bdev.c index edc110d90df4041e7d337976951bd0d17525f1f7..09cb5ef900ca9ad5b21250bb63e64cc2a79f9289 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -459,10 +459,33 @@ static struct file_system_type bd_type = { struct super_block *blockdev_superblock __read_mostly; EXPORT_SYMBOL_GPL(blockdev_superblock); +static struct vfsmount *bd_mnt __read_mostly; + +struct file * +blkdev_get_file(struct block_device *bdev, fmode_t flags, void *holder) +{ + struct inode *inode; + struct file *filp; + int ret; + + ret = blkdev_do_open(bdev, flags, holder); + if (ret) + return ERR_PTR(ret); + inode = bdev->bd_inode; + filp = alloc_file_pseudo(inode, bd_mnt, "[block]", flags | O_CLOEXEC, &def_blk_fops); + if (IS_ERR(filp)) { + blkdev_put(bdev, flags); + } else { + filp->f_mapping = inode->i_mapping; + filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping); + } + return filp; +} +EXPORT_SYMBOL(blkdev_get_file); + void __init bdev_cache_init(void) { int err; - static struct vfsmount *bd_mnt; bdev_cachep = kmem_cache_create("bdev_cache", sizeof(struct bdev_inode), 0, (SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT| @@ -775,7 +798,7 @@ void blkdev_put_no_open(struct block_device *bdev) * * Use this interface ONLY if you really do not have anything better - i.e. when * you are behind a truly sucky interface and all you are given is a device - * number. Everything else should use blkdev_get_by_path(). + * number. Everything else should use blkdev_get_by_path() or blkdev_do_open(). * * CONTEXT: * Might sleep. @@ -785,9 +808,7 @@ void blkdev_put_no_open(struct block_device *bdev) */ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder) { - bool unblock_events = true; struct block_device *bdev; - struct gendisk *disk; int ret; ret = devcgroup_check_permission(DEVCG_DEV_BLOCK, @@ -800,18 +821,52 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder) bdev = blkdev_get_no_open(dev); if (!bdev) return ERR_PTR(-ENXIO); - disk = bdev->bd_disk; + + ret = blkdev_do_open(bdev, mode, holder); + if (ret) { + blkdev_put_no_open(bdev); + return ERR_PTR(ret); + } + + return bdev; +} +EXPORT_SYMBOL(blkdev_get_by_dev); + +/** + * blkdev_do_open - open a block device by device pointer + * @bdev: pointer to the device to open + * @mode: FMODE_* mask + * @holder: exclusive holder identifier + * + * Open the block device pointed to by @bdev. If @mode includes + * %FMODE_EXCL, the block device is opened with exclusive access. Specifying + * %FMODE_EXCL with a %NULL @holder is invalid. Exclusive opens may nest for + * the same @holder. + * + * Unlike blkdev_get_by_dev() and bldev_get_by_path(), this function does not + * do any permission checks. The most common use-case is where the device + * was freshly created by userspace. + * + * CONTEXT: + * Might sleep. + * + * RETURNS: + * Reference 0 on success, -errno on failure. + */ +int blkdev_do_open(struct block_device *bdev, fmode_t mode, void *holder) { + struct gendisk *disk = bdev->bd_disk; + int ret = -ENXIO; + bool unblock_events = true; if (mode & FMODE_EXCL) { ret = bd_prepare_to_claim(bdev, holder); if (ret) - goto put_blkdev; + return ret; } disk_block_events(disk); mutex_lock(&disk->open_mutex); - ret = -ENXIO; if (!disk_live(disk)) goto abort_claiming; if (!try_module_get(disk->fops->owner)) @@ -842,7 +897,7 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder) if (unblock_events) disk_unblock_events(disk); - return bdev; + return 0; put_module: module_put(disk->fops->owner); abort_claiming: @@ -850,11 +905,9 @@ struct block_device *blkdev_get_by_dev(dev_t dev, fmode_t mode, void *holder) bd_abort_claiming(bdev, holder); mutex_unlock(&disk->open_mutex); disk_unblock_events(disk); -put_blkdev: - blkdev_put_no_open(bdev); - return ERR_PTR(ret); + return ret; } -EXPORT_SYMBOL(blkdev_get_by_dev); +EXPORT_SYMBOL(blkdev_do_open); /** * blkdev_get_by_path - open a block device by name diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 43d4e073b1115e4628a001081fbf08b296d342df..04635cb5ee29d22394a34c65eb34bea4e7847d8d 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -325,6 +325,11 @@ typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx, void disk_set_zoned(struct gendisk *disk, enum blk_zoned_model model); +struct file * +blkdev_get_file(struct block_device *bdev, fmode_t flags, void *holder); + +int blkdev_do_open(struct block_device *bdev, fmode_t flags, void *holder); + #ifdef CONFIG_BLK_DEV_ZONED #define BLK_ALL_ZONES ((unsigned int)-1)
The newly added blkdev_get_file() function allows kernel code to create a struct file for any block device. The main use-case is for the struct file to be exposed to userspace as a file descriptor. A future patch will modify the DM_DEV_CREATE_CREATE ioctl to allow userspace to get a file descriptor to the newly created block device, avoiding nasty race conditions. Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> --- block/bdev.c | 77 +++++++++++++++++++++++++++++++++++------- include/linux/blkdev.h | 5 +++ 2 files changed, 70 insertions(+), 12 deletions(-)