Message ID | 20250415001405.GA25659@frogsfrogsfrogs (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths | expand |
On Mon, Apr 14, 2025 at 05:14:05PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > With the new large sector size support, it's now the case that > set_blocksize can change i_blksize and the folio order in a manner that > conflicts with a concurrent reader and causes a kernel crash. > > Specifically, let's say that udev-worker calls libblkid to detect the > labels on a block device. The read call can create an order-0 folio to > read the first 4096 bytes from the disk. But then udev is preempted. > > Next, someone tries to mount an 8k-sectorsize filesystem from the same > block device. The filesystem calls set_blksize, which sets i_blksize to > 8192 and the minimum folio order to 1. > > Now udev resumes, still holding the order-0 folio it allocated. It then > tries to schedule a read bio and do_mpage_readahead tries to create > bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because > the page size is 4096 but the blocksize is 8192 so no bufferheads are > attached and the bh walk never sets bdev. We then submit the bio with a > NULL block device and crash. > Do we have a testcase for blktests or xfstests for this? The issue is subtle and some of the code in the patch looks easy to accidentally break again (not the fault of this patch primarily). > } else { > + inode_lock_shared(bd_inode); > ret = blkdev_buffered_write(iocb, from); > + inode_unlock_shared(bd_inode); Does this need a comment why we take i_rwsem? > + inode_lock_shared(bd_inode); > ret = filemap_read(iocb, to, ret); > + inode_unlock_shared(bd_inode); Same here. Especially as the protection is now heavier than for most file systems. I also wonder if we need locking asserts in some of the write side functions that expect the shared inode lock and invalidate lock now?
On Tue, Apr 15, 2025 at 09:41:32PM -0700, Christoph Hellwig wrote: > On Mon, Apr 14, 2025 at 05:14:05PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > With the new large sector size support, it's now the case that > > set_blocksize can change i_blksize and the folio order in a manner that > > conflicts with a concurrent reader and causes a kernel crash. > > > > Specifically, let's say that udev-worker calls libblkid to detect the > > labels on a block device. The read call can create an order-0 folio to > > read the first 4096 bytes from the disk. But then udev is preempted. > > > > Next, someone tries to mount an 8k-sectorsize filesystem from the same > > block device. The filesystem calls set_blksize, which sets i_blksize to > > 8192 and the minimum folio order to 1. > > > > Now udev resumes, still holding the order-0 folio it allocated. It then > > tries to schedule a read bio and do_mpage_readahead tries to create > > bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because > > the page size is 4096 but the blocksize is 8192 so no bufferheads are > > attached and the bh walk never sets bdev. We then submit the bio with a > > NULL block device and crash. > > > > Do we have a testcase for blktests or xfstests for this? The issue is > subtle and some of the code in the patch looks easy to accidentally > break again (not the fault of this patch primarily). It's the same patch as: https://lore.kernel.org/linux-fsdevel/20250408175125.GL6266@frogsfrogsfrogs/ which is to say, xfs/032 with while true; do blkid; done running in the background to increase the chances of a collision. > > } else { > > + inode_lock_shared(bd_inode); > > ret = blkdev_buffered_write(iocb, from); > > + inode_unlock_shared(bd_inode); > > Does this need a comment why we take i_rwsem? > > > + inode_lock_shared(bd_inode); > > ret = filemap_read(iocb, to, ret); > > + inode_unlock_shared(bd_inode); > > Same here. Especially as the protection is now heavier than for most > file systems. Yeah, somewhere we need a better comment. How about this for set_blocksize: /* * Flush and truncate the pagecache before we reconfigure the * mapping geometry because folio sizes are variable now. If * a reader has already allocated a folio whose size is smaller * than the new min_order but invokes readahead after the new * min_order becomes visible, readahead will think there are * "zero" blocks per folio and crash. */ And then the read/write paths can say something simpler: /* * Take i_rwsem and invalidate_lock to avoid racing with a * blocksize change punching out the pagecache. */ > I also wonder if we need locking asserts in some of the write side > functions that expect the shared inode lock and invalidate lock now? Probably. Do you have specific places in mind? --D
On Tue, Apr 15, 2025 at 10:01:44PM -0700, Darrick J. Wong wrote: > It's the same patch as: > https://lore.kernel.org/linux-fsdevel/20250408175125.GL6266@frogsfrogsfrogs/ > > which is to say, xfs/032 with while true; do blkid; done running in the > background to increase the chances of a collision. I think the xfs-zoned CI actually hit this with 032 without any extra action the. > /* > * Flush and truncate the pagecache before we reconfigure the > * mapping geometry because folio sizes are variable now. If > * a reader has already allocated a folio whose size is smaller > * than the new min_order but invokes readahead after the new > * min_order becomes visible, readahead will think there are > * "zero" blocks per folio and crash. > */ > > And then the read/write paths can say something simpler: > > /* > * Take i_rwsem and invalidate_lock to avoid racing with a > * blocksize change punching out the pagecache. > */ Sounds reasonable. > > I also wonder if we need locking asserts in some of the write side > > functions that expect the shared inode lock and invalidate lock now? > > Probably. Do you have specific places in mind? Looking at it more closely: no. We're only calling very low-level helpers, so this might not actually be feasible.
On Apr 15, 2025 / 22:14, Christoph Hellwig wrote: > On Tue, Apr 15, 2025 at 10:01:44PM -0700, Darrick J. Wong wrote: > > It's the same patch as: > > https://lore.kernel.org/linux-fsdevel/20250408175125.GL6266@frogsfrogsfrogs/ > > > > which is to say, xfs/032 with while true; do blkid; done running in the > > background to increase the chances of a collision. > > I think the xfs-zoned CI actually hit this with 032 without any extra > action the. I observed xfs/032 hanged using the kernel on linux-xfs/for-next branch with git hash 71700ac47ad8. Before the hang, kernel reported the messages below: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 21 UID: 0 PID: 3187783 Comm: (udev-worker) Not tainted 6.15.0-rc1-kts-xfs-g71700ac47ad+ #1 PREEMPT(lazy) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 RIP: 0010:guard_bio_eod+0x52/0x5b0 The failure was recreated in stable manner. I applied this patch series, and confirmed the failure disappears. Good. (I needed to resolve conflicts, though) This patch fixes block layer. So, IMO, it's the better to have a test case in blktests to confirm the fix. I created a blktests test case which recreates the failure using blockdev and fio commands. Will post it soon.
diff --git a/block/bdev.c b/block/bdev.c index 7b4e35a661b0c9..0cbdac46d98d86 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -169,11 +169,23 @@ int set_blocksize(struct file *file, int size) /* Don't change the size if it is same as current */ if (inode->i_blkbits != blksize_bits(size)) { + /* Prevent concurrent IO operations */ + inode_lock(inode); + filemap_invalidate_lock(inode->i_mapping); + + /* + * Flush and truncate the pagecache before we reconfigure the + * mapping geometry because folio sizes are variable now. + */ sync_blockdev(bdev); + kill_bdev(bdev); + inode->i_blkbits = blksize_bits(size); mapping_set_folio_order_range(inode->i_mapping, get_order(size), get_order(size)); kill_bdev(bdev); + filemap_invalidate_unlock(inode->i_mapping); + inode_unlock(inode); } return 0; } diff --git a/block/blk-zoned.c b/block/blk-zoned.c index 0c77244a35c92e..8f15d1aa6eb89a 100644 --- a/block/blk-zoned.c +++ b/block/blk-zoned.c @@ -343,6 +343,7 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode, op = REQ_OP_ZONE_RESET; /* Invalidate the page cache, including dirty pages. */ + inode_lock(bdev->bd_mapping->host); filemap_invalidate_lock(bdev->bd_mapping); ret = blkdev_truncate_zone_range(bdev, mode, &zrange); if (ret) @@ -364,8 +365,10 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode, ret = blkdev_zone_mgmt(bdev, op, zrange.sector, zrange.nr_sectors); fail: - if (cmd == BLKRESETZONE) + if (cmd == BLKRESETZONE) { filemap_invalidate_unlock(bdev->bd_mapping); + inode_unlock(bdev->bd_mapping->host); + } return ret; } diff --git a/block/fops.c b/block/fops.c index be9f1dbea9ce0a..f46ae08fac33dd 100644 --- a/block/fops.c +++ b/block/fops.c @@ -746,7 +746,9 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from) ret = direct_write_fallback(iocb, from, ret, blkdev_buffered_write(iocb, from)); } else { + inode_lock_shared(bd_inode); ret = blkdev_buffered_write(iocb, from); + inode_unlock_shared(bd_inode); } if (ret > 0) @@ -757,6 +759,7 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from) static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to) { + struct inode *bd_inode = bdev_file_inode(iocb->ki_filp); struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host); loff_t size = bdev_nr_bytes(bdev); loff_t pos = iocb->ki_pos; @@ -793,7 +796,9 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to) goto reexpand; } + inode_lock_shared(bd_inode); ret = filemap_read(iocb, to, ret); + inode_unlock_shared(bd_inode); reexpand: if (unlikely(shorted)) @@ -836,6 +841,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, if ((start | len) & (bdev_logical_block_size(bdev) - 1)) return -EINVAL; + inode_lock(inode); filemap_invalidate_lock(inode->i_mapping); /* @@ -868,6 +874,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start, fail: filemap_invalidate_unlock(inode->i_mapping); + inode_unlock(inode); return error; } diff --git a/block/ioctl.c b/block/ioctl.c index faa40f383e2736..e472cc1030c60c 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -142,6 +142,7 @@ static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode, if (err) return err; + inode_lock(bdev->bd_mapping->host); filemap_invalidate_lock(bdev->bd_mapping); err = truncate_bdev_range(bdev, mode, start, start + len - 1); if (err) @@ -174,6 +175,7 @@ static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode, blk_finish_plug(&plug); fail: filemap_invalidate_unlock(bdev->bd_mapping); + inode_unlock(bdev->bd_mapping->host); return err; } @@ -199,12 +201,14 @@ static int blk_ioctl_secure_erase(struct block_device *bdev, blk_mode_t mode, end > bdev_nr_bytes(bdev)) return -EINVAL; + inode_lock(bdev->bd_mapping->host); filemap_invalidate_lock(bdev->bd_mapping); err = truncate_bdev_range(bdev, mode, start, end - 1); if (!err) err = blkdev_issue_secure_erase(bdev, start >> 9, len >> 9, GFP_KERNEL); filemap_invalidate_unlock(bdev->bd_mapping); + inode_unlock(bdev->bd_mapping->host); return err; } @@ -236,6 +240,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode, return -EINVAL; /* Invalidate the page cache, including dirty pages */ + inode_lock(bdev->bd_mapping->host); filemap_invalidate_lock(bdev->bd_mapping); err = truncate_bdev_range(bdev, mode, start, end); if (err) @@ -246,6 +251,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode, fail: filemap_invalidate_unlock(bdev->bd_mapping); + inode_unlock(bdev->bd_mapping->host); return err; }