diff mbox series

iomap: add a workaround for racy i_size updates on block devices

Message ID 20230925095133.311224-1-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series iomap: add a workaround for racy i_size updates on block devices | expand

Commit Message

Christoph Hellwig Sept. 25, 2023, 9:51 a.m. UTC
A szybot reproducer that does write I/O while truncating the size of a
block device can end up in clean_bdev_aliases, which tries to clean the
bdev aliases that it uses.  This is because iomap_to_bh automatically
sets the BH_New flag when outside of i_size.  For block devices updates
to i_size are racy and we can hit this case in a tiny race window,
leading to the eventual clean_bdev_aliases call.  Fix this by erroring
out of > i_size I/O on block devices.

Reported-by: syzbot+1fa947e7f09e136925b8@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: syzbot+1fa947e7f09e136925b8@syzkaller.appspotmail.com
---
 fs/buffer.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Comments

Darrick J. Wong Sept. 25, 2023, 3:09 p.m. UTC | #1
On Mon, Sep 25, 2023 at 11:51:33AM +0200, Christoph Hellwig wrote:
> A szybot reproducer that does write I/O while truncating the size of a
> block device can end up in clean_bdev_aliases, which tries to clean the
> bdev aliases that it uses.  This is because iomap_to_bh automatically
> sets the BH_New flag when outside of i_size.  For block devices updates
> to i_size are racy and we can hit this case in a tiny race window,
> leading to the eventual clean_bdev_aliases call.  Fix this by erroring
> out of > i_size I/O on block devices.
> 
> Reported-by: syzbot+1fa947e7f09e136925b8@syzkaller.appspotmail.com
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Tested-by: syzbot+1fa947e7f09e136925b8@syzkaller.appspotmail.com
> ---
>  fs/buffer.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index a6785cd07081cb..12e9a71c693d74 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2058,8 +2058,17 @@ iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
>  		fallthrough;
>  	case IOMAP_MAPPED:
>  		if ((iomap->flags & IOMAP_F_NEW) ||
> -		    offset >= i_size_read(inode))
> +		    offset >= i_size_read(inode)) {
> +			/*
> +			 * This can happen if truncating the block device races
> +			 * with the check in the caller as i_size updates on
> +			 * block devices aren't synchronized by i_rwsem for
> +			 * block devices.

Why /are/ bdevs special like this (not holding i_rwsem during a
truncate) anyway?  Is it because we require the sysadmin to coordinate
device shrink vs. running programs?

--D

> +			 */
> +			if (S_ISBLK(inode->i_mode))
> +				return -EIO;
>  			set_buffer_new(bh);
> +		}
>  		bh->b_blocknr = (iomap->addr + offset - iomap->offset) >>
>  				inode->i_blkbits;
>  		set_buffer_mapped(bh);
> -- 
> 2.39.2
>
Christoph Hellwig Sept. 25, 2023, 3:18 p.m. UTC | #2
On Mon, Sep 25, 2023 at 08:09:02AM -0700, Darrick J. Wong wrote:
> > +			/*
> > +			 * This can happen if truncating the block device races
> > +			 * with the check in the caller as i_size updates on
> > +			 * block devices aren't synchronized by i_rwsem for
> > +			 * block devices.
> 
> Why /are/ bdevs special like this (not holding i_rwsem during a
> truncate) anyway?  Is it because we require the sysadmin to coordinate
> device shrink vs. running programs?

It's not just truncate, they also don't hold a lock on write.

I think the reason is that there is no such things as the block allocator
and block truncation that happens for block devices, they historically
had a fixed size, and at some point we allowed to change that size
by various crude means that are only slowly becoming more standardized
and formal.  Real block device size changes are about 100% growing of
the device, as that is an actually useful feature.  Shrinks OTOH are
usuall a "cute" hack: block drivers set the size to 0 stop I/O when they
are shut down.  I've been wanting to replace that with an actual check
in the bdev fd I/O path for a while, but that would also mean the
shrinking case would still be around, just exercised a lot less.
Darrick J. Wong Sept. 25, 2023, 3:53 p.m. UTC | #3
On Mon, Sep 25, 2023 at 05:18:16PM +0200, Christoph Hellwig wrote:
> On Mon, Sep 25, 2023 at 08:09:02AM -0700, Darrick J. Wong wrote:
> > > +			/*
> > > +			 * This can happen if truncating the block device races
> > > +			 * with the check in the caller as i_size updates on
> > > +			 * block devices aren't synchronized by i_rwsem for
> > > +			 * block devices.
> > 
> > Why /are/ bdevs special like this (not holding i_rwsem during a
> > truncate) anyway?  Is it because we require the sysadmin to coordinate
> > device shrink vs. running programs?
> 
> It's not just truncate, they also don't hold a lock on write.

Oh!  So they don't.  Heh.

> I think the reason is that there is no such things as the block allocator
> and block truncation that happens for block devices, they historically
> had a fixed size, and at some point we allowed to change that size
> by various crude means that are only slowly becoming more standardized
> and formal.  Real block device size changes are about 100% growing of
> the device, as that is an actually useful feature.  Shrinks OTOH are
> usuall a "cute" hack: block drivers set the size to 0 stop I/O when they
> are shut down.  I've been wanting to replace that with an actual check
> in the bdev fd I/O path for a while, but that would also mean the
> shrinking case would still be around, just exercised a lot less.

You call bdev shrink a cute hack, cloud tenants call it a cost-reducing
activity, and cloud vendors call it a revenue opportunity because
shrinking filesystems is un***** expensive in terms of CPU time and IO
usage. ;)

Anyway, I'm not going to argue with longstanding blockdev precedent.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D
diff mbox series

Patch

diff --git a/fs/buffer.c b/fs/buffer.c
index a6785cd07081cb..12e9a71c693d74 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2058,8 +2058,17 @@  iomap_to_bh(struct inode *inode, sector_t block, struct buffer_head *bh,
 		fallthrough;
 	case IOMAP_MAPPED:
 		if ((iomap->flags & IOMAP_F_NEW) ||
-		    offset >= i_size_read(inode))
+		    offset >= i_size_read(inode)) {
+			/*
+			 * This can happen if truncating the block device races
+			 * with the check in the caller as i_size updates on
+			 * block devices aren't synchronized by i_rwsem for
+			 * block devices.
+			 */
+			if (S_ISBLK(inode->i_mode))
+				return -EIO;
 			set_buffer_new(bh);
+		}
 		bh->b_blocknr = (iomap->addr + offset - iomap->offset) >>
 				inode->i_blkbits;
 		set_buffer_mapped(bh);