diff mbox series

[V2] block: fail unaligned bio from submit_bio_noacct()

Message ID 20240324133702.1328237-1-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series [V2] block: fail unaligned bio from submit_bio_noacct() | expand

Commit Message

Ming Lei March 24, 2024, 1:37 p.m. UTC
For any FS bio, its start sector and size have to be aligned with the
queue's logical block size from beginning, because bio split code can't
make one aligned bio.

This rule is obvious, but there is still user which may send unaligned
bio to block layer, and it is observed that dm-integrity can do that,
and cause double free of driver's dma meta buffer.

So failfast unaligned bio from submit_bio_noacct() for avoiding more
troubles.

Meantime remove this kind of check in dio and discard code path.

Cc: Keith Busch <kbusch@kernel.org>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
V2:
	- remove the check in dio and discard code path
	- check .bi_sector with (logical_block_size >> 9) - 1

 block/blk-core.c | 16 ++++++++++++++++
 block/blk-lib.c  | 17 -----------------
 block/fops.c     |  3 +--
 3 files changed, 17 insertions(+), 19 deletions(-)

Comments

Mike Snitzer March 24, 2024, 9:48 p.m. UTC | #1
On Sun, Mar 24 2024 at  9:37P -0400,
Ming Lei <ming.lei@redhat.com> wrote:

> For any FS bio, its start sector and size have to be aligned with the
> queue's logical block size from beginning, because bio split code can't
> make one aligned bio.
> 
> This rule is obvious, but there is still user which may send unaligned
> bio to block layer, and it is observed that dm-integrity can do that,
> and cause double free of driver's dma meta buffer.
> 
> So failfast unaligned bio from submit_bio_noacct() for avoiding more
> troubles.
> 
> Meantime remove this kind of check in dio and discard code path.
> 
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Mikulas Patocka <mpatocka@redhat.com>
> Cc: Mike Snitzer <snitzer@kernel.org>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> V2:
> 	- remove the check in dio and discard code path
> 	- check .bi_sector with (logical_block_size >> 9) - 1
> 
>  block/blk-core.c | 16 ++++++++++++++++
>  block/blk-lib.c  | 17 -----------------
>  block/fops.c     |  3 +--
>  3 files changed, 17 insertions(+), 19 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index a16b5abdbbf5..2d86922f95e3 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -729,6 +729,19 @@ void submit_bio_noacct_nocheck(struct bio *bio)
>  		__submit_bio_noacct(bio);
>  }
>  
> +static bool bio_check_alignment(struct bio *bio, struct request_queue *q)
> +{
> +	unsigned int bs = q->limits.logical_block_size;
> +
> +	if (bio->bi_iter.bi_size & (bs - 1))
> +		return false;
> +
> +	if (bio->bi_iter.bi_sector & ((bs >> SECTOR_SHIFT) - 1))
> +		return false;
> +
> +	return true;
> +}
> +

You missed Christoph's reply to v1 where he offered:
"This should just use bdev_logical_block_size() on bio->bi_bdev."

Otherwise, looks good.

Mike
Christoph Hellwig March 24, 2024, 11:25 p.m. UTC | #2
On Sun, Mar 24, 2024 at 09:37:02PM +0800, Ming Lei wrote:
> +static bool bio_check_alignment(struct bio *bio, struct request_queue *q)
> +{
> +	unsigned int bs = q->limits.logical_block_size;
> +
> +	if (bio->bi_iter.bi_size & (bs - 1))
> +		return false;
> +
> +	if (bio->bi_iter.bi_sector & ((bs >> SECTOR_SHIFT) - 1))
> +		return false;
> +
> +	return true;
> +}


This should still use bdev_logic_block_size.  And maybe it's just me,
but I think dropping thelines after the false returns would actually
make it more readle.

> diff --git a/block/fops.c b/block/fops.c
> index 679d9b752fe8..75595c728190 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -37,8 +37,7 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
>  static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
>  			      struct iov_iter *iter)
>  {
> -	return pos & (bdev_logical_block_size(bdev) - 1) ||
> -		!bdev_iter_is_aligned(bdev, iter);
> +	return !bdev_iter_is_aligned(bdev, iter);

If you drop this:

 - we now actually go all the way down to building and submiting a
   bio for a trivial bounds check.
 - your get a trivial to trigger WARN_ON.

I'd strongly advise against dropping this check.
Ming Lei March 25, 2024, 3:03 a.m. UTC | #3
On Sun, Mar 24, 2024 at 04:25:04PM -0700, Christoph Hellwig wrote:
> On Sun, Mar 24, 2024 at 09:37:02PM +0800, Ming Lei wrote:
> > +static bool bio_check_alignment(struct bio *bio, struct request_queue *q)
> > +{
> > +	unsigned int bs = q->limits.logical_block_size;
> > +
> > +	if (bio->bi_iter.bi_size & (bs - 1))
> > +		return false;
> > +
> > +	if (bio->bi_iter.bi_sector & ((bs >> SECTOR_SHIFT) - 1))
> > +		return false;
> > +
> > +	return true;
> > +}
> 
> 
> This should still use bdev_logic_block_size.  And maybe it's just me,
> but I think dropping thelines after the false returns would actually
> make it more readle.

OK, will remove the blank line.

> 
> > diff --git a/block/fops.c b/block/fops.c
> > index 679d9b752fe8..75595c728190 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -37,8 +37,7 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
> >  static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
> >  			      struct iov_iter *iter)
> >  {
> > -	return pos & (bdev_logical_block_size(bdev) - 1) ||
> > -		!bdev_iter_is_aligned(bdev, iter);
> > +	return !bdev_iter_is_aligned(bdev, iter);
> 
> If you drop this:
> 
>  - we now actually go all the way down to building and submiting a
>    bio for a trivial bounds check.
>  - your get a trivial to trigger WARN_ON.
> 
> I'd strongly advise against dropping this check.

OK.

Also only q->limits.logical_block_size is fetched for small BS IO
fast path, I think log(lbs) can be cached in request_queue for avoiding the
extra fetch of q.limits. Especially, it could be easier to do so
with your recent queue limit atomic update changes.


Thanks, 
Ming
Christoph Hellwig March 25, 2024, 3:12 a.m. UTC | #4
On Mon, Mar 25, 2024 at 11:03:25AM +0800, Ming Lei wrote:
> Also only q->limits.logical_block_size is fetched for small BS IO
> fast path, I think log(lbs) can be cached in request_queue for avoiding the
> extra fetch of q.limits. Especially, it could be easier to do so
> with your recent queue limit atomic update changes.

So.  One thing I've been thinking of for a while (and which Bart also
mentioned) is tht queue_limits currently is a bit of a mess between
the actual queue limits, and the gneidks configuration.   The logical
block size is firmly in the latter, and we should probably move it
to the gendisk eventually.  Depending on how converting the SCSI ULDs
to the atomic queue limits API goes that imght happen rather sooner
than later.
Ming Lei March 25, 2024, 3:50 a.m. UTC | #5
On Sun, Mar 24, 2024 at 08:12:01PM -0700, Christoph Hellwig wrote:
> On Mon, Mar 25, 2024 at 11:03:25AM +0800, Ming Lei wrote:
> > Also only q->limits.logical_block_size is fetched for small BS IO
> > fast path, I think log(lbs) can be cached in request_queue for avoiding the
> > extra fetch of q.limits. Especially, it could be easier to do so
> > with your recent queue limit atomic update changes.
> 
> So.  One thing I've been thinking of for a while (and which Bart also
> mentioned) is tht queue_limits currently is a bit of a mess between
> the actual queue limits, and the gneidks configuration.   The logical
> block size is firmly in the latter, and we should probably move it

lbs and pbs belong to disk, but some others may not be very obvious.

Strictly speaking elevator/blkcg belong to disk too, but still stay in
request_queue, :-)

Thanks, 
Ming
Keith Busch March 25, 2024, 6:53 p.m. UTC | #6
On Sun, Mar 24, 2024 at 09:37:02PM +0800, Ming Lei wrote:
> @@ -780,6 +793,9 @@ void submit_bio_noacct(struct bio *bio)
>  		}
>  	}
>  
> +	if (WARN_ON_ONCE(!bio_check_alignment(bio, q)))
> +		goto end_io;
> +

The "status" at this point is "BLK_STS_IOERR", so user space would see
EIO, but the existing checks return EINVAL. I'm not sure if that's "ok",
but assuming it is, I think the user visible different behavior should
be mentioned in the changelog.

Alternatively, maybe we want an asynchronous way to return EINVAL for
these conditions. It's more informative to a user where the problem is
than a generic EIO. There is no BLK_STS_ value that translates to
EINVAL, though, so maybe we need a new block status code like
BLK_STS_INVALID_REQUEST.

> @@ -53,10 +52,6 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>  		return -EOPNOTSUPP;
>  	}
>  
> -	bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
> -	if ((sector | nr_sects) & bs_mask)
> -		return -EINVAL;
> -
>  	if (!nr_sects)
>  		return -EINVAL;
Ming Lei March 26, 2024, 1:19 a.m. UTC | #7
On Mon, Mar 25, 2024 at 11:53:45AM -0700, Keith Busch wrote:
> On Sun, Mar 24, 2024 at 09:37:02PM +0800, Ming Lei wrote:
> > @@ -780,6 +793,9 @@ void submit_bio_noacct(struct bio *bio)
> >  		}
> >  	}
> >  
> > +	if (WARN_ON_ONCE(!bio_check_alignment(bio, q)))
> > +		goto end_io;
> > +
> 
> The "status" at this point is "BLK_STS_IOERR", so user space would see
> EIO, but the existing checks return EINVAL. I'm not sure if that's "ok",
> but assuming it is, I think the user visible different behavior should
> be mentioned in the changelog.
> 
> Alternatively, maybe we want an asynchronous way to return EINVAL for

It has to be async way to return it because submit_bio*() returns
void.

> these conditions. It's more informative to a user where the problem is
> than a generic EIO. There is no BLK_STS_ value that translates to
> EINVAL, though, so maybe we need a new block status code like
> BLK_STS_INVALID_REQUEST.

Yeah, I agree, but that is one existed issue. The 'status' should have
been initialized as 'BLK_STS_INVALID_REQUEST' or 'BLK_STS_INVALID' in
submit_bio_noacct(), and all check failure can be thought as -EINVAL.


Thanks, 
Ming
diff mbox series

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index a16b5abdbbf5..2d86922f95e3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -729,6 +729,19 @@  void submit_bio_noacct_nocheck(struct bio *bio)
 		__submit_bio_noacct(bio);
 }
 
+static bool bio_check_alignment(struct bio *bio, struct request_queue *q)
+{
+	unsigned int bs = q->limits.logical_block_size;
+
+	if (bio->bi_iter.bi_size & (bs - 1))
+		return false;
+
+	if (bio->bi_iter.bi_sector & ((bs >> SECTOR_SHIFT) - 1))
+		return false;
+
+	return true;
+}
+
 /**
  * submit_bio_noacct - re-submit a bio to the block device layer for I/O
  * @bio:  The bio describing the location in memory and on the device.
@@ -780,6 +793,9 @@  void submit_bio_noacct(struct bio *bio)
 		}
 	}
 
+	if (WARN_ON_ONCE(!bio_check_alignment(bio, q)))
+		goto end_io;
+
 	if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags))
 		bio_clear_polled(bio);
 
diff --git a/block/blk-lib.c b/block/blk-lib.c
index a6954eafb8c8..ea1a7d16ffdf 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -39,7 +39,6 @@  int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, struct bio **biop)
 {
 	struct bio *bio = *biop;
-	sector_t bs_mask;
 
 	if (bdev_read_only(bdev))
 		return -EPERM;
@@ -53,10 +52,6 @@  int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
 		return -EOPNOTSUPP;
 	}
 
-	bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
-	if ((sector | nr_sects) & bs_mask)
-		return -EINVAL;
-
 	if (!nr_sects)
 		return -EINVAL;
 
@@ -217,11 +212,6 @@  int __blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		unsigned flags)
 {
 	int ret;
-	sector_t bs_mask;
-
-	bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
-	if ((sector | nr_sects) & bs_mask)
-		return -EINVAL;
 
 	ret = __blkdev_issue_write_zeroes(bdev, sector, nr_sects, gfp_mask,
 			biop, flags);
@@ -250,15 +240,10 @@  int blkdev_issue_zeroout(struct block_device *bdev, sector_t sector,
 		sector_t nr_sects, gfp_t gfp_mask, unsigned flags)
 {
 	int ret = 0;
-	sector_t bs_mask;
 	struct bio *bio;
 	struct blk_plug plug;
 	bool try_write_zeroes = !!bdev_write_zeroes_sectors(bdev);
 
-	bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
-	if ((sector | nr_sects) & bs_mask)
-		return -EINVAL;
-
 retry:
 	bio = NULL;
 	blk_start_plug(&plug);
@@ -313,8 +298,6 @@  int blkdev_issue_secure_erase(struct block_device *bdev, sector_t sector,
 
 	if (max_sectors == 0)
 		return -EOPNOTSUPP;
-	if ((sector | nr_sects) & bs_mask)
-		return -EINVAL;
 	if (bdev_read_only(bdev))
 		return -EPERM;
 
diff --git a/block/fops.c b/block/fops.c
index 679d9b752fe8..75595c728190 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -37,8 +37,7 @@  static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
 static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
 			      struct iov_iter *iter)
 {
-	return pos & (bdev_logical_block_size(bdev) - 1) ||
-		!bdev_iter_is_aligned(bdev, iter);
+	return !bdev_iter_is_aligned(bdev, iter);
 }
 
 #define DIO_INLINE_BIO_VECS 4