diff mbox series

block: support zone append bvecs

Message ID 739a96e185f008c238fcf06cb22068016149ad4a.1616497531.git.johannes.thumshirn@wdc.com (mailing list archive)
State New, archived
Headers show
Series block: support zone append bvecs | expand

Commit Message

Johannes Thumshirn March 23, 2021, 11:06 a.m. UTC
Christoph reported that we'll likely trigger the WARN_ON_ONCE() checking
that we're not submitting a bvec with REQ_OP_ZONE_APPEND in
bio_iov_iter_get_pages() some time ago using zoned btrfs, but I couldn't
reproduce it back then.

Now Naohiro was able to trigger the bug as well with xfstests generic/095
on a zoned btrfs.

There is nothing that prevents bvec submissions via REQ_OP_ZONE_APPEND if
the hardware's zone append limit is met.

Reported-by: Naohiro Aota <naohiro.aota@wdc.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/bio.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig March 23, 2021, 12:30 p.m. UTC | #1
On Tue, Mar 23, 2021 at 08:06:56PM +0900, Johannes Thumshirn wrote:
> diff --git a/block/bio.c b/block/bio.c
> index 26b7f721cda8..215fe24a01ee 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1094,8 +1094,14 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	int ret = 0;
>  
>  	if (iov_iter_is_bvec(iter)) {
> -		if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
> -			return -EINVAL;
> +		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> +			struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +			unsigned int max_append =
> +				queue_max_zone_append_sectors(q) << 9;
> +
> +			if (WARN_ON_ONCE(iter->count > max_append))
> +				return -EINVAL;
> +		}

That is not correct.  bio_iov_iter_get_pages just fills the bio as far
as we can, and then returns 0 for the next call to continue.  Basically
what you want here is a partial version of bio_iov_bvec_set.
Johannes Thumshirn March 24, 2021, 7:07 a.m. UTC | #2
On 23/03/2021 13:30, Christoph Hellwig wrote:
> On Tue, Mar 23, 2021 at 08:06:56PM +0900, Johannes Thumshirn wrote:
>> diff --git a/block/bio.c b/block/bio.c
>> index 26b7f721cda8..215fe24a01ee 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -1094,8 +1094,14 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>>  	int ret = 0;
>>  
>>  	if (iov_iter_is_bvec(iter)) {
>> -		if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
>> -			return -EINVAL;
>> +		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
>> +			struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>> +			unsigned int max_append =
>> +				queue_max_zone_append_sectors(q) << 9;
>> +
>> +			if (WARN_ON_ONCE(iter->count > max_append))
>> +				return -EINVAL;
>> +		}
> 
> That is not correct.  bio_iov_iter_get_pages just fills the bio as far
> as we can, and then returns 0 for the next call to continue.  Basically
> what you want here is a partial version of bio_iov_bvec_set.
> 

Isn't that what I did? The above is checking if we have REQ_OP_ZONE_APPEND and
then returns EINVAL if iter->count is bigger than queue_max_zone_append_sectors().
If the check doesn't fail, its going to call bio_iov_bvec_set().
Christoph Hellwig March 24, 2021, 7:11 a.m. UTC | #3
On Wed, Mar 24, 2021 at 07:07:27AM +0000, Johannes Thumshirn wrote:
> >>  	if (iov_iter_is_bvec(iter)) {
> >> -		if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
> >> -			return -EINVAL;
> >> +		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> >> +			struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> >> +			unsigned int max_append =
> >> +				queue_max_zone_append_sectors(q) << 9;
> >> +
> >> +			if (WARN_ON_ONCE(iter->count > max_append))
> >> +				return -EINVAL;
> >> +		}
> > 
> > That is not correct.  bio_iov_iter_get_pages just fills the bio as far
> > as we can, and then returns 0 for the next call to continue.  Basically
> > what you want here is a partial version of bio_iov_bvec_set.
> > 
> 
> Isn't that what I did? The above is checking if we have REQ_OP_ZONE_APPEND and
> then returns EINVAL if iter->count is bigger than queue_max_zone_append_sectors().
> If the check doesn't fail, its going to call bio_iov_bvec_set().

And that is the problem.  It should not fail, the payload is decoupled
from the max_append size.

Doing the proper thing is not too hard as described above - make sure
the bi_iter points to only the chunk of iter passed in that fits, and
only advance the passed in iter by that amount.
Johannes Thumshirn March 24, 2021, 7:12 a.m. UTC | #4
On 24/03/2021 08:11, Christoph Hellwig wrote:
> On Wed, Mar 24, 2021 at 07:07:27AM +0000, Johannes Thumshirn wrote:
>>>>  	if (iov_iter_is_bvec(iter)) {
>>>> -		if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
>>>> -			return -EINVAL;
>>>> +		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
>>>> +			struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>>>> +			unsigned int max_append =
>>>> +				queue_max_zone_append_sectors(q) << 9;
>>>> +
>>>> +			if (WARN_ON_ONCE(iter->count > max_append))
>>>> +				return -EINVAL;
>>>> +		}
>>>
>>> That is not correct.  bio_iov_iter_get_pages just fills the bio as far
>>> as we can, and then returns 0 for the next call to continue.  Basically
>>> what you want here is a partial version of bio_iov_bvec_set.
>>>
>>
>> Isn't that what I did? The above is checking if we have REQ_OP_ZONE_APPEND and
>> then returns EINVAL if iter->count is bigger than queue_max_zone_append_sectors().
>> If the check doesn't fail, its going to call bio_iov_bvec_set().
> 
> And that is the problem.  It should not fail, the payload is decoupled
> from the max_append size.
> 
> Doing the proper thing is not too hard as described above - make sure
> the bi_iter points to only the chunk of iter passed in that fits, and
> only advance the passed in iter by that amount.
> 

Ah got it now,
thanks
Johannes Thumshirn March 24, 2021, 10:09 a.m. UTC | #5
On 24/03/2021 08:13, Johannes Thumshirn wrote:
> On 24/03/2021 08:11, Christoph Hellwig wrote:
>> On Wed, Mar 24, 2021 at 07:07:27AM +0000, Johannes Thumshirn wrote:
>>>>>  	if (iov_iter_is_bvec(iter)) {
>>>>> -		if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
>>>>> -			return -EINVAL;
>>>>> +		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
>>>>> +			struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>>>>> +			unsigned int max_append =
>>>>> +				queue_max_zone_append_sectors(q) << 9;
>>>>> +
>>>>> +			if (WARN_ON_ONCE(iter->count > max_append))
>>>>> +				return -EINVAL;
>>>>> +		}
>>>>
>>>> That is not correct.  bio_iov_iter_get_pages just fills the bio as far
>>>> as we can, and then returns 0 for the next call to continue.  Basically
>>>> what you want here is a partial version of bio_iov_bvec_set.
>>>>
>>>
>>> Isn't that what I did? The above is checking if we have REQ_OP_ZONE_APPEND and
>>> then returns EINVAL if iter->count is bigger than queue_max_zone_append_sectors().
>>> If the check doesn't fail, its going to call bio_iov_bvec_set().
>>
>> And that is the problem.  It should not fail, the payload is decoupled
>> from the max_append size.
>>
>> Doing the proper thing is not too hard as described above - make sure
>> the bi_iter points to only the chunk of iter passed in that fits, and
>> only advance the passed in iter by that amount.
>>
> 
> Ah got it now,
> thanks
> 

Stupid question, but wouldn't it be sufficient if I did (this can still be
simplified):

diff --git a/block/bio.c b/block/bio.c
index 26b7f721cda8..9c529b2db8fa 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -964,6 +964,16 @@ static int bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
        return 0;
 }
 
+static int bio_iov_append_bvec_set(struct bio *bio, struct iov_iter *iter)
+{
+       struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+       unsigned int max_append = queue_max_zone_append_sectors(q) << 9;
+
+       iov_iter_truncate(iter, max_append);
+
+       return bio_iov_bvec_set(bio, iter);
+}
+
 #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))
 
 /**
@@ -1094,8 +1104,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
        int ret = 0;
 
        if (iov_iter_is_bvec(iter)) {
-               if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
-                       return -EINVAL;
+               if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+                       return bio_iov_append_bvec_set(bio, iter);
                return bio_iov_bvec_set(bio, iter);
        }

The above had a successful xfstests run and several fio --ioengine=splice runs.
Damien Le Moal March 24, 2021, 10:22 a.m. UTC | #6
On 2021/03/24 19:09, Johannes Thumshirn wrote:
> On 24/03/2021 08:13, Johannes Thumshirn wrote:
>> On 24/03/2021 08:11, Christoph Hellwig wrote:
>>> On Wed, Mar 24, 2021 at 07:07:27AM +0000, Johannes Thumshirn wrote:
>>>>>>  	if (iov_iter_is_bvec(iter)) {
>>>>>> -		if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
>>>>>> -			return -EINVAL;
>>>>>> +		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
>>>>>> +			struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>>>>>> +			unsigned int max_append =
>>>>>> +				queue_max_zone_append_sectors(q) << 9;
>>>>>> +
>>>>>> +			if (WARN_ON_ONCE(iter->count > max_append))
>>>>>> +				return -EINVAL;
>>>>>> +		}
>>>>>
>>>>> That is not correct.  bio_iov_iter_get_pages just fills the bio as far
>>>>> as we can, and then returns 0 for the next call to continue.  Basically
>>>>> what you want here is a partial version of bio_iov_bvec_set.
>>>>>
>>>>
>>>> Isn't that what I did? The above is checking if we have REQ_OP_ZONE_APPEND and
>>>> then returns EINVAL if iter->count is bigger than queue_max_zone_append_sectors().
>>>> If the check doesn't fail, its going to call bio_iov_bvec_set().
>>>
>>> And that is the problem.  It should not fail, the payload is decoupled
>>> from the max_append size.
>>>
>>> Doing the proper thing is not too hard as described above - make sure
>>> the bi_iter points to only the chunk of iter passed in that fits, and
>>> only advance the passed in iter by that amount.
>>>
>>
>> Ah got it now,
>> thanks
>>
> 
> Stupid question, but wouldn't it be sufficient if I did (this can still be
> simplified):
> 
> diff --git a/block/bio.c b/block/bio.c
> index 26b7f721cda8..9c529b2db8fa 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -964,6 +964,16 @@ static int bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
>         return 0;
>  }
>  
> +static int bio_iov_append_bvec_set(struct bio *bio, struct iov_iter *iter)
> +{
> +       struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +       unsigned int max_append = queue_max_zone_append_sectors(q) << 9;
> +
> +       iov_iter_truncate(iter, max_append);

Why truncate ? the caller of bio_iov_iter_get_pages() will see a shorter iter
after this return and will not loop over to send the remaining.

I think you need a special bio_iov_bvec_set() or you need to patch it so that
the bio size is at most max_append instead of iter->count. Same for the segments
count: the bio must be set with the number of segments up to max_append instead
of iter->nr_segs.

Then you can do:

iov_iter_advance(iter, max_append);

before returning, like bio_iov_bvec_set() does. No ?

> +
> +       return bio_iov_bvec_set(bio, iter);
> +}
> +
>  #define PAGE_PTRS_PER_BVEC     (sizeof(struct bio_vec) / sizeof(struct page *))
>  
>  /**
> @@ -1094,8 +1104,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>         int ret = 0;
>  
>         if (iov_iter_is_bvec(iter)) {
> -               if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
> -                       return -EINVAL;
> +               if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> +                       return bio_iov_append_bvec_set(bio, iter);
>                 return bio_iov_bvec_set(bio, iter);
>         }
> 
> The above had a successful xfstests run and several fio --ioengine=splice runs.
>
Christoph Hellwig March 24, 2021, 11:05 a.m. UTC | #7
On Wed, Mar 24, 2021 at 10:09:32AM +0000, Johannes Thumshirn wrote:
> Stupid question, but wouldn't it be sufficient if I did (this can still be
> simplified):

No, this loses data if the iter is bigger than what you truncate it to.
Just with your last patch you probably did not test with larger
enough iters to be beyond the zone append limit.

> diff --git a/block/bio.c b/block/bio.c
> index 26b7f721cda8..9c529b2db8fa 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -964,6 +964,16 @@ static int bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
>         return 0;
>  }
>  
> +static int bio_iov_append_bvec_set(struct bio *bio, struct iov_iter *iter)
> +{
> +       struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +       unsigned int max_append = queue_max_zone_append_sectors(q) << 9;
> +
> +       iov_iter_truncate(iter, max_append);
> +
> +       return bio_iov_bvec_set(bio, iter);

OTOH if you copy the iter by value to a local one first and then
make sure the original iter is advanced it should work.  We don't
really need the iter advance for the original one, though.  Something like:

diff --git a/block/bio.c b/block/bio.c
index a1c4d2900c7a83..7d9e01580f2ab1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -949,7 +949,7 @@ void bio_release_pages(struct bio *bio, bool mark_dirty)
 }
 EXPORT_SYMBOL_GPL(bio_release_pages);
 
-static int bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
+static void __bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 {
 	WARN_ON_ONCE(bio->bi_max_vecs);
 
@@ -959,7 +959,11 @@ static int bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
 	bio->bi_iter.bi_size = iter->count;
 	bio_set_flag(bio, BIO_NO_PAGE_REF);
 	bio_set_flag(bio, BIO_CLONED);
+}
 
+static int bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
+{
+	__bio_iov_bvec_set(bio, iter);
 	iov_iter_advance(iter, iter->count);
 	return 0;
 }
@@ -1019,6 +1023,17 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	return 0;
 }
 
+static int bio_iov_bvec_set_append(struct bio *bio, struct iov_iter *iter)
+{
+	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+	struct iov_iter i = *iter;
+
+	iov_iter_truncate(&i, queue_max_zone_append_sectors(q) << 9);
+	__bio_iov_bvec_set(bio, &i);
+	iov_iter_advance(iter, i.count);
+	return 0;
+}
+
 static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
 {
 	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
@@ -1094,8 +1109,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	int ret = 0;
 
 	if (iov_iter_is_bvec(iter)) {
-		if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
-			return -EINVAL;
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
+			return bio_iov_bvec_set_append(bio, iter);
 		return bio_iov_bvec_set(bio, iter);
 	}
Johannes Thumshirn March 24, 2021, 12:12 p.m. UTC | #8
On 24/03/2021 12:05, Christoph Hellwig wrote:
> On Wed, Mar 24, 2021 at 10:09:32AM +0000, Johannes Thumshirn wrote:
>> Stupid question, but wouldn't it be sufficient if I did (this can still be
>> simplified):
> 
> No, this loses data if the iter is bigger than what you truncate it to.
> Just with your last patch you probably did not test with larger
> enough iters to be beyond the zone append limit.

Hmm I did a fio job with bs=128k (and my nullblock device has 127k 
zone_append_max_bytes) and verify=md5.
 
Anyhow, I like your solution more. Fio looks good (bs=1M this time, to be
sure), xfstests still pending.

> 
>> diff --git a/block/bio.c b/block/bio.c
>> index 26b7f721cda8..9c529b2db8fa 100644
>> --- a/block/bio.c
>> +++ b/block/bio.c
>> @@ -964,6 +964,16 @@ static int bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
>>         return 0;
>>  }
>>  
>> +static int bio_iov_append_bvec_set(struct bio *bio, struct iov_iter *iter)
>> +{
>> +       struct request_queue *q = bio->bi_bdev->bd_disk->queue;
>> +       unsigned int max_append = queue_max_zone_append_sectors(q) << 9;
>> +
>> +       iov_iter_truncate(iter, max_append);
>> +
>> +       return bio_iov_bvec_set(bio, iter);
> 
> OTOH if you copy the iter by value to a local one first and then
> make sure the original iter is advanced it should work.  We don't
> really need the iter advance for the original one, though.  Something like:
> 
> diff --git a/block/bio.c b/block/bio.c
> index a1c4d2900c7a83..7d9e01580f2ab1 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -949,7 +949,7 @@ void bio_release_pages(struct bio *bio, bool mark_dirty)
>  }
>  EXPORT_SYMBOL_GPL(bio_release_pages);
>  
> -static int bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
> +static void __bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
>  {
>  	WARN_ON_ONCE(bio->bi_max_vecs);
>  
> @@ -959,7 +959,11 @@ static int bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
>  	bio->bi_iter.bi_size = iter->count;
>  	bio_set_flag(bio, BIO_NO_PAGE_REF);
>  	bio_set_flag(bio, BIO_CLONED);
> +}
>  
> +static int bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)
> +{
> +	__bio_iov_bvec_set(bio, iter);
>  	iov_iter_advance(iter, iter->count);
>  	return 0;
>  }
> @@ -1019,6 +1023,17 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	return 0;
>  }
>  
> +static int bio_iov_bvec_set_append(struct bio *bio, struct iov_iter *iter)
> +{
> +	struct request_queue *q = bio->bi_bdev->bd_disk->queue;
> +	struct iov_iter i = *iter;
> +
> +	iov_iter_truncate(&i, queue_max_zone_append_sectors(q) << 9);
> +	__bio_iov_bvec_set(bio, &i);
> +	iov_iter_advance(iter, i.count);
> +	return 0;
> +}
> +
>  static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
>  {
>  	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
> @@ -1094,8 +1109,8 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
>  	int ret = 0;
>  
>  	if (iov_iter_is_bvec(iter)) {
> -		if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
> -			return -EINVAL;
> +		if (bio_op(bio) == REQ_OP_ZONE_APPEND)
> +			return bio_iov_bvec_set_append(bio, iter);
>  		return bio_iov_bvec_set(bio, iter);
>  	}
>  
>
diff mbox series

Patch

diff --git a/block/bio.c b/block/bio.c
index 26b7f721cda8..215fe24a01ee 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1094,8 +1094,14 @@  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	int ret = 0;
 
 	if (iov_iter_is_bvec(iter)) {
-		if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
-			return -EINVAL;
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+			struct request_queue *q = bio->bi_bdev->bd_disk->queue;
+			unsigned int max_append =
+				queue_max_zone_append_sectors(q) << 9;
+
+			if (WARN_ON_ONCE(iter->count > max_append))
+				return -EINVAL;
+		}
 		return bio_iov_bvec_set(bio, iter);
 	}