diff mbox series

[v2] block: don't allow multiple bios for IOCB_NOWAIT issue

Message ID 1ce71005-c81b-374d-5bcf-e3b7e7c48d0d@kernel.dk (mailing list archive)
State New, archived
Headers show
Series [v2] block: don't allow multiple bios for IOCB_NOWAIT issue | expand

Commit Message

Jens Axboe Jan. 16, 2023, 9:06 p.m. UTC
If we're doing a large IO request which needs to be split into multiple
bios for issue, then we can run into the same situation as the below
marked commit fixes - parts will complete just fine, one or more parts
will fail to allocate a request. This will result in a partially
completed read or write request, where the caller gets EAGAIN even though
parts of the IO completed just fine.

Do the same for large bios as we do for splits - fail a NOWAIT request
with EAGAIN. This isn't technically fixing an issue in the below marked
patch, but for stable purposes, we should have either none of them or
both.

This depends on: 613b14884b85 ("block: handle bio_split_to_limits() NULL return")

Cc: stable@vger.kernel.org # 5.15+
Fixes: 9cea62b2cbab ("block: don't allow splitting of a REQ_NOWAIT bio")
Link: https://github.com/axboe/liburing/issues/766
Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

Since v1: catch this at submit time instead, since we can have various
valid cases where the number of single page segments will not take a
bio segment (page merging, huge pages).

Comments

Damien Le Moal Jan. 16, 2023, 11:20 p.m. UTC | #1
On 1/17/23 06:06, Jens Axboe wrote:
> If we're doing a large IO request which needs to be split into multiple
> bios for issue, then we can run into the same situation as the below
> marked commit fixes - parts will complete just fine, one or more parts
> will fail to allocate a request. This will result in a partially
> completed read or write request, where the caller gets EAGAIN even though
> parts of the IO completed just fine.
> 
> Do the same for large bios as we do for splits - fail a NOWAIT request
> with EAGAIN. This isn't technically fixing an issue in the below marked
> patch, but for stable purposes, we should have either none of them or
> both.
> 
> This depends on: 613b14884b85 ("block: handle bio_split_to_limits() NULL return")
> 
> Cc: stable@vger.kernel.org # 5.15+
> Fixes: 9cea62b2cbab ("block: don't allow splitting of a REQ_NOWAIT bio")
> Link: https://github.com/axboe/liburing/issues/766
> Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> Since v1: catch this at submit time instead, since we can have various
> valid cases where the number of single page segments will not take a
> bio segment (page merging, huge pages).
> 
> diff --git a/block/fops.c b/block/fops.c
> index 50d245e8c913..d2e6be4e3d1c 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -221,6 +221,24 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
>  			bio_endio(bio);
>  			break;
>  		}
> +		if (iocb->ki_flags & IOCB_NOWAIT) {
> +			/*
> +			 * This is nonblocking IO, and we need to allocate
> +			 * another bio if we have data left to map. As we
> +			 * cannot guarantee that one of the sub bios will not
> +			 * fail getting issued FOR NOWAIT and as error results
> +			 * are coalesced across all of them, be safe and ask for
> +			 * a retry of this from blocking context.
> +			 */
> +			if (unlikely(iov_iter_count(iter))) {
> +				bio_release_pages(bio, false);
> +				bio_clear_flag(bio, BIO_REFFED);
> +				bio_put(bio);
> +				blk_finish_plug(&plug);
> +				return -EAGAIN;

Doesn't this mean that for a really very large IO request that has 100%
chance of being split, the user will always get -EAGAIN ? Not that I mind,
doing super large IOs with NOWAIT is not a smart thing to do in the first
place... But as a user interface, it seems that this will prevent any
forward progress for such really large NOWAIT IOs. Is that OK ?

> +			}
> +			bio->bi_opf |= REQ_NOWAIT;
> +		}
>  
>  		if (is_read) {
>  			if (dio->flags & DIO_SHOULD_DIRTY)
> @@ -228,9 +246,6 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
>  		} else {
>  			task_io_account_write(bio->bi_iter.bi_size);
>  		}
> -		if (iocb->ki_flags & IOCB_NOWAIT)
> -			bio->bi_opf |= REQ_NOWAIT;
> -
>  		dio->size += bio->bi_iter.bi_size;
>  		pos += bio->bi_iter.bi_size;
>
Jens Axboe Jan. 16, 2023, 11:28 p.m. UTC | #2
On 1/16/23 4:20?PM, Damien Le Moal wrote:
> On 1/17/23 06:06, Jens Axboe wrote:
>> If we're doing a large IO request which needs to be split into multiple
>> bios for issue, then we can run into the same situation as the below
>> marked commit fixes - parts will complete just fine, one or more parts
>> will fail to allocate a request. This will result in a partially
>> completed read or write request, where the caller gets EAGAIN even though
>> parts of the IO completed just fine.
>>
>> Do the same for large bios as we do for splits - fail a NOWAIT request
>> with EAGAIN. This isn't technically fixing an issue in the below marked
>> patch, but for stable purposes, we should have either none of them or
>> both.
>>
>> This depends on: 613b14884b85 ("block: handle bio_split_to_limits() NULL return")
>>
>> Cc: stable@vger.kernel.org # 5.15+
>> Fixes: 9cea62b2cbab ("block: don't allow splitting of a REQ_NOWAIT bio")
>> Link: https://github.com/axboe/liburing/issues/766
>> Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> ---
>>
>> Since v1: catch this at submit time instead, since we can have various
>> valid cases where the number of single page segments will not take a
>> bio segment (page merging, huge pages).
>>
>> diff --git a/block/fops.c b/block/fops.c
>> index 50d245e8c913..d2e6be4e3d1c 100644
>> --- a/block/fops.c
>> +++ b/block/fops.c
>> @@ -221,6 +221,24 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
>>  			bio_endio(bio);
>>  			break;
>>  		}
>> +		if (iocb->ki_flags & IOCB_NOWAIT) {
>> +			/*
>> +			 * This is nonblocking IO, and we need to allocate
>> +			 * another bio if we have data left to map. As we
>> +			 * cannot guarantee that one of the sub bios will not
>> +			 * fail getting issued FOR NOWAIT and as error results
>> +			 * are coalesced across all of them, be safe and ask for
>> +			 * a retry of this from blocking context.
>> +			 */
>> +			if (unlikely(iov_iter_count(iter))) {
>> +				bio_release_pages(bio, false);
>> +				bio_clear_flag(bio, BIO_REFFED);
>> +				bio_put(bio);
>> +				blk_finish_plug(&plug);
>> +				return -EAGAIN;
> 
> Doesn't this mean that for a really very large IO request that has 100%
> chance of being split, the user will always get -EAGAIN ? Not that I mind,
> doing super large IOs with NOWAIT is not a smart thing to do in the first
> place... But as a user interface, it seems that this will prevent any
> forward progress for such really large NOWAIT IOs. Is that OK ?

Right, if you asked for NOWAIT, then it would not necessarily succeed if
it:

1) Needs multiple bios
2) Needs splitting

You're expected to attempt blocking issue at that point. Reasoning is
explained in this (and the previous commit related to the issue),
otherwise you end up with potentially various amounts of the request
being written to disk or read from disk, but EAGAIN being returned for
the request as a whole.
Jens Axboe Jan. 16, 2023, 11:29 p.m. UTC | #3
On 1/16/23 4:28?PM, Jens Axboe wrote:
> On 1/16/23 4:20?PM, Damien Le Moal wrote:
>> On 1/17/23 06:06, Jens Axboe wrote:
>>> If we're doing a large IO request which needs to be split into multiple
>>> bios for issue, then we can run into the same situation as the below
>>> marked commit fixes - parts will complete just fine, one or more parts
>>> will fail to allocate a request. This will result in a partially
>>> completed read or write request, where the caller gets EAGAIN even though
>>> parts of the IO completed just fine.
>>>
>>> Do the same for large bios as we do for splits - fail a NOWAIT request
>>> with EAGAIN. This isn't technically fixing an issue in the below marked
>>> patch, but for stable purposes, we should have either none of them or
>>> both.
>>>
>>> This depends on: 613b14884b85 ("block: handle bio_split_to_limits() NULL return")
>>>
>>> Cc: stable@vger.kernel.org # 5.15+
>>> Fixes: 9cea62b2cbab ("block: don't allow splitting of a REQ_NOWAIT bio")
>>> Link: https://github.com/axboe/liburing/issues/766
>>> Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>
>>> ---
>>>
>>> Since v1: catch this at submit time instead, since we can have various
>>> valid cases where the number of single page segments will not take a
>>> bio segment (page merging, huge pages).
>>>
>>> diff --git a/block/fops.c b/block/fops.c
>>> index 50d245e8c913..d2e6be4e3d1c 100644
>>> --- a/block/fops.c
>>> +++ b/block/fops.c
>>> @@ -221,6 +221,24 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
>>>  			bio_endio(bio);
>>>  			break;
>>>  		}
>>> +		if (iocb->ki_flags & IOCB_NOWAIT) {
>>> +			/*
>>> +			 * This is nonblocking IO, and we need to allocate
>>> +			 * another bio if we have data left to map. As we
>>> +			 * cannot guarantee that one of the sub bios will not
>>> +			 * fail getting issued FOR NOWAIT and as error results
>>> +			 * are coalesced across all of them, be safe and ask for
>>> +			 * a retry of this from blocking context.
>>> +			 */
>>> +			if (unlikely(iov_iter_count(iter))) {
>>> +				bio_release_pages(bio, false);
>>> +				bio_clear_flag(bio, BIO_REFFED);
>>> +				bio_put(bio);
>>> +				blk_finish_plug(&plug);
>>> +				return -EAGAIN;
>>
>> Doesn't this mean that for a really very large IO request that has 100%
>> chance of being split, the user will always get -EAGAIN ? Not that I mind,
>> doing super large IOs with NOWAIT is not a smart thing to do in the first
>> place... But as a user interface, it seems that this will prevent any
>> forward progress for such really large NOWAIT IOs. Is that OK ?
> 
> Right, if you asked for NOWAIT, then it would not necessarily succeed if
> it:
> 
> 1) Needs multiple bios
> 2) Needs splitting
> 
> You're expected to attempt blocking issue at that point. Reasoning is
> explained in this (and the previous commit related to the issue),
> otherwise you end up with potentially various amounts of the request
> being written to disk or read from disk, but EAGAIN being returned for
> the request as a whole.

BTW, this is no different than eg doing a buffered read and needing to
read in the data. You'd get EAGAIN, and no amount of repeated retries
would change that. You need to either block for the IO at that point, or
otherwise start it so it will become available directly at some later
point (eg readahead).
Damien Le Moal Jan. 16, 2023, 11:31 p.m. UTC | #4
On 1/17/23 08:28, Jens Axboe wrote:
> On 1/16/23 4:20?PM, Damien Le Moal wrote:
>> On 1/17/23 06:06, Jens Axboe wrote:
>>> If we're doing a large IO request which needs to be split into multiple
>>> bios for issue, then we can run into the same situation as the below
>>> marked commit fixes - parts will complete just fine, one or more parts
>>> will fail to allocate a request. This will result in a partially
>>> completed read or write request, where the caller gets EAGAIN even though
>>> parts of the IO completed just fine.
>>>
>>> Do the same for large bios as we do for splits - fail a NOWAIT request
>>> with EAGAIN. This isn't technically fixing an issue in the below marked
>>> patch, but for stable purposes, we should have either none of them or
>>> both.
>>>
>>> This depends on: 613b14884b85 ("block: handle bio_split_to_limits() NULL return")
>>>
>>> Cc: stable@vger.kernel.org # 5.15+
>>> Fixes: 9cea62b2cbab ("block: don't allow splitting of a REQ_NOWAIT bio")
>>> Link: https://github.com/axboe/liburing/issues/766
>>> Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>
>>> ---
>>>
>>> Since v1: catch this at submit time instead, since we can have various
>>> valid cases where the number of single page segments will not take a
>>> bio segment (page merging, huge pages).
>>>
>>> diff --git a/block/fops.c b/block/fops.c
>>> index 50d245e8c913..d2e6be4e3d1c 100644
>>> --- a/block/fops.c
>>> +++ b/block/fops.c
>>> @@ -221,6 +221,24 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
>>>  			bio_endio(bio);
>>>  			break;
>>>  		}
>>> +		if (iocb->ki_flags & IOCB_NOWAIT) {
>>> +			/*
>>> +			 * This is nonblocking IO, and we need to allocate
>>> +			 * another bio if we have data left to map. As we
>>> +			 * cannot guarantee that one of the sub bios will not
>>> +			 * fail getting issued FOR NOWAIT and as error results
>>> +			 * are coalesced across all of them, be safe and ask for
>>> +			 * a retry of this from blocking context.
>>> +			 */
>>> +			if (unlikely(iov_iter_count(iter))) {
>>> +				bio_release_pages(bio, false);
>>> +				bio_clear_flag(bio, BIO_REFFED);
>>> +				bio_put(bio);
>>> +				blk_finish_plug(&plug);
>>> +				return -EAGAIN;
>>
>> Doesn't this mean that for a really very large IO request that has 100%
>> chance of being split, the user will always get -EAGAIN ? Not that I mind,
>> doing super large IOs with NOWAIT is not a smart thing to do in the first
>> place... But as a user interface, it seems that this will prevent any
>> forward progress for such really large NOWAIT IOs. Is that OK ?
> 
> Right, if you asked for NOWAIT, then it would not necessarily succeed if
> it:
> 
> 1) Needs multiple bios
> 2) Needs splitting
> 
> You're expected to attempt blocking issue at that point. Reasoning is
> explained in this (and the previous commit related to the issue),
> otherwise you end up with potentially various amounts of the request
> being written to disk or read from disk, but EAGAIN being returned for
> the request as a whole.

Yes, I understood all that and completely agree with it.

I was only wondering if this change may not surprise some (bad) userspace
stuff. Do we need to update some man page or other doc, mentioning that
there are no guarantees that a NOWAIT IO may actually be executed if it
too large (e.g. larger than max_sectors_kb) ?
Damien Le Moal Jan. 16, 2023, 11:32 p.m. UTC | #5
On 1/17/23 08:29, Jens Axboe wrote:
> On 1/16/23 4:28?PM, Jens Axboe wrote:
>> On 1/16/23 4:20?PM, Damien Le Moal wrote:
>>> On 1/17/23 06:06, Jens Axboe wrote:
>>>> If we're doing a large IO request which needs to be split into multiple
>>>> bios for issue, then we can run into the same situation as the below
>>>> marked commit fixes - parts will complete just fine, one or more parts
>>>> will fail to allocate a request. This will result in a partially
>>>> completed read or write request, where the caller gets EAGAIN even though
>>>> parts of the IO completed just fine.
>>>>
>>>> Do the same for large bios as we do for splits - fail a NOWAIT request
>>>> with EAGAIN. This isn't technically fixing an issue in the below marked
>>>> patch, but for stable purposes, we should have either none of them or
>>>> both.
>>>>
>>>> This depends on: 613b14884b85 ("block: handle bio_split_to_limits() NULL return")
>>>>
>>>> Cc: stable@vger.kernel.org # 5.15+
>>>> Fixes: 9cea62b2cbab ("block: don't allow splitting of a REQ_NOWAIT bio")
>>>> Link: https://github.com/axboe/liburing/issues/766
>>>> Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>
>>>> ---
>>>>
>>>> Since v1: catch this at submit time instead, since we can have various
>>>> valid cases where the number of single page segments will not take a
>>>> bio segment (page merging, huge pages).
>>>>
>>>> diff --git a/block/fops.c b/block/fops.c
>>>> index 50d245e8c913..d2e6be4e3d1c 100644
>>>> --- a/block/fops.c
>>>> +++ b/block/fops.c
>>>> @@ -221,6 +221,24 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
>>>>  			bio_endio(bio);
>>>>  			break;
>>>>  		}
>>>> +		if (iocb->ki_flags & IOCB_NOWAIT) {
>>>> +			/*
>>>> +			 * This is nonblocking IO, and we need to allocate
>>>> +			 * another bio if we have data left to map. As we
>>>> +			 * cannot guarantee that one of the sub bios will not
>>>> +			 * fail getting issued FOR NOWAIT and as error results
>>>> +			 * are coalesced across all of them, be safe and ask for
>>>> +			 * a retry of this from blocking context.
>>>> +			 */
>>>> +			if (unlikely(iov_iter_count(iter))) {
>>>> +				bio_release_pages(bio, false);
>>>> +				bio_clear_flag(bio, BIO_REFFED);
>>>> +				bio_put(bio);
>>>> +				blk_finish_plug(&plug);
>>>> +				return -EAGAIN;
>>>
>>> Doesn't this mean that for a really very large IO request that has 100%
>>> chance of being split, the user will always get -EAGAIN ? Not that I mind,
>>> doing super large IOs with NOWAIT is not a smart thing to do in the first
>>> place... But as a user interface, it seems that this will prevent any
>>> forward progress for such really large NOWAIT IOs. Is that OK ?
>>
>> Right, if you asked for NOWAIT, then it would not necessarily succeed if
>> it:
>>
>> 1) Needs multiple bios
>> 2) Needs splitting
>>
>> You're expected to attempt blocking issue at that point. Reasoning is
>> explained in this (and the previous commit related to the issue),
>> otherwise you end up with potentially various amounts of the request
>> being written to disk or read from disk, but EAGAIN being returned for
>> the request as a whole.
> 
> BTW, this is no different than eg doing a buffered read and needing to
> read in the data. You'd get EAGAIN, and no amount of repeated retries
> would change that. You need to either block for the IO at that point, or
> otherwise start it so it will become available directly at some later
> point (eg readahead).

Indeed.
Jens Axboe Jan. 16, 2023, 11:39 p.m. UTC | #6
On 1/16/23 4:31 PM, Damien Le Moal wrote:
> On 1/17/23 08:28, Jens Axboe wrote:
>> On 1/16/23 4:20?PM, Damien Le Moal wrote:
>>> On 1/17/23 06:06, Jens Axboe wrote:
>>>> If we're doing a large IO request which needs to be split into multiple
>>>> bios for issue, then we can run into the same situation as the below
>>>> marked commit fixes - parts will complete just fine, one or more parts
>>>> will fail to allocate a request. This will result in a partially
>>>> completed read or write request, where the caller gets EAGAIN even though
>>>> parts of the IO completed just fine.
>>>>
>>>> Do the same for large bios as we do for splits - fail a NOWAIT request
>>>> with EAGAIN. This isn't technically fixing an issue in the below marked
>>>> patch, but for stable purposes, we should have either none of them or
>>>> both.
>>>>
>>>> This depends on: 613b14884b85 ("block: handle bio_split_to_limits() NULL return")
>>>>
>>>> Cc: stable@vger.kernel.org # 5.15+
>>>> Fixes: 9cea62b2cbab ("block: don't allow splitting of a REQ_NOWAIT bio")
>>>> Link: https://github.com/axboe/liburing/issues/766
>>>> Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>
>>>> ---
>>>>
>>>> Since v1: catch this at submit time instead, since we can have various
>>>> valid cases where the number of single page segments will not take a
>>>> bio segment (page merging, huge pages).
>>>>
>>>> diff --git a/block/fops.c b/block/fops.c
>>>> index 50d245e8c913..d2e6be4e3d1c 100644
>>>> --- a/block/fops.c
>>>> +++ b/block/fops.c
>>>> @@ -221,6 +221,24 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
>>>>  			bio_endio(bio);
>>>>  			break;
>>>>  		}
>>>> +		if (iocb->ki_flags & IOCB_NOWAIT) {
>>>> +			/*
>>>> +			 * This is nonblocking IO, and we need to allocate
>>>> +			 * another bio if we have data left to map. As we
>>>> +			 * cannot guarantee that one of the sub bios will not
>>>> +			 * fail getting issued FOR NOWAIT and as error results
>>>> +			 * are coalesced across all of them, be safe and ask for
>>>> +			 * a retry of this from blocking context.
>>>> +			 */
>>>> +			if (unlikely(iov_iter_count(iter))) {
>>>> +				bio_release_pages(bio, false);
>>>> +				bio_clear_flag(bio, BIO_REFFED);
>>>> +				bio_put(bio);
>>>> +				blk_finish_plug(&plug);
>>>> +				return -EAGAIN;
>>>
>>> Doesn't this mean that for a really very large IO request that has 100%
>>> chance of being split, the user will always get -EAGAIN ? Not that I mind,
>>> doing super large IOs with NOWAIT is not a smart thing to do in the first
>>> place... But as a user interface, it seems that this will prevent any
>>> forward progress for such really large NOWAIT IOs. Is that OK ?
>>
>> Right, if you asked for NOWAIT, then it would not necessarily succeed if
>> it:
>>
>> 1) Needs multiple bios
>> 2) Needs splitting
>>
>> You're expected to attempt blocking issue at that point. Reasoning is
>> explained in this (and the previous commit related to the issue),
>> otherwise you end up with potentially various amounts of the request
>> being written to disk or read from disk, but EAGAIN being returned for
>> the request as a whole.
> 
> Yes, I understood all that and completely agree with it.
> 
> I was only wondering if this change may not surprise some (bad) userspace
> stuff. Do we need to update some man page or other doc, mentioning that
> there are no guarantees that a NOWAIT IO may actually be executed if it
> too large (e.g. larger than max_sectors_kb) ?

We can certainly add it to the man pages talking about RWF_NOWAIT. But
there's never been a guarantee that any EAGAIN will later succeed
under the same conditions, and honestly there are various conditions
where this is already not true. And those same cases would spuriously
yield EAGAIN before already, it's not a new condition for those sizes
of IOs.
Damien Le Moal Jan. 17, 2023, 2:17 a.m. UTC | #7
On 1/17/23 08:39, Jens Axboe wrote:
> On 1/16/23 4:31 PM, Damien Le Moal wrote:
>> On 1/17/23 08:28, Jens Axboe wrote:
>>> On 1/16/23 4:20?PM, Damien Le Moal wrote:
>>>> On 1/17/23 06:06, Jens Axboe wrote:
>>>>> If we're doing a large IO request which needs to be split into multiple
>>>>> bios for issue, then we can run into the same situation as the below
>>>>> marked commit fixes - parts will complete just fine, one or more parts
>>>>> will fail to allocate a request. This will result in a partially
>>>>> completed read or write request, where the caller gets EAGAIN even though
>>>>> parts of the IO completed just fine.
>>>>>
>>>>> Do the same for large bios as we do for splits - fail a NOWAIT request
>>>>> with EAGAIN. This isn't technically fixing an issue in the below marked
>>>>> patch, but for stable purposes, we should have either none of them or
>>>>> both.
>>>>>
>>>>> This depends on: 613b14884b85 ("block: handle bio_split_to_limits() NULL return")
>>>>>
>>>>> Cc: stable@vger.kernel.org # 5.15+
>>>>> Fixes: 9cea62b2cbab ("block: don't allow splitting of a REQ_NOWAIT bio")
>>>>> Link: https://github.com/axboe/liburing/issues/766
>>>>> Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com>
>>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>>>
>>>>> ---
>>>>>
>>>>> Since v1: catch this at submit time instead, since we can have various
>>>>> valid cases where the number of single page segments will not take a
>>>>> bio segment (page merging, huge pages).
>>>>>
>>>>> diff --git a/block/fops.c b/block/fops.c
>>>>> index 50d245e8c913..d2e6be4e3d1c 100644
>>>>> --- a/block/fops.c
>>>>> +++ b/block/fops.c
>>>>> @@ -221,6 +221,24 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
>>>>>  			bio_endio(bio);
>>>>>  			break;
>>>>>  		}
>>>>> +		if (iocb->ki_flags & IOCB_NOWAIT) {
>>>>> +			/*
>>>>> +			 * This is nonblocking IO, and we need to allocate
>>>>> +			 * another bio if we have data left to map. As we
>>>>> +			 * cannot guarantee that one of the sub bios will not
>>>>> +			 * fail getting issued FOR NOWAIT and as error results
>>>>> +			 * are coalesced across all of them, be safe and ask for
>>>>> +			 * a retry of this from blocking context.
>>>>> +			 */
>>>>> +			if (unlikely(iov_iter_count(iter))) {
>>>>> +				bio_release_pages(bio, false);
>>>>> +				bio_clear_flag(bio, BIO_REFFED);
>>>>> +				bio_put(bio);
>>>>> +				blk_finish_plug(&plug);
>>>>> +				return -EAGAIN;
>>>>
>>>> Doesn't this mean that for a really very large IO request that has 100%
>>>> chance of being split, the user will always get -EAGAIN ? Not that I mind,
>>>> doing super large IOs with NOWAIT is not a smart thing to do in the first
>>>> place... But as a user interface, it seems that this will prevent any
>>>> forward progress for such really large NOWAIT IOs. Is that OK ?
>>>
>>> Right, if you asked for NOWAIT, then it would not necessarily succeed if
>>> it:
>>>
>>> 1) Needs multiple bios
>>> 2) Needs splitting
>>>
>>> You're expected to attempt blocking issue at that point. Reasoning is
>>> explained in this (and the previous commit related to the issue),
>>> otherwise you end up with potentially various amounts of the request
>>> being written to disk or read from disk, but EAGAIN being returned for
>>> the request as a whole.
>>
>> Yes, I understood all that and completely agree with it.
>>
>> I was only wondering if this change may not surprise some (bad) userspace
>> stuff. Do we need to update some man page or other doc, mentioning that
>> there are no guarantees that a NOWAIT IO may actually be executed if it
>> too large (e.g. larger than max_sectors_kb) ?
> 
> We can certainly add it to the man pages talking about RWF_NOWAIT. But
> there's never been a guarantee that any EAGAIN will later succeed
> under the same conditions, and honestly there are various conditions
> where this is already not true. And those same cases would spuriously
> yield EAGAIN before already, it's not a new condition for those sizes
> of IOs.

OK. Thanks.
Michael Kelley (LINUX) Jan. 21, 2023, 4:56 a.m. UTC | #8
From: Jens Axboe <axboe@kernel.dk> Sent: Monday, January 16, 2023 1:06 PM
> 
> If we're doing a large IO request which needs to be split into multiple
> bios for issue, then we can run into the same situation as the below
> marked commit fixes - parts will complete just fine, one or more parts
> will fail to allocate a request. This will result in a partially
> completed read or write request, where the caller gets EAGAIN even though
> parts of the IO completed just fine.
> 
> Do the same for large bios as we do for splits - fail a NOWAIT request
> with EAGAIN. This isn't technically fixing an issue in the below marked
> patch, but for stable purposes, we should have either none of them or
> both.
> 
> This depends on: 613b14884b85 ("block: handle bio_split_to_limits() NULL return")
> 
> Cc: stable@vger.kernel.org # 5.15+
> Fixes: 9cea62b2cbab ("block: don't allow splitting of a REQ_NOWAIT bio")
> Link: https://github.com/axboe/liburing/issues/766
> Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> Since v1: catch this at submit time instead, since we can have various
> valid cases where the number of single page segments will not take a
> bio segment (page merging, huge pages).
> 
> diff --git a/block/fops.c b/block/fops.c
> index 50d245e8c913..d2e6be4e3d1c 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -221,6 +221,24 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct
> iov_iter *iter,
>  			bio_endio(bio);
>  			break;
>  		}
> +		if (iocb->ki_flags & IOCB_NOWAIT) {
> +			/*
> +			 * This is nonblocking IO, and we need to allocate
> +			 * another bio if we have data left to map. As we
> +			 * cannot guarantee that one of the sub bios will not
> +			 * fail getting issued FOR NOWAIT and as error results
> +			 * are coalesced across all of them, be safe and ask for
> +			 * a retry of this from blocking context.
> +			 */
> +			if (unlikely(iov_iter_count(iter))) {
> +				bio_release_pages(bio, false);
> +				bio_clear_flag(bio, BIO_REFFED);
> +				bio_put(bio);
> +				blk_finish_plug(&plug);
> +				return -EAGAIN;
> +			}
> +			bio->bi_opf |= REQ_NOWAIT;
> +		}
> 
>  		if (is_read) {
>  			if (dio->flags & DIO_SHOULD_DIRTY)
> @@ -228,9 +246,6 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
>  		} else {
>  			task_io_account_write(bio->bi_iter.bi_size);
>  		}
> -		if (iocb->ki_flags & IOCB_NOWAIT)
> -			bio->bi_opf |= REQ_NOWAIT;
> -
>  		dio->size += bio->bi_iter.bi_size;
>  		pos += bio->bi_iter.bi_size;
> 

I've wrapped up my testing on this patch.  All testing was via io_uring -- I did
not test other paths.  Testing was against a combination of this patch and the
previous patch set for a similar problem. [1]

I tested with a simple test program to issue single I/Os, and verified the
expected paths were taken through the block layer and io_uring code for
various size I/Os, including over 1 Mbyte.  No EAGAIN errors were seen.
This testing was with a 6.1 kernel.

Also tested the original app that surfaced the problem.  It's a larger scale
workload using io_uring, and is where the problem was originally
encountered.  That workload runs on a purpose-built 5.15 kernel, so I
backported both patches to 5.15 for this testing.  All looks good.
No EAGAIN errors were seen.

Michael

[1] https://lore.kernel.org/linux-block/20230104160938.62636-1-axboe@kernel.dk/
Jens Axboe Jan. 21, 2023, 9:10 p.m. UTC | #9
On 1/20/23 9:56?PM, Michael Kelley (LINUX) wrote:
> From: Jens Axboe <axboe@kernel.dk> Sent: Monday, January 16, 2023 1:06 PM
>>
>> If we're doing a large IO request which needs to be split into multiple
>> bios for issue, then we can run into the same situation as the below
>> marked commit fixes - parts will complete just fine, one or more parts
>> will fail to allocate a request. This will result in a partially
>> completed read or write request, where the caller gets EAGAIN even though
>> parts of the IO completed just fine.
>>
>> Do the same for large bios as we do for splits - fail a NOWAIT request
>> with EAGAIN. This isn't technically fixing an issue in the below marked
>> patch, but for stable purposes, we should have either none of them or
>> both.
>>
>> This depends on: 613b14884b85 ("block: handle bio_split_to_limits() NULL return")
>>
>> Cc: stable@vger.kernel.org # 5.15+
>> Fixes: 9cea62b2cbab ("block: don't allow splitting of a REQ_NOWAIT bio")
>> Link: https://github.com/axboe/liburing/issues/766
>> Reported-and-tested-by: Michael Kelley <mikelley@microsoft.com>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> ---
>>
>> Since v1: catch this at submit time instead, since we can have various
>> valid cases where the number of single page segments will not take a
>> bio segment (page merging, huge pages).
>>
>> diff --git a/block/fops.c b/block/fops.c
>> index 50d245e8c913..d2e6be4e3d1c 100644
>> --- a/block/fops.c
>> +++ b/block/fops.c
>> @@ -221,6 +221,24 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct
>> iov_iter *iter,
>>  			bio_endio(bio);
>>  			break;
>>  		}
>> +		if (iocb->ki_flags & IOCB_NOWAIT) {
>> +			/*
>> +			 * This is nonblocking IO, and we need to allocate
>> +			 * another bio if we have data left to map. As we
>> +			 * cannot guarantee that one of the sub bios will not
>> +			 * fail getting issued FOR NOWAIT and as error results
>> +			 * are coalesced across all of them, be safe and ask for
>> +			 * a retry of this from blocking context.
>> +			 */
>> +			if (unlikely(iov_iter_count(iter))) {
>> +				bio_release_pages(bio, false);
>> +				bio_clear_flag(bio, BIO_REFFED);
>> +				bio_put(bio);
>> +				blk_finish_plug(&plug);
>> +				return -EAGAIN;
>> +			}
>> +			bio->bi_opf |= REQ_NOWAIT;
>> +		}
>>
>>  		if (is_read) {
>>  			if (dio->flags & DIO_SHOULD_DIRTY)
>> @@ -228,9 +246,6 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
>>  		} else {
>>  			task_io_account_write(bio->bi_iter.bi_size);
>>  		}
>> -		if (iocb->ki_flags & IOCB_NOWAIT)
>> -			bio->bi_opf |= REQ_NOWAIT;
>> -
>>  		dio->size += bio->bi_iter.bi_size;
>>  		pos += bio->bi_iter.bi_size;
>>
> 
> I've wrapped up my testing on this patch.  All testing was via
> io_uring -- I did not test other paths.  Testing was against a
> combination of this patch and the previous patch set for a similar
> problem. [1]
> 
> I tested with a simple test program to issue single I/Os, and verified
> the expected paths were taken through the block layer and io_uring
> code for various size I/Os, including over 1 Mbyte.  No EAGAIN errors
> were seen. This testing was with a 6.1 kernel.
> 
> Also tested the original app that surfaced the problem.  It's a larger
> scale workload using io_uring, and is where the problem was originally
> encountered.  That workload runs on a purpose-built 5.15 kernel, so I
> backported both patches to 5.15 for this testing.  All looks good. No
> EAGAIN errors were seen.

Thanks a lot for your thorough testing! Can you share the 5.15
backports, so we can put them into 5.15-stable as well potentially?
Michael Kelley (LINUX) Jan. 24, 2023, 4:03 p.m. UTC | #10
From: Jens Axboe <axboe@kernel.dk> Sent: Saturday, January 21, 2023 1:11 PM
> 
> On 1/20/23 9:56?PM, Michael Kelley (LINUX) wrote:
> > From: Jens Axboe <axboe@kernel.dk> Sent: Monday, January 16, 2023 1:06 PM

[snip]

> >
> > I've wrapped up my testing on this patch.  All testing was via
> > io_uring -- I did not test other paths.  Testing was against a
> > combination of this patch and the previous patch set for a similar
> > problem. [1]
> >
> > I tested with a simple test program to issue single I/Os, and verified
> > the expected paths were taken through the block layer and io_uring
> > code for various size I/Os, including over 1 Mbyte.  No EAGAIN errors
> > were seen. This testing was with a 6.1 kernel.
> >
> > Also tested the original app that surfaced the problem.  It's a larger
> > scale workload using io_uring, and is where the problem was originally
> > encountered.  That workload runs on a purpose-built 5.15 kernel, so I
> > backported both patches to 5.15 for this testing.  All looks good. No
> > EAGAIN errors were seen.
> 
> Thanks a lot for your thorough testing! Can you share the 5.15
> backports, so we can put them into 5.15-stable as well potentially?
> 

Certainly.  What's the best way to do that?  Should I send them to you,
or to the linux-block list?  Or post directly to stable@vger.kernel.org?
If the latter, maybe I need to wait until it has an upstream commit ID
that can be referenced.  Also, you or someone should do a quick review
of the backport to make sure I didn't break something in a path I
didn't test.

Michael
Jens Axboe Jan. 24, 2023, 4:13 p.m. UTC | #11
On 1/24/23 9:03 AM, Michael Kelley (LINUX) wrote:
> From: Jens Axboe <axboe@kernel.dk> Sent: Saturday, January 21, 2023 1:11 PM
>>
>> On 1/20/23 9:56?PM, Michael Kelley (LINUX) wrote:
>>> From: Jens Axboe <axboe@kernel.dk> Sent: Monday, January 16, 2023 1:06 PM
> 
> [snip]
> 
>>>
>>> I've wrapped up my testing on this patch.  All testing was via
>>> io_uring -- I did not test other paths.  Testing was against a
>>> combination of this patch and the previous patch set for a similar
>>> problem. [1]
>>>
>>> I tested with a simple test program to issue single I/Os, and verified
>>> the expected paths were taken through the block layer and io_uring
>>> code for various size I/Os, including over 1 Mbyte.  No EAGAIN errors
>>> were seen. This testing was with a 6.1 kernel.
>>>
>>> Also tested the original app that surfaced the problem.  It's a larger
>>> scale workload using io_uring, and is where the problem was originally
>>> encountered.  That workload runs on a purpose-built 5.15 kernel, so I
>>> backported both patches to 5.15 for this testing.  All looks good. No
>>> EAGAIN errors were seen.
>>
>> Thanks a lot for your thorough testing! Can you share the 5.15
>> backports, so we can put them into 5.15-stable as well potentially?
>>
> 
> Certainly.  What's the best way to do that?  Should I send them to you,
> or to the linux-block list?  Or post directly to stable@vger.kernel.org?
> If the latter, maybe I need to wait until it has an upstream commit ID
> that can be referenced.  Also, you or someone should do a quick review
> of the backport to make sure I didn't break something in a path I
> didn't test.

Just send them to the block list, then we have them for when the commit
hits upstream and gives us a chance to review them upfront.

Thanks!
Michael Kelley (LINUX) Jan. 28, 2023, 4:34 p.m. UTC | #12
From: Jens Axboe <axboe@kernel.dk> Sent: Tuesday, January 24, 2023 8:13 AM
> 
> On 1/24/23 9:03 AM, Michael Kelley (LINUX) wrote:
> > From: Jens Axboe <axboe@kernel.dk> Sent: Saturday, January 21, 2023 1:11 PM
> >>
> >> On 1/20/23 9:56?PM, Michael Kelley (LINUX) wrote:
> >>> From: Jens Axboe <axboe@kernel.dk> Sent: Monday, January 16, 2023 1:06 PM
> >
> > [snip]
> >
> >>>
> >>> I've wrapped up my testing on this patch.  All testing was via
> >>> io_uring -- I did not test other paths.  Testing was against a
> >>> combination of this patch and the previous patch set for a similar
> >>> problem. [1]
> >>>
> >>> I tested with a simple test program to issue single I/Os, and verified
> >>> the expected paths were taken through the block layer and io_uring
> >>> code for various size I/Os, including over 1 Mbyte.  No EAGAIN errors
> >>> were seen. This testing was with a 6.1 kernel.
> >>>
> >>> Also tested the original app that surfaced the problem.  It's a larger
> >>> scale workload using io_uring, and is where the problem was originally
> >>> encountered.  That workload runs on a purpose-built 5.15 kernel, so I
> >>> backported both patches to 5.15 for this testing.  All looks good. No
> >>> EAGAIN errors were seen.
> >>
> >> Thanks a lot for your thorough testing! Can you share the 5.15
> >> backports, so we can put them into 5.15-stable as well potentially?
> >>
> >
> > Certainly.  What's the best way to do that?  Should I send them to you,
> > or to the linux-block list?  Or post directly to stable@vger.kernel.org?
> > If the latter, maybe I need to wait until it has an upstream commit ID
> > that can be referenced.  Also, you or someone should do a quick review
> > of the backport to make sure I didn't break something in a path I
> > didn't test.
> 
> Just send them to the block list, then we have them for when the commit
> hits upstream and gives us a chance to review them upfront.
> 
> Thanks!
> 

Your first two patches for handling bio splitting have already been
backported to 5.15 and are included in 5.15.90.   However, in reviewing
the backports, stable commit 613b14884b85 didn't update dm_submit_bio()
to check for a NULL bio being returned by blk_queue_split().   Presumably
that needs to be fixed unless there is a reason the check isn't needed
(which I didn't see).

Separately, I've posted a v5.15.90 backport for the __blkdev_direct_IO()
fix for multiple bio's.

Michael
diff mbox series

Patch

diff --git a/block/fops.c b/block/fops.c
index 50d245e8c913..d2e6be4e3d1c 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -221,6 +221,24 @@  static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 			bio_endio(bio);
 			break;
 		}
+		if (iocb->ki_flags & IOCB_NOWAIT) {
+			/*
+			 * This is nonblocking IO, and we need to allocate
+			 * another bio if we have data left to map. As we
+			 * cannot guarantee that one of the sub bios will not
+			 * fail getting issued FOR NOWAIT and as error results
+			 * are coalesced across all of them, be safe and ask for
+			 * a retry of this from blocking context.
+			 */
+			if (unlikely(iov_iter_count(iter))) {
+				bio_release_pages(bio, false);
+				bio_clear_flag(bio, BIO_REFFED);
+				bio_put(bio);
+				blk_finish_plug(&plug);
+				return -EAGAIN;
+			}
+			bio->bi_opf |= REQ_NOWAIT;
+		}
 
 		if (is_read) {
 			if (dio->flags & DIO_SHOULD_DIRTY)
@@ -228,9 +246,6 @@  static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		} else {
 			task_io_account_write(bio->bi_iter.bi_size);
 		}
-		if (iocb->ki_flags & IOCB_NOWAIT)
-			bio->bi_opf |= REQ_NOWAIT;
-
 		dio->size += bio->bi_iter.bi_size;
 		pos += bio->bi_iter.bi_size;