[v2,1/1] iomap: propagate nowait to block layer

Message ID	f287a7882a4c4576e90e55ecc5ab8bf634579afd.1741090631.git.asml.silence@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-ej1-f42.google.com (mail-ej1-f42.google.com [209.85.218.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B183C201023; Tue, 4 Mar 2025 12:17:34 +0000 (UTC) From: Pavel Begunkov <asml.silence@gmail.com> To: Christian Brauner <brauner@kernel.org>, linux-fsdevel@vger.kernel.org, Dave Chinner <david@fromorbit.com> Cc: io-uring@vger.kernel.org, "Darrick J . Wong" <djwong@kernel.org>, linux-xfs@vger.kernel.org, wu lei <uwydoc@gmail.com>, asml.silence@gmail.com Subject: [PATCH v2 1/1] iomap: propagate nowait to block layer Date: Tue, 4 Mar 2025 12:18:07 +0000 Message-ID: <f287a7882a4c4576e90e55ecc5ab8bf634579afd.1741090631.git.asml.silence@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v2,1/1] iomap: propagate nowait to block layer \| expand [v2,1/1] iomap: propagate nowait to block layer

Pavel Begunkov March 4, 2025, 12:18 p.m. UTC

There are reports of high io_uring submission latency for ext4 and xfs,
which is due to iomap not propagating nowait flag to the block layer
resulting in waiting for IO during tag allocation.

Because of how errors are propagated back, we can't set REQ_NOWAIT
for multi bio IO, in this case return -EAGAIN and let the caller to
handle it, for example, it can reissue it from a blocking context.
It's aligned with how raw bdev direct IO handles it.

Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/issues/826#issuecomment-2674131870
Reported-by: wu lei <uwydoc@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---

v2:
	Fail multi-bio nowait submissions

 fs/iomap/direct-io.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

Christoph Hellwig March 4, 2025, 4:07 p.m. UTC | #1

On Tue, Mar 04, 2025 at 12:18:07PM +0000, Pavel Begunkov wrote:
>  	    ((dio->flags & IOMAP_DIO_NEED_SYNC) && !use_fua) ||
> -	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
> +	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>  		dio->flags &= ~IOMAP_DIO_CALLER_COMP;
>  
> +		if (!is_sync_kiocb(dio->iocb) &&
> +		    (dio->iocb->ki_flags & IOCB_NOWAIT))
> +			return -EAGAIN;

Black magic without comments explaining it.

> +	if (!is_sync_kiocb(dio->iocb) && (dio->iocb->ki_flags & IOCB_NOWAIT)) {
> +		/*
> +		 * This is nonblocking IO, and we might need to allocate
> +		 * multiple bios. In this case, as we cannot guarantee that
> +		 * one of the sub bios will not fail getting issued FOR NOWAIT
> +		 * and as error results are coalesced across all of them, ask
> +		 * for a retry of this from blocking context.
> +		 */
> +		if (bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS + 1) >
> +					  BIO_MAX_VECS)

This is not very accurate in times of multi-page bvecs and large order
folios all over.

I think you really need to byte the bullet and support for early returns
from the non-blocking bio submission path.

Pavel Begunkov March 4, 2025, 4:41 p.m. UTC | #2

On 3/4/25 16:07, Christoph Hellwig wrote:
> On Tue, Mar 04, 2025 at 12:18:07PM +0000, Pavel Begunkov wrote:
>>   	    ((dio->flags & IOMAP_DIO_NEED_SYNC) && !use_fua) ||
>> -	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
>> +	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>>   		dio->flags &= ~IOMAP_DIO_CALLER_COMP;
>>   
>> +		if (!is_sync_kiocb(dio->iocb) &&
>> +		    (dio->iocb->ki_flags & IOCB_NOWAIT))
>> +			return -EAGAIN;
> 
> Black magic without comments explaining it.

I can copy the comment from below if you wish.

>> +	if (!is_sync_kiocb(dio->iocb) && (dio->iocb->ki_flags & IOCB_NOWAIT)) {
>> +		/*
>> +		 * This is nonblocking IO, and we might need to allocate
>> +		 * multiple bios. In this case, as we cannot guarantee that
>> +		 * one of the sub bios will not fail getting issued FOR NOWAIT
>> +		 * and as error results are coalesced across all of them, ask
>> +		 * for a retry of this from blocking context.
>> +		 */
>> +		if (bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS + 1) >
>> +					  BIO_MAX_VECS)
> 
> This is not very accurate in times of multi-page bvecs and large order
> folios all over.

bio_iov_vecs_to_alloc() can overestimate, i.e. the check might return
-EAGAIN in more cases than required but not the other way around,
that should be enough for a fix such as this patch. Or did I maybe
misunderstood you?

> I think you really need to byte the bullet and support for early returns
> from the non-blocking bio submission path.

Assuming you're suggesting to implement that, I can't say I'm excited by
the idea of reworking a non trivial chunk of block layer to fix a problem
and then porting it up to some 5.x, especially since it was already
attempted before by someone and ultimately got reverted.

Christoph Hellwig March 4, 2025, 4:59 p.m. UTC | #3

On Tue, Mar 04, 2025 at 04:41:40PM +0000, Pavel Begunkov wrote:
> bio_iov_vecs_to_alloc() can overestimate, i.e. the check might return
> -EAGAIN in more cases than required but not the other way around,
> that should be enough for a fix such as this patch. Or did I maybe
> misunderstood you?

No you didn;t but we need to do this properly.

> Assuming you're suggesting to implement that, I can't say I'm excited by
> the idea of reworking a non trivial chunk of block layer to fix a problem
> and then porting it up to some 5.x, especially since it was already
> attempted before by someone and ultimately got reverted.

Stop whining.  Backporting does not matter for upstream development,
and I'm pretty sure you'd qualify for a job where you don't have to do
this if you actually care and don't just like to complain.

Jens Axboe March 4, 2025, 5:36 p.m. UTC | #4

On 3/4/25 9:59 AM, Christoph Hellwig wrote:
> Stop whining.  Backporting does not matter for upstream development,
> and I'm pretty sure you'd qualify for a job where you don't have to do
> this if you actually care and don't just like to complain.

Sorry, but wtf is this? Can we please keep it factual.

You may not need to care about backporting, but those of us supporting
stable and actual production certainly do. Not that this should drive
upstream development in any way, it's entirely unrelated to the problem
at hand.

Pavel Begunkov March 4, 2025, 5:54 p.m. UTC | #5

On 3/4/25 16:59, Christoph Hellwig wrote:
> On Tue, Mar 04, 2025 at 04:41:40PM +0000, Pavel Begunkov wrote:
>> bio_iov_vecs_to_alloc() can overestimate, i.e. the check might return
>> -EAGAIN in more cases than required but not the other way around,
>> that should be enough for a fix such as this patch. Or did I maybe
>> misunderstood you?
> 
> No you didn;t but we need to do this properly.
> 
>> Assuming you're suggesting to implement that, I can't say I'm excited by
>> the idea of reworking a non trivial chunk of block layer to fix a problem
>> and then porting it up to some 5.x, especially since it was already
>> attempted before by someone and ultimately got reverted.
> 
> Stop whining.  Backporting does not matter for upstream development,
> and I'm pretty sure you'd qualify for a job where you don't have to do
> this if you actually care and don't just like to complain.

That's an interesting choice of words. Dear Christoph, I'm afraid you
don't realise that you're the one whining at a person who actually tries
to fix it. I'd appreciate if you stop this bullshit, especially if you're
not willing to fix it yourself. If you do, please, be my guest and prove
me wrong.

Stable does exist, and people do care about it, and people do care about
problems being fixed and not delayed because your desire for someone else
to do some random grand rework for you. If there are _actual_ problems
with the patch, please let me know. Some of more rare cases being not
as efficient is not a problem.

Darrick J. Wong March 4, 2025, 7:22 p.m. UTC | #6

On Tue, Mar 04, 2025 at 04:41:40PM +0000, Pavel Begunkov wrote:
> On 3/4/25 16:07, Christoph Hellwig wrote:
> > On Tue, Mar 04, 2025 at 12:18:07PM +0000, Pavel Begunkov wrote:
> > >   	    ((dio->flags & IOMAP_DIO_NEED_SYNC) && !use_fua) ||
> > > -	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
> > > +	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
> > >   		dio->flags &= ~IOMAP_DIO_CALLER_COMP;
> > > +		if (!is_sync_kiocb(dio->iocb) &&
> > > +		    (dio->iocb->ki_flags & IOCB_NOWAIT))
> > > +			return -EAGAIN;
> > 
> > Black magic without comments explaining it.
> 
> I can copy the comment from below if you wish.
> 
> > > +	if (!is_sync_kiocb(dio->iocb) && (dio->iocb->ki_flags & IOCB_NOWAIT)) {
> > > +		/*
> > > +		 * This is nonblocking IO, and we might need to allocate
> > > +		 * multiple bios. In this case, as we cannot guarantee that
> > > +		 * one of the sub bios will not fail getting issued FOR NOWAIT
> > > +		 * and as error results are coalesced across all of them, ask
> > > +		 * for a retry of this from blocking context.
> > > +		 */
> > > +		if (bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS + 1) >
> > > +					  BIO_MAX_VECS)
> > 
> > This is not very accurate in times of multi-page bvecs and large order
> > folios all over.
> 
> bio_iov_vecs_to_alloc() can overestimate, i.e. the check might return
> -EAGAIN in more cases than required but not the other way around,
> that should be enough for a fix such as this patch. Or did I maybe
> misunderstood you?
> 
> > I think you really need to byte the bullet and support for early returns
> > from the non-blocking bio submission path.
> 
> Assuming you're suggesting to implement that, I can't say I'm excited by
> the idea of reworking a non trivial chunk of block layer to fix a problem
> and then porting it up to some 5.x, especially since it was already
> attempted before by someone and ultimately got reverted.

[I'm going to ignore the sarcasm downthread because I don't like it and
will not participate in prolonging that.]

So don't.  XFS LTS generally doesn't pull large chunks of new code into
old kernels, we just tell people they need to keep moving forward if
they want new code, or even bug fixes that get really involved.  You
want an XFS that doesn't allocate xfs_bufs from reclaim?  Well, you have
to move to 6.12, we're not going to backport a ton of super invasive
changes to 6.6, let alone 5.x.

We don't let old kernel source dictate changes to new kernels.

--D

> -- 
> Pavel Begunkov
> 
>

Pavel Begunkov March 4, 2025, 8:35 p.m. UTC | #7

On 3/4/25 19:22, Darrick J. Wong wrote:
...
>> Assuming you're suggesting to implement that, I can't say I'm excited by
>> the idea of reworking a non trivial chunk of block layer to fix a problem
>> and then porting it up to some 5.x, especially since it was already
>> attempted before by someone and ultimately got reverted.

Clarification: the mentioned work was reverted or pulled out _upstream_,
it wasn't about back porting.

> [I'm going to ignore the sarcasm downthread because I don't like it and
> will not participate in prolonging that.]
> 
> So don't.  XFS LTS generally doesn't pull large chunks of new code into

I agree, and that's why I'm trying to have a small fix. I think
this patch is concise if you disregard comments taking some
lines. And Christoph even of confirmed that the main check in the patch
does what's intended, i.e. disallowing setups where multiple bios
would be generated from the iterator.

> old kernels, we just tell people they need to keep moving forward if
> they want new code, or even bug fixes that get really involved.  You

It's still a broken io_uring uapi promise though, and I'd still need
to address it in older kernels somehow. For example we can have a
patch like this, and IMHO it'd be the ideal option.

Another option is to push all io_uring filesystem / iomap requests
to the slow path (where blocking is possible) and have a meaningful
perf regression for those who still use fs+io_uring direct IO. And
I don't put any dramaticism into it, it's essentially what users
who detect the problem already do, either that but from the user
space or disabling io_uring all together.

Even then it'd leave the question how to fix it upstream, I don't see
the full scope, but it has non trivial nuances and might likely turn
out to be a very involving project and take a lot of time I don't have
at the moment.

Darrick, any thoughts on the patch? Is there any problem why
it wouldn't work?

> want an XFS that doesn't allocate xfs_bufs from reclaim?  Well, you have
> to move to 6.12, we're not going to backport a ton of super invasive
> changes to 6.6, let alone 5.x.
> 
> We don't let old kernel source dictate changes to new kernels.

Dave Chinner March 4, 2025, 9:11 p.m. UTC | #8

On Tue, Mar 04, 2025 at 12:18:07PM +0000, Pavel Begunkov wrote:
> There are reports of high io_uring submission latency for ext4 and xfs,
> which is due to iomap not propagating nowait flag to the block layer
> resulting in waiting for IO during tag allocation.
> 
> Because of how errors are propagated back, we can't set REQ_NOWAIT
> for multi bio IO, in this case return -EAGAIN and let the caller to
> handle it, for example, it can reissue it from a blocking context.
> It's aligned with how raw bdev direct IO handles it.
> 
> Cc: stable@vger.kernel.org
> Link: https://github.com/axboe/liburing/issues/826#issuecomment-2674131870
> Reported-by: wu lei <uwydoc@gmail.com>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
> 
> v2:
> 	Fail multi-bio nowait submissions
> 
>  fs/iomap/direct-io.c | 26 +++++++++++++++++++++++---
>  1 file changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index b521eb15759e..07c336fdf4f0 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -363,9 +363,14 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	 */
>  	if (need_zeroout ||
>  	    ((dio->flags & IOMAP_DIO_NEED_SYNC) && !use_fua) ||
> -	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
> +	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>  		dio->flags &= ~IOMAP_DIO_CALLER_COMP;
>  
> +		if (!is_sync_kiocb(dio->iocb) &&
> +		    (dio->iocb->ki_flags & IOCB_NOWAIT))
> +			return -EAGAIN;
> +	}

How are we getting here with IOCB_NOWAIT IO? This is either
sub-block unaligned write IO, it is a write IO that requires
allocation (i.e. write beyond EOF), or we are doing a O_DSYNC write
on hardware that doesn't support REQ_FUA. 

The first 2 cases should have already been filtered out by the
filesystem before we ever get here. The latter doesn't require
multiple IOs in IO submission - the O_DSYNC IO submission (if any is
required) occurs from data IO completion context, and so it will not
block IO submission at all.

So what type of IO in what mapping condition is triggering the need
to return EAGAIN here?

> +
>  	/*
>  	 * The rules for polled IO completions follow the guidelines as the
>  	 * ones we set for inline and deferred completions. If none of those
> @@ -374,6 +379,23 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  	if (!(dio->flags & (IOMAP_DIO_INLINE_COMP|IOMAP_DIO_CALLER_COMP)))
>  		dio->iocb->ki_flags &= ~IOCB_HIPRI;
>  
> +	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
> +
> +	if (!is_sync_kiocb(dio->iocb) && (dio->iocb->ki_flags & IOCB_NOWAIT)) {
> +		/*
> +		 * This is nonblocking IO, and we might need to allocate
> +		 * multiple bios. In this case, as we cannot guarantee that
> +		 * one of the sub bios will not fail getting issued FOR NOWAIT
> +		 * and as error results are coalesced across all of them, ask
> +		 * for a retry of this from blocking context.
> +		 */
> +		if (bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS + 1) >
> +					  BIO_MAX_VECS)
> +			return -EAGAIN;
> +
> +		bio_opf |= REQ_NOWAIT;
> +	}

Ok, so this allows a max sized bio to be used. So, what, 1MB on 4kB
page size is the maximum IO size for IOCB_NOWAIT now? I bet that's
not documented anywhere....

Ah. This doesn't fix the problem at all.

Say, for exmaple, I have a hardware storage device with a max
hardware IO size of 128kB. This is from the workstation I'm typing
this email on:

$ cat /sys/block/nvme0n1/queue/max_hw_sectors_kb
128
$  cat /sys/block/nvme0n1/queue/max_segments
33
$

We build a 1MB bio above, set REQ_NOWAIT, then:

submit_bio
  ....
  blk_mq_submit_bio
    __bio_split_to_limits(bio, &q->limits, &nr_segs);
      bio_split_rw()
        .....
        split:
	.....
        /*                                                                       
         * We can't sanely support splitting for a REQ_NOWAIT bio. End it        
         * with EAGAIN if splitting is required and return an error pointer.     
         */                                                                      
        if (bio->bi_opf & REQ_NOWAIT)                                            
                return -EAGAIN;                                                  


So, REQ_NOWAIT effectively limits bio submission to the maximum
single IO size of the underlying storage. So, we can't use
REQ_NOWAIT without actually looking at request queue limits before
we start building the IO - yuk.

REQ_NOWAIT still feels completely unusable to me....

-Dave.

Pavel Begunkov March 4, 2025, 10:47 p.m. UTC | #9

On 3/4/25 21:11, Dave Chinner wrote:
...
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index b521eb15759e..07c336fdf4f0 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -363,9 +363,14 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   	 */
>>   	if (need_zeroout ||
>>   	    ((dio->flags & IOMAP_DIO_NEED_SYNC) && !use_fua) ||
>> -	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
>> +	    ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>>   		dio->flags &= ~IOMAP_DIO_CALLER_COMP;
>>   
>> +		if (!is_sync_kiocb(dio->iocb) &&
>> +		    (dio->iocb->ki_flags & IOCB_NOWAIT))
>> +			return -EAGAIN;
>> +	}
> 
> How are we getting here with IOCB_NOWAIT IO? This is either
> sub-block unaligned write IO, it is a write IO that requires
> allocation (i.e. write beyond EOF), or we are doing a O_DSYNC write
> on hardware that doesn't support REQ_FUA.
> 
> The first 2 cases should have already been filtered out by the
> filesystem before we ever get here. The latter doesn't require
> multiple IOs in IO submission - the O_DSYNC IO submission (if any is
> required) occurs from data IO completion context, and so it will not
> block IO submission at all.
> 
> So what type of IO in what mapping condition is triggering the need
> to return EAGAIN here?

I didn't hit it but neither could prove that it's impossible.
I'll drop the hunk since you're saying it shouldn't be needed.

>>   	/*
>>   	 * The rules for polled IO completions follow the guidelines as the
>>   	 * ones we set for inline and deferred completions. If none of those
>> @@ -374,6 +379,23 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   	if (!(dio->flags & (IOMAP_DIO_INLINE_COMP|IOMAP_DIO_CALLER_COMP)))
>>   		dio->iocb->ki_flags &= ~IOCB_HIPRI;
>>   
>> +	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
>> +
>> +	if (!is_sync_kiocb(dio->iocb) && (dio->iocb->ki_flags & IOCB_NOWAIT)) {
>> +		/*
>> +		 * This is nonblocking IO, and we might need to allocate
>> +		 * multiple bios. In this case, as we cannot guarantee that
>> +		 * one of the sub bios will not fail getting issued FOR NOWAIT
>> +		 * and as error results are coalesced across all of them, ask
>> +		 * for a retry of this from blocking context.
>> +		 */
>> +		if (bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS + 1) >
>> +					  BIO_MAX_VECS)
>> +			return -EAGAIN;
>> +
>> +		bio_opf |= REQ_NOWAIT;
>> +	}
> 
> Ok, so this allows a max sized bio to be used. So, what, 1MB on 4kB
> page size is the maximum IO size for IOCB_NOWAIT now? I bet that's
> not documented anywhere....
> 
> Ah. This doesn't fix the problem at all.
> 
> Say, for exmaple, I have a hardware storage device with a max
> hardware IO size of 128kB. This is from the workstation I'm typing
> this email on:
> 
> $ cat /sys/block/nvme0n1/queue/max_hw_sectors_kb
> 128
> $  cat /sys/block/nvme0n1/queue/max_segments
> 33
> $
> 
> We build a 1MB bio above, set REQ_NOWAIT, then:
> 
> submit_bio
>    ....
>    blk_mq_submit_bio
>      __bio_split_to_limits(bio, &q->limits, &nr_segs);
>        bio_split_rw()
>          .....
>          split:
> 	.....
>          /*
>           * We can't sanely support splitting for a REQ_NOWAIT bio. End it
>           * with EAGAIN if splitting is required and return an error pointer.
>           */
>          if (bio->bi_opf & REQ_NOWAIT)
>                  return -EAGAIN;
> 
> 
> So, REQ_NOWAIT effectively limits bio submission to the maximum
> single IO size of the underlying storage. So, we can't use
> REQ_NOWAIT without actually looking at request queue limits before
> we start building the IO - yuk.
> 
> REQ_NOWAIT still feels completely unusable to me....

Not great but better than not using the async path at all (i.e. executing
by a kernel worker) as far as io_uring concerned. It should cover a good
share of users, and io_uring has a fallback path, so userspace won't even
know about the error. The overhead per byte is less biting for 128K as well.

The patch already limits the change to async users only, but if you're
concerned about non-io_uring users, I can try to narrow it to io_uring
requests only, even though I don't think it's a good way.

Are you only concerned about the size being too restrictive or do you
see any other problems?

Christoph Hellwig March 4, 2025, 11:26 p.m. UTC | #10

On Tue, Mar 04, 2025 at 10:36:16AM -0700, Jens Axboe wrote:
> stable and actual production certainly do. Not that this should drive
> upstream development in any way, it's entirely unrelated to the problem
> at hand.

And that's exactly what I'm saying.  Do the right thing instead of
whining about backports to old kernels.

Christoph Hellwig March 4, 2025, 11:28 p.m. UTC | #11

On Tue, Mar 04, 2025 at 05:54:51PM +0000, Pavel Begunkov wrote:
> That's an interesting choice of words. Dear Christoph, I'm afraid you
> don't realise that you're the one whining at a person who actually tries
> to fix it. I'd appreciate if you stop this bullshit, especially if you're
> not willing to fix it yourself. If you do, please, be my guest and prove
> me wrong.

I'm not bullshiting.  I put my money where my mouth is, and you're
complaining.

Christoph Hellwig March 4, 2025, 11:40 p.m. UTC | #12

On Tue, Mar 04, 2025 at 10:47:48PM +0000, Pavel Begunkov wrote:
> Are you only concerned about the size being too restrictive or do you
> see any other problems?

You can't know the size beforehand.  A stacking block driver can split
the I/O for absolutely any reason it wants, and there is no requirement
to propagate that up to the queue limits.  That's why we need to be
able to synchronously return the wouldblock error to get this right.

Jens Axboe March 4, 2025, 11:43 p.m. UTC | #13

On 3/4/25 4:26 PM, Christoph Hellwig wrote:
> On Tue, Mar 04, 2025 at 10:36:16AM -0700, Jens Axboe wrote:
>> stable and actual production certainly do. Not that this should drive
>> upstream development in any way, it's entirely unrelated to the problem
>> at hand.
> 
> And that's exactly what I'm saying.  Do the right thing instead of
> whining about backports to old kernels.

Yep we agree on that, that's obvious. What I'm objecting to is your
delivery, which was personal rather than factual, which you should imho
apologize for. It's not that hard to stay on topic and avoid random
statements like that, which are entirely useless, counterproductive, and
just bullshit to be honest.

And honestly pretty tiring that this needs to be said, still. Really.

Christoph Hellwig March 4, 2025, 11:49 p.m. UTC | #14

On Tue, Mar 04, 2025 at 04:43:29PM -0700, Jens Axboe wrote:
> On 3/4/25 4:26 PM, Christoph Hellwig wrote:
> > On Tue, Mar 04, 2025 at 10:36:16AM -0700, Jens Axboe wrote:
> >> stable and actual production certainly do. Not that this should drive
> >> upstream development in any way, it's entirely unrelated to the problem
> >> at hand.
> > 
> > And that's exactly what I'm saying.  Do the right thing instead of
> > whining about backports to old kernels.
> 
> Yep we agree on that, that's obvious. What I'm objecting to is your
> delivery, which was personal rather than factual, which you should imho
> apologize for.

I thus sincerly apologize for flaming Pavel for whining about
backporting, but I'd still prefer he would have taken up that proposal
on technical grounds as I see absolutely no alternatively to
synchronously returning an error.

> And honestly pretty tiring that this needs to be said, still. Really.

An I'm really tired of folks whining about backporting instead of
staying ontopic.

Christoph Hellwig March 5, 2025, 12:01 a.m. UTC | #15

On Tue, Mar 04, 2025 at 08:35:52PM +0000, Pavel Begunkov wrote:
> Clarification: the mentioned work was reverted or pulled out _upstream_,
> it wasn't about back porting.

I don't think we ever tried synchronous reporting of wouldblock errors,
but maybe i'm just too old and confused by now.

> lines. And Christoph even of confirmed that the main check in the patch
> does what's intended,

I absolutely did not.

> Another option is to push all io_uring filesystem / iomap requests
> to the slow path (where blocking is possible) and have a meaningful
> perf regression for those who still use fs+io_uring direct IO. And
> I don't put any dramaticism into it, it's essentially what users
> who detect the problem already do, either that but from the user
> space or disabling io_uring all together.

If you don't want to do synchronous wouldblock errors that's your
only option.  I think it would suck badly, but it's certainly easier
to backport.

Pavel Begunkov March 5, 2025, 12:14 a.m. UTC | #16

On 3/4/25 23:49, Christoph Hellwig wrote:
> On Tue, Mar 04, 2025 at 04:43:29PM -0700, Jens Axboe wrote:
>> On 3/4/25 4:26 PM, Christoph Hellwig wrote:
>>> On Tue, Mar 04, 2025 at 10:36:16AM -0700, Jens Axboe wrote:
>>>> stable and actual production certainly do. Not that this should drive
>>>> upstream development in any way, it's entirely unrelated to the problem
>>>> at hand.
>>>
>>> And that's exactly what I'm saying.  Do the right thing instead of
>>> whining about backports to old kernels.
>>
>> Yep we agree on that, that's obvious. What I'm objecting to is your
>> delivery, which was personal rather than factual, which you should imho
>> apologize for.
> 
> I thus sincerly apologize for flaming Pavel for whining about

Again, pretty interesting choice of words, because it was more akin
to a tantrum. I hope you don't think you'd be taken seriously after
that, I can't, and not only from this conversation.

> backporting, but I'd still prefer he would have taken up that proposal
> on technical grounds as I see absolutely no alternatively to
> synchronously returning an error.

The "You have to do X" hardly reveals any details to claim "technical
grounds". Take stable out of the question, and I would still think
that a simple understandable fix for upstream is better than jumping
to a major rework right away.

>> And honestly pretty tiring that this needs to be said, still. Really.
> 
> An I'm really tired of folks whining about backporting instead of
> staying ontopic.

Pavel Begunkov March 5, 2025, 12:18 a.m. UTC | #17

On 3/5/25 00:14, Pavel Begunkov wrote:
> On 3/4/25 23:49, Christoph Hellwig wrote:
>> On Tue, Mar 04, 2025 at 04:43:29PM -0700, Jens Axboe wrote:
>>> On 3/4/25 4:26 PM, Christoph Hellwig wrote:
>>>> On Tue, Mar 04, 2025 at 10:36:16AM -0700, Jens Axboe wrote:
>>>>> stable and actual production certainly do. Not that this should drive
>>>>> upstream development in any way, it's entirely unrelated to the problem
>>>>> at hand.
>>>>
>>>> And that's exactly what I'm saying.  Do the right thing instead of
>>>> whining about backports to old kernels.
>>>
>>> Yep we agree on that, that's obvious. What I'm objecting to is your
>>> delivery, which was personal rather than factual, which you should imho
>>> apologize for.
>>
>> I thus sincerly apologize for flaming Pavel for whining about
> 
> Again, pretty interesting choice of words, because it was more akin
> to a tantrum. I hope you don't think you'd be taken seriously after
> that, I can't, and not only from this conversation.
> 
>> backporting, but I'd still prefer he would have taken up that proposal
>> on technical grounds as I see absolutely no alternatively to
>> synchronously returning an error.
> 
> The "You have to do X" hardly reveals any details to claim "technical
> grounds". Take stable out of the question, and I would still think
> that a simple understandable fix for upstream is better than jumping
> to a major rework right away.

That is _if_ there is a simple fix, but your argument was that
it's "not very accurate", which is not very useful.

>>> And honestly pretty tiring that this needs to be said, still. Really.
>>
>> An I'm really tired of folks whining about backporting instead of
>> staying ontopic.

Pavel Begunkov March 5, 2025, 12:45 a.m. UTC | #18

On 3/5/25 00:01, Christoph Hellwig wrote:
> On Tue, Mar 04, 2025 at 08:35:52PM +0000, Pavel Begunkov wrote:
>> Clarification: the mentioned work was reverted or pulled out _upstream_,
>> it wasn't about back porting.
> 
> I don't think we ever tried synchronous reporting of wouldblock errors,
> but maybe i'm just too old and confused by now.

It's not something recent. After some digging I think the
one I remember is

https://lore.kernel.org/all/20190717150445.1131-2-axboe@kernel.dk/

Remove by

commit 7b6620d7db566a46f49b4b9deab9fa061fd4b59b
Author: Jens Axboe <axboe@kernel.dk>
Date:   Thu Aug 15 11:09:16 2019 -0600

     block: remove REQ_NOWAIT_INLINE


>> lines. And Christoph even of confirmed that the main check in the patch
>> does what's intended,
> 
> I absolutely did not.
> 
>> Another option is to push all io_uring filesystem / iomap requests
>> to the slow path (where blocking is possible) and have a meaningful
>> perf regression for those who still use fs+io_uring direct IO. And
>> I don't put any dramaticism into it, it's essentially what users
>> who detect the problem already do, either that but from the user
>> space or disabling io_uring all together.
> 
> If you don't want to do synchronous wouldblock errors that's your
> only option.  I think it would suck badly, but it's certainly easier
> to backport.

Is there some intrinsic difference of iomap from the block file
in block/fops.c? Or is that one broken?

Dave Chinner March 5, 2025, 1:19 a.m. UTC | #19

On Tue, Mar 04, 2025 at 10:47:48PM +0000, Pavel Begunkov wrote:
> On 3/4/25 21:11, Dave Chinner wrote:
> > > @@ -374,6 +379,23 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
> > >   	if (!(dio->flags & (IOMAP_DIO_INLINE_COMP|IOMAP_DIO_CALLER_COMP)))
> > >   		dio->iocb->ki_flags &= ~IOCB_HIPRI;
> > > +	bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua, atomic);
> > > +
> > > +	if (!is_sync_kiocb(dio->iocb) && (dio->iocb->ki_flags & IOCB_NOWAIT)) {
> > > +		/*
> > > +		 * This is nonblocking IO, and we might need to allocate
> > > +		 * multiple bios. In this case, as we cannot guarantee that
> > > +		 * one of the sub bios will not fail getting issued FOR NOWAIT
> > > +		 * and as error results are coalesced across all of them, ask
> > > +		 * for a retry of this from blocking context.
> > > +		 */
> > > +		if (bio_iov_vecs_to_alloc(dio->submit.iter, BIO_MAX_VECS + 1) >
> > > +					  BIO_MAX_VECS)
> > > +			return -EAGAIN;
> > > +
> > > +		bio_opf |= REQ_NOWAIT;
> > > +	}
> > 
> > Ok, so this allows a max sized bio to be used. So, what, 1MB on 4kB
> > page size is the maximum IO size for IOCB_NOWAIT now? I bet that's
> > not documented anywhere....
> > 
> > Ah. This doesn't fix the problem at all.
> > 
> > Say, for exmaple, I have a hardware storage device with a max
> > hardware IO size of 128kB. This is from the workstation I'm typing
> > this email on:
> > 
> > $ cat /sys/block/nvme0n1/queue/max_hw_sectors_kb
> > 128
> > $  cat /sys/block/nvme0n1/queue/max_segments
> > 33
> > $
> > 
> > We build a 1MB bio above, set REQ_NOWAIT, then:
> > 
> > submit_bio
> >    ....
> >    blk_mq_submit_bio
> >      __bio_split_to_limits(bio, &q->limits, &nr_segs);
> >        bio_split_rw()
> >          .....
> >          split:
> > 	.....
> >          /*
> >           * We can't sanely support splitting for a REQ_NOWAIT bio. End it
> >           * with EAGAIN if splitting is required and return an error pointer.
> >           */
> >          if (bio->bi_opf & REQ_NOWAIT)
> >                  return -EAGAIN;
> > 
> > 
> > So, REQ_NOWAIT effectively limits bio submission to the maximum
> > single IO size of the underlying storage. So, we can't use
> > REQ_NOWAIT without actually looking at request queue limits before
> > we start building the IO - yuk.
> > 
> > REQ_NOWAIT still feels completely unusable to me....
> 
> Not great but better than not using the async path at all (i.e. executing
> by a kernel worker) as far as io_uring concerned.

I really don't care about what io_uring thinks or does. If the block
layer REQ_NOWAIT semantics are unusable for non-blocking IO
submission, then that's the problem that needs fixing. This isn't a
problem we can (or should) try to work around in the iomap layer.

> It should cover a good
> share of users, and io_uring has a fallback path, so userspace won't even
> know about the error. The overhead per byte is less biting for 128K as well.

That 128kb limit I demonstrated is not a fixed number. Stacking
block devices (e.g. software RAID devices) can split bios at any
sector offset within the submitted bio.

For example: we have RAID5 witha 64kB chunk size, so max REQ_NOWAIT
io size is 64kB according to the queue limits. However, if we do a
64kB IO at a 60kB chunk offset, that bio is going to be split into a
4kB bio and a 60kB bio because they are issued to different physical
devices.....

There is no way the bio submitter can know that this behaviour will
occur, nor should they even be attempting to predict when/if such
splitting may occur.

> The patch already limits the change to async users only, but if you're
> concerned about non-io_uring users, I can try to narrow it to io_uring
> requests only, even though I don't think it's a good way.

I'm concerned about any calling context that sets IOCB_NOWAIT. I
don't care if it's io_uring, AIO or a synchronous
preadv2(RWF_NOWAIT) call: we still have the same issue that
REQ_NOWAIT submission side -EAGAIN is only reported through IO
completion callbacks.

> Are you only concerned about the size being too restrictive or do you
> see any other problems?

I'm concerned abou the fact that REQ_NOWAIT is not usable as it
stands. We've identified bio chaining as an issue, now bio splitting
is an issue, and I'm sure if we look further there will be other
cases that are issues (e.g. bounce buffers).

The underlying problem here is that bio submission errors are
reported through bio completion mechanisms, not directly back to the
submitting context. Fix that problem in the block layer API, and
then iomap can use REQ_NOWAIT without having to care about what the
block layer is doing under the covers.

-Dave.

Christoph Hellwig March 5, 2025, 1:34 a.m. UTC | #20

On Wed, Mar 05, 2025 at 12:45:52AM +0000, Pavel Begunkov wrote:
> It's not something recent. After some digging I think the
> one I remember is
> 
> https://lore.kernel.org/all/20190717150445.1131-2-axboe@kernel.dk/
> 
> Remove by
> 
> commit 7b6620d7db566a46f49b4b9deab9fa061fd4b59b
> Author: Jens Axboe <axboe@kernel.dk>
> Date:   Thu Aug 15 11:09:16 2019 -0600
> 
>     block: remove REQ_NOWAIT_INLINE

That does indeed look pretty broken, mostly because it doesn't actually
return the errors inline as return values but tries to emulate it.

> > If you don't want to do synchronous wouldblock errors that's your
> > only option.  I think it would suck badly, but it's certainly easier
> > to backport.
> 
> Is there some intrinsic difference of iomap from the block file
> in block/fops.c? Or is that one broken?

block/fops.c has roughly the same issue, but it's not anywhere near
as sever, as block/fops.c doesn't have any file system state that
gets confused by the async BLK_STS_AGAIN.  But unless io_uring knows
how to rewind the state for that case it probably also doesn't work
properly.  It might be worth to look into testcases that actually
exercise the nowait I/O on block devices with annoyingly small limits
or that split a lot (like RAID0).

Christoph Hellwig March 5, 2025, 2:10 p.m. UTC | #21

On Wed, Mar 05, 2025 at 12:19:46PM +1100, Dave Chinner wrote:
> I really don't care about what io_uring thinks or does. If the block
> layer REQ_NOWAIT semantics are unusable for non-blocking IO
> submission, then that's the problem that needs fixing. This isn't a
> problem we can (or should) try to work around in the iomap layer.

Agreed.  The problem are the block layer semantics.  iomap/xfs really
just is the messenger here.

> For example: we have RAID5 witha 64kB chunk size, so max REQ_NOWAIT
> io size is 64kB according to the queue limits. However, if we do a
> 64kB IO at a 60kB chunk offset, that bio is going to be split into a
> 4kB bio and a 60kB bio because they are issued to different physical
> devices.....
> 
> There is no way the bio submitter can know that this behaviour will
> occur, nor should they even be attempting to predict when/if such
> splitting may occur.

And for something that has a real block allocator it could also be
entirely dynamic.  But I'm not sure if dm-thinp or bcache do anything
like that at the moment.

> > Are you only concerned about the size being too restrictive or do you
> > see any other problems?
> 
> I'm concerned abou the fact that REQ_NOWAIT is not usable as it
> stands. We've identified bio chaining as an issue, now bio splitting
> is an issue, and I'm sure if we look further there will be other
> cases that are issues (e.g. bounce buffers).
> 
> The underlying problem here is that bio submission errors are
> reported through bio completion mechanisms, not directly back to the
> submitting context. Fix that problem in the block layer API, and
> then iomap can use REQ_NOWAIT without having to care about what the
> block layer is doing under the covers.

Exactly.  Either they need to be reported synchronously, or maybe we
need a block layer hook in bio_endio that retries the given bio on a
workqueue without ever bubbling up to the caller.  But allowing delayed
BLK_STS_AGAIN is going to mess up any non-trivial caller.  But even
for the plain block device is will cause duplicate I/O where some
blocks have already been read/written and then will get resubmitted.

I'm not sure that breaks any atomicity assumptions as we don't really
give explicit ones for block devices (except maybe for the new
RWF_ATOMIC flag?), but it certainly is unexpected and suboptimal.

[v2,1/1] iomap: propagate nowait to block layer

Commit Message

Comments

Patch