[4/8] iomap: completed polled IO inline

Message ID	20230720181310.71589-5-axboe@kernel.dk (mailing list archive)
State	Superseded
Headers	show Return-Path: <linux-xfs-owner@vger.kernel.org> From: Jens Axboe <axboe@kernel.dk> To: io-uring@vger.kernel.org, linux-xfs@vger.kernel.org Cc: hch@lst.de, andres@anarazel.de, david@fromorbit.com, Jens Axboe <axboe@kernel.dk> Subject: [PATCH 4/8] iomap: completed polled IO inline Date: Thu, 20 Jul 2023 12:13:06 -0600 Message-Id: <20230720181310.71589-5-axboe@kernel.dk> In-Reply-To: <20230720181310.71589-1-axboe@kernel.dk> References: <20230720181310.71589-1-axboe@kernel.dk> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Improve async iomap DIO performance \| expand [PATCHSET,v4,0/8] Improve async iomap DIO performance [1/8] iomap: cleanup up iomap_dio_bio_end_io() [2/8] iomap: add IOMAP_DIO_INLINE_COMP [3/8] iomap: treat a write through cache the same as FUA [4/8] iomap: completed polled IO inline [5/8] iomap: only set iocb->private for polled bio [6/8] fs: add IOCB flags related to passing back dio completions [7/8] io_uring/rw: add write support for IOCB_DIO_DEFER [8/8] iomap: support IOCB_DIO_DEFER

Jens Axboe July 20, 2023, 6:13 p.m. UTC

Polled IO is only allowed for conditions where task completion is safe
anyway, so we can always complete it inline. This cannot easily be
checked with a submission side flag, as the block layer may clear the
polled flag and turn it into a regular IO instead. Hence we need to
check this at completion time. If REQ_POLLED is still set, then we know
that this IO was successfully polled, and is completing in task context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/direct-io.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

Christoph Hellwig July 21, 2023, 6:16 a.m. UTC | #1

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Darrick J. Wong July 21, 2023, 3:19 p.m. UTC | #2

On Thu, Jul 20, 2023 at 12:13:06PM -0600, Jens Axboe wrote:
> Polled IO is only allowed for conditions where task completion is safe
> anyway, so we can always complete it inline. This cannot easily be
> checked with a submission side flag, as the block layer may clear the
> polled flag and turn it into a regular IO instead. Hence we need to
> check this at completion time. If REQ_POLLED is still set, then we know
> that this IO was successfully polled, and is completing in task context.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/iomap/direct-io.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 9f97d0d03724..c3ea1839628f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -173,9 +173,19 @@ void iomap_dio_bio_end_io(struct bio *bio)
>  	}
>  
>  	/*
> -	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
> +	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline.
> +	 * Ditto for polled requests - if the flag is still at completion
> +	 * time, then we know the request was actually polled and completion

Glad you added the comment here pointing out that REQ_POLLED must
*still* be set after the bio has been executed, because that was the
only question I had about this patch.

> +	 * is called from the task itself. This is why we need to check it
> +	 * here rather than flag it at issue time.
>  	 */
> -	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
> +	if ((dio->flags & IOMAP_DIO_INLINE_COMP) || (bio->bi_opf & REQ_POLLED)) {
> +		/*
> +		 * For polled IO, we need to clear ->private as it points to
> +		 * the bio being polled for. The completion side uses it to
> +		 * know if a given request has been found yet or not. For
> +		 * non-polled IO, ->private isn't applicable.

Thanks for the clarifying note here too.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> +		 */
>  		WRITE_ONCE(iocb->private, NULL);
>  		iomap_dio_complete_work(&dio->aio.work);
>  		goto release_bio;
> -- 
> 2.40.1
>

Dave Chinner July 21, 2023, 9:43 p.m. UTC | #3

On Thu, Jul 20, 2023 at 12:13:06PM -0600, Jens Axboe wrote:
> Polled IO is only allowed for conditions where task completion is safe
> anyway, so we can always complete it inline. This cannot easily be
> checked with a submission side flag, as the block layer may clear the
> polled flag and turn it into a regular IO instead. Hence we need to
> check this at completion time. If REQ_POLLED is still set, then we know
> that this IO was successfully polled, and is completing in task context.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/iomap/direct-io.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 9f97d0d03724..c3ea1839628f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -173,9 +173,19 @@ void iomap_dio_bio_end_io(struct bio *bio)
>  	}
>  
>  	/*
> -	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
> +	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline.
> +	 * Ditto for polled requests - if the flag is still at completion
> +	 * time, then we know the request was actually polled and completion
> +	 * is called from the task itself. This is why we need to check it
> +	 * here rather than flag it at issue time.
>  	 */
> -	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
> +	if ((dio->flags & IOMAP_DIO_INLINE_COMP) || (bio->bi_opf & REQ_POLLED)) {

This still smells wrong to me. Let me see if I can work out why...

<spelunk!>

When we set up the IO in iomap_dio_bio_iter(), we do this:

        /*
         * We can only poll for single bio I/Os.
         */
        if (need_zeroout ||
            ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
                dio->iocb->ki_flags &= ~IOCB_HIPRI;

The "need_zeroout" covers writes into unwritten regions that require
conversion at IO completion, and the latter check is for writes
extending EOF. i.e. this covers the cases where we have dirty
metadata for this specific write and so may need transactions or
journal/metadata IO during IO completion.

The only case it doesn't cover is clearing IOCB_HIPRI for O_DSYNC IO
that may require a call to generic_write_sync() in completion. That
is, if we aren't using FUA, will not have IOMAP_DIO_INLINE_COMP set,
but still do polled IO.

I think this is a bug. We don't want to be issuing more IO in
REQ_POLLED task context during IO completion, and O_DSYNC IO
completion for non-FUA IO requires a journal flush and that can
issue lots of journal IO and wait on it in completion process.

Hence I think we should only be setting REQ_POLLED in the cases
where IOCB_HIPRI and IOMAP_DIO_INLINE_COMP are both set.  If
IOMAP_DIO_INLINE_COMP is set on the dio, then it doesn't matter what
context we are in at completion time or whether REQ_POLLED was set
or cleared during the IO....

That means the above check should be:

        /*
         * We can only poll for single bio I/Os that can run inline
	 * completion.
         */
        if (need_zeroout ||
	    (iocb_is_dsync(dio->iocb) && !use_fua) ||
            ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
                dio->iocb->ki_flags &= ~IOCB_HIPRI;

or if we change the logic such that calculate IOMAP_DIO_INLINE_COMP
first:

	if (!(dio->flags & IOMAP_DIO_INLINE_COMP))
		dio->iocb->ki_flags &= ~IOCB_HIPRI;

Then we don't need to care about polled IO on the completion side at
all at the iomap layer because it doesn't change the completion
requirements at all...

Cheers,

Dave.

Jens Axboe July 22, 2023, 3:10 a.m. UTC | #4

On 7/21/23 3:43?PM, Dave Chinner wrote:
> On Thu, Jul 20, 2023 at 12:13:06PM -0600, Jens Axboe wrote:
>> Polled IO is only allowed for conditions where task completion is safe
>> anyway, so we can always complete it inline. This cannot easily be
>> checked with a submission side flag, as the block layer may clear the
>> polled flag and turn it into a regular IO instead. Hence we need to
>> check this at completion time. If REQ_POLLED is still set, then we know
>> that this IO was successfully polled, and is completing in task context.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/iomap/direct-io.c | 14 ++++++++++++--
>>  1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index 9f97d0d03724..c3ea1839628f 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -173,9 +173,19 @@ void iomap_dio_bio_end_io(struct bio *bio)
>>  	}
>>  
>>  	/*
>> -	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
>> +	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline.
>> +	 * Ditto for polled requests - if the flag is still at completion
>> +	 * time, then we know the request was actually polled and completion
>> +	 * is called from the task itself. This is why we need to check it
>> +	 * here rather than flag it at issue time.
>>  	 */
>> -	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
>> +	if ((dio->flags & IOMAP_DIO_INLINE_COMP) || (bio->bi_opf & REQ_POLLED)) {
> 
> This still smells wrong to me. Let me see if I can work out why...
> 
> <spelunk!>
> 
> When we set up the IO in iomap_dio_bio_iter(), we do this:
> 
>         /*
>          * We can only poll for single bio I/Os.
>          */
>         if (need_zeroout ||
>             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
>                 dio->iocb->ki_flags &= ~IOCB_HIPRI;
> 
> The "need_zeroout" covers writes into unwritten regions that require
> conversion at IO completion, and the latter check is for writes
> extending EOF. i.e. this covers the cases where we have dirty
> metadata for this specific write and so may need transactions or
> journal/metadata IO during IO completion.
> 
> The only case it doesn't cover is clearing IOCB_HIPRI for O_DSYNC IO
> that may require a call to generic_write_sync() in completion. That
> is, if we aren't using FUA, will not have IOMAP_DIO_INLINE_COMP set,
> but still do polled IO.
> 
> I think this is a bug. We don't want to be issuing more IO in
> REQ_POLLED task context during IO completion, and O_DSYNC IO
> completion for non-FUA IO requires a journal flush and that can
> issue lots of journal IO and wait on it in completion process.
> 
> Hence I think we should only be setting REQ_POLLED in the cases
> where IOCB_HIPRI and IOMAP_DIO_INLINE_COMP are both set.  If
> IOMAP_DIO_INLINE_COMP is set on the dio, then it doesn't matter what
> context we are in at completion time or whether REQ_POLLED was set
> or cleared during the IO....
> 
> That means the above check should be:
> 
>         /*
>          * We can only poll for single bio I/Os that can run inline
> 	 * completion.
>          */
>         if (need_zeroout ||
> 	    (iocb_is_dsync(dio->iocb) && !use_fua) ||
>             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
>                 dio->iocb->ki_flags &= ~IOCB_HIPRI;

Looks like you are right, it would not be a great idea to handle that
off polled IO completion. It'd work just fine, but anything generating
more IO should go to a helper. I'll make that change.

> or if we change the logic such that calculate IOMAP_DIO_INLINE_COMP
> first:
> 
> 	if (!(dio->flags & IOMAP_DIO_INLINE_COMP))
> 		dio->iocb->ki_flags &= ~IOCB_HIPRI;
> 
> Then we don't need to care about polled IO on the completion side at
> all at the iomap layer because it doesn't change the completion
> requirements at all...

That still isn't true, because you can still happily issue as polled IO
and get it cleared and now have an IRQ based completion. This would work
for most cases, but eg xfs dio end_io handler will grab:

spin_lock(&ip->i_flags_lock);

if the inode got truncated. Maybe that can't happen because we did
inode_dio_begin() higher up? Still seems saner to check for the polled
flag at completion to me...

Jens Axboe July 22, 2023, 4:54 p.m. UTC | #5

On 7/21/23 3:43?PM, Dave Chinner wrote:
> On Thu, Jul 20, 2023 at 12:13:06PM -0600, Jens Axboe wrote:
>> Polled IO is only allowed for conditions where task completion is safe
>> anyway, so we can always complete it inline. This cannot easily be
>> checked with a submission side flag, as the block layer may clear the
>> polled flag and turn it into a regular IO instead. Hence we need to
>> check this at completion time. If REQ_POLLED is still set, then we know
>> that this IO was successfully polled, and is completing in task context.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/iomap/direct-io.c | 14 ++++++++++++--
>>  1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index 9f97d0d03724..c3ea1839628f 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -173,9 +173,19 @@ void iomap_dio_bio_end_io(struct bio *bio)
>>  	}
>>  
>>  	/*
>> -	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
>> +	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline.
>> +	 * Ditto for polled requests - if the flag is still at completion
>> +	 * time, then we know the request was actually polled and completion
>> +	 * is called from the task itself. This is why we need to check it
>> +	 * here rather than flag it at issue time.
>>  	 */
>> -	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
>> +	if ((dio->flags & IOMAP_DIO_INLINE_COMP) || (bio->bi_opf & REQ_POLLED)) {
> 
> This still smells wrong to me. Let me see if I can work out why...
> 
> <spelunk!>
> 
> When we set up the IO in iomap_dio_bio_iter(), we do this:
> 
>         /*
>          * We can only poll for single bio I/Os.
>          */
>         if (need_zeroout ||
>             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
>                 dio->iocb->ki_flags &= ~IOCB_HIPRI;
> 
> The "need_zeroout" covers writes into unwritten regions that require
> conversion at IO completion, and the latter check is for writes
> extending EOF. i.e. this covers the cases where we have dirty
> metadata for this specific write and so may need transactions or
> journal/metadata IO during IO completion.
> 
> The only case it doesn't cover is clearing IOCB_HIPRI for O_DSYNC IO
> that may require a call to generic_write_sync() in completion. That
> is, if we aren't using FUA, will not have IOMAP_DIO_INLINE_COMP set,
> but still do polled IO.
> 
> I think this is a bug. We don't want to be issuing more IO in
> REQ_POLLED task context during IO completion, and O_DSYNC IO
> completion for non-FUA IO requires a journal flush and that can
> issue lots of journal IO and wait on it in completion process.
> 
> Hence I think we should only be setting REQ_POLLED in the cases
> where IOCB_HIPRI and IOMAP_DIO_INLINE_COMP are both set.  If
> IOMAP_DIO_INLINE_COMP is set on the dio, then it doesn't matter what
> context we are in at completion time or whether REQ_POLLED was set
> or cleared during the IO....
> 
> That means the above check should be:
> 
>         /*
>          * We can only poll for single bio I/Os that can run inline
> 	 * completion.
>          */
>         if (need_zeroout ||
> 	    (iocb_is_dsync(dio->iocb) && !use_fua) ||
>             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
>                 dio->iocb->ki_flags &= ~IOCB_HIPRI;

Looks like you are right, it would not be a great idea to handle that
off polled IO completion. It'd work just fine, but anything generating
more IO should go to a helper. I'll make that change.

> or if we change the logic such that calculate IOMAP_DIO_INLINE_COMP
> first:
> 
> 	if (!(dio->flags & IOMAP_DIO_INLINE_COMP))
> 		dio->iocb->ki_flags &= ~IOCB_HIPRI;
> 
> Then we don't need to care about polled IO on the completion side at
> all at the iomap layer because it doesn't change the completion
> requirements at all...

That still isn't true, because you can still happily issue as polled IO
and get it cleared and now have an IRQ based completion. This would work
for most cases, but eg xfs dio end_io handler will grab:

spin_lock(&ip->i_flags_lock);

if the inode got truncated. Maybe that can't happen because we did
inode_dio_begin() higher up? Still seems saner to check for the polled
flag at completion to me...

Dave Chinner July 22, 2023, 11:05 p.m. UTC | #6

On Fri, Jul 21, 2023 at 09:10:57PM -0600, Jens Axboe wrote:
> On 7/21/23 3:43?PM, Dave Chinner wrote:
> > On Thu, Jul 20, 2023 at 12:13:06PM -0600, Jens Axboe wrote:
> >> Polled IO is only allowed for conditions where task completion is safe
> >> anyway, so we can always complete it inline. This cannot easily be
> >> checked with a submission side flag, as the block layer may clear the
> >> polled flag and turn it into a regular IO instead. Hence we need to
> >> check this at completion time. If REQ_POLLED is still set, then we know
> >> that this IO was successfully polled, and is completing in task context.
> >>
> >> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >> ---
> >>  fs/iomap/direct-io.c | 14 ++++++++++++--
> >>  1 file changed, 12 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> >> index 9f97d0d03724..c3ea1839628f 100644
> >> --- a/fs/iomap/direct-io.c
> >> +++ b/fs/iomap/direct-io.c
> >> @@ -173,9 +173,19 @@ void iomap_dio_bio_end_io(struct bio *bio)
> >>  	}
> >>  
> >>  	/*
> >> -	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
> >> +	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline.
> >> +	 * Ditto for polled requests - if the flag is still at completion
> >> +	 * time, then we know the request was actually polled and completion
> >> +	 * is called from the task itself. This is why we need to check it
> >> +	 * here rather than flag it at issue time.
> >>  	 */
> >> -	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
> >> +	if ((dio->flags & IOMAP_DIO_INLINE_COMP) || (bio->bi_opf & REQ_POLLED)) {
> > 
> > This still smells wrong to me. Let me see if I can work out why...
> > 
> > <spelunk!>
> > 
> > When we set up the IO in iomap_dio_bio_iter(), we do this:
> > 
> >         /*
> >          * We can only poll for single bio I/Os.
> >          */
> >         if (need_zeroout ||
> >             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
> >                 dio->iocb->ki_flags &= ~IOCB_HIPRI;
> > 
> > The "need_zeroout" covers writes into unwritten regions that require
> > conversion at IO completion, and the latter check is for writes
> > extending EOF. i.e. this covers the cases where we have dirty
> > metadata for this specific write and so may need transactions or
> > journal/metadata IO during IO completion.
> > 
> > The only case it doesn't cover is clearing IOCB_HIPRI for O_DSYNC IO
> > that may require a call to generic_write_sync() in completion. That
> > is, if we aren't using FUA, will not have IOMAP_DIO_INLINE_COMP set,
> > but still do polled IO.
> > 
> > I think this is a bug. We don't want to be issuing more IO in
> > REQ_POLLED task context during IO completion, and O_DSYNC IO
> > completion for non-FUA IO requires a journal flush and that can
> > issue lots of journal IO and wait on it in completion process.
> > 
> > Hence I think we should only be setting REQ_POLLED in the cases
> > where IOCB_HIPRI and IOMAP_DIO_INLINE_COMP are both set.  If
> > IOMAP_DIO_INLINE_COMP is set on the dio, then it doesn't matter what
> > context we are in at completion time or whether REQ_POLLED was set
> > or cleared during the IO....
> > 
> > That means the above check should be:
> > 
> >         /*
> >          * We can only poll for single bio I/Os that can run inline
> > 	 * completion.
> >          */
> >         if (need_zeroout ||
> > 	    (iocb_is_dsync(dio->iocb) && !use_fua) ||
> >             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
> >                 dio->iocb->ki_flags &= ~IOCB_HIPRI;
> 
> Looks like you are right, it would not be a great idea to handle that
> off polled IO completion. It'd work just fine, but anything generating
> more IO should go to a helper. I'll make that change.
> 
> > or if we change the logic such that calculate IOMAP_DIO_INLINE_COMP
> > first:
> > 
> > 	if (!(dio->flags & IOMAP_DIO_INLINE_COMP))
> > 		dio->iocb->ki_flags &= ~IOCB_HIPRI;
> > 
> > Then we don't need to care about polled IO on the completion side at
> > all at the iomap layer because it doesn't change the completion
> > requirements at all...
> 
> That still isn't true, because you can still happily issue as polled IO
> and get it cleared and now have an IRQ based completion. This would work
> for most cases, but eg xfs dio end_io handler will grab:
> 
> spin_lock(&ip->i_flags_lock);
> 
> if the inode got truncated. Maybe that can't happen because we did
> inode_dio_begin() higher up?

Yes, truncate, hole punch, etc block on inode_dio_wait() with the
i_rwsem held which means it blocks new DIO submissions and waits
until all in-flight DIO before the truncate operation starts.
inode_dio_complete() does not get called until after the filesystem
->endio completion has run, so there's no possibility of
truncate-like operations actually racing with DIO completion at
all...

> Still seems saner to check for the polled
> flag at completion to me...

I disagree. If truncate (or anything that removes extents or reduces
inode size) is running whilst DIO to that range is still in
progress, we have a use-after-free situation that will cause data
and/or filesystem corruption. It's just not a safe thing to allow,
so we prevent it from occurring at a high level in the filesystem
and the result is that low level IO code just doesn't need to
care about races with layout/size changing operations...

-Dave.

Jens Axboe July 24, 2023, 10:35 p.m. UTC | #7

On 7/22/23 5:05?PM, Dave Chinner wrote:
> On Fri, Jul 21, 2023 at 09:10:57PM -0600, Jens Axboe wrote:
>> On 7/21/23 3:43?PM, Dave Chinner wrote:
>>> On Thu, Jul 20, 2023 at 12:13:06PM -0600, Jens Axboe wrote:
>>>> Polled IO is only allowed for conditions where task completion is safe
>>>> anyway, so we can always complete it inline. This cannot easily be
>>>> checked with a submission side flag, as the block layer may clear the
>>>> polled flag and turn it into a regular IO instead. Hence we need to
>>>> check this at completion time. If REQ_POLLED is still set, then we know
>>>> that this IO was successfully polled, and is completing in task context.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> ---
>>>>  fs/iomap/direct-io.c | 14 ++++++++++++--
>>>>  1 file changed, 12 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>>>> index 9f97d0d03724..c3ea1839628f 100644
>>>> --- a/fs/iomap/direct-io.c
>>>> +++ b/fs/iomap/direct-io.c
>>>> @@ -173,9 +173,19 @@ void iomap_dio_bio_end_io(struct bio *bio)
>>>>  	}
>>>>  
>>>>  	/*
>>>> -	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline
>>>> +	 * Flagged with IOMAP_DIO_INLINE_COMP, we can complete it inline.
>>>> +	 * Ditto for polled requests - if the flag is still at completion
>>>> +	 * time, then we know the request was actually polled and completion
>>>> +	 * is called from the task itself. This is why we need to check it
>>>> +	 * here rather than flag it at issue time.
>>>>  	 */
>>>> -	if (dio->flags & IOMAP_DIO_INLINE_COMP) {
>>>> +	if ((dio->flags & IOMAP_DIO_INLINE_COMP) || (bio->bi_opf & REQ_POLLED)) {
>>>
>>> This still smells wrong to me. Let me see if I can work out why...
>>>
>>> <spelunk!>
>>>
>>> When we set up the IO in iomap_dio_bio_iter(), we do this:
>>>
>>>         /*
>>>          * We can only poll for single bio I/Os.
>>>          */
>>>         if (need_zeroout ||
>>>             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
>>>                 dio->iocb->ki_flags &= ~IOCB_HIPRI;
>>>
>>> The "need_zeroout" covers writes into unwritten regions that require
>>> conversion at IO completion, and the latter check is for writes
>>> extending EOF. i.e. this covers the cases where we have dirty
>>> metadata for this specific write and so may need transactions or
>>> journal/metadata IO during IO completion.
>>>
>>> The only case it doesn't cover is clearing IOCB_HIPRI for O_DSYNC IO
>>> that may require a call to generic_write_sync() in completion. That
>>> is, if we aren't using FUA, will not have IOMAP_DIO_INLINE_COMP set,
>>> but still do polled IO.
>>>
>>> I think this is a bug. We don't want to be issuing more IO in
>>> REQ_POLLED task context during IO completion, and O_DSYNC IO
>>> completion for non-FUA IO requires a journal flush and that can
>>> issue lots of journal IO and wait on it in completion process.
>>>
>>> Hence I think we should only be setting REQ_POLLED in the cases
>>> where IOCB_HIPRI and IOMAP_DIO_INLINE_COMP are both set.  If
>>> IOMAP_DIO_INLINE_COMP is set on the dio, then it doesn't matter what
>>> context we are in at completion time or whether REQ_POLLED was set
>>> or cleared during the IO....
>>>
>>> That means the above check should be:
>>>
>>>         /*
>>>          * We can only poll for single bio I/Os that can run inline
>>> 	 * completion.
>>>          */
>>>         if (need_zeroout ||
>>> 	    (iocb_is_dsync(dio->iocb) && !use_fua) ||
>>>             ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode)))
>>>                 dio->iocb->ki_flags &= ~IOCB_HIPRI;
>>
>> Looks like you are right, it would not be a great idea to handle that
>> off polled IO completion. It'd work just fine, but anything generating
>> more IO should go to a helper. I'll make that change.
>>
>>> or if we change the logic such that calculate IOMAP_DIO_INLINE_COMP
>>> first:
>>>
>>> 	if (!(dio->flags & IOMAP_DIO_INLINE_COMP))
>>> 		dio->iocb->ki_flags &= ~IOCB_HIPRI;
>>>
>>> Then we don't need to care about polled IO on the completion side at
>>> all at the iomap layer because it doesn't change the completion
>>> requirements at all...
>>
>> That still isn't true, because you can still happily issue as polled IO
>> and get it cleared and now have an IRQ based completion. This would work
>> for most cases, but eg xfs dio end_io handler will grab:
>>
>> spin_lock(&ip->i_flags_lock);
>>
>> if the inode got truncated. Maybe that can't happen because we did
>> inode_dio_begin() higher up?
> 
> Yes, truncate, hole punch, etc block on inode_dio_wait() with the
> i_rwsem held which means it blocks new DIO submissions and waits
> until all in-flight DIO before the truncate operation starts.
> inode_dio_complete() does not get called until after the filesystem
> ->endio completion has run, so there's no possibility of
> truncate-like operations actually racing with DIO completion at
> all...

OK so that part is all fine at least. This means we'll never hit that
lock, as we know we'll be within the range.

>> Still seems saner to check for the polled
>> flag at completion to me...
> 
> I disagree. If truncate (or anything that removes extents or reduces
> inode size) is running whilst DIO to that range is still in
> progress, we have a use-after-free situation that will cause data
> and/or filesystem corruption. It's just not a safe thing to allow,
> so we prevent it from occurring at a high level in the filesystem
> and the result is that low level IO code just doesn't need to
> care about races with layout/size changing operations...

I looked at the rest of the completion side, to ensure that writes would
be sane. And one thing that is problematic _without_ checking for POLLED
is invalidating inode pages post IO completion. That can happen if
someone is doing buffered IO on the same file, which arguably is silly
and not really supported, but we need to do it and it needs task context
to do so.

So I think I'll just simplify things a bit - get rid of the inline
completion for writes, and only have them be doable via CALLER_COMP
(which was DEFER_COMP before). Then we always have task context. This
doesn't change the other side, we'll still only do them for the cases
outlined. But it does mean that non-CALLER_COMP can't ever set
INLINE_COMP, and that latter flag just goes away.

[4/8] iomap: completed polled IO inline

Commit Message

Comments

Patch