[6/6] iomap: support IOCB_DIO_DEFER

Message ID	20230719195417.1704513-7-axboe@kernel.dk (mailing list archive)
State	New
Headers	show Return-Path: <io-uring-owner@vger.kernel.org> From: Jens Axboe <axboe@kernel.dk> To: io-uring@vger.kernel.org, linux-xfs@vger.kernel.org Cc: hch@lst.de, andres@anarazel.de, david@fromorbit.com, Jens Axboe <axboe@kernel.dk> Subject: [PATCH 6/6] iomap: support IOCB_DIO_DEFER Date: Wed, 19 Jul 2023 13:54:17 -0600 Message-Id: <20230719195417.1704513-7-axboe@kernel.dk> In-Reply-To: <20230719195417.1704513-1-axboe@kernel.dk> References: <20230719195417.1704513-1-axboe@kernel.dk> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Improve async iomap DIO performance \| expand [PATCHSET,v3,0/6] Improve async iomap DIO performance [1/6] iomap: cleanup up iomap_dio_bio_end_io() [2/6] iomap: add IOMAP_DIO_INLINE_COMP [3/6] iomap: treat a write through cache the same as FUA [4/6] fs: add IOCB flags related to passing back dio completions [5/6] io_uring/rw: add write support for IOCB_DIO_DEFER [6/6] iomap: support IOCB_DIO_DEFER

Message ID

20230719195417.1704513-7-axboe@kernel.dk (mailing list archive)

State

New

Headers

From: Jens Axboe <axboe@kernel.dk>
To: io-uring@vger.kernel.org, linux-xfs@vger.kernel.org
Cc: hch@lst.de, andres@anarazel.de, david@fromorbit.com,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 6/6] iomap: support IOCB_DIO_DEFER
Date: Wed, 19 Jul 2023 13:54:17 -0600
Message-Id: <20230719195417.1704513-7-axboe@kernel.dk>
In-Reply-To: <20230719195417.1704513-1-axboe@kernel.dk>
References: <20230719195417.1704513-1-axboe@kernel.dk>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Improve async iomap DIO performance | expand

Commit Message

Jens Axboe July 19, 2023, 7:54 p.m. UTC

If IOCB_DIO_DEFER is set, utilize that to set kiocb->dio_complete handler
and data for that callback. Rather than punt the completion to a
workqueue, we pass back the handler and data to the issuer and will get a
callback from a safe task context.

Using the following fio job to randomly dio write 4k blocks at
queue depths of 1..16:

fio --name=dio-write --filename=/data1/file --time_based=1 \
--runtime=10 --bs=4096 --rw=randwrite --norandommap --buffered=0 \
--cpus_allowed=4 --ioengine=io_uring --iodepth=$depth

shows the following results before and after this patch:

	Stock	Patched		Diff
=======================================
QD1	155K	162K		+ 4.5%
QD2	290K	313K		+ 7.9%
QD4	533K	597K		+12.0%
QD8	604K	827K		+36.9%
QD16	615K	845K		+37.4%

which shows nice wins all around. If we factored in per-IOP efficiency,
the wins look even nicer. This becomes apparent as queue depth rises,
as the offloaded workqueue completions runs out of steam.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/iomap/direct-io.c | 47 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig July 20, 2023, 4:59 a.m. UTC | #1

> +	if (dio->flags & IOMAP_DIO_DEFER_COMP) {
> +		/* only polled IO cares about private cleared */
> +		iocb->private = dio;

FYI, I find this comment a bit weird as it comments on what we're
not doing in a path where it is irreleant.  I'd rather only clear
the private data in the path where polling is applicable and have
a comment there why it is cleared.  That probably belongs into the
first patch restructuring the function.

> @@ -277,12 +308,15 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		 * data IO that doesn't require any metadata updates (including
>  		 * after IO completion such as unwritten extent conversion) and
>  		 * the underlying device supports FUA. This allows us to avoid
> -		 * cache flushes on IO completion.
> +		 * cache flushes on IO completion. If we can't use FUA and
> +		 * need to sync, disable in-task completions.

... because iomap_dio_complete will have to call generic_write_sync to
do a blocking ->fsync call.

> @@ -308,6 +342,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>  		pad = pos & (fs_block_size - 1);
>  		if (pad)
>  			iomap_dio_zero(iter, dio, pos - pad, pad);
> +
> +		dio->flags &= ~IOMAP_DIO_DEFER_COMP;

Why does zeroing disable the deferred completions?  I can't really think
of why, which is probably a strong indicator why this needs a comment.

Jens Axboe July 20, 2023, 4:27 p.m. UTC | #2

On 7/19/23 10:59?PM, Christoph Hellwig wrote:
>> +	if (dio->flags & IOMAP_DIO_DEFER_COMP) {
>> +		/* only polled IO cares about private cleared */
>> +		iocb->private = dio;
> 
> FYI, I find this comment a bit weird as it comments on what we're
> not doing in a path where it is irreleant.  I'd rather only clear
> the private data in the path where polling is applicable and have
> a comment there why it is cleared.  That probably belongs into the
> first patch restructuring the function.

OK, that makes sense.

>> @@ -277,12 +308,15 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>  		 * data IO that doesn't require any metadata updates (including
>>  		 * after IO completion such as unwritten extent conversion) and
>>  		 * the underlying device supports FUA. This allows us to avoid
>> -		 * cache flushes on IO completion.
>> +		 * cache flushes on IO completion. If we can't use FUA and
>> +		 * need to sync, disable in-task completions.
> 
> ... because iomap_dio_complete will have to call generic_write_sync to
> do a blocking ->fsync call.

Will add to that comment.

>> @@ -308,6 +342,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>  		pad = pos & (fs_block_size - 1);
>>  		if (pad)
>>  			iomap_dio_zero(iter, dio, pos - pad, pad);
>> +
>> +		dio->flags &= ~IOMAP_DIO_DEFER_COMP;
> 
> Why does zeroing disable the deferred completions?  I can't really
> think of why, which is probably a strong indicator why this needs a
> comment.

If zerooout is set, then it's a new or unwritten extent and we need
further processing at write time which may trigger more IO. I'll add a
comment.

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index b30c3edf2ef3..b7055d50dd99 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -20,6 +20,7 @@ 
  * Private flags for iomap_dio, must not overlap with the public ones in
  * iomap.h:
  */
+#define IOMAP_DIO_DEFER_COMP	(1 << 26)
 #define IOMAP_DIO_INLINE_COMP	(1 << 27)
 #define IOMAP_DIO_WRITE_FUA	(1 << 28)
 #define IOMAP_DIO_NEED_SYNC	(1 << 29)
@@ -131,6 +132,11 @@  ssize_t iomap_dio_complete(struct iomap_dio *dio)
 }
 EXPORT_SYMBOL_GPL(iomap_dio_complete);
 
+static ssize_t iomap_dio_deferred_complete(void *data)
+{
+	return iomap_dio_complete(data);
+}
+
 static void iomap_dio_complete_work(struct work_struct *work)
 {
 	struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work);
@@ -180,6 +186,31 @@  void iomap_dio_bio_end_io(struct bio *bio)
 		goto release_bio;
 	}
 
+	/*
+	 * If this dio is flagged with IOMAP_DIO_DEFER_COMP, then schedule
+	 * our completion that way to avoid an async punt to a workqueue.
+	 */
+	if (dio->flags & IOMAP_DIO_DEFER_COMP) {
+		/* only polled IO cares about private cleared */
+		iocb->private = dio;
+		iocb->dio_complete = iomap_dio_deferred_complete;
+
+		/*
+		 * Invoke ->ki_complete() directly. We've assigned out
+		 * dio_complete callback handler, and since the issuer set
+		 * IOCB_DIO_DEFER, we know their ki_complete handler will
+		 * notice ->dio_complete being set and will defer calling that
+		 * handler until it can be done from a safe task context.
+		 *
+		 * Note that the 'res' being passed in here is not important
+		 * for this case. The actual completion value of the request
+		 * will be gotten from dio_complete when that is run by the
+		 * issuer.
+		 */
+		iocb->ki_complete(iocb, 0);
+		goto release_bio;
+	}
+
 	/*
 	 * Async DIO completion that requires filesystem level completion work
 	 * gets punted to a work queue to complete as the operation may require
@@ -277,12 +308,15 @@  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		 * data IO that doesn't require any metadata updates (including
 		 * after IO completion such as unwritten extent conversion) and
 		 * the underlying device supports FUA. This allows us to avoid
-		 * cache flushes on IO completion.
+		 * cache flushes on IO completion. If we can't use FUA and
+		 * need to sync, disable in-task completions.
 		 */
 		if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
 		    (dio->flags & IOMAP_DIO_WRITE_FUA) &&
 		    (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
 			use_fua = true;
+		else if (dio->flags & IOMAP_DIO_NEED_SYNC)
+			dio->flags &= ~IOMAP_DIO_DEFER_COMP;
 	}
 
 	/*
@@ -308,6 +342,8 @@  static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
 		pad = pos & (fs_block_size - 1);
 		if (pad)
 			iomap_dio_zero(iter, dio, pos - pad, pad);
+
+		dio->flags &= ~IOMAP_DIO_DEFER_COMP;
 	}
 
 	/*
@@ -547,6 +583,15 @@  __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		iomi.flags |= IOMAP_WRITE;
 		dio->flags |= IOMAP_DIO_WRITE;
 
+		/*
+		 * Flag as supporting deferred completions, if the issuer
+		 * groks it. This can avoid a workqueue punt for writes.
+		 * We may later clear this flag if we need to do other IO
+		 * as part of this IO completion.
+		 */
+		if (iocb->ki_flags & IOCB_DIO_DEFER)
+			dio->flags |= IOMAP_DIO_DEFER_COMP;
+
 		if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) {
 			ret = -EAGAIN;
 			if (iomi.pos >= dio->i_size ||