Message ID | 20230719195417.1704513-7-axboe@kernel.dk (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Improve async iomap DIO performance | expand |
> + if (dio->flags & IOMAP_DIO_DEFER_COMP) { > + /* only polled IO cares about private cleared */ > + iocb->private = dio; FYI, I find this comment a bit weird as it comments on what we're not doing in a path where it is irreleant. I'd rather only clear the private data in the path where polling is applicable and have a comment there why it is cleared. That probably belongs into the first patch restructuring the function. > @@ -277,12 +308,15 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, > * data IO that doesn't require any metadata updates (including > * after IO completion such as unwritten extent conversion) and > * the underlying device supports FUA. This allows us to avoid > - * cache flushes on IO completion. > + * cache flushes on IO completion. If we can't use FUA and > + * need to sync, disable in-task completions. ... because iomap_dio_complete will have to call generic_write_sync to do a blocking ->fsync call. > @@ -308,6 +342,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, > pad = pos & (fs_block_size - 1); > if (pad) > iomap_dio_zero(iter, dio, pos - pad, pad); > + > + dio->flags &= ~IOMAP_DIO_DEFER_COMP; Why does zeroing disable the deferred completions? I can't really think of why, which is probably a strong indicator why this needs a comment.
On 7/19/23 10:59?PM, Christoph Hellwig wrote: >> + if (dio->flags & IOMAP_DIO_DEFER_COMP) { >> + /* only polled IO cares about private cleared */ >> + iocb->private = dio; > > FYI, I find this comment a bit weird as it comments on what we're > not doing in a path where it is irreleant. I'd rather only clear > the private data in the path where polling is applicable and have > a comment there why it is cleared. That probably belongs into the > first patch restructuring the function. OK, that makes sense. >> @@ -277,12 +308,15 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, >> * data IO that doesn't require any metadata updates (including >> * after IO completion such as unwritten extent conversion) and >> * the underlying device supports FUA. This allows us to avoid >> - * cache flushes on IO completion. >> + * cache flushes on IO completion. If we can't use FUA and >> + * need to sync, disable in-task completions. > > ... because iomap_dio_complete will have to call generic_write_sync to > do a blocking ->fsync call. Will add to that comment. >> @@ -308,6 +342,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, >> pad = pos & (fs_block_size - 1); >> if (pad) >> iomap_dio_zero(iter, dio, pos - pad, pad); >> + >> + dio->flags &= ~IOMAP_DIO_DEFER_COMP; > > Why does zeroing disable the deferred completions? I can't really > think of why, which is probably a strong indicator why this needs a > comment. If zerooout is set, then it's a new or unwritten extent and we need further processing at write time which may trigger more IO. I'll add a comment.
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index b30c3edf2ef3..b7055d50dd99 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -20,6 +20,7 @@ * Private flags for iomap_dio, must not overlap with the public ones in * iomap.h: */ +#define IOMAP_DIO_DEFER_COMP (1 << 26) #define IOMAP_DIO_INLINE_COMP (1 << 27) #define IOMAP_DIO_WRITE_FUA (1 << 28) #define IOMAP_DIO_NEED_SYNC (1 << 29) @@ -131,6 +132,11 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio) } EXPORT_SYMBOL_GPL(iomap_dio_complete); +static ssize_t iomap_dio_deferred_complete(void *data) +{ + return iomap_dio_complete(data); +} + static void iomap_dio_complete_work(struct work_struct *work) { struct iomap_dio *dio = container_of(work, struct iomap_dio, aio.work); @@ -180,6 +186,31 @@ void iomap_dio_bio_end_io(struct bio *bio) goto release_bio; } + /* + * If this dio is flagged with IOMAP_DIO_DEFER_COMP, then schedule + * our completion that way to avoid an async punt to a workqueue. + */ + if (dio->flags & IOMAP_DIO_DEFER_COMP) { + /* only polled IO cares about private cleared */ + iocb->private = dio; + iocb->dio_complete = iomap_dio_deferred_complete; + + /* + * Invoke ->ki_complete() directly. We've assigned out + * dio_complete callback handler, and since the issuer set + * IOCB_DIO_DEFER, we know their ki_complete handler will + * notice ->dio_complete being set and will defer calling that + * handler until it can be done from a safe task context. + * + * Note that the 'res' being passed in here is not important + * for this case. The actual completion value of the request + * will be gotten from dio_complete when that is run by the + * issuer. + */ + iocb->ki_complete(iocb, 0); + goto release_bio; + } + /* * Async DIO completion that requires filesystem level completion work * gets punted to a work queue to complete as the operation may require @@ -277,12 +308,15 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, * data IO that doesn't require any metadata updates (including * after IO completion such as unwritten extent conversion) and * the underlying device supports FUA. This allows us to avoid - * cache flushes on IO completion. + * cache flushes on IO completion. If we can't use FUA and + * need to sync, disable in-task completions. */ if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) && (dio->flags & IOMAP_DIO_WRITE_FUA) && (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev))) use_fua = true; + else if (dio->flags & IOMAP_DIO_NEED_SYNC) + dio->flags &= ~IOMAP_DIO_DEFER_COMP; } /* @@ -308,6 +342,8 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter, pad = pos & (fs_block_size - 1); if (pad) iomap_dio_zero(iter, dio, pos - pad, pad); + + dio->flags &= ~IOMAP_DIO_DEFER_COMP; } /* @@ -547,6 +583,15 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, iomi.flags |= IOMAP_WRITE; dio->flags |= IOMAP_DIO_WRITE; + /* + * Flag as supporting deferred completions, if the issuer + * groks it. This can avoid a workqueue punt for writes. + * We may later clear this flag if we need to do other IO + * as part of this IO completion. + */ + if (iocb->ki_flags & IOCB_DIO_DEFER) + dio->flags |= IOMAP_DIO_DEFER_COMP; + if (dio_flags & IOMAP_DIO_OVERWRITE_ONLY) { ret = -EAGAIN; if (iomi.pos >= dio->i_size ||
If IOCB_DIO_DEFER is set, utilize that to set kiocb->dio_complete handler and data for that callback. Rather than punt the completion to a workqueue, we pass back the handler and data to the issuer and will get a callback from a safe task context. Using the following fio job to randomly dio write 4k blocks at queue depths of 1..16: fio --name=dio-write --filename=/data1/file --time_based=1 \ --runtime=10 --bs=4096 --rw=randwrite --norandommap --buffered=0 \ --cpus_allowed=4 --ioengine=io_uring --iodepth=$depth shows the following results before and after this patch: Stock Patched Diff ======================================= QD1 155K 162K + 4.5% QD2 290K 313K + 7.9% QD4 533K 597K +12.0% QD8 604K 827K +36.9% QD16 615K 845K +37.4% which shows nice wins all around. If we factored in per-IOP efficiency, the wins look even nicer. This becomes apparent as queue depth rises, as the offloaded workqueue completions runs out of steam. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- fs/iomap/direct-io.c | 47 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 46 insertions(+), 1 deletion(-)