diff mbox series

[3/3] block: enable bio allocation cache for IRQ driven IO

Message ID 20211215163009.15269-4-axboe@kernel.dk (mailing list archive)
State New, archived
Headers show
Series Improve IRQ driven performance | expand

Commit Message

Jens Axboe Dec. 15, 2021, 4:30 p.m. UTC
We currently cannot use the bio recycling allocation cache for IRQ driven
IO, as the cache isn't IRQ safe (by design).

Add a way for the completion side to pass back a bio that needs freeing,
so we can do it from the io_uring side. io_uring completions always
run in task context.

This is good for about a 13% improvement in IRQ driven IO, taking us from
around 6.3M/core to 7.1M/core IOPS.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/fops.c       | 11 ++++++++---
 fs/io_uring.c      |  6 ++++++
 include/linux/fs.h |  4 ++++
 3 files changed, 18 insertions(+), 3 deletions(-)

Comments

Christoph Hellwig Dec. 16, 2021, 2:33 p.m. UTC | #1
On Wed, Dec 15, 2021 at 09:30:09AM -0700, Jens Axboe wrote:
> We currently cannot use the bio recycling allocation cache for IRQ driven
> IO, as the cache isn't IRQ safe (by design).
> 
> Add a way for the completion side to pass back a bio that needs freeing,
> so we can do it from the io_uring side. io_uring completions always
> run in task context.
> 
> This is good for about a 13% improvement in IRQ driven IO, taking us from
> around 6.3M/core to 7.1M/core IOPS.

The numbers looks great, but I really hate how it ties the caller into
using a bio.  I'll have to think hard about a better structure.

Just curious:  are the numbers with retpolines or without?  Do you care
about the cost of indirect calls with retpolines for these benchmarks?
Jens Axboe Dec. 16, 2021, 3:41 p.m. UTC | #2
On 12/16/21 7:33 AM, Christoph Hellwig wrote:
> On Wed, Dec 15, 2021 at 09:30:09AM -0700, Jens Axboe wrote:
>> We currently cannot use the bio recycling allocation cache for IRQ driven
>> IO, as the cache isn't IRQ safe (by design).
>>
>> Add a way for the completion side to pass back a bio that needs freeing,
>> so we can do it from the io_uring side. io_uring completions always
>> run in task context.
>>
>> This is good for about a 13% improvement in IRQ driven IO, taking us from
>> around 6.3M/core to 7.1M/core IOPS.
> 
> The numbers looks great, but I really hate how it ties the caller into
> using a bio.  I'll have to think hard about a better structure.

Yeah it's definitely not the prettiest, but I haven't been able to come
up with a better approach so far. I don't think the bio association is
the worst, can't imagine we'd want to tie it to something else. I might
be wrong...

Ideally we'd flip the completion model upside down here, instead of
having ->bi_end_io() call ->ki_complete.

> Just curious:  are the numbers with retpolines or without?  Do you care
> about the cost of indirect calls with retpolines for these benchmarks?

No mitigations enabled for these tests, so no retpoline. I definitely do
care about indirect call performance. The peak testing so far has been
mostly focused on best case - various features disabled that are common
but costly, and no mitigations enabled. Mostly because it's necessary to
break down the issues one-by-one, and the intent has always been to move
towards making common options less expensive too. Most of the patches
cover both bases, but there are also cases where we want to fix just
items that are costly when features/mitigations are enabled.
diff mbox series

Patch

diff --git a/block/fops.c b/block/fops.c
index bcf866b07edc..c7794d42be85 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -296,14 +296,19 @@  static void blkdev_bio_end_io_async(struct bio *bio)
 		ret = blk_status_to_errno(bio->bi_status);
 	}
 
-	iocb->ki_complete(iocb, ret);
-
 	if (dio->flags & DIO_SHOULD_DIRTY) {
 		bio_check_pages_dirty(bio);
 	} else {
 		bio_release_pages(bio, false);
-		bio_put(bio);
+		if (iocb->ki_flags & IOCB_BIO_PASSBACK) {
+			iocb->ki_flags |= IOCB_PRIV_IS_BIO;
+			iocb->private = bio;
+		} else {
+			bio_put(bio);
+		}
 	}
+
+	iocb->ki_complete(iocb, ret);
 }
 
 static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1a647a6a5add..b0302c0407e6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2828,6 +2828,11 @@  static void io_req_task_complete(struct io_kiocb *req, bool *locked)
 	unsigned int cflags = io_put_kbuf(req);
 	int res = req->result;
 
+#ifdef CONFIG_BLOCK
+	if (req->rw.kiocb.ki_flags & IOCB_PRIV_IS_BIO)
+		bio_put(req->rw.kiocb.private);
+#endif
+
 	if (*locked) {
 		io_req_complete_state(req, res, cflags);
 		io_req_add_compl_list(req);
@@ -3024,6 +3029,7 @@  static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 	} else {
 		if (kiocb->ki_flags & IOCB_HIPRI)
 			return -EINVAL;
+		kiocb->ki_flags |= IOCB_ALLOC_CACHE | IOCB_BIO_PASSBACK;
 		kiocb->ki_complete = io_complete_rw;
 	}
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6b8dc1a78df6..bf7a76dfdc29 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -322,6 +322,10 @@  enum rw_hint {
 #define IOCB_NOIO		(1 << 20)
 /* can use bio alloc cache */
 #define IOCB_ALLOC_CACHE	(1 << 21)
+/* iocb supports bio passback */
+#define IOCB_BIO_PASSBACK	(1 << 22)
+/* iocb->private holds bio to put */
+#define IOCB_PRIV_IS_BIO	(1 << 23)
 
 struct kiocb {
 	struct file		*ki_filp;