Message ID | 20210810163728.265939-1-axboe@kernel.dk (mailing list archive) |
---|---|
Headers | show |
Series | Enable bio recycling for polled IO | expand |
On 8/10/21 10:37 AM, Jens Axboe wrote: > Hi, > > This is v3 of this patchset. We're back to passing the cache pointer > in the kiocb, I do think that's the cleanest and it's also the most > efficient approach. A patch has been added to remove a member from > the io_uring req_rw structure, so that the kiocb size bump doesn't > result in the per-command part of io_kiocb to bump into the next > cacheline. > > Another benefit of this approach is that we get per-ring caching. > That means if an application splits polled IO into two threads, one > doing submit and one doing reaps, then we still get the full benefit > of the bio caching. > > The tldr; here is that we get about a 10% bump in polled performance with > this patchset, as we can recycle bio structures essentially for free. > Outside of that, explanations in each patch. I've also got an iomap patch, > but trying to keep this single user until there's agreement on the > direction. > > Against for-5.15/io_uring, and can also be found in my > io_uring-bio-cache.3 branch. As a reference. Before the patch: axboe@amd ~/g/fio (master)> sudo taskset -c 0 t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 /dev/nvme3n1 i 8, argc 9 Added file /dev/nvme3n1 (submitter 0) sq_ring ptr = 0x0x7f63a7c42000 sqes ptr = 0x0x7f63a7c40000 cq_ring ptr = 0x0x7f63a7c3e000 polled=1, fixedbufs=1, register_files=1, buffered=0 QD=128, sq_ring=128, cq_ring=256 submitter=1600 IOPS=3111520, IOS/call=32/31, inflight=128 (128) or around 3.1M IOPS (single thread, single core), and after: axboe@amd ~/g/fio (master)> sudo taskset -c 0 t/io_uring -b512 -d128 -s32 -c32 -p1 -F1 -B1 /dev/nvme3n1 i 8, argc 9 Added file /dev/nvme3n1 (submitter 0) sq_ring ptr = 0x0x7f62726bc000 sqes ptr = 0x0x7f62726ba000 cq_ring ptr = 0x0x7f62726b8000 polled=1, fixedbufs=1, register_files=1, buffered=0 QD=128, sq_ring=128, cq_ring=256 submitter=1791 IOPS=3417120, IOS/call=32/31, inflight=128 (128) which is about a ~10% increase in per-core IOPS for this kind of workload.
I really don't like all the layering violations in here. What is the problem with a simple (optional) percpu cache in the bio_set? Something like the completely untested patch below: diff --git a/block/bio.c b/block/bio.c index 33160007f4e0..edd4a83b96fa 100644 --- a/block/bio.c +++ b/block/bio.c @@ -25,6 +25,11 @@ #include "blk.h" #include "blk-rq-qos.h" +struct bio_alloc_cache { + struct bio_list free_list; + unsigned int nr; +}; + static struct biovec_slab { int nr_vecs; char *name; @@ -239,6 +244,35 @@ static void bio_free(struct bio *bio) } } +static inline void __bio_init(struct bio *bio) +{ + bio->bi_next = NULL; + bio->bi_bdev = NULL; + bio->bi_opf = 0; + bio->bi_flags = bio->bi_ioprio = bio->bi_write_hint = 0; + bio->bi_status = 0; + bio->bi_iter.bi_sector = 0; + bio->bi_iter.bi_size = 0; + bio->bi_iter.bi_idx = 0; + bio->bi_iter.bi_bvec_done = 0; + bio->bi_end_io = NULL; + bio->bi_private = NULL; +#ifdef CONFIG_BLK_CGROUP + bio->bi_blkg = NULL; + bio->bi_issue.value = 0; +#ifdef CONFIG_BLK_CGROUP_IOCOST + bio->bi_iocost_cost = 0; +#endif +#endif +#ifdef CONFIG_BLK_INLINE_ENCRYPTION + bio->bi_crypt_context = NULL; +#endif +#ifdef CONFIG_BLK_DEV_INTEGRITY + bio->bi_integrity = NULL; +#endif + bio->bi_vcnt = 0; +} + /* * Users of this function have their own bio allocation. Subsequently, * they must remember to pair any call to bio_init() with bio_uninit() @@ -247,7 +281,7 @@ static void bio_free(struct bio *bio) void bio_init(struct bio *bio, struct bio_vec *table, unsigned short max_vecs) { - memset(bio, 0, sizeof(*bio)); + __bio_init(bio); atomic_set(&bio->__bi_remaining, 1); atomic_set(&bio->__bi_cnt, 1); bio->bi_cookie = BLK_QC_T_NONE; @@ -470,6 +504,31 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned short nr_iovecs, } EXPORT_SYMBOL(bio_alloc_bioset); +struct bio *bio_alloc_iocb(struct kiocb *iocb, unsigned short nr_vecs, + struct bio_set *bs) +{ + struct bio_alloc_cache *cache = NULL; + struct bio *bio; + + if (!(iocb->ki_flags & IOCB_HIPRI) || + !(iocb->ki_flags & IOCB_NOWAIT) || + nr_vecs > BIO_INLINE_VECS) + return bio_alloc_bioset(GFP_KERNEL, nr_vecs, bs); + + cache = per_cpu_ptr(bs->cache, get_cpu()); + bio = bio_list_pop(&cache->free_list); + if (bio) { + bio_init(bio, nr_vecs ? bio->bi_inline_vecs : NULL, nr_vecs); + cache->nr--; + } + put_cpu(); + + if (!bio) + bio = bio_alloc_bioset(GFP_KERNEL, nr_vecs, bs); + bio_set_flag(bio, BIO_CACHEABLE); + return bio; +} + /** * bio_kmalloc - kmalloc a bio for I/O * @gfp_mask: the GFP_* mask given to the slab allocator @@ -588,6 +647,46 @@ void guard_bio_eod(struct bio *bio) bio_truncate(bio, maxsector << 9); } +#define ALLOC_CACHE_MAX 512 +#define ALLOC_CACHE_SLACK 64 + +static void bio_alloc_cache_prune(struct bio_alloc_cache *cache, + unsigned int nr) +{ + struct bio *bio; + unsigned int i; + + i = 0; + while ((bio = bio_list_pop(&cache->free_list)) != NULL) { + cache->nr--; + bio_free(bio); + if (++i == nr) + break; + } +} + +#if 0 +// XXX: add a cpu down notifier to call this +void bio_alloc_cache_destroy(struct bio_alloc_cache *cache) +{ + bio_alloc_cache_prune(cache, -1U); +} +#endif + +static void bio_add_to_cache(struct bio *bio) +{ + struct bio_alloc_cache *cache; + + bio_uninit(bio); + + cache = per_cpu_ptr(bio->bi_pool->cache, get_cpu()); + bio_list_add_head(&cache->free_list, bio); + cache->nr++; + if (cache->nr > ALLOC_CACHE_MAX + ALLOC_CACHE_SLACK) + bio_alloc_cache_prune(cache, ALLOC_CACHE_SLACK); + put_cpu(); +} + /** * bio_put - release a reference to a bio * @bio: bio to release reference to @@ -598,17 +697,16 @@ void guard_bio_eod(struct bio *bio) **/ void bio_put(struct bio *bio) { - if (!bio_flagged(bio, BIO_REFFED)) - bio_free(bio); - else { + if (bio_flagged(bio, BIO_REFFED)) { BIO_BUG_ON(!atomic_read(&bio->__bi_cnt)); - - /* - * last put frees it - */ - if (atomic_dec_and_test(&bio->__bi_cnt)) - bio_free(bio); + if (!atomic_dec_and_test(&bio->__bi_cnt)) + return; } + + if (bio_flagged(bio, BIO_CACHEABLE)) + bio_add_to_cache(bio); + else + bio_free(bio); } EXPORT_SYMBOL(bio_put); @@ -1487,6 +1585,7 @@ int biovec_init_pool(mempool_t *pool, int pool_entries) */ void bioset_exit(struct bio_set *bs) { + free_percpu(bs->cache); if (bs->rescue_workqueue) destroy_workqueue(bs->rescue_workqueue); bs->rescue_workqueue = NULL; @@ -1548,12 +1647,18 @@ int bioset_init(struct bio_set *bs, biovec_init_pool(&bs->bvec_pool, pool_size)) goto bad; - if (!(flags & BIOSET_NEED_RESCUER)) - return 0; - - bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0); - if (!bs->rescue_workqueue) - goto bad; + if (flags & BIOSET_NEED_RESCUER) { + bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, + 0); + if (!bs->rescue_workqueue) + goto bad; + } + + if (flags & BIOSET_PERCPU_CACHE) { + bs->cache = alloc_percpu(struct bio_alloc_cache); + if (!bs->cache) + goto bad; + } return 0; bad: @@ -1594,7 +1699,8 @@ static int __init init_bio(void) SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL); } - if (bioset_init(&fs_bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS)) + if (bioset_init(&fs_bio_set, BIO_POOL_SIZE, 0, + BIOSET_NEED_BVECS | BIOSET_PERCPU_CACHE)) panic("bio: can't allocate bios\n"); if (bioset_integrity_create(&fs_bio_set, BIO_POOL_SIZE)) diff --git a/fs/block_dev.c b/fs/block_dev.c index e95889ff4fba..c67043bfb788 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -376,8 +376,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, (bdev_logical_block_size(bdev) - 1)) return -EINVAL; - bio = bio_alloc_bioset(GFP_KERNEL, nr_pages, &blkdev_dio_pool); - + bio = bio_alloc_iocb(iocb, nr_pages, &blkdev_dio_pool); dio = container_of(bio, struct blkdev_dio, bio); dio->is_sync = is_sync = is_sync_kiocb(iocb); if (dio->is_sync) { @@ -452,7 +451,7 @@ static ssize_t __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, } submit_bio(bio); - bio = bio_alloc(GFP_KERNEL, nr_pages); + bio = bio_alloc_iocb(iocb, nr_pages, &fs_bio_set); } if (!(iocb->ki_flags & IOCB_HIPRI)) @@ -497,7 +496,9 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter) static __init int blkdev_init(void) { - return bioset_init(&blkdev_dio_pool, 4, offsetof(struct blkdev_dio, bio), BIOSET_NEED_BVECS); + return bioset_init(&blkdev_dio_pool, 4, + offsetof(struct blkdev_dio, bio), + BIOSET_NEED_BVECS | BIOSET_PERCPU_CACHE); } module_init(blkdev_init); diff --git a/include/linux/bio.h b/include/linux/bio.h index 35de19f2ae88..69850bfddf18 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -400,6 +400,7 @@ static inline struct bio *bio_next_split(struct bio *bio, int sectors, enum { BIOSET_NEED_BVECS = BIT(0), BIOSET_NEED_RESCUER = BIT(1), + BIOSET_PERCPU_CACHE = BIT(2), }; extern int bioset_init(struct bio_set *, unsigned int, unsigned int, int flags); extern void bioset_exit(struct bio_set *); @@ -656,7 +657,7 @@ static inline void bio_inc_remaining(struct bio *bio) struct bio_set { struct kmem_cache *bio_slab; unsigned int front_pad; - + struct bio_alloc_cache __percpu *cache; mempool_t bio_pool; mempool_t bvec_pool; #if defined(CONFIG_BLK_DEV_INTEGRITY) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index e3a70dd0470b..7a7d9c6b33ee 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -300,6 +300,7 @@ enum { BIO_TRACKED, /* set if bio goes through the rq_qos path */ BIO_REMAPPED, BIO_ZONE_WRITE_LOCKED, /* Owns a zoned device zone write lock */ + BIO_CACHEABLE, /* can be added to the percpu cache */ BIO_FLAG_LAST };
On Wed, Aug 11, 2021 at 09:26:33AM +0100, Christoph Hellwig wrote: > I really don't like all the layering violations in here. What is the > problem with a simple (optional) percpu cache in the bio_set? Something > like the completely untested patch below: A slightly updated version that actually compiles and survives minimal testing is here: http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/bio-cache
On 8/11/21 5:13 AM, Christoph Hellwig wrote: > On Wed, Aug 11, 2021 at 09:26:33AM +0100, Christoph Hellwig wrote: >> I really don't like all the layering violations in here. What is the >> problem with a simple (optional) percpu cache in the bio_set? Something >> like the completely untested patch below: > > A slightly updated version that actually compiles and survives minimal > testing is here: > > http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/bio-cache I like this approach, I've done something very similar in the past. But the key here is the opt-in, to avoid needing IRQ/locking for the cache which defeats the purpose. That's why it was done just for polling, obviously. I'll test a bit and re-spin the series.
On Wed, Aug 11, 2021 at 12:13:23PM +0100, Christoph Hellwig wrote: > On Wed, Aug 11, 2021 at 09:26:33AM +0100, Christoph Hellwig wrote: > > I really don't like all the layering violations in here. What is the > > problem with a simple (optional) percpu cache in the bio_set? Something > > like the completely untested patch below: > > A slightly updated version that actually compiles and survives minimal > testing is here: > > http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/bio-cache Just wondering if the percpu bio cache may perform better than slab which should be very similar with percpu allocation. And here the cached bio is just one fixed length bio with inline bvecs. BTW, in the patch of 'io_uring: ask for bio caching', you have to not set IOCB_ALLOC_CACHE just like what Jens did, otherwise it will break when using io_uring over queue without POLL capability. Thanks, Ming
On Wed, Aug 11, 2021 at 09:05:47AM -0600, Jens Axboe wrote: > On 8/11/21 5:13 AM, Christoph Hellwig wrote: > > On Wed, Aug 11, 2021 at 09:26:33AM +0100, Christoph Hellwig wrote: > >> I really don't like all the layering violations in here. What is the > >> problem with a simple (optional) percpu cache in the bio_set? Something > >> like the completely untested patch below: > > > > A slightly updated version that actually compiles and survives minimal > > testing is here: > > > > http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/bio-cache > > I like this approach, I've done something very similar in the past. But > the key here is the opt-in, to avoid needing IRQ/locking for the cache > which defeats the purpose. That's why it was done just for polling, > obviously. > > I'll test a bit and re-spin the series. The series above now has the opt-in as I realized the same after a little testing.