Message ID | 20200824043121.13421-2-wanghonghao@bytedance.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] QSLIST: add atomic replace operation | expand |
On Mon, Aug 24, 2020 at 12:31:21PM +0800, wanghonghao wrote: > This patch replace the global coroutine queue with a lock-free stack of which > the elements are coroutine queues. Threads can put coroutine queues into the > stack or take queues from it and each coroutine queue has exactly > POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's > enough for buffer pool. > > Coroutines will be put into thread-local pools first while release. Now the > fast pathes of both allocation and release are atomic-free, and there won't > be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been > reduced to 16. > > In practice, I've run a VM with two block devices binding to two different > iothreads, and run fio with iodepth 128 on each device. It maintains around > 400 coroutines and has about 1% chance of calling to `qemu_coroutine_new` > without this patch. And with this patch, it maintains no more than 273 > coroutines and doesn't call `qemu_coroutine_new` after initial allocations. Does throughput or IOPS change? Is the main purpose of this patch to reduce memory consumption? Stefan
The purpose of this patch is to improve performance without increasing memory consumption. My test case: QEMU command line arguments -drive file=/dev/nvme2n1p1,format=raw,if=none,id=local0,cache=none,aio=native \ -device virtio-blk,id=blk0,drive=local0,iothread=iothread0,num-queues=4 \ -drive file=/dev/nvme3n1p1,format=raw,if=none,id=local1,cache=none,aio=native \ -device virtio-blk,id=blk1,drive=local1,iothread=iothread1,num-queues=4 \ run these two fio jobs at the same time [job-vda] filename=/dev/vda iodepth=64 ioengine=libaio rw=randrw bs=4k size=300G rwmixread=80 direct=1 numjobs=2 runtime=60 [job-vdb] filename=/dev/vdb iodepth=64 ioengine=libaio rw=randrw bs=4k size=300G rwmixread=90 direct=1 numjobs=2 loops=1 runtime=60 without this patch, test 3 times: total iops: 278548.1, 312374.1, 276638.2 with this patch, test 3 times: total iops: 368370.9, 335693.2, 327693.1 18.9% improvement in average. In addition, we are also using a distributed block storage, of which the io latency is much more than local nvme devices because of the network overhead. So it needs higher iodepth(>=256) to reach its max throughput. Without this patch, it has more than 5% chance of calling `qemu_coroutine_new` and the iops is less than 100K, while the iops is about 260K with this patch. On the other hand, there's a simpler way to reduce or eliminate the cost of `qemu_coroutine_new` is to increase POOL_BATCH_SIZE. But it will also bring much more memory consumption which we don't expect. So it's the purpose of this patch. Stefan Hajnoczi <stefanha@redhat.com> 于2020年8月25日周二 下午10:52写道: > > On Mon, Aug 24, 2020 at 12:31:21PM +0800, wanghonghao wrote: > > This patch replace the global coroutine queue with a lock-free stack of which > > the elements are coroutine queues. Threads can put coroutine queues into the > > stack or take queues from it and each coroutine queue has exactly > > POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's > > enough for buffer pool. > > > > Coroutines will be put into thread-local pools first while release. Now the > > fast pathes of both allocation and release are atomic-free, and there won't > > be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been > > reduced to 16. > > > > In practice, I've run a VM with two block devices binding to two different > > iothreads, and run fio with iodepth 128 on each device. It maintains around > > 400 coroutines and has about 1% chance of calling to `qemu_coroutine_new` > > without this patch. And with this patch, it maintains no more than 273 > > coroutines and doesn't call `qemu_coroutine_new` after initial allocations. > > Does throughput or IOPS change? > > Is the main purpose of this patch to reduce memory consumption? > > Stefan
Hi, I'd like to know if there are any other problems with this patch, or if there is a better implement to improve coroutine pool. 王洪浩 <wanghonghao@bytedance.com> 于2020年8月26日周三 下午2:06写道: > > The purpose of this patch is to improve performance without increasing > memory consumption. > > My test case: > QEMU command line arguments > -drive file=/dev/nvme2n1p1,format=raw,if=none,id=local0,cache=none,aio=native \ > -device virtio-blk,id=blk0,drive=local0,iothread=iothread0,num-queues=4 \ > -drive file=/dev/nvme3n1p1,format=raw,if=none,id=local1,cache=none,aio=native \ > -device virtio-blk,id=blk1,drive=local1,iothread=iothread1,num-queues=4 \ > > run these two fio jobs at the same time > [job-vda] > filename=/dev/vda > iodepth=64 > ioengine=libaio > rw=randrw > bs=4k > size=300G > rwmixread=80 > direct=1 > numjobs=2 > runtime=60 > > [job-vdb] > filename=/dev/vdb > iodepth=64 > ioengine=libaio > rw=randrw > bs=4k > size=300G > rwmixread=90 > direct=1 > numjobs=2 > loops=1 > runtime=60 > > without this patch, test 3 times: > total iops: 278548.1, 312374.1, 276638.2 > with this patch, test 3 times: > total iops: 368370.9, 335693.2, 327693.1 > > 18.9% improvement in average. > > In addition, we are also using a distributed block storage, of which > the io latency is much more than local nvme devices because of the > network overhead. So it needs higher iodepth(>=256) to reach its max > throughput. > Without this patch, it has more than 5% chance of calling > `qemu_coroutine_new` and the iops is less than 100K, while the iops is > about 260K with this patch. > > On the other hand, there's a simpler way to reduce or eliminate the > cost of `qemu_coroutine_new` is to increase POOL_BATCH_SIZE. But it > will also bring much more memory consumption which we don't expect. > So it's the purpose of this patch. > > Stefan Hajnoczi <stefanha@redhat.com> 于2020年8月25日周二 下午10:52写道: > > > > On Mon, Aug 24, 2020 at 12:31:21PM +0800, wanghonghao wrote: > > > This patch replace the global coroutine queue with a lock-free stack of which > > > the elements are coroutine queues. Threads can put coroutine queues into the > > > stack or take queues from it and each coroutine queue has exactly > > > POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's > > > enough for buffer pool. > > > > > > Coroutines will be put into thread-local pools first while release. Now the > > > fast pathes of both allocation and release are atomic-free, and there won't > > > be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been > > > reduced to 16. > > > > > > In practice, I've run a VM with two block devices binding to two different > > > iothreads, and run fio with iodepth 128 on each device. It maintains around > > > 400 coroutines and has about 1% chance of calling to `qemu_coroutine_new` > > > without this patch. And with this patch, it maintains no more than 273 > > > coroutines and doesn't call `qemu_coroutine_new` after initial allocations. > > > > Does throughput or IOPS change? > > > > Is the main purpose of this patch to reduce memory consumption? > > > > Stefan
On Tue, Sep 29, 2020 at 11:24:14AM +0800, 王洪浩 wrote: > Hi, I'd like to know if there are any other problems with this patch, > or if there is a better implement to improve coroutine pool. Please rebase onto qemu.git/master and resend the patch as a top-level email thread. I think v2 was overlooked because it was sent as a reply: https://wiki.qemu.org/Contribute/SubmitAPatch Thanks, Stefan
diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c index c3caa6c770..070d492edc 100644 --- a/util/qemu-coroutine.c +++ b/util/qemu-coroutine.c @@ -21,13 +21,14 @@ #include "block/aio.h" enum { - POOL_BATCH_SIZE = 64, + POOL_BATCH_SIZE = 16, + POOL_MAX_BATCHES = 32, }; -/** Free list to speed up creation */ -static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int release_pool_size; -static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool); +/** Free stack to speed up creation */ +static QSLIST_HEAD(, Coroutine) pool[POOL_MAX_BATCHES]; +static int pool_top; +static __thread QSLIST_HEAD(, Coroutine) alloc_pool; static __thread unsigned int alloc_pool_size; static __thread Notifier coroutine_pool_cleanup_notifier; @@ -49,20 +50,26 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque) if (CONFIG_COROUTINE_POOL) { co = QSLIST_FIRST(&alloc_pool); if (!co) { - if (release_pool_size > POOL_BATCH_SIZE) { - /* Slow path; a good place to register the destructor, too. */ - if (!coroutine_pool_cleanup_notifier.notify) { - coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup; - qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier); + int top; + + /* Slow path; a good place to register the destructor, too. */ + if (!coroutine_pool_cleanup_notifier.notify) { + coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup; + qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier); + } + + while ((top = atomic_read(&pool_top)) > 0) { + if (atomic_cmpxchg(&pool_top, top, top - 1) != top) { + continue; } - /* This is not exact; there could be a little skew between - * release_pool_size and the actual size of release_pool. But - * it is just a heuristic, it does not need to be perfect. - */ - alloc_pool_size = atomic_xchg(&release_pool_size, 0); - QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool); + QSLIST_MOVE_ATOMIC(&alloc_pool, &pool[top - 1]); co = QSLIST_FIRST(&alloc_pool); + + if (co) { + alloc_pool_size = POOL_BATCH_SIZE; + break; + } } } if (co) { @@ -86,16 +93,30 @@ static void coroutine_delete(Coroutine *co) co->caller = NULL; if (CONFIG_COROUTINE_POOL) { - if (release_pool_size < POOL_BATCH_SIZE * 2) { - QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next); - atomic_inc(&release_pool_size); - return; - } + int top, value, old; + if (alloc_pool_size < POOL_BATCH_SIZE) { QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); alloc_pool_size++; return; } + + for (top = atomic_read(&pool_top); top < POOL_MAX_BATCHES; top++) { + QSLIST_REPLACE_ATOMIC(&pool[top], &alloc_pool); + if (!QSLIST_EMPTY(&alloc_pool)) { + continue; + } + + value = top + 1; + + do { + old = atomic_cmpxchg(&pool_top, top, value); + } while (old != top && (top = old) < value); + + QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next); + alloc_pool_size = 1; + return; + } } qemu_coroutine_delete(co);
This patch replace the global coroutine queue with a lock-free stack of which the elements are coroutine queues. Threads can put coroutine queues into the stack or take queues from it and each coroutine queue has exactly POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's enough for buffer pool. Coroutines will be put into thread-local pools first while release. Now the fast pathes of both allocation and release are atomic-free, and there won't be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been reduced to 16. In practice, I've run a VM with two block devices binding to two different iothreads, and run fio with iodepth 128 on each device. It maintains around 400 coroutines and has about 1% chance of calling to `qemu_coroutine_new` without this patch. And with this patch, it maintains no more than 273 coroutines and doesn't call `qemu_coroutine_new` after initial allocations. Signed-off-by: wanghonghao <wanghonghao@bytedance.com> --- util/qemu-coroutine.c | 63 ++++++++++++++++++++++++++++--------------- 1 file changed, 42 insertions(+), 21 deletions(-)