Message ID | 20220701091348.215320-1-chao.gao@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC,v1] util/aio: Keep notification disabled as much as possible | expand |
On Fri, Jul 01, 2022 at 05:13:48PM +0800, Chao Gao wrote: > When measuring FIO read performance (cache=writethrough, bs=4k, iodepth=64) in > VMs, we observe ~80K/s notifications (e.g., EPT_MISCONFIG) from guest to qemu. It's not clear to me what caused the frequent poll_set_started(ctx, false) calls and whether this patch is correct. I have posted some questions below about the nature of this issue. If ctx->fdmon_ops->wait() is called while polling is still enabled then hangs or unnecessary latency can occur. For example, consider an fd handler that temporarily suppresses fd activity between poll start/end. The thread would be blocked in ->wait() and the fd will never become readable. Even if a hang doesn't occur because there is a timeout, there would be extra latency because the fd doesn't become readable and we have to wait for the timeout to expire so we can poll again. So we must be sure it's safe to leave polling enabled across the event loop and I'm not sure if this patch is okay. > > Currently, poll_set_started(ctx,false) is called in try_poll_mode() to enable > virtqueue notification in below 4 cases: > > 1. ctx->poll_ns is 0 > 2. a zero timeout is passed to try_poll_mode() > 3. polling succeeded but reported as no progress > 4. polling failed and reported as no progress > > To minimize unnecessary guest notifications, keep notification disabled when > it is possible, i.e., polling is enabled and last polling doesn't fail. What is the exact definition of polling success/failure? > > Keep notification disabled for case #2 and #3; handle case #2 simply by a call Did you see case #2 happening often? What was the cause? > of run_poll_handlers_once() and for case #3, differentiate successful/failed > polling and skip the call of poll_set_started(ctx,false) for successful ones. This is probably the most interesting case. When polling detects an event, that's considered "progress", except for aio_notify() events. aio_notify() is an internal event for restarting the event loop when something has changed (e.g. fd handlers have been added/removed). I wouldn't expect it to intefere polling frequently since that requires another QEMU thread doing something to the AioContext, which should be rare. Was aio_notify() intefering with polling in your case? Do you know why? > > With this patch applied, FIO seq-read performance (bs=4k, iodepth=64, > cache=writethrough) in VMs increases from 330K/s to 413K/s IOPS. > > Below are changes in VM-exit statistics: > w/o this patch (duration:10s): > VM-EXIT Samples Samples% Time% Min Time Max Time Avg time > > EPT_MISCONFIG 797440 99.34% 98.58% 0.39us 57.92us 0.60us ( +- 0.05% ) > MSR_WRITE 3672 0.46% 1.15% 0.88us 4.97us 1.52us ( +- 0.53% ) > EXTERNAL_INTERRUPT 1638 0.20% 0.27% 0.59us 11.04us 0.81us ( +- 1.71% ) > > w/ this patch (duration:10s): > VM-EXIT Samples Samples% Time% Min Time Max Time Avg time > > MSR_WRITE 3524 84.11% 87.17% 0.86us 4.43us 1.74us ( +- 0.60% ) > EXTERNAL_INTERRUPT 515 12.29% 10.05% 0.64us 8.95us 1.37us ( +- 1.54% ) > EPT_MISCONFIG 151 3.60% 2.79% 0.40us 52.07us 1.30us ( +- 31.44% ) > > Signed-off-by: Chao Gao <chao.gao@intel.com> > --- > util/aio-posix.c | 16 ++++++++++++---- > 1 file changed, 12 insertions(+), 4 deletions(-) > > diff --git a/util/aio-posix.c b/util/aio-posix.c > index 731f3826c0..bd2076145b 100644 > --- a/util/aio-posix.c > +++ b/util/aio-posix.c > @@ -519,7 +519,7 @@ static bool remove_idle_poll_handlers(AioContext *ctx, > * Returns: true if progress was made, false otherwise > */ > static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list, > - int64_t max_ns, int64_t *timeout) > + int64_t max_ns, int64_t *timeout, bool *success) > { > bool progress; > int64_t start_time, elapsed_time; > @@ -553,6 +553,8 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list, > progress = true; > } > > + *success = !*timeout; > + > /* If time has passed with no successful polling, adjust *timeout to > * keep the same ending time. > */ > @@ -577,22 +579,28 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list, > static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list, > int64_t *timeout) > { > - int64_t max_ns; > + int64_t max_ns, start_time; > + bool success = false; > > if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) { > return false; > } > > + if (!*timeout) { > + start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); > + return run_poll_handlers_once(ctx, ready_list, start_time, timeout); > + } > + > max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns); > if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) { > poll_set_started(ctx, ready_list, true); > > - if (run_poll_handlers(ctx, ready_list, max_ns, timeout)) { > + if (run_poll_handlers(ctx, ready_list, max_ns, timeout, &success)) { > return true; > } > } > > - if (poll_set_started(ctx, ready_list, false)) { > + if (!success && poll_set_started(ctx, ready_list, false)) { > *timeout = 0; > return true; > } > -- > 2.25.1 >
On Wed, Jul 06, 2022 at 12:59:29PM +0100, Stefan Hajnoczi wrote: >On Fri, Jul 01, 2022 at 05:13:48PM +0800, Chao Gao wrote: >> When measuring FIO read performance (cache=writethrough, bs=4k, iodepth=64) in >> VMs, we observe ~80K/s notifications (e.g., EPT_MISCONFIG) from guest to qemu. > >It's not clear to me what caused the frequent poll_set_started(ctx, >false) calls and whether this patch is correct. I have posted some Me either. That's why the patch was marked RFC. >questions below about the nature of this issue. > >If ctx->fdmon_ops->wait() is called while polling is still enabled then >hangs or unnecessary latency can occur. For example, consider an fd >handler that temporarily suppresses fd activity between poll start/end. >The thread would be blocked in ->wait() and the fd will never become >readable. Even if a hang doesn't occur because there is a timeout, there >would be extra latency because the fd doesn't become readable and we >have to wait for the timeout to expire so we can poll again. So we must >be sure it's safe to leave polling enabled across the event loop and I'm >not sure if this patch is okay. Thanks for the explanation. in aio_poll(), if (timeout || ctx->fdmon_ops->need_wait(ctx)) { ctx->fdmon_ops->wait(ctx, &ready_list, timeout); } if @timeout is zero, then ctx->fdmon_ops->wait() won't be called. In case #2 and #3 below, it is guaranteed that @timeout is zero after try_poll_mode() return. So, I think it is safe to keep polling enabled for these two cases. > >> >> Currently, poll_set_started(ctx,false) is called in try_poll_mode() to enable >> virtqueue notification in below 4 cases: >> >> 1. ctx->poll_ns is 0 >> 2. a zero timeout is passed to try_poll_mode() >> 3. polling succeeded but reported as no progress >> 4. polling failed and reported as no progress >> >> To minimize unnecessary guest notifications, keep notification disabled when >> it is possible, i.e., polling is enabled and last polling doesn't fail. > >What is the exact definition of polling success/failure? Polling success: found some events pending. Polling failure: timeout. success/failure are used because I saw below comment in run_poll_handlers_once(), then I thought they are well-known terms. /* * Polling was successful, exit try_poll_mode immediately * to adjust the next polling time. */ > >> >> Keep notification disabled for case #2 and #3; handle case #2 simply by a call > >Did you see case #2 happening often? What was the cause? I think so. I can add some tracepoint and collect statistics. IMO (of course, I can be totally wrong), the cause is: when a worker thread in thread poll completes an IO request, a bottom half is queued by work_thread()->qemu_bh_schedule(). Pending bottom halves lead to aio_compute_timeout() setting timeout to zero and then a zero timeout is passed to try_poll_mode(). > >> of run_poll_handlers_once() and for case #3, differentiate successful/failed >> polling and skip the call of poll_set_started(ctx,false) for successful ones. > >This is probably the most interesting case. When polling detects an >event, that's considered "progress", except for aio_notify() events. >aio_notify() is an internal event for restarting the event loop when >something has changed (e.g. fd handlers have been added/removed). I >wouldn't expect it to intefere polling frequently since that requires >another QEMU thread doing something to the AioContext, which should be >rare. > >Was aio_notify() intefering with polling in your case? Do you know why? Yes. It was. The reason is the same: after finishing IO requests, worker threads queue bottom halves during which aio_notify() is called.
On Wed, Jul 06, 2022 at 10:12:14PM +0800, Chao Gao wrote: > On Wed, Jul 06, 2022 at 12:59:29PM +0100, Stefan Hajnoczi wrote: > >On Fri, Jul 01, 2022 at 05:13:48PM +0800, Chao Gao wrote: > >> When measuring FIO read performance (cache=writethrough, bs=4k, iodepth=64) in > >> VMs, we observe ~80K/s notifications (e.g., EPT_MISCONFIG) from guest to qemu. > > > >It's not clear to me what caused the frequent poll_set_started(ctx, > >false) calls and whether this patch is correct. I have posted some > > Me either. That's why the patch was marked RFC. Your explanation about worker threads calling qemu_bh_schedule() -> aio_notify() makes sense to me. Polling mode is stopped prematurely when the event loop goes to process the BH. There is no need to stop polling mode, the next event loop iteration could resume polling mode. > >questions below about the nature of this issue. > > > >If ctx->fdmon_ops->wait() is called while polling is still enabled then > >hangs or unnecessary latency can occur. For example, consider an fd > >handler that temporarily suppresses fd activity between poll start/end. > >The thread would be blocked in ->wait() and the fd will never become > >readable. Even if a hang doesn't occur because there is a timeout, there > >would be extra latency because the fd doesn't become readable and we > >have to wait for the timeout to expire so we can poll again. So we must > >be sure it's safe to leave polling enabled across the event loop and I'm > >not sure if this patch is okay. > > Thanks for the explanation. > > in aio_poll(), > > if (timeout || ctx->fdmon_ops->need_wait(ctx)) { > ctx->fdmon_ops->wait(ctx, &ready_list, timeout); > } > > if @timeout is zero, then ctx->fdmon_ops->wait() won't be called. > > In case #2 and #3 below, it is guaranteed that @timeout is zero after > try_poll_mode() return. So, I think it is safe to keep polling enabled > for these two cases. I think you're right, it's safe since timeout is zeroed. > > > > >> > >> Currently, poll_set_started(ctx,false) is called in try_poll_mode() to enable > >> virtqueue notification in below 4 cases: > >> > >> 1. ctx->poll_ns is 0 > >> 2. a zero timeout is passed to try_poll_mode() > >> 3. polling succeeded but reported as no progress > >> 4. polling failed and reported as no progress > >> > >> To minimize unnecessary guest notifications, keep notification disabled when > >> it is possible, i.e., polling is enabled and last polling doesn't fail. > > > >What is the exact definition of polling success/failure? > > Polling success: found some events pending. > Polling failure: timeout. > > success/failure are used because I saw below comment in > run_poll_handlers_once(), then I thought they are well-known terms. > > /* > * Polling was successful, exit try_poll_mode immediately > * to adjust the next polling time. > */ > > > > >> > >> Keep notification disabled for case #2 and #3; handle case #2 simply by a call > > > >Did you see case #2 happening often? What was the cause? > > I think so. I can add some tracepoint and collect statistics. > > IMO (of course, I can be totally wrong), the cause is: > when a worker thread in thread poll completes an IO request, a bottom > half is queued by work_thread()->qemu_bh_schedule(). Pending bottom > halves lead to aio_compute_timeout() setting timeout to zero and then > a zero timeout is passed to try_poll_mode(). > > > > >> of run_poll_handlers_once() and for case #3, differentiate successful/failed > >> polling and skip the call of poll_set_started(ctx,false) for successful ones. > > > >This is probably the most interesting case. When polling detects an > >event, that's considered "progress", except for aio_notify() events. > >aio_notify() is an internal event for restarting the event loop when > >something has changed (e.g. fd handlers have been added/removed). I > >wouldn't expect it to intefere polling frequently since that requires > >another QEMU thread doing something to the AioContext, which should be > >rare. > > > >Was aio_notify() intefering with polling in your case? Do you know why? > > Yes. It was. The reason is the same: after finishing IO requests, worker > threads queue bottom halves during which aio_notify() is called. Does this patch solve the issue? The idea is to defer poll_set_started(false) for as long as possible. diff --git a/util/aio-posix.c b/util/aio-posix.c index 731f3826c0..536f8b2534 100644 --- a/util/aio-posix.c +++ b/util/aio-posix.c @@ -591,12 +591,6 @@ static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list, return true; } } - - if (poll_set_started(ctx, ready_list, false)) { - *timeout = 0; - return true; - } - return false; } @@ -657,6 +651,11 @@ bool aio_poll(AioContext *ctx, bool blocking) * system call---a single round of run_poll_handlers_once suffices. */ if (timeout || ctx->fdmon_ops->need_wait(ctx)) { + if (poll_set_started(ctx, &ready_list, false)) { + timeout = 0; + progress = true; + } + ctx->fdmon_ops->wait(ctx, &ready_list, timeout); }
On Thu, Jul 07, 2022 at 10:04:23AM +0100, Stefan Hajnoczi wrote: > >Does this patch solve the issue? The idea is to defer >poll_set_started(false) for as long as possible. Good idea! It is straightforward. > >diff --git a/util/aio-posix.c b/util/aio-posix.c >index 731f3826c0..536f8b2534 100644 >--- a/util/aio-posix.c >+++ b/util/aio-posix.c >@@ -591,12 +591,6 @@ static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list, > return true; > } > } >- >- if (poll_set_started(ctx, ready_list, false)) { >- *timeout = 0; >- return true; >- } >- > return false; > } > >@@ -657,6 +651,11 @@ bool aio_poll(AioContext *ctx, bool blocking) > * system call---a single round of run_poll_handlers_once suffices. > */ > if (timeout || ctx->fdmon_ops->need_wait(ctx)) { >+ if (poll_set_started(ctx, &ready_list, false)) { >+ timeout = 0; >+ progress = true; In this case, is it ok to skip the call of ->wait() below? If yes, maybe put the call in the "else" path. >+ } >+ > ctx->fdmon_ops->wait(ctx, &ready_list, timeout); > } > Anyway, Reviewed-by: Chao Gao <chao.gao@intel.com> And my tests show your change works well. The number of notifications is significantly reduced from ~80K/s to tens. Tested-by: Chao Gao <chao.gao@intel.com>
diff --git a/util/aio-posix.c b/util/aio-posix.c index 731f3826c0..bd2076145b 100644 --- a/util/aio-posix.c +++ b/util/aio-posix.c @@ -519,7 +519,7 @@ static bool remove_idle_poll_handlers(AioContext *ctx, * Returns: true if progress was made, false otherwise */ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list, - int64_t max_ns, int64_t *timeout) + int64_t max_ns, int64_t *timeout, bool *success) { bool progress; int64_t start_time, elapsed_time; @@ -553,6 +553,8 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list, progress = true; } + *success = !*timeout; + /* If time has passed with no successful polling, adjust *timeout to * keep the same ending time. */ @@ -577,22 +579,28 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list, static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list, int64_t *timeout) { - int64_t max_ns; + int64_t max_ns, start_time; + bool success = false; if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) { return false; } + if (!*timeout) { + start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); + return run_poll_handlers_once(ctx, ready_list, start_time, timeout); + } + max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns); if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) { poll_set_started(ctx, ready_list, true); - if (run_poll_handlers(ctx, ready_list, max_ns, timeout)) { + if (run_poll_handlers(ctx, ready_list, max_ns, timeout, &success)) { return true; } } - if (poll_set_started(ctx, ready_list, false)) { + if (!success && poll_set_started(ctx, ready_list, false)) { *timeout = 0; return true; }
When measuring FIO read performance (cache=writethrough, bs=4k, iodepth=64) in VMs, we observe ~80K/s notifications (e.g., EPT_MISCONFIG) from guest to qemu. Currently, poll_set_started(ctx,false) is called in try_poll_mode() to enable virtqueue notification in below 4 cases: 1. ctx->poll_ns is 0 2. a zero timeout is passed to try_poll_mode() 3. polling succeeded but reported as no progress 4. polling failed and reported as no progress To minimize unnecessary guest notifications, keep notification disabled when it is possible, i.e., polling is enabled and last polling doesn't fail. Keep notification disabled for case #2 and #3; handle case #2 simply by a call of run_poll_handlers_once() and for case #3, differentiate successful/failed polling and skip the call of poll_set_started(ctx,false) for successful ones. With this patch applied, FIO seq-read performance (bs=4k, iodepth=64, cache=writethrough) in VMs increases from 330K/s to 413K/s IOPS. Below are changes in VM-exit statistics: w/o this patch (duration:10s): VM-EXIT Samples Samples% Time% Min Time Max Time Avg time EPT_MISCONFIG 797440 99.34% 98.58% 0.39us 57.92us 0.60us ( +- 0.05% ) MSR_WRITE 3672 0.46% 1.15% 0.88us 4.97us 1.52us ( +- 0.53% ) EXTERNAL_INTERRUPT 1638 0.20% 0.27% 0.59us 11.04us 0.81us ( +- 1.71% ) w/ this patch (duration:10s): VM-EXIT Samples Samples% Time% Min Time Max Time Avg time MSR_WRITE 3524 84.11% 87.17% 0.86us 4.43us 1.74us ( +- 0.60% ) EXTERNAL_INTERRUPT 515 12.29% 10.05% 0.64us 8.95us 1.37us ( +- 1.54% ) EPT_MISCONFIG 151 3.60% 2.79% 0.40us 52.07us 1.30us ( +- 31.44% ) Signed-off-by: Chao Gao <chao.gao@intel.com> --- util/aio-posix.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-)