diff mbox series

[RFC,v1] util/aio: Keep notification disabled as much as possible

Message ID 20220701091348.215320-1-chao.gao@intel.com (mailing list archive)
State New, archived
Headers show
Series [RFC,v1] util/aio: Keep notification disabled as much as possible | expand

Commit Message

Chao Gao July 1, 2022, 9:13 a.m. UTC
When measuring FIO read performance (cache=writethrough, bs=4k, iodepth=64) in
VMs, we observe ~80K/s notifications (e.g., EPT_MISCONFIG) from guest to qemu.

Currently, poll_set_started(ctx,false) is called in try_poll_mode() to enable
virtqueue notification in below 4 cases:

1. ctx->poll_ns is 0
2. a zero timeout is passed to try_poll_mode()
3. polling succeeded but reported as no progress
4. polling failed and reported as no progress

To minimize unnecessary guest notifications, keep notification disabled when
it is possible, i.e., polling is enabled and last polling doesn't fail.

Keep notification disabled for case #2 and #3; handle case #2 simply by a call
of run_poll_handlers_once() and for case #3, differentiate successful/failed
polling and skip the call of poll_set_started(ctx,false) for successful ones.

With this patch applied, FIO seq-read performance (bs=4k, iodepth=64,
cache=writethrough) in VMs increases from 330K/s to 413K/s IOPS.

Below are changes in VM-exit statistics:
w/o this patch (duration:10s):
           VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

     EPT_MISCONFIG     797440    99.34%    98.58%      0.39us     57.92us      0.60us ( +-   0.05% )
         MSR_WRITE       3672     0.46%     1.15%      0.88us      4.97us      1.52us ( +-   0.53% )
EXTERNAL_INTERRUPT       1638     0.20%     0.27%      0.59us     11.04us      0.81us ( +-   1.71% )

w/ this patch (duration:10s):
          VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time

         MSR_WRITE       3524    84.11%    87.17%      0.86us      4.43us      1.74us ( +-   0.60% )
EXTERNAL_INTERRUPT        515    12.29%    10.05%      0.64us      8.95us      1.37us ( +-   1.54% )
     EPT_MISCONFIG        151     3.60%     2.79%      0.40us     52.07us      1.30us ( +-  31.44% )

Signed-off-by: Chao Gao <chao.gao@intel.com>
---
 util/aio-posix.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

Comments

Stefan Hajnoczi July 6, 2022, 11:59 a.m. UTC | #1
On Fri, Jul 01, 2022 at 05:13:48PM +0800, Chao Gao wrote:
> When measuring FIO read performance (cache=writethrough, bs=4k, iodepth=64) in
> VMs, we observe ~80K/s notifications (e.g., EPT_MISCONFIG) from guest to qemu.

It's not clear to me what caused the frequent poll_set_started(ctx,
false) calls and whether this patch is correct. I have posted some
questions below about the nature of this issue.

If ctx->fdmon_ops->wait() is called while polling is still enabled then
hangs or unnecessary latency can occur. For example, consider an fd
handler that temporarily suppresses fd activity between poll start/end.
The thread would be blocked in ->wait() and the fd will never become
readable. Even if a hang doesn't occur because there is a timeout, there
would be extra latency because the fd doesn't become readable and we
have to wait for the timeout to expire so we can poll again. So we must
be sure it's safe to leave polling enabled across the event loop and I'm
not sure if this patch is okay.

> 
> Currently, poll_set_started(ctx,false) is called in try_poll_mode() to enable
> virtqueue notification in below 4 cases:
> 
> 1. ctx->poll_ns is 0
> 2. a zero timeout is passed to try_poll_mode()
> 3. polling succeeded but reported as no progress
> 4. polling failed and reported as no progress
> 
> To minimize unnecessary guest notifications, keep notification disabled when
> it is possible, i.e., polling is enabled and last polling doesn't fail.

What is the exact definition of polling success/failure?

> 
> Keep notification disabled for case #2 and #3; handle case #2 simply by a call

Did you see case #2 happening often? What was the cause?

> of run_poll_handlers_once() and for case #3, differentiate successful/failed
> polling and skip the call of poll_set_started(ctx,false) for successful ones.

This is probably the most interesting case. When polling detects an
event, that's considered "progress", except for aio_notify() events.
aio_notify() is an internal event for restarting the event loop when
something has changed (e.g. fd handlers have been added/removed). I
wouldn't expect it to intefere polling frequently since that requires
another QEMU thread doing something to the AioContext, which should be
rare.

Was aio_notify() intefering with polling in your case? Do you know why?

> 
> With this patch applied, FIO seq-read performance (bs=4k, iodepth=64,
> cache=writethrough) in VMs increases from 330K/s to 413K/s IOPS.
> 
> Below are changes in VM-exit statistics:
> w/o this patch (duration:10s):
>            VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time
> 
>      EPT_MISCONFIG     797440    99.34%    98.58%      0.39us     57.92us      0.60us ( +-   0.05% )
>          MSR_WRITE       3672     0.46%     1.15%      0.88us      4.97us      1.52us ( +-   0.53% )
> EXTERNAL_INTERRUPT       1638     0.20%     0.27%      0.59us     11.04us      0.81us ( +-   1.71% )
> 
> w/ this patch (duration:10s):
>           VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time
> 
>          MSR_WRITE       3524    84.11%    87.17%      0.86us      4.43us      1.74us ( +-   0.60% )
> EXTERNAL_INTERRUPT        515    12.29%    10.05%      0.64us      8.95us      1.37us ( +-   1.54% )
>      EPT_MISCONFIG        151     3.60%     2.79%      0.40us     52.07us      1.30us ( +-  31.44% )
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> ---
>  util/aio-posix.c | 16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index 731f3826c0..bd2076145b 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -519,7 +519,7 @@ static bool remove_idle_poll_handlers(AioContext *ctx,
>   * Returns: true if progress was made, false otherwise
>   */
>  static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
> -                              int64_t max_ns, int64_t *timeout)
> +                              int64_t max_ns, int64_t *timeout, bool *success)
>  {
>      bool progress;
>      int64_t start_time, elapsed_time;
> @@ -553,6 +553,8 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
>          progress = true;
>      }
>  
> +    *success = !*timeout;
> +
>      /* If time has passed with no successful polling, adjust *timeout to
>       * keep the same ending time.
>       */
> @@ -577,22 +579,28 @@ static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
>  static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
>                            int64_t *timeout)
>  {
> -    int64_t max_ns;
> +    int64_t max_ns, start_time;
> +    bool success = false;
>  
>      if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) {
>          return false;
>      }
>  
> +    if (!*timeout) {
> +        start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> +        return run_poll_handlers_once(ctx, ready_list, start_time, timeout);
> +    }
> +
>      max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
>      if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
>          poll_set_started(ctx, ready_list, true);
>  
> -        if (run_poll_handlers(ctx, ready_list, max_ns, timeout)) {
> +        if (run_poll_handlers(ctx, ready_list, max_ns, timeout, &success)) {
>              return true;
>          }
>      }
>  
> -    if (poll_set_started(ctx, ready_list, false)) {
> +    if (!success && poll_set_started(ctx, ready_list, false)) {
>          *timeout = 0;
>          return true;
>      }
> -- 
> 2.25.1
>
Chao Gao July 6, 2022, 2:12 p.m. UTC | #2
On Wed, Jul 06, 2022 at 12:59:29PM +0100, Stefan Hajnoczi wrote:
>On Fri, Jul 01, 2022 at 05:13:48PM +0800, Chao Gao wrote:
>> When measuring FIO read performance (cache=writethrough, bs=4k, iodepth=64) in
>> VMs, we observe ~80K/s notifications (e.g., EPT_MISCONFIG) from guest to qemu.
>
>It's not clear to me what caused the frequent poll_set_started(ctx,
>false) calls and whether this patch is correct. I have posted some

Me either. That's why the patch was marked RFC.

>questions below about the nature of this issue.
>
>If ctx->fdmon_ops->wait() is called while polling is still enabled then
>hangs or unnecessary latency can occur. For example, consider an fd
>handler that temporarily suppresses fd activity between poll start/end.
>The thread would be blocked in ->wait() and the fd will never become
>readable. Even if a hang doesn't occur because there is a timeout, there
>would be extra latency because the fd doesn't become readable and we
>have to wait for the timeout to expire so we can poll again. So we must
>be sure it's safe to leave polling enabled across the event loop and I'm
>not sure if this patch is okay.

Thanks for the explanation.

in aio_poll(),

    if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
        ctx->fdmon_ops->wait(ctx, &ready_list, timeout);
    }

if @timeout is zero, then ctx->fdmon_ops->wait() won't be called.

In case #2 and #3 below, it is guaranteed that @timeout is zero after
try_poll_mode() return. So, I think it is safe to keep polling enabled
for these two cases.

>
>> 
>> Currently, poll_set_started(ctx,false) is called in try_poll_mode() to enable
>> virtqueue notification in below 4 cases:
>> 
>> 1. ctx->poll_ns is 0
>> 2. a zero timeout is passed to try_poll_mode()
>> 3. polling succeeded but reported as no progress
>> 4. polling failed and reported as no progress
>> 
>> To minimize unnecessary guest notifications, keep notification disabled when
>> it is possible, i.e., polling is enabled and last polling doesn't fail.
>
>What is the exact definition of polling success/failure?

Polling success: found some events pending.
Polling failure: timeout.

success/failure are used because I saw below comment in
run_poll_handlers_once(), then I thought they are well-known terms.

            /*
             * Polling was successful, exit try_poll_mode immediately
             * to adjust the next polling time.
             */

>
>> 
>> Keep notification disabled for case #2 and #3; handle case #2 simply by a call
>
>Did you see case #2 happening often? What was the cause?

I think so. I can add some tracepoint and collect statistics.

IMO (of course, I can be totally wrong), the cause is:
when a worker thread in thread poll completes an IO request, a bottom
half is queued by work_thread()->qemu_bh_schedule(). Pending bottom
halves lead to aio_compute_timeout() setting timeout to zero and then
a zero timeout is passed to try_poll_mode().

>
>> of run_poll_handlers_once() and for case #3, differentiate successful/failed
>> polling and skip the call of poll_set_started(ctx,false) for successful ones.
>
>This is probably the most interesting case. When polling detects an
>event, that's considered "progress", except for aio_notify() events.
>aio_notify() is an internal event for restarting the event loop when
>something has changed (e.g. fd handlers have been added/removed). I
>wouldn't expect it to intefere polling frequently since that requires
>another QEMU thread doing something to the AioContext, which should be
>rare.
>
>Was aio_notify() intefering with polling in your case? Do you know why?

Yes. It was. The reason is the same: after finishing IO requests, worker
threads queue bottom halves during which aio_notify() is called.
Stefan Hajnoczi July 7, 2022, 9:04 a.m. UTC | #3
On Wed, Jul 06, 2022 at 10:12:14PM +0800, Chao Gao wrote:
> On Wed, Jul 06, 2022 at 12:59:29PM +0100, Stefan Hajnoczi wrote:
> >On Fri, Jul 01, 2022 at 05:13:48PM +0800, Chao Gao wrote:
> >> When measuring FIO read performance (cache=writethrough, bs=4k, iodepth=64) in
> >> VMs, we observe ~80K/s notifications (e.g., EPT_MISCONFIG) from guest to qemu.
> >
> >It's not clear to me what caused the frequent poll_set_started(ctx,
> >false) calls and whether this patch is correct. I have posted some
> 
> Me either. That's why the patch was marked RFC.

Your explanation about worker threads calling qemu_bh_schedule() ->
aio_notify() makes sense to me. Polling mode is stopped prematurely when
the event loop goes to process the BH. There is no need to stop polling
mode, the next event loop iteration could resume polling mode.

> >questions below about the nature of this issue.
> >
> >If ctx->fdmon_ops->wait() is called while polling is still enabled then
> >hangs or unnecessary latency can occur. For example, consider an fd
> >handler that temporarily suppresses fd activity between poll start/end.
> >The thread would be blocked in ->wait() and the fd will never become
> >readable. Even if a hang doesn't occur because there is a timeout, there
> >would be extra latency because the fd doesn't become readable and we
> >have to wait for the timeout to expire so we can poll again. So we must
> >be sure it's safe to leave polling enabled across the event loop and I'm
> >not sure if this patch is okay.
> 
> Thanks for the explanation.
> 
> in aio_poll(),
> 
>     if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
>         ctx->fdmon_ops->wait(ctx, &ready_list, timeout);
>     }
> 
> if @timeout is zero, then ctx->fdmon_ops->wait() won't be called.
> 
> In case #2 and #3 below, it is guaranteed that @timeout is zero after
> try_poll_mode() return. So, I think it is safe to keep polling enabled
> for these two cases.

I think you're right, it's safe since timeout is zeroed.

> 
> >
> >> 
> >> Currently, poll_set_started(ctx,false) is called in try_poll_mode() to enable
> >> virtqueue notification in below 4 cases:
> >> 
> >> 1. ctx->poll_ns is 0
> >> 2. a zero timeout is passed to try_poll_mode()
> >> 3. polling succeeded but reported as no progress
> >> 4. polling failed and reported as no progress
> >> 
> >> To minimize unnecessary guest notifications, keep notification disabled when
> >> it is possible, i.e., polling is enabled and last polling doesn't fail.
> >
> >What is the exact definition of polling success/failure?
> 
> Polling success: found some events pending.
> Polling failure: timeout.
> 
> success/failure are used because I saw below comment in
> run_poll_handlers_once(), then I thought they are well-known terms.
> 
>             /*
>              * Polling was successful, exit try_poll_mode immediately
>              * to adjust the next polling time.
>              */
> 
> >
> >> 
> >> Keep notification disabled for case #2 and #3; handle case #2 simply by a call
> >
> >Did you see case #2 happening often? What was the cause?
> 
> I think so. I can add some tracepoint and collect statistics.
> 
> IMO (of course, I can be totally wrong), the cause is:
> when a worker thread in thread poll completes an IO request, a bottom
> half is queued by work_thread()->qemu_bh_schedule(). Pending bottom
> halves lead to aio_compute_timeout() setting timeout to zero and then
> a zero timeout is passed to try_poll_mode().
> 
> >
> >> of run_poll_handlers_once() and for case #3, differentiate successful/failed
> >> polling and skip the call of poll_set_started(ctx,false) for successful ones.
> >
> >This is probably the most interesting case. When polling detects an
> >event, that's considered "progress", except for aio_notify() events.
> >aio_notify() is an internal event for restarting the event loop when
> >something has changed (e.g. fd handlers have been added/removed). I
> >wouldn't expect it to intefere polling frequently since that requires
> >another QEMU thread doing something to the AioContext, which should be
> >rare.
> >
> >Was aio_notify() intefering with polling in your case? Do you know why?
> 
> Yes. It was. The reason is the same: after finishing IO requests, worker
> threads queue bottom halves during which aio_notify() is called.

Does this patch solve the issue? The idea is to defer
poll_set_started(false) for as long as possible.

diff --git a/util/aio-posix.c b/util/aio-posix.c
index 731f3826c0..536f8b2534 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -591,12 +591,6 @@ static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
             return true;
         }
     }
-
-    if (poll_set_started(ctx, ready_list, false)) {
-        *timeout = 0;
-        return true;
-    }
-
     return false;
 }

@@ -657,6 +651,11 @@ bool aio_poll(AioContext *ctx, bool blocking)
      * system call---a single round of run_poll_handlers_once suffices.
      */
     if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
+        if (poll_set_started(ctx, &ready_list, false)) {
+            timeout = 0;
+            progress = true;
+        }
+
         ctx->fdmon_ops->wait(ctx, &ready_list, timeout);
     }
Chao Gao July 7, 2022, 10:16 a.m. UTC | #4
On Thu, Jul 07, 2022 at 10:04:23AM +0100, Stefan Hajnoczi wrote:
>
>Does this patch solve the issue? The idea is to defer
>poll_set_started(false) for as long as possible.

Good idea! It is straightforward.

>
>diff --git a/util/aio-posix.c b/util/aio-posix.c
>index 731f3826c0..536f8b2534 100644
>--- a/util/aio-posix.c
>+++ b/util/aio-posix.c
>@@ -591,12 +591,6 @@ static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
>             return true;
>         }
>     }
>-
>-    if (poll_set_started(ctx, ready_list, false)) {
>-        *timeout = 0;
>-        return true;
>-    }
>-
>     return false;
> }
>
>@@ -657,6 +651,11 @@ bool aio_poll(AioContext *ctx, bool blocking)
>      * system call---a single round of run_poll_handlers_once suffices.
>      */
>     if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
>+        if (poll_set_started(ctx, &ready_list, false)) {
>+            timeout = 0;
>+            progress = true;

In this case, is it ok to skip the call of ->wait() below? If yes, maybe put
the call in the "else" path.

>+        }
>+
>         ctx->fdmon_ops->wait(ctx, &ready_list, timeout);
>     }
>

Anyway,
Reviewed-by: Chao Gao <chao.gao@intel.com>

And my tests show your change works well. The number of notifications is
significantly reduced from ~80K/s to tens.

Tested-by: Chao Gao <chao.gao@intel.com>
diff mbox series

Patch

diff --git a/util/aio-posix.c b/util/aio-posix.c
index 731f3826c0..bd2076145b 100644
--- a/util/aio-posix.c
+++ b/util/aio-posix.c
@@ -519,7 +519,7 @@  static bool remove_idle_poll_handlers(AioContext *ctx,
  * Returns: true if progress was made, false otherwise
  */
 static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
-                              int64_t max_ns, int64_t *timeout)
+                              int64_t max_ns, int64_t *timeout, bool *success)
 {
     bool progress;
     int64_t start_time, elapsed_time;
@@ -553,6 +553,8 @@  static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
         progress = true;
     }
 
+    *success = !*timeout;
+
     /* If time has passed with no successful polling, adjust *timeout to
      * keep the same ending time.
      */
@@ -577,22 +579,28 @@  static bool run_poll_handlers(AioContext *ctx, AioHandlerList *ready_list,
 static bool try_poll_mode(AioContext *ctx, AioHandlerList *ready_list,
                           int64_t *timeout)
 {
-    int64_t max_ns;
+    int64_t max_ns, start_time;
+    bool success = false;
 
     if (QLIST_EMPTY_RCU(&ctx->poll_aio_handlers)) {
         return false;
     }
 
+    if (!*timeout) {
+        start_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+        return run_poll_handlers_once(ctx, ready_list, start_time, timeout);
+    }
+
     max_ns = qemu_soonest_timeout(*timeout, ctx->poll_ns);
     if (max_ns && !ctx->fdmon_ops->need_wait(ctx)) {
         poll_set_started(ctx, ready_list, true);
 
-        if (run_poll_handlers(ctx, ready_list, max_ns, timeout)) {
+        if (run_poll_handlers(ctx, ready_list, max_ns, timeout, &success)) {
             return true;
         }
     }
 
-    if (poll_set_started(ctx, ready_list, false)) {
+    if (!success && poll_set_started(ctx, ready_list, false)) {
         *timeout = 0;
         return true;
     }