mbox series

[RFCv2,0/9] kvm/arm64: Support Async Page Fault

Message ID 20200508032919.52147-1-gshan@redhat.com (mailing list archive)
Headers show
Series kvm/arm64: Support Async Page Fault | expand

Message

Gavin Shan May 8, 2020, 3:29 a.m. UTC
There are two stages of page faults and the stage one page fault is
handled by guest itself. The guest is trapped to host when the page
fault is caused by stage 2 page table, for example missing. The guest
is suspended until the requested page is populated. There might be
IO activities involved for host to populate the requested page. For
instance, the requested page has been swapped out previously. In this
case, the guest (vCPU) has to suspend for a few of milliseconds, which
depends on the swapping media, regardless of the overall system load.

The series adds asychornous page fault to improve the situation. A
signal (PAGE_NOT_PRESENT) is sent from host to the guest if the requested
page isn't absent immediately. In the mean while, a worker is started
to populate the requested page in background. Guest either picks another
available process to run or puts current (faulting) process to power
saving mode when receiving the (PAGE_NOT_PRESENT) signal. After the
requested page is populated by the worker, another signal (PAGE_READY)
is sent from host to guest. Guest wakes up the (faulting) process when
receiving the (PAGE_READY) signal.

The signals are conveyed through control block. The control block physical
address is passed from guest to host through dedicated KVM vendor specific
hypercall. The control block is visible and accessible by host and guest
in the mean while. The hypercall is also used to enable, disable, configure
the functionality. Notifications, by injected abort data exception, are
fired when there are pending signals. The exception handler will be invoked
in guest kernel.

Testing
=======
The tests are carried on the following machine. A guest with single vCPU
and 4GB memory is started. Also, the QEMU process is put into memory cgroup
(v1) whose memory limit is set to 2GB. In the guest, there are two threads,
which are memory bound and CPU bound separately. The memory bound thread
allocates all available memory, accesses and them free them. The CPU bound
thread simply executes block of "nop". The test is carried out for 5 time
continuously and the average number (per minute) of executed blocks in the
CPU bound thread is taken as indicator of improvement.

   Vendor: GIGABYTE   CPU: 224 x Cavium ThunderX2(R) CPU CN9975 v2.2 @ 2.0GHz
   Memory: 32GB       Disk: Fusion-MPT SAS-3 (PCIe3.0 x8)

   Without-APF: 7029030180/minute = avg(7559625120 5962155840 7823208540
                                        7629633480 6170527920)
   With-APF:    8286827472/minute = avg(8464584540 8177073360 8262723180
                                        8095084020 8434672260)
   Outcome:     +17.8%

Another test case is to measure the time consumed by the application, but
with the CPU-bound thread disabled.

   Without-APF: 40.3s = avg(40.6 39.3 39.2 41.6 41.2)
   With-APF:    40.8s = avg(40.6 41.1 40.9 41.0 40.7)
   Outcome:     +1.2%

I also have some code in the host to capture the number of async page faults,
time used to do swapin and its maximal/minimal values when async page fault
is enabled. During the test, the CPU-bound thread is disabled. There is about
30% of the time used to do swapin.

   Number of async page fault:     7555 times
   Total time used by application: 42.2 seconds
   Total time used by swapin:      12.7 seconds   (30%)
         Minimal swapin time:      36.2 us
         Maximal swapin time:      55.7 ms

Changelog
=========
RFCv1 -> RFCv2
   * Rebase to 5.7.rc3
   * Performance data                                                   (Marc Zyngier)
   * Replace IMPDEF system register with KVM vendor specific hypercall  (Mark Rutland)
   * Based on Will's KVM vendor hypercall probe mechanism               (Will Deacon)
   * Don't use IMPDEF DFSC (0x43). Async page fault reason is conveyed
     by the control block                                               (Mark Rutland)
   * Delayed wakeup mechanism in guest kernel                           (Gavin Shan)
   * Stability improvement in the guest kernel: delayed wakeup mechanism,
     external abort disallowed region, lazily clear async page fault,
     disabled interrupt on acquiring the head's lock and so on          (Gavin Shan)
   * Stability improvement in the host kernel: serialized async page
     faults etc.                                                        (Gavin Shan)
   * Performance improvement in guest kernel: percpu sleeper head       (Gavin Shan)

Gavin Shan (7):
  kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()
  kvm/arm64: Detach ESR operator from vCPU struct
  kvm/arm64: Replace hsr with esr
  kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
  kvm/arm64: Support async page fault
  kernel/sched: Add cpu_rq_is_locked()
  arm64: Support async page fault

Will Deacon (2):
  arm64: Probe for the presence of KVM hypervisor services during boot
  arm/arm64: KVM: Advertise KVM UID to guests via SMCCC

 arch/arm64/Kconfig                       |  11 +
 arch/arm64/include/asm/exception.h       |   3 +
 arch/arm64/include/asm/hypervisor.h      |  11 +
 arch/arm64/include/asm/kvm_emulate.h     |  83 +++--
 arch/arm64/include/asm/kvm_host.h        |  47 +++
 arch/arm64/include/asm/kvm_para.h        |  40 +++
 arch/arm64/include/uapi/asm/Kbuild       |   2 -
 arch/arm64/include/uapi/asm/kvm_para.h   |  22 ++
 arch/arm64/kernel/entry.S                |  33 ++
 arch/arm64/kernel/process.c              |   4 +
 arch/arm64/kernel/setup.c                |  35 ++
 arch/arm64/kvm/Kconfig                   |   1 +
 arch/arm64/kvm/Makefile                  |   2 +
 arch/arm64/kvm/handle_exit.c             |  48 +--
 arch/arm64/kvm/hyp/switch.c              |  33 +-
 arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c |   7 +-
 arch/arm64/kvm/inject_fault.c            |   4 +-
 arch/arm64/kvm/sys_regs.c                |  38 +-
 arch/arm64/mm/fault.c                    | 434 +++++++++++++++++++++++
 include/linux/arm-smccc.h                |  32 ++
 include/linux/sched.h                    |   1 +
 kernel/sched/core.c                      |   8 +
 virt/kvm/arm/arm.c                       |  40 ++-
 virt/kvm/arm/async_pf.c                  | 335 +++++++++++++++++
 virt/kvm/arm/hyp/aarch32.c               |   4 +-
 virt/kvm/arm/hyp/vgic-v3-sr.c            |   7 +-
 virt/kvm/arm/hypercalls.c                |  37 +-
 virt/kvm/arm/mmio.c                      |  27 +-
 virt/kvm/arm/mmu.c                       |  69 +++-
 29 files changed, 1264 insertions(+), 154 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_para.h
 create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
 create mode 100644 virt/kvm/arm/async_pf.c

Comments

Gavin Shan May 25, 2020, 11:39 p.m. UTC | #1
On 5/8/20 1:29 PM, Gavin Shan wrote:
> There are two stages of page faults and the stage one page fault is
> handled by guest itself. The guest is trapped to host when the page
> fault is caused by stage 2 page table, for example missing. The guest
> is suspended until the requested page is populated. There might be
> IO activities involved for host to populate the requested page. For
> instance, the requested page has been swapped out previously. In this
> case, the guest (vCPU) has to suspend for a few of milliseconds, which
> depends on the swapping media, regardless of the overall system load.
> 
> The series adds asychornous page fault to improve the situation. A
> signal (PAGE_NOT_PRESENT) is sent from host to the guest if the requested
> page isn't absent immediately. In the mean while, a worker is started
> to populate the requested page in background. Guest either picks another
> available process to run or puts current (faulting) process to power
> saving mode when receiving the (PAGE_NOT_PRESENT) signal. After the
> requested page is populated by the worker, another signal (PAGE_READY)
> is sent from host to guest. Guest wakes up the (faulting) process when
> receiving the (PAGE_READY) signal.
> 
> The signals are conveyed through control block. The control block physical
> address is passed from guest to host through dedicated KVM vendor specific
> hypercall. The control block is visible and accessible by host and guest
> in the mean while. The hypercall is also used to enable, disable, configure
> the functionality. Notifications, by injected abort data exception, are
> fired when there are pending signals. The exception handler will be invoked
> in guest kernel.
> 
> Testing
> =======
> The tests are carried on the following machine. A guest with single vCPU
> and 4GB memory is started. Also, the QEMU process is put into memory cgroup
> (v1) whose memory limit is set to 2GB. In the guest, there are two threads,
> which are memory bound and CPU bound separately. The memory bound thread
> allocates all available memory, accesses and them free them. The CPU bound
> thread simply executes block of "nop". The test is carried out for 5 time
> continuously and the average number (per minute) of executed blocks in the
> CPU bound thread is taken as indicator of improvement.
> 
>     Vendor: GIGABYTE   CPU: 224 x Cavium ThunderX2(R) CPU CN9975 v2.2 @ 2.0GHz
>     Memory: 32GB       Disk: Fusion-MPT SAS-3 (PCIe3.0 x8)
> 
>     Without-APF: 7029030180/minute = avg(7559625120 5962155840 7823208540
>                                          7629633480 6170527920)
>     With-APF:    8286827472/minute = avg(8464584540 8177073360 8262723180
>                                          8095084020 8434672260)
>     Outcome:     +17.8%
> 
> Another test case is to measure the time consumed by the application, but
> with the CPU-bound thread disabled.
> 
>     Without-APF: 40.3s = avg(40.6 39.3 39.2 41.6 41.2)
>     With-APF:    40.8s = avg(40.6 41.1 40.9 41.0 40.7)
>     Outcome:     +1.2%
> 
> I also have some code in the host to capture the number of async page faults,
> time used to do swapin and its maximal/minimal values when async page fault
> is enabled. During the test, the CPU-bound thread is disabled. There is about
> 30% of the time used to do swapin.
> 
>     Number of async page fault:     7555 times
>     Total time used by application: 42.2 seconds
>     Total time used by swapin:      12.7 seconds   (30%)
>           Minimal swapin time:      36.2 us
>           Maximal swapin time:      55.7 ms
> 

A kindly ping... Marc/Mark/Will, please let me know your comments
on this. thanks in advance!

> Changelog
> =========
> RFCv1 -> RFCv2
>     * Rebase to 5.7.rc3
>     * Performance data                                                   (Marc Zyngier)
>     * Replace IMPDEF system register with KVM vendor specific hypercall  (Mark Rutland)
>     * Based on Will's KVM vendor hypercall probe mechanism               (Will Deacon)
>     * Don't use IMPDEF DFSC (0x43). Async page fault reason is conveyed
>       by the control block                                               (Mark Rutland)
>     * Delayed wakeup mechanism in guest kernel                           (Gavin Shan)
>     * Stability improvement in the guest kernel: delayed wakeup mechanism,
>       external abort disallowed region, lazily clear async page fault,
>       disabled interrupt on acquiring the head's lock and so on          (Gavin Shan)
>     * Stability improvement in the host kernel: serialized async page
>       faults etc.                                                        (Gavin Shan)
>     * Performance improvement in guest kernel: percpu sleeper head       (Gavin Shan)
> 
> Gavin Shan (7):
>    kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()
>    kvm/arm64: Detach ESR operator from vCPU struct
>    kvm/arm64: Replace hsr with esr
>    kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
>    kvm/arm64: Support async page fault
>    kernel/sched: Add cpu_rq_is_locked()
>    arm64: Support async page fault
> 
> Will Deacon (2):
>    arm64: Probe for the presence of KVM hypervisor services during boot
>    arm/arm64: KVM: Advertise KVM UID to guests via SMCCC
> 
>   arch/arm64/Kconfig                       |  11 +
>   arch/arm64/include/asm/exception.h       |   3 +
>   arch/arm64/include/asm/hypervisor.h      |  11 +
>   arch/arm64/include/asm/kvm_emulate.h     |  83 +++--
>   arch/arm64/include/asm/kvm_host.h        |  47 +++
>   arch/arm64/include/asm/kvm_para.h        |  40 +++
>   arch/arm64/include/uapi/asm/Kbuild       |   2 -
>   arch/arm64/include/uapi/asm/kvm_para.h   |  22 ++
>   arch/arm64/kernel/entry.S                |  33 ++
>   arch/arm64/kernel/process.c              |   4 +
>   arch/arm64/kernel/setup.c                |  35 ++
>   arch/arm64/kvm/Kconfig                   |   1 +
>   arch/arm64/kvm/Makefile                  |   2 +
>   arch/arm64/kvm/handle_exit.c             |  48 +--
>   arch/arm64/kvm/hyp/switch.c              |  33 +-
>   arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c |   7 +-
>   arch/arm64/kvm/inject_fault.c            |   4 +-
>   arch/arm64/kvm/sys_regs.c                |  38 +-
>   arch/arm64/mm/fault.c                    | 434 +++++++++++++++++++++++
>   include/linux/arm-smccc.h                |  32 ++
>   include/linux/sched.h                    |   1 +
>   kernel/sched/core.c                      |   8 +
>   virt/kvm/arm/arm.c                       |  40 ++-
>   virt/kvm/arm/async_pf.c                  | 335 +++++++++++++++++
>   virt/kvm/arm/hyp/aarch32.c               |   4 +-
>   virt/kvm/arm/hyp/vgic-v3-sr.c            |   7 +-
>   virt/kvm/arm/hypercalls.c                |  37 +-
>   virt/kvm/arm/mmio.c                      |  27 +-
>   virt/kvm/arm/mmu.c                       |  69 +++-
>   29 files changed, 1264 insertions(+), 154 deletions(-)
>   create mode 100644 arch/arm64/include/asm/kvm_para.h
>   create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
>   create mode 100644 virt/kvm/arm/async_pf.c
>
Mark Rutland May 26, 2020, 1:09 p.m. UTC | #2
Hi Gavin,

At a high-level I'm rather fearful of this series. I can see many ways
that this can break, and I can also see that even if/when we get things
into a working state, constant vigilance will be requried for any
changes to the entry code.

I'm not keen on injecting non-architectural exceptions in this way, and
I'm also not keen on how deep the PV hooks are injected currently (e.g.
in the ret_to_user path).

I see a few patches have preparator cleanup that I think would be
worthwhile regardless of this series; if you could factor those out and
send them on their own it would get that out of the way and make it
easier to review the series itself. Similarly, there's some duplication
of code from arch/x86 which I think can be factored out to virt/kvm
instead as preparatory work.

Generally, I also think that you need to spend some time on commit
messages and/or documentation to better explain the concepts and
expected usage. I had to reverse-engineer the series by reviewing it in
entirety before I had an idea as to how basic parts of it strung
together, and a more thorough conceptual explanation would make it much
easier to critique the approach rather than the individual patches.

On Fri, May 08, 2020 at 01:29:10PM +1000, Gavin Shan wrote:
> Testing
> =======
> The tests are carried on the following machine. A guest with single vCPU
> and 4GB memory is started. Also, the QEMU process is put into memory cgroup
> (v1) whose memory limit is set to 2GB. In the guest, there are two threads,
> which are memory bound and CPU bound separately. The memory bound thread
> allocates all available memory, accesses and them free them. The CPU bound
> thread simply executes block of "nop".

I appreciate this is a microbenchmark, but that sounds far from
realistic.

Is there a specitic real workload that's expected to be representative
of?

Can you run tests with a real workload? For example, a kernel build
inside the VM?

> The test is carried out for 5 time
> continuously and the average number (per minute) of executed blocks in the
> CPU bound thread is taken as indicator of improvement.
> 
>    Vendor: GIGABYTE   CPU: 224 x Cavium ThunderX2(R) CPU CN9975 v2.2 @ 2.0GHz
>    Memory: 32GB       Disk: Fusion-MPT SAS-3 (PCIe3.0 x8)
> 
>    Without-APF: 7029030180/minute = avg(7559625120 5962155840 7823208540
>                                         7629633480 6170527920)
>    With-APF:    8286827472/minute = avg(8464584540 8177073360 8262723180
>                                         8095084020 8434672260)
>    Outcome:     +17.8%
> 
> Another test case is to measure the time consumed by the application, but
> with the CPU-bound thread disabled.
> 
>    Without-APF: 40.3s = avg(40.6 39.3 39.2 41.6 41.2)
>    With-APF:    40.8s = avg(40.6 41.1 40.9 41.0 40.7)
>    Outcome:     +1.2%

So this is pure overhead in that case?

I think we need to see a real workload that this benefits. As it stands
it seems that this is a lot of complexity to game a synthetic benchmark.

Thanks,
Mark.

> I also have some code in the host to capture the number of async page faults,
> time used to do swapin and its maximal/minimal values when async page fault
> is enabled. During the test, the CPU-bound thread is disabled. There is about
> 30% of the time used to do swapin.
> 
>    Number of async page fault:     7555 times
>    Total time used by application: 42.2 seconds
>    Total time used by swapin:      12.7 seconds   (30%)
>          Minimal swapin time:      36.2 us
>          Maximal swapin time:      55.7 ms
> 
> Changelog
> =========
> RFCv1 -> RFCv2
>    * Rebase to 5.7.rc3
>    * Performance data                                                   (Marc Zyngier)
>    * Replace IMPDEF system register with KVM vendor specific hypercall  (Mark Rutland)
>    * Based on Will's KVM vendor hypercall probe mechanism               (Will Deacon)
>    * Don't use IMPDEF DFSC (0x43). Async page fault reason is conveyed
>      by the control block                                               (Mark Rutland)
>    * Delayed wakeup mechanism in guest kernel                           (Gavin Shan)
>    * Stability improvement in the guest kernel: delayed wakeup mechanism,
>      external abort disallowed region, lazily clear async page fault,
>      disabled interrupt on acquiring the head's lock and so on          (Gavin Shan)
>    * Stability improvement in the host kernel: serialized async page
>      faults etc.                                                        (Gavin Shan)
>    * Performance improvement in guest kernel: percpu sleeper head       (Gavin Shan)
> 
> Gavin Shan (7):
>   kvm/arm64: Rename kvm_vcpu_get_hsr() to kvm_vcpu_get_esr()
>   kvm/arm64: Detach ESR operator from vCPU struct
>   kvm/arm64: Replace hsr with esr
>   kvm/arm64: Export kvm_handle_user_mem_abort() with prefault mode
>   kvm/arm64: Support async page fault
>   kernel/sched: Add cpu_rq_is_locked()
>   arm64: Support async page fault
> 
> Will Deacon (2):
>   arm64: Probe for the presence of KVM hypervisor services during boot
>   arm/arm64: KVM: Advertise KVM UID to guests via SMCCC
> 
>  arch/arm64/Kconfig                       |  11 +
>  arch/arm64/include/asm/exception.h       |   3 +
>  arch/arm64/include/asm/hypervisor.h      |  11 +
>  arch/arm64/include/asm/kvm_emulate.h     |  83 +++--
>  arch/arm64/include/asm/kvm_host.h        |  47 +++
>  arch/arm64/include/asm/kvm_para.h        |  40 +++
>  arch/arm64/include/uapi/asm/Kbuild       |   2 -
>  arch/arm64/include/uapi/asm/kvm_para.h   |  22 ++
>  arch/arm64/kernel/entry.S                |  33 ++
>  arch/arm64/kernel/process.c              |   4 +
>  arch/arm64/kernel/setup.c                |  35 ++
>  arch/arm64/kvm/Kconfig                   |   1 +
>  arch/arm64/kvm/Makefile                  |   2 +
>  arch/arm64/kvm/handle_exit.c             |  48 +--
>  arch/arm64/kvm/hyp/switch.c              |  33 +-
>  arch/arm64/kvm/hyp/vgic-v2-cpuif-proxy.c |   7 +-
>  arch/arm64/kvm/inject_fault.c            |   4 +-
>  arch/arm64/kvm/sys_regs.c                |  38 +-
>  arch/arm64/mm/fault.c                    | 434 +++++++++++++++++++++++
>  include/linux/arm-smccc.h                |  32 ++
>  include/linux/sched.h                    |   1 +
>  kernel/sched/core.c                      |   8 +
>  virt/kvm/arm/arm.c                       |  40 ++-
>  virt/kvm/arm/async_pf.c                  | 335 +++++++++++++++++
>  virt/kvm/arm/hyp/aarch32.c               |   4 +-
>  virt/kvm/arm/hyp/vgic-v3-sr.c            |   7 +-
>  virt/kvm/arm/hypercalls.c                |  37 +-
>  virt/kvm/arm/mmio.c                      |  27 +-
>  virt/kvm/arm/mmu.c                       |  69 +++-
>  29 files changed, 1264 insertions(+), 154 deletions(-)
>  create mode 100644 arch/arm64/include/asm/kvm_para.h
>  create mode 100644 arch/arm64/include/uapi/asm/kvm_para.h
>  create mode 100644 virt/kvm/arm/async_pf.c
> 
> -- 
> 2.23.0
>
Gavin Shan May 27, 2020, 2:39 a.m. UTC | #3
Hi Mark,

On 5/26/20 11:09 PM, Mark Rutland wrote:
> At a high-level I'm rather fearful of this series. I can see many ways
> that this can break, and I can also see that even if/when we get things
> into a working state, constant vigilance will be requried for any
> changes to the entry code.
> 
> I'm not keen on injecting non-architectural exceptions in this way, and
> I'm also not keen on how deep the PV hooks are injected currently (e.g.
> in the ret_to_user path).
> 

First of all, thank you for your time and providing your comments continuously.
Since the series is tagged as RFC, it's not a surprise to see something is
obviously broken. However, Could you please provide more details? With more
details, I can figure out the solutions. If I'm correct, you're talking about
the added entry code and the injected PV hooks. Anyway, please provide more
details about your concerns so that I can figure out the solutions.

Let me briefly explain why we need the injected PV hooks in ret_to_user: There
are two fashions of wakeup and I would call them as direct wakeup and delayed
wakeup. The sleeping process is waked up directly when received PAGE_READY
notification from the host, which is the process of direct wakeup. However there
are some cases the direct wakeup can't be carried out. For example, the sleeper
and the waker are same process or the (CFS) runqueue has been locked by somebody
else. In these cases, the wakeup is delayed until the idle process is running or
in ret_to_user. It's how delayed wakeup works.

> I see a few patches have preparator cleanup that I think would be
> worthwhile regardless of this series; if you could factor those out and
> send them on their own it would get that out of the way and make it
> easier to review the series itself. Similarly, there's some duplication
> of code from arch/x86 which I think can be factored out to virt/kvm
> instead as preparatory work.
> 

Yep, I agree there are several cleanup patches can be posted separately
and merged in advance. I will do that and thanks for the comments.

About the shared code between arm64/x86, I need some time to investigate.
Basically, I agree to do so. I also included Paolo here to check his opnion.

It's no doubt these are all preparatory work, to make the review a bit
easier as you said :)

> Generally, I also think that you need to spend some time on commit
> messages and/or documentation to better explain the concepts and
> expected usage. I had to reverse-engineer the series by reviewing it in
> entirety before I had an idea as to how basic parts of it strung
> together, and a more thorough conceptual explanation would make it much
> easier to critique the approach rather than the individual patches.
> 

Yes, sure. I will do this in the future. Sorry about having taken you
too much to do the reverse-engineering. In next revision, I might put
more information in the cover letter and commit log to explain how things
are designed and working :)

> On Fri, May 08, 2020 at 01:29:10PM +1000, Gavin Shan wrote:
>> Testing
>> =======
>> The tests are carried on the following machine. A guest with single vCPU
>> and 4GB memory is started. Also, the QEMU process is put into memory cgroup
>> (v1) whose memory limit is set to 2GB. In the guest, there are two threads,
>> which are memory bound and CPU bound separately. The memory bound thread
>> allocates all available memory, accesses and them free them. The CPU bound
>> thread simply executes block of "nop".
> 
> I appreciate this is a microbenchmark, but that sounds far from
> realistic.
> 
> Is there a specitic real workload that's expected to be representative
> of?
> 
> Can you run tests with a real workload? For example, a kernel build
> inside the VM?
> 

Yeah, I agree it's far from a realistic workload. However, it's the test case
which was suggested when async page fault was proposed from day one, according
to the following document. On the page#34, you can see the benchmark, which is
similar to what we're doing.

https://www.linux-kvm.org/images/a/ac/2010-forum-Async-page-faults.pdf

Ok. I will test with the workload to build kernel or another better one to
represent the case.

>> The test is carried out for 5 time
>> continuously and the average number (per minute) of executed blocks in the
>> CPU bound thread is taken as indicator of improvement.
>>
>>     Vendor: GIGABYTE   CPU: 224 x Cavium ThunderX2(R) CPU CN9975 v2.2 @ 2.0GHz
>>     Memory: 32GB       Disk: Fusion-MPT SAS-3 (PCIe3.0 x8)
>>
>>     Without-APF: 7029030180/minute = avg(7559625120 5962155840 7823208540
>>                                          7629633480 6170527920)
>>     With-APF:    8286827472/minute = avg(8464584540 8177073360 8262723180
>>                                          8095084020 8434672260)
>>     Outcome:     +17.8%
>>
>> Another test case is to measure the time consumed by the application, but
>> with the CPU-bound thread disabled.
>>
>>     Without-APF: 40.3s = avg(40.6 39.3 39.2 41.6 41.2)
>>     With-APF:    40.8s = avg(40.6 41.1 40.9 41.0 40.7)
>>     Outcome:     +1.2%
> 
> So this is pure overhead in that case?
> 

Yes, It's the pure overhead, which is mainly contributed by the injected
PV code in ret_to_user.

> I think we need to see a real workload that this benefits. As it stands
> it seems that this is a lot of complexity to game a synthetic benchmark.
> 
> Thanks,
> Mark.
> 
>> I also have some code in the host to capture the number of async page faults,
>> time used to do swapin and its maximal/minimal values when async page fault
>> is enabled. During the test, the CPU-bound thread is disabled. There is about
>> 30% of the time used to do swapin.
>>
>>     Number of async page fault:     7555 times
>>     Total time used by application: 42.2 seconds
>>     Total time used by swapin:      12.7 seconds   (30%)
>>           Minimal swapin time:      36.2 us
>>           Maximal swapin time:      55.7 ms
>>

[...]

Thanks,
Gavin
Marc Zyngier May 27, 2020, 7:48 a.m. UTC | #4
On 2020-05-27 03:39, Gavin Shan wrote:
> Hi Mark,

[...]

>> Can you run tests with a real workload? For example, a kernel build
>> inside the VM?
>> 
> 
> Yeah, I agree it's far from a realistic workload. However, it's the 
> test case
> which was suggested when async page fault was proposed from day one, 
> according
> to the following document. On the page#34, you can see the benchmark, 
> which is
> similar to what we're doing.
> 
> https://www.linux-kvm.org/images/a/ac/2010-forum-Async-page-faults.pdf

My own question is whether this even makes any sense 10 years later.

The HW has massively changed, and this adds a whole lot of complexity
to both the hypervisor and the guest. It also plays very ugly games
with the exception model, which doesn't give me the warm fuzzy feeling
that it's going to be great.

> Ok. I will test with the workload to build kernel or another better one 
> to
> represent the case.

Thanks,

         M.
Paolo Bonzini May 27, 2020, 4:10 p.m. UTC | #5
On 27/05/20 09:48, Marc Zyngier wrote:
> 
> My own question is whether this even makes any sense 10 years later.
> The HW has massively changed, and this adds a whole lot of complexity
> to both the hypervisor and the guest.

It still makes sense, but indeed it's for different reasons.  One
example is host page cache sharing, where (parts of) the host page cache
are visible to the guest.  In this context, async page faults are used
for any kind of host page faults, not just paging out memory due to
overcommit.

But I agree that it is very very important to design the exception model
first, as we're witnessing in x86 land the problems with a poor design.
 Nothing major, but just pain all around.

Paolo

> It also plays very ugly games
> with the exception model, which doesn't give me the warm fuzzy feeling
> that it's going to be great.