KVM: x86: async_pf: check earlier if can deliver async pf

Message ID	20241118130403.23184-1-kalyazin@amazon.com (mailing list archive)
State	New
Headers	show Received: from smtp-fw-6001.amazon.com (smtp-fw-6001.amazon.com [52.95.48.154]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 03D201B85F0; Mon, 18 Nov 2024 13:04:35 +0000 (UTC) From: Nikita Kalyazin <kalyazin@amazon.com> To: <pbonzini@redhat.com>, <seanjc@google.com>, <tglx@linutronix.de>, <mingo@redhat.com>, <bp@alien8.de>, <dave.hansen@linux.intel.com>, <hpa@zytor.com>, <kvm@vger.kernel.org>, <linux-kernel@vger.kernel.org> CC: <david@redhat.com>, <peterx@redhat.com>, <oleg@redhat.com>, <vkuznets@redhat.com>, <gshan@redhat.com>, <graf@amazon.de>, <jgowans@amazon.com>, <roypat@amazon.co.uk>, <derekmn@amazon.com>, <nsaenz@amazon.es>, <xmarcalx@amazon.com>, <kalyazin@amazon.com> Subject: [PATCH] KVM: x86: async_pf: check earlier if can deliver async pf Date: Mon, 18 Nov 2024 13:04:03 +0000 Message-ID: <20241118130403.23184-1-kalyazin@amazon.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain
Series	KVM: x86: async_pf: check earlier if can deliver async pf \| expand KVM: x86: async_pf: check earlier if can deliver async pf

Nikita Kalyazin Nov. 18, 2024, 1:04 p.m. UTC

On x86, async pagefault events can only be delivered if the page fault
was triggered by guest userspace, not kernel.  This is because
the guest may be in non-sleepable context and will not be able
to reschedule.

However existing implementation pays the following overhead even for the
kernel-originated faults, even though it is known in advance that they
cannot be processed asynchronously:
 - allocate async PF token
 - create and schedule an async work

This patch avoids the overhead above in case of kernel-originated faults
by moving the `kvm_can_deliver_async_pf` check from
`kvm_arch_async_page_not_present` to `__kvm_faultin_pfn`.

Note that the existing check `kvm_can_do_async_pf` already calls
`kvm_can_deliver_async_pf` internally, however it only does that if the
`kvm_hlt_in_guest` check is true, ie userspace requested KVM not to exit
on guest halts via `KVM_CAP_X86_DISABLE_EXITS`.  In that case the code
proceeds with the async fault processing with the following
justification in 1dfdb45ec510ba27e366878f97484e9c9e728902 ("KVM: x86:
clean up conditions for asynchronous page fault handling"):

"Even when asynchronous page fault is disabled, KVM does not want to pause
the host if a guest triggers a page fault; instead it will put it into
an artificial HLT state that allows running other host processes while
allowing interrupt delivery into the guest."

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 ++-
 arch/x86/kvm/x86.c     | 5 ++---
 arch/x86/kvm/x86.h     | 2 ++
 3 files changed, 6 insertions(+), 4 deletions(-)


base-commit: d96c77bd4eeba469bddbbb14323d2191684da82a

Vitaly Kuznetsov Nov. 18, 2024, 5:58 p.m. UTC | #1

Nikita Kalyazin <kalyazin@amazon.com> writes:

> On x86, async pagefault events can only be delivered if the page fault
> was triggered by guest userspace, not kernel.  This is because
> the guest may be in non-sleepable context and will not be able
> to reschedule.

We used to set KVM_ASYNC_PF_SEND_ALWAYS for Linux guests before

commit 3a7c8fafd1b42adea229fd204132f6a2fb3cd2d9
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Fri Apr 24 09:57:56 2020 +0200

    x86/kvm: Restrict ASYNC_PF to user space

but KVM side of the feature is kind of still there, namely

kvm_pv_enable_async_pf() sets

    vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);

and then we check it in 

kvm_can_deliver_async_pf():

     if (vcpu->arch.apf.send_user_only &&
         kvm_x86_call(get_cpl)(vcpu) == 0)
             return false;

and this can still be used by some legacy guests I suppose. How about
we start with removing this completely? It does not matter if some
legacy guest wants to get an APF for CPL0, we are never obliged to
actually use the mechanism.

>
> However existing implementation pays the following overhead even for the
> kernel-originated faults, even though it is known in advance that they
> cannot be processed asynchronously:
>  - allocate async PF token
>  - create and schedule an async work
>
> This patch avoids the overhead above in case of kernel-originated faults
> by moving the `kvm_can_deliver_async_pf` check from
> `kvm_arch_async_page_not_present` to `__kvm_faultin_pfn`.
>
> Note that the existing check `kvm_can_do_async_pf` already calls
> `kvm_can_deliver_async_pf` internally, however it only does that if the
> `kvm_hlt_in_guest` check is true, ie userspace requested KVM not to exit
> on guest halts via `KVM_CAP_X86_DISABLE_EXITS`.  In that case the code
> proceeds with the async fault processing with the following
> justification in 1dfdb45ec510ba27e366878f97484e9c9e728902 ("KVM: x86:
> clean up conditions for asynchronous page fault handling"):
>
> "Even when asynchronous page fault is disabled, KVM does not want to pause
> the host if a guest triggers a page fault; instead it will put it into
> an artificial HLT state that allows running other host processes while
> allowing interrupt delivery into the guest."
>
> Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 3 ++-
>  arch/x86/kvm/x86.c     | 5 ++---
>  arch/x86/kvm/x86.h     | 2 ++
>  3 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 22e7ad235123..11d29d15b6cd 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4369,7 +4369,8 @@ static int __kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
>  			trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
>  			kvm_make_request(KVM_REQ_APF_HALT, vcpu);
>  			return RET_PF_RETRY;
> -		} else if (kvm_arch_setup_async_pf(vcpu, fault)) {
> +		} else if (kvm_can_deliver_async_pf(vcpu) &&
> +			kvm_arch_setup_async_pf(vcpu, fault)) {
>  			return RET_PF_RETRY;
>  		}
>  	}
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2e713480933a..8edae75b39f7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13355,7 +13355,7 @@ static inline bool apf_pageready_slot_free(struct kvm_vcpu *vcpu)
>  	return !val;
>  }
>  
> -static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
> +bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>  {
>  
>  	if (!kvm_pv_async_pf_enabled(vcpu))
> @@ -13406,8 +13406,7 @@ bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
>  	trace_kvm_async_pf_not_present(work->arch.token, work->cr2_or_gpa);
>  	kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
>  
> -	if (kvm_can_deliver_async_pf(vcpu) &&
> -	    !apf_put_user_notpresent(vcpu)) {
> +	if (!apf_put_user_notpresent(vcpu)) {
>  		fault.vector = PF_VECTOR;
>  		fault.error_code_valid = true;
>  		fault.error_code = 0;
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index ec623d23d13d..9647f41e5c49 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -387,6 +387,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
>  fastpath_t handle_fastpath_hlt(struct kvm_vcpu *vcpu);
>  
> +bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu);
> +
>  extern struct kvm_caps kvm_caps;
>  extern struct kvm_host_values kvm_host;
>  
>
> base-commit: d96c77bd4eeba469bddbbb14323d2191684da82a

Sean Christopherson Nov. 19, 2024, 1:24 p.m. UTC | #2

On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
> On x86, async pagefault events can only be delivered if the page fault
> was triggered by guest userspace, not kernel.  This is because
> the guest may be in non-sleepable context and will not be able
> to reschedule.
> 
> However existing implementation pays the following overhead even for the
> kernel-originated faults, even though it is known in advance that they
> cannot be processed asynchronously:
>  - allocate async PF token
>  - create and schedule an async work

Very deliberately, because as noted below, async page faults aren't limited to
the paravirt case.

> This patch avoids the overhead above in case of kernel-originated faults

Please avoid "This patch".

> by moving the `kvm_can_deliver_async_pf` check from
> `kvm_arch_async_page_not_present` to `__kvm_faultin_pfn`.
> 
> Note that the existing check `kvm_can_do_async_pf` already calls
> `kvm_can_deliver_async_pf` internally, however it only does that if the
> `kvm_hlt_in_guest` check is true, ie userspace requested KVM not to exit
> on guest halts via `KVM_CAP_X86_DISABLE_EXITS`.  In that case the code
> proceeds with the async fault processing with the following
> justification in 1dfdb45ec510ba27e366878f97484e9c9e728902 ("KVM: x86:
> clean up conditions for asynchronous page fault handling"):
> 
> "Even when asynchronous page fault is disabled, KVM does not want to pause
> the host if a guest triggers a page fault; instead it will put it into
> an artificial HLT state that allows running other host processes while
> allowing interrupt delivery into the guest."

None of this justifies breaking host-side, non-paravirt async page faults.  If a
vCPU hits a missing page, KVM can schedule out the vCPU and let something else
run on the pCPU, or enter idle and let the SMT sibling get more cycles, or maybe
even enter a low enough sleep state to let other cores turbo a wee bit.

I have no objection to disabling host async page faults, e.g. it's probably a net
negative for 1:1 vCPU:pCPU pinned setups, but such disabling needs an opt-in from
userspace.

Nikita Kalyazin Nov. 21, 2024, 5:59 p.m. UTC | #3

On 19/11/2024 13:24, Sean Christopherson wrote:
>> This patch avoids the overhead above in case of kernel-originated faults
> 
> Please avoid "This patch".

Ack, thanks.

>> by moving the `kvm_can_deliver_async_pf` check from
>> `kvm_arch_async_page_not_present` to `__kvm_faultin_pfn`.
>>
>> Note that the existing check `kvm_can_do_async_pf` already calls
>> `kvm_can_deliver_async_pf` internally, however it only does that if the
>> `kvm_hlt_in_guest` check is true, ie userspace requested KVM not to exit
>> on guest halts via `KVM_CAP_X86_DISABLE_EXITS`.  In that case the code
>> proceeds with the async fault processing with the following
>> justification in 1dfdb45ec510ba27e366878f97484e9c9e728902 ("KVM: x86:
>> clean up conditions for asynchronous page fault handling"):
>>
>> "Even when asynchronous page fault is disabled, KVM does not want to pause
>> the host if a guest triggers a page fault; instead it will put it into
>> an artificial HLT state that allows running other host processes while
>> allowing interrupt delivery into the guest."
> 
> None of this justifies breaking host-side, non-paravirt async page faults.  If a
> vCPU hits a missing page, KVM can schedule out the vCPU and let something else
> run on the pCPU, or enter idle and let the SMT sibling get more cycles, or maybe
> even enter a low enough sleep state to let other cores turbo a wee bit.
> 
> I have no objection to disabling host async page faults, e.g. it's probably a net
> negative for 1:1 vCPU:pCPU pinned setups, but such disabling needs an opt-in from
> userspace.

That's a good point, I didn't think about it.  The async work would 
still need to execute somewhere in that case (or sleep in GUP until the 
page is available).  If processing the fault synchronously, the vCPU 
thread can also sleep in the same way freeing the pCPU for something 
else, so the amount of work to be done looks equivalent (please correct 
me otherwise).  What's the net gain of moving that to an async work in 
the host async fault case?  "while allowing interrupt delivery into the 
guest." -- is this the main advantage?

Nikita Kalyazin Nov. 21, 2024, 6:10 p.m. UTC | #4

On 18/11/2024 17:58, Vitaly Kuznetsov wrote:
> Nikita Kalyazin <kalyazin@amazon.com> writes:
> 
>> On x86, async pagefault events can only be delivered if the page fault
>> was triggered by guest userspace, not kernel.  This is because
>> the guest may be in non-sleepable context and will not be able
>> to reschedule.
> 
> We used to set KVM_ASYNC_PF_SEND_ALWAYS for Linux guests before
> 
> commit 3a7c8fafd1b42adea229fd204132f6a2fb3cd2d9
> Author: Thomas Gleixner <tglx@linutronix.de>
> Date:   Fri Apr 24 09:57:56 2020 +0200
> 
>      x86/kvm: Restrict ASYNC_PF to user space
> 
> but KVM side of the feature is kind of still there, namely
> 
> kvm_pv_enable_async_pf() sets
> 
>      vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
> 
> and then we check it in
> 
> kvm_can_deliver_async_pf():
> 
>       if (vcpu->arch.apf.send_user_only &&
>           kvm_x86_call(get_cpl)(vcpu) == 0)
>               return false;
> 
> and this can still be used by some legacy guests I suppose. How about
> we start with removing this completely? It does not matter if some
> legacy guest wants to get an APF for CPL0, we are never obliged to
> actually use the mechanism.

If I understand you correctly, the change you propose is rather 
orthogonal to the original one as the check is performed after the work 
has been already allocated (in kvm_setup_async_pf).  Would you expect 
tangible savings from omitting the send_user_only check?

> 
> --
> Vitaly

Sean Christopherson Nov. 21, 2024, 9:05 p.m. UTC | #5

On Thu, Nov 21, 2024, Nikita Kalyazin wrote:
> On 19/11/2024 13:24, Sean Christopherson wrote:
> > None of this justifies breaking host-side, non-paravirt async page faults.  If a
> > vCPU hits a missing page, KVM can schedule out the vCPU and let something else
> > run on the pCPU, or enter idle and let the SMT sibling get more cycles, or maybe
> > even enter a low enough sleep state to let other cores turbo a wee bit.
> > 
> > I have no objection to disabling host async page faults, e.g. it's probably a net
> > negative for 1:1 vCPU:pCPU pinned setups, but such disabling needs an opt-in from
> > userspace.
> 
> That's a good point, I didn't think about it.  The async work would still
> need to execute somewhere in that case (or sleep in GUP until the page is
> available).

The "async work" is often an I/O operation, e.g. to pull in the page from disk,
or over the network from the source.  The *CPU* doesn't need to actively do
anything for those operations.  The I/O is initiated, so the CPU can do something
else, or go idle if there's no other work to be done.

> If processing the fault synchronously, the vCPU thread can also sleep in the
> same way freeing the pCPU for something else,

If and only if the vCPU can handle a PV async #PF.  E.g. if the guest kernel flat
out doesn't support PV async #PF, or the fault happened while the guest was in an
incompatible mode, etc.

If KVM doesn't do async #PFs of any kind, the vCPU will spin on the fault until
the I/O completes and the page is ready.

> so the amount of work to be done looks equivalent (please correct me
> otherwise).  What's the net gain of moving that to an async work in the host
> async fault case? "while allowing interrupt delivery into the guest." -- is
> this the main advantage?

Vitaly Kuznetsov Nov. 22, 2024, 9:33 a.m. UTC | #6

Nikita Kalyazin <kalyazin@amazon.com> writes:

> On 18/11/2024 17:58, Vitaly Kuznetsov wrote:
>> Nikita Kalyazin <kalyazin@amazon.com> writes:
>> 
>>> On x86, async pagefault events can only be delivered if the page fault
>>> was triggered by guest userspace, not kernel.  This is because
>>> the guest may be in non-sleepable context and will not be able
>>> to reschedule.
>> 
>> We used to set KVM_ASYNC_PF_SEND_ALWAYS for Linux guests before
>> 
>> commit 3a7c8fafd1b42adea229fd204132f6a2fb3cd2d9
>> Author: Thomas Gleixner <tglx@linutronix.de>
>> Date:   Fri Apr 24 09:57:56 2020 +0200
>> 
>>      x86/kvm: Restrict ASYNC_PF to user space
>> 
>> but KVM side of the feature is kind of still there, namely
>> 
>> kvm_pv_enable_async_pf() sets
>> 
>>      vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
>> 
>> and then we check it in
>> 
>> kvm_can_deliver_async_pf():
>> 
>>       if (vcpu->arch.apf.send_user_only &&
>>           kvm_x86_call(get_cpl)(vcpu) == 0)
>>               return false;
>> 
>> and this can still be used by some legacy guests I suppose. How about
>> we start with removing this completely? It does not matter if some
>> legacy guest wants to get an APF for CPL0, we are never obliged to
>> actually use the mechanism.
>
> If I understand you correctly, the change you propose is rather 
> orthogonal to the original one as the check is performed after the work 
> has been already allocated (in kvm_setup_async_pf).  Would you expect 
> tangible savings from omitting the send_user_only check?
>

No, I don't expect any performance benefits. Basically, I was referring
to the description of your patch: "On x86, async pagefault events can
only be delivered if the page fault was triggered by guest userspace,
not kernel" and strictly speaking this is not true today as we still
support KVM_ASYNC_PF_SEND_ALWAYS in KVM. Yes, modern Linux guest don't
use it but the flag is there. Basically, my suggestion is to start with
a cleanup (untested):

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d9f763a7bb9..d0906830a9fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -974,7 +974,6 @@ struct kvm_vcpu_arch {
                u64 msr_int_val; /* MSR_KVM_ASYNC_PF_INT */
                u16 vec;
                u32 id;
-               bool send_user_only;
                u32 host_apf_flags;
                bool delivery_as_pf_vmexit;
                bool pageready_pending;
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index a1efa7907a0b..5558a1ec3dc9 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -87,7 +87,7 @@ struct kvm_clock_pairing {
 #define KVM_MAX_MMU_OP_BATCH           32
 
 #define KVM_ASYNC_PF_ENABLED                   (1 << 0)
-#define KVM_ASYNC_PF_SEND_ALWAYS               (1 << 1)
+#define KVM_ASYNC_PF_SEND_ALWAYS               (1 << 1) /* deprecated */
 #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT     (1 << 2)
 #define KVM_ASYNC_PF_DELIVERY_AS_INT           (1 << 3)
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 83fe0a78146f..cd15e738ca9b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3585,7 +3585,6 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
                                        sizeof(u64)))
                return 1;
 
-       vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
        vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
 
        kvm_async_pf_wakeup_all(vcpu);
@@ -13374,8 +13373,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
        if (!kvm_pv_async_pf_enabled(vcpu))
                return false;
 
-       if (vcpu->arch.apf.send_user_only &&
-           kvm_x86_call(get_cpl)(vcpu) == 0)
+       if (kvm_x86_call(get_cpl)(vcpu) == 0)
                return false;
 
        if (is_guest_mode(vcpu)) {

Sean Christopherson Nov. 22, 2024, 2:32 p.m. UTC | #7

On Fri, Nov 22, 2024, Vitaly Kuznetsov wrote:
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index a1efa7907a0b..5558a1ec3dc9 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -87,7 +87,7 @@ struct kvm_clock_pairing {
>  #define KVM_MAX_MMU_OP_BATCH           32
>  
>  #define KVM_ASYNC_PF_ENABLED                   (1 << 0)
> -#define KVM_ASYNC_PF_SEND_ALWAYS               (1 << 1)
> +#define KVM_ASYNC_PF_SEND_ALWAYS               (1 << 1) /* deprecated */
>  #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT     (1 << 2)
>  #define KVM_ASYNC_PF_DELIVERY_AS_INT           (1 << 3)
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 83fe0a78146f..cd15e738ca9b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3585,7 +3585,6 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
>                                         sizeof(u64)))
>                 return 1;
>  
> -       vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
>         vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
>  
>         kvm_async_pf_wakeup_all(vcpu);
> @@ -13374,8 +13373,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>         if (!kvm_pv_async_pf_enabled(vcpu))
>                 return false;
>  
> -       if (vcpu->arch.apf.send_user_only &&
> -           kvm_x86_call(get_cpl)(vcpu) == 0)
> +       if (kvm_x86_call(get_cpl)(vcpu) == 0)

By x86's general definition of "user", this should be "!= 3" :-)

Nikita Kalyazin Nov. 25, 2024, 3:50 p.m. UTC | #8

On 21/11/2024 21:05, Sean Christopherson wrote:
> On Thu, Nov 21, 2024, Nikita Kalyazin wrote:
>> On 19/11/2024 13:24, Sean Christopherson wrote:
>>> None of this justifies breaking host-side, non-paravirt async page faults.  If a
>>> vCPU hits a missing page, KVM can schedule out the vCPU and let something else
>>> run on the pCPU, or enter idle and let the SMT sibling get more cycles, or maybe
>>> even enter a low enough sleep state to let other cores turbo a wee bit.
>>>
>>> I have no objection to disabling host async page faults, e.g. it's probably a net
>>> negative for 1:1 vCPU:pCPU pinned setups, but such disabling needs an opt-in from
>>> userspace.
>>
>> That's a good point, I didn't think about it.  The async work would still
>> need to execute somewhere in that case (or sleep in GUP until the page is
>> available).
> 
> The "async work" is often an I/O operation, e.g. to pull in the page from disk,
> or over the network from the source.  The *CPU* doesn't need to actively do
> anything for those operations.  The I/O is initiated, so the CPU can do something
> else, or go idle if there's no other work to be done.
> 
>> If processing the fault synchronously, the vCPU thread can also sleep in the
>> same way freeing the pCPU for something else,
> 
> If and only if the vCPU can handle a PV async #PF.  E.g. if the guest kernel flat
> out doesn't support PV async #PF, or the fault happened while the guest was in an
> incompatible mode, etc.
> 
> If KVM doesn't do async #PFs of any kind, the vCPU will spin on the fault until
> the I/O completes and the page is ready.

I ran a little experiment to see that by backing guest memory by a file 
on FUSE and delaying response to one of the read operations to emulate a 
delay in fault processing.

1. Original (the patch isn't applied)

vCPU thread (disk-sleeping):

[<0>] kvm_vcpu_block+0x62/0xe0
[<0>] kvm_arch_vcpu_ioctl_run+0x240/0x1e30
[<0>] kvm_vcpu_ioctl+0x2f1/0x860
[<0>] __x64_sys_ioctl+0x87/0xc0
[<0>] do_syscall_64+0x47/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

Async task (disk-sleeping):

[<0>] folio_wait_bit_common+0x116/0x2e0
[<0>] filemap_fault+0xe5/0xcd0
[<0>] __do_fault+0x30/0xc0
[<0>] do_fault+0x9a/0x580
[<0>] __handle_mm_fault+0x684/0x8a0
[<0>] handle_mm_fault+0xc9/0x220
[<0>] __get_user_pages+0x248/0x12c0
[<0>] get_user_pages_remote+0xef/0x470
[<0>] async_pf_execute+0x99/0x1c0
[<0>] process_one_work+0x145/0x360
[<0>] worker_thread+0x294/0x3b0
[<0>] kthread+0xdb/0x110
[<0>] ret_from_fork+0x2d/0x50
[<0>] ret_from_fork_asm+0x1a/0x30

2. With the patch applied (no async task)

vCPU thread (disk-sleeping):

[<0>] folio_wait_bit_common+0x116/0x2e0
[<0>] filemap_fault+0xe5/0xcd0
[<0>] __do_fault+0x30/0xc0
[<0>] do_fault+0x36f/0x580
[<0>] __handle_mm_fault+0x684/0x8a0
[<0>] handle_mm_fault+0xc9/0x220
[<0>] __get_user_pages+0x248/0x12c0
[<0>] get_user_pages_unlocked+0xf7/0x380
[<0>] hva_to_pfn+0x2a2/0x440
[<0>] __kvm_faultin_pfn+0x5e/0x90
[<0>] kvm_mmu_faultin_pfn+0x1ec/0x690
[<0>] kvm_tdp_page_fault+0xba/0x160
[<0>] kvm_mmu_do_page_fault+0x1cc/0x210
[<0>] kvm_mmu_page_fault+0x8e/0x600
[<0>] vmx_handle_exit+0x14c/0x6c0
[<0>] kvm_arch_vcpu_ioctl_run+0xeb1/0x1e30
[<0>] kvm_vcpu_ioctl+0x2f1/0x860
[<0>] __x64_sys_ioctl+0x87/0xc0
[<0>] do_syscall_64+0x47/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

In both cases the fault handling code is blocked and the pCPU is free 
for other tasks.  I can't see the vCPU spinning on the IO to get 
completed if the async task isn't created.  I tried that with and 
without async PF enabled by the guest (MSR_KVM_ASYNC_PF_EN).

What am I missing?

>> so the amount of work to be done looks equivalent (please correct me
>> otherwise).  What's the net gain of moving that to an async work in the host
>> async fault case? "while allowing interrupt delivery into the guest." -- is
>> this the main advantage?

Sean Christopherson Nov. 26, 2024, 12:06 a.m. UTC | #9

On Mon, Nov 25, 2024, Nikita Kalyazin wrote:
> On 21/11/2024 21:05, Sean Christopherson wrote:
> > On Thu, Nov 21, 2024, Nikita Kalyazin wrote:
> > > On 19/11/2024 13:24, Sean Christopherson wrote:
> > > > None of this justifies breaking host-side, non-paravirt async page faults.  If a
> > > > vCPU hits a missing page, KVM can schedule out the vCPU and let something else
> > > > run on the pCPU, or enter idle and let the SMT sibling get more cycles, or maybe
> > > > even enter a low enough sleep state to let other cores turbo a wee bit.
> > > > 
> > > > I have no objection to disabling host async page faults, e.g. it's probably a net
> > > > negative for 1:1 vCPU:pCPU pinned setups, but such disabling needs an opt-in from
> > > > userspace.
> > > 
> > > That's a good point, I didn't think about it.  The async work would still
> > > need to execute somewhere in that case (or sleep in GUP until the page is
> > > available).
> > 
> > The "async work" is often an I/O operation, e.g. to pull in the page from disk,
> > or over the network from the source.  The *CPU* doesn't need to actively do
> > anything for those operations.  The I/O is initiated, so the CPU can do something
> > else, or go idle if there's no other work to be done.
> > 
> > > If processing the fault synchronously, the vCPU thread can also sleep in the
> > > same way freeing the pCPU for something else,
> > 
> > If and only if the vCPU can handle a PV async #PF.  E.g. if the guest kernel flat
> > out doesn't support PV async #PF, or the fault happened while the guest was in an
> > incompatible mode, etc.
> > 
> > If KVM doesn't do async #PFs of any kind, the vCPU will spin on the fault until
> > the I/O completes and the page is ready.
> 
> I ran a little experiment to see that by backing guest memory by a file on
> FUSE and delaying response to one of the read operations to emulate a delay
> in fault processing.

...

> In both cases the fault handling code is blocked and the pCPU is free for
> other tasks.  I can't see the vCPU spinning on the IO to get completed if
> the async task isn't created.  I tried that with and without async PF
> enabled by the guest (MSR_KVM_ASYNC_PF_EN).
> 
> What am I missing?

Ah, I was wrong about the vCPU spinning.

The goal is specifically to schedule() from KVM context, i.e. from kvm_vcpu_block(),
so that if a virtual interrupt arrives for the guest, KVM can wake the vCPU and
deliver the IRQ, e.g. to reduce latency for interrupt delivery, and possible even
to let the guest schedule in a different task if the IRQ is the guest's tick.

Letting mm/ or fs/ do schedule() means the only wake event even for the vCPU task
is the completion of the I/O (or whatever the fault is waiting on).

Nikita Kalyazin Nov. 26, 2024, 3:35 p.m. UTC | #10

On 26/11/2024 00:06, Sean Christopherson wrote:
> On Mon, Nov 25, 2024, Nikita Kalyazin wrote:
>> On 21/11/2024 21:05, Sean Christopherson wrote:
>>> On Thu, Nov 21, 2024, Nikita Kalyazin wrote:
>>>> On 19/11/2024 13:24, Sean Christopherson wrote:
>>>>> None of this justifies breaking host-side, non-paravirt async page faults.  If a
>>>>> vCPU hits a missing page, KVM can schedule out the vCPU and let something else
>>>>> run on the pCPU, or enter idle and let the SMT sibling get more cycles, or maybe
>>>>> even enter a low enough sleep state to let other cores turbo a wee bit.
>>>>>
>>>>> I have no objection to disabling host async page faults, e.g. it's probably a net
>>>>> negative for 1:1 vCPU:pCPU pinned setups, but such disabling needs an opt-in from
>>>>> userspace.
>>>>
>>>> That's a good point, I didn't think about it.  The async work would still
>>>> need to execute somewhere in that case (or sleep in GUP until the page is
>>>> available).
>>>
>>> The "async work" is often an I/O operation, e.g. to pull in the page from disk,
>>> or over the network from the source.  The *CPU* doesn't need to actively do
>>> anything for those operations.  The I/O is initiated, so the CPU can do something
>>> else, or go idle if there's no other work to be done.
>>>
>>>> If processing the fault synchronously, the vCPU thread can also sleep in the
>>>> same way freeing the pCPU for something else,
>>>
>>> If and only if the vCPU can handle a PV async #PF.  E.g. if the guest kernel flat
>>> out doesn't support PV async #PF, or the fault happened while the guest was in an
>>> incompatible mode, etc.
>>>
>>> If KVM doesn't do async #PFs of any kind, the vCPU will spin on the fault until
>>> the I/O completes and the page is ready.
>>
>> I ran a little experiment to see that by backing guest memory by a file on
>> FUSE and delaying response to one of the read operations to emulate a delay
>> in fault processing.
> 
> ...
> 
>> In both cases the fault handling code is blocked and the pCPU is free for
>> other tasks.  I can't see the vCPU spinning on the IO to get completed if
>> the async task isn't created.  I tried that with and without async PF
>> enabled by the guest (MSR_KVM_ASYNC_PF_EN).
>>
>> What am I missing?
> 
> Ah, I was wrong about the vCPU spinning.
> 
> The goal is specifically to schedule() from KVM context, i.e. from kvm_vcpu_block(),
> so that if a virtual interrupt arrives for the guest, KVM can wake the vCPU and
> deliver the IRQ, e.g. to reduce latency for interrupt delivery, and possible even
> to let the guest schedule in a different task if the IRQ is the guest's tick.
> 
> Letting mm/ or fs/ do schedule() means the only wake event even for the vCPU task
> is the completion of the I/O (or whatever the fault is waiting on).

Ok, great, then that's how I understood it last time.  The only thing 
that is not entirely clear to me is like Vitaly says, 
KVM_ASYNC_PF_SEND_ALWAYS is no longer set, because we don't want to 
inject IRQs into the guest when it's in kernel mode, but the "host async 
PF" case would still allow IRQs (eg ticks like you said).  Why is it 
safe to deliver them?

>>>>> I have no objection to disabling host async page faults,
>>>>> e.g. it's probably a net>>>>> negative for 1:1 vCPU:pCPU pinned setups, but such disabling
>>>>> needs an opt-in from>>>>> userspace.
Back to this, I couldn't see a significant effect of this optimisation 
with the original async PF so happy to give it up, but it does make a 
difference when applied to async PF user [2] in my setup.  Would a new 
cap be a good way for users to express their opt-in for it?

[1]: 
https://lore.kernel.org/kvm/20241118130403.23184-1-kalyazin@amazon.com/T/#ma719a9cb3e036e24ea8512abf9a625ddeaccfc96
[2]: 
https://lore.kernel.org/kvm/20241118123948.4796-1-kalyazin@amazon.com/T/

Sean Christopherson Nov. 26, 2024, 10:10 p.m. UTC | #11

On Tue, Nov 26, 2024, Nikita Kalyazin wrote:
> On 26/11/2024 00:06, Sean Christopherson wrote:
> > On Mon, Nov 25, 2024, Nikita Kalyazin wrote:
> > > In both cases the fault handling code is blocked and the pCPU is free for
> > > other tasks.  I can't see the vCPU spinning on the IO to get completed if
> > > the async task isn't created.  I tried that with and without async PF
> > > enabled by the guest (MSR_KVM_ASYNC_PF_EN).
> > > 
> > > What am I missing?
> > 
> > Ah, I was wrong about the vCPU spinning.
> > 
> > The goal is specifically to schedule() from KVM context, i.e. from kvm_vcpu_block(),
> > so that if a virtual interrupt arrives for the guest, KVM can wake the vCPU and
> > deliver the IRQ, e.g. to reduce latency for interrupt delivery, and possible even
> > to let the guest schedule in a different task if the IRQ is the guest's tick.
> > 
> > Letting mm/ or fs/ do schedule() means the only wake event even for the vCPU task
> > is the completion of the I/O (or whatever the fault is waiting on).
> 
> Ok, great, then that's how I understood it last time.  The only thing that
> is not entirely clear to me is like Vitaly says, KVM_ASYNC_PF_SEND_ALWAYS is
> no longer set, because we don't want to inject IRQs into the guest when it's
> in kernel mode, but the "host async PF" case would still allow IRQs (eg
> ticks like you said).  Why is it safe to deliver them?

IRQs are fine, the problem with PV async #PF is that it directly injects a #PF,
which the kernel may not be prepared to handle.

> > > > > > I have no objection to disabling host async page faults,
> > > > > > e.g. it's probably a net>>>>> negative for 1:1 vCPU:pCPU pinned setups, but such disabling
> > > > > > needs an opt-in from>>>>> userspace.
> Back to this, I couldn't see a significant effect of this optimisation with
> the original async PF so happy to give it up, but it does make a difference
> when applied to async PF user [2] in my setup.  Would a new cap be a good
> way for users to express their opt-in for it?

This probably needs to be handled in the context of the async #PF user series.
If that series never lands, adding a new cap is likely a waste.  And I suspect
that even then, a capability may not be warranted (truly don't know, haven't
looked at your other series).

Nikita Kalyazin Nov. 27, 2024, 10:35 a.m. UTC | #12

On 26/11/2024 22:10, Sean Christopherson wrote:
> On Tue, Nov 26, 2024, Nikita Kalyazin wrote:
>> On 26/11/2024 00:06, Sean Christopherson wrote:
>>> On Mon, Nov 25, 2024, Nikita Kalyazin wrote:
>>>> In both cases the fault handling code is blocked and the pCPU is free for
>>>> other tasks.  I can't see the vCPU spinning on the IO to get completed if
>>>> the async task isn't created.  I tried that with and without async PF
>>>> enabled by the guest (MSR_KVM_ASYNC_PF_EN).
>>>>
>>>> What am I missing?
>>>
>>> Ah, I was wrong about the vCPU spinning.
>>>
>>> The goal is specifically to schedule() from KVM context, i.e. from kvm_vcpu_block(),
>>> so that if a virtual interrupt arrives for the guest, KVM can wake the vCPU and
>>> deliver the IRQ, e.g. to reduce latency for interrupt delivery, and possible even
>>> to let the guest schedule in a different task if the IRQ is the guest's tick.
>>>
>>> Letting mm/ or fs/ do schedule() means the only wake event even for the vCPU task
>>> is the completion of the I/O (or whatever the fault is waiting on).
>>
>> Ok, great, then that's how I understood it last time.  The only thing that
>> is not entirely clear to me is like Vitaly says, KVM_ASYNC_PF_SEND_ALWAYS is
>> no longer set, because we don't want to inject IRQs into the guest when it's
>> in kernel mode, but the "host async PF" case would still allow IRQs (eg
>> ticks like you said).  Why is it safe to deliver them?
> 
> IRQs are fine, the problem with PV async #PF is that it directly injects a #PF,
> which the kernel may not be prepared to handle.

You're right indeed, I was overfocused on IRQs for some reason.

>>>>>>> I have no objection to disabling host async page faults,
>>>>>>> e.g. it's probably a net>>>>> negative for 1:1 vCPU:pCPU pinned setups, but such disabling
>>>>>>> needs an opt-in from>>>>> userspace.
>> Back to this, I couldn't see a significant effect of this optimisation with
>> the original async PF so happy to give it up, but it does make a difference
>> when applied to async PF user [2] in my setup.  Would a new cap be a good
>> way for users to express their opt-in for it?
> 
> This probably needs to be handled in the context of the async #PF user series.
> If that series never lands, adding a new cap is likely a waste.  And I suspect
> that even then, a capability may not be warranted (truly don't know, haven't
> looked at your other series).

Yes, I meant that to be included in the async #PF user series (if 
required), not this one.  Just wanted to bring it up here, because the 
thread already had the relevant context.  Thanks.

KVM: x86: async_pf: check earlier if can deliver async pf

Commit Message

Comments

Patch