KVM: x86: async_pf: check earlier if can deliver async pf

Message ID	20241118130403.23184-1-kalyazin@amazon.com (mailing list archive)
State	New
Headers	show Received: from smtp-fw-6001.amazon.com (smtp-fw-6001.amazon.com [52.95.48.154]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 03D201B85F0; Mon, 18 Nov 2024 13:04:35 +0000 (UTC) From: Nikita Kalyazin <kalyazin@amazon.com> To: <pbonzini@redhat.com>, <seanjc@google.com>, <tglx@linutronix.de>, <mingo@redhat.com>, <bp@alien8.de>, <dave.hansen@linux.intel.com>, <hpa@zytor.com>, <kvm@vger.kernel.org>, <linux-kernel@vger.kernel.org> CC: <david@redhat.com>, <peterx@redhat.com>, <oleg@redhat.com>, <vkuznets@redhat.com>, <gshan@redhat.com>, <graf@amazon.de>, <jgowans@amazon.com>, <roypat@amazon.co.uk>, <derekmn@amazon.com>, <nsaenz@amazon.es>, <xmarcalx@amazon.com>, <kalyazin@amazon.com> Subject: [PATCH] KVM: x86: async_pf: check earlier if can deliver async pf Date: Mon, 18 Nov 2024 13:04:03 +0000 Message-ID: <20241118130403.23184-1-kalyazin@amazon.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain
Series	KVM: x86: async_pf: check earlier if can deliver async pf \| expand KVM: x86: async_pf: check earlier if can deliver async pf

Nikita Kalyazin Nov. 18, 2024, 1:04 p.m. UTC

On x86, async pagefault events can only be delivered if the page fault
was triggered by guest userspace, not kernel.  This is because
the guest may be in non-sleepable context and will not be able
to reschedule.

However existing implementation pays the following overhead even for the
kernel-originated faults, even though it is known in advance that they
cannot be processed asynchronously:
 - allocate async PF token
 - create and schedule an async work

This patch avoids the overhead above in case of kernel-originated faults
by moving the `kvm_can_deliver_async_pf` check from
`kvm_arch_async_page_not_present` to `__kvm_faultin_pfn`.

Note that the existing check `kvm_can_do_async_pf` already calls
`kvm_can_deliver_async_pf` internally, however it only does that if the
`kvm_hlt_in_guest` check is true, ie userspace requested KVM not to exit
on guest halts via `KVM_CAP_X86_DISABLE_EXITS`.  In that case the code
proceeds with the async fault processing with the following
justification in 1dfdb45ec510ba27e366878f97484e9c9e728902 ("KVM: x86:
clean up conditions for asynchronous page fault handling"):

"Even when asynchronous page fault is disabled, KVM does not want to pause
the host if a guest triggers a page fault; instead it will put it into
an artificial HLT state that allows running other host processes while
allowing interrupt delivery into the guest."

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
---
 arch/x86/kvm/mmu/mmu.c | 3 ++-
 arch/x86/kvm/x86.c     | 5 ++---
 arch/x86/kvm/x86.h     | 2 ++
 3 files changed, 6 insertions(+), 4 deletions(-)


base-commit: d96c77bd4eeba469bddbbb14323d2191684da82a

Vitaly Kuznetsov Nov. 18, 2024, 5:58 p.m. UTC | #1

Nikita Kalyazin <kalyazin@amazon.com> writes:

> On x86, async pagefault events can only be delivered if the page fault
> was triggered by guest userspace, not kernel.  This is because
> the guest may be in non-sleepable context and will not be able
> to reschedule.

We used to set KVM_ASYNC_PF_SEND_ALWAYS for Linux guests before

commit 3a7c8fafd1b42adea229fd204132f6a2fb3cd2d9
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Fri Apr 24 09:57:56 2020 +0200

    x86/kvm: Restrict ASYNC_PF to user space

but KVM side of the feature is kind of still there, namely

kvm_pv_enable_async_pf() sets

    vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);

and then we check it in 

kvm_can_deliver_async_pf():

     if (vcpu->arch.apf.send_user_only &&
         kvm_x86_call(get_cpl)(vcpu) == 0)
             return false;

and this can still be used by some legacy guests I suppose. How about
we start with removing this completely? It does not matter if some
legacy guest wants to get an APF for CPL0, we are never obliged to
actually use the mechanism.

>
> However existing implementation pays the following overhead even for the
> kernel-originated faults, even though it is known in advance that they
> cannot be processed asynchronously:
>  - allocate async PF token
>  - create and schedule an async work
>
> This patch avoids the overhead above in case of kernel-originated faults
> by moving the `kvm_can_deliver_async_pf` check from
> `kvm_arch_async_page_not_present` to `__kvm_faultin_pfn`.
>
> Note that the existing check `kvm_can_do_async_pf` already calls
> `kvm_can_deliver_async_pf` internally, however it only does that if the
> `kvm_hlt_in_guest` check is true, ie userspace requested KVM not to exit
> on guest halts via `KVM_CAP_X86_DISABLE_EXITS`.  In that case the code
> proceeds with the async fault processing with the following
> justification in 1dfdb45ec510ba27e366878f97484e9c9e728902 ("KVM: x86:
> clean up conditions for asynchronous page fault handling"):
>
> "Even when asynchronous page fault is disabled, KVM does not want to pause
> the host if a guest triggers a page fault; instead it will put it into
> an artificial HLT state that allows running other host processes while
> allowing interrupt delivery into the guest."
>
> Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 3 ++-
>  arch/x86/kvm/x86.c     | 5 ++---
>  arch/x86/kvm/x86.h     | 2 ++
>  3 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 22e7ad235123..11d29d15b6cd 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4369,7 +4369,8 @@ static int __kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu,
>  			trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
>  			kvm_make_request(KVM_REQ_APF_HALT, vcpu);
>  			return RET_PF_RETRY;
> -		} else if (kvm_arch_setup_async_pf(vcpu, fault)) {
> +		} else if (kvm_can_deliver_async_pf(vcpu) &&
> +			kvm_arch_setup_async_pf(vcpu, fault)) {
>  			return RET_PF_RETRY;
>  		}
>  	}
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2e713480933a..8edae75b39f7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13355,7 +13355,7 @@ static inline bool apf_pageready_slot_free(struct kvm_vcpu *vcpu)
>  	return !val;
>  }
>  
> -static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
> +bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>  {
>  
>  	if (!kvm_pv_async_pf_enabled(vcpu))
> @@ -13406,8 +13406,7 @@ bool kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
>  	trace_kvm_async_pf_not_present(work->arch.token, work->cr2_or_gpa);
>  	kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
>  
> -	if (kvm_can_deliver_async_pf(vcpu) &&
> -	    !apf_put_user_notpresent(vcpu)) {
> +	if (!apf_put_user_notpresent(vcpu)) {
>  		fault.vector = PF_VECTOR;
>  		fault.error_code_valid = true;
>  		fault.error_code = 0;
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index ec623d23d13d..9647f41e5c49 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -387,6 +387,8 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
>  fastpath_t handle_fastpath_hlt(struct kvm_vcpu *vcpu);
>  
> +bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu);
> +
>  extern struct kvm_caps kvm_caps;
>  extern struct kvm_host_values kvm_host;
>  
>
> base-commit: d96c77bd4eeba469bddbbb14323d2191684da82a

Sean Christopherson Nov. 19, 2024, 1:24 p.m. UTC | #2

On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
> On x86, async pagefault events can only be delivered if the page fault
> was triggered by guest userspace, not kernel.  This is because
> the guest may be in non-sleepable context and will not be able
> to reschedule.
> 
> However existing implementation pays the following overhead even for the
> kernel-originated faults, even though it is known in advance that they
> cannot be processed asynchronously:
>  - allocate async PF token
>  - create and schedule an async work

Very deliberately, because as noted below, async page faults aren't limited to
the paravirt case.

> This patch avoids the overhead above in case of kernel-originated faults

Please avoid "This patch".

> by moving the `kvm_can_deliver_async_pf` check from
> `kvm_arch_async_page_not_present` to `__kvm_faultin_pfn`.
> 
> Note that the existing check `kvm_can_do_async_pf` already calls
> `kvm_can_deliver_async_pf` internally, however it only does that if the
> `kvm_hlt_in_guest` check is true, ie userspace requested KVM not to exit
> on guest halts via `KVM_CAP_X86_DISABLE_EXITS`.  In that case the code
> proceeds with the async fault processing with the following
> justification in 1dfdb45ec510ba27e366878f97484e9c9e728902 ("KVM: x86:
> clean up conditions for asynchronous page fault handling"):
> 
> "Even when asynchronous page fault is disabled, KVM does not want to pause
> the host if a guest triggers a page fault; instead it will put it into
> an artificial HLT state that allows running other host processes while
> allowing interrupt delivery into the guest."

None of this justifies breaking host-side, non-paravirt async page faults.  If a
vCPU hits a missing page, KVM can schedule out the vCPU and let something else
run on the pCPU, or enter idle and let the SMT sibling get more cycles, or maybe
even enter a low enough sleep state to let other cores turbo a wee bit.

I have no objection to disabling host async page faults, e.g. it's probably a net
negative for 1:1 vCPU:pCPU pinned setups, but such disabling needs an opt-in from
userspace.

Nikita Kalyazin Nov. 21, 2024, 5:59 p.m. UTC | #3

On 19/11/2024 13:24, Sean Christopherson wrote:
>> This patch avoids the overhead above in case of kernel-originated faults
> 
> Please avoid "This patch".

Ack, thanks.

>> by moving the `kvm_can_deliver_async_pf` check from
>> `kvm_arch_async_page_not_present` to `__kvm_faultin_pfn`.
>>
>> Note that the existing check `kvm_can_do_async_pf` already calls
>> `kvm_can_deliver_async_pf` internally, however it only does that if the
>> `kvm_hlt_in_guest` check is true, ie userspace requested KVM not to exit
>> on guest halts via `KVM_CAP_X86_DISABLE_EXITS`.  In that case the code
>> proceeds with the async fault processing with the following
>> justification in 1dfdb45ec510ba27e366878f97484e9c9e728902 ("KVM: x86:
>> clean up conditions for asynchronous page fault handling"):
>>
>> "Even when asynchronous page fault is disabled, KVM does not want to pause
>> the host if a guest triggers a page fault; instead it will put it into
>> an artificial HLT state that allows running other host processes while
>> allowing interrupt delivery into the guest."
> 
> None of this justifies breaking host-side, non-paravirt async page faults.  If a
> vCPU hits a missing page, KVM can schedule out the vCPU and let something else
> run on the pCPU, or enter idle and let the SMT sibling get more cycles, or maybe
> even enter a low enough sleep state to let other cores turbo a wee bit.
> 
> I have no objection to disabling host async page faults, e.g. it's probably a net
> negative for 1:1 vCPU:pCPU pinned setups, but such disabling needs an opt-in from
> userspace.

That's a good point, I didn't think about it.  The async work would 
still need to execute somewhere in that case (or sleep in GUP until the 
page is available).  If processing the fault synchronously, the vCPU 
thread can also sleep in the same way freeing the pCPU for something 
else, so the amount of work to be done looks equivalent (please correct 
me otherwise).  What's the net gain of moving that to an async work in 
the host async fault case?  "while allowing interrupt delivery into the 
guest." -- is this the main advantage?

Nikita Kalyazin Nov. 21, 2024, 6:10 p.m. UTC | #4

On 18/11/2024 17:58, Vitaly Kuznetsov wrote:
> Nikita Kalyazin <kalyazin@amazon.com> writes:
> 
>> On x86, async pagefault events can only be delivered if the page fault
>> was triggered by guest userspace, not kernel.  This is because
>> the guest may be in non-sleepable context and will not be able
>> to reschedule.
> 
> We used to set KVM_ASYNC_PF_SEND_ALWAYS for Linux guests before
> 
> commit 3a7c8fafd1b42adea229fd204132f6a2fb3cd2d9
> Author: Thomas Gleixner <tglx@linutronix.de>
> Date:   Fri Apr 24 09:57:56 2020 +0200
> 
>      x86/kvm: Restrict ASYNC_PF to user space
> 
> but KVM side of the feature is kind of still there, namely
> 
> kvm_pv_enable_async_pf() sets
> 
>      vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
> 
> and then we check it in
> 
> kvm_can_deliver_async_pf():
> 
>       if (vcpu->arch.apf.send_user_only &&
>           kvm_x86_call(get_cpl)(vcpu) == 0)
>               return false;
> 
> and this can still be used by some legacy guests I suppose. How about
> we start with removing this completely? It does not matter if some
> legacy guest wants to get an APF for CPL0, we are never obliged to
> actually use the mechanism.

If I understand you correctly, the change you propose is rather 
orthogonal to the original one as the check is performed after the work 
has been already allocated (in kvm_setup_async_pf).  Would you expect 
tangible savings from omitting the send_user_only check?

> 
> --
> Vitaly

Sean Christopherson Nov. 21, 2024, 9:05 p.m. UTC | #5

On Thu, Nov 21, 2024, Nikita Kalyazin wrote:
> On 19/11/2024 13:24, Sean Christopherson wrote:
> > None of this justifies breaking host-side, non-paravirt async page faults.  If a
> > vCPU hits a missing page, KVM can schedule out the vCPU and let something else
> > run on the pCPU, or enter idle and let the SMT sibling get more cycles, or maybe
> > even enter a low enough sleep state to let other cores turbo a wee bit.
> > 
> > I have no objection to disabling host async page faults, e.g. it's probably a net
> > negative for 1:1 vCPU:pCPU pinned setups, but such disabling needs an opt-in from
> > userspace.
> 
> That's a good point, I didn't think about it.  The async work would still
> need to execute somewhere in that case (or sleep in GUP until the page is
> available).

The "async work" is often an I/O operation, e.g. to pull in the page from disk,
or over the network from the source.  The *CPU* doesn't need to actively do
anything for those operations.  The I/O is initiated, so the CPU can do something
else, or go idle if there's no other work to be done.

> If processing the fault synchronously, the vCPU thread can also sleep in the
> same way freeing the pCPU for something else,

If and only if the vCPU can handle a PV async #PF.  E.g. if the guest kernel flat
out doesn't support PV async #PF, or the fault happened while the guest was in an
incompatible mode, etc.

If KVM doesn't do async #PFs of any kind, the vCPU will spin on the fault until
the I/O completes and the page is ready.

> so the amount of work to be done looks equivalent (please correct me
> otherwise).  What's the net gain of moving that to an async work in the host
> async fault case? "while allowing interrupt delivery into the guest." -- is
> this the main advantage?

Vitaly Kuznetsov Nov. 22, 2024, 9:33 a.m. UTC | #6

Nikita Kalyazin <kalyazin@amazon.com> writes:

> On 18/11/2024 17:58, Vitaly Kuznetsov wrote:
>> Nikita Kalyazin <kalyazin@amazon.com> writes:
>> 
>>> On x86, async pagefault events can only be delivered if the page fault
>>> was triggered by guest userspace, not kernel.  This is because
>>> the guest may be in non-sleepable context and will not be able
>>> to reschedule.
>> 
>> We used to set KVM_ASYNC_PF_SEND_ALWAYS for Linux guests before
>> 
>> commit 3a7c8fafd1b42adea229fd204132f6a2fb3cd2d9
>> Author: Thomas Gleixner <tglx@linutronix.de>
>> Date:   Fri Apr 24 09:57:56 2020 +0200
>> 
>>      x86/kvm: Restrict ASYNC_PF to user space
>> 
>> but KVM side of the feature is kind of still there, namely
>> 
>> kvm_pv_enable_async_pf() sets
>> 
>>      vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
>> 
>> and then we check it in
>> 
>> kvm_can_deliver_async_pf():
>> 
>>       if (vcpu->arch.apf.send_user_only &&
>>           kvm_x86_call(get_cpl)(vcpu) == 0)
>>               return false;
>> 
>> and this can still be used by some legacy guests I suppose. How about
>> we start with removing this completely? It does not matter if some
>> legacy guest wants to get an APF for CPL0, we are never obliged to
>> actually use the mechanism.
>
> If I understand you correctly, the change you propose is rather 
> orthogonal to the original one as the check is performed after the work 
> has been already allocated (in kvm_setup_async_pf).  Would you expect 
> tangible savings from omitting the send_user_only check?
>

No, I don't expect any performance benefits. Basically, I was referring
to the description of your patch: "On x86, async pagefault events can
only be delivered if the page fault was triggered by guest userspace,
not kernel" and strictly speaking this is not true today as we still
support KVM_ASYNC_PF_SEND_ALWAYS in KVM. Yes, modern Linux guest don't
use it but the flag is there. Basically, my suggestion is to start with
a cleanup (untested):

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6d9f763a7bb9..d0906830a9fb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -974,7 +974,6 @@ struct kvm_vcpu_arch {
                u64 msr_int_val; /* MSR_KVM_ASYNC_PF_INT */
                u16 vec;
                u32 id;
-               bool send_user_only;
                u32 host_apf_flags;
                bool delivery_as_pf_vmexit;
                bool pageready_pending;
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index a1efa7907a0b..5558a1ec3dc9 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -87,7 +87,7 @@ struct kvm_clock_pairing {
 #define KVM_MAX_MMU_OP_BATCH           32
 
 #define KVM_ASYNC_PF_ENABLED                   (1 << 0)
-#define KVM_ASYNC_PF_SEND_ALWAYS               (1 << 1)
+#define KVM_ASYNC_PF_SEND_ALWAYS               (1 << 1) /* deprecated */
 #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT     (1 << 2)
 #define KVM_ASYNC_PF_DELIVERY_AS_INT           (1 << 3)
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 83fe0a78146f..cd15e738ca9b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3585,7 +3585,6 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
                                        sizeof(u64)))
                return 1;
 
-       vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
        vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
 
        kvm_async_pf_wakeup_all(vcpu);
@@ -13374,8 +13373,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
        if (!kvm_pv_async_pf_enabled(vcpu))
                return false;
 
-       if (vcpu->arch.apf.send_user_only &&
-           kvm_x86_call(get_cpl)(vcpu) == 0)
+       if (kvm_x86_call(get_cpl)(vcpu) == 0)
                return false;
 
        if (is_guest_mode(vcpu)) {

Sean Christopherson Nov. 22, 2024, 2:32 p.m. UTC | #7

On Fri, Nov 22, 2024, Vitaly Kuznetsov wrote:
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index a1efa7907a0b..5558a1ec3dc9 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -87,7 +87,7 @@ struct kvm_clock_pairing {
>  #define KVM_MAX_MMU_OP_BATCH           32
>  
>  #define KVM_ASYNC_PF_ENABLED                   (1 << 0)
> -#define KVM_ASYNC_PF_SEND_ALWAYS               (1 << 1)
> +#define KVM_ASYNC_PF_SEND_ALWAYS               (1 << 1) /* deprecated */
>  #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT     (1 << 2)
>  #define KVM_ASYNC_PF_DELIVERY_AS_INT           (1 << 3)
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 83fe0a78146f..cd15e738ca9b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3585,7 +3585,6 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
>                                         sizeof(u64)))
>                 return 1;
>  
> -       vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
>         vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
>  
>         kvm_async_pf_wakeup_all(vcpu);
> @@ -13374,8 +13373,7 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>         if (!kvm_pv_async_pf_enabled(vcpu))
>                 return false;
>  
> -       if (vcpu->arch.apf.send_user_only &&
> -           kvm_x86_call(get_cpl)(vcpu) == 0)
> +       if (kvm_x86_call(get_cpl)(vcpu) == 0)

By x86's general definition of "user", this should be "!= 3" :-)

KVM: x86: async_pf: check earlier if can deliver async pf

Commit Message

Comments

Patch