diff mbox series

[RFC,4/6] KVM: x86: acknowledgment mechanism for async pf page ready notifications

Message ID 20200429093634.1514902-5-vkuznets@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications | expand

Commit Message

Vitaly Kuznetsov April 29, 2020, 9:36 a.m. UTC
If two page ready notifications happen back to back the second one is not
delivered and the only mechanism we currently have is
kvm_check_async_pf_completion() check in vcpu_run() loop. The check will
only be performed with the next vmexit when it happens and in some cases
it may take a while. With interrupt based page ready notification delivery
the situation is even worse: unlike exceptions, interrupts are not handled
immediately so we must check if the slot is empty. This is slow and
unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate
the fact that the slot is free and host should check its notification
queue. Mandate using it for interrupt based type 2 APF event delivery.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 Documentation/virt/kvm/msr.rst       | 16 +++++++++++++++-
 arch/x86/include/uapi/asm/kvm_para.h |  1 +
 arch/x86/kvm/x86.c                   |  9 ++++++++-
 3 files changed, 24 insertions(+), 2 deletions(-)

Comments

Andy Lutomirski April 29, 2020, 5:28 p.m. UTC | #1
On Wed, Apr 29, 2020 at 2:36 AM Vitaly Kuznetsov <vkuznets@redhat.com> wrote:
>
> If two page ready notifications happen back to back the second one is not
> delivered and the only mechanism we currently have is
> kvm_check_async_pf_completion() check in vcpu_run() loop. The check will
> only be performed with the next vmexit when it happens and in some cases
> it may take a while. With interrupt based page ready notification delivery
> the situation is even worse: unlike exceptions, interrupts are not handled
> immediately so we must check if the slot is empty. This is slow and
> unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate
> the fact that the slot is free and host should check its notification
> queue. Mandate using it for interrupt based type 2 APF event delivery.

This seems functional, but I'm wondering if it could a bit simpler and
more efficient if the data structure was a normal descriptor ring with
the same number slots as whatever the maximum number of waiting pages
is.  Then there would never need to be any notification from the guest
back to the host, since there would always be room for a notification.

It might be even better if a single unified data structure was used
for both notifications.
Paolo Bonzini April 29, 2020, 5:40 p.m. UTC | #2
On 29/04/20 19:28, Andy Lutomirski wrote:
> This seems functional, but I'm wondering if it could a bit simpler and
> more efficient if the data structure was a normal descriptor ring with
> the same number slots as whatever the maximum number of waiting pages
> is.  Then there would never need to be any notification from the guest
> back to the host, since there would always be room for a notification.

No, it would be much more complicated code for a slow path which is
already order of magnitudes slower than a vmexit.  It would also use
much more memory.

> It might be even better if a single unified data structure was used
> for both notifications.

That's a very bad idea since one is synchronous and one is asynchronous.
 Part of the proposal we agreed upon was to keep "page not ready"
synchronous while making "page ready" an interrupt.  The data structure
for "page not ready" will be #VE.

Paolo
Andy Lutomirski April 30, 2020, 12:45 a.m. UTC | #3
On Wed, Apr 29, 2020 at 10:40 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 29/04/20 19:28, Andy Lutomirski wrote:
> > This seems functional, but I'm wondering if it could a bit simpler and
> > more efficient if the data structure was a normal descriptor ring with
> > the same number slots as whatever the maximum number of waiting pages
> > is.  Then there would never need to be any notification from the guest
> > back to the host, since there would always be room for a notification.
>
> No, it would be much more complicated code for a slow path which is
> already order of magnitudes slower than a vmexit.  It would also use
> much more memory.

Fair enough.

>
> > It might be even better if a single unified data structure was used
> > for both notifications.
>
> That's a very bad idea since one is synchronous and one is asynchronous.
>  Part of the proposal we agreed upon was to keep "page not ready"
> synchronous while making "page ready" an interrupt.  The data structure
> for "page not ready" will be #VE.
>

#VE on SVM will be interesting, to say the least, and I think that a
solution that is VMX specific doesn't make much sense.  #VE also has
unpleasant issues involving the contexts in which it can occur.  You
will have quite a hard time convincing me to ack the addition of a #VE
entry handler for this.  I think a brand new vector is the right
solution.
Paolo Bonzini April 30, 2020, 6:46 a.m. UTC | #4
On 30/04/20 02:45, Andy Lutomirski wrote:
>> That's a very bad idea since one is synchronous and one is asynchronous.
>>  Part of the proposal we agreed upon was to keep "page not ready"
>> synchronous while making "page ready" an interrupt.  The data structure
>> for "page not ready" will be #VE.
>
> #VE on SVM will be interesting, to say the least, and I think that a
> solution that is VMX specific doesn't make much sense.

You can always inject it manually.  The same is true of Haswell and
earlier processors.

> #VE also has
> unpleasant issues involving the contexts in which it can occur.  You
> will have quite a hard time convincing me to ack the addition of a #VE
> entry handler for this.  I think a brand new vector is the right
> solution.

I need --verbose. :)  For #VE I liked the idea of re-enabling it from an
IPI, at least in the case where we cannot move out of the IST stack.
And any other vector that behaves like an exception would have the same
issue, wouldn't it (especially re-entrancy)?

Paolo
Paolo Bonzini April 30, 2020, 6:49 a.m. UTC | #5
On 29/04/20 11:36, Vitaly Kuznetsov wrote:
> +	case MSR_KVM_ASYNC_PF_ACK:
> +		if (data & 0x1)
> +			kvm_check_async_pf_completion(vcpu);
> +		break;

Does this work if interrupts are turned off?  I think in that case
kvm_check_async_pf_completion will refuse to make progress.  You need to
make this bit stateful (e.g. 1 = async PF in progress, 0 = not in
progress), and check that for page ready notifications instead of
EFLAGS.IF.  This probably means that;

- it might be simpler to move it to the vector MSR

- it's definitely much simpler to remove the #PF-based mechanism for
injecting page ready notifications.

Thanks,

Paolo
Vitaly Kuznetsov April 30, 2020, 8:40 a.m. UTC | #6
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 29/04/20 11:36, Vitaly Kuznetsov wrote:
>> +	case MSR_KVM_ASYNC_PF_ACK:
>> +		if (data & 0x1)
>> +			kvm_check_async_pf_completion(vcpu);
>> +		break;
>
> Does this work if interrupts are turned off?
>  I think in that case
> kvm_check_async_pf_completion will refuse to make progress.
> You need to make this bit stateful (e.g. 1 = async PF in progress, 0 =
> not in progress), and check that for page ready notifications instead of
> EFLAGS.IF.  
> This probably means that;
>
> - it might be simpler to move it to the vector MSR

I didn't want to merge 'ACK' with the vector MSR as it forces the guest
to remember the setting. It doesn't matter at all for Linux as we
hardcode the interrupt number but I can imaging an OS assigning IRQ
numbers dynamically, it'll need to keep record to avoid doing rdmsr.

> - it's definitely much simpler to remove the #PF-based mechanism for
> injecting page ready notifications.

Yea, the logic in kvm_can_do_async_pf()/kvm_can_deliver_async_pf()
becomes cumbersome. If we are to drop #PF-based mechanism I'd split it
completely from the remaining synchronious #PF for page-not-present:
basically, we only need to check that the slot (which we agreed becomes
completely separate) is empty, interrupt/pending expception/... state
becomes irrelevant.
Paolo Bonzini April 30, 2020, 9:27 a.m. UTC | #7
On 30/04/20 10:40, Vitaly Kuznetsov wrote:
>>  I think in that case
>> kvm_check_async_pf_completion will refuse to make progress.
>> You need to make this bit stateful (e.g. 1 = async PF in progress, 0 =
>> not in progress), and check that for page ready notifications instead of
>> EFLAGS.IF.  
>> This probably means that;
>>
>> - it might be simpler to move it to the vector MSR
> I didn't want to merge 'ACK' with the vector MSR as it forces the guest
> to remember the setting. It doesn't matter at all for Linux as we
> hardcode the interrupt number but I can imaging an OS assigning IRQ
> numbers dynamically, it'll need to keep record to avoid doing rdmsr.

I would expect that it needs to keep it in a global variable anyway, but
yes this is a good point.  You can also keep the ACK MSR and store the
pending bit in the other MSR, kind of like you have separate ISR and EOI
registers in the LAPIC.

>> - it's definitely much simpler to remove the #PF-based mechanism for
>> injecting page ready notifications.
> Yea, the logic in kvm_can_do_async_pf()/kvm_can_deliver_async_pf()
> becomes cumbersome. If we are to drop #PF-based mechanism I'd split it
> completely from the remaining synchronious #PF for page-not-present:
> basically, we only need to check that the slot (which we agreed becomes
> completely separate) is empty, interrupt/pending expception/... state
> becomes irrelevant.

Yes, that's a good point.

Paolo
Vitaly Kuznetsov April 30, 2020, 11:33 a.m. UTC | #8
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 30/04/20 10:40, Vitaly Kuznetsov wrote:
>>>  I think in that case
>>> kvm_check_async_pf_completion will refuse to make progress.
>>> You need to make this bit stateful (e.g. 1 = async PF in progress, 0 =
>>> not in progress), and check that for page ready notifications instead of
>>> EFLAGS.IF.  
>>> This probably means that;
>>>
>>> - it might be simpler to move it to the vector MSR
>> I didn't want to merge 'ACK' with the vector MSR as it forces the guest
>> to remember the setting. It doesn't matter at all for Linux as we
>> hardcode the interrupt number but I can imaging an OS assigning IRQ
>> numbers dynamically, it'll need to keep record to avoid doing rdmsr.
>
> I would expect that it needs to keep it in a global variable anyway, but
> yes this is a good point.  You can also keep the ACK MSR and store the
> pending bit in the other MSR, kind of like you have separate ISR and EOI
> registers in the LAPIC.
>

Honestly I was inspired by Hyper-V's HV_X64_MSR_EOM MSR as the protocol
we're trying to come up with here is very similar to HV messaging)

I'm not exactly sure why we need the pending bit after we drop #PF. When
we call kvm_check_async_pf_completion() from MSR_KVM_ASYNC_PF_ACK write
it will (in case there are page ready events in the queue) check if the
slot is empty, put one there and raise IRQ regardless of guest's current
state. It may or may not get injected immediately but we don't care.
The second invocation of kvm_check_async_pf_completion() from vcpu_run()
will just go away.

I'm probably just missing something, will think of it again while
working on v1, it seems nobody is against the idea in general. Thanks!
Paolo Bonzini April 30, 2020, 11:43 a.m. UTC | #9
On 30/04/20 13:33, Vitaly Kuznetsov wrote:
>> I would expect that it needs to keep it in a global variable anyway, but
>> yes this is a good point.  You can also keep the ACK MSR and store the
>> pending bit in the other MSR, kind of like you have separate ISR and EOI
>> registers in the LAPIC.
>>
> Honestly I was inspired by Hyper-V's HV_X64_MSR_EOM MSR as the protocol
> we're trying to come up with here is very similar to HV messaging)

Oh, that's true actually.

> I'm not exactly sure why we need the pending bit after we drop #PF. When
> we call kvm_check_async_pf_completion() from MSR_KVM_ASYNC_PF_ACK write
> it will (in case there are page ready events in the queue) check if the
> slot is empty, put one there and raise IRQ regardless of guest's current
> state. It may or may not get injected immediately but we don't care.> The second invocation of kvm_check_async_pf_completion() from vcpu_run()
> will just go away.

You're right, you can just use the value in the guest to see if the
guest is ready.  This is also similar to how #VE handles re-entrancy,
however because this is an interrupt we have IF to delay the IRQ until
after the interrupt handler has finished.

By dropping the #PF page ready case, we can also drop the ugly case
where WRMSR injects a page ready page fault even if IF=0.  That one is
safe on Linux, but Andy didn't like it.

Paolo
Gavin Shan May 5, 2020, 12:36 a.m. UTC | #10
Hi Vitaly,

On 4/29/20 7:36 PM, Vitaly Kuznetsov wrote:
> If two page ready notifications happen back to back the second one is not
> delivered and the only mechanism we currently have is
> kvm_check_async_pf_completion() check in vcpu_run() loop. The check will
> only be performed with the next vmexit when it happens and in some cases
> it may take a while. With interrupt based page ready notification delivery
> the situation is even worse: unlike exceptions, interrupts are not handled
> immediately so we must check if the slot is empty. This is slow and
> unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate
> the fact that the slot is free and host should check its notification
> queue. Mandate using it for interrupt based type 2 APF event delivery.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>   Documentation/virt/kvm/msr.rst       | 16 +++++++++++++++-
>   arch/x86/include/uapi/asm/kvm_para.h |  1 +
>   arch/x86/kvm/x86.c                   |  9 ++++++++-
>   3 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
> index 7433e55f7184..18db3448db06 100644
> --- a/Documentation/virt/kvm/msr.rst
> +++ b/Documentation/virt/kvm/msr.rst
> @@ -219,6 +219,11 @@ data:
>   	If during pagefault APF reason is 0 it means that this is regular
>   	page fault.
>   
> +	For interrupt based delivery, guest has to write '1' to
> +	MSR_KVM_ASYNC_PF_ACK every time it clears reason in the shared
> +	'struct kvm_vcpu_pv_apf_data', this forces KVM to re-scan its
> +	queue and deliver next pending notification.
> +
>   	During delivery of type 1 APF cr2 contains a token that will
>   	be used to notify a guest when missing page becomes
>   	available. When page becomes available type 2 APF is sent with
> @@ -340,4 +345,13 @@ data:
>   
>   	To switch to interrupt based delivery of type 2 APF events guests
>   	are supposed to enable asynchronous page faults and set bit 3 in
> -	MSR_KVM_ASYNC_PF_EN first.
> +
> +MSR_KVM_ASYNC_PF_ACK:
> +	0x4b564d07
> +
> +data:
> +	Asynchronous page fault acknowledgment. When the guest is done
> +	processing type 2 APF event and 'reason' field in 'struct
> +	kvm_vcpu_pv_apf_data' is cleared it is supposed to write '1' to
> +	Bit 0 of the MSR, this caused the host to re-scan its queue and
> +	check if there are more notifications pending.

I'm not sure if I understood the usage of MSR_KVM_ASYNC_PF_ACK
completely. It seems it's used to trapped to host, to have chance
to check/deliver pending page ready events. If there is no pending
events, no work will be finished in the trap. If it's true, it might
be good idea to trap conditionally, meaning writing to ASYNC_PF_ACK
if there are really pending events?

Thanks,
Gavin

> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 1bbb0b7e062f..5c7449980619 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -51,6 +51,7 @@
>   #define MSR_KVM_PV_EOI_EN      0x4b564d04
>   #define MSR_KVM_POLL_CONTROL	0x4b564d05
>   #define MSR_KVM_ASYNC_PF2	0x4b564d06
> +#define MSR_KVM_ASYNC_PF_ACK	0x4b564d07
>   
>   struct kvm_steal_time {
>   	__u64 steal;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 861dce1e7cf5..e3b91ac33bfd 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1243,7 +1243,7 @@ static const u32 emulated_msrs_all[] = {
>   	HV_X64_MSR_TSC_EMULATION_STATUS,
>   
>   	MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
> -	MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF2,
> +	MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF2, MSR_KVM_ASYNC_PF_ACK,
>   
>   	MSR_IA32_TSC_ADJUST,
>   	MSR_IA32_TSCDEADLINE,
> @@ -2915,6 +2915,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   		if (kvm_pv_enable_async_pf2(vcpu, data))
>   			return 1;
>   		break;
> +	case MSR_KVM_ASYNC_PF_ACK:
> +		if (data & 0x1)
> +			kvm_check_async_pf_completion(vcpu);
> +		break;
>   	case MSR_KVM_STEAL_TIME:
>   
>   		if (unlikely(!sched_info_on()))
> @@ -3194,6 +3198,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   	case MSR_KVM_ASYNC_PF2:
>   		msr_info->data = vcpu->arch.apf.msr2_val;
>   		break;
> +	case MSR_KVM_ASYNC_PF_ACK:
> +		msr_info->data = 0;
> +		break;
>   	case MSR_KVM_STEAL_TIME:
>   		msr_info->data = vcpu->arch.st.msr_val;
>   		break;
>
Vitaly Kuznetsov May 5, 2020, 8:16 a.m. UTC | #11
Gavin Shan <gshan@redhat.com> writes:

> Hi Vitaly,
>
> On 4/29/20 7:36 PM, Vitaly Kuznetsov wrote:
>> If two page ready notifications happen back to back the second one is not
>> delivered and the only mechanism we currently have is
>> kvm_check_async_pf_completion() check in vcpu_run() loop. The check will
>> only be performed with the next vmexit when it happens and in some cases
>> it may take a while. With interrupt based page ready notification delivery
>> the situation is even worse: unlike exceptions, interrupts are not handled
>> immediately so we must check if the slot is empty. This is slow and
>> unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate
>> the fact that the slot is free and host should check its notification
>> queue. Mandate using it for interrupt based type 2 APF event delivery.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>   Documentation/virt/kvm/msr.rst       | 16 +++++++++++++++-
>>   arch/x86/include/uapi/asm/kvm_para.h |  1 +
>>   arch/x86/kvm/x86.c                   |  9 ++++++++-
>>   3 files changed, 24 insertions(+), 2 deletions(-)
>> 
>> diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
>> index 7433e55f7184..18db3448db06 100644
>> --- a/Documentation/virt/kvm/msr.rst
>> +++ b/Documentation/virt/kvm/msr.rst
>> @@ -219,6 +219,11 @@ data:
>>   	If during pagefault APF reason is 0 it means that this is regular
>>   	page fault.
>>   
>> +	For interrupt based delivery, guest has to write '1' to
>> +	MSR_KVM_ASYNC_PF_ACK every time it clears reason in the shared
>> +	'struct kvm_vcpu_pv_apf_data', this forces KVM to re-scan its
>> +	queue and deliver next pending notification.
>> +
>>   	During delivery of type 1 APF cr2 contains a token that will
>>   	be used to notify a guest when missing page becomes
>>   	available. When page becomes available type 2 APF is sent with
>> @@ -340,4 +345,13 @@ data:
>>   
>>   	To switch to interrupt based delivery of type 2 APF events guests
>>   	are supposed to enable asynchronous page faults and set bit 3 in
>> -	MSR_KVM_ASYNC_PF_EN first.
>> +
>> +MSR_KVM_ASYNC_PF_ACK:
>> +	0x4b564d07
>> +
>> +data:
>> +	Asynchronous page fault acknowledgment. When the guest is done
>> +	processing type 2 APF event and 'reason' field in 'struct
>> +	kvm_vcpu_pv_apf_data' is cleared it is supposed to write '1' to
>> +	Bit 0 of the MSR, this caused the host to re-scan its queue and
>> +	check if there are more notifications pending.
>
> I'm not sure if I understood the usage of MSR_KVM_ASYNC_PF_ACK
> completely. It seems it's used to trapped to host, to have chance
> to check/deliver pending page ready events. If there is no pending
> events, no work will be finished in the trap. If it's true, it might
> be good idea to trap conditionally, meaning writing to ASYNC_PF_ACK
> if there are really pending events?

How does the guest know if host has any pending events or not?

The problem we're trying to address with ACK msr is the following:
imagine host has two 'page ready' notifications back to back. It puts
token for the first on in the slot and raises an IRQ but how do we know
when the slot becomes free so we can inject the second one? Currently,
we have kvm_check_async_pf_completion() check in vcpu_run() loop but
this doesn't guarantee timely delivery of the event, we just hope that
there's going to be a vmexit 'some time soon' and we'll piggy back onto
that. Normally this works but in some special cases it may take really
long before a vmexit happens. Also, we're penalizing each vmexit with an
unneeded check. ACK msr is intended to solve these issues.
Gavin Shan May 5, 2020, 8:51 a.m. UTC | #12
Hi Vitaly,

On 5/5/20 6:16 PM, Vitaly Kuznetsov wrote:
>> On 4/29/20 7:36 PM, Vitaly Kuznetsov wrote:
>>> If two page ready notifications happen back to back the second one is not
>>> delivered and the only mechanism we currently have is
>>> kvm_check_async_pf_completion() check in vcpu_run() loop. The check will
>>> only be performed with the next vmexit when it happens and in some cases
>>> it may take a while. With interrupt based page ready notification delivery
>>> the situation is even worse: unlike exceptions, interrupts are not handled
>>> immediately so we must check if the slot is empty. This is slow and
>>> unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate
>>> the fact that the slot is free and host should check its notification
>>> queue. Mandate using it for interrupt based type 2 APF event delivery.
>>>
>>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>>> ---
>>>    Documentation/virt/kvm/msr.rst       | 16 +++++++++++++++-
>>>    arch/x86/include/uapi/asm/kvm_para.h |  1 +
>>>    arch/x86/kvm/x86.c                   |  9 ++++++++-
>>>    3 files changed, 24 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
>>> index 7433e55f7184..18db3448db06 100644
>>> --- a/Documentation/virt/kvm/msr.rst
>>> +++ b/Documentation/virt/kvm/msr.rst
>>> @@ -219,6 +219,11 @@ data:
>>>    	If during pagefault APF reason is 0 it means that this is regular
>>>    	page fault.
>>>    
>>> +	For interrupt based delivery, guest has to write '1' to
>>> +	MSR_KVM_ASYNC_PF_ACK every time it clears reason in the shared
>>> +	'struct kvm_vcpu_pv_apf_data', this forces KVM to re-scan its
>>> +	queue and deliver next pending notification.
>>> +
>>>    	During delivery of type 1 APF cr2 contains a token that will
>>>    	be used to notify a guest when missing page becomes
>>>    	available. When page becomes available type 2 APF is sent with
>>> @@ -340,4 +345,13 @@ data:
>>>    
>>>    	To switch to interrupt based delivery of type 2 APF events guests
>>>    	are supposed to enable asynchronous page faults and set bit 3 in
>>> -	MSR_KVM_ASYNC_PF_EN first.
>>> +
>>> +MSR_KVM_ASYNC_PF_ACK:
>>> +	0x4b564d07
>>> +
>>> +data:
>>> +	Asynchronous page fault acknowledgment. When the guest is done
>>> +	processing type 2 APF event and 'reason' field in 'struct
>>> +	kvm_vcpu_pv_apf_data' is cleared it is supposed to write '1' to
>>> +	Bit 0 of the MSR, this caused the host to re-scan its queue and
>>> +	check if there are more notifications pending.
>>
>> I'm not sure if I understood the usage of MSR_KVM_ASYNC_PF_ACK
>> completely. It seems it's used to trapped to host, to have chance
>> to check/deliver pending page ready events. If there is no pending
>> events, no work will be finished in the trap. If it's true, it might
>> be good idea to trap conditionally, meaning writing to ASYNC_PF_ACK
>> if there are really pending events?
> 
> How does the guest know if host has any pending events or not?
> 

One way would be through struct kvm_vcpu_pv_apf_data, which is visible
to host and guest. In the host, the workers that have completed their
work (GUP) are queued into vcpu->async_pf.done. So we need a bit in
struct kvm_vcpu_pv_apf_data, written by host while read by guest to
see if there pending work. I even don't think the writer/reader need
to be synchronized.

> The problem we're trying to address with ACK msr is the following:
> imagine host has two 'page ready' notifications back to back. It puts
> token for the first on in the slot and raises an IRQ but how do we know
> when the slot becomes free so we can inject the second one? Currently,
> we have kvm_check_async_pf_completion() check in vcpu_run() loop but
> this doesn't guarantee timely delivery of the event, we just hope that
> there's going to be a vmexit 'some time soon' and we'll piggy back onto
> that. Normally this works but in some special cases it may take really
> long before a vmexit happens. Also, we're penalizing each vmexit with an
> unneeded check. ACK msr is intended to solve these issues.
> 

Thanks for the explanation. I think vmexit frequency is somewhat proportional
to the workload, meaning more vmexit would be observed if the VM has more
workload. The async page fault is part of the workload. I agree it makes sense
to proactively poke the pending events ('page ready'), but it would be reasonable
to do so conditionally, to avoid overhead. However, I'm not sure how much overhead
it would have on x86 :)

Thanks,
Gavin
Paolo Bonzini May 5, 2020, 9:54 a.m. UTC | #13
On 05/05/20 10:51, Gavin Shan wrote:
>> How does the guest know if host has any pending events or not?
> 
> One way would be through struct kvm_vcpu_pv_apf_data, which is visible
> to host and guest. In the host, the workers that have completed their
> work (GUP) are queued into vcpu->async_pf.done. So we need a bit in
> struct kvm_vcpu_pv_apf_data, written by host while read by guest to
> see if there pending work. I even don't think the writer/reader need
> to be synchronized.

No, this cannot work.  The problem is that you need a way to tell the
host "you can give me another ready page", and a memory location is not
enough for that.

Paolo
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index 7433e55f7184..18db3448db06 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -219,6 +219,11 @@  data:
 	If during pagefault APF reason is 0 it means that this is regular
 	page fault.
 
+	For interrupt based delivery, guest has to write '1' to
+	MSR_KVM_ASYNC_PF_ACK every time it clears reason in the shared
+	'struct kvm_vcpu_pv_apf_data', this forces KVM to re-scan its
+	queue and deliver next pending notification.
+
 	During delivery of type 1 APF cr2 contains a token that will
 	be used to notify a guest when missing page becomes
 	available. When page becomes available type 2 APF is sent with
@@ -340,4 +345,13 @@  data:
 
 	To switch to interrupt based delivery of type 2 APF events guests
 	are supposed to enable asynchronous page faults and set bit 3 in
-	MSR_KVM_ASYNC_PF_EN first.
+
+MSR_KVM_ASYNC_PF_ACK:
+	0x4b564d07
+
+data:
+	Asynchronous page fault acknowledgment. When the guest is done
+	processing type 2 APF event and 'reason' field in 'struct
+	kvm_vcpu_pv_apf_data' is cleared it is supposed to write '1' to
+	Bit 0 of the MSR, this caused the host to re-scan its queue and
+	check if there are more notifications pending.
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 1bbb0b7e062f..5c7449980619 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -51,6 +51,7 @@ 
 #define MSR_KVM_PV_EOI_EN      0x4b564d04
 #define MSR_KVM_POLL_CONTROL	0x4b564d05
 #define MSR_KVM_ASYNC_PF2	0x4b564d06
+#define MSR_KVM_ASYNC_PF_ACK	0x4b564d07
 
 struct kvm_steal_time {
 	__u64 steal;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 861dce1e7cf5..e3b91ac33bfd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1243,7 +1243,7 @@  static const u32 emulated_msrs_all[] = {
 	HV_X64_MSR_TSC_EMULATION_STATUS,
 
 	MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
-	MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF2,
+	MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF2, MSR_KVM_ASYNC_PF_ACK,
 
 	MSR_IA32_TSC_ADJUST,
 	MSR_IA32_TSCDEADLINE,
@@ -2915,6 +2915,10 @@  int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (kvm_pv_enable_async_pf2(vcpu, data))
 			return 1;
 		break;
+	case MSR_KVM_ASYNC_PF_ACK:
+		if (data & 0x1)
+			kvm_check_async_pf_completion(vcpu);
+		break;
 	case MSR_KVM_STEAL_TIME:
 
 		if (unlikely(!sched_info_on()))
@@ -3194,6 +3198,9 @@  int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_KVM_ASYNC_PF2:
 		msr_info->data = vcpu->arch.apf.msr2_val;
 		break;
+	case MSR_KVM_ASYNC_PF_ACK:
+		msr_info->data = 0;
+		break;
 	case MSR_KVM_STEAL_TIME:
 		msr_info->data = vcpu->arch.st.msr_val;
 		break;