diff mbox series

[4/8] KVM: x86: interrupt based APF page-ready event delivery

Message ID 20200511164752.2158645-5-vkuznets@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications | expand

Commit Message

Vitaly Kuznetsov May 11, 2020, 4:47 p.m. UTC
Concerns were expressed around APF delivery via synthetic #PF exception as
in some cases such delivery may collide with real page fault. For type 2
(page ready) notifications we can easily switch to using an interrupt
instead. Introduce new MSR_KVM_ASYNC_PF_INT mechanism and deprecate the
legacy one.

One notable difference between the two mechanisms is that interrupt may not
get handled immediately so whenever we would like to deliver next event
(regardless of its type) we must be sure the guest had read and cleared
previous event in the slot.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 Documentation/virt/kvm/msr.rst       | 91 +++++++++++++++++---------
 arch/x86/include/asm/kvm_host.h      |  4 +-
 arch/x86/include/uapi/asm/kvm_para.h |  6 ++
 arch/x86/kvm/x86.c                   | 95 ++++++++++++++++++++--------
 4 files changed, 140 insertions(+), 56 deletions(-)

Comments

Vivek Goyal May 12, 2020, 2:24 p.m. UTC | #1
On Mon, May 11, 2020 at 06:47:48PM +0200, Vitaly Kuznetsov wrote:
> Concerns were expressed around APF delivery via synthetic #PF exception as
> in some cases such delivery may collide with real page fault. For type 2
> (page ready) notifications we can easily switch to using an interrupt
> instead. Introduce new MSR_KVM_ASYNC_PF_INT mechanism and deprecate the
> legacy one.
> 
> One notable difference between the two mechanisms is that interrupt may not
> get handled immediately so whenever we would like to deliver next event
> (regardless of its type) we must be sure the guest had read and cleared
> previous event in the slot.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  Documentation/virt/kvm/msr.rst       | 91 +++++++++++++++++---------
>  arch/x86/include/asm/kvm_host.h      |  4 +-
>  arch/x86/include/uapi/asm/kvm_para.h |  6 ++
>  arch/x86/kvm/x86.c                   | 95 ++++++++++++++++++++--------
>  4 files changed, 140 insertions(+), 56 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
> index 33892036672d..f988a36f226a 100644
> --- a/Documentation/virt/kvm/msr.rst
> +++ b/Documentation/virt/kvm/msr.rst
> @@ -190,35 +190,54 @@ MSR_KVM_ASYNC_PF_EN:
>  	0x4b564d02
>  
>  data:
> -	Bits 63-6 hold 64-byte aligned physical address of a
> -	64 byte memory area which must be in guest RAM and must be
> -	zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1
> -	when asynchronous page faults are enabled on the vcpu 0 when
> -	disabled. Bit 1 is 1 if asynchronous page faults can be injected
> -	when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults
> -	are delivered to L1 as #PF vmexits.  Bit 2 can be set only if
> -	KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID.
> -
> -	First 4 byte of 64 byte memory location will be written to by
> -	the hypervisor at the time of asynchronous page fault (APF)
> -	injection to indicate type of asynchronous page fault. Value
> -	of 1 means that the page referred to by the page fault is not
> -	present. Value 2 means that the page is now available. Disabling
> -	interrupt inhibits APFs. Guest must not enable interrupt
> -	before the reason is read, or it may be overwritten by another
> -	APF. Since APF uses the same exception vector as regular page
> -	fault guest must reset the reason to 0 before it does
> -	something that can generate normal page fault.  If during page
> -	fault APF reason is 0 it means that this is regular page
> -	fault.
> -
> -	During delivery of type 1 APF cr2 contains a token that will
> -	be used to notify a guest when missing page becomes
> -	available. When page becomes available type 2 APF is sent with
> -	cr2 set to the token associated with the page. There is special
> -	kind of token 0xffffffff which tells vcpu that it should wake
> -	up all processes waiting for APFs and no individual type 2 APFs
> -	will be sent.
> +	Asynchronous page fault (APF) control MSR.
> +
> +	Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
> +	which must be in guest RAM and must be zeroed. This memory is expected
> +	to hold a copy of the following structure::
> +
> +	  struct kvm_vcpu_pv_apf_data {
> +		__u32 reason;
> +		__u32 pageready_token;
> +		__u8 pad[56];
> +		__u32 enabled;
> +	  };
> +
> +	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
> +	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
> +	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
> +	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
> +	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
> +	present in CPUID. Bit 3 enables interrupt based delivery of type 2
> +	(page present) events.

Hi Vitaly,

"Bit 3 enables interrupt based delivery of type 2 events". So one has to
opt in to enable it. If this bit is 0, we will continue to deliver
page ready events using #PF? This probably will be needed to ensure
backward compatibility also.

> +
> +	First 4 byte of 64 byte memory location ('reason') will be written to
> +	by the hypervisor at the time APF type 1 (page not present) injection.
> +	The only possible values are '0' and '1'.

What do "reason" values "0" and "1" signify?

Previously this value could be 1 for PAGE_NOT_PRESENT and 2 for
PAGE_READY. So looks like we took away reason "PAGE_READY" because it will
be delivered using interrupts.

But that seems like an opt in. If that's the case, then we should still
retain PAGE_READY reason. If we are getting rid of page_ready using
#PF, then interrupt based deliver should not be optional. What am I
missing.

Also previous text had following line.

"Guest must not enable interrupt before the reason is read, or it may be
 overwritten by another APF".

So this is not a requirement anymore?

> Type 1 events are currently
> +	always delivered as synthetic #PF exception. During delivery of type 1
> +	APF CR2 register contains a token that will be used to notify the guest
> +	when missing page becomes available. Guest is supposed to write '0' to
> +	the location when it is done handling type 1 event so the next one can
> +	be delivered.
> +
> +	Note, since APF type 1 uses the same exception vector as regular page
> +	fault, guest must reset the reason to '0' before it does something that
> +	can generate normal page fault. If during a page fault APF reason is '0'
> +	it means that this is regular page fault.
> +
> +	Bytes 5-7 of 64 byte memory location ('pageready_token') will be written
> +	to by the hypervisor at the time of type 2 (page ready) event injection.
> +	The content of these bytes is a token which was previously delivered as
> +	type 1 event. The event indicates the page in now available. Guest is
> +	supposed to write '0' to the location when it is done handling type 2
> +	event so the next one can be delivered. MSR_KVM_ASYNC_PF_INT MSR
> +	specifying the interrupt vector for type 2 APF delivery needs to be
> +	written to before enabling APF mechanism in MSR_KVM_ASYNC_PF_EN.

What is supposed to be value of "reason" field for type2 events. I
had liked previous values "KVM_PV_REASON_PAGE_READY" and
"KVM_PV_REASON_PAGE_NOT_PRESENT". Name itself made it plenty clear, what
it means. Also it allowed for easy extension where this protocol could
be extended to deliver other "reasons", like error.

So if we are using a common structure "kvm_vcpu_pv_apf_data" to deliver
type1 and type2 events, to me it makes sense to retain existing
KVM_PV_REASON_PAGE_READY and KVM_PV_REASON_PAGE_NOT_PRESENT. Just that
in new scheme of things, KVM_PV_REASON_PAGE_NOT_PRESENT will be delivered
using #PF (and later possibly using #VE) and KVM_PV_REASON_PAGE_READY
will be delivered using interrupt.

> +
> +	Note, previously, type 2 (page present) events were delivered via the
> +	same #PF exception as type 1 (page not present) events but this is
> +	now deprecated.

> If bit 3 (interrupt based delivery) is not set APF events are not delivered.

So all the old guests which were getting async pf will suddenly find
that async pf does not work anymore (after hypervisor update). And
some of them might report it as performance issue (if there were any
performance benefits to be had with async pf).

[..]
>  
>  bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
>  {
> -	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
> +	if (!kvm_pv_async_pf_enabled(vcpu))
>  		return true;

What does above mean. If async pf is not enabled, then it returns true,
implying one can inject async page present. But if async pf is not
enabled, there is no need to inject these events.

Thanks
Vivek
Vitaly Kuznetsov May 12, 2020, 3:50 p.m. UTC | #2
Vivek Goyal <vgoyal@redhat.com> writes:

> On Mon, May 11, 2020 at 06:47:48PM +0200, Vitaly Kuznetsov wrote:
>> Concerns were expressed around APF delivery via synthetic #PF exception as
>> in some cases such delivery may collide with real page fault. For type 2
>> (page ready) notifications we can easily switch to using an interrupt
>> instead. Introduce new MSR_KVM_ASYNC_PF_INT mechanism and deprecate the
>> legacy one.
>> 
>> One notable difference between the two mechanisms is that interrupt may not
>> get handled immediately so whenever we would like to deliver next event
>> (regardless of its type) we must be sure the guest had read and cleared
>> previous event in the slot.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  Documentation/virt/kvm/msr.rst       | 91 +++++++++++++++++---------
>>  arch/x86/include/asm/kvm_host.h      |  4 +-
>>  arch/x86/include/uapi/asm/kvm_para.h |  6 ++
>>  arch/x86/kvm/x86.c                   | 95 ++++++++++++++++++++--------
>>  4 files changed, 140 insertions(+), 56 deletions(-)
>> 
>> diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
>> index 33892036672d..f988a36f226a 100644
>> --- a/Documentation/virt/kvm/msr.rst
>> +++ b/Documentation/virt/kvm/msr.rst
>> @@ -190,35 +190,54 @@ MSR_KVM_ASYNC_PF_EN:
>>  	0x4b564d02
>>  
>>  data:
>> -	Bits 63-6 hold 64-byte aligned physical address of a
>> -	64 byte memory area which must be in guest RAM and must be
>> -	zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1
>> -	when asynchronous page faults are enabled on the vcpu 0 when
>> -	disabled. Bit 1 is 1 if asynchronous page faults can be injected
>> -	when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults
>> -	are delivered to L1 as #PF vmexits.  Bit 2 can be set only if
>> -	KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID.
>> -
>> -	First 4 byte of 64 byte memory location will be written to by
>> -	the hypervisor at the time of asynchronous page fault (APF)
>> -	injection to indicate type of asynchronous page fault. Value
>> -	of 1 means that the page referred to by the page fault is not
>> -	present. Value 2 means that the page is now available. Disabling
>> -	interrupt inhibits APFs. Guest must not enable interrupt
>> -	before the reason is read, or it may be overwritten by another
>> -	APF. Since APF uses the same exception vector as regular page
>> -	fault guest must reset the reason to 0 before it does
>> -	something that can generate normal page fault.  If during page
>> -	fault APF reason is 0 it means that this is regular page
>> -	fault.
>> -
>> -	During delivery of type 1 APF cr2 contains a token that will
>> -	be used to notify a guest when missing page becomes
>> -	available. When page becomes available type 2 APF is sent with
>> -	cr2 set to the token associated with the page. There is special
>> -	kind of token 0xffffffff which tells vcpu that it should wake
>> -	up all processes waiting for APFs and no individual type 2 APFs
>> -	will be sent.
>> +	Asynchronous page fault (APF) control MSR.
>> +
>> +	Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
>> +	which must be in guest RAM and must be zeroed. This memory is expected
>> +	to hold a copy of the following structure::
>> +
>> +	  struct kvm_vcpu_pv_apf_data {
>> +		__u32 reason;
>> +		__u32 pageready_token;
>> +		__u8 pad[56];
>> +		__u32 enabled;
>> +	  };
>> +
>> +	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
>> +	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
>> +	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
>> +	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
>> +	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
>> +	present in CPUID. Bit 3 enables interrupt based delivery of type 2
>> +	(page present) events.
>
> Hi Vitaly,
>
> "Bit 3 enables interrupt based delivery of type 2 events". So one has to
> opt in to enable it. If this bit is 0, we will continue to deliver
> page ready events using #PF? This probably will be needed to ensure
> backward compatibility also.
>

No, as Paolo suggested we don't enable the mechanism at all if bit3 is
0. Legacy (unaware) guests will think that they've enabled the mechanism
but it won't work, they won't see any APF notifications.

>> +
>> +	First 4 byte of 64 byte memory location ('reason') will be written to
>> +	by the hypervisor at the time APF type 1 (page not present) injection.
>> +	The only possible values are '0' and '1'.
>
> What do "reason" values "0" and "1" signify?
>
> Previously this value could be 1 for PAGE_NOT_PRESENT and 2 for
> PAGE_READY. So looks like we took away reason "PAGE_READY" because it will
> be delivered using interrupts.
>
> But that seems like an opt in. If that's the case, then we should still
> retain PAGE_READY reason. If we are getting rid of page_ready using
> #PF, then interrupt based deliver should not be optional. What am I
> missing.

It is not optional now :-)

>
> Also previous text had following line.
>
> "Guest must not enable interrupt before the reason is read, or it may be
>  overwritten by another APF".
>
> So this is not a requirement anymore?
>

It still stands for type 1 (page not present) events.

>> Type 1 events are currently
>> +	always delivered as synthetic #PF exception. During delivery of type 1
>> +	APF CR2 register contains a token that will be used to notify the guest
>> +	when missing page becomes available. Guest is supposed to write '0' to
>> +	the location when it is done handling type 1 event so the next one can
>> +	be delivered.
>> +
>> +	Note, since APF type 1 uses the same exception vector as regular page
>> +	fault, guest must reset the reason to '0' before it does something that
>> +	can generate normal page fault. If during a page fault APF reason is '0'
>> +	it means that this is regular page fault.
>> +
>> +	Bytes 5-7 of 64 byte memory location ('pageready_token') will be written
>> +	to by the hypervisor at the time of type 2 (page ready) event injection.
>> +	The content of these bytes is a token which was previously delivered as
>> +	type 1 event. The event indicates the page in now available. Guest is
>> +	supposed to write '0' to the location when it is done handling type 2
>> +	event so the next one can be delivered. MSR_KVM_ASYNC_PF_INT MSR
>> +	specifying the interrupt vector for type 2 APF delivery needs to be
>> +	written to before enabling APF mechanism in MSR_KVM_ASYNC_PF_EN.
>
> What is supposed to be value of "reason" field for type2 events. I
> had liked previous values "KVM_PV_REASON_PAGE_READY" and
> "KVM_PV_REASON_PAGE_NOT_PRESENT". Name itself made it plenty clear, what
> it means. Also it allowed for easy extension where this protocol could
> be extended to deliver other "reasons", like error.
>
> So if we are using a common structure "kvm_vcpu_pv_apf_data" to deliver
> type1 and type2 events, to me it makes sense to retain existing
> KVM_PV_REASON_PAGE_READY and KVM_PV_REASON_PAGE_NOT_PRESENT. Just that
> in new scheme of things, KVM_PV_REASON_PAGE_NOT_PRESENT will be delivered
> using #PF (and later possibly using #VE) and KVM_PV_REASON_PAGE_READY
> will be delivered using interrupt.

We use different fields for page-not-present and page-ready events so
there is no intersection. If we start setting KVM_PV_REASON_PAGE_READY
to 'reason' we may accidentally destroy a 'page-not-present' event.

With this patchset we have two completely separate channels:
1) Page-not-present goes through #PF and 'reason' in struct
kvm_vcpu_pv_apf_data.
2) Page-ready goes through interrupt and 'pageready_token' in the same
kvm_vcpu_pv_apf_data.

>
>> +
>> +	Note, previously, type 2 (page present) events were delivered via the
>> +	same #PF exception as type 1 (page not present) events but this is
>> +	now deprecated.
>
>> If bit 3 (interrupt based delivery) is not set APF events are not delivered.
>
> So all the old guests which were getting async pf will suddenly find
> that async pf does not work anymore (after hypervisor update). And
> some of them might report it as performance issue (if there were any
> performance benefits to be had with async pf).

We still do APF_HALT but generally yes, there might be some performance
implications. My RFC was preserving #PF path but the suggestion was to
retire it completely. (and I kinda like it because it makes a lot of
code go away)

>
> [..]
>>  
>>  bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
>>  {
>> -	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
>> +	if (!kvm_pv_async_pf_enabled(vcpu))
>>  		return true;
>
> What does above mean. If async pf is not enabled, then it returns true,
> implying one can inject async page present. But if async pf is not
> enabled, there is no need to inject these events.

AFAIU this is a protection agains guest suddenly disabling APF
mechanism. What do we do with all the 'page ready' events after, we
can't deliver them anymore. So we just eat them (hoping guest will
unfreeze all processes on its own before disabling the mechanism).

It is the existing logic, my patch doesn't change it.

Thanks for the review!
Vivek Goyal May 12, 2020, 6:07 p.m. UTC | #3
On Tue, May 12, 2020 at 05:50:53PM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Mon, May 11, 2020 at 06:47:48PM +0200, Vitaly Kuznetsov wrote:
> >> Concerns were expressed around APF delivery via synthetic #PF exception as
> >> in some cases such delivery may collide with real page fault. For type 2
> >> (page ready) notifications we can easily switch to using an interrupt
> >> instead. Introduce new MSR_KVM_ASYNC_PF_INT mechanism and deprecate the
> >> legacy one.
> >> 
> >> One notable difference between the two mechanisms is that interrupt may not
> >> get handled immediately so whenever we would like to deliver next event
> >> (regardless of its type) we must be sure the guest had read and cleared
> >> previous event in the slot.
> >> 
> >> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> >> ---
> >>  Documentation/virt/kvm/msr.rst       | 91 +++++++++++++++++---------
> >>  arch/x86/include/asm/kvm_host.h      |  4 +-
> >>  arch/x86/include/uapi/asm/kvm_para.h |  6 ++
> >>  arch/x86/kvm/x86.c                   | 95 ++++++++++++++++++++--------
> >>  4 files changed, 140 insertions(+), 56 deletions(-)
> >> 
> >> diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
> >> index 33892036672d..f988a36f226a 100644
> >> --- a/Documentation/virt/kvm/msr.rst
> >> +++ b/Documentation/virt/kvm/msr.rst
> >> @@ -190,35 +190,54 @@ MSR_KVM_ASYNC_PF_EN:
> >>  	0x4b564d02
> >>  
> >>  data:
> >> -	Bits 63-6 hold 64-byte aligned physical address of a
> >> -	64 byte memory area which must be in guest RAM and must be
> >> -	zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1
> >> -	when asynchronous page faults are enabled on the vcpu 0 when
> >> -	disabled. Bit 1 is 1 if asynchronous page faults can be injected
> >> -	when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults
> >> -	are delivered to L1 as #PF vmexits.  Bit 2 can be set only if
> >> -	KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID.
> >> -
> >> -	First 4 byte of 64 byte memory location will be written to by
> >> -	the hypervisor at the time of asynchronous page fault (APF)
> >> -	injection to indicate type of asynchronous page fault. Value
> >> -	of 1 means that the page referred to by the page fault is not
> >> -	present. Value 2 means that the page is now available. Disabling
> >> -	interrupt inhibits APFs. Guest must not enable interrupt
> >> -	before the reason is read, or it may be overwritten by another
> >> -	APF. Since APF uses the same exception vector as regular page
> >> -	fault guest must reset the reason to 0 before it does
> >> -	something that can generate normal page fault.  If during page
> >> -	fault APF reason is 0 it means that this is regular page
> >> -	fault.
> >> -
> >> -	During delivery of type 1 APF cr2 contains a token that will
> >> -	be used to notify a guest when missing page becomes
> >> -	available. When page becomes available type 2 APF is sent with
> >> -	cr2 set to the token associated with the page. There is special
> >> -	kind of token 0xffffffff which tells vcpu that it should wake
> >> -	up all processes waiting for APFs and no individual type 2 APFs
> >> -	will be sent.
> >> +	Asynchronous page fault (APF) control MSR.
> >> +
> >> +	Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
> >> +	which must be in guest RAM and must be zeroed. This memory is expected
> >> +	to hold a copy of the following structure::
> >> +
> >> +	  struct kvm_vcpu_pv_apf_data {
> >> +		__u32 reason;
> >> +		__u32 pageready_token;
> >> +		__u8 pad[56];
> >> +		__u32 enabled;
> >> +	  };
> >> +
> >> +	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
> >> +	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
> >> +	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
> >> +	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
> >> +	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
> >> +	present in CPUID. Bit 3 enables interrupt based delivery of type 2
> >> +	(page present) events.
> >
> > Hi Vitaly,
> >
> > "Bit 3 enables interrupt based delivery of type 2 events". So one has to
> > opt in to enable it. If this bit is 0, we will continue to deliver
> > page ready events using #PF? This probably will be needed to ensure
> > backward compatibility also.
> >
> 
> No, as Paolo suggested we don't enable the mechanism at all if bit3 is
> 0. Legacy (unaware) guests will think that they've enabled the mechanism
> but it won't work, they won't see any APF notifications.
> 
> >> +
> >> +	First 4 byte of 64 byte memory location ('reason') will be written to
> >> +	by the hypervisor at the time APF type 1 (page not present) injection.
> >> +	The only possible values are '0' and '1'.
> >
> > What do "reason" values "0" and "1" signify?
> >
> > Previously this value could be 1 for PAGE_NOT_PRESENT and 2 for
> > PAGE_READY. So looks like we took away reason "PAGE_READY" because it will
> > be delivered using interrupts.
> >
> > But that seems like an opt in. If that's the case, then we should still
> > retain PAGE_READY reason. If we are getting rid of page_ready using
> > #PF, then interrupt based deliver should not be optional. What am I
> > missing.
> 
> It is not optional now :-)
> 
> >
> > Also previous text had following line.
> >
> > "Guest must not enable interrupt before the reason is read, or it may be
> >  overwritten by another APF".
> >
> > So this is not a requirement anymore?
> >
> 
> It still stands for type 1 (page not present) events.
> 
> >> Type 1 events are currently
> >> +	always delivered as synthetic #PF exception. During delivery of type 1
> >> +	APF CR2 register contains a token that will be used to notify the guest
> >> +	when missing page becomes available. Guest is supposed to write '0' to
> >> +	the location when it is done handling type 1 event so the next one can
> >> +	be delivered.
> >> +
> >> +	Note, since APF type 1 uses the same exception vector as regular page
> >> +	fault, guest must reset the reason to '0' before it does something that
> >> +	can generate normal page fault. If during a page fault APF reason is '0'
> >> +	it means that this is regular page fault.
> >> +
> >> +	Bytes 5-7 of 64 byte memory location ('pageready_token') will be written
> >> +	to by the hypervisor at the time of type 2 (page ready) event injection.
> >> +	The content of these bytes is a token which was previously delivered as
> >> +	type 1 event. The event indicates the page in now available. Guest is
> >> +	supposed to write '0' to the location when it is done handling type 2
> >> +	event so the next one can be delivered. MSR_KVM_ASYNC_PF_INT MSR
> >> +	specifying the interrupt vector for type 2 APF delivery needs to be
> >> +	written to before enabling APF mechanism in MSR_KVM_ASYNC_PF_EN.
> >
> > What is supposed to be value of "reason" field for type2 events. I
> > had liked previous values "KVM_PV_REASON_PAGE_READY" and
> > "KVM_PV_REASON_PAGE_NOT_PRESENT". Name itself made it plenty clear, what
> > it means. Also it allowed for easy extension where this protocol could
> > be extended to deliver other "reasons", like error.
> >
> > So if we are using a common structure "kvm_vcpu_pv_apf_data" to deliver
> > type1 and type2 events, to me it makes sense to retain existing
> > KVM_PV_REASON_PAGE_READY and KVM_PV_REASON_PAGE_NOT_PRESENT. Just that
> > in new scheme of things, KVM_PV_REASON_PAGE_NOT_PRESENT will be delivered
> > using #PF (and later possibly using #VE) and KVM_PV_REASON_PAGE_READY
> > will be delivered using interrupt.
> 
> We use different fields for page-not-present and page-ready events so
> there is no intersection. If we start setting KVM_PV_REASON_PAGE_READY
> to 'reason' we may accidentally destroy a 'page-not-present' event.

This is confusing. So you mean at one point of time we might be using
same shared data structure for two events.

- ->reason will be set to 1 and you will inject page_not_present
  execption.

- If some page gets ready, you will now set ->token and queue 
  page ready exception. 

Its very confusing. Can't we serialize the delivery of these events. So
that only one is in progress so that this structure is used by one event
at a time.

Also how do I extend it now to do error delivery. Please keep that in
mind. We don't want to be redesigning this stuff again. Its already
very complicated.

I really need ->reason field to be usable in both the paths so that
error can be delivered.

And this notion of same structure being shared across multiple events
at the same time is just going to create more confusion, IMHO. If we
can decouple it by serializing it, that definitely feels simpler to
understand.

> 
> With this patchset we have two completely separate channels:
> 1) Page-not-present goes through #PF and 'reason' in struct
> kvm_vcpu_pv_apf_data.
> 2) Page-ready goes through interrupt and 'pageready_token' in the same
> kvm_vcpu_pv_apf_data.
> 
> >
> >> +
> >> +	Note, previously, type 2 (page present) events were delivered via the
> >> +	same #PF exception as type 1 (page not present) events but this is
> >> +	now deprecated.
> >
> >> If bit 3 (interrupt based delivery) is not set APF events are not delivered.
> >
> > So all the old guests which were getting async pf will suddenly find
> > that async pf does not work anymore (after hypervisor update). And
> > some of them might report it as performance issue (if there were any
> > performance benefits to be had with async pf).
> 
> We still do APF_HALT but generally yes, there might be some performance
> implications. My RFC was preserving #PF path but the suggestion was to
> retire it completely. (and I kinda like it because it makes a lot of
> code go away)

Ok. I don't have strong opinion here. If paolo likes it this way, so be
it. :-)

> 
> >
> > [..]
> >>  
> >>  bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
> >>  {
> >> -	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
> >> +	if (!kvm_pv_async_pf_enabled(vcpu))
> >>  		return true;
> >
> > What does above mean. If async pf is not enabled, then it returns true,
> > implying one can inject async page present. But if async pf is not
> > enabled, there is no need to inject these events.
> 
> AFAIU this is a protection agains guest suddenly disabling APF
> mechanism.

Can we provide that protection in MSR implementation. That is once APF
is enabled, it can't be disabled. Or it is a feature that we allow
guest to disable APF and want it that way?

> What do we do with all the 'page ready' events after, we
> can't deliver them anymore. So we just eat them (hoping guest will
> unfreeze all processes on its own before disabling the mechanism).
> 
> It is the existing logic, my patch doesn't change it.

I see its existing logic. Just it is very confusing and will be good
if we can atleast explain it with some comments.

I don't know what to make out of this.

bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
{
        if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
                return true;
        else
                return kvm_can_do_async_pf(vcpu);
}

If feature is disabled, then do inject async pf page present. If feature
is enabled and check whether we can inject async pf right now or not.

It probably will help if this check if feature being enabled/disabled
is outside kvm_arch_can_inject_async_page_present() at the callsite
of kvm_arch_can_inject_async_page_present() and there we explain that
why it is important to inject page ready events despite the fact
that feature is disabled.

Thanks
Vivek
Vitaly Kuznetsov May 13, 2020, 9:03 a.m. UTC | #4
Vivek Goyal <vgoyal@redhat.com> writes:

> On Tue, May 12, 2020 at 05:50:53PM +0200, Vitaly Kuznetsov wrote:
>> Vivek Goyal <vgoyal@redhat.com> writes:
>> 
>> >
>> > So if we are using a common structure "kvm_vcpu_pv_apf_data" to deliver
>> > type1 and type2 events, to me it makes sense to retain existing
>> > KVM_PV_REASON_PAGE_READY and KVM_PV_REASON_PAGE_NOT_PRESENT. Just that
>> > in new scheme of things, KVM_PV_REASON_PAGE_NOT_PRESENT will be delivered
>> > using #PF (and later possibly using #VE) and KVM_PV_REASON_PAGE_READY
>> > will be delivered using interrupt.
>> 
>> We use different fields for page-not-present and page-ready events so
>> there is no intersection. If we start setting KVM_PV_REASON_PAGE_READY
>> to 'reason' we may accidentally destroy a 'page-not-present' event.
>
> This is confusing. So you mean at one point of time we might be using
> same shared data structure for two events.
>
> - ->reason will be set to 1 and you will inject page_not_present
>   execption.
>
> - If some page gets ready, you will now set ->token and queue 
>   page ready exception. 
>
> Its very confusing. Can't we serialize the delivery of these events. So
> that only one is in progress so that this structure is used by one event
> at a time.

This is not how the mechanism (currently) works:
- A process accesses a page which is swapped out

- We deliver synchronious APF (#PF) to the guest, it freezes the process
and switches to another one.

- Another process accesses a swapped out page, APF is delivered and it
also got frozen

...

- At some point one of the previously unavailable pages become available
(not necessarily the last or the first one) and we deliver this info via
asynchronous APF (now interrupt).

- Note, after we deliver the interrupt and before it is actually
consumed we may have another synchronous APF (#PF) event.

So we really need to separate memory locations for synchronous (type-1,
#PF,...) and asynchronous (type-2, interrupt, ...) data.

The naming is unfortunate and misleading, I agree. What is currently
named 'reason' should be something like 'apf_flag_for_pf' and it just
means to distinguish real #PF from APF. This is going away in the
future, we won't be abusing #PF anymore so I'd keep it as it is now,
maybe add another comment somewhere. The other location is
'pageready_token' and it actually contains the token. This is to stay
long term so any suggestions for better naming are welcome.

We could've separated these two memory locations completely and
e.g. used the remaining 56 bits of MSR_KVM_ASYNC_PF_INT as the new
location information. Maybe we should do that just to avoid the
confusion.

>
> Also how do I extend it now to do error delivery. Please keep that in
> mind. We don't want to be redesigning this stuff again. Its already
> very complicated.
>
> I really need ->reason field to be usable in both the paths so that
> error can be delivered.

If we want to use 'reason' for both we'll get into a weird scenario when
exception is blocking interrupt and, what's more confusing, vice
versa. I'd like to avoid this complexity in KVM code. My suggestion
would be to rename 'reason' to something like 'pf_abuse_flag' so it
doesn't cause the confusion and add new 'reason' after 'token'.

>
> And this notion of same structure being shared across multiple events
> at the same time is just going to create more confusion, IMHO. If we
> can decouple it by serializing it, that definitely feels simpler to
> understand.

What if we just add sub-structures to the structure, e.g. 

struct kvm_vcpu_pv_apf_data {
        struct {
            __u32 apf_flag;
        } legacy_apf_data;
        struct {
            __u32 token;
        } apf_interrupt_data;
        ....
        __u8 pad[56];                                                                                  |
        __u32 enabled;                                                                                 |
};    

would it make it more obvious?

>
>> 
>> With this patchset we have two completely separate channels:
>> 1) Page-not-present goes through #PF and 'reason' in struct
>> kvm_vcpu_pv_apf_data.
>> 2) Page-ready goes through interrupt and 'pageready_token' in the same
>> kvm_vcpu_pv_apf_data.
>> 
>> >
>> >> +
>> >> +	Note, previously, type 2 (page present) events were delivered via the
>> >> +	same #PF exception as type 1 (page not present) events but this is
>> >> +	now deprecated.
>> >
>> >> If bit 3 (interrupt based delivery) is not set APF events are not delivered.
>> >
>> > So all the old guests which were getting async pf will suddenly find
>> > that async pf does not work anymore (after hypervisor update). And
>> > some of them might report it as performance issue (if there were any
>> > performance benefits to be had with async pf).
>> 
>> We still do APF_HALT but generally yes, there might be some performance
>> implications. My RFC was preserving #PF path but the suggestion was to
>> retire it completely. (and I kinda like it because it makes a lot of
>> code go away)
>
> Ok. I don't have strong opinion here. If paolo likes it this way, so be
> it. :-)

APF is a slowpath for overcommited scenarios and when we switch to
APF_HALT we allow the host to run some other guest while PF is
processed. This is not the same from guest's perspective but from host's
we're fine as we're not wasting cycles.

>
>> 
>> >
>> > [..]
>> >>  
>> >>  bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
>> >>  {
>> >> -	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
>> >> +	if (!kvm_pv_async_pf_enabled(vcpu))
>> >>  		return true;
>> >
>> > What does above mean. If async pf is not enabled, then it returns true,
>> > implying one can inject async page present. But if async pf is not
>> > enabled, there is no need to inject these events.
>> 
>> AFAIU this is a protection agains guest suddenly disabling APF
>> mechanism.
>
> Can we provide that protection in MSR implementation. That is once APF
> is enabled, it can't be disabled. Or it is a feature that we allow
> guest to disable APF and want it that way?

We need to allow to disable the feature. E.g. think about kdump
scenario, for example. Before we switch to kdump kernel we need to make
sure there's no more 'magic' memory which can suggenly change. Also,
kdump kernel may not even support APF so it will get very confused when
APF events get delivered.

>
>> What do we do with all the 'page ready' events after, we
>> can't deliver them anymore. So we just eat them (hoping guest will
>> unfreeze all processes on its own before disabling the mechanism).
>> 
>> It is the existing logic, my patch doesn't change it.
>
> I see its existing logic. Just it is very confusing and will be good
> if we can atleast explain it with some comments.
>
> I don't know what to make out of this.
>
> bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
> {
>         if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
>                 return true;
>         else
>                 return kvm_can_do_async_pf(vcpu);
> }
>
> If feature is disabled, then do inject async pf page present. If feature
> is enabled and check whether we can inject async pf right now or not.
>
> It probably will help if this check if feature being enabled/disabled
> is outside kvm_arch_can_inject_async_page_present() at the callsite
> of kvm_arch_can_inject_async_page_present() and there we explain that
> why it is important to inject page ready events despite the fact
> that feature is disabled.

This code would definitely love some comments added, will do in the next
version. And I'll also think how to improve the readability, thanks for
the feedback!
Vivek Goyal May 13, 2020, 1:53 p.m. UTC | #5
On Wed, May 13, 2020 at 11:03:48AM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Tue, May 12, 2020 at 05:50:53PM +0200, Vitaly Kuznetsov wrote:
> >> Vivek Goyal <vgoyal@redhat.com> writes:
> >> 
> >> >
> >> > So if we are using a common structure "kvm_vcpu_pv_apf_data" to deliver
> >> > type1 and type2 events, to me it makes sense to retain existing
> >> > KVM_PV_REASON_PAGE_READY and KVM_PV_REASON_PAGE_NOT_PRESENT. Just that
> >> > in new scheme of things, KVM_PV_REASON_PAGE_NOT_PRESENT will be delivered
> >> > using #PF (and later possibly using #VE) and KVM_PV_REASON_PAGE_READY
> >> > will be delivered using interrupt.
> >> 
> >> We use different fields for page-not-present and page-ready events so
> >> there is no intersection. If we start setting KVM_PV_REASON_PAGE_READY
> >> to 'reason' we may accidentally destroy a 'page-not-present' event.
> >
> > This is confusing. So you mean at one point of time we might be using
> > same shared data structure for two events.
> >
> > - ->reason will be set to 1 and you will inject page_not_present
> >   execption.
> >
> > - If some page gets ready, you will now set ->token and queue 
> >   page ready exception. 
> >
> > Its very confusing. Can't we serialize the delivery of these events. So
> > that only one is in progress so that this structure is used by one event
> > at a time.
> 
> This is not how the mechanism (currently) works:
> - A process accesses a page which is swapped out
> 
> - We deliver synchronious APF (#PF) to the guest, it freezes the process
> and switches to another one.
> 
> - Another process accesses a swapped out page, APF is delivered and it
> also got frozen
> 
> ...
> 
> - At some point one of the previously unavailable pages become available
> (not necessarily the last or the first one) and we deliver this info via
> asynchronous APF (now interrupt).
> 
> - Note, after we deliver the interrupt and before it is actually
> consumed we may have another synchronous APF (#PF) event.
> 
> So we really need to separate memory locations for synchronous (type-1,
> #PF,...) and asynchronous (type-2, interrupt, ...) data.
> 
> The naming is unfortunate and misleading, I agree. What is currently
> named 'reason' should be something like 'apf_flag_for_pf' and it just
> means to distinguish real #PF from APF. This is going away in the
> future, we won't be abusing #PF anymore so I'd keep it as it is now,
> maybe add another comment somewhere.

Shouldn't we do #VE changes also at the same time. Otherwise we are
again creating this intermediate mode where synchronous notifications
happen using #PF. Once we move to #VE, old guests will again stop
getting async #PF? 

Also there is this issue of who is using this common shared area
between host and guest (struct kvm_vcpu_pv_apf_data) and how to 
use fields in this struct without breaking existing guests.

For now I see you have modified meaning of fields of this structure
(->reason) beacuse you are not concerned about older guets because
they will not receive async pf to begin with. If that's the case,
renaming even reason makes sense. Havind said that, as we are
planning to not use this data structure for synchronous notifications,
creating this intermediate mode is just going to create more
confusion. So in new ABI, we should not even try to make use of
->reason and design #VE changes at the same time. Core problem with
current ABI is racy access to ->reason and this patch set does not
get rid of that race anyway.

> The other location is
> 'pageready_token' and it actually contains the token. This is to stay
> long term so any suggestions for better naming are welcome.

pageready_token makes sense if this structure is being used both
types of notifications otherwise just ->token might be enough.

> 
> We could've separated these two memory locations completely and
> e.g. used the remaining 56 bits of MSR_KVM_ASYNC_PF_INT as the new
> location information. Maybe we should do that just to avoid the
> confusion.

If this is V2 of ABI, why not do "struct struct kvm_vcpu_pv_apf_data_v2"
instead? Only reason you seem to be sticking to existing structure and
->reason fields because of #PF based sync notifications which is planned
to go away soon. So if we do both the changes together, there is no
need to keep this confusion w.r.t ->reason and existing structure.

> 
> >
> > Also how do I extend it now to do error delivery. Please keep that in
> > mind. We don't want to be redesigning this stuff again. Its already
> > very complicated.
> >
> > I really need ->reason field to be usable in both the paths so that
> > error can be delivered.
> 
> If we want to use 'reason' for both we'll get into a weird scenario when
> exception is blocking interrupt and, what's more confusing, vice
> versa. I'd like to avoid this complexity in KVM code. My suggestion
> would be to rename 'reason' to something like 'pf_abuse_flag' so it
> doesn't cause the confusion and add new 'reason' after 'token'.
> 
> >
> > And this notion of same structure being shared across multiple events
> > at the same time is just going to create more confusion, IMHO. If we
> > can decouple it by serializing it, that definitely feels simpler to
> > understand.
> 
> What if we just add sub-structures to the structure, e.g. 
> 
> struct kvm_vcpu_pv_apf_data {
>         struct {
>             __u32 apf_flag;
>         } legacy_apf_data;
>         struct {
>             __u32 token;
>         } apf_interrupt_data;
>         ....
>         __u8 pad[56];                                                                                  |
>         __u32 enabled;                                                                                 |
> };    
> 
> would it make it more obvious?

To me this seems like an improvement. Just looking at it, it is clear
to me who is using what. But soon legacy_apf_data will be unused. Its
not clear to me when will be able to get rid of apf_flag and reuse it
for something else without having to worry about older guests.

This will be second best option, if for some reason we don't want to
do #VE changes at the same time. To make new design cleaner and more
understandable, doing #VE changes at the same time will be good.

> 
> >
> >> 
> >> With this patchset we have two completely separate channels:
> >> 1) Page-not-present goes through #PF and 'reason' in struct
> >> kvm_vcpu_pv_apf_data.
> >> 2) Page-ready goes through interrupt and 'pageready_token' in the same
> >> kvm_vcpu_pv_apf_data.
> >> 
> >> >
> >> >> +
> >> >> +	Note, previously, type 2 (page present) events were delivered via the
> >> >> +	same #PF exception as type 1 (page not present) events but this is
> >> >> +	now deprecated.
> >> >
> >> >> If bit 3 (interrupt based delivery) is not set APF events are not delivered.
> >> >
> >> > So all the old guests which were getting async pf will suddenly find
> >> > that async pf does not work anymore (after hypervisor update). And
> >> > some of them might report it as performance issue (if there were any
> >> > performance benefits to be had with async pf).
> >> 
> >> We still do APF_HALT but generally yes, there might be some performance
> >> implications. My RFC was preserving #PF path but the suggestion was to
> >> retire it completely. (and I kinda like it because it makes a lot of
> >> code go away)
> >
> > Ok. I don't have strong opinion here. If paolo likes it this way, so be
> > it. :-)
> 
> APF is a slowpath for overcommited scenarios and when we switch to
> APF_HALT we allow the host to run some other guest while PF is
> processed. This is not the same from guest's perspective but from host's
> we're fine as we're not wasting cycles.
> 
> >
> >> 
> >> >
> >> > [..]
> >> >>  
> >> >>  bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
> >> >>  {
> >> >> -	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
> >> >> +	if (!kvm_pv_async_pf_enabled(vcpu))
> >> >>  		return true;
> >> >
> >> > What does above mean. If async pf is not enabled, then it returns true,
> >> > implying one can inject async page present. But if async pf is not
> >> > enabled, there is no need to inject these events.
> >> 
> >> AFAIU this is a protection agains guest suddenly disabling APF
> >> mechanism.
> >
> > Can we provide that protection in MSR implementation. That is once APF
> > is enabled, it can't be disabled. Or it is a feature that we allow
> > guest to disable APF and want it that way?
> 
> We need to allow to disable the feature. E.g. think about kdump
> scenario, for example. Before we switch to kdump kernel we need to make
> sure there's no more 'magic' memory which can suggenly change.

So are we waiting to receive all page ready events before we switch
to kdump kernel? If yes, that's not good. Existing kernel is corrupted
and nothing meaningful can be done. So we should switch to next
kernel asap. Previous kernel is dying anyway, so we don't care to
receive page ready events.

> Also,
> kdump kernel may not even support APF so it will get very confused when
> APF events get delivered.

New kernel can just ignore these events if it does not support async
pf? 

This is somewhat similar to devices still doing interrupts in new
kernel. And solution for that seemed to be doing a "reset" of devices
in new kernel. We probably need similar logic where in new kernel
we simply disable "async pf" so that we don't get new notifications.

IOW, waiting in crashed kernel for all page ready events does not
sound like a good option. It reduces the reliability of kdump
operation. 

> 
> >
> >> What do we do with all the 'page ready' events after, we
> >> can't deliver them anymore. So we just eat them (hoping guest will
> >> unfreeze all processes on its own before disabling the mechanism).
> >> 
> >> It is the existing logic, my patch doesn't change it.
> >
> > I see its existing logic. Just it is very confusing and will be good
> > if we can atleast explain it with some comments.
> >
> > I don't know what to make out of this.
> >
> > bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
> > {
> >         if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
> >                 return true;
> >         else
> >                 return kvm_can_do_async_pf(vcpu);
> > }
> >
> > If feature is disabled, then do inject async pf page present. If feature
> > is enabled and check whether we can inject async pf right now or not.
> >
> > It probably will help if this check if feature being enabled/disabled
> > is outside kvm_arch_can_inject_async_page_present() at the callsite
> > of kvm_arch_can_inject_async_page_present() and there we explain that
> > why it is important to inject page ready events despite the fact
> > that feature is disabled.
> 
> This code would definitely love some comments added, will do in the next
> version. And I'll also think how to improve the readability, thanks for
> the feedback!

It definitely needs a comment (if it makes sense to still send page
ready events to guest).

Thanks
Vivek
Vivek Goyal May 13, 2020, 2:03 p.m. UTC | #6
On Wed, May 13, 2020 at 09:53:50AM -0400, Vivek Goyal wrote:

[..]
> > > And this notion of same structure being shared across multiple events
> > > at the same time is just going to create more confusion, IMHO. If we
> > > can decouple it by serializing it, that definitely feels simpler to
> > > understand.
> > 
> > What if we just add sub-structures to the structure, e.g. 
> > 
> > struct kvm_vcpu_pv_apf_data {
> >         struct {
> >             __u32 apf_flag;
> >         } legacy_apf_data;
> >         struct {
> >             __u32 token;
> >         } apf_interrupt_data;
> >         ....
> >         __u8 pad[56];                                                                                  |
> >         __u32 enabled;                                                                                 |
> > };    
> > 
> > would it make it more obvious?

On a second thought, given we are not planning to use
this structure for synchrous events anymore, I think defining
struct might be overkill. May be a simple comment will do.

struct kvm_vcpu_pv_apf_data {
	/* Used by page fault based page not present notifications. Soon
	 * it will be legacy
	 */
	__u32 apf_flag;
	/* Used for interrupt based page ready notifications */
	__u32 token;
	...
	...
}

Thanks
Vivek
Vitaly Kuznetsov May 13, 2020, 2:23 p.m. UTC | #7
Vivek Goyal <vgoyal@redhat.com> writes:

> On Wed, May 13, 2020 at 11:03:48AM +0200, Vitaly Kuznetsov wrote:
>> Vivek Goyal <vgoyal@redhat.com> writes:
>> 
>> > On Tue, May 12, 2020 at 05:50:53PM +0200, Vitaly Kuznetsov wrote:
>> >> Vivek Goyal <vgoyal@redhat.com> writes:
>> >> 
>> >> >
>> >> > So if we are using a common structure "kvm_vcpu_pv_apf_data" to deliver
>> >> > type1 and type2 events, to me it makes sense to retain existing
>> >> > KVM_PV_REASON_PAGE_READY and KVM_PV_REASON_PAGE_NOT_PRESENT. Just that
>> >> > in new scheme of things, KVM_PV_REASON_PAGE_NOT_PRESENT will be delivered
>> >> > using #PF (and later possibly using #VE) and KVM_PV_REASON_PAGE_READY
>> >> > will be delivered using interrupt.
>> >> 
>> >> We use different fields for page-not-present and page-ready events so
>> >> there is no intersection. If we start setting KVM_PV_REASON_PAGE_READY
>> >> to 'reason' we may accidentally destroy a 'page-not-present' event.
>> >
>> > This is confusing. So you mean at one point of time we might be using
>> > same shared data structure for two events.
>> >
>> > - ->reason will be set to 1 and you will inject page_not_present
>> >   execption.
>> >
>> > - If some page gets ready, you will now set ->token and queue 
>> >   page ready exception. 
>> >
>> > Its very confusing. Can't we serialize the delivery of these events. So
>> > that only one is in progress so that this structure is used by one event
>> > at a time.
>> 
>> This is not how the mechanism (currently) works:
>> - A process accesses a page which is swapped out
>> 
>> - We deliver synchronious APF (#PF) to the guest, it freezes the process
>> and switches to another one.
>> 
>> - Another process accesses a swapped out page, APF is delivered and it
>> also got frozen
>> 
>> ...
>> 
>> - At some point one of the previously unavailable pages become available
>> (not necessarily the last or the first one) and we deliver this info via
>> asynchronous APF (now interrupt).
>> 
>> - Note, after we deliver the interrupt and before it is actually
>> consumed we may have another synchronous APF (#PF) event.
>> 
>> So we really need to separate memory locations for synchronous (type-1,
>> #PF,...) and asynchronous (type-2, interrupt, ...) data.
>> 
>> The naming is unfortunate and misleading, I agree. What is currently
>> named 'reason' should be something like 'apf_flag_for_pf' and it just
>> means to distinguish real #PF from APF. This is going away in the
>> future, we won't be abusing #PF anymore so I'd keep it as it is now,
>> maybe add another comment somewhere.
>
> Shouldn't we do #VE changes also at the same time. Otherwise we are
> again creating this intermediate mode where synchronous notifications
> happen using #PF. Once we move to #VE, old guests will again stop
> getting async #PF? 
>
> Also there is this issue of who is using this common shared area
> between host and guest (struct kvm_vcpu_pv_apf_data) and how to 
> use fields in this struct without breaking existing guests.
>
> For now I see you have modified meaning of fields of this structure
> (->reason) beacuse you are not concerned about older guets because
> they will not receive async pf to begin with. If that's the case,
> renaming even reason makes sense. Havind said that, as we are
> planning to not use this data structure for synchronous notifications,
> creating this intermediate mode is just going to create more
> confusion. So in new ABI, we should not even try to make use of
> ->reason and design #VE changes at the same time. Core problem with
> current ABI is racy access to ->reason and this patch set does not
> get rid of that race anyway.

Well, at least 'page ready' notifications don't access reason and can't
collide with real #PF which is already an improvement :-)

>
>> The other location is
>> 'pageready_token' and it actually contains the token. This is to stay
>> long term so any suggestions for better naming are welcome.
>
> pageready_token makes sense if this structure is being used both
> types of notifications otherwise just ->token might be enough.

Yes, sure, when we stop using this structure for type-1 APF 'pageready_'
prefix becomes superfluous.

>
>> 
>> We could've separated these two memory locations completely and
>> e.g. used the remaining 56 bits of MSR_KVM_ASYNC_PF_INT as the new
>> location information. Maybe we should do that just to avoid the
>> confusion.
>
> If this is V2 of ABI, why not do "struct struct kvm_vcpu_pv_apf_data_v2"
> instead? Only reason you seem to be sticking to existing structure and
> ->reason fields because of #PF based sync notifications which is planned
> to go away soon. So if we do both the changes together, there is no
> need to keep this confusion w.r.t ->reason and existing structure.
>

Yes, sure. I don't know if Paolo already has #VE patches in his stash
but we can definitely keep this on the list untill the second half is
done, no rush.

>> 
>> >
>> > Also how do I extend it now to do error delivery. Please keep that in
>> > mind. We don't want to be redesigning this stuff again. Its already
>> > very complicated.
>> >
>> > I really need ->reason field to be usable in both the paths so that
>> > error can be delivered.
>> 
>> If we want to use 'reason' for both we'll get into a weird scenario when
>> exception is blocking interrupt and, what's more confusing, vice
>> versa. I'd like to avoid this complexity in KVM code. My suggestion
>> would be to rename 'reason' to something like 'pf_abuse_flag' so it
>> doesn't cause the confusion and add new 'reason' after 'token'.
>> 
>> >
>> > And this notion of same structure being shared across multiple events
>> > at the same time is just going to create more confusion, IMHO. If we
>> > can decouple it by serializing it, that definitely feels simpler to
>> > understand.
>> 
>> What if we just add sub-structures to the structure, e.g. 
>> 
>> struct kvm_vcpu_pv_apf_data {
>>         struct {
>>             __u32 apf_flag;
>>         } legacy_apf_data;
>>         struct {
>>             __u32 token;
>>         } apf_interrupt_data;
>>         ....
>>         __u8 pad[56];                                                                                  |
>>         __u32 enabled;                                                                                 |
>> };    
>> 
>> would it make it more obvious?
>
> To me this seems like an improvement. Just looking at it, it is clear
> to me who is using what. But soon legacy_apf_data will be unused. Its
> not clear to me when will be able to get rid of apf_flag and reuse it
> for something else without having to worry about older guests.
>
> This will be second best option, if for some reason we don't want to
> do #VE changes at the same time. To make new design cleaner and more
> understandable, doing #VE changes at the same time will be good.

No objections :-)

>
>> 
>> >
>> >> 
>> >> With this patchset we have two completely separate channels:
>> >> 1) Page-not-present goes through #PF and 'reason' in struct
>> >> kvm_vcpu_pv_apf_data.
>> >> 2) Page-ready goes through interrupt and 'pageready_token' in the same
>> >> kvm_vcpu_pv_apf_data.
>> >> 
>> >> >
>> >> >> +
>> >> >> +	Note, previously, type 2 (page present) events were delivered via the
>> >> >> +	same #PF exception as type 1 (page not present) events but this is
>> >> >> +	now deprecated.
>> >> >
>> >> >> If bit 3 (interrupt based delivery) is not set APF events are not delivered.
>> >> >
>> >> > So all the old guests which were getting async pf will suddenly find
>> >> > that async pf does not work anymore (after hypervisor update). And
>> >> > some of them might report it as performance issue (if there were any
>> >> > performance benefits to be had with async pf).
>> >> 
>> >> We still do APF_HALT but generally yes, there might be some performance
>> >> implications. My RFC was preserving #PF path but the suggestion was to
>> >> retire it completely. (and I kinda like it because it makes a lot of
>> >> code go away)
>> >
>> > Ok. I don't have strong opinion here. If paolo likes it this way, so be
>> > it. :-)
>> 
>> APF is a slowpath for overcommited scenarios and when we switch to
>> APF_HALT we allow the host to run some other guest while PF is
>> processed. This is not the same from guest's perspective but from host's
>> we're fine as we're not wasting cycles.
>> 
>> >
>> >> 
>> >> >
>> >> > [..]
>> >> >>  
>> >> >>  bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
>> >> >>  {
>> >> >> -	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
>> >> >> +	if (!kvm_pv_async_pf_enabled(vcpu))
>> >> >>  		return true;
>> >> >
>> >> > What does above mean. If async pf is not enabled, then it returns true,
>> >> > implying one can inject async page present. But if async pf is not
>> >> > enabled, there is no need to inject these events.
>> >> 
>> >> AFAIU this is a protection agains guest suddenly disabling APF
>> >> mechanism.
>> >
>> > Can we provide that protection in MSR implementation. That is once APF
>> > is enabled, it can't be disabled. Or it is a feature that we allow
>> > guest to disable APF and want it that way?
>> 
>> We need to allow to disable the feature. E.g. think about kdump
>> scenario, for example. Before we switch to kdump kernel we need to make
>> sure there's no more 'magic' memory which can suggenly change.
>
> So are we waiting to receive all page ready events before we switch
> to kdump kernel? If yes, that's not good. Existing kernel is corrupted
> and nothing meaningful can be done. So we should switch to next
> kernel asap. Previous kernel is dying anyway, so we don't care to
> receive page ready events.

No, we don't wait. We just disable the feature so no new notifications
will get delivered (if they are the new kernel will get confused). If we
happen to access one of the previously-unavailable pages again the host
will just freeze us untill the page is ready, no harm done.

>
>> Also,
>> kdump kernel may not even support APF so it will get very confused when
>> APF events get delivered.
>
> New kernel can just ignore these events if it does not support async
> pf? 
>
> This is somewhat similar to devices still doing interrupts in new
> kernel. And solution for that seemed to be doing a "reset" of devices
> in new kernel. We probably need similar logic where in new kernel
> we simply disable "async pf" so that we don't get new notifications.

Right and that's what we're doing - just disabling new notifications.

>
> IOW, waiting in crashed kernel for all page ready events does not
> sound like a good option. It reduces the reliability of kdump
> operation. 

AFAIK we don't wait (kvm_pv_guest_cpu_reboot() -> kvm_pv_disable_apf()
-> wrmsrl()).

>
>> 
>> >
>> >> What do we do with all the 'page ready' events after, we
>> >> can't deliver them anymore. So we just eat them (hoping guest will
>> >> unfreeze all processes on its own before disabling the mechanism).
>> >> 
>> >> It is the existing logic, my patch doesn't change it.
>> >
>> > I see its existing logic. Just it is very confusing and will be good
>> > if we can atleast explain it with some comments.
>> >
>> > I don't know what to make out of this.
>> >
>> > bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
>> > {
>> >         if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
>> >                 return true;
>> >         else
>> >                 return kvm_can_do_async_pf(vcpu);
>> > }
>> >
>> > If feature is disabled, then do inject async pf page present. If feature
>> > is enabled and check whether we can inject async pf right now or not.
>> >
>> > It probably will help if this check if feature being enabled/disabled
>> > is outside kvm_arch_can_inject_async_page_present() at the callsite
>> > of kvm_arch_can_inject_async_page_present() and there we explain that
>> > why it is important to inject page ready events despite the fact
>> > that feature is disabled.
>> 
>> This code would definitely love some comments added, will do in the next
>> version. And I'll also think how to improve the readability, thanks for
>> the feedback!
>
> It definitely needs a comment (if it makes sense to still send page
> ready events to guest).
>
Vivek Goyal May 13, 2020, 6:46 p.m. UTC | #8
On Wed, May 13, 2020 at 04:23:55PM +0200, Vitaly Kuznetsov wrote:

[..]
> >> Also,
> >> kdump kernel may not even support APF so it will get very confused when
> >> APF events get delivered.
> >
> > New kernel can just ignore these events if it does not support async
> > pf? 
> >
> > This is somewhat similar to devices still doing interrupts in new
> > kernel. And solution for that seemed to be doing a "reset" of devices
> > in new kernel. We probably need similar logic where in new kernel
> > we simply disable "async pf" so that we don't get new notifications.
> 
> Right and that's what we're doing - just disabling new notifications.

Nice.

So why there is a need to deliver "page ready" notifications
to guest after guest has disabled async pf. Atleast kdump does not
seem to need it. It will boot into second kernel anyway, irrespective
of the fact whether it receives page ready or not.

Thanks
Vivek
Vitaly Kuznetsov May 14, 2020, 8:08 a.m. UTC | #9
Vivek Goyal <vgoyal@redhat.com> writes:

> On Wed, May 13, 2020 at 04:23:55PM +0200, Vitaly Kuznetsov wrote:
>
> [..]
>> >> Also,
>> >> kdump kernel may not even support APF so it will get very confused when
>> >> APF events get delivered.
>> >
>> > New kernel can just ignore these events if it does not support async
>> > pf? 
>> >
>> > This is somewhat similar to devices still doing interrupts in new
>> > kernel. And solution for that seemed to be doing a "reset" of devices
>> > in new kernel. We probably need similar logic where in new kernel
>> > we simply disable "async pf" so that we don't get new notifications.
>> 
>> Right and that's what we're doing - just disabling new notifications.
>
> Nice.
>
> So why there is a need to deliver "page ready" notifications
> to guest after guest has disabled async pf. Atleast kdump does not
> seem to need it. It will boot into second kernel anyway, irrespective
> of the fact whether it receives page ready or not.

We don't deliver anything to the guest after it disables APF (neither
'page ready' for what was previously missing, nor 'page not ready' for
new faults), kvm_arch_can_inject_async_page_present() is just another
misnomer, it should be named something like
'kvm_arch_can_unqueue_async_page_present()' meaning that 'page ready'
notification can be 'unqueued' from internal KVM queue. We will either
deliver it (when guest has APF enabled) or just drop it (when guest has
APF disabled). The only case when it has to stay in the queue is when
guest has APF enabled and the slot is still busy (so it didn't get to
process a previously delivered notification). We will try to deliver it
again after guest writes to MSR_KVM_ASYNC_PF_ACK.
Vivek Goyal May 14, 2020, 1:31 p.m. UTC | #10
On Thu, May 14, 2020 at 10:08:37AM +0200, Vitaly Kuznetsov wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Wed, May 13, 2020 at 04:23:55PM +0200, Vitaly Kuznetsov wrote:
> >
> > [..]
> >> >> Also,
> >> >> kdump kernel may not even support APF so it will get very confused when
> >> >> APF events get delivered.
> >> >
> >> > New kernel can just ignore these events if it does not support async
> >> > pf? 
> >> >
> >> > This is somewhat similar to devices still doing interrupts in new
> >> > kernel. And solution for that seemed to be doing a "reset" of devices
> >> > in new kernel. We probably need similar logic where in new kernel
> >> > we simply disable "async pf" so that we don't get new notifications.
> >> 
> >> Right and that's what we're doing - just disabling new notifications.
> >
> > Nice.
> >
> > So why there is a need to deliver "page ready" notifications
> > to guest after guest has disabled async pf. Atleast kdump does not
> > seem to need it. It will boot into second kernel anyway, irrespective
> > of the fact whether it receives page ready or not.
> 
> We don't deliver anything to the guest after it disables APF (neither
> 'page ready' for what was previously missing, nor 'page not ready' for
> new faults), kvm_arch_can_inject_async_page_present() is just another
> misnomer, it should be named something like
> 'kvm_arch_can_unqueue_async_page_present()' meaning that 'page ready'
> notification can be 'unqueued' from internal KVM queue. We will either
> deliver it (when guest has APF enabled) or just drop it (when guest has
> APF disabled). The only case when it has to stay in the queue is when
> guest has APF enabled and the slot is still busy (so it didn't get to
> process a previously delivered notification). We will try to deliver it
> again after guest writes to MSR_KVM_ASYNC_PF_ACK.

This makes sense. Renaming this function to make it more clear will
help understanding code better.

Vivek
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index 33892036672d..f988a36f226a 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -190,35 +190,54 @@  MSR_KVM_ASYNC_PF_EN:
 	0x4b564d02
 
 data:
-	Bits 63-6 hold 64-byte aligned physical address of a
-	64 byte memory area which must be in guest RAM and must be
-	zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1
-	when asynchronous page faults are enabled on the vcpu 0 when
-	disabled. Bit 1 is 1 if asynchronous page faults can be injected
-	when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults
-	are delivered to L1 as #PF vmexits.  Bit 2 can be set only if
-	KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID.
-
-	First 4 byte of 64 byte memory location will be written to by
-	the hypervisor at the time of asynchronous page fault (APF)
-	injection to indicate type of asynchronous page fault. Value
-	of 1 means that the page referred to by the page fault is not
-	present. Value 2 means that the page is now available. Disabling
-	interrupt inhibits APFs. Guest must not enable interrupt
-	before the reason is read, or it may be overwritten by another
-	APF. Since APF uses the same exception vector as regular page
-	fault guest must reset the reason to 0 before it does
-	something that can generate normal page fault.  If during page
-	fault APF reason is 0 it means that this is regular page
-	fault.
-
-	During delivery of type 1 APF cr2 contains a token that will
-	be used to notify a guest when missing page becomes
-	available. When page becomes available type 2 APF is sent with
-	cr2 set to the token associated with the page. There is special
-	kind of token 0xffffffff which tells vcpu that it should wake
-	up all processes waiting for APFs and no individual type 2 APFs
-	will be sent.
+	Asynchronous page fault (APF) control MSR.
+
+	Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
+	which must be in guest RAM and must be zeroed. This memory is expected
+	to hold a copy of the following structure::
+
+	  struct kvm_vcpu_pv_apf_data {
+		__u32 reason;
+		__u32 pageready_token;
+		__u8 pad[56];
+		__u32 enabled;
+	  };
+
+	Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
+	when asynchronous page faults are enabled on the vcpu, 0 when disabled.
+	Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
+	cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
+	#PF vmexits.  Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
+	present in CPUID. Bit 3 enables interrupt based delivery of type 2
+	(page present) events.
+
+	First 4 byte of 64 byte memory location ('reason') will be written to
+	by the hypervisor at the time APF type 1 (page not present) injection.
+	The only possible values are '0' and '1'. Type 1 events are currently
+	always delivered as synthetic #PF exception. During delivery of type 1
+	APF CR2 register contains a token that will be used to notify the guest
+	when missing page becomes available. Guest is supposed to write '0' to
+	the location when it is done handling type 1 event so the next one can
+	be delivered.
+
+	Note, since APF type 1 uses the same exception vector as regular page
+	fault, guest must reset the reason to '0' before it does something that
+	can generate normal page fault. If during a page fault APF reason is '0'
+	it means that this is regular page fault.
+
+	Bytes 5-7 of 64 byte memory location ('pageready_token') will be written
+	to by the hypervisor at the time of type 2 (page ready) event injection.
+	The content of these bytes is a token which was previously delivered as
+	type 1 event. The event indicates the page in now available. Guest is
+	supposed to write '0' to the location when it is done handling type 2
+	event so the next one can be delivered. MSR_KVM_ASYNC_PF_INT MSR
+	specifying the interrupt vector for type 2 APF delivery needs to be
+	written to before enabling APF mechanism in MSR_KVM_ASYNC_PF_EN.
+
+	Note, previously, type 2 (page present) events were delivered via the
+	same #PF exception as type 1 (page not present) events but this is
+	now deprecated. If bit 3 (interrupt based delivery) is not set APF
+	events are not delivered.
 
 	If APF is disabled while there are outstanding APFs, they will
 	not be delivered.
@@ -319,3 +338,17 @@  data:
 
 	KVM guests can request the host not to poll on HLT, for example if
 	they are performing polling themselves.
+
+MSR_KVM_ASYNC_PF_INT:
+	0x4b564d06
+
+data:
+	Second asynchronous page fault (APF) control MSR.
+
+	Bits 0-7: APIC vector for delivery of type 2 APF events (page ready
+	notifications).
+	Bits 8-63: Reserved
+
+	Interrupt vector for asynchnonous page ready notifications delivery.
+	The vector has to be set up before asynchronous page fault mechanism
+	is enabled in MSR_KVM_ASYNC_PF_EN.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 42a2d0d3984a..4a72ba26a532 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -763,7 +763,9 @@  struct kvm_vcpu_arch {
 		bool halted;
 		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
 		struct gfn_to_hva_cache data;
-		u64 msr_val;
+		u64 msr_en_val; /* MSR_KVM_ASYNC_PF_EN */
+		u64 msr_int_val; /* MSR_KVM_ASYNC_PF_INT */
+		u16 vec;
 		u32 id;
 		bool send_user_only;
 		u32 host_apf_reason;
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index e3602a1de136..c55ffa2c40b6 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -50,6 +50,7 @@ 
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN      0x4b564d04
 #define MSR_KVM_POLL_CONTROL	0x4b564d05
+#define MSR_KVM_ASYNC_PF_INT	0x4b564d06
 
 struct kvm_steal_time {
 	__u64 steal;
@@ -81,6 +82,11 @@  struct kvm_clock_pairing {
 #define KVM_ASYNC_PF_ENABLED			(1 << 0)
 #define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
 #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT	(1 << 2)
+#define KVM_ASYNC_PF_DELIVERY_AS_INT		(1 << 3)
+
+/* MSR_KVM_ASYNC_PF_INT */
+#define KVM_ASYNC_PF_VEC_MASK			GENMASK(7, 0)
+
 
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE            1
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 28868cc16e4d..d1e9b072d279 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1243,7 +1243,7 @@  static const u32 emulated_msrs_all[] = {
 	HV_X64_MSR_TSC_EMULATION_STATUS,
 
 	MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
-	MSR_KVM_PV_EOI_EN,
+	MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF_INT,
 
 	MSR_IA32_TSC_ADJUST,
 	MSR_IA32_TSCDEADLINE,
@@ -2645,17 +2645,24 @@  static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 data)
 	return r;
 }
 
+static inline bool kvm_pv_async_pf_enabled(struct kvm_vcpu *vcpu)
+{
+	u64 mask = KVM_ASYNC_PF_ENABLED | KVM_ASYNC_PF_DELIVERY_AS_INT;
+
+	return (vcpu->arch.apf.msr_en_val & mask) == mask;
+}
+
 static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 {
 	gpa_t gpa = data & ~0x3f;
 
-	/* Bits 3:5 are reserved, Should be zero */
-	if (data & 0x38)
+	/* Bits 4:5 are reserved, Should be zero */
+	if (data & 0x30)
 		return 1;
 
-	vcpu->arch.apf.msr_val = data;
+	vcpu->arch.apf.msr_en_val = data;
 
-	if (!(data & KVM_ASYNC_PF_ENABLED)) {
+	if (!kvm_pv_async_pf_enabled(vcpu)) {
 		kvm_clear_async_pf_completion_queue(vcpu);
 		kvm_async_pf_hash_reset(vcpu);
 		return 0;
@@ -2667,7 +2674,25 @@  static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 
 	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
 	vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
+
 	kvm_async_pf_wakeup_all(vcpu);
+
+	return 0;
+}
+
+static int kvm_pv_enable_async_pf_int(struct kvm_vcpu *vcpu, u64 data)
+{
+	/* Bits 8-63 are reserved */
+	if (data >> 8)
+		return 1;
+
+	if (!lapic_in_kernel(vcpu))
+		return 1;
+
+	vcpu->arch.apf.msr_int_val = data;
+
+	vcpu->arch.apf.vec = data & KVM_ASYNC_PF_VEC_MASK;
+
 	return 0;
 }
 
@@ -2883,6 +2908,10 @@  int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (kvm_pv_enable_async_pf(vcpu, data))
 			return 1;
 		break;
+	case MSR_KVM_ASYNC_PF_INT:
+		if (kvm_pv_enable_async_pf_int(vcpu, data))
+			return 1;
+		break;
 	case MSR_KVM_STEAL_TIME:
 
 		if (unlikely(!sched_info_on()))
@@ -3157,7 +3186,10 @@  int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		msr_info->data = vcpu->arch.time;
 		break;
 	case MSR_KVM_ASYNC_PF_EN:
-		msr_info->data = vcpu->arch.apf.msr_val;
+		msr_info->data = vcpu->arch.apf.msr_en_val;
+		break;
+	case MSR_KVM_ASYNC_PF_INT:
+		msr_info->data = vcpu->arch.apf.msr_int_val;
 		break;
 	case MSR_KVM_STEAL_TIME:
 		msr_info->data = vcpu->arch.st.msr_val;
@@ -9500,7 +9532,8 @@  void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vcpu->arch.cr2 = 0;
 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
-	vcpu->arch.apf.msr_val = 0;
+	vcpu->arch.apf.msr_en_val = 0;
+	vcpu->arch.apf.msr_int_val = 0;
 	vcpu->arch.st.msr_val = 0;
 
 	kvmclock_reset(vcpu);
@@ -10362,10 +10395,24 @@  static inline int apf_put_user_notpresent(struct kvm_vcpu *vcpu)
 
 static inline int apf_put_user_ready(struct kvm_vcpu *vcpu, u32 token)
 {
-	u64 val = (u64)token << 32 | KVM_PV_REASON_PAGE_READY;
+	unsigned int offset = offsetof(struct kvm_vcpu_pv_apf_data,
+				       pageready_token);
 
-	return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.data, &val,
-				      sizeof(val));
+	return kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
+					     &token, offset, sizeof(token));
+}
+
+static inline bool apf_pageready_slot_free(struct kvm_vcpu *vcpu)
+{
+	unsigned int offset = offsetof(struct kvm_vcpu_pv_apf_data,
+				       pageready_token);
+	u32 val;
+
+	if (kvm_read_guest_offset_cached(vcpu->kvm, &vcpu->arch.apf.data,
+					 &val, offset, sizeof(val)))
+		return false;
+
+	return !val;
 }
 
 static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
@@ -10373,9 +10420,8 @@  static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
 	if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu))
 		return false;
 
-	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
-	    (vcpu->arch.apf.send_user_only &&
-	     kvm_x86_ops.get_cpl(vcpu) == 0))
+	if (!kvm_pv_async_pf_enabled(vcpu) ||
+	    (vcpu->arch.apf.send_user_only && kvm_x86_ops.get_cpl(vcpu) == 0))
 		return false;
 
 	return true;
@@ -10431,7 +10477,10 @@  void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 				 struct kvm_async_pf *work)
 {
-	struct x86_exception fault;
+	struct kvm_lapic_irq irq = {
+		.delivery_mode = APIC_DM_FIXED,
+		.vector = vcpu->arch.apf.vec
+	};
 
 	if (work->wakeup_all)
 		work->arch.token = ~0; /* broadcast wakeup */
@@ -10439,26 +10488,20 @@  void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 		kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
 	trace_kvm_async_pf_ready(work->arch.token, work->cr2_or_gpa);
 
-	if (vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED &&
-	    !apf_put_user_ready(vcpu, work->arch.token)) {
-			fault.vector = PF_VECTOR;
-			fault.error_code_valid = true;
-			fault.error_code = 0;
-			fault.nested_page_fault = false;
-			fault.address = work->arch.token;
-			fault.async_page_fault = true;
-			kvm_inject_page_fault(vcpu, &fault);
-	}
+	if (kvm_pv_async_pf_enabled(vcpu) &&
+	    !apf_put_user_ready(vcpu, work->arch.token))
+		kvm_apic_set_irq(vcpu, &irq, NULL);
+
 	vcpu->arch.apf.halted = false;
 	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 }
 
 bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
 {
-	if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
+	if (!kvm_pv_async_pf_enabled(vcpu))
 		return true;
 	else
-		return kvm_can_do_async_pf(vcpu);
+		return apf_pageready_slot_free(vcpu);
 }
 
 void kvm_arch_start_assignment(struct kvm *kvm)