diff mbox series

[RFC,3/6] KVM: x86: interrupt based APF page-ready event delivery

Message ID 20200429093634.1514902-4-vkuznets@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications | expand

Commit Message

Vitaly Kuznetsov April 29, 2020, 9:36 a.m. UTC
Concerns were expressed around APF delivery via synthetic #PF exception as
in some cases such delivery may collide with real page fault. For type 2
(page ready) notifications we can easily switch to using an interrupt
instead. Introduce new MSR_KVM_ASYNC_PF2 mechanism.

One notable difference between the two mechanisms is that interrupt may not
get handled immediately so whenever we would like to deliver next event
(regardless of its type) we must be sure the guest had read and cleared
previous event in the slot.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 Documentation/virt/kvm/msr.rst       | 38 +++++++++++---
 arch/x86/include/asm/kvm_host.h      |  5 +-
 arch/x86/include/uapi/asm/kvm_para.h |  6 +++
 arch/x86/kvm/x86.c                   | 77 ++++++++++++++++++++++++++--
 4 files changed, 113 insertions(+), 13 deletions(-)

Comments

Paolo Bonzini April 29, 2020, 10:54 a.m. UTC | #1
On 29/04/20 11:36, Vitaly Kuznetsov wrote:
> +
> +	Type 1 page (page missing) events are currently always delivered as
> +	synthetic #PF exception. Type 2 (page ready) are either delivered
> +	by #PF exception (when bit 3 of MSR_KVM_ASYNC_PF_EN is clear) or
> +	via an APIC interrupt (when bit 3 set). APIC interrupt delivery is
> +	controlled by MSR_KVM_ASYNC_PF2.

I think we should (in the non-RFC version) block async page faults
completely and only keep APF_HALT unless the guest is using page ready
interrupt delivery.

Paolo
Vitaly Kuznetsov April 29, 2020, 12:40 p.m. UTC | #2
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 29/04/20 11:36, Vitaly Kuznetsov wrote:
>> +
>> +	Type 1 page (page missing) events are currently always delivered as
>> +	synthetic #PF exception. Type 2 (page ready) are either delivered
>> +	by #PF exception (when bit 3 of MSR_KVM_ASYNC_PF_EN is clear) or
>> +	via an APIC interrupt (when bit 3 set). APIC interrupt delivery is
>> +	controlled by MSR_KVM_ASYNC_PF2.
>
> I think we should (in the non-RFC version) block async page faults
> completely and only keep APF_HALT unless the guest is using page ready
> interrupt delivery.

Sure, we can do that. This is, however, a significant behavioral change:
APF_HALT frees the host, not the guest, so even if the combined
performance of all guests on the same pCPU remain the same guests with
e.g. a lot of simultaneously running processes may suffer more.

In theory, we can keep two mechanisms side by side for as long as we
want but if the end goal is to have '#PF abuse eliminated' than we'll
have to get rid of the legacy one some day. The day when the new
mechanism lands is also a good choice :-)
Paolo Bonzini April 29, 2020, 1:21 p.m. UTC | #3
On 29/04/20 14:40, Vitaly Kuznetsov wrote:
> Paolo Bonzini <pbonzini@redhat.com> writes:
> 
>> On 29/04/20 11:36, Vitaly Kuznetsov wrote:
>>> +
>>> +	Type 1 page (page missing) events are currently always delivered as
>>> +	synthetic #PF exception. Type 2 (page ready) are either delivered
>>> +	by #PF exception (when bit 3 of MSR_KVM_ASYNC_PF_EN is clear) or
>>> +	via an APIC interrupt (when bit 3 set). APIC interrupt delivery is
>>> +	controlled by MSR_KVM_ASYNC_PF2.
>>
>> I think we should (in the non-RFC version) block async page faults
>> completely and only keep APF_HALT unless the guest is using page ready
>> interrupt delivery.
> 
> Sure, we can do that. This is, however, a significant behavioral change:
> APF_HALT frees the host, not the guest, so even if the combined
> performance of all guests on the same pCPU remain the same guests with
> e.g. a lot of simultaneously running processes may suffer more.

Yes, it is a significant change.  However the resulting clean up in the
spec is significant, because we don't have type 2 notifications at all
anymore.

(APF_HALT does free the guest a little bit by allowing interrupt
delivery during a host page fault; in particular it lets the scheduler
tick run, which does improve responsiveness somewhat significantly).

Likewise, I think we should clean up the guest side without prejudice.
Patch 6 should disable async page fault unless page-ready interrupts are
available, and drop the page ready case from the #PF handler.

Thanks,

Paolo

> In theory, we can keep two mechanisms side by side for as long as we
> want but if the end goal is to have '#PF abuse eliminated' than we'll
> have to get rid of the legacy one some day. The day when the new
> mechanism lands is also a good choice :-)
Peter Xu April 29, 2020, 9:27 p.m. UTC | #4
Hi, Vitaly,

On Wed, Apr 29, 2020 at 11:36:31AM +0200, Vitaly Kuznetsov wrote:
> +	Type 1 page (page missing) events are currently always delivered as
> +	synthetic #PF exception. Type 2 (page ready) are either delivered
> +	by #PF exception (when bit 3 of MSR_KVM_ASYNC_PF_EN is clear) or
> +	via an APIC interrupt (when bit 3 set). APIC interrupt delivery is
> +	controlled by MSR_KVM_ASYNC_PF2.

How about MSR_KVM_ASYNC_PF_INT instead of MSR_KVM_ASYNC_PF2 (to match
MSR_KVM_ASYNC_EN and MSR_KVM_ASYNC_ACK where they're all MSR_KVM_ASYNC_* with a
meaningful ending word)?

> +
> +	For #PF delivery, disabling interrupt inhibits APFs. Guest must
> +	not enable interrupt before the reason is read, or it may be
> +	overwritten by another APF. Since APF uses the same exception
> +	vector as regular page fault guest must reset the reason to 0
> +	before it does something that can generate normal page fault.
> +	If during pagefault APF reason is 0 it means that this is regular
> +	page fault.
>  
>  	During delivery of type 1 APF cr2 contains a token that will
>  	be used to notify a guest when missing page becomes
> @@ -319,3 +326,18 @@ data:
>  
>  	KVM guests can request the host not to poll on HLT, for example if
>  	they are performing polling themselves.
> +
> +MSR_KVM_ASYNC_PF2:
> +	0x4b564d06
> +
> +data:
> +	Second asynchronous page fault control MSR.
> +
> +	Bits 0-7: APIC vector for interrupt based delivery of type 2 APF
> +	events (page ready notification).
> +        Bit 8: Interrupt based delivery of type 2 APF events is enabled
> +        Bits 9-63: Reserved

(may need to fix up the indents)

> +
> +	To switch to interrupt based delivery of type 2 APF events guests
> +	are supposed to enable asynchronous page faults and set bit 3 in
> +	MSR_KVM_ASYNC_PF_EN first.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 42a2d0d3984a..6215f61450cb 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -763,12 +763,15 @@ struct kvm_vcpu_arch {
>  		bool halted;
>  		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
>  		struct gfn_to_hva_cache data;
> -		u64 msr_val;
> +		u64 msr_val; /* MSR_KVM_ASYNC_PF_EN */
> +		u64 msr2_val; /* MSR_KVM_ASYNC_PF2 */
> +		u16 vec;
>  		u32 id;
>  		bool send_user_only;
>  		u32 host_apf_reason;
>  		unsigned long nested_apf_token;
>  		bool delivery_as_pf_vmexit;
> +		bool delivery_as_int;
>  	} apf;
>  
>  	/* OSVW MSRs (AMD only) */
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index df2ba34037a2..1bbb0b7e062f 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -50,6 +50,7 @@
>  #define MSR_KVM_STEAL_TIME  0x4b564d03
>  #define MSR_KVM_PV_EOI_EN      0x4b564d04
>  #define MSR_KVM_POLL_CONTROL	0x4b564d05
> +#define MSR_KVM_ASYNC_PF2	0x4b564d06
>  
>  struct kvm_steal_time {
>  	__u64 steal;
> @@ -81,6 +82,11 @@ struct kvm_clock_pairing {
>  #define KVM_ASYNC_PF_ENABLED			(1 << 0)
>  #define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
>  #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT	(1 << 2)
> +#define KVM_ASYNC_PF_DELIVERY_AS_INT		(1 << 3)
> +
> +#define KVM_ASYNC_PF2_VEC_MASK			GENMASK(7, 0)
> +#define KVM_ASYNC_PF2_ENABLED			BIT(8)

There are two enable bits, this one in ASYNC_PF_EN and the other old one in
ASYNC_PF2.  Could it work with only one knob (e.g., set bit 0 of ASYNC_PF_EN
always to enable apf)?  After all we have had bit 3 of ASYNC_PF_EN to show
whether interrupt is enabled, which seems to be the real switch for whether to
enable interrupt for apf.

If we can keep the only knob in ASYNC_PF_EN (bit 0), iiuc we can also keep the
below kvm_async_pf_wakeup_all() untouched (so we only set bit 0 of ASYNC_PF_EN
after configure everything).

Thanks,

> +
>  
>  /* Operations for KVM_HC_MMU_OP */
>  #define KVM_MMU_OP_WRITE_PTE            1
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7c21c0cf0a33..861dce1e7cf5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1243,7 +1243,7 @@ static const u32 emulated_msrs_all[] = {
>  	HV_X64_MSR_TSC_EMULATION_STATUS,
>  
>  	MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
> -	MSR_KVM_PV_EOI_EN,
> +	MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF2,
>  
>  	MSR_IA32_TSC_ADJUST,
>  	MSR_IA32_TSCDEADLINE,
> @@ -2649,8 +2649,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
>  {
>  	gpa_t gpa = data & ~0x3f;
>  
> -	/* Bits 3:5 are reserved, Should be zero */
> -	if (data & 0x38)
> +	/* Bits 4:5 are reserved, Should be zero */
> +	if (data & 0x30)
>  		return 1;
>  
>  	vcpu->arch.apf.msr_val = data;
> @@ -2667,7 +2667,35 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
>  
>  	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
>  	vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
> -	kvm_async_pf_wakeup_all(vcpu);
> +	vcpu->arch.apf.delivery_as_int = data & KVM_ASYNC_PF_DELIVERY_AS_INT;
> +
> +	/*
> +	 * If delivery via interrupt is configured make sure MSR_KVM_ASYNC_PF2
> +	 * was written to before sending 'wakeup all'.
> +	 */
> +	if (!vcpu->arch.apf.delivery_as_int ||
> +	    vcpu->arch.apf.msr2_val & KVM_ASYNC_PF2_ENABLED)
> +		kvm_async_pf_wakeup_all(vcpu);
> +
> +	return 0;
> +}
> +
> +static int kvm_pv_enable_async_pf2(struct kvm_vcpu *vcpu, u64 data)
> +{
> +	/* Bits 9-63 are reserved */
> +	if (data & ~0x1ff)
> +		return 1;
> +
> +	if (!lapic_in_kernel(vcpu))
> +		return 1;
> +
> +	vcpu->arch.apf.msr2_val = data;
> +
> +	vcpu->arch.apf.vec = data & KVM_ASYNC_PF2_VEC_MASK;
> +
> +	if (data & KVM_ASYNC_PF2_ENABLED)
> +		kvm_async_pf_wakeup_all(vcpu);
> +
>  	return 0;
>  }
>  
> @@ -2883,6 +2911,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		if (kvm_pv_enable_async_pf(vcpu, data))
>  			return 1;
>  		break;
> +	case MSR_KVM_ASYNC_PF2:
> +		if (kvm_pv_enable_async_pf2(vcpu, data))
> +			return 1;
> +		break;
>  	case MSR_KVM_STEAL_TIME:
>  
>  		if (unlikely(!sched_info_on()))
> @@ -3159,6 +3191,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	case MSR_KVM_ASYNC_PF_EN:
>  		msr_info->data = vcpu->arch.apf.msr_val;
>  		break;
> +	case MSR_KVM_ASYNC_PF2:
> +		msr_info->data = vcpu->arch.apf.msr2_val;
> +		break;
>  	case MSR_KVM_STEAL_TIME:
>  		msr_info->data = vcpu->arch.st.msr_val;
>  		break;
> @@ -10367,6 +10402,16 @@ static int apf_get_user(struct kvm_vcpu *vcpu, u32 *val)
>  				      sizeof(u32));
>  }
>  
> +static bool apf_slot_free(struct kvm_vcpu *vcpu)
> +{
> +	u32 val;
> +
> +	if (apf_get_user(vcpu, &val))
> +		return false;
> +
> +	return !val;
> +}
> +
>  static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>  {
>  	if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu))
> @@ -10382,11 +10427,23 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>  
>  bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>  {
> +	/*
> +	 * TODO: when we are injecting a 'page present' event with an interrupt
> +	 * we may ignore pending exceptions.
> +	 */
>  	if (unlikely(!lapic_in_kernel(vcpu) ||
>  		     kvm_event_needs_reinjection(vcpu) ||
>  		     vcpu->arch.exception.pending))
>  		return false;
>  
> +	/*'
> +	 * Regardless of the type of event we're trying to deliver, we need to
> +	 * check that the previous even was already consumed, this may not be
> +	 * the case with interrupt based delivery.
> +	 */
> +	if (vcpu->arch.apf.delivery_as_int && !apf_slot_free(vcpu))
> +		return false;
> +
>  	if (kvm_hlt_in_guest(vcpu->kvm) && !kvm_can_deliver_async_pf(vcpu))
>  		return false;
Vitaly Kuznetsov April 30, 2020, 8:31 a.m. UTC | #5
Peter Xu <peterx@redhat.com> writes:

> Hi, Vitaly,
>
> On Wed, Apr 29, 2020 at 11:36:31AM +0200, Vitaly Kuznetsov wrote:
>> +	Type 1 page (page missing) events are currently always delivered as
>> +	synthetic #PF exception. Type 2 (page ready) are either delivered
>> +	by #PF exception (when bit 3 of MSR_KVM_ASYNC_PF_EN is clear) or
>> +	via an APIC interrupt (when bit 3 set). APIC interrupt delivery is
>> +	controlled by MSR_KVM_ASYNC_PF2.
>
> How about MSR_KVM_ASYNC_PF_INT instead of MSR_KVM_ASYNC_PF2 (to match
> MSR_KVM_ASYNC_EN and MSR_KVM_ASYNC_ACK where they're all MSR_KVM_ASYNC_* with a
> meaningful ending word)?
>

Sure.

>> +
>> +	For #PF delivery, disabling interrupt inhibits APFs. Guest must
>> +	not enable interrupt before the reason is read, or it may be
>> +	overwritten by another APF. Since APF uses the same exception
>> +	vector as regular page fault guest must reset the reason to 0
>> +	before it does something that can generate normal page fault.
>> +	If during pagefault APF reason is 0 it means that this is regular
>> +	page fault.
>>  
>>  	During delivery of type 1 APF cr2 contains a token that will
>>  	be used to notify a guest when missing page becomes
>> @@ -319,3 +326,18 @@ data:
>>  
>>  	KVM guests can request the host not to poll on HLT, for example if
>>  	they are performing polling themselves.
>> +
>> +MSR_KVM_ASYNC_PF2:
>> +	0x4b564d06
>> +
>> +data:
>> +	Second asynchronous page fault control MSR.
>> +
>> +	Bits 0-7: APIC vector for interrupt based delivery of type 2 APF
>> +	events (page ready notification).
>> +        Bit 8: Interrupt based delivery of type 2 APF events is enabled
>> +        Bits 9-63: Reserved
>
> (may need to fix up the indents)
>
>> +
>> +	To switch to interrupt based delivery of type 2 APF events guests
>> +	are supposed to enable asynchronous page faults and set bit 3 in
>> +	MSR_KVM_ASYNC_PF_EN first.
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 42a2d0d3984a..6215f61450cb 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -763,12 +763,15 @@ struct kvm_vcpu_arch {
>>  		bool halted;
>>  		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
>>  		struct gfn_to_hva_cache data;
>> -		u64 msr_val;
>> +		u64 msr_val; /* MSR_KVM_ASYNC_PF_EN */
>> +		u64 msr2_val; /* MSR_KVM_ASYNC_PF2 */
>> +		u16 vec;
>>  		u32 id;
>>  		bool send_user_only;
>>  		u32 host_apf_reason;
>>  		unsigned long nested_apf_token;
>>  		bool delivery_as_pf_vmexit;
>> +		bool delivery_as_int;
>>  	} apf;
>>  
>>  	/* OSVW MSRs (AMD only) */
>> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
>> index df2ba34037a2..1bbb0b7e062f 100644
>> --- a/arch/x86/include/uapi/asm/kvm_para.h
>> +++ b/arch/x86/include/uapi/asm/kvm_para.h
>> @@ -50,6 +50,7 @@
>>  #define MSR_KVM_STEAL_TIME  0x4b564d03
>>  #define MSR_KVM_PV_EOI_EN      0x4b564d04
>>  #define MSR_KVM_POLL_CONTROL	0x4b564d05
>> +#define MSR_KVM_ASYNC_PF2	0x4b564d06
>>  
>>  struct kvm_steal_time {
>>  	__u64 steal;
>> @@ -81,6 +82,11 @@ struct kvm_clock_pairing {
>>  #define KVM_ASYNC_PF_ENABLED			(1 << 0)
>>  #define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
>>  #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT	(1 << 2)
>> +#define KVM_ASYNC_PF_DELIVERY_AS_INT		(1 << 3)
>> +
>> +#define KVM_ASYNC_PF2_VEC_MASK			GENMASK(7, 0)
>> +#define KVM_ASYNC_PF2_ENABLED			BIT(8)
>
> There are two enable bits, this one in ASYNC_PF_EN and the other old one in
> ASYNC_PF2.  Could it work with only one knob (e.g., set bit 0 of ASYNC_PF_EN
> always to enable apf)?  After all we have had bit 3 of ASYNC_PF_EN to show
> whether interrupt is enabled, which seems to be the real switch for whether to
> enable interrupt for apf.
>
> If we can keep the only knob in ASYNC_PF_EN (bit 0), iiuc we can also keep the
> below kvm_async_pf_wakeup_all() untouched (so we only set bit 0 of ASYNC_PF_EN
> after configure everything).

Yes,

as we need to write to two MSRs to configure the new mechanism ordering
becomes important. If the guest writes to ASYNC_PF_EN first to establish
the shared memory stucture the interrupt in ASYNC_PF2 is not yet set
(and AFAIR '0' is a valid interrupt!) so if an async pf happens
immediately after that we'll be forced to inject INT0 in the guest and
it'll get confused and linkely miss the event.

We can probably mandate the reverse sequence: guest has to set up
interrupt in ASYNC_PF2 first and then write to ASYNC_PF_EN (with both
bit 0 and bit 3). In that case the additional 'enable' bit in ASYNC_PF2
seems redundant. This protocol doesn't look too complex for guests to
follow.

>
> Thanks,
>
>> +
>>  
>>  /* Operations for KVM_HC_MMU_OP */
>>  #define KVM_MMU_OP_WRITE_PTE            1
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 7c21c0cf0a33..861dce1e7cf5 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -1243,7 +1243,7 @@ static const u32 emulated_msrs_all[] = {
>>  	HV_X64_MSR_TSC_EMULATION_STATUS,
>>  
>>  	MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
>> -	MSR_KVM_PV_EOI_EN,
>> +	MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF2,
>>  
>>  	MSR_IA32_TSC_ADJUST,
>>  	MSR_IA32_TSCDEADLINE,
>> @@ -2649,8 +2649,8 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
>>  {
>>  	gpa_t gpa = data & ~0x3f;
>>  
>> -	/* Bits 3:5 are reserved, Should be zero */
>> -	if (data & 0x38)
>> +	/* Bits 4:5 are reserved, Should be zero */
>> +	if (data & 0x30)
>>  		return 1;
>>  
>>  	vcpu->arch.apf.msr_val = data;
>> @@ -2667,7 +2667,35 @@ static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
>>  
>>  	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
>>  	vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
>> -	kvm_async_pf_wakeup_all(vcpu);
>> +	vcpu->arch.apf.delivery_as_int = data & KVM_ASYNC_PF_DELIVERY_AS_INT;
>> +
>> +	/*
>> +	 * If delivery via interrupt is configured make sure MSR_KVM_ASYNC_PF2
>> +	 * was written to before sending 'wakeup all'.
>> +	 */
>> +	if (!vcpu->arch.apf.delivery_as_int ||
>> +	    vcpu->arch.apf.msr2_val & KVM_ASYNC_PF2_ENABLED)
>> +		kvm_async_pf_wakeup_all(vcpu);
>> +
>> +	return 0;
>> +}
>> +
>> +static int kvm_pv_enable_async_pf2(struct kvm_vcpu *vcpu, u64 data)
>> +{
>> +	/* Bits 9-63 are reserved */
>> +	if (data & ~0x1ff)
>> +		return 1;
>> +
>> +	if (!lapic_in_kernel(vcpu))
>> +		return 1;
>> +
>> +	vcpu->arch.apf.msr2_val = data;
>> +
>> +	vcpu->arch.apf.vec = data & KVM_ASYNC_PF2_VEC_MASK;
>> +
>> +	if (data & KVM_ASYNC_PF2_ENABLED)
>> +		kvm_async_pf_wakeup_all(vcpu);
>> +
>>  	return 0;
>>  }
>>  
>> @@ -2883,6 +2911,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>  		if (kvm_pv_enable_async_pf(vcpu, data))
>>  			return 1;
>>  		break;
>> +	case MSR_KVM_ASYNC_PF2:
>> +		if (kvm_pv_enable_async_pf2(vcpu, data))
>> +			return 1;
>> +		break;
>>  	case MSR_KVM_STEAL_TIME:
>>  
>>  		if (unlikely(!sched_info_on()))
>> @@ -3159,6 +3191,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>  	case MSR_KVM_ASYNC_PF_EN:
>>  		msr_info->data = vcpu->arch.apf.msr_val;
>>  		break;
>> +	case MSR_KVM_ASYNC_PF2:
>> +		msr_info->data = vcpu->arch.apf.msr2_val;
>> +		break;
>>  	case MSR_KVM_STEAL_TIME:
>>  		msr_info->data = vcpu->arch.st.msr_val;
>>  		break;
>> @@ -10367,6 +10402,16 @@ static int apf_get_user(struct kvm_vcpu *vcpu, u32 *val)
>>  				      sizeof(u32));
>>  }
>>  
>> +static bool apf_slot_free(struct kvm_vcpu *vcpu)
>> +{
>> +	u32 val;
>> +
>> +	if (apf_get_user(vcpu, &val))
>> +		return false;
>> +
>> +	return !val;
>> +}
>> +
>>  static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>>  {
>>  	if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu))
>> @@ -10382,11 +10427,23 @@ static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
>>  
>>  bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>>  {
>> +	/*
>> +	 * TODO: when we are injecting a 'page present' event with an interrupt
>> +	 * we may ignore pending exceptions.
>> +	 */
>>  	if (unlikely(!lapic_in_kernel(vcpu) ||
>>  		     kvm_event_needs_reinjection(vcpu) ||
>>  		     vcpu->arch.exception.pending))
>>  		return false;
>>  
>> +	/*'
>> +	 * Regardless of the type of event we're trying to deliver, we need to
>> +	 * check that the previous even was already consumed, this may not be
>> +	 * the case with interrupt based delivery.
>> +	 */
>> +	if (vcpu->arch.apf.delivery_as_int && !apf_slot_free(vcpu))
>> +		return false;
>> +
>>  	if (kvm_hlt_in_guest(vcpu->kvm) && !kvm_can_deliver_async_pf(vcpu))
>>  		return false;
Peter Xu April 30, 2020, 1:28 p.m. UTC | #6
On Thu, Apr 30, 2020 at 10:31:32AM +0200, Vitaly Kuznetsov wrote:
> as we need to write to two MSRs to configure the new mechanism ordering
> becomes important. If the guest writes to ASYNC_PF_EN first to establish
> the shared memory stucture the interrupt in ASYNC_PF2 is not yet set
> (and AFAIR '0' is a valid interrupt!) so if an async pf happens
> immediately after that we'll be forced to inject INT0 in the guest and
> it'll get confused and linkely miss the event.
> 
> We can probably mandate the reverse sequence: guest has to set up
> interrupt in ASYNC_PF2 first and then write to ASYNC_PF_EN (with both
> bit 0 and bit 3). In that case the additional 'enable' bit in ASYNC_PF2
> seems redundant. This protocol doesn't look too complex for guests to
> follow.

Yep looks good.  We should also update the document too about the fact.

Thanks,
Vitaly Kuznetsov April 30, 2020, 1:49 p.m. UTC | #7
Peter Xu <peterx@redhat.com> writes:

> On Thu, Apr 30, 2020 at 10:31:32AM +0200, Vitaly Kuznetsov wrote:
>> as we need to write to two MSRs to configure the new mechanism ordering
>> becomes important. If the guest writes to ASYNC_PF_EN first to establish
>> the shared memory stucture the interrupt in ASYNC_PF2 is not yet set
>> (and AFAIR '0' is a valid interrupt!) so if an async pf happens
>> immediately after that we'll be forced to inject INT0 in the guest and
>> it'll get confused and linkely miss the event.
>> 
>> We can probably mandate the reverse sequence: guest has to set up
>> interrupt in ASYNC_PF2 first and then write to ASYNC_PF_EN (with both
>> bit 0 and bit 3). In that case the additional 'enable' bit in ASYNC_PF2
>> seems redundant. This protocol doesn't look too complex for guests to
>> follow.
>
> Yep looks good.  We should also update the document too about the fact.
>

Of course. It will also mention the fact that #PF-based mechanism is
now depreceted and unless guest opts into interrupt based delivery
async_pf is not functional (except for APF_HALT).
Vivek Goyal May 5, 2020, 3:22 p.m. UTC | #8
On Wed, Apr 29, 2020 at 11:36:31AM +0200, Vitaly Kuznetsov wrote:
> Concerns were expressed around APF delivery via synthetic #PF exception as
> in some cases such delivery may collide with real page fault. For type 2
> (page ready) notifications we can easily switch to using an interrupt
> instead. Introduce new MSR_KVM_ASYNC_PF2 mechanism.
> 
> One notable difference between the two mechanisms is that interrupt may not
> get handled immediately so whenever we would like to deliver next event
> (regardless of its type) we must be sure the guest had read and cleared
> previous event in the slot.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  Documentation/virt/kvm/msr.rst       | 38 +++++++++++---
>  arch/x86/include/asm/kvm_host.h      |  5 +-
>  arch/x86/include/uapi/asm/kvm_para.h |  6 +++
>  arch/x86/kvm/x86.c                   | 77 ++++++++++++++++++++++++++--
>  4 files changed, 113 insertions(+), 13 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
> index 33892036672d..7433e55f7184 100644
> --- a/Documentation/virt/kvm/msr.rst
> +++ b/Documentation/virt/kvm/msr.rst
> @@ -203,14 +203,21 @@ data:
>  	the hypervisor at the time of asynchronous page fault (APF)
>  	injection to indicate type of asynchronous page fault. Value
>  	of 1 means that the page referred to by the page fault is not
> -	present. Value 2 means that the page is now available. Disabling
> -	interrupt inhibits APFs. Guest must not enable interrupt
> -	before the reason is read, or it may be overwritten by another
> -	APF. Since APF uses the same exception vector as regular page
> -	fault guest must reset the reason to 0 before it does
> -	something that can generate normal page fault.  If during page
> -	fault APF reason is 0 it means that this is regular page
> -	fault.
> +	present. Value 2 means that the page is now available.
> +
> +	Type 1 page (page missing) events are currently always delivered as
> +	synthetic #PF exception. Type 2 (page ready) are either delivered
> +	by #PF exception (when bit 3 of MSR_KVM_ASYNC_PF_EN is clear) or
> +	via an APIC interrupt (when bit 3 set). APIC interrupt delivery is
> +	controlled by MSR_KVM_ASYNC_PF2.
> +
> +	For #PF delivery, disabling interrupt inhibits APFs. Guest must
> +	not enable interrupt before the reason is read, or it may be
> +	overwritten by another APF. Since APF uses the same exception
> +	vector as regular page fault guest must reset the reason to 0
> +	before it does something that can generate normal page fault.
> +	If during pagefault APF reason is 0 it means that this is regular
> +	page fault.

Hi Vitaly,

Again, thinking about how errors will be delivered. Will these be using
same interrupt path? 

As you mentioned that if interrupts are disabled, APFs are blocked. That
means host will fall back to synchronous fault? If yes, that means we
will need a mechanism to report errors in synchronous path too.

Thanks
Vivek
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index 33892036672d..7433e55f7184 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -203,14 +203,21 @@  data:
 	the hypervisor at the time of asynchronous page fault (APF)
 	injection to indicate type of asynchronous page fault. Value
 	of 1 means that the page referred to by the page fault is not
-	present. Value 2 means that the page is now available. Disabling
-	interrupt inhibits APFs. Guest must not enable interrupt
-	before the reason is read, or it may be overwritten by another
-	APF. Since APF uses the same exception vector as regular page
-	fault guest must reset the reason to 0 before it does
-	something that can generate normal page fault.  If during page
-	fault APF reason is 0 it means that this is regular page
-	fault.
+	present. Value 2 means that the page is now available.
+
+	Type 1 page (page missing) events are currently always delivered as
+	synthetic #PF exception. Type 2 (page ready) are either delivered
+	by #PF exception (when bit 3 of MSR_KVM_ASYNC_PF_EN is clear) or
+	via an APIC interrupt (when bit 3 set). APIC interrupt delivery is
+	controlled by MSR_KVM_ASYNC_PF2.
+
+	For #PF delivery, disabling interrupt inhibits APFs. Guest must
+	not enable interrupt before the reason is read, or it may be
+	overwritten by another APF. Since APF uses the same exception
+	vector as regular page fault guest must reset the reason to 0
+	before it does something that can generate normal page fault.
+	If during pagefault APF reason is 0 it means that this is regular
+	page fault.
 
 	During delivery of type 1 APF cr2 contains a token that will
 	be used to notify a guest when missing page becomes
@@ -319,3 +326,18 @@  data:
 
 	KVM guests can request the host not to poll on HLT, for example if
 	they are performing polling themselves.
+
+MSR_KVM_ASYNC_PF2:
+	0x4b564d06
+
+data:
+	Second asynchronous page fault control MSR.
+
+	Bits 0-7: APIC vector for interrupt based delivery of type 2 APF
+	events (page ready notification).
+        Bit 8: Interrupt based delivery of type 2 APF events is enabled
+        Bits 9-63: Reserved
+
+	To switch to interrupt based delivery of type 2 APF events guests
+	are supposed to enable asynchronous page faults and set bit 3 in
+	MSR_KVM_ASYNC_PF_EN first.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 42a2d0d3984a..6215f61450cb 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -763,12 +763,15 @@  struct kvm_vcpu_arch {
 		bool halted;
 		gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)];
 		struct gfn_to_hva_cache data;
-		u64 msr_val;
+		u64 msr_val; /* MSR_KVM_ASYNC_PF_EN */
+		u64 msr2_val; /* MSR_KVM_ASYNC_PF2 */
+		u16 vec;
 		u32 id;
 		bool send_user_only;
 		u32 host_apf_reason;
 		unsigned long nested_apf_token;
 		bool delivery_as_pf_vmexit;
+		bool delivery_as_int;
 	} apf;
 
 	/* OSVW MSRs (AMD only) */
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index df2ba34037a2..1bbb0b7e062f 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -50,6 +50,7 @@ 
 #define MSR_KVM_STEAL_TIME  0x4b564d03
 #define MSR_KVM_PV_EOI_EN      0x4b564d04
 #define MSR_KVM_POLL_CONTROL	0x4b564d05
+#define MSR_KVM_ASYNC_PF2	0x4b564d06
 
 struct kvm_steal_time {
 	__u64 steal;
@@ -81,6 +82,11 @@  struct kvm_clock_pairing {
 #define KVM_ASYNC_PF_ENABLED			(1 << 0)
 #define KVM_ASYNC_PF_SEND_ALWAYS		(1 << 1)
 #define KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT	(1 << 2)
+#define KVM_ASYNC_PF_DELIVERY_AS_INT		(1 << 3)
+
+#define KVM_ASYNC_PF2_VEC_MASK			GENMASK(7, 0)
+#define KVM_ASYNC_PF2_ENABLED			BIT(8)
+
 
 /* Operations for KVM_HC_MMU_OP */
 #define KVM_MMU_OP_WRITE_PTE            1
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7c21c0cf0a33..861dce1e7cf5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1243,7 +1243,7 @@  static const u32 emulated_msrs_all[] = {
 	HV_X64_MSR_TSC_EMULATION_STATUS,
 
 	MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
-	MSR_KVM_PV_EOI_EN,
+	MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF2,
 
 	MSR_IA32_TSC_ADJUST,
 	MSR_IA32_TSCDEADLINE,
@@ -2649,8 +2649,8 @@  static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 {
 	gpa_t gpa = data & ~0x3f;
 
-	/* Bits 3:5 are reserved, Should be zero */
-	if (data & 0x38)
+	/* Bits 4:5 are reserved, Should be zero */
+	if (data & 0x30)
 		return 1;
 
 	vcpu->arch.apf.msr_val = data;
@@ -2667,7 +2667,35 @@  static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 
 	vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
 	vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
-	kvm_async_pf_wakeup_all(vcpu);
+	vcpu->arch.apf.delivery_as_int = data & KVM_ASYNC_PF_DELIVERY_AS_INT;
+
+	/*
+	 * If delivery via interrupt is configured make sure MSR_KVM_ASYNC_PF2
+	 * was written to before sending 'wakeup all'.
+	 */
+	if (!vcpu->arch.apf.delivery_as_int ||
+	    vcpu->arch.apf.msr2_val & KVM_ASYNC_PF2_ENABLED)
+		kvm_async_pf_wakeup_all(vcpu);
+
+	return 0;
+}
+
+static int kvm_pv_enable_async_pf2(struct kvm_vcpu *vcpu, u64 data)
+{
+	/* Bits 9-63 are reserved */
+	if (data & ~0x1ff)
+		return 1;
+
+	if (!lapic_in_kernel(vcpu))
+		return 1;
+
+	vcpu->arch.apf.msr2_val = data;
+
+	vcpu->arch.apf.vec = data & KVM_ASYNC_PF2_VEC_MASK;
+
+	if (data & KVM_ASYNC_PF2_ENABLED)
+		kvm_async_pf_wakeup_all(vcpu);
+
 	return 0;
 }
 
@@ -2883,6 +2911,10 @@  int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		if (kvm_pv_enable_async_pf(vcpu, data))
 			return 1;
 		break;
+	case MSR_KVM_ASYNC_PF2:
+		if (kvm_pv_enable_async_pf2(vcpu, data))
+			return 1;
+		break;
 	case MSR_KVM_STEAL_TIME:
 
 		if (unlikely(!sched_info_on()))
@@ -3159,6 +3191,9 @@  int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_KVM_ASYNC_PF_EN:
 		msr_info->data = vcpu->arch.apf.msr_val;
 		break;
+	case MSR_KVM_ASYNC_PF2:
+		msr_info->data = vcpu->arch.apf.msr2_val;
+		break;
 	case MSR_KVM_STEAL_TIME:
 		msr_info->data = vcpu->arch.st.msr_val;
 		break;
@@ -10367,6 +10402,16 @@  static int apf_get_user(struct kvm_vcpu *vcpu, u32 *val)
 				      sizeof(u32));
 }
 
+static bool apf_slot_free(struct kvm_vcpu *vcpu)
+{
+	u32 val;
+
+	if (apf_get_user(vcpu, &val))
+		return false;
+
+	return !val;
+}
+
 static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
 {
 	if (!vcpu->arch.apf.delivery_as_pf_vmexit && is_guest_mode(vcpu))
@@ -10382,11 +10427,23 @@  static bool kvm_can_deliver_async_pf(struct kvm_vcpu *vcpu)
 
 bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
 {
+	/*
+	 * TODO: when we are injecting a 'page present' event with an interrupt
+	 * we may ignore pending exceptions.
+	 */
 	if (unlikely(!lapic_in_kernel(vcpu) ||
 		     kvm_event_needs_reinjection(vcpu) ||
 		     vcpu->arch.exception.pending))
 		return false;
 
+	/*'
+	 * Regardless of the type of event we're trying to deliver, we need to
+	 * check that the previous even was already consumed, this may not be
+	 * the case with interrupt based delivery.
+	 */
+	if (vcpu->arch.apf.delivery_as_int && !apf_slot_free(vcpu))
+		return false;
+
 	if (kvm_hlt_in_guest(vcpu->kvm) && !kvm_can_deliver_async_pf(vcpu))
 		return false;
 
@@ -10441,6 +10498,8 @@  void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 
 	if (vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED &&
 	    !apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY, work->arch.token)) {
+		if (!vcpu->arch.apf.delivery_as_int) {
+			/* Page ready delivery via #PF */
 			fault.vector = PF_VECTOR;
 			fault.error_code_valid = true;
 			fault.error_code = 0;
@@ -10448,6 +10507,16 @@  void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 			fault.address = work->arch.token;
 			fault.async_page_fault = true;
 			kvm_inject_page_fault(vcpu, &fault);
+		} else if (vcpu->arch.apf.msr2_val & KVM_ASYNC_PF2_ENABLED) {
+			/* Page ready delivery via interrupt */
+			struct kvm_lapic_irq irq = {
+				.delivery_mode = APIC_DM_FIXED,
+				.vector = vcpu->arch.apf.vec
+			};
+
+			/* Assuming LAPIC is enabled */
+			kvm_apic_set_irq(vcpu, &irq, NULL);
+		}
 	}
 	vcpu->arch.apf.halted = false;
 	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;