diff mbox series

[v2,12/25] KVM: VMX: Handle FRED event data

Message ID 20240207172646.3981-13-xin3.li@intel.com (mailing list archive)
State New, archived
Headers show
Series Enable FRED with KVM VMX | expand

Commit Message

Li, Xin3 Feb. 7, 2024, 5:26 p.m. UTC
Set injected-event data when injecting a #PF, #DB, or #NM caused
by extended feature disable using FRED event delivery, and save
original-event data for being used as injected-event data.

Unlike IDT using some extra CPU register as part of an event
context, e.g., %cr2 for #PF, FRED saves a complete event context
in its stack frame, e.g., FRED saves the faulting linear address
of a #PF into the event data field defined in its stack frame.

Thus a new VMX control field called injected-event data is added
to provide the event data that will be pushed into a FRED stack
frame for VM entries that inject an event using FRED event delivery.
In addition, a new VM exit information field called original-event
data is added to store the event data that would have saved into a
FRED stack frame for VM exits that occur during FRED event delivery.
After such a VM exit is handled to allow the original-event to be
delivered, the data in the original-event data VMCS field needs to
be set into the injected-event data VMCS field for the injection of
the original event.

Signed-off-by: Xin Li <xin3.li@intel.com>
Tested-by: Shan Kang <shan.kang@intel.com>
---

Change since v1:
* Document event data should be equal to CR2/DR6/IA32_XFD_ERR instead
  of using WARN_ON() (Chao Gao).
* Zero event data if a #NM was not caused by extended feature disable
  (Chao Gao).
---
 arch/x86/include/asm/vmx.h |   4 ++
 arch/x86/kvm/vmx/vmx.c     | 109 ++++++++++++++++++++++++++++---------
 arch/x86/kvm/vmx/vmx.h     |   1 +
 arch/x86/kvm/x86.c         |  10 +++-
 4 files changed, 95 insertions(+), 29 deletions(-)

Comments

Chao Gao April 30, 2024, 3:14 a.m. UTC | #1
>index ee61d2c25cb0..f622fb90a098 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -1871,9 +1871,29 @@ static void vmx_inject_exception(struct kvm_vcpu *vcpu)
>                vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
>                             vmx->vcpu.arch.event_exit_inst_len);
>                intr_info |= INTR_TYPE_SOFT_EXCEPTION;
>-       } else
>+       } else {
>                intr_info |= INTR_TYPE_HARD_EXCEPTION;
>
>+               if (kvm_is_fred_enabled(vcpu)) {
>+                       u64 event_data = 0;
>+
>+                       if (is_debug(intr_info))
>+                               /*
>+                                * Compared to DR6, FRED #DB event data saved on
>+                                * the stack frame have bits 4 ~ 11 and 16 ~ 31
>+                                * inverted, i.e.,
>+                                *   fred_db_event_data = dr6 ^ 0xFFFF0FF0UL
>+                                */
>+                               event_data = vcpu->arch.dr6 ^ DR6_RESERVED;
>+                       else if (is_page_fault(intr_info))
>+                               event_data = vcpu->arch.cr2;
>+                       else if (is_nm_fault(intr_info))
>+                               event_data = to_vmx(vcpu)->fred_xfd_event_data;
>+

IMO, deriving an event_data from CR2/DR6 is a little short-sighted because the
event_data and CR2/DR6 __can__ be different, e.g., L1 VMM __can__ set CR2 to A
and event_data field to B (!=A) when injecting #PF.

And this approach cannot be extended to handle a (future) exception whose
event_data isn't tied to a dedicated register like CR2/DR6.

Adding a new field fred_xfd_event_data in struct vcpu has problems too:
fred_xfd_event_data gets lost during migration; strickly speaking, event_data
is tied to an exception rather than a CPU. e.g., the CPU may detect a nested
exception when delivering one and both have their own event_data.

I think we can make event_data a property of exceptions. i.e., add a payload2
to struct kvm_queued_exception. and add new APIs to kvm_queue_exception* family
to accept a payload2 and in VMX code, just program payload2 to the VMCS
event_data field if FRED is enabled. KVM ABI should be extended as well to pass
payload2 to userspace like how the payload is handled in
kvm_vcpu_ioctl_x86_get/put_vcpu_events.
Li, Xin3 May 10, 2024, 9:36 a.m. UTC | #2
> >+               if (kvm_is_fred_enabled(vcpu)) {
> >+                       u64 event_data = 0;
> >+
> >+                       if (is_debug(intr_info))
> >+                               /*
> >+                                * Compared to DR6, FRED #DB event data saved on
> >+                                * the stack frame have bits 4 ~ 11 and 16 ~ 31
> >+                                * inverted, i.e.,
> >+                                *   fred_db_event_data = dr6 ^ 0xFFFF0FF0UL
> >+                                */
> >+                               event_data = vcpu->arch.dr6 ^ DR6_RESERVED;
> >+                       else if (is_page_fault(intr_info))
> >+                               event_data = vcpu->arch.cr2;
> >+                       else if (is_nm_fault(intr_info))
> >+                               event_data =
> >+ to_vmx(vcpu)->fred_xfd_event_data;
> >+
> 
> IMO, deriving an event_data from CR2/DR6 is a little short-sighted because the
> event_data and CR2/DR6 __can__ be different, e.g., L1 VMM __can__ set CR2 to A
> and event_data field to B (!=A) when injecting #PF.

VMM should guarantee a FRED guest _sees_ consistent values in CR6/DR6
and event data. If not it's just a VMM bug that we need to fix.

> 
> And this approach cannot be extended to handle a (future) exception whose
> event_data isn't tied to a dedicated register like CR2/DR6.

See below.

> Adding a new field fred_xfd_event_data in struct vcpu has problems too:
> fred_xfd_event_data gets lost during migration;

I'm not bothered, because this is not hard to fix, right?

> strickly speaking, event_data is tied
> to an exception rather than a CPU. e.g., the CPU may detect a nested exception when
> delivering one and both have their own event_data.

No, don't get me wrong. An event data has to be _regenerated_ after
a nested exception is handled and the original instruction flow is
restarted. 
sometimes the original event could be gone.

We don't say an event data is tied to an exception or a CPU, which
is just confusing, or misleading.

> I think we can make event_data a property of exceptions. i.e., add a payload2 to
> struct kvm_queued_exception. and add new APIs to kvm_queue_exception* family to
> accept a payload2 and in VMX code, just program payload2 to the VMCS event_data
> field if FRED is enabled. KVM ABI should be extended as well to pass
> payload2 to userspace like how the payload is handled in
> kvm_vcpu_ioctl_x86_get/put_vcpu_events.

Yes, it's very likely that we will need to add a payload2 in future,
but NOT now. 2 reasons:

1) The first-generation FRED is designed to NOT go too far from what
   IDT can do. And FRED event data is conceptually an alias of CR2/DR6
   in the latest FRED spec (not considering xfd event data for now).
   And the existing payload is a nice match for now;

2) FRED is an extendable CPU architecture, which allows the structure
   of event data to become way bigger and complicated. Let's not assume
   anything and add a payload2 too early.
Chao Gao May 11, 2024, 3:03 a.m. UTC | #3
On Fri, May 10, 2024 at 05:36:03PM +0800, Li, Xin3 wrote:
>> >+               if (kvm_is_fred_enabled(vcpu)) {
>> >+                       u64 event_data = 0;
>> >+
>> >+                       if (is_debug(intr_info))
>> >+                               /*
>> >+                                * Compared to DR6, FRED #DB event data saved on
>> >+                                * the stack frame have bits 4 ~ 11 and 16 ~ 31
>> >+                                * inverted, i.e.,
>> >+                                *   fred_db_event_data = dr6 ^ 0xFFFF0FF0UL
>> >+                                */
>> >+                               event_data = vcpu->arch.dr6 ^ DR6_RESERVED;
>> >+                       else if (is_page_fault(intr_info))
>> >+                               event_data = vcpu->arch.cr2;
>> >+                       else if (is_nm_fault(intr_info))
>> >+                               event_data =
>> >+ to_vmx(vcpu)->fred_xfd_event_data;
>> >+
>> 
>> IMO, deriving an event_data from CR2/DR6 is a little short-sighted because the
>> event_data and CR2/DR6 __can__ be different, e.g., L1 VMM __can__ set CR2 to A
>> and event_data field to B (!=A) when injecting #PF.
>
>VMM should guarantee a FRED guest _sees_ consistent values in CR6/DR6
>and event data. If not it's just a VMM bug that we need to fix.

I don't get why VMM should.

I know the hardware will guarantee this. And likely KVM will also do this.
but I don't think it is necessary for KVM to assume L1 VMM will guarantee
this. because as long as L2 guest is enlightened to read event_data from stack
only, the ABI between L1 VMM and L2 guest can be: CR2/DR6 may be out of sync
with the event_data. I am not saying it is good that L1 VMM deviates from the
real hardware behavior. But how L1 VMM defines this ABI with L2 has nothing to
do with KVM as L0. KVM shouldn't make assumptions on that.
Sean Christopherson June 12, 2024, 10:52 p.m. UTC | #4
On Wed, Feb 07, 2024, Xin Li wrote:
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 4889754415b5..6b796c5c9c2b 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -256,8 +256,12 @@ enum vmcs_field {
>  	PID_POINTER_TABLE_HIGH		= 0x00002043,
>  	SECONDARY_VM_EXIT_CONTROLS	= 0x00002044,
>  	SECONDARY_VM_EXIT_CONTROLS_HIGH	= 0x00002045,
> +	INJECTED_EVENT_DATA		= 0x00002052,
> +	INJECTED_EVENT_DATA_HIGH	= 0x00002053,
>  	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
>  	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
> +	ORIGINAL_EVENT_DATA		= 0x00002404,
> +	ORIGINAL_EVENT_DATA_HIGH	= 0x00002405,

Are these the actual names from the SDM?  E.g. is there no FRED_ prefix to clue
in readers that they are FRED specific? (unless they aren't FRED specific?)

>  	VMCS_LINK_POINTER               = 0x00002800,
>  	VMCS_LINK_POINTER_HIGH          = 0x00002801,
>  	GUEST_IA32_DEBUGCTL             = 0x00002802,
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ee61d2c25cb0..f622fb90a098 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1871,9 +1871,29 @@ static void vmx_inject_exception(struct kvm_vcpu *vcpu)
>  		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
>  			     vmx->vcpu.arch.event_exit_inst_len);
>  		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
> -	} else
> +	} else {
>  		intr_info |= INTR_TYPE_HARD_EXCEPTION;
>  
> +		if (kvm_is_fred_enabled(vcpu)) {
> +			u64 event_data = 0;
> +
> +			if (is_debug(intr_info))
> +				/*
> +				 * Compared to DR6, FRED #DB event data saved on
> +				 * the stack frame have bits 4 ~ 11 and 16 ~ 31
> +				 * inverted, i.e.,
> +				 *   fred_db_event_data = dr6 ^ 0xFFFF0FF0UL
> +				 */
> +				event_data = vcpu->arch.dr6 ^ DR6_RESERVED;
> +			else if (is_page_fault(intr_info))
> +				event_data = vcpu->arch.cr2;
> +			else if (is_nm_fault(intr_info))
> +				event_data = to_vmx(vcpu)->fred_xfd_event_data;
> +
> +			vmcs_write64(INJECTED_EVENT_DATA, event_data);
> +		}
> +	}
> +
>  	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
>  
>  	vmx_clear_hlt(vcpu);
> @@ -7082,8 +7102,11 @@ static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
>  	 *
>  	 * Queuing exception is done in vmx_handle_exit. See comment there.
>  	 */
> -	if (vcpu->arch.guest_fpu.fpstate->xfd)
> +	if (vcpu->arch.guest_fpu.fpstate->xfd) {
>  		rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> +		to_vmx(vcpu)->fred_xfd_event_data = vcpu->arch.cr0 & X86_CR0_TS

kvm_is_cr0_bit_set(), don't read vcpu->arch.cr0 directly.

> +			? 0 : vcpu->arch.guest_fpu.xfd_err;

Maybe this?

		if (kvm_is_cr0_bit_set(vcpu, X86_CR0_TS))
			to_vmx(vcpu)->fred_xfd_event_data = 0;
		else
			to_vmx(vcpu)->fred_xfd_event_data = vcpu->arch.guest_fpu.xfd_err;

Hmm, but why does this need to be cached _now_?  I.e. why does fred_xfd_event_data
need to exist?  Wouldn't it be simpler and more robust to use vcpu->arch.guest_fpu.xfd_err
directly in vmx_inject_exception()?

> +	}
>  }
>  
>  static void handle_exception_irqoff(struct vcpu_vmx *vmx)
> @@ -7199,29 +7222,28 @@ static void vmx_recover_nmi_blocking(struct vcpu_vmx *vmx)
>  					      vmx->loaded_vmcs->entry_time));
>  }
>  
> -static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
> -				      u32 idt_vectoring_info,
> -				      int instr_len_field,
> -				      int error_code_field)
> +static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu, bool vectoring)
>  {
> -	u8 vector;
> -	int type;
> -	bool idtv_info_valid;
> -
> -	idtv_info_valid = idt_vectoring_info & VECTORING_INFO_VALID_MASK;
> +	u32 event_id = vectoring ? to_vmx(vcpu)->idt_vectoring_info
> +				 : vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);


Preferred style for ternary operators is:

	u32 event_id = vectoring ? to_vmx(vcpu)->idt_vectoring_info :
				   vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);

That said, I don't think this is a net positive versus passing in all params.
The bare true/false is somewhat inscrutable, and in this code, it's hard to
understand why KVM looks at X instead of Y without the conext of the caller.

> +	int instr_len_field = vectoring ? VM_EXIT_INSTRUCTION_LEN
> +					: VM_ENTRY_INSTRUCTION_LEN;
> +	int error_code_field = vectoring ? IDT_VECTORING_ERROR_CODE
> +					 : VM_ENTRY_EXCEPTION_ERROR_CODE;
> +	int event_data_field = vectoring ? ORIGINAL_EVENT_DATA
> +					 : INJECTED_EVENT_DATA;
> +	u8 vector = event_id & INTR_INFO_VECTOR_MASK;
> +	int type = event_id & INTR_INFO_INTR_TYPE_MASK;
>  
>  	vcpu->arch.nmi_injected = false;
>  	kvm_clear_exception_queue(vcpu);
>  	kvm_clear_interrupt_queue(vcpu);
>  
> -	if (!idtv_info_valid)
> +	if (!(event_id & INTR_INFO_VALID_MASK))
>  		return;
>  
>  	kvm_make_request(KVM_REQ_EVENT, vcpu);
>  
> -	vector = idt_vectoring_info & VECTORING_INFO_VECTOR_MASK;
> -	type = idt_vectoring_info & VECTORING_INFO_TYPE_MASK;
> -
>  	switch (type) {
>  	case INTR_TYPE_NMI_INTR:
>  		vcpu->arch.nmi_injected = true;
> @@ -7236,10 +7258,31 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
>  		vcpu->arch.event_exit_inst_len = vmcs_read32(instr_len_field);
>  		fallthrough;
>  	case INTR_TYPE_HARD_EXCEPTION:
> -		if (idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK) {
> -			u32 err = vmcs_read32(error_code_field);
> -			kvm_requeue_exception_e(vcpu, vector, err);
> -		} else
> +		if (kvm_is_fred_enabled(vcpu)) {
> +			/* Save event data for being used as injected-event data */
> +			u64 event_data = vmcs_read64(event_data_field);
> +
> +			switch (vector) {
> +			case DB_VECTOR:
> +				/* %dr6 should be equal to (event_data ^ DR6_RESERVED) */

DR6, no need to use assembly syntax, but I'd just drop this comment, as well as
the CR2 comment.  They add no insight beyond what the code literally does.

> +				vcpu->arch.dr6 = event_data ^ DR6_RESERVED;
> +				break;
> +			case NM_VECTOR:
> +				to_vmx(vcpu)->fred_xfd_event_data = event_data;
> +				break;
> +			case PF_VECTOR:
> +				/* %cr2 should be equal to event_data */
> +				vcpu->arch.cr2 = event_data;
> +				break;
> +			default:
> +				WARN_ON(event_data != 0);
> +				break;
> +			}
> +		}
Sean Christopherson June 12, 2024, 11:31 p.m. UTC | #5
On Sat, May 11, 2024, Chao Gao wrote:
> On Fri, May 10, 2024 at 05:36:03PM +0800, Li, Xin3 wrote:
> >> >+               if (kvm_is_fred_enabled(vcpu)) {
> >> >+                       u64 event_data = 0;
> >> >+
> >> >+                       if (is_debug(intr_info))
> >> >+                               /*
> >> >+                                * Compared to DR6, FRED #DB event data saved on
> >> >+                                * the stack frame have bits 4 ~ 11 and 16 ~ 31
> >> >+                                * inverted, i.e.,
> >> >+                                *   fred_db_event_data = dr6 ^ 0xFFFF0FF0UL
> >> >+                                */
> >> >+                               event_data = vcpu->arch.dr6 ^ DR6_RESERVED;
> >> >+                       else if (is_page_fault(intr_info))
> >> >+                               event_data = vcpu->arch.cr2;
> >> >+                       else if (is_nm_fault(intr_info))
> >> >+                               event_data =
> >> >+ to_vmx(vcpu)->fred_xfd_event_data;
> >> >+
> >> 
> >> IMO, deriving an event_data from CR2/DR6 is a little short-sighted because the
> >> event_data and CR2/DR6 __can__ be different, e.g., L1 VMM __can__ set CR2 to A
> >> and event_data field to B (!=A) when injecting #PF.
> >
> >VMM should guarantee a FRED guest _sees_ consistent values in CR6/DR6
> >and event data. If not it's just a VMM bug that we need to fix.
> 
> I don't get why VMM should.
> 
> I know the hardware will guarantee this. And likely KVM will also do this.
> but I don't think it is necessary for KVM to assume L1 VMM will guarantee
> this. because as long as L2 guest is enlightened to read event_data from stack
> only, the ABI between L1 VMM and L2 guest can be: CR2/DR6 may be out of sync
> with the event_data. I am not saying it is good that L1 VMM deviates from the
> real hardware behavior. But how L1 VMM defines this ABI with L2 has nothing to
> do with KVM as L0. KVM shouldn't make assumptions on that.

Right, but in that case the propagation of event_data would be from vmcs12 =>
vmcs02, which is handled by prepare_vmcs02_early().

For this flow, it specifically handles exception injection from _L0 KVM_, in which
case KVM should always follow the architectural behavior.

Ahh, but the code in with __vmx_complete_interrupts() is wrong.  Overwriting
vcpu->arch.{dr6,cr2} is wrong, because theres no telling what was in vmcs02.
And even if vmcs02 holds DR6/CR2 values, those might be L2 values, i.e. shouldn't
clobber the vCPU state.

It's not clear to me that we need to do anything new for FRED in
__vmx_complete_interrupts().  The relevant VMCS fields should already hold the
correct values, there's no reason to clobber vCPU state.  The reason KVM grabs
things like instruction length and error code is because that information is
visible to other aspects of injection, e.g. to adjust RIP and pushed the error
code on the stack.
Chao Gao June 13, 2024, 5:29 a.m. UTC | #6
On Wed, Jun 12, 2024 at 04:31:35PM -0700, Sean Christopherson wrote:
>On Sat, May 11, 2024, Chao Gao wrote:
>> On Fri, May 10, 2024 at 05:36:03PM +0800, Li, Xin3 wrote:
>> >> >+               if (kvm_is_fred_enabled(vcpu)) {
>> >> >+                       u64 event_data = 0;
>> >> >+
>> >> >+                       if (is_debug(intr_info))
>> >> >+                               /*
>> >> >+                                * Compared to DR6, FRED #DB event data saved on
>> >> >+                                * the stack frame have bits 4 ~ 11 and 16 ~ 31
>> >> >+                                * inverted, i.e.,
>> >> >+                                *   fred_db_event_data = dr6 ^ 0xFFFF0FF0UL
>> >> >+                                */
>> >> >+                               event_data = vcpu->arch.dr6 ^ DR6_RESERVED;
>> >> >+                       else if (is_page_fault(intr_info))
>> >> >+                               event_data = vcpu->arch.cr2;
>> >> >+                       else if (is_nm_fault(intr_info))
>> >> >+                               event_data =
>> >> >+ to_vmx(vcpu)->fred_xfd_event_data;
>> >> >+
>> >> 
>> >> IMO, deriving an event_data from CR2/DR6 is a little short-sighted because the
>> >> event_data and CR2/DR6 __can__ be different, e.g., L1 VMM __can__ set CR2 to A
>> >> and event_data field to B (!=A) when injecting #PF.
>> >
>> >VMM should guarantee a FRED guest _sees_ consistent values in CR6/DR6
>> >and event data. If not it's just a VMM bug that we need to fix.
>> 
>> I don't get why VMM should.
>> 
>> I know the hardware will guarantee this. And likely KVM will also do this.
>> but I don't think it is necessary for KVM to assume L1 VMM will guarantee
>> this. because as long as L2 guest is enlightened to read event_data from stack
>> only, the ABI between L1 VMM and L2 guest can be: CR2/DR6 may be out of sync
>> with the event_data. I am not saying it is good that L1 VMM deviates from the
>> real hardware behavior. But how L1 VMM defines this ABI with L2 has nothing to
>> do with KVM as L0. KVM shouldn't make assumptions on that.
>
>Right, but in that case the propagation of event_data would be from vmcs12 =>
>vmcs02, which is handled by prepare_vmcs02_early().

Yes. But delivering this event to L2 may cause VM-exit. So, L0 KVM may need to
re-inject this event ...

>
>For this flow, it specifically handles exception injection from _L0 KVM_, in which
>case KVM should always follow the architectural behavior.

... and go through this exception injection flow. For such an event, there is no
guarantee that the associated event data is consistent with the vCPU's
DR6/CR2/XFD_ERR.

>
>Ahh, but the code in with __vmx_complete_interrupts() is wrong.  Overwriting
>vcpu->arch.{dr6,cr2} is wrong, because theres no telling what was in vmcs02.
>And even if vmcs02 holds DR6/CR2 values, those might be L2 values, i.e. shouldn't
>clobber the vCPU state.

Exactly.

>
>It's not clear to me that we need to do anything new for FRED in
>__vmx_complete_interrupts().  The relevant VMCS fields should already hold the
>correct values, there's no reason to clobber vCPU state.  The reason KVM grabs

The whole point is to cache the ORIGINAL_EVENT_DATA VMCS field so that KVM can
set it back to the INJECTED_EVENT_DATA VMCS field when reinjecting the pending
event in IDT-vectoring information.

>things like instruction length and error code is because that information is
>visible to other aspects of injection, e.g. to adjust RIP and pushed the error
>code on the stack.
Sean Christopherson June 13, 2024, 4:57 p.m. UTC | #7
On Wed, Feb 07, 2024, Xin Li wrote:
> @@ -7382,6 +7419,24 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>  
>  	vmx_disable_fb_clear(vmx);
>  
> +	/*
> +	 * %cr2 needs to be saved after a VM exit and restored before a VM
> +	 * entry in case a VM exit happens immediately after delivery of a
> +	 * guest #PF but before guest reads %cr2.
> +	 *
> +	 * A FRED guest should read its #PF faulting linear address from
> +	 * the event data field in its FRED stack frame instead of %cr2.
> +	 * But the FRED 5.0 spec still requires a FRED CPU to update %cr2
> +	 * in the normal way, thus %cr2 is still updated even for a FRED
> +	 * guest.
> +	 *
> +	 * Note, an NMI could interrupt KVM:
> +	 *   1) after VM exit but before CR2 is saved.
> +	 *   2) after CR2 is restored but before VM entry.
> +	 * And a #PF could happen durng NMI handlng, which overwrites %cr2.
> +	 * Thus exc_nmi() should save and restore %cr2 upon entering and
> +	 * before leaving to make sure %cr2 not corrupted.
> +	 */

This is 99.9% noise.  What software does or does not do with respect to CR2 is
completely irrelevant.  The *only* thing that matters is the architectural
behavior, and architecturally guest CR2 _must_ be up-to-date at all times because
CR2 accesses cannot be intercepted.  So, just say:

	/*
	 * Note, even though FRED delivers the faulting linear address via the
	 * event data field on the stack, CR2 is still updated.
	 */

>  	if (vcpu->arch.cr2 != native_read_cr2())
>  		native_write_cr2(vcpu->arch.cr2);
>
Sean Christopherson June 13, 2024, 5:57 p.m. UTC | #8
On Thu, Jun 13, 2024, Chao Gao wrote:
> On Wed, Jun 12, 2024 at 04:31:35PM -0700, Sean Christopherson wrote:
> >On Sat, May 11, 2024, Chao Gao wrote:
> >> On Fri, May 10, 2024 at 05:36:03PM +0800, Li, Xin3 wrote:
> >> >> >+               if (kvm_is_fred_enabled(vcpu)) {
> >> >> >+                       u64 event_data = 0;
> >> >> >+
> >> >> >+                       if (is_debug(intr_info))
> >> >> >+                               /*
> >> >> >+                                * Compared to DR6, FRED #DB event data saved on
> >> >> >+                                * the stack frame have bits 4 ~ 11 and 16 ~ 31
> >> >> >+                                * inverted, i.e.,
> >> >> >+                                *   fred_db_event_data = dr6 ^ 0xFFFF0FF0UL
> >> >> >+                                */
> >> >> >+                               event_data = vcpu->arch.dr6 ^ DR6_RESERVED;
> >> >> >+                       else if (is_page_fault(intr_info))
> >> >> >+                               event_data = vcpu->arch.cr2;
> >> >> >+                       else if (is_nm_fault(intr_info))
> >> >> >+                               event_data =
> >> >> >+ to_vmx(vcpu)->fred_xfd_event_data;
> >> >> >+
> >> >> 
> >> >> IMO, deriving an event_data from CR2/DR6 is a little short-sighted because the
> >> >> event_data and CR2/DR6 __can__ be different, e.g., L1 VMM __can__ set CR2 to A
> >> >> and event_data field to B (!=A) when injecting #PF.
> >> >
> >> >VMM should guarantee a FRED guest _sees_ consistent values in CR6/DR6
> >> >and event data. If not it's just a VMM bug that we need to fix.
> >> 
> >> I don't get why VMM should.
> >> 
> >> I know the hardware will guarantee this. And likely KVM will also do this.
> >> but I don't think it is necessary for KVM to assume L1 VMM will guarantee
> >> this. because as long as L2 guest is enlightened to read event_data from stack
> >> only, the ABI between L1 VMM and L2 guest can be: CR2/DR6 may be out of sync
> >> with the event_data. I am not saying it is good that L1 VMM deviates from the
> >> real hardware behavior. But how L1 VMM defines this ABI with L2 has nothing to
> >> do with KVM as L0. KVM shouldn't make assumptions on that.
> >
> >Right, but in that case the propagation of event_data would be from vmcs12 =>
> >vmcs02, which is handled by prepare_vmcs02_early().
> 
> Yes. But delivering this event to L2 may cause VM-exit. So, L0 KVM may need to
> re-inject this event ...
> 
> >
> >For this flow, it specifically handles exception injection from _L0 KVM_, in which
> >case KVM should always follow the architectural behavior.
> 
> ... and go through this exception injection flow. For such an event, there is no
> guarantee that the associated event data is consistent with the vCPU's
> DR6/CR2/XFD_ERR.
> 
> >
> >Ahh, but the code in with __vmx_complete_interrupts() is wrong.  Overwriting
> >vcpu->arch.{dr6,cr2} is wrong, because theres no telling what was in vmcs02.
> >And even if vmcs02 holds DR6/CR2 values, those might be L2 values, i.e. shouldn't
> >clobber the vCPU state.
> 
> Exactly.
> 
> >
> >It's not clear to me that we need to do anything new for FRED in
> >__vmx_complete_interrupts().  The relevant VMCS fields should already hold the
> >correct values, there's no reason to clobber vCPU state.  The reason KVM grabs
> 
> The whole point is to cache the ORIGINAL_EVENT_DATA VMCS field so that KVM can
> set it back to the INJECTED_EVENT_DATA VMCS field when reinjecting the pending
> event in IDT-vectoring information.

Hrm, right.  I was thinking INJECTED_EVENT_DATA would already hold the correct
data, but that's only true when the VM-Exit occurred on an injected event, i.e.
when KVM already set the relevant fields.  Ah, and not capturing the state would
lead to loss of data on migration.

I think the right way to handle this is to add kvm_queued_exception.event_data,
and then fill event_data during kvm_deliver_exception_payload() when injecting
an event for the first time, and set it directly when re-injecting an event.  The
event data is effectively the same thing as the payload, it just happens to be
deliver on the event stack frame, not via architectural register state.

And I think we should also rework kvm_requeue_exception() to open code stuffing
vcpu->arch.exception instead of using kvm_multiple_exception().  The two flows
(injection vs. re-injection) don't actually have that much in common, and the
common parts are the super duper simple things, e.g. actually setting values and
requested KVM_REQ_EVENT.

Aha!  And there is a pre-existing bug in handle_nm_fault_irqoff(), as it clobbers
guest XFD_ERR if CR0.TS=1.

Speaking of XFD_ERR, I think the best way to deal with that is to pass it along
as a payload, but then simply do nothing when delivering the payload... until
FRED comes along, and then kvm_deliver_exception_payload() can be responsible
for setting FRED's ex->event_data.

With that combination of tweaks, "normal" injection always sets event_data via
the payload, and because re-injected events never deliver payloads (already
delivered), KVM will naturally avoid clobbering ex->event_data with stale state
(and obviously doesn't need to overwrite CR2 or DR6).

Attached patches are compile-tested only.  They're a subset of the overall series,
hopefully it's fairly easy to understand where they slot in.
Li, Xin3 June 13, 2024, 6:02 p.m. UTC | #9
> On Wed, Feb 07, 2024, Xin Li wrote:
> > @@ -7382,6 +7419,24 @@ static noinstr void vmx_vcpu_enter_exit(struct
> kvm_vcpu *vcpu,
> >
> >  	vmx_disable_fb_clear(vmx);
> >
> > +	/*
> > +	 * %cr2 needs to be saved after a VM exit and restored before a VM
> > +	 * entry in case a VM exit happens immediately after delivery of a
> > +	 * guest #PF but before guest reads %cr2.
> > +	 *
> > +	 * A FRED guest should read its #PF faulting linear address from
> > +	 * the event data field in its FRED stack frame instead of %cr2.
> > +	 * But the FRED 5.0 spec still requires a FRED CPU to update %cr2
> > +	 * in the normal way, thus %cr2 is still updated even for a FRED
> > +	 * guest.
> > +	 *
> > +	 * Note, an NMI could interrupt KVM:
> > +	 *   1) after VM exit but before CR2 is saved.
> > +	 *   2) after CR2 is restored but before VM entry.
> > +	 * And a #PF could happen durng NMI handlng, which overwrites %cr2.
> > +	 * Thus exc_nmi() should save and restore %cr2 upon entering and
> > +	 * before leaving to make sure %cr2 not corrupted.
> > +	 */
> 
> This is 99.9% noise.  What software does or does not do with respect to CR2 is
> completely irrelevant.  The *only* thing that matters is the architectural
> behavior, and architecturally guest CR2 _must_ be up-to-date at all times because
> CR2 accesses cannot be intercepted.  So, just say:
> 
> 	/*
> 	 * Note, even though FRED delivers the faulting linear address via the
> 	 * event data field on the stack, CR2 is still updated.
> 	 */

Will do!

There is a reason for this comment because it won't be architectural:
https://lore.kernel.org/lkml/c0c7c605-d487-483e-a034-983b76ee1dfa@zytor.com/

FRED is designed to atomically save and restore _full_ supervisor/user
context upon event delivery and return.  But unfortunately, KVM still has
to save/restore guest CR2 explicitly due to the issue mentioned above.

Thanks!
    Xin
diff mbox series

Patch

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 4889754415b5..6b796c5c9c2b 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -256,8 +256,12 @@  enum vmcs_field {
 	PID_POINTER_TABLE_HIGH		= 0x00002043,
 	SECONDARY_VM_EXIT_CONTROLS	= 0x00002044,
 	SECONDARY_VM_EXIT_CONTROLS_HIGH	= 0x00002045,
+	INJECTED_EVENT_DATA		= 0x00002052,
+	INJECTED_EVENT_DATA_HIGH	= 0x00002053,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
+	ORIGINAL_EVENT_DATA		= 0x00002404,
+	ORIGINAL_EVENT_DATA_HIGH	= 0x00002405,
 	VMCS_LINK_POINTER               = 0x00002800,
 	VMCS_LINK_POINTER_HIGH          = 0x00002801,
 	GUEST_IA32_DEBUGCTL             = 0x00002802,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ee61d2c25cb0..f622fb90a098 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1871,9 +1871,29 @@  static void vmx_inject_exception(struct kvm_vcpu *vcpu)
 		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
 			     vmx->vcpu.arch.event_exit_inst_len);
 		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
-	} else
+	} else {
 		intr_info |= INTR_TYPE_HARD_EXCEPTION;
 
+		if (kvm_is_fred_enabled(vcpu)) {
+			u64 event_data = 0;
+
+			if (is_debug(intr_info))
+				/*
+				 * Compared to DR6, FRED #DB event data saved on
+				 * the stack frame have bits 4 ~ 11 and 16 ~ 31
+				 * inverted, i.e.,
+				 *   fred_db_event_data = dr6 ^ 0xFFFF0FF0UL
+				 */
+				event_data = vcpu->arch.dr6 ^ DR6_RESERVED;
+			else if (is_page_fault(intr_info))
+				event_data = vcpu->arch.cr2;
+			else if (is_nm_fault(intr_info))
+				event_data = to_vmx(vcpu)->fred_xfd_event_data;
+
+			vmcs_write64(INJECTED_EVENT_DATA, event_data);
+		}
+	}
+
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
 
 	vmx_clear_hlt(vcpu);
@@ -7082,8 +7102,11 @@  static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
 	 *
 	 * Queuing exception is done in vmx_handle_exit. See comment there.
 	 */
-	if (vcpu->arch.guest_fpu.fpstate->xfd)
+	if (vcpu->arch.guest_fpu.fpstate->xfd) {
 		rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
+		to_vmx(vcpu)->fred_xfd_event_data = vcpu->arch.cr0 & X86_CR0_TS
+			? 0 : vcpu->arch.guest_fpu.xfd_err;
+	}
 }
 
 static void handle_exception_irqoff(struct vcpu_vmx *vmx)
@@ -7199,29 +7222,28 @@  static void vmx_recover_nmi_blocking(struct vcpu_vmx *vmx)
 					      vmx->loaded_vmcs->entry_time));
 }
 
-static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
-				      u32 idt_vectoring_info,
-				      int instr_len_field,
-				      int error_code_field)
+static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu, bool vectoring)
 {
-	u8 vector;
-	int type;
-	bool idtv_info_valid;
-
-	idtv_info_valid = idt_vectoring_info & VECTORING_INFO_VALID_MASK;
+	u32 event_id = vectoring ? to_vmx(vcpu)->idt_vectoring_info
+				 : vmcs_read32(VM_ENTRY_INTR_INFO_FIELD);
+	int instr_len_field = vectoring ? VM_EXIT_INSTRUCTION_LEN
+					: VM_ENTRY_INSTRUCTION_LEN;
+	int error_code_field = vectoring ? IDT_VECTORING_ERROR_CODE
+					 : VM_ENTRY_EXCEPTION_ERROR_CODE;
+	int event_data_field = vectoring ? ORIGINAL_EVENT_DATA
+					 : INJECTED_EVENT_DATA;
+	u8 vector = event_id & INTR_INFO_VECTOR_MASK;
+	int type = event_id & INTR_INFO_INTR_TYPE_MASK;
 
 	vcpu->arch.nmi_injected = false;
 	kvm_clear_exception_queue(vcpu);
 	kvm_clear_interrupt_queue(vcpu);
 
-	if (!idtv_info_valid)
+	if (!(event_id & INTR_INFO_VALID_MASK))
 		return;
 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 
-	vector = idt_vectoring_info & VECTORING_INFO_VECTOR_MASK;
-	type = idt_vectoring_info & VECTORING_INFO_TYPE_MASK;
-
 	switch (type) {
 	case INTR_TYPE_NMI_INTR:
 		vcpu->arch.nmi_injected = true;
@@ -7236,10 +7258,31 @@  static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
 		vcpu->arch.event_exit_inst_len = vmcs_read32(instr_len_field);
 		fallthrough;
 	case INTR_TYPE_HARD_EXCEPTION:
-		if (idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK) {
-			u32 err = vmcs_read32(error_code_field);
-			kvm_requeue_exception_e(vcpu, vector, err);
-		} else
+		if (kvm_is_fred_enabled(vcpu)) {
+			/* Save event data for being used as injected-event data */
+			u64 event_data = vmcs_read64(event_data_field);
+
+			switch (vector) {
+			case DB_VECTOR:
+				/* %dr6 should be equal to (event_data ^ DR6_RESERVED) */
+				vcpu->arch.dr6 = event_data ^ DR6_RESERVED;
+				break;
+			case NM_VECTOR:
+				to_vmx(vcpu)->fred_xfd_event_data = event_data;
+				break;
+			case PF_VECTOR:
+				/* %cr2 should be equal to event_data */
+				vcpu->arch.cr2 = event_data;
+				break;
+			default:
+				WARN_ON(event_data != 0);
+				break;
+			}
+		}
+
+		if (event_id & INTR_INFO_DELIVER_CODE_MASK)
+			kvm_requeue_exception_e(vcpu, vector, vmcs_read32(error_code_field));
+		else
 			kvm_requeue_exception(vcpu, vector);
 		break;
 	case INTR_TYPE_SOFT_INTR:
@@ -7255,18 +7298,12 @@  static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
 
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
-	__vmx_complete_interrupts(&vmx->vcpu, vmx->idt_vectoring_info,
-				  VM_EXIT_INSTRUCTION_LEN,
-				  IDT_VECTORING_ERROR_CODE);
+	__vmx_complete_interrupts(&vmx->vcpu, true);
 }
 
 static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
-	__vmx_complete_interrupts(vcpu,
-				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
-				  VM_ENTRY_INSTRUCTION_LEN,
-				  VM_ENTRY_EXCEPTION_ERROR_CODE);
-
+	__vmx_complete_interrupts(vcpu, false);
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
 
@@ -7382,6 +7419,24 @@  static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 
 	vmx_disable_fb_clear(vmx);
 
+	/*
+	 * %cr2 needs to be saved after a VM exit and restored before a VM
+	 * entry in case a VM exit happens immediately after delivery of a
+	 * guest #PF but before guest reads %cr2.
+	 *
+	 * A FRED guest should read its #PF faulting linear address from
+	 * the event data field in its FRED stack frame instead of %cr2.
+	 * But the FRED 5.0 spec still requires a FRED CPU to update %cr2
+	 * in the normal way, thus %cr2 is still updated even for a FRED
+	 * guest.
+	 *
+	 * Note, an NMI could interrupt KVM:
+	 *   1) after VM exit but before CR2 is saved.
+	 *   2) after CR2 is restored but before VM entry.
+	 * And a #PF could happen durng NMI handlng, which overwrites %cr2.
+	 * Thus exc_nmi() should save and restore %cr2 upon entering and
+	 * before leaving to make sure %cr2 not corrupted.
+	 */
 	if (vcpu->arch.cr2 != native_read_cr2())
 		native_write_cr2(vcpu->arch.cr2);
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 176ad39be406..d5738c5a4814 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -266,6 +266,7 @@  struct vcpu_vmx {
 	u32                   exit_intr_info;
 	u32                   idt_vectoring_info;
 	ulong                 rflags;
+	u64                   fred_xfd_event_data;
 
 	/*
 	 * User return MSRs are always emulated when enabled in the guest, but
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4e8d60f248e3..00c0062726ae 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -680,8 +680,14 @@  static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 			vcpu->arch.exception.injected = true;
 			if (WARN_ON_ONCE(has_payload)) {
 				/*
-				 * A reinjected event has already
-				 * delivered its payload.
+				 * For a reinjected event, KVM delivers its
+				 * payload through:
+				 *   #PF: save %cr2 into arch.cr2 immediately
+				 *        after VM exits.
+				 *   #DB: save %dr6 into arch.dr6 later in
+				 *        sync_dirty_debug_regs().
+				 *
+				 * For FRED guest, see __vmx_complete_interrupts().
 				 */
 				has_payload = false;
 				payload = 0;