[v3,13/22] kvm: x86: Intercept #NM for saving IA32_XFD_ERR

Message ID	20211222124052.644626-14-jing2.liu@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Jing Liu <jing2.liu@intel.com> To: x86@kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, pbonzini@redhat.com, corbet@lwn.net, shuah@kernel.org Cc: seanjc@google.com, jun.nakajima@intel.com, kevin.tian@intel.com, jing2.liu@linux.intel.com, jing2.liu@intel.com, guang.zeng@intel.com, wei.w.wang@intel.com, yang.zhong@intel.com Subject: [PATCH v3 13/22] kvm: x86: Intercept #NM for saving IA32_XFD_ERR Date: Wed, 22 Dec 2021 04:40:43 -0800 Message-Id: <20211222124052.644626-14-jing2.liu@intel.com> In-Reply-To: <20211222124052.644626-1-jing2.liu@intel.com> References: <20211222124052.644626-1-jing2.liu@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	AMX Support in KVM \| expand [v3,00/22] AMX Support in KVM [v3,01/22] x86/fpu: Extend fpu_xstate_prctl() with guest permissions [v3,02/22] x86/fpu: Prepare guest FPU for dynamically enabled FPU features [v3,03/22] kvm: x86: Fix xstate_required_size() to follow XSTATE alignment rule [v3,04/22] kvm: x86: Exclude unpermitted xfeatures at KVM_GET_SUPPORTED_CPUID [v3,05/22] kvm: x86: Check permitted dynamic xfeatures at KVM_SET_CPUID2 [v3,06/22] x86/fpu: Make XFD initialization in __fpstate_reset() a function argument [v3,07/22] x86/fpu: Add guest support to xfd_enable_feature() [v3,08/22] x86/fpu: Provide fpu_update_guest_perm_features() for guest [v3,09/22] kvm: x86: Enable dynamic XSAVE features at KVM_SET_CPUID2 [v3,10/22] x86/fpu: Provide fpu_update_guest_xfd() for IA32_XFD emulation [v3,11/22] kvm: x86: Add emulation for IA32_XFD [v3,12/22] x86/fpu: Prepare xfd_err in struct fpu_guest [v3,13/22] kvm: x86: Intercept #NM for saving IA32_XFD_ERR [v3,14/22] kvm: x86: Emulate IA32_XFD_ERR for guest [v3,15/22] kvm: x86: Disable RDMSR interception of IA32_XFD_ERR [v3,16/22] kvm: x86: Add XCR0 support for Intel AMX [v3,17/22] kvm: x86: Add CPUID support for Intel AMX [v3,18/22] x86/fpu: Add uabi_size to guest_fpu [v3,19/22] kvm: x86: Get/set expanded xstate buffer [v3,20/22] kvm: selftests: Add support for KVM_CAP_XSAVE2 [v3,21/22] x86/fpu: Provide fpu_sync_guest_vmexit_xfd_state() [v3,22/22] kvm: x86: Disable interception for IA32_XFD on demand

Jing Liu Dec. 22, 2021, 12:40 p.m. UTC

Guest IA32_XFD_ERR is generally modified in two places:

  - Set by CPU when #NM is triggered;
  - Cleared by guest in its #NM handler;

Intercept #NM for the first case, if guest writes XFD as nonzero for
the first time which indicates guest is possible to use XFD generating
the exception. #NM is rare if the guest doesn't use dynamic features.
Otherwise, there is at most one exception per guest task given a
dynamic feature.

Save the current XFD_ERR value to the guest_fpu container in the #NM
VM-exit handler. This must be done with interrupt/preemption disabled,
otherwise the unsaved MSR value may be clobbered by host operations.

Inject a virtual #NM to the guest after saving the MSR value.

Restore the host value (always ZERO outside of the host #NM
handler) before enabling preemption.

Restore the guest value from the guest_fpu container right before
entering the guest (with preemption disabled).

Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/vmx/vmcs.h         |  5 +++++
 arch/x86/kvm/vmx/vmx.c          | 22 +++++++++++++++++++++-
 arch/x86/kvm/x86.c              |  6 ++++++
 4 files changed, 33 insertions(+), 1 deletion(-)

Sean Christopherson Dec. 29, 2021, 12:09 a.m. UTC | #1

On Wed, Dec 22, 2021, Jing Liu wrote:
> Guest IA32_XFD_ERR is generally modified in two places:
> 
>   - Set by CPU when #NM is triggered;
>   - Cleared by guest in its #NM handler;
> 
> Intercept #NM for the first case, if guest writes XFD as nonzero for
> the first time which indicates guest is possible to use XFD generating
> the exception. #NM is rare if the guest doesn't use dynamic features.
> Otherwise, there is at most one exception per guest task given a
> dynamic feature.
> 
> Save the current XFD_ERR value to the guest_fpu container in the #NM
> VM-exit handler. This must be done with interrupt/preemption disabled,

Assuming my below understanding is correct, drop the "preemption" bit, it's
misleading.

> otherwise the unsaved MSR value may be clobbered by host operations.
> 
> Inject a virtual #NM to the guest after saving the MSR value.
> 
> Restore the host value (always ZERO outside of the host #NM
> handler) before enabling preemption.

AIUI, changelog is wrong, code is right.  This must be done before _IRQs_ are
enabled, same as handling TIF_NEED_FPU_LOAD. 

> Restore the guest value from the guest_fpu container right before
> entering the guest (with preemption disabled).

Same complaint about preemption.

> Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Jing Liu <jing2.liu@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/vmx/vmcs.h         |  5 +++++
>  arch/x86/kvm/vmx/vmx.c          | 22 +++++++++++++++++++++-
>  arch/x86/kvm/x86.c              |  6 ++++++
>  4 files changed, 33 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 555f4de47ef2..f7a661f35d1a 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -640,6 +640,7 @@ struct kvm_vcpu_arch {
>  	u64 smi_count;
>  	bool tpr_access_reporting;
>  	bool xsaves_enabled;
> +	bool trap_nm;
>  	u64 ia32_xss;
>  	u64 microcode_version;
>  	u64 arch_capabilities;

...

> @@ -763,6 +764,9 @@ void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu)
>  		vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, match);
>  	}
>  
> +	if (vcpu->arch.trap_nm)
> +		eb |= (1u << NM_VECTOR);
> +
>  	vmcs_write32(EXCEPTION_BITMAP, eb);
>  }
>  
> @@ -1960,6 +1964,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	case MSR_KERNEL_GS_BASE:
>  		vmx_write_guest_kernel_gs_base(vmx, data);
>  		break;
> +	case MSR_IA32_XFD:
> +		ret = kvm_set_msr_common(vcpu, msr_info);
> +		if (!ret && data) {
> +			vcpu->arch.trap_nm = true;
> +			vmx_update_exception_bitmap(vcpu);

This is wrong, it fails to clear vcpu->arch.trap_nm and update the bitmap if the
MSR is cleared.

But why even bother with an extra flag?  Can't vmx_update_exception_bitmap() get
the guest's MSR_IA32_XFD value and intercept #NM accordingly?  Then you could
even handle this fully in kvm_set_msr_common(), e.g.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2c9606380bca..c6c936d2b298 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3704,6 +3704,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
                        return 1;

                fpu_update_guest_xfd(&vcpu->arch.guest_fpu, data);
+               /* Blah blah blah blah */
+               static_call(kvm_x86_update_exception_bitmap)(vcpu);
                break;
        case MSR_IA32_XFD_ERR:
                if (!msr_info->host_initiated &&

> +		}
> +		break;
>  #endif
>  	case MSR_IA32_SYSENTER_CS:
>  		if (is_guest_mode(vcpu))
> @@ -4746,7 +4757,7 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
>  	vect_info = vmx->idt_vectoring_info;
>  	intr_info = vmx_get_intr_info(vcpu);
>  
> -	if (is_machine_check(intr_info) || is_nmi(intr_info))
> +	if (is_machine_check(intr_info) || is_nmi(intr_info) || is_nm(intr_info))
>  		return 1; /* handled by handle_exception_nmi_irqoff() */
>  
>  	if (is_invalid_opcode(intr_info))
> @@ -6350,6 +6361,12 @@ static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
>  	kvm_after_interrupt(vcpu);
>  }
>  
> +static void handle_exception_nm(struct kvm_vcpu *vcpu)

This needs a different name, it's waaaay too close to the base handle_exception_nmi(),
which runs with IRQs _on_.  And please add "_irqoff" at the end.  Maybe handle_nm_fault_irqoff()?

> +{
> +	rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> +	kvm_queue_exception(vcpu, NM_VECTOR);
> +}
> +
>  static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
>  {
>  	const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
> @@ -6358,6 +6375,9 @@ static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
>  	/* if exit due to PF check for async PF */
>  	if (is_page_fault(intr_info))
>  		vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
> +	/* if exit due to NM, handle before preemptions are enabled */
> +	else if (is_nm(intr_info))

Same naming complaint about this helper, it looks like an is_nmi() typo.  is_nm_fault()?

> +		handle_exception_nm(&vmx->vcpu);
>  	/* Handle machine checks before interrupts are enabled */
>  	else if (is_machine_check(intr_info))
>  		kvm_machine_check();

Tian, Kevin Dec. 29, 2021, 2:52 a.m. UTC | #2

> From: Sean Christopherson <seanjc@google.com>
> Sent: Wednesday, December 29, 2021 8:10 AM
> 
> On Wed, Dec 22, 2021, Jing Liu wrote:
> > Guest IA32_XFD_ERR is generally modified in two places:
> >
> >   - Set by CPU when #NM is triggered;
> >   - Cleared by guest in its #NM handler;
> >
> > Intercept #NM for the first case, if guest writes XFD as nonzero for
> > the first time which indicates guest is possible to use XFD generating
> > the exception. #NM is rare if the guest doesn't use dynamic features.
> > Otherwise, there is at most one exception per guest task given a
> > dynamic feature.
> >
> > Save the current XFD_ERR value to the guest_fpu container in the #NM
> > VM-exit handler. This must be done with interrupt/preemption disabled,
> 
> Assuming my below understanding is correct, drop the "preemption" bit, it's
> misleading.

code-wise yes. In concept we just want to highlight that this operation 
must be completed when both interrupt and preemption are disabled.

But we can also drop preemption if you prefer to, since preemption is
certainly disabled  when interrupt is disabled.

> 
> > otherwise the unsaved MSR value may be clobbered by host operations.
> >
> > Inject a virtual #NM to the guest after saving the MSR value.
> >
> > Restore the host value (always ZERO outside of the host #NM
> > handler) before enabling preemption.
> 
> AIUI, changelog is wrong, code is right.  This must be done before _IRQs_ are
> enabled, same as handling TIF_NEED_FPU_LOAD.

yes

> 
> > Restore the guest value from the guest_fpu container right before
> > entering the guest (with preemption disabled).
> 
> Same complaint about preemption.
> 
> > Suggested-by: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Jing Liu <jing2.liu@intel.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  1 +
> >  arch/x86/kvm/vmx/vmcs.h         |  5 +++++
> >  arch/x86/kvm/vmx/vmx.c          | 22 +++++++++++++++++++++-
> >  arch/x86/kvm/x86.c              |  6 ++++++
> >  4 files changed, 33 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h
> > index 555f4de47ef2..f7a661f35d1a 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -640,6 +640,7 @@ struct kvm_vcpu_arch {
> >  	u64 smi_count;
> >  	bool tpr_access_reporting;
> >  	bool xsaves_enabled;
> > +	bool trap_nm;
> >  	u64 ia32_xss;
> >  	u64 microcode_version;
> >  	u64 arch_capabilities;
> 
> ...
> 
> > @@ -763,6 +764,9 @@ void vmx_update_exception_bitmap(struct
> kvm_vcpu *vcpu)
> >  		vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, match);
> >  	}
> >
> > +	if (vcpu->arch.trap_nm)
> > +		eb |= (1u << NM_VECTOR);
> > +
> >  	vmcs_write32(EXCEPTION_BITMAP, eb);
> >  }
> >
> > @@ -1960,6 +1964,13 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu,
> struct msr_data *msr_info)
> >  	case MSR_KERNEL_GS_BASE:
> >  		vmx_write_guest_kernel_gs_base(vmx, data);
> >  		break;
> > +	case MSR_IA32_XFD:
> > +		ret = kvm_set_msr_common(vcpu, msr_info);
> > +		if (!ret && data) {
> > +			vcpu->arch.trap_nm = true;
> > +			vmx_update_exception_bitmap(vcpu);
> 
> This is wrong, it fails to clear vcpu->arch.trap_nm and update the bitmap if
> the
> MSR is cleared.

In concept you are right if just looking at this patch. It's pointless to
trap #NM if guest xfd is cleared.

But here we need think about patch22 which disables write interception
for xfd. With that in consideration we use the 1st non-zero write as the
hint indicating that guest might enable xfd-related usages thus always
trap #NM after this point.

It's not a good ordering, but Paolo wants to put the optimization in the
end of this series. But we do need to put a clear comment here explaining
the always-trap policy.

> 
> But why even bother with an extra flag?  Can't
> vmx_update_exception_bitmap() get
> the guest's MSR_IA32_XFD value and intercept #NM accordingly?  Then you

Above is the reason for the extra flag

> could
> even handle this fully in kvm_set_msr_common(), e.g.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2c9606380bca..c6c936d2b298 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3704,6 +3704,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu,
> struct msr_data *msr_info)
>                         return 1;
> 
>                 fpu_update_guest_xfd(&vcpu->arch.guest_fpu, data);
> +               /* Blah blah blah blah */
> +               static_call(kvm_x86_update_exception_bitmap)(vcpu);
>                 break;
>         case MSR_IA32_XFD_ERR:
>                 if (!msr_info->host_initiated &&
> 
> > +		}
> > +		break;
> >  #endif
> >  	case MSR_IA32_SYSENTER_CS:
> >  		if (is_guest_mode(vcpu))
> > @@ -4746,7 +4757,7 @@ static int handle_exception_nmi(struct kvm_vcpu
> *vcpu)
> >  	vect_info = vmx->idt_vectoring_info;
> >  	intr_info = vmx_get_intr_info(vcpu);
> >
> > -	if (is_machine_check(intr_info) || is_nmi(intr_info))
> > +	if (is_machine_check(intr_info) || is_nmi(intr_info) ||
> is_nm(intr_info))
> >  		return 1; /* handled by handle_exception_nmi_irqoff() */
> >
> >  	if (is_invalid_opcode(intr_info))
> > @@ -6350,6 +6361,12 @@ static void handle_interrupt_nmi_irqoff(struct
> kvm_vcpu *vcpu,
> >  	kvm_after_interrupt(vcpu);
> >  }
> >
> > +static void handle_exception_nm(struct kvm_vcpu *vcpu)
> 
> This needs a different name, it's waaaay too close to the base
> handle_exception_nmi(),
> which runs with IRQs _on_.  And please add "_irqoff" at the end.  Maybe
> handle_nm_fault_irqoff()?

sounds good.

> 
> > +{
> > +	rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> > +	kvm_queue_exception(vcpu, NM_VECTOR);
> > +}
> > +
> >  static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
> >  {
> >  	const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
> > @@ -6358,6 +6375,9 @@ static void handle_exception_nmi_irqoff(struct
> vcpu_vmx *vmx)
> >  	/* if exit due to PF check for async PF */
> >  	if (is_page_fault(intr_info))
> >  		vmx->vcpu.arch.apf.host_apf_flags =
> kvm_read_and_reset_apf_flags();
> > +	/* if exit due to NM, handle before preemptions are enabled */
> > +	else if (is_nm(intr_info))
> 
> Same naming complaint about this helper, it looks like an is_nmi() typo.
> is_nm_fault()?

will fix

> 
> > +		handle_exception_nm(&vmx->vcpu);
> >  	/* Handle machine checks before interrupts are enabled */
> >  	else if (is_machine_check(intr_info))
> >  		kvm_machine_check();

Tian, Kevin Dec. 29, 2021, 6:50 a.m. UTC | #3

> From: Tian, Kevin
> Sent: Wednesday, December 29, 2021 10:53 AM
> > > +	case MSR_IA32_XFD:
> > > +		ret = kvm_set_msr_common(vcpu, msr_info);
> > > +		if (!ret && data) {
> > > +			vcpu->arch.trap_nm = true;
> > > +			vmx_update_exception_bitmap(vcpu);
> >
> > This is wrong, it fails to clear vcpu->arch.trap_nm and update the bitmap if
> > the
> > MSR is cleared.
> 
> In concept you are right if just looking at this patch. It's pointless to
> trap #NM if guest xfd is cleared.
> 
> But here we need think about patch22 which disables write interception
> for xfd. With that in consideration we use the 1st non-zero write as the
> hint indicating that guest might enable xfd-related usages thus always
> trap #NM after this point.
> 
> It's not a good ordering, but Paolo wants to put the optimization in the
> end of this series. But we do need to put a clear comment here explaining
> the always-trap policy.

Given write emulation of XFD is not disabled in this patch, it reads cleaner
to always update exception bitmap according to guest xfd value at this
stage. So we will follow your suggestion here and then change to check
msr bitmap when write emulation is disabled in patch22.

Thanks
Kevin

Tian, Kevin Dec. 29, 2021, 8:13 a.m. UTC | #4

> From: Sean Christopherson <seanjc@google.com>
> Sent: Wednesday, December 29, 2021 8:10 AM
>
> But why even bother with an extra flag?  Can't
> vmx_update_exception_bitmap() get
> the guest's MSR_IA32_XFD value and intercept #NM accordingly?  Then you
> could
> even handle this fully in kvm_set_msr_common(), e.g.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2c9606380bca..c6c936d2b298 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3704,6 +3704,8 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu,
> struct msr_data *msr_info)
>                         return 1;
> 
>                 fpu_update_guest_xfd(&vcpu->arch.guest_fpu, data);
> +               /* Blah blah blah blah */
> +               static_call(kvm_x86_update_exception_bitmap)(vcpu);
>                 break;

btw follow Paolo's suggestion here:

https://lore.kernel.org/all/6e95b6f7-44dc-7e48-4e6e-81cf85fc11c6@redhat.com/

He prefers to disabling xfd interception in vmx specific code instead of
introducing a new callback to be called in common code. Since we want
to always trap #NM after xfd interception is disabled (which is done
after calling the msr common code), it also reads cleaner to call
update_exception_bitmap() from vmx.c instead of here.

Thanks
Kevin

Sean Christopherson Dec. 29, 2021, 5:37 p.m. UTC | #5

On Wed, Dec 29, 2021, Tian, Kevin wrote:
> > From: Sean Christopherson <seanjc@google.com>
> > Sent: Wednesday, December 29, 2021 8:10 AM
> > 
> > On Wed, Dec 22, 2021, Jing Liu wrote:
> > > Guest IA32_XFD_ERR is generally modified in two places:
> > >
> > >   - Set by CPU when #NM is triggered;
> > >   - Cleared by guest in its #NM handler;
> > >
> > > Intercept #NM for the first case, if guest writes XFD as nonzero for
> > > the first time which indicates guest is possible to use XFD generating
> > > the exception. #NM is rare if the guest doesn't use dynamic features.
> > > Otherwise, there is at most one exception per guest task given a
> > > dynamic feature.
> > >
> > > Save the current XFD_ERR value to the guest_fpu container in the #NM
> > > VM-exit handler. This must be done with interrupt/preemption disabled,
> > 
> > Assuming my below understanding is correct, drop the "preemption" bit, it's
> > misleading.
> 
> code-wise yes. In concept we just want to highlight that this operation 
> must be completed when both interrupt and preemption are disabled.

No no no no no.  Yes, disabling IRQs also disables preemption, but that's not at
all relevant, e.g. KVM could handle preemption via kvm_sched_{in,out}().  Handling
this with IRQs disable is 100% mandatory because MSR_IA32_XFD_ERR can be indirectly
consumed in (soft) IRQ context, end of story.

> But we can also drop preemption if you prefer to, since preemption is
> certainly disabled  when interrupt is disabled.

[v3,13/22] kvm: x86: Intercept #NM for saving IA32_XFD_ERR

Commit Message

Comments

Patch