Message ID | 20211208000359.2853257-16-yang.zhong@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | AMX Support in KVM | expand |
On 12/8/21 01:03, Yang Zhong wrote: > kvm_steal_time_set_preempted(vcpu); > srcu_read_unlock(&vcpu->kvm->srcu, idx); > > + if (vcpu->preempted) > + fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu); > + Instead of checking vcpu->preempted, can you instead check if the active FPU is the guest FPU? That is, save if current->thread.fpu->fpstate->is_guest? Paolo
On 12/8/21 01:03, Yang Zhong wrote: > --- a/arch/x86/kvm/cpuid.c > +++ b/arch/x86/kvm/cpuid.c > @@ -219,6 +219,11 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) > kvm_apic_set_version(vcpu); > } > > + /* Enable saving guest XFD_ERR */ > + best = kvm_find_cpuid_entry(vcpu, 7, 0); > + if (best && cpuid_entry_has(best, X86_FEATURE_AMX_TILE)) > + vcpu->arch.guest_fpu.xfd_err = 0; > + This is incorrect. Instead it should check whether leaf 0xD includes any dynamic features. Paolo
On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote: > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c > index 5089f2e7dc22..9811dc98d550 100644 > --- a/arch/x86/kernel/fpu/core.c > +++ b/arch/x86/kernel/fpu/core.c > @@ -238,6 +238,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu) > fpstate->is_guest = true; > > gfpu->fpstate = fpstate; > + gfpu->xfd_err = XFD_ERR_GUEST_DISABLED; This wants to be part of the previous patch, which introduces the field. > gfpu->user_xfeatures = fpu_user_cfg.default_features; > gfpu->user_perm = fpu_user_cfg.default_features; > fpu_init_guest_permissions(gfpu); > @@ -297,6 +298,7 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest) > fpu->fpstate = guest_fps; > guest_fps->in_use = true; > } else { > + fpu_save_guest_xfd_err(guest_fpu); Hmm. See below. > guest_fps->in_use = false; > fpu->fpstate = fpu->__task_fpstate; > fpu->__task_fpstate = NULL; > @@ -4550,6 +4550,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) > kvm_steal_time_set_preempted(vcpu); > srcu_read_unlock(&vcpu->kvm->srcu, idx); > > + if (vcpu->preempted) > + fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu); I'm not really exited about the thought of an exception cause register in guest clobbered state. Aside of that I really have to ask the question why all this is needed? #NM in the guest is slow path, right? So why are you trying to optimize for it? The straight forward solution to this is: 1) Trap #NM and MSR_XFD_ERR write 2) When the guest triggers #NM is takes an VMEXIT and the host does: rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err); injects the #NM and goes on. 3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and the host does: vcpu->arch.guest_fpu.xfd_err = msrval; wrmsrl(MSR_XFD_ERR, msrval); and goes back. 4) Before entering the preemption disabled section of the VCPU loop do: if (vcpu->arch.guest_fpu.xfd_err) wrmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err); 5) Before leaving the preemption disabled section of the VCPU loop do: if (vcpu->arch.guest_fpu.xfd_err) wrmsrl(MSR_XFD_ERR, 0); It's really that simple and pretty much 0 overhead for the regular case. If the guest triggers #NM with a high frequency then taking the VMEXITs is the least of the problems. That's not a realistic use case, really. Hmm? Thanks, tglx
On 12/11/21 01:10, Thomas Gleixner wrote: > 2) When the guest triggers #NM is takes an VMEXIT and the host > does: > > rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err); > > injects the #NM and goes on. > > 3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and > the host does: > > vcpu->arch.guest_fpu.xfd_err = msrval; > wrmsrl(MSR_XFD_ERR, msrval); No wrmsrl here I think, the host value is 0 and should stay so. Instead the wrmsrl will happen the next time the VCPU loop is entred. Paolo
> From: Thomas Gleixner <tglx@linutronix.de> > Sent: Saturday, December 11, 2021 8:11 AM > > On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote: > > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c > > index 5089f2e7dc22..9811dc98d550 100644 > > --- a/arch/x86/kernel/fpu/core.c > > +++ b/arch/x86/kernel/fpu/core.c > > @@ -238,6 +238,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest > *gfpu) > > fpstate->is_guest = true; > > > > gfpu->fpstate = fpstate; > > + gfpu->xfd_err = XFD_ERR_GUEST_DISABLED; > > This wants to be part of the previous patch, which introduces the field. > > > gfpu->user_xfeatures = fpu_user_cfg.default_features; > > gfpu->user_perm = fpu_user_cfg.default_features; > > fpu_init_guest_permissions(gfpu); > > @@ -297,6 +298,7 @@ int fpu_swap_kvm_fpstate(struct fpu_guest > *guest_fpu, bool enter_guest) > > fpu->fpstate = guest_fps; > > guest_fps->in_use = true; > > } else { > > + fpu_save_guest_xfd_err(guest_fpu); > > Hmm. See below. > > > guest_fps->in_use = false; > > fpu->fpstate = fpu->__task_fpstate; > > fpu->__task_fpstate = NULL; > > @@ -4550,6 +4550,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) > > kvm_steal_time_set_preempted(vcpu); > > srcu_read_unlock(&vcpu->kvm->srcu, idx); > > > > + if (vcpu->preempted) > > + fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu); > > I'm not really exited about the thought of an exception cause register > in guest clobbered state. > > Aside of that I really have to ask the question why all this is needed? > > #NM in the guest is slow path, right? So why are you trying to optimize > for it? This is really good information. The current logic is obviously based on the assumption that #NM is frequently triggered. > > The straight forward solution to this is: > > 1) Trap #NM and MSR_XFD_ERR write and #NM vmexit handler should be called in kvm_x86_handle_exit_irqoff() before preemption is enabled, otherwise there is still a small window where MSR_XFD_ERR might be clobbered after preemption enable and before #NM handler is actually called. > > 2) When the guest triggers #NM is takes an VMEXIT and the host > does: > > rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err); > > injects the #NM and goes on. > > 3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and > the host does: > > vcpu->arch.guest_fpu.xfd_err = msrval; > wrmsrl(MSR_XFD_ERR, msrval); > > and goes back. > > 4) Before entering the preemption disabled section of the VCPU loop > do: > > if (vcpu->arch.guest_fpu.xfd_err) > wrmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err); > > 5) Before leaving the preemption disabled section of the VCPU loop > do: > > if (vcpu->arch.guest_fpu.xfd_err) > wrmsrl(MSR_XFD_ERR, 0); > > It's really that simple and pretty much 0 overhead for the regular case. Much cleaner. > > If the guest triggers #NM with a high frequency then taking the VMEXITs > is the least of the problems. That's not a realistic use case, really. > > Hmm? > > Thanks, > > tglx Thanks Kevin
> From: Paolo Bonzini > Sent: Saturday, December 11, 2021 9:32 AM > > On 12/11/21 01:10, Thomas Gleixner wrote: > > 2) When the guest triggers #NM is takes an VMEXIT and the host > > does: > > > > rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err); > > > > injects the #NM and goes on. > > > > 3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and > > the host does: > > > > vcpu->arch.guest_fpu.xfd_err = msrval; > > wrmsrl(MSR_XFD_ERR, msrval); > > No wrmsrl here I think, the host value is 0 and should stay so. Instead > the wrmsrl will happen the next time the VCPU loop is entred. > To elaborate I guess the reason is because MSR_XFD_ERR should always contain host value 0 after preemption is enabled, while WRMSR emulation is called with preemption enabled. Then we just need wait for the next time the vcpu loop is entered to restore the guest value after preemption is disabled.
On Sat, Dec 11 2021 at 02:31, Paolo Bonzini wrote: > On 12/11/21 01:10, Thomas Gleixner wrote: >> 2) When the guest triggers #NM is takes an VMEXIT and the host >> does: >> >> rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err); >> >> injects the #NM and goes on. >> >> 3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and >> the host does: >> >> vcpu->arch.guest_fpu.xfd_err = msrval; >> wrmsrl(MSR_XFD_ERR, msrval); > > No wrmsrl here I think, the host value is 0 and should stay so. Instead > the wrmsrl will happen the next time the VCPU loop is entred. I assumed this can be handled in the fast path, but either way.
Kevin, On Sat, Dec 11 2021 at 03:07, Kevin Tian wrote: >> From: Thomas Gleixner <tglx@linutronix.de> >> #NM in the guest is slow path, right? So why are you trying to optimize >> for it? > > This is really good information. The current logic is obviously > based on the assumption that #NM is frequently triggered. More context. When an application want's to use AMX, it invokes the prctl() which grants permission. If permission is granted then still the kernel FPU state buffers are default size and XFD is armed. When a thread of that process issues the first AMX (tile) instruction, then #NM is raised. The #NM handler does: 1) Read MSR_XFD_ERR. If 0, goto regular #NM 2) Write MSR_XFD_ERR to 0 3) Check whether the process has permission granted. If not, raise SIGILL and return. 4) Allocate and install a larger FPU state buffer for the task. If allocation fails, raise SIGSEGV and return. 5) Disarm XFD for that task That means one thread takes at max. one AMX/XFD related #NM during its lifetime, which means two VMEXITs. If there are other XFD controlled facilities in the future, then it will be NR_USED_XFD_CONTROLLED_FACILITIES * 2 VMEXITs per thread which uses them. Not the end of the world either. Looking at the targeted application space it's pretty unlikely that tasks which utilize AMX are going to be so short lived that the overhead of these VMEXITs really matters. This of course can be revisited when there is a sane use case, but optimizing for it prematurely does not buy us anything else than pointless complexity. >> The straight forward solution to this is: >> >> 1) Trap #NM and MSR_XFD_ERR write > > and #NM vmexit handler should be called in kvm_x86_handle_exit_irqoff() > before preemption is enabled, otherwise there is still a small window > where MSR_XFD_ERR might be clobbered after preemption enable and > before #NM handler is actually called. Yes. Thanks, tglx
> From: Thomas Gleixner <tglx@linutronix.de> > Sent: Saturday, December 11, 2021 9:29 PM > > Kevin, > > On Sat, Dec 11 2021 at 03:07, Kevin Tian wrote: > >> From: Thomas Gleixner <tglx@linutronix.de> > >> #NM in the guest is slow path, right? So why are you trying to optimize > >> for it? > > > > This is really good information. The current logic is obviously > > based on the assumption that #NM is frequently triggered. > > More context. > > When an application want's to use AMX, it invokes the prctl() which > grants permission. If permission is granted then still the kernel FPU > state buffers are default size and XFD is armed. > > When a thread of that process issues the first AMX (tile) instruction, > then #NM is raised. > > The #NM handler does: > > 1) Read MSR_XFD_ERR. If 0, goto regular #NM > > 2) Write MSR_XFD_ERR to 0 > > 3) Check whether the process has permission granted. If not, > raise SIGILL and return. > > 4) Allocate and install a larger FPU state buffer for the task. > If allocation fails, raise SIGSEGV and return. > > 5) Disarm XFD for that task > > That means one thread takes at max. one AMX/XFD related #NM during its > lifetime, which means two VMEXITs. > > If there are other XFD controlled facilities in the future, then it will > be NR_USED_XFD_CONTROLLED_FACILITIES * 2 VMEXITs per thread which > uses > them. Not the end of the world either. > > Looking at the targeted application space it's pretty unlikely that > tasks which utilize AMX are going to be so short lived that the overhead > of these VMEXITs really matters. > > This of course can be revisited when there is a sane use case, but > optimizing for it prematurely does not buy us anything else than > pointless complexity. I get all above. I guess the original open is also about the frequency of #NM not due to XFD. For Linux guest looks it's not a problem since CR0.TS is not set now when math emulation is not required: DEFINE_IDTENTRY(exc_device_not_available) { ... /* This should not happen. */ if (WARN(cr0 & X86_CR0_TS, "CR0.TS was set")) { /* Try to fix it up and carry on. */ write_cr0(cr0 & ~X86_CR0_TS); } else { /* * Something terrible happened, and we're better off trying * to kill the task than getting stuck in a never-ending * loop of #NM faults. */ die("unexpected #NM exception", regs, 0); } } It may affect guest which still uses CR0.TS to do lazy save. But likely modern OSes all move to eager save approach so always trapping #NM should be fine. Is this understanding correct? Thanks Kevin
On 12/12/21 02:50, Tian, Kevin wrote: >> >> If there are other XFD controlled facilities in the future, then it will >> be NR_USED_XFD_CONTROLLED_FACILITIES * 2 VMEXITs per thread which >> uses >> them. Not the end of the world either. >> >> Looking at the targeted application space it's pretty unlikely that >> tasks which utilize AMX are going to be so short lived that the overhead >> of these VMEXITs really matters. >> >> This of course can be revisited when there is a sane use case, but >> optimizing for it prematurely does not buy us anything else than >> pointless complexity. > It may affect guest which still uses CR0.TS to do lazy save. But likely > modern OSes all move to eager save approach so always trapping #NM > should be fine. You also don't need to trap #NM if CPUID includes no dynamic bits, because then XFD will never be nonzero. Paolo
On Fri, Dec 10, 2021 at 11:01:15PM +0100, Paolo Bonzini wrote: > On 12/8/21 01:03, Yang Zhong wrote: > >--- a/arch/x86/kvm/cpuid.c > >+++ b/arch/x86/kvm/cpuid.c > >@@ -219,6 +219,11 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) > > kvm_apic_set_version(vcpu); > > } > >+ /* Enable saving guest XFD_ERR */ > >+ best = kvm_find_cpuid_entry(vcpu, 7, 0); > >+ if (best && cpuid_entry_has(best, X86_FEATURE_AMX_TILE)) > >+ vcpu->arch.guest_fpu.xfd_err = 0; > >+ > > This is incorrect. Instead it should check whether leaf 0xD > includes any dynamic features. > Thanks Paolo, So ditto for "[PATCH 04/19] kvm: x86: Check guest xstate permissions when KVM_SET_CPUID2". Yang > Paolo
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index 5089f2e7dc22..9811dc98d550 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -238,6 +238,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu) fpstate->is_guest = true; gfpu->fpstate = fpstate; + gfpu->xfd_err = XFD_ERR_GUEST_DISABLED; gfpu->user_xfeatures = fpu_user_cfg.default_features; gfpu->user_perm = fpu_user_cfg.default_features; fpu_init_guest_permissions(gfpu); @@ -297,6 +298,7 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest) fpu->fpstate = guest_fps; guest_fps->in_use = true; } else { + fpu_save_guest_xfd_err(guest_fpu); guest_fps->in_use = false; fpu->fpstate = fpu->__task_fpstate; fpu->__task_fpstate = NULL; diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index f3c61205bbf4..ea51b986ee67 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -219,6 +219,11 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) kvm_apic_set_version(vcpu); } + /* Enable saving guest XFD_ERR */ + best = kvm_find_cpuid_entry(vcpu, 7, 0); + if (best && cpuid_entry_has(best, X86_FEATURE_AMX_TILE)) + vcpu->arch.guest_fpu.xfd_err = 0; + best = kvm_find_cpuid_entry(vcpu, 0xD, 0); if (!best) vcpu->arch.guest_supported_xcr0 = 0; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 6198b13c4846..0db8bdf273e2 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -161,6 +161,7 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = { MSR_GS_BASE, MSR_KERNEL_GS_BASE, MSR_IA32_XFD, + MSR_IA32_XFD_ERR, #endif MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, @@ -7153,6 +7154,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu) static void vmx_update_intercept_xfd(struct kvm_vcpu *vcpu) { vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_R, false); + vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD_ERR, MSR_TYPE_RW, false); } static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index bf9d3051cd6c..0a00242a91e7 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -340,7 +340,7 @@ struct vcpu_vmx { struct lbr_desc lbr_desc; /* Save desired MSR intercept (read: pass-through) state */ -#define MAX_POSSIBLE_PASSTHROUGH_MSRS 14 +#define MAX_POSSIBLE_PASSTHROUGH_MSRS 15 struct { DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS); DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index d127b229dd29..8b033c9241d6 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4550,6 +4550,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) kvm_steal_time_set_preempted(vcpu); srcu_read_unlock(&vcpu->kvm->srcu, idx); + if (vcpu->preempted) + fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu); + static_call(kvm_x86_vcpu_put)(vcpu); vcpu->arch.last_host_tsc = rdtsc(); } @@ -9951,6 +9954,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) if (test_thread_flag(TIF_NEED_FPU_LOAD)) switch_fpu_return(); + fpu_restore_guest_xfd_err(&vcpu->arch.guest_fpu); + if (unlikely(vcpu->arch.switch_db_regs)) { set_debugreg(0, 7); set_debugreg(vcpu->arch.eff_db[0], 0);