Message ID | 20230904013555.725413-3-tao1.su@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86: Fix a WARN in kvm_apic_send_ipi() | expand |
On Mon, Sep 04, 2023 at 09:35:55AM +0800, Tao Su wrote: >When IPI virtualization is enabled, a WARN is triggered if bit12 of ICR >MSR is set after APIC-write VM-exit. The reason is kvm_apic_send_ipi() >thinks the APIC_ICR_BUSY bit should be cleared because KVM has no delay, >but kvm_apic_write_nodecode() doesn't clear the APIC_ICR_BUSY bit. > >Since bit12 of ICR is no longer BUSY bit but UNUSED bit in x2APIC mode, >and SDM has no detail about how hardware will handle the UNUSED bit12 >set, we tested on Intel CPU (SRF/GNR) with IPI virtualization and found >the UNUSED bit12 was also cleared by hardware without #GP. Therefore, >the clearing of bit12 should be still kept being consistent with the >hardware behavior. > >Fixes: 5413bcba7ed5 ("KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode") >Signed-off-by: Tao Su <tao1.su@linux.intel.com> >Tested-by: Yi Lai <yi1.lai@intel.com> >--- > arch/x86/kvm/lapic.c | 27 ++++++++++++++++++++------- > 1 file changed, 20 insertions(+), 7 deletions(-) > >diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c >index a983a16163b1..09a376aeb4a0 100644 >--- a/arch/x86/kvm/lapic.c >+++ b/arch/x86/kvm/lapic.c >@@ -1482,8 +1482,17 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high) > { > struct kvm_lapic_irq irq; > >- /* KVM has no delay and should always clear the BUSY/PENDING flag. */ >- WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); >+ /* >+ * In non-x2apic mode, KVM has no delay and should always clear the >+ * BUSY/PENDING flag. In x2apic mode, KVM should clear the unused bit12 >+ * of ICR since hardware will also clear this bit. Although >+ * APIC_ICR_BUSY and X2APIC_ICR_UNUSED_12 are same, they mean different >+ * things in different modes. >+ */ >+ if (!apic_x2apic_mode(apic)) >+ WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); >+ else >+ WARN_ON_ONCE(icr_low & X2APIC_ICR_UNUSED_12); Hi Tao, Using the alias for APIC_ICR_BUSY is not strictly necessary for the bug fix. I think it is better to move this hunk into patch 1. > > irq.vector = icr_low & APIC_VECTOR_MASK; > irq.delivery_mode = icr_low & APIC_MODE_MASK; >@@ -2429,13 +2438,12 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset) > * ICR is a single 64-bit register when x2APIC is enabled. For legacy > * xAPIC, ICR writes need to go down the common (slightly slower) path > * to get the upper half from ICR2. >+ * >+ * TODO: optimize to just emulate side effect w/o one more write > */ > if (apic_x2apic_mode(apic) && offset == APIC_ICR) { >- val = kvm_lapic_get_reg64(apic, APIC_ICR); >- kvm_apic_send_ipi(apic, (u32)val, (u32)(val >> 32)); >- trace_kvm_apic_write(APIC_ICR, val); >+ kvm_x2apic_icr_write(apic, val); > } else { >- /* TODO: optimize to just emulate side effect w/o one more write */ > val = kvm_lapic_get_reg(apic, offset); > kvm_lapic_reg_write(apic, offset, (u32)val); > } >@@ -3122,7 +3130,12 @@ int kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr) > > int kvm_x2apic_icr_write(struct kvm_lapic *apic, u64 data) > { >- data &= ~APIC_ICR_BUSY; >+ /* >+ * The Delivery Status Bit(bit 12) is removed in x2apic mode, but this >+ * bit is also cleared by hardware, so keep consistent with hardware >+ * behavior to clear this bit. >+ */ >+ data &= ~X2APIC_ICR_UNUSED_12; Ditto. and can you also swap patch 1 with patch 2? Because patch 1 is an optional improvement, putting such a patch at the end of a patch set gives maintainers an easy choice about whether they want it or not. > > kvm_apic_send_ipi(apic, (u32)data, (u32)(data >> 32)); > kvm_lapic_set_reg64(apic, APIC_ICR, data); >-- >2.34.1 >
On Mon, Sep 04, 2023 at 10:46:38AM +0800, Chao Gao wrote: > On Mon, Sep 04, 2023 at 09:35:55AM +0800, Tao Su wrote: > >When IPI virtualization is enabled, a WARN is triggered if bit12 of ICR > >MSR is set after APIC-write VM-exit. The reason is kvm_apic_send_ipi() > >thinks the APIC_ICR_BUSY bit should be cleared because KVM has no delay, > >but kvm_apic_write_nodecode() doesn't clear the APIC_ICR_BUSY bit. > > > >Since bit12 of ICR is no longer BUSY bit but UNUSED bit in x2APIC mode, > >and SDM has no detail about how hardware will handle the UNUSED bit12 > >set, we tested on Intel CPU (SRF/GNR) with IPI virtualization and found > >the UNUSED bit12 was also cleared by hardware without #GP. Therefore, > >the clearing of bit12 should be still kept being consistent with the > >hardware behavior. > > > >Fixes: 5413bcba7ed5 ("KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode") > >Signed-off-by: Tao Su <tao1.su@linux.intel.com> > >Tested-by: Yi Lai <yi1.lai@intel.com> > >--- > > arch/x86/kvm/lapic.c | 27 ++++++++++++++++++++------- > > 1 file changed, 20 insertions(+), 7 deletions(-) > > > >diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > >index a983a16163b1..09a376aeb4a0 100644 > >--- a/arch/x86/kvm/lapic.c > >+++ b/arch/x86/kvm/lapic.c > >@@ -1482,8 +1482,17 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high) > > { > > struct kvm_lapic_irq irq; > > > >- /* KVM has no delay and should always clear the BUSY/PENDING flag. */ > >- WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); > >+ /* > >+ * In non-x2apic mode, KVM has no delay and should always clear the > >+ * BUSY/PENDING flag. In x2apic mode, KVM should clear the unused bit12 > >+ * of ICR since hardware will also clear this bit. Although > >+ * APIC_ICR_BUSY and X2APIC_ICR_UNUSED_12 are same, they mean different > >+ * things in different modes. > >+ */ > >+ if (!apic_x2apic_mode(apic)) > >+ WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); > >+ else > >+ WARN_ON_ONCE(icr_low & X2APIC_ICR_UNUSED_12); > > > Hi Tao, > > Using the alias for APIC_ICR_BUSY is not strictly necessary for the bug fix. > I think it is better to move this hunk into patch 1. > > > > > irq.vector = icr_low & APIC_VECTOR_MASK; > > irq.delivery_mode = icr_low & APIC_MODE_MASK; > >@@ -2429,13 +2438,12 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset) > > * ICR is a single 64-bit register when x2APIC is enabled. For legacy > > * xAPIC, ICR writes need to go down the common (slightly slower) path > > * to get the upper half from ICR2. > >+ * > >+ * TODO: optimize to just emulate side effect w/o one more write > > */ > > if (apic_x2apic_mode(apic) && offset == APIC_ICR) { > >- val = kvm_lapic_get_reg64(apic, APIC_ICR); > >- kvm_apic_send_ipi(apic, (u32)val, (u32)(val >> 32)); > >- trace_kvm_apic_write(APIC_ICR, val); > >+ kvm_x2apic_icr_write(apic, val); > > } else { > >- /* TODO: optimize to just emulate side effect w/o one more write */ > > val = kvm_lapic_get_reg(apic, offset); > > kvm_lapic_reg_write(apic, offset, (u32)val); > > } > >@@ -3122,7 +3130,12 @@ int kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr) > > > > int kvm_x2apic_icr_write(struct kvm_lapic *apic, u64 data) > > { > >- data &= ~APIC_ICR_BUSY; > >+ /* > >+ * The Delivery Status Bit(bit 12) is removed in x2apic mode, but this > >+ * bit is also cleared by hardware, so keep consistent with hardware > >+ * behavior to clear this bit. > >+ */ > >+ data &= ~X2APIC_ICR_UNUSED_12; > > Ditto. > > and can you also swap patch 1 with patch 2? Because patch 1 is an optional > improvement, putting such a patch at the end of a patch set gives > maintainers an easy choice about whether they want it or not. Yes, I just think the alias will make the code clearer because x2APIC does remove bit12. If it is not necessary, I will directly drop this alias in version 2. Thanks, Tao > > > > > kvm_apic_send_ipi(apic, (u32)data, (u32)(data >> 32)); > > kvm_lapic_set_reg64(apic, APIC_ICR, data); > >-- > >2.34.1 > >
Hi Tao, kernel test robot noticed the following build warnings: [auto build test WARNING on 708283abf896dd4853e673cc8cba70acaf9bf4ea] url: https://github.com/intel-lab-lkp/linux/commits/Tao-Su/x86-apic-Introduce-X2APIC_ICR_UNUSED_12-for-x2APIC-mode/20230904-093801 base: 708283abf896dd4853e673cc8cba70acaf9bf4ea patch link: https://lore.kernel.org/r/20230904013555.725413-3-tao1.su%40linux.intel.com patch subject: [PATCH 2/2] KVM: x86: Clear X2APIC_ICR_UNUSED_12 after APIC-write VM-exit config: x86_64-randconfig-003-20230904 (https://download.01.org/0day-ci/archive/20230904/202309041224.CDj6t1BN-lkp@intel.com/config) compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230904/202309041224.CDj6t1BN-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202309041224.CDj6t1BN-lkp@intel.com/ All warnings (new ones prefixed by >>): >> arch/x86/kvm/lapic.c:2445:30: warning: variable 'val' is uninitialized when used here [-Wuninitialized] kvm_x2apic_icr_write(apic, val); ^~~ arch/x86/kvm/lapic.c:2435:9: note: initialize the variable 'val' to silence this warning u64 val; ^ = 0 1 warning generated. vim +/val +2445 arch/x86/kvm/lapic.c 2430 2431 /* emulate APIC access in a trap manner */ 2432 void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset) 2433 { 2434 struct kvm_lapic *apic = vcpu->arch.apic; 2435 u64 val; 2436 2437 /* 2438 * ICR is a single 64-bit register when x2APIC is enabled. For legacy 2439 * xAPIC, ICR writes need to go down the common (slightly slower) path 2440 * to get the upper half from ICR2. 2441 * 2442 * TODO: optimize to just emulate side effect w/o one more write 2443 */ 2444 if (apic_x2apic_mode(apic) && offset == APIC_ICR) { > 2445 kvm_x2apic_icr_write(apic, val); 2446 } else { 2447 val = kvm_lapic_get_reg(apic, offset); 2448 kvm_lapic_reg_write(apic, offset, (u32)val); 2449 } 2450 } 2451 EXPORT_SYMBOL_GPL(kvm_apic_write_nodecode); 2452
On Mon, Sep 04, 2023 at 09:35:55AM +0800, Tao Su wrote: > When IPI virtualization is enabled, a WARN is triggered if bit12 of ICR > MSR is set after APIC-write VM-exit. The reason is kvm_apic_send_ipi() > thinks the APIC_ICR_BUSY bit should be cleared because KVM has no delay, > but kvm_apic_write_nodecode() doesn't clear the APIC_ICR_BUSY bit. > > Since bit12 of ICR is no longer BUSY bit but UNUSED bit in x2APIC mode, > and SDM has no detail about how hardware will handle the UNUSED bit12 > set, we tested on Intel CPU (SRF/GNR) with IPI virtualization and found > the UNUSED bit12 was also cleared by hardware without #GP. Therefore, > the clearing of bit12 should be still kept being consistent with the > hardware behavior. > > Fixes: 5413bcba7ed5 ("KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode") > Signed-off-by: Tao Su <tao1.su@linux.intel.com> > Tested-by: Yi Lai <yi1.lai@intel.com> > --- > arch/x86/kvm/lapic.c | 27 ++++++++++++++++++++------- > 1 file changed, 20 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index a983a16163b1..09a376aeb4a0 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -1482,8 +1482,17 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high) > { > struct kvm_lapic_irq irq; > > - /* KVM has no delay and should always clear the BUSY/PENDING flag. */ > - WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); > + /* > + * In non-x2apic mode, KVM has no delay and should always clear the > + * BUSY/PENDING flag. In x2apic mode, KVM should clear the unused bit12 > + * of ICR since hardware will also clear this bit. Although > + * APIC_ICR_BUSY and X2APIC_ICR_UNUSED_12 are same, they mean different > + * things in different modes. > + */ > + if (!apic_x2apic_mode(apic)) > + WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); > + else > + WARN_ON_ONCE(icr_low & X2APIC_ICR_UNUSED_12); > > irq.vector = icr_low & APIC_VECTOR_MASK; > irq.delivery_mode = icr_low & APIC_MODE_MASK; > @@ -2429,13 +2438,12 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset) > * ICR is a single 64-bit register when x2APIC is enabled. For legacy > * xAPIC, ICR writes need to go down the common (slightly slower) path > * to get the upper half from ICR2. > + * > + * TODO: optimize to just emulate side effect w/o one more write > */ > if (apic_x2apic_mode(apic) && offset == APIC_ICR) { > - val = kvm_lapic_get_reg64(apic, APIC_ICR); Sorry for the mistake, I notice this line is removed accidentally, I will add it back in the next version. > - kvm_apic_send_ipi(apic, (u32)val, (u32)(val >> 32)); > - trace_kvm_apic_write(APIC_ICR, val); > + kvm_x2apic_icr_write(apic, val); > } else { > - /* TODO: optimize to just emulate side effect w/o one more write */ > val = kvm_lapic_get_reg(apic, offset); > kvm_lapic_reg_write(apic, offset, (u32)val); > } > @@ -3122,7 +3130,12 @@ int kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr) > > int kvm_x2apic_icr_write(struct kvm_lapic *apic, u64 data) > { > - data &= ~APIC_ICR_BUSY; > + /* > + * The Delivery Status Bit(bit 12) is removed in x2apic mode, but this > + * bit is also cleared by hardware, so keep consistent with hardware > + * behavior to clear this bit. > + */ > + data &= ~X2APIC_ICR_UNUSED_12; > > kvm_apic_send_ipi(apic, (u32)data, (u32)(data >> 32)); > kvm_lapic_set_reg64(apic, APIC_ICR, data); > -- > 2.34.1 >
+Suravee On Mon, Sep 04, 2023, Tao Su wrote: > When IPI virtualization is enabled, a WARN is triggered if bit12 of ICR > MSR is set after APIC-write VM-exit. The reason is kvm_apic_send_ipi() > thinks the APIC_ICR_BUSY bit should be cleared because KVM has no delay, > but kvm_apic_write_nodecode() doesn't clear the APIC_ICR_BUSY bit. > > Since bit12 of ICR is no longer BUSY bit but UNUSED bit in x2APIC mode, > and SDM has no detail about how hardware will handle the UNUSED bit12 > set, we tested on Intel CPU (SRF/GNR) with IPI virtualization and found > the UNUSED bit12 was also cleared by hardware without #GP. Therefore, > the clearing of bit12 should be still kept being consistent with the > hardware behavior. I'm confused. If hardware clears the bit, then why is it set in the vAPIC page after a trap-like APIC-write VM-Exit? In other words, how is this not a ucode or hardware bug? Suravee, can you confirm what happens on AMD with x2AVIC? Does hardware *always* clear the busy bit if it's set by the guest? If so, then we could "optimize" avic_incomplete_ipi_interception() to skip the busy check, e.g. diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index cfc8ab773025..4bf0bb250147 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -513,7 +513,7 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu) * in which case KVM needs to emulate the ICR write as well in * order to clear the BUSY flag. */ - if (icrl & APIC_ICR_BUSY) + if (!apic_x2apic_mode(apic) && (icrl & APIC_ICR_BUSY)) kvm_apic_write_nodecode(vcpu, APIC_ICR); else kvm_apic_send_ipi(apic, icrl, icrh); > Fixes: 5413bcba7ed5 ("KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode") > Signed-off-by: Tao Su <tao1.su@linux.intel.com> > Tested-by: Yi Lai <yi1.lai@intel.com> > --- > arch/x86/kvm/lapic.c | 27 ++++++++++++++++++++------- > 1 file changed, 20 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index a983a16163b1..09a376aeb4a0 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -1482,8 +1482,17 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high) > { > struct kvm_lapic_irq irq; > > - /* KVM has no delay and should always clear the BUSY/PENDING flag. */ > - WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); > + /* > + * In non-x2apic mode, KVM has no delay and should always clear the > + * BUSY/PENDING flag. In x2apic mode, KVM should clear the unused bit12 > + * of ICR since hardware will also clear this bit. Although > + * APIC_ICR_BUSY and X2APIC_ICR_UNUSED_12 are same, they mean different > + * things in different modes. > + */ > + if (!apic_x2apic_mode(apic)) > + WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); > + else > + WARN_ON_ONCE(icr_low & X2APIC_ICR_UNUSED_12); NAK to the new name, KVM is absolutely not going to zero an arbitrary "unused" bit. If Intel wants to reclaim bit 12 for something useful in the future, then Intel can ship CPUs that don't touch the "reserved" bit, and deal with all the fun of finding and updating all software that unnecessarily sets the busy bit in x2apic mode. If we really want to pretend that Intel has more than a snowball's chance in hell of doing something useful with bit 12, then the right thing to do in KVM is to ignore the bit entirely and let the guest keep the pieces, e.g. diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 113ca9661ab2..36ec195a3339 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1473,8 +1473,13 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high) { struct kvm_lapic_irq irq; - /* KVM has no delay and should always clear the BUSY/PENDING flag. */ - WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); + /* + * KVM has no delay and should always clear the BUSY/PENDING flag. + * The flag doesn't exist in x2APIC mode; both the SDM and APM state + * that the flag "Must Be Zero", but neither Intel nor AMD enforces + * that (or any other reserved bits in ICR). + */ + WARN_ON_ONCE(!apic_x2apic_mode(apic) && (icr_low & APIC_ICR_BUSY)); irq.vector = icr_low & APIC_VECTOR_MASK; irq.delivery_mode = icr_low & APIC_MODE_MASK; @@ -3113,8 +3118,6 @@ int kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr) int kvm_x2apic_icr_write(struct kvm_lapic *apic, u64 data) { - data &= ~APIC_ICR_BUSY; - kvm_apic_send_ipi(apic, (u32)data, (u32)(data >> 32)); kvm_lapic_set_reg64(apic, APIC_ICR, data); trace_kvm_apic_write(APIC_ICR, data); diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c index cfc8ab773025..4bf0bb250147 100644 --- a/arch/x86/kvm/svm/avic.c +++ b/arch/x86/kvm/svm/avic.c @@ -513,7 +513,7 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu) * in which case KVM needs to emulate the ICR write as well in * order to clear the BUSY flag. */ - if (icrl & APIC_ICR_BUSY) + if (!apic_x2apic_mode(apic) && (icrl & APIC_ICR_BUSY)) kvm_apic_write_nodecode(vcpu, APIC_ICR); else kvm_apic_send_ipi(apic, icrl, icrh);
On Tue, Sep 05, 2023 at 04:03:36PM -0700, Sean Christopherson wrote: > +Suravee > > On Mon, Sep 04, 2023, Tao Su wrote: > > When IPI virtualization is enabled, a WARN is triggered if bit12 of ICR > > MSR is set after APIC-write VM-exit. The reason is kvm_apic_send_ipi() > > thinks the APIC_ICR_BUSY bit should be cleared because KVM has no delay, > > but kvm_apic_write_nodecode() doesn't clear the APIC_ICR_BUSY bit. > > > > Since bit12 of ICR is no longer BUSY bit but UNUSED bit in x2APIC mode, > > and SDM has no detail about how hardware will handle the UNUSED bit12 > > set, we tested on Intel CPU (SRF/GNR) with IPI virtualization and found > > the UNUSED bit12 was also cleared by hardware without #GP. Therefore, > > the clearing of bit12 should be still kept being consistent with the > > hardware behavior. > > I'm confused. If hardware clears the bit, then why is it set in the vAPIC page > after a trap-like APIC-write VM-Exit? In other words, how is this not a ucode > or hardware bug? Sorry, I didn't describe it clearly. On bare-metal, bit12 of ICR MSR will be cleared after setting this bit. If bit12 is set in guest, the bit is not cleared in the vAPIC page after APIC-write VM-Exit. So whether to clear bit12 in vAPIC page needs to be considered. > > Suravee, can you confirm what happens on AMD with x2AVIC? Does hardware *always* > clear the busy bit if it's set by the guest? If so, then we could "optimize" > avic_incomplete_ipi_interception() to skip the busy check, e.g. > > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c > index cfc8ab773025..4bf0bb250147 100644 > --- a/arch/x86/kvm/svm/avic.c > +++ b/arch/x86/kvm/svm/avic.c > @@ -513,7 +513,7 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu) > * in which case KVM needs to emulate the ICR write as well in > * order to clear the BUSY flag. > */ > - if (icrl & APIC_ICR_BUSY) > + if (!apic_x2apic_mode(apic) && (icrl & APIC_ICR_BUSY)) > kvm_apic_write_nodecode(vcpu, APIC_ICR); > else > kvm_apic_send_ipi(apic, icrl, icrh); > > > Fixes: 5413bcba7ed5 ("KVM: x86: Add support for vICR APIC-write VM-Exits in x2APIC mode") > > Signed-off-by: Tao Su <tao1.su@linux.intel.com> > > Tested-by: Yi Lai <yi1.lai@intel.com> > > --- > > arch/x86/kvm/lapic.c | 27 ++++++++++++++++++++------- > > 1 file changed, 20 insertions(+), 7 deletions(-) > > > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > > index a983a16163b1..09a376aeb4a0 100644 > > --- a/arch/x86/kvm/lapic.c > > +++ b/arch/x86/kvm/lapic.c > > @@ -1482,8 +1482,17 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high) > > { > > struct kvm_lapic_irq irq; > > > > - /* KVM has no delay and should always clear the BUSY/PENDING flag. */ > > - WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); > > + /* > > + * In non-x2apic mode, KVM has no delay and should always clear the > > + * BUSY/PENDING flag. In x2apic mode, KVM should clear the unused bit12 > > + * of ICR since hardware will also clear this bit. Although > > + * APIC_ICR_BUSY and X2APIC_ICR_UNUSED_12 are same, they mean different > > + * things in different modes. > > + */ > > + if (!apic_x2apic_mode(apic)) > > + WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); > > + else > > + WARN_ON_ONCE(icr_low & X2APIC_ICR_UNUSED_12); > > NAK to the new name, KVM is absolutely not going to zero an arbitrary "unused" > bit. If Intel wants to reclaim bit 12 for something useful in the future, then > Intel can ship CPUs that don't touch the "reserved" bit, and deal with all the > fun of finding and updating all software that unnecessarily sets the busy bit in > x2apic mode. > > If we really want to pretend that Intel has more than a snowball's chance in hell > of doing something useful with bit 12, then the right thing to do in KVM is to > ignore the bit entirely and let the guest keep the pieces, e.g. Yes, agree. Currently bit12 is unused and cleared by hardware on bare-metal, clearing bit12 in guest is for keeping the same behavior as bare-metal. But KVM should not to zero the unused bit until SDM reclaims it for something, so ignoring the bit in KVM should be better. Thanks, Tao > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index 113ca9661ab2..36ec195a3339 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -1473,8 +1473,13 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high) > { > struct kvm_lapic_irq irq; > > - /* KVM has no delay and should always clear the BUSY/PENDING flag. */ > - WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); > + /* > + * KVM has no delay and should always clear the BUSY/PENDING flag. > + * The flag doesn't exist in x2APIC mode; both the SDM and APM state > + * that the flag "Must Be Zero", but neither Intel nor AMD enforces > + * that (or any other reserved bits in ICR). > + */ > + WARN_ON_ONCE(!apic_x2apic_mode(apic) && (icr_low & APIC_ICR_BUSY)); > > irq.vector = icr_low & APIC_VECTOR_MASK; > irq.delivery_mode = icr_low & APIC_MODE_MASK; > @@ -3113,8 +3118,6 @@ int kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr) > > int kvm_x2apic_icr_write(struct kvm_lapic *apic, u64 data) > { > - data &= ~APIC_ICR_BUSY; > - > kvm_apic_send_ipi(apic, (u32)data, (u32)(data >> 32)); > kvm_lapic_set_reg64(apic, APIC_ICR, data); > trace_kvm_apic_write(APIC_ICR, data); > diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c > index cfc8ab773025..4bf0bb250147 100644 > --- a/arch/x86/kvm/svm/avic.c > +++ b/arch/x86/kvm/svm/avic.c > @@ -513,7 +513,7 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu) > * in which case KVM needs to emulate the ICR write as well in > * order to clear the BUSY flag. > */ > - if (icrl & APIC_ICR_BUSY) > + if (!apic_x2apic_mode(apic) && (icrl & APIC_ICR_BUSY)) > kvm_apic_write_nodecode(vcpu, APIC_ICR); > else > kvm_apic_send_ipi(apic, icrl, icrh); >
On Wed, Sep 06, 2023, Tao Su wrote: > On Tue, Sep 05, 2023 at 04:03:36PM -0700, Sean Christopherson wrote: > > +Suravee > > > > On Mon, Sep 04, 2023, Tao Su wrote: > > > When IPI virtualization is enabled, a WARN is triggered if bit12 of ICR > > > MSR is set after APIC-write VM-exit. The reason is kvm_apic_send_ipi() > > > thinks the APIC_ICR_BUSY bit should be cleared because KVM has no delay, > > > but kvm_apic_write_nodecode() doesn't clear the APIC_ICR_BUSY bit. > > > > > > Since bit12 of ICR is no longer BUSY bit but UNUSED bit in x2APIC mode, > > > and SDM has no detail about how hardware will handle the UNUSED bit12 > > > set, we tested on Intel CPU (SRF/GNR) with IPI virtualization and found > > > the UNUSED bit12 was also cleared by hardware without #GP. Therefore, > > > the clearing of bit12 should be still kept being consistent with the > > > hardware behavior. > > > > I'm confused. If hardware clears the bit, then why is it set in the vAPIC page > > after a trap-like APIC-write VM-Exit? In other words, how is this not a ucode > > or hardware bug? > > Sorry, I didn't describe it clearly. > > On bare-metal, bit12 of ICR MSR will be cleared after setting this bit. > > If bit12 is set in guest, the bit is not cleared in the vAPIC page after APIC-write > VM-Exit. So whether to clear bit12 in vAPIC page needs to be considered. I got that, the behavior just seems odd to me. And I'm grumpy that Intel punted the problem to software. But the SDM specifically calls out that this is the correct behavior :-/ Specifically, in the context of IPI virtualization: If ECX contains 830H, a general-protection fault occurs if any of bits 31:20, 17:16, or 13 of EAX is non-zero. and If ECX contains 830H, the processor then checks the value of VICR to determine whether the following are all true: Bits 19:18 (destination shorthand) are 00B (no shorthand). Bit 15 (trigger mode) is 0 (edge). Bit 12 (unused) is 0. Bit 11 (destination mode) is 0 (physical). Bits 10:8 (delivery mode) are 000B (fixed). If all of the items above are true, the processor performs IPI virtualization using the 8-bit vector in byte 0 of VICR and the 32-bit APIC ID in VICR[63:32] (see Section 30.1.6). Otherwise, the logical processor causes an APIC-write VM exit (see Section 30.4.3.3). If ECX contains 830H, the processor then checks the value of VICR to determine whether the following are all true: I.e. the "unused" busy bit must be zero. The part that makes me grumpy is that hardware does check that other reserved bits are actually zero: If special processing applies, no general-protection exception is produced due to the fact that the local APIC is in xAPIC mode. However, WRMSR does perform the normal reserved-bit checking: - If ECX contains 808H or 83FH, a general-protection fault occurs if either EDX or EAX[31:8] is non-zero. - If ECX contains 80BH, a general-protection fault occurs if either EDX or EAX is non-zero. - If ECX contains 830H, a general-protection fault occurs if any of bits 31:20, 17:16, or 13 of EAX is non-zero. Which implies that the hardware *does* enforce all the other reserved bits, but punted bit 12 to the hypervisor :-( That said, I think we have an "out". Under the x2APIC section, regarding ICR, the SDM also says: It remains readable only to aid in debugging; however, software should not assume the value returned by reading the ICR is the last written value. I.e. KVM basically has free reign to do whatever it wants, so long as it doesn't confuse userspace or break KVM's ABI. As much as I want to say "do nothing", clearing bit 12 so that it reads back as '0' is the safer approach. Just please don't add a new #define for, it's far easier to understand what's going on if we just use APIC_ICR_BUSY, especially given that I highly doubt the bit will actually be repurposed for something new. FWIW, I also suspect that hardware isn't clearing the busy bit per se, I suspect that hardware simply reads the bit as zero. Side topic, unless I'm blind, KVM is missing the reserved bits #GP checks for ICR bits bits 31:20, 17:16, and 13.
On Wed, Sep 06, 2023 at 03:17:44PM -0700, Sean Christopherson wrote: > On Wed, Sep 06, 2023, Tao Su wrote: > > On Tue, Sep 05, 2023 at 04:03:36PM -0700, Sean Christopherson wrote: > > > +Suravee > > > > > > On Mon, Sep 04, 2023, Tao Su wrote: > > > > When IPI virtualization is enabled, a WARN is triggered if bit12 of ICR > > > > MSR is set after APIC-write VM-exit. The reason is kvm_apic_send_ipi() > > > > thinks the APIC_ICR_BUSY bit should be cleared because KVM has no delay, > > > > but kvm_apic_write_nodecode() doesn't clear the APIC_ICR_BUSY bit. > > > > > > > > Since bit12 of ICR is no longer BUSY bit but UNUSED bit in x2APIC mode, > > > > and SDM has no detail about how hardware will handle the UNUSED bit12 > > > > set, we tested on Intel CPU (SRF/GNR) with IPI virtualization and found > > > > the UNUSED bit12 was also cleared by hardware without #GP. Therefore, > > > > the clearing of bit12 should be still kept being consistent with the > > > > hardware behavior. > > > > > > I'm confused. If hardware clears the bit, then why is it set in the vAPIC page > > > after a trap-like APIC-write VM-Exit? In other words, how is this not a ucode > > > or hardware bug? > > > > Sorry, I didn't describe it clearly. > > > > On bare-metal, bit12 of ICR MSR will be cleared after setting this bit. > > > > If bit12 is set in guest, the bit is not cleared in the vAPIC page after APIC-write > > VM-Exit. So whether to clear bit12 in vAPIC page needs to be considered. > > I got that, the behavior just seems odd to me. And I'm grumpy that Intel punted > the problem to software. But the SDM specifically calls out that this is the > correct behavior :-/ > > Specifically, in the context of IPI virtualization: > > If ECX contains 830H, a general-protection fault occurs if any of bits 31:20, > 17:16, or 13 of EAX is non-zero. > > and > > If ECX contains 830H, the processor then checks the value of VICR to determine > whether the following are all true: > > Bits 19:18 (destination shorthand) are 00B (no shorthand). > Bit 15 (trigger mode) is 0 (edge). > Bit 12 (unused) is 0. > Bit 11 (destination mode) is 0 (physical). > Bits 10:8 (delivery mode) are 000B (fixed). > > If all of the items above are true, the processor performs IPI virtualization > using the 8-bit vector in byte 0 of VICR and the 32-bit APIC ID in VICR[63:32] > (see Section 30.1.6). Otherwise, the logical processor causes an APIC-write VM > exit (see Section 30.4.3.3). If ECX contains 830H, the processor then checks > the value of VICR to determine whether the following are all true: > > I.e. the "unused" busy bit must be zero. The part that makes me grumpy is that > hardware does check that other reserved bits are actually zero: > > If special processing applies, no general-protection exception is produced due > to the fact that the local APIC is in xAPIC mode. However, WRMSR does perform > the normal reserved-bit checking: > - If ECX contains 808H or 83FH, a general-protection fault occurs if either EDX or EAX[31:8] is non-zero. > - If ECX contains 80BH, a general-protection fault occurs if either EDX or EAX is non-zero. > - If ECX contains 830H, a general-protection fault occurs if any of bits 31:20, 17:16, or 13 of EAX is non-zero. > > Which implies that the hardware *does* enforce all the other reserved bits, but > punted bit 12 to the hypervisor :-( > > That said, I think we have an "out". Under the x2APIC section, regarding ICR, > the SDM also says: > > It remains readable only to aid in debugging; however, software should not > assume the value returned by reading the ICR is the last written value. > > I.e. KVM basically has free reign to do whatever it wants, so long as it doesn't > confuse userspace or break KVM's ABI. As much as I want to say "do nothing", > clearing bit 12 so that it reads back as '0' is the safer approach. Just please > don't add a new #define for, it's far easier to understand what's going on if we > just use APIC_ICR_BUSY, especially given that I highly doubt the bit will actually > be repurposed for something new. Thanks for such a detailed analysis. I will submit a patch to clear bit12 since it is a safer approach, and also drop the unnecessary alias. > > FWIW, I also suspect that hardware isn't clearing the busy bit per se, I suspect > that hardware simply reads the bit as zero. > > Side topic, unless I'm blind, KVM is missing the reserved bits #GP checks for ICR > bits bits 31:20, 17:16, and 13. Yes, it is needed to check the reserved bits of ICR and inject #GP when IPI virtualization disabled (otherwise it is injected by hardware). The related kselftest is also necessary to verify the #GP is correctly emulated. Thanks, Tao
Hello, kernel test robot noticed "kernel-selftests.kvm.xapic_state_test.fail" on: commit: 6ccade077c151e719397b0a86f48554db9aa1c24 ("[PATCH 2/2] KVM: x86: Clear X2APIC_ICR_UNUSED_12 after APIC-write VM-exit") url: https://github.com/intel-lab-lkp/linux/commits/Tao-Su/x86-apic-Introduce-X2APIC_ICR_UNUSED_12-for-x2APIC-mode/20230904-093801 patch link: https://lore.kernel.org/all/20230904013555.725413-3-tao1.su@linux.intel.com/ patch subject: [PATCH 2/2] KVM: x86: Clear X2APIC_ICR_UNUSED_12 after APIC-write VM-exit in testcase: kernel-selftests version: kernel-selftests-x86_64-60acb023-1_20230329 with following parameters: group: kvm compiler: gcc-12 test machine: 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids) with 256G memory (please refer to attached dmesg/kmsg for entire log/backtrace) besides, we also noticed kvm.smm_test.fail which keeps pass on parent. 83c02e23e21fec27 6ccade077c151e719397b0a86f4 ---------------- --------------------------- fail:runs %reproduction fail:runs | | | :6 100% 6:6 kernel-selftests.kvm.smm_test.fail :6 100% 6:6 kernel-selftests.kvm.xapic_state_test.fail If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202309242109.43637c74-oliver.sang@intel.com # timeout set to 120 # selftests: kvm: smm_test # ==== Test Assertion Failure ==== # x86_64/smm_test.c:179: stage_reported == stage || stage_reported == SMRAM_STAGE # pid=6354 tid=6354 errno=4 - Interrupted system call # 1 0x00000000004026fc: main at smm_test.c:179 # 2 0x00007f08310cb189: ?? ??:0 # 3 0x00007f08310cb244: ?? ??:0 # 4 0x0000000000402810: _start at ??:? # Unexpected stage: #3, got 4 not ok 25 selftests: kvm: smm_test # exit=254 ... # timeout set to 120 # selftests: kvm: xapic_state_test # ==== Test Assertion Failure ==== # x86_64/xapic_state_test.c:78: __a == __b # pid=7081 tid=7081 errno=4 - Interrupted system call # 1 0x0000000000402a1f: ____test_icr at xapic_state_test.c:78 # 2 0x0000000000402c51: __test_icr at xapic_state_test.c:95 (discriminator 3) # 3 (inlined by) test_icr at xapic_state_test.c:106 (discriminator 3) # 4 0x0000000000402433: main at xapic_state_test.c:197 # 5 0x00007f8a0c660189: ?? ??:0 # 6 0x00007f8a0c660244: ?? ??:0 # 7 0x0000000000402640: _start at ??:? # ASSERT_EQ(icr & ~APIC_ICR_BUSY, val & ~APIC_ICR_BUSY) failed. # icr & ~APIC_ICR_BUSY is 0 # val & ~APIC_ICR_BUSY is 0x44000 not ok 47 selftests: kvm: xapic_state_test # exit=254 The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20230924/202309242109.43637c74-oliver.sang@intel.com
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index a983a16163b1..09a376aeb4a0 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1482,8 +1482,17 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high) { struct kvm_lapic_irq irq; - /* KVM has no delay and should always clear the BUSY/PENDING flag. */ - WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); + /* + * In non-x2apic mode, KVM has no delay and should always clear the + * BUSY/PENDING flag. In x2apic mode, KVM should clear the unused bit12 + * of ICR since hardware will also clear this bit. Although + * APIC_ICR_BUSY and X2APIC_ICR_UNUSED_12 are same, they mean different + * things in different modes. + */ + if (!apic_x2apic_mode(apic)) + WARN_ON_ONCE(icr_low & APIC_ICR_BUSY); + else + WARN_ON_ONCE(icr_low & X2APIC_ICR_UNUSED_12); irq.vector = icr_low & APIC_VECTOR_MASK; irq.delivery_mode = icr_low & APIC_MODE_MASK; @@ -2429,13 +2438,12 @@ void kvm_apic_write_nodecode(struct kvm_vcpu *vcpu, u32 offset) * ICR is a single 64-bit register when x2APIC is enabled. For legacy * xAPIC, ICR writes need to go down the common (slightly slower) path * to get the upper half from ICR2. + * + * TODO: optimize to just emulate side effect w/o one more write */ if (apic_x2apic_mode(apic) && offset == APIC_ICR) { - val = kvm_lapic_get_reg64(apic, APIC_ICR); - kvm_apic_send_ipi(apic, (u32)val, (u32)(val >> 32)); - trace_kvm_apic_write(APIC_ICR, val); + kvm_x2apic_icr_write(apic, val); } else { - /* TODO: optimize to just emulate side effect w/o one more write */ val = kvm_lapic_get_reg(apic, offset); kvm_lapic_reg_write(apic, offset, (u32)val); } @@ -3122,7 +3130,12 @@ int kvm_lapic_set_vapic_addr(struct kvm_vcpu *vcpu, gpa_t vapic_addr) int kvm_x2apic_icr_write(struct kvm_lapic *apic, u64 data) { - data &= ~APIC_ICR_BUSY; + /* + * The Delivery Status Bit(bit 12) is removed in x2apic mode, but this + * bit is also cleared by hardware, so keep consistent with hardware + * behavior to clear this bit. + */ + data &= ~X2APIC_ICR_UNUSED_12; kvm_apic_send_ipi(apic, (u32)data, (u32)(data >> 32)); kvm_lapic_set_reg64(apic, APIC_ICR, data);